YARN Capacity Scheduler and Node Labels Part 1
Updated: Sep 29, 2019
As I've worked with clients all over the U.S., I've seen many different schemes for managing YARN queues. Oftentimes, the different schemes are driven by a misunderstanding of how YARN actually manages queues and jobs.
For example, some administrators use partitions to ensure certain jobs had enough resources. Instead, using this scheme actually restricts how many resources their jobs could access, lowering the overall cluster utilization.
Based on that experience, for the next three weeks, I'm going to discuss YARN queue management with regards to queues and partitions, to give an overview of how to control YARN jobs automatically.
In this first part today, I'm going to explore exactly how YARN works with queues, and the various mechanisms available to control how YARN does this. In part 2, we will then discuss partitions and node labels. Finally, part 3 will discuss how partitions play with YARN queues, and then return to the example given above to predict how we can better achieve our goal.
Note that this series is purely about using the Capacity Scheduler, which at this point is the default scheduler for YARN on HDP. Legacy Cloudera platforms use the Fair Scheduler, which differs in some key ways. That being said, this post will be useful for CDH users as well, just be aware of the differences.
Edit: Part 2 is now available here!
Capacity Scheduler Configuration
YARN controls what jobs get what resources through the use of hierarchical queues. To better understand what this means, it's important to understand the basic controls that we, as YARN Queue Designers, have on an individual queue
Capacity – This is the main parameter, and the one you’ll hear used most often. The name is a bit misleading, however, as it refers to the minimum amount of resources that is guaranteed to the queue for use at a time. If those resources are not available, it may take action to get to that level. That includes taking resources from other queues and, if enabled, preempting containers in flight. All children queues of a queue must have capacities adding up to 100%, so that if all queues only get what they are requiring, it will use up all resources. This is the number which should be negotiated between sibling queues in order to determine the best values for all sibling queues.
Max Capacity – This is the maximum percentage of the parent queue which this queue can take. This can occasionally be useful if you have jobs that run for a long time and can’t be preempted, in order to ensure in these cases there are at least some resources available until one of the jobs finishes. This may also be set on queues that are the lowest priority, to ensure that the jobs don’t need to be preempted to get resources.
User Limit Factor – This is a floating-point multiplier which is applied to the capacity, to set a maximum percentage of the parent queue which can be used by a single user. As an example, if the queue is set to a capacity of 20%, and the user limit factor is set to 2.0, then a single user will be able to use 40% of the parent queue’s resources, whether that is one job or many jobs. For most queues, setting this to a large enough value that it is essentially not enforced is best practice. Any queues for ad-hoc jobs should be set to a lower number, however, in order to prevent one user from using all the resources in the parent queue.
Minimum User Limit – This is a limit that can prevent new users from entering jobs into a queue. You would do this for a queue that is open to ad-hoc queries by data scientists, where you want to ensure that a user does not have a job start, only to have so few resources that everyone’s jobs slow down. Instead, if they wait their turn in accepted until there is a good amount of resources available, then things can go much smoother for everyone. The value is a percentage of the queue which must be reached, assuming all users on the queue get equal amounts of the queue. For example, if the limit is set to 25%, then up to 4 distinct users can run jobs on the queue, but a 5th user (who would move the average percentage to 20%) would have to wait in accepted until one of the jobs finishes. A value of 100 actually disables this feature, and is often used. When necessary, it is often set to somewhere between 10 - 25%, meaning that between 4 - 10 different users can run jobs on the queue.
Maximum Applications – This is a maximum number of applications allowed to be accepted or running in the queue at once. Any applications submitted after this maximum has been reached are rejected outright and fail. By default, this is set to 10,000. This is very rarely set, and often limits on how many jobs can run on the cluster are done through other mechanisms.
Maximum AM Resource – This is the maximum percentage of the queue which may be used by application masters. This helps prevent the situation where so many applications have been started, that there is no more room for the worker nodes, so the cluster is stuck in a deadlock situation. This should generally be set at the top level and inherited everywhere else, unless there is a compelling reason to do so. The value should be set such that the remaining percentage has enough resources for workers to complete their jobs.
Priority – Priority helps YARN determine which queue to pull from next, and when to apply preemption. The default value is 0, which is the lowest recommended value. Setting this value higher helps prevent most cases of preemption from queues that we don’t want preemption to occur on, and guide YARN to using the lower priority queues to get more resources when necessary.
Ordering Policy – This parameter determines how YARN decides which job to take next from a queue, once it decides which queue should be chosen. The default is FIFO, which means that it will wait until the next job in the queue has enough resources to run. It is recommended to use Fair instead, which will pull jobs that aren’t the next job if they are small enough to finish quickly, while the bigger jobs wait for enough resources. This should be set at the top level and kept consistent.
Setting these values at a parent queue means the value will filter down to the child queues, until it is set otherwise. Capacity values are percentage of the parent queue, not of the entire cluster, which can get confusing quickly, so it is best to use a spreadsheet to keep on top of it if using multiple tiers. The YARN documentation on queue structures gives more details on YARN and how it works with queues, and I highly recommend reading through it here.
As an example, see the image above. We have two queues and one partition. The queues are set up with 50% capacity each, with 50% max capacity as well. A job is running on each queue currently, with a third job that started on Queue B, and then finished its job. In the meantime, Job 4 is still waiting for resources on Queue A. Because the max capacity on A is 50%, it can't use the space Queue B has, so it's stuck waiting for Job 2 to finish.
This image shows what would happen if the max capacity was set at 100% instead. Note that Job 4 now has space to run on the resources for Queue B, but it is still in Queue A. If jobs were to come into Queue B, then either Job 4 would have some of its resources removed (if preemption is enabled on Queue A), or it would have to wait until there is enough space just on Queue A for Job 4 to run.