YARN Capacity Scheduler and Node Labels Part 3
In this previous posts, we explored exactly how YARN works with queues, node labels and partitions. In this post, we will discuss how partitions play with YARN queues. Finally, we will return to the example given in the first post in this series to predict how they might have better achieved their goal. It is recommended you read the posts or understand how these mechanisms work before continuing through this post.
Note that this series is purely about using the Capacity Scheduler, which at this point is the default scheduler for YARN on HDP. Legacy Cloudera platforms use the Fair Scheduler, which differs in some key ways. That being said, this post will be useful for CDH users as well, just be aware of the differences.
Interaction Between Queues and Partitions
So now that we have an understanding of queues and partitions in YARN, how do these concepts play together? Both at a high level do the same thing, control where a job can run, but they do it in vastly different ways. So what happens when you use both at the same time?
First off, the queue configuration we had been using up until now referred to the default partition. Any other partition will by default have a capacity of 0% and a maximum capacity of 100% for all queues. This means that once you add a new partition, you will need to update the configuration such that the capacity specifically for that partition across all queues sums to 100%, just like you would do for capacity for the default partition.
To set these capacities, you need to set yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.capacity and yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.maximumCapacity appropriately for each queue and partition combination you want to enable.
You should also set yarn.scheduler.capacity.<queue-path>.accessible-node-labels to allow access to the partitions you want that queue to be able to use. You can use the wildcard character (*) here if you so choose.
The last piece of this puzzle is the default node label expression. This property is set on each queue, and indicates when a job is run on the queue, what node label it should use if no other node label is specified. This allows you to have queues which default to a specific partition, making it easy on developers to use partitions by purely using the right queue.
Keep in mind this doesn't enforce that any jobs in that queue must use that partition, however. That is the role of the accessible-node-labels property mentioned above. I can have a job run in a queue pointed to the gpu partition, but set the node label on one of the jobs to empty, so it'll run in the default partition. To reiterate, this only sets the default node label used, not the only partition.
One More Example
To help put all of this into action, the above shows an example of what happens when queues and partitions combine. As before, any job which does not have a node label specified (namely Job 4) is the same as having specified the default partition. Lets assume in all cases, capacity is set to maximum capacity. We also assume that the GPU partition is non-exclusive.
Note that there are two instances of Queue A and Queue B. Additionally, while the two queues are split evenly across the GPU partition, there is more room in the default partition for Queue B than Queue A, highlighting the difference in capacities and settings queues can have between partitions.
At first, Job 1 and Job 2 don't request a partition, so they are placed in Queue A and Queue B for the default partition. When Job 3 comes into the cluster, it wants to use the gpu partition, so it is placed in Queue B for the GPU Partition. So far, we've been able to do everything that has been asked of us. But then Job 4 comes and ruins that. Job 4 requests Queue A without a node label specified. Because Queue A is currently full, and we've hit our maximum capacity as well (since we assume capacity equals maximum capacity). But, we do have space in Queue A on the GPU Partition, so we use that space instead.
As you can see, this gets hairy pretty quickly. That's why its generally recommended to stay away from partitions unless you have a good reason for it (such as the two reasons mentioned above). As we'll see next, the queueing structure often gives you the ability to solve most requirements you might have.
Revisiting the Original Problem
After three weeks, we can return to the situation I postulated originally at the beginning of this series. The situation is that a company notices that Spark developers have a bad habit of requesting too many resources, hogging the entire cluster. This prevents Hive jobs from running, causing major headaches.
The current solution we have is to partition the cluster using an exclusive partition. One partition is meant for Spark jobs, and a queue is made which has a default node label pointing to this partition. All other jobs are run in the remaining queues, and run on the default partition. At first this works, since you've now essentially restricted the Spark jobs to a specific set of nodes. Yes, you could run the jobs on the default partition, but the Spark queue on the default partition is set at 0% capacity, so it has lower priority than the other jobs, and thus is rarely scheduled.
But then a project comes along that needs as many resources as we can spare for a few weeks to do a major load in Spark. We put the job in the Spark queue, but it's not getting enough resources, and is stuck competing with other jobs running in the Spark queue. We decide to create a new queue, but the question is where to put the queue, and what to take capacity away from?
If we take capacity away from the Spark queue, we can only get so much. If we take away from the other queues, then we can't take away from the Spark queues. If we try to take away from both, we need to figure out how to make the percentages work, and quickly run into confusing scenarios where one person is talking about capacity of the whole cluster while the other is talking about capacity of the default partition or of the Spark queue.
Based on the last three weeks of posts, what's the right way out of this mess? The best way to handle this long term is to remove the partition, and put everything into one partition. The Spark partition was not done due to heterogeneous nodes, nor is it due to a network partition. It is due purely to wanting to restrict how much of the cluster Spark jobs can take up. This can instead be done using the maximum capacity parameter which YARN queues have. So we put everything into one partition, put limits on the Spark queue so it doesn't take over the cluster, and then add a new queue for the spike job we need to finish in a few weeks.
And like that, we've gotten ourselves into a good short-term situation for all parties involved. The spike job has a strictly controlled amount of resources we can tweak easily. Spark jobs have a resource cap to prevent them from taking over the cluster. Most importantly, all jobs can now share all resources, so we don't have nodes sitting idle while jobs are desperate for resources. We still need to better arrange the queues to allow for easier tweaking of all of the job resources, but for now we can put our feet up and rest, before jumping into the next YARN puzzle!