YARN Capacity Scheduler and Node Labels Part 2
Updated: Oct 6, 2019
This is a continuation of my series on the mechanisms for using YARN capacity scheduler to control your jobs. The first post in the series is here. The third and final post is now available here.
In this first part, we explored exactly how YARN works with queues, and the various mechanisms available to control how YARN does this. With that information, we will now discuss partitions, and how partitions play with YARN queues. In the series finale next week, we will return to the example given above to predict how they might have better achieved their goal.
Note that this series is purely about using the Capacity Scheduler, which at this point is the default scheduler for YARN on HDP. Legacy Cloudera platforms use the Fair Scheduler, which differs in some key ways. That being said, this post will be useful for CDH users as well, just be aware of the differences.
Uses of YARN Partitions and Node Labels
As artificial intelligence has grown, it has become clear that running these jobs on GPUs as opposed to CPUs gave significant performance benefits, and also freed the CPUs to be used for activities that GPUs are not optimized for, typically I/O intensive jobs.
Until the release of Hadoop 3 in recent years, the best way to do this was to run a Spark job (or a similar general-purpose batch processing framework), and run GPU commands from within a map job on all of the executors. This raised the issue: GPUs cost a lot of money, and perhaps the jobs that needed GPUs are not nearly all of the jobs running on the cluster. How do we ensure that GPU jobs run on worker nodes with GPUs without buying expensive GPUs for all of our worker nodes?
Another situation in which this might be used is in a split cluster, where workers are spread across a data center or even between data centers. While this is not optimal due to slower network traffic, it sometimes is necessary, especially when using IaaS offerings from public cloud vendors (such as EC2, Azure VMs, or Compute Engine). In these cases, there may be some worker nodes which are closer to a resource that is used, such as a database or a REST API. In these cases, it may make sense to have these in a separate, inclusive partition, so that jobs can specify that partition when they heavily use those resources. This will speed up that job's I/O work significantly.
How to Use Node Labels
Node labels allow you to specify a characteristic of a node. A node label automatically makes a partition, which is a collection of nodes with the same label. Because of this, every node can have up to one label only. Any node without a label is put into a special default partition, creatively named DEFAULT_PARTITION in the UI.
Every partition is either exclusive or inclusive. If inclusive, then any extra space on the partition can be used by jobs running in the default partition. Any jobs specifically running in that partition get priority, however, and can preempt (assuming preemption is enabled). Exclusive partitions do not allow this behavior, so nodes within the partition can only be used when the node label is specified. Note that in both cases, any extra room in the default partition can be used by jobs in other partitions, if no room is available in their own partitions.
To create a node label and partition, you first need to add it to YARN to define that partition. This is done through a command like the following.
The above command creates two partitions. A non-exclusive partition using label1, and an exclusive partition using label2. Note that exclusive partitions are the default, if exclusivity is not specified.
Once the partition is created, adding a node to that partition is as easy as the following command.
This will add node1 to the partition marked by label1, and node2 to the partition marked by label2. You're now ready to submit a job to one of the partitions!
When submitted, a job can specify a node label which it needs to run on. In the GPU example above, we might have a gpu node label, and any jobs which use GPU resources would specify this in the spark-submit or similar submission command. This would force the workers to only exist on nodes within the gpu partition. If no node label is specified, then the default partition is used.
As an example, Spark uses the properties spark.yarn.am.nodeLabelExpression and spark.yarn.executor.nodeLabelExpression to set the node label on which the driver (when in YARN cluster mode) and the executor can run, respectively. These are set like any other configuration option, most commonly through the --conf flag of spark-submit.
Warning Against Exclusive Partitions
One more thing I want to mention here is that when a partition is exclusive, it is essentially creating a mini-cluster within your overall cluster. We have one YARN Resource Manager, but two sets of workers nodes, one of which can't use resources from the other. If the default partition is undersized and constantly busy, or the other partition is oversized and not often full, then essentially they are two separate clusters. This can lead to resources being idle in the other partition while the default partition is full with jobs waiting.
As an example, see the situation above. Any job without a node label specified is automatically going to be as if the node label was set to the default partition. In this case, Job 3 didn't have room in Queue A to run on the default partition, but it did have room on the GPU partition, so it went there. If a job came in later and needed the GPU partition, it could be preempted, but at least you'd be able to run your job earlier. This makes sense in the case of GPUs, since the GPU partition is just adding resources, so anything that can run on the default partition can likely run on the GPU partition as well.
If instead the GPU partition was exclusive, however, we'd have the situation above. Note that even though there is plenty of room in the GPU partition for Queue A to place Job 3, it cannot since the GPU partition is only for jobs that are labelled appropriately. This leads to jobs being stuck in the accepted state even when there are resources available that could run the job in question. This is obviously not an ideal situation, which is why using exclusive partitions needs to be done cautiously.