For the last few weeks, I've been diving into various Spark job optimization myths which I've seen as a consultant at my various clients. I started with why increasing the executor memory may not give you the performance boost you expect. Last week we discussed why increasing the number of executors also may not give you the boost you expect.
This week, we are going to change gears a bit, and focus on the driver. While the driver is not oftentimes the source of a lot of our issues, it is still a good place to look at for improvements. Every little improvement adds up, after all.
The most common misconception I see developers fall into with regards to the driver configuration is increasing driver memory. We'll discuss why this is generally not a good decision, and the rare cases when it might be reasonable to do.
As with previous weeks, I'm running tests on a local 3-node HDP 2.6.1 cluster using YARN. I've set YARN to have 6 GB total and 4 cores, but all other configuration settings are default. I haven't enabled security of any kind. I'm using Vagrant to set this up, and have supplied the Vagrant file on Github.
What Does the Driver Look Like Anyways?
Oftentimes when writing Spark jobs, we spend so much time focusing on the executors or on the data that we forget what the driver even does and how it does it. This actually isn't a horrible thing, however, since, from its view, it is just any other Java/Scala/Python/R program, using a library called Spark. There's no fancy memory allocation happening on the driver, like what we see in the executor, and you can even run a Spark job just like you would any other JVM job, and it'll work fine if you develop it right.
Based on this, a Spark driver will have the memory set up like any other JVM application, as shown below. There is a heap to the left, with varying generations managed by the garbage collector. This portion may vary wildly depending on your exact version and implementation of Java, as well as which garbage collection algorithm you use. The right-hand side is your permanents, where things like the stack, constants, and the code itself are held. This should all be very familiar to you if you've ever taken a computer architecture course. If not, you don't need to understand the details here, just that it is similar to any other JVM application.
Keep in mind in all of this there are a few exceptions that we are glossing over for simplicity's sake. One example is if you use YARN cluster mode, then YARN will set up your JVM instance for you, and do some memory management, including setting the heap size for you. For the most part, these are transparent and don't hae a huge effect on our results.
So What Does That Mean?
At this point, unless you're a theoretical computer science junkie like me, you're probably asking yourself "so what?" and figuring out how much further you have to go. It's simple: optimize the driver code like you would optimize any Java application. This includes simple things like:
Avoid unnecessary memory usage
Only set the heap to what you need
Don't use globals
Obviously, there are a lot more, but these three translate very well to Spark specific projects as well, so we'll focus on them. We'll talk about each of these as they pertain to Spark below.
Avoid Unnecessary Memory Usage
This is a pretty obvious one in a normal JVM application. If you don't need the contents of a huge file, don't read it in. Simple, right?
Yet so often I see applications that collect all of the data from a DataFrame into memory. This does exactly the same thing: takes a large amount of data stored safely elsewhere, and pulls it into memory. This is often done as a collect() call. Because of this, using collect() is often the first sign that something is wrong, and needs to be fixed.
Why collect your data to the driver to process it, when that is what Spark is there for? Collect calls should be used only when either you are in a development environment testing your code, or when you know 100% without a doubt that it will never be large. Even then, the second one is doubtful. After all, how many of us know 100% something will never happen in our applications?
Only Set the Heap to What You Need
Another one is not setting the heap size to be too large. Most generic JVM applications I've seen, the heap size isn't set unless it was found to be absolutely necessary. Sometimes that is in place of optimization, and sometimes that is despite optimization. Regardless, because it isn't too easy to set the heap size, most developers don't mess with it until they need to.
That's not the case in Spark. Spark makes it really easy, especially if you are using the YARN cluster mode. It's just another switch of the many you need to set anyways, so many people set it. Additionally, because there is the misconception that increasing executor memory speeds things up, that naturally translates to driver memory as well.
This is all wrong. The default driver memory size is 1 GB, and in my experience, that is all you need. Looking at the memory layout above, what do you expect to take up more than 1 GB of memory? The stack and constants should be small. The heap will have pointers to DataFrames, and maybe a configuration file loaded, but not much else. You shouldn't be collecting data, so that shouldn't be on the heap. There's just not much you need to have on the heap in a well-written Spark driver.
There is one caveat to this: SPARK-17556. This is a known bug where if you use a broadcast join, the broadcast table is kept in driver memory before broadcasting it. This means that if you are using broadcast joins a lot, you are essentially collecting each of those tables into memory. In this case, it is reasonable to increase the memory usage for driver memory, until the bug is fixed.
Don't Use Globals
One final thing that you should avoid is globals. Global variables are bad for so many reasons, but one is that the data is kept around forever, even when it isn't needed anymore. So if you have a DataFrame that reads the data in from a file, but only need it once to start the processing, why keep it around? Yet, if it is kept in a global variable, it will be kept around for the entire application. Instead, put it in an object that will be removed once it can no longer be referenced due to leaving that scope. This saves you room and headaches down the road.
This isn't one of the flashiest optimizations you can do or one that will change your life, but it's an important optimization to run. Reducing your memory usage on the driver will lower your YARN usage amount and might even speed up your application.
Additionally, I've found that applying these patterns helps clear up the code immensely. It can be really difficult sometimes to determine where code is supposed to be running between the driver and executor. Enforcing these policies oftentimes will make that distinction clearer, making your application more maintainable.
Finally, it's just good practice. The old saying in sports "practice like you play" applies here. If you don't have good habits when writing your driver code, you're more likely to not have good habits when you write the user-defined functions.