Since we rung in the new year, we've been discussing various myths that I often see development teams run into when trying to optimize their Spark jobs. So far, we have covered:
Why increasing the executor memory may not give you the performance boost you expect.
Why increasing the number of executors also may not give you the boost you expect.
Why increasing driver memory will rarely have an impact on your system.
Why increasing overhead memory is almost never the right solution.
Why increasing the number of executor cores won't always improve performance.
This week marks the last planned entry in this series. In this entry, we're going to discuss something a bit different, and that is the common excuse I see from developers to immediately try to increase resources, instead of understanding the system they are building. And that excuse is the notorious OutOfMemoryError exception.
This error can lead to many different solutions based on a variety of factors, including where the error came from, what other errors are seen on that node and across the system, and the exact message included with the exception. Because of that, we're going to focus on what you should do when you see this exception, so that you'll better understand what the error is actually trying to tell you before you waste cluster resources.
Determining the True Source of the Error
The first step when you see this error is to figure out where the error is occurring. Is it occurring on the driver? Or is it on only one of the executors? Or maybe all of the executors are seeing this error? If you are running in YARN client mode, then any errors from the driver will be printed to standard out. If this error is not printed to standard out in these cases, then the driver is not the one throwing this error.
If not, or if you are running in any other mode, then you'll need to look at the individual logs for the nodes. As we've discussed in past posts, using the yarn logs command will help you immensely here, along with a good helping of grep magic. You can use the exact error message from the original issue you saw to find the error messages quickly across all nodes. Don't use the timestamp, since this will preclude other executors that may have run into the same issue at different times.
One note I want to make here is that I often see developers pointing to the YARN resource manager UI for their job like the image below, and do all of their debugging based on this. Note that while this one doesn't mention memory issues specifically, there are definitely situations where it will show an OutOfMemoryError exception in this output.
The error shown here is just a single error in what may be many different, interconnected (or not) errors. This gives you a good idea of what to look for in the logs to find the actual error, but it isn't going to be all you need.
Is This The Error We're Looking For?
Once you find where the error is coming from, the next step we need to take is understand what other errors and warnings were being reported around the same time on the same nodes. If the error was reported on the driver, understanding any errors in the executors will also be necessary. If the error happens when interfacing with an external system (such as a JDBC connection), then check the logs on that system to make sure an issue didn't happen there.
This may all seem to be overkill. We have the error, right? Well, not quite. It's rare for a job to have one error reported and that be the error we are looking for. Instead, it's common to see multiple errors across multiple nodes, all interconnected. As an example, you'll often see network errors on the driver when an executor crashes. The issue isn't the network errors, but instead whatever caused the executor to crash. Yet both will be reported.
Checking for errors and warnings (using the ERROR and WARN keywords from log4j) gives us an understanding of all of the errors being reported. Keep the timestamps in mind as well, since often the earliest errors are the root issues.
When It Really Is a Memory Issue
If you go through the steps above, and are left with a memory issue as the most likely culprit, there's still some more work to be done. If the error is coming from the driver, then looking at the possible causes in my earlier post in this series on driver memory would help. If it's coming from only one or two executors, then you might be dealing with skew. Finally, the cause for the OOM exception may indicate a specific set of memory such as overhead or permanent generation. These are handled in specific ways and may be caused by specific things, so you'll want to dig into exactly what that type of memory is used for, and what you can do to reduce that usage.
Conclusion
Increasing memory usage by your application may not be the end of the world, but jumping to that solution is a common practice among developers that needs to change in order to build better applications. Follow the steps above, and you'll be many steps ahead of your peers in writing better, more efficient code.
And with that, I'm ending our series on Spark job optimization myths. I hope that you've gained a lot of good information from this series, and use it to improve your jobs. Feel free to comment here, or tweet me any other myths or topics you'd like me to cover, and I'll consider them for a future post.
Commentaires