Debugging from the Field: Sudden En Masse Failures in 100s of Spark Streaming Jobs

David McGinnis
Sep 15, 2019
4 min read

This is one of a series of posts revolving around debugging stories from the field. The goal of this series is to help demonstrate how to debug issues in a Hadoop application, as well as show some of the decision making process that goes into diagnosing issues. I encourage you to look past just the problem and the solution, and instead focus on the process. Feel free to post comments on possibilities you think we missed, or questions on why we went a certain direction, so that we can all learn from each other! All names are omitted in order to protect the innocent and not-so-innocent.

Background

A few weeks ago, I was included in a frantic email chain from a client over the weekend about a job failure. The job that was failing was a very important full load job that was already behind schedule, and needed to be finished ASAP. The job itself was a collection of hundreds of Spark Streaming jobs which had been running for a week or two when it went down. Many (but not all) of the jobs had failed over the weekend, all around the same time.

The investigating engineer highlighted the following lines from the application log, asserting this meant that there were not enough available resources on the cluster. This has been a concern with this job in the past, so naturally it was concerning to see this.

While this error message was certainly troublesome, something bothered me about it. If a YARN job doesn't have enough resources to do what it needs to, you'll often see it stuck in the ACCEPTED state, or you'll see containers killed due to preemption. Java heap space errors can sometimes come up with the overhead parameter on a Spark job isn't set properly, or when the system tries to collect objects from Spark that are too large for the driver heap. This meant we needed to dig a bit deeper to understand the problem, before simply chalking it up to not enough resources.

Getting the Logs

The first step I followed once I got in after going through this thread was to bring up the logs, and check out the errors in them. The first thing I noticed, is that the logs were too large to bring down, somewhere more than 24 GB of logs. This is likely due to the fact that it is a Spark Streaming job, so the job had been running constantly for weeks.

This was obviously a problem. Luckily, YARN handles this situation beautifully. The YARN logs command has a parameter -size which limits how large each file can be. This sadly doesn't appear to be documented on the Hadoop website, but is documented in the CLI help (accessed through the -help command). You can see the text of it here.

With this parameter, if it is set to positive, each file will contain the first N bytes, but if it is set to negative, each file will contain the last N bytes. Since we want to see what happened near the end of the jobs life, we can use -100000000 to get the last 100 MB for each file. This should be enough to see what we want to see, without overloading the disk.

Investigating the Logs

Once this was brought down, I had a folder for each executor node, with all of the logs underneath the executor folders. I could then search for errors in the logs. This came back with around a million error messages. Speaking with the development team, most of them appeared to be due to invalid data, and they verified they could be ignored for now.

These sorts of errors were marked with APP_LOG, so excluding those lines, we were able to use the following command to see all error lines that aren't data errors.

This didn't give us anything interesting, with just the usual errors about kill signal being given to executors after the root failure, or other mundane sounding things. This only produced 10 lines or so, however, so I was able to expand my view a bit to see the 10 lines before and after each error line. Oftentimes, exception callstacks are printed after an error line, so doing this allows you to see exceptions that might be hiding otherwise.

This brought up the start of a callstack for a NullPointerException. Modifying the command above, we were able to search for NullPointerException and the 30 lines after it, in order to see as full of the callstack as we could. The callstack looked something like the following.

Verifying No Heap Issues

At this point, we had shown that one of the jobs failed due to a NullPointerException, but we hadn't come across the heap failure. Without finding the heap failure and tracking it down, there was still doubt as to whether the analysis had been flawed. So, I first searched the logs for OutOfMemoryError to see if the job I had been analyzing had that log anywhere. Sadly, it was no where to be found.

So, I followed the steps above to download a second jobs' information. This one did have the log entry, so we can finally understand it! Interestingly enough, the error came up 3 days prior to the failure. Moreso, this new job did have the NullPointerException just prior to failing, just like the previous job. This proved that the error did happen, but did not directly cause the failures we had seen.

Conclusion

With this information, I was able to prove that the jobs in question failed due to a NullPointerException, and point the development team to the line in question which caused the issue. Additionally, I was able to bring up the heap issue being with their code, with an initial recommendation to try increasing the overhead space to see if that solves this issue.

As this case shows, it is very important to understand where to look and how to filter out noise from the signal you care about. Understanding basics like how to use grep and filter through logs can get you a long way to finding the core issues causing your failures.