Debugging From The Field: The Case of the Ignored Configuration Change
This post assumes that you understand how the YARN log aggregation system works. If you don't know what that is, or even if you have, I heavily recommend you check out my post last week here for details on the internal workings of the system. This is an important yet under-documented system that I believe every administrator would benefit from understanding better!
As I've been discussing over the last few weeks, I'm currently working with a client that has issues surrounding the HDFS Small File Problem. As part of that issue, one thing that the team realized was that the YARN application logs were taking up 18% of the cluster. While this by itself is not a ton, it is significant enough to warrant looking at whether it could be improved.
After discussions between the administrative team and the development team, we decided to lower the log retention period for the YARN logs in HDFS from 30 days to 15 days. This, in theory, should reduce the number of files in half without us having to do too much work, as long as we were willing to risk the ability to debug jobs older than a few weeks old.
We made the change on a Sunday, but four days later, the number of files had not appreciably changed in the YARN logs directory. Additionally, we could see that the files had been trimmed to the last 30 days, so the deletion service was running properly, just not with the right configuration. This is where the real fun begins.
Covering the Obvious
First, we checked the obvious culprits. We verified that the value in Ambari was 15 days. Since this is kept in seconds, it's not initially obvious, but it came out correct. We then checked the other log aggregation configurations and found that the IndexedFile type had been enabled and set as default. This was concerning since it was a unique configuration that could cause issues, but we set it aside for now.
Since the configurations looked OK, we checked the configurations in the files themselves on the nodes. We were able to easily verify through some quick bash work that the configurations matched what they should be. We then went a step deeper and checked that the configuration was set before the YARN processes were started. Because the processes had been started more than 24 hours before, we couldn't use ps to get the exact time safely, so we instead used the Ambari run logs. You can see some of them below.
Below this screenshot, there were many more operations, starting with a manual stopping of various services, followed by manual starting of other services, including some failed and aborted operations. Even just seen above, there are a lot more actions going on here than just restarting the necessary services. This is a huge red flag. Ambari makes it easy to change configurations since you can just use the button which restarts all necessary components. The operations that start with "Restart all components with Stale Configs..." are using this button.
Additionally, the stopping and starting of all components, regardless of the service, on a single node is a red flag. This operation makes sense if you are decommissioning a node, but that isn't what's happening here. So instead this operation is harming all services on the cluster that have a service on that node but are not taking down the services completely. This may lead to bad performance or incorrect data, or may not, depending on the services and the cluster. Either way though, there shouldn't be a need to do this in general.
Despite all of this, we can see that the YARN Resource Manager was started at 9:03 AM, while the configuration files on the disk were updated around 8:15 AM that same day. That means the YARN Resource Manager instance currently running should have the new configuration.
The final obvious thing to check is the logs. Perhaps there is an error happening somewhere when it tries to delete. This seems like it wouldn't cause the issue we were seeing, but we wanted to be sure. Sure enough, the logs for the Resource Manager, Timeline Server, and one of the Node Managers all didn't have any messages which caused concern, based on various attempts to grep the files.
Strange File Format Configuration
Now that we had covered the obvious, the next thing we turned our attention to was that weird configuration, setting YARN to use the indexed file instead of transaction files. This was a configuration I had not seen before, so it required some digging. Sure enough, this configuration was added in Hadoop 2.9 but was back-ported for HDP 2.6.X, which explains how it was in use in our cluster.
During my investigation, I stumbled across YARN-8279, which describes a bug in the deletion service where files were not deleted if the indexed file format was used. This seemed intriguing, considering we were using this feature in an early version, and seeing issues surrounding files not being deleted.
After more considering, however, this couldn't be the issue. Some log files were being deleted, they were just older than 30 days, instead of the desired 15 days. Additionally, this bug was related to the Node Managers changing the suffix value to append -ifile, which caused the deletion service to not find those files. In HDFS, the files were still placed in their initial suffix folder, so that bug wasn't happening there.
The Obvious Answer That Wasn't
If you've been paying attention, you've likely noticed something that had been bugging me this entire time. We know how the files are placed, and we know the configuration to delete the files, but we don't know what service does the deletions? I'd been working on the premise that the YARN Resource Manager was the process that did the deleting, but what if that isn't the case? Due to the botched deployment, what if the proper process wasn't restarted in time?
All of the documentation I could find mentioned it running in the history server. But YARN doesn't have a history server anymore, just a resource manager and a timeline server. Working on that, I looked at the Hadoop source code. I eventually found the file for the deletion service. I then cloned the code, and using IntelliJ, I was able to trace the call path to the history server's main function. At that point, the trail led cold.
Thanks to some help from the Hadoop community and especially Prabhu Josephraj from Cloudera, I found out that the history server is maintained by the MapReducev2 set of services, even though it does log aggregation for YARN and uses yarn-site.xml for configuration. Looking back, the namespace and folder directory for the history server code above gave it away, but this seemed so bizarre to me at first, I didn't even consider it.
Having confirmation from Cloudera and the community, I instructed the administrators to restart the History Server for MapReduce v2. They made the change, and pretty shortly afterward we confirmed that only 15 days of logs were now in HDFS.
The biggest lesson out of all of this is how important it is for administrators to understand how the system works, at least at a high level. If you are making configuration changes but don't know which processes use that configuration, or how the process works, things like this can cause real issues.
This was one of those issues that with better documentation would've been an easy fix. Instead, this system is woefully under-documented. I hope that in this post and the last post, I've managed to add documentation that will help future administrators dealing with this and related issues.