Rationale
This post is based on research I needed to do in the process of investigating a bug we found while tweaking some settings related to YARN log retention. I'll be posting a post-mortem of that investigation next week, but first I wanted to add some documentation on the subject since this system is woefully under-documented both in the official documentation as well as on forums.
This is all of the information about the system that I have found useful. Note that all configurations I mention below are in the yarn-site.xml file.
In the interest of letting you play along at home with us when we go through the investigation, you may want to skip the deletion service section at the end, since that section contains the information which we were missing during our investigation.
YARN Log Aggregation Overview
If you've worked with YARN enough, you're likely aware that YARN logs are kept in HDFS. Other than that, you likely haven't looked into how those logs are put there, how they are removed over time, or anything other than where to look to find the YARN job logs. I was the same way when this issue first came through, so the first thing I did was investigate the system, to make sure the right configurations and services were being used.
The system which maintains the application logs in HDFS is called the Log Aggregation system and is flexible enough to handle any file system, not just HDFS. This includes S3, ADLS, and SAN disks.
The logs are uploaded by the node managers at the end of the YARN jobs and are then deleted routinely by the deletion service. We'll cover each piece one by one.
Node Manager Log Aggregation
By default, YARN is going to keep the logs on the individual name nodes on local disk for a certain amount of time, set by yarn.nodemanager.log.retain-seconds, and then deleted. Because they are on the individual nodes, that means there isn't a central place you can go to get the logs, nor is there a way to get the logs from a node that has gone down.
That's where log aggregation comes into play. Log aggregation is enabled by the yarn.log-aggregation-enable configuration. This value is defaulted to false, although most distributions seem to change the default to true since that is the value that makes the most sense anyway. If log aggregation is enabled, then the logs will be placed in the directory at yarn.nodemanager.remote-app-log-dir once the job has completed. As mentioned above, the value of remote-app-log-dir can be any path, so it can be an HDFS path, a local path to a NAS, or even something like S3 or ADLS (assuming the core-site configuration has been set up for these blob stores). If no protocol is specified, HDFS is assumed, the same way it would be using the hdfs dfs command.
When log aggregation is enabled, then yarn.nodemanager.log.retain-seconds is essentially ignored.
Under the directory you specify, a directory for each user will be created. Underneath the user directories, then, will be the directory or directories specified by yarn.nodemanager.remote-app-log-dir-suffix. That means that the YARN application logs for a given user will be located under {yarn.nodemanager.remote-app-log-dir}/${user}/{yarn.nodemanager.remote-app-log-dir-suffix}.
If you've ever seen in a YARN Resource Manager UI page a note about log retention. The status that is shown there is actually the status of the process of the node managers sending their logs to HDFS or other storage specified. This section is a good way of determining whether your log aggregation configurations are working properly.
A useful feature is forcing YARN to aggregate logs while the job is still running. For long-running jobs such as Spark Streaming jobs, this is invaluable. To do this, set yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds to a non-negative value. When this is set, a timer will be set for the given duration, and whenever that timer goes off, log aggregation will run on new files. Note this will likely slow down your job, especially for low values. Thus, the lowest you can set this is one hour.
One final configuration to know about is only available from Hadoop 2.9 onwards. This configuration is yarn.log-aggregation.file-formats, which sets the file formats in which the aggregated logs are saved in. By default, this will be the normal transaction files but can be set to anything, provided you have the right classes available in the classpath. The only other option currently built into Hadoop is IndexedFile, which is meant to have better performance for large files.
Log Aggregation Deletion Service
Reminder: For those who don't want to ruin the fun of the investigation debrief next week, skip this section for now.
The logs are retained for yarn.log-aggregation.retain-seconds seconds, which is checked every yarn.log-aggregation.retain-check-interval-seconds seconds. By default, the check interval is set to 10% of the retention period, so if your retention period is 30 days, every 3 days we will check and delete any old files.
This timer is maintained by the log aggregation deletion service, which is run as part of the MapReduce v2 History Server. Despite the location, this service handles the logs generated by all YARN jobs, including non-MapReduce jobs. I am not clear on the rationale for this service remaining in MapReduce instead of moving to YARN like the rest of the services, but that is the state as of Hadoop 2.9, and likely on into Hadoop 3 as well.
Because this process is on a different node than the Node Managers, you should take care to make sure that the configuration is the same between the Node Managers and the MapReduce v2 History Server. Otherwise, logs will not be found and deleted appropriately.
Conclusion
Hopefully, this gives you a good idea of the internal workings of the YARN Log Aggregation. There are certain things I've glossed over here, but this should at least give you the background to be able to work with the system efficiently.
You can see more information on these configuration options and more here.
Useful post 👍