Stop Feeding the Small File Monster!
Updated: Nov 21, 2019
If you've worked with Hadoop for any amount of time, you've likely run into the dreaded small file problem. While the advent of online storage solutions like S3 and ADLS has reduced the issue, it still exists for the hundreds of companies that use HDFS.
HDFS, like any file system, uses blocks. A block is a set of disk space, which together with possibly other blocks, makes up a file. Every block has the same size, which is block size. This allows for more efficient usage of the disk since instead of having odd, small areas of free disks that are essentially unusable, you always have an area that is a multiple of the block size.
The downside to this approach is that, in a local disk, if you don't have enough data to fill a block, you still will use the full block size of disk space regardless. Obviously the way to prevent this from causing too much of an issue is to have a small block size. On a normal hard disk, the block size is often in the realm of bytes to kilobytes.
HDFS is different, however, since pointers to the blocks are held in memory, and an HDFS cluster often holds much more data than a single hard disk. In order to prevent the HDFS Name Node from not being able to hold all of the pointers in memory, the default block size is set to 256 MB. Thankfully, this doesn't lead to wasted space, since HDFS only stores the actual data on disk and not the full block size.
Despite this, there is another issue that can plague an HDFS cluster. This issue is the extra pressure it puts on the memory usage of the name node. Above we noted that the name node tracks all of the blocks in the cluster. Additionally, it tracks all of the files, along with a list of the blocks that make up that file. If you have one file that takes up 10 blocks, it will take up much less space in the name node memory than 10 files that take up one block each, due to all of the metadata surrounding each file that needs to be kept. Thus, having all of your files take up only one block each will increase your memory usage in the name node substantially. This is the small file monster.
(Note: Thanks to Michael Melnick in the comments below who pointed out my incorrect statements about the behavior of HDFS with regards to incomplete blocks. I have corrected the information above to state that HDFS does not suffer from that issue, as opposed to a traditional file system such as NFS)
Finding Small Files
Heap size in Java is always stuck in between two extremes. Too low, and there isn't enough room for the memory you need. Too high, and garbage collection takes forever, essentially shutting down your service for minutes. In a normal situation, you can set the heap to be at a reasonable size and be fine. With the name node process in particular, the heap needs increase slowly, as files are added to the cluster. Outside of situations where many thousands or millions of files are added suddenly, you should know far in advance when you'll need more heap space for the name node. This lets you better set the heap size to something small enough to not cause garbage collection headaches.
When you can't set the heap size any smaller without running out of space, and you still get GC pauses in terms of minutes, that's when you know you have a problem. At that point, you should start investigating how many small files exist on the system. A good definition of a small file is one which is smaller than 1 MB. Looking at the number of files and amount of space is also a good way to test this.
If you're running an HDP or HDF cluster, then you can use the Activity Explorer provided by SmartSense to find this information. The HDFS Dashboard in here shows your distribution of file sizes, as well as users who have the most small files. Otherwise, you can use a bash script or something like the HdfsCLI Python Library to generate this list and analyze it. These scripts will take a decent amount of time, so just be aware.
Fixing the Issue
Keep in mind that some small files are expected. For example, it is common to keep source code or documentation in HDFS in order to ensure it is available across the cluster. These files are unlikely to exceed 256 MB, or even 1 MB. For these files there isn't a lot you can do.
If you find a lot of data files that are part of the same dataset in the same folder, but are separate files, that is a surefire indication that you can combine the files and fix them. Additionally, another common issue I see is a table partitioned so finely that each partition only has a bit of data, making a bunch of small files. Changing the partitioning scheme to make fewer, larger partitions helps fix this. You may also find common utility files and JARs which can be placed in a central location, for all teams to use.
Preventing the Issue
Once you've gotten the problem under control, we need to take a step back and find how to keep our gains. Certainly you don't want a fire drill every six months when we inevitably have new systems cause this issue to happen again, do we?
The key is to educate the developers. Preventing this issue starts at design, and goes all the way through implementation. Developers need to consider how large their files will be when they work on the design. If there are issues, there are a few ways to fix it. While these are only a few of the many options, they are ones I've seen work.
If your job is doing a full load of data every time, then things are a lot easier, since you only have the files you generate to think about. Often times, the output is few enough files to completely ignore this issue in these instances. If you are generating so many files that it is a concern, however, reducing the number of workers will reduce the number of files, and make the files bigger. This can be done through a coalesce call in Spark, or by changing the merge properties on your Hive job, similar to below.
If the small files are a result of frequent small batch jobs that make new files each time, then you can condense the data regularly in order to prevent the small files from lingering too long. If the data is part of a partitioned Hive table, then its natural to do a condensation when you are "closing" the partition and creating a new partition. You can condense using an INSERT OVERWRITE command in Hive, a Spark job, or even a simple bash job using the HDFS utility. If the data set isn't partitioned, then a regular job to condense the data that is there is reasonable as well.
Finally, you might consider using a technology stack that doesn't use HDFS. The obvious alternatives would be the cloud storage technologies like S3 and ADLS, but this also includes on-premises data storage services like Kafka and HBase. These won't have the small file problem, but do dictate how data should be stored. Because of this, this solution is only really valid when your access patterns match the desired access patterns of the storage. For example, if you need to run analytical queries on your data, HBase and Kafka likely won't be a good fit for you. If you need to look up specific rows, however, HBase might be a good option.
If after going through the files, there just aren't a lot of actionable files based on the above, then you may legitimately have a lot of useful files. This is a very small number of users, but it does happen. For these users, a good solution is to use HDFS Federation. HDFS Federation essentially splits up an HDFS cluster based on the root folder. So you can now say that /data uses one name node, while /projects uses another name node instance. Making this split reduces the pressure on the name nodes, since they don't have to track as many files at once. From the outside, requests are routed to the appropriate name node instance, in a similar way to how HDFS handles high availability.
This has just been a very high-level view at the small file monster, and what we can do to slay it. There are many more ways that this problem can be solved, and to cover all of them would (and probably has) taken up a book. I hope that if you are dealing with this problem, you can take the ideas above, and look for a solution that fits best with your current architecture and requirements. If you have solutions you think work as well, feel free to add comments below describing them!