Running Garbage Collection on Your Cluster

David McGinnis
Oct 22, 2019
4 min read

Motivation

As companies start to use their new Hadoop cluster, it feels like new projects are being added every day. This team wants to try out some new ML algorithms on the cluster, or that team wants to try to offload analytics work from their ancient Oracle database onto the new hotness. I once worked with a client that had 11 projects being worked on to move into Hadoop. Thing was, the cluster was still being stood up and stabilized, and we didn't even have processes yet for DevOps for the cluster.

Anyone who has worked with a Hadoop cluster long enough can tell you that not all of these projects are going to end up being long running projects. Some may just need one run to get the information the team needed, and then it is abandoned. Others might not end up solving the problem originally intended, so they are left there. Even more might run out of budget or any number of other reasons.

Oftentimes, these projects aren't deleted from the cluster, however. The data may still be sitting in HDFS, the Hive tables still present, the code still available in the cluster somewhere. Sometimes, scheduled jobs through Oozie or CRON may even keep running, with no one stopping them. This takes up room in HDFS and run time on your YARN queues. Not only that, but it makes things more confusing for a new user of the system. Additionally, what happens if you upgrade the cluster? A good administrator will ensure that all projects have their libraries updated accordingly, but these projects no longer have an owner, making the process of figuring out what the jobs are and whether we can remove them that much more difficult.

Picking Up After Yourself

I recently worked with a client who found themselves in the above situation. They were staring down an upgrade from HDP 2.6.X to HDP 3.1. Since they weren't using Hive Interactive, this meant upgrading Hive from Hive 1.3.1 to Hive 3.X. To do this upgrade successfully, there are tasks that need to be done for each and every Hive table, since this jump changes managed tables from traditional Hive managed tables into Hive ACID tables.

We started looking at the tables present in the cluster, and found nearly 40,000 tables in production, with similar numbers in the lower environments. Even if we automate this, it will take a while. If we can't automate the changes (which require some human touch on them), we're looking in terms of man-months of effort to get this done!

To try to get a better handle on these tables and what we can do to make this situation better, we found a lot of tables that were fairly obviously temporary or experimental tables. For example, I think we can agree that default.junk or default.abcd are probably not useful for anyone anymore.

At this point, my colleague Ron Lee and I started formulating a plan to get rid of all of these useless data sets and jobs. Ron coined it as "Cluster Garbage Collection", which I am partial to as well.

So what does cluster garbage collection involve? At a high level, it is merely going through the cluster, taking inventory of the data and processes that run on the cluster, and determining which bits can be removed safely from the cluster. Some example processes this might entail include:

Dropping tables that haven't been accessed or modified recently
Stopping Oozie bundles and coordinators which support projects that have been retired
Deleting folders from HDFS for projects or users that are no longer used
Removing Ranger policies which are no longer used

Mitigating Risks

Obviously doing this has a lot of risks, since if you delete data that is needed, you've caused yourself a huge headache. You can automate some of this by checking Ranger audit logs and timestamps associated with Hive and HDFS metadata to ensure that no changes or accesses have occurred recently. This is a good way to narrow down your list to begin with, but you should make sure to ask around and find an owner, or at least find someone who knows something about the project. The user who owns the tables and data are a good place to start. That person should be consulted before any deletions occur, to make sure that we are safe to delete the data.

To be extra safe, I would also recommend getting a list of all of the items to be deleted, and distribute it amongst the users of the cluster for a final check. This obviously won't be fool-proof, since people won't look at it or only skim the list and so on. That being said, it is a good last chance before those items are removed. For things like HDFS files, you might even think about putting all of the files into a directory such as /ToBeDeleted, and wait for a while to see if any complaints are brought up. This way, you haven't completely deleted the files yet, and can restore them easily enough.

Next Steps

As I mentioned above, this concept is one that we have just started to explore, but I am excited for its possibilities. In the future, as we try this model at various clients, I'll be expounding more on this, going into details on what works and what doesn't. If you have any ideas, feel free to add a comment below or send me an email. Hopefully this idea jogs your memory on items you need to remove from your cluster, and you get to it yourself!

Running Garbage Collection on Your Cluster

Motivation

Picking Up After Yourself

Mitigating Risks

Next Steps

Recent Posts

Comments

SIGN UP AND STAY UPDATED!