Oozie is one of those technologies that has been around forever, and yet is ignored oftentimes, in the same vein as Zookeeper. Certainly I have ignored it so far on this blog. After talking with an old colleague of mine about some Oozie standards I had given him a few years ago, I realized this, and wanted to write a post to give it its' due. I'm not going to cover the technical aspects of Oozie, but instead focus on best practices such as when and why you should use Oozie, and when to use bundles. Hopefully this post gives you some appreciation for Oozie as a service, and helps you understand its role in the Hadoop ecosystem.
Why Oozie?
Oozie is one of the oldest services in the Hadoop ecosystem (YARN, Spark, HBase, Hive, etc.), and a variant of it is included with the two biggest distributions out there (namely HDP and CDH). Despite that, out of all of the standard services in the Hadoop ecosystem, Oozie has to be the most replaced of any of the others. There are so many newer competitors out there, from Luigi to Airflow, and even Nifi, it isn't hard to see why. Oozie is definitely an older technology, and hasn't had a good update to catch up with newer trends, such as submission of jobs through a web interface and programmatic creation of pipelines.
That said, I think there is still room in the world for Oozie. First of all, it is the workflow manager included with both HDP and CDH. While it may not be included with CDP, many teams will still be using the two legacy distributions for a while to come. Because of this, a team just starting their big data journey that doesn't have an enterprise-wide workflow management tool, or has one that doesn't work well with Hadoop, should give a serious look to Oozie.
Additionally, while Oozie may not have the features or shine that Airflow has, it has enough tools to do the job right. The only area which I've genuinely felt that Oozie has held me back in the past has been dealing with errors, and even those issues are minor and can be worked around via email actions and good logging practices.
Finally, Oozie is distributed, allowing us to spread the load across our cluster, and also deal with node failures more gracefully. While this is also true of a lot of workflow management tools, ASG-Zena and *shudder* CRON both do not have this feature. I have seen both used by clients in lieu of Oozie, and both had various issues because of this approach that would've been avoided by using Oozie.
What Are Bundles Good For?
One question that comes up frequently, and the question which was the impetus for this post, is when to use Oozie bundles. Bundles are collections of coordinators which can be stopped and started together, and can share configuration information. They can also be used to share dependencies between coordinators (more on that later).
And that's pretty much it. Which leads inevitably to the question of "Do I need a bundle?" I very rarely recommend clients to use bundles, so the default answer to that question is probably no. The one place I have seen it be helpful is on multi-tenant clusters running systems with multiple coordinators. In this case, it is useful to be able to stop and start all coordinators running for a given team at once, during an upgrade of that team's code. Without using bundles, you'd need to understand all of the coordinators for a given team, and stop all of them individually.
How Do I Make A Coordinator Run When I Want It To?
The last one I'll discuss today is dealing with dependencies between coordinators. At a client once, we had a team which was in charge of ingesting data from outside systems, and putting them into base tables that were meant to be used by the rest of the company. Every other team would then read from these tables to populate their own tables and datastores. This all sounds great, and it was, but it led to a problem. If the ingestion coordinator starts at 2 AM, when do we know for sure it'll finish so the other systems can start running?
Well, we can add an input event on a dataset. This essentially tells the coordinator that we shouldn't start this coordinator until a dataset or even specific done flag is generated by another coordinator. This in theory should work perfectly, and in fact it can.
The problem is getting it all to work exactly right. I've seen clients struggle for weeks to get the paramete and syntax exactly right so that it all works properly. At the end of the day, they get it working, but are never really comfortable it'll keep working.
Instead, I recommend a drastically different approach. That approach is for consumers of data to assume data is always available, and to update from any data that is new. There are a lot of ways to do this, all of which you'd likely have to do anyways with a dataset approach. Approaches I've seen work include partitioning based on ingestion timestamp, Kafka topics with the data or metadata about which rows are new, and keeping a file in HDFS with the last update timestamp read, and then reading anything newer than that. All of these approaches have pros and cons, and really deserve a post of their own.
That said, this approach also solves a second issue which using data dependencies does not. It handles the ability to do a full reload of data. In a dataset approach, if you need to do a full reset of the table (due to a change in the processing logic, perhaps), you'd have to simulate a full reload in the source system, either by actually doing that full reload or creating some fake flags. Even then, columns like ingested timestamps may make this more difficult than otherwise.
Conclusion
These are just a few tips and tricks for how to use Oozie. Hopefully it gives you a better appreciation for Oozie, as the far from perfect, yet convenient workflow management tool for Hadoop. Feel free to comment if you disagree, or if you have other tips of using Oozie that you'd like to share!
Comments