Last week, we discussed a concept I've been using with clients for years called I/O interfaces. This concept allows us to abstract away external systems in a generalized fashion, allowing us to switch between systems easily, even dynamically based on the context if necessary. Check out the blog post linked above to catch up, since this week we are going to be diving into the technical details and discussing the architectural concepts here and how general they may be.
The Nuts-and-Bolts of It
Each category or I/O interface starts as an interface that defines a basic interface that covers all of the functionality we need from that type of service. For example, a messaging service I/O interface might have a getNextMessage and a putMessage function, ignoring advanced features such as dealing with partitions that some services may support and some may not. We then implement this interface for each implementation we want to have. Often this will initially include two: one that mocks things out for testing purposes, and one that uses the service we actually want to use. This may look like the architecture below to start.
This object can be created at the top level (either in the initial entry point for a serverless function or in the test), and then passed through wherever these services may be needed in the system itself. If there are only one or two external services necessary, passing these directly may be feasible, but as the number of services swells, it becomes necessary to create an object which contains all of our I/O Interface implementations, which I call an environment. A sample of what an environment may look like is below.
There are a few ways for the environment to create its I/O interfaces. It can take them as parameters or it can contain logic to determine which ones to create (e.g., if test libraries are in the path, use the test I/O interfaces, otherwise use the production I/O interfaces, etc.). A third option is to have multiple Environment classes, each of which creates the I/O Interface implementations it needs. Then a top-level function just needs to create the right type, and everything underneath it is created properly. This might look like the following for a simple test/production situation.
I tend to prefer the third option since it makes the creation logic a bit simpler and gives us more flexibility, but it comes at the cost of more overhead and a bit more complexity, so thought should be given before going with this approach.
Testing with I/O Interfaces
One of the benefits mentioned last week of I/O interfaces was the ease it makes testing, since we already have the external interfaces abstracted away for us.
To make this concrete, consider a data pipeline which uses a set of I/O interfaces defined by the last UML diagram above. We will also assume that ProductionEnvironment is created at the entry point for the pipeline (whether that's the function a Lambda job calls, or the main function for a more traditional job). For everything underneath that entry point, we can just create a TestEnvironment instance, pass it into the unit being tested, and then evaluate any results in the TestEnvironment, if necessary, afterwards.
We then should test the individual I/O interfaces as well. This is where we will need to use the mocking frameworks mentioned before, such as moto for AWS calls or requests-mock for REST API calls through requests. Similarly, any integration tests targeting the entry point itself will need to use the same tools. Since this is so high-level, however, we can keep the number of these tests low, reducing the number of tests we need to deal with these heavy weight libraries.
An old colleague of mine brought up on LinkedIn the use of environment emulators such as LocalStack for AWS or the native emulators for GCP. While I have not used these tools specifically, I have seen a sibling team of mine use a Docker container which allows for local running of Glue jobs. These heavy-weight libraries used at the integration tests are perfect places where these emulators can be used, allowing for testing the code at a high level or along the edges with other systems, while reducing the burden somewhat.
I am interested in working with LocalStack and plan to try implementing it in a system soon to see how it compares to mocking AWS through moto. When I do, I will likely write about it here, so stay tuned!
Some readers may be reading this and wondering how this differs from some popular design patterns that already exist. One pattern this design is similar to is the Facade design pattern. While each individual I/O interface implementation is likely a Facade, the strength here is having the same interface across each implementation to different services, allowing us to mix-and-match them on the fly based on the environment we are running in, configuration parameters, or anything else we choose.
In this way, it's actually similar to the Strategy design pattern, since you are passing a strategy object (I/O Interface) around which determines how to perform a common task. A strategy is defined as being a single algorithm, however. Here, an I/O Interface is a collection of related algorithms, so this pattern doesn't match this design exactly either.
This wraps up our two-week look at this pattern. While seemingly simple, this pattern has come in handy for me many times and is one that I think can be applied in most data engineering pipelines in one way or another.
While this pattern may not match a gang-of-four pattern exactly, it feels like I can't have been the only one to have come to this conclusion. I have implemented this at many clients successfully at this point, so I can be confident it does solve the issue here well. I would encourage any readers that have seen this pattern or similar approaches to let me know, since it would be good to know if this has reinvented the wheel or is genuinely a unique approach.