Apache Airflow is a platform for programming, planning, and managing workflows programmatically. It has a robust integration set that was created and supported by the open source community and was built with Python and expandable with Python. Workflows may be managed with Airflow as scripts, monitored through the user interface, and expanded through various solid plugins.
- What more could a team of data engineers ask for?
- What if someone else handled scaling and managing it while you reap the rewards?
This post is a compilation of ideas and observations about whether self-hosted Airflow may be replaced by AWS Managed Workflows for Apache Airflow in our use cases. Your results may differ.
Airflow is the primary extract, transform, and load (ETL) tool used by the firm where I work. After using Talend for years seemed like a breath of fresh air. ETLs can only be relied upon if they are built and maintained through code since changes to them are versioned, peer-reviewed, and can be tested before deployment.
If you are not familiar with airflow concepts, be sure to learn them.
As with the rest of our infrastructure, AWS was powering our Airflow. The metadata database was housed in AWS RDS and specifically ran as a collection of containers on AWS Elastic Container Service (ECS).
Because it was running in containers, we could use Docker to run it locally for local development and testing. With Datadog, monitoring it was pretty easy.
Everything usually ran until one day when the maximum number of connections to the metadata database caused the Airflow scheduler, which schedules the jobs that make up workflows, to cease functioning.
Although it took some time, we were able to grow the database to support additional connections. Additionally, it indicated the absence of business-critical workflows. This made us consider whether we wanted to run this platform ourselves or were content with the outcomes.
- Leaving on a journey
- Bilbo Baggins embarks on a journey
- An Unexpected Journey in The Hobbit
In late 2020, AWS unveiled their Managed Workflows for Apache Airflow (MWAA) solution. This demonstrated to me the value and excellence of Airflow as a tool. Since AWS included it in their service offering, it must be nice (similar to AWS Managed Prometheus and Grafana). I did, however, wait a whole year to ensure that all the kinks and creases worked out.
The adventure phase will now begin, shall we? I determined whether MWAA might be a viable replacement for the Airflow we controlled and utilised. After using anything for a long, you start to form routines and procedures around it, just like we did with our Airflow. We had our own methods for:
- utilise Airflow
- create and execute tests
- install customised plugins, operators, and DAGs using a local development environment
- control relationships and variables
- retain the available custom configurations
On MWAA, I wanted to figure out how to accomplish all of that with minimal to no modification from our end.