Setup Airflow & Pentaho DI using Docker

Need a standardized environment to develop and test PDI transformations and Airflow DAGs? Below are the containerization approaches which might help you.

Sarit Si
5 min readApr 29, 2021
Photo by @commonboxturtle on Unsplash

Scenario

At my workplace we have a number of data pipelines each involving different types of processes orchestrated by Apache Airflow. Among these processes are a bunch of kettle jobs and transformations. Using Pentaho Data Integration (PDI) and Airflow installed as native tools in our machine, we developers create the transformations and add them to Airflow DAGs. When it comes to end-to-end testing of the data pipeline with the new code branch, QA team needs to have the same set of developer tools, databases, packages and environment variables set in their machine (if not a QA server). Same scenario arises whenever a new member joins our team.

To avoid this hassle of setting up environment for different services and make them isolated and self-sustaining without contaminating the host machine, we can take help of Docker.

What is Docker?

Docker is an open-source software platform which allows fast build, test and deployment of all services required to run an application. It makes it possible by containerizing each service in their own isolated self-sustaining environment with all required parameters set for their operation, all within their own container.

How I used Docker?

I came across a great article which talks about using Airflow docker container to trigger kettle jobs/transformations in locally setup PDI. I extended this by containerizing PDI as well and connecting it to the Airflow container. Below are the approaches which I played with.

  1. PDI (with Carte) and Airflow in separate containers
  2. PDI (without Carte) and Airflow in same container

Approach 1

  • Two docker files used to build PDI (with Carte) and Airflow images. For this specific scenario, customizations are done on top of the Airflow base image.
  • Via docker compose, used the above docker files to build (if first time) the PDI and Airflow images. In parallel, base images of Redis and Postgres are pulled from docker hub and containers created.
  • Once all services are up, DAGs triggered via Airflow web server will instruct the worker node to call the Carte executeJob/ executeTrans APIs in the PDI container, sending details of the job/transformation to be run.

Approach 2

  • One docker file is used, which downloads PDI on top of the airflow base image. There will be no Carte server and airflow will trigger PDI tasks locally via kitchen.sh/ pan.sh, within the same container.
  • Docker compose file will use the above docker file to build and spawn airflow-pdi worker container. Base images of Airflow will be used to spawn web server and scheduler containers.
  • DAGs triggered via web server, will invoke the kitchen.sh/pan.sh files inside the worker node to run the assigned job/transformation.

All the services in each of the above approaches can be instantiated just by the below command.

docker-compose up

Environment variables defined inside a file (.env file) get automatically exported to each of the containers. Simply update the volume source mounts to your project source code folder(s) and all updates to the kettle/DAG files done locally on host will be visible inside the containers.

Evaluation

I evaluated both the approaches based on:

  • Total image size on disk
  • Task duration
  • CPU utilization and memory used

Total image size on disk

The PDI downloaded while building the docker image is of ~1.8 Gb itself. Hence, in the docker files I ensured the base image sizes are minimal and less number of additional layers are added on top of it either by chaining multiple RUN statements into one or by using multi-stage build.

Approach 1 = 3.72 Gb
Breakdown of custom images:

  • pdi: openjdk base image + PDI = 2.17 Gb
  • airflow: apache/airflow base image + additional packages = 1.2 Gb

Approach 2= 3.24 Gb
Breakdown of custom images:

  • airflow-pdi: airflow base image + PDI = 2.89 Gb

Task duration

Ran same set of parallel tasks thrice.

Below graphs as seen in the Task Duration tab of Airflow

Approach 1: < 1.5 secs

Approach 2:between 42–78 secs

CPU utilization and memory used

In first approach,
- highest CPU utilization (~ 77%) by container with PDI (pdi-master)
- PDI container average memory used = ~ 13% of 7.7 Gb

In second approach,
- highest CPU utilization (~ 230%) by container with PDI (docker-airflow-pdi-02_airflow-worker_1)
- PDI container average memory used = ~ 55% of 7.7 Gb

Which one I chose

Airflow & PDI in separate containers (approach1). Despite the fact that the total image footprint of approach 1 is ~500Mb higher than the other, task runs are much faster with lesser system resource demand.

How I use it

  • Code the DAGs in VS Code & kettle transformations in the PDI local setup.
  • Mount the above source code folder(s) to respective target volumes in docker compose file, for them to be visible inside the containers.
  • docker-compose up
  • Run the DAG and test the output
  • Create pull request

How testing team use it

  • Fetch new code
  • docker compose up
  • Run the DAG and test output

Conclusion

So far this has worked well for me. I maintain this setup as a separate project repository of its own. It helped me and my team negate environment setup time in scenarios like New code folder — where I simply change the volume mounts in docker compose; New team member — git pull this repo; Add Python dashboards— add and spin up a Dash-plotly container as a new service.

I am pretty sure there are other alternatives (better ones) to this scenario which I will keep looking for. Meanwhile, any tip on how I could have handled this better or any modification to this article, feel free to share in the comments section.

Thanks for reading :).

Resources

--

--

Sarit Si

Working as a data engineer in the SF/SJ bay area. I love to build things, either it be converting a study table to a shoe stand or designing a data pipeline..