-- Read about DataTalks.Club Data Engineering Zoomcamp --
Second week of the data engineering Zoomcamp by DataTalks.Club brought a new tool that is one of the most popular data pipeline platforms - Apache Airflow
. So we are going to create some workflows!
First you have to run the Docker compose Airflow installation in the environment of our choice, which can be one of but not limited to MacOS
, Linux
, GCP VM
or very popular WSL
.
What's more, we also need the Google Cloud SDK
installed in our Airflow env in order to connect with the Cloud Store bucket & create tables in Big Query.
That means we cannot just use the official docker-compose.yaml
referenced in the Airflow's docs, but we have to build custom Dockerfile
with an extended apache/airflow
image containing our additional dependencies. Then we can incorporate it into docker-compose.yaml
🙌
Fortunately, the course instructors have prepared all the files for us, moreover in two versions:
- Official Version which consists of:
- airflow-scheduler
- airflow-webserver
- airflow-worker
- airflow-init
- flower
- postgres
- redis
- celery
- Custom No-Frills Version which consists of:
- airflow-scheduler
- airflow-webserver
- postgres
Since I have some previous commercial experience with Airflow, I was pretty sure that during this course we will be good with just the LocalExecutor
so I've decided to follow the No-Frills path with limited number of services.
Soon it turned out that although this lightweight solution has significant number of users who reported "It works❗️" on their MacOS or Linux or Windows/WSL machines, by some unknown reason, from time to time I still saw posts on the Slack channel asking for help with the X.Y.Z issue/error that happened while running custom version on the Windows/WSL.
My curiosity forced me to take an in-depth look at this topic, especially because I was one of this Windows/WSL
users and at that time it did not work for me as well 🤣
My laptop has 8GB of RAM and 4 CPU cores. Assuming that W10+VS Code consumes around 3GB per se there is still ~5GB free to use by Docker engine which is 1GB+ the recommended minimum (via Airflow docs). I did not notice any CPU related requirements TBH.
As I said before, there are three services defined in the docker-compose.yaml
that should start in the following order: postgres -> scheduler and webserver which depends on them.
The first round of build&up took some time but less than 10 minutes. Postgres was fine, scheduler was trying to insert something to the database which was not possible because the webserver kept raising exceptions and restarting trying to initialize the Airflow's internal database.
Actually I noticed this when I was dealing with the Issue #2:
nervuzz@DELL:~/repos/data-engineering-zoomcamp/WEEK_2/airflow$ docker compose config
services:
postgres:
deploy:
resources:
limits:
memory: "314572800"
environment:
_AIRFLOW_WWW_USER_CREATE: "True"
_AIRFLOW_WWW_USER_PASSWORD: :airflow} # <---
_AIRFLOW_WWW_USER_USERNAME: :airflow} # <---
(...)
Obviously there is something wrong with those values. So what we have in the .env
file?
# .env
(...)
# Airflow
_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow}
_AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow}
(...)
Okay, at the first sight there is nothing wrong with referencing another variable, other environment variables like AIRFLOW__CORE__SQL_ALCHEMY_CONN
use this syntax as well.
But this is not just reference to another variable, it's a variable substitution
syntax which won't work inside .env
file, according to official Docker compose documentation.
On the other hand, we can use it (as the instructors did) in the docker-compose.yaml
file.
Let's do some modifications and test it's behavior again:
# .env
(...)
_TEST_1=buzz
# no variable _TEST_2
_AIRFLOW_TEST_1=${_TEST_1:foo}
_AIRFLOW_TEST_2=${_TEST_2:bar}
_AIRFLOW_TEST_3=${_TEST_2}
_AIRFLOW_TEST_4=${:_TEST_2}
nervuzz@DELL:~/repos/data-engineering-zoomcamp/WEEK_2/airflow$ docker compose config
(...)
environment:
(...)
_AIRFLOW_TEST_1: buzz:foo}
_AIRFLOW_TEST_2: :bar}
_AIRFLOW_TEST_3: ""
_AIRFLOW_TEST_4: $${:_TEST_2}
_TEST_1: buzz
(...)
Solution: Set default values the same way as we did for postgres:
# .env
(...)
_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow
(...)
Side notes: Since we are using the no-frills version of docker-compose.yaml
and custom entrypoint.sh
, this environment variables are obsolete. Our webserver user (admin/admin) is created explicitly in the entrypoint:
# entrypoint.sh
(...)
airflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow
(...)
Let's bring some real example logs:
dtc-de-postgres-1 | 2022-02-02 21:25:47.569 UTC [1] LOG: starting PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
dtc-de-postgres-1 | 2022-02-02 21:25:47.569 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
dtc-de-postgres-1 | 2022-02-02 21:25:47.570 UTC [1] LOG: listening on IPv6 address "::", port 5432
dtc-de-postgres-1 | 2022-02-02 21:25:47.578 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
dtc-de-postgres-1 | 2022-02-02 21:25:47.596 UTC [1] LOG: database system is ready to accept connections
dtc-de-webserver-1 | Traceback (most recent call last):
dtc-de-webserver-1 | File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1 | from airflow.__main__ import main
dtc-de-webserver-1 | ModuleNotFoundError: No module named 'airflow'
dtc-de-webserver-1 | Traceback (most recent call last):
dtc-de-webserver-1 | File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1 | from airflow.__main__ import main
dtc-de-webserver-1 | ModuleNotFoundError: No module named 'airflow'
dtc-de-webserver-1 | Traceback (most recent call last):
dtc-de-webserver-1 | File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1 | from airflow.__main__ import main
dtc-de-webserver-1 | ModuleNotFoundError: No module named 'airflow'
dtc-de-webserver-1 exited with code 1
Observations: Python cannot find the airflow
module but since we are using the Airflow's official image it must be there. Maybe we are not using the correct Python user install director (where PIP keeps installed packages)?
Workaround: Run image with the default user
# Dockerfile
(...)
RUN chmod +x scripts
USER $AIRFLOW_UID # <-- Delete or comment out that line
Solution: Full solution TBD
Explanation: root
is the default user. More details TBD
We have just get rid of one missing module and BOOM, there is another one:
dtc-de-postgres-1 | 2022-02-03 15:48:21.137 UTC [1] LOG: database system is ready to accept connectio
dtc-de-webserver-1 | Traceback (most recent call last):
dtc-de-webserver-1 | File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1 | from airflow.__main__ import main
dtc-de-webserver-1 | File "/root/.local/lib/python3.7/site-packages/airflow/__init__.py", line 46, in <module>
dtc-de-webserver-1 | settings.initialize()
dtc-de-webserver-1 | File "/root/.local/lib/python3.7/site-packages/airflow/settings.py", line 495, in initialize
dtc-de-webserver-1 | configure_orm()
dtc-de-webserver-1 | File "/root/.local/lib/python3.7/site-packages/airflow/settings.py", line 233, in configure_orm
dtc-de-webserver-1 | engine = create_engine(SQL_ALCHEMY_CONN, connect_args=connect_args, **engine_args)
dtc-de-webserver-1 | File "<string>", line 2, in create_engine
dtc-de-webserver-1 | File "/root/.local/lib/python3.7/site-packages/sqlalchemy/util/deprecations.py", line 309, in warned
dtc-de-webserver-1 | return fn(*args, **kwargs)
dtc-de-webserver-1 | File "/root/.local/lib/python3.7/site-packages/sqlalchemy/engine/create.py", line 560, in create_engine
dtc-de-webserver-1 | dbapi = dialect_cls.dbapi(**dbapi_args)
dtc-de-webserver-1 | File "/root/.local/lib/python3.7/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 782, in dbapi
dtc-de-webserver-1 | import psycopg2
dtc-de-webserver-1 | ModuleNotFoundError: No module named 'psycopg2'
dtc-de-webserver-1 exited with code 0
Solution: Add psycopg2
or psycopg2-binary
to Python requirements and add two additional dependencies libpq-dev
and gcc
to apt-get install
command in Dockerfile:
# requirements.txt
(...)
psycopg2
# Dockerfile
(...)
USER root
RUN apt-get update -qq && apt-get -y install libpq-dev gcc vim -qq
(...)
Quite by accident I noticed such an error in the Docker engine logs:
WARN[2022-02-03T18:20:55.656604700+01:00] Health check for container 4ac563b736fb02c3f6265db442274bde97fda2c9c9209771cdb731536ff10a69 error: OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "pg_isready -U airflow": executable file not found in $PATH: unknown
Indeed, there is such a command used as health check test for postgres service:
# docker-compose.yaml
version: '3'
services:
postgres:
image: postgres:13
# ...
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
BTW; pg_isready
is a utility shipped with postgres which checks the status of the PostgreSQL server.
Docker compose file specification did not help too much, it says:
- when test is an array, use
NONE
,CMD
orCMD-SHELL
- but if it's a string, use
CMD-SHELL
or skip this instruction (syntax)
After some googling I have found that CMD-SHELL
instruction prepends test command with a "/bin/sh -c"
in contrast to CMD
instruction which will execute test command without a shell. Furthermore, the pg_isready
is a 3rd party utility stored somewhere in the /bin/*/*
folder but not directly in /bin
(like most of popular commands we use every day) so if $PATH
is not set then... 🥺
Solution: Everything should be clear now (at least for me), it's time to fix this error:
# docker-compose.yaml
test: ["CMD-SHELL", "pg_isready -U airflow"]
# or equivalent
test: pg_isready -U airflow
Software-related issues can really be a pain but there is something else that know how to push your buttons - freezing operating system while you work 🔥🔥🔥
The docker compose build
command is able to eat nearly 4.5GB !
But that's not a big issue because with a reasonable Internet connection it can take 5 to 10 minutes (of course depending on the number of services) and and you will probably notice the peak of memory consumption close to the end of the operation.
Things are different with docker compose up
command. By default the Docker engine will give service's a "permission" to consume as much memory as possible, with some predefined value being the upper bound which in my case is around 6GB.
This screen shoot was taken when there was no running DAGs, no database transactions, no webserver traffic. So Airflow needs at least 1.24GB just to run it's core components! I have not tested yet how much this values are going to change with let's say two DAGs running once per 5 minutes. I guess the webserver will not raise too much, however I'am pretty sure the scheduler will fluctuate a lot.
Hopefully there is a built-in feature which enables us to limit RAM and CPU consumption on a per-service level.
# docker-compose.yaml
version: '3'
services:
postgres:
(...)
deploy:
resources:
limits:
memory: 300M
scheduler:
(...)
deploy:
resources:
limits:
memory: 1g
webserver:
(...)
deploy:
resources:
limits:
memory: 1300m
However, the compose up command must be changed for this configuration to work:
docker compose --compatibility up
# Options:
# --compatibility Run compose in backward compatibility mode
# See details: https://github.com/docker/compose/pull/5684
Now we can take a breath and tailor this limits to our needs!
Feel free to leave a comment!