Dagster Pipeline Example
Technical blog post coming soon™!
A boilerplate for creating data pipelines using Dagster, Docker, and Poetry. To use this repo, clone it or click “Use this template” and follow the instructions below
- Picks up code changes immediately (just hit
Reloadin dagit; don’t have to restart the container!)
- Unified Dockerfile for development & deployment; easily integrates with CI/CD processes
- Packages the source code according to PEP517 & PEP518
- Tractable package management using
poetry. No more hideous
pip freeze > requirements.txt!
Setup (using a container)
Build and Run Dagster
docker compose down # if already running docker compose build docker compose up
Done! At this point, you should be able to successfully navigate to the Dagit UI and launch the job
Configure Slack (optional)
top_hacker_news job will run out of the box and simply log its results to console, but if you configure a Slack Webhook, the job will send its output to the corresponding channel, which is much more fun ?
After creating the Slack Webhook, copy the Slack Webhook URL and uncomment the environment variable lines in
docker-compose.yml, then restart the container
Install Poetry (optional)
When using containerization, installing poetry locally is not necessary, but it is recommended; the venv it creates can be used for code completion, simple interactive debugging, and more
Alternative Setup (no container)
The alternative setup runs locally without any containerization
Note It’s recommended that the application is run using the docker approach
Run Dagster Locally
Running locally is very similar to using the container
- Install poetry (not optional in this case)
- Export the environment variable(s)
- Open a terminal in the project root and run the following commands
# First command optional. creates `.venv` in the project root; very useful when using VSCode! poetry config virtualenvs.in-project true poetry install # To use poetry (i.e. activate the virtualenv): poetry shell dagit -w workspace.yaml
Running Tests Locally
I’ll be honest, I haven’t focused on testing with this repo. Suggestions for improvement are welcome ?
Assuming poetry is installed and the environment created, run the following from the project root:
poetry shell pytest
During Development, When Should I Rebuild/Restart the Docker Container?
If you change any env vars or files that are outside of
src, then you’ll want to rebuild the docker container, e.g. when…
- adding new packages to
- adding a volume mount for DAGSTER_HOME
How Do I Install Python Packages?
Just add it to
[tool.poetry.dependencies] in pyproject.toml (or
[tool.poetry.dev-dependencies]) and rebuild the container. If using poetry locally without containerization, also run
poetry update to update the lockfile
Poetry Doesn’t Like My Lock File. What do I do?
Don’t worry! Delete
poetry.lock(poetry.lock) and run
poetry install locally to recreate it
Does This Approach Work for Dagster Daemon?
Yes! If you’re developing sensors, partitions, schedules, and want to test them in your container, then simply uncomment the following line in the
dev stage of the Dockerfile:
# RUN echo "poetry run dagster-daemon run &" >> /usr/bin/dev_command.sh
How Do I Deploy This Repo through CI/CD?
I leave this as an exercise for the reader and/or the reader’s DevOps team ? Though here are some tips:
- Use semantic versioning to version-bump
pyproject.tomland associate this with the container version
- You don’t need to target a specific stage in the Dockerfile; the end result is a Dagster User Code Deployment in a ready-to-use container
- If using helm, make sure you’ve added the correct container version to the list of User Code Deployments; don’t forget to apply any secrets/env vars as needed
How Can I Debug My Op (or other functions)
debugpy (already installed). In
- "5678:5678" to the list of ports. In the actual op you’d like to debug, add the following three lines:
# It's very important that we specify both address and port! debugpy.listen(('0.0.0.0', 5678)) # Block until you can attach the debugger in VSCode debugpy.wait_for_client() # Add this final line wherever you'd like within the op debugpy.breakpoint()
Finally, you’ll need to create a
launch.json for python remote attach. In VSCode, click “Run and Debug” -> “Create a launch.json file” and follow the prompts ( python -> remote attach -> localhost -> 5678 )