September 16, 2024 • 3 minute read •
Sakila Co.: An End-to-End Open-Source Analytics Starter Project
- Name
- Fraser Marlow
- Handle
- @frasermarlow
Data engineering teams looking to build a complete open-source data stack are turning to Dagster to design, run, and observe their entire data platform.
In this article, Pipat Methavanitpong, a member of the Dagster community, shares one such project, using Dagster, Postgres, dlt, dbt, Clickhouse and Metabase. The result is a completely open-source containerized analytics-platform-in-a-box.
Pipat built the project around the Sakila sample database, but it can be repurposed to provide an analytics platform for any set of data sources. This example can benefit any small company looking to set up a basic self-hosted analytics capability.
About Pipat Methavanitpong
He believes data engineering is the cornerstone of digital transformation and data-driven decisions. He aims to provide cost-effective and easy-to-operate data platforms for businesses.
You can connect with Pipat on the Dagster Community Slack (@Pipat Methavanitpong).
About the Sakila Database
The Sakila database was created to fill the gap for a complex, realistic database schema to help users learn and experiment with MySQL. Its movie rental store schema provides a practical example that can be used to showcase everything from basic queries to advanced SQL features, and its continued use highlights its success as a tool for learning database management.
The Open-Source Analytics Stack:
The project's genesis came when Pipat was asked to build an analytics pipeline at a previous company. He rapidly ran into the cost constraints of acquiring multiple tools and the complexity of weaving together commercial solutions into one coherent stack. He set about building a reproducible stack of tools—an analytics platform in a box, if you will—using best-of-breed open-source solutions.
“The only cost is the operation cost is man-hours” says Pipat, “So it’s not really free, but once built, this project can easily be cloned and reused by another company.”
Pipat selected the following technologies, as representing both high-performing yet accessible tools for most data teams:
- PostgreSQL due to its popularity.
- Dagster for orchestration
- dlt for data ingestion (using the Dagster embedded elt function)
- Clickhouse for data warehousing. As a OLAP DBMS, Pipat likes Clickhouse’s performance and resource efficiency.
- Metabase for data visualization.
Pipat shared his thinking about these technology choices:
“I selected Dagster over Airflow, because Dagster fundamentally looks at things from the data’s perspective: was the asset materialized, did it pass the quality tests? So any data professional can just look at the lineage graph and immediately understand the state of the pipeline and the data it generates. Tools like Airflow use a ‘if-this-then-that’ approach. Other tools are just concerned about the operators or the job, not the state of the data.”
This said, Dagster does not provide out-of-the-box operators like Airflow. This is where "dlt" comes in. dlt is a Python library for loading data from various sources to the destination of your choice. Dlt provides an incremental sync mode and schema evolution.
“I can be more declarative with my pipelines by leveraging the dlt library. With this combination, you have Dagster as a brain to control and dlt as arms and legs to run the data movement tasks, a perfect fit.”
“For Clickhouse, I was looking for an AWS Redshift alternative, that was open source but equally performant. I considered Clickhouse and Apache Druid, but Clickhouse has wider adoption.”
“Metabase is a user-friendly data visualization tool. It provides both a UI-based query builder and SQL query editor to create charts and dashboards. It might not be as fancy as SuperSet, which competes with Tableau. But it is simple and good enough for a company to start on a data-driven decision journey.” says Pipat.
Deploying the Project on Docker
In terms of deployment, the whole project is in one Docker compose file. If you clone the depository and run the initialization script, the project creates the environment file for you.
From there, you just initialize the databases and start up the project. The Project’s README.md file will give you the details you need.
Design Considerations:
In this project, Pipat set dlt and dbt up to work in close synchronization by mapping asset keys between the dlt and dbt steps. dlt also offers the option of incremental load or full refresh, providing additional flexibility.
Credentials are managed in a file rather than in environment variables, which Pipat sees as being more secure.
“The Dagster dlt tutorial passes credentials using environment variables, which I don't think it's secure. If someone forgot to turn the debug flag off, the credentials might leak out to bad actors. So I pass a credential file as an environment variable instead and tell where dlt should load from.”
“In the data warehouse side, I organize dbt models into stages based on its best practice. In the data warehouse stage, I model data into fact and dimension tables based on Kimball's modeling techniques.“
A Springboard Project for Future Growth
Pipat hopes that this data platform template will help companies start their data-driven journey regardless of budget. From here, data teams can switch components in and out as they see fit based on their requirements, team's skills, preferences, and budget; for example, they can use Dagster+ instead of hosting themselves to regain hours working on infrastructure.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Community Memo: Pythonic Config and Resources
- Name
- Nick Schrock
- Handle
- @schrockn
- Name
- Ben Pankow
- Handle
Community Memo: Moving Dagster's Core APIs Towards 1.0
- Name
- Sandy Ryza
- Handle
- @s_ryz