September 16, 20243 minute read

Sakila Co.: An End-to-End Open-Source Analytics Starter Project

Jumpstart your analytics work with some of today’s best open-source technologies.
Fraser Marlow
Name
Fraser Marlow
Handle
@frasermarlow

Data engineering teams looking to build a complete open-source data stack are turning to Dagster to design, run, and observe their entire data platform.

In this article, Pipat Methavanitpong, a member of the Dagster community, shares one such project, using Dagster, Postgres, dlt, dbt, Clickhouse and Metabase. The result is a completely open-source containerized analytics-platform-in-a-box.

Pipat built the project around the Sakila sample database, but it can be repurposed to provide an analytics platform for any set of data sources. This example can benefit any small company looking to set up a basic self-hosted analytics capability.

About Pipat Methavanitpong

Pipat Methavanitpong headshot
Pipat is a self-taught software engineer with education and work experience from semiconductor to cloud computing. After taking a deep learning course in 2017, he has been interested in data engineering in how to tackle big data complexities and make data available to everyone in an organization.

He believes data engineering is the cornerstone of digital transformation and data-driven decisions. He aims to provide cost-effective and easy-to-operate data platforms for businesses.

You can connect with Pipat on the Dagster Community Slack (@Pipat Methavanitpong).

About the Sakila Database

The Sakila database was created to fill the gap for a complex, realistic database schema to help users learn and experiment with MySQL. Its movie rental store schema provides a practical example that can be used to showcase everything from basic queries to advanced SQL features, and its continued use highlights its success as a tool for learning database management.

The Open-Source Analytics Stack:

The project's genesis came when Pipat was asked to build an analytics pipeline at a previous company. He rapidly ran into the cost constraints of acquiring multiple tools and the complexity of weaving together commercial solutions into one coherent stack. He set about building a reproducible stack of tools—an analytics platform in a box, if you will—using best-of-breed open-source solutions.

“The only cost is the operation cost is man-hours” says Pipat, “So it’s not really free, but once built, this project can easily be cloned and reused by another company.”

The Group1001 ML/AI tech stack as orchestrated by Dagster
This analytics-platform-in-a-box project includes several key technologies, all running in Docker containers.

Pipat selected the following technologies, as representing both high-performing yet accessible tools for most data teams:

  • PostgreSQL due to its popularity.
  • Dagster for orchestration
  • dlt for data ingestion (using the Dagster embedded elt function)
  • Clickhouse for data warehousing. As a OLAP DBMS, Pipat likes Clickhouse’s performance and resource efficiency.
  • Metabase for data visualization.

Pipat shared his thinking about these technology choices:

“I selected Dagster over Airflow, because Dagster fundamentally looks at things from the data’s perspective: was the asset materialized, did it pass the quality tests? So any data professional can just look at the lineage graph and immediately understand the state of the pipeline and the data it generates. Tools like Airflow use a ‘if-this-then-that’ approach. Other tools are just concerned about the operators or the job, not the state of the data.”

This said, Dagster does not provide out-of-the-box operators like Airflow. This is where "dlt" comes in. dlt is a Python library for loading data from various sources to the destination of your choice. Dlt provides an incremental sync mode and schema evolution.

“I can be more declarative with my pipelines by leveraging the dlt library. With this combination, you have Dagster as a brain to control and dlt as arms and legs to run the data movement tasks, a perfect fit.”

“For Clickhouse, I was looking for an AWS Redshift alternative, that was open source but equally performant. I considered Clickhouse and Apache Druid, but Clickhouse has wider adoption.”

“Metabase is a user-friendly data visualization tool. It provides both a UI-based query builder and SQL query editor to create charts and dashboards. It might not be as fancy as SuperSet, which competes with Tableau. But it is simple and good enough for a company to start on a data-driven decision journey.” says Pipat.

Deploying the Project on Docker

In terms of deployment, the whole project is in one Docker compose file. If you clone the depository and run the initialization script, the project creates the environment file for you.

From there, you just initialize the databases and start up the project. The Project’s README.md file will give you the details you need.

The Sakila project’s Docker containers.
The Sakila project’s Docker containers.

Design Considerations:

In this project, Pipat set dlt and dbt up to work in close synchronization by mapping asset keys between the dlt and dbt steps. dlt also offers the option of incremental load or full refresh, providing additional flexibility.

dlt and dbt working hand-in-hand in the Dagster asset graph.
dlt and dbt working hand-in-hand.

Credentials are managed in a file rather than in environment variables, which Pipat sees as being more secure.

“The Dagster dlt tutorial passes credentials using environment variables, which I don't think it's secure. If someone forgot to turn the debug flag off, the credentials might leak out to bad actors. So I pass a credential file as an environment variable instead and tell where dlt should load from.”

“In the data warehouse side, I organize dbt models into stages based on its best practice. In the data warehouse stage, I model data into fact and dimension tables based on Kimball's modeling techniques.“

The Sakila project’s Asset Graph in Dagster.
The Sakila project’s Asset Graph in Dagster.

A Springboard Project for Future Growth

Pipat hopes that this data platform template will help companies start their data-driven journey regardless of budget. From here, data teams can switch components in and out as they see fit based on their requirements, team's skills, preferences, and budget; for example, they can use Dagster+ instead of hosting themselves to regain hours working on infrastructure.


The Dagster Labs logo

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us:


Read more filed under
Blog post category for Community. Community
Share this article
Share Dagster on Twitter / XCheck out Dagster on LinkedInShare Dagster on Reddit

Dagster Newsletter: Get updates delivered to your inbox

Dagster University