I live in the front range of Colorado, which means each ski weekend I have to make the difficult choice about which world class resort to visit. To aide in this choice, over the years, I've built data pipelines and dashboards to show an aggregated view of resort conditions like weather, terrain status, and snow conditions. These data products make it easy to compare resorts with a single web visit instead of visiting the website for each resort.
In addition to being a useful tool, I've also found this project is a great candidate as a minimal PoC to implement when learning or evaluating new data tools. Pulling data from an API, transforming it, loading it to a warehouse, and building a dashboard represent a common data PoC. This post discusses how I approached learning Dagster, how the Dagster version of the snow report compares to other implementations and overall recommendations for learning new data technologies.
5 Steps to Learning New Data technologies
A major challenge in data engineering is keeping track of all the new data tools. Evaluating new tools can be especially challenging. To learn Dagster I took the following approach:
- Cursory review of documents
- Understand and play with an example
- Develop a useful but minimal project from scratch
- Deploy the project to understand infrastructure
- Reflect on how the minimal project compares to other implementations
Below I discuss how I applied each step of this process to learning and evaluating Dagster.
Understand an Example
While it can be tempting to start learning a framework by diving into a migration, a better first step is often to start with a template. With Dagster, there are multiple example projects that can be scaffolded locally or cloned from Dagster Cloud. My use case was similar to this template so I began by copying the template locally. Installing the dependencies and running a template locally provide a best-case experience for future local development. For Dagster, I found the install was as easy as installing Python packages into a virtual environment, and the local web interface was immediately accessible. After installation I scanned the code and tested my understanding by making minor code changes to see if the result matched my expectations.
Building a PoC
After gaining an understanding of the framework through sample code, it can be helpful to build a more realistic project. For me, I keep re-building the same data products for each PoC because re-implementing the patterns allows me to focus on the changes needed for the framework and not the details of my use case. In this case, I had Python code to call an API for each ski area. I wanted to use this code in Dagster to query the API and then assemble the results in a database.
After reading the documentation and playing with the template I identified a few specific goals for my Dagster implementation of the snow report.
Separate the production pipeline from my local testing and ideally build a staging pipeline that runs on PRs to my code.
In each environment use the same logic but different resources. The local pipeline runs with just pandas data frames that are stored on disk. The staging pipeline writes the resort data to Google Cloud Storage and then loads and cleans it in BigQuery within a staging dataset. The production pipeline is the same as the staging pipeline but uses the production dataset.
The code correctly reflects the dependency between the resort data and the final summary tables. Each resort data pull is an independent step instead of all the resort data being pulled in one function with a for-loop. This separation makes it easy to re-run one resort if the API call fails. The resort summary can be updated with the new resort data while using cached results for the other resorts.
The project should support backfills. If I want to change the logic that loads the resort data, Dagster should make it easy to go back through the historical raw data and load it using the new logic.
I started from the template project and made gradual changes. I first swapped the HackerNews API for my ski resort API and made sure the data pulls were correct by building a simple asset that called the API for one resort. Then I added the other resorts and built a downstream asset that summarized all of the resort data into one summary table. This first pass relied only on local resources storing data in the file system. My second pass added in Google Cloud Storage and BigQuery. With storage setup, I refactored the code to use an asset factory instead of hard coding each ski resort. Finally, I refactored the code further to add in partitions and test the dbt integration. The final code for my project is available here.
Throughout this process, I tried to note Dagster-specific concepts:
In Dagster, assets are saved to BigQuery through IO Managers. I wrote a custom
bq_io_managerto handle writing my results to BQ. My implementation expects the BQ dataset and table to already exist and be supplied as resource configuration. I wrote my own IO manager because I wanted to be appending data to my raw table everyday instead of overwriting the data.
While I primarily focused on Dagster assets, I also wrote an op and job that use a resource -
bq_writer- to drop and create a table called
resort_clean. This job is triggered by a sensor whenever the
resort_rawasset is updated. I added this op and job to test out how Dagster handled declarative tasks and to evaluate sensors which provide event driven automation in addition to cron scheduling.
I added a few dbt assets to test out the dagster and dbt integrations. To tell dbt that it depends on dagster you use a dbt source. The asset key in Dagster must match the dbt source
name + table, which in turn means the asset key must be
key prefix + asset name. I wanted dbt to switch schemas (BQ datasets) for local/branch/prod deployments. I did this with an if statement in my dbt source.
I evaluated a few different ways to handle authenticating to external services. For Google Cloud Storage I relied on the underlying environment to have access. In production, this underlying access is granted through the Kubernetes service account, which, in turn, binds to a GCP IAM service account in a process called workload identity. For BigQuery, I went with the alternative approach of supplying the IAM service account credentials directly to the client code. Locally those credentials are passed through environment variables. In production, those environment variables are set through Kubernetes secrets.
Deploying the Project
With my PoC code running locally I set out to test the production infrastructure requirements. Dagster Cloud supports both serverless and hybrid architectures. I opted for a hybrid architecture using Google Kubernetes Engine (GKE). I found the GKE deployment straightfoward. I used the provided helm chart to run an agent in a GKE cluster built with default settings. I then copied their default GitHub actions to automate deploying my code as a Docker image into the cluster. Aside from fighting with Google IAM I found this setup simpler than other Kubernetes tools that required custom helm charts.
Reflecting on Dagster
My Dagster PoC highlighted a few critical advantages.
Dagster is aware that my goal is to create datasets. This awareness makes it possible for me to "re-materialize" specific assets (like data from one resort) and to see asset lineage. In comparison, other schedulers just execute tasks and lack the rich metadata and handy interface that is aware of the results. Partitions and backfills are also possible because dagster is aware I am building datasets. If I update my asset logic, I can re-run the tasks to build the new asset with historical "upstream" data in a backfill, instead of manually managing tons of backfill logic in my code.
Code Structure and Local Development
Because Dagster knows that I'm building datasets, it has strong opinions about how to structure my code. The data processing logic is separate from the data storage logic. This separation made it easy for me to have local development build on pandas with staging and production built on GCS and BigQuery. The core logic was the same, see
assets, and the different storage for local/staging/prod was handled by
resources. While you can fuss around with
if statements to achieve similar outcomes in other tools, the first-class support in Dagster makes a world of difference.
Speaking of local development, in Dagster, I developed everything locally with Python. I didn't have to mess with Docker or minikube. In other schedulers I would have done the initial development with cloud resources, which dramatically slows things down (see v2 of this project). I get really distracted if I have to wait for Kubernetes schedulers to test a code change!
Once my dagster code was ready I did have to setup my production deployment, which took some Kubernetes iterations. However, those iterations were config iterations, not code changes - a one-time setup cost for the project. Now that production and staging are configured, I can make changes to my core code locally without ever waiting on the Kubernetes setup.
The code structure also makes it possible to do unit tests that use mock resources.
Airflow workers load all the code all the time. This architecture can create performance issues, but it also causes a dependency nightmare where the data transformation python code has to be compatible with airflow's internal dependencies. The natural workaround to these two problems is to create distinct Airflow clusters for everything, which sort of defeats the point of a scheduler knowing about the dependencies between things!
Dagster is built differently and ensures that the control plane is separate from my user code. For a toy project like this one, the impact is mostly hypothetical, but for actual workloads these architecture hurdles a big deal.
My goal in evaluating Dagster was different from most data engineering teams. I was not trying to pick a new data tool, but rather whether or not I should leave Google to work on a new orchestrator full time! My test project confirmed my initial excitement about Dagster and now, two months later, I am thrilled to be helping teams evaluate Dagster and see all these benefits for themselves. (Except for days where my Dagster snow report alerts me to critical powder thresholds!)
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Elementl Raises $33 Million in Series B Funding to Accelerate Data Orchestration and Unleash Advanced Data Use Cases
Dagster and the Decade of Data Engineering
- Nick Schrock
Building Better Analytics Pipelines
- Pete Hunt
- Yuhan Luo