- Name
- Sandy Ryza
- Handle
- @s_ryz
Unless you work at a company whose mission is to build a better classifier for the Iris dataset, doing machine learning means building data pipelines. To train an ML model - whether it’s a sklearn logistic regression or a Tensorflow/PyTorch/Hugging Face deep learning model, you need a training set, which usually requires a set of data transformations. After you’ve trained your model, you need to evaluate it - which, in realistic settings, often requires another set of data transformations for backtesting. And once you’ve chosen a model, often you want to use it to perform inference on batches of data, which requires - you guessed it - more data transformations.
Then, you need all of this to be repeatable. As new training data comes in, you need to re-train and re-evaluate your model. And you need to be able to iterate - to try out changes without interfering with the data and machine learning models that you’ve deployed to production.
In the five years I spent as a data scientist and ML engineer at Cloudera, Clover Health, and Motive, I found that building and operating these pipelines was both the most time-consuming and the most important part of my job. Improving them often had a larger impact on the final result than improving the model itself.
I came to work on Dagster because of the frustrations I experienced with existing tooling for building and operating ML pipelines. I came to believe that the best way to make repeatable ML pipelines is with a data orchestrator: a system that models graphs of data assets and the data transformations that connect them. The right orchestrator can dramatically accelerate the speed you improve your ML models, reduce errors and downtime, and help your team understand how your ML assets are faring in production.
Orchestrating ML pipelines with Dagster
Dagster is an open source data orchestrator that is widely used for building machine learning pipelines. Roughly half of Dagster’s users use it for ML.
Dagster makes it easy to define training and batch inference pipelines in Python, to test and experiment with them locally, and then to run them reliably in production. Its Pythonic APIs enable defining pipelines by applying decorators to vanilla Python functions.
Compared to workflow managers like Apache Airflow that are used for orchestrating data pipelines, Dagster is different in a couple main ways:
- It’s designed for use during development. Its lightweight execution model means you can access the benefits of an orchestrator — like re-executing from the middle of a pipeline and parallelizing steps — while you’re experimenting with features or ML model parameters.
- It models data assets, not just tasks. A data asset is a machine learning model, table, or other persistent object that captures some understanding of the world. An orchestrator that understands assets can model the end-to-end dataflow between ML models and the datasets upstream and downstream of them, so you can tweak a data transformation that lives far upstream and see how that affects model performance before merging to production.
ML pipelines are asset graphs
One way of looking at an ML pipeline is as a big fat graph of data assets.
At the root are core datasets that form the basis for the features that the ML model will operate on. ML engineers often need to modify or add to these core datasets to get the data they need for their machine learning models.
The feature set and label set are derived from these core datasets.
The training ML pipeline operates on the feature set and label set. Its output is an ML model – e.g. an sklearn LogisticRegression or a Keras Model – a set of predictions on the training and test data, and evaluation metrics for the model.
Truly evaluating a model often requires going further than just looking at its accuracy on the test set. Often it’s important to have backtests on historical data to understand the impact it’s likely to have on the business. For example, if you’re training a model that detects fraud, you can run it on past transactions to understand the $ amount of fraud that the model would have caught over a given time period.
Finally, in batch inference settings, there will be an asset that captures the model’s predictions on unlabeled data, usually updated more frequently than the model is re-trained.
Needless to say, there’s a lot going on here, and trying to remember what’s impacted after making a change to one of these assets is a difficult and error-prone endeavor.
Execution vs. data dependencies
It’s useful to distinguish here between data dependencies, which are represented in the asset graph, and execution dependencies.
Just because these assets form a single graph doesn’t mean that you want to update them all on the same schedule:
- Different datasets that you build features from might be updated on different cadences - for example, new events might arrive in your data lake every hour, but a data dump from a vendor might arrive only once per month.
- Even if the datasets you build features from are updated hourly, you likely don’t want to re-train your model every hour. It wouldn’t change significantly and could be computationally expensive - training GPT-3 took nine days and $4.5 million.
- You might want to evaluate or backtest your model more frequently than you train it, to understand how it’s performing on new data.
- The frequency that you run inference on depends on how the results of that inference will be used.
However, it’s often useful to kick off ad-hoc runs that span the entire pipeline:
- When experimenting with changes to a model, you might want to try adding a column to a core dataset, building features from that column, training the model, and backtesting it.
- If there’s a failure, or after deploying a code change, you might want to kick off a backfill that updates all the assets in the graph.
Dagster decouples data dependencies from execution dependencies: it lets users put any subgraph of their asset graph on a schedule, but also kick off ad-hoc runs that execute any subset of assets in the full graph.
Orchestration during model development
Development in machine learning usually starts with a question: What if we change this hyperparameter value? What if we add this new feature? What if we swap out this data source with a different one?
Any input to a model training process, and anything that input data depends on, can affect the ultimate model that comes out. We want to understand the impact of that change:
- How will it affect the performance of our model?
- Will our model run at all?
The speed you can improve your model hinges on how quickly you can pose and answer these questions. Answering one of these questions typically happens in a few steps:
- Change the code that generates one of the assets in your graph - e.g. your feature set or a dataset that your feature set is derived from
- Materialize that asset and a set of downstream assets
- Inspect the results
For many ML engineers, step 2 is a highly manual and error-prone process. They run some scripts, execute some cells in a notebook, see some results they didn’t expect, realize they forgot to run an important intermediate script, find that intermediate script, etc.
An orchestrator speeds this up significantly. By having a single point of entry for executing everything, tracking dependencies, and tracking what has already executed, step 2 can be a single button-click or terminal command. And if something fails in the middle, it’s easy to fix your code and pick up where you left off.
Dagster: lightweight in development, heavyweight in production
There’s a reason that orchestrators like Airflow aren’t typically part of an ML development lifecycle: they’re too heavyweight.
To run Airflow, you need to run a database, a scheduler, and a web server. Additionally, when you write an Airflow pipeline, it’s usually bound to production services that are overkill during development. When experimenting with a model, you don’t want to spin up a Kubernetes pod for each step, and you don’t want to write data to your production data warehouse.
Dagster helps separate the “business logic” in a pipeline from the infrastructure it runs on top of. This means you can have a very lightweight setup in development without sacrificing robustness in production. For example:
- During development, you can store intermediate data in memory. During production, you can store it in S3 or Snowflake.
- During development, you can run all the steps in your pipeline within a single process. During production, each step can happen in its own Kubernetes pod.
Developing across the entire asset graph
The more that you’re able to merge machine learning and data processing steps into a single ML pipeline, the more “what happens if…?” questions you can answer.
If you make any changes to the base tables that you build features or labels from, they’re going to affect what happens in the modeling phase. A change to how one of the columns is defined in your “users” table is just as likely to hurt or boost your model’s precision as a tweak to a hyperparameter is.
Similarly, there are often many transformation steps that come after the modeling. If you’re tweaking a hyperparameter in your model, often you’re curious about more than just the effect of that change on your model’s precision. For example, if you’re using a model to approve loans, it’s useful to understand how a change to the model will affect the number of loans you’re likely to approve. And that might take many steps and data transformations after the modeling phase.
If you maintain separate pipelines for core data, ML models, and analytics, it becomes very difficult to build this understanding. But if machine learning, core data, and analytics live underneath the same framework, it becomes easy to assess how making a change to a core table will affect ML models. Or to assess how changes to ML models are likely to impact the business.
But putting everything together is easier said than done. A couple of difficulties tend to emerge:
- Agreeing on a framework for defining ML pipelines. Analytics engineers love dbt. ML engineers want a Pythonic API and a development environment that permits quick iteration.
- Managing code. If the entire ML pipeline needs to be defined inside a single DAG object, this becomes an unwieldy point of contention between different teams
Dagster is built to mitigate these difficulties:
- It has a Pythonic API and light footprint that makes it accessible to ML engineers and data scientists. You can define pipelines by writing functions that accept and return the Python objects exposed by frameworks like sklearn, PyTorch, TensorFlow, or Keras.
- It has deep integration with tools like dbt - it can represent the dbt graph at full fidelity and interleave dbt models with Dagster assets.
- It enables defining pipelines “bottoms-up” instead of “top-down.” You don’t need a monolithic object that captures all the dependencies. Instead, each node contains references to the upstream nodes it depends on.
Notebooks and reusable data assets
Data orchestrators and notebooks like Jupyter have some overlap in their features and philosophies. They both let you split up your program into chunks that can be executed independently: cells in notebooks and ops in Dagster. This lets you pick up execution from the middle: unlike with scripts, you don’t need to start from the beginning whenever you make a change.
Orchestrators will likely never replace notebooks. However, when parts of your notebook have reached a level of stability, pulling them out into your machine learning pipeline can give them better reliability and observability in production, as well as make them possible to reuse elsewhere.
Each notebook cell typically corresponds to one or a few data assets - it produces some objects that are used by downstream cells in the notebook. Often these objects would be widely useful across your data platform. If you could expose them and keep them continuously up-to-date, then other notebooks and machine learning pipelines could take advantage of them.
It’s straightforward to factor out a cell in a notebook into a reusable Dagster asset: it just requires moving the code into an editor, surrounding it in a function definition, and applying the @asset decorator. The asset can still be accessed from within the notebook: you can invoke a function to materialize it or a function to load its value.
Dagster also includes integrations with Papermill and Noteable, which enable executing notebooks as part of a data pipeline.
Conclusion
A machine learning model is only as good as the pipeline that trains and evaluates it.
Building and maintaining ML pipelines is a difficult and time-consuming endeavor, but it’s likely the most important part of an ML engineer’s job.
A data orchestrator is a harness for developing and operating these machine learning pipelines. The right orchestrator can have a huge impact on the success of this endeavor.
By being lightweight enough to use during development and modeling the entire graph of assets that surrounds a machine learning model, Dagster enables ML engineers to try out ideas faster, translate those ideas into production more easily, and understand how they’re performing over time.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
AI's Long-Term Impact on Data Engineering Roles
- Name
- Fraser Marlow
- Handle
- @frasermarlow
10 Reasons Why No-Code Solutions Almost Always Fail
- Name
- TéJaun RiChard
- Handle
- @tejaun
5 Best Practices AI Engineers Should Learn From Data Engineering
- Name
- TéJaun RiChard
- Handle
- @tejaun