- Name
- Leor Fishman
- Handle
We’re thrilled to announce a new integration between Dagster and a fellow open-source project, Great Expectations (GX). GX is the fastest-growing open source data validation and documentation framework, helping data teams save time and promote analytic integrity of their data. GX allows users to express what they expect from data in a simple, declarative manner. “Expectations” are like unit tests for datasets: they help you guard against upstream data changes and monitor data quality.
If you haven’t worked with GX before, you might want to take a look at Getting started with Great Expectations before continuing. We’ll assume that you are familiar with GX concepts like the data context, expectations, and data docs in what follows. If you’re already using GX and you’re eager to integrate your existing test suites into your Dagster pipelines, you may want to jump right into a fully worked example. Otherwise, keep reading for an introduction to the system and how it integrates with Dagster.
A schematic example
Many real world pipelines have roughly the following schematic structure: first they read in some large data file, then they run some large-scale, computationally costly analysis on it.
@solid
def load_data():
return dataframe
@solid
def analyze():
...
@pipeline
def data_pipe():
analyze(load_data())
This is great when you’re on the happy path. But in the real world, the data frame returned from load_data might have serious and unexpected issues — whether from human error, faulty APIs, or shifts in the reality to which your data corresponds. Ideally, we want to catch these data issues before we do costly analysis or pass bad data to downstream processes.
from dagster_ge import ge_validation_solid_factory
validate = ge_validation_solid_factory(datasource, expectation_suite)
@pipeline
def data_pipe():
analyze(validate(load_data()))
The core of our new integration with GX is a validation solid that can run your expectation suites alongside the other processes in your pipelines. Like other data processes modeled as Dagster solids, the validation solid is configurable at runtime. If you want to swap out your GX config, you can do that in Dagit or by passing run config through the Python APIs, since the GX data context is modeled as a Dagster resource. In the current state of the integration, you’ll still need to define the configurations you want to switch between with GX. Still, we’re working with the GX team to further parameterize and expose as much config as possible natively in Dagster.
How it works
At its core, this integration exposes two interfaces between Dagster and GX: the ge_data_context
resource for making GX config available to pipelines, and the ge_validation_solid_factory
for building solids which actually use it.
The ge_data_context
resource points to the root directory of your GX config. Users of our integration don’t need to change anything about existing GX workflows — you can still follow GX’s intro tutorials, build expectation suites using the GX native tooling, and so forth. All you need to do to expose the resulting data context to Dagster is configure this resource.
The ge_validation_solid_factory
builds a solid that executes an expectation suite. This factory takes the name of your datasource, the name of the expectation suite you want to run, and a third optional validation operator parameter. This lets you call any special functionality, like slack notifications on failed expectations that you’ve already built through GX. The default here is to just run the expectation suite and output the results without saving anything to file.
To see all this in code, check out the integration example in the Dagster docs.
Where we’re going from here
We hope these basic APIs will let teams that want to use GX’s powerful data quality capabilities with their Dagster pipelines hit the ground running.
Of course, this is just the beginning. If you’re interested in the future of finer-grained integrations that expose more of the data context configuration and creation process at the Dagster resource level — or if you have other ideas about how we can make these two projects even better together — please get in touch! Our two teams would love to hear from you.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Running Singer on Dagster
- Name
- Fraser Marlow
- Handle
- @frasermarlow
Orchestrate Unstructured Data Pipelines with Dagster and dlt
- Name
- Zaeem Athar
- Handle
- @zaeem
Parallel Computing on Dagster with Dask
- Name
- Odette Harary
- Handle
- @odette