Release

Dagster 0.11.0: Lucky Star

In 0.11.0, we introduce dynamic orchestration, a new backfill UI, and support for tracking asset lineage.

April 1, 20215 minute read


Please welcome to the stage our eleventh major release of Dagster, codenamed "Lucky Star."

This release marks a substantial step forwards in several areas.

Release highlights

  • Dynamic orchestration, which lets pipeline subsets execute once for each element of an input processed at runtime, is now ready for wider usage.
  • Backfills of pipeline runs over partitioned datasets, to rebuild data assets when processes or underlying data change, are now supported with specialized UI.
  • Asset lineage to track the interdependencies of data assets produced by pipeline runs now has core API support.
  • Dagster is coiffed to impress

    We've also launched a new docs site, improved the asset catalog, improved interoperability between the Dagster type system and Python type hints, made many improvements to Dagit and to the robustness of core system components, and now support MySQL as a backend for core storage.

    Dynamic orchestration

    Many data pipelines include steps that have to be repeated once for each element of some input. Very often, you don't know in advance how many inputs you're going to have to process. And once you're done, you'll often want to collect the outputs of each process for downstream processing as an aggregate.

    For example, if you're processing files that you grab from an SFTP drop, you may want to run a complex subset of a pipeline once for each file, but the number of files may differ day-to-day. Once you've processed all the files, you might want to compute aggregate statistics or create a summary report.

    We think of this as the simplest kind of dynamic orchestration — where the work to be orchestrated is determined not at pipeline definition time but at runtime, dependent on data that's observed as part of pipeline execution.

    One of our longest-standing Github issues has been for an abstraction to help model this common case. In Dagster 0.10.0 "The Edge of Glory", we introduced experimental support for dynamic orchestration with the map operation, allowing solids to yield special outputs using the DynamicOutput class, over which the execution of downstream solids could be mapped at runtime.

    In 0.11.0, we've expanded our experimental support for this pattern (and closed issue 462!) with the collect operation, so that downstream solids can aggregate and process all of the results of the solids executed dynamically using map.

    What does this look like in practice? Let's imagine a data pipeline in which we process a directory containing some number of files. We will be doing significant computation on these files, so we want each file to be handled in its own unit of computation. This will let us parallelize computation across processes or machines and retry the processing of each file individually if something goes awry. Every time this pipeline runs, the directory contains new files, and the number of files can vary substantially.

    In this example, the files_in_directory solid is defined using a special DynamicOutputDefinition. The body of the solid walks a directory (set via config) and yields a DynamicOutput for each file that it finds in the directory. Each DynamicOutput includes a user-specified mapping_key.

    @solid(
        config_schema={"path": str},
        output_defs=[DynamicOutputDefinition(str)],
    )
    def files_in_directory(context):
        path = context.solid_config["path"]
        dirname, _, filenames = next(os.walk(path))
        for file in filenames:
            yield DynamicOutput(
                value=os.path.join(dirname, file),
                mapping_key=file_name_to_key(file),
            )
    

    In the pipeline definition, we need to map over the outputs of this solid and then, later, collect the results of the mapped computation.

    @pipeline
    def process_directory():
        files = files_in_directory()
        file_results = files.map(process_file)
        summarize_directory(file_results.collect())
    

    We believe that a user should know what is going to happen before a pipeline executes. So Dagit always lets you render a pipeline in its logical form before computation, even in the presence of dynamic orchestration. Let's look at how the map and collect operations are represented in Dagit. Both the mapped dynamic output and the collected input have special sigils and the mapped solid is embellished in the pipeline view to illustrate that it will be executed dynamically.

    Dynamic pipeline as it appears in Dagit

    When we execute the pipeline, each copy of the mapped computations is identified in the execution timeline view by its associated mapping_key in square brackets.

    Dynamic pipeline in Dagit's execution viewer

    Dynamic orchestration is not meant to replace frameworks that perform massively parallel computation over large data sets. For that task, you should use a specialized distributed computation framework like Spark or Dask. Dagster sits at a different layer of the stack, and dynamic orchestration is intended to let you express coarse-grained, logical parallelism in your pipelines.

    We'd love to learn about your use cases and to get your feedback on this feature as we move it from experimental to fully supported.

    Running backfills from Dagit

    As data practitioners, we often find ourselves working with partitioned data pipelines. For example, you might have a pipeline that consumes a table of events generated by a web application and produces aggregate statistics, like "average session length." It's quite likely that you'll want to partition this pipeline by time — perhaps running it daily, over the past day's data — so that you can track the aggregate statistics as they change over time. Or you may have a pipeline that consumes a table of user data, where you run the pipeline for each user cohort to track the performance of different user groups.

    But partitions can lead to operational complexities. When everything's going smoothly, you can incrementally update your data assets by creating new pipeline runs as new data becomes available. But what happens when you change the definition of a key metric, and you need to compare it to historical data that was computed using different pipeline logic? What if your upstream data is revised and you need to understand the consequences? This is where it becomes critical to kick off a backfill — a set of pipeline runs that revise previously calculated partitions.

    We're continuing to invest in making this task easier than ever, and in 0.11.0, we're shipping a new dedicated backfill interface in Dagit to help you launch, monitor, and cancel in-flight backfills over any or all of your partitions.

    The Partitions Tab

    In Dagster, partitioned pipelines are represented using the PartitionSet API, which defines the partitions over which a given pipeline can be run (for example, a set of days).

    The Dagit partitions tab makes it easy to view all the historical runs for a particular partition set, organized by partition. You can navigate to this interface from the pipeline page or search for a partition set in Dagit's universal search box.

    The Dagit partitions tab

    Each row in this matrix corresponds to one solid in the partitioned pipeline. Each column corresponds to one of the partitions. So each box represents a solid's execution over a particular partition. The box's shade indicates the state of the latest run for that solid over the given partition. Red indicates a failure, and white indicates that no run has been executed.

    Because Dagster keeps an immutable record of all pipeline executions, you can drill down into an individual partition to see the history of all runs executed for that partition. Unlike systems like Airflow, which overwrite task history, with Dagster it's possible to reconstruct what happened in the past when debugging tricky production issues.

    The history of runs for a partition

    Launching backfills

    You can also launch backfills directly from the partitions tab. The backfill launch UI includes the ability to visually select partition to backfill:

    Selecting the partitions over which to launch a backfill

    After launching a backfill, you can jump back to the partitions view to monitor its progress:

    Observing the progress of a backfill

    Execution model

    Backfill execution should also be much more robust, with backfills now managed by Dagster's daemon process. This means you can launch large backfills over thousands of partitions without the risk of bringing Dagit down or interfering with other uses of the system.

    We know that backfills are often a huge pain for users of other data orchestration systems, so we're very interested in reducing the pain associated with this common chore. If you find yourself running a lot of backfills, the core team would greatly appreciate your feedback on usability.

    Thinking about asset lineage

    Dagster's asset catalog, introduced in Dagster 0.8.0 "In The Zone", lets you track metadata about the assets produced by pipeline runs in a centralized interface. The asset catalog has already proven to be incredibly useful for collaboration between diverse clients of the data platform.

    In 0.11.0, we're expanding this system by adding an experimental interface to track asset lineage: which other assets a given asset depends on. By tagging the outputs of solids with associated asset keys, you can tell Dagster to use what it already knows about the data dependencies between solids to track the dependencies of assets on other assets, layering lineage information on top of the data dependency graph.

    You can associate outputs with assets in a couple of ways (see the docs for details), but the simplest is just to set the asset_key parameter on an OutputDefinition.

    @solid(output_defs=[OutputDefinition(asset_key=AssetKey("my_db.users"))])
    def scrape_users(_):
        users_df = some_api_call()
        persist_to_db(users_df)
        return users_df
    
    
    @solid(output_defs=[OutputDefinition(asset_key=AssetKey("ml_models.user_prediction"))])
    def create_model(_, users_df):
        my_ml_model = train_prediction_model(users_df)
        persist_to_model_store(my_ml_model)
        return my_ml_model
    
    
    @pipeline
    def my_user_model_pipeline():
        get_prediction_model(scrape_users())
    

    In this toy example, our scrape_users solid pulls some data into a database table so that the downstream create_model solid can build an ML model. We'd like users of our data platform who are browsing through the asset catalog to know that the ML model depends on the database table.

    Dagster already knows about the data dependencies between our solids. With a minimal amount of additional metadata, Dagster can construct the lineage of the assets produced by those computations at runtime.

    Inferring the lineage graph from data dependencies

    It's early days for this feature, and we expect that the lineage interface in particular will evolve significantly. But as soon as you associate assets with outputs, you'll start to see information about asset lineage appear in the Asset Catalog in Dagit.

    Asset lineage displayed in the Dagit asset catalog

    We're very excited to see the use cases that the community serves with this feature and to hear more about your specific needs: from data debugging (which tables do I need to query to find the source of my data issue?) to data discovery (which tables are used to train this model?), to impact analysis (if I change this file on s3, who do I need to notify?).

    Check out the docs to learn more, and if you're excited about this product direction, please don't hesitate to reach out to the core team with feedback.

    New documentation site

    We're also very proud to present our new documentation site, redesigned from the underlying infrastructure to the information architecture of each page, which lives at https://docs.dagster.io. This new site will host documentation for all Dagster versions from 0.11.0 onwards. (If you're on an older version of Dagster, you can still view pre-0.11.0 documentation at https://legacy-docs.dagster.io.)

    We hope you'll find the new organization makes it easier for you to quickly find the information you need about Dagster. We've also made it easier for the community to give feedback and contribute to the documentation.

    Top-level content organization

    • The Tutorial is a step-by-step walk-through of building a basic Dagster pipeline that takes advantage of the most important Dagster features. If you're new to Dagster, we highly recommend starting here.
    • Main Concepts contains comprehensive overviews of all of the main Dagster APIs and features, from solids and pipelines to modes, resources, Dagit, and more. Each concepts page contains examples and patterns you can re-use and build off of in your own Dagster projects.

    A new concepts page in the Dagster documentation

    • Deployment explains Dagster's architecture, how to deploy Dagster on platforms such as AWS and Kubernetes, and the components that can be configured in deployments, such as the instance, daemon, and run launchers. If you've successfully deployed Dagster on a platform that isn't discussed here, and are interested in collaborating on a deployment guide, please reach out to us!

    • Integrations includes guides to the many integrations with popular data tools and platforms maintained as part of the core repository, including Pandas, dbt, Slack, Jupyter Notebooks, and more.

    • The Guides are still early in development, but is intended to document best practices combining multiple Dagster concepts to accomplish common goals. If you have suggestions on what guides you would like to see, feel free to open an issue or send feedback on Slack.

    • Finally, the API Reference* contains comprehensive API documentation of all the decorators, functions, and classes exported by Dagster. If you want to explore the full API surface area, this is the place to search.

    Searching the docs

    Searching the Dagster documentation

    We've also invested a bunch of effort into improving search in the Dagster docs. Previously, the search bars for the main content and for the API docs were separate. We've now introduced a single, unified search bar, with more visual hinting to show where results are coming from. We're always working on improving search results, so that you can quickly find the information you need. If you find that you are not getting the results you expect, please give us feedback or submit an issue.

    Contributing to documentation

    With the new release, we've also made it easier for the community to give feedback and contribute back to the docs site.

    If you've encountered a docs issue, there are now two main ways to give feedback:

    1. The Share Feedback button in the header of the documentation site
    2. Open a Github issue with the Documentation Template

    If you spot issues such as a typo, confusing explanation, or broken code example, you can also click the Edit on Github button on any page to fix the content right in your browser and send a PR.

    If you're interested in writing longer-form content for the documentation site or the Dagster blog, please reach out! Our community is full of people using Dagster in all sorts of interesting ways, and we'd love to feature your learnings on the site.

    We're particularly interested in help with:

    • Guides to illustrate best practices
    • Deployment documentation for different platforms
    • Examples and patterns illustrating the Main Concepts

    Thank you!

    There's much more in this release that we haven't had space to cover in this blog post. As always, please see the changelog for full details about changes in this release, including many more small fixes and improvements, and the migration guide for instructions on migrating code and Dagster instances from 0.10.x to 0.11.0.

    A huge thank you as always to the community contributors who helped make Lucky Star a reality: @alex-treebeard, @basilvetas, @bollwyvl, @dehume-drizly, @emilmelnikov, @ericct, @jmsanders, @joshuataylor, @luander, @mrdavidlaing, @pacha, @rjendoubi, @stonebig, @taljaards, @wingyplus.

    And thanks to everyone for being active on Slack and GitHub and helping the core team better understand the community's needs. We're very excited about the evolution of the system and would love to help you get as involved as you want to be. Please don't hesitate to reach out with any questions or suggestions.


    Read more filed under
    Release