June 15, 2022 • 6 minute read •

Dagster 0.15.0: Cool for the Summer

Name: Mollie Pettit
Handle

We’re excited to release version 0.15.0 of Dagster! In 0.15.0, software-defined assets are now marked fully stable and ready for primetime; we recommend using them whenever building and maintaining data assets is your goal.

In 0.14.0, we released a fairly mature version of software-defined assets that we encouraged users to start experimenting with. There were a few large gaps that we believed stood in the way of unreserved endorsement at the time, and we have since addressed these. Software-defined assets now support configuration, they’re easier to work with at scale, and they can be interleaved with ops through graph-backed assets.

Software-defined assets aren't a replacement for Dagster's core computational concepts—graphs, jobs, and ops. Rather, they're a layer on top that links those computations to the long-lived objects they interact with. Using software-defined assets means building your Dagster jobs in a way that declares ahead of time what assets they produce and consume. This gives you far-reaching lineage, improved code ergonomics, and makes it possible to operate your assets directly.

Independent of software-defined assets, we’ve made a number of other improvements to Dagster:

Job retries: you can now automatically re-execute runs from failure. This is analogous to op-level retries, except at the job level.
Redesigned partition & backfill pages: they're faster and show the status of all partitions.
Improved left navigation pane in Dagit: It’s now organized by repository.
Python API improvements: input_values argument to execute_in_process, generic Outputs, job-level metadata, and more.

To learn more about software-defined assets, check out this Introducing Software-Defined Assets blog post and this Data Council talk on Software-Defined Assets in Dagster. Here's a nuts and bolts guide on why, when, and how to use them, written for existing Dagster users.

Software Defined Assets

Organizing software-defined assets

In Dagster 0.14.0, we rendered all assets together in a single asset graph. This worked fine when you were starting out with a few assets, but became very unwieldy when you had a repository with hundreds of assets.

We wanted to make it easy to organize assets into smaller groups that would cut out the noise of irrelevant assets.

By dogfooding our APIs in a few different scenarios, we discovered that our AssetGroup class was not a good fit as the unit of organization. This is because the collections of asset definitions that make the most sense to pass around together in code don’t necessarily have a 1:1 correspondence with the collections that make sense to show together in Dagit. For example, although it’s typical to load and pass around all your dbt asset definitions together, it’s most useful to see them organized in Dagit according to the folder structure that the dbt models live in.

With 0.15.0, asset definitions now have a group_name property. You can set this property on the assets directly, or apply it at load time by providing an argument to functions like assets_from_package_module. We’re also making it easy to load a set of assets across a variety of subfolders and assign group_names based on the names of the subfolders.

Asset groups show up in the left navigation pane in Dagit, and you can click on one to see the graph of all assets in the group.

Asset Config

Software-defined assets now support configuration. This allows for the same flexibility in parameterizing assets that you have with regular ops. For example, an asset representing an intermediate in financial modeling might use an interest rate parameter in its defining computation. Previously this would have been awkward to represent— you would have had to hardcode a single rate or place the parameterized rate in an upstream asset. With asset configuration, you can now define the rate parameter in the asset’s config_schema and specify its value as runtime config.

To understand how asset configuration works, it’s useful to understand a broader point about Dagster’s architecture. The asset layer sits atop the op layer. The computation of an asset is handled by a wrapped op, so an asset’s compute function is simply its corresponding op’s compute function. Consequently, asset config is implemented by forwarding an asset’s config_schema to its underlying op. Runtime config for the asset is provided by directly specifying the config for the op. Config is accessed in the asset’s compute function under context.op_config:

from dagster import asset, materialize

@asset(config_schema={"interest_rate": float})
def asset_with_config(context):
    return some_financial_calculation(rate=context.op_config["interest_rate"])
        
materialize(
    [asset_with_config],
    run_config={"ops": {"asset_with_config": {"config": {"foo": "bar"}}}}
)

Dagit also has a few modifications to support asset config. For assets with a config_schema, the schema is rendered in both the sidebar of the Assets Explorer and in the Asset Details page.

Further, when selecting assets for materialization, if at least one selected asset has a defined config schema, you’ll be presented with a modal Launchpad interface when you click “Materialize”. This allows you to provide config values before launching the run that materializes your assets.

Reusing graphs and ops across assets / Graph-backed assets

As of 0.14.0, the computation for software-defined assets existed in functions decorated with @asset and @multi_asset. For assets requiring complex computation, this led to a couple shortcomings. Complex computation was forced to exist in contiguous blocks, and code computation couldn’t be reused arbitrarily by other assets or jobs.

0.15.0 unifies software-defined assets with fundamental Dagster building blocks, enabling you to construct software-defined assets using ops and graphs. This means you can split your asset computation into separate ops and reuse those ops across different assets and jobs. You can create asset definitions from graphs as follows:

@graph(
    ins={"new_user_signups": GraphIn()},
    out={"signups_today": GraphOut(), "num_signups_today": GraphOut()},
)
def users_filtered_by_date(new_user_signups):
    signups_today = filter_for_date(new_user_signups)
    return signups_today, num_users(signups_today)
   
asset_def = _AssetsDefinition_.from_graph(users_filtered_by_date)

AssetsDefinition.from_graph accepts a graph and infers input and output assets from the decorated graph function. In this case, users_filtered_by_date accepts new_user_signups as its input asset and outputs two assets, signups_today and num_signups_today.

A revamped dagster-dbt integration

Dagster is often responsible for kicking off a computation in an external tool such as fivetran, airbyte , or dbt, which may be responsible for updating several assets. A tool like dbt not only updates many assets at once, but also tracks a complex web of dependencies between these assets.

In 0.14.0, Dagster introduced the ability to load the models in a dbt project as a set of Dagster assets. This feature allows you to view and interact with your dbt graph inside the broader context of its upstream and downstream dependencies.

These dbt assets are all backed by a single dbt run operation, meaning that the underlying set of steps that Dagster needs to execute is quite simple, even if the set of assets that need to be updated is complex.

This initial implementation of the integration was useful, but had some limitations. Concretely, it was still impossible to kick off computation that included just a subset of the dbt models in the project; if you wanted to rematerialize a single dbt asset from Dagit, or schedule a job to run a subset of your dbt models on a schedule, you were mostly out of luck.

For 0.15.0, we’ve reworked a set of internal abstractions to allow Dagster to interact more seamlessly with tools like dbt. The features enabled by these changes include:

Rematerializing arbitrary sets of dbt models from Dagit
- You can now just select a few models through the UI, and kick off a run to rematerialize just those models. No need to type out a selection string!
Scheduling jobs that run subsets of a dbt graph
- Previously, you were required to invoke load_assets_from_dbt_project multiple times (with a different select parameter) if you wanted to create a job that did anything other than run your entire dbt project. Now, you can load all your dbt models at once, and select them when you’re building your jobs.
- Support for tag-based selection is coming soon!
Handling dependencies that go from dbt, to Python, then back to dbt
- For these cases, Dagster will automatically figure out how to split up the dbt project so that everything runs in the proper order.

You can read more about these features in this Orchestrating Python and dbt with Dagster blog post.

Note: the screenshots in this section were taken from the modern_data_stack_assets example on Github.

Core API

`input_values`

Dagster jobs can have top-level inputs defined, which can be passed to constituent ops and graphs within the job.

@job
def the_job(x):
    some_op(x)

Now, when using execute_in_process, a dictionary of input values can be provided directly to supply to top level inputs. This makes it possible to use top-level inputs with dagster types that cannot be specified via config, such as dataframes.

the_job.execute_in_process(input_values={"x": 5})

Input values can also be provided when converting a graph into a job using to_job:

@graph
def the_graph(x):
    some_op(x)
    
the_graph.to_job(input_values={"x": 5})

Generic Outputs

Dagster’s Output and DynamicOutput types can now be directly returned from ops and used for generic type annotations. This gives the ability to make full use of type annotations while still being able to provide output metadata to Dagster.

With Output:

from dagster import Output, op

@op
def op_returns_int() -> Output[int]
    return Output(5)

With DynamicOutput:

from dagster import DynamicOutput, DynamicOut, op

@op
def op_returns_int() -> List[DynamicOutput[int]]:
    return [DynamicOutput(5, mapping_key="1"), DynamicOutput(2, mapping_key="2")]

To learn more, check out the docs on Outputs and Dynamic Outputs.

Run retries

Make your jobs more robust to transient failure: you can now set policies to automatically retry Job runs. This is analogous to op-level retries, except at the job level. By default the retries pick up from failure, meaning only failed ops and their dependents are executed.

To learn more, check out the docs.

Metadata on jobs

Users can now attach metadata to their jobs and see that metadata displayed in Dagit. Metadata provides a flexible system for specifying information about the job definition itself. Teams might find it particularly useful for tracking the owner of a job or linking to external documentation.

@job(
    metadata={
        "owner": "data team",
        "link": MetadataValue.url(url="https://dagster.io"),
    },
)
def with_metadata():
    my_op()

What’s new in Dagit

The left nav in Dagit is now grouped by repository so it’s easier to organize and find your jobs and asset groups.

Asset catalog

Assets in the asset catalog now include the latest materialization and run information so you don’t have to click into each one to view its status.

For software-defined assets, the Asset Details page now includes a lineage graph, so you can quickly inspect upstream or downstream dependencies. The asset graph itself has also received a number of visual improvements to feel less overwhelming.

New backfills and partitions experience

Generally speaking, there are several benefits to using partitioning when designing a data pipeline:

Improved performance: By partitioning data assets, you can distribute it across multiple nodes in a cluster, allowing for parallel processing and faster performance.
Better scalability: Partitioning allows you to easily add or remove nodes from the cluster as your data volume increases or decreases, making it easier to scale your data pipeline.
Improved resiliency: Partitioning helps to distribute the data and workload across multiple nodes, reducing the risk of a single point of failure and improving the overall resiliency of the data pipeline.
Improved maintainability: By dividing the data and workload into smaller, more manageable pieces, partitioning can make it easier to maintain and troubleshoot the data pipeline.
Better data organization: Partitioning can help to organize and structure the data in a way that makes it easier to query and access, improving the overall usability of the data.

The partitions and backfills experience in Dagit has been redesigned to be faster and more intuitive.

The partitions page now displays a status bar at the top of the page so you can quickly view the status across all your partitions for each job and catch missing or failed partitions right away. Click anywhere within the status bar to expand the per-step status breakdown.

When you’re ready to launch a backfill, you can now select all the failed or missing partitions with just a few clicks. Lastly, the partitions page now includes a history of all previous backfills you’ve run with more clear status indicators so you can quickly catch and debug issues with backfills.

Working with configuration

In the launchpad, you can now preview any default configuration by hovering over elements in the configuration schema panel. It’s also now easier to reuse the configuration defined in a previous run by clicking the new “Open in launchpad“ button on the run page. It’s also now easier to quickly scaffold missing configuration and remove unnecessary configuration within the launchpad.

Bulk actions on runs

You can now multi-select and bulk re-execute or terminate runs from the runs page in Dagit.

Wrapping up

For more details on this release, check out the change log, release notes, and migration guide.

A special thanks to everyone in the community who contributed to this release: @chasleslr, @fahadkh, @aroig, @dwinston, @3cham, @kervel, @peay, @bollwyvl, @trevenrawr, @Javier162380, @jrouly, @dehume, @ascrookes, @kstennettlull, @swotai, @HAMZA310, @kbd, @LeoHuckvale, @joe-hdai, @antquinonez5, @abkfenris, @frcode, @dwallace0723, @proteusiq, @kahnwong, and @iswariyam.

Stay cool!

And with that, we’ll let Demi Lovato sing you out. Enjoy! (Curious how we ended up naming our release after this song? Check out this Release Naming GitHub Discussion.)

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: