February 28, 2023 • 3 minute read •
Dagster Integrations Update
- Name
- Rex Ledesma
- Handle
- @_rexledesma
Integrations and the Modern Data Stack
The Modern Data Stack (MDS) has brought many benefits to data engineering, allowing teams to rapidly build bespoke platforms with either open-source solutions or hosted Cloud services.
But the Modern Data Stack is hardly plug-and-play. If you select eight amazing stand-alone components, your team must stitch all those solutions together.
This is especially true for the orchestration layer as it needs to interact with many different technologies and perform across compute environments.
This is why the Dagster Labs team has been putting a lot of energy into building MDS integrations available to Dagster users. These integrations will save your team time, allowing you to tap into more of the value of your chosen solutions.
We have shipped 47 integrations and are working on adding many more, working closely with our technology partners and community members.
Dagster integrations span a range of applications, including Alerting, Data Quality, Deployment, Ingestion, ML Ops, Monitoring, Data Science, Secret Management, Storage, and Transformation.
What’s new in integration land?
All MDS solutions embrace CI/CD, meaning that new versions are rolling out quickly. The Elementl team drops new features, bug fixes, and functionality for Dagster every week. As other solutions do the same, the integrations team has to keep pace on two fronts: update the integrations to take full advantage of the new features of Dagster and stay compatible with the latest versions of the third-party solution.
Here are some of the integration highlights from releases from the beginning of this year alone:
IO Manager updates
Many Dagster integrations feature dedicated IO Managers. They are particularly useful for storage technologies like data warehouses.
IO (Input/Output) Managers are a key concept in the Dagster framework. As assets are materialized, or ops are executed, the results are written to persistent storage. Where and how they get stored is handled by the IO Manager, making it easy to separate code that's responsible for logical data transformation from code that's responsible for reading and writing the results. IO Managers are explained in more detail here.
Here are some recent updates to our integration’s IO Managers:
We have updated the Snowflake, DuckDB, and BigQuery integrations, adding partition support to the IO Managers:
- The Snowflake IO Manager now supports PySpark DataFrames and existing Pandas support. The Snowflake IO Manager and Snowflake Resource now support private key authentication in addition to username and password authentication.
- The DuckDB IO Manager now supports Polars DataFrames, in addition to existing Pandas and PySpark support.
- The new BigQuery IO Manager has been released and currently supports Pandas DataFrames, with PySpark support coming soon!
- The Snowflake, BigQuery, and DuckDB IO have all been updated to support static partitions, time window partitions, multi-partitions, and (coming soon) dynamic partitions.
In addition, we’ve also made several ergonomic improvements to the above IO Managers. For example, these IO Managers now create missing schemas and, in many cases, can automatically infer the types of your inputs and outputs without type hints.
And, of course, you can always define your own IO Manager to fit your needs.
Airflow migration
We offer a dagster-airflow library that helps teams migrate off Apache Airflow and onto Dagster.
This Airflow integration has received major enhancements in the last few weeks. It now supports a local, ephemeral Airflow database for use in migrated Airflow DAGs, and a more fine-grained composition of transformed Airflow DAGs into Dagster. Retry policies are now converted over from Airflow DAGs, and we have added new parameters for configuring the Airflow connections used by migrated DAGs.
On Feb 9th, we hosted a special Airflow migration event and we have recordings of the many sessions that go into the details for how to successfully migrate off Apache Airflow.
Enhanced Databricks integration
Many Dagster projects integrate Spark jobs, and Databricks is a platform of choice. We recently worked with the Databricks team to enhance the dagster-databricks
integration. For instance, we added support to trigger Databricks Jobs as ops and encode them as Software-defined Assets.
New ML Ops integration (Weights & Biases)
We collaborated with Weights & Biases (W&B) to bring you a new MLOps integration for Dagster that allows you to orchestrate your MLOps pipelines and maintain ML assets with Dagster.
With this new Weights & Biases integration, Dagster developers can easily:
- Use and create W&B Artifacts.
- Use and create Registered Models in W&B Model Registry.
- Run training jobs on dedicated compute using W&B Launch.
- Use the W&B client in ops and assets.
Explore Dagster’s integrations and speed up your work
We encourage you to explore the Dagster integrations and keep an eye on our release notes as we continue to improve and expand the list of integrations available. And if you are interested in building out a new integration for Dagster, get in touch. We would love to discuss the project with you.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Bridging Business Intelligence and Data Orchestration with Dagster + Sigma
- Name
- Brandon Phillips
- Handle
Running Singer on Dagster
- Name
- Fraser Marlow
- Handle
- @frasermarlow
Orchestrate Unstructured Data Pipelines with Dagster and dlt
- Name
- Zaeem Athar
- Handle
- @zaeem