August 14, 2024 • 5 minute read •
Combining Dagster and SDF: The Post-Modern Data Stack for End-to-End Data Platforms
- Name
- TéJaun RiChard
- Handle
- @tejaun
The coordination between data orchestration and transformation is essential to data operations. While orchestration (through tools like Dagster) focuses on scheduling, monitoring, and executing data pipelines to ensure timely operations, transformation (through tools like SDF – the Semantic Data Fabric) deals with manipulating and refining data to extract actionable insights through processes like cleaning, aggregating, and modeling. Together, these elements are foundational to engineering scalable and efficient data pipelines.
However, many organizations face challenges due to a lack of coordination between these two layers, with traditional orchestrators having no visibility into what’s happening in transformation tools. This misalignment can lead to fragmented pipelines where the seamless handoff from orchestration to transformation is compromised or obscure (due to the black box nature of traditional solutions), resulting in inefficiencies, lack of visibility, inconsistent data quality, and ultimately, increased costs and operational complexity.
Combining Dagster with SDF lets companies harness the strengths of both platforms and supports orchestration and transformation needs through a unified system that streamlines operations, provides transparency, and ensures scalable, high quality data pipelines.
In this blog post, we’ll explore how these platforms complement each other and the benefits they bring to data practitioners
Data Orchestration with Dagster
Dagster is a powerful orchestration platform that, by design, brings structure and efficiency to data flows. Dagster lets you define and manage data assets, capture the insights needed to resolve issues, and observe pipeline performance. Dagster also handles complex pipelines in various execution environments while playing nice with other tools in the data stack.
A key strength that Dagster has is its native understanding of the underlying transformation graph, which enables fine-grained control, deep observability, and the ability to model cross-technology lineage between the transformation layer and other systems.
With Dagster in the equation, you get a transparent orchestration layer that allows you to introspect and manage the underlying compute graph – ultimately meaning more coherent pipelines, enhanced visibility, and efficiency.
Data Transformation with SDF
SDF is a transformation layer that understands your SQL. It simplifies testing, reporting, and debugging. It complements Dagster with capabilities in model validation, linting, and static analysis, ensuring data integrity and consistency while producing cleaner, more maintainable code. SDF also optimizes efficiency in data transformations.
While dbt has been instrumental in mapping out and running transformation models in data stacks, SDF offers unique capabilities that go beyond. SDF deeply understands proprietary dialects of SQL and executes them directly with its built-in engine, eliminating the need for transpilation. This fundamental understanding of SQL and its execution environment helps streamline data processes, reducing unnecessary compute costs and minimizing pipeline delays.
SDF’s approach leads to more efficient data transformations, better error detection, and a clearer understanding of column-level lineage, all while ensuring that your SQL transformations are both optimized and reliable.
Tip: To understand SDF’s approach to data transformation versus the approach of dbt, check out this blog post by SDF.
Combining The Two
It’s easy to integrate SDF projects into Dagster. The ergonomics of this decorator are nearly identical to our dbt integration, so it should be familiar to our users.
Here’s an example:
@sdf_assets(
workspace=SdfWorkspace(
workspace_dir=moms_flower_shop_path,
target_dir=moms_flower_shop_target_dir
)
)
def my_flower_shop_assets(): ...
Using this enables you to define your SDF asset logic directly within Dagster, letting you combine our data orchestration with SDF’s data transformation to really take your pipelines to the next level.
Exploring the Combination
When combined, the tools offer several key benefits in the way of orchestration and transformation:
- Unified Pipeline Management
- Enhanced Data Quality and Reliable Deployments
- Cost Savings
- Metadata management
- Improved Performance and Developer Experience
Unified Pipeline Management
Combining Dagster and SDF creates a cohesive pipeline management system.
Dagster handles the task scheduling and execution monitoring, while SDF takes care of things like validation, cleaning, and aggregation. As a result, the process from ingestion to transformation to data warehouse is painless and perfectly synchronized.
Enhanced Data Quality and Reliable Deployments
SDF’s static impact analysis and model compilation ensure that pipelines are validated before deployment, reducing the risk of errors. This process evaluates the impact of changes and compiles models to verify their correctness. For example, SDF allows users to catch errors like mistyped column names nearly instantaneously prior to execution in the data warehouse, saving both time and money.
Dagster enhances this by orchestrating and monitoring the pipelines, offering real-time visibility and control. The testing-first approach in Dagster enables the identification and resolution of issues before they affect production, ensuring reliable and consistent deployments.
Using them together means you’ll catch potential errors early in the development cycle, improving overall pipeline efficiency.
Cost Savings
SDF's advanced static analysis and local execution capabilities enable extensive testing and debugging of pipelines locally, minimizing the need for cloud resources.
The reduced dependency on cloud resources during development and testing can significantly lower cloud compute costs. Running pipelines and transformations locally lets organizations use cloud resources only when necessary, keeping these expenses in check.
This approach ensures that your entire ETL process is efficient and cost-effective.
Metadata Management
Dagster’s support for metadata management, combined with SDF’s static analysis, offers valuable insights into column-level data lineage, dependencies, and performance metrics – ensuring a clear understanding and optimization of your data pipeline for easier troubleshooting and performance improvements.
As a result, you can precisely track data flows and transformations, identify bottlenecks, and enhance overall metadata management for more informed decision-making.
Improved Developer Experience and Performance
The end-to-end testing and debugging of pipelines locally lets developers not only overcome a reliance on cloud compute vendors but also validate pipelines effectively.
Dagster lets developers define and manage data assets transparently, capturing insights that help with issue resolution and observing pipeline performance in various execution environments. SDF complements this by speeding up the testing and debugging process and helping developers understand the effects of changes in their SQL queries across the data pipeline– helping them make informed decisions and avoid unintended consequences.
Conclusion
Combining Dagster and SDF offers a unified framework that excels in both orchestration and transformation, enabling data practitioners to create transparent, scalable, reliable, and cost-effective data pipelines.
Key Benefits Recap
- Unified Pipeline Management: Streamlined operations using Dagster’s data orchestration and SDF’s data transformation.
- Enhanced Data Quality: Error-free pipelines that have been rigorously tested and validated to ensure high data quality.
- Cost Savings: Reduced operational costs by minimizing reliance on cloud compute resources during development and testing.
- Metadata Management: Valuable insights into data lineage, dependencies, and performance metrics.
- Improved Developer Experience: Efficient testing, debugging, and enhanced visibility throughout your data pipelines.
Ready to Get Started?
Take your data pipelines to the next level by using the two technologies and seeing your data orchestration and transformation processes improve firsthand.
Get started by checking out the Dagster docs or SDF’s Getting Started with Dagster and SDF Guide, where you’ll be able to install the dagster-sdf integration, set up a Dagster project, and try an example project to experience firsthand how this combination can enhance your data orchestration and transformation processes.
Additionally, you can check out this video of a deep dive we held where Pedram, our Head of DevRel discussed with SDF Labs CEO and co-founder Lukas Schulte how Dagster and SDF can connect local development and production orchestration.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
AI's Long-Term Impact on Data Engineering Roles
- Name
- Fraser Marlow
- Handle
- @frasermarlow
10 Reasons Why No-Code Solutions Almost Always Fail
- Name
- TéJaun RiChard
- Handle
- @tejaun
5 Best Practices AI Engineers Should Learn From Data Engineering
- Name
- TéJaun RiChard
- Handle
- @tejaun