August 30, 2024 • 5 minute read •
Dagster Deep Dive Recap: Evolution of the Data Platform
- Name
- TéJaun RiChard
- Handle
- @tejaun
Data engineers often face a disconnect between local development (offline) and production orchestration (online). This gap leads to slower iteration, increased costs, and potential errors that only surface during runtime.
In our recent Dagster Deep Dive, led by our very own Pedram Navid, we explored the integration between Dagster and SDF (Semantic Data Framework) alongside SDF Labs co-founder and CEO Lukas Schulte, showcasing how you can use the combined strengths of both to enhance your data operations.
If you missed the live event, we’ve embedded the on-demand video below so you can watch it here.
Key Points
Throughout the dive, we explored a few key points and heard expert insights on current challenges and what needs to happen to remove those obstacles:
Challenges in Current SQL and Developer Tooling
We began by discussing the limitations that come with different SQL dialects and the inefficiencies in current pipelines, highlighting the need for more consistent and efficient developer tools in data engineering.
Introduction to SDF
Building on this foundation, we introduced SDF as a transformation layer with deep SQL understanding, capable of enhancing data orchestration through improved metadata handling and addressing many of the challenges identified earlier.
Combining SDF and Dagster
The core of our discussion focused on the integration of SDF with Dagster. We showed how you can use the two to create better data pipelines and reduce operational costs through enhanced metadata handling and optimized execution paths.
Enhancing Efficiency and Quality
To illustrate the practical benefits of the integration, we talked about SDF’s role as a true SQL compiler. This capability enables local development and execution without relying on data warehouses – making it more efficient and capable of providing high quality data.
Demo Highlights
To bring these concepts to life, Pedram and the SDF team led a demo of the integration in action, running through a few functionalities and projects such as:
- Setting up an SDF workspace with classifiers and SQL models
- Scaffolding a Dagster project integrated with SDF
- Materializing assets using SDF's execution engine
- Demonstrating intelligent caching for faster subsequent runs
- Handling and catching errors at compile-time, before reaching the orchestration layer
Poll Summary
We also conducted two polls during the dive to better understand the challenges our attendees face with their data pipelines and the features they believe would make the most significant impact on their processes.
Here’s a breakdown of the insights we gathered.
Common Challenges in Data Pipelines
The first poll asked participants to identify the biggest challenge they currently face with their data pipelines. The results highlighted several key areas of concern:
- Inconsistent Data Quality emerged as the most significant challenge, with 42.1% of respondents citing it as their primary issue. This reflects a common pain point in the industry, where ensuring data integrity across various pipelines and sources remains a persistent struggle.
- Increased Costs and Operational Complexity followed closely, with 31.6% of attendees indicating this as their main challenge. As data operations scale, so too do the associated costs and complexities, making this a crucial area for optimization.
- Fragmented Pipelines were identified by 21.1% of respondents as a major challenge. Fragmentation can lead to inefficiencies and bottlenecks, underscoring the need for more cohesive and integrated data pipeline solutions.
- Slow Iteration due to Online Data Development was the least cited issue, with 5.3% of participants noting it as a challenge. While less common, this issue can still significantly impact productivity, particularly in environments that rely heavily on rapid iteration cycles.
Desired Features for Improved Pipelines
The second poll focused on the features presented during the session and which ones attendees believed would have the most significant impact on their pipelines. The results revealed a diverse range of preferences:
- Local SQL Development without a Data Warehouse was the top choice, with 34.4% of respondents believing it would be the most impactful feature. This preference highlights the growing demand for more flexible development environments that don’t require full reliance on data warehouses.
- Precise Column-Level Lineage and Fast SQL Feedback Loops were both highly favored, each receiving 28.1% of the votes. These features are crucial for improving the accuracy and speed of data operations, offering more transparency and quicker iterations.
- SQL Validation Across Dialects was seen as the least impactful, with 9.4% of respondents selecting it. While still important, this feature might be seen as more of a specialized need compared to the broader appeal of the other options.
Key Takeaways
As we wrapped up the deep dive, several key takeaways emerged:
- The Dagster/SDF integration brings you harmony in your data by aligning transformation and orchestration layers.
- Developers can work confidently with local SQL development, knowing it will execute successfully in production.
- Dagster gains rich metadata from SDF, enhancing observability and control.
- The combined solution offers significant performance improvements and cost reductions.
Q&A
The dive concluded with an engaging Q&A session, where we got to address important questions from our audience.
Here are those questions and their answers:
- How would you suggest integrating or migrating from dbt?
- SDF has a nice tutorial on how to migrate from a dbt project. It also outlines the incompatibilities between the two nicely. You can find it here.
- Does SDF support Python models in addition to SQL models (like SQLMesh)
- At the moment, only SQL-based models are supported.
- Is SDF compliant with Databricks?
- Not yet, but it’s on their roadmap!
- What is the added value for local development coming from SDF+Datafusion compared to e.g. SQLMesh + DuckDB? Does SDF use SQLGlot under the hood?
- SDF operates at a logical plan layer, providing type bindings and schema-level type validation. This allows for SQL compilation rather than translation, offering more robust validation and execution capabilities.
- It seems SDF is offering features from SQLGlot / SQLMesh as well as dbt. What does this mean for potential future integrations with SQLMesh models? Will Dagster support all three approaches eventually?
- While there are similarities, SDF's approach focuses on compilation rather than translation. Dagster's future integrations will likely depend on user needs and the evolving data tooling landscape.
- When you change one thing it propagates downstream. But are you doing a full refresh? How does it work when all your processes are incremental loads?
- SDF's caching layer keeps track of state changes. Only affected parts of the DAG are invalidated and rebuilt, supporting both full refreshes and incremental loads efficiently.
- If everything is running local, do you need local sample data?
- Local development can use small seed files or parquet files. For larger datasets, SDF can authenticate with object stores like S3 to pull data for transformations.
- You're not caching tables, right? Otherwise your snowflake table would need to fit inside your Dagster container.
- Correct. SDF connects to the cloud database (e.g., Snowflake) to pull current schemas for root tables, ensuring synchronization between cloud and local code without caching entire tables locally.
Conclusion
This Dagster Deep Dive highlighted a shift in data engineering practices. By addressing the disconnect between local development and production orchestration, you can easily remedy some of the most pressing challenges faced by data engineers today.
The Dagster-SDF integration presents a solution to those challenges, letting you streamline your pipelines and enhance their overall quality and reliability. Embracing and adopting solutions like this can let you overcome the challenges of fragmented pipelines, inconsistent data quality, and the disconnect between environments so that you can improve the efficiency of your data operations and pave the way for scalable, transparent, and reliable data infrastructures in the future.
Be sure to watch the on-demand video of this webinar that we’ve embedded to explore the combined strengths of the two. You can also read our blog post on the integration or see SDF’s take on it for more info.
Additionally, stay tuned for future deep dives so that you can continue to gain valuable insights on how to stay on top of your data management.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Dagster Deep Dive Recap: Building a True Data Platform
- Name
- TéJaun RiChard
- Handle
- @tejaun
Dagster Deep Dive Recap: Building Reliable Data Platforms
- Name
- TéJaun RiChard
- Handle
- @tejaun
- Name
- Colton Padden
- Handle
- @colton
Scaling Dagster’s DAG Visualization to Handle Tens of Thousands of Assets
- Name
- Marco Salazar
- Handle
- @BkOptimism