September 6, 2024 • 4 minute read •
Dagster Deep Dive Recap: Building a True Data Platform
- Name
- TéJaun RiChard
- Handle
- @tejaun
As data engineering evolves, companies are running into problems with the modern data stack: observability, orchestration, and cost. To solve these problems, our recent Dagster Deep Dive, led by [Pedram Navid] (https://www.linkedin.com/in/pedramnavid/), went into how to build a real data platform that goes beyond the modern data stack.
If you missed the live event, don’t worry.
We’ve embedded the video below so you can watch below.
Highlights
We covered the following during the deep dive:
The Unmet Promise of the Modern Data Stack
While the modern data stack has improved upon previous tools, it has also introduced new problems:
- No observability across multiple tools
- Limited orchestration beyond basic scheduling
- High cost and vendor lock-in
The Data Platform Engineer
We introduced the data platform engineer as the role that’s been created to solve these problems.
This role is about managing complex data infrastructure, with engineers building platforms that serve the needs of stakeholders. They enable consumers to build pipelines without having to learn complex languages, making data more accessible to a broader set of users.
And they’re moving from building individual pipelines to building frameworks and services that support the entire data ecosystem within an organization.
This is a big change and evolution in data engineering.
What Makes a Good Data Platform?
We went into how a real data platform is the foundation of modern data driven companies, and what characteristics a platform should have to meet the changing needs of companies and their data teams:
- Scalability and Maintainability: It should grow with your company’s data maturity
- High-Quality Governance: Data testing, fact assertion, alerting
- Data Observability and Insights: Stakeholders should be able to see the state of the data and dependencies
- Software Development Lifecycle Integration: Testing, version control, branching
- Support for Heterogeneous Use Cases: Different languages and tools
- Declarative Workflows: Flexible, easy to understand process definitions
Takeaways
Here's a recap of the main takeaways from the deep dive:
- The modern data stack is an improvement over legacy systems but has introduced new problems in data engineering: fragmented observability, limited orchestration, cost.
- The data platform engineer role has emerged to solve these problems.
- A good data platform should be scalable and maintainable, have high-quality governance and data observability and insights.
- A data platform should also have:
- Software development lifecycle integration
- Support for heterogeneous use cases
- Different languages and tools
- Declarative workflows
- Code-based solutions over no-code or low-code for complex data engineering tasks.
- Dagster can help build such a unified data platform with features like code locations and asset checks for data quality and governance.
Q&A
Near the end of the deep dive, Pedram answered questions from our audience. Here are those questions and answers:
- What are Dagster’s features for data governance, especially data access management?
- The roadmap for data access management features isn’t set, but we’re considering role-based access control (RBAC). It’s possible for future development, but the timeline is unknown.
- Are column-level lineage and data catalog in Dagster Plus or open source?
- Column-level lineage is a Dagster+ feature, not open source.
- How to schedule assets from multiple code locations?
- Keep each code location separate as much as possible. For dependencies between code locations, you can use an asset sensor as input into the pipeline. You can also look into declarative automation, which allows you to specify asset materialization at a specific cadence without getting too granular about the details.
- How to move from an existing Airflow setup to Dagster?
- We’re working on a solution for this exact problem. Reach out on Slack and we’ll connect you with someone who can help.
- What’s Dagster’s view on machine learning processes or LLM Ops?
- We use Dagster internally for LLM Ops, basically as data ops. With LLMs, data quality and metadata become important. Dagster has built-in integrations like OpenAI, so you can observe token consumption through Dagster Insights. You can also emit custom metadata to track metrics like token count over time or LLM response quality.
- Can I use Dagster Plus and then move to my own server with the open source option?
- Yes, that’s possible. But if you become dependent on Dagster+ only features, you’ll need to remove those when you move. The core code itself isn’t locked in.
- Can I access the data catalog programmatically and tag datasets for filtering?
- Tagging and filtering by tags in the catalog is supported. For programmatic access to the data catalog, ask in the Slack channel for more info.
- Can I use the Great Expectations plugin for data assets?
- Yes, you can use Great Expectations or Dagster’s built-in asset checks. Depends if you need the more complex features Great Expectations offers.
- What plugins or integrations would be helpful for a Financial Service Company?
- Financial service companies face similar data engineering challenges as other industries. Data replication is key, so tools like Airbyte for replicating data across databases to your data warehouse are useful. Remember, Dagster is Python-based, so you can use any Python library, even without a specific built-in integration.
Conclusion
Building a real data platform is more than just using modern data stack tools. It’s thinking about scalability, governance and observability. Dagster addresses those needs with features that tackle the challenges of today’s data engineering.
Watch the on-demand webinar above to learn how to build a data platform with Dagster and stay tuned for more or catch up on past deep dives to learn more about data platform engineering, Dagster and data engineering best practices and tools.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Dagster Deep Dive Recap: Orchestrating Flexible Compute for ML with Dagster and Modal
- Name
- TéJaun RiChard
- Handle
- @tejaun
Dagster Deep Dive Recap: Evolution of the Data Platform
- Name
- TéJaun RiChard
- Handle
- @tejaun
Dagster Deep Dive Recap: Building Reliable Data Platforms
- Name
- TéJaun RiChard
- Handle
- @tejaun
- Name
- Colton Padden
- Handle
- @colton