How to Build a True Data Platform

Move past the MDS and build a data platform for observability, cost-efficiency, and top-tier orchestrating.

As data engineering evolves, companies are running into problems with the modern data stack: observability, orchestration, and cost. To solve these problems, our recent Dagster Deep Dive, led by [Pedram Navid] (https://www.linkedin.com/in/pedramnavid/), went into how to build a real data platform that goes beyond the modern data stack.

If you missed the live event, don’t worry.

We’ve embedded the video below so you can watch below.

> Building a True Data Platform: Beyond the Modern Data Stack (A Dagster Deep Dive)

Highlights

We covered the following during the deep dive:

The Unmet Promise of the Modern Data Stack

While the modern data stack has improved upon previous tools, it has also introduced new problems:

No observability across multiple tools
Limited orchestration beyond basic scheduling
High cost and vendor lock-in

The Data Platform Engineer

We introduced the data platform engineer as the role that’s been created to solve these problems.

This role is about managing complex data infrastructure, with engineers building platforms that serve the needs of stakeholders. They enable consumers to build pipelines without having to learn complex languages, making data more accessible to a broader set of users.

And they’re moving from building individual pipelines to building frameworks and services that support the entire data ecosystem within an organization.

This is a big change and evolution in data engineering.

What Makes a Good Data Platform?

We went into how a real data platform is the foundation of modern data driven companies, and what characteristics a platform should have to meet the changing needs of companies and their data teams:

Scalability and Maintainability: It should grow with your company’s data maturity
High-Quality Governance: Data testing, fact assertion, alerting
Data Observability and Insights: Stakeholders should be able to see the state of the data and dependencies
Software Development Lifecycle Integration: Testing, version control, branching
Support for Heterogeneous Use Cases: Different languages and tools
Declarative Workflows: Flexible, easy to understand process definitions

Takeaways

Here's a recap of the main takeaways from the deep dive:

The modern data stack is an improvement over legacy systems but has introduced new problems in data engineering: fragmented observability, limited orchestration, cost.
The data platform engineer role has emerged to solve these problems.
A good data platform should be scalable and maintainable, have high-quality governance and data observability and insights.
A data platform should also have:
- Software development lifecycle integration
- Support for heterogeneous use cases
- Different languages and tools
- Declarative workflows
Code-based solutions over no-code or low-code for complex data engineering tasks.
Dagster can help build such a unified data platform with features like code locations and asset checks for data quality and governance.

Q&A

Near the end of the deep dive, Pedram answered questions from our audience. Here are those questions and answers:

What are Dagster’s features for data governance, especially data access management?
- The roadmap for data access management features isn’t set, but we’re considering role-based access control (RBAC). It’s possible for future development, but the timeline is unknown.
Are column-level lineage and data catalog in Dagster Plus or open source?
- Column-level lineage is a Dagster+ feature, not open source.
How to schedule assets from multiple code locations?
- Keep each code location separate as much as possible. For dependencies between code locations, you can use an asset sensor as input into the pipeline. You can also look into declarative automation, which allows you to specify asset materialization at a specific cadence without getting too granular about the details.
How to move from an existing Airflow setup to Dagster?
- We’re working on a solution for this exact problem. Reach out on Slack and we’ll connect you with someone who can help.
What’s Dagster’s view on machine learning processes or LLM Ops?
- We use Dagster internally for LLM Ops, basically as data ops. With LLMs, data quality and metadata become important. Dagster has built-in integrations like OpenAI, so you can observe token consumption through Dagster Insights. You can also emit custom metadata to track metrics like token count over time or LLM response quality.
Can I use Dagster Plus and then move to my own server with the open source option?
- Yes, that’s possible. But if you become dependent on Dagster+ only features, you’ll need to remove those when you move. The core code itself isn’t locked in.
Can I access the data catalog programmatically and tag datasets for filtering?
- Tagging and filtering by tags in the catalog is supported. For programmatic access to the data catalog, ask in the Slack channel for more info.
Can I use the Great Expectations plugin for data assets?
- Yes, you can use Great Expectations or Dagster’s built-in asset checks. Depends if you need the more complex features Great Expectations offers.
What plugins or integrations would be helpful for a Financial Service Company?
- Financial service companies face similar data engineering challenges as other industries. Data replication is key, so tools like Airbyte for replicating data across databases to your data warehouse are useful. Remember, Dagster is Python-based, so you can use any Python library, even without a specific built-in integration.

Conclusion

Building a real data platform is more than just using modern data stack tools. It’s thinking about scalability, governance and observability. Dagster addresses those needs with features that tackle the challenges of today’s data engineering.

Watch the on-demand webinar above to learn how to build a data platform with Dagster and stay tuned for more or catch up on past deep dives to learn more about data platform engineering, Dagster and data engineering best practices and tools.

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Deep Dive Recap: Building a True Data Platform

Move past the MDS and build a data platform for observability, cost-efficiency, and top-tier orchestrating.

Highlights

The Unmet Promise of the Modern Data Stack

The Data Platform Engineer

What Makes a Good Data Platform?

Takeaways

Q&A

Conclusion

Dagster Newsletter

Latest writings

Dagster 1.12: Monster Mash

How Compass Turns Questions Into Queries

Scaling Analysis Without Scaling the Team