September 28, 2023 • 3 minute read •
Escaping the Modern Data Trap
- Name
- Pete Hunt
- Handle
- @floydophone
- Name
- Nick Schrock
- Handle
- @schrockn
Dagster Launch Week starts October 9th. Each day from October 9th to the 13th, we’ll be announcing a new feature, project, or capability that we think users will love.
The theme of Launch Week is “Escaping the Modern Data Trap.” We’ll get to what that means in a moment. But first, I want to recap how we got to this point and why we are doing a Launch Week in the first place.
We entered 2023 at an interesting stage in the company’s lifecycle. We had released Dagster 1.0 and Dagster Cloud about five months prior, and I was just a few weeks into my new gig as CEO.
On the one hand, we had recently become the fastest-growing data orchestrator and were exceeding our growth goals. On the other, we were kicking off our series B during the worst fundraising environment of my career (fortunately, this ended up working out).
These events triggered serious discussions about where we were going to take the product and the business over the next year. We talked to lots of users, and summarized our thinking earlier this year in the Dagster Master Plan.
Throughout the course of our discussions with data and machine learning engineers, negative feedback about the Modern Data Stack kept coming up. Internally, we started referring to this theme as the Modern Data Trap.
The Modern Data Trap
Our users are data engineers and data platform engineers at a wide variety of company sizes, industries, and levels of seniority.
When we talked to them, we heard a number of positive things about the Modern Data Stack:
- Raising a big tent. Tools like dbt™ Core brought new stakeholders into the software engineering process.
- Fewer home-grown tools. Companies stopped building their own version of tools like SQL templating, data catalogs, and SaaS ELT connectors in favor of pulling tools off the shelf.
- Cloud adoption. The Modern Data Stack accelerated the adoption of cloud technologies, which reduced fixed costs and improved developer happiness.
We also heard about many critical problems with the Modern Data Stack:
- Assumed homogeneity. Many tools in the Modern Data Stack assume you live in an idealized world where there is a single, modern data warehouse. In reality, most businesses have hundreds of different legacy data sources that must be integrated together.
- Too many disconnected tools. Everyone has seen the ridiculous market maps of the Modern Data Stack. Data teams are managing dozens - if not hundreds - of open-source, on-prem, and cloud tools, which adds enormous complexity to the stack.
- Too expensive. The explosion of tools comes with an explosion in SaaS fees. This is compounded by the challenging macroeconomic environment of the past two years.
- Inflexibility. Because so many Modern Data Stack tools jumped on the low- and no-code bandwagon, they are now very hard to customize. If your requirements slightly deviate from your tool’s view of the world, you may need to start from scratch.
- Ignorance of software engineering best practices. Again, low- and no-code tools try to avoid software engineering rather than embrace it. This means that software engineering best practices like testing, continuous delivery, and observability are not ubiquitously adopted.
Take, for example, data movement tools like Fivetran or Stitch. On the surface, they are simple: they move data from some external system into your data warehouse. While you’ve solved one problem, you’ve created several new ones:
- How do you transform the data? Now you need dbt Core (recently bundled with Fivetran), PySpark, or another transformation tool.
- What if you need to integrate with internal services, existing codebases or another data warehouse? Now you need an orchestrator like Dagster.
- How do you ensure your data is correct? Now you need to adopt a data quality tool.
- How do you activate the data? You need a data activation or reverse ETL tool.
- Now I have data in many different systems. How do I keep track of it? You need a data catalog and governance suite.
This situation is OK if you are a small business, have simple requirements, and cannot afford to hire data engineers. However, as your needs get more demanding, Big Complexity rears its ugly head. We believe that software engineering best practices are the only way to tame Big Complexity.
Escaping the Trap: From Orchestrator to Data Control Plane
We realized that we could solve these problems - and more - not by delivering yet another standalone product, but by evolving the definition of what a “data orchestrator” is. Rather than simply a “service that schedules compute,” we believe the orchestrator should be a true control plane for the Modern Data Stack.
Specifically, we believe that a large segment of Modern Data stack tools:
- Are overkill for data engineers. They are either expensive, expansive SaaS services or complex, multi-service OSS tools that include lots of heavyweight bells and whistles that many data engineers don’t need.
- Would benefit from deep orchestrator integration. Many tools in the Modern Data Stack either participate directly in the execution of data pipelines or rely on data where the orchestrator is the source of truth. Deep orchestrator integration can make these tools simpler, more robust, and more powerful.
Some customers truly need all the bells and whistles of a heavyweight external tool, and for those, we’ll continue to focus on high-quality integrations. However, we think that many users would be better served by lightweight features directly integrated into the orchestrator that are specifically targeted at data engineers.
Launch Week Agenda
At Launch Week, we’ll be showing off a number of these new bundled features and deep integrations to empower data engineers to have a greater impact and a radically improved developer experience.
Here’s the agenda.
- Friday, October 6: Join us for a pre-Launch Week conversation about the state of the Modern Data Stack.
- Monday, October 9: Sandy Ryza will be talking about data quality.
- Tuesday, October 10: Jarred Colli will help data teams reduce their spend on Modern Data Stack tools with our new Dagster Insights feature.
- Wednesday, October 11: Erin Cochran will discuss our recent investments in developer education and announce two very special new projects.
- Thursday, October 12: Pedram Navid will show Dagster users how to reduce operational headaches and cash burn by changing how they do ELT.
- Friday, October 13: Nick Schrock will talk about One More Thing.
We’re looking forward to seeing you there!
Pete and Nick
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Interactive Debugging With Dagster and Docker
- Name
- Gianfranco Demarco
- Handle
- @gianfranco
AI's Long-Term Impact on Data Engineering Roles
- Name
- Fraser Marlow
- Handle
- @frasermarlow
10 Reasons Why No-Code Solutions Almost Always Fail
- Name
- TéJaun RiChard
- Handle
- @tejaun