Blog
Stop Reinventing Orchestration: Embedded ELT in the Orchestrator

Stop Reinventing Orchestration: Embedded ELT in the Orchestrator

October 12, 2023
Stop Reinventing Orchestration: Embedded ELT in the Orchestrator
Stop Reinventing Orchestration: Embedded ELT in the Orchestrator

Solve data ingestion issues with Dagster's Embedded ELT feature, a lightweight embedded library.

We have seen a lot of attention given to many parts of the data engineering lifecycle, from storage to transformation to analytics. However, one part of the lifecycle that remains frustrating for data engineers is ingestion.

New ELT tools include several high-performance, lightweight libraries.
     Source: Team 8 Data analytics Market Map

For many teams, the choice is often between rolling your own ingestion solution, cobbling together heavyweight third-party frameworks like Airbyte or Meltano, or paying a SaaS company that charges for ingestion by the row.

Rolling your own ingestion can be enticing, but you’re quickly faced with a range of requirements such as monitoring, retries, state management, and more. While open-source products such as Airbyte or Meltano are powerful in their own right, they have complex requirements to deploy, requiring many different services to operate. Airbyte, for example, will spin up 11 different services as part of the deployment. Fivetran, Stitch, and other managed solutions might offload that burden but can be cost-prohibitive.

We’ve seen a shift in the data engineering landscape away from heavy frameworks and libraries toward smaller, embedded solutions. Much of the hype around DuckDB comes from how light and simple it is to use, and the developer experience of that tool has set the bar for how we think about the rest of the categories in the data platform.

We’ve also seen some promising work in the ingestion space with tools like Sling, dlt, Steampipe, pg2parquet, CloudQuery, and Alto, offering small, simple libraries that are great for extracting and loading data. However, on their own, these tools still lack basic functionality that makes them suitable for production.

We at Dagster saw an opportunity: we already had a powerful orchestration platform that managed much of the complexity that the heavier ingestion frameworks duplicated. What if we could leverage that to provide a great foundation to build using these lighter-weight tools?

To help bridge the gap in ingestion, we’re introducing the Dagster Embedded ELT library: a set of pre-built assets and resources around lightweight frameworks that make it easy to ingest and load data.

>    Pedram Navid, Head of Data Engineering and Developer Relations at Dagster Labs, shares details of the new Embedded ELT feature.  

What Makes Ingestion Difficult?

Ingestion alone may seem simple, but building a reliable ingestion pipeline is more than querying a database and saving the results. Let’s quickly look at some of the primary concerns that make ingestion difficult:

  1. Observability: it’s critical to understand how data is flowing through the system, from the number of rows ingested to being able to capture and store logs, as well as alerting on any failures
  2. Error Handling: if there’s one universal truth about data pipelines, they will fail in some fashion. Being able to handle errors through retries and workflow dependencies is essential.
  3. State Management: unless you want to pull all the data out of your database on every run, you’ll need a way to manage state as part of incremental loads
  4. Data Quality: the ability to validate before loading it, identify schema changes and unexpected values, and ensure constraints are met, and data consistency is a minimum requirement for many data teams.
  5. Type Conversions: everything from strings to JSON to date-time columns between two systems often will need some conversion logic.
  6. Schema Drift: As your source data structure changes, you will need to handle that in the destination system, either by adding or dropping columns or even changing column types.

A careful observer might notice that there are two categories of problems here. The first four, observability, error handling, state management, and data quality, are all the natural domains of orchestrators, while type conversions and schema drift are precisely what good ELT tools handle for you.

The problem is that absent a good orchestrator, ELT tools need to start bundling large components to be able to operate correctly. These siloed, purpose-specific orchestrators often cause much pain for data teams.

Unbundling the Orchestrator

At Dagster, we’ve seen great success with users with centralized orchestration in one place for their workflows across all data lifecycle stages. Instead of building a feature-incomplete orchestrator for every tool, data engineers are taking advantage of the full power of Dagster to build resilient pipelines.

In that spirit, we’re introducing dagster-embedded-elt: a framework for building ELT pipelines with Dagster through helpful pre-built assets and resources.

Our first release features the free open-source tool Sling, which makes it easy to sync data between data systems, be it a database, a data warehouse, or an object store. Check out the list of Sling connectors.

Sling is an embeddable library that offers many connectors out-of-the-box.
 Sling offers many embeddable connectors out-of-the-box.

We’ve built abstractions around the tool so that syncing data from a production database to your data warehouse can be done in just a few lines of code.Our integration is build around Sling's replication configuration, which is a declarativeway to define how to sync data between databases and object stores.

To use our dagster-embedded-elt to sync data between systems, it's as simple creating a replication configuration, defining your sources and targets, and calling the replicate method.

First, we'll define a replication config. We can do this either in the native yaml supported by Sling, or as a Python dictionary:

SOURCE: MY_POSTGRES
TARGET: MY_SNOWFLAKE

defaults:
  mode: full-refresh
  object: "{stream_schema}_{stream_table}"

streams:
  public.accounts:
  public.users:
  public.finance_departments:
    object: "departments"yaml

Next, we define the source and destinations and create a Dagster Resource:

sling_resource = SlingResource(
    connections=[
        SlingConnectionResource(
            name="MY_POSTGRES",
            type="postgres",
            connection_string=EnvVar("POSTGRES_CONNECTION_STRING"),
        ),
        SlingConnectionResource(
            name="MY_SNOWFLAKE",
            type="snowflake",
            host=EnvVar("SNOWFLAKE_HOST"),
            user=EnvVar("SNOWFLAKE_USER"),
            role="REPORTING",
        ),
    ]
)

With the resource set, we then use the @sling_assets decorator and Dagster will read and parse your replication config, build your assets, and run your syncs.All in just a few lines of code.

### Path to the replication config, or optionally, create the config as a Python dictionary
replication_config = file_relative_path(__file__, "../sling_replication.yaml")

@sling_assets(replication_config=replication_config)
def my_assets(context, sling: SlingResource):
    yield from sling.replicate(
        replication_config=replication_config,
        dagster_sling_translator=DagsterSlingTranslator(),
    )

Going forward, as we add more ingestion libraries, the design of this plugin architecture will make it simpler to swap solutions without rewriting your entire codebase. In building this, we had three key goals in mind:

1) Make it easier to replicate data from your operational databases such as Postgres and MySQL to your data lake and data warehouse.

2) Make it faster to get started on your core analytics platform without a slew of hosted SaaS solutions.

3) Remove cost anxiety when it comes to large production tables, so you don’t have to worry about how many rows you are syncing and whether a backfill will surprise you at your next invoice.

We are initially launching with Sling as a proof of concept, which gives you great options out-of-the-box. We’d love your feedback to help further develop and extend this feature to other integrations and will build this capability out based on feedback from Dagtser users.

The Modern Data Stack has given us many great tools over the past years. It has also created problems by adding a layer of complexity (and cost) on top of traditionally basic data engineering tasks. We hope that with this integration, Dagster can simplify data engineering again and hopefully bring some fun back along the way.

 Learn more about Embedded ELT in the Dagster Docs.

Update:  Dagster's embedded ELT capability has evolved since this article was published.  You might want to check out:* Sling Out Your ETL Provider with Embedded ELT* Expanding the Dagster Embedded ELT Ecosystem with dltHub for Data Ingestion

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Multi-Tenancy for Modern Data Platforms
Webinar

April 13, 2026

Multi-Tenancy for Modern Data Platforms

Learn the patterns, trade-offs, and production-tested strategies for building multi-tenant data platforms with Dagster.

Deep Dive: Building a Cross-Workspace Control Plane for Databricks
Webinar

March 24, 2026

Deep Dive: Building a Cross-Workspace Control Plane for Databricks

Learn how to build a cross-workspace control plane for Databricks using Dagster — connecting multiple workspaces, dbt, and Fivetran into a single observable asset graph with zero code changes to get started.

Dagster Running Dagster: How We Use Compass for AI Analytics
Webinar

February 17, 2026

Dagster Running Dagster: How We Use Compass for AI Analytics

In this Deep Dive, we're joined by Dagster Analytics Lead Anil Maharjan, who demonstrates how our internal team utilizes Compass to drive AI-driven analysis throughout the company.

Dagster 1.13: Octopus's Garden
Dagster 1.13: Octopus's Garden
Blog

April 9, 2026

Dagster 1.13: Octopus's Garden

Dagster skills, partitioned asset checks, state backed components, virtual assets, and stronger integrations.

Monorepos, the hub-and-spoke model, and Copybara
Monorepos, the hub-and-spoke model, and Copybara
Blog

April 3, 2026

Monorepos, the hub-and-spoke model, and Copybara

How we configure Copybara for bi-directional syncing to enable a hub-and-spoke model for Git repositories

Making Dagster Easier to Contribute to in an AI-Driven World
Making Dagster Easier to Contribute to in an AI-Driven World
Blog

April 1, 2026

Making Dagster Easier to Contribute to in an AI-Driven World

AI has made contributing to open source easier but reviewing contributions is still hard. At Dagster, we’re improving the contributor experience with smarter review tooling, clearer guidelines, and a focus on contributions that are easier to evaluate, merge, and maintain.

How Magenta Telekom Built the Unsinkable Data Platform
Case study

February 25, 2026

How Magenta Telekom Built the Unsinkable Data Platform

Magenta Telekom rebuilt its data infrastructure from the ground up with Dagster, cutting developer onboarding from months to a single day and eliminating the shadow IT and manual workflows that had long slowed the business down.

Scaling FinTech: How smava achieved zero downtime with Dagster
Case study

November 25, 2025

Scaling FinTech: How smava achieved zero downtime with Dagster

smava achieved zero downtime and automated the generation of over 1,000 dbt models by migrating to Dagster's, eliminating maintenance overhead and reducing developer onboarding from weeks to 15 minutes.

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster
Case study

November 18, 2025

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster

UK logistics company HIVED achieved 99.9% pipeline reliability with zero data incidents over three years by replacing cron-based workflows with Dagster's unified orchestration platform.

Modernize Your Data Platform for the Age of AI
Guide

January 15, 2026

Modernize Your Data Platform for the Age of AI

While 75% of enterprises experiment with AI, traditional data platforms are becoming the biggest bottleneck. Learn how to build a unified control plane that enables AI-driven development, reduces pipeline failures, and cuts complexity.

Download the eBook on How to Scale Data Teams
Guide

November 5, 2025

Download the eBook on How to Scale Data Teams

From a solo data practitioner to an enterprise-wide platform, learn how to build systems that scale with clarity, reliability, and confidence.

Download the eBook Primer on How to Build Data Platforms
Guide

February 21, 2025

Download the eBook Primer on How to Build Data Platforms

Learn the fundamental concepts to build a data platform in your organization; covering common design patterns for data ingestion and transformation, data modeling strategies, and data quality tips.