Blog
Backfills in Data & Machine Learning: A Primer

Backfills in Data & Machine Learning: A Primer

June 6, 2023
Backfills in Data & Machine Learning: A Primer
Backfills in Data & Machine Learning: A Primer

A step-by-step guide to using backfills and partitions to make data management more simple for data & ML engineers.

In the life of a data engineer, perhaps nothing inspires more difficulty or awe than the backfill. Flub a backfill, and you'll have bad data, angry stakeholders, and a fat cloud bill. Nail a backfill, and you're rewarded with months or years of pristine data (until the next backfill). Backfills are a central piece of the data engineering skillset, a fiery furnace where all the individual challenges of the profession melt together into one big challenge.

Backfills frequently go wrong. They accidentally target the wrong data. Or they fail, and it’s too difficult to pick up from the middle, so they get restarted from the beginning. Or they leave data in an inconsistent state: the records in related tables don’t match up.  And the stakes are high: restarting a large backfill from the beginning can mean spending an extra $10k cloud compute and living with out-of-date data for days.

In this post, we'll survey backfills: what they are, why we need them, what makes them difficult, and how to deal with that difficulty.

Backfills often go hand-in-hand with partitions, an approach to data management that can help make backfills dramatically simpler and more sane. This post will also delve into how to use partitions to avoid many of backfilling’s main pitfalls.

In this article

What's a backfill?

A backfill is when you take a data asset that's normally updated incrementally and update historical parts of it.

For example, you have a table, and each day, you add records to it that correspond to events that happened during that day. Backfilling the table means filling in or overwriting data for days in the past.

We use the term “data asset” instead of only “table” because not all pipelines operate on tabular data. You might also backfill a set of images that you use for training a vision model. Or backfill a set of ML models.

Why backfill your data?

You typically run a backfill if you're in one of these situations:

Greenfield - you’ve added data assets to your data pipeline

You’ve just developed a new data asset, e.g. you’ve written code that takes a table with raw data and computes a table with cleaner data. In normal operation, this data asset will be updated incrementally using the most recent data, but you need to initialize it with historical data.

Brownfield - one of the assets in your data pipeline has changed

You’ve changed the code that generates one of your data assets, e.g., you fixed a bug in the filtering logic. Or the source that you're pulling data from notices their data was corrupted and publishes corrected source data. You need to throw away all the historical data that you generated with the corrupted data or buggy code and overwrite it with the correct data.

Recovering from failure

The service that you’re pulling your data from has been down for week. Or you notice that your pipeline has been failing for a week because it doesn’t know how to handle record type that was added. After the problem is fixed, you need a backfill to fill in the gap.

Backfilling a graph of data assets

Often, the asset that you’re backfilling has other assets that are derived from it.  In that case, after backfilling your asset, you’ll usually need to backfill those downstream assets as well.  I.e., if you backfill the raw_events table, and the events table is generated from the raw_events table, then you’ll need to backfill the events table too.

Orchestrators,  which are usually responsible for tracking dependencies in data pipelines, are particularly helpful for managing these kinds of backfills. They can kick off work to backfill the downstream table after the backfill of the upstream table is complete.

Backfills and partitions

Backfilling becomes dramatically easier when you think of your data in terms of partitions.

Partitioning is an approach to incremental data management that views each data asset as a collection of partitions.  For example, a table of events might correspond to a set of hourly partitions, each of which contains the events that occurred within a single hour. At the end of each hour, you fill in the partition for each hour. When you run a backfill, you overwrite partitions for hours in the past.

Partitions help with backfills for a few reasons:

What needs to be backfilled?

Partitions allow you to track what data needs to be backfilled and what data does not. Each partition is an atomic unit of data that was updated using particular code at a particular time. This allows you to make statements like:

  • This partition is missing, so it needs to be backfilled.
  • This partition was built using outdated logic, so it needs to be backfilled.
  • This partition was built using the latest logic, so it doesn't need to be backfilled.

Instead of a large ball of data that's an accretion of updates at different times with different code and different source data, a partitioned data asset is a collection of objects, i.e., partitions, that are easy to reason about individually.

Partitions also help you figure out what portions of downstream assets you need to backfill after backfilling a portion of an asset. If you realize that the data in your events table was corrupted during a particular range of dates, and you backfill those dates, you'll likely also need to backfill those same dates in any tables that depend on your events table.

Partitions, parallelism, and fault tolerance

When the code you use to generate your data is single-threaded, partitions help you run backfills in bite-sized chunks, in parallel.

Backfills often demand large amounts of computation.  Massively parallel computation engines like Snowflake or Spark can distribute the computation across many computers and recover when parts of it fail.

But if your asset is generated with a Python loop that fetches data from an REST endpoint, or set of transformations over Pandas DataFrames, then executing it over a large historical time range can take days or even weeks.  Or it might simply require too much memory to complete successfully.  And if there’s a failure in the middle, you might need to start over from the beginning.

When working with this kind of single-threaded code, it's often preferable to run a task for each partition instead of executing the entire backfill at once. This allows the backfill to complete faster, because partitions can be filled in parallel. It also makes them more fault-tolerant - if the task that's processing one of the partitions fails, then you can restart it without losing the progress made on the other partitions.

Note that this isn't always the case. If your data is computed by running a SQL query in Snowflake, it's usually easier to let Snowflake handle the parallelism. Splitting it up into multiple queries would add extra overhead. Whether or not to parallelize a backfill by partition depends a lot on the compute framework you're using – e.g., Pandas vs. Spark vs. Snowflake. For this reason, it's very useful to have the flexibility to run backfills in both ways.

For more on data partitions, read Sandy's article ["Partitions in Data Pipelines"](/blog/partitioned-data-pipelines).

Backfills gone wrong

Backfills can go wrong in a few different ways:

  • Targeting the wrong subset - If you neglect to backfill parts of your data that need to be backfilled, you risk finishing with the false impression that your data is up-to-date. If you backfill parts of your data that don't need to be backfilled, then you've wasted some time and money. If you backfill a data asset without backfilling the data assets derived from it, you risk ending up with your data in an inconsistent and confusing state.
  • Resource overload - Backfills can require significant amounts of memory and processing power. This can cause them to overwhelm your system or starve important workloads.
  • Cost overload - A large backfill might end up costing much more than expected.
  • Getting lost in the middle - If parts of your backfill fail, you can end up in a state where you know something has gone wrong but don’t know what, and need to restart from the beginning.

To avoid these issues, it's essential to plan and execute backfills carefully.

Running a backfill: a step-by-step guide

So you've decided it's time to run a backfill. How should you proceed? It's helpful to think of a backfill in terms of a few steps:

Step 0: Manage your data in a way that makes it easy to backfill

This often means using partitions to organize your data and track its changes. It also means tracking what code and upstream data you've used to generate your data so you know what parts are out-of-date. Lastly, it means tracking the dependencies between your data assets, so you can determine what's impacted by a change.

Step 1: Plan your backfill

What data assets (i.e., tables and files) do you need to backfill? What partitions of those data assets does the backfill need to include? The better a job you've done with Step 0, the easier this step will be.

What's your backfill strategy? Are you going to run your backfill as a single big Snowflake query? A separate task per partition? Different strategies for different assets?

How will you avoid overwhelming your system? You might need to provision additional resources or configure queues that prevent too much from happening simultaneously.

If you're running your backfill without an orchestrator, you might need to write a script that will carry it out.

Step 3: Launch your backfill

This might involve invoking a script. It might involve clearing a set of Airflow tasks. If you're using Dagster, you'll use the UI to select a set of assets and partitions to backfill or submit a request to the GraphQL API.

If you're not using an orchestrator to manage your backfill, this step might need to be spread out - after parts of your backfill are complete, you'll need to kick off the next steps manually.

Step 3a: Start small - if your backfill is large, start with a small subset, e.g. just a few partitions, to verify that it works as expected.

Step 4: Monitor your backfill

Backfills can take hours or days to complete, and a lot can go wrong during that time. You'll need to pay attention to failures and potentially re-launch failed tasks. If you notice an error in your logic, you might need to restart the backfill from the beginning or re-do a portion of it.

Step 5: Verify the results

Just because you completed all the work does not mean that results have the outcome you expect. You’ll want to verify that your backfilled data looks like you expect it to, either by running queries against it manually or by executing data quality checks.

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Multi-Tenancy for Modern Data Platforms
Webinar

April 7, 2026

Multi-Tenancy for Modern Data Platforms

Learn the patterns, trade-offs, and production-tested strategies for building multi-tenant data platforms with Dagster.

Deep Dive: Building a Cross-Workspace Control Plane for Databricks
Webinar

March 24, 2026

Deep Dive: Building a Cross-Workspace Control Plane for Databricks

Learn how to build a cross-workspace control plane for Databricks using Dagster — connecting multiple workspaces, dbt, and Fivetran into a single observable asset graph with zero code changes to get started.

Dagster Running Dagster: How We Use Compass for AI Analytics
Webinar

February 17, 2026

Dagster Running Dagster: How We Use Compass for AI Analytics

In this Deep Dive, we're joined by Dagster Analytics Lead Anil Maharjan, who demonstrates how our internal team utilizes Compass to drive AI-driven analysis throughout the company.

DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform
DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform
Blog

March 17, 2026

DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform

DataOps is about building a system that provides visibility into what's happening and control over how it behaves

Unlocking the Full Value of Your Databricks
Unlocking the Full Value of Your Databricks
Blog

March 12, 2026

Unlocking the Full Value of Your Databricks

Standardizing on Databricks is a smart strategic move, but consolidation alone does not create a working operating model across teams, tools, and downstream systems. By pairing Databricks and Unity Catalog with Dagster, enterprises can add the coordination layer needed for dependency visibility, end-to-end lineage, and faster, more confident delivery at scale.

Announcing AI Driven Data Engineering
Announcing AI Driven Data Engineering
Blog

March 5, 2026

Announcing AI Driven Data Engineering

AI coding agents are changing how data engineers work. This Dagster University course shows how to build a production-ready ELT pipeline from prompts while learning practical patterns for reliable AI-assisted development.

How Magenta Telekom Built the Unsinkable Data Platform
Case study

February 25, 2026

How Magenta Telekom Built the Unsinkable Data Platform

Magenta Telekom rebuilt its data infrastructure from the ground up with Dagster, cutting developer onboarding from months to a single day and eliminating the shadow IT and manual workflows that had long slowed the business down.

Scaling FinTech: How smava achieved zero downtime with Dagster
Case study

November 25, 2025

Scaling FinTech: How smava achieved zero downtime with Dagster

smava achieved zero downtime and automated the generation of over 1,000 dbt models by migrating to Dagster's, eliminating maintenance overhead and reducing developer onboarding from weeks to 15 minutes.

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster
Case study

November 18, 2025

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster

UK logistics company HIVED achieved 99.9% pipeline reliability with zero data incidents over three years by replacing cron-based workflows with Dagster's unified orchestration platform.