Backfills in Data & Machine Learning: A Primer | Dagster Blog

June 6, 20237 minute read

Backfills in Data & Machine Learning: A Primer

Sandy Ryza
Name
Sandy Ryza
Handle
@s_ryz

In the life of a data engineer, perhaps nothing inspires more difficulty or awe than the backfill. Flub a backfill, and you'll have bad data, angry stakeholders, and a fat cloud bill. Nail a backfill, and you're rewarded with months or years of pristine data (until the next backfill). Backfills are a central piece of the data engineering skillset, a fiery furnace where all the individual challenges of the profession melt together into one big challenge.

Backfills frequently go wrong. They accidentally target the wrong data. Or they fail, and it’s too difficult to pick up from the middle, so they get restarted from the beginning. Or they leave data in an inconsistent state: the records in related tables don’t match up. And the stakes are high: restarting a large backfill from the beginning can mean spending an extra $10k cloud compute and living with out-of-date data for days.

In this post, we'll survey backfills: what they are, why we need them, what makes them difficult, and how to deal with that difficulty.

Backfills often go hand-in-hand with partitions, an approach to data management that can help make backfills dramatically simpler and more sane. This post will also delve into how to use partitions to avoid many of backfilling’s main pitfalls.

In this article

What's a backfill?

A backfill is when you take a data asset that's normally updated incrementally and update historical parts of it.

For example, you have a table, and each day, you add records to it that correspond to events that happened during that day. Backfilling the table means filling in or overwriting data for days in the past.

We use the term “data asset” instead of only “table” because not all pipelines operate on tabular data. You might also backfill a set of images that you use for training a vision model. Or backfill a set of ML models.

Why backfill your data?

You typically run a backfill if you're in one of these situations:

Greenfield - you’ve added data assets to your data pipeline

You’ve just developed a new data asset, e.g. you’ve written code that takes a table with raw data and computes a table with cleaner data. In normal operation, this data asset will be updated incrementally using the most recent data, but you need to initialize it with historical data.

Brownfield - one of the assets in your data pipeline has changed

You’ve changed the code that generates one of your data assets, e.g., you fixed a bug in the filtering logic. Or the source that you're pulling data from notices their data was corrupted and publishes corrected source data. You need to throw away all the historical data that you generated with the corrupted data or buggy code and overwrite it with the correct data.

Recovering from failure

The service that you’re pulling your data from has been down for week. Or you notice that your pipeline has been failing for a week because it doesn’t know how to handle record type that was added. After the problem is fixed, you need a backfill to fill in the gap.

Backfilling a graph of data assets

Often, the asset that you’re backfilling has other assets that are derived from it. In that case, after backfilling your asset, you’ll usually need to backfill those downstream assets as well. I.e., if you backfill the raw_events table, and the events table is generated from the raw_events table, then you’ll need to backfill the events table too.

Orchestrators, which are usually responsible for tracking dependencies in data pipelines, are particularly helpful for managing these kinds of backfills. They can kick off work to backfill the downstream table after the backfill of the upstream table is complete.

Backfills and partitions

Backfilling becomes dramatically easier when you think of your data in terms of partitions.

Partitioning is an approach to incremental data management that views each data asset as a collection of partitions. For example, a table of events might correspond to a set of hourly partitions, each of which contains the events that occurred within a single hour. At the end of each hour, you fill in the partition for each hour. When you run a backfill, you overwrite partitions for hours in the past.

Partitions help with backfills for a few reasons:

What needs to be backfilled?

Partitions allow you to track what data needs to be backfilled and what data does not. Each partition is an atomic unit of data that was updated using particular code at a particular time. This allows you to make statements like:

  • This partition is missing, so it needs to be backfilled.
  • This partition was built using outdated logic, so it needs to be backfilled.
  • This partition was built using the latest logic, so it doesn't need to be backfilled.

Instead of a large ball of data that's an accretion of updates at different times with different code and different source data, a partitioned data asset is a collection of objects, i.e., partitions, that are easy to reason about individually.

Partitions also help you figure out what portions of downstream assets you need to backfill after backfilling a portion of an asset. If you realize that the data in your events table was corrupted during a particular range of dates, and you backfill those dates, you'll likely also need to backfill those same dates in any tables that depend on your events table.

Partitions, parallelism, and fault tolerance

When the code you use to generate your data is single-threaded, partitions help you run backfills in bite-sized chunks, in parallel.

Backfills often demand large amounts of computation. Massively parallel computation engines like Snowflake or Spark can distribute the computation across many computers and recover when parts of it fail.

But if your asset is generated with a Python loop that fetches data from an REST endpoint, or set of transformations over Pandas DataFrames, then executing it over a large historical time range can take days or even weeks. Or it might simply require too much memory to complete successfully. And if there’s a failure in the middle, you might need to start over from the beginning.

When working with this kind of single-threaded code, it's often preferable to run a task for each partition instead of executing the entire backfill at once. This allows the backfill to complete faster, because partitions can be filled in parallel. It also makes them more fault-tolerant - if the task that's processing one of the partitions fails, then you can restart it without losing the progress made on the other partitions.

Note that this isn't always the case. If your data is computed by running a SQL query in Snowflake, it's usually easier to let Snowflake handle the parallelism. Splitting it up into multiple queries would add extra overhead. Whether or not to parallelize a backfill by partition depends a lot on the compute framework you're using – e.g., Pandas vs. Spark vs. Snowflake. For this reason, it's very useful to have the flexibility to run backfills in both ways.

Backfills gone wrong

Backfills can go wrong in a few different ways:

  • Targeting the wrong subset - If you neglect to backfill parts of your data that need to be backfilled, you risk finishing with the false impression that your data is up-to-date. If you backfill parts of your data that don't need to be backfilled, then you've wasted some time and money. If you backfill a data asset without backfilling the data assets derived from it, you risk ending up with your data in an inconsistent and confusing state.
  • Resource overload - Backfills can require significant amounts of memory and processing power. This can cause them to overwhelm your system or starve important workloads.
  • Cost overload - A large backfill might end up costing much more than expected.
  • Getting lost in the middle - If parts of your backfill fail, you can end up in a state where you know something has gone wrong but don’t know what, and need to restart from the beginning.

To avoid these issues, it's essential to plan and execute backfills carefully.

Running a backfill: a step-by-step guide

So you've decided it's time to run a backfill. How should you proceed? It's helpful to think of a backfill in terms of a few steps:

Step 0: Manage your data in a way that makes it easy to backfill

This often means using partitions to organize your data and track its changes. It also means tracking what code and upstream data you've used to generate your data so you know what parts are out-of-date. Lastly, it means tracking the dependencies between your data assets, so you can determine what's impacted by a change.

Step 1: Plan your backfill

What data assets (i.e., tables and files) do you need to backfill? What partitions of those data assets does the backfill need to include? The better a job you've done with Step 0, the easier this step will be.

What's your backfill strategy? Are you going to run your backfill as a single big Snowflake query? A separate task per partition? Different strategies for different assets?

How will you avoid overwhelming your system? You might need to provision additional resources or configure queues that prevent too much from happening simultaneously.

If you're running your backfill without an orchestrator, you might need to write a script that will carry it out.

Step 3: Launch your backfill

This might involve invoking a script. It might involve clearing a set of Airflow tasks. If you're using Dagster, you'll use the UI to select a set of assets and partitions to backfill or submit a request to the GraphQL API.

If you're not using an orchestrator to manage your backfill, this step might need to be spread out - after parts of your backfill are complete, you'll need to kick off the next steps manually.

Step 3a: Start small - if your backfill is large, start with a small subset, e.g. just a few partitions, to verify that it works as expected.

Step 4: Monitor your backfill

Backfills can take hours or days to complete, and a lot can go wrong during that time. You'll need to pay attention to failures and potentially re-launch failed tasks. If you notice an error in your logic, you might need to restart the backfill from the beginning or re-do a portion of it.

Step 5: Verify the results

Just because you completed all the work does not mean that results have the outcome you expect. You’ll want to verify that your backfilled data looks like you expect it to, either by running queries against it manually or by executing data quality checks.


The Dagster Labs logo

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us:


Read more filed under
Blog post category for Blog Post. Blog Post