July 12, 2022 • 4 minute read •
Roman Roads in Data Engineering: Don't Write Data Pipelines from Scratch
- Name
- Claire Lin
- Handle
- Name
- Sandy Ryza
- Handle
- @s_ryz
Roman roads & data engineering
The Romans are famous for their expansion from a tiny town in Italy to ruling a territory as large as the modern US. Many of their ancient peers were just as fierce, but the Romans were singular in their obsession with infrastructure: they would literally build roads as they traveled. When the Romans went on a campaign to Southern Gaul, they’d build roads through its rugged terrain. When they later wanted to campaign in Northern Gaul, they could walk on those roads and avoid navigating that same terrain a second time.
Data teams might not be Roman legions, but building data products requires navigating its own kind of challenging terrain—the complexities of the datasets that they work with. “Why does this field get omitted sometimes?” “What do we consider a user session? “How do I pull data from S3?”
In many data organizations, this same terrain is covered over and over again. Each time you build a new report or ML model, you end up re-cleaning the data, re-answering questions, re-implementing logic. Deadlines are often tight, and, without the right tools, building a road is a whole project of its own.
But the right tools let anyone build Roman roads: to work in a way that lays the foundation for their next data product while building their current one.
For this to work, you need a few things:
- Writing code in a reusable way needs to be barely harder than writing it in a non-reusable way.
- You need to be able to find what you’ve built before.
- You need to be able to trust what you’ve built before. Determining how past data assets were constructed and ensuring the legitimacy of old data assets are challenging hurdles that make it often easier to start from scratch.
What are the Roman roads in the world of data? Data teams work with two kinds of reusable entities:
- Data assets - a core dataset or ML model is often useful for a variety of different analyses or data products. For example, a canonical dataset of website logins is useful across dozens of analyses.
- Data operations - an operation or transformation is often applicable to a variety of data assets. For example, a function that geocodes all the addresses in a dataset might be useful for multiple datasets that have addresses.
Roman roads & Dagster
We built Dagster with Roman roads in mind. It models both reusable assets and reusable operations. You can write an operation to transform one asset, and then reuse that same operation to transform another asset. Or, you can apply multiple operations to the same asset to generate new reusable assets.
Dagster provides abstractions to model these components:
- An
asset
, a reusable data entity. - An
op
, a reusable operation. Ops can be used to generate assets.
Reusing an asset
Say that we’re an airline and we’ve made optimizations in 2022 to reduce the number of layovers in flights within the US. In order to analyze our optimized flights, we want to create a table of flights within the US since 2022:
@asset
def us_flights_since_2022(all_flights):
us_flights = all_flights[
(all_flights['departure_country'] == 'USA') &
(all_flights['arrival_country'] == 'USA')
]
return us_flights[us_flights['date'].dt.strftime('%Y') >= '2022']
Later, we realize that we’d also like to build a table of US flights before 2022. This requires:
- Computing a table of US flights, which we’ve already done in our
us_flights_since_2022
asset. In order to reuse this table, we can define it as an asset namedus_flights
. - Computing the US flights before 2022 by reusing
us_flights
.
This allows us to reuse us_flights
to build both the table of flights before 2022 and after 2022 and without recreating already-generated data.
@asset
def us_flights(all_flights):
return all_flights[
(all_flights['departure_country'] == 'USA') &
(all_flights['arrival_country'] == 'USA')
]
@asset
def us_flights_before_2022(us_flights):
return us_flights[us_flights['date'].dt.strftime('%Y') < '2022']
@asset
def us_flights_since_2022(us_flights):
return us_flights[us_flights['date'].dt.strftime('%Y') >= '2022']
Dagit, Dagster’s user interface, surfaces the reusable components you’ve constructed. You can view every asset dependency required to generate a certain asset, or view which assets reuse a given asset.
Reusing an op
In the previous section we defined an us_flights_since_2022
asset. We’d now like to calculate the layover percentage breakdown of these flights (e.g. 50% of flights had zero layovers, 25% had one layover, etc.):
@asset
def layover_breakdown_since_2022(us_flights_since_2022):
grouped_by_num_layovers = us_flights_since_2022.groupby('num_layovers').size()
layover_counts_percentage = grouped_by_num_layovers / len(us_flights_since_2022)
return layover_counts_percentage
Dagster allows us to reuse the same piece of code to transform multiple data assets. This means that creating similar data assets is an incremental change.
For example, computing the layover breakdown before 2022 would also require applying the same layover breakdown computation we defined to generate the layover_breakdown_since_2022
asset above. Because we want to reuse this layover breakdown computation, we can define it as a reusable op named calculate_layover_breakdown
. Then, when we compute the layover breakdown before 2022, we reuse this op:
@op
def calculate_layover_breakdown(flights):
grouped_by_num_layovers = flights.groupby('num_layovers').size()
layover_counts_percentage = grouped_by_num_layovers / len(flights)
return layover_counts_percentage
layover_breakdown_since_2022 = AssetsDefinition.from_op(
calculate_layover_breakdown,
keys_by_input_name={"flights": AssetKey("us_flights_since_2022")},
keys_by_output_name={"result": AssetKey("layover_breakdown_since_2022")},
)
layover_breakdown_before_2022 = AssetsDefinition.from_op(
calculate_layover_breakdown,
keys_by_input_name={"flights": AssetKey("us_flights_before_2022")},
keys_by_output_name={"result": AssetKey("layover_breakdown_before_2022")},
)
Ops support configuration and resources, allowing you to specify runtime configuration or plug in different implementations for different development environments. In addition, ops can be interconnected to create a graph, which lets you distribute computation across multiple machines. Like ops, graphs can also back assets.
In Dagit, you can view every invocation of a given op. This allows you to trace the locations where a computation is reused, and the assets generated subsequently.
Wrapping up
Dagster’s asset
abstraction ensures the legitimacy of data assets and allows you to build assets atop previous ones. Similarly, Dagster’s op
abstraction enables defining new, similar assets by reusing existing code throughout your data platform.
With these tools, you can now build Roman roads for your data platform — your current platform becomes the foundation for your future data products.
Acknowledgements
Thank you John Sears for the Roman roads analogy!
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
AI's Long-Term Impact on Data Engineering Roles
- Name
- Fraser Marlow
- Handle
- @frasermarlow
10 Reasons Why No-Code Solutions Almost Always Fail
- Name
- TéJaun RiChard
- Handle
- @tejaun
5 Best Practices AI Engineers Should Learn From Data Engineering
- Name
- TéJaun RiChard
- Handle
- @tejaun