Back to Glossary Index


Executing a computation and persisting the results into storage.

Definition of data materialization

"Materialization" refers to the process of executing a computation and persisting the results into a physical storage medium, such as a disk or memory. Materialization has particular significance in the context of large-scale data processing and transformation operations.

Data materialization in context

Here are the primary contexts in which "materialization" is used in data engineering:

  1. Lazy vs. Eager Evaluation: Many modern computational frameworks use a technique called lazy evaluation, where operations are only computed when absolutely necessary. In these systems, data processing operations are built into a computational plan or a directed acyclic graph (DAG). The actual computations are delayed until an action that requires the results (like saving data or collecting it to the driver node) is called. At that point, the operations are materialized. The opposite approach, eager evaluation, materializes the result immediately after each operation. Dagster also adopts the concepts of Lazy and Eager data asset meterialization policies.

  2. Intermediate Results in Data Pipelines: In distributed data processing systems like Apache Spark or Apache Beam, data goes through a series of transformations. Each transformation might create an intermediate dataset. Depending on the design of the pipeline, these intermediate datasets might be materialized to disk or memory to optimize performance or fault tolerance.

  3. Stream Processing: In stream processing, materialization may refer to storing the current state of a computation so that it can be restored after failures or for querying purposes. Systems like Apache Kafka's Streams API provide state stores that can materialize data for such scenarios.

Benefits (and downsides) of data materialization

Materialization can offer several benefits, including:

  • Improved performance for frequently accessed computations.
  • Versioning of a data asset over time.
  • Fault tolerance, by reducing the need to recompute data after data pipeline failures.
  • Smart incremental processing, whereby a data pipeline can skip specific upstream branches if no data dependency on that branch has chnaged.

However, materialization can also introduce overheads related to storage and the time taken to write results to physical media. Effective data engineering involves making judicious decisions about when and where to materialize data based on the specific requirements of a use case.

Materializing a data asset vs. materializing a database view

The term "materialization" is also used in database systems, where a view is a virtual table representing the results of a query. Whenever a user queries a view, the underlying query runs from scratch against the base tables. On the other hand, a materialized view is a physical copy of the data resulting from a query. It's essentially a snapshot of the data at a certain point in time. Materialized views can be refreshed periodically or based on specific triggers.

Using materialized views improves query performance by avoiding the overhead of recomputing the virtual table's data every time. This is especially useful for complex and frequently accessed queries. Of course, the snapshot may not reflect the very latest data since it's a static copy. So you need to define a strategy for how often to refresh your cached view.

Materialization vs. Caching

Both data materialization and caching involve storing data to optimize access times and reduce computational or retrieval overhead. However, they differ in their primary objectives, usage scenarios, and lifecycles.

Data Materialization refers to the persisting of intermediate or final results of data processing and transformation tasks, often in the context of data engineering or databases. It is also used in big data processing where intermediate results are persisted to optimize iterative computations or enhance fault tolerance. Materialized data might have a longer lifecycle, and its refresh or update cycle is typically well-defined. For instance, a materialized view might be refreshed nightly or weekly.

Caching on the other hand refers to a strategy of temporarily storing frequently accessed data close to the point of use to reduce data retrieval times and offload source systems. It is common in web applications where frequent data (like user profiles) is stored in memory for rapid access. It is also used in databases and content delivery networks to speed up data or content delivery by temporarily storing it in a location closer to the user or application.

Cached data usually has a shorter lifecycle than materialized data and may be evicted based on policies (e.g., Least Recently Used (LRU) policy). The nature of caching is transient; if a cache is cleared, the system should still be able to fetch the original data, albeit with a performance hit.

So, while both mechanisms aim to improve data access times, data materialization is more about persisting data for optimized processing and computations over potentially longer periods whereas caching is about temporarily storing data for quick access, often accepting the possibility of stale data for the sake of speed.

An example of data materialization in Python

We can demonstrate data materialization using Python and SQLite, a lightweight database system. This example will focus on the difference between views and materialized views.

Consider you have a dataset of products and their sales. We'll create two tables:

  1. products - listing products and their details.
  2. sales - listing individual sales records.

Our goal is to get the total sales amount for each product.

Here's how we can use SQLite in Python to demonstrate this:

import sqlite3

# Set up an in-memory SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create products table
CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    product_name TEXT NOT NULL,
    price REAL NOT NULL

# Create sales table
    product_id INTEGER,
    quantity_sold INTEGER,
    FOREIGN KEY (product_id) REFERENCES products (product_id)

# Insert sample data into products
products = [(1, 'Laptop', 2500),
            (2, 'Mouse', 40),
            (3, 'Keyboard', 80)]
cursor.executemany('INSERT INTO products VALUES (?, ?, ?)', products)

# Insert sample data into sales
sales = [(1, 1, 5),  # 5 laptops sold
         (2, 2, 10), # 10 mice sold
         (3, 3, 7),  # 7 keyboards sold
         (4, 2, 5)]  # 5 more mice sold
cursor.executemany('INSERT INTO sales VALUES (?, ?, ?)', sales)

# Create a VIEW for total sales for each product
CREATE VIEW product_sales AS
SELECT p.product_id, p.product_name, SUM(p.price * s.quantity_sold) as total_sales
FROM products p
JOIN sales s ON p.product_id = s.product_id
GROUP BY p.product_id

# Now, let's materialize this view into a physical table called product_sales_materialized
CREATE TABLE product_sales_materialized AS
SELECT * FROM product_sales

# Query the materialized data
cursor.execute('SELECT * FROM product_sales_materialized')


In this example:

  • First, we set up an in-memory SQLite database.
  • We then defined two tables: products and sales.
  • After inserting some sample data, we created a view named product_sales that calculates the total sales for each product.
  • To demonstrate materialization, we copied the data from the view into a physical table named product_sales_materialized.
  • Finally, we queried the materialized table to get our results.

We can now print out our materialized data which will appear as follows:

[(1, 'Laptop', 12500.0), (2, 'Mouse', 600.0), (3, 'Keyboard', 560.0)]

In a real-world scenario, using materialized views can help in speeding up frequently accessed and complex queries by storing their results in a physical table, thus avoiding recalculating the results every time.

Other data engineering terms related to
Data Management:


Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. store data for long-term retention and compliance.
An image representing the data engineering concept of 'Archive'
Dagster Glossary code icon


Add new data or information to an existing dataset to enhance its value.
An image representing the data engineering concept of 'Augment'


Create a copy of data to protect against loss or corruption.
An image representing the data engineering concept of 'Backup'
Dagster Glossary code icon

Batch Processing

Process large volumes of data all at once in a single operation or batch.
An image representing the data engineering concept of 'Batch Processing'
Dagster Glossary code icon


Store expensive computation results so they can be reused, not recomputed.
An image representing the data engineering concept of 'Cache'
Dagster Glossary code icon


Organizing and classifying data into different categories, groups, or segments.
An image representing the data engineering concept of 'Categorize'
Dagster Glossary code icon


Identify and remove duplicate records or entries to improve data quality.
An image representing the data engineering concept of 'Deduplicate'


Deserialization is essentially the reverse process of serialization. See: 'Serialize'.
An image representing the data engineering concept of 'Deserialize'
Dagster Glossary code icon


Analyzing the number of features or attributes in the data to improve performance.
An image representing the data engineering concept of 'Dimensionality'
Dagster Glossary code icon


Enhance data with additional information from external sources.
An image representing the data engineering concept of 'Enrich'


Extract data from a system for use in another system or application.
An image representing the data engineering concept of 'Export'
Dagster Glossary code icon


An operation that produces the same result each time it is performed.
Dagster Glossary code icon


Create an optimized data structure for fast search and retrieval.
Dagster Glossary code icon


Combine data from different sources to create a unified view for analysis or reporting.
Dagster Glossary code icon


Understand of how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.
An image representing the data engineering concept of 'Lineage'
Dagster Glossary code icon


Store the results of expensive function calls and reusing them when the same inputs occur again.
Dagster Glossary code icon


Combine data from multiple datasets into a single dataset.
An image representing the data engineering concept of 'Merge'
Dagster Glossary code icon


Create a conceptual representation of data objects.


Track data processing metrics and system health to ensure high availability and performance.
Dagster Glossary code icon

Named Entity Recognition

Locate and classify named entities in text into pre-defined categories.
Dagster Glossary code icon


Interpret and convert data from one format to another.
Dagster Glossary code icon


Divide data into smaller subsets for improved performance.
An image representing the data engineering concept of 'Partition'
Dagster Glossary code icon


Transform your data so it is fit-for-purpose.
Dagster Glossary code icon


Transform raw data before data analysis or machine learning modeling.
Dagster Glossary code icon


Create a copy of data for redundancy or distributed processing.


Increasing the capacity or performance of a system to handle more data or traffic.
Dagster Glossary code icon

Schema Mapping

Translate data from one schema or structure to another to facilitate data integration.


Ensure that data in different systems or databases are in sync and up-to-date.
Dagster Glossary code icon


Check data for completeness, accuracy, and consistency.
Dagster Glossary code icon


Maintain a history of changes to data for auditing and tracking purposes.
An image representing the data engineering concept of 'Version'