Data Preaggregation | Dagster Glossary

Data preaggregation definition

Preaggregation is a specific type of aggregation that occurs before data is stored, aimed at optimizing query performance and reducing computational resources.

Jump to the entry for 'aggregate'.

In data engineering, data aggregation and data pre-aggregation are important techniques used to process and organize data, but they are applied at different stages and for different purposes. Here's a breakdown of the key differences and when to use each approach:

Quick recap of Data Aggregation vs. Pre-aggregation:

Definition:

Data aggregation is the process of summarizing and condensing data into a more manageable format at the time it is queried. It typically happens at runtime, when users or systems request specific insights from raw data.
Example: Summing up daily sales data to get monthly or yearly sales numbers.

Data Pre-Aggregation:

Definition:

Data pre-aggregation involves summarizing and condensing data into pre-defined aggregates before it is queried. This means aggregating data at predefined intervals or categories during data processing, rather than at query time.
Example: Pre-aggregating daily sales data into monthly or yearly sales figures during data loading or batch processing.

Quick comparison:

Characteristic	Data Aggregation	Data Pre-Aggregation
Timing	Happens at query time (on-demand)	Happens before query time (precomputed)
Flexibility	High: Can aggregate in different ways dynamically	Low: Limited to predefined aggregates
Performance/Latency	Potentially slower, especially for large datasets	Faster query performance (lower latency)
Granularity	Raw, detailed data remains available	Pre-aggregated data; raw data may also be retained
Resource Usage	Requires more computational resources at runtime	Reduces runtime computation, but increases storage needs
Query Types	Best for ad hoc and varied queries	Best for predictable, repetitive queries
Use Cases	Suitable for flexible reporting and analysis	Ideal for dashboards and reports with known metrics
Scalability	May struggle with very large datasets	Better suited for handling large datasets
Storage Requirements	Lower storage needs (raw data only)	Higher storage needs (pre-aggregated and raw data)
Maintenance	Minimal, as raw data is queried directly	Requires maintaining pre-aggregated data in sync with updates

When to Use Aggegation:

When there is a need for real-time flexibility in how data is summarized or aggregated.
When data queries vary widely and cannot be predicted in advance.
When you have the necessary computational resources to handle on-demand aggregation without significantly impacting performance.
When the underlying dataset is not too large to cause major performance bottlenecks.

When to Use Pre-aggregation:

When performance and low-latency queries are critical, especially in dashboards, reports, or applications that rely on pre-defined insights.
When the types of queries are predictable, and you can anticipate what aggregates will be needed (e.g., daily, weekly, monthly summaries).
When dealing with large datasets where runtime aggregation would be too slow or computationally expensive.
In batch processing scenarios where data can be pre-processed in regular intervals, making it easier to aggregate large amounts of data ahead of time.

Best Practices:

Use Pre-Aggregation for Performance-Critical Applications:
- If your application requires fast responses to queries and you know in advance which aggregations are needed, pre-aggregation is the way to go. This is common in business intelligence tools, dashboards, or where real-time performance is crucial.
Use On-Demand Aggregation for Flexibility:
- For ad hoc querying where the aggregation needs might change frequently, it is better to rely on dynamic aggregation. This approach allows for more flexible exploration of the data without being limited by predefined summaries.
Hybrid Approach:
- Often, a hybrid approach is best. For frequently queried aggregates, you can pre-aggregate data to improve performance. At the same time, you retain raw data so that less frequent or ad hoc queries can still be performed dynamically.
Data Volume Considerations:
- Pre-aggregation makes sense when the volume of raw data is so large that aggregating on the fly would negatively impact performance. But if the data volume is manageable, on-demand aggregation can provide greater flexibility without major performance trade-offs.
Storage and Maintenance:
- Pre-aggregating data will increase storage requirements because both raw data and pre-aggregated results may need to be retained. You should also account for the maintenance overhead of keeping pre-aggregated data in sync with raw data updates.

By carefully analyzing the performance requirements, the flexibility of query needs, and the data volume, you can decide when to use data aggregation or data pre-aggregation in your data engineering pipelines.