Dagster Data Engineering Glossary:
Data Compaction
Data compaction definition:
In data engineering, data compaction is the process of reducing the size of data while preserving its essential information. It is used in data management and storage optimization, especially when dealing with large volumes of data. Compaction is commonly used in databases, file systems, and data storage systems to reduce storage costs, and improve data access speeds.
Compaction enables efficient use of resources, enhances system performance, and ensures that data remains accessible and valuable over time.
Compaction in data engineering:
Compaction may involve one of several approaches:
Data Storage Reduction: Compaction aims to reduce the amount of storage space required to store data. This is achieved by identifying and eliminating redundant or unnecessary information within the data while retaining the valuable content.
Compression: One of the common techniques used in compaction is data compression. Data compression algorithms like gzip, LZ4, or Snappy are applied to data to shrink its size by encoding it in a more space-efficient format. Compressed data can be decompressed on-the-fly when needed for analysis or retrieval, although this extra step has its own performance trade offs. See Data Compression entry. We provide an example of Data Compression using Python below.
Data Merging: In some data storage systems, especially those based on log-structured or columnar storage, compaction involves merging smaller data segments or files into larger ones. This consolidation reduces the overhead of managing numerous smaller files and improves read and write performance. See the entry on Data Merging.
Aggregation: Compaction may involve aggregating data at different levels of granularity. For example, daily data may be aggregated into weekly or monthly summaries, reducing the overall data volume. Aggregating data can also improve query performance for analytical workloads. See Aggregation entry.
Data Cleanup: Another aspect of compaction involves data cleanup and removal of obsolete or duplicate records. This helps maintain data quality and ensures that only relevant information is retained. See Cleanse and Deduplication entries.
Index Optimization: In databases, compaction may include optimizing data structures like indexes. Indexes can be rebuilt or reorganized to reduce fragmentation and improve query performance. See Data Indexing entry.
Time-Series Compaction: In time-series databases, compaction techniques are often applied to efficiently store historical data. Older data points may be downsampled or aggregated to reduce storage requirements while still preserving important trends. We provide an example of time-series downsampling below.
Data Lifecycle Management: Compaction is often part of data lifecycle management, where data is archived, moved to lower-cost storage tiers, or deleted when it is no longer actively needed. This helps in cost optimization and compliance with data retention policies.
Consistency: In distributed data systems, compaction ensures data consistency by removing outdated or conflicting data versions, helping maintain the integrity of distributed databases.
Data compaction vs. data compression vs. data reduction
While related, compaction, compression and reduction are slightly different techniques:
Compaction | Compression | Reduction | |
Aim | Minimize storage space while preserving data contents and structure. | Reduce data size by encoding it in a more space-efficient format, minimizing transfer or storage. | Reduce the overall amount of data while retaining its critical information. |
Techniques | Data compression, merging smaller data segments into larger ones, and optimizing data structures like indexes to reduce fragmentation. | Encoding data in a compressed forms to represent data in fewer bits. | Aggregation, downsampling, encoding, filtering out less relevant data points, dimensionality reduction for feature selection. |
Key benefits | Improve storage efficiency and reduce costs. | Reduce data transfer times, saving storage space, and improving data transmission efficiency. | Improve data manageability, simplify analytics, and speed up query performance. |
In summary, data compaction is a broader concept that encompasses various techniques aimed at minimizing storage space while preserving data integrity and structure. It often involves consolidation, aggregation, and optimization of data. All three concepts are essential in modern data engineering, and they may be used together to achieve efficient data storage, transmission, and processing.
Two code examples of Data Compaction in Python
As mentioned above, there are lots of techniques we can employ in data compaction. Let's look at two: compression and time-series downsampling. Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.
Compaction via compression
Let's create a self-contained Python example that demonstrates data compaction using a common compression library called gzip
. In this example, we will compress and decompress data to illustrate how data compaction reduces the storage space required for the same dataset.
import gzip
import io
## Sample data to compact
data_to_compact = """
Dagster is different from other data orchestrators.
It’s the first orchestrator built to be used at every stage of the data development lifecycle - local development, unit tests, integration tests, staging environments, all the way up to production.
And it’s the first orchestrator that includes software-defined assets - it frees up teams to think about critical data assets they’re trying to build and let the orchestrator manage the tasks.
"""
## Step 1: Compress the data
compressed_data = io.BytesIO()
with gzip.GzipFile(fileobj=compressed_data, mode='wb') as f:
f.write(data_to_compact.encode('utf-8'))
compressed_data.seek(0)
## Step 2: Decompress the data
decompressed_data = io.BytesIO()
with gzip.GzipFile(fileobj=compressed_data, mode='rb') as f:
decompressed_data.write(f.read())
## Step 3: Print the results
original_data_size = len(data_to_compact)
compressed_data_size = compressed_data.tell()
decompressed_data.seek(0)
decompressed_data_text = decompressed_data.read().decode('utf-8')
print("Original Data Size:", original_data_size, "bytes")
print("Compressed Data Size:", compressed_data_size, "bytes")
print("Decompressed Data Size:", len(decompressed_data_text), "bytes")
print("\nOriginal Data:")
print(data_to_compact)
print("\nDecompressed Data:")
print(decompressed_data_text)
In this example:
- We first create some sample text data (
data_to_compact
). - We use the
gzip
library to compress the data and save it tocompressed_data
. - Then, we decompress the data from
compressed_data
and store it indecompressed_data
. - Finally, we compare the original data size, compressed data size, and decompressed data size, and we print both the original and decompressed data to verify that the essential information is preserved.
You can run this code to see how data compaction through compression reduces the storage space required for the same dataset while preserving its content.
Compaction via time-series downsampling
Time-series compaction, often referred to as downsampling or aggregation, is a common task for data engineers working with large volumes of time-series data. In this example, we'll use Python and the Pandas library to demonstrate how to downsample time-series data to reduce its granularity while preserving essential information.
For this example, we'll work with simulated temperature data measured every minute and downsample it to hourly averages.
import pandas as pd
import numpy as np
def mem_usage(dataframe):
if isinstance(dataframe, pd.DataFrame):
# Get memory usage of each column
memory_usage = dataframe.memory_usage(deep=True)
# Calculate the total memory usage of the DataFrame in bytes
total_memory_usage = memory_usage.sum()
# Convert bytes to megabytes (MB) for a more readable output
total_memory_usage_mb = total_memory_usage / (1024 * 1024) # 1 MB = 1024 * 1024 bytes
return (f"Total memory usage of the DataFrame: {total_memory_usage} bytes ({total_memory_usage_mb:.2f} MB)")
else:
return False
## Generate sample time-series data (simulated temperature readings)
np.random.seed(42)
date_range = pd.date_range(start='2024-01-01', end='2024-01-30', freq='T')
temperature_data = np.random.uniform(0, 100, size=(len(date_range),))
temperature_df = pd.DataFrame({'datetime': date_range, 'temperature': temperature_data})
## Display the first few rows of the original data
print("Original Data:")
print(temperature_df)
bytes_used = mem_usage(temperature_df)
if bytes_used is not False:
print(bytes_used)
## Downsample the data to hourly averages
hourly_avg_temperature_df = temperature_df.resample('H', on='datetime').mean()
## Display the first few rows of the downsampled data
print("\nDownsampled Data (Hourly Averages):")
print(hourly_avg_temperature_df)
bytes_used = mem_usage(hourly_avg_temperature_df)
if bytes_used is not False:
print(bytes_used)
In this example:
- We generate simulated time-series data for temperature readings over a period of ten days, with measurements taken every minute.
- We create a Pandas DataFrame to store this data, with columns 'datetime' for timestamps and 'temperature' for temperature values.
- We then downsample the data to hourly averages using the
resample
method with a frequency of 'H
' (hourly) and applying themean
function for aggregation. You can use other aggregation functions likesum
,min
, ormax
depending on your specific requirements. You can try changing this to 'D
' for daily aggregations. - Finally, we print the first and last few rows of both the original and downsampled data to see the difference in granularity.
Here is the output our example will generate (although the temperature values will be randomized):
Original Data:
datetime temperature
0 2024-01-01 00:00:00 37.454012
1 2024-01-01 00:01:00 95.071431
2 2024-01-01 00:02:00 73.199394
3 2024-01-01 00:03:00 59.865848
4 2024-01-01 00:04:00 15.601864
... ... ...
41756 2024-01-29 23:56:00 55.088656
41757 2024-01-29 23:57:00 69.560490
41758 2024-01-29 23:58:00 40.454901
41759 2024-01-29 23:59:00 73.148175
41760 2024-01-30 00:00:00 16.333289
[41761 rows x 2 columns]
Total memory usage of the DataFrame: 668304 bytes (0.64 MB)
Downsampled Data (Hourly Averages):
temperature
datetime
2024-01-01 00:00:00 46.750077
2024-01-01 01:00:00 48.671940
2024-01-01 02:00:00 46.212512
2024-01-01 03:00:00 50.480704
2024-01-01 04:00:00 55.487067
... ...
2024-01-29 20:00:00 51.402151
2024-01-29 21:00:00 48.945141
2024-01-29 22:00:00 49.935838
2024-01-29 23:00:00 44.468143
2024-01-30 00:00:00 16.333289
[697 rows x 1 columns]
Total memory usage of the DataFrame: 11152 bytes (0.01 MB)
Note the major gain in terms of memory usage (0.01MB vs. 0.64MB), but downsampling does imply that we have also lost the granularity of the original dataset. It's crucial to choose the right aggregation method and frequency depending on the nature of the data and your analysis goals.