Data Downsampling | Dagster Glossary

Back to Glossary Index

Data Downsampling

Reduce the amount of data for analysis, storage, or processing.

Definition of downsampling:

In data engineering, "downsampling" refers to the process of reducing the amount of data for analysis, storage, or processing. This is done by systematically selecting a subset of the data at a lower rate than the original.

The main goal of downsampling is to reduce the computational requirements, storage needs, and potentially noise in your data. For instance, in time series data, it is common to downsample to different granularities (daily, hourly, etc.) to make the data easier to work with and to align it more closely with the problem you're trying to solve.

In image or audio file processing, downsampling can be used to reduce the resolution of an image or sound file.

Downsampling can be performed in various ways, such as:

  1. Average Pooling: This method averages a group of data points to create a single representative data point. For example, you might reduce the granularity of time-series data from seconds to minutes by averaging all the data points within each minute.

  2. Decimation: This method discards some data points without any replacement. For instance, you might keep every nth data point and discard the rest.

  3. Reservoir Sampling: This is a randomized algorithm that allows you to select a sample of n items from a dataset of an unknown size, m, where m > n. The algorithm guarantees that every possible subset of m items has an equal chance of being the sampled subset.

Downsampling has to be done carefully. If not handled properly, important information can be lost, which could lead to inaccurate model predictions or analysis. It's also worth mentioning that downsampling is different from data compression, which aims to reduce the storage space required for data without losing any information through techniques like encoding and quantization.

Example of downsampling in Python

Here's an example of downsampling a time series data set using the Pandas library in Python.

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

import pandas as pd
import numpy as np

# Create a date range
date_rng = pd.date_range(start='1/1/2023', end='12/31/2023', freq='H')

# Create a DataFrame with the date_rng as the index and random data
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,500,size=(len(date_rng)))
df.set_index('date', inplace=True)

# Print original DataFrame
print("Original DataFrame:")

# Downsampling: reduce datetime rows to daily
df_daily = df.resample('D').mean()

# Print downsampled DataFrame
print("\nDownsampled DataFrame:")

In this script, we first generate a date range from January 1, 2020 to January 10, 2020 with an hourly frequency. This serves as the index of our DataFrame. The data column of the DataFrame is populated with random integers between 0 and 100. The DataFrame is then resampled (downsampled) to a daily frequency using the resample method, and the mean of the hourly data for each day is computed. Both the original and downsampled DataFrames are then printed.

The resample function is a flexible and high-performance method in pandas for frequency conversion and resampling of time-series data. It allows you to convert the data into different frequencies. The 'D' argument makes it daily, 'H' makes it hourly, 'M' is for monthly, and so on.

Your output will look something like this:

Original DataFrame:
2023-01-01 00:00:00   402
2023-01-01 01:00:00   191
2023-01-01 02:00:00   432
2023-01-01 03:00:00   288
2023-01-01 04:00:00   274
...                   ...
2023-12-30 20:00:00   329
2023-12-30 21:00:00   277
2023-12-30 22:00:00   469
2023-12-30 23:00:00   289
2023-12-31 00:00:00   193

[8737 rows x 1 columns]

Downsampled DataFrame:
2023-01-01  253.083333
2023-01-02  226.041667
2023-01-03  259.208333
2023-01-04  225.208333
2023-01-05  296.166667
...                ...
2023-12-27  244.666667
2023-12-28  261.625000
2023-12-29  248.708333
2023-12-30  250.375000
2023-12-31  193.000000

[365 rows x 1 columns]

So we downsampled a dataset of 8,737 rows to a smaller set of just 365 rows.

You can replace .mean() with any other function (sum(), min(), max(), median(), etc.) depending on what kind of downsampling you need.

Other data engineering terms related to
Data Aggregation and Summarization: