Data Sampling | Dagster Glossary

Back to Glossary Index

Data Sampling

Extract a subset of data for exploratory analysis or to reduce computational complexity.

Data sampling definition:

Data sampling refers to the process of selecting a subset of data from a larger dataset for analysis or processing. This is often done to reduce the computational requirements of working with the entire dataset or to obtain a representative sample for testing or experimentation.

There are several techniques for data sampling, including random sampling, stratified sampling, and cluster sampling. Random sampling involves selecting data points randomly from the dataset, while stratified sampling involves dividing the dataset into subgroups and selecting data points from each subgroup. Cluster sampling involves selecting clusters of data points that are representative of the entire dataset.

Data sampling example using Python:

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

Here is an example of how to perform random sampling in Python using the pandas library:

import pandas as pd

# Load the dataset
data = pd.read_csv('data.csv')

print(f"Full dataset is {len(data)} items long")

# Randomly sample 10 rows from the dataset
sampled_data = data.sample(n=10)

print(f"Sampled dataset is {len(sampled_data)} items long")

In this example, we load a dataset from a CSV file using pandas and then use the sample() method to randomly select 1000 rows from the dataset. The resulting sampled_data object contains the randomly selected rows, which can then be used for further analysis or processing.

Depending on your input data.csv this code might print out:

Full dataset is 57 items long
Sampled dataset is 10 items long

Here's an example of performing stratified sampling on a dataset using Python's scikit-learn library. Assuming an input file data.csv with a column label:

import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

# Load dataset
data = pd.read_csv('data.csv')

# Create stratified sample
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in splitter.split(data, data['label']):
    train_data = data.iloc[train_index]
    test_data = data.iloc[test_index]

# Check class distribution in training set
print(train_data['label'].value_counts() / len(train_data))

In this example, we first load a dataset from a CSV file using Pandas. We then create a StratifiedShuffleSplit object with a test set size of 20% and a random seed of 42. We pass in the entire dataset as well as the labels we want to stratify by (data['label']). We then iterate over the single split generated by the splitter and assign the resulting train and test indices to new variables. Finally, we print the class distribution of the training set to ensure it was stratified correctly.

Other data engineering terms related to
Data Aggregation and Summarization: