Back to Glossary Index

Dagster Data Engineering Glossary:

Data Filtering

Extract a subset of data based on specific criteria or conditions.

Glossary entry badge for Filter

Data filtering definition:

Filtering data in the context of modern data pipelines refers to the process of selecting a subset of data based on some criteria or condition. This is often used to reduce the amount of data to be processed, to remove irrelevant or redundant data, or to prepare data for further analysis.

Data filtering example using Python:

In Python, filtering data can be accomplished using various built-in functions such as filter() and list comprehensions. The filter() function applies a function to each element of a sequence and returns a new sequence containing only the elements for which the function returns True. Here is an example:

# Define a list of numbers
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Filter out even numbers
filtered_numbers = list(filter(lambda x: x % 2 == 1, numbers))

# Print the filtered numbers
print(filtered_numbers)

This would output:

[1, 3, 5, 7, 9]

In this example, the lambda function checks whether the input number is odd or even. The filter() function applies this lambda function to each element of the numbers list and returns a new list containing only the odd numbers.

List comprehension provides another way to filter data in Python. Here's an example:

# Define a list of numbers
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Filter out even numbers using a list comprehension
filtered_numbers = [x for x in numbers if x % 2 == 1]

# Print the filtered numbers
print(filtered_numbers)

This would also output:

[1, 3, 5, 7, 9]

In this example, the list comprehension creates a new list that contains only the odd numbers from the original numbers list. The condition x % 2 == 1 filters out even numbers.

Filtering data can also be done using libraries such as Pandas or NumPy. These libraries provide efficient ways to filter data in large datasets. For example, the pandas.DataFrame.query() method can be used to filter rows in a Pandas DataFrame based on a specified condition.

Other data engineering terms related to

‘Data Transformation’:

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.

An image representing the data engineering concept of 'Align'

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.

An image representing the data engineering concept of 'Clean or Cleanse'

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.

An image representing the data engineering concept of 'Cluster'

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.

An image representing the data engineering concept of 'Curate'

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.

An image representing the data engineering concept of 'Denoise'

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.

An image representing the data engineering concept of 'Denormalize'

Derive

Extracting, transforming, and generating new data from existing datasets.

An image representing the data engineering concept of 'Derive'

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.

An image representing the data engineering concept of 'Discretize'

ETL

Extract, transform, and load data between different systems.

An image representing the data engineering concept of 'ETL'

Encode

Convert categorical variables into numerical representations for ML algorithms.

An image representing the data engineering concept of 'Encode'

Fragment

Break data down into smaller chunks for storage and management purposes.

An image representing the data engineering concept of 'Fragment'

Homogenize

Make data uniform, consistent, and comparable.

An image representing the data engineering concept of 'Homogenize'

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

An image representing the data engineering concept of 'Impute'

Linearize

Transforming the relationship between variables to make datasets approximately linear.

An image representing the data engineering concept of 'Linearize'

Munge

An image representing the data engineering concept of 'Munge'

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.

An image representing the data engineering concept of 'Reduce'

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.

An image representing the data engineering concept of 'Reshape'

Serialize

Convert data into a linear format for efficient storage and processing.

An image representing the data engineering concept of 'Serialize'

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.

Skew

An imbalance in the distribution or representation of data.

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.

An image representing the data engineering concept of 'Tokenize'

Transform

Convert data from one format or structure to another.

An image representing the data engineering concept of 'Transform'

Wrangle

Convert unstructured data into a structured format.

An image representing the data engineering concept of 'Wrangle'