Back to Glossary Index

Filter

Extract a subset of data based on specific criteria or conditions.

Data filtering definition:

Filtering data in the context of modern data pipelines refers to the process of selecting a subset of data based on some criteria or condition. This is often used to reduce the amount of data to be processed, to remove irrelevant or redundant data, or to prepare data for further analysis.

Data filtering example using Python:

In Python, filtering data can be accomplished using various built-in functions such as filter() and list comprehensions. The filter() function applies a function to each element of a sequence and returns a new sequence containing only the elements for which the function returns True. Here is an example:

# Define a list of numbers
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Filter out even numbers
filtered_numbers = list(filter(lambda x: x % 2 == 1, numbers))

# Print the filtered numbers
print(filtered_numbers)

This would output:

[1, 3, 5, 7, 9]

In this example, the lambda function checks whether the input number is odd or even. The filter() function applies this lambda function to each element of the numbers list and returns a new list containing only the odd numbers.

List comprehension provides another way to filter data in Python. Here's an example:

# Define a list of numbers
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Filter out even numbers using a list comprehension
filtered_numbers = [x for x in numbers if x % 2 == 1]

# Print the filtered numbers
print(filtered_numbers)

This would also output:

[1, 3, 5, 7, 9]

In this example, the list comprehension creates a new list that contains only the odd numbers from the original numbers list. The condition x % 2 == 1 filters out even numbers.

Filtering data can also be done using libraries such as Pandas or NumPy. These libraries provide efficient ways to filter data in large datasets. For example, the pandas.DataFrame.query() method can be used to filter rows in a Pandas DataFrame based on a specified condition.


Other data engineering terms related to
Data Transformation:

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules or arranging data elements in memory.

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.

Denoising

Remove noise or artifacts from data to improve its accuracy and quality.

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.

ETL

Extract, transform, and load data between different systems.

Fragment

Convert data into a linear format for efficient storage and processing.

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

Munge

See 'wrangle'.

Normalize

Standardize data values to facilitate comparison and analysis. organize data into a consistent format.

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.

Serialize

Convert data into a linear format for efficient storage and processing.

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.

Skew

An imbalance in the distribution or representation of data.

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.

Transform

Convert data from one format or structure to another.

Wrangle

Convert unstructured data into a structured format.