Data Engineering Terms Explained

A guide to key terms used in data engineering. Entries with the icon include useful code examples in Python.
For installation instructions for the packages used in the examples, visit the packages page.

Dagster Newsletter: Get updates delivered to your inbox

Dagster Glossary code icon

Aggregate

Combine data from multiple sources into a single dataset.
An image representing the data engineering concept of 'Aggregate'
Dagster Glossary code icon

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
An image representing the data engineering concept of 'Align'
Dagster Glossary code icon

Anomaly Detection

Identify data points or events that deviate significantly from expected patterns or behaviors.
An image representing the data engineering concept of 'Anomaly Detection'
Dagster Glossary code icon

Anonymize

Remove personal or identifying information from data.
An image representing the data engineering concept of 'Anonymize'

Archive

Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. store data for long-term retention and compliance.
An image representing the data engineering concept of 'Archive'
Dagster Glossary code icon

AsyncIO

Speed up execution with asynchronous I/O.
An image representing the data engineering concept of 'AsyncIO'
Dagster Glossary code icon

Augment

Add new data or information to an existing dataset to enhance its value.
An image representing the data engineering concept of 'Augment'
Dagster Glossary code icon

Backpressure

A mechanism to handle situations where data is produced faster than it can be consumed.

Backup

Create a copy of data to protect against loss or corruption.
An image representing the data engineering concept of 'Backup'
Dagster Glossary code icon

Batch Processing

Process large volumes of data all at once in a single operation or batch.
An image representing the data engineering concept of 'Batch Processing'

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.
An image representing the data engineering concept of 'Big Data Processing'
Dagster Glossary code icon

Cache

Store expensive computation results so they can be reused, not recomputed.
An image representing the data engineering concept of 'Cache'
Dagster Glossary code icon

Categorize

Organizing and classifying data into different categories, groups, or segments.
An image representing the data engineering concept of 'Categorize'
Dagster Glossary code icon

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.
An image representing the data engineering concept of 'Clean or Cleanse'
Dagster Glossary code icon

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.
An image representing the data engineering concept of 'Cluster'
Dagster Glossary code icon

Compress

Reduce the size of data to save storage space and improve processing performance.
An image representing the data engineering concept of 'Compress'
Dagster Glossary code icon

Consolidate

Combine multiple datasets into one to create a more comprehensive view of the data.
An image representing the data engineering concept of 'Consolidate'
Dagster Glossary code icon

Curate

Select, organize and annotate data to make it more useful for analysis and modeling.
An image representing the data engineering concept of 'Curate'
Dagster Glossary code icon

De-identify

Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.
An image representing the data engineering concept of 'De-identify'
Dagster Glossary code icon

Deduplicate

Identify and remove duplicate records or entries to improve data quality.
An image representing the data engineering concept of 'Deduplicate'
Dagster Glossary code icon

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.
An image representing the data engineering concept of 'Denoise'
Dagster Glossary code icon

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
An image representing the data engineering concept of 'Denormalize'
Dagster Glossary code icon

Derive

Extracting, transforming, and generating new data from existing datasets.

Deserialize

Deserialization is essentially the reverse process of serialization. See: 'Serialize'.
An image representing the data engineering concept of 'Deserialize'
Dagster Glossary code icon

Dimensionality

Analyzing the number of features or attributes in the data to improve performance.
An image representing the data engineering concept of 'Dimensionality'
Dagster Glossary code icon

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.
An image representing the data engineering concept of 'Discretize'
Dagster Glossary code icon

Downsample

Reduce the amount of data for analysis, storage, or processing.
An image representing the data engineering concept of 'Downsample'
Dagster Glossary code icon

ETL

Extract, transform, and load data between different systems.
An image representing the data engineering concept of 'ETL'
Dagster Glossary code icon

Encode

Convert categorical variables into numerical representations for ML algorithms.
An image representing the data engineering concept of 'Encode'
Dagster Glossary code icon

Enrich

Enhance data with additional information from external sources.
An image representing the data engineering concept of 'Enrich'
Dagster Glossary code icon

Explore

Understand the data, identify patterns, and gain insights.
An image representing the data engineering concept of 'Explore'

Export

Extract data from a system for use in another system or application.
An image representing the data engineering concept of 'Export'
Dagster Glossary code icon

Extrapolate

Predict values outside a known range, based on the trends or patterns identified within the available data.
An image representing the data engineering concept of 'Extrapolate'

Fan-Out

A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.
An image representing the data engineering concept of 'Fan-Out'
Dagster Glossary code icon

Feature Extraction

Identify and extract relevant features from raw data for use in analysis or modeling.
An image representing the data engineering concept of 'Feature Extraction'
Dagster Glossary code icon

Feature Selection

Identify and select the most relevant and informative features for analysis or modeling.
An image representing the data engineering concept of 'Feature Selection'
Dagster Glossary code icon

Filter

Extract a subset of data based on specific criteria or conditions.
Dagster Glossary code icon

Fragment

Break data down into smaller chunks for storage and management purposes.
An image representing the data engineering concept of 'Fragment'
Dagster Glossary code icon

Geospatial Analysis

Analyze data that has geographic or spatial components to identify patterns and relationships.
An image representing the data engineering concept of 'Geospatial Analysis'
Dagster Glossary code icon

Hash

Convert data into a fixed-length code to improve data security and integrity.
An image representing the data engineering concept of 'Hash'
Dagster Glossary code icon

Homogenize

Make data uniform, consistent, and comparable.
An image representing the data engineering concept of 'Homogenize'
Dagster Glossary code icon

Idempotent

An operation that produces the same result each time it is performed.
Dagster Glossary code icon

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.
An image representing the data engineering concept of 'Impute'
Dagster Glossary code icon

Index

Create an optimized data structure for fast search and retrieval.
Dagster Glossary code icon

Ingest

The initial collection and import of data from various sources into your processing environment.
Dagster Glossary code icon

Integrate

Combine data from different sources to create a unified view for analysis or reporting.
Dagster Glossary code icon

Interpolate

Use known data values to estimate unknown data values.
An image representing the data engineering concept of 'Interpolate'
Dagster Glossary code icon

Lineage

Understand of how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.
An image representing the data engineering concept of 'Lineage'
Dagster Glossary code icon

Load

Insert data into a database or data warehouse, or your pipeline for processing.
An image representing the data engineering concept of 'Load'
Dagster Glossary code icon

Mask

Obfuscate sensitive data to protect its privacy and security.
An image representing the data engineering concept of 'Mask'
Dagster Glossary code icon

Materialize

Executing a computation and persisting the results into storage.
Dagster Glossary code icon

Memoize

Store the results of expensive function calls and reusing them when the same inputs occur again.
Dagster Glossary code icon

Merge

Combine data from multiple datasets into a single dataset.
An image representing the data engineering concept of 'Merge'
Dagster Glossary code icon

Mine

Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.
Dagster Glossary code icon

Model

Create a conceptual representation of data objects.

Monitor

Track data processing metrics and system health to ensure high availability and performance.
Dagster Glossary code icon

Multiprocessing

Optimize execution time with multiple parallel processes.
An image representing the data engineering concept of 'Multiprocessing'

Munge

See 'wrangle'.
Dagster Glossary code icon

Named Entity Recognition

Locate and classify named entities in text into pre-defined categories.
Dagster Glossary code icon

NoSQL

Non-relational databases designed for scalability, schema flexibility, and optimized performance in specific use-cases.
Dagster Glossary code icon

Normality Testing

Assess the normality of data distributions to ensure validity and reliability of statistical analysis.
An image representing the data engineering concept of 'Normality Testing'
Dagster Glossary code icon

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Dagster Glossary code icon

Obfuscate

Make data unintelligible or difficult to understand.
Dagster Glossary code icon

Parallelize

Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.
An image representing the data engineering concept of 'Parallelize'
Dagster Glossary code icon

Parse

Interpret and convert data from one format to another.
Dagster Glossary code icon

Partition

Divide data into smaller subsets for improved performance.
An image representing the data engineering concept of 'Partition'
Dagster Glossary code icon

Pickle

Convert a Python object into a byte stream for efficient storage.

Pre-aggregate

See 'aggregate'.
Dagster Glossary code icon

Prep

Transform your data so it is fit-for-purpose.
Dagster Glossary code icon

Preprocess

Transform raw data before data analysis or machine learning modeling.
Dagster Glossary code icon

Profile

Generate statistical summaries and distributions of data to understand its characteristics.
An image representing the data engineering concept of 'Profile'
Dagster Glossary code icon

Purge

Delete data that is no longer needed or relevant to free up storage space.
Dagster Glossary code icon

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.
Dagster Glossary code icon

Repartition

Redistribute data across multiple partitions for improved parallelism and performance.
Dagster Glossary code icon

Replicate

Create a copy of data for redundancy or distributed processing.
Dagster Glossary code icon

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.
An image representing the data engineering concept of 'Reshape'
Dagster Glossary code icon

Sample

Extract a subset of data for exploratory analysis or to reduce computational complexity.

Scaling

Increasing the capacity or performance of a system to handle more data or traffic.
Dagster Glossary code icon

Schema Mapping

Translate data from one schema or structure to another to facilitate data integration.
Dagster Glossary code icon

Scrape

Extract data from a website or another source.
An image representing the data engineering concept of 'Scrape'
Dagster Glossary code icon

Secure

Protect data from unauthorized access, modification, or destruction.
Dagster Glossary code icon

Sentiment Analysis

Analyze text data to identify and categorize the emotional tone or sentiment expressed.
An image representing the data engineering concept of 'Sentiment Analysis'
Dagster Glossary code icon

Serialize

Convert data into a linear format for efficient storage and processing.
An image representing the data engineering concept of 'Serialize'
Dagster Glossary code icon

Shard

Partitioning a database into smaller, more manageable pieces.
Dagster Glossary code icon

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Dagster Glossary code icon

Shuffle

Randomize the order of data records to improve analysis and prevent bias.
An image representing the data engineering concept of 'Shuffle'
Dagster Glossary code icon

Skew

An imbalance in the distribution or representation of data.
Dagster Glossary code icon

Spill

Temporarily transfer data that exceeds available memory to disk.
An image representing the data engineering concept of 'Spill'
Dagster Glossary code icon

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.
Dagster Glossary code icon

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.

Synchronize

Ensure that data in different systems or databases are in sync and up-to-date.
Dagster Glossary code icon

Thread

Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.
An image representing the data engineering concept of 'Thread'
Dagster Glossary code icon

Time Series Analysis

Analyze data over time to identify trends, patterns, and relationships.
An image representing the data engineering concept of 'Time Series Analysis'
Dagster Glossary code icon

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.
An image representing the data engineering concept of 'Tokenize'

Transform

Convert data from one format or structure to another.

Unstructured Data Analysis

Analyze unstructured data, such as text or images, to extract insights and meaning.
An image representing the data engineering concept of 'Unstructured Data Analysis'
Dagster Glossary code icon

Validate

Check data for completeness, accuracy, and consistency.
Dagster Glossary code icon

Vectorize

Executing a single operation on multiple data points simultaneously.
An image representing the data engineering concept of 'Vectorize'
Dagster Glossary code icon

Version

Maintain a history of changes to data for auditing and tracking purposes.
An image representing the data engineering concept of 'Version'
Dagster Glossary code icon

Wrangle

Convert unstructured data into a structured format.
An image representing the data engineering concept of 'Wrangle'
An image representing the data engineering concept of

About the artwork.

The art you see throughout the glossary was generated thanks to Midjourney and curated by the Dagster Labs team. It was inspired by some of the great artists of the 20th century (and some from earlier periods). See if you can recognize the 'work' of Marcel Duchamp, Frederic Remington, Keith Haring, Claes Oldenburg, Roy Lichtenstein, Wassily Kandinsky, and others.

Left: Daggy, as seen by René Magritte.

Interested in trying Dagster Cloud for Free?
Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.