Data Engineering Terms Explained

A guide to key terms used in data engineering. Entries with the icon include useful code examples in Python.
For installation instructions for the packages used in the examples, visit the packages page.

Dagster Newsletter: Get updates delivered to your inbox

Aggregate

Combine data from multiple sources into a single dataset.

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules or arranging data elements in memory.

Anomaly Detection

Identify data points or events that deviate significantly from expected patterns or behaviors.

Anonymize

Remove personal or identifying information from data.

Archive

Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. store data for long-term retention and compliance.

Augment

Add new data or information to an existing dataset to enhance its value. Enhance data with additional information or attributes to enrich analysis and reporting.

Backup

Create a copy of data to protect against loss or corruption.

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.

Compress

Reduce the size of data to save storage space and improve processing performance.

Consolidate

Combine data from multiple sources into a single dataset.

Curation

Select, organize and annotate data to make it more useful for analysis and modeling.

De-identify

Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.

Deduplicate

Identify and remove duplicate records or entries to improve data quality.

Denoising

Remove noise or artifacts from data to improve its accuracy and quality.

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.

Dimensionality

Analyzing the number of features or attributes in the data to improve performance.

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.

ETL

Extract, transform, and load data between different systems.

Enrich

Enhance data with additional information from external sources.

Export

Extract data from a system for use in another system or application.

Feature Extraction

Identify and extract relevant features from raw data for use in analysis or modeling.

Feature Selection

Identify and select the most relevant and informative features for analysis or modeling.

Filter

Extract a subset of data based on specific criteria or conditions.

Fragment

Convert data into a linear format for efficient storage and processing.

Geospatial Analysis

Analyze data that has geographic or spatial components to identify patterns and relationships.

Hash

Convert data into a fixed-length code to improve data security and integrity.

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

Index

Create an optimized data structure for fast search and retrieval.

Ingest

The initial collection and import of data from various sources into your processing environment.

Integrate

combine data from different sources to create a unified view for analysis or reporting.

Load

Insert data into a database or data warehouse, or your pipeline for processing.

Mask

Obfuscate sensitive data to protect its privacy and security.

Memoize

Store the results of expensive function calls and reusing them when the same inputs occur again.

Merge

Combine data from multiple datasets into a single dataset.

Mine

Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.

Model

Create a conceptual representation of data objects.

Monitor

Track data processing metrics and system health to ensure high availability and performance.

Munge

See 'wrangle'.

Named Entity Recognition

Locate and classify named entities in text into pre-defined categories.

Normality Testing

Assess the normality of data distributions to ensure validity and reliability of statistical analysis.

Normalize

Standardize data values to facilitate comparison and analysis. organize data into a consistent format.

Obfuscate

Make data unintelligible or difficult to understand.

Parse

Interpret and convert data from one format to another.

Partition

Divide data into smaller subsets for improved performance.

Pickle

Convert a Python object into a byte stream for efficient storage.

Prep

Transform your data so it is fit-for-purpose.

Preprocess

Transform raw data before data analysis or machine learning modeling.

Profile

Generate statistical summaries and distributions of data to understand its characteristics.

Purge

Delete data that is no longer needed or relevant to free up storage space.

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.

Repartition

Redistribute data across multiple partitions for improved parallelism and performance.

Replicate

Create a copy of data for redundancy or distributed processing.

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.

Sampling

Extract a subset of data for exploratory analysis or to reduce computational complexity.

Scaling

Increasing the capacity or performance of a system to handle more data or traffic.

Schema Mapping

Translate data from one schema or structure to another to facilitate data integration.

Secure

Protect data from unauthorized access, modification, or destruction.

Sentiment Analysis

Analyze text data to identify and categorize the emotional tone or sentiment expressed.

Serialize

Convert data into a linear format for efficient storage and processing.

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.

Shuffle

Randomize the order of data records to improve analysis and prevent bias.

Skew

An imbalance in the distribution or representation of data.

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.

Synchronize

Ensure that data in different systems or databases are in sync and up-to-date.

Time Series Analysis

Analyze data over time to identify trends, patterns, and relationships.

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.

Transform

Convert data from one format or structure to another.

Unstructured Data Analysis

Analyze unstructured data, such as text or images, to extract insights and meaning.

Validate

Check data for completeness, accuracy, and consistency.

Version

Maintain a history of changes to data for auditing and tracking purposes.

Wrangle

Convert unstructured data into a structured format.
Interested in trying Dagster Cloud for Free?
Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.