Data Engineering Terms Explained
A guide to key terms used in data engineering. Entries with the icon include useful code examples in Python.
For installation instructions for the packages used in the examples, visit the packages page.
Aggregate
Combine data from multiple sources into a single dataset.
Align
Aligning data can mean one of three things: aligning datasets, meeting business rules or arranging data elements in memory.
Anomaly Detection
Identify data points or events that deviate significantly from expected patterns or behaviors.
Anonymize
Remove personal or identifying information from data.
Archive
Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. store data for long-term retention and compliance.
Augment
Add new data or information to an existing dataset to enhance its value. Enhance data with additional information or attributes to enrich analysis and reporting.
Backup
Create a copy of data to protect against loss or corruption.
Big Data Processing
Process large volumes of data in parallel and distributed computing environments to improve performance.
Clean or Cleanse
Remove invalid or inconsistent data values, such as empty fields or outliers.
Cluster
Group data points based on similarities or patterns to facilitate analysis and modeling.
Compress
Reduce the size of data to save storage space and improve processing performance.
Consolidate
Combine data from multiple sources into a single dataset.
Curation
Select, organize and annotate data to make it more useful for analysis and modeling.
De-identify
Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.
Deduplicate
Identify and remove duplicate records or entries to improve data quality.
Denoising
Remove noise or artifacts from data to improve its accuracy and quality.
Denormalize
Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
Dimensionality
Analyzing the number of features or attributes in the data to improve performance.
Discretize
Transform continuous data into discrete categories or bins to simplify analysis.
ETL
Extract, transform, and load data between different systems.
Enrich
Enhance data with additional information from external sources.
Export
Extract data from a system for use in another system or application.
Feature Extraction
Identify and extract relevant features from raw data for use in analysis or modeling.
Feature Selection
Identify and select the most relevant and informative features for analysis or modeling.
Filter
Extract a subset of data based on specific criteria or conditions.
Fragment
Convert data into a linear format for efficient storage and processing.
Geospatial Analysis
Analyze data that has geographic or spatial components to identify patterns and relationships.
Hash
Convert data into a fixed-length code to improve data security and integrity.
Impute
Fill in missing data values with estimated or imputed values to facilitate analysis.
Index
Create an optimized data structure for fast search and retrieval.
Ingest
The initial collection and import of data from various sources into your processing environment.
Integrate
combine data from different sources to create a unified view for analysis or reporting.
Load
Insert data into a database or data warehouse, or your pipeline for processing.
Mask
Obfuscate sensitive data to protect its privacy and security.
Memoize
Store the results of expensive function calls and reusing them when the same inputs occur again.
Merge
Combine data from multiple datasets into a single dataset.
Mine
Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.
Model
Create a conceptual representation of data objects.
Monitor
Track data processing metrics and system health to ensure high availability and performance.
Munge
See 'wrangle'.
Named Entity Recognition
Locate and classify named entities in text into pre-defined categories.
Normality Testing
Assess the normality of data distributions to ensure validity and reliability of statistical analysis.
Normalize
Standardize data values to facilitate comparison and analysis. organize data into a consistent format.
Obfuscate
Make data unintelligible or difficult to understand.
Parse
Interpret and convert data from one format to another.
Partition
Divide data into smaller subsets for improved performance.
Pickle
Convert a Python object into a byte stream for efficient storage.
Prep
Transform your data so it is fit-for-purpose.
Preprocess
Transform raw data before data analysis or machine learning modeling.
Profile
Generate statistical summaries and distributions of data to understand its characteristics.
Purge
Delete data that is no longer needed or relevant to free up storage space.
Reduce
Convert a large set of data into a smaller, more manageable form without significant loss of information.
Repartition
Redistribute data across multiple partitions for improved parallelism and performance.
Replicate
Create a copy of data for redundancy or distributed processing.
Reshape
Change the structure of data to better fit specific analysis or modeling requirements.
Sampling
Extract a subset of data for exploratory analysis or to reduce computational complexity.
Scaling
Increasing the capacity or performance of a system to handle more data or traffic.
Schema Mapping
Translate data from one schema or structure to another to facilitate data integration.
Secure
Protect data from unauthorized access, modification, or destruction.
Sentiment Analysis
Analyze text data to identify and categorize the emotional tone or sentiment expressed.
Serialize
Convert data into a linear format for efficient storage and processing.
Shred
Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Shuffle
Randomize the order of data records to improve analysis and prevent bias.
Skew
An imbalance in the distribution or representation of data.
Standardize
Transform data to a common unit or format to facilitate comparison and analysis.
Synchronize
Ensure that data in different systems or databases are in sync and up-to-date.
Time Series Analysis
Analyze data over time to identify trends, patterns, and relationships.
Tokenize
Convert data into tokens or smaller units to simplify analysis or processing.
Transform
Convert data from one format or structure to another.
Unstructured Data Analysis
Analyze unstructured data, such as text or images, to extract insights and meaning.
Validate
Check data for completeness, accuracy, and consistency.
Version
Maintain a history of changes to data for auditing and tracking purposes.
Wrangle
Convert unstructured data into a structured format.
Interested in trying Dagster Cloud for Free?
Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.