Dagster Glossary | Key Terms in Data Orchestration

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Aggregate

Combine data from multiple sources into a single dataset.

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.

Anomaly Detection

Identify data points or events that deviate significantly from expected patterns or behaviors.

Anonymize

Remove personal or identifying information from data.

Append

Adding or attaching new records or data items to the end of an existing dataset, database table, file, or list.

AsyncIO

Speed up execution with asynchronous I/O.

Augment

Add new data or information to an existing dataset to enhance its value.

Auto-materialize

The automatic execution of computations and the persistence of their results.

Backpressure

A mechanism to handle situations where data is produced faster than it can be consumed.

Backup

Create a copy of data to protect against loss or corruption.

Batch Processing

Process large volumes of data all at once in a single operation or batch.

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

Cache

Store expensive computation results so they can be reused, not recomputed.

Categorize

Organizing and classifying data into different categories, groups, or segments.

Checkpointing

Saving the state of a process at certain points so that it can be restarted from that point in case of failure.

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.

Compact

Reducing the size of data while preserving its essential information.

Compress

Reduce the size of data to save storage space and improve processing performance.

Consolidate

Combine multiple datasets into one to create a more comprehensive view of the data.

Cosine Similarity

A measure of similarity between two entities used in text analysis, natural language processing, etc.

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.

De-identify

Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.

Deduplicate

Identify and remove duplicate records or entries to improve data quality.

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.

Derive

Extracting, transforming, and generating new data from existing datasets.

Deserialize

Deserialization is essentially the reverse process of serialization. See: 'Serialize'.

Dimensionality

Analyzing the number of features or attributes in the data to improve performance.

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.

Downsample

Reduce the amount of data for analysis, storage, or processing.

ETL

Extract, transform, and load data between different systems.

Encapsulate

The bundling of data with the methods that operate on that data.

Encode

Convert categorical variables into numerical representations for ML algorithms.

Enrich

Enhance data with additional information from external sources.

Explore

Understand the data, identify patterns, and gain insights.

Export

Extract data from a system for use in another system or application.

Extrapolate

Predict values outside a known range, based on the trends or patterns identified within the available data.

Fan-Out

A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.

Feature Extraction

Identify and extract relevant features from raw data for use in analysis or modeling.

Feature Selection

Identify and select the most relevant and informative features for analysis or modeling.

Filter

Extract a subset of data based on specific criteria or conditions.

Fragment

Break data down into smaller chunks for storage and management purposes.

Geospatial Analysis

Analyze data that has geographic or spatial components to identify patterns and relationships.

Graph Theory

A powerful tool to model and understand intricate relationships within our data systems.

Hash

Convert data into a fixed-length code to improve data security and integrity.

Homogenize

Make data uniform, consistent, and comparable.

Idempotent

An operation that produces the same result each time it is performed.

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

Index

Create an optimized data structure for fast search and retrieval.

Ingest

The initial collection and import of data from various sources into your processing environment.

Integrate

Combine data from different sources to create a unified view for analysis or reporting.

Interpolate

Use known data values to estimate unknown data values.

Lineage

Understand how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.

Linearizability

Ensure that each individual operation on a distributed system appear to occur instantaneously.

Linearize

Transforming the relationship between variables to make datasets approximately linear.

Load

Insert data into a database or data warehouse, or your pipeline for processing.

Mask

Obfuscate sensitive data to protect its privacy and security.

Materialize

Executing a computation and persisting the results into storage.

No results, please try different filters.

Data Engineering Terms Explained