Dagster Glossary | Data Orchestration Terms Explained

A/B Testing

A statistical hypothesis testing for a randomized experiment with two variables, A and B, which are used to compare two models or strategies and determine which performs better.

Learn More

See Wikipedia

ACID Properties

The set of properties of database transactions intended to guarantee validity even in the event of errors or failures, encompassing Atomicity, Consistency, Isolation, and Durability.

Learn More

See Wikipedia

API (Application Programming Interface)

A set of rules and definitions that allow different software entities to communicate with each other.

Learn More

AWS Step Functions

Enables you to coordinate AWS components, applications and microservices using visual workflows.

Learn More

See vendor site

Aggregate

Combine data from multiple sources into a single dataset.

Learn More

Agile Methodology

An iterative approach to software development and project management that prioritizes flexibility and customer satisfaction, often used by data engineering teams to manage projects.

Learn More

See Wikipedia

Alation

A machine learning data catalog that helps people find, understand, and trust the data.

Learn More

Vendor Website

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.

Learn More

Aligning

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.

Learn More

See Glossary entry

Amazon DynamoDB

A managed NoSQL database service provided by Amazon Web Services.

Learn More

Vendor Website

Amazon Kinesis

A platform to stream data on AWS, offering powerful services to make it easy to load and analyze streaming data.

Learn More

Vendor Website

Amazon Redshift

A fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.

Learn More

Vendor Website

Amazon Web Services (AWS)

Offers a broad set of global cloud-based products including compute, storage, databases, analytics, networking, mobile, developer tools, and more.

Learn More

Vendor Website

Annotation

The process of adding metadata or explanatory notes to data, often used in machine learning to create labeled data for training models.

Learn More

Anomaly Detection

Identify data points or events that deviate significantly from expected patterns or behaviors.

Learn More

Anonymize

Remove personal or identifying information from data.

Learn More

Apache Airflow

A platform to programmatically author, schedule, and monitor workflows of tasks.

Learn More

Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Learn More

Project website

Apache Atlas

A scalable and extensible set of core foundational governance services, enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop.

Learn More

Project website

Apache Camel

An open-source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.

Learn More

Project website

Apache Flink

A framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Learn More

Project website

Apache Hadoop

A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Learn More

Project website

Apache Kafka

A distributed streaming platform capable of handling trillions of events a day.

Learn More

Project website

Apache Nifi

A tool designed to automate the flow of data between software systems.

Learn More

Project website

Apache Pulsar

A highly scalable, low-latency messaging platform running on commodity hardware.

Learn More

Project website

Apache Samza

A stream processing framework for running applications that process data as it is created.

Learn More

Project website

Apache Spark

A fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python, and R.

Learn More

Project website

Apache Storm

A free and open-source distributed real-time computation system.

Learn More

Project website

Append

Adding or attaching new records or data items to the end of an existing dataset, database table, file, or list.

Learn More

Argo

An open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes.

Learn More

Vendor Website

Association Rule Mining

A machine learning method aimed at identifying interesting relations between variables (items or events) in large databases, frequently used for market basket analysis.

Learn More

Find out more

AsyncIO

Speed up execution with asynchronous I/O.

Learn More

Augment

Add new data or information to an existing dataset to enhance its value.

Learn More

Augmented Data Management

The use of AI and ML technologies to optimize and enhance data management tasks, improving data quality and metadata development.

Learn More

Find out more

Auto-materialize

The automatic execution of computations and the persistence of their results.

Learn More

Automated Machine Learning (AutoML)

The process of automating the end-to-end process of applying machine learning to real-world problems, facilitating the development of ML models by experts and non-experts alike.

Learn More

Learn more

Avro

A binary serialization format developed within the Apache Hadoop project, compact, fast, and suitable for serializing large amounts of data. It uses JSON for defining data types and protocols, and it serializes data in a compact binary format.

Learn More

Visit the project

BSON (Binary JSON)

A binary-encoded serialization of JSON-like documents used to store documents and make remote procedure calls in MongoDB. BSON supports embedded documents and arrays, offering additional data types not supported by JSON.

Learn More

Backend-as-a-Service (BaaS)

A cloud computing service model that serves as the middleware that provides developers with ways to connect their web and mobile applications to cloud services via application programming interfaces (APIs) and software developers' kits (SDKs).

Learn More

See Wikipedia

Backpressure

A mechanism to handle situations where data is produced faster than it can be consumed.

Learn More

Backup

Create a copy of data to protect against loss or corruption.

Learn More

Batch Processing

Process large volumes of data all at once in a single operation or batch.

Learn More

Big Data

Refers to extremely large datasets that can be analyzed for patterns, trends, and associations, typically involving varied and complex structures. What constitutes 'big' is debated, but a rule of thumb is a volume of data that cannot be analyzed on a single machine.

Learn More

See Wikipedia

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

Learn More

Big O Notation

A mathematical notation used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, primarily used to classify algorithms by how they respond to changes in input size.

Learn More

See Wikipedia

Binary Tree

A tree data structure in which each node has at most two children, referred to as the left child and the right child.

Learn More

See Wikipedia

Bitwise Operation

Operations that manipulate one or more bits at the level of their individual binary representation.

Learn More

See Wikipedia

Blend

A term coined by data analytics vendors to describe the process of combining data from multiple sources to create a cohesive, unified dataset. Typically used in the context of data analysis and business intelligence.

Learn More

Blockchain

A system of recording information in a way that makes it difficult or impossible to change, hack, or cheat the system. A blockchain is a digital ledger of transactions that is duplicated and distributed across the entire network of computer systems on the blockchain.

Learn More

See Wikipedia

Broadcast

A method in parallel computing where data is sent from one point (a root node) to all other nodes in the topology.

Learn More

Broadcasting

A method in distributed computing to send the same message to all nodes in a network.

Learn More

Bucketing

A method for dividing a dataset into discrete buckets or bins to separate it into roughly equal parts based on some characteristic.

Learn More

Bulk Extract

The process of extracting large amounts of data from a database in a single transaction.

Learn More

Business Intelligence (BI)

A set of strategies and technologies used by enterprises for the data analysis of business information, helping companies make more informed business decisions.

Learn More

CAP Theorem

In computer science, it represents that it is impossible for a distributed system to simultaneously provide more than two out of three guarantees: Consistency, Availability, and Partition Tolerance.

Learn More

CBOR (Concise Binary Object Representation)

A binary format encoding data in a more efficient and compact manner than JSON. It is designed to efficiently serialize and deserialize complex data structures without losing schema-free property of JSON.

Learn More

CRON

A time-based job scheduler in Unix-like computer operating systems for scheduling periodic jobs at fixed times, dates, or intervals.

Learn More

CSV (Comma Separated Values)

A simple, plain-text file format used to store tabular data, where each line represents a data record, and each record consists of one or more fields, separated by commas. Suitable for a wide range of applications due to its simplicity, but lacks a standard schema, which can lead to parsing errors.

Learn More

CURL

A command-line tool and library for transferring data with URLs, supporting various protocols like HTTP, FTP, and more.

Learn More

No results, please try different filters.

Data Engineering Terms Explained

A/B Testing

ACID Properties

API (Application Programming Interface)

AWS Step Functions

Aggregate

Agile Methodology

Alation

Align

Aligning

Amazon DynamoDB

Amazon Kinesis

Amazon Redshift

Amazon Web Services (AWS)

Annotation

Anomaly Detection

Anonymize

Apache Airflow

Apache Arrow

Apache Atlas

Apache Camel

Apache Flink

Apache Hadoop

Apache Kafka

Apache Nifi

Apache Pulsar

Apache Samza

Apache Spark

Apache Storm

Append

Archive

Argo

Association Rule Mining

AsyncIO

Augment

Augmented Data Management

Auto-materialize

Automated Machine Learning (AutoML)

Avro

BSON (Binary JSON)

Backend-as-a-Service (BaaS)

Backpressure

Backup

Batch Processing

Big Data

Big Data Processing

Big O Notation

Binary Tree

Bitwise Operation

Blend

Blockchain

Broadcast

Broadcasting

Bucketing

Bulk Extract

Business Intelligence (BI)

CAP Theorem

CBOR (Concise Binary Object Representation)

CRON

CSV (Comma Separated Values)

CURL