Dagster Glossary | Data Orchestration Terms Explained

Genetic Algorithm

A search heuristic that is inspired by Charles Darwin’s theory of natural evolution, used to find approximate solutions to optimization and search problems.

Learn More

Geo-replication

Replication of datasets across geographical locations, primarily for data resilience and availability purposes.

Learn More

Geospatial Analysis

Analyze data that has geographic or spatial components to identify patterns and relationships.

Learn More

Git

A free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

Learn More

GitHub

A web-based platform that provides hosting for software development and a community of developers to work together and share code.

Learn More

Google BigQuery

A fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.

Learn More

Google Cloud Platform (GCP)

A provider of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and YouTube.

Learn More

Gradient Boosting

A machine learning technique for regression and classification problems, which builds a model in a stage-wise fashion, optimizing for predictive accuracy.

Learn More

Graph Database

A database designed to treat the relationships between data as equally important to the data itself, used to store data whose relations are best represented as a graph.

Learn More

Graph Processing

A type of data processing that uses graph theory to analyze and visually represent data relationships.

Learn More

See Glossary entry

Graph Theory

A powerful tool to model and understand intricate relationships within our data systems.

Learn More

Greedy Algorithm

An algorithmic paradigm that makes locally optimal choices at each stage with the hope of finding the global optimum.

Learn More

Grid Computing

A form of distributed computing whereby a 'super and virtual computer' is composed of clustered, networked, loosely coupled computers acting in parallel to perform very large tasks.

Learn More

Grid Search

An approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

Learn More

HDF5 (Hierarchical Data Format version)

A file format and set of tools for managing complex data. It is designed for flexible, efficient I/O and for high volume and complex data sets and supports an unlimited variety of datatypes.

Learn More

Find out more

HTML Parsing

Analyzing HTML code to extract relevant information and understand the structure of the content, often used in web scraping.

Learn More

Hadoop Distributed File System (HDFS)

A distributed file system designed to run on commodity hardware, providing high-throughput access to application data and fault tolerance.

Learn More

Hash

Convert data into a fixed-length code to improve data security and integrity.

Learn More

Hash Function

A function that converts an input into a fixed-size string of bytes, typically a digest that is unique to the given input.

Learn More

See Glossary entry

Hashing

The process of transforming input of any length into a fixed-size string of text, typically using a hash function.

Learn More

See Glossary entry

Heap

A specialized tree-based data structure that satisfies the heap property, used in computer memory management and for heapsort algorithm.

Learn More

Helm

A package manager for Kubernetes that allows developers and operators to more easily package, configure, and deploy applications and services onto Kubernetes clusters.

Learn More

See vendor website

Heterogeneous Database System

A system that uses middleware to connect databases that are not alike and are running on different DBMSs, possibly on different platforms.

Learn More

Hierarchical Database Model

A data model where data is organized into a tree-like structure with a single root, to which all other data is linked in a hierarchy.

Learn More

See Wikipedia

High Availability

A characteristic of a system aiming to ensure an agreed level of operational performance for a higher than normal period.

Learn More

High Cardinality

A term used to define the uniqueness of data values contained in a column. If a column has a high number of unique values, it is said to have high cardinality.

Learn More

High-Availability Systems

Systems designed to be operational and accessible for longer periods, minimizing downtime and ensuring continuous service.

Learn More

Homogeneous Database System

A system where all databases are based on the same DBMS technology.

Learn More

Homogenize

Make data uniform, consistent, and comparable.

Learn More

Horizontal Scaling

Adding more machines to a network to improve the capability to handle more load and perform better, also known as scaling out.

Learn More

See LinkedIn Advice

Hortonworks

Provides comprehensive solutions for data management and analytics.

Learn More

See Wikipedia entry

Hot storage

The immediate, high-speed storage of data that is frequently accessed and modified, enabling rapid retrieval and updates.

Learn More

Huge Pages

Memory pages that are larger than the standard memory page size, beneficial in managing large amounts of memory.

Learn More

Hybrid Cloud

An IT architecture that incorporates some degree of workload portability, orchestration, and management across a mix of on-premises data centers, private clouds, and public clouds.

Learn More

Hyperparameter

A configuration that is external to the model and whose value cannot be estimated from data, they are used in processes to help estimate model parameters.

Learn More

Hyperparameter Tuning

The process of optimizing the configuration parameters of a machine learning model, called hyperparameters, to improve model performance on a given metric.

Learn More

Hypervisor

A piece of software, firmware, or hardware that creates and runs virtual machines (VMs).

Learn More

See Wikipedida

Idempotence

A property of certain operations in mathematics and computer science, whereby they can be applied multiple times without changing the result beyond the initial application.

Learn More

Idempotent

An operation that produces the same result each time it is performed.

Learn More

Immutable Data

Data that once created, cannot be changed. Any modification necessitates the creation of a new instance.

Learn More

Impala

An open-source, native analytic database for Apache Hadoop, providing high-performance, low-latency SQL queries on Hadoop data.

Learn More

Project website

Imputation

The process of replacing missing data with substituted values, allowing more robust analysis when dealing with incomplete datasets.

Learn More

See Glossary entry

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

Learn More

In-Memory Database (IMDB)

A database management system that primarily relies on main memory for computer data storage, faster than disk storage-based databases.

Learn More

Index

Create an optimized data structure for fast search and retrieval.

Learn More

Indexing

The process of creating a data structure (an index) to improve the speed of data retrieval operations on a database.

Learn More

See Glossary entry

Informatica

A closed-source data management and data integration solutions provider.

Learn More

See vendor website

Information Retrieval

The process of obtaining information from a repository, often concerning text-based search.

Learn More

Infrastructure as Code (IaC)

A key DevOps practice that involves managing and provisioning computing infrastructure through machine-readable script files, rather than through physical hardware configuration or interactive configuration tools.

Learn More

Ingest

The initial collection and import of data from various sources into your processing environment.

Learn More

Ingestion

The process of importing, transferring, loading, and processing data for later use or storage in a database.

Learn More

See Glossary entry

Input/Output Operations Per Second (IOPS)

A common performance measurement used to benchmark computer storage devices like hard disk drives (HDD), solid-state drives (SSD), and storage area networks (SAN).

Learn More

Instance

A single occurrence of an object, often referring to virtual machines (VMs) or individual database items.

Learn More

Integrate

Combine data from different sources to create a unified view for analysis or reporting.

Learn More

Integration Testing

A level of software testing where individual units are combined and tested as a group, to expose faults in the interaction between integrated units.

Learn More

Integrity Constraints

Rules applied to maintain the quality and accuracy of the data inside a database, such as uniqueness, referential integrity, and check constraints.

Learn More

Interactive Query

A query mechanism allowing users to ask spontaneous questions and receive rapid responses, used in analyzing datasets.

Learn More

Interoperability

The ability of different IT systems, software applications, and devices to communicate, exchange, and use information effectively.

Learn More

Interpolate

Use known data values to estimate unknown data values.

Learn More

Interval Data Type

A type of data that represents a duration between two datetime values, such as the span of time between a start-time and an end-time.

Learn More

No results, please try different filters.

Data Engineering Terms Explained

Genetic Algorithm

Geo-replication

Geospatial Analysis

Git

GitHub

Google BigQuery

Google Cloud Platform (GCP)

Gradient Boosting

Graph Database

Graph Processing

Graph Theory

Greedy Algorithm

Grid Computing

Grid Search

HDF5 (Hierarchical Data Format version)

HTML Parsing

Hadoop Distributed File System (HDFS)

Hash

Hash Function

Hashing

Heap

Helm

Heterogeneous Database System

Hierarchical Database Model

High Availability

High Cardinality

High-Availability Systems

Homogeneous Database System

Homogenize

Horizontal Scaling

Hortonworks

Hot storage

Huge Pages

Hybrid Cloud

Hyperparameter

Hyperparameter Tuning

Hypervisor

Idempotence

Idempotent

Immutable Data

Impala

Imputation

Impute

In-Memory Database (IMDB)

Index

Indexing

Informatica

Information Retrieval

Infrastructure as Code (IaC)

Ingest

Ingestion

Input/Output Operations Per Second (IOPS)

Instance

Integrate

Integration Testing

Integrity Constraints

Interactive Query

Interoperability

Interpolate

Interval Data Type