Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Genetic Algorithm
A search heuristic that is inspired by Charles Darwin’s theory of natural evolution, used to find approximate solutions to optimization and search problems.
Geo-replication
Replication of datasets across geographical locations, primarily for data resilience and availability purposes.
Geospatial Analysis
Analyze data that has geographic or spatial components to identify patterns and relationships.
Git
A free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
GitHub
A web-based platform that provides hosting for software development and a community of developers to work together and share code.
Google BigQuery
A fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.
Google Cloud Platform (GCP)
A provider of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and YouTube.
Gradient Boosting
A machine learning technique for regression and classification problems, which builds a model in a stage-wise fashion, optimizing for predictive accuracy.
Graph Database
A database designed to treat the relationships between data as equally important to the data itself, used to store data whose relations are best represented as a graph.
Graph Processing
A type of data processing that uses graph theory to analyze and visually represent data relationships.
Graph Theory
A powerful tool to model and understand intricate relationships within our data systems.
Greedy Algorithm
An algorithmic paradigm that makes locally optimal choices at each stage with the hope of finding the global optimum.
Grid Computing
A form of distributed computing whereby a 'super and virtual computer' is composed of clustered, networked, loosely coupled computers acting in parallel to perform very large tasks.
Grid Search
An approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
HDF5 (Hierarchical Data Format version)
A file format and set of tools for managing complex data. It is designed for flexible, efficient I/O and for high volume and complex data sets and supports an unlimited variety of datatypes.
HTML Parsing
Analyzing HTML code to extract relevant information and understand the structure of the content, often used in web scraping.
Hadoop Distributed File System (HDFS)
A distributed file system designed to run on commodity hardware, providing high-throughput access to application data and fault tolerance.
Hash Function
A function that converts an input into a fixed-size string of bytes, typically a digest that is unique to the given input.
Hashing
The process of transforming input of any length into a fixed-size string of text, typically using a hash function.
Heap
A specialized tree-based data structure that satisfies the heap property, used in computer memory management and for heapsort algorithm.
Helm
A package manager for Kubernetes that allows developers and operators to more easily package, configure, and deploy applications and services onto Kubernetes clusters.
Heterogeneous Database System
A system that uses middleware to connect databases that are not alike and are running on different DBMSs, possibly on different platforms.
Hierarchical Database Model
A data model where data is organized into a tree-like structure with a single root, to which all other data is linked in a hierarchy.
High Availability
A characteristic of a system aiming to ensure an agreed level of operational performance for a higher than normal period.
High Cardinality
A term used to define the uniqueness of data values contained in a column. If a column has a high number of unique values, it is said to have high cardinality.
High-Availability Systems
Systems designed to be operational and accessible for longer periods, minimizing downtime and ensuring continuous service.
Homogeneous Database System
A system where all databases are based on the same DBMS technology.
Horizontal Scaling
Adding more machines to a network to improve the capability to handle more load and perform better, also known as scaling out.
Hortonworks
Provides comprehensive solutions for data management and analytics.
Hot storage
The immediate, high-speed storage of data that is frequently accessed and modified, enabling rapid retrieval and updates.
Huge Pages
Memory pages that are larger than the standard memory page size, beneficial in managing large amounts of memory.
Hybrid Cloud
An IT architecture that incorporates some degree of workload portability, orchestration, and management across a mix of on-premises data centers, private clouds, and public clouds.
Hyperparameter
A configuration that is external to the model and whose value cannot be estimated from data, they are used in processes to help estimate model parameters.
Hyperparameter Tuning
The process of optimizing the configuration parameters of a machine learning model, called hyperparameters, to improve model performance on a given metric.
Hypervisor
A piece of software, firmware, or hardware that creates and runs virtual machines (VMs).
Idempotence
A property of certain operations in mathematics and computer science, whereby they can be applied multiple times without changing the result beyond the initial application.
Immutable Data
Data that once created, cannot be changed. Any modification necessitates the creation of a new instance.
Impala
An open-source, native analytic database for Apache Hadoop, providing high-performance, low-latency SQL queries on Hadoop data.
Imputation
The process of replacing missing data with substituted values, allowing more robust analysis when dealing with incomplete datasets.
Impute
Fill in missing data values with estimated or imputed values to facilitate analysis.
In-Memory Database (IMDB)
A database management system that primarily relies on main memory for computer data storage, faster than disk storage-based databases.
Indexing
The process of creating a data structure (an index) to improve the speed of data retrieval operations on a database.
Informatica
A closed-source data management and data integration solutions provider.
Information Retrieval
The process of obtaining information from a repository, often concerning text-based search.
Infrastructure as Code (IaC)
A key DevOps practice that involves managing and provisioning computing infrastructure through machine-readable script files, rather than through physical hardware configuration or interactive configuration tools.
Ingest
The initial collection and import of data from various sources into your processing environment.
Ingestion
The process of importing, transferring, loading, and processing data for later use or storage in a database.
Input/Output Operations Per Second (IOPS)
A common performance measurement used to benchmark computer storage devices like hard disk drives (HDD), solid-state drives (SSD), and storage area networks (SAN).
Instance
A single occurrence of an object, often referring to virtual machines (VMs) or individual database items.
Integrate
Combine data from different sources to create a unified view for analysis or reporting.
Integration Testing
A level of software testing where individual units are combined and tested as a group, to expose faults in the interaction between integrated units.
Integrity Constraints
Rules applied to maintain the quality and accuracy of the data inside a database, such as uniqueness, referential integrity, and check constraints.
Interactive Query
A query mechanism allowing users to ask spontaneous questions and receive rapid responses, used in analyzing datasets.