Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Cursor
A database object used to traverse the results of a SQL query, allowing individual rows to be accessed.
Cybersecurity
The practice of protecting systems, networks, and programs from digital attacks aimed at accessing, changing, or destroying sensitive information.
DAG (Directed Acyclic Graph)
A finite directed graph with no directed cycles, used extensively in representing data flow in data processing systems like Apache Airflow.
Dagster
An open source solution for defining, building and managing critical data assets.
Data Aggregation
The process of gathering and summarizing information in a specified form, often used in statistical analysis.
Data Allocation
The assignment of storage space to specific data, often in the context of distributed databases where data is allocated across multiple nodes.
Data Analytics
The science of analyzing raw data to make conclusions about that information.
Data Annotation
The process of adding explanatory notes or comments to data, often used in the context of machine learning to create labeled training data.
Data Architecture
The overall structure, organization, and rules used to manage and use data within an organization, including the arrangement of data and data processing.
Data Block
The smallest unit of data storage in a database, storing a set of rows or a subset of a table's columns.
Data Catalog
A centralized repository that allows for the management, collaboration, discovery, and consumption of organizational datasets, serving as a metadata inventory.
Data Dictionary
A collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them.
Data Drift
A phenomenon where the statistical properties of incoming data change over time, potentially impacting model performance and accuracy.
Data Fabric
A unified architecture that provides a consistent and coherent set of capabilities and services across different environments.
Data Federation
The process of aggregating data from different sources to create a single, unified view.
Data Fusion
The process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.
Data Governance
The overall management of the availability, usability, integrity, and security of data employed in an enterprise, involving a set of practices and policies.
Data Lake
A centralized storage repository that allows storing structured and unstructured data at any scale, usually used for big data and real-time analytics.
Data Lakehouse
A modern data architecture that combines the best elements of data lakes and data warehouses, enabling efficient handling of both structured and unstructured data.
Data Lifecycle
The journey that data goes through from creation and initial storage to the time it becomes obsolete and is deleted.
Data Lifecycle Management
The process of managing the flow of data throughout its lifecycle from creation and initial storage to the time it is archived or deleted.
Data Mart
A subset of a data warehouse that is designed for a specific line of business or department within an organization.
Data Marts
Subsets of data warehouses designed to provide data for specific business lines or departments.
Data Mesh
A decentralized approach to data architecture and organizational structure that treats data as a product and emphasizes domain-oriented decentralized data ownership and architecture.
Data Ops
An automated, process-oriented methodology used to improve the quality and reduce the cycle time of data analytics.
Data Pipeline
A series of data processing steps involved in the flow of data from the source to its final destination, usually used in the context of ETL and data integration.
Data Provenance
Information that helps to trace the origins, processing, and use of data, helping to determine the quality and reliability of the dataset.
Data Quality
A comprehensive way of maintaining the accuracy, reliability, and consistency of data over its entire life cycle.
Data Redundancy
The existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data.
Data Reservoir
An expansive storage repository that allows for the integration and storage of data from various sources in its native format.
Data Silo
A repository of data isolated or segregated from other parts of the organization's data system.
Data Stewardship
Responsible management and oversight of an organization's data to help provide business users with high-quality data.
Data Vault Modeling
A database modeling method specifically designed for top-down data warehouses with a focus on long-term historical storage, tractability, and scalability.
Data Volume
The amount of data available for analysis, usually referred to in the context of Big Data.
Data Warehouse
A central repository of integrated data from disparate sources, used to store and manage large volumes of historical data and enable fast, complex queries across all the consolidated data.
Database Indexing
The use of special data structures that improve the speed of operations in a table, such as search, filter, and sort.
Database Management System (DBMS)
A software package designed to define, manipulate, retrieve, and manage data in a database.
Database Mirroring
A technique used to increase data availability by maintaining two copies of a single database that must reside on different server instances of SQL Server Database Engine.
Database Normalization
A systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like insertion, update, and deletion anomalies.
Database Schema
The structure or blueprint of a database that outlines the way data is organized and how relationships are between the data entities.
De-identify
Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.
Deadlock
A condition where two or more database transactions are unable to proceed because each is waiting for the other to release a lock, leading to a cyclic waiting condition.
Decision Tree
A tree-like model of decisions used to make predictions, especially in machine learning algorithms.
Deep Learning
A subset of machine learning that utilizes neural networks with many layers (hence “deep”) to analyze various factors of data and to learn and make intelligent decisions.
Delta Lake
An open-source storage layer that brings reliability to data lakes, ensuring ACID transactions, scalable metadata handling, and unifying streaming and batch data processing.
Denormalize
Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
Dependency Parsing
A Natural Language Processing (NLP) technique to analyze the grammatical structure of a sentence to establish relationships between words.
Deserialize
Deserialization is essentially the reverse process of serialization. See: 'Serialize'.
DevOps
A set of practices that combines software development (Dev) and IT operations (Ops), aiming to shorten the systems development life cycle and provide continuous delivery.
Differential Privacy
A system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.
Dimension Table
A table in a star schema of a data warehouse that stores categorical, descriptive, hierarchical, or textual attributes of data.
Dimensional Modeling
A design technique used in data warehousing to map and visualize data in a way that’s intuitive to business users, typically using facts and dimensions.
Dimensionality
Analyzing the number of features or attributes in the data to improve performance.
Dimensionality Reduction
The process of reducing the number of random variables under consideration by obtaining a set of principal variables, crucial for dealing with the “curse of dimensionality” in high-dimensional spaces.