Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Distributed Computing
A model in which components located on networked computers communicate and coordinate their actions by passing messages to achieve a common goal, crucial for handling large
Distributed Ledger Technology
A decentralized database managed by multiple participants, across multiple nodes.
Distributed Ledger Technology (DLT)
A digital system for recording the transaction of assets wherein transactions and their details are recorded in multiple places at the same time, the most common form being blockchain technology.
Distributed System
A system where components located on networked computers communicate and coordinate their actions by passing messages.
Docker
A platform used to develop, ship, and run applications inside containers, promoting software reliability and scalability.
Document Store Database
A type of NoSQL database designed to store, manage, and retrieve document-oriented information, also known as semi-structured data.
Domain-Driven Design (DDD)
An approach to software development that centers the design and development process on the business domain, ensuring that the software solves real business problems.
Drift Detection
Identifying when the statistical properties of the target variable, which the model is trying to predict, change.
Dynamic Data
Data that change frequently and are usually generated in real-time, such as stock prices or sensor data.
ETL Testing
The process of validating, verifying, and qualifying data while preventing duplicate records and data loss, conducted during the ETL process.
Eager Execution
A programming environment that evaluates operations immediately, instead of building graphs to run later, typically used in TensorFlow for debugging and interactive development.
Early Stopping
A form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent, by stopping the training process before it completes all iterations.
Edge Computing
A distributed computing paradigm that brings computation and data storage closer to the sources of data generation, improving response times and saving bandwidth.
Elasticity
The ability of a system to efficiently allocate resources to meet demand and then deallocate resources when they are no longer needed.
Elasticsearch
A search engine based on the Lucene library, providing a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Embedded Analytics
The integration of analytical capabilities and content within the business process applications.
Embedding Layer
A layer within a neural network that learns to map the input data (such as words in text) into fixed-size dense vectors of continuous values, usually as the first layer in a network processing sequential or textual data.
Ensemble Learning
A technique used in machine learning that combines several models to solve a single predictive problem, enhancing the performance and robustness of the model.
Entity Resolution
The process of identifying and linking mentions of the same entity across different data sources, critical for creating a unified view of entities from disparate data sources.
Entity-Relationship Model
A data model for describing a database in an abstract way, using entities, relationships, and attributes.
Ephemeral Storage
Temporary storage that is provisioned for a short period of time and is deleted when the instance using it is terminated.
Event-driven Architecture
A software architecture paradigm promoting the production, detection, consumption of, and reaction to events.
Evolutionary Algorithm
A subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm used to find approximate solutions to optimization and search problems.
Exabyte
A unit of information or computer storage equal to one quintillion bytes (1 billion gigabytes).
Exascale Computing
Computing systems capable of at least one exaFLOP, or a billion billion calculations per second, representing a thousandfold increase over petascale.
Explainable AI (XAI)
An area in AI that develops methods and techniques to help human users understand and trust the output and operations of machine learning models.
Extract
The process of retrieving data out of unstructured data sources for further processing or storage.
Extract, Load, Transform (ELT)
A variant of ETL in which extracted data is loaded into the target system and then transformed.
Extrapolate
Predict values outside a known range, based on the trends or patterns identified within the available data.
Factory Pattern
Factory patterns allow you to create a class, with its subclasses deciding which class to instantiate.
Fan-Out
A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.
Fault Tolerance
The property that enables a system to continue operating properly in the event of the failure of some of its components.
Feather
A binary columnar serialization format optimized for use with DataFrames in analytics. It is language agnostic, though it is most commonly used with Python and R. Ideal for fast, lightweight reading and writing of data frames.
Feature Engineering
The process of using domain knowledge to create new features from the existing ones, improving the performance of machine learning models.
Feature Extraction
Identify and extract relevant features from raw data for use in analysis or modeling.
Feature Scaling
A method used to normalize the range of independent variables or features of data.
Feature Selection
Identify and select the most relevant and informative features for analysis or modeling.
Feature Store
A centralized repository for storing, serving, and sharing machine learning features, allowing for the consistent use of features across different models.
Federated Learning
A machine learning approach that trains an algorithm across multiple decentralized devices or servers holding local data samples and without exchanging them.
Federated Query
A type of query in database computing, spanning multiple databases, possibly using different database management systems.
Flink
An open-source stream-processing framework for high-throughput, fault-tolerant, and scalable processing of data streams.
Flume
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Foreign Key
A set of one or more columns used to establish a link between the data in two tables by referencing a unique key in another table.
Full Stack Development
The development of both front end (client-side) and back end (server-side) portions of a web application.
Function as a Service (FaaS)
A category of cloud services that provides a platform allowing customers to develop, run, and manage application functionalities without complex infrastructure.
Functional Programming
A programming paradigm that treats computation as the evaluation of mathematical functions and avoids changing state and mutable data.