Dagster Glossary | Data Orchestration Terms Explained

Inversion of Control (IoC)

A design principle in which the custom-written portions of a computer program receive the flow of control from a generic, reusable library.

Learn More

See Wikipedia entry

Isolation Levels

Different configurations used in databases to trade off consistency for performance, such as Read Uncommitted, Read Committed, Repeatable Read, and Serializable.

Learn More

Iterative Model

A software development model that involves repeating the same set of activities for each portion of the project, allowing refinement with each iteration.

Learn More

JSON (JavaScript Object Notation)

A lightweight, text-based, and human-readable data interchange format used for representing structured data. It is based on a subset of the JavaScript Programming Language and is easy for humans to read and write and for machines to parse and generate.

Learn More

Java Database Connectivity (JDBC)

An API for the Java programming language that defines how a client may access a database, providing methods to query and update data in a database.

Learn More

See Wikipedia entry

Jenkins

An open-source automation server, helping to automate parts of the software development process.

Learn More

Join Operation

A SQL operation used to combine rows from two or more tables based on a related column between them.

Learn More

Joins

An SQL operation performed to connect rows from two or more tables based on a related column.

Learn More

Learn more

Jupyter Notebook

An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.

Learn More

Just-In-Time Compilation (JIT)

A way of executing computer code that involves compilation during the execution of a program at runtime rather than prior to execution, improving the execution efficiency.

Learn More

K-Means Clustering

A partitioning method that divides a dataset into subsets (clusters), where each data point belongs to the cluster with the nearest mean.

Learn More

K-Nearest Neighbors (KNN)

A simple, supervised machine learning algorithm used for classification and regression, which predicts the classification or value of a new point based on the K nearest points.

Learn More

Kafka

An open-source stream processing platform developed by LinkedIn and donated to the Apache Software Foundation, designed for high-throughput, fault-tolerance, and scalability.

Learn More

Key Performance Indicator (KPI)

A type of performance measurement that evaluates the success of an organization, employee, etc., in achieving objectives.

Learn More

Key-Value Store

A type of NoSQL database that uses a simple key/value method to store data, suitable for storing large amounts of data.

Learn More

Kibana

An open-source data visualization dashboard for Elasticsearch, providing visualization capabilities on top of the content indexed in Elasticsearch clusters.

Learn More

See Github repo

Knowledge Graph

A knowledge base used to store complex structured and unstructured information used by machines and humans to enhance search and understand relationships and properties of the data.

Learn More

Kubernetes

An open-source platform designed to automate deploying, scaling, and operating application containers, allowing for easy management of containerized applications across multiple hosts.

Learn More

Lambda Architecture

A data processing architecture designed to handle massive quantities of data by combining batch processing and stream processing, providing a balance between latency, throughput, and fault-tolerance.

Learn More

Late Binding

Delaying the binding of referenced attributes and methods until runtime.

Learn More

Latent Semantic Analysis (LSA)

A technique in natural language processing and information retrieval to discover relationships between words and the concepts they form.

Learn More

Lazy Loading

A design pattern used in computer programming to defer initialization of an object until the point at which it is needed.

Learn More

Lineage

Understand how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.

Learn More

Linear Regression

A statistical method used to model the relationship between a dependent variable and one or more independent variables, predicting outcomes.

Learn More

Linearizability

Ensure that each individual operation on a distributed system appear to occur instantaneously.

Learn More

Linearize

Transforming the relationship between variables to make datasets approximately linear.

Learn More

Linked Data

A method of publishing structured data so that it can be interlinked and become more useful, leveraging the structure of the data to enhance its usability and discoverability.

Learn More

Load

Insert data into a database or data warehouse, or your pipeline for processing.

Learn More

Load Balancer

A device or software function that distributes network or application traffic across multiple servers, optimizing resource use, maximizing throughput, minimizing response time, and avoiding overload.

Learn More

Load Shedding

The process of reducing the load on a system by restricting the amount of incoming requests.

Learn More

Load Testing

A type of non-functional testing conducted to understand the behavior of the application under a specific expected load, identifying the maximum operating capacity of an application and any bottlenecks.

Learn More

Localization

The process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text.

Learn More

Locking

A mechanism employed by RDBMSs to regulate data access in multi-user environments, ensuring the integrity of data by preventing multiple users from altering the same data at the same time.

Learn More

Log Files

Files that record either events that occur in an operating system or other software runs, or messages between different users of a communication software.

Learn More

Log Mining

A process that involves analyzing log files from different sources to uncover insights, which can be used for various purposes such as security, performance monitoring, and user behavior analysis.

Learn More

Logistic Regression

A statistical method used to analyze a dataset and predict binary outcomes, utilizing a logistic function to model a binary dependent variable.

Learn More

Logstash

A server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a 'stash' like Elasticsearch.

Learn More

Long Short-Term Memory (LSTM)

A special kind of RNN, capable of learning long-term dependencies, and is particularly useful for learning from important experiences that have very long time lags.

Learn More

Long-Polling

A web communication technique where the client requests information from the server, and the server holds the request open until new information is available.

Learn More

Looker

A data exploration and discovery business intelligence platform.

Learn More

Lookup Table

A table with one or more columns, where you look up a value in the table based on the value in one or more columns.

Learn More

Loss Function

A function used in optimization to measure the difference between the predicted value and the actual value, guiding the model training process.

Learn More

Low Latency

Characterized by a short delay from input into a system to the desired outcome, crucial in systems requiring real-time response.

Learn More

Luigi

An older Python module that helps you build basic pipelines of batch jobs.

Learn More

See the docs

Machine Learning

A method of data analysis that automates analytical model building, enabling systems to learn from data, identify patterns, and make decisions.

Learn More

Machine Learning Operations (MLOps)

A practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle.

Learn More

Machine Learning Pipeline

A sequence of data processing and machine learning tasks, assembled to create a model, with each step in the sequence processing the data and passing it on to the next step.

Learn More

See the guide

Machine-to-Machine (M2M)

Direct communication between devices using any communications channel, including wired and wireless.

Learn More

Map

The process of defining relationships between two distinct data models.

Learn More

MapR

Offers a comprehensive data platform with the speed, scale, and reliability required by enterprise-grade applications.

Learn More

MapReduce

A programming model for processing and generating large datasets in parallel with a distributed algorithm on a cluster, initially developed by Google.

Learn More

Markdown

A lightweight markup language with plain text formatting syntax designed for creating rich text using a plain text editor.

Learn More

Mask

Obfuscate sensitive data to protect its privacy and security.

Learn More

Master Data Management (MDM)

A method that defines and manages the critical data of an organization to provide a single point of reference across the organization.

Learn More

Materialize

Executing a computation and persisting the results into storage.

Learn More

Materialized View

A database object that contains the results of a query, providing indirect access to table data by storing the results of the query in a separate schema object.

Learn More

See guide

Mean Squared Error (MSE)

A measure of the average of the squares of the errors, used as a risk metric corresponding to the expected value of the squared (quadratic) error or loss.

Learn More

Median

A measure of central tendency representing the middle value of a sorted list of numbers, separating the higher half from the lower half of the data set.

Learn More

Memoize

Store the results of expensive function calls and reusing them when the same inputs occur again.

Learn More

Merge

Combine data from multiple datasets into a single dataset.

Learn More

No results, please try different filters.

Data Engineering Terms Explained

Inversion of Control (IoC)

Isolation Levels

Iterative Model

JSON (JavaScript Object Notation)

Java Database Connectivity (JDBC)

Jenkins

Join Operation

Joins

Jupyter Notebook

Just-In-Time Compilation (JIT)

K-Means Clustering

K-Nearest Neighbors (KNN)

Kafka

Key Performance Indicator (KPI)

Key-Value Store

Kibana

Knowledge Graph

Kubernetes

Lambda Architecture

Late Binding

Latent Semantic Analysis (LSA)

Lazy Loading

Lineage

Linear Regression

Linearizability

Linearize

Linked Data

Load

Load Balancer

Load Shedding

Load Testing

Localization

Locking

Log Files

Log Mining

Logistic Regression

Logstash

Long Short-Term Memory (LSTM)

Long-Polling

Looker

Lookup Table

Loss Function

Low Latency

Luigi

Machine Learning

Machine Learning Operations (MLOps)

Machine Learning Pipeline

Machine-to-Machine (M2M)

Map

MapR

MapReduce

Markdown

Mask

Master Data Management (MDM)

Materialize

Materialized View

Mean Squared Error (MSE)

Median

Memoize

Merge