Dagster Glossary | Data Orchestration Terms Explained

Optimistic Concurrency Control

A type of concurrency control method applied on transactional systems to handle simultaneous updates.

Optimization

The process of adjusting a system to improve its efficiency or use of resources, usually in the context of improving the performance of algorithms and models.

Learn More

Oracle Database

A multi-model database management system.

Learn More

Orchestration

Automated configuration, coordination, and management of computer systems, middleware, and services.

Learn More

Find out more

Outlier Detection

The identification of rare items, events, or observations in a data set that raise suspicions due to differences in pattern or behavior from the majority of the data.

Learn More

Overfitting

A modeling error that occurs when a function is too closely tailored to the training dataset; hence, the model performs well on the training dataset but poorly on new, unseen data.

Learn More

P-value

A measure in statistical hypothesis testing that helps in determining the strength of the evidence that null hypothesis can be rejected.

Learn More

Page Cache

A transparent cache for the pages originating from a secondary storage device such as a hard disk drive.

Learn More

PageRank

An algorithm used by Google Search to rank web

Learn More

Pandas

A fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of Python.

Learn More

Parallel Processing

A type of computation in which many calculations or processes are carried out simultaneously, suitable for tasks where many operations are independent of each other.

Learn More

Parallelize

Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.

Learn More

Parameter Tuning

The adjustment of weights in model training processes, with the aim of improving model accuracy, it refers to adjustments made to the internal parameters of the models.

Learn More

Parquet

A columnar storage file format optimized for use with big data processing frameworks. It is highly efficient for both storage and processing, especially for complex nested data structures, and it supports schema evolution, allowing users to modify Parquet schema after data ingestion.

Learn More

Parse

Interpret and convert data from one format to another.

Learn More

Partition

Data partitioning is a technique that data engineers and ML engineers use to divide data into smaller subsets for improved performance.

Learn More

Partitioning

A database design technique to improve performance, manageability, or availability by splitting tables into smaller, more manageable pieces.

Learn More

Pattern Recognition

A branch of machine learning that focuses on the recognition of patterns and regularities in data.

Learn More

Payload

The part of the transmitted data that is the actual intended message, excluding any headers or metadata sent mainly for the purpose of the delivery of the payload.

Learn More

Peer-to-Peer Network

A decentralized network where each connected computer has equal status and can interact with each other without a central server.

Learn More

Percentile

A statistical measure that indicates the value below which a given percentage of observations fall in a group of observations.

Learn More

Performance Tuning

The improvement of system performance, typically in computer systems and networks, by adjusting various underlying parameters and configurations.

Learn More

Permutation Test

A type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.

Learn More

Persistence Layer

The data access layer in a software application that stores and retrieves data from databases, files, and other storage locations.

Learn More

Pickle

Convert a Python object into a byte stream for efficient storage.

Learn More

Pipeline

A set of tools and processes chained together to automate the flow of data from source to storage, allowing for stages of transformation and analysis in between.

Learn More

Polyglot Persistence

The use of various, often complementary database technologies to handle varying data storage needs within a given software application.

Learn More

Polynomial Regression

A type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial.

Learn More

PostgreSQL

Advanced, open-source object-relational database management system.

Learn More

Power BI

A business analytics service by Microsoft, providing interactive visualizations with self-service business intelligence capabilities.

Learn More

See vendor site

Pre-aggregate

See 'aggregate'.

Learn More

Precision

A metric in classification that measures the number of true positive results divided by the number of all positive results, including those not correctly identified.

Learn More

Predictive Analytics

The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data.

Learn More

Predictive Modeling

The process of creating, testing, and validating a model to best predict the probability of an outcome.

Learn More

Prep

Transform your data so it is fit-for-purpose.

Learn More

Preprocess

Transform raw data before data analysis or machine learning modeling.

Learn More

Primary Key

A unique identifier for a record in a database table that helps maintain data integrity.

Learn More

Principal Component Analysis (PCA)

A dimensionality reduction technique used to emphasize variation and bring out strong patterns in a dataset, often used before fitting a machine learning model to the data.

Learn More

Probabilistic Data Structure

A high-performance, low-memory data structure that provides approximations to set operations, often used for tasks like membership tests, frequency counting, and finding heavy hitters.

Learn More

Process

Manipulation of data to convert it from one form to another or to reduce it to a more manageable state.

Learn More

Process Isolation

A form of data security which prevents running processes from interacting with each other, often used in multitasking operating systems to increase security and stability.

Learn More

Profile

Generate statistical summaries and distributions of data to understand its characteristics.

Learn More

Programmatic Advertising

The automated buying and selling of online advertising, optimizing based on algorithms and data.

Learn More

Projection

A database operation that returns a set of columns (attributes) in a table, reducing the number of columns in the resultant relation.

Learn More

Protobuf (Protocol Buffers)

Developed by Google, it is a method developed to serialize structured data, like XML and JSON. It is both simpler and more efficient than both XML and JSON. Protobuf is language-agnostic, making it highly versatile for different systems.

Learn More

Prototyping

The process of quickly creating a working model (a prototype) of a part of a system, allowing for faster and more efficient final design and development.

Learn More

Pseudonymization

A data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers.

Learn More

Pub/Sub (Publish/Subscribe)

A messaging pattern where senders of messages (publishers) do not prepare the messages to be sent directly to specific receivers (subscribers), defining classes of messages into topics.

Learn More

Pull Request

A method of submitting contributions to an open development project, often used in collaborative development to manage changes from multiple contributors.

Learn More

Purge

Delete data that is no longer needed or relevant to free up storage space.

Learn More

Push Notification

A message that pops up on a mobile device or desktop from an app or website, typically used to deliver updates, news, or promotions.

Learn More

PyTorch

An open-source machine learning library for Python, developed by Facebook’s AI Research lab.

Learn More

See project website

Python Pickle

A module in Python used for serializing and de-serializing Python object structures, converting Python objects into a byte stream.

Learn More

See Glossary entry

QlikView

A Business Intelligence (BI) tool ideal for data visualization, analytics development, and reporting.

Learn More

Learn more

Quantile

A data point or set of data points in a dataset that divide your data into “parts” of equal probability, such as the median, quartiles, percentiles, etc.

Learn More

Quantum Computing

A type of computation that takes advantage of the quantum states of particles to store information, potentially allowing for the solving of complex problems much faster than classical computers can.

Learn More

Query Language

A type of computer language that requests and retrieves data from database management systems.

Learn More

Query Optimization

The process of choosing the most efficient means of executing a SQL statement, usually involving the optimization of SQL queries and projections, and the choice of optimal query plans.

Learn More

Query Plan

A sequence of steps used to access data in a SQL relational database management system, important for optimizing database queries and improving system performance.

Learn More

RESTful API

An architectural style for designing networked applications, utilizing stateless, cacheable communications protocols, typically HTTP.

Learn More

No results, please try different filters.

Data Engineering Terms Explained

Optimistic Concurrency Control

Optimization

Oracle Database

Orchestration

Outlier Detection

Overfitting

P-value

Page Cache

PageRank

Pandas

Parallel Processing

Parallelize

Parameter Tuning

Parquet

Parse

Partition

Partitioning

Pattern Recognition

Payload

Peer-to-Peer Network

Percentile

Performance Tuning

Permutation Test

Persistence Layer

Pickle

Pipeline

Polyglot Persistence

Polynomial Regression

PostgreSQL

Power BI

Pre-aggregate

Precision

Predictive Analytics

Predictive Modeling

Prep

Preprocess

Primary Key

Principal Component Analysis (PCA)

Probabilistic Data Structure

Process

Process Isolation

Profile

Programmatic Advertising

Projection

Protobuf (Protocol Buffers)

Prototyping

Pseudonymization

Pub/Sub (Publish/Subscribe)

Pull Request

Purge

Push Notification

PyTorch

Python Pickle

QlikView

Quantile

Quantum Computing

Query Language

Query Optimization

Query Plan

RESTful API