Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Optimistic Concurrency Control
A type of concurrency control method applied on transactional systems to handle simultaneous updates.
Optimization
The process of adjusting a system to improve its efficiency or use of resources, usually in the context of improving the performance of algorithms and models.
Orchestration
Automated configuration, coordination, and management of computer systems, middleware, and services.
Outlier Detection
The identification of rare items, events, or observations in a data set that raise suspicions due to differences in pattern or behavior from the majority of the data.
Overfitting
A modeling error that occurs when a function is too closely tailored to the training dataset; hence, the model performs well on the training dataset but poorly on new, unseen data.
P-value
A measure in statistical hypothesis testing that helps in determining the strength of the evidence that null hypothesis can be rejected.
Page Cache
A transparent cache for the pages originating from a secondary storage device such as a hard disk drive.
Pandas
A fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of Python.
Parallel Processing
A type of computation in which many calculations or processes are carried out simultaneously, suitable for tasks where many operations are independent of each other.
Parallelize
Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.
Parameter Tuning
The adjustment of weights in model training processes, with the aim of improving model accuracy, it refers to adjustments made to the internal parameters of the models.
Parquet
A columnar storage file format optimized for use with big data processing frameworks. It is highly efficient for both storage and processing, especially for complex nested data structures, and it supports schema evolution, allowing users to modify Parquet schema after data ingestion.
Partition
Data partitioning is a technique that data engineers and ML engineers use to divide data into smaller subsets for improved performance.
Partitioning
A database design technique to improve performance, manageability, or availability by splitting tables into smaller, more manageable pieces.
Pattern Recognition
A branch of machine learning that focuses on the recognition of patterns and regularities in data.
Payload
The part of the transmitted data that is the actual intended message, excluding any headers or metadata sent mainly for the purpose of the delivery of the payload.
Peer-to-Peer Network
A decentralized network where each connected computer has equal status and can interact with each other without a central server.
Percentile
A statistical measure that indicates the value below which a given percentage of observations fall in a group of observations.
Performance Tuning
The improvement of system performance, typically in computer systems and networks, by adjusting various underlying parameters and configurations.
Permutation Test
A type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.
Persistence Layer
The data access layer in a software application that stores and retrieves data from databases, files, and other storage locations.
Pipeline
A set of tools and processes chained together to automate the flow of data from source to storage, allowing for stages of transformation and analysis in between.
Polyglot Persistence
The use of various, often complementary database technologies to handle varying data storage needs within a given software application.
Polynomial Regression
A type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial.
Power BI
A business analytics service by Microsoft, providing interactive visualizations with self-service business intelligence capabilities.
Precision
A metric in classification that measures the number of true positive results divided by the number of all positive results, including those not correctly identified.
Predictive Analytics
The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data.
Predictive Modeling
The process of creating, testing, and validating a model to best predict the probability of an outcome.
Primary Key
A unique identifier for a record in a database table that helps maintain data integrity.
Principal Component Analysis (PCA)
A dimensionality reduction technique used to emphasize variation and bring out strong patterns in a dataset, often used before fitting a machine learning model to the data.
Probabilistic Data Structure
A high-performance, low-memory data structure that provides approximations to set operations, often used for tasks like membership tests, frequency counting, and finding heavy hitters.
Process
Manipulation of data to convert it from one form to another or to reduce it to a more manageable state.
Process Isolation
A form of data security which prevents running processes from interacting with each other, often used in multitasking operating systems to increase security and stability.
Profile
Generate statistical summaries and distributions of data to understand its characteristics.
Programmatic Advertising
The automated buying and selling of online advertising, optimizing based on algorithms and data.
Projection
A database operation that returns a set of columns (attributes) in a table, reducing the number of columns in the resultant relation.
Protobuf (Protocol Buffers)
Developed by Google, it is a method developed to serialize structured data, like XML and JSON. It is both simpler and more efficient than both XML and JSON. Protobuf is language-agnostic, making it highly versatile for different systems.
Prototyping
The process of quickly creating a working model (a prototype) of a part of a system, allowing for faster and more efficient final design and development.
Pseudonymization
A data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers.
Pub/Sub (Publish/Subscribe)
A messaging pattern where senders of messages (publishers) do not prepare the messages to be sent directly to specific receivers (subscribers), defining classes of messages into topics.
Pull Request
A method of submitting contributions to an open development project, often used in collaborative development to manage changes from multiple contributors.
Push Notification
A message that pops up on a mobile device or desktop from an app or website, typically used to deliver updates, news, or promotions.
PyTorch
An open-source machine learning library for Python, developed by Facebook’s AI Research lab.
Python Pickle
A module in Python used for serializing and de-serializing Python object structures, converting Python objects into a byte stream.
QlikView
A Business Intelligence (BI) tool ideal for data visualization, analytics development, and reporting.
Quantile
A data point or set of data points in a dataset that divide your data into “parts” of equal probability, such as the median, quartiles, percentiles, etc.
Quantum Computing
A type of computation that takes advantage of the quantum states of particles to store information, potentially allowing for the solving of complex problems much faster than classical computers can.
Query Language
A type of computer language that requests and retrieves data from database management systems.
Query Optimization
The process of choosing the most efficient means of executing a SQL statement, usually involving the optimization of SQL queries and projections, and the choice of optimal query plans.