Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Rack Awareness
A concept applied in distributed computing to minimize the latency and use of resources while retrieving data and to ensure data availability during component failures.
Radial Basis Function (RBF)
A function whose value depends on the distance between the input and some fixed point, typically used in various areas such as function approximation, time series prediction, and classification.
Random Forest
An ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Range Query
A type of query that retrieves data based on a range of values, typically used in the context of numerical or datetime values.
Real-Time Bidding (RTB)
A means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction.
Real-Time Processing
The processing of data that continuously enters a system and obtains results within a timeframe short enough to affect the sources of the incoming data.
Recommender System
A subclass of information filtering system that seeks to predict the 'rating' or 'preference' a user would give to an item.
Reconcile
The process of ensuring that two or more datasets are consistent with each other, identifying any discrepancies and resolving them.
Record Linkage
The process of finding entries that refer to the same entity in different data sources.
Recurrent Neural Network (RNN)
A class of artificial neural networks designed for sequence prediction problems and other tasks where data points have connections to previous points, such as time series analysis and natural language processing.
Reduce
Convert a large set of data into a smaller, more manageable form without significant loss of information.
Redundancy
The duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe.
Referential Integrity
A property of data stating that all its references are valid and ensures that the relationship between tables remains consistent.
Regression Analysis
A statistical process for estimating the relationships among variables, often used for prediction and forecasting, where one variable is dependent on one or more independent variables.
Regular Expression (Regex)
A sequence of characters defining a search pattern, typically used by string-searching algorithms for 'find' or 'find and replace' operations on strings, crucial for data cleaning and transformation.
Regularization
A technique used to prevent overfitting in a machine learning model by adding a penalty term to the model’s loss function, commonly used regularizations are L1 and L2 regularization.
Reinforcement Learning
A type of machine learning where an agent learns how to behave in an environment by performing certain actions and receiving rewards or penalties in return.
Relational Algebra
A theoretical set of mathematical principles and concepts forming the foundational basis for implementing and optimizing queries in Relational Database Management Systems.
Relational Database
A type of database that stores data in structured tables and is based on the relational model.
Relational Model
A database model based on first-order predicate logic, serving as the basis for relational databases, where all data is represented in terms of tuples, grouped into relations.
Repartition
Redistribute data across multiple partitions for improved parallelism and performance.
Replica Set
A group of database nodes that maintains the same data set, providing redundancy and increasing data availability with multiple copies of data on different database servers.
Representation Learning
An area of machine learning where automatic feature learning from raw data is explored, aimed at identifying better representations and improving model generalization.
Request-Response
A message exchange pattern in which a requester sends a request message to a replier system, which then sends a response message in return.
Reshape
Change the structure of data to better fit specific analysis or modeling requirements.
Resilient Distributed Dataset (RDD)
A fault-tolerant collection of elements that can be processed in parallel, fundamental data structure of Spark,
Response Variable
The variable that is being predicted or modeled, often denoted as the dependent variable or output variable.
Ridge Regression
A regularization technique for analyzing multiple regression data that suffer from multicollinearity, shrinking the coefficients of the model towards zero to stabilize them.
Risk Analysis
The process of identifying and analyzing potential issues that could negatively impact key business initiatives or projects.
Rollback
The operation which undoes partially completed transactions by the database management system after a failed transaction.
Root Mean Square Error (RMSE)
A standard way to measure the error of a model in predicting quantitative data, it’s the square root of the average squared differences between the predicted and observed actual outcomes.
Routing
The process of selecting a path for traffic in a network or between or across multiple networks, based on routing table information.
Row-Level Security (RLS)
A method of restricting access at the database row level, based on parameters such as user roles or identity, enabling fine-grained access control.
Ruby on Rails
A server-side web application framework written in Ruby, it is a model-view-controller (MVC) framework, providing default structures for a database, a web service, and web pages.
SQL (Structured Query Language)
A standardized programming language used for managing and querying relational databases.
SQL Injection
A code injection technique, used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution.
Sample
Extract a subset of data for exploratory analysis or to reduce computational complexity.
Sampling
The process of selecting a subset of elements from a larger set to approximate the properties of the whole set, often used for statistical analysis.
Sandboxing
A security mechanism used to run an application in a confined environment, isolating it from the system, preventing it from causing harm or accessing sensitive data.
Scalability
The capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth.
Scalar
A quantity represented by a single element in the corresponding field, usually a single number, as opposed to a vector or matrix.
Schema
The organization or structure for a database, defining tables, fields, relationships, indexes, etc.
Schema Evolution
The ability of a database system to handle changes in a database schema, especially relevant for systems that require flexibility and adaptability to changing data requirements.
Schema Mapping
Translate data from one schema or structure to another to facilitate data integration.
Schema-on-Read
A strategy where data structure is inferred at read time, typically used in big data processing where data is not predefined and is instead interpreted when it is analyzed.
Schema-on-Write
A strategy where data structure is defined before writing data, typically used in relational databases where data must conform to a known schema before it's written to disk.
Scikit-learn
A free software machine learning library for the Python programming language. It features various classification, regression, clustering algorithms, and efficient tools for data mining and data analysis.
Scraping
The process of extracting data from websites, converting it from unstructured to structured form.
Scrub
A process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated, also known as data cleansing.