Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Inversion of Control (IoC)
A design principle in which the custom-written portions of a computer program receive the flow of control from a generic, reusable library.
Isolation Levels
Different configurations used in databases to trade off consistency for performance, such as Read Uncommitted, Read Committed, Repeatable Read, and Serializable.
Iterative Model
A software development model that involves repeating the same set of activities for each portion of the project, allowing refinement with each iteration.
JSON (JavaScript Object Notation)
A lightweight, text-based, and human-readable data interchange format used for representing structured data. It is based on a subset of the JavaScript Programming Language and is easy for humans to read and write and for machines to parse and generate.
Java Database Connectivity (JDBC)
An API for the Java programming language that defines how a client may access a database, providing methods to query and update data in a database.
Jenkins
An open-source automation server, helping to automate parts of the software development process.
Join Operation
A SQL operation used to combine rows from two or more tables based on a related column between them.
Joins
An SQL operation performed to connect rows from two or more tables based on a related column.
Jupyter Notebook
An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.
Just-In-Time Compilation (JIT)
A way of executing computer code that involves compilation during the execution of a program at runtime rather than prior to execution, improving the execution efficiency.
K-Means Clustering
A partitioning method that divides a dataset into subsets (clusters), where each data point belongs to the cluster with the nearest mean.
K-Nearest Neighbors (KNN)
A simple, supervised machine learning algorithm used for classification and regression, which predicts the classification or value of a new point based on the K nearest points.
Kafka
An open-source stream processing platform developed by LinkedIn and donated to the Apache Software Foundation, designed for high-throughput, fault-tolerance, and scalability.
Key Performance Indicator (KPI)
A type of performance measurement that evaluates the success of an organization, employee, etc., in achieving objectives.
Key-Value Store
A type of NoSQL database that uses a simple key/value method to store data, suitable for storing large amounts of data.
Kibana
An open-source data visualization dashboard for Elasticsearch, providing visualization capabilities on top of the content indexed in Elasticsearch clusters.
Knowledge Graph
A knowledge base used to store complex structured and unstructured information used by machines and humans to enhance search and understand relationships and properties of the data.
Kubernetes
An open-source platform designed to automate deploying, scaling, and operating application containers, allowing for easy management of containerized applications across multiple hosts.
Lambda Architecture
A data processing architecture designed to handle massive quantities of data by combining batch processing and stream processing, providing a balance between latency, throughput, and fault-tolerance.
Latent Semantic Analysis (LSA)
A technique in natural language processing and information retrieval to discover relationships between words and the concepts they form.
Lazy Loading
A design pattern used in computer programming to defer initialization of an object until the point at which it is needed.
Lineage
Understand how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.
Linear Regression
A statistical method used to model the relationship between a dependent variable and one or more independent variables, predicting outcomes.
Linearizability
Ensure that each individual operation on a distributed system appear to occur instantaneously.
Linearize
Transforming the relationship between variables to make datasets approximately linear.
Linked Data
A method of publishing structured data so that it can be interlinked and become more useful, leveraging the structure of the data to enhance its usability and discoverability.
Load Balancer
A device or software function that distributes network or application traffic across multiple servers, optimizing resource use, maximizing throughput, minimizing response time, and avoiding overload.
Load Shedding
The process of reducing the load on a system by restricting the amount of incoming requests.
Load Testing
A type of non-functional testing conducted to understand the behavior of the application under a specific expected load, identifying the maximum operating capacity of an application and any bottlenecks.
Localization
The process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text.
Locking
A mechanism employed by RDBMSs to regulate data access in multi-user environments, ensuring the integrity of data by preventing multiple users from altering the same data at the same time.
Log Files
Files that record either events that occur in an operating system or other software runs, or messages between different users of a communication software.
Log Mining
A process that involves analyzing log files from different sources to uncover insights, which can be used for various purposes such as security, performance monitoring, and user behavior analysis.
Logistic Regression
A statistical method used to analyze a dataset and predict binary outcomes, utilizing a logistic function to model a binary dependent variable.
Logstash
A server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a 'stash' like Elasticsearch.
Long Short-Term Memory (LSTM)
A special kind of RNN, capable of learning long-term dependencies, and is particularly useful for learning from important experiences that have very long time lags.
Long-Polling
A web communication technique where the client requests information from the server, and the server holds the request open until new information is available.
Lookup Table
A table with one or more columns, where you look up a value in the table based on the value in one or more columns.
Loss Function
A function used in optimization to measure the difference between the predicted value and the actual value, guiding the model training process.
Low Latency
Characterized by a short delay from input into a system to the desired outcome, crucial in systems requiring real-time response.
Luigi
An older Python module that helps you build basic pipelines of batch jobs.
Machine Learning
A method of data analysis that automates analytical model building, enabling systems to learn from data, identify patterns, and make decisions.
Machine Learning Operations (MLOps)
A practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle.
Machine Learning Pipeline
A sequence of data processing and machine learning tasks, assembled to create a model, with each step in the sequence processing the data and passing it on to the next step.
Machine-to-Machine (M2M)
Direct communication between devices using any communications channel, including wired and wireless.
MapR
Offers a comprehensive data platform with the speed, scale, and reliability required by enterprise-grade applications.
MapReduce
A programming model for processing and generating large datasets in parallel with a distributed algorithm on a cluster, initially developed by Google.
Markdown
A lightweight markup language with plain text formatting syntax designed for creating rich text using a plain text editor.
Master Data Management (MDM)
A method that defines and manages the critical data of an organization to provide a single point of reference across the organization.
Materialized View
A database object that contains the results of a query, providing indirect access to table data by storing the results of the query in a separate schema object.
Mean Squared Error (MSE)
A measure of the average of the squares of the errors, used as a risk metric corresponding to the expected value of the squared (quadratic) error or loss.
Median
A measure of central tendency representing the middle value of a sorted list of numbers, separating the higher half from the lower half of the data set.