Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Message Passing
A method by which information is communicated between distributed or parallel processes in a computer system.
Message Queue
A form of asynchronous service-to-service communication used in serverless and microservices architectures.
MessagePack
A binary format efficiently encoding objects and their fields in a compact binary representation. It is more efficient and compact compared to JSON, used when performance and bandwidth are concerns.
Metadata
Data that provides information about other data, such as data structure or content details.
Metadata Management
The administration of data that describes other data, involving establishing and managing descriptions, definitions, scope, ownership, and other characteristics of metadata.
Micro-Batching
A data processing method that deals with relatively small batches of data, providing a middle ground between batch processing and stream processing.
Microservices
A software development technique that structures an application as a collection of loosely coupled services, allowing for improved scalability and ease of updates.
Microservices Architecture
An architectural style that structures an application as a collection of services, which are highly maintainable and testable, loosely coupled, independently deployable, and precisely scoped.
Microsoft Azure
A cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.
Microsoft SSIS (SQL Server Integration Services)
A platform for data integration and workflow applications.
Middleware
Software that acts as a bridge between an operating system or database and applications, enabling communication and data management.
Migrate
The process of transferring data between storage types, formats, or computer systems, usually performed programmatically.
Mine
Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.
Model Deployment
The integration of a machine learning model into an existing production environment to make practical business decisions based on data.
Model Selection
The task of selecting a statistical model from a set of candidate models, based on the performance of the models on a given dataset.
Model Validation
The process of assessing how well your model performs at making predictions on new data, by using various metrics and statistical methods.
Monitor
Track data processing metrics and system health to ensure high availability and performance.
Monitoring
The process of observing and checking the quality or content of data over a period, aimed at detecting patterns, performance, failures, or other attributes.
Multi-Cloud
The use of multiple cloud computing and storage services in a single network architecture, utilized by businesses to spread computing resources and minimize the risk of data loss or downtime.
Multi-tenancy
A reference to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment.
Multidimensional Scaling (MDS)
A means of visualizing the level of similarity of individual cases of a dataset, used in information visualization to detect patterns in high-dimensional data.
Multilabel Classification
A type of classification task where each instance (or data point) can belong to multiple classes, as opposed to just one in the traditional case.
Multilayer Perceptron (MLP)
A class of feedforward artificial neural network consisting of at least three layers of nodes, used for classification and regression.
Multithreading
The ability of a CPU, or a single core in a multi-core processor, to provide multiple threads of execution concurrently.
Mutability
The capability of an object to be altered or changed, often used in contrast with immutability, which refers to the incapacity to be changed.
N+1 Query Problem
A common performance problem in applications that use ORMs to fetch data, occurs when the system retrieves related objects in a separate query for each object, leading to a high number of executed SQL queries.
Named Entity Recognition
Locate and classify named entities in text into pre-defined categories.
Named Entity Recognition (NER)
A subtask of information extraction that classifies named entities in text into pre-defined categories such as person names, organizations, locations, etc.
Namespace
A container that holds a set of identifiers to help avoid collisions between identifiers with the same name.
Natural Language Processing (NLP)
A field of artificial intelligence that focuses on the interaction between computers and humans through natural language, enabling computers to understand, interpret, and generate human language.
Naïve Bayes Classifier
A family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features.
Network Partition
A network failure that divides a network into two or more disconnected sub-networks due to the failure of network devices.
Neural Network
A set of algorithms, modeled loosely after the human brain, designed to recognize patterns in data through machine learning.
NoSQL
Non-relational databases designed for scalability, schema flexibility, and optimized performance in specific use-cases.
NoSQL Database
A non-relational database that allows for storage and processing of large amounts of unstructured data and is designed for distributed data stores where very large-scale processing is needed.
Normality Testing
Assess the normality of data distributions to ensure validity and reliability of statistical analysis.
Normalization
The process of organizing the columns (attributes) and tables (relations) of a relational database to reduce redundancy and dependency.
Normalize
Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Null Hypothesis
A general statement or default position that there is no relationship between two measured phenomena, to be tested and refuted in the process of statistical hypothesis testing.
NumPy
A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
OLAP (Online Analytical Processing)
A category of software tools that allows users to analyze data from multiple database dimensions.
OLAP Cube
A multi-dimensional array of data used for complex calculations, enabling users to drill down into multiple levels of hierarchical data, making it a key technology for data analysis and reporting.
OLTP (Online Transaction Processing)
A type of processing that facilitates and manages transaction-oriented applications.
ORC (Optimized Row Columnar)
A columnar storage file format optimized for heavy read access and is highly suitable for storing and processing big data workloads. It is highly compressed and efficient, reducing the amount of storage space needed for large datasets.
Object Storage
A storage architecture that manages data as objects, as opposed to other storage architectures like file systems or block storage.
Object-Relational Mapping (ORM)
A programming technique to convert data between incompatible type systems in object-oriented programming languages.
Observability
The ability to understand the internal state of a system from its external outputs, crucial in modern computing environments to ensure the reliability, availability, and performance of systems.
One-Hot Encoding
A process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions.
Online Analytical Processing (OLAP)
A category of software tools that analyze data from various database perspectives and enable users to interactively analyze multidimensional data from multiple perspectives.
Ontology
A representation of a set of concepts within a domain and the relationships between those concepts, used to reason about the entities within that domain.
Open Database Connectivity (ODBC)
A standard application programming interface (API) for accessing database management systems.