Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
A/B Testing
A statistical hypothesis testing for a randomized experiment with two variables, A and B, which are used to compare two models or strategies and determine which performs better.
ACID Properties
The set of properties of database transactions intended to guarantee validity even in the event of errors or failures, encompassing Atomicity, Consistency, Isolation, and Durability.
API (Application Programming Interface)
A set of rules and definitions that allow different software entities to communicate with each other.
AWS Step Functions
Enables you to coordinate AWS components, applications and microservices using visual workflows.
Agile Methodology
An iterative approach to software development and project management that prioritizes flexibility and customer satisfaction, often used by data engineering teams to manage projects.
Alation
A machine learning data catalog that helps people find, understand, and trust the data.
Align
Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
Aligning
Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
Amazon DynamoDB
A managed NoSQL database service provided by Amazon Web Services.
Amazon Kinesis
A platform to stream data on AWS, offering powerful services to make it easy to load and analyze streaming data.
Amazon Redshift
A fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.
Amazon Web Services (AWS)
Offers a broad set of global cloud-based products including compute, storage, databases, analytics, networking, mobile, developer tools, and more.
Annotation
The process of adding metadata or explanatory notes to data, often used in machine learning to create labeled data for training models.
Anomaly Detection
Identify data points or events that deviate significantly from expected patterns or behaviors.
Apache Airflow
A platform to programmatically author, schedule, and monitor workflows of tasks.
Apache Arrow
Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
Apache Atlas
A scalable and extensible set of core foundational governance services, enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop.
Apache Camel
An open-source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
Apache Flink
A framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
Apache Hadoop
A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Apache Kafka
A distributed streaming platform capable of handling trillions of events a day.
Apache Nifi
A tool designed to automate the flow of data between software systems.
Apache Pulsar
A highly scalable, low-latency messaging platform running on commodity hardware.
Apache Samza
A stream processing framework for running applications that process data as it is created.
Apache Spark
A fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python, and R.
Apache Storm
A free and open-source distributed real-time computation system.
Append
Adding or attaching new records or data items to the end of an existing dataset, database table, file, or list.
Archive
Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. Store data for long-term retention and compliance.
Argo
An open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes.
Association Rule Mining
A machine learning method aimed at identifying interesting relations between variables (items or events) in large databases, frequently used for market basket analysis.
Augmented Data Management
The use of AI and ML technologies to optimize and enhance data management tasks, improving data quality and metadata development.
Auto-materialize
The automatic execution of computations and the persistence of their results.
Automated Machine Learning (AutoML)
The process of automating the end-to-end process of applying machine learning to real-world problems, facilitating the development of ML models by experts and non-experts alike.
Avro
A binary serialization format developed within the Apache Hadoop project, compact, fast, and suitable for serializing large amounts of data. It uses JSON for defining data types and protocols, and it serializes data in a compact binary format.
BSON (Binary JSON)
A binary-encoded serialization of JSON-like documents used to store documents and make remote procedure calls in MongoDB. BSON supports embedded documents and arrays, offering additional data types not supported by JSON.
Backend-as-a-Service (BaaS)
A cloud computing service model that serves as the middleware that provides developers with ways to connect their web and mobile applications to cloud services via application programming interfaces (APIs) and software developers' kits (SDKs).
Backpressure
A mechanism to handle situations where data is produced faster than it can be consumed.
Big Data
Refers to extremely large datasets that can be analyzed for patterns, trends, and associations, typically involving varied and complex structures. What constitutes 'big' is debated, but a rule of thumb is a volume of data that cannot be analyzed on a single machine.
Big Data Processing
Process large volumes of data in parallel and distributed computing environments to improve performance.
Big O Notation
A mathematical notation used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, primarily used to classify algorithms by how they respond to changes in input size.
Binary Tree
A tree data structure in which each node has at most two children, referred to as the left child and the right child.
Bitwise Operation
Operations that manipulate one or more bits at the level of their individual binary representation.
Blend
A term coined by data analytics vendors to describe the process of combining data from multiple sources to create a cohesive, unified dataset. Typically used in the context of data analysis and business intelligence.
Blockchain
A system of recording information in a way that makes it difficult or impossible to change, hack, or cheat the system. A blockchain is a digital ledger of transactions that is duplicated and distributed across the entire network of computer systems on the blockchain.
Broadcast
A method in parallel computing where data is sent from one point (a root node) to all other nodes in the topology.
Broadcasting
A method in distributed computing to send the same message to all nodes in a network.
Bucketing
A method for dividing a dataset into discrete buckets or bins to separate it into roughly equal parts based on some characteristic.
Bulk Extract
The process of extracting large amounts of data from a database in a single transaction.
Business Intelligence (BI)
A set of strategies and technologies used by enterprises for the data analysis of business information, helping companies make more informed business decisions.
CAP Theorem
In computer science, it represents that it is impossible for a distributed system to simultaneously provide more than two out of three guarantees: Consistency, Availability, and Partition Tolerance.
CBOR (Concise Binary Object Representation)
A binary format encoding data in a more efficient and compact manner than JSON. It is designed to efficiently serialize and deserialize complex data structures without losing schema-free property of JSON.
CRON
A time-based job scheduler in Unix-like computer operating systems for scheduling periodic jobs at fixed times, dates, or intervals.
CSV (Comma Separated Values)
A simple, plain-text file format used to store tabular data, where each line represents a data record, and each record consists of one or more fields, separated by commas. Suitable for a wide range of applications due to its simplicity, but lacks a standard schema, which can lead to parsing errors.