Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Cap’n Proto
A data interchange format similar to Protobuf, but faster. Instead of parsing the data and then unpacking it, the data is directly accessed in the binary form in which it is stored, reducing processing time.
CAP Theorem
In computer science, it represents that it is impossible for a distributed system to simultaneously provide more than two out of three guarantees: Consistency, Availability, and Partition Tolerance.
Categorical Data
A type of data that can take on one of a limited and usually fixed number of possible values, representing the membership of an object in a group, such as ‘male’ or ‘female’.
Causal Inference
A process used to make conclusions about one variable’s effect on another, critical in understanding relationships in data and making informed decisions based on those relationships.
CBOR (Concise Binary Object Representation)
A binary format encoding data in a more efficient and compact manner than JSON. It is designed to efficiently serialize and deserialize complex data structures without losing schema-free property of JSON.
Chaining
Linking two or more computing tasks together so that, as soon as one task is finished, the next task immediately begins.
Character Encoding
A method used to represent a repertoire of characters by some kind of encoding system, e.g., ASCII or UTF-8.
Checkpoint
A snapshot of the state of a system at a specific point in time, usually used to recover from failures.
Checkpointing
Saving the state of a process at certain points so that it can be restarted from that point in case of failure.
Circular Dependency
A relation between two or more modules which either directly or indirectly depend on each other to function properly.
Classify
The process of organizing data by relevant categories for efficient use and secure data management.
Class Variable
A variable that is shared by all instances of a class, belonging to the class rather than any object instance.
Clean Code
Code that is easy to understand and easy to change, adhering to good programming principles and practices.
Clean or Cleanse
Remove invalid or inconsistent data values, such as empty fields or outliers.
Cloud Computing
The delivery of various services over the Internet, such as storage, processing, and networking resources.
Cloudera
A provider of software for data engineering, data warehousing, machine learning, and analytics.
Cluster
Group data points based on similarities or patterns to facilitate analysis and modeling.
Cluster Analysis
A group of algorithms used to categorize data into groups, or clusters, where objects in the same group are more similar to each other than to those in other groups.
Cold storage
A storage strategy for data that is accessed infrequently and is primarily for archival purposes, offering cost-efficiency at the expense of retrieval speed.
Columnar Database
A database optimized for reading and writing columns of data as opposed to rows of data, often used for analytics and reporting.
Combinatorial Explosion
A phenomenon in computer science where the number of possible solutions or combinations in a problem grows exponentially with the size of the problem.
Command-Line Interface (CLI)
A text-based user interface used to interact with software by entering commands into the interface.
Comment
A programming language feature allowing the insertion of human-readable descriptions or annotations in the source code.
Commit
The act of saving changes in a database, version control system, or transactional system, making them permanent.
Common Gateway Interface (CGI)
A standard protocol for web servers to execute programs and generate dynamic content, often used for form processing.
Compilation
The process of translating a high-level programming language into machine language or bytecode that can be executed by a computer’s CPU.
Compound Key
A key that consists of multiple attributes to uniquely identify an entity in a database.
Computed Column
A virtual column in a database table that is based on a calculation or expression using other columns in the table.
Concurrency Control
Techniques to manage simultaneous operations in a database system, ensuring consistency and resolving conflicts.
Concurrent Processing
A computing concept where several tasks are executed during overlapping time periods, enabling more efficient use of computing resources.
Configuration File
A file used to configure the initial settings of software programs, usually written in XML, JSON, or YAML.
Configuration Management
The process of systematically managing, organizing, and controlling the changes in the documents, codes, and other entities during the development process.
Connection Pool
A cache of database connections maintained to be reused by future requests, reducing the overhead of opening and closing connections.
Consensus Algorithm
A process used in computer science to achieve agreement on a single data value among distributed processes or systems.
Consolidate
Combine multiple datasets into one to create a more comprehensive view of the data.
Container
A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, and system libraries.
Containerization
A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, system tools, and libraries.
Continuous Delivery
A software development discipline where software is built in such a way that it can be released to production at any time.
Continuous Deployment (CD)
A software engineering approach in which software functionalities are delivered and deployed continuously and automatically into production, after passing a series of automated tests.
Continuous Integration (CI)
A development practice where developers integrate code into a shared repository frequently, ideally several times a day, to detect errors quickly.
Control Flow
The order in which individual statements, instructions, or function calls are executed within a program.
Convergence
The state where different nodes (or systems) update their internal states to a common value, usually used in the context of iterative algorithms and distributed systems.
Convolutional Neural Network (CNN)
A class of deep learning neural networks, most commonly applied to analyzing visual imagery, used in image recognition and classification tasks.
Cosine Similarity
A measure of similarity between two entities used in text analysis, natural language processing, etc.
Covariance
A statistical measure that indicates the extent to which two variables change together.
Crash Recovery
The process by which an operating system or application restarts operation after a crash, possibly recovering lost data.
CRON
A time-based job scheduler in Unix-like computer operating systems for scheduling periodic jobs at fixed times, dates, or intervals.
Cron Job
A scheduled task in Unix-based operating systems, used to automate repetitive tasks.
Cross-Join
A SQL join that returns the Cartesian product of the joined tables, meaning every row of the first table is combined with every row of the second table.
Cross-Validation
A statistical method used to estimate the skill of machine learning models, it is primarily used in applied machine learning to assess a predictive modeling algorithm’s performance when there is no separate test dataset available.
Cryptography
The practice and study of techniques for securing communication and data from third parties or the public.
CSV (Comma Separated Values)
A simple, plain-text file format used to store tabular data, where each line represents a data record, and each record consists of one or more fields, separated by commas. Suitable for a wide range of applications due to its simplicity, but lacks a standard schema, which can lead to parsing errors.
