Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Cache Invalidation
A process in a computing system where entries in a cache are replaced or removed due to change in the underlying data.
Caching
The process of storing copies of files in a cache, or temporary storage location, so that they can be accessed more quickly.
Callback
A piece of executable code that is passed as an argument to other code and is expected to execute at a given time.
Capacity Planning
The process used to determine how much hardware and software is required to meet future workload demands.
Cap’n Proto
A data interchange format similar to Protobuf, but faster. Instead of parsing the data and then unpacking it, the data is directly accessed in the binary form in which it is stored, reducing processing time.
Categorical Data
A type of data that can take on one of a limited and usually fixed number of possible values, representing the membership of an object in a group, such as ‘male’ or ‘female’.
Causal Inference
A process used to make conclusions about one variable’s effect on another, critical in understanding relationships in data and making informed decisions based on those relationships.
Chaining
Linking two or more computing tasks together so that, as soon as one task is finished, the next task immediately begins.
Character Encoding
A method used to represent a repertoire of characters by some kind of encoding system, e.g., ASCII or UTF-8.
Checkpoint
A snapshot of the state of a system at a specific point in time, usually used to recover from failures.
Checkpointing
Saving the state of a process at certain points so that it can be restarted from that point in case of failure.
Circular Dependency
A relation between two or more modules which either directly or indirectly depend on each other to function properly.
Class Variable
A variable that is shared by all instances of a class, belonging to the class rather than any object instance.
Classify
The process of organizing data by relevant categories for efficient use and secure data management.
Clean Code
Code that is easy to understand and easy to change, adhering to good programming principles and practices.
Clean or Cleanse
Remove invalid or inconsistent data values, such as empty fields or outliers.
Cloud Computing
The delivery of various services over the Internet, such as storage, processing, and networking resources.
Cloudera
A provider of software for data engineering, data warehousing, machine learning, and analytics.
Cluster
Group data points based on similarities or patterns to facilitate analysis and modeling.
Cluster Analysis
A group of algorithms used to categorize data into groups, or clusters, where objects in the same group are more similar to each other than to those in other groups.
Cold storage
A storage strategy for data that is accessed infrequently and is primarily for archival purposes, offering cost-efficiency at the expense of retrieval speed.
Columnar Database
A database optimized for reading and writing columns of data as opposed to rows of data, often used for analytics and reporting.
Combinatorial Explosion
A phenomenon in computer science where the number of possible solutions or combinations in a problem grows exponentially with the size of the problem.
Command-Line Interface (CLI)
A text-based user interface used to interact with software by entering commands into the interface.
Comment
A programming language feature allowing the insertion of human-readable descriptions or annotations in the source code.
Commit
The act of saving changes in a database, version control system, or transactional system, making them permanent.
Common Gateway Interface (CGI)
A standard protocol for web servers to execute programs and generate dynamic content, often used for form processing.
Compilation
The process of translating a high-level programming language into machine language or bytecode that can be executed by a computer’s CPU.
Compound Key
A key that consists of multiple attributes to uniquely identify an entity in a database.
Computed Column
A virtual column in a database table that is based on a calculation or expression using other columns in the table.
Concurrency Control
Techniques to manage simultaneous operations in a database system, ensuring consistency and resolving conflicts.
Concurrent Processing
A computing concept where several tasks are executed during overlapping time periods, enabling more efficient use of computing resources.
Configuration File
A file used to configure the initial settings of software programs, usually written in XML, JSON, or YAML.
Configuration Management
The process of systematically managing, organizing, and controlling the changes in the documents, codes, and other entities during the development process.
Connection Pool
A cache of database connections maintained to be reused by future requests, reducing the overhead of opening and closing connections.
Consensus Algorithm
A process used in computer science to achieve agreement on a single data value among distributed processes or systems.
Consolidate
Combine multiple datasets into one to create a more comprehensive view of the data.
Container
A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, and system libraries.
Containerization
A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, system tools, and libraries.
Continuous Delivery
A software development discipline where software is built in such a way that it can be released to production at any time.
Continuous Deployment (CD)
A software engineering approach in which software functionalities are delivered and deployed continuously and automatically into production, after passing a series of automated tests.
Continuous Integration (CI)
A development practice where developers integrate code into a shared repository frequently, ideally several times a day, to detect errors quickly.
Control Flow
The order in which individual statements, instructions, or function calls are executed within a program.
Convergence
The state where different nodes (or systems) update their internal states to a common value, usually used in the context of iterative algorithms and distributed systems.
Convolutional Neural Network (CNN)
A class of deep learning neural networks, most commonly applied to analyzing visual imagery, used in image recognition and classification tasks.
Cosine Similarity
A measure of similarity between two entities used in text analysis, natural language processing, etc.
Covariance
A statistical measure that indicates the extent to which two variables change together.
Crash Recovery
The process by which an operating system or application restarts operation after a crash, possibly recovering lost data.
Cron Job
A scheduled task in Unix-based operating systems, used to automate repetitive tasks.
Cross-Join
A SQL join that returns the Cartesian product of the joined tables, meaning every row of the first table is combined with every row of the second table.
Cross-Validation
A statistical method used to estimate the skill of machine learning models, it is primarily used in applied machine learning to assess a predictive modeling algorithm’s performance when there is no separate test dataset available.