Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
T-distribution
A type of probability distribution that is symmetrical and bell-shaped, like the normal distribution, but has heavier tails.
Tableau
A data visualization tool that is used for converting raw, unstructured data into an understandable or readable format.
Tagging
The practice of labeling data with tags that categorize or annotate it, often used in organizing content or in natural language processing to identify parts of speech.
Talend
A software integration vendor that provides data integration, data management, enterprise application integration, and big data software and services.
Temporal Database
A database that is optimized to manage data relating to time instances, maintaining information about the times at which certain data is valid.
Tensor
A mathematical object represented as arrays of higher dimensions, extended from matrices and used in machine learning and deep learning models, particularly in neural networks.
TensorFlow
An open-source software library for dataflow and differentiable programming across a range of tasks, developed by the Google Brain team.
Terabyte (TB)
A unit of information or computer storage equal to one trillion bytes or 1,024 gigabytes.
Teradata
Offers products related to data warehousing, including a powerful, scalable, and reliable data warehousing solution.
Text Mining
The process of deriving meaningful information from natural language text, involves the preprocessing (cleaning and transforming) of text data and the application of natural language processing (NLP) techniques.
Thread
Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.
Throughput
The amount of data transferred or processed in a specified time period, often used as a measure of system or network performance.
Time Complexity
A concept in computer science that describes the amount of time an algorithm takes to run as a function of the length of the input.
Time Series Analysis
Analyze data over time to identify trends, patterns, and relationships.
Time Series Database (TSDB)
A database optimized for handling time series data, which are data points indexed in time order, commonly used for analyzing, storing, and querying time series data.
Tokenization
The process of converting input text into smaller units, or tokens, typically words or phrases, used in natural language processing to understand the structure of the text.
Top-Down Design
A design methodology that begins with specifying the high-level structure of a system and decomposes it into its components, focusing on the system as a whole before examining its parts.
Topology
In networking, it refers to the arrangement of different elements (links, nodes, etc.) in a computer network. In data analysis, it refers to the study of geometric properties and spatial relations.
Training Set
A subset of a dataset used to train machine learning models, helping the models make predictions or decisions without being explicitly programmed to perform the task.
Transactional Database
A type of database that manages transaction-oriented applications, ensuring ACID properties (Atomicity, Consistency, Isolation, Durability) to maintain reliability in every transaction.
Transfer Learning
A research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.
Transformation
The process of converting data from one format or structure into another, often involving cleaning, aggregating, enriching, and reformatting the data.
Tree Structure
A hierarchical structure used in computer science to represent relationships between individual data points or nodes, where each node is connected to one parent node and zero or more child nodes.
Triggers
Procedural code automatically executed in response to certain events on a particular table or view in a database, often used to maintain the integrity of the data.
Tuple
An ordered list of elements, often used to represent a single row in a relational database table, or a single record in a dataset.
Turing Machine
A mathematical model of computation that defines an abstract machine, which manipulates symbols on a strip of tape according to a table of rules, foundational in the theory of computation.
Type Casting
The process of converting a variable from one data type to another, such as changing a float to an integer or a string to a number.
URL Encoding
A method of encoding information in a Uniform Resource Identifier (URI) where certain characters are replaced by corresponding hexadecimal values, used in the submission of form data in HTTP requests.
Undirected Graph
A graph in which edges have no orientation, meaning the edge from vertex A to vertex B is identical to the edge from vertex B to vertex A.
Union
An operation in SQL that allows for the return of one distinct result set from multiple queries.
Unique Constraint
A constraint applied on a field to ensure that it cannot have duplicate values.
Univariate Analysis
The simplest form of analyzing data with one variable, without regard to any other variable, focusing on the patterns, and summarizing the underlying patterns in the data.
Unstructured Data
Information that doesn't reside in a traditional row-column database and is often text-heavy.
Unstructured Data Analysis
Analyze unstructured data, such as text or images, to extract insights and meaning.
Unsupervised Learning
A type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses, often for clustering or association.
Update Anomaly
A data inconsistency that occurs when not all instances of a redundant piece of data are updated, leading to inconsistent and inaccurate data in a database.
Upstream
In data processing, refers to the tasks, operations, or stages of processing occurring or located before a particular stage in a specified direction or flow.
User-Defined Function (UDF)
A function provided by the user of a program or environment, allowing for the creation of functions that are not included in the original software.
Variable Selection
The process of selecting the most relevant features (variables, predictors) for use in model construction, reducing dimensionality and improving model performance.
Variance Inflation Factor (VIF)
A measure used to quantify how much the variance of a regression coefficient is inflated due to multicollinearity in the model.
Variational Autoencoder (VAE)
A type of autoencoder with added constraints on the encoded representations being learned, often used for generating new data that's similar to the training data.
Vectorization
The process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time, improving performance by exploiting data-level parallelism.
Version Control
The management of changes to documents, computer programs, large websites, and other collections of information, allowing for revisions and variations to be tracked and managed efficiently.
Vertex
In graph theory, a vertex is a point where two or more curves, lines, or edges meet, representing entities in graph-based storage and analysis systems.
Vertical Scaling
Adding more resources such as CPU, memory to an existing server, or replacing the server with a more powerful one.
View
A virtual table based on the result-set of an SQL statement, often used to focus, simplify, and customize the perception each user has of the database.
Virtual Private Network (VPN)
A technology that creates a safe and encrypted connection over a less secure network, such as the internet, allowing for secure remote access to network resources.
Virtualization
The process of creating a virtual version of something, including virtual computer hardware systems, storage devices, and network resources.
Virtualization (in analytics)
A data integration process to provide a unified, real-time, and consistent view of data across different data sources without having to move or replicate the data.
Visualization
The graphical representation of information and data, using visual elements like charts, graphs, and maps.
Volatile Memory
Computer memory that requires power to maintain the stored information; all data is lost when the system’s power is turned off or interrupted.
Volume Testing
A type of software testing that checks the system’s performance and behavior under high volumes of data, ensuring the software can handle large data quantities effectively.