Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
Segmentation
The process of dividing a data set into distinct and meaningful groups, usually to perform more specific analysis, or to target specific subsets of users.
Semantic Analysis
The process of analyzing the meanings of words, texts, and sentences, typically used in NLP to understand the context and intent behind the words.
Semi-Supervised Learning
A class of machine learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.
Sentiment Analysis
Analyze text data to identify and categorize the emotional tone or sentiment expressed.
Sequential Pattern Mining
A method of discovering frequent subsequences or patterns in a sequence of items or events, usually in datasets of customer transactions or other sequence data.
Serverless Computing
A cloud-computing execution model where the cloud provider runs the server and dynamically manages the allocation of machine resources, allowing developers to focus on individual functions.
Service-Oriented Architecture (SOA)
An architectural pattern in software design where services are provided to the other components by application components, through a communication protocol over a network.
Shred
Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Similarity Measure
A numeric measure of how alike two data objects are, often used in clustering, classification, or nearest neighbor analysis.
Single Source of Truth (SSOT)
A practice of structuring information models and associated schema such that every data element is mastered in only one place.
Site Reliability Engineering (SRE)
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems, aiming for creating scalable and highly reliable software systems.
Skewness
A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean, indicating whether the data points are skewed to the left or right.
Sliding Window
A technique used in analyzing or processing sequences of data, where a window of specified size moves across the data, and for each position of the window, a computation is performed.
Snappy
A fast and efficient data compression and decompression library developed by Google, designed to balance processing speed and compression ratio. It is often used to compress data stored in Hadoop environments and for other similar applications.
Snapshot
A set point in time copy of data that can be used as a backup for recovery purposes.
Snapshot Isolation
A guarantee provided by some database systems that all reads made in a transaction will see a consistent snapshot of the database, and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.
Social Graph
A graph that depicts personal relations of internet users, representing the interconnection of relationships in an online social network.
Soft Delete
A data removal strategy where records are marked as deleted but are not physically removed from the database, enabling potential recovery.
Software as a Service (SaaS)
A cloud computing service model that provides access to software and its functions remotely as a web-based service, allowing users to access software applications over the internet.
Software-defined Asset
A declarative design pattern that represents a data asset through code.
Sorting Algorithm
An algorithm that puts elements of a list in a certain order, often numerical or lexicographical.
Sparse Matrix
A matrix mostly containing zero values, represented and stored efficiently in memory by only storing the non-zero elements.
Spatial Database
A database optimized to store and query data representing objects defined in a geometric space, often used for storing and analyzing geographical or spatial information.
Spatial Index
A data structure that allows for accessing a spatial object efficiently, essential in spatial databases and geodatabases.
Spatial Indexing
A data structure that allows for accessing a spatial object in a database in a more efficient manner, crucial in GIS systems, spatial databases, and spatial data processing.
Speculative Execution
An optimization technique where a computer system performs some tasks before it knows whether these tasks will be needed, to reduce latency and improve throughput.
Split
Divide a dataset into training, validation, and testing sets for machine learning model training.
Stack
A data structure that stores a collection of elements, with two main principal operations: Push, which adds an element to the collection, and Pop, which removes the most recently added element.
Standardize
Transform data to a common unit or format to facilitate comparison and analysis.
Star Schema
The simplest style of data warehouse schema that organizes data in a single fact table linked to one or more dimension tables, enabling easy and efficient data retrieval.
Stateful Application
An application that saves client data from the activities of one session for use in the next session.
Stateless Application
An application that does not save client data generated in one session for use in the next session with that client.
Stateless Protocol
A communications protocol that treats each request as an independent transaction, without requiring the server to retain session information or status about each communicating partner for the duration of multiple requests.
Stemming
The process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.
Stored Procedure
Precompiled and stored SQL statements and procedural logic for easy database operations and complex data manipulations.
Strategic Information Systems
Information systems that are developed in response to corporate business initiatives to give competitive advantage to organizations.
Stream Processing
The real-time processing of data continuously, concurrently, and record by record, often used in applications that require real-time response and analytics.
Streaming Data
Data that is generated continuously by thousands of data sources, sending data records simultaneously and in small sizes.
Structured Data
Data that is organized and formatted in a way that is easily searchable, often residing in relational databases and including data types such as numbers, dates, and strings.
Structured Query Language (SQL)
A standard programming language specifically for managing and querying data in relational databases.
Subquery
A SQL query nested inside a larger query, used to retrieve data that will be used in the main query as a condition to further restrict the data to be retrieved.
Support Vector Machine (SVM)
A supervised machine learning algorithm, used for classification or regression analysis, that separates data into classes by finding the hyperplane that maximizes the margin between the classes.
Surrogate Key
A unique identifier for a record in a database table that serves as a substitute for natural primary keys and is typically auto-generated.
Swarm Intelligence
The collective behavior of decentralized, self-organized systems, typically inspired by nature, like ant colonies, bird flocking, and fish schooling, used in artificial intelligence for problem-solving and optimization.
Synchronization
The coordination of events to operate a system in unison, ensuring that multiple threads or processes do not interfere with each other.
Syntactic Sugar
Syntax within a programming language that is designed to make things easier to read or to express.
Syntax Analysis
The analysis of the symbols or statements in a computer program to ensure their correct arrangement, often used in compilers to check the syntax of the programming code.
Synthetic Data
Data that's artificially created, rather than being generated by actual events, often used for testing and training machine learning models when real data is scarce or sensitive.
Systematic Sampling
A statistical method involving the selection of elements from an ordered sampling frame, selecting every kth (where k is a constant) item in the frame.