Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
A statistical hypothesis testing for a randomized experiment with two variables, A and B, which are used to compare two models or strategies and determine which performs better.
The set of properties of database transactions intended to guarantee validity even in the event of errors or failures, encompassing Atomicity, Consistency, Isolation, and Durability.
Combining data from multiple sources into a single dataset.
An iterative approach to software development and project management that prioritizes flexibility and customer satisfaction, often used by data engineering teams to manage projects.
A machine learning data catalog that helps people find, understand, and trust the data.
Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
A managed NoSQL database service provided by Amazon Web Services.
A platform to stream data on AWS, offering powerful services to make it easy to load and analyze streaming data.
A fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.
Amazon Web Services (AWS)
Offers a broad set of global cloud-based products including compute, storage, databases, analytics, networking, mobile, developer tools, and more.
The process of adding metadata or explanatory notes to data, often used in machine learning to create labeled data for training models.
The identification of items, events, or observations which do not conform to an expected pattern or other items in a dataset, crucial in fraud detection, network security, and fault detection.
Remove personal or identifying information from data.
A platform to programmatically author, schedule, and monitor workflows of tasks.
Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
A scalable and extensible set of core foundational governance services, enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop.
An open-source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
A framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
A distributed streaming platform capable of handling trillions of events a day.
A tool designed to automate the flow of data between software systems.
A highly scalable, low-latency messaging platform running on commodity hardware.
A stream processing framework for running applications that process data as it is created.
A fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python, and R.
A free and open-source distributed real-time computation system.
API (Application Programming Interface)
A set of rules and definitions that allow different software entities to communicate with each other.
The process of adding new, updated, or corrected information to an existing database or list.
Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. store data for long-term retention and compliance.
An open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes.
Association Rule Mining
A machine learning method aimed at identifying interesting relations between variables (items or events) in large databases, frequently used for market basket analysis.
A Python library for asynchronous I/O. It is built around the coroutines of Python and provides tools to manage them and handle the I/O in an efficient way.
The technique of increasing the diversity of your training dataset by modifying the existing data points, often used in training deep learning models to improve model generalization.
Augmented Data Management
The use of AI and ML technologies to optimize and enhance data management tasks, improving data quality and metadata development.
Automated Machine Learning (AutoML)
The process of automating the end-to-end process of applying machine learning to real-world problems, facilitating the development of ML models by experts and non-experts alike.
A binary serialization format developed within the Apache Hadoop project, compact, fast, and suitable for serializing large amounts of data. It uses JSON for defining data types and protocols, and it serializes data in a compact binary format.
AWS Step Functions
Enables you to coordinate AWS components, applications and microservices using visual workflows.
A cloud computing service model that serves as the middleware that provides developers with ways to connect their web and mobile applications to cloud services via application programming interfaces (APIs) and software developers' kits (SDKs).
A mechanism to handle situations where data is produced faster than it can be consumed.
Create a copy of data to protect against loss or corruption.
The processing of data in a batch or group where the entire batch is processed before any individual item in the batch is considered processed.
Refers to extremely large datasets that can be analyzed for patterns, trends, and associations, typically involving varied and complex structures. What constitutes 'big' is debated, but a rule of thumb is a volume of data that cannot be analyzed on a single machine.
Big Data Processing
Process large volumes of data in parallel and distributed computing environments to improve performance.
Big O Notation
A mathematical notation used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, primarily used to classify algorithms by how they respond to changes in input size.
A tree data structure in which each node has at most two children, referred to as the left child and the right child.
Operations that manipulate one or more bits at the level of their individual binary representation.
A term coined by data analytics vendors to describe the process of combining data from multiple sources to create a cohesive, unified dataset. Typically used in the context of data analysis and business intelligence.
A system of recording information in a way that makes it difficult or impossible to change, hack, or cheat the system. A blockchain is a digital ledger of transactions that is duplicated and distributed across the entire network of computer systems on the blockchain.
A method in parallel computing where data is sent from one point (a root node) to all other nodes in the topology.
A method in distributed computing to send the same message to all nodes in a network.
BSON (Binary JSON)
A binary-encoded serialization of JSON-like documents used to store documents and make remote procedure calls in MongoDB. BSON supports embedded documents and arrays, offering additional data types not supported by JSON.
A method for dividing a dataset into discrete buckets or bins to separate it into roughly equal parts based on some characteristic.
The process of extracting large amounts of data from a database in a single transaction.
Business Intelligence (BI)
A set of strategies and technologies used by enterprises for the data analysis of business information, helping companies make more informed business decisions.
A process in a computing system where entries in a cache are replaced or removed due to change in the underlying data.
The process of storing copies of files in a cache, or temporary storage location, so that they can be accessed more quickly.
A piece of executable code that is passed as an argument to other code and is expected to execute at a given time.
In computer science, it represents that it is impossible for a distributed system to simultaneously provide more than two out of three guarantees: Consistency, Availability, and Partition Tolerance.
A data interchange format similar to Protobuf, but faster. Instead of parsing the data and then unpacking it, the data is directly accessed in the binary form in which it is stored, reducing processing time.
The process used to determine how much hardware and software is required to meet future workload demands.
A highly scalable NoSQL database designed to handle large amounts of data.
A type of data that can take on one of a limited and usually fixed number of possible values, representing the membership of an object in a group, such as ‘male’ or ‘female’.
Organizing and classifying data into different categories, groups, or segments.
A process used to make conclusions about one variable’s effect on another, critical in understanding relationships in data and making informed decisions based on those relationships.
CBOR (Concise Binary Object Representation)
A binary format encoding data in a more efficient and compact manner than JSON. It is designed to efficiently serialize and deserialize complex data structures without losing schema-free property of JSON.
Linking two or more computing tasks together so that, as soon as one task is finished, the next task immediately begins.
A method used to represent a repertoire of characters by some kind of encoding system, e.g., ASCII or UTF-8.
A snapshot of the state of a system at a specific point in time, usually used to recover from failures.
The process of saving the state of a system at specific points, so it can be returned to that state in case of failure.
A relation between two or more modules which either directly or indirectly depend on each other to function properly.
A variable that is shared by all instances of a class, belonging to the class rather than any object instance.
A method that is bound to the class and not the instance of the class.
The process of organizing data by relevant categories for efficient use and secure data management.
Code that is easy to understand and easy to change, adhering to good programming principles and practices.
Clean or Cleanse
The process of identifying and correcting (or removing) errors and inconsistencies in datasets to improve their quality.
The delivery of various services over the Internet, such as storage, processing, and networking resources.
A provider of software for data engineering, data warehousing, machine learning, and analytics.
Group data points based on similarities or patterns to facilitate analysis and modeling.
A group of algorithms used to categorize data into groups, or clusters, where objects in the same group are more similar to each other than to those in other groups.
A SQL function that returns the first non-null value in a list.
A storage strategy for data that is accessed infrequently and is primarily for archival purposes, offering cost-efficiency at the expense of retrieval speed.
A database optimized for reading and writing columns of data as opposed to rows of data, often used for analytics and reporting.
A phenomenon in computer science where the number of possible solutions or combinations in a problem grows exponentially with the size of the problem.
Command-Line Interface (CLI)
A text-based user interface used to interact with software by entering commands into the interface.
A programming language feature allowing the insertion of human-readable descriptions or annotations in the source code.
The act of saving changes in a database, version control system, or transactional system, making them permanent.
Common Gateway Interface (CGI)
A standard protocol for web servers to execute programs and generate dynamic content, often used for form processing.
The process of translating a high-level programming language into machine language or bytecode that can be executed by a computer’s CPU.
A key that consists of multiple attributes to uniquely identify an entity in a database.
Reduce the size of data to save storage space and improve processing performance.
Reducing the size of a data file, typically to save space or speed up transmission.
The process of reducing the size of data, usually to save space or speed up transmission over networks.
A virtual column in a database table that is based on a calculation or expression using other columns in the table.
Techniques to manage simultaneous operations in a database system, ensuring consistency and resolving conflicts.
A computing concept where several tasks are executed during overlapping time periods, enabling more efficient use of computing resources.
A file used to configure the initial settings of software programs, usually written in XML, JSON, or YAML.
The process of systematically managing, organizing, and controlling the changes in the documents, codes, and other entities during the development process.
A cache of database connections maintained to be reused by future requests, reducing the overhead of opening and closing connections.
A process used in computer science to achieve agreement on a single data value among distributed processes or systems.
Combine multiple datasets into one to create a more comprehensive view of the data.
A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, and system libraries.
A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, system tools, and libraries.
A software development discipline where software is built in such a way that it can be released to production at any time.
Continuous Deployment (CD)
A software engineering approach in which software functionalities are delivered and deployed continuously and automatically into production, after passing a series of automated tests.
Continuous Integration (CI)
A development practice where developers integrate code into a shared repository frequently, ideally several times a day, to detect errors quickly.
The order in which individual statements, instructions, or function calls are executed within a program.
The state where different nodes (or systems) update their internal states to a common value, usually used in the context of iterative algorithms and distributed systems.
Convolutional Neural Network (CNN)
A class of deep learning neural networks, most commonly applied to analyzing visual imagery, used in image recognition and classification tasks.
A statistical measure that indicates the extent to which two variables change together.
The process by which an operating system or application restarts operation after a crash, possibly recovering lost data.
A time-based job scheduler in Unix-like computer operating systems for scheduling periodic jobs at fixed times, dates, or intervals.
A scheduled task in Unix-based operating systems, used to automate repetitive tasks.
A SQL join that returns the Cartesian product of the joined tables, meaning every row of the first table is combined with every row of the second table.
A statistical method used to estimate the skill of machine learning models, it is primarily used in applied machine learning to assess a predictive modeling algorithm’s performance when there is no separate test dataset available.
The practice and study of techniques for securing communication and data from third parties or the public.
CSV (Comma Separated Values)
A simple, plain-text file format used to store tabular data, where each line represents a data record, and each record consists of one or more fields, separated by commas. Suitable for a wide range of applications due to its simplicity, but lacks a standard schema, which can lead to parsing errors.
Select, organize and annotate data to make it more useful for analysis and modeling.
A command-line tool and library for transferring data with URLs, supporting various protocols like HTTP, FTP, and more.
A database object used to traverse the results of a SQL query, allowing individual rows to be accessed.
The practice of protecting systems, networks, and programs from digital attacks aimed at accessing, changing, or destroying sensitive information.
An open source solution for defining, building and managing critical data assets.
The process of gathering and summarizing information in a specified form, often used in statistical analysis.
The assignment of storage space to specific data, often in the context of distributed databases where data is allocated across multiple nodes.
The science of analyzing raw data to make conclusions about that information.
The process of adding explanatory notes or comments to data, often used in the context of machine learning to create labeled training data.
The process of adding new, updated, or corrected information to an existing database or list.
The overall structure, organization, and rules used to manage and use data within an organization, including the arrangement of data and data processing.
The smallest unit of data storage in a database, storing a set of rows or a subset of a table's columns.
A centralized repository that allows for the management, collaboration, discovery, and consumption of organizational datasets, serving as a metadata inventory.
The gradual loss or deterioration of data quality over time.
A collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them.
A phenomenon where the statistical properties of incoming data change over time, potentially impacting model performance and accuracy.
A unified architecture that provides a consistent and coherent set of capabilities and services across different environments.
The process of aggregating data from different sources to create a single, unified view.
The process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.
The overall management of the availability, usability, integrity, and security of data employed in an enterprise, involving a set of practices and policies.
A centralized storage repository that allows storing structured and unstructured data at any scale, usually used for big data and real-time analytics.
A modern data architecture that combines the best elements of data lakes and data warehouses, enabling efficient handling of both structured and unstructured data.
The journey that data goes through from creation and initial storage to the time it becomes obsolete and is deleted.
Data Lifecycle Management
The process of managing the flow of data throughout its lifecycle from creation and initial storage to the time it is archived or deleted.
The visualization of the flow and transformation of data as it moves through the various stages of a data pipeline, crucial for understanding and maintaining complex data systems.
A subset of a data warehouse that is designed for a specific line of business or department within an organization.
Subsets of data warehouses designed to provide data for specific business lines or departments.
A decentralized approach to data architecture and organizational structure that treats data as a product and emphasizes domain-oriented decentralized data ownership and architecture.
An automated, process-oriented methodology used to improve the quality and reduce the cycle time of data analytics.
A series of data processing steps involved in the flow of data from the source to its final destination, usually used in the context of ETL and data integration.
Information that helps to trace the origins, processing, and use of data, helping to determine the quality and reliability of the dataset.
A comprehensive way of maintaining the accuracy, reliability, and consistency of data over its entire life cycle.
The existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data.
An expansive storage repository that allows for the integration and storage of data from various sources in its native format.
A repository of data isolated or segregated from other parts of the organization's data system.
Responsible management and oversight of an organization's data to help provide business users with high-quality data.
Data Vault Modeling
A database modeling method specifically designed for top-down data warehouses with a focus on long-term historical storage, tractability, and scalability.
The amount of data available for analysis, usually referred to in the context of Big Data.
A central repository of integrated data from disparate sources, used to store and manage large volumes of historical data and enable fast, complex queries across all the consolidated data.
The use of special data structures that improve the speed of operations in a table, such as search, filter, and sort.
Database Management System (DBMS)
A software package designed to define, manipulate, retrieve, and manage data in a database.
A technique used to increase data availability by maintaining two copies of a single database that must reside on different server instances of SQL Server Database Engine.
A systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like insertion, update, and deletion anomalies.
The structure or blueprint of a database that outlines the way data is organized and how relationships are between the data entities.
Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.
A condition where two or more database transactions are unable to proceed because each is waiting for the other to release a lock, leading to a cyclic waiting condition.
A tree-like model of decisions used to make predictions, especially in machine learning algorithms.
A process used to eliminate redundant copies of data, ensuring data accuracy and reducing storage overhead.
Identify and remove duplicate records or entries to improve data quality.
A subset of machine learning that utilizes neural networks with many layers (hence “deep”) to analyze various factors of data and to learn and make intelligent decisions.
An open-source storage layer that brings reliability to data lakes, ensuring ACID transactions, scalable metadata handling, and unifying streaming and batch data processing.
Remove noise or artifacts from data to improve its accuracy and quality.
The process of attempting to optimize the performance of a database by adding redundant data or by grouping data.
A Natural Language Processing (NLP) technique to analyze the grammatical structure of a sentence to establish relationships between words.
Extracting, transforming, and generating new data from existing datasets.
Deserialization is essentially the reverse process of serialization. See: 'Serialize'.
A set of practices that combines software development (Dev) and IT operations (Ops), aiming to shorten the systems development life cycle and provide continuous delivery.
A system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.
A table in a star schema of a data warehouse that stores categorical, descriptive, hierarchical, or textual attributes of data.
A design technique used in data warehousing to map and visualize data in a way that’s intuitive to business users, typically using facts and dimensions.
Analyzing the number of features or attributes in the data to improve performance.
The process of reducing the number of random variables under consideration by obtaining a set of principal variables, crucial for dealing with the “curse of dimensionality” in high-dimensional spaces.
DAG (Directed Acyclic Graph)
A finite directed graph with no directed cycles, used extensively in representing data flow in data processing systems like Apache Airflow.
Directed Acyclic Graph (DAG)
A finite directed graph with no directed cycles, used extensively in representing data flow in data processing systems like Apache Airflow.
Transform continuous data into discrete categories or bins to simplify analysis.
A model in which components located on networked computers communicate and coordinate their actions by passing messages to achieve a common goal, crucial for handling large
Distributed Ledger Technology
A decentralized database managed by multiple participants, across multiple nodes.
Distributed Ledger Technology (DLT)
A digital system for recording the transaction of assets wherein transactions and their details are recorded in multiple places at the same time, the most common form being blockchain technology.
A system where components located on networked computers communicate and coordinate their actions by passing messages.
A platform used to develop, ship, and run applications inside containers, promoting software reliability and scalability.
Document Store Database
A type of NoSQL database designed to store, manage, and retrieve document-oriented information, also known as semi-structured data.
Domain-Driven Design (DDD)
An approach to software development that centers the design and development process on the business domain, ensuring that the software solves real business problems.
The process of reducing the amount of data in a dataset, primarily by reducing the number of points in the data or reducing the precision of the data.
Identifying when the statistical properties of the target variable, which the model is trying to predict, change.
Data that change frequently and are usually generated in real-time, such as stock prices or sensor data.
A programming environment that evaluates operations immediately, instead of building graphs to run later, typically used in TensorFlow for debugging and interactive development.
A form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent, by stopping the training process before it completes all iterations.
A distributed computing paradigm that brings computation and data storage closer to the sources of data generation, improving response times and saving bandwidth.
The ability of a system to efficiently allocate resources to meet demand and then deallocate resources when they are no longer needed.
A search engine based on the Lucene library, providing a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
The integration of analytical capabilities and content within the business process applications.
A layer within a neural network that learns to map the input data (such as words in text) into fixed-size dense vectors of continuous values, usually as the first layer in a network processing sequential or textual data.
Convert categorical variables into numerical representations for ML algorithms.
The process of converting data into a code to prevent unauthorized access.
The process of enhancing, refining, and improving raw data by adding information to it.
A technique used in machine learning that combines several models to solve a single predictive problem, enhancing the performance and robustness of the model.
The process of identifying and linking mentions of the same entity across different data sources, critical for creating a unified view of entities from disparate data sources.
A data model for describing a database in an abstract way, using entities, relationships, and attributes.
Temporary storage that is provisioned for a short period of time and is deleted when the instance using it is terminated.
ETL (Extract, Transform, Load)
A type of data integration that refers to the three steps used to blend data from multiple sources.
The process of validating, verifying, and qualifying data while preventing duplicate records and data loss, conducted during the ETL process.
A software architecture paradigm promoting the production, detection, consumption of, and reaction to events.
A subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm used to find approximate solutions to optimization and search problems.
A unit of information or computer storage equal to one quintillion bytes (1 billion gigabytes).
Computing systems capable of at least one exaFLOP, or a billion billion calculations per second, representing a thousandfold increase over petascale.
Explainable AI (XAI)
An area in AI that develops methods and techniques to help human users understand and trust the output and operations of machine learning models.
Understand the data, identify patterns, and gain insights.
Extract data from a system for use in another system or application.
The process of retrieving data out of unstructured data sources for further processing or storage.
Extract, Load, Transform (ELT)
A variant of ETL in which extracted data is loaded into the target system and then transformed.
Predict values outside a known range, based on the trends or patterns identified within the available data.
Factory patterns allow you to create a class, with its subclasses deciding which class to instantiate.
A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.
The property that enables a system to continue operating properly in the event of the failure of some of its components.
A binary columnar serialization format optimized for use with DataFrames in analytics. It is language agnostic, though it is most commonly used with Python and R. Ideal for fast, lightweight reading and writing of data frames.
The process of using domain knowledge to create new features from the existing ones, improving the performance of machine learning models.
Identify and extract relevant features from raw data for use in analysis or modeling.
A method used to normalize the range of independent variables or features of data.
The process of selecting a subset of relevant features (variables, predictors) for use in model construction, reducing overfitting and improving model generalization.
A centralized repository for storing, serving, and sharing machine learning features, allowing for the consistent use of features across different models.
A machine learning approach that trains an algorithm across multiple decentralized devices or servers holding local data samples and without exchanging them.
A type of query in database computing, spanning multiple databases, possibly using different database management systems.
The way in which data is stored in a file, designated by a file extension.
Extract a subset of data based on specific criteria or conditions.
An open-source stream-processing framework for high-throughput, fault-tolerant, and scalable processing of data streams.
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
A set of one or more columns used to establish a link between the data in two tables by referencing a unique key in another table.
Break data down into smaller chunks for storage and management purposes.
Full Stack Development
The development of both front end (client-side) and back end (server-side) portions of a web application.
Function as a Service (FaaS)
A category of cloud services that provides a platform allowing customers to develop, run, and manage application functionalities without complex infrastructure.
A programming paradigm that treats computation as the evaluation of mathematical functions and avoids changing state and mutable data.
Automatic memory management, the process by which a program runs in the background to identify and delete objects that are no longer needed by the program.
Gated Recurrent Unit (GRU)
A variant of the Recurrent Neural Network (RNN), designed to capture dependencies for sequences of varied lengths without using a fixed-size time step.
A search heuristic that is inspired by Charles Darwin’s theory of natural evolution, used to find approximate solutions to optimization and search problems.
Replication of datasets across geographical locations, primarily for data resilience and availability purposes.
The gathering, display, and manipulation of imagery, GPS, satellite photographs, and historical data represented in terms of geographic coordinates.
A free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
A web-based platform that provides hosting for software development and a community of developers to work together and share code.
A fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.
Google Cloud Platform (GCP)
A provider of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and YouTube.
A machine learning technique for regression and classification problems, which builds a model in a stage-wise fashion, optimizing for predictive accuracy.
A database designed to treat the relationships between data as equally important to the data itself, used to store data whose relations are best represented as a graph.
A type of data processing that uses graph theory to analyze and visually represent data relationships.
A field in discrete mathematics that studies graphs, which are mathematical structures used to model pairwise relations between objects, important in understanding the structure of various kinds of networks, including data networks.
An algorithmic paradigm that makes locally optimal choices at each stage with the hope of finding the global optimum.
A form of distributed computing whereby a 'super and virtual computer' is composed of clustered, networked, loosely coupled computers acting in parallel to perform very large tasks.
An approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
A file format and a software application used for file compression and decompression.
Hadoop Distributed File System (HDFS)
A distributed file system designed to run on commodity hardware, providing high-throughput access to application data and fault tolerance.
Convert data into a fixed-length code to improve data security and integrity.
A function that converts an input into a fixed-size string of bytes, typically a digest that is unique to the given input.
The process of transforming input of any length into a fixed-size string of text, typically using a hash function.
HDF5 (Hierarchical Data Format version)
A file format and set of tools for managing complex data. It is designed for flexible, efficient I/O and for high volume and complex data sets and supports an unlimited variety of datatypes.
A specialized tree-based data structure that satisfies the heap property, used in computer memory management and for heapsort algorithm.
A package manager for Kubernetes that allows developers and operators to more easily package, configure, and deploy applications and services onto Kubernetes clusters.
Heterogeneous Database System
A system that uses middleware to connect databases that are not alike and are running on different DBMSs, possibly on different platforms.
Hierarchical Database Model
A data model where data is organized into a tree-like structure with a single root, to which all other data is linked in a hierarchy.
A characteristic of a system aiming to ensure an agreed level of operational performance for a higher than normal period.
A term used to define the uniqueness of data values contained in a column. If a column has a high number of unique values, it is said to have high cardinality.
Systems designed to be operational and accessible for longer periods, minimizing downtime and ensuring continuous service.
Homogeneous Database System
A system where all databases are based on the same DBMS technology.
Make data uniform, consistent, and comparable.
Adding more machines to a network to improve the capability to handle more load and perform better, also known as scaling out.
Provides comprehensive solutions for data management and analytics.
The immediate, high-speed storage of data that is frequently accessed and modified, enabling rapid retrieval and updates.
Analyzing HTML code to extract relevant information and understand the structure of the content, often used in web scraping.
Memory pages that are larger than the standard memory page size, beneficial in managing large amounts of memory.
An IT architecture that incorporates some degree of workload portability, orchestration, and management across a mix of on-premises data centers, private clouds, and public clouds.
A configuration that is external to the model and whose value cannot be estimated from data, they are used in processes to help estimate model parameters.
The process of optimizing the configuration parameters of a machine learning model, called hyperparameters, to improve model performance on a given metric.
A piece of software, firmware, or hardware that creates and runs virtual machines (VMs).
A property of certain operations in mathematics and computer science, whereby they can be applied multiple times without changing the result beyond the initial application.
Data that once created, cannot be changed. Any modification necessitates the creation of a new instance.
An open-source, native analytic database for Apache Hadoop, providing high-performance, low-latency SQL queries on Hadoop data.
The process of replacing missing data with substituted values, allowing more robust analysis when dealing with incomplete datasets.
Fill in missing data values with estimated or imputed values to facilitate analysis.
In-Memory Database (IMDB)
A database management system that primarily relies on main memory for computer data storage, faster than disk storage-based databases.
Create an optimized data structure for fast search and retrieval.
The process of creating a data structure (an index) to improve the speed of data retrieval operations on a database.
A closed-source data management and data integration solutions provider.
The process of obtaining information from a repository, often concerning text-based search.
Infrastructure as Code (IaC)
A key DevOps practice that involves managing and provisioning computing infrastructure through machine-readable script files, rather than through physical hardware configuration or interactive configuration tools.
The initial collection and import of data from various sources into your processing environment.
The process of importing, transferring, loading, and processing data for later use or storage in a database.
Input/Output Operations Per Second (IOPS)
A common performance measurement used to benchmark computer storage devices like hard disk drives (HDD), solid-state drives (SSD), and storage area networks (SAN).
A single occurrence of an object, often referring to virtual machines (VMs) or individual database items.
The process of combining data from different sources and providing users with a unified view of them.
A level of software testing where individual units are combined and tested as a group, to expose faults in the interaction between integrated units.
Rules applied to maintain the quality and accuracy of the data inside a database, such as uniqueness, referential integrity, and check constraints.
A query mechanism allowing users to ask spontaneous questions and receive rapid responses, used in analyzing datasets.
The ability of different IT systems, software applications, and devices to communicate, exchange, and use information effectively.
Use known data values to estimate unknown data values.
Interval Data Type
A type of data that represents a duration between two datetime values, such as the span of time between a start-time and an end-time.
Inversion of Control (IoC)
A design principle in which the custom-written portions of a computer program receive the flow of control from a generic, reusable library.
Different configurations used in databases to trade off consistency for performance, such as Read Uncommitted, Read Committed, Repeatable Read, and Serializable.
A software development model that involves repeating the same set of activities for each portion of the project, allowing refinement with each iteration.
Java Database Connectivity (JDBC)
An API for the Java programming language that defines how a client may access a database, providing methods to query and update data in a database.
An open-source automation server, helping to automate parts of the software development process.
A SQL operation used to combine rows from two or more tables based on a related column between them.
An SQL operation performed to connect rows from two or more tables based on a related column.
An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.
Just-In-Time Compilation (JIT)
A way of executing computer code that involves compilation during the execution of a program at runtime rather than prior to execution, improving the execution efficiency.
A partitioning method that divides a dataset into subsets (clusters), where each data point belongs to the cluster with the nearest mean.
K-Nearest Neighbors (KNN)
A simple, supervised machine learning algorithm used for classification and regression, which predicts the classification or value of a new point based on the K nearest points.
An open-source stream processing platform developed by LinkedIn and donated to the Apache Software Foundation, designed for high-throughput, fault-tolerance, and scalability.
Key Performance Indicator (KPI)
A type of performance measurement that evaluates the success of an organization, employee, etc., in achieving objectives.
A type of NoSQL database that uses a simple key/value method to store data, suitable for storing large amounts of data.
An open-source data visualization dashboard for Elasticsearch, providing visualization capabilities on top of the content indexed in Elasticsearch clusters.
A platform provided by Amazon Web Services (AWS) to collect, process, and analyze real-time, streaming data.
A knowledge base used to store complex structured and unstructured information used by machines and humans to enhance search and understand relationships and properties of the data.
An open-source platform designed to automate deploying, scaling, and operating application containers, allowing for easy management of containerized applications across multiple hosts.
A data processing architecture designed to handle massive quantities of data by combining batch processing and stream processing, providing a balance between latency, throughput, and fault-tolerance.
Delaying the binding of referenced attributes and methods until runtime.
Latent Semantic Analysis (LSA)
A technique in natural language processing and information retrieval to discover relationships between words and the concepts they form.
A design pattern used in computer programming to defer initialization of an object until the point at which it is needed.
Understand of how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.
A statistical method used to model the relationship between a dependent variable and one or more independent variables, predicting outcomes.
A method of publishing structured data so that it can be interlinked and become more useful, leveraging the structure of the data to enhance its usability and discoverability.
The process of transferring data from one location, format, or application to another, typically into a database.
A device or software function that distributes network or application traffic across multiple servers, optimizing resource use, maximizing throughput, minimizing response time, and avoiding overload.
The process of reducing the load on a system by restricting the amount of incoming requests.
A type of non-functional testing conducted to understand the behavior of the application under a specific expected load, identifying the maximum operating capacity of an application and any bottlenecks.
The process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text.
A mechanism employed by RDBMSs to regulate data access in multi-user environments, ensuring the integrity of data by preventing multiple users from altering the same data at the same time.
Files that record either events that occur in an operating system or other software runs, or messages between different users of a communication software.
A process that involves analyzing log files from different sources to uncover insights, which can be used for various purposes such as security, performance monitoring, and user behavior analysis.
A statistical method used to analyze a dataset and predict binary outcomes, utilizing a logistic function to model a binary dependent variable.
A server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a 'stash' like Elasticsearch.
Long Short-Term Memory (LSTM)
A special kind of RNN, capable of learning long-term dependencies, and is particularly useful for learning from important experiences that have very long time lags.
A web communication technique where the client requests information from the server, and the server holds the request open until new information is available.
A data exploration and discovery business intelligence platform.
A table with one or more columns, where you look up a value in the table based on the value in one or more columns.
A function used in optimization to measure the difference between the predicted value and the actual value, guiding the model training process.
Characterized by a short delay from input into a system to the desired outcome, crucial in systems requiring real-time response.
An older Python module that helps you build basic pipelines of batch jobs.
A method of data analysis that automates analytical model building, enabling systems to learn from data, identify patterns, and make decisions.
Machine Learning Operations (MLOps)
A practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle.
Machine Learning Pipeline
A sequence of data processing and machine learning tasks, assembled to create a model, with each step in the sequence processing the data and passing it on to the next step.
Direct communication between devices using any communications channel, including wired and wireless.
The process of defining relationships between two distinct data models.
Offers a comprehensive data platform with the speed, scale, and reliability required by enterprise-grade applications.
A programming model for processing and generating large datasets in parallel with a distributed algorithm on a cluster, initially developed by Google.
A lightweight markup language with plain text formatting syntax designed for creating rich text using a plain text editor.
The method of protecting sensitive information in non-production environments by altering data records so that the structure remains similar while the information itself is changed.
Master Data Management (MDM)
A method that defines and manages the critical data of an organization to provide a single point of reference across the organization.
Executing a computation and persisting the results into storage.
A database object that contains the results of a query, providing indirect access to table data by storing the results of the query in a separate schema object.
Mean Squared Error (MSE)
A measure of the average of the squares of the errors, used as a risk metric corresponding to the expected value of the squared (quadratic) error or loss.
A measure of central tendency representing the middle value of a sorted list of numbers, separating the higher half from the lower half of the data set.
Store the results of expensive function calls and reusing them when the same inputs occur again.
Combine data from multiple datasets into a single dataset.
A method by which information is communicated between distributed or parallel processes in a computer system.
A form of asynchronous service-to-service communication used in serverless and microservices architectures.
A binary format efficiently encoding objects and their fields in a compact binary representation. It is more efficient and compact compared to JSON, used when performance and bandwidth are concerns.
Data that provides information about other data, such as data structure or content details.
The administration of data that describes other data, involving establishing and managing descriptions, definitions, scope, ownership, and other characteristics of metadata.
A data processing method that deals with relatively small batches of data, providing a middle ground between batch processing and stream processing.
A software development technique that structures an application as a collection of loosely coupled services, allowing for improved scalability and ease of updates.
An architectural style that structures an application as a collection of services, which are highly maintainable and testable, loosely coupled, independently deployable, and precisely scoped.
A cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.
Microsoft SQL Server
A relational database management system developed by Microsoft.
Microsoft SSIS (SQL Server Integration Services)
A platform for data integration and workflow applications.
Software that acts as a bridge between an operating system or database and applications, enabling communication and data management.
The process of transferring data between storage types, formats, or computer systems, usually performed programmatically.
Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.
The process of creating abstract representations of the structure and relationship between various data items in an application or database.
The integration of a machine learning model into an existing production environment to make practical business decisions based on data.
The task of selecting a statistical model from a set of candidate models, based on the performance of the models on a given dataset.
The process of assessing how well your model performs at making predictions on new data, by using various metrics and statistical methods.
A popular NoSQL database, utilizing a document-oriented database model.
Track data processing metrics and system health to ensure high availability and performance.
The process of observing and checking the quality or content of data over a period, aimed at detecting patterns, performance, failures, or other attributes.
The use of multiple cloud computing and storage services in a single network architecture, utilized by businesses to spread computing resources and minimize the risk of data loss or downtime.
A reference to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment.
Multidimensional Scaling (MDS)
A means of visualizing the level of similarity of individual cases of a dataset, used in information visualization to detect patterns in high-dimensional data.
A type of classification task where each instance (or data point) can belong to multiple classes, as opposed to just one in the traditional case.
Multilayer Perceptron (MLP)
A class of feedforward artificial neural network consisting of at least three layers of nodes, used for classification and regression.
Optimize execution time with multiple parallel processes.
The ability of a CPU, or a single core in a multi-core processor, to provide multiple threads of execution concurrently.
The capability of an object to be altered or changed, often used in contrast with immutability, which refers to the incapacity to be changed.
A popular open-source relational database management system.
N+1 Query Problem
A common performance problem in applications that use ORMs to fetch data, occurs when the system retrieves related objects in a separate query for each object, leading to a high number of executed SQL queries.
Naïve Bayes Classifier
A family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features.
Named Entity Recognition (NER)
A subtask of information extraction that classifies named entities in text into pre-defined categories such as person names, organizations, locations, etc.
A container that holds a set of identifiers to help avoid collisions between identifiers with the same name.
Natural Language Processing (NLP)
A field of artificial intelligence that focuses on the interaction between computers and humans through natural language, enabling computers to understand, interpret, and generate human language.
A network failure that divides a network into two or more disconnected sub-networks due to the failure of network devices.
A set of algorithms, modeled loosely after the human brain, designed to recognize patterns in data through machine learning.
Assess the normality of data distributions to ensure validity and reliability of statistical analysis.
The process of organizing the columns (attributes) and tables (relations) of a relational database to reduce redundancy and dependency.
Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
A non-relational database that allows for storage and processing of large amounts of unstructured data and is designed for distributed data stores where very large-scale processing is needed.
A general statement or default position that there is no relationship between two measured phenomena, to be tested and refuted in the process of statistical hypothesis testing.
A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
The technique of disguising data by replacing, encrypting, or removing sensitive information to protect the data subject.
A storage architecture that manages data as objects, as opposed to other storage architectures like file systems or block storage.
Object-Relational Mapping (ORM)
A programming technique to convert data between incompatible type systems in object-oriented programming languages.
The ability to understand the internal state of a system from its external outputs, crucial in modern computing environments to ensure the reliability, availability, and performance of systems.
OLAP (Online Analytical Processing)
A category of software tools that allows users to analyze data from multiple database dimensions.
A multi-dimensional array of data used for complex calculations, enabling users to drill down into multiple levels of hierarchical data, making it a key technology for data analysis and reporting.
OLTP (Online Transaction Processing)
A type of processing that facilitates and manages transaction-oriented applications.
A process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions.
Online Analytical Processing (OLAP)
A category of software tools that analyze data from various database perspectives and enable users to interactively analyze multidimensional data from multiple perspectives.
A representation of a set of concepts within a domain and the relationships between those concepts, used to reason about the entities within that domain.
Open Database Connectivity (ODBC)
A standard application programming interface (API) for accessing database management systems.
Operating System (OS)
Software that manages computer hardware and provides various services for computer programs, serving as a bridge between users and the computer hardware.
Operational Data Store (ODS)
A database designed to integrate data from multiple sources for additional operations on the data, serving as an intermediary between the data warehouse and the process of data sources.
Optimistic Concurrency Control
A type of concurrency control method applied on transactional systems to handle simultaneous updates.
The process of adjusting a system to improve its efficiency or use of resources, usually in the context of improving the performance of algorithms and models.
A multi-model database management system.
ORC (Optimized Row Columnar)
A columnar storage file format optimized for heavy read access and is highly suitable for storing and processing big data workloads. It is highly compressed and efficient, reducing the amount of storage space needed for large datasets.
Automated configuration, coordination, and management of computer systems, middleware, and services.
The identification of rare items, events, or observations in a data set that raise suspicions due to differences in pattern or behavior from the majority of the data.
A modeling error that occurs when a function is too closely tailored to the training dataset; hence, the model performs well on the training dataset but poorly on new, unseen data.
A measure in statistical hypothesis testing that helps in determining the strength of the evidence that null hypothesis can be rejected.
A transparent cache for the pages originating from a secondary storage device such as a hard disk drive.
An algorithm used by Google Search to rank web
A fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of Python.
A type of computation in which many calculations or processes are carried out simultaneously, suitable for tasks where many operations are independent of each other.
Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.
The adjustment of weights in model training processes, with the aim of improving model accuracy, it refers to adjustments made to the internal parameters of the models.
A columnar storage file format optimized for use with big data processing frameworks. It is highly efficient for both storage and processing, especially for complex nested data structures, and it supports schema evolution, allowing users to modify Parquet schema after data ingestion.
Interpret and convert data from one format to another.
The process of dividing a database into smaller, more manageable pieces, usually for improving performance, manageability, and availability.
A database design technique to improve performance, manageability, or availability by splitting tables into smaller, more manageable pieces.
A branch of machine learning that focuses on the recognition of patterns and regularities in data.
The part of the transmitted data that is the actual intended message, excluding any headers or metadata sent mainly for the purpose of the delivery of the payload.
A decentralized network where each connected computer has equal status and can interact with each other without a central server.
A statistical measure that indicates the value below which a given percentage of observations fall in a group of observations.
The improvement of system performance, typically in computer systems and networks, by adjusting various underlying parameters and configurations.
A type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.
The data access layer in a software application that stores and retrieves data from databases, files, and other storage locations.
Convert a Python object into a byte stream for efficient storage.
A set of tools and processes chained together to automate the flow of data from source to storage, allowing for stages of transformation and analysis in between.
The use of various, often complementary database technologies to handle varying data storage needs within a given software application.
A type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial.
Advanced, open-source object-relational database management system.
A business analytics service by Microsoft, providing interactive visualizations with self-service business intelligence capabilities.
A metric in classification that measures the number of true positive results divided by the number of all positive results, including those not correctly identified.
The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data.
The process of creating, testing, and validating a model to best predict the probability of an outcome.
Transform your data so it is fit-for-purpose.
Transform raw data before data analysis or machine learning modeling.
A unique identifier for a record in a database table, ensuring that each record can be uniquely identified and retrieved.
Principal Component Analysis (PCA)
A dimensionality reduction technique used to emphasize variation and bring out strong patterns in a dataset, often used before fitting a machine learning model to the data.
Probabilistic Data Structure
A high-performance, low-memory data structure that provides approximations to set operations, often used for tasks like membership tests, frequency counting, and finding heavy hitters.
Manipulation of data to convert it from one form to another or to reduce it to a more manageable state.
A form of data security which prevents running processes from interacting with each other, often used in multitasking operating systems to increase security and stability.
The process of examining, analyzing, and reviewing data to collect statistics and information about the quality and the nature of the data items.
The automated buying and selling of online advertising, optimizing based on algorithms and data.
A database operation that returns a set of columns (attributes) in a table, reducing the number of columns in the resultant relation.
Protobuf (Protocol Buffers)
Developed by Google, it is a method developed to serialize structured data, like XML and JSON. It is both simpler and more efficient than both XML and JSON. Protobuf is language-agnostic, making it highly versatile for different systems.
The process of quickly creating a working model (a prototype) of a part of a system, allowing for faster and more efficient final design and development.
A data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers.
A messaging pattern where senders of messages (publishers) do not prepare the messages to be sent directly to specific receivers (subscribers), defining classes of messages into topics.
A method of submitting contributions to an open development project, often used in collaborative development to manage changes from multiple contributors.
The process of permanently and irreversibly deleting old and irrelevant records from a database.
A message that pops up on a mobile device or desktop from an app or website, typically used to deliver updates, news, or promotions.
A module in Python used for serializing and de-serializing Python object structures, converting Python objects into a byte stream.
An open-source machine learning library for Python, developed by Facebook’s AI Research lab.
A Business Intelligence (BI) tool ideal for data visualization, analytics development, and reporting.
A data point or set of data points in a dataset that divide your data into “parts” of equal probability, such as the median, quartiles, percentiles, etc.
A type of computation that takes advantage of the quantum states of particles to store information, potentially allowing for the solving of complex problems much faster than classical computers can.
A type of computer language that requests and retrieves data from database management systems.
The process of choosing the most efficient means of executing a SQL statement, usually involving the optimization of SQL queries and projections, and the choice of optimal query plans.
A sequence of steps used to access data in a SQL relational database management system, important for optimizing database queries and improving system performance.
A concept applied in distributed computing to minimize the latency and use of resources while retrieving data and to ensure data availability during component failures.
Radial Basis Function (RBF)
A function whose value depends on the distance between the input and some fixed point, typically used in various areas such as function approximation, time series prediction, and classification.
An ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
A type of query that retrieves data based on a range of values, typically used in the context of numerical or datetime values.
Real-Time Bidding (RTB)
A means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction.
The processing of data that continuously enters a system and obtains results within a timeframe short enough to affect the sources of the incoming data.
A subclass of information filtering system that seeks to predict the 'rating' or 'preference' a user would give to an item.
The process of ensuring that two or more datasets are consistent with each other, identifying any discrepancies and resolving them.
The process of finding entries that refer to the same entity in different data sources.
Recurrent Neural Network (RNN)
A class of artificial neural networks designed for sequence prediction problems and other tasks where data points have connections to previous points, such as time series analysis and natural language processing.
An in-memory data structure store, used as a database, cache, and message broker.
The process of reducing the amount of raw data, either by aggregating it, choosing representative subsets, or transforming it into a more compact representation.
Convert a large set of data into a smaller, more manageable form without significant loss of information.
The duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe.
A property of data stating that all its references are valid and ensures that the relationship between tables remains consistent.
A statistical process for estimating the relationships among variables, often used for prediction and forecasting, where one variable is dependent on one or more independent variables.
Regular Expression (Regex)
A sequence of characters defining a search pattern, typically used by string-searching algorithms for 'find' or 'find and replace' operations on strings, crucial for data cleaning and transformation.
A technique used to prevent overfitting in a machine learning model by adding a penalty term to the model’s loss function, commonly used regularizations are L1 and L2 regularization.
A type of machine learning where an agent learns how to behave in an environment by performing certain actions and receiving rewards or penalties in return.
A theoretical set of mathematical principles and concepts forming the foundational basis for implementing and optimizing queries in Relational Database Management Systems.
A type of database that stores data in structured tables and is based on the relational model.
A database model based on first-order predicate logic, serving as the basis for relational databases, where all data is represented in terms of tuples, grouped into relations.
Redistribute data across multiple partitions for improved parallelism and performance.
A group of database nodes that maintains the same data set, providing redundancy and increasing data availability with multiple copies of data on different database servers.
The process of copying data from a database in one server or computer to a database in another so that all users share the same level of information.
An area of machine learning where automatic feature learning from raw data is explored, aimed at identifying better representations and improving model generalization.
A message exchange pattern in which a requester sends a request message to a replier system, which then sends a response message in return.
Change the structure of data to better fit specific analysis or modeling requirements.
Resilient Distributed Dataset (RDD)
A fault-tolerant collection of elements that can be processed in parallel, fundamental data structure of Spark,
The variable that is being predicted or modeled, often denoted as the dependent variable or output variable.
An architectural style for designing networked applications, utilizing stateless, cacheable communications protocols, typically HTTP.
A regularization technique for analyzing multiple regression data that suffer from multicollinearity, shrinking the coefficients of the model towards zero to stabilize them.
The process of identifying and analyzing potential issues that could negatively impact key business initiatives or projects.
The operation which undoes partially completed transactions by the database management system after a failed transaction.
Root Mean Square Error (RMSE)
A standard way to measure the error of a model in predicting quantitative data, it’s the square root of the average squared differences between the predicted and observed actual outcomes.
The process of selecting a path for traffic in a network or between or across multiple networks, based on routing table information.
Row-Level Security (RLS)
A method of restricting access at the database row level, based on parameters such as user roles or identity, enabling fine-grained access control.
Ruby on Rails
A server-side web application framework written in Ruby, it is a model-view-controller (MVC) framework, providing default structures for a database, a web service, and web pages.
Extract a subset of data for exploratory analysis or to reduce computational complexity.
The process of selecting a subset of elements from a larger set to approximate the properties of the whole set, often used for statistical analysis.
A security mechanism used to run an application in a confined environment, isolating it from the system, preventing it from causing harm or accessing sensitive data.
The capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth.
A quantity represented by a single element in the corresponding field, usually a single number, as opposed to a vector or matrix.
Increasing the capacity or performance of a system to handle more data or traffic.
The organization or structure for a database, defining tables, fields, relationships, indexes, etc.
The ability of a database system to handle changes in a database schema, especially relevant for systems that require flexibility and adaptability to changing data requirements.
Translate data from one schema or structure to another to facilitate data integration.
A strategy where data structure is inferred at read time, typically used in big data processing where data is not predefined and is instead interpreted when it is analyzed.
A strategy where data structure is defined before writing data, typically used in relational databases where data must conform to a known schema before it's written to disk.
A free software machine learning library for the Python programming language. It features various classification, regression, clustering algorithms, and efficient tools for data mining and data analysis.
An open-source Python library used for scientific and technical computing.
Extract data from a website or another source.
The process of extracting data from websites, converting it from unstructured to structured form.
A process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated, also known as data cleansing.
A software application designed to search for information in a database, with requested information returned to the user as search results.
Search Engine Optimization (SEO)
The practice of optimizing content to be discovered through a search engine’s organic search results, affecting the visibility of a website or a web page.
Protect data from unauthorized access, modification, or destruction.
The process of dividing a data set into distinct and meaningful groups, usually to perform more specific analysis, or to target specific subsets of users.
The process of analyzing the meanings of words, texts, and sentences, typically used in NLP to understand the context and intent behind the words.
A class of machine learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.
The use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
Sequential Pattern Mining
A method of discovering frequent subsequences or patterns in a sequence of items or events, usually in datasets of customer transactions or other sequence data.
The process of converting complex data structures into a format that can be easily stored or transmitted and then reconstructed later.
A cloud-computing execution model where the cloud provider runs the server and dynamically manages the allocation of machine resources, allowing developers to focus on individual functions.
Service-Oriented Architecture (SOA)
An architectural pattern in software design where services are provided to the other components by application components, through a communication protocol over a network.
A method of splitting and storing a single logical dataset in multiple databases to spread the load, enhancing the performance and enabling horizontal scaling.
Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Randomize the order of data records to improve analysis and prevent bias.
A numeric measure of how alike two data objects are, often used in clustering, classification, or nearest neighbor analysis.
Single Source of Truth (SSOT)
A practice of structuring information models and associated schema such that every data element is mastered in only one place.
Site Reliability Engineering (SRE)
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems, aiming for creating scalable and highly reliable software systems.
A condition in which the distribution of data is not uniform, impacting the performance of data processing in parallel computing environments.
A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean, indicating whether the data points are skewed to the left or right.
A technique used in analyzing or processing sequences of data, where a window of specified size moves across the data, and for each position of the window, a computation is performed.
A fast and efficient data compression and decompression library developed by Google, designed to balance processing speed and compression ratio. It is often used to compress data stored in Hadoop environments and for other similar applications.
A set point in time copy of data that can be used as a backup for recovery purposes.
A guarantee provided by some database systems that all reads made in a transaction will see a consistent snapshot of the database, and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.
A cloud-based data warehouse service designed for high-performance analytics.
A normalized form of Star Schema in a Data Warehouse, reducing redundancy and improving data integrity, with a central fact table connected to multiple normalized dimension tables.
A graph that depicts personal relations of internet users, representing the interconnection of relationships in an online social network.
A data removal strategy where records are marked as deleted but are not physically removed from the database, enabling potential recovery.
Software as a Service (SaaS)
A cloud computing service model that provides access to software and its functions remotely as a web-based service, allowing users to access software applications over the internet.
An algorithm that puts elements of a list in a certain order, often numerical or lexicographical.
A matrix mostly containing zero values, represented and stored efficiently in memory by only storing the non-zero elements.
A database optimized to store and query data representing objects defined in a geometric space, often used for storing and analyzing geographical or spatial information.
A data structure that allows for accessing a spatial object efficiently, essential in spatial databases and geodatabases.
A data structure that allows for accessing a spatial object in a database in a more efficient manner, crucial in GIS systems, spatial databases, and spatial data processing.
An optimization technique where a computer system performs some tasks before it knows whether these tasks will be needed, to reduce latency and improve throughput.
Temporarily transfer data that exceeds available memory to disk.
Divide a dataset into training, validation, and testing sets for machine learning model training.
SQL (Structured Query Language)
A standardized programming language used for managing and querying relational databases.
A code injection technique, used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution.
A C library that provides a lightweight, disk-based database.
A data structure that stores a collection of elements, with two main principal operations: Push, which adds an element to the collection, and Pop, which removes the most recently added element.
Transform data to a common unit or format to facilitate comparison and analysis.
The simplest style of data warehouse schema that organizes data in a single fact table linked to one or more dimension tables, enabling easy and efficient data retrieval.
An application that saves client data from the activities of one session for use in the next session.
An application that does not save client data generated in one session for use in the next session with that client.
A communications protocol that treats each request as an independent transaction, without requiring the server to retain session information or status about each communicating partner for the duration of multiple requests.
The process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.
Strategic Information Systems
Information systems that are developed in response to corporate business initiatives to give competitive advantage to organizations.
The real-time processing of data continuously, concurrently, and record by record, often used in applications that require real-time response and analytics.
Data that is generated continuously by thousands of data sources, sending data records simultaneously and in small sizes.
Data that is organized and formatted in a way that is easily searchable, often residing in relational databases and including data types such as numbers, dates, and strings.
Structured Query Language (SQL)
A standard programming language specifically for managing and querying data in relational databases.
A SQL query nested inside a larger query, used to retrieve data that will be used in the main query as a condition to further restrict the data to be retrieved.
Support Vector Machine (SVM)
A supervised machine learning algorithm, used for classification or regression analysis, that separates data into classes by finding the hyperplane that maximizes the margin between the classes.
A unique identifier for a record in a database table that serves as a substitute for natural primary keys and is typically auto-generated.
The collective behavior of decentralized, self-organized systems, typically inspired by nature, like ant colonies, bird flocking, and fish schooling, used in artificial intelligence for problem-solving and optimization.
The coordination of events to operate a system in unison, ensuring that multiple threads or processes do not interfere with each other.
The process of establishing consistency among data from a source to a target data storage and vice versa.
Syntax within a programming language that is designed to make things easier to read or to express.
The analysis of the symbols or statements in a computer program to ensure their correct arrangement, often used in compilers to check the syntax of the programming code.
Data that's artificially created, rather than being generated by actual events, often used for testing and training machine learning models when real data is scarce or sensitive.
A statistical method involving the selection of elements from an ordered sampling frame, selecting every kth (where k is a constant) item in the frame.
Systems Development Life Cycle (SDLC)
The process of creating or altering systems, and the models and methodologies that development teams use to develop systems.
T-distributed Stochastic Neighbor Embedding (t-SNE)
A machine learning algorithm for dimensionality reduction, particularly well suited for the visualization of high-dimensional datasets.
A type of probability distribution that is symmetrical and bell-shaped, like the normal distribution, but has heavier tails.
A data visualization tool that is used for converting raw, unstructured data into an understandable or readable format.
The practice of labeling data with tags that categorize or annotate it, often used in organizing content or in natural language processing to identify parts of speech.
A software integration vendor that provides data integration, data management, enterprise application integration, and big data software and services.
A database that is optimized to manage data relating to time instances, maintaining information about the times at which certain data is valid.
A mathematical object represented as arrays of higher dimensions, extended from matrices and used in machine learning and deep learning models, particularly in neural networks.
An open-source software library for dataflow and differentiable programming across a range of tasks, developed by the Google Brain team.
A unit of information or computer storage equal to one trillion bytes or 1,024 gigabytes.
Offers products related to data warehousing, including a powerful, scalable, and reliable data warehousing solution.
The process of deriving meaningful information from natural language text, involves the preprocessing (cleaning and transforming) of text data and the application of natural language processing (NLP) techniques.
Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.
The amount of data transferred or processed in a specified time period, often used as a measure of system or network performance.
A concept in computer science that describes the amount of time an algorithm takes to run as a function of the length of the input.
Time Series Analysis
A statistical technique that deals with time series data, or trend analysis, involving the use of various methods to analyze time series data and extract meaningful statistics and characteristics about the data.
Time Series Database (TSDB)
A database optimized for handling time series data, which are data points indexed in time order, commonly used for analyzing, storing, and querying time series data.
The process of converting input text into smaller units, or tokens, typically words or phrases, used in natural language processing to understand the structure of the text.
Convert data into tokens or smaller units to simplify analysis or processing.
A design methodology that begins with specifying the high-level structure of a system and decomposes it into its components, focusing on the system as a whole before examining its parts.
In networking, it refers to the arrangement of different elements (links, nodes, etc.) in a computer network. In data analysis, it refers to the study of geometric properties and spatial relations.
A subset of a dataset used to train machine learning models, helping the models make predictions or decisions without being explicitly programmed to perform the task.
A type of database that manages transaction-oriented applications, ensuring ACID properties (Atomicity, Consistency, Isolation, Durability) to maintain reliability in every transaction.
A research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.
The process of converting data from one format, structure, or type to another.
The process of converting data from one format or structure into another, often involving cleaning, aggregating, enriching, and reformatting the data.
A hierarchical structure used in computer science to represent relationships between individual data points or nodes, where each node is connected to one parent node and zero or more child nodes.
Procedural code automatically executed in response to certain events on a particular table or view in a database, often used to maintain the integrity of the data.
An ordered list of elements, often used to represent a single row in a relational database table, or a single record in a dataset.
A mathematical model of computation that defines an abstract machine, which manipulates symbols on a strip of tape according to a table of rules, foundational in the theory of computation.
The process of converting a variable from one data type to another, such as changing a float to an integer or a string to a number.
A graph in which edges have no orientation, meaning the edge from vertex A to vertex B is identical to the edge from vertex B to vertex A.
An operation in SQL that allows for the return of one distinct result set from multiple queries.
A constraint applied on a field to ensure that it cannot have duplicate values.
The simplest form of analyzing data with one variable, without regard to any other variable, focusing on the patterns, and summarizing the underlying patterns in the data.
Information that doesn't reside in a traditional row-column database and is often text-heavy.
Unstructured Data Analysis
Analyze unstructured data, such as text or images, to extract insights and meaning.
A type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses, often for clustering or association.
A data inconsistency that occurs when not all instances of a redundant piece of data are updated, leading to inconsistent and inaccurate data in a database.
A database operation that either inserts a row into a database table if a corresponding row does not exist, or updates the row if it does exist.
In data processing, refers to the tasks, operations, or stages of processing occurring or located before a particular stage in a specified direction or flow.
A method of encoding information in a Uniform Resource Identifier (URI) where certain characters are replaced by corresponding hexadecimal values, used in the submission of form data in HTTP requests.
User-Defined Function (UDF)
A function provided by the user of a program or environment, allowing for the creation of functions that are not included in the original software.
The process of ensuring that a program operates on clean, correct, and useful data, checking the accuracy and quality of the input data before it is processed.
The process of selecting the most relevant features (variables, predictors) for use in model construction, reducing dimensionality and improving model performance.
Variance Inflation Factor (VIF)
A measure used to quantify how much the variance of a regression coefficient is inflated due to multicollinearity in the model.
Variational Autoencoder (VAE)
A type of autoencoder with added constraints on the encoded representations being learned, often used for generating new data that's similar to the training data.
The process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time, improving performance by exploiting data-level parallelism.
Executing a single operation on multiple data points simultaneously.
The approach of managing changes and history of data in a dataset, useful for reproducing results, rolling back changes, and understanding changes in data over time.
The management of changes to documents, computer programs, large websites, and other collections of information, allowing for revisions and variations to be tracked and managed efficiently.
In graph theory, a vertex is a point where two or more curves, lines, or edges meet, representing entities in graph-based storage and analysis systems.
Adding more resources such as CPU, memory to an existing server, or replacing the server with a more powerful one.
A virtual table based on the result-set of an SQL statement, often used to focus, simplify, and customize the perception each user has of the database.
Virtual Private Network (VPN)
A technology that creates a safe and encrypted connection over a less secure network, such as the internet, allowing for secure remote access to network resources.
Virtualization (in analytics)
A data integration process to provide a unified, real-time, and consistent view of data across different data sources without having to move or replicate the data.
The process of creating a virtual version of something, including virtual computer hardware systems, storage devices, and network resources.
The graphical representation of information and data, using visual elements like charts, graphs, and maps.
Computer memory that requires power to maintain the stored information; all data is lost when the system’s power is turned off or interrupted.
A type of software testing that checks the system’s performance and behavior under high volumes of data, ensuring the software can handle large data quantities effectively.
The process of identifying, quantifying, and prioritizing the vulnerabilities in a system, involving the evaluation of system or software weaknesses and potential threats.
The process of developing abstract representations of a data warehouse system, typically structured in a way that helps in understanding, analyzing, and designing the data warehouse.
Web Application Firewall (WAF)
A security policy enforcement point positioned between a web application and the client endpoint, monitoring, and controlling communications to protect against attacks.
The automated process of browsing the web to collect information about websites and their pages, often used by search engines to index web content.
A software framework designed to aid the development of web applications including web services, web resources, and web APIs.
An automated method used to extract large amounts of data from websites quickly, used in data mining where you extract useful information or knowledge from data.
Standardized software systems designed to communicate over the Internet using standardized protocols, allowing different applications to talk to each other.
A graph in which a number (the weight) is assigned to each edge, representing quantities such as cost, length, or capacity, depending on the problem at hand.
The process of breaking up text into tokens based on whitespace characters such as spaces, tabs, and newline characters, commonly used in natural language processing.
Wide Column Store
A type of NoSQL database that uses tables, rows, and columns, but unlike a relational database, names and format of the columns can vary from row to row in the same table.
A character used to replace or represent one or more characters in string comparisons, often used in search operations to represent unknown characters in the search pattern.
In SQL, a type of function that performs a calculation across a set of table rows related to the current row, providing access to rows at a specified physical offset without using a self-join.
The sequence of industrial, administrative, or other processes through which a piece of work passes from initiation to completion, automated by software in many cases.
The process of transforming raw data into a more usable or appropriate format.
A function, method, or class that contains a piece of existing code and typically adds some additional functionality or converts inputs or outputs.
Write-Ahead Logging (WAL)
A method where changes are written to a log before they are applied, ensuring data integrity and consistency by providing a recovery mechanism in case of system failures.
XML (eXtensible Markup Language)
A markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. Widely used for the representation of arbitrary data structures such as those used in web services.
A database that stores data in a structured format, typically XML, allowing for complex and hierarchical data relationships.
The process of analyzing an XML document to read the codes and to access or modify data, used in various applications to interact with XML data.
XOR (Exclusive Or)
A logical operator that outputs true only when inputs differ (one is true, the other is false).
A query language for selecting nodes from an XML document, providing a way to navigate through elements and attributes in XML documents.
YARN (Yet Another Resource Negotiator)
A resource-management technology in Hadoop, allocating resources to various applications and managing resource consumption and task scheduling.
A unit of information or computer storage equal to one septillion bytes.
A property specifying the stack order of elements, commonly used in web development to manage overlaying of elements.
A statistical measurement that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean.
Zero Trust Security
A security concept centered on the belief that organizations should not automatically trust anything inside or outside its perimeters and must verify
A method of transferring data in computer systems so that it does not need to be copied from one buffer or memory location to another.
An attack that targets software vulnerabilities that are unknown
A unit of digital information storage used to denote the size of data. It is equivalent to one sextillion (10^21) bytes or 1000 exabytes.
The process of replicating data across different zones in a multi-zone environment, usually for data redundancy and availability.
In storage area networking, zoning is the process of allocating resources in a network to communicate only with each other and isolated from other resources, improving security and performance.
An open-source technology that provides a centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services.