Back to Glossary Index

Dagster Data Engineering Glossary:


Data Derivation

Extracting, transforming, and generating new data from existing datasets.

Deriving data definition:

In data engineering, "deriving data" refers to the process of extracting, transforming, and generating new data from existing datasets.

This derived data is often used to enhance understanding, improve data quality, support decision-making, or meet specific business needs. The derived data is usually a result of applying some transformation, computation, or aggregation on the source data.

Overall, deriving data is a fundamental aspect of data engineering, enabling organizations to extract greater value from their data assets.

Key elements of data derivation:

  1. Source Data: This is the original dataset you start with. It could be ingested from databases, data lakes, external APIs, or other sources.

  2. Transformation: This involves applying operations to the source data to generate new information. Transformations can be simple arithmetic operations, aggregations, filtering, joining multiple datasets, or even applying complex algorithms.

  3. Derived Data: The materialized result of the transformation. This is the new data that was not present in its original form in the source data but has been extracted or computed based on it.

Example use cases of deriving data:

  • Aggregations: Summing up sales figures by month to generate a monthly sales report.

  • Feature Engineering in Machine Learning: Creating new predictive features from existing dataset columns. For example, from a dataset containing "date of birth", you can derive an "age" column.

  • Data Enrichment: Combining data from different sources. If you have a list of users with their email addresses and another list with email addresses and purchase history, you could derive a combined dataset showing user activity.

  • Data Normalization: Transforming data to fit a standard scale or format. For instance, converting temperature readings from Fahrenheit to Celsius.

  • Data Imputation: Filling missing values based on certain rules or algorithms derived from the available data.

  • Temporal Derivations: Generating data like "Month to Date" or "Year to Date" metrics.

It's important to note the following when working with derived data:

  • Data Lineage: It's essential to track where the derived data comes from and understand the transformations applied. This helps in ensuring transparency, reproducibility, and trust in the derived data.

  • Quality Checks: Derived data should undergo quality checks to ensure its accuracy and relevance. Incorrect transformations can lead to misleading conclusions.

  • Storage and Management: Depending on the size and usage, derived data might be stored differently from the source data. Considerations about storage, retrieval efficiency, and cost are important.

An example scenario of data derivation in data engineering:

To illustrate the concept of data derivation, let's consider a scenario: Real-time Stream Processing for Anomaly Detection.

Say you work for a fintech company that processes millions of transactional events every minute through your payment gateway. You aim to derive new critical data assets in real-time to identify anomalous transaction patterns, which might indicate fraudulent activities. These data assets will be used by downstream teams (data consumers) to run anti-fraoud processes.

Going back to our three elements of data derivation:

1. Source Data: Real-time streaming data of transactions, ingested through tools like Apache Kafka. Each transaction record consists of attributes like user_id, transaction_id, transaction_amount, location, merchant_id, and timestamp.

2. Transformation & Derivation:

You would use several techniques for transofrming the data and deriving your new key data asset:

  • Windowing: Using Apache Flink or Apache Spark Streaming, the data stream is divided into time-based windows, e.g., tumbling windows of 5 minutes.

  • Feature Engineering: For each window, derive new features such as:

    • Average transaction amount for the user in the last 24 hours.
    • Number of distinct locations from which transactions originated in the last 30 minutes.
    • Count of transactions for the user in the last 5 minutes.
  • Joining with Reference Data: Join the real-time data stream with static datasets (residing in a distributed file system like HDFS or a database) that might include:

    • User's historical transaction behavior.
    • Merchant risk scores.
  • Anomaly Scoring: Use a pre-trained machine learning model (like an Isolation Forest or One-Class SVM) to score each transaction for its likelihood of being anomalous. This model would have been previously trained on historical data and continuously retrained at regular intervals.

3. Derived Data:

  • For each transaction in the current time window, a derived record is produced that includes the original attributes, the newly derived features, and an anomaly score.

  • Aggregated metrics like the total number of high-risk scores in the current window, average risk score per merchant, etc. are created.

4. Sink & Action:

  • You store derived insights in a real-time analytics database like Druid or ClickHouse for dashboarding and alerting.

  • You use the anomaly score to trigger alerts or even block transactions in real-time, using thresholds or business rules.

Considerations:

For this scenario to work, you would want to invest time in several aspects of the project:

  • State Management: You need to maintain the state across windows for continuity, especially for continuous queries that rely on historical data.

  • Scalability: As data rates fluctuate, the processing system needs to scale. Running Dagster on Kubernetes can be used for orchestrating the scalability of such streaming applications.

  • Model Management: Machine Learning models need frequent re-training. Ensuring the ML model used for anomaly scoring remains up-to-date and relevant is crucial. This involves model versioning, continuous training, and strategies for model deployment without interrupting the streaming process, all of which can be orchestrated via Dagster data pipelines.

This example embodies advanced data engineering concepts, blending real-time processing, machine learning, and scalable architectures to derive valuable insights from raw transactional data.

An example of deriving data using Python and Spark

Let's consider a scenario involving Apache Spark, one of the most widely-used distributed data processing frameworks. In this example, we'll derive features from user transaction data for further analytics or machine learning.

Again, let's look at processing a large dataset of user transactions to derive user behavior metrics over a time window. We will not replicate the entire scenario listed above, but just demonstrate the derivation concept.

First, let's create two dummy source data files. Naturally, in a real-world scenario, data sources would likely be more diverse (e.g., databases, data lakes, streaming sources), and configurations would be more intricate.

Create the following two files locally (in the same folder as your demo code):

  • transactions.csv contains columns: user_id, transaction_id, transaction_amount, timestamp
user_id,transaction_id,transaction_amount,timestamp
1,1001,50.25,2023-09-13 10:00:00
1,1002,75.00,2023-09-12 14:15:10
2,1003,100.00,2023-09-15 12:30:45
2,1004,25.50,2023-09-14 09:05:00
3,1005,110.75,2023-09-11 16:20:20
1,1006,45.00,2023-09-17 15:30:00
3,1007,95.25,2023-09-18 11:00:00
  • user_profiles.csv contains columns: user_id, signup_date, user_type
user_id,signup_date,user_type
1,2023-01-01,Dagster Team
2,2023-06-15,Dagster Enterprise
3,2022-11-20,Dagster Team

Using these as our source, we will derive the following data

  • Total transaction amounts for each user in the last 7 days.
  • Total transactions count for each user in the last 7 days.
  • Days since user signup till today.

Note: You need to have the necessary Python libraries installed (a Spark environment, and PySpark) in your Python environment to run this code.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from datetime import datetime, timedelta

# Configure Spark application
conf = SparkConf().setAppName("MySparkApp").setMaster("local[*]")  # Setting master to "local[*]" for local execution

# Initialize SparkContext with the configuration and set the log level
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("DeriveDataExample") \
    .getOrCreate()

# Load data
transactions = spark.read.csv('transactions.csv', header=True, inferSchema=True)
user_profiles = spark.read.csv('user_profiles.csv', header=True, inferSchema=True)

# Calculate the timestamp for 7 days ago
seven_days_ago = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d %H:%M:%S')

# Filter transactions in the last 7 days
recent_transactions = transactions.filter(transactions.timestamp > seven_days_ago)

# Derive total transaction amounts and counts for each user in the last 7 days
derived_metrics = recent_transactions.groupBy('user_id').agg(
    F.sum('transaction_amount').alias('total_amount_7d'),
    F.count('transaction_id').alias('transaction_count_7d')
)

# Derive days since user signup
user_derived = user_profiles.withColumn('days_since_signup',
                                        F.datediff(F.current_date(), F.col('signup_date')))

# Join the derived metrics with user profiles
final_derived_data = user_derived.join(derived_metrics, on='user_id', how='left').fillna(0)

# Display or save the derived data
final_derived_data.show()

# Optionally, save to a destination
# final_derived_data.write.csv('output_derived_data.csv', header=True)

spark.stop()
# Stop the SparkContext when done
sc.stop()

The output of the example will be the derived data as follows:

+-------+-----------+------------------+-----------------+---------------+--------------------+
|user_id|signup_date|         user_type|days_since_signup|total_amount_7d|transaction_count_7d|
+-------+-----------+------------------+-----------------+---------------+--------------------+
|      1| 2023-01-01|      Dagster Team|              262|           45.0|                   1|
|      2| 2023-06-15|Dagster Enterprise|               97|          125.5|                   2|
|      3| 2022-11-20|      Dagster Team|              304|          95.25|                   1|
+-------+-----------+------------------+-----------------+---------------+--------------------+

Using Apache Spark and Python (PySpark), this example of data derivation demonstrates:

  • Distributed data processing with Apache Spark.
  • Aggregation methods on large datasets.
  • Data transformations and enrichments (joining, date calculations).
  • Scalability: This code can handle very large datasets across clusters.

Other data engineering terms related to
Data Transformation:
Dagster Glossary code icon

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
An image representing the data engineering concept of 'Align'
Dagster Glossary code icon

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.
An image representing the data engineering concept of 'Clean or Cleanse'
Dagster Glossary code icon

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.
An image representing the data engineering concept of 'Cluster'
Dagster Glossary code icon

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.
An image representing the data engineering concept of 'Curate'
Dagster Glossary code icon

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.
An image representing the data engineering concept of 'Denoise'
Dagster Glossary code icon

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
An image representing the data engineering concept of 'Denormalize'
Dagster Glossary code icon

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.
An image representing the data engineering concept of 'Discretize'
Dagster Glossary code icon

ETL

Extract, transform, and load data between different systems.
An image representing the data engineering concept of 'ETL'
Dagster Glossary code icon

Encode

Convert categorical variables into numerical representations for ML algorithms.
An image representing the data engineering concept of 'Encode'
Dagster Glossary code icon

Filter

Extract a subset of data based on specific criteria or conditions.
An image representing the data engineering concept of 'Filter'
Dagster Glossary code icon

Fragment

Break data down into smaller chunks for storage and management purposes.
An image representing the data engineering concept of 'Fragment'
Dagster Glossary code icon

Homogenize

Make data uniform, consistent, and comparable.
An image representing the data engineering concept of 'Homogenize'
Dagster Glossary code icon

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.
An image representing the data engineering concept of 'Impute'
Dagster Glossary code icon

Linearize

Transforming the relationship between variables to make datasets approximately linear.
An image representing the data engineering concept of 'Linearize'

Munge

See 'wrangle'.
An image representing the data engineering concept of 'Munge'
Dagster Glossary code icon

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Dagster Glossary code icon

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.
An image representing the data engineering concept of 'Reduce'
Dagster Glossary code icon

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.
An image representing the data engineering concept of 'Reshape'
Dagster Glossary code icon

Serialize

Convert data into a linear format for efficient storage and processing.
An image representing the data engineering concept of 'Serialize'
Dagster Glossary code icon

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Dagster Glossary code icon

Skew

An imbalance in the distribution or representation of data.
Dagster Glossary code icon

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.
Dagster Glossary code icon

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.
Dagster Glossary code icon

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.
An image representing the data engineering concept of 'Tokenize'

Transform

Convert data from one format or structure to another.
An image representing the data engineering concept of 'Transform'
Dagster Glossary code icon

Wrangle

Convert unstructured data into a structured format.
An image representing the data engineering concept of 'Wrangle'