Dagster Data Engineering Glossary:
ETL: Extract, Transform, Load
ETL definition:
ELT stands for Extract, Load, Transform, and is a process used in modern data pipelines for integrating and transforming data from various sources into a centralized data store. The ELT process is similar to the more traditional ETL (Extract, Transform, Load) process, but with a key difference: data is extracted from source systems and loaded directly into a data store, where it can then be transformed.
Dagster provides many integrations with common ETL/ELT tools.
Data ETL example in Python:
Here's an example of an ELT process in Python using the Pandas library and SQLite3. Please note that you need to have the necessary Python libraries installed in your Python environment to run the code:
Given two input files of customers.csv and orders.csv as follows:
customer_id,name,total_orders
1,Mary,42
2,Brian,28
3,Mercedes,7
4,Rose, 20
and
customer_id,item,value
1,cat food,12
1,apples,6
1,books,24
2,bananas,12
2,candles,12
2,coffee,4
3,nails,7
4,paper,18
4,pencils,2
Extract: We start by extracting data from our source systems. For this example, let's assume we have a CSV file containing customer orders.
import pandas as pd
# Extract data from CSV file
orders_df = pd.read_csv('orders.csv')
customers_df = pd.read_csv('customers.csv')
Load: Next, we load the data into our data store. In this case, we'll use a SQLite database.
import sqlite3
# Connect to SQLite database
conn = sqlite3.connect('my_database.db')
# Load data into database
orders_df.to_sql('orders', conn, if_exists='replace', index=False)
customers_df.to_sql('customers', conn, if_exists='replace', index=False)
Transform: Finally, we transform the data as needed. For example, we might want to join the orders data with customer data to get more insights.
# Extract customer data from database
customers_df = pd.read_sql_query('SELECT * FROM customers', conn)
# Join orders and customers data
enriched_df = pd.merge(orders_df, customers_df, on='customer_id')
print(enriched_df)
In this example, we used Pandas to extract data from a CSV file, load it into a SQLite database, and then join it with customer data to enrich the dataset. This ELT process can be repeated for other sources of data, allowing us to integrate and transform multiple datasets into a centralized data store.
Our example would yield the following output:
customer_id item value name total_orders
0 1 cat food 12 Mary 42
1 1 apples 6 Mary 42
2 1 books 24 Mary 42
3 2 bananas 12 Brian 28
4 2 candles 12 Brian 28
5 2 coffee 4 Brian 28
6 3 nails 7 Mercedes 7
7 4 paper 18 Rose 20
8 4 pencils 2 Rose 20