Dagster Data Engineering Glossary:
Schema Inference
Schema inference definition:
Schema inference is a process where the structure of a dataset (such as the data types and column names) is automatically identified, often when working with semi-structured or unstructured data.
An example of schema inference in Python using Pandas and PyArrow
In Python, schema inference can be demonstrated using libraries such as Pandas
for data manipulation and PyArrow
for more advanced schema inference, especially useful with large datasets or in the context of distributed computing.
Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.
Say we have a simple source file as follows:
id,name,age,date_of_birth,role,is_related
1,Thom Yorke,55,1968-10-07,vocals,False
2,Jonny Greenwood,52,1971-10-05,guitar,True
3,Ed OBrien,55,1968-04-15,guitar,False
4,Philip Selway,56,1967-05-23,drums,False
5,Colin Greenwood,54,1969-06-26,bass,True
- Read .csv File: Use Pandas to read a .csv file. Pandas infers the data types for each column.
- Advanced Schema Inference with PyArrow: After reading the data with Pandas, we'll convert the Pandas DataFrame into a PyArrow table for more advanced schema inference, which is particularly useful in big data scenarios.
Assuming we have the .csv file saved locally as our_data.csv
we can run the following simple script:
import pandas as pd
import pyarrow as pa
def infer_schema_with_pandas(file_path):
"""
Infer schema using Pandas
"""
df = pd.read_csv(file_path)
return df.dtypes
def convert_to_pyarrow_table(df):
"""
Convert Pandas DataFrame to PyArrow Table
"""
return pa.Table.from_pandas(df)
def infer_schema_with_pyarrow(table):
"""
Infer schema using PyArrow
"""
return table.schema
def print_full_schema(schema):
"""
Print out the schema fields
"""
for field in schema:
print(f"Field name: {field.name}, Type: {field.type}, Nullable: {field.nullable}, Metadata: {field.metadata}")
# Path to your CSV file
file_path = 'our_data.csv'
# Infer schema with Pandas
pandas_schema = infer_schema_with_pandas(file_path)
print("Schema inferred by Pandas:")
print(pandas_schema)
# Advanced Schema Inference with PyArrow
df = pd.read_csv(file_path)
arrow_table = convert_to_pyarrow_table(df)
arrow_schema = infer_schema_with_pyarrow(arrow_table)
print("\nSchema inferred by PyArrow:")
print_full_schema(arrow_schema)
In this very basic example:
infer_schema_with_pandas
function reads the .csv file using Pandas and prints the inferred data types.convert_to_pyarrow_table
converts the Pandas DataFrame to a PyArrow Table.infer_schema_with_pyarrow
uses PyArrow to print a more detailed schema, which can include more nuanced data types and is more suitable for big data applications.
The output will be as follows:
Schema inferred by Pandas:
id int64
name object
age int64
date_of_birth object
role object
is_related bool
dtype: object
Schema inferred by PyArrow:
Field name: id, Type: int64, Nullable: True, Metadata: None
Field name: name, Type: string, Nullable: True, Metadata: None
Field name: age, Type: int64, Nullable: True, Metadata: None
Field name: date_of_birth, Type: string, Nullable: True, Metadata: None
Field name: role, Type: string, Nullable: True, Metadata: None
Field name: is_related, Type: bool, Nullable: True, Metadata: None