Back to Glossary Index

Dagster Data Engineering Glossary:


Data Serialization

Convert data into a linear format for efficient storage and processing.

Data serialization definition:

Data serialization is the process of converting complex data structures, such as objects or dictionaries, into a format that can be stored or transmitted, such as a byte stream or JSON string. This is useful in modern data pipelines for tasks such as saving data to disk, transmitting data across a network, or storing data in a database.

What are the benefits of serializing data?

The main goals of data serialization are:

  1. Interoperability: To enable data exchange between different systems or components. Serialized data can be transmitted between systems built with different programming languages or architectures.

  2. Persistence: To allow the state of an object or data structure to be saved to storage media (like a hard drive) and then later retrieved and reconstructed.

  3. Transmission: Facilitate the sending of data across a network. Serialized data can be sent over protocols like HTTP, or through sockets, and then reconstructed at the receiving end.

  4. Consistency: Serialized data often adheres to specific schemas or formats, ensuring that the data exchanged or stored maintains a predictable structure.

What about compression?

While some serialization formats or methods might result in a smaller representation of the data (and thus have a secondary effect of compression), true compression is a separate process. In many scenarios, especially where bandwidth or storage is at a premium, serialized data might be further compressed using specific compression algorithms or techniques. But remember, each step (serializing/deserializing, compressing/decompressing) consumes time and resources.

For some applications, especially in advanced data engineering or analytics use cases, certain serialization formats (like Avro, Parquet, or Protocol Buffers) might be preferred because they can offer both efficient serialization/deserialization and a compact data representation - scroll down for an example of using Protocol Buffers in Python. Still, compression is a beneficial side effect, not the primary goal of serialization.

A simple Data serialization example using Python:

Let's look at a simple serialization/deserialization example. You will find a more advanced example further down in the glossary entry.

Python provides several built-in serialization formats, including pickle, JSON, and YAML. Here's an example of using the pickle module to serialize a Python object:

import pickle

# Define an object to serialize
data = {
    'name': 'Dagster',
    'age': 4,
    'email': 'dagster@dagsterlabs.com'
}

# Serialize the object to a byte stream
serialized_data = pickle.dumps(data)

# Write the byte stream to a file
with open('data.pickle', 'wb') as f:
    f.write(serialized_data)

This code defines a dictionary data and then serializes it using the pickle module's dumps() method. The resulting byte stream is then written to a file named data.pickle. If you open the file you will see the data written out as:

��=}�(�name��Dagster��age�K�email��dagster@dagsterlabs.com�u.

To deserialize the data later, you can use the loads() method:

import pickle

# Read the byte stream from the file
with open('data.pickle', 'rb') as f:
    serialized_data = f.read()

# Deserialize the byte stream into a Python object
data = pickle.loads(serialized_data)

# Print the deserialized data
print(data)

This code reads the serialized data from the file, deserializes it using pickle's loads() method, and then prints the resulting Python object.

A data engineering example of data serialization

For more advanced data engineering use cases, especially when dealing with distributed systems and big data platforms, data serialization often involves formats like Apache Avro, Parquet, or Protocol Buffers (Protobuf). These formats provide efficient serialization and deserialization, often in both size and speed, compared to simple JSON or XML formats.

Let's look at an example using Protocol Buffers (or "Protobuf"), a method developed by Google for serializing structured data.

First, let's make sure we have the right packages involved.

  1. Install Java: You will need to be running java on your machine. If you don't already have it installed you will find the java installation instructions here.

  2. Install Protobuf: On a Mac the command you enter in the Terminal is brew install protobuf. On Linux it's apt install -y protobuf-compiler. On Windows you will find instructions here. (More general instal instructions here).

  3. Define the Data Structure (.proto file): We'll define the structure of the data you want to serialize in a .proto file. Let's consider a simple message format for a user. Create a file in a sandbox directory on your computer called user.proto as follows:

syntax = "proto3";

message User {
    int32 id = 1;
    string name = 2;
    string email = 3;
}
  1. Have Protobuf create the Python serializer. You need the protoc compiler to generate Python code from the user.proto file:
protoc --python_out=. user.proto

This command simply instructs protoc to compile a new Python file (python_out) in the same directory (.) based on the information found in the file user.proto.

After running this command you should have a new Python file called user_pb2.py in your directory.

  1. Run the example: Create a new file called example.py with the following code.
import sys
import user_pb2

# Create a new User instance
user = user_pb2.User()
user.id = 123
user.name = "Alice"
user.email = "alice@example.com"

# Serialize the user data
serialized_data = user.SerializeToString()

print("Serialized data:", serialized_data)
print("Serialized data is type: ", type(serialized_data), " of size ", sys.getsizeof(serialized_data), "bytes")

# Deserialize the data
deserialized_user = user_pb2.User()
deserialized_user.ParseFromString(serialized_data)

print("Deserialized data:")
print(type(deserialized_user), " of size ", sys.getsizeof(deserialized_user), "bytes")
print("ID:", deserialized_user.id)
print("Name:", deserialized_user.name)
print("Email:", deserialized_user.email)

When you execute this example with python3 example.py, your output will look like this:

Serialized data: b'\x08{\x12\x05Alice\x1a\x11alice@example.com'
Serialized data is type:  <class 'bytes'>  of size  61 bytes
Deserialized data:
<class 'user_pb2.User'>  of size  80 bytes
ID: 123
Name: Alice
Email: alice@example.com

Congratulations - you serialized, then deserialized data using a more advanced data engineering technique.


Other data engineering terms related to
Data Transformation:
Dagster Glossary code icon

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
An image representing the data engineering concept of 'Align'
Dagster Glossary code icon

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.
An image representing the data engineering concept of 'Clean or Cleanse'
Dagster Glossary code icon

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.
An image representing the data engineering concept of 'Cluster'
Dagster Glossary code icon

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.
An image representing the data engineering concept of 'Curate'
Dagster Glossary code icon

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.
An image representing the data engineering concept of 'Denoise'
Dagster Glossary code icon

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
An image representing the data engineering concept of 'Denormalize'
Dagster Glossary code icon

Derive

Extracting, transforming, and generating new data from existing datasets.
An image representing the data engineering concept of 'Derive'
Dagster Glossary code icon

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.
An image representing the data engineering concept of 'Discretize'
Dagster Glossary code icon

ETL

Extract, transform, and load data between different systems.
An image representing the data engineering concept of 'ETL'
Dagster Glossary code icon

Encode

Convert categorical variables into numerical representations for ML algorithms.
An image representing the data engineering concept of 'Encode'
Dagster Glossary code icon

Filter

Extract a subset of data based on specific criteria or conditions.
An image representing the data engineering concept of 'Filter'
Dagster Glossary code icon

Fragment

Break data down into smaller chunks for storage and management purposes.
An image representing the data engineering concept of 'Fragment'
Dagster Glossary code icon

Homogenize

Make data uniform, consistent, and comparable.
An image representing the data engineering concept of 'Homogenize'
Dagster Glossary code icon

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.
An image representing the data engineering concept of 'Impute'
Dagster Glossary code icon

Linearize

Transforming the relationship between variables to make datasets approximately linear.
An image representing the data engineering concept of 'Linearize'

Munge

See 'wrangle'.
An image representing the data engineering concept of 'Munge'
Dagster Glossary code icon

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Dagster Glossary code icon

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.
An image representing the data engineering concept of 'Reduce'
Dagster Glossary code icon

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.
An image representing the data engineering concept of 'Reshape'
Dagster Glossary code icon

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Dagster Glossary code icon

Skew

An imbalance in the distribution or representation of data.
Dagster Glossary code icon

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.
Dagster Glossary code icon

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.
Dagster Glossary code icon

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.
An image representing the data engineering concept of 'Tokenize'

Transform

Convert data from one format or structure to another.
An image representing the data engineering concept of 'Transform'
Dagster Glossary code icon

Wrangle

Convert unstructured data into a structured format.
An image representing the data engineering concept of 'Wrangle'