Data Serialization

Data serialization definition:

Data serialization is the process of converting complex data structures, such as objects or dictionaries, into a format that can be stored or transmitted, such as a byte stream or JSON string. This is useful in modern data pipelines for tasks such as saving data to disk, transmitting data across a network, or storing data in a database.

What are the benefits of serializing data?

The main goals of data serialization are:

Interoperability: To enable data exchange between different systems or components. Serialized data can be transmitted between systems built with different programming languages or architectures.
Persistence: To allow the state of an object or data structure to be saved to storage media (like a hard drive) and then later retrieved and reconstructed.
Transmission: Facilitate the sending of data across a network. Serialized data can be sent over protocols like HTTP, or through sockets, and then reconstructed at the receiving end.
Consistency: Serialized data often adheres to specific schemas or formats, ensuring that the data exchanged or stored maintains a predictable structure.

What about compression?

While some serialization formats or methods might result in a smaller representation of the data (and thus have a secondary effect of compression), true compression is a separate process. In many scenarios, especially where bandwidth or storage is at a premium, serialized data might be further compressed using specific compression algorithms or techniques. But remember, each step (serializing/deserializing, compressing/decompressing) consumes time and resources.

For some applications, especially in advanced data engineering or analytics use cases, certain serialization formats (like Avro, Parquet, or Protocol Buffers) might be preferred because they can offer both efficient serialization/deserialization and a compact data representation - scroll down for an example of using Protocol Buffers in Python. Still, compression is a beneficial side effect, not the primary goal of serialization.

A simple Data serialization example using Python:

Let's look at a simple serialization/deserialization example. You will find a more advanced example further down in the glossary entry.

Python provides several built-in serialization formats, including pickle, JSON, and YAML. Here's an example of using the pickle module to serialize a Python object:

import pickle

# Define an object to serialize
data = {
    'name': 'Dagster',
    'age': 4,
    'email': 'dagster@dagsterlabs.com'
}

# Serialize the object to a byte stream
serialized_data = pickle.dumps(data)

# Write the byte stream to a file
with open('data.pickle', 'wb') as f:
    f.write(serialized_data)

This code defines a dictionary data and then serializes it using the pickle module's dumps() method. The resulting byte stream is then written to a file named data.pickle. If you open the file you will see the data written out as:

��=}�(�name��Dagster��age�K�email��dagster@dagsterlabs.com�u.

To deserialize the data later, you can use the loads() method:

import pickle

# Read the byte stream from the file
with open('data.pickle', 'rb') as f:
    serialized_data = f.read()

# Deserialize the byte stream into a Python object
data = pickle.loads(serialized_data)

# Print the deserialized data
print(data)

This code reads the serialized data from the file, deserializes it using pickle's loads() method, and then prints the resulting Python object.

A data engineering example of data serialization

For more advanced data engineering use cases, especially when dealing with distributed systems and big data platforms, data serialization often involves formats like Apache Avro, Parquet, or Protocol Buffers (Protobuf). These formats provide efficient serialization and deserialization, often in both size and speed, compared to simple JSON or XML formats.

Let's look at an example using Protocol Buffers (or "Protobuf"), a method developed by Google for serializing structured data.

First, let's make sure we have the right packages involved.

Install Java: You will need to be running java on your machine. If you don't already have it installed you will find the java installation instructions here.
Install Protobuf: On a Mac the command you enter in the Terminal is brew install protobuf. On Linux it's apt install -y protobuf-compiler. On Windows you will find instructions here. (More general instal instructions here).
Define the Data Structure (.proto file): We'll define the structure of the data you want to serialize in a .proto file. Let's consider a simple message format for a user. Create a file in a sandbox directory on your computer called user.proto as follows:

syntax = "proto3";

message User {
    int32 id = 1;
    string name = 2;
    string email = 3;
}

Have Protobuf create the Python serializer. You need the protoc compiler to generate Python code from the user.proto file:

protoc --python_out=. user.proto

This command simply instructs protoc to compile a new Python file (python_out) in the same directory (.) based on the information found in the file user.proto.

After running this command you should have a new Python file called user_pb2.py in your directory.

Run the example: Create a new file called example.py with the following code.

import sys
import user_pb2

# Create a new User instance
user = user_pb2.User()
user.id = 123
user.name = "Alice"
user.email = "alice@example.com"

# Serialize the user data
serialized_data = user.SerializeToString()

print("Serialized data:", serialized_data)
print("Serialized data is type: ", type(serialized_data), " of size ", sys.getsizeof(serialized_data), "bytes")

# Deserialize the data
deserialized_user = user_pb2.User()
deserialized_user.ParseFromString(serialized_data)

print("Deserialized data:")
print(type(deserialized_user), " of size ", sys.getsizeof(deserialized_user), "bytes")
print("ID:", deserialized_user.id)
print("Name:", deserialized_user.name)
print("Email:", deserialized_user.email)

When you execute this example with python3 example.py, your output will look like this:

Serialized data: b'\x08{\x12\x05Alice\x1a\x11alice@example.com'
Serialized data is type:  <class 'bytes'>  of size  61 bytes
Deserialized data:
<class 'user_pb2.User'>  of size  80 bytes
ID: 123
Name: Alice
Email: alice@example.com

Congratulations - you serialized, then deserialized data using a more advanced data engineering technique.