Back to Glossary Index

Dagster Data Engineering Glossary:


Data Compression

Reduce the size of data to save storage space and improve processing performance.

Data compression definition:

Data compression is important in the context of data pipelines because it can help reduce the amount of storage space required, which can result in faster processing times and lower costs.

Data compression examples in Python:

In Python, there are several built-in modules and libraries that can be used for data compression. The two main ones are gzip and zipfile. Both modules are included in the standard library of Python, which means they are built into Python by default and can be used without the need to install any external packages or modules.

The gzip module provides a simple way to compress and decompress files using the gzip format, which is a common file format used for compressing files on Unix-based systems. The module provides functions to read and write gzip files, as well as utilities to compress and decompress data in memory.

The ZipFile module provides a way to create, read, and extract files from ZIP archives. ZIP files are a popular way to package multiple files into a single file, and the zipfile module makes it easy to work with these files in Python. The module provides functions to create new ZIP files, add files to existing ZIP files, and extract files from ZIP files.

gzip

Here is an example of gzip as used in code. Please note that you need to have gzip installed in your Python environment to run this code.

import gzip

## compress a file
with open('data.txt', 'rb') as f_in, gzip.open('data.txt.gz', 'wb') as f_out:
    f_out.writelines(f_in)

## decompress a file
with gzip.open('data.txt.gz', 'rb') as f_in, open('data2.txt', 'wb') as f_out:
    f_out.writelines(f_in)

Given an input file data.txt , this will generate a compressed version of the file. It will then decompress the same file back into a new data2.txt file.

If you open the compressed file in a text editor you will notice it is hard to read and looks something like this:

��@d�data.txtE�a�$�
D{�(((���{Jz�96bg��
��J�������8s>�?���������ۿٞ1��9�xϷ�?��<��oO~���kc����zy���{��y������X�<��o�;���~���c�^�����6��,_��i�{����u�V?�w���������Go�~���;���6yK�����,����?�~{���y�sxt-y�w��5�����g��}�d+�}����i븀�'࣋e�o�5��k���s+�
������y���w�c��k�Ƴ���n�1���S_�,�������kO��pJ�֡�=O=����3Z=�=��i�e���w�ٲ�f+s�O/��޲��gs�ݕΣ&�iX����÷9����V=���'-۫|�{�����|�F<��j���^/]Zꩅ��1�sV�㝝ӫ�n,���Z��z6xW-�u����=���}g�z���
��w��;W��X��Y��7��/�:~�@��xm6�G���;�>?��'�1W�_�

You will also notice that data.txt and data2.txt are the exact same size, while data.txt.gz is a smaller size:

$ ls -la
-rw-r--r--   1 16893 Apr 19 11:00 data.txt
-rw-r--r--   1   7952 Apr 19 11:00 data.txt.gz
-rw-r--r--   1 16893 Apr 19 11:00 data2.txt

ZipFile

Next, here is an example of ZipFile as used in code. Assuming we have two arbitrary files (data1.txt and data2.txt), we can add these to a zipfile compressed file. Then we will extract them from the file into a new folder:

import zipfile

## create a zip file
with zipfile.ZipFile('data.zip', 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write('data1.txt')
    f.write('data2.txt')

## extract files from a zip file
with zipfile.ZipFile('data.zip', 'r') as f:
    f.extractall('newfolder')

While gzip and ZipFile are the most frequently used, they are not the only options out there. For example, the bz2 module provides a way to compress and decompress files using the bzip2 format, which can achieve higher compression ratios than gzip. It is commonly used for compressing large text files. Bzip2 is also built into current Python distributions.

Let’s look at an example:

import bz2

## compress a file
with open('data.txt', 'rb') as f_in, bz2.open('data.txt.bz2', 'wb') as f_out:
    f_out.writelines(f_in)

## decompress a file
with bz2.open('data.txt.bz2', 'rb') as f_in, open('data.txt', 'wb') as f_out:
    f_out.writelines(f_in)

As with all compressed files, this one is also unreadable in a simple text editor, and will look like this:

BZh91AY&SY�q��S�`0�`$���ץ��e�`vsV�nNn���N��U�f�,����Vgu[�];j��f����Ѿ���M�l��u�٢�wwF;i��ܶ�S�53*�S���S�LM*������POz��$� $�U
�;���n�����~

Data compression vs. data compaction vs. data reduction

While related, compaction, compression and reduction are slightly different techniques:

CompressionCompaction Reduction
AimReduce data size by encoding it in a more space-efficient format, minimizing transfer or storage.Minimize storage space while preserving data contents and structure.Reduce the overall amount of data while retaining its critical information.
TechniquesEncoding data in a compressed forms to represent data in fewer bits.Data compression, merging smaller data segments into larger ones, and optimizing data structures like indexes to reduce fragmentation.Aggregation, downsampling, encoding, filtering out less relevant data points, dimensionality reduction for feature selection.
Key benefitsReduce data transfer times, saving storage space, and improving data transmission efficiency.Improve storage efficiency and reduce costs.Improve data manageability, simplify analytics, and speed up query performance.

Compression, compaction, and reduction are complimentary and often used together to achieve our data engineering objectives.


Other data engineering terms related to
Data Storage and Retrieval: