Back to Glossary Index

Dagster Data Engineering Glossary:


Data Anonymization

Remove personal or identifying information from data.

Data anonymization definition:

Data anonymization is the process of removing personal or identifying information from data to protect the privacy of individuals. In the context of modern data pipelines, anonymization is an important technique used to protect sensitive data, especially in industries such as healthcare, finance, and government.

There are several techniques used in anonymization, such as:

  • Removing personal information: This involves removing information such as names, addresses, social security numbers, and any other identifying information from the dataset. This is the most basic form of anonymization.
  • Pseudonymization: This involves replacing the identifying information with pseudonyms or codes. For example, replacing names with unique IDs or codes. This technique is reversible, and the original data can be reconstructed using the code.
  • Generalization: This involves replacing specific values with a more general value. For example, replacing specific ages with age ranges or replacing specific zip codes with broader geographical regions.
  • Noise addition: This involves adding random noise to the data to mask the original values. For example, adding random numbers to ages or incomes to obscure the original values.

Here's an example of anonymizing a dataset using the Pandas library in Python:

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

Assuming data.csv is as follows:

Patient ID,Age,Name,ZipPostalCode,Address,Social Security Number,Income
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
14,61,Cheryl Crow,88905,1 Sunset Boulevard,788-52-1876,1000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000

You can run the following code:

import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('data.csv')

# Remove personal information
df = df.drop(['Name', 'Address', 'Social Security Number'], axis=1)

# Pseudonymization
df['Patient ID'] = df['Patient ID'].apply(lambda x: 'Patient_' + str(x))

# Generalization
df['Age'] = df['Age'].apply(lambda x: str(x//10) + '0s')

# Noise addition
df['Income'] = df['Income'].apply(lambda x: x + np.random.normal(0, 500))

And yield the following output:

   Patient ID  Age ZipPostalCode        Income
0  Patient_12  40s        BD74AD   1545.911736
1  Patient_13  30s         21209   1410.581151
2  Patient_14  60s         88905   1487.859904
3  Patient_15  60s         78690  10521.217944

In the example above, we load a dataset and remove personal information such as name, address, and social security number. We then use pseudonymization to replace the patient ID with a unique ID, generalize the age to age ranges, and add random noise to the income column. These techniques help to protect the privacy of individuals while still retaining the utility of the dataset.


Other data engineering terms related to
Data Security and Privacy: