Back to Glossary Index

Dagster Data Engineering Glossary:


Data De-identification

Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.

De-identify data - a definition:

De-identifying data is the process of removing or obfuscating personally identifiable information from datasets, while still retaining their utility for analysis. One common method is to replace sensitive data with pseudonyms or anonymized values.

In Python, one way to de-identify data is to use regular expressions (“regex”) to find and replace sensitive information. For example, the re library can be used to search for patterns such as email addresses, phone numbers, or social security numbers, and replace them with generic terms like "EMAIL", "PHONE", or "SSN".

You can install this library with pip install regex.

Example of de-identifying data in Python:

import re

email_pattern = re.compile(r'\b[\w.-]+?@\w+?\.\w+?\b')
phone_pattern = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')

text = "John Smith's email is john.smith@example.com and his phone number is 555-123-4567"
print(text)

text = email_pattern.sub('EMAIL', text)
text = phone_pattern.sub('PHONE', text)

print(text)

This would output: "John Smith's email is EMAIL and his phone number is PHONE"

Another method to de-identify data is to use a hashing function to replace sensitive values with a unique hash value, which can be used to link records across datasets without revealing sensitive information. For example, the hashlib library which comes pre-built in Python can be used to generate a SHA-256 hash of a sensitive value:

import hashlib

email = 'john.smith@dagster.com'
hashed_email = hashlib.sha256(email.encode()).hexdigest()

print(hashed_email)

This would output a unique hash value that can be used to link records with the same email address, without revealing the actual email: "ce909f6d89a6f29d7fc442d6fcacc166b89f87c1baeac9c24b14f986ae2c75ca".

De-identifying data in Pandas:

Let’s look at the de-identification process using the popular Pandas library:

  1. Identify sensitive data: Before de-identification, it's important to identify what data needs to be protected. This can include personal identifiers such as names, addresses, and social security numbers.
  2. Anonymize the data: The next step is to anonymize the data by removing or altering identifying information. Here are some Python functions that can be used for this:
  • pandas.DataFrame.drop() : This function can be used to drop columns that contain sensitive information.
  • pandas.DataFrame.replace() : This function can be used to replace sensitive information with non-identifying information.
  • hashlib: This module can be used to hash sensitive information such as email addresses or phone numbers.
  1. Validate the data: After de-identification, it's important to validate the data to ensure that no identifying information remains. Here are some Python functions that can be used for this:
  • pandas.DataFrame.duplicated(): This function can be used to check for duplicate rows in the dataset.
  • pandas.DataFrame.isnull(): This function can be used to check for missing values in the dataset.
  • pandas.DataFrame.any(): This function can be used to check if any identifying information remains in the dataset.

Using Pandas functions such as the ones listed above can help ensure that sensitive information is properly removed or altered in datasets.


Other data engineering terms related to
Data Security and Privacy: