Dagster Data Engineering Glossary:
Data Masking
Data masking definition:
Data masking is the process of obscuring or masking specific data elements within a dataset to protect sensitive information. This is commonly used in situations where the data is being shared or used for analysis, but certain elements of the data need to remain confidential.
The importance of data masking is to ensure that sensitive information is not exposed to unauthorized users, while still allowing the use of the data for various purposes.
There are several techniques for data masking, including:
- Substitution: replacing sensitive data with a different value, such as a pseudonym or a hash value.
- Shuffling: randomizing the order of sensitive data within a dataset.
- Truncation: removing a portion of sensitive data, such as only showing the first few digits of a social security number.
- Noise addition: adding random noise to sensitive data to make it harder to identify.
Data masking example using Python:
Please note that you need to have the necessary Python libraries installed in your Python environment to run the following code examples.
Here is an example of data masking using Python:
Given the input file ss_numbers.csv
:
first_name,last_name,social_security_number,email
Aisha,Khan,555221111,aisha.khan@company.com
Mia,Tran,555332222,mia.tran@company.com
Jasmine,Lee,555443333,jasmine.lee@company.com
Rohan,Gupta,555554444,rohan.gupta@company.com
[...]
Imran,Khan,555198888,imran.khan@company.com
Aisha,Shah,555209999,aisha.shah@company.com
Jalen,Davis,555210000,jalen.davis@company.com
We can mask the data using:
import pandas as pd
import numpy as np
# load data
df = pd.read_csv('ss_numbers.csv')
# mask sensitive data
df['social_security_number'] = df['social_security_number'].apply(lambda x: "********" if str(x).isdigit() else x)
df['email'] = df['email'].apply(lambda x: 'masked_email@example.com' if str(x).endswith('@company.com') else x)
# save masked data
df.to_csv('masked_data.csv', index=False)
In this example, we load a CSV file containing sensitive data such as social security numbers and email addresses. We then apply different masking techniques to the data depending on the level of sensitivity. For example, we replace social security numbers with ‘*’, and we replace email addresses that end with '@company.com' with a masked email address. Finally, we save the masked data to a new CSV file which will look like this:
first_nme,last_name,social_security_number,email
Aisha,Khan,********,masked_email@example.com
Mia,Tran,********,masked_email@example.com
Jasmine,Lee,********,masked_email@example.com
Rohan,Gupta,********,masked_email@example.com
[...]
Aisha,Shah,********,masked_email@example.com
Jalen,Davis,********,masked_email@example.com
Best practices for data masking in Python
- Identify sensitive data: Before masking data, it is important to identify what data is considered sensitive. This could include personal identifiable information (PII) such as names, addresses, and social security numbers, as well as financial information such as credit card numbers and bank account numbers.
- Choose a masking technique: There are several techniques for data masking, including substitution, shuffling, and encryption. The choice of technique will depend on the sensitivity of the data and the desired level of protection.
- Mask the data: Once the technique has been chosen, the data can be masked. Here is an example of using substitution to mask a social security number using regular expressions:
import re
ssn = "123-45-6789"
# Replace any three concurrent digits with X's
masked_ssn = re.sub(r"\d{3}", "XX", ssn)
print(masked_ssn) # Output: XX-45-XX9
- Test the masked data: It is important to test the masked data to ensure that it is still usable for its intended purpose, but without exposing sensitive information. This can involve checking that the data is still accurate and consistent.
- Document the masking process: It is important to document the masking process to ensure that it can be replicated and audited in the future. This can include recording the masking technique used and any parameters or rules that were applied.
Masking vs. Obfuscation vs. Hashing
Data masking, data obfuscation, and data hashing are techniques used to protect sensitive data, but they differ in how they accomplish this.
Data masking is a technique used to hide sensitive information within a dataset by replacing the original data with a modified version that cannot be used to identify the original information. The masked data still has the same statistical properties as the original data but is no longer identifiable. Data masking is reversible, meaning that the original data can be retrieved if needed.
Data obfuscation, on the other hand, is a technique used to make data unintelligible or difficult to understand. It involves altering the data in a way that makes it difficult to interpret or make sense of without additional context or knowledge. Obfuscated data is not reversible, meaning that the original data cannot be retrieved.
Data hashing is a technique used to convert data into a fixed-length string of characters, called a hash. The hash is unique to the data being hashed, and even small changes to the original data will result in a completely different hash. Hashing is a one-way function, meaning that it is impossible to retrieve the original data from the hash. The main purpose of hashing is to ensure the integrity of data during transmission and storage by verifying that the hash of the data has not changed.