Dagster Data Engineering Glossary:
Data Cleaning
Data cleaning definition:
Data cleansing, also known as data cleaning, is the process of identifying and correcting or removing inaccuracies, inconsistencies, and errors in data to improve its quality and reliability. In data engineering, data cleansing is an essential step in the data pipeline to ensure that downstream processes are not negatively affected by bad data. But poor data cleansing practices can actually negatively impact your data.
Data cleaning example using Python:
Here are some best practices and practical examples in Python for data cleansing in data engineering:
Identify and handle missing data: Use Python libraries such as Pandas and NumPy to detect and handle missing data. The following code snippet shows how to detect missing data using Pandas:
Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.
import pandas as pd
import numpy as np
# Create a sample dataset with missing data
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, np.nan],
'C': [8, 9, 10, 11]
})
# Print the original dataset
print("Original dataset:\n", df)
# Detect missing values
print("\nMissing values:\n", df.isnull())
# Remove rows with missing data
df.dropna(inplace=True)
# Print the modified dataset
print("\nModified dataset:\n", df)
# Fill missing values with a specific value
df.fillna(value=0, inplace=True)
# Print the modified dataset
print("\nModified dataset after filling missing values:\n", df)
# Interpolate missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, np.nan],
'C': [8, 9, 10, 11]
})
df.interpolate(method='linear', inplace=True)
# Print the modified dataset
print("\nModified dataset after interpolating missing values:\n", df)
In the above example, we create a sample dataset with missing data and use Pandas and NumPy to detect and handle missing data. We first print the original dataset and then detect missing values using the isnull()
method. Next, we remove rows with missing data using the dropna()
method and print the modified dataset. We then fill missing values with a specific value using the fillna()
method and print the modified dataset again. Finally, we interpolate missing values using the interpolate()
method and print the modified dataset.
It will output the following:
Original dataset:
A B C
0 1.0 5.0 8
1 2.0 NaN 9
2 NaN 7.0 10
3 4.0 NaN 11
Missing values:
A B C
0 False False False
1 False True False
2 True False False
3 False True False
Modified dataset:
A B C
0 1.0 5.0 8
Modified dataset after filling missing values:
A B C
0 1.0 5.0 8
Modified dataset after interpolating missing values:
A B C
0 1.0 5.0 8
1 2.0 6.0 9
2 3.0 7.0 10
3 4.0 7.0 11
Remove duplicates: Use Python libraries such as Pandas to remove duplicate records from the data. The following code snippet shows how to remove duplicates using Pandas:
Assuming your starting file data.csv
is as follows (note the duplicate records):
Patient ID,Age,Name,ZipPostalCode,Address,Social Security Number,Income
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
14,61,Cheryl Crow,88905,1 Sunset Boulevard,788-52-1876,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000
You can remove the duplicates as follows:
import pandas as pd
df = pd.read_csv('data.csv')
# remove duplicates
df.drop_duplicates(inplace=True)
print(df)
You output will be:
Patient ID Age Name ZipPostalCode Address Social Security Number Income
0 12 46 James Bond BD74AD 27 Ardmore Terrace CN142371 1000
1 13 32 William Tell 21209 23 Archery Road 321-65-6278 1000
2 14 61 Cheryl Crow 88905 1 Sunset Boulevard 788-52-1876 1000
4 15 66 Marie Curie 78690 10 Rue Maurice Ravel FR2939793 10000
PyJanitor is another library that provides a set of functions for cleaning and tidying data. It includes functions for identifying and removing duplicates, converting data types, and handling missing values.
import pandas as pd
import janitor
df = pd.read_csv('data.csv')
df = df.clean_names() # cleans column names
df = df.drop_duplicates() # removes duplicate rows
df = df.fillna(0) # fills missing values with 0
print(df)
Which will yield the following output:
feature_1 feature_2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Standardize data: Use Python libraries such as Pandas and scikit-learn to standardize data. The following code snippet shows how to standardize data using scikit-learn:
Given the input file:
A,B,C
101,211,0.98
312,456,0.97
105,776,0.89
0.07,809,0.98
9,1034,0.99
Running the following:
import pandas as pd
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('data.csv')
scaler = StandardScaler()
# standardize the data
standardized_data = scaler.fit_transform(data)
print(standardized_data)
Will yield the following output:
[[-0.03929484 -1.54190454 0.49245809]
[ 1.83909466 -0.69527385 0.21887026]
[-0.00368556 0.41052949 -1.96983236]
[-0.93780599 0.52456546 0.49245809]
[-0.85830827 1.30208344 0.76604592]]
Remove outliers: Use Python libraries such as Pandas and NumPy to identify and remove outliers from the data. The following code snippet shows how to remove outliers using Pandas:
Given the input file:
column_name
0.123
0.129
0.110
0.19
0.134
0.09
The following operation…
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# remove outliers
df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]
… will remove the two datapoints deemed outliers, and will yield the following output:
column_name
0 0.123
1 0.129
2 0.110
4 0.134
Handle inconsistent data: Use Python libraries such as Pandas to handle inconsistent data, such as different data types or formats. The following code snippet shows how to handle inconsistent data using Pandas:
import pandas as pd
df = pd.read_csv('data.csv')
# convert data type of column to float
df['column_name'] = df['column_name'].astype(float)