Data Cleaning

Data cleaning definition:

Data cleansing, also known as data cleaning, is the process of identifying and correcting or removing inaccuracies, inconsistencies, and errors in data to improve its quality and reliability. In data engineering, data cleansing is an essential step in the data pipeline to ensure that downstream processes are not negatively affected by bad data. But poor data cleansing practices can actually negatively impact your data.

Data cleaning example using Python:

Here are some best practices and practical examples in Python for data cleansing in data engineering:

Identify and handle missing data: Use Python libraries such as Pandas and NumPy to detect and handle missing data. The following code snippet shows how to detect missing data using Pandas:

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

import pandas as pd
import numpy as np

# Create a sample dataset with missing data
df = pd.DataFrame({
   'A': [1, 2, np.nan, 4],
   'B': [5, np.nan, 7, np.nan],
   'C': [8, 9, 10, 11]
})

# Print the original dataset
print("Original dataset:\n", df)

# Detect missing values
print("\nMissing values:\n", df.isnull())

# Remove rows with missing data
df.dropna(inplace=True)

# Print the modified dataset
print("\nModified dataset:\n", df)

# Fill missing values with a specific value
df.fillna(value=0, inplace=True)

# Print the modified dataset
print("\nModified dataset after filling missing values:\n", df)

# Interpolate missing values
df = pd.DataFrame({
   'A': [1, 2, np.nan, 4],
   'B': [5, np.nan, 7, np.nan],
   'C': [8, 9, 10, 11]
})

df.interpolate(method='linear', inplace=True)

# Print the modified dataset
print("\nModified dataset after interpolating missing values:\n", df)

In the above example, we create a sample dataset with missing data and use Pandas and NumPy to detect and handle missing data. We first print the original dataset and then detect missing values using the isnull() method. Next, we remove rows with missing data using the dropna() method and print the modified dataset. We then fill missing values with a specific value using the fillna() method and print the modified dataset again. Finally, we interpolate missing values using the interpolate() method and print the modified dataset.

It will output the following:

Original dataset:
      A    B   C
0  1.0  5.0   8
1  2.0  NaN   9
2  NaN  7.0  10
3  4.0  NaN  11

Missing values:
        A      B      C
0  False  False  False
1  False   True  False
2   True  False  False
3  False   True  False

Modified dataset:
      A    B  C
0  1.0  5.0  8

Modified dataset after filling missing values:
      A    B  C
0  1.0  5.0  8

Modified dataset after interpolating missing values:
      A    B   C
0  1.0  5.0   8
1  2.0  6.0   9
2  3.0  7.0  10
3  4.0  7.0  11

Remove duplicates: Use Python libraries such as Pandas to remove duplicate records from the data. The following code snippet shows how to remove duplicates using Pandas:

Assuming your starting file data.csv is as follows (note the duplicate records):

Patient ID,Age,Name,ZipPostalCode,Address,Social Security Number,Income
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
14,61,Cheryl Crow,88905,1 Sunset Boulevard,788-52-1876,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000

You can remove the duplicates as follows:

import pandas as pd

df = pd.read_csv('data.csv')
# remove duplicates
df.drop_duplicates(inplace=True)
print(df)

You output will be:

   Patient ID  Age          Name ZipPostalCode               Address Social Security Number  Income
0          12   46    James Bond        BD74AD    27 Ardmore Terrace               CN142371    1000
1          13   32  William Tell         21209       23 Archery Road            321-65-6278    1000
2          14   61   Cheryl Crow         88905    1 Sunset Boulevard            788-52-1876    1000
4          15   66   Marie Curie         78690  10 Rue Maurice Ravel              FR2939793   10000

PyJanitor is another library that provides a set of functions for cleaning and tidying data. It includes functions for identifying and removing duplicates, converting data types, and handling missing values.

import pandas as pd
import janitor

df = pd.read_csv('data.csv')
df = df.clean_names() # cleans column names
df = df.drop_duplicates() # removes duplicate rows
df = df.fillna(0) # fills missing values with 0
print(df)

Which will yield the following output:

   feature_1  feature_2
0          1          2
1          2          4
2          3          6
3          4          8
4          5         10

Standardize data: Use Python libraries such as Pandas and scikit-learn to standardize data. The following code snippet shows how to standardize data using scikit-learn:

Given the input file:

A,B,C
101,211,0.98
312,456,0.97
105,776,0.89
0.07,809,0.98
9,1034,0.99

Running the following:

import pandas as pd
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('data.csv')
scaler = StandardScaler()

# standardize the data
standardized_data = scaler.fit_transform(data)
print(standardized_data)

Will yield the following output:

[[-0.03929484 -1.54190454  0.49245809]
 [ 1.83909466 -0.69527385  0.21887026]
 [-0.00368556  0.41052949 -1.96983236]
 [-0.93780599  0.52456546  0.49245809]
 [-0.85830827  1.30208344  0.76604592]]

Remove outliers: Use Python libraries such as Pandas and NumPy to identify and remove outliers from the data. The following code snippet shows how to remove outliers using Pandas:

Given the input file:

column_name
0.123
0.129
0.110
0.19
0.134
0.09

The following operation…

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# remove outliers
df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]

… will remove the two datapoints deemed outliers, and will yield the following output:

   column_name
0        0.123
1        0.129
2        0.110
4        0.134

Handle inconsistent data: Use Python libraries such as Pandas to handle inconsistent data, such as different data types or formats. The following code snippet shows how to handle inconsistent data using Pandas:

import pandas as pd

df = pd.read_csv('data.csv')

# convert data type of column to float
df['column_name'] = df['column_name'].astype(float)

Data cleaning definition:

Data cleaning example using Python:

Align

Cluster

Curate

Denoise

Denormalize

Derive

Discretize

ETL

Encode

Filter

Fragment

Homogenize

Impute

Linearize

Munge

Normalize

Reduce

Reshape

Serialize

Shred

Skew

Split

Standardize

Tokenize

Transform

Wrangle