Back to Glossary Index

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.

Data cleaning definition:

Data cleansing, also known as data cleaning, is the process of identifying and correcting or removing inaccuracies, inconsistencies, and errors in data to improve its quality and reliability. In data engineering, data cleansing is an essential step in the data pipeline to ensure that downstream processes are not negatively affected by bad data. But poor data cleansing practices can actually negatively impact your data.

Data cleaning example using Python:

Here are some best practices and practical examples in Python for data cleansing in data engineering:

Identify and handle missing data: Use Python libraries such as Pandas and NumPy to detect and handle missing data. The following code snippet shows how to detect missing data using Pandas:

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

import pandas as pd
import numpy as np

# Create a sample dataset with missing data
df = pd.DataFrame({
   'A': [1, 2, np.nan, 4],
   'B': [5, np.nan, 7, np.nan],
   'C': [8, 9, 10, 11]
})

# Print the original dataset
print("Original dataset:\n", df)

# Detect missing values
print("\nMissing values:\n", df.isnull())

# Remove rows with missing data
df.dropna(inplace=True)

# Print the modified dataset
print("\nModified dataset:\n", df)

# Fill missing values with a specific value
df.fillna(value=0, inplace=True)

# Print the modified dataset
print("\nModified dataset after filling missing values:\n", df)

# Interpolate missing values
df = pd.DataFrame({
   'A': [1, 2, np.nan, 4],
   'B': [5, np.nan, 7, np.nan],
   'C': [8, 9, 10, 11]
})

df.interpolate(method='linear', inplace=True)

# Print the modified dataset
print("\nModified dataset after interpolating missing values:\n", df)

In the above example, we create a sample dataset with missing data and use Pandas and NumPy to detect and handle missing data. We first print the original dataset and then detect missing values using the isnull() method. Next, we remove rows with missing data using the dropna() method and print the modified dataset. We then fill missing values with a specific value using the fillna() method and print the modified dataset again. Finally, we interpolate missing values using the interpolate() method and print the modified dataset.

It will output the following:

Original dataset:
      A    B   C
0  1.0  5.0   8
1  2.0  NaN   9
2  NaN  7.0  10
3  4.0  NaN  11

Missing values:
        A      B      C
0  False  False  False
1  False   True  False
2   True  False  False
3  False   True  False

Modified dataset:
      A    B  C
0  1.0  5.0  8

Modified dataset after filling missing values:
      A    B  C
0  1.0  5.0  8

Modified dataset after interpolating missing values:
      A    B   C
0  1.0  5.0   8
1  2.0  6.0   9
2  3.0  7.0  10
3  4.0  7.0  11

Remove duplicates: Use Python libraries such as Pandas to remove duplicate records from the data. The following code snippet shows how to remove duplicates using Pandas:

Assuming your starting file data.csv is as follows (note the duplicate records):

Patient ID,Age,Name,ZipPostalCode,Address,Social Security Number,Income
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
14,61,Cheryl Crow,88905,1 Sunset Boulevard,788-52-1876,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000

You can remove the duplicates as follows:

import pandas as pd

df = pd.read_csv('data.csv')
# remove duplicates
df.drop_duplicates(inplace=True)
print(df)

You output will be:

   Patient ID  Age          Name ZipPostalCode               Address Social Security Number  Income
0          12   46    James Bond        BD74AD    27 Ardmore Terrace               CN142371    1000
1          13   32  William Tell         21209       23 Archery Road            321-65-6278    1000
2          14   61   Cheryl Crow         88905    1 Sunset Boulevard            788-52-1876    1000
4          15   66   Marie Curie         78690  10 Rue Maurice Ravel              FR2939793   10000

PyJanitor is another library that provides a set of functions for cleaning and tidying data. It includes functions for identifying and removing duplicates, converting data types, and handling missing values.

import pandas as pd
import janitor

df = pd.read_csv('data.csv')
df = df.clean_names() # cleans column names
df = df.drop_duplicates() # removes duplicate rows
df = df.fillna(0) # fills missing values with 0
print(df)

Which will yield the following output:

   feature_1  feature_2
0          1          2
1          2          4
2          3          6
3          4          8
4          5         10

Standardize data: Use Python libraries such as Pandas and scikit-learn to standardize data. The following code snippet shows how to standardize data using scikit-learn:

Given the input file:

A,B,C
101,211,0.98
312,456,0.97
105,776,0.89
0.07,809,0.98
9,1034,0.99

Running the following:

import pandas as pd
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('data.csv')
scaler = StandardScaler()

# standardize the data
standardized_data = scaler.fit_transform(data)
print(standardized_data)

Will yield the following output:

[[-0.03929484 -1.54190454  0.49245809]
 [ 1.83909466 -0.69527385  0.21887026]
 [-0.00368556  0.41052949 -1.96983236]
 [-0.93780599  0.52456546  0.49245809]
 [-0.85830827  1.30208344  0.76604592]]

Remove outliers: Use Python libraries such as Pandas and NumPy to identify and remove outliers from the data. The following code snippet shows how to remove outliers using Pandas:

Given the input file:

column_name
0.123
0.129
0.110
0.19
0.134
0.09

The following operation…

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# remove outliers
df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]

… will remove the two datapoints deemed outliers, and will yield the following output:

   column_name
0        0.123
1        0.129
2        0.110
4        0.134

Handle inconsistent data: Use Python libraries such as Pandas to handle inconsistent data, such as different data types or formats. The following code snippet shows how to handle inconsistent data using Pandas:

import pandas as pd

df = pd.read_csv('data.csv')

# convert data type of column to float
df['column_name'] = df['column_name'].astype(float)

Other data engineering terms related to
Data Transformation:

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules or arranging data elements in memory.

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.

Denoising

Remove noise or artifacts from data to improve its accuracy and quality.

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.

ETL

Extract, transform, and load data between different systems.

Filter

Extract a subset of data based on specific criteria or conditions.

Fragment

Convert data into a linear format for efficient storage and processing.

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

Munge

See 'wrangle'.

Normalize

Standardize data values to facilitate comparison and analysis. organize data into a consistent format.

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.

Serialize

Convert data into a linear format for efficient storage and processing.

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.

Skew

An imbalance in the distribution or representation of data.

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.

Transform

Convert data from one format or structure to another.

Wrangle

Convert unstructured data into a structured format.