Data Cleaning | Dagster Glossary

Back to Glossary Index

Data Cleaning

Remove invalid or inconsistent data values, such as empty fields or outliers.

Data cleaning definition:

Data cleansing, also known as data cleaning, is the process of identifying and correcting or removing inaccuracies, inconsistencies, and errors in data to improve its quality and reliability. In data engineering, data cleansing is an essential step in the data pipeline to ensure that downstream processes are not negatively affected by bad data. But poor data cleansing practices can actually negatively impact your data.

Data cleaning example using Python:

Here are some best practices and practical examples in Python for data cleansing in data engineering:

Identify and handle missing data: Use Python libraries such as Pandas and NumPy to detect and handle missing data. The following code snippet shows how to detect missing data using Pandas:

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

import pandas as pd
import numpy as np

# Create a sample dataset with missing data
df = pd.DataFrame({
   'A': [1, 2, np.nan, 4],
   'B': [5, np.nan, 7, np.nan],
   'C': [8, 9, 10, 11]
})

# Print the original dataset
print("Original dataset:\n", df)

# Detect missing values
print("\nMissing values:\n", df.isnull())

# Remove rows with missing data
df.dropna(inplace=True)

# Print the modified dataset
print("\nModified dataset:\n", df)

# Fill missing values with a specific value
df.fillna(value=0, inplace=True)

# Print the modified dataset
print("\nModified dataset after filling missing values:\n", df)

# Interpolate missing values
df = pd.DataFrame({
   'A': [1, 2, np.nan, 4],
   'B': [5, np.nan, 7, np.nan],
   'C': [8, 9, 10, 11]
})

df.interpolate(method='linear', inplace=True)

# Print the modified dataset
print("\nModified dataset after interpolating missing values:\n", df)

In the above example, we create a sample dataset with missing data and use Pandas and NumPy to detect and handle missing data. We first print the original dataset and then detect missing values using the isnull() method. Next, we remove rows with missing data using the dropna() method and print the modified dataset. We then fill missing values with a specific value using the fillna() method and print the modified dataset again. Finally, we interpolate missing values using the interpolate() method and print the modified dataset.

It will output the following:

Original dataset:
      A    B   C
0  1.0  5.0   8
1  2.0  NaN   9
2  NaN  7.0  10
3  4.0  NaN  11

Missing values:
        A      B      C
0  False  False  False
1  False   True  False
2   True  False  False
3  False   True  False

Modified dataset:
      A    B  C
0  1.0  5.0  8

Modified dataset after filling missing values:
      A    B  C
0  1.0  5.0  8

Modified dataset after interpolating missing values:
      A    B   C
0  1.0  5.0   8
1  2.0  6.0   9
2  3.0  7.0  10
3  4.0  7.0  11

Remove duplicates: Use Python libraries such as Pandas to remove duplicate records from the data. The following code snippet shows how to remove duplicates using Pandas:

Assuming your starting file data.csv is as follows (note the duplicate records):

Patient ID,Age,Name,ZipPostalCode,Address,Social Security Number,Income
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
14,61,Cheryl Crow,88905,1 Sunset Boulevard,788-52-1876,1000
13,32,William Tell,21209,23 Archery Road,321-65-6278,1000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
15,66,Marie Curie,78690,10 Rue Maurice Ravel,FR2939793,10000
12,46,James Bond,BD74AD,27 Ardmore Terrace,CN142371,1000

You can remove the duplicates as follows:

import pandas as pd

df = pd.read_csv('data.csv')
# remove duplicates
df.drop_duplicates(inplace=True)
print(df)

You output will be:

   Patient ID  Age          Name ZipPostalCode               Address Social Security Number  Income
0          12   46    James Bond        BD74AD    27 Ardmore Terrace               CN142371    1000
1          13   32  William Tell         21209       23 Archery Road            321-65-6278    1000
2          14   61   Cheryl Crow         88905    1 Sunset Boulevard            788-52-1876    1000
4          15   66   Marie Curie         78690  10 Rue Maurice Ravel              FR2939793   10000

PyJanitor is another library that provides a set of functions for cleaning and tidying data. It includes functions for identifying and removing duplicates, converting data types, and handling missing values.

import pandas as pd
import janitor

df = pd.read_csv('data.csv')
df = df.clean_names() # cleans column names
df = df.drop_duplicates() # removes duplicate rows
df = df.fillna(0) # fills missing values with 0
print(df)

Which will yield the following output:

   feature_1  feature_2
0          1          2
1          2          4
2          3          6
3          4          8
4          5         10

Standardize data: Use Python libraries such as Pandas and scikit-learn to standardize data. The following code snippet shows how to standardize data using scikit-learn:

Given the input file:

A,B,C
101,211,0.98
312,456,0.97
105,776,0.89
0.07,809,0.98
9,1034,0.99

Running the following:

import pandas as pd
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('data.csv')
scaler = StandardScaler()

# standardize the data
standardized_data = scaler.fit_transform(data)
print(standardized_data)

Will yield the following output:

[[-0.03929484 -1.54190454  0.49245809]
 [ 1.83909466 -0.69527385  0.21887026]
 [-0.00368556  0.41052949 -1.96983236]
 [-0.93780599  0.52456546  0.49245809]
 [-0.85830827  1.30208344  0.76604592]]

Remove outliers: Use Python libraries such as Pandas and NumPy to identify and remove outliers from the data. The following code snippet shows how to remove outliers using Pandas:

Given the input file:

column_name
0.123
0.129
0.110
0.19
0.134
0.09

The following operation…

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# remove outliers
df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]

… will remove the two datapoints deemed outliers, and will yield the following output:

   column_name
0        0.123
1        0.129
2        0.110
4        0.134

Handle inconsistent data: Use Python libraries such as Pandas to handle inconsistent data, such as different data types or formats. The following code snippet shows how to handle inconsistent data using Pandas:

import pandas as pd

df = pd.read_csv('data.csv')

# convert data type of column to float
df['column_name'] = df['column_name'].astype(float)

Other data engineering terms related to
Data Transformation:
Dagster Glossary code icon

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
An image representing the data engineering concept of 'Align'
Dagster Glossary code icon

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.
An image representing the data engineering concept of 'Cluster'
Dagster Glossary code icon

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.
An image representing the data engineering concept of 'Curate'
Dagster Glossary code icon

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.
An image representing the data engineering concept of 'Denoise'
Dagster Glossary code icon

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
An image representing the data engineering concept of 'Denormalize'
Dagster Glossary code icon

Derive

Extracting, transforming, and generating new data from existing datasets.
An image representing the data engineering concept of 'Derive'
Dagster Glossary code icon

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.
An image representing the data engineering concept of 'Discretize'
Dagster Glossary code icon

ETL

Extract, transform, and load data between different systems.
An image representing the data engineering concept of 'ETL'
Dagster Glossary code icon

Encode

Convert categorical variables into numerical representations for ML algorithms.
An image representing the data engineering concept of 'Encode'
Dagster Glossary code icon

Filter

Extract a subset of data based on specific criteria or conditions.
An image representing the data engineering concept of 'Filter'
Dagster Glossary code icon

Fragment

Break data down into smaller chunks for storage and management purposes.
An image representing the data engineering concept of 'Fragment'
Dagster Glossary code icon

Homogenize

Make data uniform, consistent, and comparable.
An image representing the data engineering concept of 'Homogenize'
Dagster Glossary code icon

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.
An image representing the data engineering concept of 'Impute'
Dagster Glossary code icon

Linearize

Transforming the relationship between variables to make datasets approximately linear.
An image representing the data engineering concept of 'Linearize'

Munge

See 'wrangle'.
An image representing the data engineering concept of 'Munge'
Dagster Glossary code icon

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Dagster Glossary code icon

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.
An image representing the data engineering concept of 'Reduce'
Dagster Glossary code icon

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.
An image representing the data engineering concept of 'Reshape'
Dagster Glossary code icon

Serialize

Convert data into a linear format for efficient storage and processing.
An image representing the data engineering concept of 'Serialize'
Dagster Glossary code icon

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Dagster Glossary code icon

Skew

An imbalance in the distribution or representation of data.
Dagster Glossary code icon

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.
Dagster Glossary code icon

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.
Dagster Glossary code icon

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.
An image representing the data engineering concept of 'Tokenize'

Transform

Convert data from one format or structure to another.
Dagster Glossary code icon

Wrangle

Convert unstructured data into a structured format.
An image representing the data engineering concept of 'Wrangle'