Dagster Data Engineering Glossary:
Data Imputation
Imputation definition:
Imputation refers to the process of replacing missing values in a dataset with estimated values. This is often necessary when working with real-world datasets that may have missing data due to a variety of reasons such as data entry errors or incomplete data recording.
Imputation can be done using various techniques such as mean imputation, median imputation, mode imputation, and regression imputation, among others. Mean imputation replaces missing values with the mean of the non-missing values in that column. Similarly, median imputation replaces missing values with the median of the non-missing values in that column. Mode imputation replaces missing values with the mode (i.e., most frequent value) of the non-missing values in that column. Regression imputation involves using a regression model to estimate missing values based on the other variables in the dataset.
Data imputation example using Python:
Please note that you need to have the necessary Python libraries installed in your Python environment to run the following code examples.
Here's a practical example of mean imputation in Python:
import pandas as pd
import numpy as np
# create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4, 5],
'B': [6, np.nan, 8, 9, 10],
'C': [11, 12, 13, np.nan, 15]}
df = pd.DataFrame(data)
# perform mean imputation on column A
df['A'] = df['A'].fillna(df['A'].mean())
# display the updated dataset
print(df)
The code above would yield the following output:
A B C
0 1.0 6.0 11.0
1 2.0 NaN 12.0
2 3.0 8.0 13.0
3 4.0 9.0 NaN
4 5.0 10.0 15.0
In this example, we first create a sample dataset with missing values in columns A, B, and C. We then perform mean imputation on column A using the fillna()
function and the mean()
method to calculate the mean of the non-missing values in that column. The resulting dataset has the missing value in column A replaced with the mean of the non-missing values in that column.