Back to Glossary Index

Dagster Data Engineering Glossary:


Feature Extraction

Identify and extract relevant features from raw data for use in analysis or modeling.

Feature extraction definition:

Feature extraction is the process of identifying and extracting relevant features from raw data to create a set of informative features that can be used for machine learning or other data analysis tasks. The aim of feature extraction is to reduce the dimensionality of the data while retaining as much relevant information as possible.

In the context of modern data pipelines, feature extraction is an important step in preparing data for machine learning models. It can significantly improve the accuracy of predictive models, reduce model training time, and reduce the risk of overfitting.

Feature extraction example using Python:

Please note that you need to have the necessary Python libraries installed in your Python environment to run the following code examples.

The following are some techniques for feature extraction in Python:

Bag of Words (BoW): BoW is a simple but effective technique for feature extraction in natural language processing (NLP) tasks. It involves counting the frequency of each word in a corpus of text and representing each document as a vector of word counts that can be used as input for machine learning algorithms. This can be implemented using libraries like scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document.", "This is the second document.", "And this is the third one.", "Is this the first document?"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

This code would yield the following output:

[[0 1 1 1 0 0 1 0 1]
 [0 1 1 0 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]

TF-IDF: Term Frequency-Inverse Document Frequency (TF-IDF) is another technique for feature extraction in NLP tasks. It assigns weights to each word based on its frequency in the document and its rarity in the corpus.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is the first document.", "This is the second document.", "And this is the third one.", "Is this the first document?"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

This code would yield the following output:

[[0.         0.43877674 0.54197657 0.43877674 0.         0.         0.35872874 0.         0.43877674]
 [0.         0.43877674 0.54197657 0.         0.         0.65465367 0.35872874 0.         0.43877674]
 [0.57735027 0.         0.         0.57735027 0.57735027 0.         0.29247968 0.57735027 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.         0.35872874 0.         0.43877674]]

Principal Component Analysis (PCA): PCA is a technique for feature extraction that reduces the dimensionality of the data while retaining as much information as possible. It involves transforming the data into a new coordinate system that maximizes the variance of the data.

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X = iris.data
y = iris.target

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print(X_pca)

This code would yield the following output:

[[-2.68412563  0.31939725]
 [-2.71414169 -0.17700123]
 [-2.88899057 -0.14494943]
 [-2.74534286 -0.31829898]
 [-2.72871654  0.32675451]
 ...
 [ 1.90094161  0.11662796]
 [ 1.39018886 -0.28266094]]

Scale-Invariant Feature Transform (SIFT) for Image data: In computer vision, feature extraction involves transforming raw image data into numerical representations that capture important characteristics of the image, such as edges, corners, and texture. One popular technique for image feature extraction is the Scale-Invariant Feature Transform (SIFT), which detects and describes local features in an image. This can be implemented using the OpenCV library:

import cv2

# Load an image
img = cv2.imread('image.jpg')
image8bit = cv2.normalize(img, None, 0, 255, cv2.NORM_MINMAX).astype('uint8')

# Create a SIFT detector
sift = cv2.SIFT_create()

# Detect and compute keypoints and descriptors
kp, des = sift.detectAndCompute(image8bit, None)

# Print the number of keypoints and the shape of the descriptor matrix
print('Number of keypoints:', len(kp))
print('Descriptor shape:', des.shape)

Based on the input image you provide, your output will look something like this:

Number of keypoints: 1448
Descriptor shape: (1448, 128)

Autoregressive Integrated Moving Average (ARIMA) model for time series data: In time series analysis, feature extraction involves transforming raw time series data into numerical representations that capture important patterns and trends in the data, such as seasonality, trend, and volatility. One popular technique for time series feature extraction is the Autoregressive Integrated Moving Average (ARIMA) model, which models the data as a combination of autoregressive, differencing, and moving average terms. This can be implemented using the statsmodels library:

Given the simple timeseries.csv input file:

Date,Value
2023-05-01,10
2023-05-02,15
2023-05-03,20
2023-05-04,18
2023-05-05,22
2023-05-06,25
2023-05-07,30
2023-05-08,35
2023-05-09,40
2023-05-10,45
2023-05-11,50
2023-05-12,55
2023-05-13,60
2023-05-14,65
2023-05-15,70
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load a time series dataset
data = pd.read_csv('timeseries.csv', parse_dates=['date'], index_col='date')

# Create an ARIMA model
model = sm.tsa.ARIMA(data, order=(1, 1, 1))

# Fit the model to the data
results = model.fit()

# Print the model summary
print(results.summary())

Will return the following output:

                               SARIMAX Results
==============================================================================
Dep. Variable:                  Value   No. Observations:                   15
Model:                 ARIMA(1, 1, 1)   Log Likelihood                 -30.790
Date:                Sat, 22 Apr 2023   AIC                             67.580
Time:                        21:57:35   BIC                             69.497
Sample:                    05-01-2023   HQIC                            67.403
                         - 05-15-2023
Covariance Type:                  opg
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.9956      0.047     20.976      0.000       0.903       1.089
ma.L1         -0.7962      0.914     -0.871      0.384      -2.588       0.995
sigma2         3.9599      4.594      0.862      0.389      -5.045      12.965
===================================================================================
Ljung-Box (L1) (Q):      0.14   Jarque-Bera (JB): 45.05
Prob(Q):                 0.70   Prob(JB):         0.00
Heteroskedasticity (H):  0.05   Skew:            -2.75
Prob(H) (two-sided):     0.00   Kurtosis:         9.86
=================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Other data engineering terms related to
Data Aggregation and Summarization: