Back to Glossary Index

Dagster Data Engineering Glossary:


Data Clustering

Group data points based on similarities or patterns to facilitate analysis and modeling.

Data clustering definition:

Data clustering is a technique that groups similar data points into clusters based on their similarities. This can be useful in various data engineering applications such as customer segmentation, anomaly detection, and image segmentation.

data clustering example in Python:

In Python, there are several libraries that can be used for data clustering, including scikit-learn, KMeans, and DBSCAN.

Here are some practical examples of using these libraries for data clustering:

K-Means clustering:

K-Means is a popular algorithm used for clustering data. We can apply K-Means clustering against a synthetic dataset with 300 samples, 4 clusters, and a specified random state for reproducibility as follows:

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Create synthetic dataset
n_samples = 300
random_state = 42
X, y = make_blobs(n_samples=n_samples, centers=4, random_state=random_state, cluster_std=1.2)

# Perform K-Means clustering
k = 4  # Number of clusters
kmeans = KMeans(n_clusters=k, random_state=random_state, n_init=1)
kmeans.fit(X)

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='x', label='Cluster Centers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering with k = {}'.format(k))
plt.legend()
plt.show()

which will vizualize the clustering as follows:

This is a fairly simple example of K-Means clustering. Scroll down for a more sophisticated example.

Hierarchical clustering:

Hierarchical clustering is a technique that creates a tree-like structure of clusters.

Here is one example of Hierarchical clustering using a synthetic dataset. You can modify the dataset, linkage method, or clustering parameters to experiment with different datasets and see how Hierarchical clustering works on various data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

# Create synthetic dataset
n_samples = 300
random_state = 42
X, y = make_blobs(n_samples=n_samples, centers=4, random_state=random_state, cluster_std=1.2)

# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform hierarchical clustering using complete linkage
linkage_matrix = linkage(X_scaled, method='complete')

# Plot dendrogram
plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix, truncate_mode='level', p=3)
plt.title('Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Choose the number of clusters based on the dendrogram
n_clusters = 4

# Perform Agglomerative Clustering with the chosen number of clusters
agg_clustering = AgglomerativeClustering(n_clusters=n_clusters)
agg_clustering.fit(X_scaled)

# Visualize the results
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=agg_clustering.labels_, cmap='viridis', alpha=0.7)
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')
plt.title('Hierarchical Clustering with {} Clusters'.format(n_clusters))
plt.show()

This script performs the following steps:

  1. Import necessary libraries: NumPy, Matplotlib, scikit-learn, and SciPy.
  2. Create a synthetic dataset with 300 samples, 4 clusters, and a specified random state for reproducibility.
  3. Standardize the dataset using StandardScaler.
  4. Perform hierarchical clustering using complete linkage method.
  5. Plot the dendrogram to visualize the hierarchical clustering structure.
  6. Choose the number of clusters based on the dendrogram.
  7. Perform Agglomerative Clustering with the chosen number of clusters.
  8. Visualize the results, including the cluster assignments:

Both k-means and hierarchical clustering are good techniques to explore if you are new to data clustering.

DBSCAN clustering

DBSCAN is another popular clustering algorithm that can be useful for detecting outliers in data. Here's an example of using DBSCAN to cluster customer data based on their spending habits:

from sklearn.cluster import DBSCAN
import pandas as pd

# load customer spending data
data = pd.read_csv('input.csv')

# extract relevant features
features = ['col_A', 'col_B', 'col_C', 'col_D']
X = data[features]

# perform DBSCAN clustering with eps=3 and min_samples=2
dbscan = DBSCAN(eps=3, min_samples=2).fit(X)

# add cluster labels to the original data
data['cluster'] = dbscan.labels_

# save results to a new CSV file
data.to_csv('output.csv', index=False)

These are just a few examples of how data clustering can be used in data pipelines using Python. Other techniques include

  1. Hierarchical Clustering
  2. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  3. Expectation-Maximization (EM) Clustering
  4. Fuzzy C-Means Clustering
  5. Affinity Propagation Clustering
  6. Spectral Clustering
  7. Subspace Clustering
  8. Grid-Based Clustering

… and many more Model-Based Clustering techniques

The choice of algorithm and parameters will depend on the specific problem being solved and the characteristics of the data.


A more sophisticated example of K-Means clustering in Python:

Here's a more elaborate example of K-Means clustering in Python using the popular machine learning library, scikit-learn. In this example, we'll apply K-Means clustering to the Iris dataset and evaluate the quality of the clustering using various metrics.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine the optimal number of clusters using the elbow method and silhouette scores
inertia = []
silhouette_scores = []
k_values = list(range(2, 11))

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# Plot the elbow method and silhouette scores
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(k_values, inertia, 'o-')
ax1.set_title('Elbow Method')
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Inertia')

ax2.plot(k_values, silhouette_scores, 'o-')
ax2.set_title('Silhouette Scores')
ax2.set_xlabel('Number of Clusters (k)')
ax2.set_ylabel('Silhouette Score')

plt.show()

# Choose the optimal number of clusters based on the elbow method and silhouette scores
optimal_k = 3

# Perform K-Means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans.fit(X_scaled)

# Visualize the results
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='x', label='Cluster Centers')
plt.xlabel('Sepal Length (Standardized)')
plt.ylabel('Sepal Width (Standardized)')
plt.title('K-Means Clustering with k = {}'.format(optimal_k))
plt.legend()
plt.show()

# Evaluate the clustering
silhouette = silhouette_score(X_scaled, kmeans.labels_)
ari = adjusted_rand_score(y, kmeans.labels_)
print("Silhouette Score: {:.3f}".format(silhouette))
print("Adjusted Rand Index: {:.3f}".format(ari))

This script performs the following steps:

  1. Import necessary libraries: NumPy, pandas, Matplotlib, and scikit-learn.
  2. Load the Iris dataset and extract features and target labels.
  3. Standardize the dataset using StandardScaler.
  4. Determine the optimal number of clusters (k) using the elbow method and silhouette scores.
  5. Plot the elbow method and silhouette scores to visualize the optimal number of clusters.
  6. Perform K-Means clustering with the optimal number of clusters.
  7. Visualize the results, including the cluster assignments and cluster centers.
  8. Evaluate the quality of the clustering using silhouette scores and adjusted Rand index.

In this example, we use the elbow method and silhouette scores to determine the optimal number of clusters for K-Means clustering. We then perform the clustering with the optimal number of clusters and visualize the results. Finally, we evaluate the quality of the clustering using silhouette scores and adjusted Rand index.

In this example, we use the elbow method and silhouette scores to determine the optimal number of clusters for K-Means clustering. We then perform the clustering with the optimal number of clusters and visualize the results. Finally, we evaluate the quality of the clustering using silhouette scores and adjusted Rand index.

By running the script, you can observe the elbow method and silhouette scores plots that help determine the optimal number of clusters.

In this case, k=3 seems to be a good choice. The visualization of the clustering result shows the different clusters and their centers.

The Silhouette Score and Adjusted Rand Index are used to measure the quality of the clustering.

Silhouette Score: 0.460
Adjusted Rand Index: 0.620

A higher Silhouette Score indicates better separation between clusters, while the Adjusted Rand Index measures the similarity between the true labels and the predicted cluster assignments. In this case, the Silhouette Score and Adjusted Rand Index provide an indication of the quality of the clustering result.

You can modify the dataset or clustering parameters to experiment with different datasets and see how K-Means clustering works on various data.


Other data engineering terms related to
Data Transformation:
Dagster Glossary code icon

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
An image representing the data engineering concept of 'Align'
Dagster Glossary code icon

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.
An image representing the data engineering concept of 'Clean or Cleanse'
Dagster Glossary code icon

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.
An image representing the data engineering concept of 'Curate'
Dagster Glossary code icon

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.
An image representing the data engineering concept of 'Denoise'
Dagster Glossary code icon

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
An image representing the data engineering concept of 'Denormalize'
Dagster Glossary code icon

Derive

Extracting, transforming, and generating new data from existing datasets.
An image representing the data engineering concept of 'Derive'
Dagster Glossary code icon

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.
An image representing the data engineering concept of 'Discretize'
Dagster Glossary code icon

ETL

Extract, transform, and load data between different systems.
An image representing the data engineering concept of 'ETL'
Dagster Glossary code icon

Encode

Convert categorical variables into numerical representations for ML algorithms.
An image representing the data engineering concept of 'Encode'
Dagster Glossary code icon

Filter

Extract a subset of data based on specific criteria or conditions.
An image representing the data engineering concept of 'Filter'
Dagster Glossary code icon

Fragment

Break data down into smaller chunks for storage and management purposes.
An image representing the data engineering concept of 'Fragment'
Dagster Glossary code icon

Homogenize

Make data uniform, consistent, and comparable.
An image representing the data engineering concept of 'Homogenize'
Dagster Glossary code icon

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.
An image representing the data engineering concept of 'Impute'
Dagster Glossary code icon

Linearize

Transforming the relationship between variables to make datasets approximately linear.
An image representing the data engineering concept of 'Linearize'

Munge

See 'wrangle'.
An image representing the data engineering concept of 'Munge'
Dagster Glossary code icon

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Dagster Glossary code icon

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.
An image representing the data engineering concept of 'Reduce'
Dagster Glossary code icon

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.
An image representing the data engineering concept of 'Reshape'
Dagster Glossary code icon

Serialize

Convert data into a linear format for efficient storage and processing.
An image representing the data engineering concept of 'Serialize'
Dagster Glossary code icon

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Dagster Glossary code icon

Skew

An imbalance in the distribution or representation of data.
Dagster Glossary code icon

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.
Dagster Glossary code icon

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.
Dagster Glossary code icon

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.
An image representing the data engineering concept of 'Tokenize'

Transform

Convert data from one format or structure to another.
An image representing the data engineering concept of 'Transform'
Dagster Glossary code icon

Wrangle

Convert unstructured data into a structured format.
An image representing the data engineering concept of 'Wrangle'