SciPy - Home
SciPy - Introduction
SciPy - Environment Setup
SciPy - Basic Functionality
SciPy - Relationship with NumPy
SciPy Clusters
SciPy - Clusters
SciPy - Hierarchical Clustering
SciPy - K-means Clustering
SciPy - Distance Metrics
SciPy Constants
SciPy - Constants
SciPy - Mathematical Constants
SciPy - Physical Constants
SciPy - Unit Conversion
SciPy - Astronomical Constants
SciPy - Fourier Transforms
SciPy - FFTpack
SciPy - Discrete Fourier Transform (DFT)
SciPy - Fast Fourier Transform (FFT)
SciPy Integration Equations
SciPy - Integrate Module
SciPy - Single Integration
SciPy - Double Integration
SciPy - Triple Integration
SciPy - Multiple Integration
SciPy Differential Equations
SciPy - Differential Equations
SciPy - Integration of Stochastic Differential Equations
SciPy - Integration of Ordinary Differential Equations
SciPy - Discontinuous Functions
SciPy - Oscillatory Functions
SciPy - Partial Differential Equations
SciPy Interpolation
SciPy - Interpolate
SciPy - Linear 1-D Interpolation
SciPy - Polynomial 1-D Interpolation
SciPy - Spline 1-D Interpolation
SciPy - Grid Data Multi-Dimensional Interpolation
SciPy - RBF Multi-Dimensional Interpolation
SciPy - Polynomial & Spline Interpolation
SciPy Curve Fitting
SciPy - Curve Fitting
SciPy - Linear Curve Fitting
SciPy - Non-Linear Curve Fitting
SciPy - Input & Output
SciPy - Input & Output
SciPy - Reading & Writing Files
SciPy - Working with Different File Formats
SciPy - Efficient Data Storage with HDF5
SciPy - Data Serialization
SciPy Linear Algebra
SciPy - Linalg
SciPy - Matrix Creation & Basic Operations
SciPy - Matrix LU Decomposition
SciPy - Matrix QU Decomposition
SciPy - Singular Value Decomposition
SciPy - Cholesky Decomposition
SciPy - Solving Linear Systems
SciPy - Eigenvalues & Eigenvectors
SciPy Image Processing
SciPy - Ndimage
SciPy - Reading & Writing Images
SciPy - Image Transformation
SciPy - Filtering & Edge Detection
SciPy - Top Hat Filters
SciPy - Morphological Filters
SciPy - Low Pass Filters
SciPy - High Pass Filters
SciPy - Bilateral Filter
SciPy - Median Filter
SciPy - Non - Linear Filters in Image Processing
SciPy - High Boost Filter
SciPy - Laplacian Filter
SciPy - Morphological Operations
SciPy - Image Segmentation
SciPy - Thresholding in Image Segmentation
SciPy - Region-Based Segmentation
SciPy - Connected Component Labeling
SciPy Optimize
SciPy - Optimize
SciPy - Special Matrices & Functions
SciPy - Unconstrained Optimization
SciPy - Constrained Optimization
SciPy - Matrix Norms
SciPy - Sparse Matrix
SciPy - Frobenius Norm
SciPy - Spectral Norm
SciPy Condition Numbers
SciPy - Condition Numbers
SciPy - Linear Least Squares
SciPy - Non-Linear Least Squares
SciPy - Finding Roots of Scalar Functions
SciPy - Finding Roots of Multivariate Functions
SciPy - Signal Processing
SciPy - Signal Filtering & Smoothing
SciPy - Short-Time Fourier Transform
SciPy - Wavelet Transform
SciPy - Continuous Wavelet Transform
SciPy - Discrete Wavelet Transform
SciPy - Wavelet Packet Transform
SciPy - Multi-Resolution Analysis
SciPy - Stationary Wavelet Transform
SciPy - Statistical Functions
SciPy - Stats
SciPy - Descriptive Statistics
SciPy - Continuous Probability Distributions
SciPy - Discrete Probability Distributions
SciPy - Statistical Tests & Inference
SciPy - Generating Random Samples
SciPy - Kaplan-Meier Estimator Survival Analysis
SciPy - Cox Proportional Hazards Model Survival Analysis
SciPy Spatial Data
SciPy - Spatial
SciPy - Special Functions
SciPy - Special Package
SciPy Advanced Topics
SciPy - CSGraph
SciPy - ODR
SciPy Useful Resources
SciPy - Reference
SciPy - Quick Guide
SciPy - Cheatsheet
SciPy - Useful Resources
SciPy - Discussion

SciPy - K Means Clustering

Quiz

What is K-Means Clustering?

SciPy K-means clustering is a technique for partitioning data into 'K' clusters implemented in the scipy.cluster.vq module. It includes functions like 'kmeans' for clustering and 'vq' for assigning data points to clusters.

The algorithm works by iteratively updating cluster centroids and reassigning data points based on their distance to these centroids by aiming to minimize the within-cluster variance. SciPy's implementation allows for different initialization methods such as random or K-Means++ which improves centroid placement.

This clustering is useful for identifying groupings in datasets which although it is less feature-rich compared to other libraries such as scikit-learn.

Types of K-Means Clustering

K-means clustering is a widely used algorithm for partitioning data into clusters. Various types and variations of K-means clustering are employed to address specific needs or improve performance. Let's see them in detail −

Standard K-Means Clustering

Standard K-Means Clustering is a popular algorithm used for partitioning data into K distinct clusters. It works through an iterative process to assign data points to clusters and update cluster centroids.

How Does K-Means Clustering Work?

Here are the steps which shows how does the Standard K-Means Clustering works −

Initialization

Initialization is a crucial step in K-means clustering as it significantly impacts the algorithms performance and the quality of the final clustering results.

Select the Number of Clusters (K) − Here we will decide the number of clusters that we want to form in the dataset. This is a crucial parameter that affects the results. We can use the methods such as Elbow, Silhouette Score, Gap Statistic, Cross-Validation as per our requiremnet.
Initialize Centroids− We can start by randomly selecting (K) data points from our dataset to serve as the initial centroids (cluster centers). Alternatively, more advanced initialization methods such as K-Means++ can be employed to achieve better clustering results.

Assignment Step

The Assignment Step in K-means clustering is the phase where each data point in the dataset is assigned to the nearest cluster is based on the current positions of the centroids. This step is crucial because it determines the composition of each cluster which directly influencing the subsequent update step.

Compute Distances − For each data point we calculate the distance to each of the K centroids. Common distance metrics such as Euclidean distance, Manhattan distance, etc.
Assign Clusters− Assign each data point to the cluster associated with the nearest centroid. This forms K clusters where each data point belongs to the cluster with the closest centroid.

Update Step

The Update Step in the K-means clustering algorithm is crucial for refining the positions of the centroids based on the current cluster assignments of the data points. After the data points have been assigned to the nearest centroid i.e calculated in assignment step, the centroids are updated to reflect the mean position of all data points assigned to each cluster. This process continues iteratively until the centroids stabilize.

Recalculate Centroids − Once all data points are assigned to clusters we can recompute the centroids for each cluster. The new centroid is calculated as the mean of all the data points within that cluster. Mathematically the formula is given as follows −

Update Centroids − Replace the old centroids with the newly recalculated centroids.
Convergence Check − After updating the centroids the algorithm checks whether the centroids have moved significantly compared to their previous positions.

If the movement or change of centroids is below a certain threshold then the algorithm considers this as convergence and stops the iterations. Otherwise, the algorithm goes back to the Assignment Step to reassign data points to the nearest updated centroid.

Example

Following is the example which shows how to apply standard K-means clustering using SciPy, visualize the results and interpret the output −

import numpy as np
from scipy.cluster.vq import kmeans, vq
import matplotlib.pyplot as plt

# Generate some synthetic data
np.random.seed(0)
data = np.vstack([np.random.normal(0, 0.5, (50, 2)), 
                  np.random.normal(3, 0.5, (50, 2)), 
                  np.random.normal(6, 0.5, (50, 2))])

# Number of clusters
k = 3

# Perform K-means clustering
centroids, distortion = kmeans(data, k)

# Assign each sample to a cluster
labels, _ = vq(data, centroids)

# Plot the results
plt.figure(figsize=(8, 6))
for i in range(k):
    plt.scatter(data[labels == i, 0], data[labels == i, 1], label=f'Cluster {i+1}')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-Means Clustering using SciPy')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

print(f"Centroids:\n{centroids}")
print(f"Distortion: {distortion}")

Following is the output of the Standard K-Means clustering −

Centroids:
[[0.00840308 0.05140493]
 [5.95346338 5.98730436]
 [2.99063924 3.09137373]]
Distortion: 0.6258523704544776

K-Means++ Clustering

K-means++ is an advanced version of the standard K-means clustering algorithm which is designed to improve the initialization step by choosing the initial centroids more strategically. This approach enhances the performance and accuracy of the K-means algorithm by reducing the likelihood of poor clustering results due to random initialization.

How K-means++ Works?

Following are the steps which shows how the K-Means++ clustering works −

First Centroid Selection − The algorithm begins by randomly selecting the first centroid from the data points.
Subsequent Centroid Selection − For each data point ( x_i) that has not yet been selected as a centroid, calculate the distance ( D(x_i) ) between ( x_i ) and the nearest already chosen centroid.

Select the next centroid from the remaining data points with a probability proportional to ( D(x_i)^2 ). This means that data points farther from the existing centroids have a higher probability of being chosen as new centroids. The probability is given as follows −

Repeat this process until ( k ) centroids have been selected.
Standard K-means − Once the initial centroids are selected using the K-means++ method, the standard K-means clustering algorithm is applied. This involves iteratively assigning data points to the nearest centroid, updating the centroids and repeating until convergence.

Example

SciPy does not directly implement K-means++ in the scipy.cluster.vq module but it can be used through the kmeans function by setting the minit parameter to '++'. This ensures that the centroids are initialized using the K-means++ strategy. Below is the example −

import numpy as np
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt

# Generate some synthetic data
np.random.seed(0)
data = np.vstack([np.random.normal(0, 0.5, (50, 2)), 
                  np.random.normal(3, 0.5, (50, 2)), 
                  np.random.normal(6, 0.5, (50, 2))])

# Number of clusters
k = 3

# Perform K-means++ clustering
centroids, labels = kmeans2(data, k, minit='++')

# Plot the results
plt.figure(figsize=(8, 6))
for i in range(k):
    plt.scatter(data[labels == i, 0], data[labels == i, 1], label=f'Cluster {i+1}')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-Means++ Clustering using SciPy')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

print(f"Centroids:\n{centroids}")

Here is the output of the K-means++ clustering −

Centroids:
[[5.95346338 5.98730436]
 [0.00840308 0.05140493]
 [2.99063924 3.09137373]]

Vector quantization

Vector quantization involves partitioning a large set of vectors into a smaller set of clusters. Each vector in the dataset is approximated by the nearest representative vector which is known as a codebook vector or centroid. The set of these codebook vectors is called a codebook.

Vector quantization (VQ) is a technique in signal processing and machine learning used to compress and encode vector data by mapping it to a finite set of representative vectors. It is widely used in data compression, pattern recognition and clustering.

How Vector Quantization Works

Here are the steps which shows how the Vector quantization works −

Training

Initialization: Choose an initial set of codebook vectors. This can be done randomly or using methods like K-means clustering.
Assignment: Assign each data vector to the nearest codebook vector.
Update: Recalculate the codebook vectors as the mean of all vectors assigned to each codebook vector.
Iteration: Repeat the assignment and update steps until convergence, meaning that the codebook vectors no longer change significantly.

Encoding

After training each data vector is encoded by its index in the codebook rather than by the vector itself. This reduces the amount of data needed to represent the original data.

Decoding

To reconstruct the data we have to replace each index with the corresponding codebook vector. This results in a compressed approximation of the original data.

Example

Heres an example of vector quantization using K-means clustering from scipy.cluster.vq which effectively performs vector quantization −

import numpy as np
from scipy.cluster.vq import kmeans, vq
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
data = np.vstack([np.random.normal(0, 0.5, (100, 2)), 
                  np.random.normal(3, 0.5, (100, 2)), 
                  np.random.normal(6, 0.5, (100, 2))])

# Number of codebook vectors (clusters)
k = 3

# Perform K-means clustering to get codebook vectors
centroids, distortion = kmeans(data, k)

# Assign each sample to a cluster
labels, _ = vq(data, centroids)

# Plot the results
plt.figure(figsize=(8, 6))
for i in range(k):
    plt.scatter(data[labels == i, 0], data[labels == i, 1], label=f'Cluster {i+1}')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Codebook Vectors')
plt.title('Vector Quantization using K-means')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

print(f"Codebook Vectors:\n{centroids}")
print(f"Distortion: {distortion}")

The output of the Vector quantization using K-Means Clustering is given below −

Codebook Vectors:
[[-4.78840902e-04  7.13893340e-02]
 [ 5.94382124e+00  5.94843116e+00]
 [ 2.92846083e+00  2.94352468e+00]]
Distortion: 0.6249447014860251

Print Page