SciPy - Home
SciPy - Introduction
SciPy - Environment Setup
SciPy - Basic Functionality
SciPy - Relationship with NumPy
SciPy Clusters
SciPy - Clusters
SciPy - Hierarchical Clustering
SciPy - K-means Clustering
SciPy - Distance Metrics
SciPy Constants
SciPy - Constants
SciPy - Mathematical Constants
SciPy - Physical Constants
SciPy - Unit Conversion
SciPy - Astronomical Constants
SciPy - Fourier Transforms
SciPy - FFTpack
SciPy - Discrete Fourier Transform (DFT)
SciPy - Fast Fourier Transform (FFT)
SciPy Integration Equations
SciPy - Integrate Module
SciPy - Single Integration
SciPy - Double Integration
SciPy - Triple Integration
SciPy - Multiple Integration
SciPy Differential Equations
SciPy - Differential Equations
SciPy - Integration of Stochastic Differential Equations
SciPy - Integration of Ordinary Differential Equations
SciPy - Discontinuous Functions
SciPy - Oscillatory Functions
SciPy - Partial Differential Equations
SciPy Interpolation
SciPy - Interpolate
SciPy - Linear 1-D Interpolation
SciPy - Polynomial 1-D Interpolation
SciPy - Spline 1-D Interpolation
SciPy - Grid Data Multi-Dimensional Interpolation
SciPy - RBF Multi-Dimensional Interpolation
SciPy - Polynomial & Spline Interpolation
SciPy Curve Fitting
SciPy - Curve Fitting
SciPy - Linear Curve Fitting
SciPy - Non-Linear Curve Fitting
SciPy - Input & Output
SciPy - Input & Output
SciPy - Reading & Writing Files
SciPy - Working with Different File Formats
SciPy - Efficient Data Storage with HDF5
SciPy - Data Serialization
SciPy Linear Algebra
SciPy - Linalg
SciPy - Matrix Creation & Basic Operations
SciPy - Matrix LU Decomposition
SciPy - Matrix QU Decomposition
SciPy - Singular Value Decomposition
SciPy - Cholesky Decomposition
SciPy - Solving Linear Systems
SciPy - Eigenvalues & Eigenvectors
SciPy Image Processing
SciPy - Ndimage
SciPy - Reading & Writing Images
SciPy - Image Transformation
SciPy - Filtering & Edge Detection
SciPy - Top Hat Filters
SciPy - Morphological Filters
SciPy - Low Pass Filters
SciPy - High Pass Filters
SciPy - Bilateral Filter
SciPy - Median Filter
SciPy - Non - Linear Filters in Image Processing
SciPy - High Boost Filter
SciPy - Laplacian Filter
SciPy - Morphological Operations
SciPy - Image Segmentation
SciPy - Thresholding in Image Segmentation
SciPy - Region-Based Segmentation
SciPy - Connected Component Labeling
SciPy Optimize
SciPy - Optimize
SciPy - Special Matrices & Functions
SciPy - Unconstrained Optimization
SciPy - Constrained Optimization
SciPy - Matrix Norms
SciPy - Sparse Matrix
SciPy - Frobenius Norm
SciPy - Spectral Norm
SciPy Condition Numbers
SciPy - Condition Numbers
SciPy - Linear Least Squares
SciPy - Non-Linear Least Squares
SciPy - Finding Roots of Scalar Functions
SciPy - Finding Roots of Multivariate Functions
SciPy - Signal Processing
SciPy - Signal Filtering & Smoothing
SciPy - Short-Time Fourier Transform
SciPy - Wavelet Transform
SciPy - Continuous Wavelet Transform
SciPy - Discrete Wavelet Transform
SciPy - Wavelet Packet Transform
SciPy - Multi-Resolution Analysis
SciPy - Stationary Wavelet Transform
SciPy - Statistical Functions
SciPy - Stats
SciPy - Descriptive Statistics
SciPy - Continuous Probability Distributions
SciPy - Discrete Probability Distributions
SciPy - Statistical Tests & Inference
SciPy - Generating Random Samples
SciPy - Kaplan-Meier Estimator Survival Analysis
SciPy - Cox Proportional Hazards Model Survival Analysis
SciPy Spatial Data
SciPy - Spatial
SciPy - Special Functions
SciPy - Special Package
SciPy Advanced Topics
SciPy - CSGraph
SciPy - ODR
SciPy Useful Resources
SciPy - Reference
SciPy - Quick Guide
SciPy - Cheatsheet
SciPy - Useful Resources
SciPy - Discussion

SciPy - Hierarchical Clustering

Quiz

What is Hierarchical Clustering?

In Scipy Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters by either successively merging smaller clusters into larger ones i.e. agglomerative approach or splitting larger clusters into smaller ones i.e. divisive approach.

This method does not require specifying the number of clusters beforehand. The result is typically visualized using a dendrogram which is a tree-like diagram showing the arrangement and distance between clusters at each step.

Hierarchical clustering helps us to reveal the data's natural structure and relationships by making it useful for exploratory data analysis and identifying patterns or groupings in complex datasets.

Types of Hierarchical Clustering

Hierarchical clustering can be categorized based on its approach to forming clusters. Each type of Hierarchical clustering has different methods for building clusters and varies in how it handles the data.

Following are the two primary types of hierarchical clustering −

Agglomerative Hierarchical Clustering − This approach is bottom-up. It starts with each data point as its own individual cluster and progressively merges the closest pairs of clusters.
Divisive Hierarchical Clustering − This approach is top-down. It starts with all data points in a single cluster and recursively splits it into smaller clusters.

Now let's see in detail about each type of Hierarchical Clustering.

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering is a bottom-up approach where each data point starts as its own cluster. It iteratively merges the closest pairs of clusters based on a chosen linkage criterion such as single, complete or average linkage until all points are grouped into a single cluster or a predefined number of clusters is reached.

This method builds a hierarchy of clusters which is often visualized using a dendrogram, illustrating the sequence of merges and the distances at which they occurred. It is widely used in data analysis to uncover the underlying structure and relationships within data.

Agglomerative hierarchical clustering in SciPy

SciPy has scipy.cluster.hierarchy module which provides comprehensive tools for performing agglomerative hierarchical clustering which is a method of cluster analysis that builds a hierarchy of clusters through a series of merging operations.

Below are the functions which are used to perform Agglomerative Hierarchical Clustering −

Linkage Computation

The linkage() function computes the hierarchical clustering encoded in a linkage matrix. This matrix describes the clustering process and is used for further analysis.

Syntax

Here is the syntax of Scipy Agglomerative Hierarchical Clustering linkage() function −

scipy.cluster.hierarchy.linkage(Y, method='ward', metric='euclidean')

Parameters

Following are the parameters of the linkage() function of the Agglomerative Hierarchical Clustering −

Y − Distance matrix i.e. condensed form from pdist or a square matrix representing pairwise distances.
method − This is the linkage method such as 'single', 'complete', 'average', 'ward'.
metric − This Parameter is Distance metric. The default value is 'euclidean'.

Example

Following is the example of using the linkage() function of Agglomerative Hierarchical Clustering. This example performs hierarchical clustering using SciPys linkage function with Wards method −

from scipy.cluster.hierarchy import linkage
import numpy as np

# Generate random sample data
data = np.random.rand(10, 2)  # 10 points in 2D space

# Compute the linkage matrix
Z = linkage(data, method='ward')
print(Z)

Following is the output of the linkage() function −

[[ 0.          7.          0.0505634   2.        ]
 [ 4.         10.          0.09255057  3.        ]
 [ 1.          5.          0.15725673  2.        ]
 [ 2.          8.          0.22920974  2.        ]
 [ 9.         11.          0.24129559  4.        ]
 [ 3.         13.          0.29270489  3.        ]
 [ 6.         14.          0.32005747  5.        ]
 [12.         15.          0.93642962  5.        ]
 [16.         17.          0.98112101 10.        ]]

Dendrogram Visualization

The dendrogram() function creates a dendrogram which is a tree-like diagram that shows the arrangement and distances of clusters as they are merged.

Syntax

Here is the syntax of Scipy Agglomerative Hierarchical Clustering dendrogram() function −

scipy.cluster.hierarchy.dendrogram(Z, **kwargs)

Parameters

Below are the parameters of the dendrogram() function of the agglomerative hierarchical clustering−

Z − Linkage Matrix
kwargs − These are the optional arguments for customization such as color_threshold, labels, leaf_rotation.

Example

Here is the example of using the dendrogram() function of Agglomerative Hierarchical Clustering which generates the image of the matrix computed with the help of linkage() function −

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage
import numpy as np
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
Z = linkage(data, method='ward') 
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

Here is the output image generated with the help of dendrogram function −

Forming Flat Clusters

The fcluster function in SciPy is used to extract flat clusters from hierarchical clustering results defined by a linkage matrix. This function converts the hierarchical clustering into a specific number of clusters or based on a distance threshold by making it easier to work with and interpret the results.

Syntax

Here is the syntax of Scipy Agglomerative Hierarchical Clustering dendrogram() function −

scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', **kwargs)

Parameters

Here are the parameters of the fcluster() function of the Scipy Agglomerative Hierarchical Clustering −

Z: The linkage matrix obtained from the linkage function which represents the hierarchical clustering of the data.
t: The threshold for forming flat clusters.
criterion: This parameter determines how the flat clusters are formed.
**kwargs: Additional keyword arguments which can defined depending on the criterion

Example

Here is the example of the fcluster() function which converts the hierarchical clustering results into a format that is easier to analyze and interpret by allowing for practical application of clustering results −

from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import linkage
import numpy as np
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
Z = linkage(data, method='ward') 
clusters = fcluster(Z, t=3, criterion='maxclust')  # Form 3 clusters
print(f"Cluster assignments: {clusters}")

Below is the output of the fcluster() function −

Cluster assignments: [1 1 2 2]

Divisive Hierarchical Clustering

Divisive Hierarchical Clustering is a clustering method where the process starts with a single, all-encompassing cluster containing all data points and iteratively splits it into smaller clusters until each cluster meets a certain criterion or the desired number of clusters is achieved.

This is in contrast to Agglomerative Hierarchical Clustering which begins with individual data points and merges them into larger clusters.

SciPy does not provide a built-in implementation for Divisive Hierarchical Clustering. Let's see how we can illustrate how it might be implemented manually in Python. Below are the steps to be followed to implement the Divisive Hierarchical Clustering manually −

Initialize with all data points in one cluster.
Iteratively split the largest cluster.
Repeat this until the desired number of clusters are achieved.

For simplicity the below example will use a basic approach to splitting clusters but note that real-world implementations might be more sophisticated.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

def split_cluster(cluster):
    """ Split a cluster into two sub-clusters based on a simple approach. """
    # Compute the centroid of the cluster
    centroid = np.mean(cluster, axis=0)
    # Compute distances from each point to the centroid
    distances = cdist(cluster, [centroid], metric='euclidean').ravel()
    # Split the cluster into two based on distance from the centroid
    median_distance = np.median(distances)
    cluster1 = cluster[distances < median_distance]
    cluster2 = cluster[distances >= median_distance]
    return cluster1, cluster2

def divisive_clustering(data, n_clusters):
    """ Perform Divisive Hierarchical Clustering. """
    clusters = [data]
    while len(clusters) < n_clusters:
        # Find the largest cluster to split
        largest_cluster = max(clusters, key=len)
        clusters.remove(largest_cluster)
        # Split the largest cluster
        cluster1, cluster2 = split_cluster(largest_cluster)
        # Add the new clusters
        clusters.append(cluster1)
        clusters.append(cluster2)
    return clusters

# Generate synthetic data
np.random.seed(0)
data = np.random.rand(100, 2)  # 100 points in 2D space

# Perform divisive hierarchical clustering
n_clusters = 4
clusters = divisive_clustering(data, n_clusters)

# Plot the clusters
plt.figure(figsize=(8, 6))
for cluster in clusters:
    plt.scatter(cluster[:, 0], cluster[:, 1])
plt.title('Divisive Hierarchical Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Following is the output of the Divisive Hierarchical Clustering −

Print Page