SciPy - Hierarchical Clustering



What is Hierarchical Clustering?

In Scipy Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters by either successively merging smaller clusters into larger ones i.e. agglomerative approach or splitting larger clusters into smaller ones i.e. divisive approach.

This method does not require specifying the number of clusters beforehand. The result is typically visualized using a dendrogram which is a tree-like diagram showing the arrangement and distance between clusters at each step.

Hierarchical clustering helps us to reveal the data's natural structure and relationships by making it useful for exploratory data analysis and identifying patterns or groupings in complex datasets.

Types of Hierarchical Clustering

Hierarchical clustering can be categorized based on its approach to forming clusters. Each type of Hierarchical clustering has different methods for building clusters and varies in how it handles the data.

Following are the two primary types of hierarchical clustering −

  • Agglomerative Hierarchical Clustering − This approach is bottom-up. It starts with each data point as its own individual cluster and progressively merges the closest pairs of clusters.
  • Divisive Hierarchical Clustering − This approach is top-down. It starts with all data points in a single cluster and recursively splits it into smaller clusters.

Now let's see in detail about each type of Hierarchical Clustering.

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering is a bottom-up approach where each data point starts as its own cluster. It iteratively merges the closest pairs of clusters based on a chosen linkage criterion such as single, complete or average linkage until all points are grouped into a single cluster or a predefined number of clusters is reached.

This method builds a hierarchy of clusters which is often visualized using a dendrogram, illustrating the sequence of merges and the distances at which they occurred. It is widely used in data analysis to uncover the underlying structure and relationships within data.

Agglomerative hierarchical clustering in SciPy

SciPy has scipy.cluster.hierarchy module which provides comprehensive tools for performing agglomerative hierarchical clustering which is a method of cluster analysis that builds a hierarchy of clusters through a series of merging operations.

Below are the functions which are used to perform Agglomerative Hierarchical Clustering

Linkage Computation

The linkage() function computes the hierarchical clustering encoded in a linkage matrix. This matrix describes the clustering process and is used for further analysis.

Syntax

Here is the syntax of Scipy Agglomerative Hierarchical Clustering linkage() function −

scipy.cluster.hierarchy.linkage(Y, method='ward', metric='euclidean')

Parameters

Following are the parameters of the linkage() function of the Agglomerative Hierarchical Clustering −

  • Y − Distance matrix i.e. condensed form from pdist or a square matrix representing pairwise distances.
  • method − This is the linkage method such as 'single', 'complete', 'average', 'ward'.
  • metric − This Parameter is Distance metric. The default value is 'euclidean'.

Example

Following is the example of using the linkage() function of Agglomerative Hierarchical Clustering. This example performs hierarchical clustering using SciPys linkage function with Wards method −

from scipy.cluster.hierarchy import linkage
import numpy as np

# Generate random sample data
data = np.random.rand(10, 2)  # 10 points in 2D space

# Compute the linkage matrix
Z = linkage(data, method='ward')
print(Z)

Following is the output of the linkage() function −

[[ 0.          7.          0.0505634   2.        ]
 [ 4.         10.          0.09255057  3.        ]
 [ 1.          5.          0.15725673  2.        ]
 [ 2.          8.          0.22920974  2.        ]
 [ 9.         11.          0.24129559  4.        ]
 [ 3.         13.          0.29270489  3.        ]
 [ 6.         14.          0.32005747  5.        ]
 [12.         15.          0.93642962  5.        ]
 [16.         17.          0.98112101 10.        ]]
 

Dendrogram Visualization

The dendrogram() function creates a dendrogram which is a tree-like diagram that shows the arrangement and distances of clusters as they are merged.

Syntax

Here is the syntax of Scipy Agglomerative Hierarchical Clustering dendrogram() function −

scipy.cluster.hierarchy.dendrogram(Z, **kwargs)

Parameters

Below are the parameters of the dendrogram() function of the agglomerative hierarchical clustering−

  • Z − Linkage Matrix
  • kwargs − These are the optional arguments for customization such as color_threshold, labels, leaf_rotation.

Example

Here is the example of using the dendrogram() function of Agglomerative Hierarchical Clustering which generates the image of the matrix computed with the help of linkage() function −

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage
import numpy as np
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
Z = linkage(data, method='ward') 
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

Here is the output image generated with the help of dendrogram function −

Dendrogram

Forming Flat Clusters

The fcluster function in SciPy is used to extract flat clusters from hierarchical clustering results defined by a linkage matrix. This function converts the hierarchical clustering into a specific number of clusters or based on a distance threshold by making it easier to work with and interpret the results.

Syntax

Here is the syntax of Scipy Agglomerative Hierarchical Clustering dendrogram() function −

scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', **kwargs)

Parameters

Here are the parameters of the fcluster() function of the Scipy Agglomerative Hierarchical Clustering −

  • Z: The linkage matrix obtained from the linkage function which represents the hierarchical clustering of the data.
  • t: The threshold for forming flat clusters.
  • criterion: This parameter determines how the flat clusters are formed.
  • **kwargs: Additional keyword arguments which can defined depending on the criterion

Example

Here is the example of the fcluster() function which converts the hierarchical clustering results into a format that is easier to analyze and interpret by allowing for practical application of clustering results −

from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import linkage
import numpy as np
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
Z = linkage(data, method='ward') 
clusters = fcluster(Z, t=3, criterion='maxclust')  # Form 3 clusters
print(f"Cluster assignments: {clusters}")

Below is the output of the fcluster() function −

Cluster assignments: [1 1 2 2]

Divisive Hierarchical Clustering

Divisive Hierarchical Clustering is a clustering method where the process starts with a single, all-encompassing cluster containing all data points and iteratively splits it into smaller clusters until each cluster meets a certain criterion or the desired number of clusters is achieved.

This is in contrast to Agglomerative Hierarchical Clustering which begins with individual data points and merges them into larger clusters.

SciPy does not provide a built-in implementation for Divisive Hierarchical Clustering. Let's see how we can illustrate how it might be implemented manually in Python. Below are the steps to be followed to implement the Divisive Hierarchical Clustering manually −

  • Initialize with all data points in one cluster.
  • Iteratively split the largest cluster.
  • Repeat this until the desired number of clusters are achieved.

For simplicity the below example will use a basic approach to splitting clusters but note that real-world implementations might be more sophisticated.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

def split_cluster(cluster):
    """ Split a cluster into two sub-clusters based on a simple approach. """
    # Compute the centroid of the cluster
    centroid = np.mean(cluster, axis=0)
    # Compute distances from each point to the centroid
    distances = cdist(cluster, [centroid], metric='euclidean').ravel()
    # Split the cluster into two based on distance from the centroid
    median_distance = np.median(distances)
    cluster1 = cluster[distances < median_distance]
    cluster2 = cluster[distances >= median_distance]
    return cluster1, cluster2

def divisive_clustering(data, n_clusters):
    """ Perform Divisive Hierarchical Clustering. """
    clusters = [data]
    while len(clusters) < n_clusters:
        # Find the largest cluster to split
        largest_cluster = max(clusters, key=len)
        clusters.remove(largest_cluster)
        # Split the largest cluster
        cluster1, cluster2 = split_cluster(largest_cluster)
        # Add the new clusters
        clusters.append(cluster1)
        clusters.append(cluster2)
    return clusters

# Generate synthetic data
np.random.seed(0)
data = np.random.rand(100, 2)  # 100 points in 2D space

# Perform divisive hierarchical clustering
n_clusters = 4
clusters = divisive_clustering(data, n_clusters)

# Plot the clusters
plt.figure(figsize=(8, 6))
for cluster in clusters:
    plt.scatter(cluster[:, 0], cluster[:, 1])
plt.title('Divisive Hierarchical Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Following is the output of the Divisive Hierarchical Clustering −

Divisive
Advertisements