Kickstart ML with Python snippets

Exploring Hierarchical Agglomerative Clustering (HAC) basics

Hierarchical Agglomerative Clustering (HAC) is a type of hierarchical clustering algorithm used to build a hierarchy of clusters. HAC is an iterative process where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

  1. Dendrogram:

    • A tree-like diagram that records the sequences of merges or splits. It visually represents the hierarchy of clusters formed during the HAC process.
  2. Linkage Criteria:

    • Single Linkage: The distance between two clusters is defined as the shortest distance between points in the two clusters.
    • Complete Linkage: The distance between two clusters is defined as the longest distance between points in the two clusters.
    • Average Linkage: The distance between two clusters is defined as the average distance between all pairs of points in the two clusters.
    • Ward’s Linkage: Minimizes the total within-cluster variance.
  3. Distance Metric:

    • Common distance metrics include Euclidean, Manhattan, and Cosine distances. Euclidean distance is most commonly used.

Steps in HAC Algorithm

  1. Compute the Distance Matrix:

    • Calculate the distance between every pair of data points.
  2. Initialize Clusters:

    • Start with each data point as its own cluster.
  3. Merge Clusters:

    • Iteratively merge the two closest clusters based on the chosen linkage criteria.
  4. Repeat:

    • Repeat step 3 until all points are merged into a single cluster or until a stopping criterion is met (e.g., a desired number of clusters).

Practical Example in Python

Let’s walk through an example using the scipy and seaborn libraries in Python.

Step-by-Step Example

  1. Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets import make_blobs
import seaborn as sns

# Optional for better plot aesthetics
sns.set(style="whitegrid")
                            
  1. Generate Synthetic Data:
# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
  1. Perform HAC:
# Compute the linkage matrix using Ward's method
Z = linkage(X, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z, truncate_mode='lastp', p=12, leaf_rotation=45., leaf_font_size=15., show_contracted=True)
plt.title('Dendrogram')
plt.xlabel('Sample index or (cluster size)')
plt.ylabel('Distance')
plt.show()
  1. Form Flat Clusters:
# Determine the clusters
max_d = 7.5  # Maximum distance for cutting the dendrogram
clusters = fcluster(Z, max_d, criterion='distance')

# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical Agglomerative Clustering (HAC)')
plt.show()
  1. Data Generation:

    • We use make_blobs to generate synthetic data with 4 centers (clusters) for visualization and understanding of HAC.
  2. Linkage Matrix:

    • We compute the linkage matrix using linkage(X, method='ward'). This matrix contains the distances and the pairs of clusters merged at each step.
  3. Dendrogram:

    • We plot the dendrogram using dendrogram(Z). This visualizes the hierarchy of clusters and helps in determining the optimal number of clusters by visually inspecting the longest vertical lines that are not crossed by horizontal lines.
  4. Forming Flat Clusters:

    • We use fcluster(Z, max_d, criterion='distance') to cut the dendrogram at a specified distance (max_d) to form flat clusters.
    • We visualize the clustered data by coloring each data point based on its cluster label.

Practical Tips

  1. Choosing the Linkage Method:

    • Different linkage methods can produce different results. Experiment with single, complete, average, and Ward’s linkage to see which method works best for your data.
  2. Determining the Number of Clusters:

    • Use the dendrogram to visually inspect the number of clusters. Look for the longest vertical line in the dendrogram that can be cut by a horizontal line without intersecting any clusters.
  3. Scaling Features:

    • Scale your features before clustering, especially if they have different units or scales. Use StandardScaler from sklearn.preprocessing.
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
  4. Handling Large Datasets:

    • HAC can be computationally expensive for large datasets. Consider using faster clustering algorithms like K-Means or MiniBatchKMeans for very large datasets.

Back to Kickstart ML with Python cookbook page