Kickstart ML with Python snippets
Exploring Hierarchical Agglomerative Clustering (HAC) basics
Hierarchical Agglomerative Clustering (HAC) is a type of hierarchical clustering algorithm used to build a hierarchy of clusters. HAC is an iterative process where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Dendrogram:
 A treelike diagram that records the sequences of merges or splits. It visually represents the hierarchy of clusters formed during the HAC process.

Linkage Criteria:
 Single Linkage: The distance between two clusters is defined as the shortest distance between points in the two clusters.
 Complete Linkage: The distance between two clusters is defined as the longest distance between points in the two clusters.
 Average Linkage: The distance between two clusters is defined as the average distance between all pairs of points in the two clusters.
 Ward’s Linkage: Minimizes the total withincluster variance.

Distance Metric:
 Common distance metrics include Euclidean, Manhattan, and Cosine distances. Euclidean distance is most commonly used.
Steps in HAC Algorithm

Compute the Distance Matrix:
 Calculate the distance between every pair of data points.

Initialize Clusters:
 Start with each data point as its own cluster.

Merge Clusters:
 Iteratively merge the two closest clusters based on the chosen linkage criteria.

Repeat:
 Repeat step 3 until all points are merged into a single cluster or until a stopping criterion is met (e.g., a desired number of clusters).
Practical Example in Python
Let’s walk through an example using the scipy
and seaborn
libraries
in Python.
StepbyStep Example
 Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets import make_blobs
import seaborn as sns
# Optional for better plot aesthetics
sns.set(style="whitegrid")
 Generate Synthetic Data:
# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualize the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
 Perform HAC:
# Compute the linkage matrix using Ward's method
Z = linkage(X, method='ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z, truncate_mode='lastp', p=12, leaf_rotation=45., leaf_font_size=15., show_contracted=True)
plt.title('Dendrogram')
plt.xlabel('Sample index or (cluster size)')
plt.ylabel('Distance')
plt.show()
 Form Flat Clusters:
# Determine the clusters
max_d = 7.5 # Maximum distance for cutting the dendrogram
clusters = fcluster(Z, max_d, criterion='distance')
# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical Agglomerative Clustering (HAC)')
plt.show()

Data Generation:
 We use
make_blobs
to generate synthetic data with 4 centers (clusters) for visualization and understanding of HAC.
 We use

Linkage Matrix:
 We compute the linkage matrix using
linkage(X, method='ward')
. This matrix contains the distances and the pairs of clusters merged at each step.
 We compute the linkage matrix using

Dendrogram:
 We plot the dendrogram using
dendrogram(Z)
. This visualizes the hierarchy of clusters and helps in determining the optimal number of clusters by visually inspecting the longest vertical lines that are not crossed by horizontal lines.
 We plot the dendrogram using

Forming Flat Clusters:
 We use
fcluster(Z, max_d, criterion='distance')
to cut the dendrogram at a specified distance (max_d
) to form flat clusters.  We visualize the clustered data by coloring each data point based on its cluster label.
 We use
Practical Tips

Choosing the Linkage Method:
 Different linkage methods can produce different results. Experiment with single, complete, average, and Ward’s linkage to see which method works best for your data.

Determining the Number of Clusters:
 Use the dendrogram to visually inspect the number of clusters. Look for the longest vertical line in the dendrogram that can be cut by a horizontal line without intersecting any clusters.

Scaling Features:
 Scale your features before clustering, especially if they have different units
or scales. Use
StandardScaler
fromsklearn.preprocessing
.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
 Scale your features before clustering, especially if they have different units
or scales. Use

Handling Large Datasets:
 HAC can be computationally expensive for large datasets. Consider using faster clustering algorithms like KMeans or MiniBatchKMeans for very large datasets.