Kickstart ML with Python snippets

Exploring k-medoids (PAM) algorithm basic concepts

K-Medoids (Partitioning Around Medoids) is a clustering algorithm similar to K-Means but instead of using the mean of the points in a cluster (centroid), it uses the most centrally located data point (medoid) to represent the cluster. This makes it more robust to outliers and noise.

  1. Medoid:

    • The medoid is the most centrally located point in a cluster. It minimizes the sum of dissimilarities between itself and the other points in the cluster.
  2. Dissimilarity Measure:

    • Commonly used dissimilarity measures include Euclidean distance, Manhattan distance, etc.
  3. Algorithm Steps:

    • Initialization: Randomly select k points as the initial medoids.
    • Assignment: Assign each data point to the nearest medoid.
    • Update: For each cluster, select the point that minimizes the sum of dissimilarities within the cluster as the new medoid.
    • Repeat: Repeat the assignment and update steps until the medoids do not change.

Steps in the PAM Algorithm

  1. Initialization:

    • Randomly select k data points as initial medoids.
  2. Assignment:

    • Assign each data point to the nearest medoid based on a dissimilarity measure.
  3. Update:

    • For each cluster, compute the total dissimilarity between each point in the cluster and all other points in the cluster.
    • Select the point that minimizes this total dissimilarity as the new medoid.
  4. Repeat:

    • Repeat the assignment and update steps until there is no change in the medoids.

Practical Example in Python

Let's implement the k-medoids algorithm using the scikit-learn-extra library, which provides an implementation of k-medoids.

Step-by-Step Example

  1. Install the scikit-learn-extra library:
pip install scikit-learn-extra
  1. Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn_extra.cluster import KMedoids
import seaborn as sns

# Optional for better plot aesthetics
sns.set(style="whitegrid")
  1. Generate Synthetic Data:
# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
  1. Apply K-Medoids:
# Apply K-Medoids
kmedoids = KMedoids(n_clusters=4, random_state=0)
kmedoids.fit(X)

# Get the cluster labels
labels = kmedoids.labels_

# Get the medoids
medoids = kmedoids.cluster_centers_

# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Medoids Clustering')
plt.show()
  1. Data Generation:

    • We use make_blobs to generate synthetic data with 4 centers (clusters) for visualization and understanding of K-Medoids.
  2. Model Fitting:

    • We initialize the K-Medoids model with n_clusters=4, meaning we want to partition the data into 4 clusters.
    • We fit the model to the data using kmedoids.fit(X), which performs the K-Medoids algorithm and returns cluster labels for each data point.
  3. Cluster Labels and Medoids:

    • kmedoids.labels_ gives the cluster label for each data point.
    • kmedoids.cluster_centers_ gives the coordinates of the medoids of the clusters.
  4. Visualization:

    • We plot the data points, coloring them by their cluster labels.
    • The medoids are plotted as red ‘x’ marks, showing the central points of each cluster.

Practical Tips

  1. Choosing the Number of Clusters (k):

    • Use methods like the Elbow Method or Silhouette Score to determine the optimal number of clusters.
    from sklearn.metrics import silhouette_score
    silhouette_avg = silhouette_score(X, labels)print(f'Silhouette Score:{silhouette_avg}')
  2. Scaling Features:

    • Scale your features before applying K-Medoids to ensure that all features contribute equally to the distance metric.
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
  3. Initialization Sensitivity:

    • K-Medoids can be sensitive to initial medoid selection. Using multiple initializations and selecting the best result can improve performance.
    kmedoids = KMedoids(n_clusters=4, random_state=0, init='k-medoids++')
  4. Handling Large Datasets:

    • For very large datasets, consider using sampling techniques to reduce the size of the data before applying K-Medoids.

Back to Kickstart ML with Python cookbook page