Kickstart ML with Python snippets
Exploring kmedoids (PAM) algorithm basic concepts
KMedoids (Partitioning Around Medoids) is a clustering algorithm similar to KMeans but instead of using the mean of the points in a cluster (centroid), it uses the most centrally located data point (medoid) to represent the cluster. This makes it more robust to outliers and noise.

Medoid:
 The medoid is the most centrally located point in a cluster. It minimizes the sum of dissimilarities between itself and the other points in the cluster.

Dissimilarity Measure:
 Commonly used dissimilarity measures include Euclidean distance, Manhattan distance, etc.

Algorithm Steps:
 Initialization: Randomly select k points as the initial medoids.
 Assignment: Assign each data point to the nearest medoid.
 Update: For each cluster, select the point that minimizes the sum of dissimilarities within the cluster as the new medoid.
 Repeat: Repeat the assignment and update steps until the medoids do not change.
Steps in the PAM Algorithm

Initialization:
 Randomly select k data points as initial medoids.

Assignment:
 Assign each data point to the nearest medoid based on a dissimilarity measure.

Update:
 For each cluster, compute the total dissimilarity between each point in the cluster and all other points in the cluster.
 Select the point that minimizes this total dissimilarity as the new medoid.

Repeat:
 Repeat the assignment and update steps until there is no change in the medoids.
Practical Example in Python
Let's implement the kmedoids algorithm using the scikitlearnextra
library,
which provides an implementation of kmedoids.
StepbyStep Example
 Install the
scikitlearnextra
library:
pip install scikitlearnextra
 Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn_extra.cluster import KMedoids
import seaborn as sns
# Optional for better plot aesthetics
sns.set(style="whitegrid")
 Generate Synthetic Data:
# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualize the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
 Apply KMedoids:
# Apply KMedoids
kmedoids = KMedoids(n_clusters=4, random_state=0)
kmedoids.fit(X)
# Get the cluster labels
labels = kmedoids.labels_
# Get the medoids
medoids = kmedoids.cluster_centers_
# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KMedoids Clustering')
plt.show()

Data Generation:
 We use
make_blobs
to generate synthetic data with 4 centers (clusters) for visualization and understanding of KMedoids.
 We use

Model Fitting:
 We initialize the KMedoids model with
n_clusters=4
, meaning we want to partition the data into 4 clusters.  We fit the model to the data using
kmedoids.fit(X)
, which performs the KMedoids algorithm and returns cluster labels for each data point.
 We initialize the KMedoids model with

Cluster Labels and Medoids:
kmedoids.labels_
gives the cluster label for each data point.kmedoids.cluster_centers_
gives the coordinates of the medoids of the clusters.

Visualization:
 We plot the data points, coloring them by their cluster labels.
 The medoids are plotted as red ‘x’ marks, showing the central points of each cluster.
Practical Tips

Choosing the Number of Clusters (k):
 Use methods like the Elbow Method or Silhouette Score to determine the optimal number of clusters.
from sklearn.metrics import silhouette_score silhouette_avg = silhouette_score(X, labels)print(f'Silhouette Score:{silhouette_avg}')

Scaling Features:
 Scale your features before applying KMedoids to ensure that all features contribute equally to the distance metric.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Initialization Sensitivity:
 KMedoids can be sensitive to initial medoid selection. Using multiple initializations and selecting the best result can improve performance.
kmedoids = KMedoids(n_clusters=4, random_state=0, init='kmedoids++')

Handling Large Datasets:
 For very large datasets, consider using sampling techniques to reduce the size of the data before applying KMedoids.