Kickstart ML with Python snippets
Exploring Kmeans algorithm basic concepts
KMeans Clustering is a type of unsupervised learning used to partition data into K distinct, nonoverlapping clusters. Each data point is assigned to the cluster with the nearest mean, serving as the prototype of the cluster.

Centroids:
 The center of a cluster.
 Each cluster is represented by its centroid, which is the mean of all data points in the cluster.

Inertia:
 Also known as the withincluster sum of squares.
 Measures the compactness of the clusters, calculated as the sum of squared distances between each point and its centroid.

K (Number of Clusters):
 The number of clusters to partition the data into.
 Needs to be specified before running the algorithm.
Steps in KMeans Algorithm

Initialization:
 Select K initial centroids randomly from the dataset.

Assignment:
 Assign each data point to the nearest centroid, forming K clusters.

Update:
 Calculate the new centroids as the mean of all data points in each cluster.

Repeat:
 Repeat the assignment and update steps until the centroids no longer change (convergence) or a maximum number of iterations is reached.
Practical Example in Python
Let’s walk through an example using the sklearn
library in Python.
StepbyStep Example
 Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
 Generate Synthetic Data:
# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Visualize the dataplt.scatter(X[:,0], X[:,1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
 Apply KMeans Clustering:
# Set the number of clustersk =4# Fit the KMeans modelkmeans = KMeans(n_clusters=k)
kmeans.fit(X)# Get the cluster centers and labelscenters = kmeans.cluster_centers_
labels = kmeans.labels_# Print the cluster centersprint(f"Cluster Centers:\n{centers}")
 Visualize the Clusters:
# Visualize the clustered dataplt.scatter(X[:,0], X[:,1], c=labels, s=50, cmap='viridis')
plt.scatter(centers[:,0], centers[:,1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KMeans Clustering')
plt.show()
 Evaluate the Model:
# Calculate inertia (sum of squared distances to the nearest centroid)
inertia = kmeans.inertia_print(f"Inertia:{inertia}")

Data Generation:
 We use
make_blobs
to generate synthetic data with 4 centers (clusters). This helps us visualize and understand how KMeans works.
 We use

Model Fitting:
 We initialize the KMeans model with
k=4
, meaning we want to partition the data into 4 clusters.  We fit the model to the data using
kmeans.fit(X)
, which performs the KMeans algorithm.
 We initialize the KMeans model with

Cluster Centers and Labels:
kmeans.cluster_centers_
gives the coordinates of the centroids of the clusters.kmeans.labels_
gives the cluster label for each data point.

Visualization:
 We plot the data points, coloring them by their cluster label.
 The centroids are plotted as red ‘x’ marks, showing the center of each cluster.

Inertia:
 Inertia is calculated to measure how well the clusters have been formed. Lower inertia indicates more compact clusters.
Practical Tips and tricks

Choosing K:
 Use the Elbow Method to determine the optimal number of clusters. Plot inertia for different values of K and look for an "elbow" point where the rate of decrease slows down.
# Elbow Methodinertia_values = [] k_values =range(1,10)forkink_values: kmeans = KMeans(n_clusters=k) kmeans.fit(X) inertia_values.append(kmeans.inertia_) plt.plot(k_values, inertia_values,'bx') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method for Optimal K') plt.show()

Scaling Features:
 Scale your features before applying KMeans, especially if they have different
units or scales. Use
StandardScaler
orMinMaxScaler
fromsklearn.preprocessing
.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
 Scale your features before applying KMeans, especially if they have different
units or scales. Use

Initialization:
 KMeans is sensitive to initial centroids. Use the
kmeans++
initialization to improve convergence.
kmeans = KMeans(n_clusters=4, init='kmeans++')
 KMeans is sensitive to initial centroids. Use the

Handling Large Datasets:
 For large datasets, consider using MiniBatchKMeans, which reduces computational cost by using minibatches.
from sklearn.cluster import MiniBatchKMeans mini_batch_kmeans = MiniBatchKMeans(n_clusters=4) mini_batch_kmeans.fit(X)