Kickstart ML with Python snippets

Exploring K-means algorithm basic concepts

K-Means Clustering is a type of unsupervised learning used to partition data into K distinct, non-overlapping clusters. Each data point is assigned to the cluster with the nearest mean, serving as the prototype of the cluster.

  1. Centroids:

    • The center of a cluster.
    • Each cluster is represented by its centroid, which is the mean of all data points in the cluster.
  2. Inertia:

    • Also known as the within-cluster sum of squares.
    • Measures the compactness of the clusters, calculated as the sum of squared distances between each point and its centroid.
  3. K (Number of Clusters):

    • The number of clusters to partition the data into.
    • Needs to be specified before running the algorithm.

Steps in K-Means Algorithm

  1. Initialization:

    • Select K initial centroids randomly from the dataset.
  2. Assignment:

    • Assign each data point to the nearest centroid, forming K clusters.
  3. Update:

    • Calculate the new centroids as the mean of all data points in each cluster.
  4. Repeat:

    • Repeat the assignment and update steps until the centroids no longer change (convergence) or a maximum number of iterations is reached.

Practical Example in Python

Let’s walk through an example using the sklearn library in Python.

Step-by-Step Example

  1. Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
                            
  1. Generate Synthetic Data:
# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Visualize the dataplt.scatter(X[:,0], X[:,1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
  1. Apply K-Means Clustering:
# Set the number of clustersk =4# Fit the K-Means modelkmeans = KMeans(n_clusters=k)
kmeans.fit(X)# Get the cluster centers and labelscenters = kmeans.cluster_centers_
labels = kmeans.labels_# Print the cluster centersprint(f"Cluster Centers:\n{centers}")
  1. Visualize the Clusters:
# Visualize the clustered dataplt.scatter(X[:,0], X[:,1], c=labels, s=50, cmap='viridis')
plt.scatter(centers[:,0], centers[:,1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering')
plt.show()
  1. Evaluate the Model:
# Calculate inertia (sum of squared distances to the nearest centroid)
inertia = kmeans.inertia_print(f"Inertia:{inertia}")
  1. Data Generation:

    • We use make_blobs to generate synthetic data with 4 centers (clusters). This helps us visualize and understand how K-Means works.
  2. Model Fitting:

    • We initialize the K-Means model with k=4, meaning we want to partition the data into 4 clusters.
    • We fit the model to the data using kmeans.fit(X), which performs the K-Means algorithm.
  3. Cluster Centers and Labels:

    • kmeans.cluster_centers_ gives the coordinates of the centroids of the clusters.
    • kmeans.labels_ gives the cluster label for each data point.
  4. Visualization:

    • We plot the data points, coloring them by their cluster label.
    • The centroids are plotted as red ‘x’ marks, showing the center of each cluster.
  5. Inertia:

    • Inertia is calculated to measure how well the clusters have been formed. Lower inertia indicates more compact clusters.

Practical Tips and tricks

  1. Choosing K:

    • Use the Elbow Method to determine the optimal number of clusters. Plot inertia for different values of K and look for an "elbow" point where the rate of decrease slows down.
    # Elbow Methodinertia_values = []
    k_values =range(1,10)forkink_values:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    inertia_values.append(kmeans.inertia_)
    
    plt.plot(k_values, inertia_values,'bx-')
    plt.xlabel('Number of Clusters (K)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method for Optimal K')
    plt.show()
  2. Scaling Features:

    • Scale your features before applying K-Means, especially if they have different units or scales. Use StandardScaler or MinMaxScaler from sklearn.preprocessing.
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
  3. Initialization:

    • K-Means is sensitive to initial centroids. Use the k-means++ initialization to improve convergence.
    kmeans = KMeans(n_clusters=4, init='k-means++')
  4. Handling Large Datasets:

    • For large datasets, consider using MiniBatchKMeans, which reduces computational cost by using mini-batches.
    from sklearn.cluster import MiniBatchKMeans
    
    mini_batch_kmeans = MiniBatchKMeans(n_clusters=4)
    mini_batch_kmeans.fit(X)

Back to Kickstart ML with Python cookbook page