Kickstart ML with Python snippets
Getting started with clustering (unsupervised learning) in Python
Clustering is a type of unsupervised learning that involves grouping data points into clusters based on their similarities.
What is Clustering?
Clustering is the process of dividing a dataset into groups, or clusters, where the data points in each cluster are more similar to each other than to those in other clusters. It is a form of unsupervised learning because it doesn't rely on predefined labels or outcomes.
Why Use Clustering?
 Exploratory Data Analysis: To uncover hidden patterns or structures in data.
 Data Compression: To reduce the complexity of data by grouping similar data points together.
 Anomaly Detection: To identify outliers or unusual data points.
 Segmentation: To segment customers, markets, or other entities into distinct groups for targeted analysis or action.
Clustering Methods Overview
Clustering methods can be broadly categorized into four types: Partitioning, Hierarchical, Densitybased, and Gridbased clustering. Each method has unique characteristics and is suited for different types of data and clustering requirements.
1. Partitioning Clustering
Partitioning Clustering methods divide the data into distinct, nonoverlapping subsets (clusters). The goal is to partition the dataset into K clusters, where each cluster has a centroid, and each data point belongs to the cluster with the nearest centroid.
Examples:

KMeans:
 Algorithm:
 Initialize K cluster centroids randomly.
 Assign each data point to the nearest centroid.
 Update centroids by calculating the mean of the assigned points.
 Repeat steps 2 and 3 until centroids converge.
 Applications: Customer segmentation, image compression.
 Algorithm:

KMedoids (PAM):
 Algorithm:
 Initialize K medoids (actual data points) randomly.
 Assign each data point to the nearest medoid.
 Update medoids by minimizing the sum of dissimilarities between points and their medoids.
 Repeat steps 2 and 3 until medoids converge.
 Applications: More robust to outliers compared to KMeans.
 Algorithm:
2. Hierarchical Clustering
Hierarchical Clustering methods build a hierarchy of clusters by either merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive). The result is a dendrogram, which represents the nested structure of the clusters.
Examples:

Hierarchical Agglomerative Clustering (HAC):
 Algorithm:
 Start with each data point as its own cluster.
 Merge the two closest clusters based on a distance metric (e.g., Euclidean distance).
 Repeat step 2 until all points are merged into a single cluster.
 Applications: Gene expression analysis, social network analysis.
 Algorithm:

Divisive Clustering:
 Algorithm:
 Start with a single cluster containing all data points.
 Recursively split the clusters into smaller clusters.
 Continue splitting until each cluster contains a single data point.
 Applications: Less common, but useful for specific hierarchical structures.
 Algorithm:
3. DensityBased Clustering
DensityBased Clustering methods identify clusters based on the density of data points in a region. Clusters are formed in regions of high density, separated by regions of low density.
Examples:

DBSCAN (DensityBased Spatial Clustering of Applications with Noise):
 Algorithm:
 Select a point randomly and find all points within a given radius (epsilon).
 If the number of points within this radius exceeds a threshold (minPts), form a cluster.
 Expand the cluster by including all densityreachable points.
 Repeat steps 13 for all points, identifying clusters and noise.
 Applications: Spatial data analysis, anomaly detection.
 Algorithm:

OPTICS (Ordering Points To Identify the Clustering Structure):
 Algorithm:
 Similar to DBSCAN but with varying density.
 Orders the points based on their reachability distance.
 Clusters are identified by analyzing the reachability plot.
 Applications: Complex data with varying densities.
 Algorithm:
4. GridBased Clustering
GridBased Clustering methods divide the data space into a finite number of cells forming a grid structure. The clustering is performed on the grid cells instead of the actual data points, which makes the method efficient for large datasets.
Examples:

STING (Statistical Information Grid):
 Algorithm:
 Divide the data space into rectangular cells.
 Compute statistical information for each cell (e.g., mean, variance).
 Use this information to merge or split cells to form clusters.
 Applications: Efficient clustering for large spatial datasets.
 Algorithm:

CLIQUE (Clustering In QUEst):
 Algorithm:
 Divide the data space into grids.
 Identify dense cells based on a density threshold.
 Form clusters by merging adjacent dense cells.
 Applications: Highdimensional data clustering.
 Algorithm:
Summary of Key Concepts

Partitioning Clustering:
 Divides data into K nonoverlapping clusters.
 Examples: KMeans, KMedoids.
 Applications: Customer segmentation, image compression.

Hierarchical Clustering:
 Builds a treelike structure of clusters.
 Examples: HAC, Divisive Clustering.
 Applications: Gene expression analysis, social network analysis.

DensityBased Clustering:
 Forms clusters based on data density.
 Examples: DBSCAN, OPTICS.
 Applications: Spatial data analysis, anomaly detection.

GridBased Clustering:
 Divides data space into a grid structure.
 Examples: STING, CLIQUE.
 Applications: Efficient clustering for large spatial and highdimensional datasets.