Kickstart ML with Python snippets
Exploring Densitybased (DBSCAN) ML algorithm basic concepts
DBSCAN is a densitybased clustering algorithm that groups data points that are closely packed together, marking points that are alone in lowdensity regions as outliers. It is useful for data that contains clusters of varying shapes and sizes.

Core Points:
 A point is a core point if it has at least
min_samples
points (including itself) within a given distanceeps
.
 A point is a core point if it has at least

Border Points:
 A point is a border point if it is not a core point, but it is within the
eps
distance of a core point.
 A point is a border point if it is not a core point, but it is within the

Noise Points:
 A point is a noise point if it is neither a core point nor a border point.

Directly DensityReachable:
 A point
p
is directly densityreachable from a pointq
ifp
is within theeps
distance fromq
andq
is a core point.
 A point

DensityReachable:
 A point
p
is densityreachable from a pointq
if there is a chain of pointsp1, p2, ..., pn
such thatp1 = q
andpn = p
, and eachpi+1
is directly densityreachable frompi
.
 A point
Steps in DBSCAN Algorithm

Initialization:
 Select an arbitrary point from the dataset.

Cluster Formation:
 If the selected point is a core point, form a cluster by finding all points densityreachable from it.
 If the selected point is a border point, it may be assigned to an existing cluster or marked as noise if it doesn't meet the density criteria.

Repeat:
 Repeat the process until all points have been visited.
Practical Example in Python
Let's walk through an example of using the DBSCAN algorithm with Python and the
scikitlearn
library.
StepbyStep Example
 Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
import seaborn as sns
# Optional for better plot aesthetics
sns.set(style="whitegrid")
 Generate Synthetic Data:
# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, cluster_std=0.50, random_state=0)# Visualize the dataplt.scatter(X[:,0], X[:,1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
 Apply DBSCAN:
# Apply DBSCANdbscan = DBSCAN(eps=0.3, min_samples=5)
clusters = dbscan.fit_predict(X)# Plot the clustered dataplt.scatter(X[:,0], X[:,1], c=clusters, cmap='viridis', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering')
plt.show()
 Interpret Results:
# Number of clusters in labels, ignoring noise if present.n_clusters_ =len(set(clusters))  (1if1inclusterselse0)
n_noise_ =list(clusters).count(1)print(f'Estimated number of clusters:{n_clusters_}')print(f'Estimated number of noise points:{n_noise_}')

Data Generation:
 We use
make_blobs
to generate synthetic data with 4 centers (clusters). This helps us visualize and understand how DBSCAN works.
 We use

Model Fitting:
 We initialize the DBSCAN model with
eps=0.3
(maximum distance between two samples for them to be considered as in the same neighborhood) andmin_samples=5
(minimum number of samples in a neighborhood for a point to be considered a core point).  We fit the model to the data using
dbscan.fit_predict(X)
, which performs the DBSCAN algorithm and returns cluster labels for each data point.
 We initialize the DBSCAN model with

Cluster Interpretation:
 Points with the same cluster label are grouped together.
 Points labeled as
1
are considered noise (outliers).

Visualization:
 We plot the data points, coloring them by their cluster labels to visualize the clustering results.
 Noise points (if any) are usually colored differently to indicate they are outliers.
Practical Tips

Choosing Parameters:
 The choice of
eps
andmin_samples
is crucial. Use domain knowledge or heuristics to set these parameters.  The kdistance graph can help in choosing
eps
. Plot the sorted distances of each point to its kth nearest neighbor and look for a knee point.
from sklearn.neighbors import NearestNeighbors neighbors = NearestNeighbors(n_neighbors=5) neighbors_fit = neighbors.fit(X) distances, indices = neighbors_fit.kneighbors(X) distances = np.sort(distances[:,4], axis=0) plt.plot(distances) plt.xlabel('Points') plt.ylabel('5th Nearest Neighbor Distance') plt.title('KDistance Graph') plt.show()
 The choice of

Handling Different Densities:
 DBSCAN can handle clusters of varying densities better than KMeans, but it may struggle if there are significant variations within a single dataset. In such cases, consider using OPTICS (Ordering Points To Identify the Clustering Structure), which is an extension of DBSCAN.

Scaling Features:
 Scale your features before applying DBSCAN to ensure that all features contribute equally to the distance metric.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Interpretation of Noise:
 Noise points can provide valuable insights, indicating outliers or points that do not belong to any cluster. Handle these points based on the context of your application.