Kickstart ML with Python snippets

Exploring the KNN classifier algorithm basic concepts

k-Nearest Neighbors (k-NN) is a simple, instance-based learning algorithm used for both classification and regression tasks. It classifies a data point based on how its neighbors are classified.

Instance-Based Learning:
- k-NN is a type of instance-based learning, where the algorithm memorizes the training dataset instead of learning a model.
Distance Metric:
- The algorithm relies on a distance metric to find the closest neighbors. Common distance metrics include:
  - Euclidean Distance: ( d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2} )
  - Manhattan Distance: ( d(p, q) = \sum_{i=1}^{n} |p_i - q_i| )
  - Minkowski Distance: A generalization of Euclidean and Manhattan distances.
Choosing k:
- The number of neighbors (k) is a crucial hyperparameter. A small k can be sensitive to noise, while a large k can smooth out class boundaries.
Voting Mechanism:
- In classification, the class of a data point is determined by the majority class among its k-nearest neighbors.
- In regression, the prediction is the average of the values of its k-nearest neighbors.

Steps in k-NN Algorithm

Choose the number of neighbors (k).
Calculate the distance between the query instance and all the training instances.
Sort the distances in ascending order and select the k-nearest neighbors.
Perform majority voting (for classification) or take the average (for regression) to make the prediction.

Practical Example in Python

Let's walk through an example of using the k-NN classifier with Python and the scikit-learn library.

Step-by-Step Example

Import Libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Load and Prepare Data:

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Train the k-NN Model:

# Initialize the k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)# Fit the modelknn.fit(X_train, y_train)

Make Predictions:

# Predict on the test set
y_pred = knn.predict(X_test)

Evaluate the Model:

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Print classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

Data Preparation:
- We load the Iris dataset, which is a classic dataset for classification tasks.
- We split the data into training and testing sets to evaluate the model's performance on unseen data.
- We standardize the features to ensure that all features contribute equally to the distance metric.
Model Training:
- We initialize the k-NN classifier with n_neighbors=3, meaning we use the 3 nearest neighbors to make predictions.
- We fit the model to the training data using knn.fit(X_train, y_train).
Making Predictions:
- We use the trained model to predict the class labels for the test set using knn.predict(X_test).
Model Evaluation:
- We calculate the accuracy of the model, which is the proportion of correctly predicted instances.
- We generate a confusion matrix to see how well the model performs for each class.
- We print a classification report, which includes precision, recall, and F1-score for each class.

Practical Tips

Choosing k:

Use cross-validation to find the optimal value of k. A common approach is to plot the accuracy for different values of k and choose the k with the highest accuracy.

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Use cross-validation to find the best k
k_values = range(1, 26)
cv_scores = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X_train, y_train, cv=5, scoring='accuracy').mean() for k in k_values]

plt.plot(k_values, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Cross-Validated Accuracy')
plt.title('Selecting K with Cross-Validation')
plt.show()

Scaling Features:
- Always scale your features before applying k-NN, as the algorithm relies on distance metrics that can be affected by different feature scales.
Handling Large Datasets:
- k-NN can be computationally expensive for large datasets. Consider using approximate nearest neighbor methods or dimensionality reduction techniques like PCA to speed up the computation.
Handling Imbalanced Data:
- If your data is imbalanced, consider using techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes before training the model.

Back to Kickstart ML with Python cookbook page