Kickstart ML with Python snippets
Exploring the KNN classifier algorithm basic concepts
k-Nearest Neighbors (k-NN) is a simple, instance-based learning algorithm used for both classification and regression tasks. It classifies a data point based on how its neighbors are classified.
-
Instance-Based Learning:
- k-NN is a type of instance-based learning, where the algorithm memorizes the training dataset instead of learning a model.
-
Distance Metric:
- The algorithm relies on a distance metric to find the closest neighbors. Common
distance metrics include:
- Euclidean Distance: ( d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2} )
- Manhattan Distance: ( d(p, q) = \sum_{i=1}^{n} |p_i - q_i| )
- Minkowski Distance: A generalization of Euclidean and Manhattan distances.
- The algorithm relies on a distance metric to find the closest neighbors. Common
distance metrics include:
-
Choosing k:
- The number of neighbors (k) is a crucial hyperparameter. A small k can be sensitive to noise, while a large k can smooth out class boundaries.
-
Voting Mechanism:
- In classification, the class of a data point is determined by the majority class among its k-nearest neighbors.
- In regression, the prediction is the average of the values of its k-nearest neighbors.
Steps in k-NN Algorithm
- Choose the number of neighbors (k).
- Calculate the distance between the query instance and all the training instances.
- Sort the distances in ascending order and select the k-nearest neighbors.
- Perform majority voting (for classification) or take the average (for regression) to make the prediction.
Practical Example in Python
Let's walk through an example of using the k-NN classifier with Python and the
scikit-learn
library.
Step-by-Step Example
- Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
- Load and Prepare Data:
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
- Train the k-NN Model:
# Initialize the k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)# Fit the modelknn.fit(X_train, y_train)
- Make Predictions:
# Predict on the test set
y_pred = knn.predict(X_test)
- Evaluate the Model:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Print classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)
-
Data Preparation:
- We load the Iris dataset, which is a classic dataset for classification tasks.
- We split the data into training and testing sets to evaluate the model's performance on unseen data.
- We standardize the features to ensure that all features contribute equally to the distance metric.
-
Model Training:
- We initialize the k-NN classifier with
n_neighbors=3
, meaning we use the 3 nearest neighbors to make predictions. - We fit the model to the training data using
knn.fit(X_train, y_train)
.
- We initialize the k-NN classifier with
-
Making Predictions:
- We use the trained model to predict the class labels for the test set using
knn.predict(X_test)
.
- We use the trained model to predict the class labels for the test set using
-
Model Evaluation:
- We calculate the accuracy of the model, which is the proportion of correctly predicted instances.
- We generate a confusion matrix to see how well the model performs for each class.
- We print a classification report, which includes precision, recall, and F1-score for each class.
Practical Tips
-
Choosing k:
- Use cross-validation to find the optimal value of k. A common approach is to plot the accuracy for different values of k and choose the k with the highest accuracy.
from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier import matplotlib.pyplot as plt # Use cross-validation to find the best k k_values = range(1, 26) cv_scores = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X_train, y_train, cv=5, scoring='accuracy').mean() for k in k_values] plt.plot(k_values, cv_scores) plt.xlabel('Number of Neighbors K') plt.ylabel('Cross-Validated Accuracy') plt.title('Selecting K with Cross-Validation') plt.show()
-
Scaling Features:
- Always scale your features before applying k-NN, as the algorithm relies on distance metrics that can be affected by different feature scales.
-
Handling Large Datasets:
- k-NN can be computationally expensive for large datasets. Consider using approximate nearest neighbor methods or dimensionality reduction techniques like PCA to speed up the computation.
-
Handling Imbalanced Data:
- If your data is imbalanced, consider using techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes before training the model.