Kickstart ML with Python snippets

Exploring the KNN classifier algorithm basic concepts

k-Nearest Neighbors (k-NN) is a simple, instance-based learning algorithm used for both classification and regression tasks. It classifies a data point based on how its neighbors are classified.

  1. Instance-Based Learning:

    • k-NN is a type of instance-based learning, where the algorithm memorizes the training dataset instead of learning a model.
  2. Distance Metric:

    • The algorithm relies on a distance metric to find the closest neighbors. Common distance metrics include:
      • Euclidean Distance: ( d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2} )
      • Manhattan Distance: ( d(p, q) = \sum_{i=1}^{n} |p_i - q_i| )
      • Minkowski Distance: A generalization of Euclidean and Manhattan distances.
  3. Choosing k:

    • The number of neighbors (k) is a crucial hyperparameter. A small k can be sensitive to noise, while a large k can smooth out class boundaries.
  4. Voting Mechanism:

    • In classification, the class of a data point is determined by the majority class among its k-nearest neighbors.
    • In regression, the prediction is the average of the values of its k-nearest neighbors.

Steps in k-NN Algorithm

  1. Choose the number of neighbors (k).
  2. Calculate the distance between the query instance and all the training instances.
  3. Sort the distances in ascending order and select the k-nearest neighbors.
  4. Perform majority voting (for classification) or take the average (for regression) to make the prediction.

Practical Example in Python

Let's walk through an example of using the k-NN classifier with Python and the scikit-learn library.

Step-by-Step Example

  1. Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
  1. Load and Prepare Data:
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
  1. Train the k-NN Model:
# Initialize the k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)# Fit the modelknn.fit(X_train, y_train)
  1. Make Predictions:
# Predict on the test set
y_pred = knn.predict(X_test)
  1. Evaluate the Model:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Print classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)
  1. Data Preparation:

    • We load the Iris dataset, which is a classic dataset for classification tasks.
    • We split the data into training and testing sets to evaluate the model's performance on unseen data.
    • We standardize the features to ensure that all features contribute equally to the distance metric.
  2. Model Training:

    • We initialize the k-NN classifier with n_neighbors=3, meaning we use the 3 nearest neighbors to make predictions.
    • We fit the model to the training data using knn.fit(X_train, y_train).
  3. Making Predictions:

    • We use the trained model to predict the class labels for the test set using knn.predict(X_test).
  4. Model Evaluation:

    • We calculate the accuracy of the model, which is the proportion of correctly predicted instances.
    • We generate a confusion matrix to see how well the model performs for each class.
    • We print a classification report, which includes precision, recall, and F1-score for each class.

Practical Tips

  1. Choosing k:

    • Use cross-validation to find the optimal value of k. A common approach is to plot the accuracy for different values of k and choose the k with the highest accuracy.
    from sklearn.model_selection import cross_val_score
    from sklearn.neighbors import KNeighborsClassifier
    import matplotlib.pyplot as plt
    
    # Use cross-validation to find the best k
    k_values = range(1, 26)
    cv_scores = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X_train, y_train, cv=5, scoring='accuracy').mean() for k in k_values]
    
    plt.plot(k_values, cv_scores)
    plt.xlabel('Number of Neighbors K')
    plt.ylabel('Cross-Validated Accuracy')
    plt.title('Selecting K with Cross-Validation')
    plt.show()
  2. Scaling Features:

    • Always scale your features before applying k-NN, as the algorithm relies on distance metrics that can be affected by different feature scales.
  3. Handling Large Datasets:

    • k-NN can be computationally expensive for large datasets. Consider using approximate nearest neighbor methods or dimensionality reduction techniques like PCA to speed up the computation.
  4. Handling Imbalanced Data:

    • If your data is imbalanced, consider using techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes before training the model.

Back to Kickstart ML with Python cookbook page