Kickstart ML with Python snippets

Getting started with classification (supervised learning) in Python

Classification is a type of supervised learning method used to predict a categorical outcome (often referred to as the target or class label) based on one or more predictor variables (features). The goal of classification is to assign new observations to one of the predefined classes.

Key Concepts in Classification

1. Types of Classification

  • Binary Classification: The target variable has two possible classes.

    • Examples: Spam vs. Not Spam, Fraudulent vs. Non-Fraudulent.
  • Multi-Class Classification: The target variable has more than two possible classes.

    • Examples: Handwritten digit recognition (0-9), Species classification in biology.
  • Multi-Label Classification: Each instance may belong to multiple classes simultaneously.

    • Examples: Tagging a text document with multiple topics.

2. Model Representation

Classification models predict the probability of each class for a given input. The class with the highest probability is chosen as the predicted class.

Logistic Regression Equation (Binary Classification): $$ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n)}}$$

  • \( P(y=1|x) \): Probability that the output is class 1 given the input features \( x \).
  • \( x_1, x_2, \ldots, x_n \): Input features.
  • \( \beta_0, \beta_1, \ldots, \beta_n \): Model parameters.

3. Key Metrics

  • Accuracy: The ratio of correctly predicted instances to the total instances.
  • Precision: The ratio of correctly predicted positive observations to the total predicted positives.
  • Recall (Sensitivity): The ratio of correctly predicted positive observations to all actual positives.
  • F1 Score: The harmonic mean of precision and recall.
  • Confusion Matrix: A table used to describe the performance of a classification model by comparing actual vs. predicted values.
  • ROC Curve and AUC: Receiver Operating Characteristic curve and Area Under the Curve measure the performance of binary classifiers.

4. Common Algorithms

  • Logistic Regression: A statistical model for binary classification.
  • Decision Trees: A tree-like model of decisions used for classification tasks.
  • Random Forest: An ensemble method that uses multiple decision trees.
  • Support Vector Machines (SVM): A model that finds the hyperplane that best separates the classes.
  • K-Nearest Neighbors (KNN): A model that classifies based on the majority class among the k-nearest neighbors.
  • Naive Bayes: A probabilistic classifier based on Bayes' theorem with the assumption of independence among features.
  • Neural Networks: Models inspired by the human brain, used for complex pattern recognition.

Practical Example of Logistic Regression in Python

Here’s how you can implement a simple binary classification model using Python and the scikit-learn library:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

# ROC Curve and AUC
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
print(f"ROC AUC: {roc_auc:.2f}")

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
                            
  1. Data Preparation:

    • We generate a synthetic binary classification dataset using make_classification.
    • We split the dataset into training and testing sets using train_test_split.
  2. Model Fitting:

    • We create an instance of the LogisticRegression model and fit it to the training data.
  3. Making Predictions:

    • We use the predict method to make predictions on the test data.
  4. Model Evaluation:

    • We calculate the accuracy score.
    • We generate a confusion matrix to understand the model's performance.
    • We generate a classification report to get precision, recall, and F1 score.
    • We plot the ROC curve and calculate the AUC to evaluate the model's performance.

Back to Kickstart ML with Python cookbook page