Kickstart ML with Python snippets

Understanding the basics of ROC curves

  1. True Positive Rate (TPR) or Sensitivity or Recall:

    • TPR measures the proportion of actual positives that are correctly identified by the model.
    • Formula: $$ \text{TPR} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} $$
  2. False Positive Rate (FPR):

    • FPR measures the proportion of actual negatives that are incorrectly identified as positives by the model.
    • Formula: $$ \text{FPR} = \frac{\text{False Positives (FP)}}{\text{False Positives (FP)} + \text{True Negatives (TN)}} $$
  3. Threshold:

    • The classification threshold determines the point at which the model's predicted probability is converted into a binary decision (positive or negative).
    • By varying this threshold, you can generate different TPR and FPR values, which are then plotted to form the ROC curve.

How to Construct an ROC Curve:

  1. Calculate TPR and FPR:

    • For a series of threshold values (typically from 0 to 1), calculate the TPR and FPR.
  2. Plot the Points:

    • Plot the TPR against the FPR for each threshold value.
  3. Connect the Points:

    • Connect these points to form the ROC curve.

Interpreting the ROC Curve:

  • Diagonal Line (45-degree line):

    • This represents a random classifier with no discrimination capability (TPR = FPR).
  • Above the Diagonal:

    • Points above the diagonal indicate better-than-random performance.
  • Closer to the Top-Left Corner:

    • The closer the ROC curve is to the top-left corner, the better the model's performance. This point represents high TPR and low FPR.

Area Under the ROC Curve (AUC - ROC):

  • AUC (Area Under the Curve):
    • The AUC value represents the overall ability of the model to discriminate between positive and negative classes.
    • AUC ranges from 0 to 1:
      • 0.5: Represents a random classifier.
      • 1.0: Represents a perfect classifier.
      • >0.5: Represents a model better than random.

Python Example:

Here’s how you can plot an ROC curve using the sklearn library:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Generate a binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
                            

Interpreting the ROC Curve and AUC

  1. ROC Curve Shape:

    • The curve starts at (0, 0) and ends at (1, 1).
    • A curve that bows towards the top-left corner indicates good performance. The model achieves high TPR while maintaining low FPR.
  2. Diagonal Line (Random Classifier):

    • The dotted line from (0, 0) to (1, 1) represents a random classifier with no skill. Points above this line indicate better-than-random performance.
  3. Area Under the Curve (AUC):

    • The AUC value (0.5 to 1.0) provides a single measure of overall model performance. An AUC of 0.93, for example, indicates excellent model performance.

Practical Tips

  • Threshold Selection: Depending on the application, you might choose different thresholds to balance between TPR and FPR. For example, in medical diagnostics, you might prefer a higher TPR (sensitivity) even at the cost of a higher FPR.
  • Comparison of Models: ROC curves are useful for comparing the performance of different models. The model with the higher AUC is generally considered better.
  • Understanding Trade-offs: The ROC curve helps visualize the trade-off between sensitivity (TPR) and specificity (1 - FPR). This is crucial in applications where the cost of false positives and false negatives differs.

Back to Kickstart ML with Python cookbook page