Kickstart ML with Python snippets

Getting started with regression (supervised learning) in Phython

Regression is a type of supervised learning method used to predict a continuous outcome variable (often referred to as the dependent variable) based on one or more predictor variables (independent variables). The goal of regression analysis is to model the relationship between the dependent variable and the independent variables.

Key Concepts in Regression

1. Types of Regression

  • Linear Regression: Models the relationship between the dependent variable and one or more independent variables by fitting a linear equation to observed data.

    • Simple Linear Regression: Uses one independent variable.
    • Multiple Linear Regression: Uses more than one independent variable.
  • Polynomial Regression: A form of regression analysis in which the relationship between the independent variable and the dependent variable is modeled as an nth degree polynomial.

  • Logistic Regression: Used when the dependent variable is binary. It predicts the probability of a binary outcome rather than a continuous outcome.

  • Ridge and Lasso Regression: Regularized versions of linear regression that add a penalty to the model complexity to prevent overfitting.

    • Ridge Regression: Adds an L2 penalty.
    • Lasso Regression: Adds an L1 penalty, which can also lead to feature selection.

2. Model Representation

The basic idea of regression is to find the best-fitting line (or hyperplane in multiple dimensions) that predicts the dependent variable based on the independent variables.

Simple Linear Regression Equation: $$ y = \beta_0 + \beta_1 x + \epsilon $$

  • \( y \) : Dependent variable (what you want to predict).
  • \( x \): Independent variable (predictor).
  • \( \beta_0 \): Intercept (constant term).
  • \( \beta_1 \): Coefficient (slope).
  • \( \epsilon \): Error term (residuals).

Multiple Linear Regression Equation: $$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon $$

3. Key Metrics

  • Mean Absolute Error (MAE): Average of absolute errors between predicted and actual values.
  • Mean Squared Error (MSE): Average of squared errors between predicted and actual values.
  • Root Mean Squared Error (RMSE): Square root of the mean squared error, providing error in the same units as the dependent variable.
  • R-squared (R²): Proportion of variance in the dependent variable that is predictable from the independent variables.
  • Adjusted R-squared: Adjusted for the number of predictors, providing a more accurate measure of model fit when multiple predictors are used.

4. Assumptions of Linear Regression

  • Linearity: The relationship between the independent and dependent variables is linear.
  • Independence: The observations are independent of each other.
  • Homoscedasticity: The residuals have constant variance at every level of the independent variable.
  • Normality: The residuals of the model are normally distributed.

5. Evaluating Model Performance

  • Residual Plots: Used to check the assumptions of linear regression.
  • Cross-Validation: Technique to assess the model’s performance by dividing the data into training and validation sets.
  • Regularization Techniques: Methods like Ridge and Lasso regression to prevent overfitting by adding a penalty to the model complexity.

Practical Example of Linear Regression in Python

Here’s how you can implement a simple linear regression model using Python and the scikit-learn library:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2,3,5,7,11])

# Fit the linear regression model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Print coefficients
print(f"Intercept:{model.intercept_}")
print(f"Coefficient:{model.coef_[0]}")

# Calculate metricsmse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)
print(f"MSE:{mse}")
print(f"RMSE:{rmse}")
print(f"R-squared:{r2}")

# Plot resultsplt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression Example')
plt.show()
  1. Data Preparation:

    • We define our independent variable (\( X \)) and dependent variable (\( y \)) using numpy arrays.
  2. Model Fitting:

    • We create an instance of the LinearRegression model and fit it to our data.
  3. Making Predictions:

    • We use the predict method to predict (\( y \)) values based on our fitted model.
  4. Coefficients:

    • We print out the intercept and the coefficient of the model.
  5. Model Evaluation:

    • We calculate and print the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared value to evaluate the model's performance.
  6. Plotting:

    • We visualize the original data points and the fitted line to understand the model fit.

Back to Kickstart ML with Python cookbook page