Kickstart ML with Python snippets
High level introduction about probability distributions with Python
Probability distributions describe how the values of a random variable are distributed. Different distributions are used depending on the type of data and the nature of the phenomenon being modeled.
1. Normal Distribution
Normal Distribution, also known as the Gaussian distribution, is a continuous probability distribution characterized by its bell-shaped curve. It is defined by two parameters: the mean (µ) and the standard deviation (σ).
- Mean (µ): The center of the distribution.
- Standard Deviation (σ): The spread or width of the distribution.
Properties:
- Symmetrical around the mean.
- Approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations (Empirical Rule).
Formula: $$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} $$
Use Case: Heights of people, measurement errors, IQ scores.
Practical Example in Python:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Parameters
mu, sigma = 0, 1
# Generate data
data = np.random.normal(mu, sigma, 1000)
# Plot the distribution
plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
# Plot the theoretical density function
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, sigma)
plt.plot(x, p, 'k', linewidth=2)
plt.title('Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()
2. Binomial Distribution
Binomial Distribution is a discrete probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success.
- n: Number of trials.
- p: Probability of success in each trial.
Properties:
- Discrete distribution.
- Describes the number of successes in n trials.
Formula: $$ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} $$ where (\binom{n}{k}) is the binomial coefficient.
Use Case: Flipping a coin, number of defective products in a batch.
Practical Example in Python:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Parameters
n, p = 10, 0.5
# Generate data
data = np.random.binomial(n, p, 1000)
# Plot the distribution
plt.hist(data, bins=np.arange(0, n+2) - 0.5, density=True, alpha=0.6, color='b')
# Plot the theoretical probability mass function
x = np.arange(0, n+1)
p = stats.binom.pmf(x, n, p)
plt.plot(x, p, 'k', linewidth=2)
plt.title('Binomial Distribution')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.show()
3. Poisson Distribution
Poisson Distribution is a discrete probability distribution that models the number of events occurring within a fixed interval of time or space, given the events occur with a constant mean rate and independently of the time since the last event.
- λ (lambda): The average number of events in the given interval.
Properties:
- Discrete distribution.
- Describes the number of events in a fixed interval.
Formula: $$ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} $$
Use Case: Number of emails received per hour, number of accidents at a traffic light.
Practical Example in Python:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Parameter
lam = 3
# Generate data
data = np.random.poisson(lam, 1000)
# Plot the distribution
plt.hist(data, bins=15, density=True, alpha=0.6, color='r')
# Plot the theoretical probability mass function
x = np.arange(0, 15)
p = stats.poisson.pmf(x, lam)
plt.plot(x, p, 'k', linewidth=2)
plt.title('Poisson Distribution')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.show()