Kickstart ML with Python snippets

Introduction to Hypothesis Testing with Python

Hypothesis Testing is a statistical method used to make decisions about a population parameter based on sample data. It involves formulating two competing hypotheses: the null hypothesis and the alternative hypothesis.

Key Concepts

1. Null Hypothesis (H0)

  • Definition: The null hypothesis is a statement that there is no effect or no difference, and it represents the status quo or a baseline condition. It is what we seek to test against.
  • Example: There is no difference in the mean test scores of two groups of students.

2. Alternative Hypothesis (H1 or Ha)

  • Definition: The alternative hypothesis is a statement that indicates the presence of an effect or a difference. It is what we want to prove.
  • Example: There is a difference in the mean test scores of two groups of students.

Steps in Hypothesis Testing

  1. Formulate Hypotheses:

    • Define the null hypothesis (H0) and the alternative hypothesis (H1).
  2. Select Significance Level (α):

    • Choose a significance level, commonly set at 0.05, which represents a 5% risk of rejecting the null hypothesis when it is true.
  3. Choose the Appropriate Test:

    • Select a statistical test based on the data and the hypotheses. Common tests include t-tests, chi-square tests, and ANOVA.
  4. Calculate Test Statistic:

    • Compute the test statistic using sample data.
  5. Determine the p-value:

    • The p-value indicates the probability of observing the test results under the null hypothesis.
  6. Make a Decision:

    • Compare the p-value to the significance level (α). If the p-value is less than α, reject the null hypothesis; otherwise, fail to reject it.

Practical Example in Python

Let's walk through an example of hypothesis testing using Python and the scipy library.

Example: One-Sample t-Test

Suppose we want to test whether the mean of a sample dataset is equal to a known population mean.

  1. Import Libraries:
import numpyas np 
from scipy import stats
  1. Generate Sample Data:
# Generate a sample datasetnp.random.seed(0)
sample_data = np.random.normal(loc=5, scale=2, size=30)# Known population meanpopulation_mean =5
  1. Formulate Hypotheses:
  • Null Hypothesis (H0): The mean of the sample data is equal to the population mean.
  • Alternative Hypothesis (H1): The mean of the sample data is not equal to the population mean.
  1. Select Significance Level (α):
alpha = 0.05
  1. Perform One-Sample t-Test:
# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, population_mean)

print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
  1. Make a Decision:
# Compare p-value to significance level
if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")
  1. Data Generation:

    • We generate a sample dataset with a known mean and compare it to the population mean.
  2. Hypotheses Formulation:

    • We define our null and alternative hypotheses based on the problem statement.
  3. Significance Level:

    • We choose a significance level (α) of 0.05, which is a common threshold in hypothesis testing.
  4. Performing the Test:

    • We use the ttest_1samp function from the scipy.stats module to perform a one-sample t-test.
    • The test returns a t-statistic and a p-value.
  5. Decision Making:

    • We compare the p-value to our significance level to decide whether to reject or fail to reject the null hypothesis.

Practical Advice

  1. Choosing the Right Test:

    • Use t-tests for comparing means, chi-square tests for categorical data, and ANOVA for comparing more than two groups.
    • Ensure that the assumptions of the chosen test (e.g., normality, independence) are met.
  2. Interpreting p-values:

    • A p-value less than the significance level (α) indicates strong evidence against the null hypothesis, leading to its rejection.
    • A p-value greater than α indicates insufficient evidence to reject the null hypothesis.
  3. Avoiding Common Pitfalls:

    • Do not confuse failing to reject the null hypothesis with accepting it. It simply means there isn't enough evidence against it.
    • Ensure the sample size is adequate to detect a meaningful effect.
  4. Reporting Results:

    • Clearly state the hypotheses, the test used, the test statistic, the p-value, and the decision.
    • Provide context and implications of the findings.

Back to Kickstart ML with Python cookbook page