SciPy - Statistical Tests and Inference



Statistical tests and inference involve deriving conclusions about a population from sample data. These methodologies are fundamental for validating hypotheses, analyzing data trends, and making informed decisions in research, economics, engineering and many other fields. SciPys scipy.stats module offers a comprehensive set of tools to perform various statistical tests and data inferences.

Important Statistical Tests in SciPy

The scipy.stats library in Python includes a variety of functions to execute tests such as t-tests, chi-square tests and ANOVA, helping you validate assumptions and test hypotheses in different applications.

SciPy provides several statistical tests designed to assess different types of data and determine if observed differences or relationships are statistically significant. These tests play a critical role in hypothesis testing and analysis.

t-Test

A t-test is used to assess whether the means of two groups are different from one another typically applied in situations like comparing the results of two sample groups. The function scipy.stats.ttest_ind() can be used to perform a t-test on two independent samples.

The following example demonstrates how to perform a t-test on two datasets −

from scipy.stats import ttest_ind
import numpy as np

# Generate sample data
group1 = np.random.normal(0, 1, 100)
group2 = np.random.normal(0.5, 1, 100)

# Conduct the t-test
stat, p_value = ttest_ind(group1, group2)

print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

Here is the result of the t-test showing the t-statistic and p-value which help us to determine if the differences between the two groups are statistically significant −

t-statistic: -3.1020
p-value: 0.0022

Chi-Squared Test

The Chi-Squared Test is typically used to analyze categorical data, determining whether there is an association between two categorical variables. It's useful in situations like contingency tables where data is grouped into categories.

To perform Chi-Squared Test, SciPy provides the scipy.stats.chi2_contingency() function −

from scipy.stats import chi2_contingency
import numpy as np

# Example data in a contingency table
data = np.array([[10, 20], [20, 30]])

# Run the chi-squared test
chi2_stat, p_val, dof, expected = chi2_contingency(data)

print(f"Chi-squared statistic: {chi2_stat:.4f}")
print(f"p-value: {p_val:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Expected values: \n{expected}")

Below is the output of the Chi-squared test showing the statistic, p-value, degrees of freedom, and expected values:

Chi-squared statistic: 0.1280
p-value: 0.7205
Degrees of freedom: 1
Expected values:
[[11.25 18.75]
 [18.75 31.25]]

ANOVA (Analysis of Variance)

ANOVA tests whether there are significant differences among the means of three or more groups. It's useful when comparing multiple datasets to determine if at least one of them is different from the others.

To perform a one-way ANOVA we can use the scipy.stats.f_oneway() function, following is the example which performs the Annova test −

from scipy.stats import f_oneway
import numpy as np

# Example data from three groups
group1 = np.random.normal(0, 1, 100)
group2 = np.random.normal(1, 1, 100)
group3 = np.random.normal(2, 1, 100)

# Run one-way ANOVA
f_stat, p_value = f_oneway(group1, group2, group3)

print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")

Heres the result of the ANOVA test showing the F-statistic and p-value, which help us assess whether the group means are statistically different:

F-statistic: 75.5012
p-value: 0.0000

Normality Tests

To determine if a dataset follows a normal distribution we can use normality tests like the Shapiro-Wilk Test or D'Agostino and Pearson's Test available in SciPy. The scipy.stats.shapiro() function conducts the Shapiro-Wilk test to check normality −

from scipy.stats import shapiro
import numpy as np

# Example data
data = np.random.normal(0, 1, 100)

# Perform Shapiro-Wilk normality test
stat, p_value = shapiro(data)

print(f"Test statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

Following is the output of the Shapiro-Wilk test helps to evaluate if the sample data is consistent with a normal distribution −

Test statistic: 0.9878
p-value: 0.4939

Using Statistical Inference in SciPy

SciPy provides essential tools for making inferences about a population from sample data, such as −

  • p-value: This is used to determine the statistical significance of test results. A p-value below a threshold (commonly 0.05) suggests a significant result.
  • Confidence Intervals: Estimate the range in which a population parameter (such as the mean) lies based on sample data.
  • Effect Size: Quantifies the magnitude of an observed effect or difference.

Using these methods the researchers can perform thorough statistical analyses and make decisions backed by solid evidence from their data.

Advertisements