SciPy - Descriptive Statistics



Descriptive statistics is a branch of statistics that focuses on summarizing and organizing data to reveal meaningful insights. It helps in understanding the distribution, central tendency and variability of data. The Python library SciPy, particularly its stats module provides various functions to compute descriptive statistics efficiently.

Key Measures in Descriptive Statistics

Descriptive statistics are used to summarize and describe the main features of a dataset. These measures fall into three main categories as follows −

Measures of Central Tendency in SciPy

Measures of central tendency summarize a dataset by identifying a single value that represents the center or "typical" value of the data. The three main measures of central tendency as mentioned below −

Mean (Arithmetic Average)

The mean is calculated by summing all data points and dividing by the total number of points. It is sensitive to outliers which can significantly affect its value. The formula for Mean is given as below −

Mean = ½ (∑ X) / N

Below is the example of finding Mean by the function with the help of scipy.stats.tmean() function −

from scipy import stats

data = [10, 20, 30, 40, 50]

# Calculate mean using SciPy
mean_value = stats.tmean(data)
print("Mean:", mean_value)

Here is the output of Mean with the help of scipy.stats.tmean() function −

Mean: 30.0

Median

The median is the value that falls in the center of a sorted dataset. When there is an even number of data points then the median is calculated as the average of the two middle values. Unlike the mean, the median is less affected by outliers.

Here is the example which calculates the median with the help of scipy.stats.scoreatpercentile() function −

from scipy import stats

# Sample data
data = [10, 20, 30, 40, 50]

# Calculate median using SciPy's scoreatpercentile
median_value = stats.scoreatpercentile(data, 50)
print("Median:", median_value)

Below is the output of the median calculated using the function scipy.stats.scoreatpercentile()

Median: 30.0

Mode

The mode is the value that occurs most frequently in the dataset. If there is more than one mode, it is referred to as multimodal.

Following is the example which calculates the Mode with the help of scipy.stats.mode() function −

from scipy import stats

# Sample data
data = [10, 20, 20, 30, 40]

# Calculate mode using SciPy
mode_value = stats.mode(data)

# Access mode and count correctly
print("Mode:", mode_value.mode, "Frequency:", mode_value.count)

Below is the output of the Mode calculated using the function scipy.stats.mode()

Mode: 20 Frequency: 2

Measures of Dispersion in SciPy

Measures of dispersion indicate how data values are spread out or dispersed within a dataset. They help determine the variability or consistency of data points relative to each other. The key measures of dispersion are described below −

Range

The range is the simplest way to measure dispersion, calculated by subtracting the smallest value from the largest value in the dataset. Although it gives a quick sense of data spread, it is highly influenced by outliers.

Here is an example that shows how to compute the range using the numpy.ptp() function −

# Sample data
data = [10, 20, 20, 30, 40]

range_value = max(data) - min(data)
print("Range:", range_value)

Here is the output of the range calculation −

Range: 30

Variance

Variance measures how much the data values deviate from the mean. It is computed by averaging the squared differences between each data point and the mean value. A higher variance indicates more spread-out data.

The mathematical representation of variance is given below −

Variance = ½ (∑ (X - Mean)2) ÷ N

The following example calculates variance using the scipy.stats.tvar() function −

from scipy import stats

data = [10, 20, 30, 40, 50]

# Calculate variance using SciPy
variance_value = stats.tvar(data)
print("Variance:", variance_value)

Here is the output of the variance calculation using scipy.stats.tvar() function −

Variance: 250.0

Standard Deviation

Standard deviation is derived from the variance and provides a measure of data dispersion in the same units as the original dataset. It indicates how much the values differ from the mean.

Below example shows how to compute the standard deviation using the scipy.stats.tstd() function −

from scipy import stats

data = [10, 20, 30, 40, 50]

# Calculate standard deviation using SciPy
std_deviation = stats.tstd(data)
print("Standard Deviation:", std_deviation)

Below is the output of the standard deviation calculation using scipy.stats.tstd() function −

Standard Deviation: 15.811388300841896

Skewness

Skewness measures the asymmetry of a dataset's distribution around its mean. If the skewness is positive, it indicates that the data has a long right tail (positive skew) whereas a negative skew indicates a long left tail (negative skew). The formula for calculating skewness is given below −

Skewness = (n ∑i (Xi - X)3) / ((n - 1) s3)

Below is an example of how to calculate Skewness using the scipy.stats.skew() function −

from scipy import stats

data = [10, 20, 20, 30, 40, 50, 60]

# Calculate skewness using SciPy
skewness_value = stats.skew(data)
print("Skewness:", skewness_value)

Here is the output when calculating Skewness using the function scipy.stats.skew()

Skewness: 0.28372927689018057

Kurtosis

Kurtosis measures the heaviness of the tails of a data distribution. High kurtosis suggests the presence of outliers or extreme values while low kurtosis indicates a distribution with fewer outliers. The formula for calculating kurtosis is given below −

Kurtosis = &frac{n ∑ (Xi - X)4}{(n - 1) · s4}

Below is an example of calculating Kurtosis using the scipy.stats.kurtosis() function −

from scipy import stats

data = [10, 20, 20, 30, 40, 50, 60]

# Calculate kurtosis using SciPy
kurtosis_value = stats.kurtosis(data)
print("Kurtosis:", kurtosis_value)

Here is the output when calculating Kurtosis using the function scipy.stats.kurtosis()

Kurtosis: -1.2208044982698956
Advertisements