Module `stats`

A grab bag of common descriptive and inferential statistical functions.

Useful for statistical analysis.

Credit goes to sans.9536 of the Minecraft Computer Mods discord for these functions and the original file.

Info:

Author: sans.9536

Utility Functions

find (tbl, value)	Find index of value in an array-like table (linear search)
testdata (n[, a=0[, b=1]])	Generates n quantitiative data points between a and b

Basic Descriptive Statistics

sum (data)	Computes the sum of numeric values in the dataset
mean (data)	Computes the arithmetic mean (average) of the dataset
min (data)	Computes the minimum value of the dataset
max (data)	Computes the maximum value of the dataset
variance (data[, isSample=false])	Computes population variance by default.
stdev (data[, isSample=false])	Computes the standard deviation of the dataset using sample variance
size (data)	Number of elements in the dataset
range (data)	Computes the range of the dataset
median (data)	Computes the median (50th percentile) If even length, returns average of the two middle values
mode (data)	Computes the mode and returns a sorted array of the most frequent value(s)
sex (data)	Computes the standard error of the mean (commonly called SE) indicating precision
skewness (data)	Computes the asymmetry (skewness) of data around its mean Requires at least 3 observations
kurtosis (data)	Computes the messure of skew (kurtosis) using an excess kurtosis formula variant Requires at least 4 observations and indicates how much peak or tail a distribution would have

Alternative Means

geometricMean (data)	Computes the geometric mean Only defined for strictly positive values Useful for multiplicative rates such as combining percentage growth rates
harmonicMean (data)	Computes the harmonic mean Best used for average rates and ratios, or inverse proportionalities
trimmedMean (data[, trim=0.05])	Computes the trimmed mean Removes `trim` fraction from each tail and averages the rest Much more reliable measure of central tendency, clamps down outliers by removing a percentage of extremes
winsorMean (data[, alpha=0.05])	Computes the winsorized mean Clamps extreme values to given quantiles before averaging Much more reliable measure of central tendency
weightedMean (data, weights)	Computes the weighted mean

Robust Statistics

madMedian (data)

Computes the median absolute deviation (MAD)

Much more reliable measure of variability when outliers are present, from the deviation

Inequality Measures

gini (data)

Computes the gini coefficient for inequality (0..1)

Returns 0 for datasets with fewer than 2 or zero mean

Represents how unequally distributed a dataset it, best explained by income but works for other sets too

Bivariate Statistics

covariance (x, y)

Computes the sample covariance (uses n-1 denominator)

Represents how much the two datasets will vary with each other

correlation (x, y)

Computes the pearson correlation coefficient

Returns 0 if undefined

Represents the strength and relationship of the covariances, much more interpretable

Quantiles and Outliers

percentile (data, p)	Computes the percentile interpolation (linear interpolation between order statistics) Finds the percentile of a value p given a dataset, sorts and places in between indices(or on one) p in [0,1]
quartiles (data)	Computes the first and third quartiles as table (`{ Q1 = ..., Q3 = ...
iqr (data)	Computes the interquartile range Q3 - Q1 The distance between first and third quartiles
outliers (data)	Computes a list of all outliers in the dataset Outliers are considered as such if they are more than 150% of the IQR away from either the first or third quartiles

Linear Regression

linReg (x, y)

Computes a simple least-squares linear regression (y ~ a + b x)

Minimizes squares to find the line that is the least distance squared from all data points

linRegPred (model, xval)

Predicts the y value for a given linear regression model at a given position

r2 (x, y, model)

Computes the coefficient of determination R^2 for linear model

Represents how much variation in variable y can be attributed to variable x

Hyposthesis Testing - Student's t Distribution

tCDF (t, df)	Calculate cumulative distribution function for Student's t distribution Uses numerical approximation for the incomplete beta function using relationship to incomplete beta function and continued fraction approximation
normalCDF (x)	Calculate normal (Gaussian) cumulative distribution function Helper function for tCDF when df is large using error function approximation
incopmleteBeta (x, a, b)	Calculate incomplete beta function I_x(a,b) Helper function for tCDF calculation using continued fraction approximation
logGamma (x)	Calculate logarithm of gamma function Helper function for incopmleteBeta using Lanczos approximation coefficients

Hypothesis Testing - t-tests

oneSampleTTest (data, mu0)	Perform one-sample t-test Tests whether sample mean significantly differs from a hypothesized population mean Returns p-value: if p < 0.05, difference is statistically significant
twoSampleTTest (x, y, equalVar)	Perform two-sample t-test Tests whether means of two independent samples significantly differ Returns p-value: if p < 0.05, difference is statistically significant
pairTTest (before, after)	Perform paired t-test Tests whether the mean difference between paired observations is significant Useful for before/after comparisons on the same subjects
linRegTTest (x, y)	Perform t-test on linear regression slope Checks a dataset against it's linear regression

Utility Functions

find (tbl, value): Find index of value in an array-like table (linear search)
testdata (n[, a=0[, b=1]]): Generates n quantitiative data points between a and b

Basic Descriptive Statistics

sum (data)

Computes the sum of numeric values in the dataset

mean (data)

Computes the arithmetic mean (average) of the dataset

min (data)

Computes the minimum value of the dataset

max (data)

Computes the maximum value of the dataset

variance (data[, isSample=false])

Computes population variance by default.

Set isSample to true for sample variance.

stdev (data[, isSample=false])

Computes the standard deviation of the dataset using sample variance

size (data)

Number of elements in the dataset

range (data)

Computes the range of the dataset

median (data)

Computes the median (50th percentile)

If even length, returns average of the two middle values

mode (data)

Computes the mode and returns a sorted array of the most frequent value(s)

sex (data)

Computes the standard error of the mean (commonly called SE) indicating precision

skewness (data)

Computes the asymmetry (skewness) of data around its mean

Requires at least 3 observations

kurtosis (data)

Computes the messure of skew (kurtosis) using an excess kurtosis formula variant

Requires at least 4 observations and indicates how much peak or tail a distribution would have

Alternative Means

geometricMean (data)

Computes the geometric mean

Only defined for strictly positive values

Useful for multiplicative rates such as combining percentage growth rates

harmonicMean (data)

Computes the harmonic mean

Best used for average rates and ratios, or inverse proportionalities

trimmedMean (data[, trim=0.05])

Computes the trimmed mean

Removes trim fraction from each tail and averages the rest

Much more reliable measure of central tendency, clamps down outliers by removing a percentage of extremes

winsorMean (data[, alpha=0.05])

Computes the winsorized mean

Clamps extreme values to given quantiles before averaging

Much more reliable measure of central tendency

weightedMean (data, weights)

Computes the weighted mean

Robust Statistics

madMedian (data)

Computes the median absolute deviation (MAD)

Much more reliable measure of variability when outliers are present, from the deviation

Inequality Measures

gini (data)

Computes the gini coefficient for inequality (0..1)

Returns 0 for datasets with fewer than 2 or zero mean

Represents how unequally distributed a dataset it, best explained by income but works for other sets too

Bivariate Statistics

covariance (x, y)

Computes the sample covariance (uses n-1 denominator)

Represents how much the two datasets will vary with each other

correlation (x, y)

Computes the pearson correlation coefficient

Returns 0 if undefined

Represents the strength and relationship of the covariances, much more interpretable

Quantiles and Outliers

percentile (data, p)

Computes the percentile interpolation (linear interpolation between order statistics)

Finds the percentile of a value p given a dataset, sorts and places in between indices(or on one)

p in [0,1]

quartiles (data)

Computes the first and third quartiles as table ({ Q1 = ..., Q3 = ... })

Quartiles are found by cutting the dataset in half using the median, and finding the median of those sets

iqr (data)

Computes the interquartile range Q3 - Q1

The distance between first and third quartiles

outliers (data)

Computes a list of all outliers in the dataset

Outliers are considered as such if they are more than 150% of the IQR away from either the first or third quartiles

Linear Regression

linReg (x, y)

Computes a simple least-squares linear regression (y ~ a + b x)

Minimizes squares to find the line that is the least distance squared from all data points

linRegPred (model, xval)

Predicts the y value for a given linear regression model at a given position

r2 (x, y, model)

Computes the coefficient of determination R^2 for linear model

Represents how much variation in variable y can be attributed to variable x

Hyposthesis Testing - Student's t Distribution

tCDF (t, df)

Calculate cumulative distribution function for Student's t distribution

Uses numerical approximation for the incomplete beta function using relationship to incomplete beta function and continued fraction approximation

normalCDF (x)

Calculate normal (Gaussian) cumulative distribution function

Helper function for tCDF when df is large using error function approximation

incopmleteBeta (x, a, b)

Calculate incomplete beta function I_x(a,b)

Helper function for tCDF calculation using continued fraction approximation

logGamma (x)

Calculate logarithm of gamma function

Helper function for incopmleteBeta using Lanczos approximation coefficients

Hypothesis Testing - t-tests

oneSampleTTest (data, mu0)

Perform one-sample t-test

Tests whether sample mean significantly differs from a hypothesized population mean

Returns p-value: if p < 0.05, difference is statistically significant

twoSampleTTest (x, y, equalVar)

Perform two-sample t-test

Tests whether means of two independent samples significantly differ

Returns p-value: if p < 0.05, difference is statistically significant

pairTTest (before, after)

Perform paired t-test

Tests whether the mean difference between paired observations is significant

Useful for before/after comparisons on the same subjects

linRegTTest (x, y)

Perform t-test on linear regression slope

Checks a dataset against it's linear regression

Advanced Math Library

Contents

Modules

Topics

Module `stats`

Info:

Utility Functions

Basic Descriptive Statistics

Alternative Means

Robust Statistics

Inequality Measures

Bivariate Statistics

Quantiles and Outliers

Linear Regression

Hyposthesis Testing - Student's t Distribution

Hypothesis Testing - t-tests

Utility Functions

Basic Descriptive Statistics

Alternative Means

Robust Statistics

Inequality Measures

Bivariate Statistics

Quantiles and Outliers

Linear Regression

Hyposthesis Testing - Student's t Distribution

Hypothesis Testing - t-tests

Advanced Math Library

Contents

Modules

Topics

Module stats

Info:

Utility Functions

Basic Descriptive Statistics

Alternative Means

Robust Statistics

Inequality Measures

Bivariate Statistics

Quantiles and Outliers

Linear Regression

Hyposthesis Testing - Student's t Distribution

Hypothesis Testing - t-tests

Module `stats`