Module stats
A grab bag of common descriptive and inferential statistical functions.
Useful for statistical analysis.
Credit goes to sans.9536 of the Minecraft Computer Mods discord for these functions and the original file.
Info:
- Author: sans.9536
Utility Functions
| find (tbl, value) | Find index of value in an array-like table (linear search) |
| testdata (n[, a=0[, b=1]]) | Generates n quantitiative data points between a and b |
Basic Descriptive Statistics
| sum (data) | Computes the sum of numeric values in the dataset |
| mean (data) | Computes the arithmetic mean (average) of the dataset |
| min (data) | Computes the minimum value of the dataset |
| max (data) | Computes the maximum value of the dataset |
| variance (data[, isSample=false]) | Computes population variance by default. |
| stdev (data[, isSample=false]) | Computes the standard deviation of the dataset using sample variance |
| size (data) | Number of elements in the dataset |
| range (data) | Computes the range of the dataset |
| median (data) | Computes the median (50th percentile)
If even length, returns average of the two middle values |
| mode (data) | Computes the mode and returns a sorted array of the most frequent value(s) |
| sex (data) | Computes the standard error of the mean (commonly called SE) indicating precision |
| skewness (data) | Computes the asymmetry (skewness) of data around its mean
Requires at least 3 observations |
| kurtosis (data) | Computes the messure of skew (kurtosis) using an excess kurtosis formula variant
Requires at least 4 observations and indicates how much peak or tail a distribution would have |
Alternative Means
| geometricMean (data) | Computes the geometric mean
Only defined for strictly positive values Useful for multiplicative rates such as combining percentage growth rates |
| harmonicMean (data) | Computes the harmonic mean
Best used for average rates and ratios, or inverse proportionalities |
| trimmedMean (data[, trim=0.05]) | Computes the trimmed mean
Removes Much more reliable measure of central tendency, clamps down outliers by removing a percentage of extremes |
| winsorMean (data[, alpha=0.05]) | Computes the winsorized mean
Clamps extreme values to given quantiles before averaging Much more reliable measure of central tendency |
| weightedMean (data, weights) | Computes the weighted mean |
Robust Statistics
| madMedian (data) | Computes the median absolute deviation (MAD)
Much more reliable measure of variability when outliers are present, from the deviation |
Inequality Measures
| gini (data) | Computes the gini coefficient for inequality (0..1)
Returns 0 for datasets with fewer than 2 or zero mean Represents how unequally distributed a dataset it, best explained by income but works for other sets too |
Bivariate Statistics
| covariance (x, y) | Computes the sample covariance (uses n-1 denominator)
Represents how much the two datasets will vary with each other |
| correlation (x, y) | Computes the pearson correlation coefficient
Returns 0 if undefined Represents the strength and relationship of the covariances, much more interpretable |
Quantiles and Outliers
| percentile (data, p) | Computes the percentile interpolation (linear interpolation between order statistics)
Finds the percentile of a value p given a dataset, sorts and places in between indices(or on one) p in [0,1] |
| quartiles (data) | Computes the first and third quartiles as table (`{ Q1 = ..., Q3 = ... |
| iqr (data) | Computes the interquartile range Q3 - Q1
The distance between first and third quartiles |
| outliers (data) | Computes a list of all outliers in the dataset
Outliers are considered as such if they are more than 150% of the IQR away from either the first or third quartiles |
Linear Regression
| linReg (x, y) | Computes a simple least-squares linear regression (y ~ a + b x)
Minimizes squares to find the line that is the least distance squared from all data points |
| linRegPred (model, xval) | Predicts the y value for a given linear regression model at a given position |
| r2 (x, y, model) | Computes the coefficient of determination R^2 for linear model
Represents how much variation in variable y can be attributed to variable x |
Hyposthesis Testing - Student's t Distribution
| tCDF (t, df) | Calculate cumulative distribution function for Student's t distribution
Uses numerical approximation for the incomplete beta function using relationship to incomplete beta function and continued fraction approximation |
| normalCDF (x) | Calculate normal (Gaussian) cumulative distribution function
Helper function for tCDF when df is large using error function approximation |
| incopmleteBeta (x, a, b) | Calculate incomplete beta function I_x(a,b)
Helper function for tCDF calculation using continued fraction approximation |
| logGamma (x) | Calculate logarithm of gamma function
Helper function for incopmleteBeta using Lanczos approximation coefficients |
Hypothesis Testing - t-tests
| oneSampleTTest (data, mu0) | Perform one-sample t-test
Tests whether sample mean significantly differs from a hypothesized population mean Returns p-value: if p < 0.05, difference is statistically significant |
| twoSampleTTest (x, y, equalVar) | Perform two-sample t-test
Tests whether means of two independent samples significantly differ Returns p-value: if p < 0.05, difference is statistically significant |
| pairTTest (before, after) | Perform paired t-test
Tests whether the mean difference between paired observations is significant Useful for before/after comparisons on the same subjects |
| linRegTTest (x, y) | Perform t-test on linear regression slope
Checks a dataset against it's linear regression |
Utility Functions
- find (tbl, value)
- Find index of value in an array-like table (linear search)
- testdata (n[, a=0[, b=1]])
- Generates n quantitiative data points between a and b
Basic Descriptive Statistics
- sum (data)
- Computes the sum of numeric values in the dataset
- mean (data)
- Computes the arithmetic mean (average) of the dataset
- min (data)
- Computes the minimum value of the dataset
- max (data)
- Computes the maximum value of the dataset
- variance (data[, isSample=false])
-
Computes population variance by default.
Set
isSampleto true for sample variance. - stdev (data[, isSample=false])
- Computes the standard deviation of the dataset using sample variance
- size (data)
- Number of elements in the dataset
- range (data)
- Computes the range of the dataset
- median (data)
-
Computes the median (50th percentile)
If even length, returns average of the two middle values
- mode (data)
- Computes the mode and returns a sorted array of the most frequent value(s)
- sex (data)
- Computes the standard error of the mean (commonly called SE) indicating precision
- skewness (data)
-
Computes the asymmetry (skewness) of data around its mean
Requires at least 3 observations
- kurtosis (data)
-
Computes the messure of skew (kurtosis) using an excess kurtosis formula variant
Requires at least 4 observations and indicates how much peak or tail a distribution would have
Alternative Means
- geometricMean (data)
-
Computes the geometric mean
Only defined for strictly positive values
Useful for multiplicative rates such as combining percentage growth rates
- harmonicMean (data)
-
Computes the harmonic mean
Best used for average rates and ratios, or inverse proportionalities
- trimmedMean (data[, trim=0.05])
-
Computes the trimmed mean
Removes
trimfraction from each tail and averages the restMuch more reliable measure of central tendency, clamps down outliers by removing a percentage of extremes
- winsorMean (data[, alpha=0.05])
-
Computes the winsorized mean
Clamps extreme values to given quantiles before averaging
Much more reliable measure of central tendency
- weightedMean (data, weights)
- Computes the weighted mean
Robust Statistics
- madMedian (data)
-
Computes the median absolute deviation (MAD)
Much more reliable measure of variability when outliers are present, from the deviation
Inequality Measures
- gini (data)
-
Computes the gini coefficient for inequality (0..1)
Returns 0 for datasets with fewer than 2 or zero mean
Represents how unequally distributed a dataset it, best explained by income but works for other sets too
Bivariate Statistics
- covariance (x, y)
-
Computes the sample covariance (uses n-1 denominator)
Represents how much the two datasets will vary with each other
- correlation (x, y)
-
Computes the pearson correlation coefficient
Returns 0 if undefined
Represents the strength and relationship of the covariances, much more interpretable
Quantiles and Outliers
- percentile (data, p)
-
Computes the percentile interpolation (linear interpolation between order statistics)
Finds the percentile of a value p given a dataset, sorts and places in between indices(or on one)
p in [0,1]
- quartiles (data)
-
Computes the first and third quartiles as table (
{ Q1 = ..., Q3 = ... })Quartiles are found by cutting the dataset in half using the median, and finding the median of those sets
- iqr (data)
-
Computes the interquartile range Q3 - Q1
The distance between first and third quartiles
- outliers (data)
-
Computes a list of all outliers in the dataset
Outliers are considered as such if they are more than 150% of the IQR away from either the first or third quartiles
Linear Regression
- linReg (x, y)
-
Computes a simple least-squares linear regression (y ~ a + b x)
Minimizes squares to find the line that is the least distance squared from all data points
- linRegPred (model, xval)
- Predicts the y value for a given linear regression model at a given position
- r2 (x, y, model)
-
Computes the coefficient of determination R^2 for linear model
Represents how much variation in variable y can be attributed to variable x
Hyposthesis Testing - Student's t Distribution
- tCDF (t, df)
-
Calculate cumulative distribution function for Student's t distribution
Uses numerical approximation for the incomplete beta function using relationship to incomplete beta function and continued fraction approximation
- normalCDF (x)
-
Calculate normal (Gaussian) cumulative distribution function
Helper function for tCDF when df is large using error function approximation
- incopmleteBeta (x, a, b)
-
Calculate incomplete beta function I_x(a,b)
Helper function for tCDF calculation using continued fraction approximation
- logGamma (x)
-
Calculate logarithm of gamma function
Helper function for incopmleteBeta using Lanczos approximation coefficients
Hypothesis Testing - t-tests
- oneSampleTTest (data, mu0)
-
Perform one-sample t-test
Tests whether sample mean significantly differs from a hypothesized population mean
Returns p-value: if p < 0.05, difference is statistically significant
- twoSampleTTest (x, y, equalVar)
-
Perform two-sample t-test
Tests whether means of two independent samples significantly differ
Returns p-value: if p < 0.05, difference is statistically significant
- pairTTest (before, after)
-
Perform paired t-test
Tests whether the mean difference between paired observations is significant
Useful for before/after comparisons on the same subjects
- linRegTTest (x, y)
-
Perform t-test on linear regression slope
Checks a dataset against it's linear regression