Skip to main content

module analysis::statistics::Inference

rascal-0.34.0

Statistical inference methods.

Usage

import analysis::statistics::Inference;

The following functions are provided:

function chiSquare

Chi-square coefficient of data values.

num chiSquare(lrel[num expected, int observed] values)

Compute the ChiSquare statistic comparing observed and expected frequency counts.

Examples

Consider an example from the web page mentioned above. To test the hypothesis that a random sample of 100 people has been drawn from a population in which men and women are equal in frequency, the observed number of men and women would be compared to the theoretical frequencies of 50 men and 50 women. If there were 44 men in the sample and 56 women, then we have the following:

rascal>import analysis::statistics::Inference;
ok
rascal>chiSquare([<50, 44>, <50, 56>])
num: 1.44

function chiSquareTest

Chi-square test on data values.

num chiSquareTest(lrel[num expected, int observed] values)

bool chiSquareTest(lrel[num expected, int observed] values, real alpha)

Perform a chi-square test comparing expected and observed frequency counts. There are two forms of this test:

  • Returns the observed significance level, or p-value, associated with a Chi-square goodness of fit test comparing observed frequency counts to expected counts.

  • Performs a Chi-square goodness of fit test evaluating the null hypothesis that the observed counts conform to the frequency distribution described by the expected counts, with significance level alpha (0 < alpha < 0.5). Returns true iff the null hypothesis can be rejected with confidence 1 - alpha.

function tTest

T-test on sample data.

num tTest(list[num] sample1, list[num] sample2)

bool tTest(list[num] sample1, list[num] sample2, num alpha)

bool tTest(num mu, list[num] sample, num alpha)

Perform student's t-test The test is provided in three variants:

  • Returns the observed significance level, or p-value, associated with a two-sample, two-tailed t-test comparing the means of the input samples. The number returned is the smallest significance level at which one can reject the null hypothesis that the two means are equal in favor of the two-sided alternative that they are different. For a one-sided test, divide the returned value by 2.

The t-statistic used is t = (m1 - m2) / sqrt(var1/n1 + var2/n2) where

n1 is the size of the first sample n2 is the size of the second sample; m1 is the mean of the first sample; m2 is the mean of the second sample; var1 is the variance of the first sample; var2 is the variance of the second sample.

  • Performs a two-sided t-test evaluating the null hypothesis that sample1 and sample2 are drawn from populations with the same mean, with significance level alpha. This test does not assume that the subpopulation variances are equal. Returns true iff the null hypothesis that the means are equal can be rejected with confidence 1 - alpha. To perform a 1-sided test, use alpha / 2.

  • Performs a two-sided t-test evaluating the null hypothesis that the mean of the population from which sample is drawn equals mu. Returns true iff the null hypothesis can be rejected with confidence 1 - alpha. To perform a 1-sided test, use alpha * 2.

Examples

We use the data from the following http://web.mst.edu/~psyworld/texample.htm#1[example] to illustrate the t-test. First, we compute the t-statistic using the formula given above.

rascal>import util::Math;
ok
rascal>import analysis::statistics::Descriptive;
ok
rascal>import List;
ok
rascal>s1 = [5,7,5,3,5,3,3,9];
list[int]: [5,7,5,3,5,3,3,9]
rascal>s2 = [8,1,4,6,6,4,1,2];
list[int]: [8,1,4,6,6,4,1,2]
rascal>(mean(s1) - mean(s2))/sqrt(variance(s1)/size(s1) + variance(s2)/size(s2));
real: 0.84731854581

This is the same result as obtained in the cited example. We can also compute it directly using the tTest functions:

rascal>import analysis::statistics::Inference;
ok
rascal>tTest(s1, s2);
num: 0.4115203997374087

Observe that this is a smaller value than comes out directly of the formula. Recall that: The number returned is the smallest significance level at which one can reject the null hypothesis that the two means are equal in favor of the two-sided alternative that they are different. Finally, we perform the test around the significance level we just obtained:

rascal>tTest(s1,s2,0.40);
bool: false
rascal>tTest(s1,s2,0.50);
bool: true

function anovaFValue

Analysis of Variance (ANOVA) f-value.

num anovaFValue(list[list[num]] categoryData)

Perform Analysis of Variance test also described here

Compute the F statistic -- also known as F-test -- using the definitional formula F = msbg/mswg where

  • msbg = between group mean square.
  • mswg = within group mean square.

are as defined here

function anovaPValue

Analysis of Variance (ANOVA) p-value.

num anovaPValue(list[list[num]] categoryData)

Perform Analysis of Variance test also described here

Computes the exact p-value using the formula p = 1 - cumulativeProbability(F) where F is the Anova F Value.

function anovaTest

Analysis of Variance (ANOVA) test.

bool anovaTest(list[list[num]] categoryData, num alpha)

Perform Analysis of Variance also described here

Returns true iff the estimated p-value is less than alpha (0 < alpha <= 0.5).

The exact p-value is computed using the formula p = 1 - cumulativeProbability(F) where F is the Anova F Value.

function gini

Gini coefficient.

real gini(lrel[num observation,int frequency] values)

Computes the Gini Coefficient that measures the inequality among values in a frequency distribution.

The Gini coefficient is computed using Deaton's formula and returns a value between 0 (completely equal distribution) and 1 (completely unequal distribution).

Examples

rascal>import analysis::statistics::Inference;
ok

A completely equal distribution:

rascal>gini([<10000, 1>, <10000, 1>, <10000, 1>]);
real: 0.0

A rather unequal distribution:

rascal>gini([<998000, 1>, <20000, 3>, <117500, 1>, <70000, 2>, <23500, 5>, <45200,1>]);
real: 0.8530758129256304