The t-test

The T-test is used when you want to compare the means of two small samples (<30 observations). In a two sided t test, you test the null-hypothesis that µ1 = µ2. Contrary, in a one sided t test, you test the null-hypothesis µ1 ≥ µ2 or µ1 ≤ µ2.

You need to check the following assumptions before proceeding with the t-test:

  1. The observations are independent
  2. The samples have the same variance
  3. That the data set is normally distributed

The t-test relies on the test statistic t, which is calculated by:

t-test

where \overline{x}_1 and \overline{x}_2  are the sample means, s_1^2  and s_2^2  are the sample variances, and n_1 and n_2   are the sample sizes of sample 1 and 2, respectively.

The null hypothesis is rejected when t > tα. The critical value, tα, is found in a t-table for different levels of significance (e.g. α = 0.05 and 0.01) at a specific number of degrees of freedom (n-1). That is, you are certain at a degree of 95 (α=0.05) and 99 % (α=0.01), respectively, that the null-hypothesis can be rejected, i.e. the means differ.

Example

We go on with the example of the industrial workers from the two sided Z test. You want to test if the monthly salaries of industrial workers differ between China and India.

  1. Construct the null-hypothesis

H0: the mean salary in China and India do not differ (µChina = µIndia)

You only have a sample of 10 salaries each of industrial workers from China and India.

  1. Calculate the mean (\overline{x}) and variance (s^2) for each sample, in this case:
t-test example

3. Check that the variances are equal
– Perform a F-test, calculate the F statistic
F-test
– Calculate the degrees of freedom (v)
v_{china} = 45 - 1 = 44
v_{india} = 32 - 1 = 31

– Check the critical value for F at α=0.05 where v1 = 9 and v2 = 9 in a table of critical F values: Fα=0.05 = 4.03

– Compare the calculated F statistic with Fα=0.05

F < Fα=0.05 = 1.14 <4.03

– Reject H0 or H

H0 can’t be rejected; the assumption of equal variances holds true.

4. Calculate the t statistic

t-test example

– Look up the critical value for t at α = 0.05

We check the t-table at df = (n1 + n2) -2 = (10 + 10)-2 = 18 where α = 0.05. There we find that tα = 2.101

– Compare the calculated t statistic with tα=0.05

t > tα=0.05 = 6.41 > 2.101

t is even greater than tα=0.01 = 2.878
5. Reject H0 or H1

H0 can be rejected; the mean salary in China and India differ (µChina ≠ µIndia)

6. Interpret the result

We are more than 99 % certain that the salaries of industrial workers in China are higher compared to India.

How to do it in R


#Get the data
data.china<-rnorm(10,7.8,sqrt(2.4))
data.india<-rnorm(10,3.5,sqrt(2.1))

#Perform the test
t.test(data.china, data.india)

t test in depth

The estimate of σ2 is uncertain at small sample sizes. This makes is difficult to use the Z-test as it relies on an estimate of the variance (s2) that is exactly the same or very close to the true population variance (σ2). The standard deviation (s) is used in the calculation of the standard error of the difference, which expresses the distance between the observed difference between two sample means and zero (no difference). If the standard error is biased due to small sample size, how do we proceed?

Well, William Sealy Gosset solved this puzzle elegantly in 1908 while working for the Guinness brewery. He simply worked out the distribution(s) of a statistic (t) that takes these uncertainties into account. The t statistic is a function of sample size or more correctly the degrees of freedom (df or V). So, for every df > 0, Gosset worked out the distribution of t. Like the distribution of Z, the area and the associated cut-off values that incorporate the 5 or 1 % of the values in the tails can be calculated. That is the critical values that are presented in any t-table for a specific level of significance (α).

These values corresponds to Z = 1.96 and 2.58 at α=0.05 and 0.01, respectively. But, with greater uncertainty at low sample sizes, the critical value of t exceeds the corresponding critical values of Z. This is because the t distribution is flattened out at low sample sizes and is relatively wide in the tails as compared to Z. Thus, the t distribution incorporates a larger area compared to Z at the same value of these test statistics. To reduce the area to incorporate 5 or 1 % of the distribution, the cut-off value of t has to be increased. The lower the sample size, the larger t has to be to incorporate the same area as the Z distribution. With all this in mind, the distribution change with sample size and is identical to a normal curve at an infinite number of degrees of freedom. Hence, t closes in on the value of Z.