The t-test

The T-test is used when you want to com­pare the means of two small sam­ples (<30 obser­va­tions). In a two sided t test, you test the null-hypoth­e­sis that µ1 = µ2. Con­trary, in a one sided t test, you test the null-hypoth­e­sis µ1 ≥ µ2 or µ1 ≤ µ2.

You need to check the fol­low­ing assump­tions before pro­ceed­ing with the t-test:

  1. The obser­va­tions are inde­pen­dent
  2. The sam­ples have the same vari­ance
  3. That the data set is nor­mal­ly dis­trib­uted

The t-test relies on the test sta­tis­tic t, which is cal­cu­lat­ed by:

t-test

where \overline{x}_1 and \overline{x}_2  are the sam­ple means, s_1^2  and s_2^2  are the sam­ple vari­ances, and n_1 and n_2   are the sam­ple sizes of sam­ple 1 and 2, respec­tive­ly.

The null hypoth­e­sis is reject­ed when t > tα. The crit­i­cal val­ue, tα, is found in a t-table for dif­fer­ent lev­els of sig­nif­i­cance (e.g. α = 0.05 and 0.01) at a spe­cif­ic num­ber of degrees of free­dom (n-1). That is, you are cer­tain at a degree of 95 (α=0.05) and 99 % (α=0.01), respec­tive­ly, that the null-hypoth­e­sis can be reject­ed, i.e. the means dif­fer.

Exam­ple

We go on with the exam­ple of the indus­tri­al work­ers from the two sided Z test. You want to test if the month­ly salaries of indus­tri­al work­ers dif­fer between Chi­na and India.

  1. Con­struct the null-hypoth­e­sis

H0: the mean salary in Chi­na and India do not dif­fer (µChi­na = µIndia)

You only have a sam­ple of 10 salaries each of indus­tri­al work­ers from Chi­na and India.

  1. Cal­cu­late the mean (\overline{x}) and vari­ance (s^2) for each sam­ple, in this case:

3. Check that the vari­ances are equal
— Per­form a F-test, cal­cu­late the F sta­tis­tic
F-test
— Cal­cu­late the degrees of free­dom (v)
v_{china} = 45 - 1 = 44
v_{india} = 32 - 1 = 31

- Check the crit­i­cal val­ue for F at α=0.05 where v1 = 9 and v2 = 9 in a table of crit­i­cal F val­ues: Fα=0.05 = 4.03

- Com­pare the cal­cu­lat­ed F sta­tis­tic with Fα=0.05

F < Fα=0.05 = 1.14 <4.03

- Reject H0 or H

H0 can’t be reject­ed; the assump­tion of equal vari­ances holds true.

4. Cal­cu­late the t sta­tis­tic

t-test example

- Look up the crit­i­cal val­ue for t at α = 0.05

We check the t-table at df = (n1 + n2) -2 = (10 + 10)-2 = 18 where α = 0.05. There we find that tα = 2.101

- Com­pare the cal­cu­lat­ed t sta­tis­tic with tα=0.05

t > tα=0.05 = 6.41 > 2.101

t is even greater than tα=0.01 = 2.878
5. Reject H0 or H1

H0 can be reject­ed; the mean salary in Chi­na and India dif­fer (µChi­na ≠ µIndia)

6. Inter­pret the result

We are more than 99 % cer­tain that the salaries of indus­tri­al work­ers in Chi­na are high­er com­pared to India.

How to do it in R


#Get the data
data.china<-rnorm(10,7.8,sqrt(2.4))
data.india<-rnorm(10,3.5,sqrt(2.1))

#Perform the test
t.test(data.china, data.india)

t test in depth

The esti­mate of σ2 is uncer­tain at small sam­ple sizes. This makes is dif­fi­cult to use the Z-test as it relies on an esti­mate of the vari­ance (s2) that is exact­ly the same or very close to the true pop­u­la­tion vari­ance (σ2). The stan­dard devi­a­tion (s) is used in the cal­cu­la­tion of the stan­dard error of the dif­fer­ence, which express­es the dis­tance between the observed dif­fer­ence between two sam­ple means and zero (no dif­fer­ence). If the stan­dard error is biased due to small sam­ple size, how do we pro­ceed?

Well, William Sealy Gos­set solved this puz­zle ele­gant­ly in 1908 while work­ing for the Guin­ness brew­ery. He sim­ply worked out the distribution(s) of a sta­tis­tic (t) that takes these uncer­tain­ties into account. The t sta­tis­tic is a func­tion of sam­ple size or more cor­rect­ly the degrees of free­dom (df or V). So, for every df > 0, Gos­set worked out the dis­tri­b­u­tion of t. Like the dis­tri­b­u­tion of Z, the area and the asso­ci­at­ed cut-off val­ues that incor­po­rate the 5 or 1 % of the val­ues in the tails can be cal­cu­lat­ed. That is the crit­i­cal val­ues that are pre­sent­ed in any t-table for a spe­cif­ic lev­el of sig­nif­i­cance (α).

These val­ues cor­re­sponds to Z = 1.96 and 2.58 at α=0.05 and 0.01, respec­tive­ly. But, with greater uncer­tain­ty at low sam­ple sizes, the crit­i­cal val­ue of t exceeds the cor­re­spond­ing crit­i­cal val­ues of Z. This is because the t dis­tri­b­u­tion is flat­tened out at low sam­ple sizes and is rel­a­tive­ly wide in the tails as com­pared to Z. Thus, the t dis­tri­b­u­tion incor­po­rates a larg­er area com­pared to Z at the same val­ue of these test sta­tis­tics. To reduce the area to incor­po­rate 5 or 1 % of the dis­tri­b­u­tion, the cut-off val­ue of t has to be increased. The low­er the sam­ple size, the larg­er t has to be to incor­po­rate the same area as the Z dis­tri­b­u­tion. With all this in mind, the dis­tri­b­u­tion change with sam­ple size and is iden­ti­cal to a nor­mal curve at an infi­nite num­ber of degrees of free­dom. Hence, t clos­es in on the val­ue of Z.