ANOVA

Analy­sis of Vari­ance (ANOVA) is used when you want to com­pare the means of more than two groups. The test tells you whether there is a sig­nif­i­cant dif­fer­ence between any of the means of not. To inves­ti­gate which means that dif­fer, you need to per­form a Tukey test or anoth­er pair­wise test.

You need to check the fol­low­ing assump­tions before pro­ceed­ing with ANOVA:

  1. The obser­va­tions are inde­pen­dent
  2. The obser­va­tions with­in each group is nor­mal­ly dis­trib­uted
  3. The obser­va­tions with­in the groups have the same vari­ance

The goal of ANOVA is to:

Test if there is a sta­tis­ti­cal­ly sig­nif­i­cant dif­fer­ence between any of the means of the groups. That means that we test the null-hypoth­e­sis that µ1= µ2= µn. This is accom­plished by cal­cu­lat­ing the F-sta­tis­tic. 

The ratio­nale of the test is to com­pare the vari­ance between the groups to the vari­ance with­in the groups. If the between vari­ance is greater than the with­in vari­ance we say that there is an effect by the fac­tor vari­able being inves­ti­gat­ed. That means that there is a dif­fer­ence between any or all group means. A fac­tor is a nom­i­nal vari­able where each group of the fac­tor is called a lev­el. For exam­ple, the fac­tor col­or has the lev­els “blue”, “red”, “orange” etc.

The vari­ance com­po­nents are com­put­ed using the equa­tions in the ANOVA table below:


where a is the num­ber of groups (or lev­els), n_T  is the total num­ber of obser­va­tions, n_i is the num­ber of obser­va­tions with­in each group,  X_{ij} is the j^{th}obser­va­tion in group i, \overline{X}_i is the mean of group  i  and \overline{X} is the mean of all obser­va­tions (grand mean).

The null hypoth­e­sis is reject­ed when F > Fα. The crit­i­cal val­ue  (Fα), is found in a F-table for dif­fer­ent lev­els of sig­nif­i­cance (e.g. α = 0.05 and 0.01) at the degrees of free­dom v_1(df_B) and v_2 (df_W). That is, you are cer­tain at a degree of 95 (α=0.05) and 99 % (α=0.01), respec­tive­ly, that the null-hypoth­e­sis can be reject­ed, i.e. any or all means dif­fer.

Exam­ple

A com­pa­ny wants to find out if there is a dif­fer­ence in total sales between four geo­graph­i­cal areas. There are 12 shops in each area, thus giv­ing a total of 12 total sales per year (mil­lion dol­lars) for each area (Area 1-Area 4).

  1. Con­struct the null-hypoth­e­sis

H0: the mean total sale in do not dif­fer between any of the areas (µArea 1 = µArea 2 = µArea 3 = µArea 4)

  1. Cal­cu­late the mean (\overline{x}) and vari­ance (s^2) for each sam­ple:


3. Check that the vari­ances are equal

- Per­form a F-test. Cal­cu­late the F sta­tis­tic by using the largest vari­ance as numer­a­tor and the small­est vari­ance as denom­i­na­tor. In this case, we use the vari­ance of Area 1 and Area 2 as numer­a­tor and denom­i­na­tor, respec­tive­ly.

F_ANOVA

- Cal­cu­late the degrees of free­dom:

v_{area1} = n-1 = 12-1 = 11
v_{area2} = n-1 = 12-1 = 11

- Check the crit­i­cal val­ue for F at α=0.05 where = 11 and  = 11 in a table of crit­i­cal F val­ues: Fα=0.05 = 2.82

-Com­pare the cal­cu­lat­ed F sta­tis­tic with Fα=0.05

F < Fα=0.05 = 2.14 < 2.82

- Reject H0 or H1

H0 can’t be reject­ed; the assump­tion of equal vari­ances holds true.

4. Cal­cu­late the F-sta­tis­tic

Use the equa­tions for the degrees of free­dom, Sum of Squares, Mean Squares and the F-sta­tis­tic to cre­ate an ANOVA table. You can also let a sta­tis­ti­cal soft­ware do this for you.



5. Reject or retain H0

F<Fcrit , which means that the prob­a­bil­i­ty of get­ting the cal­cu­lat­ed F-val­ue if the null-hypoth­e­sis was true is less than 0.05. The null hypoth­e­sis is there­fore reject­ed.

6. Inter­pret the result

The mean total sale dif­fers between areas. Look­ing at the means, we may sus­pect that the mean of Area 3 is larg­er than the oth­er. This can be test­ed using a Tukey test.

How to do it in R

############# ANOVA ###########

#1. Import the data

	data2<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/07/Example_data.csv",dec=",",sep=";")

#2. Do the ANOVA
	m<-lm(Sales~Area,data=data2)
	anova(m)
	summary(m)

#3 Visualize
	#SST
		par(mfcol=c(1,3))
		plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SST",xlab="Area",ylab="Sales (million $)",las=1)
		axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4))
		abline(h=mean(data2$Sales),col="blue",lty="dashed")

		segments(rep(c(1,2,3,4),each=12),data2$Sales[seq(1,max(length(Sales)),1)],
		rep(c(1,2,3,4),each=12),mean(data2$Sales),col="red")

		Area1<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[1]
		Area2<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[2]
		Area3<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[3]
		Area4<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[4]

	#SSB

		plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SSB",xlab="Area",ylab="Sales (million $)",las=1)
		axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4))
		abline(h=mean(data2$Sales),col="blue",lty="dashed")

		segments(0.9,Area1,1.1,Area1,lwd=2)
		segments(1.9,Area2,2.1,Area2,lwd=2)
		segments(2.9,Area3,3.1,Area3,lwd=2)
		segments(3.9,Area4,4.1,Area4,lwd=2)

		segments(c(1,2,3,4),round(tapply(data2$Sales,data2$Area,mean),digits=2),
			c(1,2,3,4),mean(data2$Sales),col="red")

	#SSW

		plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SSW",xlab="Area",ylab="Sales (million $)",las=1)
		axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4))
		abline(h=mean(Sales),col="blue",lty="dashed")
		segments(0.9,Area1,1.1,Area1,lwd=2)
		segments(1.9,Area2,2.1,Area2,lwd=2)
		segments(2.9,Area3,3.1,Area3,lwd=2)
		segments(3.9,Area4,4.1,Area4,lwd=2)

		segments(rep(c(1,2,3,4),each=12),data2$Sales[seq(1,max(length(data2$Sales)),1)],
		rep(c(1,2,3,4),each=12),rep(round(tapply(data2$Sales,data2$Area,mean),digits=2),each=12),col="red")

#3. Check the assumptions

	#3.1 Normality
		#QQ plot
			st.res<-rstandard(m)
			x11()
			qqnorm(st.res,ylab="Standardized Residuals",xlab="Theoretical",las=1,bty="l")
 			qqline(st.res)

	# Histogram
		x11()
		par(mfcol=c(2,2))
		tapply(data2$Sales,data2$Area,hist,col="skyblue",las=1,yaxt="n",xaxt="n",xlab="Sales",main="Histogram")

	#4.1 Equal variances
		#Compute the variances of each Area
			d<-data.frame(data2[which(data2$Area=="Area 1"),],data2[which(data2$Area=="Area 2"),],
			data2[which(data2$Area=="Area 3"),], data2[which(data2$Area=="Area 4"),])

			std<-tapply(data1$Sales,data1$Area,sd)
			var<-std^2

	#F-test
		var.test(d$Sales,d$Sales.1)

ANOVA in depth

Intro­duc­tion

The idea of ANOVA is to test whether the vari­ance between a set of groups is larg­er or equal to the vari­ance with­in the groups. If the vari­ance between the groups is sig­nif­i­cant­ly larg­er, we say that there is an effect by the fac­tor to which the groups belong (e.g. Tem­per­a­ture) on the inde­pen­dent vari­able (e.g. growth). This means that the mean of at least one group devi­ates from the mean of the oth­er groups.


Par­ti­tion­ing the vari­a­tion

Since we want to com­pare the vari­ance between the groups with the vari­ance with­in the groups, we first need to cal­cu­late them. We can say that we are par­ti­tion­ing the total vari­ance in the data set into two com­po­nents; (1) The between vari­ance (MSB) and (2) the with­in vari­ance (MSW).

In the process of cal­cu­lat­ing these com­po­nents we first need to cal­cu­late the sum of squares, which  is the numer­a­tor in the equa­tion for the vari­ance.

For the total vari­a­tion in the data neglect­ing the groups, the sum of squares is the summed squared dis­tance between every obser­va­tion and the mean of all obser­va­tions (the grand mean). Using our ANOVA exam­ple of sales of dif­fer­ent shops in four areas we this can be illus­trat­ed as:



where the red lines illus­trates the dis­tance between each obser­va­tion and the grand mean. When these dis­tances are squared and summed we get the total sum of squares, the total vari­a­tion in the data. The total sum of squares  is cal­cu­lat­ed by:

SSt

where  X_{ij} is the val­ue of the j^{th} obser­va­tion of the i^{th}  group and \overline{X}  is the grand mean.

So far, the vari­a­tion is unpar­ti­tioned. We could also say that there is no explained vari­a­tion, only unex­plained since we are only look­ing at the data as a whole. In our exam­ple the  SS_T = 30.20. Thus, if no fac­tor was to explain the vari­a­tion, the unex­plained vari­a­tion would equal SS_T  = 30.20.

Now, let’s look at the vari­a­tion with­in the groups:



The red lines illus­trate the dis­tance between each obser­va­tion and the mean of each group. When these are squared and summed, we get the with­in sum of squares (SS_W). This is the unex­plained vari­a­tion in the data after con­sid­er­ing the effect of the fac­tor (Area). In our exam­ple, this was cal­cu­lat­ed as 4.34, which is a con­sid­er­able reduc­tion from the unex­plained vari­a­tion before the effect of the fac­tor was con­sid­ered, i.e. 30.20. The dis­tance to the mean for each group is short­er than the dis­tance to the grand mean. The with­in sum of squares is cal­cu­lat­ed by:

SSw

where X_{ij} is the val­ue of the j^{th} obser­va­tion of the i^{th} group and \overline{X}_i is the mean of the i^{th} group.

Now, if the unex­plained vari­a­tion dropped to 4.34 out of 30.20, the explained vari­a­tion has to be 30.20–4.34= 25.86. This com­po­nent of the total vari­a­tion can be illus­trat­ed as:



The red lines cor­re­spond to the dis­tance between the mean of each group and the grand mean. When squared, summed and mul­ti­plied with the num­ber of obser­va­tions in each group we get the between sum of squares. The equa­tion for this com­po­nent of the vari­a­tion is:

SSb

where n_i is the num­ber of obser­va­tions in the i^{th} group, \overline{X}_i is the mean of the i^{th} group and \overline{X}  is the grand mean.

Lin­ear mod­el

The ANOVA can, just like the lin­ear regres­sion, be treat­ed as a lin­ear mod­el. Con­sid­er the fact that the val­ue of each obser­va­tion in a pop­u­la­tion is the sum of the mean of the pop­u­la­tion and the devi­a­tion of the obser­va­tion from the mean:

x_i = \mu + e_i

where x_i  is the val­ue of the i^{th}  obser­va­tion and \mu is the mean of the pop­u­la­tion and  e_i is the error term or the devi­a­tion to the mean; e_i = z \times \sigma or e_i = x_i - \mu.

How­ev­er, when a fac­tor is added, each obser­va­tion can be expressed as:

x_{ij} = \mu + F_i + e_i

where x_{ij}  is the j^{th} obser­va­tion of the  i^{th} group, \mu  is the grand mean, F_i  is the effect by the fac­tor and  e_i is the devi­a­tion from the j^{th} obser­va­tion to the mean of the  group.

The error term, and thus the indi­vid­ual obser­va­tions are not esti­mat­ed by the pro­ce­dure of the ANOVA. The mod­el out­put pro­vides the esti­mates of the effect by the fac­tor. That is the indi­vid­ual devi­a­tion from the inter­cept for each lev­el of the fac­tor. That means that the lin­ear mod­el esti­mates the mean for each group:

\overline{x}_i = \mu + F_i

If there is no effect by the fac­tor, the mean of all groups equals the grand mean.