## ANOVA

Analy­sis of Vari­ance (ANOVA) is used when you want to com­pare the means of more than two groups. The test tells you whether there is a sig­nif­i­cant dif­fer­ence between any of the means of not. To inves­ti­gate which means that dif­fer, you need to per­form a Tukey test or anoth­er pair­wise test.

You need to check the fol­low­ing assump­tions before pro­ceed­ing with ANOVA:

1. The obser­va­tions are inde­pen­dent
2. The obser­va­tions with­in each group is nor­mal­ly dis­trib­uted
3. The obser­va­tions with­in the groups have the same vari­ance

The goal of ANOVA is to:

Test if there is a sta­tis­ti­cal­ly sig­nif­i­cant dif­fer­ence between any of the means of the groups. That means that we test the null-hypoth­e­sis that µ1= µ2= µn. This is accom­plished by cal­cu­lat­ing the F-sta­tis­tic.

The ratio­nale of the test is to com­pare the vari­ance between the groups to the vari­ance with­in the groups. If the between vari­ance is greater than the with­in vari­ance we say that there is an effect by the fac­tor vari­able being inves­ti­gat­ed. That means that there is a dif­fer­ence between any or all group means. A fac­tor is a nom­i­nal vari­able where each group of the fac­tor is called a lev­el. For exam­ple, the fac­tor col­or has the lev­els “blue”, “red”, “orange” etc.

The vari­ance com­po­nents are com­put­ed using the equa­tions in the ANOVA table below: where $a$ is the num­ber of groups (or lev­els), $n_T$  is the total num­ber of obser­va­tions, $n_i$ is the num­ber of obser­va­tions with­in each group, $X_{ij}$ is the $j^{th}$obser­va­tion in group $i$, $\overline{X}_i$ is the mean of group $i$  and $\overline{X}$ is the mean of all obser­va­tions (grand mean).

The null hypoth­e­sis is reject­ed when F > Fα. The crit­i­cal val­ue  (Fα), is found in a F-table for dif­fer­ent lev­els of sig­nif­i­cance (e.g. α = 0.05 and 0.01) at the degrees of free­dom $v_1$( $df_B$) and $v_2$ ( $df_W$). That is, you are cer­tain at a degree of 95 (α=0.05) and 99 % (α=0.01), respec­tive­ly, that the null-hypoth­e­sis can be reject­ed, i.e. any or all means dif­fer.

Exam­ple

A com­pa­ny wants to find out if there is a dif­fer­ence in total sales between four geo­graph­i­cal areas. There are 12 shops in each area, thus giv­ing a total of 12 total sales per year (mil­lion dol­lars) for each area (Area 1-Area 4).

1. Con­struct the null-hypoth­e­sis

H0: the mean total sale in do not dif­fer between any of the areas (µArea 1 = µArea 2 = µArea 3 = µArea 4)

1. Cal­cu­late the mean ( $\overline{x}$) and vari­ance ( $s^2$) for each sam­ple: 3. Check that the vari­ances are equal

- Per­form a F-test. Cal­cu­late the F sta­tis­tic by using the largest vari­ance as numer­a­tor and the small­est vari­ance as denom­i­na­tor. In this case, we use the vari­ance of Area 1 and Area 2 as numer­a­tor and denom­i­na­tor, respec­tive­ly. - Cal­cu­late the degrees of free­dom: $v_{area1} = n-1 = 12-1 = 11$ $v_{area2} = n-1 = 12-1 = 11$

- Check the crit­i­cal val­ue for F at α=0.05 where = 11 and  = 11 in a table of crit­i­cal F val­ues: Fα=0.05 = 2.82

-Com­pare the cal­cu­lat­ed F sta­tis­tic with Fα=0.05

F < Fα=0.05 = 2.14 < 2.82

- Reject H0 or H1

H0 can’t be reject­ed; the assump­tion of equal vari­ances holds true.

4. Cal­cu­late the F-sta­tis­tic

Use the equa­tions for the degrees of free­dom, Sum of Squares, Mean Squares and the F-sta­tis­tic to cre­ate an ANOVA table. You can also let a sta­tis­ti­cal soft­ware do this for you. 5. Reject or retain H0

F<Fcrit , which means that the prob­a­bil­i­ty of get­ting the cal­cu­lat­ed F-val­ue if the null-hypoth­e­sis was true is less than 0.05. The null hypoth­e­sis is there­fore reject­ed.

6. Inter­pret the result

The mean total sale dif­fers between areas. Look­ing at the means, we may sus­pect that the mean of Area 3 is larg­er than the oth­er. This can be test­ed using a Tukey test.

How to do it in R

############# ANOVA ###########

#1. Import the data

#2. Do the ANOVA
m<-lm(Sales~Area,data=data2)
anova(m)
summary(m)

#3 Visualize
#SST
par(mfcol=c(1,3))
plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SST",xlab="Area",ylab="Sales (million$)",las=1)
axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4))
abline(h=mean(data2$Sales),col="blue",lty="dashed") segments(rep(c(1,2,3,4),each=12),data2$Sales[seq(1,max(length(Sales)),1)],
rep(c(1,2,3,4),each=12),mean(data2$Sales),col="red") Area1<-round(tapply(data2$Sales,data2$Area,mean),digits=2) Area2<-round(tapply(data2$Sales,data2$Area,mean),digits=2) Area3<-round(tapply(data2$Sales,data2$Area,mean),digits=2) Area4<-round(tapply(data2$Sales,data2$Area,mean),digits=2) #SSB plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SSB",xlab="Area",ylab="Sales (million $)",las=1) axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4)) abline(h=mean(data2$Sales),col="blue",lty="dashed")

segments(0.9,Area1,1.1,Area1,lwd=2)
segments(1.9,Area2,2.1,Area2,lwd=2)
segments(2.9,Area3,3.1,Area3,lwd=2)
segments(3.9,Area4,4.1,Area4,lwd=2)

segments(c(1,2,3,4),round(tapply(data2$Sales,data2$Area,mean),digits=2),
c(1,2,3,4),mean(data2$Sales),col="red") #SSW plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SSW",xlab="Area",ylab="Sales (million $)",las=1) axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4)) abline(h=mean(Sales),col="blue",lty="dashed") segments(0.9,Area1,1.1,Area1,lwd=2) segments(1.9,Area2,2.1,Area2,lwd=2) segments(2.9,Area3,3.1,Area3,lwd=2) segments(3.9,Area4,4.1,Area4,lwd=2) segments(rep(c(1,2,3,4),each=12),data2$Sales[seq(1,max(length(data2$Sales)),1)], rep(c(1,2,3,4),each=12),rep(round(tapply(data2$Sales,data2$Area,mean),digits=2),each=12),col="red") #3. Check the assumptions #3.1 Normality #QQ plot st.res<-rstandard(m) x11() qqnorm(st.res,ylab="Standardized Residuals",xlab="Theoretical",las=1,bty="l") qqline(st.res) # Histogram x11() par(mfcol=c(2,2)) tapply(data2$Sales,data2$Area,hist,col="skyblue",las=1,yaxt="n",xaxt="n",xlab="Sales",main="Histogram") #4.1 Equal variances #Compute the variances of each Area d<-data.frame(data2[which(data2$Area=="Area 1"),],data2[which(data2$Area=="Area 2"),], data2[which(data2$Area=="Area 3"),], data2[which(data2$Area=="Area 4"),]) std<-tapply(data1$Sales,data1$Area,sd) var<-std^2 #F-test var.test(d$Sales,d\$Sales.1)



### ANOVA in depth

Intro­duc­tion

The idea of ANOVA is to test whether the vari­ance between a set of groups is larg­er or equal to the vari­ance with­in the groups. If the vari­ance between the groups is sig­nif­i­cant­ly larg­er, we say that there is an effect by the fac­tor to which the groups belong (e.g. Tem­per­a­ture) on the inde­pen­dent vari­able (e.g. growth). This means that the mean of at least one group devi­ates from the mean of the oth­er groups.

Par­ti­tion­ing the vari­a­tion

Since we want to com­pare the vari­ance between the groups with the vari­ance with­in the groups, we first need to cal­cu­late them. We can say that we are par­ti­tion­ing the total vari­ance in the data set into two com­po­nents; (1) The between vari­ance (MSB) and (2) the with­in vari­ance (MSW).

In the process of cal­cu­lat­ing these com­po­nents we first need to cal­cu­late the sum of squares, which  is the numer­a­tor in the equa­tion for the vari­ance.

For the total vari­a­tion in the data neglect­ing the groups, the sum of squares is the summed squared dis­tance between every obser­va­tion and the mean of all obser­va­tions (the grand mean). Using our ANOVA exam­ple of sales of dif­fer­ent shops in four areas we this can be illus­trat­ed as:

where the red lines illus­trates the dis­tance between each obser­va­tion and the grand mean. When these dis­tances are squared and summed we get the total sum of squares, the total vari­a­tion in the data. The total sum of squares  is cal­cu­lat­ed by: where $X_{ij}$ is the val­ue of the $j^{th}$ obser­va­tion of the $i^{th}$  group and $\overline{X}$  is the grand mean.

So far, the vari­a­tion is unpar­ti­tioned. We could also say that there is no explained vari­a­tion, only unex­plained since we are only look­ing at the data as a whole. In our exam­ple the $SS_T$ = 30.20. Thus, if no fac­tor was to explain the vari­a­tion, the unex­plained vari­a­tion would equal $SS_T$  = 30.20.

Now, let’s look at the vari­a­tion with­in the groups:

The red lines illus­trate the dis­tance between each obser­va­tion and the mean of each group. When these are squared and summed, we get the with­in sum of squares ( $SS_W$). This is the unex­plained vari­a­tion in the data after con­sid­er­ing the effect of the fac­tor (Area). In our exam­ple, this was cal­cu­lat­ed as 4.34, which is a con­sid­er­able reduc­tion from the unex­plained vari­a­tion before the effect of the fac­tor was con­sid­ered, i.e. 30.20. The dis­tance to the mean for each group is short­er than the dis­tance to the grand mean. The with­in sum of squares is cal­cu­lat­ed by: where $X_{ij}$ is the val­ue of the $j^{th}$ obser­va­tion of the $i^{th}$ group and $\overline{X}_i$ is the mean of the $i^{th}$ group.

Now, if the unex­plained vari­a­tion dropped to 4.34 out of 30.20, the explained vari­a­tion has to be 30.20–4.34= 25.86. This com­po­nent of the total vari­a­tion can be illus­trat­ed as:

The red lines cor­re­spond to the dis­tance between the mean of each group and the grand mean. When squared, summed and mul­ti­plied with the num­ber of obser­va­tions in each group we get the between sum of squares. The equa­tion for this com­po­nent of the vari­a­tion is: where $n_i$ is the num­ber of obser­va­tions in the $i^{th}$ group, $\overline{X}_i$ is the mean of the $i^{th}$ group and $\overline{X}$  is the grand mean.

Lin­ear mod­el

The ANOVA can, just like the lin­ear regres­sion, be treat­ed as a lin­ear mod­el. Con­sid­er the fact that the val­ue of each obser­va­tion in a pop­u­la­tion is the sum of the mean of the pop­u­la­tion and the devi­a­tion of the obser­va­tion from the mean: $x_i = \mu + e_i$

where $x_i$ is the val­ue of the $i^{th}$  obser­va­tion and $\mu$ is the mean of the pop­u­la­tion and $e_i$ is the error term or the devi­a­tion to the mean; $e_i = z \times \sigma$ or $e_i = x_i - \mu$.

How­ev­er, when a fac­tor is added, each obser­va­tion can be expressed as: $x_{ij} = \mu + F_i + e_i$

where $x_{ij}$  is the $j^{th}$ obser­va­tion of the $i^{th}$ group, $\mu$  is the grand mean, $F_i$  is the effect by the fac­tor and $e_i$ is the devi­a­tion from the $j^{th}$ obser­va­tion to the mean of the  group.

The error term, and thus the indi­vid­ual obser­va­tions are not esti­mat­ed by the pro­ce­dure of the ANOVA. The mod­el out­put pro­vides the esti­mates of the effect by the fac­tor. That is the indi­vid­ual devi­a­tion from the inter­cept for each lev­el of the fac­tor. That means that the lin­ear mod­el esti­mates the mean for each group: $\overline{x}_i = \mu + F_i$

If there is no effect by the fac­tor, the mean of all groups equals the grand mean.