## ANOVA

Analysis of Variance (ANOVA) is used when you want to compare the means of more than two groups. The test tells you whether there is a significant difference between any of the means of not. To investigate which means that differ, you need to perform a Tukey test or another pairwise test.

You need to check the following assumptions before proceeding with ANOVA:

1. The observations are independent
2. The observations within each group is normally distributed
3. The observations within the groups have the same variance

The goal of ANOVA is to:

Test if there is a statistically significant difference between any of the means of the groups. That means that we test the null-hypothesis that µ1= µ2= µn. This is accomplished by calculating the F-statistic.

The rationale of the test is to compare the variance between the groups to the variance within the groups. If the between variance is greater than the within variance we say that there is an effect by the factor variable being investigated. That means that there is a difference between any or all group means. A factor is a nominal variable where each group of the factor is called a level. For example, the factor color has the levels “blue”, “red”, “orange” etc.

The variance components are computed using the equations in the ANOVA table below:

where $a$ is the number of groups (or levels), $n_T$  is the total number of observations, $n_i$ is the number of observations within each group,  $X_{ij}$ is the $j^{th}$observation in group $i$, $\overline{X}_i$ is the mean of group  $i$  and $\overline{X}$ is the mean of all observations (grand mean).

The null hypothesis is rejected when F > Fα. The critical value  (Fα), is found in a F-table for different levels of significance (e.g. α = 0.05 and 0.01) at the degrees of freedom $v_1$($df_B$) and $v_2$ ($df_W$). That is, you are certain at a degree of 95 (α=0.05) and 99 % (α=0.01), respectively, that the null-hypothesis can be rejected, i.e. any or all means differ.

Example

A company wants to find out if there is a difference in total sales between four geographical areas. There are 12 shops in each area, thus giving a total of 12 total sales per year (million dollars) for each area (Area 1-Area 4).

1. Construct the null-hypothesis

H0: the mean total sale in do not differ between any of the areas (µArea 1 = µArea 2 = µArea 3 = µArea 4)

1. Calculate the mean ($\overline{x}$) and variance ($s^2$) for each sample:

3. Check that the variances are equal

– Perform a F-test. Calculate the F statistic by using the largest variance as numerator and the smallest variance as denominator. In this case, we use the variance of Area 1 and Area 2 as numerator and denominator, respectively.

– Calculate the degrees of freedom:

$v_{area1} = n-1 = 12-1 = 11$
$v_{area2} = n-1 = 12-1 = 11$

– Check the critical value for F at α=0.05 where = 11 and  = 11 in a table of critical F values: Fα=0.05 = 2.82

-Compare the calculated F statistic with Fα=0.05

F < Fα=0.05 = 2.14 < 2.82

– Reject H0 or H1

H0 can’t be rejected; the assumption of equal variances holds true.

4. Calculate the F-statistic

Use the equations for the degrees of freedom, Sum of Squares, Mean Squares and the F-statistic to create an ANOVA table. You can also let a statistical software do this for you.

5. Reject or retain H0

F<Fcrit , which means that the probability of getting the calculated F-value if the null-hypothesis was true is less than 0.05. The null hypothesis is therefore rejected.

6. Interpret the result

The mean total sale differs between areas. Looking at the means, we may suspect that the mean of Area 3 is larger than the other. This can be tested using a Tukey test.

How to do it in R

############# ANOVA ###########

#1. Import the data

#2. Do the ANOVA
m<-lm(Sales~Area,data=data2)
anova(m)
summary(m)

#3 Visualize
#SST
par(mfcol=c(1,3))
plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SST",xlab="Area",ylab="Sales (million$)",las=1)
axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4))
abline(h=mean(data2$Sales),col="blue",lty="dashed") segments(rep(c(1,2,3,4),each=12),data2$Sales[seq(1,max(length(Sales)),1)],
rep(c(1,2,3,4),each=12),mean(data2$Sales),col="red") Area1<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[1] Area2<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[2] Area3<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[3] Area4<-round(tapply(data2$Sales,data2$Area,mean),digits=2)[4] #SSB plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SSB",xlab="Area",ylab="Sales (million $)",las=1) axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4)) abline(h=mean(data2$Sales),col="blue",lty="dashed")

segments(0.9,Area1,1.1,Area1,lwd=2)
segments(1.9,Area2,2.1,Area2,lwd=2)
segments(2.9,Area3,3.1,Area3,lwd=2)
segments(3.9,Area4,4.1,Area4,lwd=2)

segments(c(1,2,3,4),round(tapply(data2$Sales,data2$Area,mean),digits=2),
c(1,2,3,4),mean(data2$Sales),col="red") #SSW plot(data2$Sales ~rep(c(1,2,3,4),each=12),xaxt="n",main="SSW",xlab="Area",ylab="Sales (million $)",las=1) axis(side=1,at=c(1,2,3,4),labels=c(1,2,3,4)) abline(h=mean(Sales),col="blue",lty="dashed") segments(0.9,Area1,1.1,Area1,lwd=2) segments(1.9,Area2,2.1,Area2,lwd=2) segments(2.9,Area3,3.1,Area3,lwd=2) segments(3.9,Area4,4.1,Area4,lwd=2) segments(rep(c(1,2,3,4),each=12),data2$Sales[seq(1,max(length(data2$Sales)),1)], rep(c(1,2,3,4),each=12),rep(round(tapply(data2$Sales,data2$Area,mean),digits=2),each=12),col="red") #3. Check the assumptions #3.1 Normality #QQ plot st.res<-rstandard(m) x11() qqnorm(st.res,ylab="Standardized Residuals",xlab="Theoretical",las=1,bty="l") qqline(st.res) # Histogram x11() par(mfcol=c(2,2)) tapply(data2$Sales,data2$Area,hist,col="skyblue",las=1,yaxt="n",xaxt="n",xlab="Sales",main="Histogram") #4.1 Equal variances #Compute the variances of each Area d<-data.frame(data2[which(data2$Area=="Area 1"),],data2[which(data2$Area=="Area 2"),], data2[which(data2$Area=="Area 3"),], data2[which(data2$Area=="Area 4"),]) std<-tapply(data1$Sales,data1$Area,sd) var<-std^2 #F-test var.test(d$Sales,d\$Sales.1)



### ANOVA in depth

Introduction

The idea of ANOVA is to test whether the variance between a set of groups is larger or equal to the variance within the groups. If the variance between the groups is significantly larger, we say that there is an effect by the factor to which the groups belong (e.g. Temperature) on the independent variable (e.g. growth). This means that the mean of at least one group deviates from the mean of the other groups.

Partitioning the variation

Since we want to compare the variance between the groups with the variance within the groups, we first need to calculate them. We can say that we are partitioning the total variance in the data set into two components; (1) The between variance (MSB) and (2) the within variance (MSW).

In the process of calculating these components we first need to calculate the sum of squares, which  is the numerator in the equation for the variance.

For the total variation in the data neglecting the groups, the sum of squares is the summed squared distance between every observation and the mean of all observations (the grand mean). Using our ANOVA example of sales of different shops in four areas we this can be illustrated as:

where the red lines illustrates the distance between each observation and the grand mean. When these distances are squared and summed we get the total sum of squares, the total variation in the data. The total sum of squares  is calculated by:

where  $X_{ij}$ is the value of the $j^{th}$ observation of the $i^{th}$  group and $\overline{X}$  is the grand mean.

So far, the variation is unpartitioned. We could also say that there is no explained variation, only unexplained since we are only looking at the data as a whole. In our example the  $SS_T$ = 30.20. Thus, if no factor was to explain the variation, the unexplained variation would equal $SS_T$  = 30.20.

Now, let’s look at the variation within the groups:

The red lines illustrate the distance between each observation and the mean of each group. When these are squared and summed, we get the within sum of squares ($SS_W$). This is the unexplained variation in the data after considering the effect of the factor (Area). In our example, this was calculated as 4.34, which is a considerable reduction from the unexplained variation before the effect of the factor was considered, i.e. 30.20. The distance to the mean for each group is shorter than the distance to the grand mean. The within sum of squares is calculated by:

where $X_{ij}$ is the value of the $j^{th}$ observation of the $i^{th}$ group and $\overline{X}_i$ is the mean of the $i^{th}$ group.

Now, if the unexplained variation dropped to 4.34 out of 30.20, the explained variation has to be 30.20-4.34= 25.86. This component of the total variation can be illustrated as:

The red lines correspond to the distance between the mean of each group and the grand mean. When squared, summed and multiplied with the number of observations in each group we get the between sum of squares. The equation for this component of the variation is:

where $n_i$ is the number of observations in the $i^{th}$ group, $\overline{X}_i$ is the mean of the $i^{th}$ group and $\overline{X}$  is the grand mean.

Linear model

The ANOVA can, just like the linear regression, be treated as a linear model. Consider the fact that the value of each observation in a population is the sum of the mean of the population and the deviation of the observation from the mean:

$x_i = \mu + e_i$

where $x_i$ is the value of the $i^{th}$  observation and $\mu$ is the mean of the population and  $e_i$ is the error term or the deviation to the mean; $e_i = z \times \sigma$ or $e_i = x_i - \mu$.

However, when a factor is added, each observation can be expressed as:

$x_{ij} = \mu + F_i + e_i$

where $x_{ij}$  is the $j^{th}$ observation of the  $i^{th}$ group, $\mu$  is the grand mean, $F_i$  is the effect by the factor and  $e_i$ is the deviation from the $j^{th}$ observation to the mean of the  group.

The error term, and thus the individual observations are not estimated by the procedure of the ANOVA. The model output provides the estimates of the effect by the factor. That is the individual deviation from the intercept for each level of the factor. That means that the linear model estimates the mean for each group:

$\overline{x}_i = \mu + F_i$

If there is no effect by the factor, the mean of all groups equals the grand mean.