## Linear regression

The lin­ear regres­sion is used when you want to test for a cor­re­la­tion between a depen­dent vari­able and an inde­pen­dent vari­ables. The equa­tion or mod­el that results from the lin­ear regres­sion can be used to pre­dict val­ues of the depen­dent vari­able based upon val­ues from the inde­pen­dent vari­ables. The lin­ear regres­sion is a spe­cial case of a gen­er­al­ized lin­ear mod­el using the iden­ti­ty link func­tion.

You need to check the fol­low­ing assump­tions before pro­ceed­ing with a lin­ear regres­sion:

1. The val­ues of x are fixed, i.e. mea­sure with­out error
2. The obser­va­tions are inde­pen­dent
3. The obser­va­tions of y at each x have the same vari­ance ($s_{x=1}^2 = s_{x=2}^2 ... = ... s_{x=n}^2)$
4. The obser­va­tions of y at each x are nor­mal­ly dis­trib­uted
5. There is a lin­ear rela­tion­ship between and

The lin­ear regres­sion fol­lows the sim­ple equa­tion of a straight line:

$y = a + bx$

where $y$ is the depen­dent vari­able, $a$ is the inter­cept,  $b$ is the cor­re­la­tion coef­fi­cient and $x$ is the inde­pen­dent vari­able.

The goal of a lin­ear regres­sion is to:

1. Esti­mate the val­ues of $a$ and $b$ so that we can describe the cor­re­la­tion between $y$ and $x$.
2. Test if there is a sta­tis­ti­cal­ly sig­nif­i­cant cor­re­la­tion between $y$ and $x$ . That means that we test the null-hypoth­e­sis that there is no slope of the line, i.e. $b = 0$.

These two objec­tives can eas­i­ly be accom­plished using R or any oth­er sta­tis­ti­cal soft­ware, includ­ing Excel actu­al­ly. The out­put usu­al­ly looks some­thing like this:

Here we get the esti­mates of $a$ and $b$ as well as their stan­dard errors. The null-hypoth­e­sis is test­ed using a t-test, which tells us if  devi­ates from zero. The asso­ci­at­ed p val­ue equals the prob­a­bil­i­ty that the esti­mat­ed  belongs to a pop­u­la­tion of’s that can be esti­mat­ed if the true . H0 can be reject­ed if the p val­ue asso­ci­at­ed with $b$ is less than the pre-select­ed sig­nif­i­cance lev­el, e.g. 0.01 or 0.05.

Exam­ple

Is there a cor­re­la­tion between the fol­low­ing lengths and weights of fish:

Here is how to do it in R

#1.	The data
y<-c(0.18,0.22,0.21,0.38,0.46,0.5,0.58,0.8,0.8) #Weights (g)
x<-c(26,27,28,32,36,37,36.5,40,41)              #Lengths (mm)

#2.	Specify the model. Performs a linear regression using y and x
m<-lm(y~x)

#3.	Hypothesis testing using t-test
summary(m)
#Outputs a summary of the linear model (m) including the coefficients,
#their SE, t-value and P. The bolded line is the one of interest for
#testing the null-hypothesis of b=0:

#4.	Hypothesis testing using anova
anova(m)
#Outputs an anova, which tests if there is a correlation between y and x.
#The null hypothesis is rejected at P<0.05.

#5.	Plotting the regression
#Creates a plot with xy points and a regression line based on the model:
plot(x,y,las=1,xlab="Length (mm)",ylab="wet weight (g)",bty="l",type="n",
xaxt="n",cex.lab=1.3)
axis(side=1,at=seq(26,41,2),labels=seq(26,41,2))
abline(coef(m)[1], coef(m)[2])

#Creates confidence limits
xp<-seq(min(x),max(x))
p<-predict(m,newdata=data.frame(x=xp),interval = c("confidence"),
level = 0.95,type="response")
lines(xp,p[,2],col="red",lty=2)
lines(xp,p[,3],col="red",lty=2)

#6.	Checking the assumptions
#Creates a Q-Q plot to check the assumptions of normality:
st.res<-rstandard(m)
qqnorm(st.res,ylab="Standardized Residuals",xlab="Theoretical",las=1,bty="l")
qqline(st.res)

#Creates a residual plot to check homogeneity of variances and linearity between variables:
plot(m$fitted.values,m$residuals,xlab="Predicted",ylab="Residuals",las=1,bty="l",main="Residual plot")
abline(0,0,h=T,lty=3)

#Creates an autocorrelation plot to test for independence between residuals:
acf(m\$residuals,type = "correlation",main="Autocorrelation plot",plot = TRUE,las=1,bty="l")



### Linear regression in depth

Here I go through the lin­ear regres­sion more in depth. This sec­tion con­tains the fol­low­ing:

• Fit­ting and describ­ing a regres­sion line
• Con­fi­dence inter­vals
• Vari­a­tion com­po­nents
• Hypoth­e­sis test­ing using t-test
• Hypoth­e­sis test­ing using ANOVA
• Check­ing the assump­tions

Fit­ting and describ­ing a regres­sion line

Now, we will go through how to fit a regres­sion line using ordi­nary least squares, and how to math­e­mat­i­cal­ly describe the line.

Let’s con­sid­er the cor­re­la­tion between the length (mm) and wet weight (g) of the fish­es from the exam­ple above:

In a regres­sion, we want to fit a line that best describes the cor­re­la­tion between the vari­ables. In this case: Wet weight and Length. This is achieved by using ordi­nary least squares. That means that we fit a line to the plot that min­i­mizes the squared resid­u­als, i.e. the squared dis­tance between each point and the regres­sion line:

This line can be described using a lin­ear mod­el, which basi­cal­ly is the same as the equa­tion for a straight line:

$y = a + bx$

where $y$ is the val­ue of the depen­dent vari­able, $a$  is the inter­cept, $b$ is the slope of the regres­sion line and $x$  is the val­ue of the inde­pen­dent vari­able.

The inter­cept is the start­ing val­ue, the val­ue of  when . That is where the line cross­es the y – axis and the slope is the same as the change in  by a units increase in .

To cal­cu­late the inter­cept, we must first cal­cu­late the slope:

where $n$ is the num­ber of points (x,y coor­di­nates), $x$ is the inde­pen­dent vari­able and $y$ is the depen­dent vari­able.

For our exam­ple we get:

You could actu­al­ly as well take the dis­tance between the first and last point in the plot for both  and  accord­ing to the fol­low­ing equa­tion:

The inter­cept $a$  is cal­cu­lat­ed by the fol­low­ing equa­tion:

$a = \overline{y}- b\overline{x}$

where $\overline{y}$ is the mean of all $y$ val­ues, $b$ is the cor­re­la­tion coef­fi­cient and $\overline{x}$ is the mean of all $x$ val­ues.

For our exam­ple we get:

$a = 0.459 - 0.04 \times 33.72 = -0.90$

Putting every­thing togeth­er, the mod­el looks like this:

$wet weight = -0.90 + 0.04 \times Length$

This equa­tion can now be used to pre­dict the wet weight for any length. How­ev­er, the use of the mod­el should only be used for the inter­val that the mod­el is built upon. That is you should not try to extrap­o­late.

But, what does the fit­ted line actu­al­ly rep­re­sent? Due to both genes and envi­ron­ment, the fish­es do not have the same weight although they have the same length. There is some vari­a­tion of the wet weight for each length. This can be illus­trat­ed by adding some points (fish­es) to the plot:

What we can see here is that the regres­sion line runs through the mean of the most of the dis­tri­b­u­tions of wet weight for each length. That means that the regres­sion esti­mates the mean of all pos­si­ble val­ues of $y$ for any giv­en $x$ .

Now, let’s exam­ine a case where there is no cor­re­la­tion between $x$ and $y$ :

When there is no cor­re­la­tion, the regres­sion line is sim­ply esti­mat­ing the mean of $y$. The val­ue of the line is the same for all $y$. This means that the slope is zero, and that $y = a$  (the inter­cept). It also means that $x$ has no effect on $y$.

Vari­a­tion com­po­nents

In the case above with no cor­re­la­tion, you can eas­i­ly see that the total vari­a­tion in the regres­sion is the same as the vari­a­tion in $y$. We can make it clear­er by adding the resid­u­als:

Each squared resid­ual (red lines), which is the squared dis­tance between each point and the regres­sion line, is expressed as:

$e = (y - \widehat{y})^2$

where $y$ is the observed val­ue of the depen­dent vari­able and $\widehat{y}$ is the fit­ted val­ue (val­ue on the regres­sion line) of the depen­dent vari­able.

In this case, the fit­ted val­ue $\widehat{y}$ is the same as the mean of $y$$\overline{y}$. It can then be expressed as:

$e = (y - \widehat{y})^2$

When sum­ming all thesquared dis­tances we get the resid­ual sum of squares:

Does this look famil­iar? Yes it is the numer­a­tor in the equa­tion for the vari­ance. In the case of no cor­re­la­tion the numer­a­tor of the stan­dard devi­a­tion (the sum of squares of ) is the same as the resid­ual sum of squares, SSres). So if no, cor­re­la­tion:

The sum of squares of  $y$ (SSy) is also called the total sum of squares, SST, since it sim­ply incor­po­rates the total vari­a­tion of the depen­dent vari­able. The resid­ual sum of square is a mea­sure of how much unex­plained vari­a­tion there is in the regres­sion. In the case of no cor­re­la­tion; SSres = SST. In oth­er words there is 100 % unex­plained vari­a­tion.

Now let’s exam­ine our fish case where there is a cor­re­la­tion. In all cas­es, we are deal­ing with a total amount of vari­a­tion (SST), the vari­a­tion in $y$ that is to be explained by the vari­a­tion in $x$:

The red dot­ted lines are the squared resid­u­als from each point to the mean, which when summed become the total sum of squares:

$SS_y = SS_T = \sum{(y - \overline{y})^2} = 0.452$

And now the regres­sion line is added to the plot:

The sol­id red lines are the squared resid­u­als:

$SS_{res} = \sum{(y-\widehat{y})^2} = 0.03$

The pro­por­tion of unex­plained vari­a­tion can be cal­cu­lat­ed as:

$Prop. unexplained SS = \frac{SS_{res}}{SS_T} = \frac{0.029}{0.452} = 0.06$

We have one vari­a­tion com­po­nent left to explain; the explained sum of squares by the regres­sion. This com­po­nent is sim­ply cal­cu­lat­ed by:

$Prop. explained SS = SS_{reg} = SS_T - SS_{res} = 0.452 - 0.06 = 0.423$

The pro­por­tion of explained vari­a­tion is thus:

The pro­por­tion of explained vari­a­tion is the same as the coef­fi­cient of deter­mi­na­tion, .

The table below shows how I cal­cu­lat­ed the sum of squares step by step:

Hypoth­e­sis test­ing using t test

We do not only want to fit a line and define it using the lin­ear mod­el. We also want to test if there is a sta­tis­ti­cal sig­nif­i­cant cor­re­la­tion between  and . The slope ( $b$) of the line indi­cates whether there is a cor­re­la­tion or not. If it devi­ates from zero, either pos­i­tive or neg­a­tive, there is a cor­re­la­tion. How­ev­er, the esti­mat­ed slope is just an esti­mate. It depends on the observed val­ues. If the exper­i­ment was repeat­ed we would get oth­er obser­va­tions and slight­ly dif­fer­ent esti­mates of both the slope and the inter­cept. There­fore, we need to per­form a sta­tis­ti­cal test to test the null hypoth­e­sis that the slope do not devi­ate from zero. That means cal­cu­lat­ing the prob­a­bil­i­ty that the esti­mat­ed slope comes from a dis­tri­b­u­tion of pos­si­ble slopes with mean zero at a cer­tain degree of free­dom (n-1).  Accord­ing to the cen­tral lim­it the­o­rem, the dis­tri­b­u­tion of a large num­ber of esti­mates of a para­me­ter, regard­less of the degrees of free­dom take the form of nor­mal dis­tri­b­u­tion.  So, this is the same prin­ci­ple as the test we use to test for a dif­fer­ence between means. We sim­ply use the t-test:

The dif­fer­ence between the means is here sub­sti­tut­ed for by the slope. As in the t-test for means we want to cal­cu­late the num­ber ($t$ ) of stan­dard devi­a­tions ($SE_b$ ) that the para­me­ter esti­mate ( $b$) lies from the pop­u­la­tion mean of zero. If the prob­a­bil­i­ty that esti­mat­ed slope belongs to a pop­u­la­tion of slopes with mean zero is less than 0.05, the null hypoth­e­sis is reject­ed. Go to the t-test sec­tion for more details on the prin­ci­ple of the t-test. Also see the Lin­ear Regres­sion overview sec­tion to get the equa­tion for $SE_b$. Here is how an out­put looks like for our fish exam­ple above:

For the hypoth­e­sis test­ing only the t val­ue for the slope, $b$, is inter­est­ing. Here, it is quite large with a p val­ue far below 0.01. This means that the prob­a­bil­i­ty that the esti­mat­ed slope belongs to a pop­u­la­tion with mean zero is less than 0.01. There­fore, we reject the null-hypoth­e­sis of  $b$ = 0.

Hypoth­e­sis test­ing using ANOVA

The slope of the regres­sion line can also be inves­ti­gat­ed using Analy­sis of vari­ance. Here we use the Sum of Squares and its asso­ci­at­ed degrees of free­dom to cal­cu­late the explained (MSreg) and unex­plained (MSres) vari­ance:

Now. we want to test the null-hypoth­e­sis that the explained vari­ance ( $MS_{reg}$) is equal to or small­er than the unex­plained vari­ance ($MS_{res}$). To do this we sim­ply take the ratio of the two. which fol­lows the F-dis­tri­b­u­tion:

If there is no cor­re­la­tion then $SS_{reg}$ = 0 since $SS_T = SS_{reg}$. All vari­a­tion is tak­en by the unex­plained vari­a­tion. There­fore $MS_{reg}$ = 0. So when there is no cor­re­la­tion the F-val­ue will be less than 1. If there is a cor­re­la­tion $MS_{reg} and the F-val­ue will be greater than 1. That means that we reject the null-hypoth­e­sis of no cor­re­la­tion when F >1 and retain the null-hypoth­e­sis when F ≤ 1. The test is there­fore one-sided. We do not always reject the null-hypoth­e­sis when the vari­ances dif­fer, only when the vari­ance for the regres­sion is greater than the vari­ance for the resid­u­als.

The F-dis­tri­b­u­tion is defined by the degrees of free­dom for the numer­a­tor (V1) and the denom­i­na­tor (V2). The degrees of free­dom for the regres­sion (V1)  is always 1. and the degrees of free­dom for the resid­u­als (V2) is $n-2$, where $n$ is the num­ber of observed points. When putting every­thing togeth­er we get an ANO­VA-table:

The p-val­ue is giv­en by the F-dis­tri­b­u­tion. Let’s look how this looks like with the fish exam­ple above. Remem­ber that we have already cal­cu­lat­ed the sum of squares.

The explained vari­ance is 105 times greater than the unex­plained vari­ance. which is prac­ti­cal­ly zero. There­fore. the prob­a­bil­i­ty that the observed F-val­ue belongs to a pop­u­la­tion of F where the null-hypoth­e­sis is true is less than 0.01. This is illus­trat­ed below:

This is the F-dis­tri­b­u­tion with degrees of free­dom 1 (for the regres­sion) and 7 (for the resid­u­als). which is the val­ues of F you may get when the null-hypoth­e­sis is true. That is when the true F ≤ 1. The shad­ed area show you the val­ues of F that takes up 5 per­cent (i.e. p=0.05) of the pop­u­la­tion of F. If your esti­mat­ed F-val­ue go beyond the crit­i­cal val­ue 5.591. the prob­a­bil­i­ty that it belongs to this dis­tri­b­u­tion when the null-hypoth­e­sis is true is less than 0.05. Then you reject the null-hypoth­e­sis. If your val­ue lies with­in the white area. the null-hypoth­e­sis is retained. Our esti­mat­ed F-val­ue from the fish exam­ple goes far beyond the crit­i­cal val­ue. We can there­fore reject the null-hypoth­e­sis of no cor­re­la­tion between Weight and Length. This means that the slope sig­nif­i­cant­ly devi­ates from zero.

Con­fi­dence inter­vals

The inter­cept and slope are esti­mates. If obser­va­tions were col­lect­ed a mil­lion times, you would get a dis­tri­b­u­tion of dif­fer­ent esti­mate val­ues. Since the para­me­ters affects (latex]y[/latex], the fit­ted val­ues (latex]\widehat{y}[/latex] ) would dif­fer from time to time. They would, accord­ing to the cen­tral lim­it the­o­rem, con­sti­tute a nor­mal dis­tri­b­u­tion where the mean is the true latex]\widehat{y}[/latex]. As for the con­fi­dence inter­val of any mean, the true val­ue falls with­in the inter­val  $latex]\widehat{y}$ \pm t \times SE[/latex] with a prob­a­bil­i­ty deter­mined by the lev­el of sig­nif­i­cance (usu­al­ly 0.05 or 0.01).  The t-val­ue for a spe­cif­ic lev­el of sig­nif­i­cance is found in a t-table. But how to cal­cu­late SE? Well the equa­tion looks like this:

Where  $MS_{res}$ is the mean square for the resid­u­als or resid­ual vari­ance,  $n$is the num­ber of points,  $x$ is a val­ue of the inde­pen­dent vari­able,  $\overline{x}$ is the mean of the inde­pen­dent vari­able and  $SS_x$ is the sum of squares of  $x$.

$SS_x$ is cal­cu­lat­ed in the same way as for $y$:

Using these equa­tions, the SE for all  val­ues in our fish exam­ple is:

How­ev­er, to con­struct the con­fi­dence zone, we need to cal­cu­late the SE for all val­ues along the regres­sion line. The t-val­ue that rep­re­sents the sig­nif­i­cance lev­el 0.05 is 2.365 at df = 7.

Now cal­cu­late the upper and low­er 95 % con­fi­dence inter­val for the regres­sion line by  $\widehat{y}_i \pm 2.365 \times SE_i$. Then we get some­thing like this:

Check­ing the assump­tions

Assump­tion of nor­mal­i­ty

The y-val­ues that can be observed at each x where the mean is the fit­ted val­ue by the regres­sion line, needs to be nor­mal­ly dis­trib­uted. Oth­er­wise, you will get biased esti­mates of the inter­cept and slope as well as of the stan­dard error. Also the hypoth­e­sis test­ing could be affect­ed.

How to check for nor­mal­i­ty then? Well, use a nor­mal prob­a­bil­i­ty plot. This plot com­pares the stan­dard­ized resid­u­als in the regres­sion with the resid­u­als that you would get if they come from a nor­mal dis­tri­b­u­tion. If the resid­u­als are per­fect­ly nor­mal, the points will fall on the line:

There seems to be quite a good lin­ear rela­tion­ship between the observed and the­o­ret­i­cal resid­u­als. How­ev­er, two resid­u­als in the low­er end do depart sig­nif­i­cant­ly from the line, which indi­cates some depar­ture from nor­mal­i­ty.

How can we fix this? Well, we could try to trans­form the lengths (x-axis). Trans­form­ing one or both vari­ables makes the obser­va­tions more homoge­nous and deals with the skew­ness and kur­to­sis of the data. A com­mon trans­for­ma­tion is to take the nat­ur­al log of the inde­pen­dent vari­able.

Equal vari­ances

If the vari­ances of the nor­mal dis­tri­b­u­tion of y at each x are not equal, the stan­dard error will be biased as well as the para­me­ter esti­mates in the lin­ear mod­el.

This assump­tion can be checked by plot­ting the resid­u­als against the pre­dict­ed val­ues by the regres­sion:

You do not want any trend or cor­re­la­tion between the resid­u­als and pre­dict­ed val­ues. If vari­ances are equal you will get an even scat­ter around zero. In this case, there seems like the resid­u­als are get­ting larg­er at low­er and larg­er val­ues of y, which could indi­cate that vari­ances are unequal at the ends com­pared to the mid­dle of the dis­tri­b­u­tion of the fit­ted val­ues.

As when fix­ing the nor­mal­i­ty prob­lem, the data can be log trans­formed to get equal vari­ances.

Errors need to be inde­pen­dent

All obser­va­tions need to be inde­pen­dent of each oth­er. Thus the val­ue of obser­va­tion 1 should not depend on obser­va­tion 2. This is often the case in time series and spa­tial data.

You can check this assump­tion by a lag plot where each resid­ual is plot­ted against the next resid­ual. So, resid­ual nr 1 is plot­ted against resid­ual nr 2 etc. This pro­vides us with n-1 points in this graph since there can be no resid­ual after the last one. If there is no cor­re­la­tion among the resid­u­als (auto­cor­re­la­tion), the points are scat­tered ran­dom­ly with­in the plot. For the fish data, the lag plot looks like this:

The points look quite ran­dom, but more points are locat­ed in the upper right and low­er left quad­rant com­pared to the oth­er, which indi­cates a cor­re­la­tion.

You could also use a Auto­cor­re­la­tion plot, which plots the cor­re­la­tion between the cur­rent and lagged resid­u­als:

The blue lines rep­re­sent crit­i­cal lim­its. If the ACF goes beyond these, there is a sig­nif­i­cant cor­re­la­tion. No sur­prise the first resid­ual is per­fect­ly cor­re­lat­ed with itself. I think this plot is bet­ter than before since it gives less room for sub­jec­tive inter­pre­ta­tions. This plot clear­ly shows that there is no auto­cor­re­la­tion in the data.

Assump­tion of lin­ear­i­ty

Since the regres­sion describes a lin­ear pat­tern,  and  should dis­play a lin­ear rela­tion­ship. If they are not, the pre­dic­tions will be wrong.

You can check this assump­tion by plot­ting the observed val­ues against the pre­dict­ed. The points should be on the line if there is a per­fect lin­ear rela­tion­ship between the vari­ables. This is how it looks for the fish weights and lengths data:

The points devi­ates to some degree from the line, indi­cat­ing there could be some non-lin­ear­i­ty, but in gen­er­al it looks good and we can con­clude that the assump­tion is met.