Linear regression

The lin­ear regres­sion is used when you want to test for a cor­re­la­tion between a depen­dent vari­able and an inde­pen­dent vari­ables. The equa­tion or mod­el that results from the lin­ear regres­sion can be used to pre­dict val­ues of the depen­dent vari­able based upon val­ues from the inde­pen­dent vari­ables. The lin­ear regres­sion is a spe­cial case of a gen­er­al­ized lin­ear mod­el using the iden­ti­ty link func­tion.

You need to check the fol­low­ing assump­tions before pro­ceed­ing with a lin­ear regres­sion:

  1. The val­ues of x are fixed, i.e. mea­sure with­out error
  2. The obser­va­tions are inde­pen­dent
  3. The obser­va­tions of y at each x have the same vari­ance (s_{x=1}^2 = s_{x=2}^2 ... = ... s_{x=n}^2)
  4. The obser­va­tions of y at each x are nor­mal­ly dis­trib­uted
  5. There is a lin­ear rela­tion­ship between and

The lin­ear regres­sion fol­lows the sim­ple equa­tion of a straight line:

y = a + bx

where y is the depen­dent vari­able, a is the inter­cept,  b is the cor­re­la­tion coef­fi­cient and x is the inde­pen­dent vari­able.

The goal of a lin­ear regres­sion is to:

  1. Esti­mate the val­ues of a and b so that we can describe the cor­re­la­tion between y and x.
  2. Test if there is a sta­tis­ti­cal­ly sig­nif­i­cant cor­re­la­tion between y and x . That means that we test the null-hypoth­e­sis that there is no slope of the line, i.e. b = 0.

These two objec­tives can eas­i­ly be accom­plished using R or any oth­er sta­tis­ti­cal soft­ware, includ­ing Excel actu­al­ly. The out­put usu­al­ly looks some­thing like this:

Here we get the esti­mates of a and b as well as their stan­dard errors. The null-hypoth­e­sis is test­ed using a t-test, which tells us if  devi­ates from zero. The asso­ci­at­ed p val­ue equals the prob­a­bil­i­ty that the esti­mat­ed  belongs to a pop­u­la­tion of’s that can be esti­mat­ed if the true . H0 can be reject­ed if the p val­ue asso­ci­at­ed with b is less than the pre-select­ed sig­nif­i­cance lev­el, e.g. 0.01 or 0.05.

Exam­ple

Is there a cor­re­la­tion between the fol­low­ing lengths and weights of fish:

Here is how to do it in R

#1.	The data
    y<-c(0.18,0.22,0.21,0.38,0.46,0.5,0.58,0.8,0.8) #Weights (g)
    x<-c(26,27,28,32,36,37,36.5,40,41)              #Lengths (mm)

#2.	Specify the model. Performs a linear regression using y and x
    m<-lm(y~x)         

#3.	Hypothesis testing using t-test
    summary(m)  
    #Outputs a summary of the linear model (m) including the coefficients, 
    #their SE, t-value and P. The bolded line is the one of interest for 
    #testing the null-hypothesis of b=0:
  
#4.	Hypothesis testing using anova
    anova(m)  
    #Outputs an anova, which tests if there is a correlation between y and x. 
    #The null hypothesis is rejected at P<0.05.
    
#5.	Plotting the regression
    #Creates a plot with xy points and a regression line based on the model:
    plot(x,y,las=1,xlab="Length (mm)",ylab="wet weight (g)",bty="l",type="n",
        xaxt="n",cex.lab=1.3)
    axis(side=1,at=seq(26,41,2),labels=seq(26,41,2))
    abline(coef(m)[1], coef(m)[2])
    
    #Creates confidence limits
    xp<-seq(min(x),max(x))
    p<-predict(m,newdata=data.frame(x=xp),interval = c("confidence"),
        level = 0.95,type="response")
    lines(xp,p[,2],col="red",lty=2)
    lines(xp,p[,3],col="red",lty=2)

#6.	Checking the assumptions
    #Creates a Q-Q plot to check the assumptions of normality:
    st.res<-rstandard(m)
    qqnorm(st.res,ylab="Standardized Residuals",xlab="Theoretical",las=1,bty="l") 
    qqline(st.res)
  
    #Creates a residual plot to check homogeneity of variances and linearity between variables:
    plot(m$fitted.values,m$residuals,xlab="Predicted",ylab="Residuals",las=1,bty="l",main="Residual plot")
    abline(0,0,h=T,lty=3)
    
    #Creates an autocorrelation plot to test for independence between residuals:
    acf(m$residuals,type = "correlation",main="Autocorrelation plot",plot = TRUE,las=1,bty="l")
    

Linear regression in depth

Here I go through the lin­ear regres­sion more in depth. This sec­tion con­tains the fol­low­ing:

  • Fit­ting and describ­ing a regres­sion line
  • Con­fi­dence inter­vals
  • Vari­a­tion com­po­nents
  • Hypoth­e­sis test­ing using t-test
  • Hypoth­e­sis test­ing using ANOVA
  • Check­ing the assump­tions


Fit­ting and describ­ing a regres­sion line

Now, we will go through how to fit a regres­sion line using ordi­nary least squares, and how to math­e­mat­i­cal­ly describe the line.

Let’s con­sid­er the cor­re­la­tion between the length (mm) and wet weight (g) of the fish­es from the exam­ple above:

In a regres­sion, we want to fit a line that best describes the cor­re­la­tion between the vari­ables. In this case: Wet weight and Length. This is achieved by using ordi­nary least squares. That means that we fit a line to the plot that min­i­mizes the squared resid­u­als, i.e. the squared dis­tance between each point and the regres­sion line:

This line can be described using a lin­ear mod­el, which basi­cal­ly is the same as the equa­tion for a straight line:

y = a + bx

where y is the val­ue of the depen­dent vari­able, a  is the inter­cept, b is the slope of the regres­sion line and x  is the val­ue of the inde­pen­dent vari­able.

The inter­cept is the start­ing val­ue, the val­ue of  when . That is where the line cross­es the y – axis and the slope is the same as the change in  by a units increase in .

To cal­cu­late the inter­cept, we must first cal­cu­late the slope:

Linear regression slope

where n is the num­ber of points (x,y coor­di­nates), x is the inde­pen­dent vari­able and y is the depen­dent vari­able.

For our exam­ple we get:

linear regression slope example

You could actu­al­ly as well take the dis­tance between the first and last point in the plot for both  and  accord­ing to the fol­low­ing equa­tion:

linear regression slope example2

The inter­cept a  is cal­cu­lat­ed by the fol­low­ing equa­tion:

a = \overline{y}- b\overline{x}

where \overline{y} is the mean of all y val­ues, b is the cor­re­la­tion coef­fi­cient and \overline{x} is the mean of all x val­ues.

For our exam­ple we get:

a = 0.459 - 0.04 \times 33.72 = -0.90

Putting every­thing togeth­er, the mod­el looks like this:

  wet weight = -0.90 + 0.04 \times Length

This equa­tion can now be used to pre­dict the wet weight for any length. How­ev­er, the use of the mod­el should only be used for the inter­val that the mod­el is built upon. That is you should not try to extrap­o­late.

But, what does the fit­ted line actu­al­ly rep­re­sent? Due to both genes and envi­ron­ment, the fish­es do not have the same weight although they have the same length. There is some vari­a­tion of the wet weight for each length. This can be illus­trat­ed by adding some points (fish­es) to the plot:

What we can see here is that the regres­sion line runs through the mean of the most of the dis­tri­b­u­tions of wet weight for each length. That means that the regres­sion esti­mates the mean of all pos­si­ble val­ues of y for any giv­en x .

Now, let’s exam­ine a case where there is no cor­re­la­tion between x and y :

When there is no cor­re­la­tion, the regres­sion line is sim­ply esti­mat­ing the mean of y. The val­ue of the line is the same for all y. This means that the slope is zero, and that y = a  (the inter­cept). It also means that x has no effect on y.

Vari­a­tion com­po­nents

In the case above with no cor­re­la­tion, you can eas­i­ly see that the total vari­a­tion in the regres­sion is the same as the vari­a­tion in y. We can make it clear­er by adding the resid­u­als:

Each squared resid­ual (red lines), which is the squared dis­tance between each point and the regres­sion line, is expressed as:

e = (y - \widehat{y})^2

where y is the observed val­ue of the depen­dent vari­able and \widehat{y} is the fit­ted val­ue (val­ue on the regres­sion line) of the depen­dent vari­able.

In this case, the fit­ted val­ue \widehat{y} is the same as the mean of y\overline{y}. It can then be expressed as:

e = (y - \widehat{y})^2

When sum­ming all thesquared dis­tances we get the resid­ual sum of squares:

linear regression SSres

Does this look famil­iar? Yes it is the numer­a­tor in the equa­tion for the vari­ance. In the case of no cor­re­la­tion the numer­a­tor of the stan­dard devi­a­tion (the sum of squares of ) is the same as the resid­ual sum of squares, SSres). So if no, cor­re­la­tion:

Linear regression SSres2

The sum of squares of  y (SSy) is also called the total sum of squares, SST, since it sim­ply incor­po­rates the total vari­a­tion of the depen­dent vari­able. The resid­ual sum of square is a mea­sure of how much unex­plained vari­a­tion there is in the regres­sion. In the case of no cor­re­la­tion; SSres = SST. In oth­er words there is 100 % unex­plained vari­a­tion.

Now let’s exam­ine our fish case where there is a cor­re­la­tion. In all cas­es, we are deal­ing with a total amount of vari­a­tion (SST), the vari­a­tion in y that is to be explained by the vari­a­tion in x:

The red dot­ted lines are the squared resid­u­als from each point to the mean, which when summed become the total sum of squares:

SS_y = SS_T = \sum{(y - \overline{y})^2} = 0.452

And now the regres­sion line is added to the plot:

The sol­id red lines are the squared resid­u­als:

SS_{res} = \sum{(y-\widehat{y})^2} = 0.03

The pro­por­tion of unex­plained vari­a­tion can be cal­cu­lat­ed as:

 Prop. unexplained SS = \frac{SS_{res}}{SS_T} = \frac{0.029}{0.452} = 0.06

We have one vari­a­tion com­po­nent left to explain; the explained sum of squares by the regres­sion. This com­po­nent is sim­ply cal­cu­lat­ed by:

 Prop. explained SS = SS_{reg} = SS_T - SS_{res} = 0.452 - 0.06 = 0.423

The pro­por­tion of explained vari­a­tion is thus:

Linear regression r2

The pro­por­tion of explained vari­a­tion is the same as the coef­fi­cient of deter­mi­na­tion, .

The table below shows how I cal­cu­lat­ed the sum of squares step by step:

Hypoth­e­sis test­ing using t test

We do not only want to fit a line and define it using the lin­ear mod­el. We also want to test if there is a sta­tis­ti­cal sig­nif­i­cant cor­re­la­tion between  and . The slope ( b) of the line indi­cates whether there is a cor­re­la­tion or not. If it devi­ates from zero, either pos­i­tive or neg­a­tive, there is a cor­re­la­tion. How­ev­er, the esti­mat­ed slope is just an esti­mate. It depends on the observed val­ues. If the exper­i­ment was repeat­ed we would get oth­er obser­va­tions and slight­ly dif­fer­ent esti­mates of both the slope and the inter­cept. There­fore, we need to per­form a sta­tis­ti­cal test to test the null hypoth­e­sis that the slope do not devi­ate from zero. That means cal­cu­lat­ing the prob­a­bil­i­ty that the esti­mat­ed slope comes from a dis­tri­b­u­tion of pos­si­ble slopes with mean zero at a cer­tain degree of free­dom (n-1).  Accord­ing to the cen­tral lim­it the­o­rem, the dis­tri­b­u­tion of a large num­ber of esti­mates of a para­me­ter, regard­less of the degrees of free­dom take the form of nor­mal dis­tri­b­u­tion.  So, this is the same prin­ci­ple as the test we use to test for a dif­fer­ence between means. We sim­ply use the t-test:

linear regresssion t-test

The dif­fer­ence between the means is here sub­sti­tut­ed for by the slope. As in the t-test for means we want to cal­cu­late the num­ber (t ) of stan­dard devi­a­tions (SE_b ) that the para­me­ter esti­mate ( b) lies from the pop­u­la­tion mean of zero. If the prob­a­bil­i­ty that esti­mat­ed slope belongs to a pop­u­la­tion of slopes with mean zero is less than 0.05, the null hypoth­e­sis is reject­ed. Go to the t-test sec­tion for more details on the prin­ci­ple of the t-test. Also see the Lin­ear Regres­sion overview sec­tion to get the equa­tion for SE_b. Here is how an out­put looks like for our fish exam­ple above:

For the hypoth­e­sis test­ing only the t val­ue for the slope, b, is inter­est­ing. Here, it is quite large with a p val­ue far below 0.01. This means that the prob­a­bil­i­ty that the esti­mat­ed slope belongs to a pop­u­la­tion with mean zero is less than 0.01. There­fore, we reject the null-hypoth­e­sis of  b = 0.

Hypoth­e­sis test­ing using ANOVA

The slope of the regres­sion line can also be inves­ti­gat­ed using Analy­sis of vari­ance. Here we use the Sum of Squares and its asso­ci­at­ed degrees of free­dom to cal­cu­late the explained (MSreg) and unex­plained (MSres) vari­ance:

Linear regression MSreg

Linear regression MSres

Now. we want to test the null-hypoth­e­sis that the explained vari­ance ( MS_{reg}) is equal to or small­er than the unex­plained vari­ance (MS_{res}). To do this we sim­ply take the ratio of the two. which fol­lows the F-dis­tri­b­u­tion:

Linear regression F

If there is no cor­re­la­tion then SS_{reg} = 0 since SS_T = SS_{reg}. All vari­a­tion is tak­en by the unex­plained vari­a­tion. There­fore MS_{reg} = 0. So when there is no cor­re­la­tion the F-val­ue will be less than 1. If there is a cor­re­la­tion MS_{reg}<MS_{res} and the F-val­ue will be greater than 1. That means that we reject the null-hypoth­e­sis of no cor­re­la­tion when F >1 and retain the null-hypoth­e­sis when F ≤ 1. The test is there­fore one-sided. We do not always reject the null-hypoth­e­sis when the vari­ances dif­fer, only when the vari­ance for the regres­sion is greater than the vari­ance for the resid­u­als.

The F-dis­tri­b­u­tion is defined by the degrees of free­dom for the numer­a­tor (V1) and the denom­i­na­tor (V2). The degrees of free­dom for the regres­sion (V1)  is always 1. and the degrees of free­dom for the resid­u­als (V2) is n-2, where n is the num­ber of observed points. When putting every­thing togeth­er we get an ANO­VA-table:

The p-val­ue is giv­en by the F-dis­tri­b­u­tion. Let’s look how this looks like with the fish exam­ple above. Remem­ber that we have already cal­cu­lat­ed the sum of squares.

The explained vari­ance is 105 times greater than the unex­plained vari­ance. which is prac­ti­cal­ly zero. There­fore. the prob­a­bil­i­ty that the observed F-val­ue belongs to a pop­u­la­tion of F where the null-hypoth­e­sis is true is less than 0.01. This is illus­trat­ed below:

This is the F-dis­tri­b­u­tion with degrees of free­dom 1 (for the regres­sion) and 7 (for the resid­u­als). which is the val­ues of F you may get when the null-hypoth­e­sis is true. That is when the true F ≤ 1. The shad­ed area show you the val­ues of F that takes up 5 per­cent (i.e. p=0.05) of the pop­u­la­tion of F. If your esti­mat­ed F-val­ue go beyond the crit­i­cal val­ue 5.591. the prob­a­bil­i­ty that it belongs to this dis­tri­b­u­tion when the null-hypoth­e­sis is true is less than 0.05. Then you reject the null-hypoth­e­sis. If your val­ue lies with­in the white area. the null-hypoth­e­sis is retained. Our esti­mat­ed F-val­ue from the fish exam­ple goes far beyond the crit­i­cal val­ue. We can there­fore reject the null-hypoth­e­sis of no cor­re­la­tion between Weight and Length. This means that the slope sig­nif­i­cant­ly devi­ates from zero.


Con­fi­dence inter­vals

The inter­cept and slope are esti­mates. If obser­va­tions were col­lect­ed a mil­lion times, you would get a dis­tri­b­u­tion of dif­fer­ent esti­mate val­ues. Since the para­me­ters affects (latex]y[/latex], the fit­ted val­ues (latex]\widehat{y}[/latex] ) would dif­fer from time to time. They would, accord­ing to the cen­tral lim­it the­o­rem, con­sti­tute a nor­mal dis­tri­b­u­tion where the mean is the true latex]\widehat{y}[/latex]. As for the con­fi­dence inter­val of any mean, the true val­ue falls with­in the inter­val  latex]\widehat{y} \pm t \times SE[/latex] with a prob­a­bil­i­ty deter­mined by the lev­el of sig­nif­i­cance (usu­al­ly 0.05 or 0.01).  The t-val­ue for a spe­cif­ic lev­el of sig­nif­i­cance is found in a t-table. But how to cal­cu­late SE? Well the equa­tion looks like this:

Linear regression SE3

Where  MS_{res} is the mean square for the resid­u­als or resid­ual vari­ance,  nis the num­ber of points,  x is a val­ue of the inde­pen­dent vari­able,  \overline{x} is the mean of the inde­pen­dent vari­able and  SS_x is the sum of squares of  x.

SS_x is cal­cu­lat­ed in the same way as for y:

Linear regression SSx2

Using these equa­tions, the SE for all  val­ues in our fish exam­ple is:

How­ev­er, to con­struct the con­fi­dence zone, we need to cal­cu­late the SE for all val­ues along the regres­sion line. The t-val­ue that rep­re­sents the sig­nif­i­cance lev­el 0.05 is 2.365 at df = 7.

Now cal­cu­late the upper and low­er 95 % con­fi­dence inter­val for the regres­sion line by  \widehat{y}_i \pm 2.365 \times SE_i. Then we get some­thing like this:

Check­ing the assump­tions

Assump­tion of nor­mal­i­ty

The y-val­ues that can be observed at each x where the mean is the fit­ted val­ue by the regres­sion line, needs to be nor­mal­ly dis­trib­uted. Oth­er­wise, you will get biased esti­mates of the inter­cept and slope as well as of the stan­dard error. Also the hypoth­e­sis test­ing could be affect­ed.

How to check for nor­mal­i­ty then? Well, use a nor­mal prob­a­bil­i­ty plot. This plot com­pares the stan­dard­ized resid­u­als in the regres­sion with the resid­u­als that you would get if they come from a nor­mal dis­tri­b­u­tion. If the resid­u­als are per­fect­ly nor­mal, the points will fall on the line:

There seems to be quite a good lin­ear rela­tion­ship between the observed and the­o­ret­i­cal resid­u­als. How­ev­er, two resid­u­als in the low­er end do depart sig­nif­i­cant­ly from the line, which indi­cates some depar­ture from nor­mal­i­ty.

How can we fix this? Well, we could try to trans­form the lengths (x-axis). Trans­form­ing one or both vari­ables makes the obser­va­tions more homoge­nous and deals with the skew­ness and kur­to­sis of the data. A com­mon trans­for­ma­tion is to take the nat­ur­al log of the inde­pen­dent vari­able.

Equal vari­ances

If the vari­ances of the nor­mal dis­tri­b­u­tion of y at each x are not equal, the stan­dard error will be biased as well as the para­me­ter esti­mates in the lin­ear mod­el.

This assump­tion can be checked by plot­ting the resid­u­als against the pre­dict­ed val­ues by the regres­sion:

You do not want any trend or cor­re­la­tion between the resid­u­als and pre­dict­ed val­ues. If vari­ances are equal you will get an even scat­ter around zero. In this case, there seems like the resid­u­als are get­ting larg­er at low­er and larg­er val­ues of y, which could indi­cate that vari­ances are unequal at the ends com­pared to the mid­dle of the dis­tri­b­u­tion of the fit­ted val­ues.

As when fix­ing the nor­mal­i­ty prob­lem, the data can be log trans­formed to get equal vari­ances.

Errors need to be inde­pen­dent

All obser­va­tions need to be inde­pen­dent of each oth­er. Thus the val­ue of obser­va­tion 1 should not depend on obser­va­tion 2. This is often the case in time series and spa­tial data.

You can check this assump­tion by a lag plot where each resid­ual is plot­ted against the next resid­ual. So, resid­ual nr 1 is plot­ted against resid­ual nr 2 etc. This pro­vides us with n-1 points in this graph since there can be no resid­ual after the last one. If there is no cor­re­la­tion among the resid­u­als (auto­cor­re­la­tion), the points are scat­tered ran­dom­ly with­in the plot. For the fish data, the lag plot looks like this:

The points look quite ran­dom, but more points are locat­ed in the upper right and low­er left quad­rant com­pared to the oth­er, which indi­cates a cor­re­la­tion.

You could also use a Auto­cor­re­la­tion plot, which plots the cor­re­la­tion between the cur­rent and lagged resid­u­als:

The blue lines rep­re­sent crit­i­cal lim­its. If the ACF goes beyond these, there is a sig­nif­i­cant cor­re­la­tion. No sur­prise the first resid­ual is per­fect­ly cor­re­lat­ed with itself. I think this plot is bet­ter than before since it gives less room for sub­jec­tive inter­pre­ta­tions. This plot clear­ly shows that there is no auto­cor­re­la­tion in the data.

Assump­tion of lin­ear­i­ty

Since the regres­sion describes a lin­ear pat­tern,  and  should dis­play a lin­ear rela­tion­ship. If they are not, the pre­dic­tions will be wrong.

You can check this assump­tion by plot­ting the observed val­ues against the pre­dict­ed. The points should be on the line if there is a per­fect lin­ear rela­tion­ship between the vari­ables. This is how it looks for the fish weights and lengths data:

The points devi­ates to some degree from the line, indi­cat­ing there could be some non-lin­ear­i­ty, but in gen­er­al it looks good and we can con­clude that the assump­tion is met.