## Linear regression

The linear regression is used when you want to test for a correlation between a dependent variable and an independent variables. The equation or model that results from the linear regression can be used to predict values of the dependent variable based upon values from the independent variables. The linear regression is a special case of a generalized linear model using the identity link function.

You need to check the following assumptions before proceeding with a linear regression:

1. The values of x are fixed, i.e. measure without error
2. The observations are independent
3. The observations of y at each x have the same variance ($s_{x=1}^2 = s_{x=2}^2 … = … s_{x=n}^2)$
4. The observations of y at each x are normally distributed
5. There is a linear relationship between and

The linear regression follows the simple equation of a straight line:

$y = a + bx$

where $y$ is the dependent variable, $a$ is the intercept,  $b$ is the correlation coefficient and $x$ is the independent variable.

The goal of a linear regression is to:

1. Estimate the values of $a$ and $b$ so that we can describe the correlation between $y$ and $x$.
2. Test if there is a statistically significant correlation between $y$ and $x$ . That means that we test the null-hypothesis that there is no slope of the line, i.e. $b = 0$.

These two objectives can easily be accomplished using R or any other statistical software, including Excel actually. The output usually looks something like this:

Here we get the estimates of $a$ and $b$ as well as their standard errors. The null-hypothesis is tested using a t-test, which tells us if  deviates from zero. The associated p value equals the probability that the estimated  belongs to a population of’s that can be estimated if the true . H0 can be rejected if the p value associated with $b$ is less than the pre-selected significance level, e.g. 0.01 or 0.05.

Example

Is there a correlation between the following lengths and weights of fish:

Here is how to do it in R

#1.	The data
y<-c(0.18,0.22,0.21,0.38,0.46,0.5,0.58,0.8,0.8) #Weights (g)
x<-c(26,27,28,32,36,37,36.5,40,41)              #Lengths (mm)

#2.	Specify the model. Performs a linear regression using y and x
m<-lm(y~x)

#3.	Hypothesis testing using t-test
summary(m)
#Outputs a summary of the linear model (m) including the coefficients,
#their SE, t-value and P. The bolded line is the one of interest for
#testing the null-hypothesis of b=0:

#4.	Hypothesis testing using anova
anova(m)
#Outputs an anova, which tests if there is a correlation between y and x.
#The null hypothesis is rejected at P<0.05.

#5.	Plotting the regression
#Creates a plot with xy points and a regression line based on the model:
plot(x,y,las=1,xlab="Length (mm)",ylab="wet weight (g)",bty="l",type="n",
xaxt="n",cex.lab=1.3)
axis(side=1,at=seq(26,41,2),labels=seq(26,41,2))
abline(coef(m)[1], coef(m)[2])

#Creates confidence limits
xp<-seq(min(x),max(x))
p<-predict(m,newdata=data.frame(x=xp),interval = c("confidence"),
level = 0.95,type="response")
lines(xp,p[,2],col="red",lty=2)
lines(xp,p[,3],col="red",lty=2)

#6.	Checking the assumptions
#Creates a Q-Q plot to check the assumptions of normality:
st.res<-rstandard(m)
qqnorm(st.res,ylab="Standardized Residuals",xlab="Theoretical",las=1,bty="l")
qqline(st.res)

#Creates a residual plot to check homogeneity of variances and linearity between variables:
plot(m$fitted.values,m$residuals,xlab="Predicted",ylab="Residuals",las=1,bty="l",main="Residual plot")
abline(0,0,h=T,lty=3)

#Creates an autocorrelation plot to test for independence between residuals:
acf(m\$residuals,type = "correlation",main="Autocorrelation plot",plot = TRUE,las=1,bty="l")

### Linear regression in depth

Here I go through the linear regression more in depth. This section contains the following:

• Fitting and describing a regression line
• Confidence intervals
• Variation components
• Hypothesis testing using t-test
• Hypothesis testing using ANOVA
• Checking the assumptions

Fitting and describing a regression line

Now, we will go through how to fit a regression line using ordinary least squares, and how to mathematically describe the line.

Let’s consider the correlation between the length (mm) and wet weight (g) of the fishes from the example above:

In a regression, we want to fit a line that best describes the correlation between the variables. In this case: Wet weight and Length. This is achieved by using ordinary least squares. That means that we fit a line to the plot that minimizes the squared residuals, i.e. the squared distance between each point and the regression line:

This line can be described using a linear model, which basically is the same as the equation for a straight line:

$y = a + bx$

where $y$ is the value of the dependent variable, $a$  is the intercept, $b$ is the slope of the regression line and $x$  is the value of the independent variable.

The intercept is the starting value, the value of  when . That is where the line crosses the y – axis and the slope is the same as the change in  by a units increase in .

To calculate the intercept, we must first calculate the slope:

where $n$ is the number of points (x,y coordinates), $x$ is the independent variable and $y$ is the dependent variable.

For our example we get:

You could actually as well take the distance between the first and last point in the plot for both  and  according to the following equation:

The intercept $a$  is calculated by the following equation:

$a = \overline{y}- b\overline{x}$

where $\overline{y}$ is the mean of all $y$ values, $b$ is the correlation coefficient and $\overline{x}$ is the mean of all $x$ values.

For our example we get:

$a = 0.459 - 0.04 \times 33.72 = -0.90$

Putting everything together, the model looks like this:

$wet weight = -0.90 + 0.04 \times Length$

This equation can now be used to predict the wet weight for any length. However, the use of the model should only be used for the interval that the model is built upon. That is you should not try to extrapolate.

But, what does the fitted line actually represent? Due to both genes and environment, the fishes do not have the same weight although they have the same length. There is some variation of the wet weight for each length. This can be illustrated by adding some points (fishes) to the plot:

What we can see here is that the regression line runs through the mean of the most of the distributions of wet weight for each length. That means that the regression estimates the mean of all possible values of $y$ for any given $x$ .

Now, let’s examine a case where there is no correlation between $x$ and $y$ :

When there is no correlation, the regression line is simply estimating the mean of $y$. The value of the line is the same for all $y$. This means that the slope is zero, and that $y = a$  (the intercept). It also means that $x$ has no effect on $y$.

Variation components

In the case above with no correlation, you can easily see that the total variation in the regression is the same as the variation in $y$. We can make it clearer by adding the residuals:

Each squared residual (red lines), which is the squared distance between each point and the regression line, is expressed as:

$e = (y - \widehat{y})^2$

where $y$ is the observed value of the dependent variable and $\widehat{y}$ is the fitted value (value on the regression line) of the dependent variable.

In this case, the fitted value $\widehat{y}$ is the same as the mean of $y$,$\overline{y}$. It can then be expressed as:

$e = (y - \widehat{y})^2$

When summing all thesquared distances we get the residual sum of squares:

Does this look familiar? Yes it is the numerator in the equation for the variance. In the case of no correlation the numerator of the standard deviation (the sum of squares of ) is the same as the residual sum of squares, SSres). So if no, correlation:

The sum of squares of  $y$ (SSy) is also called the total sum of squares, SST, since it simply incorporates the total variation of the dependent variable. The residual sum of square is a measure of how much unexplained variation there is in the regression. In the case of no correlation; SSres = SST. In other words there is 100 % unexplained variation.

Now let’s examine our fish case where there is a correlation. In all cases, we are dealing with a total amount of variation (SST), the variation in $y$ that is to be explained by the variation in $x$:

The red dotted lines are the squared residuals from each point to the mean, which when summed become the total sum of squares:

$SS_y = SS_T = \sum{(y - \overline{y})^2} = 0.452$

And now the regression line is added to the plot:

The solid red lines are the squared residuals:

$SS_{res} = \sum{(y-\widehat{y})^2} = 0.03$

The proportion of unexplained variation can be calculated as:

$Prop. unexplained SS = \frac{SS_{res}}{SS_T} = \frac{0.029}{0.452} = 0.06$

We have one variation component left to explain; the explained sum of squares by the regression. This component is simply calculated by:

$Prop. explained SS = SS_{reg} = SS_T - SS_{res} = 0.452 - 0.06 = 0.423$

The proportion of explained variation is thus:

The proportion of explained variation is the same as the coefficient of determination, .

The table below shows how I calculated the sum of squares step by step:

Hypothesis testing using t test

We do not only want to fit a line and define it using the linear model. We also want to test if there is a statistical significant correlation between  and . The slope ( $b$) of the line indicates whether there is a correlation or not. If it deviates from zero, either positive or negative, there is a correlation. However, the estimated slope is just an estimate. It depends on the observed values. If the experiment was repeated we would get other observations and slightly different estimates of both the slope and the intercept. Therefore, we need to perform a statistical test to test the null hypothesis that the slope do not deviate from zero. That means calculating the probability that the estimated slope comes from a distribution of possible slopes with mean zero at a certain degree of freedom (n-1).  According to the central limit theorem, the distribution of a large number of estimates of a parameter, regardless of the degrees of freedom take the form of normal distribution.  So, this is the same principle as the test we use to test for a difference between means. We simply use the t-test:

The difference between the means is here substituted for by the slope. As in the t-test for means we want to calculate the number ($t$ ) of standard deviations ($SE_b$ ) that the parameter estimate ( $b$) lies from the population mean of zero. If the probability that estimated slope belongs to a population of slopes with mean zero is less than 0.05, the null hypothesis is rejected. Go to the t-test section for more details on the principle of the t-test. Also see the Linear Regression overview section to get the equation for $SE_b$. Here is how an output looks like for our fish example above:

For the hypothesis testing only the t value for the slope, $b$, is interesting. Here, it is quite large with a p value far below 0.01. This means that the probability that the estimated slope belongs to a population with mean zero is less than 0.01. Therefore, we reject the null-hypothesis of  $b$ = 0.

Hypothesis testing using ANOVA

The slope of the regression line can also be investigated using Analysis of variance. Here we use the Sum of Squares and its associated degrees of freedom to calculate the explained (MSreg) and unexplained (MSres) variance:

Now. we want to test the null-hypothesis that the explained variance ( $MS_{reg}$) is equal to or smaller than the unexplained variance ($MS_{res}$). To do this we simply take the ratio of the two. which follows the F-distribution:

If there is no correlation then $SS_{reg}$ = 0 since $SS_T = SS_{reg}$. All variation is taken by the unexplained variation. Therefore $MS_{reg}$ = 0. So when there is no correlation the F-value will be less than 1. If there is a correlation $MS_{reg} and the F-value will be greater than 1. That means that we reject the null-hypothesis of no correlation when F >1 and retain the null-hypothesis when F ≤ 1. The test is therefore one-sided. We do not always reject the null-hypothesis when the variances differ, only when the variance for the regression is greater than the variance for the residuals.

The F-distribution is defined by the degrees of freedom for the numerator (V1) and the denominator (V2). The degrees of freedom for the regression (V1)  is always 1. and the degrees of freedom for the residuals (V2) is $n-2$, where $n$ is the number of observed points. When putting everything together we get an ANOVA-table:

The p-value is given by the F-distribution. Let’s look how this looks like with the fish example above. Remember that we have already calculated the sum of squares.

The explained variance is 105 times greater than the unexplained variance. which is practically zero. Therefore. the probability that the observed F-value belongs to a population of F where the null-hypothesis is true is less than 0.01. This is illustrated below:

This is the F-distribution with degrees of freedom 1 (for the regression) and 7 (for the residuals). which is the values of F you may get when the null-hypothesis is true. That is when the true F ≤ 1. The shaded area show you the values of F that takes up 5 percent (i.e. p=0.05) of the population of F. If your estimated F-value go beyond the critical value 5.591. the probability that it belongs to this distribution when the null-hypothesis is true is less than 0.05. Then you reject the null-hypothesis. If your value lies within the white area. the null-hypothesis is retained. Our estimated F-value from the fish example goes far beyond the critical value. We can therefore reject the null-hypothesis of no correlation between Weight and Length. This means that the slope significantly deviates from zero.

Confidence intervals

The intercept and slope are estimates. If observations were collected a million times, you would get a distribution of different estimate values. Since the parameters affects (latex]y[/latex], the fitted values (latex]\widehat{y}[/latex] ) would differ from time to time. They would, according to the central limit theorem, constitute a normal distribution where the mean is the true latex]\widehat{y}[/latex]. As for the confidence interval of any mean, the true value falls within the interval  $latex]\widehat{y}$ \pm t \times SE[/latex] with a probability determined by the level of significance (usually 0.05 or 0.01).  The t-value for a specific level of significance is found in a t-table. But how to calculate SE? Well the equation looks like this:

Where  $MS_{res}$ is the mean square for the residuals or residual variance,  $n$is the number of points,  $x$ is a value of the independent variable,  $\overline{x}$ is the mean of the independent variable and  $SS_x$ is the sum of squares of  $x$.

$SS_x$ is calculated in the same way as for $y$:

Using these equations, the SE for all  values in our fish example is:

However, to construct the confidence zone, we need to calculate the SE for all values along the regression line. The t-value that represents the significance level 0.05 is 2.365 at df = 7.

Now calculate the upper and lower 95 % confidence interval for the regression line by  $\widehat{y}_i \pm 2.365 \times SE_i$. Then we get something like this:

Checking the assumptions

Assumption of normality

The y-values that can be observed at each x where the mean is the fitted value by the regression line, needs to be normally distributed. Otherwise, you will get biased estimates of the intercept and slope as well as of the standard error. Also the hypothesis testing could be affected.

How to check for normality then? Well, use a normal probability plot. This plot compares the standardized residuals in the regression with the residuals that you would get if they come from a normal distribution. If the residuals are perfectly normal, the points will fall on the line:

There seems to be quite a good linear relationship between the observed and theoretical residuals. However, two residuals in the lower end do depart significantly from the line, which indicates some departure from normality.

How can we fix this? Well, we could try to transform the lengths (x-axis). Transforming one or both variables makes the observations more homogenous and deals with the skewness and kurtosis of the data. A common transformation is to take the natural log of the independent variable.

Equal variances

If the variances of the normal distribution of y at each x are not equal, the standard error will be biased as well as the parameter estimates in the linear model.

This assumption can be checked by plotting the residuals against the predicted values by the regression:

You do not want any trend or correlation between the residuals and predicted values. If variances are equal you will get an even scatter around zero. In this case, there seems like the residuals are getting larger at lower and larger values of y, which could indicate that variances are unequal at the ends compared to the middle of the distribution of the fitted values.

As when fixing the normality problem, the data can be log transformed to get equal variances.

Errors need to be independent

All observations need to be independent of each other. Thus the value of observation 1 should not depend on observation 2. This is often the case in time series and spatial data.

You can check this assumption by a lag plot where each residual is plotted against the next residual. So, residual nr 1 is plotted against residual nr 2 etc. This provides us with n-1 points in this graph since there can be no residual after the last one. If there is no correlation among the residuals (autocorrelation), the points are scattered randomly within the plot. For the fish data, the lag plot looks like this:

The points look quite random, but more points are located in the upper right and lower left quadrant compared to the other, which indicates a correlation.

You could also use a Autocorrelation plot, which plots the correlation between the current and lagged residuals:

The blue lines represent critical limits. If the ACF goes beyond these, there is a significant correlation. No surprise the first residual is perfectly correlated with itself. I think this plot is better than before since it gives less room for subjective interpretations. This plot clearly shows that there is no autocorrelation in the data.

Assumption of linearity

Since the regression describes a linear pattern,  and  should display a linear relationship. If they are not, the predictions will be wrong.

You can check this assumption by plotting the observed values against the predicted. The points should be on the line if there is a perfect linear relationship between the variables. This is how it looks for the fish weights and lengths data:

The points deviates to some degree from the line, indicating there could be some non-linearity, but in general it looks good and we can conclude that the assumption is met.