Logistic regression

Logistic regression is used when you are dealing with a binary dependent variable. Each observation can be one of two values; 0 or 1. These may for example represent yes/no or absent/present. As in any other regression, the logistic regression estimates the degree of influence of the independent variables on the dependent. The dependent is here viewed as a probability. For example, the probability that a democrat will vote for Obama.

The logistic regression follows the equation:

It can also be specified as:

where $p$ is the probability that the outcome of an event is 1 (for example Heads in a coin toss), $\alpha$ is the intercept, $\beta_n$ is the nth correlation coefficient and $X$ is the value of the predictor variable .

If the correlation coefficient or the value of $X$ is zero, then the logit simply equals the intercept, (intercept model). The correlation coefficient tells us how much the log odds is changing for every unit change in $X$. When transforming this value to the original scale (the response scale), it may for example say how much greater the odds is for the event on this unit of $X$ compared to the unit before. It is the odds ratio; the ratio between two odds of the same event that we get when moving on the scale of $X$.

The dependent variable, $logit(p)$is on a log scale in order to treat the model as a linear model. On the log scale, the change in odds ratio is linear when moving on the scale on $X$ whereas it is exponential on the scale. You can transform the dependent variable to the original scale, odds for event 1, by using the following equation:

To calculate $p$ use this equation:

The goal of the logistic regression is to:

1. Estimate the parameters and  so that we can describe the correlation between  and the independent variable(s)
1. See if there is a statistically significant correlation between $logit(p)$ and $x$. That means that we test the null-hypothesis that there is no effect of  $x$ on $logit(p)$ , that is $\beta$ = 0.

You can reach the goal easily by using statistical software such as SPSS, SAS, Statistica or R. Note that Excel will not work here. See the example and logistic regression in depth on how to interpret the output.

Important terms:

Probability: The number of a specified event (e.g. “yes”) divided by the total number of events (e.g. all “yes” and “no”):

Odds: The probability of an event divided by the probability of the other event, for example the odds specifies how many times larger the probability of an event is ($) in relation to the probability of the other event ([latex[q$):

Odds ratio: The ratio of two odds. It is the factor by which the odds for a specific event is larger or smaller than another odds for the same event. A difference between the odds is the result from a change in $X$:

Example

Below is an extraction of a made up data set of who a republican will vote for as president. Each person has been asked if he or she will vote for Obama or Romney. Since we are after the probability that one will vote for Obama, he is represented by 1 and Romney by 0.

To calculate the probability of a democrat voting for Obama we get:

To calculate the probability of a republican voting for Romney we take $1 - p = q = 1 - 0.2 = 0.8$

Calculating the odds for a republican voting for Obama we get:

This means that the probability for a republican voting for Obama is $\frac{1}{4}$ lower compared to a republican voting for Romney. This odds can also be expressed as 1:4.

In logistic regression, we use the log odds as dependent variable $\ln(\frac{p}{1-p}$  so we can treat the model as linear during computation.

We can go on putting these values in to the equation for the intercept model:

This says that the odds on the logit scale that a republican will vote for Obama is -1.39. On the linear scale this is $e^{-1.39} = 0.25$.

This is an intercept model since there is no independent variable. It is the natural log of the average of our array of 0’s and 1’s.

Now consider that we have also asked a number of democrats of who they would vote for in the election:

Now we have an independent variable ($X$), Political Party, that can explain variability in the dependent variable, $logit(p)$, which is the same as , $\ln(\frac{p}{1-p})$ i.e. the natural log of the odds for a person voting for Obama. The independent variable in this case is also binary; “Democrat” or “Republican” that is represented by a 1 and 0, respectively. If we want the $logit(p)$ when asking a democrat instead of a republican the model looks like this (r has estimated the value of $\beta$):

Now we use the equation to calculate the $logit(p)$ when asking a republican:

To calculate  the odds ratio:

$\ln(Odds(p))_D -$ $\ln(Odds(p))_R = 2.7726$

You could say, that “Republican” is our baseline level from which everything else changes. In this case it can only change by an amount of 3.59 since the independent variable is binary. It can only take two values; 0 or 1.

Ok, the model gives us the log odds for either a republican or democrat voting for Obama in the election. Actually what we are interested in is the probability, not the odds. So, how do we get the probability? It’s easy, just use this equation:

The probability that a republican will vote for Obama:

Is the difference between republican and democrat voters statistical significant? To determine this we need to go on and perform a logistic regression:

1. Construct the null-hypothesis.

H0: There is no effect by Political party on who one will vote for as president (=0)

1. Import the data file into r.
1. Run a logistic regression with “Obama” as dependent and “Party” as independent (or predictor variable).

We get the following output:

The output tells us that the difference in logodds between republicans and democrats is statistically significant. The null-hypothesis can be rejected.

How to do this in r

#1. Import the data

#2. Run the model
m1<-glm(Obama~Party,family=binomial,data=data)

#3. View the output
summary(m1)

#4 Test for significance using an Analysis of Deviance
anova(m1,test="Chisq")


Example 2

Suppose you have made a survey asking people who they will vote for in the upcoming president election, Obama or Romney. Is their answers affected by income? In other words; is there an effect by income on who one will vote in the election?

It is in the form (the entire dataset is found in the R example):

1. Construct the null-hypothesis.

H0: There is no effect by Income on who one will vote for as president (=0)

1. Import the data file into r.
1. Run a logistic regression with “Obama” as dependent and “Income” as independent (or predictor variable).

This example was run in R as a Generalized Linear Model with the following result.

1. Reject H0 or H1

H0 can be rejected as the correlation coefficient ($\beta$) deviates from zero, $\beta$ ≠ 0. The probability that the estimated belongs to a distribution of estimated $\beta$:s  that can be achieved when the true  $\beta$= 0  is less than 0.01, i.e. 1 %.

There is an effect by Income on who one will vote for in the election.

1. Interpret the result

There is a negative correlation between Income and the log odds of voting for Obama. That means that with increasing Income, the log odds of voting for Obama decreases.

The model can be specified as:

where $p$ is the probability of one voting for Obama and $X$ is the Income.

The relationship between the log odds of one voting for Obama and Income. This is the linear predictor scale (the scale of the logit). Notice that the scale is symmetric where the logit is of the opposite sign at 9000 Dollars compared to at 1000 Dollars.

The relationship between the probability of one voting for Obama and Income. This is the response scale.

How to do this in R

#1. Import the data

#2. Run the model
m1&lt;-glm(Obama~Income,family=binomial,data=data1)

#3. View the output
summary(m1)

#4.1 Test for significance using an Analysis of Deviance
anova(m1,test=&quot;Chisq&quot;)

#5. Get some graphs

#5.1 Construct a vector to be used as X (Income)
ND&lt;-with(data1,data.frame(Income=seq(min(Income),max(Income),length=length(Income))))

#5.2 Predict the value of logit(p) based on the value of X
P&lt;-predict(m1,ND)

#5.3 Predict the upper and lower confidence levels for the logit
P1.fit&lt;-P1$fit P1.U&lt;-P1.fit+1.96*P1$se.fit
P1.l&lt;-P1.fit-1.96*P1$se.fit #5.4 Predict the value of p based on the value of X P2&lt;-predict(m1,ND,type=&quot;response&quot;) #5.5 Predict the upper and lower confidence levels for p P3&lt;-predict(m1,ND,type=&quot;response&quot;,se.fit=T) P3.fit&lt;-P3$fit
P3.U&lt;-P3.fit+1.96*P3$se.fit P3.l&lt;-P3.fit-1.96*P3$se.fit

# Plot the graphs

#5.6 Plot with logit
plot(P~Income,ylim=c(min(P)*1.5,max(P)), data=ND,ylab=&quot;logit(p)&quot;, xlab=&quot;Income (Dollars)&quot;, las=1,type=&quot;l&quot;,bty=&quot;l&quot;,cex.lab=1.3,lwd=2)
lines(P1.U~Income,data=ND,lty=2,lwd=2)
lines(P1.l~Income,data=ND,lty=2,lwd=2)

#5.7 Plot with p
x11()
plot(P2~Income,data=ND,ylab=&quot;p&quot;, ylim=c(0,1),xlab=&quot;Income (Dollars)&quot;, las=1,type=&quot;l&quot;,bty=&quot;l&quot;,cex.lab=1.3,lwd=2)
lines(P3.U~Income,data=ND,lty=2,lwd=2)
lines(P3.l~Income,data=ND,lty=2,lwd=2)


Logistic regression in depth

I always want to understand why equations look the way they do. Have you realized why $p$ is calculated as in the equation below?

When an odds is calculated it is expressed in the factor by which the probability of the specified event (represented by 1) is larger or smaller than the other event (0). If the probabilities are equal, the odds for the specified event equals 1. There is a 50 % chance that you will observe 1. The ratio can be specified as 1:1. Each probability takes up a 1 in a total of two 1’s. There are a total of two parts. So the portion that one of the probabilities takes up in this relationship is one of two parts or $\frac{1}{2} = 0.5$, which is the probability of the event. The odds always expresses a x:1 or 1:x relationship, which can be used to calculate the probabilities.

If you are asking a democrat in the example above who he or she will vote for, the odds is higher for voting for Obama compared to Romney. It can be expressed as 4:1. The probability of voting for Obama is four times higher than the probability of voting for Romney. There is a total of 5 parts in this odds, and the probability of voting for Obama takes up 4 parts. Just divide 4 by 5 and you get the probability of voting for Obama, i.e. 0.8. This is expressed as:

Now, if you ask a republican, the odds for voting for Obama is lower compared to Romney. It can be expressed as 1:4. The probability of voting for Obama is  of the probability of voting for Romney. This case is not as intuitive for why the +1 is in the denominator when we put everything in the equation: