Logistic regression

Logis­tic regres­sion is used when you are deal­ing with a bina­ry depen­dent vari­able. Each obser­va­tion can be one of two val­ues; 0 or 1. These may for exam­ple rep­re­sent yes/no or absent/present. As in any oth­er regres­sion, the logis­tic regres­sion esti­mates the degree of influ­ence of the inde­pen­dent vari­ables on the depen­dent. The depen­dent is here viewed as a prob­a­bil­i­ty. For exam­ple, the prob­a­bil­i­ty that a demo­c­rat will vote for Oba­ma.

The logis­tic regres­sion fol­lows the equa­tion:

logistic

It can also be spec­i­fied as:

logistic2

where p is the prob­a­bil­i­ty that the out­come of an event is 1 (for exam­ple Heads in a coin toss), \alpha is the inter­cept, \beta_n is the nth cor­re­la­tion coef­fi­cient and X is the val­ue of the pre­dic­tor vari­able .

If the cor­re­la­tion coef­fi­cient or the val­ue of X is zero, then the log­it sim­ply equals the inter­cept, (inter­cept mod­el). The cor­re­la­tion coef­fi­cient tells us how much the log odds is chang­ing for every unit change in X. When trans­form­ing this val­ue to the orig­i­nal scale (the response scale), it may for exam­ple say how much greater the odds is for the event on this unit of X com­pared to the unit before. It is the odds ratio; the ratio between two odds of the same event that we get when mov­ing on the scale of X.

The depen­dent vari­able, logit(p)is on a log scale in order to treat the mod­el as a lin­ear mod­el. On the log scale, the change in odds ratio is lin­ear when mov­ing on the scale on X where­as it is expo­nen­tial on the scale. You can trans­form the depen­dent vari­able to the orig­i­nal scale, odds for event 1, by using the fol­low­ing equa­tion:

logistic3

To cal­cu­late p use this equa­tion:

logistic4

The goal of the logis­tic regres­sion is to:

  1. Esti­mate the para­me­ters and  so that we can describe the cor­re­la­tion between  and the inde­pen­dent variable(s)
  1. See if there is a sta­tis­ti­cal­ly sig­nif­i­cant cor­re­la­tion between logit(p) and x. That means that we test the null-hypoth­e­sis that there is no effect of  x on logit(p) , that is \beta = 0.

You can reach the goal eas­i­ly by using sta­tis­ti­cal soft­ware such as SPSS, SAS, Sta­tis­ti­ca or R. Note that Excel will not work here. See the exam­ple and logis­tic regres­sion in depth on how to inter­pret the out­put.

Impor­tant terms:

Prob­a­bil­i­ty: The num­ber of a spec­i­fied event (e.g. “yes”) divid­ed by the total num­ber of events (e.g. all “yes” and “no”):

logistic5

Odds: The prob­a­bil­i­ty of an event divid­ed by the prob­a­bil­i­ty of the oth­er event, for exam­ple the odds spec­i­fies how many times larg­er the prob­a­bil­i­ty of an event is () in relation to the probability of the other event ([latex[q):

logistic6

Odds ratio: The ratio of two odds. It is the fac­tor by which the odds for a spe­cif­ic event is larg­er or small­er than anoth­er odds for the same event. A dif­fer­ence between the odds is the result from a change in X:

logistic7

Exam­ple

Below is an extrac­tion of a made up data set of who a repub­li­can will vote for as pres­i­dent. Each per­son has been asked if he or she will vote for Oba­ma or Rom­ney. Since we are after the prob­a­bil­i­ty that one will vote for Oba­ma, he is rep­re­sent­ed by 1 and Rom­ney by 0.

To cal­cu­late the prob­a­bil­i­ty of a demo­c­rat vot­ing for Oba­ma we get:

logistic9

To cal­cu­late the prob­a­bil­i­ty of a repub­li­can vot­ing for Rom­ney we take 1 - p = q = 1 - 0.2 = 0.8

Cal­cu­lat­ing the odds for a repub­li­can vot­ing for Oba­ma we get:

logistic11

This means that the prob­a­bil­i­ty for a repub­li­can vot­ing for Oba­ma is \frac{1}{4} low­er com­pared to a repub­li­can vot­ing for Rom­ney. This odds can also be expressed as 1:4.

In logis­tic regres­sion, we use the log odds as depen­dent vari­able \ln(\frac{p}{1-p}  so we can treat the mod­el as lin­ear dur­ing com­pu­ta­tion.

We can go on putting these val­ues in to the equa­tion for the inter­cept mod­el:

logistic12

This says that the odds on the log­it scale that a repub­li­can will vote for Oba­ma is -1.39. On the lin­ear scale this is e^{-1.39} = 0.25.

This is an inter­cept mod­el since there is no inde­pen­dent vari­able. It is the nat­ur­al log of the aver­age of our array of 0’s and 1’s.

Now con­sid­er that we have also asked a num­ber of democ­rats of who they would vote for in the elec­tion:

Now we have an inde­pen­dent vari­able (X), Polit­i­cal Par­ty, that can explain vari­abil­i­ty in the depen­dent vari­able, logit(p), which is the same as , \ln(\frac{p}{1-p}) i.e. the nat­ur­al log of the odds for a per­son vot­ing for Oba­ma. The inde­pen­dent vari­able in this case is also bina­ry; “Demo­c­rat” or “Repub­li­can” that is rep­re­sent­ed by a 1 and 0, respec­tive­ly. If we want the logit(p) when ask­ing a demo­c­rat instead of a repub­li­can the mod­el looks like this (r has esti­mat­ed the val­ue of \beta):

logistic13

logistic14

Now we use the equa­tion to cal­cu­late the logit(p) when ask­ing a repub­li­can:

logistic15

To cal­cu­late  the odds ratio:

logistic22\ln(Odds(p))_D - \ln(Odds(p))_R = 2.7726

You could say, that “Repub­li­can” is our base­line lev­el from which every­thing else changes. In this case it can only change by an amount of 3.59 since the inde­pen­dent vari­able is bina­ry. It can only take two val­ues; 0 or 1.

Ok, the mod­el gives us the log odds for either a repub­li­can or demo­c­rat vot­ing for Oba­ma in the elec­tion. Actu­al­ly what we are inter­est­ed in is the prob­a­bil­i­ty, not the odds. So, how do we get the prob­a­bil­i­ty? It’s easy, just use this equa­tion:

logistic4

The prob­a­bil­i­ty that a repub­li­can will vote for Oba­ma:

logistic19

logistic20

Is the dif­fer­ence between repub­li­can and demo­c­rat vot­ers sta­tis­ti­cal sig­nif­i­cant? To deter­mine this we need to go on and per­form a logis­tic regres­sion:

  1. Con­struct the null-hypoth­e­sis.

H0: There is no effect by Polit­i­cal par­ty on who one will vote for as pres­i­dent (=0)

  1. Import the data file into r.
  1. Run a logis­tic regres­sion with “Oba­ma” as depen­dent and “Par­ty” as inde­pen­dent (or pre­dic­tor vari­able).

We get the fol­low­ing out­put:

logistic5

The out­put tells us that the dif­fer­ence in logodds between repub­li­cans and democ­rats is sta­tis­ti­cal­ly sig­nif­i­cant. The null-hypoth­e­sis can be reject­ed.

How to do this in r

#1. Import the data


 data<-read.csv("http://www.ilovestats.org/wp-content/uploads/2015/08/example1.csv",dec=",",sep=";")


#2. Run the model
 m1<-glm(Obama~Party,family=binomial,data=data)



#3. View the output
 summary(m1)

#4 Test for significance using an Analysis of Deviance
 anova(m1,test="Chisq")

Exam­ple 2

Sup­pose you have made a sur­vey ask­ing peo­ple who they will vote for in the upcom­ing pres­i­dent elec­tion, Oba­ma or Rom­ney. Is their answers affect­ed by income? In oth­er words; is there an effect by income on who one will vote in the elec­tion?

It is in the form (the entire dataset is found in the R exam­ple):

  1. Con­struct the null-hypoth­e­sis.

H0: There is no effect by Income on who one will vote for as pres­i­dent (=0)

  1. Import the data file into r.
  1. Run a logis­tic regres­sion with “Oba­ma” as depen­dent and “Income” as inde­pen­dent (or pre­dic­tor vari­able).

This exam­ple was run in R as a Gen­er­al­ized Lin­ear Mod­el with the fol­low­ing result.

  1. Reject H0 or H1

H0 can be reject­ed as the cor­re­la­tion coef­fi­cient (\beta) devi­ates from zero, \beta ≠ 0. The prob­a­bil­i­ty that the esti­mat­ed belongs to a dis­tri­b­u­tion of esti­mat­ed \beta:s  that can be achieved when the true  \beta= 0  is less than 0.01, i.e. 1 %.

There is an effect by Income on who one will vote for in the elec­tion.

  1. Inter­pret the result

There is a neg­a­tive cor­re­la­tion between Income and the log odds of vot­ing for Oba­ma. That means that with increas­ing Income, the log odds of vot­ing for Oba­ma decreas­es.

The mod­el can be spec­i­fied as:

logistic21

where p is the prob­a­bil­i­ty of one vot­ing for Oba­ma and X is the Income.

 

The rela­tion­ship between the log odds of one vot­ing for Oba­ma and Income. This is the lin­ear pre­dic­tor scale (the scale of the log­it). Notice that the scale is sym­met­ric where the log­it is of the oppo­site sign at 9000 Dol­lars com­pared to at 1000 Dol­lars.

The rela­tion­ship between the prob­a­bil­i­ty of one vot­ing for Oba­ma and Income. This is the response scale.

How to do this in R

#1. Import the data
data1&lt;-read.csv(&quot;http://www.ilovestats.org/wp-content/uploads/2015/08/example2.csv&quot;,header=T,dec=&quot;,&quot;,sep=&quot;;&quot;)
head(data1)

#2. Run the model
m1&lt;-glm(Obama~Income,family=binomial,data=data1)

#3. View the output
summary(m1)

#4.1 Test for significance using an Analysis of Deviance
anova(m1,test=&quot;Chisq&quot;)

#5. Get some graphs

 #5.1 Construct a vector to be used as X (Income)
 ND&lt;-with(data1,data.frame(Income=seq(min(Income),max(Income),length=length(Income))))

 #5.2 Predict the value of logit(p) based on the value of X
 P&lt;-predict(m1,ND) 

 #5.3 Predict the upper and lower confidence levels for the logit
 P1&lt;-predict(m1,ND,type=&quot;link&quot;,se.fit=T) 
 P1.fit&lt;-P1$fit
 P1.U&lt;-P1.fit+1.96*P1$se.fit
 P1.l&lt;-P1.fit-1.96*P1$se.fit

 #5.4 Predict the value of p based on the value of X
 P2&lt;-predict(m1,ND,type=&quot;response&quot;) 

 #5.5 Predict the upper and lower confidence levels for p
 P3&lt;-predict(m1,ND,type=&quot;response&quot;,se.fit=T) 
 P3.fit&lt;-P3$fit
 P3.U&lt;-P3.fit+1.96*P3$se.fit
 P3.l&lt;-P3.fit-1.96*P3$se.fit

# Plot the graphs

 #5.6 Plot with logit
 plot(P~Income,ylim=c(min(P)*1.5,max(P)), data=ND,ylab=&quot;logit(p)&quot;, xlab=&quot;Income (Dollars)&quot;, las=1,type=&quot;l&quot;,bty=&quot;l&quot;,cex.lab=1.3,lwd=2) 
 lines(P1.U~Income,data=ND,lty=2,lwd=2)
 lines(P1.l~Income,data=ND,lty=2,lwd=2)

 #5.7 Plot with p
 x11()
 plot(P2~Income,data=ND,ylab=&quot;p&quot;, ylim=c(0,1),xlab=&quot;Income (Dollars)&quot;, las=1,type=&quot;l&quot;,bty=&quot;l&quot;,cex.lab=1.3,lwd=2) 
 lines(P3.U~Income,data=ND,lty=2,lwd=2)
 lines(P3.l~Income,data=ND,lty=2,lwd=2)

Logistic regression in depth

I always want to under­stand why equa­tions look the way they do. Have you real­ized why p is cal­cu­lat­ed as in the equa­tion below?

logistic4

When an odds is cal­cu­lat­ed it is expressed in the fac­tor by which the prob­a­bil­i­ty of the spec­i­fied event (rep­re­sent­ed by 1) is larg­er or small­er than the oth­er event (0). If the prob­a­bil­i­ties are equal, the odds for the spec­i­fied event equals 1. There is a 50 % chance that you will observe 1. The ratio can be spec­i­fied as 1:1. Each prob­a­bil­i­ty takes up a 1 in a total of two 1’s. There are a total of two parts. So the por­tion that one of the prob­a­bil­i­ties takes up in this rela­tion­ship is one of two parts or \frac{1}{2} = 0.5, which is the prob­a­bil­i­ty of the event. The odds always express­es a x:1 or 1:x rela­tion­ship, which can be used to cal­cu­late the prob­a­bil­i­ties.

If you are ask­ing a demo­c­rat in the exam­ple above who he or she will vote for, the odds is high­er for vot­ing for Oba­ma com­pared to Rom­ney. It can be expressed as 4:1. The prob­a­bil­i­ty of vot­ing for Oba­ma is four times high­er than the prob­a­bil­i­ty of vot­ing for Rom­ney. There is a total of 5 parts in this odds, and the prob­a­bil­i­ty of vot­ing for Oba­ma takes up 4 parts. Just divide 4 by 5 and you get the prob­a­bil­i­ty of vot­ing for Oba­ma, i.e. 0.8. This is expressed as:

logistic24

Now, if you ask a repub­li­can, the odds for vot­ing for Oba­ma is low­er com­pared to Rom­ney. It can be expressed as 1:4. The prob­a­bil­i­ty of vot­ing for Oba­ma is  of the prob­a­bil­i­ty of vot­ing for Rom­ney. This case is not as intu­itive for why the +1 is in the denom­i­na­tor when we put every­thing in the equa­tion:

logistic19

logistic20