## Logistic regression

Logis­tic regres­sion is used when you are deal­ing with a bina­ry depen­dent vari­able. Each obser­va­tion can be one of two val­ues; 0 or 1. These may for exam­ple rep­re­sent yes/no or absent/present. As in any oth­er regres­sion, the logis­tic regres­sion esti­mates the degree of influ­ence of the inde­pen­dent vari­ables on the depen­dent. The depen­dent is here viewed as a prob­a­bil­i­ty. For exam­ple, the prob­a­bil­i­ty that a demo­c­rat will vote for Oba­ma.

The logis­tic regres­sion fol­lows the equa­tion: It can also be spec­i­fied as: where $p$ is the prob­a­bil­i­ty that the out­come of an event is 1 (for exam­ple Heads in a coin toss), $\alpha$ is the inter­cept, $\beta_n$ is the nth cor­re­la­tion coef­fi­cient and $X$ is the val­ue of the pre­dic­tor vari­able .

If the cor­re­la­tion coef­fi­cient or the val­ue of $X$ is zero, then the log­it sim­ply equals the inter­cept, (inter­cept mod­el). The cor­re­la­tion coef­fi­cient tells us how much the log odds is chang­ing for every unit change in $X$. When trans­form­ing this val­ue to the orig­i­nal scale (the response scale), it may for exam­ple say how much greater the odds is for the event on this unit of $X$ com­pared to the unit before. It is the odds ratio; the ratio between two odds of the same event that we get when mov­ing on the scale of $X$.

The depen­dent vari­able, $logit(p)$is on a log scale in order to treat the mod­el as a lin­ear mod­el. On the log scale, the change in odds ratio is lin­ear when mov­ing on the scale on $X$ where­as it is expo­nen­tial on the scale. You can trans­form the depen­dent vari­able to the orig­i­nal scale, odds for event 1, by using the fol­low­ing equa­tion: To cal­cu­late $p$ use this equa­tion: The goal of the logis­tic regres­sion is to:

1. Esti­mate the para­me­ters and  so that we can describe the cor­re­la­tion between  and the inde­pen­dent variable(s)
1. See if there is a sta­tis­ti­cal­ly sig­nif­i­cant cor­re­la­tion between $logit(p)$ and $x$. That means that we test the null-hypoth­e­sis that there is no effect of $x$ on $logit(p)$ , that is $\beta$ = 0.

You can reach the goal eas­i­ly by using sta­tis­ti­cal soft­ware such as SPSS, SAS, Sta­tis­ti­ca or R. Note that Excel will not work here. See the exam­ple and logis­tic regres­sion in depth on how to inter­pret the out­put.

Impor­tant terms:

Prob­a­bil­i­ty: The num­ber of a spec­i­fied event (e.g. “yes”) divid­ed by the total num­ber of events (e.g. all “yes” and “no”): Odds: The prob­a­bil­i­ty of an event divid­ed by the prob­a­bil­i­ty of the oth­er event, for exam­ple the odds spec­i­fies how many times larg­er the prob­a­bil­i­ty of an event is ( $) in relation to the probability of the other event ([latex[q$): Odds ratio: The ratio of two odds. It is the fac­tor by which the odds for a spe­cif­ic event is larg­er or small­er than anoth­er odds for the same event. A dif­fer­ence between the odds is the result from a change in $X$: Exam­ple

Below is an extrac­tion of a made up data set of who a repub­li­can will vote for as pres­i­dent. Each per­son has been asked if he or she will vote for Oba­ma or Rom­ney. Since we are after the prob­a­bil­i­ty that one will vote for Oba­ma, he is rep­re­sent­ed by 1 and Rom­ney by 0. To cal­cu­late the prob­a­bil­i­ty of a demo­c­rat vot­ing for Oba­ma we get: To cal­cu­late the prob­a­bil­i­ty of a repub­li­can vot­ing for Rom­ney we take $1 - p = q = 1 - 0.2 = 0.8$

Cal­cu­lat­ing the odds for a repub­li­can vot­ing for Oba­ma we get: This means that the prob­a­bil­i­ty for a repub­li­can vot­ing for Oba­ma is $\frac{1}{4}$ low­er com­pared to a repub­li­can vot­ing for Rom­ney. This odds can also be expressed as 1:4.

In logis­tic regres­sion, we use the log odds as depen­dent vari­able $\ln(\frac{p}{1-p}$  so we can treat the mod­el as lin­ear dur­ing com­pu­ta­tion.

We can go on putting these val­ues in to the equa­tion for the inter­cept mod­el: This says that the odds on the log­it scale that a repub­li­can will vote for Oba­ma is ‑1.39. On the lin­ear scale this is $e^{-1.39} = 0.25$.

This is an inter­cept mod­el since there is no inde­pen­dent vari­able. It is the nat­ur­al log of the aver­age of our array of 0’s and 1’s.

Now con­sid­er that we have also asked a num­ber of democ­rats of who they would vote for in the elec­tion: Now we have an inde­pen­dent vari­able ( $X$), Polit­i­cal Par­ty, that can explain vari­abil­i­ty in the depen­dent vari­able, $logit(p)$, which is the same as , $\ln(\frac{p}{1-p})$ i.e. the nat­ur­al log of the odds for a per­son vot­ing for Oba­ma. The inde­pen­dent vari­able in this case is also bina­ry; “Demo­c­rat” or “Repub­li­can” that is rep­re­sent­ed by a 1 and 0, respec­tive­ly. If we want the $logit(p)$ when ask­ing a demo­c­rat instead of a repub­li­can the mod­el looks like this (r has esti­mat­ed the val­ue of $\beta$):  Now we use the equa­tion to cal­cu­late the $logit(p)$ when ask­ing a repub­li­can: To cal­cu­late  the odds ratio:  $\ln(Odds(p))_D -$ $\ln(Odds(p))_R = 2.7726$

You could say, that “Repub­li­can” is our base­line lev­el from which every­thing else changes. In this case it can only change by an amount of 3.59 since the inde­pen­dent vari­able is bina­ry. It can only take two val­ues; 0 or 1.

Ok, the mod­el gives us the log odds for either a repub­li­can or demo­c­rat vot­ing for Oba­ma in the elec­tion. Actu­al­ly what we are inter­est­ed in is the prob­a­bil­i­ty, not the odds. So, how do we get the prob­a­bil­i­ty? It’s easy, just use this equa­tion: The prob­a­bil­i­ty that a repub­li­can will vote for Oba­ma:  Is the dif­fer­ence between repub­li­can and demo­c­rat vot­ers sta­tis­ti­cal sig­nif­i­cant? To deter­mine this we need to go on and per­form a logis­tic regres­sion:

1. Con­struct the null-hypoth­e­sis.

H0: There is no effect by Polit­i­cal par­ty on who one will vote for as pres­i­dent (=0)

1. Import the data file into r.
1. Run a logis­tic regres­sion with “Oba­ma” as depen­dent and “Par­ty” as inde­pen­dent (or pre­dic­tor vari­able).

We get the fol­low­ing out­put: The out­put tells us that the dif­fer­ence in logodds between repub­li­cans and democ­rats is sta­tis­ti­cal­ly sig­nif­i­cant. The null-hypoth­e­sis can be reject­ed.

How to do this in r

#1. Import the data

#2. Run the model
m1<-glm(Obama~Party,family=binomial,data=data)

#3. View the output
summary(m1)

#4 Test for significance using an Analysis of Deviance
anova(m1,test="Chisq")


Exam­ple 2

Sup­pose you have made a sur­vey ask­ing peo­ple who they will vote for in the upcom­ing pres­i­dent elec­tion, Oba­ma or Rom­ney. Is their answers affect­ed by income? In oth­er words; is there an effect by income on who one will vote in the elec­tion?

It is in the form (the entire dataset is found in the R exam­ple): 1. Con­struct the null-hypoth­e­sis.

H0: There is no effect by Income on who one will vote for as pres­i­dent (=0)

1. Import the data file into r.
1. Run a logis­tic regres­sion with “Oba­ma” as depen­dent and “Income” as inde­pen­dent (or pre­dic­tor vari­able).

This exam­ple was run in R as a Gen­er­al­ized Lin­ear Mod­el with the fol­low­ing result. 1. Reject H0 or H1

H0 can be reject­ed as the cor­re­la­tion coef­fi­cient ( $\beta$) devi­ates from zero, $\beta$ ≠ 0. The prob­a­bil­i­ty that the esti­mat­ed belongs to a dis­tri­b­u­tion of esti­mat­ed $\beta$:s  that can be achieved when the true $\beta$= 0  is less than 0.01, i.e. 1 %.

There is an effect by Income on who one will vote for in the elec­tion.

1. Inter­pret the result

There is a neg­a­tive cor­re­la­tion between Income and the log odds of vot­ing for Oba­ma. That means that with increas­ing Income, the log odds of vot­ing for Oba­ma decreas­es.

The mod­el can be spec­i­fied as: where $p$ is the prob­a­bil­i­ty of one vot­ing for Oba­ma and $X$ is the Income. The rela­tion­ship between the log odds of one vot­ing for Oba­ma and Income. This is the lin­ear pre­dic­tor scale (the scale of the log­it). Notice that the scale is sym­met­ric where the log­it is of the oppo­site sign at 9000 Dol­lars com­pared to at 1000 Dol­lars. The rela­tion­ship between the prob­a­bil­i­ty of one vot­ing for Oba­ma and Income. This is the response scale.

How to do this in R

#1. Import the data

#2. Run the model
m1&lt;-glm(Obama~Income,family=binomial,data=data1)

#3. View the output
summary(m1)

#4.1 Test for significance using an Analysis of Deviance
anova(m1,test=&quot;Chisq&quot;)

#5. Get some graphs

#5.1 Construct a vector to be used as X (Income)
ND&lt;-with(data1,data.frame(Income=seq(min(Income),max(Income),length=length(Income))))

#5.2 Predict the value of logit(p) based on the value of X
P&lt;-predict(m1,ND)

#5.3 Predict the upper and lower confidence levels for the logit
P1.fit&lt;-P1$fit P1.U&lt;-P1.fit+1.96*P1$se.fit
P1.l&lt;-P1.fit-1.96*P1$se.fit #5.4 Predict the value of p based on the value of X P2&lt;-predict(m1,ND,type=&quot;response&quot;) #5.5 Predict the upper and lower confidence levels for p P3&lt;-predict(m1,ND,type=&quot;response&quot;,se.fit=T) P3.fit&lt;-P3$fit
P3.U&lt;-P3.fit+1.96*P3$se.fit P3.l&lt;-P3.fit-1.96*P3$se.fit

# Plot the graphs

#5.6 Plot with logit
plot(P~Income,ylim=c(min(P)*1.5,max(P)), data=ND,ylab=&quot;logit(p)&quot;, xlab=&quot;Income (Dollars)&quot;, las=1,type=&quot;l&quot;,bty=&quot;l&quot;,cex.lab=1.3,lwd=2)
lines(P1.U~Income,data=ND,lty=2,lwd=2)
lines(P1.l~Income,data=ND,lty=2,lwd=2)

#5.7 Plot with p
x11()
plot(P2~Income,data=ND,ylab=&quot;p&quot;, ylim=c(0,1),xlab=&quot;Income (Dollars)&quot;, las=1,type=&quot;l&quot;,bty=&quot;l&quot;,cex.lab=1.3,lwd=2)
lines(P3.U~Income,data=ND,lty=2,lwd=2)
lines(P3.l~Income,data=ND,lty=2,lwd=2)


### Logistic regression in depth

I always want to under­stand why equa­tions look the way they do. Have you real­ized why $p$ is cal­cu­lat­ed as in the equa­tion below? When an odds is cal­cu­lat­ed it is expressed in the fac­tor by which the prob­a­bil­i­ty of the spec­i­fied event (rep­re­sent­ed by 1) is larg­er or small­er than the oth­er event (0). If the prob­a­bil­i­ties are equal, the odds for the spec­i­fied event equals 1. There is a 50 % chance that you will observe 1. The ratio can be spec­i­fied as 1:1. Each prob­a­bil­i­ty takes up a 1 in a total of two 1’s. There are a total of two parts. So the por­tion that one of the prob­a­bil­i­ties takes up in this rela­tion­ship is one of two parts or $\frac{1}{2} = 0.5$, which is the prob­a­bil­i­ty of the event. The odds always express­es a x:1 or 1:x rela­tion­ship, which can be used to cal­cu­late the prob­a­bil­i­ties.

If you are ask­ing a demo­c­rat in the exam­ple above who he or she will vote for, the odds is high­er for vot­ing for Oba­ma com­pared to Rom­ney. It can be expressed as 4:1. The prob­a­bil­i­ty of vot­ing for Oba­ma is four times high­er than the prob­a­bil­i­ty of vot­ing for Rom­ney. There is a total of 5 parts in this odds, and the prob­a­bil­i­ty of vot­ing for Oba­ma takes up 4 parts. Just divide 4 by 5 and you get the prob­a­bil­i­ty of vot­ing for Oba­ma, i.e. 0.8. This is expressed as: Now, if you ask a repub­li­can, the odds for vot­ing for Oba­ma is low­er com­pared to Rom­ney. It can be expressed as 1:4. The prob­a­bil­i­ty of vot­ing for Oba­ma is  of the prob­a­bil­i­ty of vot­ing for Rom­ney. This case is not as intu­itive for why the +1 is in the denom­i­na­tor when we put every­thing in the equa­tion:  