::p_load(ggplot2, dplyr) pacman
= read.csv("data/quality.csv") # Read in dataset
D = D[,c(14, 4, 5)]
D names(D) = c("y", "x1", "x2")
table(D$y)
0 1
98 33
glm(y~x1+x2, family=binomial)
Generalize Liner
Model= glm(y~x1+x2, D, family=binomial)
glm1 summary(glm1)
Call:
glm(formula = y ~ x1 + x2, family = binomial, data = D)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.377 -0.627 -0.510 -0.154 2.119
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5402 0.4500 -5.64 0.000000017 ***
x1 0.0627 0.0240 2.62 0.00892 **
x2 0.1099 0.0326 3.37 0.00076 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 147.88 on 130 degrees of freedom
Residual deviance: 116.45 on 128 degrees of freedom
AIC: 122.4
Number of Fisher Scoring iterations: 5
= coef(glm1); b # extract the regression coef b
(Intercept) x1 x2
-2.540212 0.062735 0.109896
\(logit = f(x) = b_0 + b_1 x_1 + b_2 x_2\)
\(odd = Exp(logit)\)
\(Pr[y = 1] = prob = \frac{odd}{1+odd}\)
Given x1=3, x2=4
, what are the predicted logit, odd and
probability?
= sum(b * c(1, 3, 4))
logit = exp(logit)
odd = odd/(1+odd)
prob c(logit=logit, odd=odd, prob=prob)
logit odd prob
-1.91242 0.14772 0.12871
🗿 : What if
x1=2, x2=3
?
= sum(b * c(1, 2, 3))
logit = exp(logit)
odd = odd/(1+odd)
prob c(logit=logit, odd=odd, prob=prob)
logit odd prob
-2.08505 0.12430 0.11056
= sum(b * c(1, 20, 30))
logit = exp(logit)
odd = odd/(1+odd)
prob c(logit=logit, odd=odd, prob=prob)
logit odd prob
2.01137 7.47354 0.88199
= data.frame(x1=c(5,25,20), x2=c(9, 30, 30)); df df
x1 x2
1 5 9
2 25 30
3 20 30
predict(glm1, df, type="response")
1 2 3
0.22488 0.91093 0.88199
💡 :
glm(family=binomial)
的功能:在 \(\{x\}\) 的空間之中,找出區隔 \(y\) 的(類別)界線
We can plot the line of logit = 0
or
odd = 1, prob = 0.5
on the plane of \(X\)
par(cex=0.8, mar=c(4,4,1,1))
plot(D$x1, D$x2, col=2+D$y, pch=20, cex=1.2, xlab="X1", ylab="X2")
abline(-b[1]/b[3], -b[2]/b[3], col="blue", lty=3)
Furthermore, we can translate probability, logit and coefficients to intercept & slope …
\[f(x) = b_0 + b_1 x_1 + b_2 x_2 \; \Rightarrow \; x_2 = \frac{f - b_0}{b_2} - \frac{b_1}{b_2}x_1\]
= seq(0.1,0.9,0.1)
p = log(p/(1-p))
logit data.frame(prob = p, logit)
prob logit
1 0.1 -2.19722
2 0.2 -1.38629
3 0.3 -0.84730
4 0.4 -0.40547
5 0.5 0.00000
6 0.6 0.40547
7 0.7 0.84730
8 0.8 1.38629
9 0.9 2.19722
then mark the contours of probabilities into the scatter plot
par(cex=0.8, mar=c(4,4,1,1))
plot(D$x1, D$x2, col=2+D$y,pch=20, cex=1.3, xlab='X1', ylab='X2')
for(f in logit) {
abline((f-b[1])/b[3], -b[2]/b[3], col=ifelse(f==0,'blue','cyan')) }
🗿 : What do the blue/cyan
lines means?
🗿 : Given any point in the
figure above, how can you tell its (predicted) probability
approximately?
Odd = \(p/(1-p)\)
Logit = \(log(odd)\) = \(log(\frac{p}{1=p})\)
\(o = p/(1-p)\) ; \(p = o/(1+o)\) ; \(logit = log(o)\)
par(cex=0.8, mfcol=c(1,2))
curve(x/(1-x), 0.02, 0.98, col='cyan',lwd=2,
ylab='odd', xlab='p')
abline(v=seq(0,1,0.1), h=seq(0,50,5), col='lightgray', lty=3)
curve(log(x/(1-x)), 0.005, 0.995, lwd=2, col='purple',
ylab="logit",xlab='p')
abline(v=seq(0,1,0.1), h=seq(-5,5,1), col='lightgray', lty=3)
Linear Model: \(y = f(x) = b_0 + b_1x_1 + b_2x_2 + ...\)
General Linear Model(GLM): \(y = Link(f(x))\)
Logistic Regression: \(logit(y) = log(\frac{p}{1-p}) = f(x) \text{ where } p = prob[y=1]\)
Logistic Function: \(Logistic(F_x) = \frac{1}{1+Exp(-F_x)} = \frac{Exp(F_x)}{1+Exp(F_x)}\)
par(cex=0.8, mfrow=c(1,1))
curve(1/(1+exp(-x)), -5, 5, col='blue', lwd=2,main="Logistic Function",
xlab="f(x): the logit of y = 1", ylab="the probability of y = 1")
abline(v=-5:5, h=seq(0,1,0.1), col='lightgray', lty=2)
abline(v=0,h=0.5,col='pink')
points(0,0.5,pch=20,cex=1.5,col='red')
🗿 : What are the definiion of
logit
& logistic function
? What is the
relationship between them?