Generalized Linear Models
Kevin Donovan
February 28, 2021
Introduction
Recall: Discussed association analyses with correlation and linear regression
\[
\begin{align}
&Y=\beta_0+\beta_1X_1+\ldots+\beta_pX_p+\epsilon \\
&\\
&\text{where E}(\epsilon)=0 \text{; Var}(\epsilon)=\sigma^2 \\
&\epsilon_i \perp \epsilon_j \text{ for }i\neq j; X_1,\ldots,X_p\perp \epsilon
\end{align}
\]
- What about for discrete outcomes and non-normal error terms?
Generalized linear model
- Idea: Create more general model structure to handle many distributions for outcome and residuals
- Keep flexibility and interpretability of linear regression model
- Especially flexibility with predictors/covariates (allow continuous, categorical, transformations, etc.)
- Such a structure referred to as generalized linear models or GLMs
- Focus on distribution of outcome instead of disribution of residuals
Regression for binary outcomes
- Suppose we want to assess association between binary outcome and predictors
- Ex. Autism (ASD) diagnosis
- What if we use a linear regression model?
- \(ASD=\beta_0+\beta_1X_1+\ldots+\beta_pX_p+\epsilon\)
- \(E(ASD|X)=\beta_0+\beta_1X_1+\ldots+\beta_pX_p\)
- For binary \(Y\), \(\text{E}(Y|X)=\text{Pr}(Y=1|X)\)
- \(\rightarrow \text{Pr}(ASD=\text{Yes}|X)=\beta_0+\beta_1X_1+\ldots+\beta_pX_p\)
- Denoted linear probability model
- Limitation: \(0 \leq\text{Pr}(Y=1|X) \leq 1\) but linear function not constrained
Logistic regression
- Specify transformation of probability which is not constrained, use linear model
- Model: Single Feature \(X\)
- \(\text{logit}[p(ASD=\text{Yes}|X)]=\beta_0+\beta_1X\)
- \(\text{logit}(p)=\text{ln}(\frac{p}{1-p})=\text{log}( \text{odds}[p(Y=1|X)])\)
- This, modeling log odds as a linear function of predictors not probability
- In terms of conditional probability:
- \(p(ASD=\text{Yes}|X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\)
- Can transform logit model to represent probabilities of interest
Logistic regression
- Visualization between probability \(p\) and \(\text{logit}(p)\)
Logistic regression
- Visualization between probability \(p\) and \(\text{logit}(p)\) modeling ASD diagnosis
Model fitting
- Since distribution of outcome based on covariates directly modeled, use maximum likelihood instead of least squares from before
Maximum likelihood example
- Intuition:
- Find estimates of \(\beta\) which best match with observed data, assuming data generated from specified likelihood
- Specifying the likelihood directly makes calculations more feasible, but assumptions may not hold (more on this later)
- Fitting in
R
: glm
function
##
## Call:
## glm(formula = factor(SSM_ASD_v24) ~ `V24 mullen,composite_standard_score`,
## family = binomial(), data = ibis_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1954 -0.5375 -0.3091 -0.1466 3.2387
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.671167 0.827694 8.06 7.63e-16 ***
## `V24 mullen,composite_standard_score` -0.088884 0.009268 -9.59 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.73 on 572 degrees of freedom
## Residual deviance: 368.89 on 571 degrees of freedom
## (14 observations deleted due to missingness)
## AIC: 372.89
##
## Number of Fisher Scoring iterations: 6
Variable | Estimate | Std. Error | Z Statistic | P-value |
Intercept | 6.671 | 0.8277 | 8.1 | <0.005 |
24 Month MSEL Composite Standard Score | -0.089 | 0.0093 | -9.6 | <0.005 |
Logistic regression
Estimated model: \(\hat{\text{Pr}}[ASD=\text{YES}|MSEL]=\frac{e^{6.67-0.09MSEL}}{1+e^{6.67-0.09MSEL}}\)
Interpretation:
- \(\hat{\beta_0}=6.67\)
- \(\hat{\text{Pr}}[ASD=\text{YES}|MSEL=0]=\frac{e^{6.67}}{1+e^{6.67}}\)
- \(\hat{\beta_1}=-0.09\)
- \(\rightarrow\) Probability of ASD diangosis decreases as MSEL increases
Logistic regression
Intercept:
- MSEL = 0 doesn’t make sense
Solution: center at means
- MSEL - \(\mu\) = 0 \(\rightarrow\) MSEL = \(\mu\)
##
## Call:
## glm(formula = factor(SSM_ASD_v24) ~ mullen_center, family = binomial(),
## data = ibis_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1954 -0.5375 -0.3091 -0.1466 3.2387
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.336873 0.180835 -12.92 <2e-16 ***
## mullen_center -0.088884 0.009268 -9.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.73 on 572 degrees of freedom
## Residual deviance: 368.89 on 571 degrees of freedom
## (14 observations deleted due to missingness)
## AIC: 372.89
##
## Number of Fisher Scoring iterations: 6
Variable | Estimate | Std. Error | Z Statistic | P-value |
Intercept | -2.337 | 0.1808 | -12.9 | <0.005 |
24 Month MSEL Composite Centered | -0.089 | 0.0093 | -9.6 | <0.005 |
Interpretation:
- \(\hat{\beta_0} = -2.34\)
\[
\hat{\text{Pr}}[ASD=\text{YES}|MSEL=\mu]=\frac{e^{-2.34}}{1+e^{-2.34}}
\]
Slope \(\hat{\beta_1}\) not changed
Logistic Regression
Model-based estimated probabilities (non-centered):
For patient with MSEL=100 (mean in population for standard score)
\[
\hat{\text{Pr}}[ASD=\text{YES}|MSEL=90]=\frac{e^{6.67-0.09*90}}{1+e^{6.67-0.09*90}}=0.19
\]
Based on \(\hat{\text{Pr}}[ASD=\text{YES}|MSEL=90]\) can create predicted response \(\hat{ASD}\) by thresholding
Logistic regression: confounding
Example: Credit Card Default Rate
- Consider predicting if a person defaults on their loan based on
- Student Status (Student or Not Student):
- Now consider adding features:
credit balance
and income
- Why did
student
’s coefficient change so much? Confounding
Logistic regression: confounding
- Being a student \(\rightarrow\) higher balance (more loans)
- \(\rightarrow\) higher marginal default rate vs non-students
- But is it the higher balance or simply them being students leading to defaulting more often?
- Need to compare students and non-students controlling for balance to answer this
- Can be done using regression
Generalized linear models
- Logistic regression one example of a GLM
Structure of model:
- Choose conditional distribution \(f(y|x)\)
- Linear regression: \(f(y|x)\sim \text{Normal}(\mu_{y|x}, \sigma^2)\)
- \(\mu_{y|x} = \text{E}(Y|X); \sigma^2=\text{Var}(Y|X)=\text{Var}(\epsilon)\)
- Logistic regression: \(f(y|x)\sim \text{Binomial}[p(x)]\)
- \(p(x) = \mu_{y|x} = \text{Pr}(Y=1|X)\)
Generalized linear models
- Choose link function \(g(\mu_{y|x})=\beta_0+\beta_1X_1+\ldots+\beta_pX_p\)
- Linear regression: \(g(\mu_{y|x})=\mu_{y|x}\)
- Logistic regression \(g(\mu_{y|x})=\log(\frac{\mu_{y|x}}{1-\mu_{y|x}})\)
- Idea: \(g(\mu_{y|x}):\mathcal{X}_\mu \rightarrow (-\infty, \infty)\)
- Construct likelihood and fit
- Assuming independent observations:
- \(f(y|x)=\prod_{i=1}^{n} f(y_i|x_i)\)
Generalized linear models
- Examples include
- Poisson regression (outcome is a count)
- Gamma regression (outcome is skewed and non-negative)
- Multinomial logistic (categorical outcome, 2+ categories)
- Beta, negative binomial, etc.
- Most can be fit all using the
glm
() function in R
- some, like beta, negative binomial, overdispered Poisson, etc. require other packages