Introduction

Recall: Discussed association analyses with correlation and linear regression

\[ \begin{align} &Y=\beta_0+\beta_1X_1+\ldots+\beta_pX_p+\epsilon \\ &\\ &\text{where E}(\epsilon)=0 \text{; Var}(\epsilon)=\sigma^2 \\ &\epsilon_i \perp \epsilon_j \text{ for }i\neq j; X_1,\ldots,X_p\perp \epsilon \end{align} \]

What about for discrete outcomes and non-normal error terms?

Generalized linear model

Idea: Create more general model structure to handle many distributions for outcome and residuals
Keep flexibility and interpretability of linear regression model
- Especially flexibility with predictors/covariates (allow continuous, categorical, transformations, etc.)
Such a structure referred to as generalized linear models or GLMs
Focus on distribution of outcome instead of disribution of residuals

Regression for binary outcomes

Suppose we want to assess association between binary outcome and predictors
- Ex. Autism (ASD) diagnosis
What if we use a linear regression model?
- \(ASD=\beta_0+\beta_1X_1+\ldots+\beta_pX_p+\epsilon\)
- \(E(ASD|X)=\beta_0+\beta_1X_1+\ldots+\beta_pX_p\)
For binary \(Y\), \(\text{E}(Y|X)=\text{Pr}(Y=1|X)\)
- \(\rightarrow \text{Pr}(ASD=\text{Yes}|X)=\beta_0+\beta_1X_1+\ldots+\beta_pX_p\)
Denoted linear probability model
- Limitation: \(0 \leq\text{Pr}(Y=1|X) \leq 1\) but linear function not constrained

Logistic regression

Specify transformation of probability which is not constrained, use linear model
Model: Single Feature \(X\)
- \(\text{logit}[p(ASD=\text{Yes}|X)]=\beta_0+\beta_1X\)
- \(\text{logit}(p)=\text{ln}(\frac{p}{1-p})=\text{log}( \text{odds}[p(Y=1|X)])\)
- This, modeling log odds as a linear function of predictors not probability
In terms of conditional probability:
- \(p(ASD=\text{Yes}|X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\)
- Can transform logit model to represent probabilities of interest

Logistic regression

Visualization between probability \(p\) and \(\text{logit}(p)\)

Logistic regression

Visualization between probability \(p\) and \(\text{logit}(p)\) modeling ASD diagnosis

Model fitting

Since distribution of outcome based on covariates directly modeled, use maximum likelihood instead of least squares from before

Maximum likelihood example

Intuition:
- Find estimates of \(\beta\) which best match with observed data, assuming data generated from specified likelihood
- Specifying the likelihood directly makes calculations more feasible, but assumptions may not hold (more on this later)
- Fitting in R: glm function

## 
## Call:
## glm(formula = factor(SSM_ASD_v24) ~ `V24 mullen,composite_standard_score`, 
##     family = binomial(), data = ibis_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1954  -0.5375  -0.3091  -0.1466   3.2387  
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                            6.671167   0.827694    8.06 7.63e-16 ***
## `V24 mullen,composite_standard_score` -0.088884   0.009268   -9.59  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 514.73  on 572  degrees of freedom
## Residual deviance: 368.89  on 571  degrees of freedom
##   (14 observations deleted due to missingness)
## AIC: 372.89
## 
## Number of Fisher Scoring iterations: 6

Variable	Estimate	Std. Error	Z Statistic	P-value
Intercept	6.671	0.8277	8.1	<0.005
24 Month MSEL Composite Standard Score	-0.089	0.0093	-9.6	<0.005

Logistic regression

Estimated model: \(\hat{\text{Pr}}[ASD=\text{YES}|MSEL]=\frac{e^{6.67-0.09MSEL}}{1+e^{6.67-0.09MSEL}}\)

Interpretation:

\(\hat{\beta_0}=6.67\)

\(\hat{\text{Pr}}[ASD=\text{YES}|MSEL=0]=\frac{e^{6.67}}{1+e^{6.67}}\)

\(\hat{\beta_1}=-0.09\)

\(\rightarrow\) Probability of ASD diangosis decreases as MSEL increases

Logistic regression

Intercept:

MSEL = 0 doesn’t make sense

Solution: center at means

MSEL - \(\mu\) = 0 \(\rightarrow\) MSEL = \(\mu\)

## 
## Call:
## glm(formula = factor(SSM_ASD_v24) ~ mullen_center, family = binomial(), 
##     data = ibis_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1954  -0.5375  -0.3091  -0.1466   3.2387  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.336873   0.180835  -12.92   <2e-16 ***
## mullen_center -0.088884   0.009268   -9.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 514.73  on 572  degrees of freedom
## Residual deviance: 368.89  on 571  degrees of freedom
##   (14 observations deleted due to missingness)
## AIC: 372.89
## 
## Number of Fisher Scoring iterations: 6

Variable	Estimate	Std. Error	Z Statistic	P-value
Intercept	-2.337	0.1808	-12.9	<0.005
24 Month MSEL Composite Centered	-0.089	0.0093	-9.6	<0.005

Interpretation:

\(\hat{\beta_0} = -2.34\)

\[ \hat{\text{Pr}}[ASD=\text{YES}|MSEL=\mu]=\frac{e^{-2.34}}{1+e^{-2.34}} \]

Slope \(\hat{\beta_1}\) not changed

Logistic Regression

Model-based estimated probabilities (non-centered):

For patient with MSEL=100 (mean in population for standard score)

\[ \hat{\text{Pr}}[ASD=\text{YES}|MSEL=90]=\frac{e^{6.67-0.09*90}}{1+e^{6.67-0.09*90}}=0.19 \]

Based on \(\hat{\text{Pr}}[ASD=\text{YES}|MSEL=90]\) can create predicted response \(\hat{ASD}\) by thresholding

Logistic regression: confounding

Example: Credit Card Default Rate

Consider predicting if a person defaults on their loan based on
- Student Status (Student or Not Student):

Now consider adding features: credit balance and income

Why did student’s coefficient change so much? Confounding

Logistic regression: confounding

Being a student \(\rightarrow\) higher balance (more loans)
- \(\rightarrow\) higher marginal default rate vs non-students
- But is it the higher balance or simply them being students leading to defaulting more often?
- Need to compare students and non-students controlling for balance to answer this
- Can be done using regression

Generalized linear models

Logistic regression one example of a GLM

Structure of model:

Choose conditional distribution \(f(y|x)\)

Linear regression: \(f(y|x)\sim \text{Normal}(\mu_{y|x}, \sigma^2)\)
- \(\mu_{y|x} = \text{E}(Y|X); \sigma^2=\text{Var}(Y|X)=\text{Var}(\epsilon)\)
Logistic regression: \(f(y|x)\sim \text{Binomial}[p(x)]\)
- \(p(x) = \mu_{y|x} = \text{Pr}(Y=1|X)\)

Generalized linear models

Choose link function \(g(\mu_{y|x})=\beta_0+\beta_1X_1+\ldots+\beta_pX_p\)

Linear regression: \(g(\mu_{y|x})=\mu_{y|x}\)
Logistic regression \(g(\mu_{y|x})=\log(\frac{\mu_{y|x}}{1-\mu_{y|x}})\)
Idea: \(g(\mu_{y|x}):\mathcal{X}_\mu \rightarrow (-\infty, \infty)\)

Construct likelihood and fit

Assuming independent observations:
- \(f(y|x)=\prod_{i=1}^{n} f(y_i|x_i)\)

Generalized linear models

Examples include
- Poisson regression (outcome is a count)
- Gamma regression (outcome is skewed and non-negative)
- Multinomial logistic (categorical outcome, 2+ categories)
- Beta, negative binomial, etc.
Most can be fit all using the glm() function in R
- some, like beta, negative binomial, overdispered Poisson, etc. require other packages

Generalized Linear Models

Introduction

Generalized linear model

Regression for binary outcomes

Logistic regression

Logistic regression

Logistic regression

Model fitting

Maximum likelihood example

Logistic regression

Logistic regression

Logistic Regression

Logistic regression: confounding

Logistic regression: confounding

Generalized linear models

Generalized linear models

Generalized linear models