Association Analyses with IBIS Data: Correlation and Linear Regression Analyses
Kevin Donovan
January 27, 2021
Introduction
Previous Session: introduced concepts in statistical inference
Now: focus on analytic methods and their implementation in R
Multivariable Analysis
Previously: Discussed simple univariable analyses (comparing means)
Suppose one is interested in how multiple variables are related distributionally
Simplest case: Two variables \(X\) and \(Y\)
Covariance and Correlation
Covariance: \(\text{Cov}(X, Y)=\text{E}[(X-\text{E}[X])(Y-\text{E}[Y])]\)
Looking inside the outer mean: \((X-\text{E}[X])(Y-\text{E}[Y])\)
- Can see \(>0\) means both variables in same direction vs. average
- \(<0\) means both variables in different directions vs. average
Limitation: relationship size in terms of variable units
Pearson Correlation: \(\text{Corr}(X, Y)=\frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}\)
Standardizes relationship size using variances
Covariance visual explanation
Correlation in IBIS
```{=html}
x_var
|
y_var
|
estimate
|
p.value
|
V06 MSEL Composite SS
|
V06 AOSI Raw TS
|
-0.366
|
<0.005
|
V12 MSEL Composite SS
|
V12 AOSI Raw TS
|
-0.296
|
<0.005
|
```
Correlation: Assessing Significance
- Pearson Correlation
- \(r=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i-\bar{y})^2}}\)
- Test statistic: Under \(H_0: \text{Cor}(X,Y)=0\)
Assuming \(X, Y\) are bi-variate normal \(T=r\sqrt{\frac{n-2}{1-r^2}} \sim \text{T distribution (DoF = n-2)}\)
Under large sample approx. by CLT \(T=r\sqrt{\frac{n-2}{1-r^2}} \sim \text{T distribution (DoF = n-2)}\)
- Spearman Correlation
- Better reflects non-linear, but monotonic, relationships
- More robust to outliers
- Nonparametric test based on rank, better for small sample size
Correlation Estimates: Example
Limitations of Correlation
- Assesses how well \(X\) and \(Y\) “tie together”, clinical effect size not well represented
- Only assessses linear or monotonic association
- Adding “confounders” to relationship not straight-forward
- Doesn’t look at mean/median comparisons
Linear Regression: Setup
Consider variable \(X\) and \(Y\) again
Consider a directional relationship:
\(X\) denoted independent variable; \(Y\) denoted dependent variable
\(X\) and \(Y\) related through mean: \(\text{E}(Y|X)=\beta_0+\beta_1X\)
Linear Regression: Setup
Full Model: \[
\begin{align}
&Y=\beta_0+\beta_1X+\epsilon \\
&\\
&\text{where E}(\epsilon)=0 \text{; Var}(\epsilon)=\sigma^2 \\
&\epsilon_i \perp \epsilon_j \text{ for }i\neq j; X\perp \epsilon
\end{align}
\]
Linear Regression: Inference
Estimation:
Find “line of best fit” in data
- Let \(\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i\)
- Define sum of the squared error \((SSE) = \sum_{i=1}^{n}(\hat{Y_i}-Y_i)^2\)
- Goal: find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) which minimize \(SSE\)
- With single \(X\), have
- \(\hat{\beta_1}=r_{xy}\frac{\hat{\sigma_y}}{\hat{\sigma_x}}\)
- \(\hat{\beta_0}=\bar{Y}-\hat{\beta}\bar{X}\)
Can see slope estimate is scaled correlation
- \(\hat{\beta_0}=\hat{\text{E}}(Y|X=0)\)
- \(\hat{\beta_1}=\hat{\text{E}}(Y|X=x+1)-\hat{\text{E}}(Y|X=x) \text{ for any }x\)
Linear Regression: Inference
Confidence Intervals and Testing:
- If \(\epsilon_i \perp \epsilon_j\) for \(i \neq j\) and \(\epsilon \sim\text{Normal}(0,\sigma^2)\)
- Under \(H_0: \beta_p=0\) for \(p=0,1\) can create test statistic with \(T(n-2)\) distribution
- Use to construct \(95\%\) CIs, do hypothesis testing for non-zero \(\beta_p\)
- If \(\epsilon_i \perp \epsilon_j\) for \(i \neq j\) and \(\text{Var}(\epsilon_i)=\sigma^2\) for all \(i\)
- Under \(H_0: \beta_p=0\), can do same as above using CLT for “large” sample
- Due to finite sample with CLT, test statistic distribution is “approximate”
Linear Regression: Covariates
Above all apply for general regression equation:
\(Y=\beta_0+\beta_1X1+\ldots+\beta_pX_p+\epsilon\)
Where \(\text{E}(Y|X_1, \ldots, X_p)=\beta_0+\beta_1X1+\ldots+\beta_pX_p\)
\(Y|X_1, \ldots, X_p=\) “controlling for \(X_1, \ldots, X_p\)”
Confounders
- Often illustrated using a DAG (directed acylic graph)
- X -> Y: \(\Delta_x \text{E}(Y|X=x, Z=z)\)
- X -> Z: \(\Delta_x \text{E}(Z|X=x)\)
- Z -> Y: \(\Delta_z \text{E}(Y|Z=z)\)
Diagnostics
- Recall: Model has a number of assumptions
- \(\text{E}(Y|X_1, \ldots, X_p)=\beta_0+\beta_1X1+\ldots+\beta_pX_p\)
- \(\epsilon_i \sim\text{Normal}(0,\sigma^2)\) for all \(i\)
- \(\epsilon_i \perp \epsilon_j\) for \(i \neq j\)
- Must evaluate if data violates assumptions
- Generally, \(H_0\): Assumptions are met
Diagnostics
- Normality
- Residual QQ-plot
- Homoskedasicity
- Residual by fitted value plot