Association Analyses with IBIS Data: Correlation and Linear Regression Analyses

Kevin Donovan

January 27, 2021

Introduction

Previous Session: introduced concepts in statistical inference

Now: focus on analytic methods and their implementation in R

Multivariable Analysis

Previously: Discussed simple univariable analyses (comparing means)

Suppose one is interested in how multiple variables are related distributionally

Simplest case: Two variables \(X\) and \(Y\)

Covariance and Correlation

Covariance: \(\text{Cov}(X, Y)=\text{E}[(X-\text{E}[X])(Y-\text{E}[Y])]\)

Looking inside the outer mean: \((X-\text{E}[X])(Y-\text{E}[Y])\)

Can see \(>0\) means both variables in same direction vs. average
\(<0\) means both variables in different directions vs. average

Limitation: relationship size in terms of variable units

Pearson Correlation: \(\text{Corr}(X, Y)=\frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}\)

Standardizes relationship size using variances

Covariance visual explanation

Correlation in IBIS

x_var	y_var	estimate	p.value
V06 MSEL Composite SS	V06 AOSI Raw TS	-0.366	<0.005
V12 MSEL Composite SS	V12 AOSI Raw TS	-0.296	<0.005

```{=html}

```

Correlation: Assessing Significance

Pearson Correlation
- \(r=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i-\bar{y})^2}}\)
- Test statistic: Under \(H_0: \text{Cor}(X,Y)=0\)
1. Assuming \(X, Y\) are bi-variate normal \(T=r\sqrt{\frac{n-2}{1-r^2}} \sim \text{T distribution (DoF = n-2)}\)
2. Under large sample approx. by CLT \(T=r\sqrt{\frac{n-2}{1-r^2}} \sim \text{T distribution (DoF = n-2)}\)
Spearman Correlation
- Better reflects non-linear, but monotonic, relationships
- More robust to outliers
- Nonparametric test based on rank, better for small sample size

Correlation Estimates: Example

Limitations of Correlation

Assesses how well \(X\) and \(Y\) “tie together”, clinical effect size not well represented
Only assessses linear or monotonic association
Adding “confounders” to relationship not straight-forward
Doesn’t look at mean/median comparisons

Linear Regression: Setup

Consider variable \(X\) and \(Y\) again

Consider a directional relationship:

\(X\) denoted independent variable; \(Y\) denoted dependent variable

\(X\) and \(Y\) related through mean: \(\text{E}(Y|X)=\beta_0+\beta_1X\)

Linear Regression: Setup

Full Model: \[ \begin{align} &Y=\beta_0+\beta_1X+\epsilon \\ &\\ &\text{where E}(\epsilon)=0 \text{; Var}(\epsilon)=\sigma^2 \\ &\epsilon_i \perp \epsilon_j \text{ for }i\neq j; X\perp \epsilon \end{align} \]

Linear Regression: Inference

Estimation:

Find “line of best fit” in data

Let \(\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i\)
Define sum of the squared error \((SSE) = \sum_{i=1}^{n}(\hat{Y_i}-Y_i)^2\)
Goal: find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) which minimize \(SSE\)
With single \(X\), have
1. \(\hat{\beta_1}=r_{xy}\frac{\hat{\sigma_y}}{\hat{\sigma_x}}\)
2. \(\hat{\beta_0}=\bar{Y}-\hat{\beta}\bar{X}\)

Can see slope estimate is scaled correlation

\(\hat{\beta_0}=\hat{\text{E}}(Y|X=0)\)
\(\hat{\beta_1}=\hat{\text{E}}(Y|X=x+1)-\hat{\text{E}}(Y|X=x) \text{ for any }x\)

Linear Regression: Inference

Confidence Intervals and Testing:

If \(\epsilon_i \perp \epsilon_j\) for \(i \neq j\) and \(\epsilon \sim\text{Normal}(0,\sigma^2)\)

Under \(H_0: \beta_p=0\) for \(p=0,1\) can create test statistic with \(T(n-2)\) distribution
Use to construct \(95\%\) CIs, do hypothesis testing for non-zero \(\beta_p\)

If \(\epsilon_i \perp \epsilon_j\) for \(i \neq j\) and \(\text{Var}(\epsilon_i)=\sigma^2\) for all \(i\)

Under \(H_0: \beta_p=0\), can do same as above using CLT for “large” sample
Due to finite sample with CLT, test statistic distribution is “approximate”

Linear Regression: Covariates

Above all apply for general regression equation:

\(Y=\beta_0+\beta_1X1+\ldots+\beta_pX_p+\epsilon\)

Where \(\text{E}(Y|X_1, \ldots, X_p)=\beta_0+\beta_1X1+\ldots+\beta_pX_p\)

\(Y|X_1, \ldots, X_p=\) “controlling for \(X_1, \ldots, X_p\)”

Confounders

Often illustrated using a DAG (directed acylic graph)

X -> Y: \(\Delta_x \text{E}(Y|X=x, Z=z)\)
X -> Z: \(\Delta_x \text{E}(Z|X=x)\)
Z -> Y: \(\Delta_z \text{E}(Y|Z=z)\)

Diagnostics

Recall: Model has a number of assumptions
- \(\text{E}(Y|X_1, \ldots, X_p)=\beta_0+\beta_1X1+\ldots+\beta_pX_p\)
- \(\epsilon_i \sim\text{Normal}(0,\sigma^2)\) for all \(i\)
- \(\epsilon_i \perp \epsilon_j\) for \(i \neq j\)
Must evaluate if data violates assumptions
- Generally, \(H_0\): Assumptions are met