Kevin Donovan
January 11, 2021
Previous Sessions: focused on data management and visualization with coding
Now: focus on statistical analysis of data
Objectives
Limitations:
Ex.: What is the probability of seeing a heads after flipping coin?
We infer systematic properties and test hypotheses based on data
Inferences are probabilistic due to unavoidable uncertainty from only seeing output of process
Parameters = Fixed properties of “black box”
Determine “signal” present in process with random variation being “noise”
Examples:
First Step: Choose parameter(s) we want to estimate
Second Step: Determine method for estimation
Third Step: Assess estimation
Concern: How do we take into account the probabilistic uncertainty?
Estimator = function of observed data
Sample = \((X_1, X_2, \ldots, X_n)\) from chosen population
Exs.:
Sample Mean = \((X_1 + X_2 + \ldots + X_n)/n = \bar{X}\)
Sample Variance = \([(X_1 - \bar{X})^2+(X_2 - \bar{X})^2+ \ldots+(X_n - \bar{X})^2]/n = \hat{\sigma^2}\)
Estimate will vary from sample to sample. How to measure this variability?
Answer: Standard Error and Confidence Intervals
Standard Error = Variance of Estimator
Confidence Interval = Interval of plausible values for parameter based on sample
CI Interpretation: 95% of intervals from sampled data will include true parameter value
Statistical Analogue: Hypothesis Testing for parameter(s)
Example:
Scientific Hypothesis: Infants at 12 months with Autism (ASD) have larger brain volumes then those without ASD
Statistical Hypothesis: \(\mu_{ASD, 12}>\mu_{ASD_{Neg}, 12}\)
where \(\mu_{x, 12}=\) mean of total brain volume at 12 months of age for group “x”
For Scientific Testing: Focused on falsifying claim
For Statistical Testing: Uncertainty \(\implies\) cannot “falsify” claim
instead
judge evidence to see if one has “enough” to “reject” claim
Process:
Step One: Determine null (baseline claim) and alternative hypotheses
Step Two: Determine quantities used as evidence to make decision (test statistic)
Step Three: Evaluate evidence
Evaluating Evidence:
Method 1: Binary decision
Method 2: Continuous evaluation
Test Statistic: Function of data used to evaluate null hypothesis claim
Example: Recall our standard normal distribution example
Given this population, suppose we wanted to evaluate the following:
\(H_0\) (Null): mean \(=0\)
\(H_1\) (Alt.): mean \(\neq 0\)
Idea: Let’s compare the sample mean from a single sample to 0
Issue: How to determine “difference” from 0 given random sampling variability?
Idea: Determine distribution of sample mean to determine correspondance with \(H_0\)
Sample Mean Distribution: Function of data used to evaluate null hypothesis claim
Note: Need enough samples to accurately estimate distribution (i.e. sample size)
Observation: Samples goes up \(\implies\) estimated distribution of sample mean “converges” to normal curve
Turns out, this is a mathematical fact for any distribution, with random samples
Referred to as the Central Limit Theorem (CLT)
Recall: Need distribution of sample mean to determine “rare” outcome
From CLT, we know
is an accurate estimate of this distribution, assuming the null hypothesis of mean = 0
Suppose we obtain a sample of size n=100 and compute the sample mean
Let’s compare this observed sample mean to the expected distribution of values under the null
Can we quantify this comparison? Yes
In this case from above, sample mean = one possible test statistic
This measure is denoted as a p-value = measure of correspondence between sample and null
Let’s compute the p-value in the above example
[1] “P-value is 0.276”
Can see much of distribution is shaded in \(\implies\) observed sample not unusual under null
Note that CTL assumes that the true population standard deviation is known
This is rarely the case and often needs to be estimated
This results in the following test statistic:
\(T=\frac{\bar{X}-\mu_0}{\hat{\sigma}/\sqrt{n}}\)
where \(\mu_0\) = mean under \(H_0\)
which has a T distribution with \(n-1\) degrees of freedom
Finding a test statistic when looking at the mean in this scenario is easy
However
n general it can be quite challenging for more complicated scenarios
Recap: Testing process
Complications:
Many considerations need to be taken into account when testing
These (and many more) aspects will be covered in future sessions