Statistical Inference Intro

Kevin Donovan

January 11, 2021

Introduction

Previous Sessions: focused on data management and visualization with coding

Now: focus on statistical analysis of data

Objectives

Populations, Samples, Data Properties

Common Viewpoint:
missing

Limitations:

Populations, Samples, Data Properties

Alternative Viewpoint: Data Generating Mechanism
missing

Ex.: What is the probability of seeing a heads after flipping coin?

Populations, Samples, Data Properties

Statistical analysis uses both inductive and deductive reasoning
missing

We infer systematic properties and test hypotheses based on data

Inferences are probabilistic due to unavoidable uncertainty from only seeing output of process

Parameters and Estimation

Parameters = Fixed properties of “black box”

Determine “signal” present in process with random variation being “noise”

Examples:

Parameters and Estimation

Parameters and Estimation

First Step: Choose parameter(s) we want to estimate

Second Step: Determine method for estimation

Third Step: Assess estimation

Concern: How do we take into account the probabilistic uncertainty?

missing

Parameters and Estimation

Estimator = function of observed data

Sample = \((X_1, X_2, \ldots, X_n)\) from chosen population

Exs.:

Sample Mean = \((X_1 + X_2 + \ldots + X_n)/n = \bar{X}\)

Sample Variance = \([(X_1 - \bar{X})^2+(X_2 - \bar{X})^2+ \ldots+(X_n - \bar{X})^2]/n = \hat{\sigma^2}\)

Parameters and Estimation

Estimate will vary from sample to sample. How to measure this variability?

Answer: Standard Error and Confidence Intervals

Parameters and Estimation

Standard Error = Variance of Estimator

Confidence Interval = Interval of plausible values for parameter based on sample

missing

Parameters and Estimation

CI Interpretation: 95% of intervals from sampled data will include true parameter value

Hypothesis Testing

missing

Hypothesis Testing

Statistical Analogue: Hypothesis Testing for parameter(s)

Example:

Scientific Hypothesis: Infants at 12 months with Autism (ASD) have larger brain volumes then those without ASD

Statistical Hypothesis: \(\mu_{ASD, 12}>\mu_{ASD_{Neg}, 12}\)

where \(\mu_{x, 12}=\) mean of total brain volume at 12 months of age for group “x”

Hypothesis Testing

For Scientific Testing: Focused on falsifying claim

For Statistical Testing: Uncertainty \(\implies\) cannot “falsify” claim

instead

judge evidence to see if one has “enough” to “reject” claim

Hypothesis Testing

Process:

Step One: Determine null (baseline claim) and alternative hypotheses

Step Two: Determine quantities used as evidence to make decision (test statistic)

Step Three: Evaluate evidence

Hypothesis Testing

Evaluating Evidence:

Method 1: Binary decision

Method 2: Continuous evaluation

Hypothesis Testing

Test Statistic: Function of data used to evaluate null hypothesis claim

Example: Recall our standard normal distribution example

Given this population, suppose we wanted to evaluate the following:

\(H_0\) (Null): mean \(=0\)

\(H_1\) (Alt.): mean \(\neq 0\)

Idea: Let’s compare the sample mean from a single sample to 0

Issue: How to determine “difference” from 0 given random sampling variability?

Idea: Determine distribution of sample mean to determine correspondance with \(H_0\)

Hypothesis Testing

Sample Mean Distribution: Function of data used to evaluate null hypothesis claim

Hypothesis Testing

Note: Need enough samples to accurately estimate distribution (i.e. sample size)

Observation: Samples goes up \(\implies\) estimated distribution of sample mean “converges” to normal curve

Turns out, this is a mathematical fact for any distribution, with random samples

Referred to as the Central Limit Theorem (CLT)

Hypothesis Testing

Recall: Need distribution of sample mean to determine “rare” outcome

From CLT, we know

is an accurate estimate of this distribution, assuming the null hypothesis of mean = 0

Hypothesis Testing

Suppose we obtain a sample of size n=100 and compute the sample mean

Let’s compare this observed sample mean to the expected distribution of values under the null

Hypothesis Testing

Can we quantify this comparison? Yes

  1. Compute Test Statistic value = statistic whose distriubtion is known under null

In this case from above, sample mean = one possible test statistic

  1. Compute “probability-based” measure = probability of observing sample mean as or extreme then seen in data

This measure is denoted as a p-value = measure of correspondence between sample and null

Hypothesis Testing

Let’s compute the p-value in the above example

[1] “P-value is 0.276”

Can see much of distribution is shaded in \(\implies\) observed sample not unusual under null

Note that CTL assumes that the true population standard deviation is known

This is rarely the case and often needs to be estimated

This results in the following test statistic:

\(T=\frac{\bar{X}-\mu_0}{\hat{\sigma}/\sqrt{n}}\)

where \(\mu_0\) = mean under \(H_0\)

which has a T distribution with \(n-1\) degrees of freedom

Hypothesis Testing

Finding a test statistic when looking at the mean in this scenario is easy

However

n general it can be quite challenging for more complicated scenarios

Recap: Testing process

  1. Determine parameters of interest
  2. Define null and alternative hypotheses
  3. Derive test statistic
  4. Translate test statistic to more interpretable measure (e.g. p-value)

Hypothesis Testing

Complications:

Many considerations need to be taken into account when testing

  1. Multiple testing and Type I error
  2. Hypothesis generating analyses vs hypothesis testing analyses
  3. P-value hacking and hypothesis generation from data

These (and many more) aspects will be covered in future sessions

Songs of the session

Mexican Grand Prix by Mogwai

Your Hand in Mine by Explosion in the Sky

Pacific Theme by Broken Social Club

The Big Ship by Brian Eno

missing
missing