Machine Learning Basics

Steps of process:

Creating algorithm: Learning from observed data
Testing algorithm: Assess quality of learning

What data to use for what process?

Need sufficient sample to learn patterns

Train and test on same data \(\rightarrow\) bias

What do we mean by bias?

Bias, generalizability, variance

Generalizability
- Algorithm performs well on independent samples from population
- Higher level: generalizes to different populations
Bias
- Average over/underperformance in independent sample vs. observed sample
- Only see performance in observed sample, biased results can be very dangerous!
Variance
- Algorithm’s accuracy and results differ based on examined data
- Not good! Limits certainty in observed results

Loss function

Need some measure of an algorithm’s “accuracy/error” for it to learn
Metric used called loss function
- Varies depending on method or model used
- Ex. residual sum of squares

\[ \begin{align} &Y_i = \text{outcome for person }i \\ &X_i = \text{predictor value for person }i \\ &\text{Model: }\hat{Y}_i=\beta_0+\beta_1X_i \\ & \\ &\text{Loss function: }L(\beta_0, \beta_1,Y,X)=\sum_{i=1}^{n}[Y_i-(\beta_0+\beta_1X_i)]^2\\ &\{\hat{\beta}_0, \hat{\beta}_1\} = \{\beta_0, \beta_1\} \text{ which minimizes } L(\beta_0, \beta_1,Y,X) \end{align} \]

Supervised learning

Learning when outcome of interest is observed
- Common example: predicting observed trait (diagnosis)
Can use observed trait in training data to learn patterns with predictor variables
Predictor variables often called features
Differs with unsupervised learning (trait not observed)

Models and tuning

Various models used to train supervised learning algorithm
- Examples
  1. Linear regression
  2. Logistic regression
  3. Penalized regression
  4. K-means
  5. Decision trees
  6. Support vector machines
  7. Neural networks (deep learning)
Each has set of tuning parameters which alter their specific structure
- Ex. K-means: \(k\) = # of neighbors in a neighborhood
- After tuning set, then model is fixed and training can be done
- How to determine tuning parameters?

Testing

Unbiased testing \(\rightarrow\) independent dataset needed
- Training results may overfit to training data \(\rightarrow\) not reliable metric for general performance
- Where to get independent dataset?
Testing data sources:
1. Completely separate sample (from same or different population)
2. Random partition into 2/3 parts
3. Random partition into \(k\) parts, repeat training and testing process
  - Called k-fold cross validation (k-fold CV)

Testing

Metrics
- Must choose metrics to quantify algorithm’s accuracy/utility
- Choice depends on many factors:
  1. Continuous or categorical outcome?
  2. Should certain values be weighted differently? (cost may vary by outcome)
  3. What is the purpose of the algorithm (ex. screening tool)?

Examples in research

Example with IBIS

Let’s go through examples using IBIS data
- Will go step-by-step
- Discuss additional complications that may arise
- Discuss potential areas of bias and inefficiency
- Illustrate various methods

Predicting ASD Diagnosis

In all cases, we will be attempting to predict 24 month Autism (ASD) diagnosis
- Will try using brain and behavioral measures at 12 months
Starting point: simple brain model
- Total brain volume and EACSF at 12 months

Predicting ASD Diagnosis

Step 1: Select model
- We consider K-Nearest Neighbor (KNN) and logistic regression
- KNN visualized below

Predicting ASD Diagnosis

Step 2: Select training and testing set strategy
- Generally, will need to use holdout method (often hungry for more data!)
- Single train-test split or CV? If CV how many folds?
- Considerations
  - Interpretability
  - Bias-variance trade-off

Drawbacks of holdout

Test set error can be highly dependent on split
- Thus highly variable
- Especially for small dataset or small group sizes
Only subset of data used to train algorithm
- May result in poorer algorithm
- \(\rightarrow\) may overestimate test error
Can we aggregate results over multiple test sets?

K-fold cross validation

Denote \(K\) folds by \(C_1, C_2, \ldots, C_K\), each with \(n_k\) observations
For a given fold \(l\):
1. Train algorithm on data in other folds: \(\{C_k\}\) s.t. \(k\neq l\)
2. Test by computing predicted values for data in \(C_l\) only
3. Repeat for each fold \(l=1, \ldots, K\), average error (ex. \(MSE_l\))
K fold CV error rate

\[ CV_{(K)}=\sum_{k=1}^{K}\frac{n_k}{n}MSE_k \]

where \(MSE_k=\sum_{i\in C_k}(y_i-\hat{y_i})^2/n_k\) where \(y_i\) is outcome and \(\hat{y_i}\) is predicted outcome from training on \(C_k\) only

\(K=n\) yields \(n-fold\) or leave-one out cross-validation
\(CV_{(K)}\) is accurate measure of generalized error rate for algorithm trained on whole sample

CV for classification

Same process as before, divide data into \(K\) partitions \(C_1, \ldots, C_K\)
Choose error/accuracy rate of interest
- E.g. sensitivity, specificity, classification error, etc.
For classification error
- Compute CV error

\[ CV_K=\sum_{k=1}^{K}\frac{n_k}{n}\text{Error}_k=\frac{1}{K}\sum_{k=1}^{K}\text{Error}_k \]

where \(\text{Error}_k=\sum_{i \in C_k}I(y_i \neq \hat{y_i})/n_k\)

Can we estimate the variability of this estimate?
- Commonly used estimate of standard error:

\[ \hat{\text{SE}}(\text{CV}_K)=\sqrt{\sum_{k=1}^{K}(\text{Error}_k-\overline{\text{Error}_k})^2/(K-1)} \]

While useful, not accurate (Why?)
Also can be used in continuous prediction CV (using MSE for example)

CV visually

Smoothed estimate of generalized error

Choosing K: bias-variance tradeoff

Recall: Holdout method uses only portion of data for training
- \(\rightarrow\) test/validation performance overestimate
- \(\rightarrow\) more folds \(\rightarrow\) more data in training folds \(\rightarrow\) better algorithm \(\rightarrow\) lower mean error
- \(\rightarrow\) LOOCV least biased estimate of test error
Compared to hold out, each training set in K-fold contains \(\frac{(k-1)n}{k}\) obs
- Generally more then in holdout \(\rightarrow\) less biased estimate

CV with tuning

Recall: Often when training a prediction algorithm, need to select tuning parameters
- Ex. # of neighbors with KNN
- Where is tuning implemented in CV?
Example: Consider set of 5000 predictors and 50 samples of data
- 1. Starting with the 5000 predictors and full data, first find 100 predictors with largest correlation with outcome
- 1. Then train and test an algorithm with only these 100 predictors, using logistic regression as an example
How do we estimate the algorithm’s test set performance without bias?
Can we only apply CV in step 2, after the predictors have been chosen using the full data?

NO

Why?
- This selection of parameters greatly impacts the algorithm’s performance and thus is a form of tuning
- Tuning needs to be done within the training framework, otherwise you are training and testing on the same dataset
- Thus, need to do step 1 within your CV scheme

Wrong way: visual

Only doing step 2 inside CV process

Right way: visual

Doing both steps 1 and 2 within CV process

Right way: visual

Make sure tuning and testing are done on separate datasets!

Concept of a validation set

Sometimes, three data paritions are created
- One for training, one for validation, and one for testing
- Validation dataset is specifically used for tuning
- Final fitting done on training+validation, evaluated on testing
- Separate training and tuning sets may \(\rightarrow\) more generalizable tuning parameters selected
- Alternative: do CV with single training/tuning set when tuning (nested CV)

Predicting ASD Diagnosis

Step 2: Select training and testing set strategy
- Let’s use KNN, tune within CV process using nested CV
Step 3: Select metric
- Evaluate on test folds in CV process, CV sensitivity, specificity, etc.
- Other metrics also available, these suffice for two-group case
Implementation in R: train and createFolds from caret package
- train can be used to tune and train many types of models
- postResample and confusionMatrix to see metrics

knn_fit <-
  train(RiskGroup~EACSF_V12+TBV_V12, data=ibis_brain_data_complete, 
        tuneLength=10, method="knn", preProcess = c("scale", "center")
        trControl=trainControl(method="cv"))

Implement in R: KNN

# Create folds
tt_indices <- createFolds(y=ibis_brain_data_complete$RiskGroup, k=10)

# Do train and test for each fold
for(i in 1:length(tt_indices)){
  # Create train and test sets
  train_data <- ibis_brain_data_complete[-tt_indices[[i]],]
  test_data <- ibis_brain_data_complete[-tt_indices[[i]],]
  
  knn_fit <-
  train(RiskGroup~EACSF_V12+TBV_V12, data=train_data, 
        tuneLength=10, method="knn", preProcess = c("scale", "center")
        trControl=trainControl(method="cv"))
}

Implement in R: KNN

# Create folds
tt_indices <- createFolds(y=ibis_brain_data_complete$RiskGroup, k=10)

# Do train and test for each fold
cv_results <- list()
for(i in 1:length(tt_indices)){
  # Create train and test sets
  train_data <- ibis_brain_data_complete[-tt_indices[[i]],]
  test_data <- ibis_brain_data_complete[-tt_indices[[i]],]
  
  knn_fit <-
  train(RiskGroup~EACSF_V12+TBV_V12, data=train_data, 
        tuneLength=10, method="knn", preProcess = c("scale", "center"),
        trControl=trainControl(method="cv"))
  
  # Save confusion matrix for each fold
  cv_results[[i]] <- confusionMatrix(data=predict(knn_fit, newdata=test_data),
                                     reference=test_data$RiskGroup,
                                     positive="HR-ASD")
}

# Print confusion matrix for a fold
cv_results[[1]]

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HR-ASD HR-Neg
##     HR-ASD      4      3
##     HR-Neg     20    116
##                                           
##                Accuracy : 0.8392          
##                  95% CI : (0.7685, 0.8952)
##     No Information Rate : 0.8322          
##     P-Value [Acc > NIR] : 0.4652809       
##                                           
##                   Kappa : 0.1972          
##                                           
##  Mcnemar's Test P-Value : 0.0008492       
##                                           
##             Sensitivity : 0.16667         
##             Specificity : 0.97479         
##          Pos Pred Value : 0.57143         
##          Neg Pred Value : 0.85294         
##              Prevalence : 0.16783         
##          Detection Rate : 0.02797         
##    Detection Prevalence : 0.04895         
##       Balanced Accuracy : 0.57073         
##                                           
##        'Positive' Class : HR-ASD          
##

Unbalanced Data

Algorithm does terribly for ASD prediction but accuracy is high
- Good accuracy but poor sensitivity
- KNN (and most others) use general error as “loss function”
- \(\rightarrow\) if minority class is very small, see poor performance
Solutions
- Weighting: weight minority class more in error calculation
- Resampling: generate synthetic sample with balance between classes
  - Ex. SMOTE
  - Complications?

Weighting

By default, each obs gets same weight \(\rightarrow\) contribute to error the same amount
But, error in predicting one type may have higher cost then for other type
Can implement this using chosen weights
Ex. linear regression
- No weights

\[ \hat{y} = \underset{\beta}{\mathrm{argmin}}\sum_{i=1}^n(y_i-\beta x_i)^2 \]

Weights

\[ \begin{align} & \hat{y}_w = \underset{\beta}{\mathrm{argmin}}\sum_{i=1}^nw_i(y_i-\beta x_i)^2 \\ & \text{where } \{w_i\}_{i=1}^n \text{ are subject-specific weights} \end{align} \]

Common choice for weights: inverse probability weighting

\[ \begin{align} & w_i = 1/\hat{\pi}_k \text{ for } k=1,\ldots, K \\ & \text{where } \hat{\pi}_k=\sum_{i=1}^{n} I(i \text{ in group } K)/n \end{align} \]

Those in rare groups get high weight, cost for their error is high

Resampling

Caveats:
- SMOTE requires some numeric predictors to determine ‘’neighborhoods’’ to draw samples
  - Ad hoc: convert categorical to numeric
  - May be suboptimal (Why?)
- Can be difficult to select ROC-based threshold from training set after SMOTE
  - Sometimes threshold chosen from test set, but this is suboptimal!

Implement in R

With caret can implement in train function (weights argument)
Common SMOTE method: DmWR package
- However, package has been removed from CRAN
- Alternatives: smotefamily and bimba

Examples

Inverse probability weighting

# Create training set weights per person
train_weights <- ifelse(ibis_brain_data_complete$RiskGroup=="HR-ASD",
                        1/prop.table(table(ibis_brain_data_complete$RiskGroup))["HR-ASD"],
                        1/prop.table(table(ibis_brain_data_complete$RiskGroup))["HR-Neg"])

knn_fit <-
  train(RiskGroup~EACSF_V12+TBV_V12, data=ibis_brain_data_complete, 
        tuneLength=10, method="knn", preProcess = c("scale", "center")
        trControl=trainControl(method="cv"))

SMOTE

# Look at frequencies
table(ibis_brain_data_complete$RiskGroup)

## 
## HR-ASD HR-Neg 
##     27    132

# Run SMOTE
ibis_brain_data_complete_smote <- SMOTE(RiskGroup~EACSF_V12+TBV_V12, 
                                        data = ibis_brain_data_complete,
                                        perc.under = 150)
# Now look at frequencies
table(ibis_brain_data_complete_smote$RiskGroup)

## 
## HR-ASD HR-Neg 
##     81     81

knn_fit <-
  train(RiskGroup~EACSF_V12+TBV_V12, data=ibis_brain_data_complete_smote, 
        tuneLength=10, method="knn", preProcess = c("scale", "center"),
        trControl=trainControl(method="cv"))

Logistic regression

KNN can be adapted to compute probabilities of being in a class (see knnflex)
Logistic regression models these probabilities directly
Let’s try this method instead

# Create folds
tt_indices <- createFolds(y=ibis_brain_data_complete$RiskGroup, k=10)

# Do train and test for each fold
cv_results <- list()
for(i in 1:length(tt_indices)){
  # Create train and test sets
  train_data <- ibis_brain_data_complete[-tt_indices[[i]],]
  test_data <- ibis_brain_data_complete[-tt_indices[[i]],]
  
  # Run logistic regression model
  logreg_fit <-
  glm(RiskGroup~EACSF_V12+TBV_V12, data=train_data, 
      family=binomial)
  
  # Save confusion matrix for each fold, use 50% probability threshold
  test_preds <- factor(ifelse(predict(logreg_fit, newdata=test_data, type="response")>0.5, 
                       "HR-Neg", "HR-ASD"))
  cv_results[[i]] <- confusionMatrix(data=predict(logreg_fit, newdata=test_data, type="response"),
                                     reference=test_data$RiskGroup,
                                     positive="HR-ASD")
}
cv_results[[1]]

ROC Curve

We chose a 0.5 threshold, but this may be suboptimal
- Why?
Instead, can evaluate the performance at all possible thresholds
- Results in a set values which can be plotted: ROC curve
- Can summarize curve: area under the curve (AUC)
Done in R using pROC and ggroc

# Run logistic regression model
logreg_fit <-
  glm(RiskGroup~EACSF_V12+TBV_V12, data=ibis_brain_data_complete, 
      family=binomial)

# Get estimated probabilities of HR-ASD
est_probs <- 1-predict(logreg_fit, newdata=ibis_brain_data_complete, type="response")

# Compute ROC curve with AUC
roc_curve <- roc(response=ibis_brain_data_complete$RiskGroup,
                 predictor=est_probs,
                 levels=c("HR-Neg", "HR-ASD"))

# Plot ROC curve
ggroc(roc_curve)+
  labs(title = paste0("AUC = ", round(auc(roc_curve),2)))+
  theme_bw()

ROC Curve

Can use ggroc to customize plot more
Can use various metrics to determine “best” threshold
- Ex. maximum Youden’s index

# Return max Youden's index, with specificity and sensitivity
best_thres_data <- 
  data.frame(coords(roc_curve, x="best", best.method = c("youden", "closest.topleft")))

# View performance at "best threshold"
best_thres_data

##   threshold specificity sensitivity
## 1 0.2284176   0.8257576   0.5185185

data_add_line <-
  data.frame("sensitvity"=c(1-best_thres_data$specificity,
                            best_thres_data$sensitivity),
             "specificity"=c(best_thres_data$specificity,
                            best_thres_data$specificity))
  
ggroc(roc_curve)+
    geom_point(
    data = best_thres_data,
    mapping = aes(x=specificity, y=sensitivity), size=2, color="red")+
    geom_point(mapping=aes(x=best_thres_data$specificity, 
               y=1-best_thres_data$specificity), 
               size=2, color="red")+
    geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
                 color="darkgrey", linetype="dashed")+
    geom_text(data = best_thres_data,
              mapping=aes(x=specificity, y=0.90,
                          label=paste0("Threshold = ", round(threshold,2),
                                       "\nSensitivity = ", round(sensitivity,2),
                                       "\nSpecificity = ", round(specificity,2),
                                       "\nAUC = ", round(auc(roc_curve),2))))+
    geom_line(data=data_add_line,
              mapping=aes(x=specificity, y=sensitvity),
              linetype="dashed")+
  theme_classic()

Intro to Machine Learning: Supervised Learning

Introduction

Machine Learning Basics

Bias, generalizability, variance

Loss function

Supervised learning

Models and tuning

Testing

Testing

Examples in research

Example with IBIS

Predicting ASD Diagnosis

Predicting ASD Diagnosis

Predicting ASD Diagnosis

Drawbacks of holdout

K-fold cross validation

CV for classification

CV visually

Choosing K: bias-variance tradeoff

CV with tuning

NO

Wrong way: visual

Right way: visual

Right way: visual

Concept of a validation set

Predicting ASD Diagnosis

Implement in R: KNN

Implement in R: KNN

Unbalanced Data

Weighting

Resampling

Implement in R

Examples

Logistic regression

ROC Curve

ROC Curve