Model selection and overfitting

Lecture 18

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Spring 2024

2024-03-26

Warm up

While you wait for class to begin…

Any questions from prepare materials?

Announcements

  • Read the feedback (including written feedback) on labs

  • Project repos to be released back to you on Friday

  • Status update emails in your inbox

Questions from last time

Can you review the augment() function?

The augment() function can be used to “augment” a data set (usually of new observations) with the model.

  • It calculates predictions, \(\hat{y}\)s, under the given model.

  • It also calculates the residuals, \(e\)s, for each observation.

  • It returns a data frame of the input data augmented with predicted values and residuals.

augment() - Setup

library(tidyverse)
library(tidymodels)

mtcars_fit <- linear_reg() |>
  fit(mpg ~ disp + cyl, data = mtcars)

tidy(mtcars_fit)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  34.7       2.55       13.6  4.02e-14
2 disp         -0.0206    0.0103     -2.01 5.42e- 2
3 cyl          -1.59      0.712      -2.23 3.37e- 2

augment() - Augment the original data

augment(mtcars_fit, new_data = mtcars)
# A tibble: 32 × 13
   .pred .resid   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21.8 -0.844  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21.8 -0.844  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  26.1 -3.29   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  19.8  1.57   21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  14.6  4.15   18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  20.5 -2.41   18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.6 -0.253  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  25.3 -0.892  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  25.4 -2.61   22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  21.7 -2.49   19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

augment() - Augment new data

new_cars <- tibble(
  mpg = c(23, 22), disp = c(150, 350), cyl = c(4, 6)
)

new_cars
# A tibble: 2 × 3
    mpg  disp   cyl
  <dbl> <dbl> <dbl>
1    23   150     4
2    22   350     6
augment(mtcars_fit, new_data = new_cars)
# A tibble: 2 × 5
  .pred .resid   mpg  disp   cyl
  <dbl>  <dbl> <dbl> <dbl> <dbl>
1  25.2  -2.22    23   150     4
2  17.9   4.07    22   350     6

What is the difference between \(R^2\) and adjusted \(R^2\)?

  • \(R^2\):

    • Proportion of variability in the outcome explained by the model.

    • Useful for quantifying the fit of a given model.

  • Adjusted \(R^2\):

    • Proportion of variability in the outcome explained by the model, with a penalty added for the number of predictors in the model.

    • Useful for comparing models.

From last time

Application exercise: ae-13-modeling-loans

  • Go to your project called ae.
  • Continue working on ae-13-modeling-loans.qmd.

Goals:

  • Review prediction and interpretation of model results
  • Review main and interaction effects models
  • Discuss model selection further

Recap: ae-13-modeling-loans

  • What is the practical difference between a model with parallel and non-parallel lines?

  • What is the definition of R-squared?

  • Why do we choose models based on adjusted R-squared and not R-squared?