Lecture 19
Duke University
STA 199 - Spring 2024
2024-03-28
Any questions from prepare materials?
Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and nonacademic endeavors, and to protect and promote a culture of integrity.
To uphold the Duke Community Standard:
https://trinity.duke.edu/undergraduate/academic-policies/community-standard-student-conduct
“Duke University has high expectations for students’ scholarship and conduct. In accepting admission, students indicate their willingness to subscribe to and be governed by the rules and regulations of the university, which flow from the Duke Community Standard. These policies reflect the Duke Community Standard’s fundamental values—honesty, fairness, respect, and accountability. Undergraduates acknowledge the right of the university to take disciplinary action, including suspension or expulsion, for failure to abide by the regulations or for other conduct adjudged unsatisfactory or detrimental to the university community. Students and groups may be held accountable for any violation of university policy that may or may not be included in this guide, whether on or off campus.”
https://trinity.duke.edu/undergraduate/academic-policies/community-standard-student-conduct
Be a Good Human
Use electronic devices for things related to coursework only and in a way that does not distract your classmates
No videos on cell phones
No phone calls!
Take off your headphones
Keep chatter to “your turn” portions or limited to clarification questions
If you have a guest in class, make sure they are aware of the DCS and take responsibility for their behaviour
Peer eval 2 is due Sunday night, results will be published on Monday
Lab 6 is due on Monday:
Render your document. If your code is running off the page so we can’t see your entire code, we will not evaluate any of it. The question will automatically receive a 0. This is something you can and should verify before you turn in your work.
If you’re using functions that are not introduced in the course materials, you must cite your sources. Failure to do so is a violation of the Duke Community Standard and will be treated as such.
Lab 7 will cover material from this week and next week. Start to working through prepare materials between now and Monday.
ae-13-modeling-loans
What is the practical difference between a model with parallel and non-parallel lines?
What is the definition of R-squared?
Why do we choose models based on adjusted R-squared and not R-squared?
from credit utilization and homeownership
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.93 0.140 70.8 0
2 credit_util 5.34 0.207 25.7 2.20e-141
3 homeownershipMortgage 0.696 0.121 5.76 8.71e- 9
4 homeownershipOwn 0.128 0.155 0.827 4.08e- 1
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.93 0.140 70.8 0
2 credit_util 5.34 0.207 25.7 2.20e-141
3 homeownershipMortgage 0.696 0.121 5.76 8.71e- 9
4 homeownershipOwn 0.128 0.155 0.827 4.08e- 1
All else held constant, for each additional percent credit utilization is higher, interest rate is predicted to be higher, on average, by 0.0534%.
All else held constant, the model predicts that loan applicants who have a mortgage for their home receive 0.696% higher interest rate than those who rent their home, on average.
All else held constant, the model predicts that loan applicants who own their home receive 0.128% higher interest rate than those who rent their home, on average.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.39 0.00512 468. 0
2 credit_checks 0.0236 0.00166 14.2 2.39e-45
\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks \]
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.39 0.00512 468. 0
2 credit_checks 0.0236 0.00166 14.2 2.39e-45
For each additional credit check, log of interest rate is predicted to be higher, on average, by 0.0236%.
\[ log(interest~rate_{x+1}) - log(interest~rate_{x}) = 0.0236 \]
\[ log(\frac{interest~rate_{x+1}}{interest~rate_{x}}) = 0.0236 \]
\[ e^{log(\frac{interest~rate_{x+1}}{interest~rate_{x}})} = e^{0.0236} \]
\[ \frac{interest~rate_{x+1}}{interest~rate_{x}} = 1.024 \]
For each additional credit check, interest rate is predicted to be higher, on average, by a factor of 1.024.
Similar to linear regression…. but
Modeling tool when our response is categorical
Variables with binary outcomes follow the Bernouilli distribution:
\(y_i \sim Bern(p)\)
\(p\): Probability of success
\(1-p\): Probability of failure
We can’t model \(y\) directly, so instead we model \(p\)
\[ p_i = \beta_o + \beta_1 \times X_1 + \cdots + \epsilon \]
But remember that \(p\) must be between 0 and 1
We need a link function that transforms the linear model to have an appropriate range
The logit function take values between 0 and 1 (probabilities) and maps them to values in the range negative infinity to positive infinity:
\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) \]
Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities.
We need the opposite of the link function… or the inverse
Taking the inverse of the logit function will map arbitrary real values back to the range [0, 1]
\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) = \beta_o + \beta_1 \times X1_i + \cdots + \epsilon \]
\[ p_i = \frac{e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}}{1 + e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}} \]
Generalized linear models allow us to fit models to predict non-continuous outcomes
Predicting binary outcomes requires modeling the log-odds of success, where p = probability of success