 ## R Programming Homework Solution on Adjusted R-squared, AIC, LRT, and LOOCV

• 2nd Jun, 2022
• 23:37 PM
{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE) library(cvTools) library(glmnet) library(sandwich) library(stargazer) library(lmtest) library(alr4) library(sciplot)  ## Model Comparison (5 points) Use the Puromycin data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Puromycin). For this problem, you will run through various methods of model comparison - Adjusted R-squared, AIC, LRT, and LOOCV. 1. Create two models that predict rate. The first model will use conc as the predictor. The second model will use conc and state as predictors. Do not include any interactions of other transformations. Run each of your models through the summary() function. {r} data(Puromycin) # Write your code here attach(Puromycin) m1a<-lm(rate~conc) m1b<-lm(rate~conc+state) stargazer(m1a, m1b, type="text")  According to adjusted R-squared, the second model is preferred: it explains 69% f the variance in teh levels of teh dependent variable, as compared to 63% in teh case of the first model. Replace this text with your answer. 2. Use the AIC() function to calculate the AIC for each model. {r} AIC(m1a) AIC(m1b)  The second model is aldo preferred according to AIC, having an AIC of 221, lower than the one of 224 in case of teh first model. 3. Run a likelihood ratio test between the two models. {r} lrtest(m1a, m1b)  The likelihood ratio test is significant (p=0.03<0> 4. Use the cvFit function within the cvTools package to run LOOCV for both models. {r} cvFit(m1a, y=rate, K=10, data=Puromycin) cvFit(m1b, y=rate, K=10, data=Puromycin)  The estimated prediction error is larger for teh first model; therefore, the second model is preferred according to LOOCV. *** ## Lasso (5 points) Use the swiss data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/swiss). For this problem, you will use Lasso regression to predict Fertility based on all other predictors. You will need to utilize the glmnet() and cv.glmnet() functions within the glmnet package. See the Lecture Notes for details. 1. Create the design matrix using the model.matrix() function. This is used as one of the arguments for glmnet() when running Lasso. {r} data(swiss) # Write your code here attach(swiss) x <- model.matrix(Fertility ~ Agriculture + Examination + Education + Catholic+Infant.Mortality)[, -1]  2. Run Lasso using different values of$\lambda$to determine which variable is the first to have its estimated slope coefficient go to 0. This may take multiple iterations. Remember to set the value of$\alpha$to 1, as we did in the Lecture Notes. {r} # Write your code here lasso00 <- glmnet(x, y=Fertility, alpha=1, lambda=0) lasso02 <- glmnet(x, y=Fertility, alpha=1, lambda=.2) lasso04 <- glmnet(x, y=Fertility, alpha=1, lambda=.4) lasso06 <- glmnet(x, y=Fertility, alpha=1, lambda=.6) lasso08 <- glmnet(x, y=Fertility, alpha=1, lambda=.8) lasso10 <- glmnet(x, y=Fertility, alpha=1, lambda=1) lasso20 <- glmnet(x, y=Fertility, alpha=1, lambda=2) lasso60 <- glmnet(x, y=Fertility, alpha=1, lambda=6) lasso80 <- glmnet(x, y=Fertility, alpha=1, lambda=8) lasso100 <- glmnet(x, y=Fertility, alpha=1, lambda=10) coeffs00 <- coef(lasso00, s = 0.1) coeffs02 <- coef(lasso02, s = 0.1) coeffs04 <- coef(lasso04, s = 0.1) coeffs06 <- coef(lasso06, s = 0.1) coeffs08 <- coef(lasso08, s = 0.1) coeffs10 <- coef(lasso10, s = 0.1) coeffs20 <- coef(lasso20, s = 0.1) coeffs60 <- coef(lasso60, s = 0.1) coeffs80 <- coef(lasso80, s = 0.1) coeffs100 <- coef(lasso100, s = 0.1) coeffs00x <- data.frame(name = coeffs00@Dimnames[][coeffs00@i + 1], coefficient = coeffs00@x) coeffs02x <- data.frame(name = coeffs02@Dimnames[][coeffs02@i + 1], coefficient = coeffs02@x) coeffs04x <- data.frame(name = coeffs04@Dimnames[][coeffs04@i + 1], coefficient = coeffs04@x) coeffs06x <- data.frame(name = coeffs06@Dimnames[][coeffs06@i + 1], coefficient = coeffs06@x) coeffs08x <- data.frame(name = coeffs08@Dimnames[][coeffs08@i + 1], coefficient = coeffs08@x) coeffs10x <- data.frame(name = coeffs10@Dimnames[][coeffs10@i + 1], coefficient = coeffs10@x) coeffs20x <- data.frame(name = coeffs20@Dimnames[][coeffs20@i + 1], coefficient = coeffs20@x) coeffs60x <- data.frame(name = coeffs60@Dimnames[][coeffs60@i + 1], coefficient = coeffs60@x) coeffs80x <- data.frame(name = coeffs80@Dimnames[][coeffs80@i + 1], coefficient = coeffs80@x) coeffs100x <- data.frame(name = coeffs100@Dimnames[][coeffs100@i + 1], coefficient = coeffs100@x) colnames(coeffs00x)<-"Lambda00" colnames(coeffs02x)<-"Lambda02" colnames(coeffs04x)<-"Lambda04" colnames(coeffs06x)<-"Lambda06" colnames(coeffs08x)<-"Lambda08" colnames(coeffs10x)<-"Lambda10" colnames(coeffs20x)<-"Lambda20" colnames(coeffs60x)<-"Lambda60" colnames(coeffs80x)<-"Lambda80" colnames(coeffs10x)<-"Lambda100" coeffsx<-Reduce(function(x, y) merge(x, y, all=TRUE), list(coeffs00x, coeffs02x, coeffs04x, coeffs06x, coeffs08x, coeffs10x, coeffs20x, coeffs60x, coeffs80x, coeffs100x)) print(coeffsx)  Agriculture seems to be the first variable to have its estimated slope coefficient go to 0. 3. At what (approximate) value of$\lambda$do **all** of the estimated slope coefficients go to 0? An integer value is fine here! {r} # Write your code here print(coeffsx)  Lambda=10. 4. Use the cv.glmnet() function to determine the optimal value for$\lambda$. {r} # Write your code here #Find the optimal value of lambda that minimizes the cross-validation error: cv.lasso<-cv.glmnet(x, y=Fertility, alpha=1) plot(cv.lasso) cv.lasso$lambda.min



What value of $\lambda$ is returned? Note: This will vary from trial-to-trial (that's okay!).

$\lambda$=0.023.

5. Run lasso regression again using your optimal $\lambda$ from Part 4.
{r}
lasso2 <- glmnet(x, y=Fertility, alpha = 1, lambda = cv.lasso$lambda.min) coef(lasso2)  What variables are in the model? All variables are still in the model *** ## Variances (12 points) Use the stopping data set from the alr4 package. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/alr4/versions/1.0.5/topics/stopping). For this problem, you will run through various methods of dealing with nonconstant variance - using OLS, WLS (assuming known weights), WLS (with a sandwich estimator), and bootstrap. 1. Create a scatterplot of Distance versus Speed (Y vs X). {r} data(stopping, package = "alr4") # Write your code here attach(stopping) scatterplot(Speed, Distance)  2. Breifly explain why this graph supports fitting a quadratic regression model. The distribution of points in the scatterplot looks closer to a quadratic curve than to a liniar one. 3. Fit a quadratic model with constant variance (OLS). Remember to use the I() function to inhibit the squared term. {r} # Write your code here quadraticM<-lm(Distance~Speed+I(Speed^2)) stargazer(quadraticM, type="text", report="vcs*")  4. What are the standard errors for the two terms? .556 for the fisrt order term, and .013 for teh quadratic one. 5. Refit the quadratic model, but use WLS with the assumption that$Var(Distance|Speed) = Speed*\sigma^2\$.
{r, echo=FALSE, warning=FALSE}