Regression Modeling with Crime Data | Exams Computer Networks

ISYE 6501 Homework Week 8 2023-2024 complete

solution Georgia Institute Of Technology

Question 11.1

Using the crime data set uscrime.txt from Questions 8.2, 9.1, and 10.1, build a regression model using:

1. Stepwise regression

2. Lasso

3. Elastic net

For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficients will be on

different scales and the constraint won’t have the desired effect.

For Parts 2 and 3, use the glmnet function in R.

Notes on R:

 For the elastic net model, what we called λ in the videos, glmnet calls “alpha”; you can get a

range of results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other

values of alpha in between].

 In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors x need to be in R’s

matrix format, rather than data frame format. You can convert a data frame to a matrix using

as.matrix – for example, x <- as.matrix(data[,1:n-1])

 Rather than specifying a value of T, glmnet returns models for a variety of values of T.

Answer:

1. Stepwise regression:

I scaled the data and performed Stepwise regression for variable selection – it came back with the 8

variables: M.F, U1, Prob, U2, M, Ed, Ineq, and Po1. Running a regression with these 8 variables, I got

Adjusted R-squared: 0.74 for model quality. Because of 47 data points, I used 47-fold cross-validation

(leave-one-out) on the model, and that leads to R-squared: 0.67.

I tried to simplify the model by only using 7 variables or 6 variables by dropping insignificant variables

like M.F and U1, the revised model doesn’t show much improvement: the revised model leads to

Adjusted R-squared: 0.73 and its cross validation’s R-squared: 0.67.

#------------------------------------- Stepwise Regression -------------------------------------

# scaling the data set except for response and a binary variable

scaledData = as.data.frame(scale(data[,c(1,3,4,5,6,7,8,9,10,11,12,13,14,15)]))

scaledData <- cbind(data[,2],scaledData,data[,16]) # Add column So back in

colnames(scaledData)[1] <- "So"

colnames(scaledData)[16] <- "Crime"

library(caret)

# in 5 repeats of 5 fold cv, the average of 5 errors was obtained by performing 5 fold cv 5 times

# lower model only includes an intercept

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

Stepwise <- train(Crime ~ ., data = scaledData, "lmStepAIC", scope = list(lower = Crime~1, upper =

Crime~.), direction = "backward",trControl=ctrl)

Regression Modeling with Crime Data, Exams of Computer Networks

Related documents

Partial preview of the text

Download Regression Modeling with Crime Data and more Exams Computer Networks in PDF only on Docsity!

ISYE 6501 Homework Week 8 2023 - 2024 complete

solution Georgia Institute Of Technology

Question 11.

Using the crime data set uscrime.txtfrom Questions 8.2, 9.1, and 10.1, build a regression model using:

1. Stepwise regression

2. Lasso

3. Elastic net

For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficients will be on

different scales and the constraint won’t have the desired effect.

For Parts 2 and 3, use the glmnetfunction in R.

Notes on R:

 For the elastic net model, what we called λ in the videos, glmnetcalls “alpha”; you can get a

range of results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other

values of alpha in between].

 In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors x need to be in R’s

matrix format, rather than data frame format. You can convert a data frame to a matrix using

as.matrix– for example, x <- as.matrix(data[,1:n-1])

 Rather than specifying a value of T, glmnetreturns models for a variety of values of T.

Answer:

1. Stepwise regression:

I scaled the data and performed Stepwise regression for variable selection – it came back with the 8

variables: M.F, U1, Prob, U2, M, Ed, Ineq, and Po1. Running a regression with these 8 variables, I got

Adjusted R-squared: 0.74 for model quality. Because of 47 data points, I used 47-fold cross-validation

(leave-one-out) on the model, and that leads to R-squared: 0.67.

I tried to simplify the model by only using 7 variables or 6 variables by dropping insignificant variables

like M.F and U1, the revised model doesn’t show much improvement: the revised model leads to

Adjusted R-squared: 0.73 and its cross validation’s R-squared: 0.67.

#------------------------------------- Stepwise Regression -------------------------------------

# scaling the data set except for response and a binary variable

scaledData = as.data.frame(scale(data[,c(1,3,4,5,6,7,8,9,10,11,12,13,14,15)]))

scaledData <- cbind(data[,2],scaledData,data[,16]) # Add column So back in

colnames(scaledData)[1] <- "So"

colnames(scaledData)[16] <- "Crime"

library(caret)

# in 5 repeats of 5 fold cv, the average of 5 errors was obtained by performing 5 fold cv 5 times

# lower model only includes an intercept

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

Stepwise <- train(Crime ~ ., data = scaledData, "lmStepAIC", scope = list(lower = Crime~1, upper =

Crime~.), direction = "backward",trControl=ctrl)

# using 8 variables in the model

Step_model = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = scaledData)

summary(Step_model)

# 47 fold cross-validation

SST <- sum((data$Crime - mean(data$Crime))^2)

SSE <- 0

for(i in 1:nrow(scaledData)) {

Step_model_i = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = scaledData[-i,])

pred_i <- predict(Step_model_i,newdata=scaledData[i,])

SSE <- SSE + ((pred_i - data[i,16])^2)

R2_mod <- 1 - SSE/SST

R2_mod

# using 7 variables in the model

Step_model_i = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = scaledData[-i,])

pred_i <- predict(Step_model_i,newdata=scaledData[i,])

SSE <- SSE + ((pred_i - data[i,16])^2)

R2_mod <- 1 - SSE/SST

R2_mod

2. Lasso:

I ran Lasso for variable selection, based on the coefficient output of variables selected by Lasso, there

are 10 significant variables and I used them in a regression model. The model produces a slightly lower

Adjusted R-squared: 0.71 and its cross-validation also gives a lower R-squared: 0.58.

Based on the coefficients output of the 10 variables regression model, variables So, LF, M.F, and NW

appear insignificant, once we removed them, the left 6 variables would be the same as what Stepwise

suggests.

#---------------------------- Lasso Regression ----------------------------

install.packages('glmnet')

library(glmnet)

# running lasso for variables selection

lasso=cv.glmnet(x=as.matrix(scaledData[,-16]),y=as.matrix(scaledData$Crime),alpha=1,nfolds =

5,type.measure="mse",family="gaussian")

# Coefficient output of variables by lasso

coef(lasso, s=lasso$lambda.min)

# using 10 variables in the model

lasso_model = lm(Crime ~So+M+Ed+Po1+LF+M.F+NW+U2+Ineq+Prob, data = scaledData)

summary(lasso_model)

for (i in 0:10) {