Regression Modeling with Crime Data, Exams of Computer Networks

A detailed analysis of regression modeling techniques applied to a crime data set from the georgia institute of technology. The analysis includes stepwise regression, elastic net, and lasso methods for variable selection and model building. The step-by-step process of scaling the data, performing cross-validation, and evaluating the model quality through adjusted r-squared and cross-validation r-squared metrics. The results show that variable selection plays a crucial role in developing a parsimonious and explanatory regression model, with the stepwise regression approach identifying a set of 8 significant variables that achieve the best model performance. The document also discusses the tradeoffs between model complexity and explanatory power, highlighting the importance of balancing simplicity and predictive accuracy in regression modeling.

Typology: Exams

2023/2024

Available from 08/04/2024

solution-master
solution-master 🇺🇸

3.2

(27)

11K documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ISYE 6501 Homework Week 8 2023-2024 complete
solution Georgia Institute Of Technology
Question 11.1
Using the crime data set uscrime.txt from Questions 8.2, 9.1, and 10.1, build a regression model using:
1. Stepwise regression
2. Lasso
3. Elastic net
For Parts 2 and 3, remember to scale the data first otherwise, the regression coefficients will be on
different scales and the constraint won’t have the desired effect.
For Parts 2 and 3, use the glmnet function in R.
Notes on R:
For the elastic net model, what we called λ in the videos, glmnet calls “alpha”; you can get a
range of results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other
values of alpha in between].
In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors x need to be in R’s
matrix format, rather than data frame format. You can convert a data frame to a matrix using
as.matrix for example, x <- as.matrix(data[,1:n-1])
Rather than specifying a value of T, glmnet returns models for a variety of values of T.
Answer:
1. Stepwise regression:
I scaled the data and performed Stepwise regression for variable selection it came back with the 8
variables: M.F, U1, Prob, U2, M, Ed, Ineq, and Po1. Running a regression with these 8 variables, I got
Adjusted R-squared: 0.74 for model quality. Because of 47 data points, I used 47-fold cross-validation
(leave-one-out) on the model, and that leads to R-squared: 0.67.
I tried to simplify the model by only using 7 variables or 6 variables by dropping insignificant variables
like M.F and U1, the revised model doesn’t show much improvement: the revised model leads to
Adjusted R-squared: 0.73 and its cross validation’s R-squared: 0.67.
#------------------------------------- Stepwise Regression -------------------------------------
# scaling the data set except for response and a binary variable
scaledData = as.data.frame(scale(data[,c(1,3,4,5,6,7,8,9,10,11,12,13,14,15)]))
scaledData <- cbind(data[,2],scaledData,data[,16]) # Add column So back in
colnames(scaledData)[1] <- "So"
colnames(scaledData)[16] <- "Crime"
library(caret)
# in 5 repeats of 5 fold cv, the average of 5 errors was obtained by performing 5 fold cv 5 times
# lower model only includes an intercept
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
Stepwise <- train(Crime ~ ., data = scaledData, "lmStepAIC", scope = list(lower = Crime~1, upper =
Crime~.), direction = "backward",trControl=ctrl)
pf3
pf4
pf5
pf8

Partial preview of the text

Download Regression Modeling with Crime Data and more Exams Computer Networks in PDF only on Docsity!

ISYE 6501 Homework Week 8 2023 - 2024 complete

solution Georgia Institute Of Technology

Question 11.

Using the crime data set uscrime.txtfrom Questions 8.2, 9.1, and 10.1, build a regression model using:

1. Stepwise regression

2. Lasso

3. Elastic net

For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficients will be on

different scales and the constraint won’t have the desired effect.

For Parts 2 and 3, use the glmnetfunction in R.

Notes on R:

 For the elastic net model, what we called λ in the videos, glmnetcalls “alpha”; you can get a

range of results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other

values of alpha in between].

 In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors x need to be in R’s

matrix format, rather than data frame format. You can convert a data frame to a matrix using

as.matrix– for example, x <- as.matrix(data[,1:n-1])

 Rather than specifying a value of T, glmnetreturns models for a variety of values of T.

Answer:

1. Stepwise regression:

I scaled the data and performed Stepwise regression for variable selection – it came back with the 8

variables: M.F, U1, Prob, U2, M, Ed, Ineq, and Po1. Running a regression with these 8 variables, I got

Adjusted R-squared: 0.74 for model quality. Because of 47 data points, I used 47-fold cross-validation

(leave-one-out) on the model, and that leads to R-squared: 0.67.

I tried to simplify the model by only using 7 variables or 6 variables by dropping insignificant variables

like M.F and U1, the revised model doesn’t show much improvement: the revised model leads to

Adjusted R-squared: 0.73 and its cross validation’s R-squared: 0.67.

#------------------------------------- Stepwise Regression -------------------------------------

# scaling the data set except for response and a binary variable

scaledData = as.data.frame(scale(data[,c(1,3,4,5,6,7,8,9,10,11,12,13,14,15)]))

scaledData <- cbind(data[,2],scaledData,data[,16]) # Add column So back in

colnames(scaledData)[1] <- "So"

colnames(scaledData)[16] <- "Crime"

library(caret)

# in 5 repeats of 5 fold cv, the average of 5 errors was obtained by performing 5 fold cv 5 times

# lower model only includes an intercept

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

Stepwise <- train(Crime ~ ., data = scaledData, "lmStepAIC", scope = list(lower = Crime~1, upper =

Crime~.), direction = "backward",trControl=ctrl)

Step: AIC=503. .outcome ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob

Df Sum of Sq RSS AIC 1453068 503.

  • M.F 1 103159 1556227 505.
  • U1 1 127044 1580112 505.
  • Prob 1 247978 1701046 509.
  • U2 1 255443 1708511 509.
  • M 1 296790 1749858 510.
  • Ed 1 445788 1898855 514.
  • Ineq 1 738244 2191312 521.
  • Po1 1 1672038 3125105 537.

# using 8 variables in the model

Step_model = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = scaledData)

summary(Step_model)

Call: lm(formula = Crime ~ M.F + U1 + Prob + U2 + M + Ed + Ineq + Po1, data = scaledData)

Residuals: Min 1Q Median 3Q Max -444.70 -111.07 3.03 122.15 483.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 905.09 28.52 31.731 < 2e-16 *** M.F 65.83 40.08 1.642 0. U1 -109.73 60.20 -1.823 0.. Prob -86.31 33.89 -2.547 0.01505 * U2 158.22 61.22 2.585 0.01371 * M 117.28 42.10 2.786 0.00828 ** Ed 201.50 59.02 3.414 0.00153 ** Ineq 244.70 55.69 4.394 8.63e-05 *** Po1 305.07 46.14 6.613 8.26e-08 ***


Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 195.5 on 38 degrees of freedom Multiple R-squared: 0.7888, Adjusted R-squared: 0. F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-

# 47 fold cross-validation

SST <- sum((data$Crime - mean(data$Crime))^2)

SSE <- 0

for(i in 1:nrow(scaledData)) {

Step_model_i = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = scaledData[-i,])

pred_i <- predict(Step_model_i,newdata=scaledData[i,])

SSE <- SSE + ((pred_i - data[i,16])^2)

R2_mod <- 1 - SSE/SST

R2_mod

# using 7 variables in the model

Step_model_i = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = scaledData[-i,])

pred_i <- predict(Step_model_i,newdata=scaledData[i,])

SSE <- SSE + ((pred_i - data[i,16])^2)

R2_mod <- 1 - SSE/SST

R2_mod

2. Lasso:

I ran Lasso for variable selection, based on the coefficient output of variables selected by Lasso, there

are 10 significant variables and I used them in a regression model. The model produces a slightly lower

Adjusted R-squared: 0.71 and its cross-validation also gives a lower R-squared: 0.58.

Based on the coefficients output of the 10 variables regression model, variables So, LF, M.F, and NW

appear insignificant, once we removed them, the left 6 variables would be the same as what Stepwise

suggests.

#---------------------------- Lasso Regression ----------------------------

install.packages('glmnet')

library(glmnet)

# running lasso for variables selection

lasso=cv.glmnet(x=as.matrix(scaledData[,-16]),y=as.matrix(scaledData$Crime),alpha=1,nfolds =

5,type.measure="mse",family="gaussian")

# Coefficient output of variables by lasso

coef(lasso, s=lasso$lambda.min)

s (Intercept) 889. So 45. M 77. Ed 98. Po1 311. Po. LF 2. M.F 46. Pop. NW 2. U. U2 29. Wealth. Ineq 164. Prob -77. Time.

# using 10 variables in the model

lasso_model = lm(Crime ~So+M+Ed+Po1+LF+M.F+NW+U2+Ineq+Prob, data = scaledData)

summary(lasso_model)

Call: lm(formula = Crime ~ So + M + Ed + Po1 + LF + M.F + NW + U2 + Ineq + Prob, data = scaledData)

for (i in 0:10) {

elastic_net = cv.glmnet(x=as.matrix(scaledData[,-16]),y=as.matrix(scaledData$Crime),alpha=i/10,nfolds

= 5,type.measure="mse",family="gaussian")

# use dev.ratios as R2 in regression

R2 = cbind(R2,elastic_net$glmnet.fit$dev.ratio[which(elastic_net$glmnet.fit$lambda

==elastic_net$lambda.min)])

R

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]

# whats the lambda value when we maximize R

alpha = (which.max(R2)-1)/

alpha

[1] 0.

# use the best lambda in elastic net

elastic_net=cv.glmnet(x=as.matrix(scaledData[,-16]),y=as.matrix(scaledData$Crime),alpha=alpha, nfolds

= 5,type.measure="mse",family="gaussian")

# Coefficient output of variables by elastic net

coef(elastic_net, s=elastic_net$lambda.min)

s (Intercept) 894. So 31. M 106. Ed 179. Po1 291. Po. LF. M.F 53. Pop -22. NW 18. U1 -78. U2 124. Wealth 63. Ineq 256. Prob -91. Time -1.

# use 13 variables in the model

elastic_net_model = lm(Crime ~So+M+Ed+Po1+M.F+Pop+NW+U1+U2+Wealth+Ineq+Prob+Time, data =

scaledData)

summary(elastic_net_model)

Call: lm(formula = Crime ~ So + M + Ed + Po1 + M.F + Pop + NW + U1 + U2 + Wealth + Ineq + Prob + Time, data = scaledData)

Residuals: Min 1Q Median 3Q Max -441.49 -112.23 17.28 116.21 476.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 898.403 53.039 16.938 < 2e-16 *** So 19.630 128.514 0.153 0. M 114.387 50.973 2.244 0.031654 * Ed 194.473 64.239 3.027 0.004760 ** Po1 290.682 67.449 4.310 0.000139 *** M.F 47.290 49.692 0.952 0. Pop -31.180 47.769 -0.653 0. NW 21.612 60.190 0.359 0. U1 -91.277 67.188 -1.359 0. U2 142.231 68.245 2.084 0.044969 * Wealth 85.100 97.455 0.873 0. Ineq 286.057 86.442 3.309 0.002269 ** Prob -97.248 48.918 -1.988 0.. Time -8.224 46.715 -0.176 0.


Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 205.6 on 33 degrees of freedom Multiple R-squared: 0.7973, Adjusted R-squared: 0. F-statistic: 9.986 on 13 and 33 DF, p-value: 5.194e-

# 47 fold cross-validation

SST <- sum((data$Crime - mean(data$Crime))^2)

SSE <- 0

for(i in 1:nrow(scaledData)) {

elastic_net_model_i = lm(Crime ~

So+M+Ed+Po1+Po2+M.F+Pop+NW+U1+U2+Wealth+Ineq+Prob+Time, data = scaledData[-i,])

pred_i <- predict(elastic_net_model_i,newdata=scaledData[i,])

SSE <- SSE + ((pred_i - data[i,16])^2)

R2_mod <- 1 - SSE/SST

R2_mod