









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Statistical Learning - Exercise R code as soutution manual ISLR Introduction to Statistical Learning James, Witten, Hastie, Tibshirani
Typology: Exercises
1 / 15
This page cannot be seen from the preview
Don't miss anything!










title: "Chapter 2: Statistical Learning" author: "Solutions to Exercises" date: "November 19, 2015" output: html_document: keep_md: no
>EXERCISE 1: Part a) flexible learning method would perform better because sample size is large enough to fit more parameters and small number of predictors limits model variance Part b) flexible learning method would perform worse because it would be more likely to overfit Part c) flexible learning method would perform better because it is less restrictive on the shape of fit
Part d) flexible learning method would perform worse because it would be more likely to overfit
>EXERCISE 2: Part a)
curve(300cos(x/3)+350, add=TRUE, col="green", lwd=2) # bias curve(225cos(x/3)+450, add=TRUE, col="blue", lwd=2) # train error
__Part b)__ * `variance` will increase with higher flexibility because changing data points will have more effect on the parameter estimates * `bias` will decrease with higher flexibility because there are fewer assumptions made about the shape of the fit * `test error` will have a U-shaped curve because it reflects the interaction between `variance` and `bias` * `irreducible error` is the same regardless of model fit * `train error` will always decrease with more model flexibility because an overfit model will produce lower MSE on the training data *** >EXERCISE 4: __Part a)__ * win/loss in basketball game * Response: team win or loss * Predictors: team strength/weakness, opponent strength/weakness, player injuries * Goal: both prediction to know win or loss and inference to understand what factors influence win/loss result * renew/non-renew insurance policy * Response: policyholder renew or cancel * Predictors: price change, customer elasticity, competitor price * type of particle * Response: particle type * Predictors: image, size, shape, time __Part b)__ * fantasy points * Response: player fantasy points for next game * Predictors: past points, injuries, teammates, opponents * salary for job posting * Response: salary * Predictors: position title, location, company/peer salaries * insurance costs * Response: loss cost for policyholder * Predictors: experience losses, customer behavior stats __Part c)__ * types of shoppers * commodity groups * personality styles *** >EXERCISE 5: Advantages of a very flexible model include better fit to data and fewer prior assumptions. Disadvantages are hard to interpret and prone to overfitting. obs5 <- c(-1, 0, 1) obs6 <- c(1, 1, 1) obs0 <- c(0, 0, 0) (dist1 <- sqrt(sum((obs1-obs0)^2)) ) (dist2 <- sqrt(sum((obs2-obs0)^2)) ) (dist3 <- sqrt(sum((obs3-obs0)^2)) ) (dist4 <- sqrt(sum((obs4-obs0)^2)) ) (dist5 <- sqrt(sum((obs5-obs0)^2)) ) (dist6 <- sqrt(sum((obs6-obs0)^2)) )Part b)
>EXERCISE 8: Part a)
require(ISLR) data(College) str(College)Part b)
# these steps were already taken on College data in the ISLR package fix(College) # pops up table in window rownames(College) <- College[,1] # set row names College <- College[,-1] # drop first colPart c)
# i. summary(College) # ii. pairs(College[,1:10]) # iii. boxplot(Outstate~Private, data=College, xlab="Private", ylab="Outstate") * quantitative: _mpg, cylinders (can treat as qual too), displacement, horsepower, weight, acceleration, year_ * qualitative: _origin, name_ __Part b)__ ```{r} range(Auto$mpg) range(Auto$cylinders) range(Auto$displacement) range(Auto$horsepower) range(Auto$weight) range(Auto$acceleration) range(Auto$year)Part c)
sapply(Auto[,1:7], mean) sapply(Auto[,1:7], sd)Part d)
# create temp matrix for numeric columns tmp <- Auto[,-(8:9)] # drop origin, name tmp <- tmp[-(10:85),] # drop rows sapply(tmp, range) sapply(tmp, mean) sapply(tmp, sd)Part e)
pairs(Auto[,1:7])>EXERCISE 10: Part a)
require(ISLR) require(MASS) * some correlations with each variable, except chas * crim rates seem to spike within certain zones * when `rad` is > 20 * when `tax` is between 600 and 700 * when `zn` is close to 0 * etc. * negative correlations with `dis`, `medv` and maybe `black` __Part d)__ ```{r} require(ggplot2) g <- ggplot(Boston, aes(x=1:nrow(Boston), y=crim)) g + geom_point() g <- ggplot(Boston, aes(x=1:nrow(Boston), y=tax)) g + geom_point() g <- ggplot(Boston, aes(x=1:nrow(Boston), y=ptratio)) g + geom_point()Part f)
median(Boston$ptratio) # 19.Part g)
# there are two towns with lowest medv value of 5 (seltown <- Boston[Boston$medv == min(Boston$medv),]) # overall quartiles and range of predictors sapply(Boston, quantile)