Introduction to Statistical Learning ISLR Chapter 2 Solutions Code, Exercises of Statistics

Statistical Learning - Exercise R code as soutution manual ISLR Introduction to Statistical Learning James, Witten, Hastie, Tibshirani

Typology: Exercises

2020/2021

Uploaded on 05/26/2021

ekaashaah
ekaashaah 🇺🇸

4.4

(41)

273 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
---
title: "Chapter 2: Statistical Learning"
author: "Solutions to Exercises"
date: "November 19, 2015"
output:
html_document:
keep_md: no
---
***
## CONCEPTUAL
***
<a id="ex01"></a>
>EXERCISE 1:
__Part a)__
flexible learning method would perform __better__ because sample size is large enough to fit more
parameters and small number of predictors limits model variance
__Part b)__
flexible learning method would perform __worse__ because it would be more likely to overfit
__Part c)__
flexible learning method would perform __better__ because it is less restrictive on the shape of fit
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Introduction to Statistical Learning ISLR Chapter 2 Solutions Code and more Exercises Statistics in PDF only on Docsity!

title: "Chapter 2: Statistical Learning" author: "Solutions to Exercises" date: "November 19, 2015" output: html_document: keep_md: no

CONCEPTUAL

>EXERCISE 1: Part a) flexible learning method would perform better because sample size is large enough to fit more parameters and small number of predictors limits model variance Part b) flexible learning method would perform worse because it would be more likely to overfit Part c) flexible learning method would perform better because it is less restrictive on the shape of fit

Part d) flexible learning method would perform worse because it would be more likely to overfit

>EXERCISE 2: Part a)

  • regression
  • inference
  • n = 500 observations
  • p = 3 variables
  • profit
  • number of employees
  • industry Part b)
  • classification
  • prediction
  • n = 20 observations
  • p = 13 variables
  • price charged
  • marketing budget
  • competition price
  • ten other variables

curve(300cos(x/3)+350, add=TRUE, col="green", lwd=2) # bias curve(225cos(x/3)+450, add=TRUE, col="blue", lwd=2) # train error

__Part b)__ * `variance` will increase with higher flexibility because changing data points will have more effect on the parameter estimates * `bias` will decrease with higher flexibility because there are fewer assumptions made about the shape of the fit * `test error` will have a U-shaped curve because it reflects the interaction between `variance` and `bias` * `irreducible error` is the same regardless of model fit * `train error` will always decrease with more model flexibility because an overfit model will produce lower MSE on the training data *** &gt;EXERCISE 4: __Part a)__ * win/loss in basketball game * Response: team win or loss * Predictors: team strength/weakness, opponent strength/weakness, player injuries * Goal: both prediction to know win or loss and inference to understand what factors influence win/loss result * renew/non-renew insurance policy * Response: policyholder renew or cancel * Predictors: price change, customer elasticity, competitor price * type of particle * Response: particle type * Predictors: image, size, shape, time __Part b)__ * fantasy points * Response: player fantasy points for next game * Predictors: past points, injuries, teammates, opponents * salary for job posting * Response: salary * Predictors: position title, location, company/peer salaries * insurance costs * Response: loss cost for policyholder * Predictors: experience losses, customer behavior stats __Part c)__ * types of shoppers * commodity groups * personality styles *** &gt;EXERCISE 5: Advantages of a very flexible model include better fit to data and fewer prior assumptions. Disadvantages are hard to interpret and prone to overfitting. obs5 &lt;- c(-1, 0, 1) obs6 &lt;- c(1, 1, 1) obs0 &lt;- c(0, 0, 0) (dist1 &lt;- sqrt(sum((obs1-obs0)^2)) ) (dist2 &lt;- sqrt(sum((obs2-obs0)^2)) ) (dist3 &lt;- sqrt(sum((obs3-obs0)^2)) ) (dist4 &lt;- sqrt(sum((obs4-obs0)^2)) ) (dist5 &lt;- sqrt(sum((obs5-obs0)^2)) ) (dist6 &lt;- sqrt(sum((obs6-obs0)^2)) )

Part b)

  • closest 1 neighbor is obs
  • prediction = Green Part c)
  • closest 3 neighbors are obs5, obs6, obs
  • prediction = Red Part d) best value of K should be smaller to be able to capture more of the non-linear decision boundary

APPLIED

>EXERCISE 8: Part a)

require(ISLR) data(College) str(College)

Part b)

# these steps were already taken on College data in the ISLR package fix(College) # pops up table in window rownames(College) &lt;- College[,1] # set row names College &lt;- College[,-1] # drop first col

Part c)

# i. summary(College) # ii. pairs(College[,1:10]) # iii. boxplot(Outstate~Private, data=College, xlab="Private", ylab="Outstate") * quantitative: _mpg, cylinders (can treat as qual too), displacement, horsepower, weight, acceleration, year_ * qualitative: _origin, name_ __Part b)__ ```{r} range(Auto$mpg) range(Auto$cylinders) range(Auto$displacement) range(Auto$horsepower) range(Auto$weight) range(Auto$acceleration) range(Auto$year)

Part c)

sapply(Auto[,1:7], mean) sapply(Auto[,1:7], sd)

Part d)

# create temp matrix for numeric columns tmp &lt;- Auto[,-(8:9)] # drop origin, name tmp &lt;- tmp[-(10:85),] # drop rows sapply(tmp, range) sapply(tmp, mean) sapply(tmp, sd)

Part e)

pairs(Auto[,1:7])
  • mpg is negatively correlated with cylinders, displacement, horsepower, and weight
  • horsepower is negatively correlated with weight
  • mpg mostly increases for newer model years Part f) yes, the plots show that there are relationships between mpg and other variables in the data set

>EXERCISE 10: Part a)

require(ISLR) require(MASS) * some correlations with each variable, except chas * crim rates seem to spike within certain zones * when `rad` is &gt; 20 * when `tax` is between 600 and 700 * when `zn` is close to 0 * etc. * negative correlations with `dis`, `medv` and maybe `black` __Part d)__ ```{r} require(ggplot2) g &lt;- ggplot(Boston, aes(x=1:nrow(Boston), y=crim)) g + geom_point() g &lt;- ggplot(Boston, aes(x=1:nrow(Boston), y=tax)) g + geom_point() g &lt;- ggplot(Boston, aes(x=1:nrow(Boston), y=ptratio)) g + geom_point()
  • definitely outliers for crim and tax
  • no clear outlier for ptratio Part e)
table(Boston$chas) # 35 towns

Part f)

median(Boston$ptratio) # 19.

Part g)

# there are two towns with lowest medv value of 5 (seltown &lt;- Boston[Boston$medv == min(Boston$medv),]) # overall quartiles and range of predictors sapply(Boston, quantile)
  • age, rad at max
  • crim, indus, nox, tax, ptratio, lstat at or above 75th percentile
  • low for zn, rm, dis Part h)
# count of towns nrow(Boston[Boston$rm &gt; 7,]) # 64 with &gt; 7 rooms nrow(Boston[Boston$rm &gt; 8,]) # 13 with &gt; 8 rooms # row 1: mean for towns with &gt; 8 rooms per dwelling # row 2: median for all towns