Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Introduction to Statistical Learning ISLR Chapter 2 Solutions Code, Exercises of Statistics

University of Massachusetts - Amherst Statistics

Statistical Learning - Exercise R code as soutution manual ISLR Introduction to Statistical Learning James, Witten, Hastie, Tibshirani

Typology: Exercises

2020/2021

Uploaded on 05/26/2021

ekaashaah 🇺🇸

4.4

(41)

273 documents

1 / 15

This page cannot be seen from the preview

Don't miss anything!

---

title: "Chapter 2: Statistical Learning"

author: "Solutions to Exercises"

date: "November 19, 2015"

output:

html_document:

keep_md: no

---

***

## CONCEPTUAL

***

>EXERCISE 1:

__Part a)__

flexible learning method would perform __better__ because sample size is large enough to fit more

parameters and small number of predictors limits model variance

__Part b)__

flexible learning method would perform __worse__ because it would be more likely to overfit

__Part c)__

flexible learning method would perform __better__ because it is less restrictive on the shape of fit

Discover Exercises of Statistics University of Massachusetts - Amherst

Partial preview of the text

Download Introduction to Statistical Learning ISLR Chapter 2 Solutions Code and more Exercises Statistics in PDF only on Docsity!

title: "Chapter 2: Statistical Learning" author: "Solutions to Exercises" date: "November 19, 2015" output: html_document: keep_md: no

CONCEPTUAL

>EXERCISE 1: Part a) flexible learning method would perform better because sample size is large enough to fit more parameters and small number of predictors limits model variance Part b) flexible learning method would perform worse because it would be more likely to overfit Part c) flexible learning method would perform better because it is less restrictive on the shape of fit

Part d) flexible learning method would perform worse because it would be more likely to overfit

>EXERCISE 2: Part a)

regression
inference
n = 500 observations
p = 3 variables
profit
number of employees
industry Part b)
classification
prediction
n = 20 observations
p = 13 variables
price charged
marketing budget
competition price
ten other variables

curve(300cos(x/3)+350, add=TRUE, col="green", lwd=2) # bias curve(225cos(x/3)+450, add=TRUE, col="blue", lwd=2) # train error

__Part b)__ * `variance` will increase with higher flexibility because changing data points will have more effect on the parameter estimates * `bias` will decrease with higher flexibility because there are fewer assumptions made about the shape of the fit * `test error` will have a U-shaped curve because it reflects the interaction between `variance` and `bias` * `irreducible error` is the same regardless of model fit * `train error` will always decrease with more model flexibility because an overfit model will produce lower MSE on the training data *** >EXERCISE 4: __Part a)__ * win/loss in basketball game * Response: team win or loss * Predictors: team strength/weakness, opponent strength/weakness, player injuries * Goal: both prediction to know win or loss and inference to understand what factors influence win/loss result * renew/non-renew insurance policy * Response: policyholder renew or cancel * Predictors: price change, customer elasticity, competitor price * type of particle * Response: particle type * Predictors: image, size, shape, time __Part b)__ * fantasy points * Response: player fantasy points for next game * Predictors: past points, injuries, teammates, opponents * salary for job posting * Response: salary * Predictors: position title, location, company/peer salaries * insurance costs * Response: loss cost for policyholder * Predictors: experience losses, customer behavior stats __Part c)__ * types of shoppers * commodity groups * personality styles *** >EXERCISE 5: Advantages of a very flexible model include better fit to data and fewer prior assumptions. Disadvantages are hard to interpret and prone to overfitting. obs5 <- c(-1, 0, 1) obs6 <- c(1, 1, 1) obs0 <- c(0, 0, 0) (dist1 <- sqrt(sum((obs1-obs0)^2)) ) (dist2 <- sqrt(sum((obs2-obs0)^2)) ) (dist3 <- sqrt(sum((obs3-obs0)^2)) ) (dist4 <- sqrt(sum((obs4-obs0)^2)) ) (dist5 <- sqrt(sum((obs5-obs0)^2)) ) (dist6 <- sqrt(sum((obs6-obs0)^2)) )

Part b)

closest 1 neighbor is obs
prediction = Green Part c)
closest 3 neighbors are obs5, obs6, obs
prediction = Red Part d) best value of K should be smaller to be able to capture more of the non-linear decision boundary

APPLIED

>EXERCISE 8: Part a)

require(ISLR) data(College) str(College)

Part b)

# these steps were already taken on College data in the ISLR package fix(College) # pops up table in window rownames(College) <- College[,1] # set row names College <- College[,-1] # drop first col

Part c)

# i. summary(College) # ii. pairs(College[,1:10]) # iii. boxplot(Outstate~Private, data=College, xlab="Private", ylab="Outstate") * quantitative: _mpg, cylinders (can treat as qual too), displacement, horsepower, weight, acceleration, year_ * qualitative: _origin, name_ __Part b)__ ```{r} range(Auto$mpg) range(Auto$cylinders) range(Auto$displacement) range(Auto$horsepower) range(Auto$weight) range(Auto$acceleration) range(Auto$year)

Part c)

sapply(Auto[,1:7], mean) sapply(Auto[,1:7], sd)

Part d)

# create temp matrix for numeric columns tmp <- Auto[,-(8:9)] # drop origin, name tmp <- tmp[-(10:85),] # drop rows sapply(tmp, range) sapply(tmp, mean) sapply(tmp, sd)

Part e)

pairs(Auto[,1:7])

mpg is negatively correlated with cylinders, displacement, horsepower, and weight
horsepower is negatively correlated with weight
mpg mostly increases for newer model years Part f) yes, the plots show that there are relationships between mpg and other variables in the data set

>EXERCISE 10: Part a)

require(ISLR) require(MASS) * some correlations with each variable, except chas * crim rates seem to spike within certain zones * when `rad` is > 20 * when `tax` is between 600 and 700 * when `zn` is close to 0 * etc. * negative correlations with `dis`, `medv` and maybe `black` __Part d)__ ```{r} require(ggplot2) g <- ggplot(Boston, aes(x=1:nrow(Boston), y=crim)) g + geom_point() g <- ggplot(Boston, aes(x=1:nrow(Boston), y=tax)) g + geom_point() g <- ggplot(Boston, aes(x=1:nrow(Boston), y=ptratio)) g + geom_point()

definitely outliers for crim and tax
no clear outlier for ptratio Part e)

table(Boston$chas) # 35 towns

Part f)

median(Boston$ptratio) # 19.

Part g)

# there are two towns with lowest medv value of 5 (seltown <- Boston[Boston$medv == min(Boston$medv),]) # overall quartiles and range of predictors sapply(Boston, quantile)

age, rad at max
crim, indus, nox, tax, ptratio, lstat at or above 75th percentile
low for zn, rm, dis Part h)

# count of towns nrow(Boston[Boston$rm > 7,]) # 64 with > 7 rooms nrow(Boston[Boston$rm > 8,]) # 13 with > 8 rooms # row 1: mean for towns with > 8 rooms per dwelling # row 2: median for all towns

Introduction to Statistical Learning ISLR Chapter 2 Solutions Code, Exercises of Statistics

Related documents

Partial preview of the text

Download Introduction to Statistical Learning ISLR Chapter 2 Solutions Code and more Exercises Statistics in PDF only on Docsity!

CONCEPTUAL

APPLIED