Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Math 407, Machine Learning: Practice Midterm, Exams of Machine Learning

Princeton University Machine Learning

In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems. The first problem is to predict the energy ...

Typology: Exams

2022/2023

Uploaded on 05/11/2023

anoushka 🇺🇸

4.1

(15)

241 documents

1 / 10

This page cannot be seen from the preview

Don't miss anything!

Machine Learning: Practice Midterm, Spring 2018

Name:_________________________________________________________________________________

Instructions

You may use the following resources on the midterm:

1. your filled in "methods" table with the accompanying notation page,

2. a single page of notes (8.5 by 11in),

3. a calculator.

In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems.

The first problem is to predict the energy consumption of applicances in a certain house and the second is

to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning

Repository.

Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm

will have a subset of the questions that are on this practice exam, possibly applied to different datasets.

Discover Exams of Machine Learning Princeton University

Partial preview of the text

Download Math 407, Machine Learning: Practice Midterm and more Exams Machine Learning in PDF only on Docsity!

Machine Learning: Practice Midterm, Spring 2018

Name:_________________________________________________________________________________

Instructions

You may use the following resources on the midterm:

1. your filled in "methods" table with the accompanying notation page,

2. a single page of notes (8.5 by 11in),

3. a calculator.

In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems.

The first problem is to predict the energy consumption of applicances in a certain house and the second is

to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning

Repository.

Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm

will have a subset of the questions that are on this practice exam, possibly applied to different datasets.

Prediction Problem 1: Energy Consumption

Problem Description:

The goal is to predict the energy consumption (in watts) of appliances in a certain house. Here is a

description of the dataset by the authors:

Data was collected every 10 min for about 4.5 months. The house temperature and humidity conditions were

monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and

humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The

energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport

weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis

(rp5.ru), and merged together with the experimental data sets using the date and time column.

Here is a list of the variable names in the dataset:

 date time, year-month-day

hour:minute:second

 Appliances, energy use in Wh

 T1, Temperature in kitchen area, in

Celsius

 RH_1, Humidity in kitchen area, in %

 T2, Temperature in living room area, in

Celsius

 RH_2, Humidity in living room area, in %

 T3, Temperature in laundry room area

 RH_3, Humidity in laundry room area, in

 T4, Temperature in office room, in Celsius

 RH_4, Humidity in office room, in %

 T5, Temperature in bathroom, in Celsius

 RH_5, Humidity in bathroom, in %

 T6, Temperature outside the building

(north side), in Celsius

 RH_6, Humidity outside the building

(north side), in %

 T7, Temperature in ironing room , in

Celsius

 RH_7, Humidity in ironing room, in %

 T8, Temperature in teenager room 2, in

Celsius

 RH_8, Humidity in teenager room 2, in %

 T9, Temperature in parents room, in

Celsius

 RH_9, Humidity in parents room, in %

 To, Temperature outside (from Chievres

weather station), in Celsius

 Pressure (from Chievres weather station),

in mm Hg

 RH_out-Humidity outside (from Chievres

weather station), in %

 Wind speed (from Chievres weather

station), in m/s

 Visibility (from Chievres weather station),

in km -Tdewpoint (from Chievres weather

station), in Celsius

Review the following graphs of the response variable “Appliances” and a few possible predictors, then

answer question 5.

5. From what you can see about the dataset in the above graphs, do you believe the assumptions of

kNNs are satisfied? (Hint: this is a trick question!)

6. Briefly explain how you could use a simulation to estimate the variance of the irreducible error in

predicting the energy use from the available predictor variables. (you don’t need to write any code,

just explain in two or three sentences).

7. The code snippet below shows kNN being trained and tested for k=1,2,...,15. Use the resulting output

to choose the best value of k to use.

I choose k=_________________________

# choose a test set set.seed ( 11 ) samp<- sample ( 1 : nrow (dd), round (. 1 * nrow (dd)), replace=TRUE) # use the following 25 variables as predictors in kNN pnames<- c ("T1", "RH_1","T2", "RH_2", "T3", "RH_3", "T4", "RH_4","T5", "RH_5", "T6","RH_6","T7","RH_7","T8", "RH_8","T9","RH_9","T_out", "Press_mm_hg", "RH_out", "Windspeed", "Visibility", "Tdewpoint","hour") # storage for mean absolute error of kNN for 15 different k's MAEk<- numeric ( 15 ) # train kNN for(k in 1 : 15 ) { knnk<-FNN:: knn.reg (train=dd[-samp,pnames], test=dd[samp,pnames], y=dd[-samp,"Appliances"], k=k) y<-dd[samp,"Appliances"] errork<-(y-knnk$pred) MAEk[k]<- mean ( abs (errork)) } plot ( 1 : 15 , MAEk, main="Mean Absolue Error of kNN on test set", xlab="k", ylab="Mean Absolute Error", type='l', xlim= c ( 1 , 15 ))

8. What (if anything) could we try to improve the performance of kNNs?

The linear regression model shown on the previous page has an MAE of 52.75, which is worse than the

best kNN model. One advantage of a linear regression over kNN is that the model is more interpretable.

9. Use the table of model coefficients in the R output and the location codes to determine:

a) which location has the largest predicted increase in energy use of the appliances for an increase of

one degree Celsius (T), if all other variables in the model are held constant?

Location: ________________________________________

b) which location has the largest reduction in predicted energy use of the appliances for an increase of

1% in relative humidity (RH), if all other variables in the model are held constant?

Location:_________________________________________

10. The month each measurement was recorded was included in the linear regression model as a

qualitative variable with categories "Jan", "Feb", "Mar", "Apr" and "May". The category "Jan" was

used as the baseline and the model coefficients under "MonthFeb", "MonthMar", "MonthApr" and

"MonthMay" correspond to the difference between the baseline "Jan" and the months of "Feb",

"Mar", "Apr" and "May". Thus the predicted energy use of appliances in Febuary may be found by

adding the intercept ("Intercept") to the coefficient of "MonthFeb".

Which month has the lowest predicted energy use by appliances, if all other variables in the model

are held constant?

Month:___________________________________________

11. Besides adding more predictor variables, what (if anything) could you try changing about the linear

regression model to improve it's performance?

12. The linear regression model had an estimated test MSE of 9628.81. Do you expect that the MSE of

the training set for the model is larger, smaller or exactly the same as the MSE of the test set?

Circle one: larger smaller exactly the same

13. The test MSE of the linear regression model was higher than the test MSE of the best kNN model. Do

you expect that the higher MSE is a result of more bias, a higher variance in model fits or a higher

variance of the irreducible error?

Circle one: bias variance in model fits variance of the irreducible error

Prediction Problem 2: Office Occupancy

Is a particular office occupied or not? To answer this question, the following variables were collected

every minute in the office for about two weeks:

 date time year-month-day hour:minute:second

 Temperature, in Celsius

 Relative Humidity, %

 Light, in Lux

 CO2, in ppm

 Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-

air

 Occupancy, ("Empty" or "Occupied").

The dataset was broken into three parts, a training set and two test sets.

1. Which methods from class are appropriate to try when the predicting a binary response variable

such as "Occupancy" using four predictor variables such as "time of day" (hour), "light" (Lux), "CO2"

(ppm) and "humidity ratio" (kgwater-vapor/kg-air)?

2. Based on the following histograms of three predictor variables by occupancy, do you expect that

either LDA or QDA will work well to predict whether or not the office is empty? Briefly explain.

8. Write down the prediction model from the trained logistic regression model shown in the following

R output:

model

Math 407, Machine Learning: Practice Midterm, Exams of Machine Learning

Related documents

Partial preview of the text

Download Math 407, Machine Learning: Practice Midterm and more Exams Machine Learning in PDF only on Docsity!

Machine Learning: Practice Midterm, Spring 2018

Name:_________________________________________________________________________________

Instructions

You may use the following resources on the midterm:

1. your filled in "methods" table with the accompanying notation page,

2. a single page of notes (8.5 by 11in),

3. a calculator.

In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems.

The first problem is to predict the energy consumption of applicances in a certain house and the second is

to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning

Repository.

Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm

will have a subset of the questions that are on this practice exam, possibly applied to different datasets.

Problem Description:

The goal is to predict the energy consumption (in watts) of appliances in a certain house. Here is a

description of the dataset by the authors:

Data was collected every 10 min for about 4.5 months. The house temperature and humidity conditions were

monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and

humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The

energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport

weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis

(rp5.ru), and merged together with the experimental data sets using the date and time column.

Here is a list of the variable names in the dataset:

 date time, year-month-day

hour:minute:second

 Appliances, energy use in Wh

 T1, Temperature in kitchen area, in

Celsius

 RH_1, Humidity in kitchen area, in %

 T2, Temperature in living room area, in

Celsius

 RH_2, Humidity in living room area, in %

 T3, Temperature in laundry room area

 RH_3, Humidity in laundry room area, in

 T4, Temperature in office room, in Celsius

 RH_4, Humidity in office room, in %

 T5, Temperature in bathroom, in Celsius

 RH_5, Humidity in bathroom, in %

 T6, Temperature outside the building

(north side), in Celsius

 RH_6, Humidity outside the building

(north side), in %

 T7, Temperature in ironing room , in

Celsius

 RH_7, Humidity in ironing room, in %

 T8, Temperature in teenager room 2, in

Celsius

 RH_8, Humidity in teenager room 2, in %

 T9, Temperature in parents room, in

Celsius

 RH_9, Humidity in parents room, in %

 To, Temperature outside (from Chievres

weather station), in Celsius

 Pressure (from Chievres weather station),

in mm Hg

 RH_out-Humidity outside (from Chievres

weather station), in %

 Wind speed (from Chievres weather

station), in m/s

 Visibility (from Chievres weather station),

in km -Tdewpoint (from Chievres weather

station), in Celsius

Review the following graphs of the response variable “Appliances” and a few possible predictors, then

answer question 5.

5. From what you can see about the dataset in the above graphs, do you believe the assumptions of

kNNs are satisfied? (Hint: this is a trick question!)

6. Briefly explain how you could use a simulation to estimate the variance of the irreducible error in

predicting the energy use from the available predictor variables. (you don’t need to write any code,

just explain in two or three sentences).

7. The code snippet below shows kNN being trained and tested for k=1,2,...,15. Use the resulting output

to choose the best value of k to use.

I choose k=_________________________

8. What (if anything) could we try to improve the performance of kNNs?

The linear regression model shown on the previous page has an MAE of 52.75, which is worse than the

best kNN model. One advantage of a linear regression over kNN is that the model is more interpretable.

9. Use the table of model coefficients in the R output and the location codes to determine: