Math 407, Machine Learning: Practice Midterm, Exams of Machine Learning

In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems. The first problem is to predict the energy ...

Typology: Exams

2022/2023

Uploaded on 05/11/2023

anoushka
anoushka 🇺🇸

4.1

(15)

241 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning: Practice Midterm, Spring 2018
Name:_________________________________________________________________________________
Instructions
You may use the following resources on the midterm:
1. your filled in "methods" table with the accompanying notation page,
2. a single page of notes (8.5 by 11in),
3. a calculator.
In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems.
The first problem is to predict the energy consumption of applicances in a certain house and the second is
to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning
Repository.
Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm
will have a subset of the questions that are on this practice exam, possibly applied to different datasets.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Math 407, Machine Learning: Practice Midterm and more Exams Machine Learning in PDF only on Docsity!

Machine Learning: Practice Midterm, Spring 2018

Name:_________________________________________________________________________________

Instructions

You may use the following resources on the midterm:

1. your filled in "methods" table with the accompanying notation page,

2. a single page of notes (8.5 by 11in),

3. a calculator.

In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems.

The first problem is to predict the energy consumption of applicances in a certain house and the second is

to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning

Repository.

Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm

will have a subset of the questions that are on this practice exam, possibly applied to different datasets.

Prediction Problem 1: Energy Consumption

Problem Description:

The goal is to predict the energy consumption (in watts) of appliances in a certain house. Here is a

description of the dataset by the authors:

Data was collected every 10 min for about 4.5 months. The house temperature and humidity conditions were

monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and

humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The

energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport

weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis

(rp5.ru), and merged together with the experimental data sets using the date and time column.

Here is a list of the variable names in the dataset:

 date time, year-month-day

hour:minute:second

 Appliances, energy use in Wh

 T1, Temperature in kitchen area, in

Celsius

 RH_1, Humidity in kitchen area, in %

 T2, Temperature in living room area, in

Celsius

 RH_2, Humidity in living room area, in %

 T3, Temperature in laundry room area

 RH_3, Humidity in laundry room area, in

 T4, Temperature in office room, in Celsius

 RH_4, Humidity in office room, in %

 T5, Temperature in bathroom, in Celsius

 RH_5, Humidity in bathroom, in %

 T6, Temperature outside the building

(north side), in Celsius

 RH_6, Humidity outside the building

(north side), in %

 T7, Temperature in ironing room , in

Celsius

 RH_7, Humidity in ironing room, in %

 T8, Temperature in teenager room 2, in

Celsius

 RH_8, Humidity in teenager room 2, in %

 T9, Temperature in parents room, in

Celsius

 RH_9, Humidity in parents room, in %

 To, Temperature outside (from Chievres

weather station), in Celsius

 Pressure (from Chievres weather station),

in mm Hg

 RH_out-Humidity outside (from Chievres

weather station), in %

 Wind speed (from Chievres weather

station), in m/s

 Visibility (from Chievres weather station),

in km -Tdewpoint (from Chievres weather

station), in Celsius

Review the following graphs of the response variable “Appliances” and a few possible predictors, then

answer question 5.

5. From what you can see about the dataset in the above graphs, do you believe the assumptions of

kNNs are satisfied? (Hint: this is a trick question!)

6. Briefly explain how you could use a simulation to estimate the variance of the irreducible error in

predicting the energy use from the available predictor variables. (you don’t need to write any code,

just explain in two or three sentences).

7. The code snippet below shows kNN being trained and tested for k=1,2,...,15. Use the resulting output

to choose the best value of k to use.

I choose k=_________________________

# choose a test set set.seed ( 11 ) samp<- sample ( 1 : nrow (dd), round (. 1 * nrow (dd)), replace=TRUE) # use the following 25 variables as predictors in kNN pnames<- c ("T1", "RH_1","T2", "RH_2", "T3", "RH_3", "T4", "RH_4","T5", "RH_5", "T6","RH_6","T7","RH_7","T8", "RH_8","T9","RH_9","T_out", "Press_mm_hg", "RH_out", "Windspeed", "Visibility", "Tdewpoint","hour") # storage for mean absolute error of kNN for 15 different k's MAEk<- numeric ( 15 ) # train kNN for(k in 1 : 15 ) { knnk<-FNN:: knn.reg (train=dd[-samp,pnames], test=dd[samp,pnames], y=dd[-samp,"Appliances"], k=k) y<-dd[samp,"Appliances"] errork<-(y-knnk$pred) MAEk[k]<- mean ( abs (errork)) } plot ( 1 : 15 , MAEk, main="Mean Absolue Error of kNN on test set", xlab="k", ylab="Mean Absolute Error", type='l', xlim= c ( 1 , 15 ))

8. What (if anything) could we try to improve the performance of kNNs?

The linear regression model shown on the previous page has an MAE of 52.75, which is worse than the

best kNN model. One advantage of a linear regression over kNN is that the model is more interpretable.

9. Use the table of model coefficients in the R output and the location codes to determine:

a) which location has the largest predicted increase in energy use of the appliances for an increase of

one degree Celsius (T), if all other variables in the model are held constant?

Location: ________________________________________

b) which location has the largest reduction in predicted energy use of the appliances for an increase of

1% in relative humidity (RH), if all other variables in the model are held constant?

Location:_________________________________________

10. The month each measurement was recorded was included in the linear regression model as a

qualitative variable with categories "Jan", "Feb", "Mar", "Apr" and "May". The category "Jan" was

used as the baseline and the model coefficients under "MonthFeb", "MonthMar", "MonthApr" and

"MonthMay" correspond to the difference between the baseline "Jan" and the months of "Feb",

"Mar", "Apr" and "May". Thus the predicted energy use of appliances in Febuary may be found by

adding the intercept ("Intercept") to the coefficient of "MonthFeb".

Which month has the lowest predicted energy use by appliances, if all other variables in the model

are held constant?

Month:___________________________________________

11. Besides adding more predictor variables, what (if anything) could you try changing about the linear

regression model to improve it's performance?

12. The linear regression model had an estimated test MSE of 9628.81. Do you expect that the MSE of

the training set for the model is larger, smaller or exactly the same as the MSE of the test set?

Circle one: larger smaller exactly the same

13. The test MSE of the linear regression model was higher than the test MSE of the best kNN model. Do

you expect that the higher MSE is a result of more bias, a higher variance in model fits or a higher

variance of the irreducible error?

Circle one: bias variance in model fits variance of the irreducible error

Prediction Problem 2: Office Occupancy

Is a particular office occupied or not? To answer this question, the following variables were collected

every minute in the office for about two weeks:

 date time year-month-day hour:minute:second

 Temperature, in Celsius

 Relative Humidity, %

 Light, in Lux

 CO2, in ppm

 Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-

air

 Occupancy, ("Empty" or "Occupied").

The dataset was broken into three parts, a training set and two test sets.

1. Which methods from class are appropriate to try when the predicting a binary response variable

such as "Occupancy" using four predictor variables such as "time of day" (hour), "light" (Lux), "CO2"

(ppm) and "humidity ratio" (kgwater-vapor/kg-air)?

2. Based on the following histograms of three predictor variables by occupancy, do you expect that

either LDA or QDA will work well to predict whether or not the office is empty? Briefly explain.

8. Write down the prediction model from the trained logistic regression model shown in the following

R output:

model

Call: glm(formula = Occupancy ~ Light + CO2 + HumidityRatio + hour +

I(hour^2), family = "binomial", data = rr)

Coefficients:

(Intercept) Light CO2 HumidityRatio hour

- 6.746022 0.020607 0.004841 21.519182 - 0.

I(hour^2)

0.

Degrees of Freedom: 8142 Total (i.e. Null); 8137 Residual

Null Deviance: 8420

Residual Deviance: 1045 AIC: 1057

9. Use above R output from logistic regression to compute the odds ratio of the office being occupied

for a one Lux increase in the amount of light detected.

10. Briefly describe how 5-fold cross validation could have been used to select the best model instead of

a single training and test set. You don’t need to write code, just explain the idea in a sentence or two.