Data, Method, Experiment - Machine Learning - Homework 2 | CS 545, Assignments of Computer Science

Material Type: Assignment; Professor: Anderson; Class: Machine Learning; Subject: Computer Science; University: Colorado State University; Term: Fall 2009;

Typology: Assignments

Pre 2010

Uploaded on 11/08/2009

koofers-user-xwc-1
koofers-user-xwc-1 🇺🇸

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS545: Assignment 2
Zach Cashero
September 15, 2009
Contents
1 Introduction 1
2 Data 1
3 Method 2
3.1 ConvertingDatatoNumeric .................................... 3
3.2 PartitioningtheData ........................................ 3
3.3 StandardizingtheData ....................................... 4
3.4 LinearModel ............................................. 5
4 Experiments 5
4.1 Varying λValues and Dropping Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Different Training Set Fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 LookingatMinRMSEValues.................................... 8
4.4 Scatterplot .............................................. 9
4.5 AnalyzingStandardDeviation.................................... 10
5 Conclusions 12
1 Introduction
Linear regression is a simple approach to creating a model in order to predict a target variable. Each data
sample consists of a vector of input variables and a single target variable in this case. A model is trained on
a set of these data points in order to assign weights to each input variable, or attribute. Throughout this
report, the simplest form of a linear regression model is used which is linear in regard to the weights and
linear in regard to the input variables. All algorithms were implemented in R.
The first part of the report will describe the data set being used and introduce the problem. The next
sections will focus on the R code used for implementing the algorithms and experiments. The data will be
analyzed and visualized in different forms and a discussion of the results will follow each of these.
2 Data
The data set used comes from the UCI Machine Learning Repository. We will be looking at the abalone data
which consists of a set of measurements for the abalone sea snail. Each sample consists of eight attributes:
sex, length, diameter, height, whol eweight, shuckedweight, v isceraweight, shellw eight, rings. There are 4177
samples in this data set. The problem is to try to predict the number of rings based on the other seven
attributes. The rings are correlated with the animal’s age. Here is a brief summary of the statistics for all
attributes besides sex:
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Data, Method, Experiment - Machine Learning - Homework 2 | CS 545 and more Assignments Computer Science in PDF only on Docsity!

CS545: Assignment 2

Zach Cashero

September 15, 2009

Contents

1 Introduction 1

2 Data 1

3 Method 2

3.1 Converting Data to Numeric.................................... 3

3.2 Partitioning the Data........................................ 3

3.3 Standardizing the Data....................................... 4

3.4 Linear Model............................................. 5

4 Experiments 5

4.1 Varying λ Values and Dropping Attributes............................. 6

4.2 Different Training Set Fractions................................... 8

4.3 Looking at Min RMSE Values.................................... 8

4.4 Scatterplot.............................................. 9

4.5 Analyzing Standard Deviation.................................... 10

5 Conclusions 12

1 Introduction

Linear regression is a simple approach to creating a model in order to predict a target variable. Each data

sample consists of a vector of input variables and a single target variable in this case. A model is trained on

a set of these data points in order to assign weights to each input variable, or attribute. Throughout this

report, the simplest form of a linear regression model is used which is linear in regard to the weights and

linear in regard to the input variables. All algorithms were implemented in R.

The first part of the report will describe the data set being used and introduce the problem. The next

sections will focus on the R code used for implementing the algorithms and experiments. The data will be

analyzed and visualized in different forms and a discussion of the results will follow each of these.

2 Data

The data set used comes from the UCI Machine Learning Repository. We will be looking at the abalone data

which consists of a set of measurements for the abalone sea snail. Each sample consists of eight attributes:

sex, length, diameter, height, wholeweight, shuckedweight, visceraweight, shellweight, rings. There are 4177

samples in this data set. The problem is to try to predict the number of rings based on the other seven

attributes. The rings are correlated with the animal’s age. Here is a brief summary of the statistics for all

attributes besides sex:

Rings

Frequency

Figure 1: Distribution of Target Values.

Length Diam Height Whole Shucked V i s c e r a S h e l l Rings

Min 0. 0 7 5 0. 0 5 5 0. 0 0 0 0. 0 0 2 0. 0 0 1 0. 0 0 1 0. 0 0 2 1

Max 0. 8 1 5 0. 6 5 0 1. 1 3 0 2. 8 2 6 1. 4 8 8 0. 7 6 0 1. 0 0 5 29

Mean 0. 5 2 4 0. 4 0 8 0. 1 4 0 0. 8 2 9 0. 3 5 9 0. 1 8 1 0. 2 3 9 9. 9 3 4

SD 0. 1 2 0 0. 0 9 9 0. 0 4 2 0. 4 9 0 0. 2 2 2 0. 1 1 0 0. 1 3 9 3. 2 2 4

A distribution of the target values can be viewed with the following statement in R:

h i s t ( a b a l o n e [ , ” Rings ” ] , b r e a k s = 1 :max( a b a l o n e [ , ” Rings ” ] ) , x l a b=” Rings ” )

This distribution is shown in Figure 1. This histogram shows that the majority of the data samples have

a value for rings between 5 and 15. This seems to indicate that any model made from this data would be

stronger in this range.

The pairs() function in R can give a good initial idea of the relationship between the different attributes.

This is shown in Figure 2. The sex attribute was not included in this plot. The most interesting part of this

figure is how all the other attributes relate to rings. These plots are in the right column and the bottom

row. It is obvious that there is a correlation between rings and all the other attributes. Though there are

some differences in the correlations, they all seem to be mostly linear in nature.

3 Method

This section describes the R implementation used to manipulate and prepare the data. It also shows how

the linear model is computed and used to make predictions.

X <− a b a l o n e [ 1 : ( numDim − 1 ) ]

T <− a b a l o n e [ , numDim , drop=FALSE ]

X <− apply (X, 2 , as. numeric )

T <− apply (T, 2 , as. numeric )

Since there is only one data set available, the data needs to be partitioned into training and testing sets.

To divide the data, samples are picked randomly from the original set and placed into the training set until

it is a given size. The rest of the samples that are not in the training set constitute the testing set. The

partition is referred to by the fraction of the original data used for the training set, such as 0.5. The following

R code defines a function that in turn creates and returns a function. The makeP artionF function should

be called whenever a new random partition of the data is desired. It takes in the data set and a training set

fraction as arguments.

makePartitionF <− function ( data , t r a i n F r a c t i o n ) {

numSamples <− nrow( data )

r a n d o r d e r <− sample ( numSamples )

nTrain <− round ( t r a i n F r a c t i o n ∗ numSamples )

function ( newData , i s T r a i n S e t=TRUE) {

r e o r d e r e d D a t a <− newData [ randorder , , drop=FALSE ]

i f ( i s T r a i n S e t )

r e o r d e r e d D a t a [ 1 : nTrain , , drop=FALSE ]

e l s e

r e o r d e r e d D a t a [ ( nTrain +1): numSamples , , drop=FALSE ]

}

}

Once the makeP artitionF function is called, the data can be partitioned by resulting calls to the partition

function by passing in the data and a boolean value to signify whether it is the training set or testing set.

p a r t i t i o n <− makePartitionF (X, 0. 5 )

X t r a i n <− p a r t i t i o n (X, i s T r a i n S e t=TRUE)

T t r a i n <− p a r t i t i o n (T, i s T r a i n S e t=TRUE)

X t e s t <− p a r t i t i o n (X, i s T r a i n S e t=FALSE)

T t e s t <− p a r t i t i o n (T, i s T r a i n S e t=FALSE)

3.3 Standardizing the Data

The data should also be standardized before working on it. This means converting the data to a form with

0 mean and unit variance. Essentially this is done for each attribute by subtracting each value from its

mean and dividing by the standard deviation. The following R code was provided by the instructor and uses

the same approach of creating and returning a new function. This will calculate the mean and standard

deviation of the data set passed into the makeStandardizeF function and use those values for standardizing.

This means that training set should be passed into this function so that both the training and testing sets

are standardized based on the training set. Along with standardizing, this function will add a column of 1’s

onto the data to serve as the bias.

makeStandardizeF <− function (X) {

i f ( missing (X) ) {

cat ( ” Usage :

s t a n d a r d i z e <− makeStandardizeF (X) ## X i s nSamples x nDimensions

Xs <− s t a n d a r d i z e (X)

X2s <− s t a n d a r d i z e (X2) \ n” )

return ( i n v i s i b l e ( ) )

}

mu <− colMeans (X)

sigma <− sd (X)

sigma [ sigma==0] <− 1 ### r e p l a c e any v a l u e s w i t h 0 s t a n d a r d d e v i a t i o n

function (newX) {

nr <− nrow(newX)

nc <− ncol (newX)

newXs <− (newX − matrix (mu, nr , nc , byrow=TRUE) ) / matrix ( sigma , nr , nc , byrow=TRUE)

newXs <− cbind ( 1 , newXs )

colnames ( newXs ) [ 1 ] <− ” b i a s ”

return ( newXs )

}

}

The use of this function is shown in the next section when creating and using the linear model.

3.4 Linear Model

The linear model was created using the ridge regression model, which incorporates a λ value that determines

how much the weights are regularized. The weights were computed with the following equation:

w = (X

T

X + λI)

X

T

X

The llsM ake function shown below creates a linear model based on input data, target values, and a λ

value. Since the model is created based on which data is passed in, this function should only be used with

the training data. The return value is a list containing the weights and the standardize function that was

created with the training data.

l l s M a k e <− function (X, Y, lambda ) {

#s t a n d a r d i z e t h e t r a i n i n g d a t a

s t a n d a r d i z e <− makeStandardizeF (X)

Xs <− s t a n d a r d i z e (X)

i d e n <− diag ( c ( 0 , rep ( 1 , ncol ( Xs ) − 1 ) ) )

w = ( solve ( t ( Xs ) %∗% Xs + lambda ∗ i d e n ) %∗% t ( Xs ) %∗% Y)

return ( l i s t ( weights=w, s t a n d a r d i z e F u n c=s t a n d a r d i z e ) )

}

The llsU se function uses the model created above to make predictions on the data passed in as X. The

first line standardizes the data based on the function created in the llsM ake function. The predictions are

then computed by multiplying the data matrix by the weights.

l l s U s e <− function ( model , X) {

Xs <− model$ s t a n d a r d i z e F u n c (X)

Xs %∗% model$weights

}

4 Experiments

The main data used in the following experiments is generated with the following loops in R code. This loops

through a series of λ values and training set fractions. For each combination, the root mean squared error

(RMSE) is calculated for the test and training sets and repeated for 200 random partitions using the same

λ and training set fraction.

lambdas <− seq ( 0 , 5 , by=0.25)

t r a i n i n g S e t R e s u l t s <− NULL

for ( t r a i n F r a c t i o n i n seq ( 0. 1 , 0. 9 , by= 0. 1 ) ) {

RMSE

Train − All Attributes

Test − All Attributes

Train − Dropped Attributes

Test − Dropped Attributes

Figure 3: Train and test RMSE at different λ values with a training set fraction of 0.5. The left plot uses a

model with all attributes, while the right plot uses a model with the least significant attributes dropped.

sortedW <− meanWeights [ order ( abs ( meanWeights ) , d e c r e a s i n g=TRUE) ]

keepWeights <− names( sortedW [ 1 : ( length ( sortedW ) − 3 ) ] )

These can then be used by recomputing all of the predictions with a modified model. The model is then

created and modified using:

model <− l l s M a k e ( Xtrain , Ttrain , lambda )

model$w <− model$w[ keepWeights , , drop=FALSE ]

The following R code creates the plot that is shown in Figure 3. The meanResults matrix is from the

original loop that uses a model with all attributes, and the newM eanResults uses the model with dropped

attributes.

yRange <− range ( c ( meanResults [ , 2 : 3 ] , newMeanResults [ , 2 : 3 ] ) )

plot ( meanResults [ , 1 ] , meanResults [ , 2 ] , type=” l ” , x l a b=expression ( lambda ) ,

y l a b=”RMSE” , ylim=yRange )

l i n e s ( meanResults [ , 1 ] , meanResults [ , 3 ] , type=” l ” , l t y =2)

l i n e s ( newMeanResults [ , 1 ] , newMeanResults [ , 2 ] , type=” l ” , col=” red ” )

l i n e s ( newMeanResults [ , 1 ] , newMeanResults [ , 3 ] , type=” l ” , l t y =2, col=” red ” )

legend ( ” c e n t e r ” , c ( ” Train − A l l A t t r i b u t e s ” , ” Test − A l l A t t r i b u t e s ” ,

” Train − Dropped A t t r i b u t e s ” , ” Test − Dropped A t t r i b u t e s ” ) ,

col=c ( ” b l a c k ” , ” b l a c k ” , ” red ” , ” red ” ) , l t y=c ( 1 , 2 , 1 , 2 ) )

In Figure 3, the effects of λ can first be analyzed by looking at the model with all attributes. Since

we mainly care about minimizing the test RMSE, it is most interesting to look at the dashed lines. It

appears that there is a minimum with λ somewhere between 1 and 2. As λ increases beyond that, the RMSE

increases, and as λ decreases below that, the RMSE also increases. This indicates that reducing the model

complexity does help to minimize the RMSE, but that λ should not be increased too much because the

model will not be as representative of the data as a whole.

Figure 3 also shows that the model with dropped attributes did not perform as well as the model with all

attributes. The train RMSE did decrease, but the more important test RMSE value increased. This shows

that the model fit the training data better, but seemed to overfit a little more than the original model.

4.2 Different Training Set Fractions

This experiment will analyze the effects of the training set fraction on the test and train RMSE. The mean

RMSE was calculated for each unique combination of training set fraction and λ using the following R code.

meanTrainResults <− NULL

for ( f i n unique ( t r a i n i n g S e t R e s u l t s [ , 1 ] ) ) {

for ( l i n unique ( t r a i n i n g S e t R e s u l t s [ , 2 ] ) ) {

mask <− apply ( t r a i n i n g S e t R e s u l t s [ , 1 : 2 ] , 1 , function ( c s ) a l l ( c s==c ( f , l ) ) )

meanTrainResults <− rbind ( meanTrainResults , colMeans ( t r a i n i n g S e t R e s u l t s [ mask , , drop=

}

}

This data can then be visualized in nine different plots, one for each training set fraction. The code below

shows how to produce this plot which is shown in Figure 4.

xRange <− range ( meanTrainResults [ , 2 ] )

yRange <− range ( meanTrainResults [ , 3 : 4 ] )

par ( mfrow=c ( 3 , 3 ) )

count <− 0

for ( f i n unique ( meanTrainResults [ , 1 ] ) ) {

mask <− meanTrainResults [ ,1]== f

plot ( meanTrainResults [ mask , 2 ] , meanTrainResults [ mask , 3 ] , type=” l ” , xlim=xRange , ylim=yR

l i n e s ( meanTrainResults [ mask , 2 ] , meanTrainResults [ mask , 4 ] , type=” l ” , l t y =2)

count <− count + 1

i f ( count == length ( unique ( meanTrainResults [ , 1 ] ) ) )

legend ( ” t o p r i g h t ” , c ( ” Test ” , ” Train ” ) , l t y=c ( 2 , 1 ) )

}

Figure 4 shows an interesting picture of the results. As the training set fraction increases, the train

RMSE increases and the test RMSE decreases. This makes sense because for a small training set, there is

not as much variation, so the model can be more specific for that data, which would result in a small train

RMSE value. However, that model does not generalize very well to the rest of the data which is why the test

RMSE is so high. As the training set size increases, there is more variation across the training data, which

results in more difficult predictions and a higher train RMSE value. With a larger training set though, the

hope is that the model is able to better represent all of the data. This seems to be true since it was able to

make better predictions for the test set and have a lower RMSE value.

Figure 4 also shows how the training set size interacts with the λ values. The changing λ values have a

much more dramatic effect using the smaller training sets. As the training set fraction increases, the lines

almost level out to show that λ does not have a significant effect. This also makes sense because λ is most

effective when given a small training set because there is a higher risk of overfitting.

4.3 Looking at Min RMSE Values

This experiment will isolate those points at which λ was at an optimal value by finding where the test RMSE

was at a minimum for each training set fraction. First, those data points were extracted into a separate

matrix with the following R code.

minRMSEvalues <− NULL

for ( f i n unique ( meanTrainResults [ , 1 ] ) ) {

subset <− meanTrainResults [ meanTrainResults [ ,1]== f , ]

Min RMSE for Different Partitions

Training Set Fraction

Min Average RMSE

λ Values for Different Partitions

Training Set Fraction

Figure 5: These plots look at the RMSE and λ values for each training set fraction at their minimum points.

par ( mfrow=c ( 1 , 2 ) )

xRange <− range ( c ( a l l T r a i n P r e d s [ , 2 ] , a l l T e s t P r e d s [ , 2 ] ) )

yRange <− range ( c ( a l l T r a i n P r e d s [ , 1 ] , a l l T e s t P r e d s [ , 1 ] ) )

plot ( a l l T r a i n P r e d s [ , 2 ] , a l l T r a i n P r e d s [ , 1 ] , pch =18 , y l a b=” Actual Rings Value ” , x l a b=” P r e d i c t

abline ( 0 , 1 , col=” red ” )

plot ( a l l T e s t P r e d s [ , 2 ] , a l l T e s t P r e d s [ , 1 ] , pch =18 , y l a b=” Actual Rings Value ” , x l a b=” P r e d i c t e d

abline ( 0 , 1 , col=” red ” )

The first observation when looking at Figure 6 is that both plots look very similar. The red line drawn

on top of the plots represents a perfect linear model. Therefore, when the points are centered around the

line on the x-axis, this represents a better model. It is obvious that the model is not that great, especially

at the extremes. The model is not as good at predicting values of the rings below about 8 and somewhere

above 16 or 17. Within this range, however, the predicted values are mainly grouped around the line. One

possible reason could be the initial distribution of values for rings. Looking back to Figure 1, most of the

data lies within this range, which could indicate that the model did not have enough data for the more

extreme values of rings in order to make an accurate prediction.

4.5 Analyzing Standard Deviation

This experiment analyzes the effects of λ and training set fraction on σ. Since the previous experiments up

to this point only dealt with the means, a new matrix containing the σ for each unique combination of λ

and training set fraction. The following R code accomplishes this.

s d T r a i n R e s u l t s <− NULL

for ( f i n unique ( t r a i n i n g S e t R e s u l t s [ , 1 ] ) ) {

for ( l i n unique ( t r a i n i n g S e t R e s u l t s [ , 2 ] ) ) {

mask <− apply ( t r a i n i n g S e t R e s u l t s [ , 1 : 2 ] , 1 , function ( c s ) a l l ( c s==c ( f , l ) ) )

Training Set

Predicted Rings Value

Actual Rings Value

Test Set

Predicted Rings Value

Actual Rings Value

Figure 6: Plots of actual vs. predicted values for the train and test sets.

s d T r a i n R e s u l t s <− rbind ( s d T r a i n R e s u l t s , c ( f , l , sd ( t r a i n i n g S e t R e s u l t s [ mask , 3 : 4 , dr

}

}

The data can then be plotted in a figure similar as before. This code will create a figure with nine plots,

one for each training set fraction as shown in Figure 7. Each plot will show σ against the λ values.

xRange <− range ( s d T r a i n R e s u l t s [ , 2 ] )

yRange <− range ( s d T r a i n R e s u l t s [ , 3 : 4 ] ) par ( mfrow=c ( 3 , 3 ) )

count <− 0

for ( f i n unique ( s d T r a i n R e s u l t s [ , 1 ] ) ) {

mask <− s d T r a i n R e s u l t s [ ,1]== f plot ( s d T r a i n R e s u l t s [ mask , 2 ] , s d T r a i n R e s u l t s [ mask , 3 ] , t

l i n e s ( s d T r a i n R e s u l t s [ mask , 2 ] , s d T r a i n R e s u l t s [ mask , 4 ] , type=” l ” , l t y =2) count <− coun

i f ( count == length ( unique ( s d T r a i n R e s u l t s [ , 1 ] ) ) )

legend ( ” r i g h t ” , c ( ” Test ” , ” Train ” ) , l t y=c ( 2 , 1 ) )

}

Figure ?? shows that as the training set fraction increases, the test σ increases and the train σ decreases.

The train σ can be explained just by the size of the training set used to make the model. As the training

set size increases, the model has more data to train on, and the train σ can be seen to continually decrease.

It appears that the best training set fraction, in respect to σ, is somewhere in the middle, between 0.4 and

0.6, when both train σ and test σ are low and almost equal to each other. It is also important to note that

λ does not have any significant effects on σ since all of the plots are almost straight lines.