Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data, Method, Experiment - Machine Learning - Homework 2 | CS 545, Assignments of Computer Science

Colorado State University (CSU)Computer Science

Prof. Charles Anderson

Material Type: Assignment; Professor: Anderson; Class: Machine Learning; Subject: Computer Science; University: Colorado State University; Term: Fall 2009;

Typology: Assignments

Pre 2010

Uploaded on 11/08/2009

koofers-user-xwc-1 🇺🇸

10 documents

1 / 12

This page cannot be seen from the preview

Don't miss anything!

CS545: Assignment 2

Zach Cashero

September 15, 2009

Contents

1 Introduction 1

2 Data 1

3 Method 2

3.1 ConvertingDatatoNumeric .................................... 3

3.2 PartitioningtheData ........................................ 3

3.3 StandardizingtheData ....................................... 4

3.4 LinearModel ............................................. 5

4 Experiments 5

4.1 Varying λValues and Dropping Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.2 Different Training Set Fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3 LookingatMinRMSEValues.................................... 8

4.4 Scatterplot .............................................. 9

4.5 AnalyzingStandardDeviation.................................... 10

5 Conclusions 12

1 Introduction

Linear regression is a simple approach to creating a model in order to predict a target variable. Each data

sample consists of a vector of input variables and a single target variable in this case. A model is trained on

a set of these data points in order to assign weights to each input variable, or attribute. Throughout this

report, the simplest form of a linear regression model is used which is linear in regard to the weights and

linear in regard to the input variables. All algorithms were implemented in R.

The first part of the report will describe the data set being used and introduce the problem. The next

sections will focus on the R code used for implementing the algorithms and experiments. The data will be

analyzed and visualized in different forms and a discussion of the results will follow each of these.

2 Data

The data set used comes from the UCI Machine Learning Repository. We will be looking at the abalone data

which consists of a set of measurements for the abalone sea snail. Each sample consists of eight attributes:

sex, length, diameter, height, whol eweight, shuckedweight, v isceraweight, shellw eight, rings. There are 4177

samples in this data set. The problem is to try to predict the number of rings based on the other seven

attributes. The rings are correlated with the animal’s age. Here is a brief summary of the statistics for all

attributes besides sex:

1

Discover Assignments of Computer Science Colorado State University (CSU)

Partial preview of the text

Download Data, Method, Experiment - Machine Learning - Homework 2 | CS 545 and more Assignments Computer Science in PDF only on Docsity!

CS545: Assignment 2

Zach Cashero

September 15, 2009

1 Introduction 1

2 Data 1

3 Method 2

3.1 Converting Data to Numeric.................................... 3

3.2 Partitioning the Data........................................ 3

3.3 Standardizing the Data....................................... 4

3.4 Linear Model............................................. 5

4 Experiments 5

4.1 Varying λ Values and Dropping Attributes............................. 6

4.2 Different Training Set Fractions................................... 8

4.3 Looking at Min RMSE Values.................................... 8

4.4 Scatterplot.............................................. 9

4.5 Analyzing Standard Deviation.................................... 10

5 Conclusions 12

1 Introduction

Linear regression is a simple approach to creating a model in order to predict a target variable. Each data

sample consists of a vector of input variables and a single target variable in this case. A model is trained on

a set of these data points in order to assign weights to each input variable, or attribute. Throughout this

report, the simplest form of a linear regression model is used which is linear in regard to the weights and

linear in regard to the input variables. All algorithms were implemented in R.

The first part of the report will describe the data set being used and introduce the problem. The next

sections will focus on the R code used for implementing the algorithms and experiments. The data will be

analyzed and visualized in different forms and a discussion of the results will follow each of these.

2 Data

The data set used comes from the UCI Machine Learning Repository. We will be looking at the abalone data

which consists of a set of measurements for the abalone sea snail. Each sample consists of eight attributes:

sex, length, diameter, height, wholeweight, shuckedweight, visceraweight, shellweight, rings. There are 4177

samples in this data set. The problem is to try to predict the number of rings based on the other seven

attributes. The rings are correlated with the animal’s age. Here is a brief summary of the statistics for all

attributes besides sex:

Rings

Frequency

Figure 1: Distribution of Target Values.

Length Diam Height Whole Shucked V i s c e r a S h e l l Rings

Min 0. 0 7 5 0. 0 5 5 0. 0 0 0 0. 0 0 2 0. 0 0 1 0. 0 0 1 0. 0 0 2 1

Max 0. 8 1 5 0. 6 5 0 1. 1 3 0 2. 8 2 6 1. 4 8 8 0. 7 6 0 1. 0 0 5 29

Mean 0. 5 2 4 0. 4 0 8 0. 1 4 0 0. 8 2 9 0. 3 5 9 0. 1 8 1 0. 2 3 9 9. 9 3 4

SD 0. 1 2 0 0. 0 9 9 0. 0 4 2 0. 4 9 0 0. 2 2 2 0. 1 1 0 0. 1 3 9 3. 2 2 4

A distribution of the target values can be viewed with the following statement in R:

h i s t ( a b a l o n e [ , ” Rings ” ] , b r e a k s = 1 :max( a b a l o n e [ , ” Rings ” ] ) , x l a b=” Rings ” )

This distribution is shown in Figure 1. This histogram shows that the majority of the data samples have

a value for rings between 5 and 15. This seems to indicate that any model made from this data would be

stronger in this range.

The pairs() function in R can give a good initial idea of the relationship between the different attributes.

This is shown in Figure 2. The sex attribute was not included in this plot. The most interesting part of this

figure is how all the other attributes relate to rings. These plots are in the right column and the bottom

row. It is obvious that there is a correlation between rings and all the other attributes. Though there are

some differences in the correlations, they all seem to be mostly linear in nature.

3 Method

This section describes the R implementation used to manipulate and prepare the data. It also shows how

the linear model is computed and used to make predictions.

X <− a b a l o n e [ 1 : ( numDim − 1 ) ]

T <− a b a l o n e [ , numDim , drop=FALSE ]

X <− apply (X, 2 , as. numeric )

T <− apply (T, 2 , as. numeric )

Since there is only one data set available, the data needs to be partitioned into training and testing sets.

To divide the data, samples are picked randomly from the original set and placed into the training set until

it is a given size. The rest of the samples that are not in the training set constitute the testing set. The

partition is referred to by the fraction of the original data used for the training set, such as 0.5. The following

R code defines a function that in turn creates and returns a function. The makeP artionF function should

be called whenever a new random partition of the data is desired. It takes in the data set and a training set

fraction as arguments.

makePartitionF <− function ( data , t r a i n F r a c t i o n ) {

numSamples <− nrow( data )

r a n d o r d e r <− sample ( numSamples )

nTrain <− round ( t r a i n F r a c t i o n ∗ numSamples )

function ( newData , i s T r a i n S e t=TRUE) {

r e o r d e r e d D a t a <− newData [ randorder , , drop=FALSE ]

i f ( i s T r a i n S e t )

r e o r d e r e d D a t a [ 1 : nTrain , , drop=FALSE ]

e l s e

r e o r d e r e d D a t a [ ( nTrain +1): numSamples , , drop=FALSE ]

}

Once the makeP artitionF function is called, the data can be partitioned by resulting calls to the partition

function by passing in the data and a boolean value to signify whether it is the training set or testing set.

p a r t i t i o n <− makePartitionF (X, 0. 5 )

X t r a i n <− p a r t i t i o n (X, i s T r a i n S e t=TRUE)

T t r a i n <− p a r t i t i o n (T, i s T r a i n S e t=TRUE)

X t e s t <− p a r t i t i o n (X, i s T r a i n S e t=FALSE)

T t e s t <− p a r t i t i o n (T, i s T r a i n S e t=FALSE)

3.3 Standardizing the Data

The data should also be standardized before working on it. This means converting the data to a form with

0 mean and unit variance. Essentially this is done for each attribute by subtracting each value from its

mean and dividing by the standard deviation. The following R code was provided by the instructor and uses

the same approach of creating and returning a new function. This will calculate the mean and standard

deviation of the data set passed into the makeStandardizeF function and use those values for standardizing.

This means that training set should be passed into this function so that both the training and testing sets

are standardized based on the training set. Along with standardizing, this function will add a column of 1’s

onto the data to serve as the bias.

makeStandardizeF <− function (X) {

i f ( missing (X) ) {

cat ( ” Usage :

s t a n d a r d i z e <− makeStandardizeF (X) ## X i s nSamples x nDimensions

Xs <− s t a n d a r d i z e (X)

X2s <− s t a n d a r d i z e (X2) \ n” )

return ( i n v i s i b l e ( ) )

}

mu <− colMeans (X)

sigma <− sd (X)

sigma [ sigma==0] <− 1 ### r e p l a c e any v a l u e s w i t h 0 s t a n d a r d d e v i a t i o n

function (newX) {

nr <− nrow(newX)

nc <− ncol (newX)

newXs <− (newX − matrix (mu, nr , nc , byrow=TRUE) ) / matrix ( sigma , nr , nc , byrow=TRUE)

newXs <− cbind ( 1 , newXs )

colnames ( newXs ) [ 1 ] <− ” b i a s ”

return ( newXs )

}

The use of this function is shown in the next section when creating and using the linear model.

3.4 Linear Model

The linear model was created using the ridge regression model, which incorporates a λ value that determines

how much the weights are regularized. The weights were computed with the following equation:

w = (X

T

X + λI)

X

T

X

The llsM ake function shown below creates a linear model based on input data, target values, and a λ

value. Since the model is created based on which data is passed in, this function should only be used with

the training data. The return value is a list containing the weights and the standardize function that was

created with the training data.

l l s M a k e <− function (X, Y, lambda ) {

#s t a n d a r d i z e t h e t r a i n i n g d a t a

s t a n d a r d i z e <− makeStandardizeF (X)

Xs <− s t a n d a r d i z e (X)

i d e n <− diag ( c ( 0 , rep ( 1 , ncol ( Xs ) − 1 ) ) )

w = ( solve ( t ( Xs ) %∗% Xs + lambda ∗ i d e n ) %∗% t ( Xs ) %∗% Y)

return ( l i s t ( weights=w, s t a n d a r d i z e F u n c=s t a n d a r d i z e ) )

}

The llsU se function uses the model created above to make predictions on the data passed in as X. The

first line standardizes the data based on the function created in the llsM ake function. The predictions are

then computed by multiplying the data matrix by the weights.

l l s U s e <− function ( model , X) {

Xs <− model$ s t a n d a r d i z e F u n c (X)

Xs %∗% model$weights

}

4 Experiments

The main data used in the following experiments is generated with the following loops in R code. This loops

through a series of λ values and training set fractions. For each combination, the root mean squared error

(RMSE) is calculated for the test and training sets and repeated for 200 random partitions using the same

λ and training set fraction.

lambdas <− seq ( 0 , 5 , by=0.25)

t r a i n i n g S e t R e s u l t s <− NULL

for ( t r a i n F r a c t i o n i n seq ( 0. 1 , 0. 9 , by= 0. 1 ) ) {

RMSE

Train − All Attributes

Test − All Attributes

Train − Dropped Attributes

Test − Dropped Attributes

Figure 3: Train and test RMSE at different λ values with a training set fraction of 0.5. The left plot uses a

model with all attributes, while the right plot uses a model with the least significant attributes dropped.

sortedW <− meanWeights [ order ( abs ( meanWeights ) , d e c r e a s i n g=TRUE) ]

keepWeights <− names( sortedW [ 1 : ( length ( sortedW ) − 3 ) ] )

These can then be used by recomputing all of the predictions with a modified model. The model is then

created and modified using:

model <− l l s M a k e ( Xtrain , Ttrain , lambda )

model$w <− model$w[ keepWeights , , drop=FALSE ]

The following R code creates the plot that is shown in Figure 3. The meanResults matrix is from the

original loop that uses a model with all attributes, and the newM eanResults uses the model with dropped

attributes.

yRange <− range ( c ( meanResults [ , 2 : 3 ] , newMeanResults [ , 2 : 3 ] ) )

plot ( meanResults [ , 1 ] , meanResults [ , 2 ] , type=” l ” , x l a b=expression ( lambda ) ,

y l a b=”RMSE” , ylim=yRange )

l i n e s ( meanResults [ , 1 ] , meanResults [ , 3 ] , type=” l ” , l t y =2)

l i n e s ( newMeanResults [ , 1 ] , newMeanResults [ , 2 ] , type=” l ” , col=” red ” )

l i n e s ( newMeanResults [ , 1 ] , newMeanResults [ , 3 ] , type=” l ” , l t y =2, col=” red ” )

legend ( ” c e n t e r ” , c ( ” Train − A l l A t t r i b u t e s ” , ” Test − A l l A t t r i b u t e s ” ,

” Train − Dropped A t t r i b u t e s ” , ” Test − Dropped A t t r i b u t e s ” ) ,

col=c ( ” b l a c k ” , ” b l a c k ” , ” red ” , ” red ” ) , l t y=c ( 1 , 2 , 1 , 2 ) )

In Figure 3, the effects of λ can first be analyzed by looking at the model with all attributes. Since

we mainly care about minimizing the test RMSE, it is most interesting to look at the dashed lines. It

appears that there is a minimum with λ somewhere between 1 and 2. As λ increases beyond that, the RMSE

increases, and as λ decreases below that, the RMSE also increases. This indicates that reducing the model

complexity does help to minimize the RMSE, but that λ should not be increased too much because the

model will not be as representative of the data as a whole.

Figure 3 also shows that the model with dropped attributes did not perform as well as the model with all

attributes. The train RMSE did decrease, but the more important test RMSE value increased. This shows

that the model fit the training data better, but seemed to overfit a little more than the original model.

4.2 Different Training Set Fractions

This experiment will analyze the effects of the training set fraction on the test and train RMSE. The mean

RMSE was calculated for each unique combination of training set fraction and λ using the following R code.

meanTrainResults <− NULL

for ( f i n unique ( t r a i n i n g S e t R e s u l t s [ , 1 ] ) ) {

for ( l i n unique ( t r a i n i n g S e t R e s u l t s [ , 2 ] ) ) {

mask <− apply ( t r a i n i n g S e t R e s u l t s [ , 1 : 2 ] , 1 , function ( c s ) a l l ( c s==c ( f , l ) ) )

meanTrainResults <− rbind ( meanTrainResults , colMeans ( t r a i n i n g S e t R e s u l t s [ mask , , drop=

}

This data can then be visualized in nine different plots, one for each training set fraction. The code below

shows how to produce this plot which is shown in Figure 4.

xRange <− range ( meanTrainResults [ , 2 ] )

yRange <− range ( meanTrainResults [ , 3 : 4 ] )

par ( mfrow=c ( 3 , 3 ) )

count <− 0

for ( f i n unique ( meanTrainResults [ , 1 ] ) ) {

mask <− meanTrainResults [ ,1]== f

plot ( meanTrainResults [ mask , 2 ] , meanTrainResults [ mask , 3 ] , type=” l ” , xlim=xRange , ylim=yR

l i n e s ( meanTrainResults [ mask , 2 ] , meanTrainResults [ mask , 4 ] , type=” l ” , l t y =2)

count <− count + 1

i f ( count == length ( unique ( meanTrainResults [ , 1 ] ) ) )

legend ( ” t o p r i g h t ” , c ( ” Test ” , ” Train ” ) , l t y=c ( 2 , 1 ) )

}

Figure 4 shows an interesting picture of the results. As the training set fraction increases, the train

RMSE increases and the test RMSE decreases. This makes sense because for a small training set, there is

not as much variation, so the model can be more specific for that data, which would result in a small train

RMSE value. However, that model does not generalize very well to the rest of the data which is why the test

RMSE is so high. As the training set size increases, there is more variation across the training data, which

results in more difficult predictions and a higher train RMSE value. With a larger training set though, the

hope is that the model is able to better represent all of the data. This seems to be true since it was able to

make better predictions for the test set and have a lower RMSE value.

Figure 4 also shows how the training set size interacts with the λ values. The changing λ values have a

much more dramatic effect using the smaller training sets. As the training set fraction increases, the lines

almost level out to show that λ does not have a significant effect. This also makes sense because λ is most

effective when given a small training set because there is a higher risk of overfitting.

4.3 Looking at Min RMSE Values

This experiment will isolate those points at which λ was at an optimal value by finding where the test RMSE

was at a minimum for each training set fraction. First, those data points were extracted into a separate

matrix with the following R code.

minRMSEvalues <− NULL

for ( f i n unique ( meanTrainResults [ , 1 ] ) ) {

subset <− meanTrainResults [ meanTrainResults [ ,1]== f , ]

Min RMSE for Different Partitions

Training Set Fraction

Min Average RMSE

λ Values for Different Partitions

Training Set Fraction

Figure 5: These plots look at the RMSE and λ values for each training set fraction at their minimum points.

par ( mfrow=c ( 1 , 2 ) )

xRange <− range ( c ( a l l T r a i n P r e d s [ , 2 ] , a l l T e s t P r e d s [ , 2 ] ) )

yRange <− range ( c ( a l l T r a i n P r e d s [ , 1 ] , a l l T e s t P r e d s [ , 1 ] ) )

plot ( a l l T r a i n P r e d s [ , 2 ] , a l l T r a i n P r e d s [ , 1 ] , pch =18 , y l a b=” Actual Rings Value ” , x l a b=” P r e d i c t

abline ( 0 , 1 , col=” red ” )

plot ( a l l T e s t P r e d s [ , 2 ] , a l l T e s t P r e d s [ , 1 ] , pch =18 , y l a b=” Actual Rings Value ” , x l a b=” P r e d i c t e d

abline ( 0 , 1 , col=” red ” )

The first observation when looking at Figure 6 is that both plots look very similar. The red line drawn

on top of the plots represents a perfect linear model. Therefore, when the points are centered around the

line on the x-axis, this represents a better model. It is obvious that the model is not that great, especially

at the extremes. The model is not as good at predicting values of the rings below about 8 and somewhere

above 16 or 17. Within this range, however, the predicted values are mainly grouped around the line. One

possible reason could be the initial distribution of values for rings. Looking back to Figure 1, most of the

data lies within this range, which could indicate that the model did not have enough data for the more

extreme values of rings in order to make an accurate prediction.

4.5 Analyzing Standard Deviation

This experiment analyzes the effects of λ and training set fraction on σ. Since the previous experiments up

to this point only dealt with the means, a new matrix containing the σ for each unique combination of λ

and training set fraction. The following R code accomplishes this.

s d T r a i n R e s u l t s <− NULL

for ( f i n unique ( t r a i n i n g S e t R e s u l t s [ , 1 ] ) ) {

for ( l i n unique ( t r a i n i n g S e t R e s u l t s [ , 2 ] ) ) {

mask <− apply ( t r a i n i n g S e t R e s u l t s [ , 1 : 2 ] , 1 , function ( c s ) a l l ( c s==c ( f , l ) ) )

Training Set

Predicted Rings Value

Actual Rings Value

Test Set

Predicted Rings Value

Actual Rings Value

Figure 6: Plots of actual vs. predicted values for the train and test sets.

s d T r a i n R e s u l t s <− rbind ( s d T r a i n R e s u l t s , c ( f , l , sd ( t r a i n i n g S e t R e s u l t s [ mask , 3 : 4 , dr

}

The data can then be plotted in a figure similar as before. This code will create a figure with nine plots,

one for each training set fraction as shown in Figure 7. Each plot will show σ against the λ values.

xRange <− range ( s d T r a i n R e s u l t s [ , 2 ] )

yRange <− range ( s d T r a i n R e s u l t s [ , 3 : 4 ] ) par ( mfrow=c ( 3 , 3 ) )

count <− 0

for ( f i n unique ( s d T r a i n R e s u l t s [ , 1 ] ) ) {

mask <− s d T r a i n R e s u l t s [ ,1]== f plot ( s d T r a i n R e s u l t s [ mask , 2 ] , s d T r a i n R e s u l t s [ mask , 3 ] , t

l i n e s ( s d T r a i n R e s u l t s [ mask , 2 ] , s d T r a i n R e s u l t s [ mask , 4 ] , type=” l ” , l t y =2) count <− coun

i f ( count == length ( unique ( s d T r a i n R e s u l t s [ , 1 ] ) ) )

legend ( ” r i g h t ” , c ( ” Test ” , ” Train ” ) , l t y=c ( 2 , 1 ) )

}

Figure ?? shows that as the training set fraction increases, the test σ increases and the train σ decreases.

The train σ can be explained just by the size of the training set used to make the model. As the training

set size increases, the model has more data to train on, and the train σ can be seen to continually decrease.

It appears that the best training set fraction, in respect to σ, is somewhere in the middle, between 0.4 and

0.6, when both train σ and test σ are low and almost equal to each other. It is also important to note that

λ does not have any significant effects on σ since all of the plots are almost straight lines.

Data, Method, Experiment - Machine Learning - Homework 2 | CS 545, Assignments of Computer Science

Related documents

Partial preview of the text

Download Data, Method, Experiment - Machine Learning - Homework 2 | CS 545 and more Assignments Computer Science in PDF only on Docsity!

CS545: Assignment 2

Zach Cashero

September 15, 2009

Contents

1 Introduction 1

2 Data 1

3 Method 2

3.1 Converting Data to Numeric.................................... 3

3.2 Partitioning the Data........................................ 3

3.3 Standardizing the Data....................................... 4

3.4 Linear Model............................................. 5

4 Experiments 5

4.1 Varying λ Values and Dropping Attributes............................. 6

4.2 Different Training Set Fractions................................... 8

4.3 Looking at Min RMSE Values.................................... 8

4.4 Scatterplot.............................................. 9

4.5 Analyzing Standard Deviation.................................... 10

5 Conclusions 12

1 Introduction

Linear regression is a simple approach to creating a model in order to predict a target variable. Each data

sample consists of a vector of input variables and a single target variable in this case. A model is trained on

a set of these data points in order to assign weights to each input variable, or attribute. Throughout this

report, the simplest form of a linear regression model is used which is linear in regard to the weights and

linear in regard to the input variables. All algorithms were implemented in R.

The first part of the report will describe the data set being used and introduce the problem. The next

sections will focus on the R code used for implementing the algorithms and experiments. The data will be

analyzed and visualized in different forms and a discussion of the results will follow each of these.

2 Data

The data set used comes from the UCI Machine Learning Repository. We will be looking at the abalone data

which consists of a set of measurements for the abalone sea snail. Each sample consists of eight attributes:

sex, length, diameter, height, wholeweight, shuckedweight, visceraweight, shellweight, rings. There are 4177

samples in this data set. The problem is to try to predict the number of rings based on the other seven

attributes. The rings are correlated with the animal’s age. Here is a brief summary of the statistics for all

attributes besides sex:

Rings

Frequency

T

T

RMSE

Train − All Attributes

Test − All Attributes

Train − Dropped Attributes

Test − Dropped Attributes

Training Set Fraction

Min Average RMSE

Training Set Fraction

Predicted Rings Value

Actual Rings Value

Predicted Rings Value

Actual Rings Value