






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Assignment; Professor: Anderson; Class: Machine Learning; Subject: Computer Science; University: Colorado State University; Term: Fall 2009;
Typology: Assignments
1 / 12
This page cannot be seen from the preview
Don't miss anything!







Figure 1: Distribution of Target Values.
Length Diam Height Whole Shucked V i s c e r a S h e l l Rings
Min 0. 0 7 5 0. 0 5 5 0. 0 0 0 0. 0 0 2 0. 0 0 1 0. 0 0 1 0. 0 0 2 1
Max 0. 8 1 5 0. 6 5 0 1. 1 3 0 2. 8 2 6 1. 4 8 8 0. 7 6 0 1. 0 0 5 29
Mean 0. 5 2 4 0. 4 0 8 0. 1 4 0 0. 8 2 9 0. 3 5 9 0. 1 8 1 0. 2 3 9 9. 9 3 4
SD 0. 1 2 0 0. 0 9 9 0. 0 4 2 0. 4 9 0 0. 2 2 2 0. 1 1 0 0. 1 3 9 3. 2 2 4
A distribution of the target values can be viewed with the following statement in R:
h i s t ( a b a l o n e [ , ” Rings ” ] , b r e a k s = 1 :max( a b a l o n e [ , ” Rings ” ] ) , x l a b=” Rings ” )
This distribution is shown in Figure 1. This histogram shows that the majority of the data samples have
a value for rings between 5 and 15. This seems to indicate that any model made from this data would be
stronger in this range.
The pairs() function in R can give a good initial idea of the relationship between the different attributes.
This is shown in Figure 2. The sex attribute was not included in this plot. The most interesting part of this
figure is how all the other attributes relate to rings. These plots are in the right column and the bottom
row. It is obvious that there is a correlation between rings and all the other attributes. Though there are
some differences in the correlations, they all seem to be mostly linear in nature.
3 Method
This section describes the R implementation used to manipulate and prepare the data. It also shows how
the linear model is computed and used to make predictions.
X <− a b a l o n e [ 1 : ( numDim − 1 ) ]
T <− a b a l o n e [ , numDim , drop=FALSE ]
X <− apply (X, 2 , as. numeric )
T <− apply (T, 2 , as. numeric )
Since there is only one data set available, the data needs to be partitioned into training and testing sets.
To divide the data, samples are picked randomly from the original set and placed into the training set until
it is a given size. The rest of the samples that are not in the training set constitute the testing set. The
partition is referred to by the fraction of the original data used for the training set, such as 0.5. The following
R code defines a function that in turn creates and returns a function. The makeP artionF function should
be called whenever a new random partition of the data is desired. It takes in the data set and a training set
fraction as arguments.
makePartitionF <− function ( data , t r a i n F r a c t i o n ) {
numSamples <− nrow( data )
r a n d o r d e r <− sample ( numSamples )
nTrain <− round ( t r a i n F r a c t i o n ∗ numSamples )
function ( newData , i s T r a i n S e t=TRUE) {
r e o r d e r e d D a t a <− newData [ randorder , , drop=FALSE ]
i f ( i s T r a i n S e t )
r e o r d e r e d D a t a [ 1 : nTrain , , drop=FALSE ]
e l s e
r e o r d e r e d D a t a [ ( nTrain +1): numSamples , , drop=FALSE ]
}
}
Once the makeP artitionF function is called, the data can be partitioned by resulting calls to the partition
function by passing in the data and a boolean value to signify whether it is the training set or testing set.
p a r t i t i o n <− makePartitionF (X, 0. 5 )
X t r a i n <− p a r t i t i o n (X, i s T r a i n S e t=TRUE)
T t r a i n <− p a r t i t i o n (T, i s T r a i n S e t=TRUE)
X t e s t <− p a r t i t i o n (X, i s T r a i n S e t=FALSE)
T t e s t <− p a r t i t i o n (T, i s T r a i n S e t=FALSE)
3.3 Standardizing the Data
The data should also be standardized before working on it. This means converting the data to a form with
0 mean and unit variance. Essentially this is done for each attribute by subtracting each value from its
mean and dividing by the standard deviation. The following R code was provided by the instructor and uses
the same approach of creating and returning a new function. This will calculate the mean and standard
deviation of the data set passed into the makeStandardizeF function and use those values for standardizing.
This means that training set should be passed into this function so that both the training and testing sets
are standardized based on the training set. Along with standardizing, this function will add a column of 1’s
onto the data to serve as the bias.
makeStandardizeF <− function (X) {
i f ( missing (X) ) {
cat ( ” Usage :
s t a n d a r d i z e <− makeStandardizeF (X) ## X i s nSamples x nDimensions
Xs <− s t a n d a r d i z e (X)
X2s <− s t a n d a r d i z e (X2) \ n” )
return ( i n v i s i b l e ( ) )
}
mu <− colMeans (X)
sigma <− sd (X)
sigma [ sigma==0] <− 1 ### r e p l a c e any v a l u e s w i t h 0 s t a n d a r d d e v i a t i o n
function (newX) {
nr <− nrow(newX)
nc <− ncol (newX)
newXs <− (newX − matrix (mu, nr , nc , byrow=TRUE) ) / matrix ( sigma , nr , nc , byrow=TRUE)
newXs <− cbind ( 1 , newXs )
colnames ( newXs ) [ 1 ] <− ” b i a s ”
return ( newXs )
}
}
The use of this function is shown in the next section when creating and using the linear model.
3.4 Linear Model
The linear model was created using the ridge regression model, which incorporates a λ value that determines
how much the weights are regularized. The weights were computed with the following equation:
w = (X
X + λI)
X
X
The llsM ake function shown below creates a linear model based on input data, target values, and a λ
value. Since the model is created based on which data is passed in, this function should only be used with
the training data. The return value is a list containing the weights and the standardize function that was
created with the training data.
l l s M a k e <− function (X, Y, lambda ) {
#s t a n d a r d i z e t h e t r a i n i n g d a t a
s t a n d a r d i z e <− makeStandardizeF (X)
Xs <− s t a n d a r d i z e (X)
i d e n <− diag ( c ( 0 , rep ( 1 , ncol ( Xs ) − 1 ) ) )
w = ( solve ( t ( Xs ) %∗% Xs + lambda ∗ i d e n ) %∗% t ( Xs ) %∗% Y)
return ( l i s t ( weights=w, s t a n d a r d i z e F u n c=s t a n d a r d i z e ) )
}
The llsU se function uses the model created above to make predictions on the data passed in as X. The
first line standardizes the data based on the function created in the llsM ake function. The predictions are
then computed by multiplying the data matrix by the weights.
l l s U s e <− function ( model , X) {
Xs <− model$ s t a n d a r d i z e F u n c (X)
Xs %∗% model$weights
}
4 Experiments
The main data used in the following experiments is generated with the following loops in R code. This loops
through a series of λ values and training set fractions. For each combination, the root mean squared error
(RMSE) is calculated for the test and training sets and repeated for 200 random partitions using the same
λ and training set fraction.
lambdas <− seq ( 0 , 5 , by=0.25)
t r a i n i n g S e t R e s u l t s <− NULL
for ( t r a i n F r a c t i o n i n seq ( 0. 1 , 0. 9 , by= 0. 1 ) ) {
Figure 3: Train and test RMSE at different λ values with a training set fraction of 0.5. The left plot uses a
model with all attributes, while the right plot uses a model with the least significant attributes dropped.
sortedW <− meanWeights [ order ( abs ( meanWeights ) , d e c r e a s i n g=TRUE) ]
keepWeights <− names( sortedW [ 1 : ( length ( sortedW ) − 3 ) ] )
These can then be used by recomputing all of the predictions with a modified model. The model is then
created and modified using:
model <− l l s M a k e ( Xtrain , Ttrain , lambda )
model$w <− model$w[ keepWeights , , drop=FALSE ]
The following R code creates the plot that is shown in Figure 3. The meanResults matrix is from the
original loop that uses a model with all attributes, and the newM eanResults uses the model with dropped
attributes.
yRange <− range ( c ( meanResults [ , 2 : 3 ] , newMeanResults [ , 2 : 3 ] ) )
plot ( meanResults [ , 1 ] , meanResults [ , 2 ] , type=” l ” , x l a b=expression ( lambda ) ,
y l a b=”RMSE” , ylim=yRange )
l i n e s ( meanResults [ , 1 ] , meanResults [ , 3 ] , type=” l ” , l t y =2)
l i n e s ( newMeanResults [ , 1 ] , newMeanResults [ , 2 ] , type=” l ” , col=” red ” )
l i n e s ( newMeanResults [ , 1 ] , newMeanResults [ , 3 ] , type=” l ” , l t y =2, col=” red ” )
legend ( ” c e n t e r ” , c ( ” Train − A l l A t t r i b u t e s ” , ” Test − A l l A t t r i b u t e s ” ,
” Train − Dropped A t t r i b u t e s ” , ” Test − Dropped A t t r i b u t e s ” ) ,
col=c ( ” b l a c k ” , ” b l a c k ” , ” red ” , ” red ” ) , l t y=c ( 1 , 2 , 1 , 2 ) )
In Figure 3, the effects of λ can first be analyzed by looking at the model with all attributes. Since
we mainly care about minimizing the test RMSE, it is most interesting to look at the dashed lines. It
appears that there is a minimum with λ somewhere between 1 and 2. As λ increases beyond that, the RMSE
increases, and as λ decreases below that, the RMSE also increases. This indicates that reducing the model
complexity does help to minimize the RMSE, but that λ should not be increased too much because the
model will not be as representative of the data as a whole.
Figure 3 also shows that the model with dropped attributes did not perform as well as the model with all
attributes. The train RMSE did decrease, but the more important test RMSE value increased. This shows
that the model fit the training data better, but seemed to overfit a little more than the original model.
4.2 Different Training Set Fractions
This experiment will analyze the effects of the training set fraction on the test and train RMSE. The mean
RMSE was calculated for each unique combination of training set fraction and λ using the following R code.
meanTrainResults <− NULL
for ( f i n unique ( t r a i n i n g S e t R e s u l t s [ , 1 ] ) ) {
for ( l i n unique ( t r a i n i n g S e t R e s u l t s [ , 2 ] ) ) {
mask <− apply ( t r a i n i n g S e t R e s u l t s [ , 1 : 2 ] , 1 , function ( c s ) a l l ( c s==c ( f , l ) ) )
meanTrainResults <− rbind ( meanTrainResults , colMeans ( t r a i n i n g S e t R e s u l t s [ mask , , drop=
}
}
This data can then be visualized in nine different plots, one for each training set fraction. The code below
shows how to produce this plot which is shown in Figure 4.
xRange <− range ( meanTrainResults [ , 2 ] )
yRange <− range ( meanTrainResults [ , 3 : 4 ] )
par ( mfrow=c ( 3 , 3 ) )
count <− 0
for ( f i n unique ( meanTrainResults [ , 1 ] ) ) {
mask <− meanTrainResults [ ,1]== f
plot ( meanTrainResults [ mask , 2 ] , meanTrainResults [ mask , 3 ] , type=” l ” , xlim=xRange , ylim=yR
l i n e s ( meanTrainResults [ mask , 2 ] , meanTrainResults [ mask , 4 ] , type=” l ” , l t y =2)
count <− count + 1
i f ( count == length ( unique ( meanTrainResults [ , 1 ] ) ) )
legend ( ” t o p r i g h t ” , c ( ” Test ” , ” Train ” ) , l t y=c ( 2 , 1 ) )
}
Figure 4 shows an interesting picture of the results. As the training set fraction increases, the train
RMSE increases and the test RMSE decreases. This makes sense because for a small training set, there is
not as much variation, so the model can be more specific for that data, which would result in a small train
RMSE value. However, that model does not generalize very well to the rest of the data which is why the test
RMSE is so high. As the training set size increases, there is more variation across the training data, which
results in more difficult predictions and a higher train RMSE value. With a larger training set though, the
hope is that the model is able to better represent all of the data. This seems to be true since it was able to
make better predictions for the test set and have a lower RMSE value.
Figure 4 also shows how the training set size interacts with the λ values. The changing λ values have a
much more dramatic effect using the smaller training sets. As the training set fraction increases, the lines
almost level out to show that λ does not have a significant effect. This also makes sense because λ is most
effective when given a small training set because there is a higher risk of overfitting.
4.3 Looking at Min RMSE Values
This experiment will isolate those points at which λ was at an optimal value by finding where the test RMSE
was at a minimum for each training set fraction. First, those data points were extracted into a separate
matrix with the following R code.
minRMSEvalues <− NULL
for ( f i n unique ( meanTrainResults [ , 1 ] ) ) {
subset <− meanTrainResults [ meanTrainResults [ ,1]== f , ]
Min RMSE for Different Partitions
λ Values for Different Partitions
Figure 5: These plots look at the RMSE and λ values for each training set fraction at their minimum points.
par ( mfrow=c ( 1 , 2 ) )
xRange <− range ( c ( a l l T r a i n P r e d s [ , 2 ] , a l l T e s t P r e d s [ , 2 ] ) )
yRange <− range ( c ( a l l T r a i n P r e d s [ , 1 ] , a l l T e s t P r e d s [ , 1 ] ) )
plot ( a l l T r a i n P r e d s [ , 2 ] , a l l T r a i n P r e d s [ , 1 ] , pch =18 , y l a b=” Actual Rings Value ” , x l a b=” P r e d i c t
abline ( 0 , 1 , col=” red ” )
plot ( a l l T e s t P r e d s [ , 2 ] , a l l T e s t P r e d s [ , 1 ] , pch =18 , y l a b=” Actual Rings Value ” , x l a b=” P r e d i c t e d
abline ( 0 , 1 , col=” red ” )
The first observation when looking at Figure 6 is that both plots look very similar. The red line drawn
on top of the plots represents a perfect linear model. Therefore, when the points are centered around the
line on the x-axis, this represents a better model. It is obvious that the model is not that great, especially
at the extremes. The model is not as good at predicting values of the rings below about 8 and somewhere
above 16 or 17. Within this range, however, the predicted values are mainly grouped around the line. One
possible reason could be the initial distribution of values for rings. Looking back to Figure 1, most of the
data lies within this range, which could indicate that the model did not have enough data for the more
extreme values of rings in order to make an accurate prediction.
4.5 Analyzing Standard Deviation
This experiment analyzes the effects of λ and training set fraction on σ. Since the previous experiments up
to this point only dealt with the means, a new matrix containing the σ for each unique combination of λ
and training set fraction. The following R code accomplishes this.
s d T r a i n R e s u l t s <− NULL
for ( f i n unique ( t r a i n i n g S e t R e s u l t s [ , 1 ] ) ) {
for ( l i n unique ( t r a i n i n g S e t R e s u l t s [ , 2 ] ) ) {
mask <− apply ( t r a i n i n g S e t R e s u l t s [ , 1 : 2 ] , 1 , function ( c s ) a l l ( c s==c ( f , l ) ) )
Training Set
Test Set
Figure 6: Plots of actual vs. predicted values for the train and test sets.
s d T r a i n R e s u l t s <− rbind ( s d T r a i n R e s u l t s , c ( f , l , sd ( t r a i n i n g S e t R e s u l t s [ mask , 3 : 4 , dr
}
}
The data can then be plotted in a figure similar as before. This code will create a figure with nine plots,
one for each training set fraction as shown in Figure 7. Each plot will show σ against the λ values.
xRange <− range ( s d T r a i n R e s u l t s [ , 2 ] )
yRange <− range ( s d T r a i n R e s u l t s [ , 3 : 4 ] ) par ( mfrow=c ( 3 , 3 ) )
count <− 0
for ( f i n unique ( s d T r a i n R e s u l t s [ , 1 ] ) ) {
mask <− s d T r a i n R e s u l t s [ ,1]== f plot ( s d T r a i n R e s u l t s [ mask , 2 ] , s d T r a i n R e s u l t s [ mask , 3 ] , t
l i n e s ( s d T r a i n R e s u l t s [ mask , 2 ] , s d T r a i n R e s u l t s [ mask , 4 ] , type=” l ” , l t y =2) count <− coun
i f ( count == length ( unique ( s d T r a i n R e s u l t s [ , 1 ] ) ) )
legend ( ” r i g h t ” , c ( ” Test ” , ” Train ” ) , l t y=c ( 2 , 1 ) )
}
Figure ?? shows that as the training set fraction increases, the test σ increases and the train σ decreases.
The train σ can be explained just by the size of the training set used to make the model. As the training
set size increases, the model has more data to train on, and the train σ can be seen to continually decrease.
It appears that the best training set fraction, in respect to σ, is somewhere in the middle, between 0.4 and
0.6, when both train σ and test σ are low and almost equal to each other. It is also important to note that
λ does not have any significant effects on σ since all of the plots are almost straight lines.