Principal components analysis - Lectures Notes - 11, Study notes of Machine Learning

Artificial Intelligence. Lectures Notes of Machine Learning. Prof. Andrew Ng - Stanford University - Contents: Principal components analysis

Typology: Study notes

2010/2011

Uploaded on 10/30/2011

ilyastrab
ilyastrab 🇺🇸

4.4

(52)

379 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS229 Lecture notes
Andrew Ng
Part XI
Principal components analysis
In our discussion of factor analysis, we gave a way to model data xRnas
“approximately” lying in some k-dimension subspace, where kn. Specif-
ically, we imagined that each point x(i)was created by first generating some
z(i)lying in the k-dimension affine space {Λz+µ;zRk}, and then adding
Ψ-covariance noise. Factor analysis is based on a probabilistic model, and
parameter estimation used the iterative EM algorithm.
In this set of notes, we will develop a method, Principal Components
Analysis (PCA), that also tries to identify the subspace in which the data
approximately lies. However, PCA will do so more directly, and will require
only an eigenvector calculation (easily done with the eig function in Matlab),
and does not need to resort to EM.
Suppose we are given dataset {x(i);i= 1,...,m}of attributes of mdif-
ferent types of automobiles, such as their maximum speed, turn radius, and
so on. Lets x(i)Rnfor each i(nm). But unknown to us, two different
attributes—some xiand xj—respectively give a car’s maximum speed mea-
sured in miles per hour, and the maximum speed measured in kilometers per
hour. These two attributes are therefore almost linearly dependent, up to
only small differences introduced by rounding off to the nearest mph or kph.
Thus, the data really lies approximately on an n1 dimensional subspace.
How can we automatically detect, and perhaps remove, this redundancy?
For a less contrived example, consider a dataset resulting from a survey of
pilots for radio-controlled helicopters, where x(i)
1is a measure of the piloting
skill of pilot i, and x(i)
2captures how much he/she enjoys flying. Because
RC helicopters are very difficult to fly, only the most committed students,
ones that truly enjoy flying, become good pilots. So, the two attributes
x1and x2are strongly correlated. Indeed, we might posit that that the
1
pf3
pf4
pf5

Partial preview of the text

Download Principal components analysis - Lectures Notes - 11 and more Study notes Machine Learning in PDF only on Docsity!

CS229 Lecture notes

Andrew Ng

Part XI

Principal components analysis

In our discussion of factor analysis, we gave a way to model data x ∈ Rn^ as “approximately” lying in some k-dimension subspace, where k  n. Specif- ically, we imagined that each point x(i)^ was created by first generating some z(i)^ lying in the k-dimension affine space {Λz + μ; z ∈ Rk}, and then adding Ψ-covariance noise. Factor analysis is based on a probabilistic model, and parameter estimation used the iterative EM algorithm. In this set of notes, we will develop a method, Principal Components Analysis (PCA), that also tries to identify the subspace in which the data approximately lies. However, PCA will do so more directly, and will require only an eigenvector calculation (easily done with the eig function in Matlab), and does not need to resort to EM. Suppose we are given dataset {x(i); i = 1,... , m} of attributes of m dif- ferent types of automobiles, such as their maximum speed, turn radius, and so on. Lets x(i)^ ∈ Rn^ for each i (n  m). But unknown to us, two different attributes—some xi and xj —respectively give a car’s maximum speed mea- sured in miles per hour, and the maximum speed measured in kilometers per hour. These two attributes are therefore almost linearly dependent, up to only small differences introduced by rounding off to the nearest mph or kph. Thus, the data really lies approximately on an n − 1 dimensional subspace. How can we automatically detect, and perhaps remove, this redundancy? For a less contrived example, consider a dataset resulting from a survey of pilots for radio-controlled helicopters, where x( 1 i )is a measure of the piloting

skill of pilot i, and x( 2 i ) captures how much he/she enjoys flying. Because RC helicopters are very difficult to fly, only the most committed students, ones that truly enjoy flying, become good pilots. So, the two attributes x 1 and x 2 are strongly correlated. Indeed, we might posit that that the

data actually likes along some diagonal axis (the u 1 direction) capturing the intrinsic piloting “karma” of a person, with only a small amount of noise lying off this axis. (See figure.) How can we automatically compute this u 1 direction?

x 1

x^2

(enjoyment)

(skill)

1

u

u

2

We will shortly develop the PCA algorithm. But prior to running PCA per se, typically we first pre-process the data to normalize its mean and variance, as follows:

  1. Let μ = (^) m^1

∑m i=1 x (i).

  1. Replace each x(i)^ with x(i)^ − μ.
  2. Let σ j^2 = (^) m^1

i(x

(i) j )

2

  1. Replace each x( ji )with x( ji )/σj.

Steps (1-2) zero out the mean of the data, and may be omitted for data known to have zero mean (for instance, time series corresponding to speech or other acoustic signals). Steps (3-4) rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same “scale.” For instance, if x 1 was cars’ maximum speed in mph (taking values in the high tens or low hundreds) and x 2 were the number of seats (taking values around 2-4), then this renormalization rescales the different attributes to make them more comparable. Steps (3-4) may be omitted if we had apriori knowledge that the different attributes are all on the same scale. One











^ 





 









 





















 



We see that the projected data still has a fairly large variance, and the points tend to be far from zero. In contrast, suppose had instead picked the following direction:



 



















                  

                  

                  

                  

                  

                  

                  

                  

         

!!!!!!!!!!!!

!!!!!!!!!!!!

!!!!!!!!!!!!

!!!!!!!!!!!!

!!!!!!!!!!!!

!!!!!!

""""""""""""

""""""""""""

""""""""""""

""""""""""""

""""""""""""

""""""

################

################

################

################

################

################

################

$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$

%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%

&&&&&&&&&&&&&&&&

&&&&&&&&&&&&&&&&

&&&&&&&&&&&&&&&&

&&&&&&&&&&&&&&&&

&&&&&&&&&&&&&&&&

&&&&&&&&&&&&&&&&

&&&&&&&&&&&&&&&& ''''''''''

''''''''''

''''''''''

'''''

((((((((((

((((((((((

((((((((((

(((((

Here, the projections have a significantly smaller variance, and are much closer to the origin. We would like to automatically select the direction u corresponding to the first of the two figures shown above. To formalize this, note that given a

unit vector u and a point x, the length of the projection of x onto u is given by xT^ u. I.e., if x(i)^ is a point in our dataset (one of the crosses in the plot), then its projection onto u (the corresponding circle in the figure) is distance xT^ u from the origin. Hence, to maximize the variance of the projections, we would like to choose a unit-length u so as to maximize:

1 m

∑^ m

i=

(x(i) T u)^2 =

m

∑^ m

i=

uT^ x(i)x(i) T u

= uT

m

∑^ m

i=

x(i)x(i) T

u.

We easily recognize that the maximizing this subject to ||u|| 2 = 1 gives the

principal eigenvector of Σ = (^) m^1

∑m i=1 x (i)x(i)T^ , which is just the empirical

covariance matrix of the data (assuming it has zero mean).^1 To summarize, we have found that if we wish to find a 1-dimensional subspace with with to approximate the data, we should choose u to be the principal eigenvector of Σ. More generally, if we wish to project our data into a k-dimensional subspace (k < n), we should choose u 1 ,... , uk to be the top k eigenvectors of Σ. The ui’s now form a new, orthogonal basis for the data.^2 Then, to represent x(i)^ in this basis, we need only compute the corre- sponding vector

y(i)^ =

uT 1 x(i) uT 2 x(i) .. . uTk x(i)

∈ Rk.

Thus, whereas x(i)^ ∈ Rn, the vector y(i)^ now gives a lower, k-dimensional, approximation/representation for x(i). PCA is therefore also referred to as a dimensionality reduction algorithm. The vectors u 1 ,... , uk are called the first k principal components of the data.

Remark. Although we have shown it formally only for the case of k = 1, using well-known properties of eigenvectors it is straightforward to show that

(^1) If you haven’t seen this before, try using the method of Lagrange multipliers to max- imize uT^ Σu subject to that uT^ u = 1. You should be able to show that Σu = λu, for some λ, which implies u is an eigenvector of Σ, with eigenvalue λ. (^2) Because Σ is symmetric, the ui’s will (or always can be chosen to be) orthogonal to

each other.