



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Artificial Intelligence. Lectures Notes of Machine Learning. Prof. Andrew Ng - Stanford University - Contents: Principal components analysis
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!




In our discussion of factor analysis, we gave a way to model data x ∈ Rn^ as “approximately” lying in some k-dimension subspace, where k n. Specif- ically, we imagined that each point x(i)^ was created by first generating some z(i)^ lying in the k-dimension affine space {Λz + μ; z ∈ Rk}, and then adding Ψ-covariance noise. Factor analysis is based on a probabilistic model, and parameter estimation used the iterative EM algorithm. In this set of notes, we will develop a method, Principal Components Analysis (PCA), that also tries to identify the subspace in which the data approximately lies. However, PCA will do so more directly, and will require only an eigenvector calculation (easily done with the eig function in Matlab), and does not need to resort to EM. Suppose we are given dataset {x(i); i = 1,... , m} of attributes of m dif- ferent types of automobiles, such as their maximum speed, turn radius, and so on. Lets x(i)^ ∈ Rn^ for each i (n m). But unknown to us, two different attributes—some xi and xj —respectively give a car’s maximum speed mea- sured in miles per hour, and the maximum speed measured in kilometers per hour. These two attributes are therefore almost linearly dependent, up to only small differences introduced by rounding off to the nearest mph or kph. Thus, the data really lies approximately on an n − 1 dimensional subspace. How can we automatically detect, and perhaps remove, this redundancy? For a less contrived example, consider a dataset resulting from a survey of pilots for radio-controlled helicopters, where x( 1 i )is a measure of the piloting
skill of pilot i, and x( 2 i ) captures how much he/she enjoys flying. Because RC helicopters are very difficult to fly, only the most committed students, ones that truly enjoy flying, become good pilots. So, the two attributes x 1 and x 2 are strongly correlated. Indeed, we might posit that that the
data actually likes along some diagonal axis (the u 1 direction) capturing the intrinsic piloting “karma” of a person, with only a small amount of noise lying off this axis. (See figure.) How can we automatically compute this u 1 direction?
x 1
x^2
(enjoyment)
(skill)
1
u
u
2
We will shortly develop the PCA algorithm. But prior to running PCA per se, typically we first pre-process the data to normalize its mean and variance, as follows:
∑m i=1 x (i).
i(x
(i) j )
2
Steps (1-2) zero out the mean of the data, and may be omitted for data known to have zero mean (for instance, time series corresponding to speech or other acoustic signals). Steps (3-4) rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same “scale.” For instance, if x 1 was cars’ maximum speed in mph (taking values in the high tens or low hundreds) and x 2 were the number of seats (taking values around 2-4), then this renormalization rescales the different attributes to make them more comparable. Steps (3-4) may be omitted if we had apriori knowledge that the different attributes are all on the same scale. One