



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Feature Selection, Multilayer Networks, Minimum-Residual Projection, PCA, Compression, Classification, Gaussians, Probabilistic, Linear Subspaces, Unsupervised Learning, Feature Selection, Filter Methods, Mutual Information, Max-MI Feature Selection, Filter Methods, Wrapper Methods, Neural Networks, Two-Layer Network, Feed-Forward Networks, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 27
This page cannot be seen from the preview
Don't miss anything!




















TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 19, 2010
PCA is the solution to minimum-residual projection.
Let rows of X be the data points (N × d matrix).
Construct the d × d data covariance matrix S = (^) N^1 XT^ X;
Let φ 1 ,... , φd be the orthonormal eigenvectors of S corresponding to the eigenvalues λ 1 ≥... ≥ λd.
The optimal k-dim linear subspace is given by
Φ = [φ 1 ,... , φk].
A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space.
Tempting: may turn computationally infeasible into practical.
A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space. Tempting: may turn computationally infeasible into practical. Careful! Direction of largest variance need not be the most discriminative direction.
−2^0 0 2 4 6 8
12
34
5 67
8
(^109) PCA subspace LDA subspace
−5^0 −4 −3 −2 −1 0 1 2 3 4 5
Class +1 Class − Total −5^0 −4 −3 −2 −1 0 1 2 3 4 5
1
1.4 Class +1 Class − Total
Probabilistic PCA is method of fitting a constrained Gaussian (“pancake”):
λ 1... 0........
... 0........ 0... λk........ 0... 0 σ^2 .. ... 0........... 0 σ^2
ML estimate for the noise variance σ^2 :
σ^2 =
d − k
∑^ d
j=k+
λj
Linearity assumption constrains the type of subspaces we can find.
A general formulation: a hidden manifold.
One possible method: kernel PCA
Very active area of research...
Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T
We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.
We have already seen this:
Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T
We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.
We have already seen this: lasso regularization.
PCA: more like “feature generation”
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
x
y
p(x, y) log p(x) +
x
y
p(x, y) log p (x | y)
x,y
p(x, y) log
p (x | y) p(x)
x,y
p(x, y) log
p (x | y) p(y) p(x)p(y)
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
x
y
p(x, y) log p(x) +
x
y
p(x, y) log p (x | y)
x,y
p(x, y) log
p (x | y) p(x)
x,y
p(x, y) log
p (x | y) p(y) p(x)p(y) = DKL (p(x, y) || p(x)p(y)).
We can evaluate MI between class label y and a feature xj.
I(xj ; y) =
y∈Y
x
p(x, y) log
p (x | y) p(y) p(x)p(y)
This requires estimating p(y) (easy), p(xj ) and p (xj | y) (may be hard).
Sanity check: for binary classification problem, I(xj ; y) ≤ 1 for any feature xj.
How many features to include? Where to place the threshold?