Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Feature Selection, Multilayer Networks, Minimum-Residual Projection, PCA, Compression, Classification, Gaussians, Probabilistic, Linear Subspaces, Unsupervised Learning, Feature Selection, Filter Methods, Mutual Information, Max-MI Feature Selection, Filter Methods, Wrapper Methods, Neural Networks, Two-Layer Network, Feed-Forward Networks, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 27
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 19, 2010
PCA is the solution to minimum-residual projection.
Let rows of X be the data points (N × d matrix).
Construct the d × d data covariance matrix S = (^) N^1 XT^ X;
Let φ 1 ,... , φd be the orthonormal eigenvectors of S corresponding to the eigenvalues λ 1 ≥... ≥ λd.
The optimal k-dim linear subspace is given by
Φ = [φ 1 ,... , φk].
Suppose we have computed k-dimensional PCA representation.
We need to transmit/store:
For each new example, we only need to convey z which is 1 × k.
A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space.
Tempting: may turn computationally infeasible into practical.
A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space. Tempting: may turn computationally infeasible into practical. Careful! Direction of largest variance need not be the most discriminative direction.
−2^0 0 2 4 6 8
12
34
5 67
8
(^109) PCA subspace LDA subspace
−5^0 −4 −3 −2 −1 0 1 2 3 4 5
Class +1 Class − Total −5^0 −4 −3 −2 −1 0 1 2 3 4 5
1
1.4 Class +1 Class − Total
Suppose p(x) = N (x; μ, Σ).
Recall:
Σ = R
λ 1
... λd
Rotation R determines the orientation of the ellipse; diag(λ 1 ,... , λd) specifies the scaling along the principal directions.
Suppose we take all d eigenvectors of Σ.
Columns of Φ are d orthonormal eigenvectors ⇒ it’s a rotation matrix.
Probabilistic PCA is method of fitting a constrained Gaussian (“pancake”):
λ 1... 0........
... 0........ 0... λk........ 0... 0 σ^2 .. ... 0........... 0 σ^2
ML estimate for the noise variance σ^2 :
σ^2 =
d − k
∑^ d
j=k+
λj
Linearity assumption constrains the type of subspaces we can find.
A general formulation: a hidden manifold.
One possible method: kernel PCA
Very active area of research...
Density estimation:
Clustering: k-means/medoids, hierarchical, spectral,...
Unsupervised dimensionality reduction (PCA).
Main points:
Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T
We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.
We have already seen this:
Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T
We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.
We have already seen this: lasso regularization.
PCA: more like “feature generation”
Wrapper methods: try to optimize the feature subset for a given supervised learning algorithm (e.g., for a given classifier).
Filter methods: evaluate features based on a criterion independent of a classification/regression method.
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
x
y
p(x, y) log p(x) +
x
y
p(x, y) log p (x | y)
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
x
y
p(x, y) log p(x) +
x
y
p(x, y) log p (x | y)
x,y
p(x, y) log
p (x | y) p(x)
x,y
p(x, y) log
p (x | y) p(y) p(x)p(y)
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
x
y
p(x, y) log p(x) +
x
y
p(x, y) log p (x | y)
x,y
p(x, y) log
p (x | y) p(x)
x,y
p(x, y) log
p (x | y) p(y) p(x)p(y) = DKL (p(x, y) || p(x)p(y)).
I(X; Y ) = H(X) − H(X|Y ) = DKL (p(X, Y ) || p(X)p(Y ))
Continuous version:
I(X; Y ) =
y
x
p(x, y) log
p(x, y) p(x)p(y) dx dy.
MI is always non-negative (since KL-divergence is) Since p(x, y) = p(y, x), and p(x)p(y) = p(y)p(x), MI is symmetric. The data processing inequality: for any function f ,
I(X; Y ) ≥ I(X; f (Y )).
We can evaluate MI between class label y and a feature xj.
I(xj ; y) =
y∈Y
x
p(x, y) log
p (x | y) p(y) p(x)p(y)
This requires estimating p(y) (easy), p(xj ) and p (xj | y) (may be hard).
Sanity check: for binary classification problem, I(xj ; y) ≤ 1 for any feature xj.
How many features to include? Where to place the threshold?
How many features to include? Where to place the threshold?
Ignores redundancy between features
Ignores dependency between features. I.e., x 1 and x 2 may each be uninformative, but together provide perfect prediction.
−1.5 −1.5 −1 −0.5 0 0.5 1 1.5
−1
−0.5
0
0.5
1
1.5
The classifier at hand may take advantage of information in some features but not others.
Wrapper methods are defined for a particular regressor/classifier.
In general, selecting optimal subset of features is NP-hard
(d k
subsets.
A (heuristic) solution: greedy feature selection.
General form of linear methods we have seen:
yˆ(x; w) = f
wT^ φ(x)
logistic regression: f (z) = (1 + exp(z))−^1 , linear regression: f (z) = z.
Representation as a neural network:
x 1 x 2
xd
f w 1 w 2 wm
General form of linear methods we have seen:
yˆ(x; w) = f
wT^ φ(x)
logistic regression: f (z) = (1 + exp(z))−^1 , linear regression: f (z) = z.
Representation as a neural network:
x 1 x 2
xd
f w 1 w 2 wm
φ 0 ≡ 1 w^0
x 1 x 2
xd
f
w(1) 11 w(1) 21 w(1) d 1
w 1 (2) w(2) 2 w(2) m
h 0 ≡ 1
w 0 (2)
Idea: learn parametric features φj (x) = h(wj T^ x + w 0 j ) for some (possibly nonlinear) function h