Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Machine Learning-Lecture 24-Computer Science, Lecture notes of Introduction to Machine Learning

Feature Selection, Multilayer Networks, Minimum-Residual Projection, PCA, Compression, Classification, Gaussians, Probabilistic, Linear Subspaces, Unsupervised Learning, Feature Selection, Filter Methods, Mutual Information, Max-MI Feature Selection, Filter Methods, Wrapper Methods, Neural Networks, Two-Layer Network, Feed-Forward Networks, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 27

Toggle sidebar

Related documents


Partial preview of the text

Download Introduction to Machine Learning-Lecture 24-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 24: Feature selection, multilayer networks

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 19, 2010

Review

PCA is the solution to minimum-residual projection.

Let rows of X be the data points (N × d matrix).

Construct the d × d data covariance matrix S = (^) N^1 XT^ X;

Let φ 1 ,... , φd be the orthonormal eigenvectors of S corresponding to the eigenvalues λ 1 ≥... ≥ λd.

The optimal k-dim linear subspace is given by

Φ = [φ 1 ,... , φk].

PCA and compression

Suppose we have computed k-dimensional PCA representation.

We need to transmit/store:

  • The 1 × d mean vector;
  • (^) The k × d projection matrix.

For each new example, we only need to convey z which is 1 × k.

  • If we transmit N examples, we need d + dk + N k numbers instead of N d.
  • (^) Tradeoff between accuracy and compression.

PCA and classification

A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space.

Tempting: may turn computationally infeasible into practical.

PCA and classification

A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space. Tempting: may turn computationally infeasible into practical. Careful! Direction of largest variance need not be the most discriminative direction.

−2^0 0 2 4 6 8

12

34

5 67

8

(^109) PCA subspace LDA subspace

−5^0 −4 −3 −2 −1 0 1 2 3 4 5

Class +1 Class − Total −5^0 −4 −3 −2 −1 0 1 2 3 4 5

1

1.4 Class +1 Class − Total

PCA and Gaussians

Suppose p(x) = N (x; μ, Σ).

Recall:

Σ = R

λ 1

... λd

 RT^.

Rotation R determines the orientation of the ellipse; diag(λ 1 ,... , λd) specifies the scaling along the principal directions.

Suppose we take all d eigenvectors of Σ.

Columns of Φ are d orthonormal eigenvectors ⇒ it’s a rotation matrix.

Probabilistic PCA

Probabilistic PCA is method of fitting a constrained Gaussian (“pancake”):

Σ = Φ







λ 1... 0........

... 0........ 0... λk........ 0... 0 σ^2 .. ... 0........... 0 σ^2







ΦT

ML estimate for the noise variance σ^2 :

σ^2 =

1

d − k

∑^ d

j=k+

λj

Linear subspaces vs. manifolds

Linearity assumption constrains the type of subspaces we can find.

A general formulation: a hidden manifold.

One possible method: kernel PCA

Very active area of research...

Summary: unsupervised learning

Density estimation:

  • parametric closed-form (Gaussian, Bernoulli);
  • (^) non-parametric (kernel-based);
  • (^) semi-parametric (the EM algorithm for mixture models).

Clustering: k-means/medoids, hierarchical, spectral,...

Unsupervised dimensionality reduction (PCA).

Main points:

  • (^) Need to define criterion carefully;
  • usually have to accept local optimum.

Feature selection

Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T

We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.

We have already seen this:

Feature selection

Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T

We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.

We have already seen this: lasso regularization.

PCA: more like “feature generation”

  • zj = φTj x is a linear combination of all x 1 ,... , xd

Wrapper versus filter methods

Wrapper methods: try to optimize the feature subset for a given supervised learning algorithm (e.g., for a given classifier).

  • (^) Regularization
  • Greedy methods.

Filter methods: evaluate features based on a criterion independent of a classification/regression method.

  • Information value: good feature contains large amount of information regarding the label.

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

= −

x

p(x) log p(x) ︸︷︷︸ =P y p(x,y)

+

x

y

p(x, y) log p (x | y)

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

= −

x

p(x) log p(x) ︸︷︷︸ =P y p(x,y)

+

x

y

p(x, y) log p (x | y)

= −

x

y

p(x, y) log p(x) +

x

y

p(x, y) log p (x | y)

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

= −

x

p(x) log p(x) ︸︷︷︸ =P y p(x,y)

+

x

y

p(x, y) log p (x | y)

= −

x

y

p(x, y) log p(x) +

x

y

p(x, y) log p (x | y)

=

x,y

p(x, y) log

p (x | y) p(x)

=

x,y

p(x, y) log

p (x | y) p(y) p(x)p(y)

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

= −

x

p(x) log p(x) ︸︷︷︸ =P y p(x,y)

+

x

y

p(x, y) log p (x | y)

= −

x

y

p(x, y) log p(x) +

x

y

p(x, y) log p (x | y)

=

x,y

p(x, y) log

p (x | y) p(x)

=

x,y

p(x, y) log

p (x | y) p(y) p(x)p(y) = DKL (p(x, y) || p(x)p(y)).

MI: properties

I(X; Y ) = H(X) − H(X|Y ) = DKL (p(X, Y ) || p(X)p(Y ))

Continuous version:

I(X; Y ) =

y

x

p(x, y) log

p(x, y) p(x)p(y) dx dy.

MI is always non-negative (since KL-divergence is) Since p(x, y) = p(y, x), and p(x)p(y) = p(y)p(x), MI is symmetric. The data processing inequality: for any function f ,

I(X; Y ) ≥ I(X; f (Y )).

Max-MI feature selection: classification

We can evaluate MI between class label y and a feature xj.

I(xj ; y) =

y∈Y

x

p(x, y) log

p (x | y) p(y) p(x)p(y)

This requires estimating p(y) (easy), p(xj ) and p (xj | y) (may be hard).

Sanity check: for binary classification problem, I(xj ; y) ≤ 1 for any feature xj.

Filter methods: shortcommings

How many features to include? Where to place the threshold?

Filter methods: shortcommings

How many features to include? Where to place the threshold?

Ignores redundancy between features

  • If the same (informative) feature is repeated 100 times, it will get selected 100 times.

Ignores dependency between features. I.e., x 1 and x 2 may each be uninformative, but together provide perfect prediction.

−1.5 −1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.5

0

0.5

1

1.5

The classifier at hand may take advantage of information in some features but not others.

Wrapper methods

Wrapper methods are defined for a particular regressor/classifier.

In general, selecting optimal subset of features is NP-hard

  • (^) Combinatorics: need to consider all

(d k

)

subsets.

A (heuristic) solution: greedy feature selection.

Neural networks

General form of linear methods we have seen:

yˆ(x; w) = f

(

wT^ φ(x)

)

logistic regression: f (z) = (1 + exp(z))−^1 , linear regression: f (z) = z.

Representation as a neural network:

x 1 x 2

...

xd

φ 1 φ 2...^ φm

f w 1 w 2 wm

Neural networks

General form of linear methods we have seen:

yˆ(x; w) = f

(

wT^ φ(x)

)

logistic regression: f (z) = (1 + exp(z))−^1 , linear regression: f (z) = z.

Representation as a neural network:

x 1 x 2

...

xd

φ 1 φ 2...^ φm

f w 1 w 2 wm

φ 0 ≡ 1 w^0

Two-layer network

x 1 x 2

...

xd

h h...^ h

f

w(1) 11 w(1) 21 w(1) d 1

w 1 (2) w(2) 2 w(2) m

h 0 ≡ 1

w 0 (2)

Idea: learn parametric features φj (x) = h(wj T^ x + w 0 j ) for some (possibly nonlinear) function h