Download Lecture 25: Statistical Learning - MLPs, Bayesian Learning, and SVMs and more Study notes Computer Science in PDF only on Docsity! Statistical Learning Introduction to Artificial Intelligence CS440/ECE448 Lecture 25 1-unit projects for grad students: Get in touch with me ! New homework out tonight Last lecture • Multi-layer perceptrons • Backpropogation This lecture • Bayesian learning • MAP learning • ML learning • Support vector machines Reading • Chapter 20 Multi-layer perceptrons
Layers are usually fully connected;
numbers of hidden units typically chosen by hand
Output units a;
We
Hidden units q;
Wi;
Input units ay
Expressiveness of multi-layer perceptrons
All continuous functions w/ 2 layers, all functions w/ 3 layers
Jay (Xy. X5)
1
0.8
0.6
0.4
0.2
Combine two opposite-facing threshold functions to make a ridge
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface
Backpropagation
Output layer: same as for single-layer perceptron,
where
Hidden layer: back-propagate the error from the output layer:
Update rule for weights in hidden layer:
(Most neuroscientists deny that back-propagation occurs in the brain)
Full Bayesian Learning
View learning as Bayesian updating of a probability distribution
over the
is the hypothesis variable, values , prior
th observation d; gives the outcome of random variable
training data
Bayesian learning: Example cdt. P( d | hi ) = Πj P(dj | hi ) Prior distribution P(H): <0.1,0.2,0.4,0.2,0.1> P( lime | h3 )=0.5 Assumption: Data are iid (independently and identically distributed) Posterior Probability of Hypotheses P(hi|d) = α P(d|hi)P(hi ) = α P(hi) Πj P(dj | hi ) P(next candy is lime | d)
=
©
©
o
0
©
N
oO
oD
o
wo
o
ms
Prediction Probability
oO
2 4 6 8
Number of samples in d
10
MAP Approximation
Summing over the hypothesis space is often intractable
(e.g., 18,446,744,073, 709,551,616 Boolean functions of 6 attributes)
Maximum a posteriori (MAP) learning: choose maximizing
l.e., maximize or
Log terms can be viewed as (negative of)
bits to encode data given hypothesis + bits to encode hypothesis
This is the basic idea of minimum description length (MDL) learning
For deterministic hypotheses, is 1 if consistent, 0 otherwise
= MAP = simplest consistent hypothesis (cf. science)
Posterior Probability of Hypotheses Assumes P(X | d) ≈ P(X | hMAP) Multiple Parameters
Red /green wrapper depends probabilistically on flavor:
Likelihood for, e.g., cherry candy in green wrapper:
N candies, r,. red-wrapped cherry candies, etc.:
P(P=cherry))
QO
PUl=red |F)
Multiple Parameters ctd
Derivatives of L contain only the relevant parameter:
With complete data, parameters can be learned separately
Naives Bayes Model C X1 Xn… Variables: • one class C • n attributes Xi Assume the Xi’s are conditionally independent given C. Parameters: • θ = P(C=true) • θi1 = P (Xi=true|C=true) • θi2 = P (Xi=true|C=false) Learning: independent ML estimation of the parameters Classification: P(C|x1, … , xn) = α P(C) Πi P(xi|C) Linearly Separable Classes
support hyperplanes
a,
A A
a A
A A
A A
A
A
A
we A
> support vectors
A A
A A
A A
Support Vector Machines What is the maximum-margin separating plane? (Boser, Guyon & Vapnik, 1992; Vapnik 1995) w.x+b = 0 w.x+b = -1 w.x+b = 1 positive examplesnegative examples Support vector machines ctd • Examples are of the form ( xi , yi ), where yi = ¨ 1. • They all verify yi ( w.xi + b ) ≥ 1. • The distance between the separating plane is 2 / | w |. • Thus finding the maximum margin plane amounts to – Minimizing: ½ | w |^2 – subject to: yi ( w.xi + b ) ≥ 1 for i = 1, … ,n. • A quadratic programming problem!