Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Multilayer Networks-Introduction to Machine Learning-Lecture 25-Computer Science, Lecture notes of Introduction to Machine Learning

Multilayer Networks, Advanced Topics, Feed-Forward Networks, Training, Backpropagation, MLP, Universal Approximators, Model Complexity, Sequential Data, Markov Models, General Graphical Models, Undirected Models, Semi-Supervised Learning, Active Learning, Reinforcement Learning, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 30

Toggle sidebar

Related documents


Partial preview of the text

Download Multilayer Networks-Introduction to Machine Learning-Lecture 25-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 25: Multilayer networks, advanced topics

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 22, 2010

Feed-forward networks

Collect all the weights into a vector w

...

...

Feedforward operation, from input x to output ˆy:

yˆ(x; w) = f

∑^ m

j=

w(2) j h

( (^) d ∑

i=

w ij(1) xi + w(1) 0 j

)

  • w(2) 0

Common choices for f : Linear Logistic tanh threshold

Training the network

Error of the network on a training set:

L(X; w) =

∑^ N

i=

1

2

(yi − ˆy(xi; w))^2

Generally, no closed-form solution; resort to gradient descent

Enough to evaluate derivative of L w.r.t. a single example ∂L ∂w ij(q) Let’s start with a simple linear model ˆy =

j wj^ xj^ : ∂L(xi) ∂wj

=

Training the network

Error of the network on a training set:

L(X; w) =

∑^ N

i=

1

2

(yi − ˆy(xi; w))^2

Generally, no closed-form solution; resort to gradient descent

Enough to evaluate derivative of L w.r.t. a single example ∂L ∂w ij(q) Let’s start with a simple linear model ˆy =

j wj^ xj^ : ∂L(xi) ∂wj

= (ˆyi − yi) ︸ ︷︷ ︸ error

xij.

Backpropagation

General unit activation in a multilayer network:

ut = h

j

wjtzj

 (^) h

z 1

w 1 t z 2

w 2 t

... zs

wst

ut

Forward propagation: calculate for each unit at =

j wjtzj The loss L depends on wjt only through at:

∂L ∂wjt

=

∂L

∂at

∂at ∂wjt

Backpropagation

General unit activation in a multilayer network:

ut = h

j

wjtzj

 (^) h

z 1

w 1 t z 2

w 2 t

... zs

wst

ut

Forward propagation: calculate for each unit at =

j wjtzj The loss L depends on wjt only through at:

∂L ∂wjt

=

∂L

∂at

∂at ∂wjt

=

∂L

∂at

zj

Backpropagation

∂L

∂wjt

=

∂L

︸︷︷︸∂at δt

zj

Output unit with linear activation: δt = yˆ − y

Hidden unit t with activation function h which sends inputs to units S:

δt =

s∈S

∂L

∂as

∂as ∂at

= h′(at)

s∈S

wtsδs

zt...

zs wts

MLP as universal approximators

Theoretical result [Cybenko, 1989]: 2-layer net with linear output can approximate any continuous function over compact domain to arbitrary accuracy (given enough hidden units!) Examples: 3 hidden units with tanh(z) = e 2 z (^) − 1 e^2 z+1 activation

[from Bishop]

Model complexity of MLP

What determines model complexity for multilayer networks?

Model complexity of MLP

What determines model complexity for multilayer networks? number of units; weight magnitude; activation functions To prevent overfitting, can apply the usual tools: CV, regularization

Model complexity of MLP

What determines model complexity for multilayer networks? number of units; weight magnitude; activation functions To prevent overfitting, can apply the usual tools: CV, regularization Two main forms of regularization in neural networks:

  • (^) Weight decay:

L(X; w) =

∑^ N

i=

1

2

(yi − ˆy(xi; w))^2 +

1

2

wT^ w

  • Early stopping

0.15 0 10 20 30 40 50

0.35 0 10 20 30 40 50

Next: advanced topics

Sequential data

Departure from the i.i.d. assumption:

  • (^) Probability of observing xi depends on x 1 ,... , xi− 1 , xi+1,... , xN.

The sequential dimension may be temporal or spatial:

  • Speech (measurements of acoustic waveform);
  • (^) Language (words);
  • (^) Images (pixels)...

Almost always: assume dependence on past only.

p (xi | x 1 ,... , xi− 1 , xi+1,... , xN ) = p (xi | x 1 ,... , xi− 1 ).

Complexity grows as we increase N.

Markov models

The k-th order Markov model:

p (xi | x 1 ,... , xi− 1 ) = p (xi | xi−k,... , xi− 1 ).

Zeroth order: xi N

Markov models

The k-th order Markov model:

p (xi | x 1 ,... , xi− 1 ) = p (xi | xi−k,... , xi− 1 ).

Zeroth order: xi N

First order (bigrams):

... xi− 2 xi− 1 xi...

Markov models

The k-th order Markov model:

p (xi | x 1 ,... , xi− 1 ) = p (xi | xi−k,... , xi− 1 ).

Zeroth order: xi N

First order (bigrams):

... xi− 2 xi− 1 xi...

Second order (trigrams):... xi− 2 xi− 1 xi xi+1..

Markov models for language

k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model.

Markov models for language

k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD

Markov models for language

k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV

Markov models for language

k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV k = 2 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE FUSO TIZIN ANDY TOBE SEACE CTISBE

Markov models for language

k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV k = 2 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE FUSO TIZIN ANDY TOBE SEACE CTISBE k = 3 IN NO IST LAY WHEY CRATICT FROURE BERS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE

Markov models for language

k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV k = 2 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE FUSO TIZIN ANDY TOBE SEACE CTISBE k = 3 IN NO IST LAY WHEY CRATICT FROURE BERS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE k = 4 THE GENERATED JOB PROVIDUAL BETTER TRAND THE DISPLAYED CODE ABOVERY UPONDULTS WELL THE CODERST IN THESTICAL IT TO HOCK BOTHE

How to wreck a nice beach you sing calm

incense

(cartoon stolen from T. Hoffman’s slides)

Cartoon

Cartoon

A language model may reduce ambiguity in acoustic signal.

However, the state of the Markov chain is not observed.