Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Multilayer Networks, Advanced Topics, Feed-Forward Networks, Training, Backpropagation, MLP, Universal Approximators, Model Complexity, Sequential Data, Markov Models, General Graphical Models, Undirected Models, Semi-Supervised Learning, Active Learning, Reinforcement Learning, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 30
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 22, 2010
Collect all the weights into a vector w
Feedforward operation, from input x to output ˆy:
yˆ(x; w) = f
∑^ m
j=
w(2) j h
( (^) d ∑
i=
w ij(1) xi + w(1) 0 j
Common choices for f : Linear Logistic tanh threshold
Error of the network on a training set:
L(X; w) =
i=
(yi − ˆy(xi; w))^2
Generally, no closed-form solution; resort to gradient descent
Enough to evaluate derivative of L w.r.t. a single example ∂L ∂w ij(q) Let’s start with a simple linear model ˆy =
j wj^ xj^ : ∂L(xi) ∂wj
Error of the network on a training set:
L(X; w) =
i=
(yi − ˆy(xi; w))^2
Generally, no closed-form solution; resort to gradient descent
Enough to evaluate derivative of L w.r.t. a single example ∂L ∂w ij(q) Let’s start with a simple linear model ˆy =
j wj^ xj^ : ∂L(xi) ∂wj
= (ˆyi − yi) ︸ ︷︷ ︸ error
xij.
General unit activation in a multilayer network:
ut = h
j
wjtzj
(^) h
z 1
w 1 t z 2
w 2 t
wst
ut
Forward propagation: calculate for each unit at =
j wjtzj The loss L depends on wjt only through at:
∂L ∂wjt
∂at
∂at ∂wjt
General unit activation in a multilayer network:
ut = h
j
wjtzj
(^) h
z 1
w 1 t z 2
w 2 t
wst
ut
Forward propagation: calculate for each unit at =
j wjtzj The loss L depends on wjt only through at:
∂L ∂wjt
∂at
∂at ∂wjt
∂at
zj
∂wjt
︸︷︷︸∂at δt
zj
Output unit with linear activation: δt = yˆ − y
Hidden unit t with activation function h which sends inputs to units S:
δt =
s∈S
∂as
∂as ∂at
= h′(at)
s∈S
wtsδs
zs wts
Theoretical result [Cybenko, 1989]: 2-layer net with linear output can approximate any continuous function over compact domain to arbitrary accuracy (given enough hidden units!) Examples: 3 hidden units with tanh(z) = e 2 z (^) − 1 e^2 z+1 activation
[from Bishop]
What determines model complexity for multilayer networks?
What determines model complexity for multilayer networks? number of units; weight magnitude; activation functions To prevent overfitting, can apply the usual tools: CV, regularization
What determines model complexity for multilayer networks? number of units; weight magnitude; activation functions To prevent overfitting, can apply the usual tools: CV, regularization Two main forms of regularization in neural networks:
L(X; w) =
i=
(yi − ˆy(xi; w))^2 +
wT^ w
0.15 0 10 20 30 40 50
0.35 0 10 20 30 40 50
Departure from the i.i.d. assumption:
The sequential dimension may be temporal or spatial:
Almost always: assume dependence on past only.
p (xi | x 1 ,... , xi− 1 , xi+1,... , xN ) = p (xi | x 1 ,... , xi− 1 ).
Complexity grows as we increase N.
The k-th order Markov model:
p (xi | x 1 ,... , xi− 1 ) = p (xi | xi−k,... , xi− 1 ).
Zeroth order: xi N
The k-th order Markov model:
p (xi | x 1 ,... , xi− 1 ) = p (xi | xi−k,... , xi− 1 ).
Zeroth order: xi N
First order (bigrams):
... xi− 2 xi− 1 xi...
The k-th order Markov model:
p (xi | x 1 ,... , xi− 1 ) = p (xi | xi−k,... , xi− 1 ).
Zeroth order: xi N
First order (bigrams):
... xi− 2 xi− 1 xi...
Second order (trigrams):... xi− 2 xi− 1 xi xi+1..
k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model.
k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD
k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV
k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV k = 2 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE FUSO TIZIN ANDY TOBE SEACE CTISBE
k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV k = 2 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE FUSO TIZIN ANDY TOBE SEACE CTISBE k = 3 IN NO IST LAY WHEY CRATICT FROURE BERS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE
k-th order Markov model is also called a k-gram model Example (C. Shannon): character k-grams as a generative model. k = 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD k = 1 OCRO HLI RGWR NMIELWIS EU LL NBBESEBYA TH EEI ALHENHTTPA OO BTTV k = 2 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE FUSO TIZIN ANDY TOBE SEACE CTISBE k = 3 IN NO IST LAY WHEY CRATICT FROURE BERS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE k = 4 THE GENERATED JOB PROVIDUAL BETTER TRAND THE DISPLAYED CODE ABOVERY UPONDULTS WELL THE CODERST IN THESTICAL IT TO HOCK BOTHE
(cartoon stolen from T. Hoffman’s slides)
A language model may reduce ambiguity in acoustic signal.
However, the state of the Markov chain is not observed.