Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Deterministic Annealing: A Technique for Optimization and Pattern Classification, Study notes of Computer Graphics

University of California-Santa Cruz Computer Graphics

An overview of deterministic annealing (da), a technique used for optimization and pattern classification. Da is based on the principles of annealing in statistical mechanics and is used to find global minima or maxima. The basics of simulated annealing, the differences between simulated annealing and deterministic annealing, and the application of deterministic annealing in pattern classification. Formulas and examples to illustrate the concepts.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-cil-1 🇺🇸

10 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

Graham Grindlay

CMPS 290C

03/03/03

Notes: Deterministic Annealing

1. Review of Simulated (Stochastic) Annealing:

• SA is an optimization technique that can be employed to find global minima

or maxima.

• It is derived from statistical mechanics and simulates the process of annealing,

which is the gradual cooling of metal or glass where the material’s molecules

settle into an optimum lattice structure.

• SA simulates the random evolution of a physical system and reaches

equilibrium as the steady-state distribution over states of a corresponding

Markov chain.

• SA can be shown to converge in probability to the set of globally optimal

solutions.

• Can be extremely computationally expensive for many problems.

&[ VLQ[



[



±[

• Basic Algorithm – Cost function C, C) = C(xnew) - C(xold), temperature T, k

= Boltzmann’s constant.

 6HWWKHLQGHSHQGHQWYDULDEOHVWRWKHLUH[SHFWHGYDOXHVWKLVLVXVHGDV

WKHLQLWLDOFHQWHULQJSRLQW

 6HWT WRDUHODWLYHO\KLJKQXPEHU

 3HUWXUEWKHLQGHSHQGHQWYDULDEOHV

 &DOFXODWHWKHQHZUHVXOWZLWKWKHQHZYDULDEOHV

 ,IWKHQHZUHVXOWLVORZHUWKDQRUHTXDOWRWKHEHVWVDYHWKHUHVXOW

 (OVHLIWKHQHZUHVXOWLVKLJKHUWKDQWKHEHVWFKRRVHDUDQGRPQXPEHU

UXQLIRUPO\IURP>@,IUH[S^ & kTWKHQVDYHUHVXOW

 5HSHDWVWHSVQQXPEHURIWLPHV

 ,IDQLPSURYHPHQWKDVEHHQPDGHDIWHUWKHQQXPEHURILWHUDWLRQVVHW

WKHFHQWHUSRLQWWREHWKHEHVWSRLQW

Discover Study notes of Computer Graphics University of California-Santa Cruz

Partial preview of the text

Download Deterministic Annealing: A Technique for Optimization and Pattern Classification and more Study notes Computer Graphics in PDF only on Docsity!

Graham Grindlay CMPS 290C 03/03/

Notes: Deterministic Annealing

Review of Simulated (Stochastic) Annealing:
- SA is an optimization technique that can be employed to find global minima or maxima.
- It is derived from statistical mechanics and simulates the process of annealing, which is the gradual cooling of metal or glass where the material’s molecules settle into an optimum lattice structure.
- SA simulates the random evolution of a physical system and reaches equilibrium as the steady-state distribution over states of a corresponding Markov chain.
- SA can be shown to converge in probability to the set of globally optimal solutions.
- Can be extremely computationally expensive for many problems.

82$v

±!'

Basic Algorithm – Cost function C, C) = C(xnew) - C(xold), temperature T, k = Boltzmann’s constant.

Trurvqrrqrh vhiyrurv rrprqhyruvvrqh urvvvhyprr vtv ! TrT h ryhvryuvtuir " Qr iurvqrrqrh vhiyr # 8hypyhrurr ryvuurrh vhiyr $ Dsurr ryvyr uh rhyurirhrur ry % @yrvsurr ryvuvtur uhurirpurh hqir _ _ vs ys b dDs _ _ 1r 8 kTurhr ry & Srrhr"% ir svr ' Dshv rruhirrhqrhsr ur ir svr hvr urprr virurirv

( Srqprurrr h r Srrhr"(s ir srr h r

Steps 3-6 correspond to the Metropolis algorithm which we saw in the MCMC paper. Recall that in the MCMC paper, Metropolis is defined as having an acceptance probability of:

} ( )

( , ) min{ 1 , ()

() * i

i p x

px A x x =

When we define our cost function, C, to be –log(p(x)), we can see that steps 3-6 are equivalent to Metropolis because we accept with probability 1 when our new result is better than the last (corresponding to a ratio of >1 in the Metropolis formula), otherwise (disregarding constants T and k) we accept with probability

( )

exp{ ( log( *^ ) ( log( ())))} i

i x

x −− x − − x =

Deterministic Annealing:
- Minimizes the Lagrangian (which is analogous to the Helmholtz free energy equation in statistical mechanics):

F = – T H ,

where < D> is the expectation of the target function to be minimized, T is a Lagrange multiplier (temperature), and H is the Shannon entropy of the Markov chain.

Rather than simulating the exact stochastic evolution of the system, DA determines the effective distribution over the states of the system at each temperature and optimizes the expected value of the cost function (free energy equation F) directly.
So the main difference of DA as compared to SA is that it does not evolve a minimum at each temperature through many modifications to the target function’s parameters, but instead finds the minimum at each temperature directly.
Does not guarantee a global optimum, but can still avoid many local minima.
Basic Algorithm: Trh hrr )vvvhyrr h r U svhyrr h r U

vvr

C hqhrhyvtpurqyrspv α ().

! Tr U2U hq 2 " F =min (^) { θ (^) j }, γ{−TH}Vvtt hqvrqrpr vvyh

# Gr

rr h r) T← α(T)

$ Ds U 3 U tr"0ryrtr% % Rrpuvt)Dp rhr γ hpp qvt γ ← q ( γ),min{ θj }

p yurμsvr¶surqv viv
6 the highest discriminant function wins with probability 1.
For finite positive pyhrvuyh tr qvp vvhspvhyr are assigned higher probabilities of winning.
A 2the distribution is uniform.
Probability of error: ( )

(^1) i i

Pe = −N∑Pxi y

We also need to define the entropy of our distribution so that we can constrain the level of randomness in the classifier:

N

i

J

j

H (^) N Px (^) i (j)logPxi(j)

Ultimate objective: minimize Pe when H = 0.
Lagrangian equivalent: F =min (^) { H (^) j }, γ(−TH)
As T , optimization reduces to the unconstrained minimization of Pe , uvpus pr hqyrhqurvhy-random maximum discriminant classifier.
When T = 0, classifier becomes a hard (non-random) classifier.
Much faster than SA.

Using DA with HMMs for Speech Recognition:

Normally, we use the Maximum Likelihood technique to design HMM classifiers for speech recognition. While ML is fast, it does not directly address the goal of classifier design which is minimization of error.
Minimum Classification Error techniques attempt to solve the direct problem instead. However this is challenging because the error surface is extremely difficult to optimize due to its piecewise-linear nature which prevents the use of gradient descent techniques.
DA allows us to directly minimize the cost surface while simultaneously avoiding many local minima.
Given a training set {(x 1 , c 1 ), (x 2 , c 2 ),…, (xN, cN)}, where xi (a feature vector of length li) is an utterance of the word ci which belongs to a finite sized dictionary C = {1..J}.
Recognition system consists of a set of HMMs {Hj: j = 1, 2,…, J}, one for each word in the dictionary.
Each model Hj has Sj states and is fully supported by a parameter set (^) j, which consists of the usual HMM components (priors (^) j, transition probabilities Aj, and observation probabilities Bj).
We determine our classifier C, which maps training pattern xi to class C(xi), in the following way:
Given the pattern xi, a path score is computed for each HMM Hj and for each sequence of states s = (s(0), s(1), …, s(li)) in Hj‘s trellis:

=

−

=

i li

t

j i

l

t

j j i

l xi sHj l s A st st B st x t 1

1

{log [( 1 )] log [(), ( 1 )] log [ (), ( )]}

The path with the highest score is determined and we map xi to the class of the HMM that the winning path belongs to.
( ) max ( , , ) j i ( ) i j d x l x sH s∈SliHj

= where S (^) l (^) i ( Hj)is the set of states of length li

in model Hj.

( ) argmax j( i) j

C xi = d x

The empirical error rate is:

( ( ), )}

min{ 1 1

i

N

i

Pe i N ∑ C xi c

=

= − δ θ where () is the Kronecker delta function.

The problem is the piecewise constant nature of Pe, making direct descent optimization impossible.
ML avoids this problem by replacing the true cost function with a sub-optimal design objective.
To use DA, we must first define a randomized version of our best path classification rule which instead of assigning pattern xi to a unique winning state sequence, associates each pattern xi with every state sequence s, in the trellis of every HMM Hj with probability P[s, j | xi].
This is the probability of selecting state sequence s and consequently, HMM Hj and is optimally modeled by the Gibbs distribution:

∈

’ ’ ( )

( ,’, )

( ,, )

’

[ , | ] ’

j s S H

lxsH

lxsH i li j

i j

e

e Ps j x γ

γ

The h hrr qrr vruryrrysr surqv vivvurhr way as described earlier.
We can now actually minimize the classifier error using the expected misclassification rate of the random classifier:

=

N

i

Pe (^) N Pci xi 1

[ | ]

P[ci | xi] is the probability that the correct class is selected as the winner and can be found by summing over paths:

∈

( )

[ | ] [ , | ]

sSli Hc i

Pci xi Psci xi

So our design problem for the random classifier is to find the optimal values of the model parameters { (^) jhq uvpuqrr mine P[s, ci | xi] so as to minimize the probability of error.
In order to keep from getting stuck in shallow local minima, we need to gradually reduce the randomness of the classifier through a constraint on the entropy, H: