Deterministic Annealing: A Technique for Optimization and Pattern Classification, Study notes of Computer Graphics

An overview of deterministic annealing (da), a technique used for optimization and pattern classification. Da is based on the principles of annealing in statistical mechanics and is used to find global minima or maxima. The basics of simulated annealing, the differences between simulated annealing and deterministic annealing, and the application of deterministic annealing in pattern classification. Formulas and examples to illustrate the concepts.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-cil-1
koofers-user-cil-1 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Graham Grindlay
CMPS 290C
03/03/03
Notes: Deterministic Annealing
1. Review of Simulated (Stochastic) Annealing:
SA is an optimization technique that can be employed to find global minima
or maxima.
It is derived from statistical mechanics and simulates the process of annealing,
which is the gradual cooling of metal or glass where the material’s molecules
settle into an optimum lattice structure.
SA simulates the random evolution of a physical system and reaches
equilibrium as the steady-state distribution over states of a corresponding
Markov chain.
SA can be shown to converge in probability to the set of globally optimal
solutions.
Can be extremely computationally expensive for many problems.
&[ VLQ[
[
±[
Basic Algorithm – Cost function C, C) = C(xnew) - C(xold), temperature T, k
= Boltzmann’s constant.
 6HWWKHLQGHSHQGHQWYDULDEOHVWRWKHLUH[SHFWHGYDOXHVWKLVLVXVHGDV
WKHLQLWLDOFHQWHULQJSRLQW
 6HWT WRDUHODWLYHO\KLJKQXPEHU
 3HUWXUEWKHLQGHSHQGHQWYDULDEOHV
 &DOFXODWHWKHQHZUHVXOWZLWKWKHQHZYDULDEOHV
 ,IWKHQHZUHVXOWLVORZHUWKDQRUHTXDOWRWKHEHVWVDYHWKHUHVXOW
 (OVHLIWKHQHZUHVXOWLVKLJKHUWKDQWKHEHVWFKRRVHDUDQGRPQXPEHU
UXQLIRUPO\IURP>@,IUH[S^ & kTWKHQVDYHUHVXOW
 5HSHDWVWHSVQQXPEHURIWLPHV
 ,IDQLPSURYHPHQWKDVEHHQPDGHDIWHUWKHQQXPEHURILWHUDWLRQVVHW
WKHFHQWHUSRLQWWREHWKHEHVWSRLQW
pf3
pf4
pf5

Partial preview of the text

Download Deterministic Annealing: A Technique for Optimization and Pattern Classification and more Study notes Computer Graphics in PDF only on Docsity!

Graham Grindlay CMPS 290C 03/03/

Notes: Deterministic Annealing

  1. Review of Simulated (Stochastic) Annealing:
    • SA is an optimization technique that can be employed to find global minima or maxima.
    • It is derived from statistical mechanics and simulates the process of annealing, which is the gradual cooling of metal or glass where the material’s molecules settle into an optimum lattice structure.
    • SA simulates the random evolution of a physical system and reaches equilibrium as the steady-state distribution over states of a corresponding Markov chain.
    • SA can be shown to converge in probability to the set of globally optimal solutions.
    • Can be extremely computationally expensive for many problems.

8‘2$†v‘

‘

 ±!‘'

  • Basic Algorithm – Cost function C, C) = C(xnew) - C(xold), temperature T, k = Boltzmann’s constant.

 Tr‡‡urvqrƒrqr‡‰h vhiyr†‡‚‡urv r‘ƒrp‡rq‰hyˆr†‡uv†v†ˆ†rqh† ‡urvv‡vhypr‡r vtƒ‚v‡ ! Tr‡T ‡‚h ryh‡v‰ry’uvtuˆ€ir  " Qr ‡ˆ i‡urvqrƒrqr‡‰h vhiyr† # 8hypˆyh‡r‡urr r†ˆy‡v‡u‡urr‰h vhiyr† $ Ds‡urr r†ˆy‡v†y‚r ‡uh‚ r„ˆhy‡‚‡urir†‡†h‰r‡ur r†ˆy‡ % @y†rvs‡urr r†ˆy‡v†uvtur ‡uh‡urir†‡pu‚‚†rh hq‚€ˆ€ir  _ _ ˆvs‚ €y’s ‚€b dDs _ _ 1r‘ƒ”  8  kT‡ur†h‰r r†ˆy‡ & Srƒrh‡†‡rƒ†"%  ˆ€ir ‚s‡v€r† ' Dshv€ƒ ‚‰r€r‡uh†irr€hqrhs‡r ‡ur  ˆ€ir ‚sv‡r h‡v‚††r‡ ‡urpr‡r ƒ‚v‡‡‚ir‡urir†‡ƒ‚v‡

( Srqˆpr‡ur‡r€ƒr h‡ˆ r  Srƒrh‡†‡rƒ†"(s‚  ‡ ˆ€ir ‚s‡r€ƒr h‡ˆ r†

  • Steps 3-6 correspond to the Metropolis algorithm which we saw in the MCMC paper. Recall that in the MCMC paper, Metropolis is defined as having an acceptance probability of:

} ( )

( , ) min{ 1 , ()

() * i

i p x

px A x x =

  • When we define our cost function, C, to be –log(p(x)), we can see that steps 3-6 are equivalent to Metropolis because we accept with probability 1 when our new result is better than the last (corresponding to a ratio of >1 in the Metropolis formula), otherwise (disregarding constants T and k) we accept with probability

( )

exp{ ( log( *^ ) ( log( ())))} i

i x

x −− x − − x =

  1. Deterministic Annealing:
    • Minimizes the Lagrangian (which is analogous to the Helmholtz free energy equation in statistical mechanics):

F = – T H ,

where < D> is the expectation of the target function to be minimized, T is a Lagrange multiplier (temperature), and H is the Shannon entropy of the Markov chain.

  • Rather than simulating the exact stochastic evolution of the system, DA determines the effective distribution over the states of the system at each temperature and optimizes the expected value of the cost function (free energy equation F) directly.
  • So the main difference of DA as compared to SA is that it does not evolve a minimum at each temperature through many modifications to the target function’s parameters, but instead finds the minimum at each temperature directly.
  • Does not guarantee a global optimum, but can still avoid many local minima.
  • Basic Algorithm:   Tr‡ƒh h€r‡r †)vv‡vhy‡r€ƒr h‡ˆ r U  svhy‡r€ƒr h‡ˆ r U  

€vv€ˆ€r‡

‚ƒ’ C    hqhrhyvt†purqˆyrsˆp‡v‚ α ().

! Tr‡ U2U  hq 2      " F =min (^) { θ (^) j }, γ{−TH}V†vtt hqvr‡qr†pr‡‚ †v€vyh 

# G‚r

‡r€ƒr h‡ˆ r) T← α(T)

$ Ds U 3 U  t‚‡‚†‡rƒ"0ry†rt‚‡‚†‡rƒ% % Rˆrpuvt)Dp rh†r γ hpp‚ qvt‡‚ γ ← q ( γ),min{ θj }

  • p‚‡ ‚y†‡urμsˆ““vr††¶‚s‡urqv†‡ viˆ‡v‚
  • 6†  ’the highest discriminant function wins with probability 1.
  • For finite positive pyh††r†v‡uyh tr qv†p v€vh‡sˆp‡v‚‰hyˆr† are assigned higher probabilities of winning.
  • A‚  2the distribution is uniform.
  • Probability of error: ( )

(^1) i i

Pe = −N∑Pxi y

  • We also need to define the entropy of our distribution so that we can constrain the level of randomness in the classifier:

N

i

J

j

H (^) N Px (^) i (j)logPxi(j)

  • Ultimate objective: minimize Pe when H = 0.
  • Lagrangian equivalent: F =min (^) { H (^) j }, γ(−TH)
  • As T , optimization reduces to the unconstrained minimization of Pe , uvpus‚ pr†  ’hqyrhq†‡‚‡ur‚ƒ‡v€hy‚-random maximum discriminant classifier.
  • When T = 0, classifier becomes a hard (non-random) classifier.
  • Much faster than SA.
  1. Using DA with HMMs for Speech Recognition:
  • Normally, we use the Maximum Likelihood technique to design HMM classifiers for speech recognition. While ML is fast, it does not directly address the goal of classifier design which is minimization of error.
  • Minimum Classification Error techniques attempt to solve the direct problem instead. However this is challenging because the error surface is extremely difficult to optimize due to its piecewise-linear nature which prevents the use of gradient descent techniques.
  • DA allows us to directly minimize the cost surface while simultaneously avoiding many local minima.
  • Given a training set {(x 1 , c 1 ), (x 2 , c 2 ),…, (xN, cN)}, where xi (a feature vector of length li) is an utterance of the word ci which belongs to a finite sized dictionary C = {1..J}.
  • Recognition system consists of a set of HMMs {Hj: j = 1, 2,…, J}, one for each word in the dictionary.
  • Each model Hj has Sj states and is fully supported by a parameter set (^) j, which consists of the usual HMM components (priors (^) j, transition probabilities Aj, and observation probabilities Bj).
  • We determine our classifier C, which maps training pattern xi to class C(xi), in the following way:
  • Given the pattern xi, a path score is computed for each HMM Hj and for each sequence of states s = (s(0), s(1), …, s(li)) in Hj‘s trellis:

=

=

i li

t

j i

l

t

j j i

l xi sHj l s A st st B st x t 1

1

1

{log [( 1 )] log [(), ( 1 )] log [ (), ( )]}

  • The path with the highest score is determined and we map xi to the class of the HMM that the winning path belongs to.
  • ( ) max ( , , ) j i ( ) i j d x l x sH s∈SliHj

= where S (^) l (^) i ( Hj)is the set of states of length li

in model Hj.

  • ( ) argmax j( i) j

C xi = d x

  • The empirical error rate is:

( ( ), )}

min{ 1 1

i

N

i

Pe i N ∑ C xi c

=

= − δ θ where () is the Kronecker delta function.

  • The problem is the piecewise constant nature of Pe, making direct descent optimization impossible.
  • ML avoids this problem by replacing the true cost function with a sub-optimal design objective.
  • To use DA, we must first define a randomized version of our best path classification rule which instead of assigning pattern xi to a unique winning state sequence, associates each pattern xi with every state sequence s, in the trellis of every HMM Hj with probability P[s, j | xi].
  • This is the probability of selecting state sequence s and consequently, HMM Hj and is optimally modeled by the Gibbs distribution:

’ ’ ( )

( ,’, )

( ,, )

[ , | ] ’

j s S H

lxsH

lxsH i li j

i j

i j

e

e Ps j x γ

γ

  • The ƒh h€r‡r qr‡r €vr†‡uryr‰ry‚sr‡ ‚ƒ’‚s‡urqv†‡ viˆ‡v‚v‡ur†h€r way as described earlier.
  • We can now actually minimize the classifier error using the expected misclassification rate of the random classifier:

=

N

i

Pe (^) N Pci xi 1

[ | ]

  • P[ci | xi] is the probability that the correct class is selected as the winner and can be found by summing over paths:

( )

[ | ] [ , | ]

sSli Hc i

Pci xi Psci xi

  • So our design problem for the random classifier is to find the optimal values of the model parameters { (^) j–hq uvpuqr‡r mine P[s, ci | xi] so as to minimize the probability of error.
  • In order to keep from getting stuck in shallow local minima, we need to gradually reduce the randomness of the classifier through a constraint on the entropy, H: