

























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Paper; Class: GRECO-ROMAN ARCHTCT; Subject: Classics; University: University of California - Los Angeles; Term: Spring 2002;
Typology: Papers
1 / 33
This page cannot be seen from the preview
Don't miss anything!


























Department of Statistics^ Alan Yuille University of California at Los AngelesLos Angeles, CA 90095 [email protected]
Department of Computer Science^ Anand Rangarajan University of Florida
In Neural Computation. Vol. 15. No. 4. pp 915-936. April. 2003.
∗ (^) Prof. Anand Rangarajan. Dept. of CISE, Univ. of Florida Room 301, CSE
Building Gainesville, FL 32611-6120 Phone: (352) 392 1507 Fax: (352) 392 1220 e-mail: [email protected] Submitted to Neural Computation on 17 April 2002. Revised 21 July 2002. Final Revision 27 August 2002. Abstract The Concave-Convex procedure (CCCP) is a way to construct discrete time it- erative dynamical systems which are guaranteed to monotonically decrease global optimization/energy functions. This procedure can be applied to almost any op- timization problem and many existing algorithms can be interpreted in terms of
ster, Laird and Rubin 1977), minimization algorithms based on Legendre trans- forms (Rangarajan, Yuille and Mjolsness 1999). CCCP can be viewed as a spe- cial case of variational bounding (Rustagi 1976, Jordan, Ghahramani, Jaakkola and Saul 1999) and related techniques including lower bound maximization/upper bound minimization (Luttrell 1994), surrogate functions and majorization (Lange, Hunter and Yang 2000). CCCP gives a novel geometric perspective on these al- gorithms and yields new convergence proofs. Finally we reformulate other classic algorithms in terms of CCCP by changing variables. These include the Generalized Iterative Scaling (GIS) algorithm (Dar- roch and Ratcliff 1972) and Sinkhorn’s algorithm for obtaining doubly stochastic matrices (Sinkhorn 1964). Sinkhorn’s algorithm can be used to solve the linear assignment problem (Kosowsky and Yuille 1994) and CCCP variants of Sinkhorn can be used to solve additional constraint problems. We introduce CCCP in section (2) and prove that it converges. Section (3) illustrates CCCP with examples from neural networks, mean field theory, self- annealing, and EM. In section (4.2) we prove the relationships between CCCP and the EM algorithm, Legendre transforms, and variational bounding. Section (5) shows that other algorithms such as GIS and Sinkhorn can be expressed in CCCP by a change of variables.
This section introduces the main results of CCCP and summarizes them in three Theorems: (i) Theorem 1 states the general conditions under which CCCP can be applied, (ii) Theorem 2 defines CCCP and proves its convergence, and (iii) Theo-
rem 3 describes an inner loop that may be necessary for some CCCP algorithms. Theorem 1 shows that any function, subject to weak conditions, can be ex- pressed as the sum of a convex and concave part (this decomposition is not unique). This will imply that CCCP can be applied to almost any optimization problem. Theorem 1. Let E(~x) be an energy function with bounded Hessian ∂^2 E(~x)/∂~x∂~x. Then we can always decompose it into the sum of a convex function and a concave function. Proof. Select any convex function F (~x) with positive definite Hessian with eigenvalues bounded below by > 0. Then there exists a positive constant λ such that the Hessian of E(~x) + λF (~x) is positive definite and hence E(~x) + λF (~x) is convex. Hence we can express E(~x) as the sum of a convex part, E(~x) + λF (~x), and a concave part −λF (~x). Theorem 2 defines the CCCP procedure and proves that it converges to a minimum or a saddle point of the energy function. (After completing this work we found that a version of Theorem 2 appeared in an unpublished technical report by D. Geman (1984)). Theorem 2. Consider an energy function E(~x) (bounded below) of form E(~x) = Evex(~x) + Ecave(~x) where Evex(~x), Ecave(~x) are convex and concave func- tions of ~x respectively. Then the discrete iterative CCCP algorithm ~xt^7 → ~xt+ given by: ∇^ ~Evex(~xt+1) = −∇~Ecave(~xt), (1)
is guaranteed to monotonically decrease the energy E(~x) as a function of time and hence to converge to a minimum or saddle point of E(~x) (or even a local maxima
applying Rolle’s theorem. We can get a graphical illustration of this algorithm by the reformulation shown in figure (2). Think of decomposing the energy function E(~x) into E 1 (~x) − E 2 (~x) where both E 1 (~x) and E 2 (~x) are convex. (This is equivalent to decomposing E(~x) into a a convex term E 1 (~x) plus a concave term −E 2 (~x)). The algorithm proceeds by matching points on the two terms which have the same tangents. For an input ~x 0 we calculate the gradient ∇~E 2 (~x 0 ) and find the point ~x 1 such that ∇~E 1 (~x 1 ) = ∇^ ~E 2 (~x 0 ). We next determine the point ~x 2 such that ∇~E 1 (~x 2 ) = ∇~E 2 (~x 1 ), and repeat.
(^100 5 )
20
30
40
50
60
70
(^00 5 )
10
20
30
40
50
60
70
80
X0 X2 X
Figure 2: A CCCP algorithm illustrated for Convex minus Convex. We want to minimize the function in the Left Panel. We decompose it (Right Panel) into a convex part (top curve) minus a convex term (bottom curve). The algorithm iterates by matching points on the two curves which have the same tangent vectors, see text for more details. The algorithm rapidly converges to the solution at x = 5.0.
The second statement of Theorem 2 can be used to analyze the convergence rates of the algorithm by placing lower bounds on the (positive semi-definite) matrix {∇~∇~Evex(~x∗) − ∇~∇~Ecave(~x∗∗)}. Moreover, if we can bound this by a
matrix B then we obtain E(~xt+1) − E(~xt) ≤ −(1/2){∇~E− vex^1 (−∇~Ecave(~xt)) − ~xt}T^ B{∇~E vex−^1 (−∇~Ecave(~xt))−~xt} ≤ 0, where ∇~Evex(~x)−^1 is the inverse of ∇~Evex(~x). We can therefore think of (1/2){ ∇~E− vex^1 (−∇~Ecave(~xt))−~xt}T^ B{∇~E vex−^1 (−∇~Ecave(~xt))− ~xt} as an auxiliary function (Della Pietra, Della Pietra, and Lafferty 1997). We can extend Theorem 2 to allow for linear constraints on the variables ~x, for example ∑ i cμi xi = αμ^ where the {cμi }, {αμ} are constants. This follows directly because properties such as convexity and concavity are preserved when linear constraints are imposed. We can change to new coordinates defined on the hyperplane defined by the linear constraints. Then we apply Theorem 1 in this coordinate system. Observe that Theorem 2 defines the update as an implicit function of ~xt+1. In many cases, as we will show in section (3), it is possible to solve for ~xt+ analytically. In other cases we may need an algorithm, or inner loop, to determine ~xt+1^ from ∇~Evex(~xt+1). In these cases we will need the following theorem where we re-express CCCP in terms of minimizing a time sequence of convex update energy functions Et+1(~xt+1) to obtain the updates ~xt+1^ (i.e. at the tth^ iteration of CCCP we need to minimize the energy Et+1(~xt+1)). Theorem 3. Let E(~x) = Evex(~x) + Ecave(~x) where ~x is required to satisfy the linear constraints ∑ i cμi xi = αμ, where the {cμi }, {αμ} are constants. Then the update rule for ~xt+1^ can be formulated as setting ~xt+1^ = arg min~x Et+1(~x) for a time sequence of convex update energy functions Et+1(~x) defined by:
Et+1(~x) = Evex(~x) + ∑ i
xi^ ∂E ∂xcavei (~xt) + ∑ μ
λμ{∑ i
ciμxi − αμ}, (3)
where the lagrange parameters {λμ} impose linear constraints. Proof. Direct calculation. The convexity of Et+1(~x) implies that there is a unique minimum ~xt+1^ =
imate) estimates of the mean states of the {Via} with respect to the distribution P [V ] = e−E[V^ ]/T^ /Z, where T is a temperature parameter and Z is a normaliza- tion constant. As described in (Yuille and Kosowsky 1994), to ensure that the minima of E[V ] and Eef f [S; T ] all coincide (as T 7 → 0 ) it is sufficient that Cijab be negative definite. Moreover, this can be attained by adding a term −K ∑ ia V (^) ia^2 to E[V ] (for sufficiently large K) without altering the structure of the minima of E[V ]. Hence, without loss of generality we can consider (1/2) ∑ i,j,a,b CijabViaVjb to be a concave function. We impose the linear constraints by adding a Lagrange multiplier term ∑ a pa{∑ i Via− 1 } to the energy where the {pa} are the Lagrange multipliers. The effective energy is given by:
Eef f [S] = (1/2) ∑ i,j,a,b
CijabSiaSjb +∑ ia
θiaSia +T ∑ ia
Sia log Sia +∑ a
pa{∑ i
Sia − 1 }. (4) We decompose Eef f [S] into a convex part Evex = T ∑ ia Sia log Sia+∑ a pa{∑ i Sia− 1 } and a concave part Ecave[S] = (1/2) ∑ i,j,a,b CijabSiaSjb + ∑ ia θiaSia. Taking derivatives yields: (^) ∂S∂ia Evex[S] = T log Sia + pa and (^) ∂S∂ia Ecave[S] = ∑ j,b CijabSjb + θia. Applying CCCP by setting ∂E ∂Svexia (St+1) = − ∂E ∂Scaveia (St) gives T {1 + log S iat+1 } + pa = − ∑ j,b CijabSjbt − θia. We solve for the Lagrange multipliers {pa} by imposing the constraints ∑ i S iat+1 = 1, ∀ a. This gives a discrete update rule:
S iat+1 = e
(− 1 /T ){ 2 ∑ j,b Cijab Sjbt+θia} ∑ c e(−^1 /T^ ){^2
∑ j,b Cijcb Stjb+θic}. (5)
The next example concerns mean field methods to model combinatorial opti- mization problems such as the quadratic assignment problem (Rangarajan, Gold and Mjolsness 1996, Rangarajan, Yuille and Mjolsness 1999) or the Travelling
Salesman Problem. It uses the same quadratic energy function as the last exam- ple but adds extra linear constraints. These additional constraints prevent us from expressing the update rule analytically and require an inner loop to implement Theorem 3. Example 2. Mean Field Algorithms to minimize discrete energy functions of form E[V ] = ∑ i,j,a,b CijabViaVjb +∑ ia θiaVia with linear constraints ∑ i Via = 1, ∀a and ∑ a Via = 1, ∀i. Discussion. This differs from the previous example because we need to add an additional constraint term ∑ i qi(∑ a Sia − 1) to the effective energy Eef f [S] in equation (4) where {qi} are Lagrange multipliers. This constraint term is also added to the convex part of the energy Evex[S] and we apply CCCP. Unlike the previous example, it is no longer possible to express St+1^ as an analytic function of St. Instead we resort to Theorem 3. Solving for St+1^ is equivalent to minimizing the convex cost function:
Et+1[St+1; p, q] = T ∑ ia
S iat+1 log S iat+1 + ∑ a
pa{∑ i
S iat+1 − 1 }
qi{∑ a
S iat+1 − 1 } + ∑ ia
S iat+1^ ∂E Scaveia (Siat). (6) It can be shown that minimizing Et+1[St+1; p, q] can also be done by CCCP, see section (5.3). Therefore each step of CCCP for this example requires an inner loop which in turn can be solved by a CCCP algorithm. Our next example is self-annealing (Rangarajan 2000). This algorithm can be applied to the effective energies of the last two examples provided we remove the “entropy term” ∑ ia Sia log Sia. Hence self-annealing can be applied to the same combinatorial optimization and clustering problems. It can also be applied to the relaxation labelling problems studied in computer vision and indeed the classic
Removing the constraints ∑ i qi(∑ a Sia − 1) gives us an update rule (compare example 1): S iat+1 = S iate(−^1 /γ){∑^ jb^ Cijab^ Sjbt+θia} ∑ c Sicte(−^1 /γ){
∑ jb Cijbc Sbjt +θic} , (9)
which, by expanding the exponential by a Taylor series, gives the equations for relaxation labelling (Rosenfeld, Hummel and Zucker 1976) (see (Rangarajan 2000) for details). Our final example is the elastic net (Durbin and Willshaw 1987, Durbin, Szeliski and Yuille 1989) in the formulation presented in (Yuille 1990). This is an example of constrained mixture models (Jordan and Jacobs 1994) and uses an EM algorithm (Dempster, Laird and Rubin 1977). Example 4. The elastic net (Durbin and Willshaw 1987) attempts to solve the Travelling Salesman Problem (TSP) by finding the shortest tour through a set of cities at positions {~xi}. The net is represented by a set of nodes at positions {~ya} and the algorithm performs steepest descent on a cost function E[~y]. This corresponds to a probability distribution P (~y) = e−E[~y]/Z on the node positions which can be interpreted (Durbin, Szeliski and Yuille 1989) as a constrained mix- ture model (Jordan and Jacobs 1994). The elastic net can be reformulated (Yuille
(Yuille 1990):
Eef f [S, ~y] = ∑ ia Sia(~xi −~ya)^2 +γ ∑ a,b
~yaT Aab~yb +T ∑ i,a Sia log Sia +∑ i λi(∑ a Sia −1), (10) where the {Aab} are components of a positive definite matrix representing a spring energy and {λa} are Lagrange multipliers which impose the constraints ∑ a Sia = 1 , ∀ i. By setting E[~y] = Eef f [S∗(~y), ~y] where S∗(~y) = arg minS Eef f [S, ~y], we obtain the original elastic net cost function E[~y] = −T ∑ i log ∑ a e−|~xi−~ya|^2 /T^ + γ ∑ a,b ~yaT Aab~yb (Durbin and Willshaw 1987). P [~y] = e−E[~y]/Z can be interpreted (Durbin, Szeliski and Yuille 1989) as a constrained mixture model (Jordan and Jacobs 1994). The effective energy Eef f [S, ~y] can decreased by minimizing it with respect to {Sia} and {~ya} alternatively. This gives update rules:
St ia+1 = e ∑−|~xi−~yta|^2 /T ∑^ j^ e−|~xj^ −~yat|^2 /T^ ,^ (11) i St ia+1 (~y at+1 − ~xi) + ∑ b
Aab~yt b+1 = 0 ∀ a, (12)
where {~yt a+1 } can be computed from the {S iat+1 } by solving the linear equations. To interpret equations (11,12) as CCCP, we define a new energy function E[S] = Eef f [S, ~y∗(S)] where ~y∗(S) = arg min~y Eef f [S, ~y] (which can be obtained by solving the linear equation (12) for {~ya}). We decompose E[S] = Evex[S]+Ecave[S] where:
Evex[S] = T ∑ ia
Sia log Sia + ∑ i
λi(∑ a
Sia − 1), Ecave[S] = ∑ ia
Sia|~xi − ~y a∗(S)|^2 + γ ∑ ab
~y a∗(S) · ~y b∗ (S)Aab. (13) It is clear that Evex[S] is a convex function of S. It can be verified alge- braically that Ecave[S] is a concave function of S and that its first derivative is
Eem[~y, Pˆ ] = − ∑ V
P^ ˆ (V ) log P (~y, V )+∑ V
P^ ˆ (V ) log Pˆ (V )+λ{∑^ Pˆ (V )− 1 }, (15)
where λ is a lagrange multiplier. The EM algorithm proceeds by minimizing Eem[~y, Pˆ ] with respect to Pˆ (V ) and ~y alternatively:
Pˆ t+1(V ) = (^) ∑P (~yt, V^ ) Vˆ P^ (~yt,^ Vˆ^ ),^ ~y
t+1 (^) = arg min ~y −^
V
P^ ˆ t+1(V ) log P (~y, V ). (16) These update rules are guaranteed to lower Eem[~y, Pˆ and give convergence to a saddle point or a local minimum (Dempster, Laird and Rubin 1977, Hathaway 1986, Neal and Hinton 1998). For example, this formulation of the EM algorithm enables us to rederive the effective energy for the elastic net and show that the alternating algorithm is EM. We let the {~ya} correspond to the positions of the nodes of the net and the {Via} be binary variables indicating the correspondences between cities and nodes (related to the {Sia} in example 4). P (~y, V ) = e−E[~y,V^ ]/T^ /Z where E[~y, V ] = ∑ ia Via|~xi − ~ya| (^2) + γ ∑ ab ~ya · ~ybAab with constraint that ∑ a Via = 1, ∀ i. We
define Sia = Pˆ (Via = 1), ∀ i, a. Then Eem[~y, S] is equal to the effective energy Eef f [~y, S] in the elastic net example, see equation (10). The update rules for EM, see equation (16), are equivalent to the alternating algorithm to minimize the effective energy, see equations (11,12). We now show that all EM algorithms are CCCP. This requires two intermediate results which we state as lemmas. Lemma 1. Minimizing Eem[~y, Pˆ ] is equivalent to minimizing the function E[ Pˆ ] = Evex[ Pˆ ] + Ecave[ Pˆ ] where Evex[ Pˆ ] = ∑ V Pˆ (V ) log Pˆ (V ) + λ{∑^ Pˆ (V ) − 1 }
is a convex function and Ecave[ Pˆ ] = − ∑ V Pˆ (V ) log P (~y∗( Pˆ ), V ) is a concave function, where we define ~y∗( Pˆ ) = arg min~y − ∑ V Pˆ (V ) log P (~y, V ). Proof. Set E[ Pˆ ] = Eem[~y∗( Pˆ ), Pˆ ], where ~y∗( Pˆ ) = arg min~y − ∑ V Pˆ (V ) log P (~y, V ). It is straightforward to decompose E[ Pˆ ] as Evex[ Pˆ ] + Ecave[ Pˆ ] and verify that Evex[ Pˆ ] is a convex function. To determine that Ecave[ Pˆ ] is concave requires showing that its Hessian is negative semi-definite. This is performed in lemma
Lemma 2. Ecave[ Pˆ ] is a concave function and (^) ∂ Pˆ∂ (V ) Ecave = − log P (~y∗( Pˆ ), V ), where ~y∗( Pˆ ) = arg min~y − ∑ V Pˆ (V ) log P (~y, V ). Proof. We first derive consequences of the definition of ~y∗^ which will be required when computing the Hessian of Ecave. The definition implies: ∑ V
P^ ˆ (V ) (^) ∂~∂y μ^ log^ P^ (~y
∗, V ) = 0 ∀μ (17) ∂ ∂~yμ^ log^ P^ (~y
V
ν
∂~y ν∗ ∂ Pˆ ( V˜ )
∂~yμ∂~yν^ log^ P^ (~y ∗, V ) = 0, ∀ μ., (18)
where the first equation is an identity which is valid for all Pˆ and the second equation follows by differentiating the first equation with respect to Pˆ. More- over, since ~y∗^ is a minimum of − ∑ V Pˆ (V ) log P (~y, V ) we also know that the matrix ∑ V Pˆ (V ) (^) ∂y∂μ^2 ∂yν log P (~y∗, V ) is negative definite. (We use the convention that (^) ∂~∂yμ log P (~y∗, V ) denotes the derivative of the function log P (~y, V ) with respect to ~yμ evaluated at ~y = ~y∗). We now calculate the derivatives of Econv [ Pˆ ] with respect to Pˆ. We obtain: ∂ ∂ Pˆ ( V˜ )Ecave^ =^ −^ log^ P^ (~y
V
mu
∂~y μ∗ ∂ Pˆ ( V˜ )
∂ log P (~y∗, V ) ∂fμ^ =^ −^ log^ P^ (~y
where we have used the definition of ~y∗, see equation (17), to eliminate the second term on the right hand side. This proves the first statement of the theorem.
The Legendre transform can be used to reformulate optimization problems by introducing auxiliary variables (Mjolsness and Garrett 1990). The idea is that some of the formulations may be more effective and computationally cheaper than others. We will concentrate on Legendre minimization, see (Rangarajan, Gold and Mjolsness 1996, Rangarajan, Yuille and Mjolsness 1999), instead of Legendre min- max emphasized in (Mjolsness and Garrett 1990). In the later, see (Mjolsness and Garrett 1990), the introduction of auxiliary variables converts the problem to a min-max problem where the goal is to find a saddle point. By contrast, in Legendre minimization, see (Rangarajan, Gold and Mjolsness 1996), the problem remains a minimization one (and so it becomes easier to analyze convergence). In Theorem 5 we show that Legendre minimization algorithms are equivalent to CCCP provided we first decompose the energy into a convex plus a concave part. The CCCP viewpoint emphasizes the geometry of the approach and com- plements the algebraic manipulations given in (Rangarajan, Yuille and Mjolsness 1999). (Moreover, the results of this paper show the generality of CCCP while, by contrast, Legendre transform methods have been applied only on a case by case basis). Definition 1. Let F (~x) be a convex function. For each value ~y let F ∗(~y) = min~x{F (~x)+~y ·~x.}. Then F ∗(~y) is concave and is the Legendre transform of F (~x). Two properties can be derived from this definition (Strang 1986). Property 1. F (~x) = max~y {F ∗(~y) − ~y · ~x}. Property 2. F (.) and F ∗(.) are related by ∂F ∂~y^ ∗ (~y) = { ∂F ∂~x }−^1 (−~y), − ∂F ∂~x (~x) = {∂F ∂~y^ ∗ }−^1 (~x). (By { ∂F ∂~y^ ∗ }−^1 (~x) we mean the value ~y such that ∂F ∂~y^ ∗ (~y) = ~x.)
The Legendre minimization algorithms (Rangarajan, Gold and Mjolsness 1996, Rangarajan, Yuille and Mjolsness 1999) exploits Legendre transforms. The op- timization function E 1 (~x) is expressed as E 1 (~x) = f (~x) + g(~x) where g(~x) is required to be a convex function. This is equivalent to minimizing E 2 (~x, ~y) = f (~x) + ~x · ~y + ˆg(~y), where ˆg(.) is the inverse Legendre transform of g(.). Legendre minimization consists of minimizing E 2 (~x, ~y) with respect to ~x and ~y alternatively. Theorem 5. Let E 1 (~x) = f (~x)+g(~x) and E 2 (~x, ~y) = f (~x)+~x·~y +h(~y), where f (.), h(.) are convex functions and g(.) is concave. Then applying CCCP to E 1 (~x) is equivalent to minimizing E 2 (~x, ~y) with respect to ~x and ~y alternatively, where g(.) is the Legendre transform of h(.). This is equivalent to Legendre minimization. Proof. We can write E 1 (~x) = f (~x) + min~y {g∗(~y) + ~x · ~y} where g∗(.) is the Legendre transform of g(.) (identify g(.) with F ∗(.) and g∗(.) with F (.) in Defi- nition 1 and Property 1). Thus minimizing E 1 (~x) with respect to ~x is equivalent to minimizing Eˆ 1 (~x, ~y) = f (~x) + ~x · ~y + g∗(~y) with respect to ~x and ~y. (Alterna- tively, we can set g∗(~y) = h(~y) in the expression for E 2 (~x, ~y) and obtain a cost function Eˆ 2 (~x) = f (~x) + g(~x).) Alternatively minimization over ~x and ~y gives: (i) ∂f /∂~x = ~y to determine ~xt+1^ in terms of ~yt, and (ii) ∂g∗/∂~y = ~x to determine ~yt^ in terms of ~xt^ which, by Property 2 of the Legendre transform is equivalent to setting ~y = −∂g/∂~x. Combining these two stages gives CCCP:
∂f ∂~x (~x t+1) = − ∂g ∂~x (~x t).