Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Similar documents
GSLM Operations Research II Fall 13/14

Message-Passing Algorithms for Quadratic Programming Formulations of MAP Estimation

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Support Vector Machines

5 The Primal-Dual Method

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

CS 534: Computer Vision Model Fitting

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Hermite Splines in Lie Groups as Products of Geodesics

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Unsupervised Learning

LECTURE : MANIFOLD LEARNING

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

LP Decoding. Martin J. Wainwright. Electrical Engineering and Computer Science UC Berkeley, CA,

CMPS 10 Introduction to Computer Science Lecture Notes

An Optimal Algorithm for Prufer Codes *

11. APPROXIMATION ALGORITHMS

CHAPTER 2 DECOMPOSITION OF GRAPHS

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 5 Luca Trevisan September 7, 2017

S1 Note. Basis functions.

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Support Vector Machines

Greedy Technique - Definition

Classification / Regression Support Vector Machines

Active Contours/Snakes

Mathematics 256 a course in differential equations for engineering students

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Smoothing Spline ANOVA for variable screening

Module Management Tool in Software Development Organizations

A Binarization Algorithm specialized on Document Images and Photos

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Ramsey numbers of cubes versus cliques

X- Chart Using ANOM Approach

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

A New Approach For the Ranking of Fuzzy Sets With Different Heights

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Optimizing Document Scoring for Query Retrieval

Problem Set 3 Solutions

TN348: Openlab Module - Colocalization

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Intra-Parametric Analysis of a Fuzzy MOLP

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Biostatistics 615/815

NOVEL CONSTRUCTION OF SHORT LENGTH LDPC CODES FOR SIMPLE DECODING

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Support Vector Machines. CS534 - Machine Learning

Solving two-person zero-sum game by Matlab

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Machine Learning: Algorithms and Applications

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

The Codesign Challenge

K-means and Hierarchical Clustering

Backpropagation: In Search of Performance Parameters

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Fitting: Deformable contours April 26 th, 2018

An Entropy-Based Approach to Integrated Information Needs Assessment

A NOTE ON FUZZY CLOSURE OF A FUZZY SET

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Programming in Fortran 90 : 2017/2018

A Five-Point Subdivision Scheme with Two Parameters and a Four-Point Shape-Preserving Scheme

Ecient Computation of the Most Probable Motion from Fuzzy. Moshe Ben-Ezra Shmuel Peleg Michael Werman. The Hebrew University of Jerusalem

Discriminative Dictionary Learning with Pairwise Constraints

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Private Information Retrieval (PIR)

Overlapping Clustering with Sparseness Constraints

The Shortest Path of Touring Lines given in the Plane

Unsupervised Learning and Clustering

SVM-based Learning for Multiple Model Estimation

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Feature Reduction and Selection

The Research of Support Vector Machine in Agricultural Data Classification

A Saturation Binary Neural Network for Crossbar Switching Problem

An Application of Network Simplex Method for Minimum Cost Flow Problems

A Robust Method for Estimating the Fundamental Matrix

Clustering on antimatroids and convex geometries

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

An Accurate Evaluation of Integrals in Convex and Non convex Polygonal Domain by Twelve Node Quadrilateral Finite Element Method

Parallel matrix-vector multiplication

High-Boost Mesh Filtering for 3-D Shape Enhancement

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Network Coding as a Dynamical System

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

Review of approximation techniques

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

3D vector computer graphics

UNIT 2 : INEQUALITIES AND CONVEX SETS

Transcription:

Fxng Max-Product: Convergent Message Passng Algorthms for MAP LP-Relaxatons Amr Globerson Tomm Jaakkola Computer Scence and Artfcal Intellgence Laboratory Massachusetts Insttute of Technology Cambrdge, MA 02139 gamr,tomm@csal.mt.edu Abstract We present a novel message passng algorthm for approxmatng the MAP problem n graphcal models. The algorthm s smlar n structure to max-product but unlke max-product t always converges, and can be proven to fnd the exact MAP soluton n varous settngs. The algorthm s derved va block coordnate descent n a dual of the LP relaxaton of MAP, but does not requre any tunable parameters such as step sze or tree weghts. We also descrbe a generalzaton of the method to cluster based potentals. The new method s tested on synthetc and real-world problems, and compares favorably wth prevous approaches. Graphcal models are an effectve approach for modelng complex objects va local nteractons. In such models, a dstrbuton over a set of varables s assumed to factor accordng to clques of a graph wth potentals assgned to each clque. Fndng the assgnment wth hghest probablty n these models s key to usng them n practce, and s often referred to as the MAP (maxmum aposteror) assgnment problem. In the general case the problem s NP hard, wth complexty exponental n the tree-wdth of the underlyng graph. Lnear programmng (LP) relaxatons have proven very useful n approxmatng the MAP problem, and often yeld satsfactory emprcal results. These approaches relax the constrant that the soluton s ntegral, and generally yeld non-ntegral solutons. However, when the LP soluton s ntegral, t s guaranteed to be the exact MAP. For some classes of problems the LP relaxaton s provably correct. These nclude the mnmum cut problem and maxmum weght matchng n b-partte graphs 8]. Although LP relaxatons can be solved usng standard LP solvers, ths may be computatonally ntensve for large problems 13]. The key problem wth generc LP solvers s that they do not use the graph structure explctly and thus may be sub-optmal n terms of computatonal effcency. The max-product method 7] s a message passng algorthm that s often used to approxmate the MAP problem. In contrast to generc LP solvers, t makes drect use of the graph structure n constructng and passng messages, and s also very smple to mplement. The relaton between max-product and the LP relaxaton has remaned largely elusve, although there are some notable exceptons: For tree-structured graphs, max-product and LP both yeld the exact MAP. A recent result 1] showed that for maxmum weght matchng on b-partte graphs max-product and LP also yeld the exact MAP 1]. Fnally, Tree-Reweghted max-product (TRMP) algorthms 5, 10] were shown to converge to the LP soluton for bnary x varables, as shown n 6]. In ths work, we propose the Max Product Lnear Programmng algorthm (MPLP) - a very smple varaton on max-product that s guaranteed to converge, and has several advantageous propertes. MPLP s derved from the dual of the LP relaxaton, and s equvalent to block coordnate descent n the dual. Although ths results n monotone mprovement of the dual objectve, global convergence s not always guaranteed snce coordnate descent may get stuck n suboptmal ponts. Ths can be remeded usng varous approaches, but n practce we have found MPLP to converge to the LP 1

soluton n a majorty of the cases we studed. To derve MPLP we use a specal form of the dual LP, whch nvolves the ntroducton of redundant prmal varables and constrants. We show how the dual varables correspondng to these constrants turn out to be the messages n the algorthm. We evaluate the method on Potts models and proten desgn problems, and show that t compares favorably wth max-product (whch often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorthms The max-product algorthm 7] s one of the most often used methods for solvng MAP problems. Although t s nether guaranteed to converge to the correct soluton, or n fact converge at all, t provdes satsfactory results n some cases. Here we present two algorthms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), whch are structurally very smlar to max-product, but have several key advantages: After each teraton, the messages yeld an upper bound on the MAP value, and the sequence of bounds s monotone decreasng and convergent. The messages also have a lmt pont that s a fxed pont of the update rule. No addtonal parameters (e.g., tree weghts as n 6]) are requred. If the fxed pont belefs have a unque maxmzer then they correspond to the exact MAP. For bnary varables, MPLP can be used to obtan the soluton to an LP relaxaton of the MAP problem. Thus, when ths LP relaxaton s exact and varables are bnary, MPLP wll fnd the MAP soluton. Moreover, for any varable whose belefs are not ted, the MAP assgnment can be found (.e., the soluton s partally decodable). Pseudo code for the algorthms (and for max-product) s gven n Fg. 1. As we show n the next sectons, MPLP s essentally a block coordnate descent algorthm n the dual of a MAP LP relaxaton. Every update of the MPLP messages corresponds to exact mnmzaton of a set of dual varables. For EMPLP mnmzaton s over the set of varables correspondng to an edge, and for NMPLP t s over the set of varables correspondng to all the edges a gven node appears n (.e., a star). The propertes of MPLP result from ts relaton to the LP dual. In what follows we descrbe the dervaton of the MPLP algorthms and prove ther propertes. 2 The MAP Problem and ts LP Relaxaton We consder functons over n varables x = {x 1,...,x n } defned as follows. Gven a graph G = (V, E) wth n vertces, and potentals θ j (x, x j ) for all edges j E, defne the functon 1 f(x; θ) = θ j (x, x j ). (1) j E The MAP problem s defned as fndng an assgnment x M that maxmzes the functon f(x; θ). Below we descrbe the standard LP relaxaton for ths problem. Denote by {µ j (x, x j )} j E dstrbutons over varables correspondng to edges j E and {µ (x )} V dstrbutons correspondng to nodes V. We wll use µ to denote a gven set of dstrbutons over all edges and nodes. The set M L (G) s defned as the set of µ where parwse and sngleton dstrbutons are consstent { } M L (G) = µ 0 ˆx µ j (ˆx, x j ) = µ j (x j ), ˆx j µ j (x, ˆx j ) = µ (x ) j E, x, x j x µ (x ) = 1 V Now consder the followng lnear program: MAPLPR : µ L = arg max µ θ. (2) µ M L(G) where µ θ s shorthand for µ θ = j E x,x j θ j (x, x j )µ j (x, x j ). It s easy to show (see e.g., 10]) that the optmum of MAPLPR yelds an upper bound on the MAP value,.e. µ L θ f(x M ). Furthermore, when the optmal µ (x ) have only ntegral values, the assgnment that maxmzes µ (x ) yelds the correct MAP assgnment. In what follows we show how the MPLP algorthms can be derved from the dual of MAPLPR. 1 We note that some authors also add a term P V θ(x) to f(x; θ). However, these terms can be ncluded n the parwse functons θ j(x, x j), so we gnore them for smplcty. 2

3 The LP Relaxaton Dual Snce MAPLPR s an LP, t has an equvalent convex dual. In App. A we derve a specal dual of MAPLPR usng a dfferent representaton of M L (G) wth redundant varables. The advantage of ths dual s that t allows the dervaton of smple message passng algorthms. The dual s descrbed n the followng proposton. Proposton 1 The followng optmzaton problem s a convex dual of MAPLPR DMAPLPR : mn max max β k (x k, x ) x k N() x k s.t. β j (x j, x ) + β j (x, x j ) = θ j (x, x j ), where the dual varables are β j (x, x j ) for all j, j E and values of x and x j. (3) The dual has an ntutve nterpretaton n terms of re-parameterzatons. Consder the star shaped graph G consstng of node and all ts neghbors N(). Assume the potental on edge k (for k N()) s β k (x k, x ). The value of the MAP assgnment for ths model s max maxβ k (x k, x ). Ths s exactly the term n the objectve of DMAPLPR. Thus the dual x k N() x k corresponds to ndvdually decodng star graphs around all nodes V where the potentals on the graph edges should sum to the orgnal potental. It s easy to see that ths wll always result n an upper bound on the MAP value. The somewhat surprsng result of the dualty s that there exsts a β assgnment such that star decodng yelds the optmal value of MAPLPR. 4 Block Coordnate Descent n the Dual To obtan a convergent algorthm we use a smple block coordnate descent strategy. At every teraton, fx all varables except a subset, and optmze over ths subset. It turns out that ths can be done n closed form for the cases we consder. We begn by dervng the EMPLP algorthm. Consder fxng all the β varables except those correspondng to some edge j E (.e., β j and β j ), and mnmzng DMAPLPR over the non-fxed varables. Only two terms n the DMAPLPR objectve depend on β j and β j. We can wrte those as f(β j, β j ) = max x λ j (x ) + maxβ j (x j, x ) x j ] ] + max λ x j (x j) + maxβ j (x, x j ) x where we defned λ j (x ) = k N()\j λ k(x ) and λ k (x ) = max xk β k (x k, x ) as n App. A. Note that the functon f(β j, β j ) depends on the other β values only through λ j (x j) and λ j (x ). Ths mples that the optmzaton can be done solely n terms of λ j (x j ) and there s no need to store the β values explctly. The optmal β j, β j are obtaned by mnmzng f(β j, β j ) subject to the re-parameterzaton constrant β j (x j, x ) + β j (x, x j ) = θ j (x, x j ). The followng proposton characterzes the mnmum of f(β j, β j ). In fact, as mentoned above, we do not need to characterze the optmal β j (x, x j ) tself, but only the new λ values. Proposton 2 Maxmzng the functon f(β j, β j ) yelds the followng λ j (x ) (and the equvalent expresson for λ j (x j )) λ j (x ) = 1 2 λ j (x ) + 1 2 max λ x j (x j) + θ j (x, x j ) ] j The proposton s proved n App. B. The λ updates above result n the EMPLP algorthm, descrbed n Fg. 1. Note that snce the β optmzaton affects both λ j (x ) and λ j (x j ), both these messages need to be updated smultaneously. We proceed to derve the NMPLP algorthm. For a gven node V, we consder all ts neghbors j N(), and wsh to optmze over the varables β j (x j, x ) for j, j E (.e., all the edges n a star centered on ), whle the other varables are fxed. One way of dong so s to use the EMPLP algorthm for the edges n the star, and terate t untl convergence. We now show that the result of (4) 3

Inputs: A graph G = (V, E), potental functons θ j(x, x j) for each edge j E. Intalzaton: Intalze messages to any value. Algorthm: Iterate untl a stoppng crteron s satsfed: Max-product: Iterate over messages and update (c j shfts the max to zero) h m j(x ) max m j (x x j) + θ j(x, x j) c j j EMPLP: For each j E, update λ j(x ) and λ j(x j) smultaneously (the update for λ j(x j) s the same wth and j exchanged) λ j(x ) 1 2 λ j (x ) + 1 h 2 max λ j (x x j) + θ j(x, x j) j NMPLP: Iterate over nodes V and update all γ j(x j) where j N() 2 3 2 X γ j(x j) max 4θ j(x, x j) γ j(x ) + γ k (x ) 5 x N() + 1 k N() Calculate node belefs : Set b (x ) to be the sum of ncomng messages nto node V (e.g., for NMPLP set b (x ) = P k N() γ k(x )). Output: Return assgnment x defned as x = arg maxˆx b(ˆx ). Fgure 1: The max-product, EMPLP and NMPLP algorthms. Max-product, EMPLP and NMPLP use messages m j, λ j and γ j respectvely. We use the notaton m j (xj) = P k N(j)\ m kj(x j). ths optmzaton can be found n closed form. The assumpton about β beng fxed outsde the star mples that λ j (x j) s fxed. Defne: γ j (x ) = max xj θj (x, x j ) + λ j (x j) ]. Smple algebra yelds the followng relaton between λ j (x ) and γ k (x ) for k N() λ j 2 (x ) = γ j (x ) + γ k (x ) (5) N() + 1 k N() Pluggng ths nto the defnton of γ j (x ) we obtan the NMPLP update n Fg. 1. The messages for both algorthms can be ntalzed to any value snce t can be shown that after one teraton they wll correspond to vald β values. 5 Convergence Propertes The MPLP algorthm decreases the dual objectve (.e., an upper bound on the MAP value) at every teraton, and thus ts dual objectve values form a convergent sequence. Usng arguments smlar to 5] t can be shown that MPLP has a lmt pont that s a fxed pont of ts updates. Ths n tself does not guarantee convergence to the dual optmum snce coordnate descent algorthms may get stuck at a pont that s not a global optmum. There are ways of overcomng ths dffculty, for example by smoothng the objectve 4] or usng technques as n 2] (see p. 636). We leave such extensons for further work. In ths secton we provde several results about the propertes of the MPLP fxed ponts and ther relaton to the correspondng LP. Frst, we clam that f all belefs have unque maxma then the exact MAP assgnment s obtaned. Proposton 3 If the fxed pont of MPLP has b (x ) such that for all the functon b (x ) has a unque maxmzer x, then x s the soluton to the MAP problem and the LP relaxaton s exact. Snce the dual objectve s always greater than or equal to the MAP value, t suffces to show that there exsts a dual feasble pont whose objectve value s f(x ). Denote by β, λ the value of the correspondng dual parameters at the fxed pont of MPLP. Then the dual objectve satsfes max x k N() λ k(x ) = k N() max β x k k(x k, x ) = 4 k N() β k(x k, x ) = f(x )

To see why the second equalty holds, note that b (x ) = max x,x j λ j (x ) + β j (x j, x ) and b j (x j ) = max x,x j λ j (x j) + β j (x, x j ). By the equalzaton property n Eq. 9 the arguments of the two max operatons are equal. From the unque maxmum assumpton t follows that x, x j are the unque maxmzers of the above. It follows that β j, β j are also maxmzed by x, x j. In the general case, the MPLP fxed pont may not correspond to a prmal optmum because of the local optma problem wth coordnate descent. However, when the varables are bnary, fxed ponts do correspond to prmal solutons, as the followng proposton states. Proposton 4 When x are bnary, the MPLP fxed pont can be used to obtan the prmal optmum. The clam can be shown by constructng a prmal optmal soluton µ. For ted b, set µ (x ) to 0.5 and for unted b, set µ (x ) to 1. If b, b j are not ted we set µ j (x, x j ) = 1. If b s not ted but b j s, we set µ j (x, x j) = 0.5. If b, b j are ted then β j, β j can be shown to be maxmzed at ether x, x j = (0, 0), (1, 1) or x, x j = (0, 1), (1, 0). We then set µ j to be 0.5 at one of these assgnment pars. The resultng µ s clearly prmal feasble. Settng δ = b we obtan that the dual varables (δ, λ, β ) and prmal µ satsfy complementary slackness for the LP n Eq. 7 and therefore µ s prmal optmal. The bnary optmalty result mples partal decodablty, snce 6] shows that the LP s partally decodable for bnary varables. 6 Beyond parwse potentals: Generalzed MPLP In the prevous sectons we consdered maxmzng functons whch factor accordng to the edges of the graph. A more general settng consders clusters c 1,..., c k {1,...,n} (the set of clusters s denoted by C), and a functon f(x; θ) = c θ c(x c ) defned va potentals over clusters θ c (x c ). The MAP problem n ths case also has an LP relaxaton (see e.g. 11]). To defne the LP we ntroduce the followng defntons: S = {c ĉ : c, ĉ C, c ĉ } s the set of ntersecton between clusters and S(c) = {s S : s c} s the set of overlap sets for cluster c.we now consder margnals over the varables n c C and s S and requre that cluster margnals agree on ther overlap. Denote ths set by M L (C). The LP relaxaton s then to maxmze µ θ subject to µ M L (C). As n Sec. 4, we can derve message passng updates that result n monotone decrease of the dual LP of the above relaxaton. The dervaton s smlar and we omt the detals. The key observaton s that one needs to ntroduce S(c) copes of each margnal µ c (x c ) (nstead of the two copes n the parwse case). Next, as n the EMPLP dervaton we assume all β are fxed except those correspondng to some cluster c. The resultng messages are λ c s (x s ) from a cluster c to all of ts ntersecton sets s S(c). The update on these messages turns out to be: λ c s (x s ) = ( 1 1 S(c) ) λ c s (x s) + 1 S(c) max x c\s ŝ S(c)\s ŝ (x ŝ) + θ c (x c ) where for a gven c C all λ c s should be updated smultaneously for s S(c), and λ c s (x s ) s defned as the sum of messages nto s that are not from c. We refer to ths algorthm as Generalzed EMPLP (GEMPLP). It s possble to derve an algorthm smlar to NMPLP that updates several clusters smultaneously, but ts structure s more nvolved and we do not address t here. 7 Related Work Wess et al. 11] recently studed the fxed ponts of a class of max-product lke algorthms. Ther analyss focused on propertes of fxed ponts rather than convergence guarantees. Specfcally, they showed that f the countng numbers used n a generalzed max-product algorthm satsfy certan propertes, then ts fxed ponts wll be the exact MAP f the belefs have unque maxma, and for bnary varables the soluton can be partally decodable. Both these propertes are obtaned for the MPLP fxed ponts, and n fact we can show that MPLP satsfes the condtons n 11], so that we obtan these propertes as corollares of 11]. We stress however, that 11] does not address convergence of algorthms, but rather propertes of ther fxed ponts, f they converge. MPLP s smlar n some aspects to Kolmogorov s TRW-S algorthm 5]. TRW-S s also a monotone coordnate descent method n a dual of the LP relaxaton and ts fxed ponts also have smlar λ c 5

guarantees to those of MPLP 6]. Furthermore, convergence to a local optmum may occur, as t does for MPLP. One advantage of MPLP les n the smplcty of ts updates and the fact that t s parameter free. The other s ts smple generalzaton to potentals over clusters of nodes (Sec. 6). Recently, several new dual LP algorthms have been ntroduced, whch are more closely related to our formalsm. Werner 12] presented a class of algorthms whch also mprove the dual LP at every teraton. The smplest of those s the max-sum-dffuson algorthm, whch s smlar to our EMPLP algorthm, although the updates are dfferent from ours. Independently, Johnson et al. 4] presented a class of algorthms that mprove duals of the MAP-LP usng coordnate descent. They decompose the model nto tractable parts by replcatng varables and enforce replcaton constrants wthn the Lagrangan dual. Our basc formulaton n Eq. 3 could be derved from ther perspectve. However, the updates n the algorthm and the analyss dffer. Johnson et al. also presented a method for overcomng the local optmum problem, by smoothng the objectve so that t s strctly convex. Such an approach could also be used wthn our algorthms. Vontobel and Koetter 9] recently ntroduced a coordnate descent algorthm for decodng LDPC codes. Ther method s specfcally talored for ths case, and uses updates that are smlar to our edge based updates. Fnally, the concept of dual coordnate descent may be used n approxmatng margnals as well. In 3] we use such an approach to optmze a varatonal bound on the partton functon. The dervaton uses some of the deas used n the MPLP dual, but mportantly does not fnd the mnmum for each coordnate. Instead, a gradent lke step s taken at every teraton to decrease the dual objectve. 8 Experments We compared NMPLP to three other message passng algorthms: 2 Tree-Reweghted max-product (TRMP) 10], 3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of dampng messages usng a factor of α = 0.5. We ran all algorthms for a maxmum of 2000 teratons, and used the ht-tme measure to compare ther speed of convergence. Ths measure s defned as follows: At every teraton the belefs can be used to obtan an assgnment x wth value f(x). We defne the ht-tme as the frst teraton at whch the maxmum value of f(x) s acheved. 4 We frst expermented wth a 10 10 grd graph, wth 5 values per state. The functon f(x) was a Potts model: f(x) = j E θ ji(x = x j ) + V θ (x ). 5 The values for θ j and θ (x ) were randomly drawn from c I, c I ] and c F, c F ] respectvely, and we used values of c I and c F n the range range 0.1, 2.35] (wth ntervals of 0.25), resultng n 100 dfferent models. The clusters for GEMPLP were the faces of the graph 14]. To see f NMPLP converges to the LP soluton we also used an LP solver to solve the LP relaxaton. We found that the the normalzed dfference between NMPLP and LP objectve was at most 10 3 (medan 10 7 ), suggestng that NMPLP typcally converged to the LP soluton. Fg. 2 (top row) shows the results for the three algorthms. It can be seen that whle all non-cluster based algorthms obtan smlar f(x) values, NMPLP has better ht-tme (n the medan) than TRMP and MP, and MP does not converge n many cases (see capton). GEMPLP converges more slowly than NMPLP, but obtans much better f(x) values. In fact, n 99% of the cases the normalzed dfference between the GEMPLP objectve and the f(x) value was less than 10 5, suggestng that the exact MAP soluton was found. We next appled the algorthms to the real world problems of proten desgn. In 13], Yanover et al. show how these problems can be formalzed n terms of fndng a MAP n an approprately constructed graphcal model. 6 We used all algorthms except GNMPLP (snce there s no natural choce for clusters n ths case) to approxmate the MAP soluton on the 97 models used n 13]. In these models the number of states per varable s 2 158, and there are up to 180 varables per model. Fg. 2 (bottom) shows results for all the desgn problems. In ths case only 11% of the MP runs converged, and NMPLP was better than TRMP n terms of ht-tme and comparable n f(x) value. The performance of MP was good on the runs where t converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are gven. 3 The edge weghts for TRMP corresponded to a unform dstrbuton over all spannng trees. 4 Ths s clearly a post-hoc measure snce t can only be obtaned after the algorthm has exceeded ts maxmum number of teratons. However, t s a reasonable algorthm-ndependent measure of convergence. 5 The potental θ (x ) may be folded nto the parwse potental to yeld a model as n Eq. 1. 6 Data avalable from http://jmlr.csal.mt.edu/papers/volume7/yanover06a/rosetta Desgn Dataset.tgz 6

(Ht Tme) (a) (b) (c) (d) 100 50 0 50 100 MP TRMP GMPLP (Value) 0.04 0.02 0 0.02 0.04 0.06 MP TRMP GMPLP (Ht Tme) 2000 1000 0 1000 MP TRMP (Value) 0.6 0.4 0.2 0 0.2 0.4 MP TRMP Fgure 2: Evaluaton of message passng algorthms on Potts models and proten desgn problems. (a,c): Convergence tme results for the Potts models (a) and proten desgn problems (c). The box-plots (horz. red lne ndcates medan) show the dfference between the ht-tme for the other algorthms and NMPLP. (b,d): Value of nteger solutons for the Potts models (b) and proten desgn problems (d). The box-plots show the normalzed dfference between the value of f(x) for NMPLP and the other algorthms. All fgures are such that better MPLP performance yelds postve Y axs values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the proten problems. Only convergent max-product runs are shown. 9 Concluson We have presented a convergent algorthm for MAP approxmaton that s based on block coordnate descent of the MAP-LP relaxaton dual. The algorthm can also be extended to cluster based functons, whch result emprcally n mproved MAP estmates. Ths s n lne wth the observatons n 14] that generalzed belef propagaton algorthms can result n sgnfcant performance mprovements. However generalzed max-product algorthms 14] are not guaranteed to converge whereas GMPLP s. Furthermore, the GMPLP algorthm does not requre a regon graph and only nvolves ntersecton between pars of clusters. In concluson, MPLP has the advantage of resolvng the convergence problems of max-product whle retanng ts smplcty, and offerng the theoretcal guarantees of LP relaxatons. We thus beleve t should be useful n a wde array of applcatons. A Dervaton of the dual Before dervng the dual, we frst express the constrant set M L (G) n a slghtly dfferent way. The defnton of M L (G) n Sec. 2 uses a sngle dstrbuton µ j (x, x j ) for every j E. In what follows, we use two copes of ths parwse dstrbuton for every edge, whch we denote µ j (x, x j ) and µ j (x j, x ), and we add the constrant that these two copes both equal the orgnal µ j (x, x j ). For ths extended set of parwse margnals, we consder the followng set of constrants whch s clearly equvalent to M L (G). On the rghtmost column we gve the dual varables that wll correspond to each constrant (we omt non-negatvty constrants). µ j (x, x j ) = µ j (x, x j ) j E, x, x j β j (x, x j ) µ j (x j, x ) = µ j (x, x j ) j E, x, x j β j (x j, x ) ˆx µ j (ˆx, x j ) = µ j (x j ) j E, x j λ j (x j ) ˆx j µ j (ˆx j, x ) = µ (x ) j E, x λ j (x ) x µ (x ) = 1 V δ We denote the set of (µ, µ) satsfyng these constrants by M L (G). We can now state an LP that s equvalent to MAPLPR, only wth an extended set of varables and constrants. The equvalent problem s to maxmze µ θ subject to (µ, µ) M L (G) (note that the objectve uses the orgnal µ copy). LP dualty transformaton of the extended problem yelds the followng LP mn δ s.t. λ j (x j ) β j (x, x j ) 0 j, j E, x, x j β j (x, x j ) + β j (x j, x ) = θ j (x, x j ) j E, x, x j (7) k N() λ k(x ) + δ 0 V, x We next smplfy the above LP by elmnatng some of ts constrants and varables. Snce each varable δ appears n only one constrant, and the objectve mnmzes δ t follows that δ = max x k N() λ k(x ) and the constrants wth δ can be dscarded. Smlarly, snce λ j (x j ) appears n a sngle constrant, we have that for all j E, j E, x, x j λ j (x j ) = max x β j (x, x j ) and the constrants wth λ j (x j ), λ j (x ) can also be dscarded. Usng the elmnated δ and λ j (x ) (6) 7

varables, we obtan that the LP n Eq. 7 s equvalent to that n Eq. 3. Note that the objectve n Eq. 3 s convex snce t s a sum of pont-wse maxma of convex functons. B Proof of Proposton 2 We wsh to mnmze f n Eq. 4 subject to the constrant that β j + β j = θ j. Rewrte f as ] f(β j, β j ) = max λ j x (x ) + β j (x j, x ) + max λ,x j x j (x j) + β j (x, x j ) ] (8),x j The sum of the two arguments n the max s λ j (x ) + λ j (x j) + θ j (x, x j ) (because of the constrants on β). ] Thus the mnmum must be greater than 1 2 max x,x j λ j (x ) + λ j (x j) + θ j (x, x j ). One assgnment to β that acheves ths mnmum s obtaned by requrng an equalzaton condton: 7 λ j (x j) + β j (x, x j ) = λ j (x ) + β j (x j, x ) = 1 2 ( whch mples β j (x, x j ) = 1 2 θ j (x, x j ) + λ j ) θ j (x, x j ) + λ j (x ) + λ j (x j) (9) ) j (x j) and a smlar expresson for β j. ( (x ) λ The resultng λ j (x j ) = max x β j (x, x j ) are then the ones n Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learnng program). Amr Globerson was also supported by the Rothschld Yad-Hanadv fellowshp. References 1] M. Bayat, D. Shah, and M. Sharma. Maxmum weght matchng va max-product belef propagaton. IEEE Trans. on Informaton Theory (to appear), 2007. 2] D. P. Bertsekas, edtor. Nonlnear Programmng. Athena Scentfc, Belmont, MA, 1995. 3] A. Globerson and T. Jaakkola. Convergent propagaton algorthms va orented trees. In UAI. 2007. 4] J.K. Johnson, D.M. Maloutov, and A.S. Wllsky. Lagrangan relaxaton for map estmaton n graphcal models. In Allerton Conf. Communcaton, Control and Computng, 2007. 5] V. Kolmogorov. Convergent tree-reweghted message passng for energy mnmzaton. IEEE Transactons on Pattern Analyss and Machne Intellgence, 28(10):1568 1583, 2006. 6] V. Kolmogorov and M. Wanwrght. On the optmalty of tree-reweghted max-product message passng. In 21st Conference on Uncertanty n Artfcal Intellgence (UAI). 2005. 7] J. Pearl. Probablstc Reasonng n Intellgent Systems. Morgan Kaufmann, 1988. 8] B. Taskar, S. Lacoste-Julen, and M. Jordan. Structured predcton, dual extragradent and bregman projectons. Journal of Machne Learnng Research, pages 1627 1653, 2006. 9] P.O. Vontobel and R. Koetter. Towards low-complexty lnear-programmng decodng. In Proc. 4th Int. Symposum on Turbo Codes and Related Topcs, 2006. 10] M. J. Wanwrght, T. Jaakkola, and A. S. Wllsky. Map estmaton va agreement on trees: messagepassng and lnear programmng. IEEE Trans. on Informaton Theory, 51(11):1120 1146, 2005. 11] Y. Wess, C. Yanover, and T. Meltzer. Map estmaton, lnear programmng and belef propagaton wth convex free energes. In UAI. 2007. 12] T. Werner. A lnear programmng approach to max-sum, a revew. IEEE Trans. on PAMI, 2007. 13] C. Yanover, T. Meltzer, and Y. Wess. Lnear programmng relaxatons and belef propagaton an emprcal study. Jourmal of Machne Learnng Research, 7:1887 1907, 2006. 14] J.S. Yedda, W.T. W.T. Freeman, and Y. Wess. Constructng free-energy approxmatons and generalzed belef propagaton algorthms. IEEE Trans. on Informaton Theory, 51(7):2282 2312, 2005. 7 Other solutons are possble but may not yeld some of the propertes of MPLP. 8