An Introduction to LP Relaxations for MAP Inference

Similar documents
An LP View of the M-best MAP problem

Tightening MRF Relaxations with Planar Subproblems

Convergent message passing algorithms - a unifying view

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

New Outer Bounds on the Marginal Polytope

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Expectation Propagation

Planar Cycle Covering Graphs

Information Processing Letters

The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs

Loopy Belief Propagation

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees

Dynamic Tree Block Coordinate Ascent

On Iteratively Constraining the Marginal Polytope for Approximate Inference and MAP

Revisiting MAP Estimation, Message Passing and Perfect Graphs

Integrating Probabilistic Reasoning with Constraint Satisfaction

MVE165/MMG631 Linear and integer optimization with applications Lecture 9 Discrete optimization: theory and algorithms

Tree-structured approximations by expectation propagation

Models for grids. Computer vision: models, learning and inference. Multi label Denoising. Binary Denoising. Denoising Goal.

A discriminative view of MRF pre-processing algorithms supplementary material

A Brief Look at Optimization

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Integer Programming Theory

LECTURE 18 LECTURE OUTLINE

Convex Optimization MLSS 2015

A Generic Separation Algorithm and Its Application to the Vehicle Routing Problem

Learning Bayesian Network Structure using LP Relaxations

Parallel constraint optimization algorithms for higher-order discrete graphical models: applications in hyperspectral imaging

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg

Mean Field and Variational Methods finishing off

3 INTEGER LINEAR PROGRAMMING

Inter and Intra-Modal Deformable Registration:

5.3 Cutting plane methods and Gomory fractional cuts

Approximate inference using planar graph decomposition

Selected Topics in Column Generation

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Bayesian network model selection using integer programming

Mean Field and Variational Methods finishing off

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

A Tutorial Introduction to Belief Propagation

Natural Language Processing

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Convex Max-Product Algorithms for Continuous MRFs with Applications to Protein Folding

Towards Efficient Higher-Order Semidefinite Relaxations for Max-Cut

Fundamentals of Integer Programming

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Part I: Sum Product Algorithm and (Loopy) Belief Propagation. What s wrong with VarElim. Forwards algorithm (filtering) Forwards-backwards algorithm

Energy Minimization Under Constraints on Label Counts

Research Interests Optimization:

Graph Coloring via Constraint Programming-based Column Generation

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

10 Sum-product on factor tree graphs, MAP elimination

Conditions Beyond Treewidth for Tightness of Higher-order LP Relaxations

The ILP approach to the layered graph drawing. Ago Kuusik

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748

3 No-Wait Job Shops with Variable Processing Times

Algorithms for Decision Support. Integer linear programming models

Inference and Representation

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY A New Class of Upper Bounds on the Log Partition Function

Mathematical Programming Formulations, Constraint Programming

Alina Ene. From Minimum Cut to Submodular Minimization Leveraging the Decomposable Structure. Boston University

Modern Benders (in a nutshell)

COS 513: Foundations of Probabilistic Modeling. Lecture 5

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Linear Programming. Course review MS-E2140. v. 1.1

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

Graphical Models for Computer Vision

Algorithms for Integer Programming

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

MVE165/MMG631 Linear and integer optimization with applications Lecture 7 Discrete optimization models and applications; complexity

GENERAL ASSIGNMENT PROBLEM via Branch and Price JOHN AND LEI

Introduction to Mathematical Programming IE406. Lecture 20. Dr. Ted Ralphs

B553 Lecture 12: Global Optimization

Lecture 9: Undirected Graphical Models Machine Learning

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem

BAYESIAN NETWORKS STRUCTURE LEARNING

Math 5593 Linear Programming Lecture Notes

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Lecture 2 - Introduction to Polytopes

Combinatorial Auctions: A Survey by de Vries and Vohra

Complex Prediction Problems

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

UNIVERSITY OF CALIFORNIA, IRVINE. Approximate Inference in Graphical Models DISSERTATION

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

LP Decoding. LP Decoding: - LP relaxation for the Maximum-Likelihood (ML) decoding problem. J. Feldman, D. Karger, M. Wainwright, LP Decoding p.

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Minimum Weight Constrained Forest Problems. Problem Definition

Convex Optimization CMU-10725

LP-Modelling. dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven. January 30, 2008

Target Cuts from Relaxed Decision Diagrams

arxiv: v1 [cs.ds] 23 Sep 2015

Linear Programming in Small Dimensions

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Transcription:

An Introduction to LP Relaxations for MAP Inference Adrian Weller MLSALT4 Lecture Feb 27, 2017 With thanks to David Sontag (NYU) for use of some of his slides and illustrations For more information, see http://mlg.eng.cam.ac.uk/adrian/ 1 / 39

High level overview of our 3 lectures 1. Directed and undirected graphical models (last Friday) 2. LP relaxations for MAP inference (today) 3. Junction tree algorithm for exact inference, belief propagation, variational methods for approximate inference (this Friday) Further reading / viewing: Murphy, Machine Learning: a Probabilistic Perspective Barber, Bayesian Reasoning and Machine Learning Bishop, Pattern Recognition and Machine Learning Koller and Friedman, Probabilistic Graphical Models https://www.coursera.org/course/pgm Wainwright and Jordan, Graphical Models, Exponential Families, and Variational Inference 2 / 39

Example of MAP inference: image denoising Inference is combining prior beliefs with observed evidence to form a prediction. MAP inference 3 / 39

Example of MAP inference: protein side-chain placement Probabilistic inference Find minimum energy configuration of amino acid side-chains Given along desired a fixed 3D carbon structure, backbone: choose amino-acids giving the most stable folding (Yanover, Meltzer, Weiss 06) Side-chain X 1 (corresponding to 1 amino acid) X 3 X 2 θ 13 (x 1, x 3 ) X 3 X 1 Potential function for each edge θ 12 (x 1, x 2 ) a table X 2 Protein backbone θ 34 (x 3, x 4 ) Joint distribution over the variables is given by Orientations of the side-chains are represented by discretized angles called rotamers Rotamer choices for nearby amino acids are energetically coupled Partition function Key problems (attractive and repulsive forces) Find marginals: Find most likely assignment (MAP): Focus of this talk X 4 4 / 39

Outline of talk Background on undirected graphical models Basic LP relaxation Tighter relaxations Message passing and dual decomposition We ll comment on When is an LP relaxation tight 5 / 39

Background: undirected graphical models Powerful way to represent relationships across variables Many applications including: computer vision, social network analysis, deep belief networks, protein folding... In this talk, focus on pairwise models with discrete variables (sometimes binary) Example: Grid for computer vision 6 / 39

Background: undirected graphical models Discrete variables X 1,..., X n with X i {0,..., k i 1} Potential functions, will somehow write as vector θ Write x = (... x 1,..., x n,... ) for one overcomplete configuration of all variables, θ x for its total score Probability distribution given by p(x) = 1 Z exp(θ x) To ensure probabilities sum to 1, need normalizing constant or partition function Z = x exp (θ x) We are interested in maximum a posteriori (MAP) inference i.e., find a global configuration with highest probability x arg max p(x) = arg max θ x 7 / 39

Background: how do we write potentials as a vector θ? θ x means the total score of a configuration x, where we sum over all potential functions If we have potential functions θ c over some subsets c C of variables, then we want c C θ c(x c ), where x c means a configuration of variables just in the subset c θ c (x c ) provides a measure of local compatibility, a table of values 8 / 39

Background: how do we write potentials as a vector θ? θ x means the total score of a configuration x, where we sum over all potential functions If we have potential functions θ c over some subsets c C of variables, then we want c C θ c(x c ), where x c means a configuration of variables just in the subset c θ c (x c ) provides a measure of local compatibility, a table of values If we only have some unary/singleton potentials θ i and edge/pairwise potentials θ ij then we can write the total score as θ i (x i ) + θ ij (x i, x j ) (i,j) i Indices? Usually assume either no unary potentials (absorb them into edges) or one for every variable, leading to a graph topology (V, E) with total score θ i (x i ) + θ ij (x i, x j ) i V ={1,...,n} (i,j) E 8 / 39

Background: overcomplete representation The overcomplete representation conveniently allows us to write θ x = θ i (x i ) + θ ij (x i, x j ) i V (i,j) E Concatenate singleton and edge terms into big vectors............ θ i (0) 1[X i = 0] θ i (1) 1[X i = 1]...... θ =... θ ij (0, 0) x =... 1[X i = 0, X j = 0] θ ij (0, 1) 1[X i = 0, X j = 1] θ ij (1, 0) 1[X i = 1, X j = 0] θ ij (1, 1) 1[X i = 1, X j = 1]............ There are many possible values of x how many? 9 / 39

Background: overcomplete representation The overcomplete representation conveniently allows us to write θ x = θ i (x i ) + θ ij (x i, x j ) i V (i,j) E Concatenate singleton and edge terms into big vectors............ θ i (0) 1[X i = 0] θ i (1) 1[X i = 1]...... θ =... θ ij (0, 0) x =... 1[X i = 0, X j = 0] θ ij (0, 1) 1[X i = 0, X j = 1] θ ij (1, 0) 1[X i = 1, X j = 0] θ ij (1, 1) 1[X i = 1, X j = 1]............ There are many possible values of x how many? i V k i 9 / 39

Background: Binary pairwise models θ x is the score of a configuration x Probability distribution given by p(x) = 1 Z exp(θ x) For MAP inference, want x arg max p(x) = arg max θ x Want to optimize over {0, 1} coordinates of overcomplete configuration space corresponding to all 2 n possible settings The convex hull of these defines the marginal polytope M Each point µ M corresponds to a probability distribution over the 2 n configurations, giving a vector of marginals 10 / 39

Background: the marginal polytope (all valid marginals) µ = 0! 1! 0! 1! 1! 0! 0" 0" 1" 0" 0! 0! 0! 1! 0! 0! 1! 0" Marginal polytope! (Wainwright & Jordan, 03)! µ = 1 µ 2 + µ valid marginal probabilities! X 1! = 1! 1! 0! 0! 1! 1! 0! 1" 0" 0" 0" 0! 1! 0! 0! 0! 0! 1! 0" Assignment for X 1" Assignment for X 2" Assignment for X 3! Edge assignment for" X 1 X 3! Edge assignment for" X 1 X 2! Edge assignment for" X 2 X 3! X 1! = 0! X 2! = 1! X 3 =! 0! X 2! = 1! X 3 =! 0! Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodes 11 / 39

Background: overcomplete and minimal representations The overcomplete representation is highly redundant, e.g. µ i (0) + µ i (1) = 1 i How many dimensions if n binary variables with m edges? 12 / 39

Background: overcomplete and minimal representations The overcomplete representation is highly redundant, e.g. µ i (0) + µ i (1) = 1 i How many dimensions if n binary variables with m edges? 2n + 4m Instead, we sometimes pick a minimal representation What s the minimum number of dimensions we need? 12 / 39

Background: overcomplete and minimal representations The overcomplete representation is highly redundant, e.g. µ i (0) + µ i (1) = 1 i How many dimensions if n binary variables with m edges? 2n + 4m Instead, we sometimes pick a minimal representation What s the minimum number of dimensions we need? n + m For example, we could use q = (q 1,..., q n,..., q ij,... ) where q i = µ i (1) i, q ij = µ ij (1, 1) (i, j), then ( ) ( ) ( ) 1 qi 1 qj 1 + qij q µ i =, µ q j =, µ i q ij = i q j q j q ij j q i q ij q ij Note many other possible minimal representations 12 / 39

LP relaxation: MAP as an integer linear program (ILP) MAP inference as a discrete optimization problem is to identify a configuration with maximum total score x arg max θ i (x i ) + θ ij (x i, x j ) x i V ij E = arg max x = arg max µ θ x θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) x i x i,x j i V = arg max θ µ µ Any other constraints? ij E s.t. µ is integral 13 / 39

What are the constraints? Force every cluster of variables to choose a local assignment: µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ ij (x i, x j ) {0, 1} ij E, x i, x j x i,x j µ ij (x i, x j ) = 1 ij E Enforce that these assignments are consistent: µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j 14 / 39

MAP as an integer linear program (ILP) subject to: MAP(θ) = max θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) µ x i x i,x j i V = max µ θ µ ij E µ i (x i ) {0, 1} i V, x i (edge terms?) x i µ i (x i ) = 1 i V µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j Many good off-the-shelf solvers, such as CPLEX and Gurobi 15 / 39

Linear programming (LP) relaxation for MAP Integer linear program was: subject to MAP(θ) = max µ θ µ µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j Now relax integrality constraints, allow variables to be between 0 and 1: µ i (x i ) [0, 1] i V, x i 16 / 39

Basic LP relaxation for MAP LP(θ) = max µ θ µ s.t. µ i (x i ) [0, 1] i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j Linear programs can be solved efficiently: simplex, interior point, ellipsoid algorithm Since the LP relaxation maximizes over a larger set, its value can only be higher MAP(θ) LP(θ) 17 / 39

The local polytope LP(θ) = max µ θ µ s.t. µ i (x i ) [0, 1] i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = µ j (x j ) = µ ij (x i, x j ) ij E, x i x i µ ij (x i, x j ) ij E, x j All these constraints are linear Hence define a polytope in the space of marginals Here we enforced only local (pairwise) consistency, which defines the local polytope If instead we had optimized over the marginal polytope, which enforces global consistency, then we would have MAP(θ) =LP(θ), i.e. the LP is tight why? why don t we do this? 18 / 39

Tighter relaxations of the marginal polytope Enforcing consistency of pairs of variables leads to the local polytope L 2 The marginal polytope enforces consistency over all variables M = L n Natural to consider the Sherali-Adams hierarchy of successively tighter relaxations L r 2 r n which enforce consistency over clusters of r variables Just up from the local polytope is the triplet polytope TRI= L 3 19 / 39

Stylized illustration of polytopes marginal polytope M = L n global consistency... triplet polytope L 3 triplet consistency local polytope L 2 pair consistency More accurate Less accurate More computationally intensive Less computationally intensive 20 / 39

Stylized illustration of polytopes marginal polytope M = L n global consistency... triplet polytope L 3 triplet consistency local polytope L 2 pair consistency More accurate Less accurate More computationally intensive Less computationally intensive Can be shown that for binary variables, TRI=CYC, the cycle polytope, which enforces consistency over all cycles In general, TRI CYC, open problem if TRI = CYC [SonPhD 3] 20 / 39

When is the LP tight? For a model without cycles, local polytope L 2 =M marginal polytope, hence the basic LP ( first order ) is always tight More generally, if a model has treewidth r then LP+L r+1 is tight [WaiJor04] STRUCTURE Separately, if we allow any structure but restrict the class of potential functions, interesting results are known POTENTIALS For example, the basic LP is tight if all potentials are supermodular Fascinating recent work [KolThaZiv15]: if we do not restrict structure, then for any given family of potentials, either the basic LP relaxation is tight or the problem class is NP-hard! Identifying HYBRID conditions is an exciting current research area 21 / 39

When is MAP inference (relatively) easy? Tree Attractive (binary) model STRUCTURE POTENTIALS Both can be handled efficiently by the basic LP relaxation, LP+L 2 22 / 39

Cutting planes µ max θ µ µ max θ µ µ max θ µ µ max θ µ θ θ θ θ µ µ µ µ (a) (b) (c) (d) Figure 2-6: Illustration of the cutting-plane algorithm. (a) Solve the LP relaxation. (b) Find a violated constraint, add it to the relaxation, and repeat. (c) Result of solving the tighter LP relaxation. (d) Finally, we find the MAP assignment. of the clusters considered. Even optimizing over TRI(G), the first lifting of the Sherali- SonPhD Adams hierarchy, is impractical for all but the smallest of graphical models. Our approach is motivated by the observation that it may not be necessary to add all of the constraints that make up a higher-order relaxation such as TRI(G). In particular, it possible that the pairwise LP relaxation alone is close to being tight, and that only a few carefully chosen 23 / 39

Cutting planes Invalid! constraint" µ max θ µ µ θ Useless! constraint" want to choose constraints that are both valid and useful. A v SonPhD es not cut off any of the integer points. One way to guarantee We want to add constraints that are both valid and useful are useful is to use them within the cutting-plane algorithm. C Valid: does not cut off any integer points separate the current fractional solution from the rest of the int Useful: leads us to update to a better solution tightest valid constraints to add are the facets of the margina 24 / 39

Methods for solving general integer linear programs Local search Start from an arbitrary assignment (e.g., random). Choose a variable. Branch-and-bound Exhaustive search over space of assignments, pruning branches that can be provably shown not to contain a MAP assignment Can use the LP relaxation or its dual to obtain upper bounds Lower bound obtained from value of any assignment found Branch-and-cut (most powerful method; used by CPLEX & Gurobi) Same as branch-and-bound; spend more time getting tighter bounds Adds cutting-planes to cut off fractional solutions of the LP relaxation, making the upper bound tighter 25 / 39

Message passing Can be a computationally efficient way to obtain or approximate a MAP solution, takes advantage of the graph structure Classic example is max-product belief propagation (BP) Sufficient conditions are known s.t. this will always converge to the solution of the basic LP, includes that the basic LP is tight [ParkShin-UAI15] In general, however, this may not converge to the LP solution (even for supermodular potentials) Other methods have been developed, many relate to dual decomposition... 26 / 39

Dual decomposition and reparameterizations Consider the MAP problem for pairwise Markov random fields: MAP(θ) = max θ i (x i ) + θ ij (x i, x j ). x i V ij E If we push the maximizations inside the sums, the value can only increase: MAP(θ) max θ i (x i ) + max θ ij (x i, x j ) x i x i,x j i V ij E Note that the right-hand side can be easily evaluated One can always reparameterize a distribution by operations like θi new (x i ) = θi old (x i ) + f (x i ) θij new (x i, x j ) = θij old (x i, x j ) f (x i ) for any function f (x i ), without changing the distribution/energy 27 / 39

Dual decomposition Introduction to Dual Decomposition for Inference f (x 1,x 2 ) x 1 f1(x 1) x 2 f2(x 2) f(x1,x2) x 1 x 2 g(x1,x3) h(x2,x4) x 3 x 4 k(x3,x4) f1(x 1) f2(x 2) x 1 + x 1 x 2 + x g (x 1,x 3 ) 2 g1(x 1 ) h2(x 2 ) g3(x 3 ) h (x 2,x 4 ) g1(x 1 ) h2(x 2 ) x g3(x 3 ) k4(x 4 ) 3 h4(x 4 ) + x 3 x 4 + x 4 k3(x 3 ) h4(x 4 ) x 3 x 4 k (x 3,x 4 ) k3(x 3 ) k4(x 4 ) Figure 1.2: Illustration of the the dual decomposition objective. Left: The original pairwise model consisting of four factors. Right: The maximization 28 / 39

Dual decomposition Define: θ i (x i ) = θ i (x i ) + ij E δ j i (x i ) θ ij (x i, x j ) = θ ij (x i, x j ) δ j i (x i ) δ i j (x j ) It is easy to verify that θ i (x i ) + θ ij (x i, x j ) = i ij E i Thus, we have that: θ i (x i ) + ij E θ ij (x i, x j ) x MAP(θ) = MAP( θ) max θi (x i ) + max θij (x i, x j ) x i x i,x j i V ij E Every value of δ gives a different upper bound on the value of the MAP The tightest upper bound can be obtained by minimizing the RHS with respect to δ 29 / 39

Dual decomposition We obtain the following dual objective: L(δ) = ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E DUAL-LP(θ) = min L(δ) δ This provides an upper bound on the MAP assignment MAP(θ) DUAL-LP(θ) L(δ) How can find δ which give tight bounds? 30 / 39

Solving the dual efficiently Many ways to solve the dual linear program, i.e. minimize with respect to δ: ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E One option is to use the subgradient method Can also solve using block coordinate-descent, which gives algorithms that look very much like max-sum belief propagation 31 / 39

Solving the dual efficiently Many ways to solve the dual linear program, i.e. minimize with respect to δ: ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E One option is to use the subgradient method Can also solve using block coordinate-descent, which gives algorithms that look very much like max-sum belief propagation 31 / 39

Solving the dual efficiently Many ways to solve the dual linear program, i.e. minimize with respect to δ: ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E One option is to use the subgradient method Can also solve using block coordinate-descent, which gives algorithms that look very much like max-sum belief propagation 31 / 39

Max-product linear programming (MPLP) algorithm Input: A set of potentials θ i (x i ), θ ij (x i, x j ) Output: An assignment x 1,..., x n that approximates a MAP solution Algorithm: Initialize δ i j (x j ) = 0, δ j i (x i ) = 0, ij E, x i, x j Iterate until small enough change in L(δ): For each edge ij E (sequentially), perform the updates: δ j i (x i ) = 1 2 δ j i (x i ) + 1 [ ] 2 max θ ij (x i, x j ) + δ i x j (x j ) j δ i j (x j ) = 1 2 δ i j (x j ) + 1 [ ] 2 max θ ij (x i, x j ) + δ j x i (x i ) i x i x j where δ j i (x i ) = θ i (x i ) + ik E,k j δ k i(x i ) Return x i arg maxˆxi θ δ i (ˆx i) 32 / 39

Generalization to arbitrary factor graphs [SonGloJaa11] Introduction to Dual Decomposition for Inference Inputs: A set of factors θ i(x i), θ f (x f ). Output: An assignment x 1,...,x n that approximates the MAP. Algorithm: Initialize δ fi(x i)=0, f F, i f,x i. Iterate until small enough change in L(δ) (seeeq.1.2): For each f F, perform the updates δ fi(x i)= δ f i (x 1 i)+ f max θ f (x f )+ x f\i î f δ f î (xî), (1.16) simultaneously for all i f and x i.wedefineδ f i (x i)=θ i(x i)+ ˆf=f δ ˆfi (x i). Return x i arg maxˆxi θδ i (ˆx i)(seeeq.1.6). Figure 1.4: Description of the MPLP block coordinate descent algorithm for minimizing the dual L(δ) (see Section 1.5.2). Similar algorithms can be devised for different choices of coordinate blocks. See sections 1.5.1 and 1.5.3. The assignment returned in the final step follows the decoding scheme 33 / 39

Experimental results Performance on stereo vision inference task: Solved optimally! Duality gap! Dual obj.! Objective! Decoded assignment! Iteration! 34 / 39

Dual decomposition = basic LP relaxation Recall we obtained the following dual linear program: L(δ) = ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E DUAL-LP(θ) = min L(δ) δ We showed two ways of upper bounding the value of the MAP assignment: MAP(θ) Basic LP(θ) (1) MAP(θ) DUAL-LP(θ) L(δ) (2) Although we derived these linear programs in seemingly very different ways, in turns out that: Basic LP(θ) = DUAL-LP(θ) [SonGloJaa11] The dual LP allows us to upper bound the value of the MAP assignment without solving an LP to optimality 35 / 39

Linear programming duality MAP assignment! µ x*! (Dual) LP relaxation! (Primal) LP relaxation! Integer linear program! MAP(θ) Basic LP(θ) = DUAL-LP(θ) L(δ) 36 / 39

Cutting-plane algorithm µ max θ µ µ max θ µ µ max θ µ µ max θ µ θ θ θ θ µ µ µ µ (a) (b) (c) (d) Figure 2-6: Illustration of the cutting-plane algorithm. (a) Solve the LP relaxation. (b) Find a violated constraint, add it to the relaxation, and repeat. (c) Result of solving the tighter LP relaxation. (d) Finally, we find the MAP assignment. is motivated by the observation that it may not be necessary to add all of the constraints that make up a higher-order relaxation such as TRI(G). In particular, it possible that the pairwise LP relaxation alone is close to being tight, and that only a few carefully chosen constraints would suffice to obtain an integer solution. Our algorithms tighten the relaxation in a problem-specific way, using additional computation just for the hard parts of each instance. We illustrate the general approach in 37 / 39

Conclusion LP relaxations yield a powerful approach for MAP inference Naturally lead to considerations of polytope or cutting planes dual decomposition and message passing Close relationship to methods for marginal inference Help build understanding as well as develop new algorithmic tools Exciting current research Thank you 38 / 39

References V. Kolmogorov, J. Thapper, and S. Živný. The power of linear programming for general-valued CSPs. SIAM Journal on Computing, 44(1):136, 2015. S. Park and J. Shin. Max-product belief propagation for linear programming: applications to combinatorial optimization. In UAI, 2015. D. Sontag. Approximate inference in graphical models using LP relaxations. PhD thesis, MIT, 2010. D. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference. In Optimization for Machine Learning, MIT Press, 2011. J. Thapper and S. Živný. The complexity of finite-valued CSPs. arxiv technical report, 2015. http://arxiv.org/abs/1210.2987v3 M. Wainwright and M. Jordan. Treewidth-based conditions for exactness of the Sherali-Adams and Lasserre relaxations. Univ. California, Berkeley, Technical Report, 671:4, 2004. M. Wainwright and M. Jordan. Graphical models, exponential families and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1305, 2008. 39 / 39