Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application

Similar documents
The K-modes and Laplacian K-modes algorithms for clustering

Mathematical Programming and Research Methods (Part II)

Lecture 2 September 3

Introduction to Constrained Optimization

This lecture: Convex optimization Convex sets Convex functions Convex optimization problems Why convex optimization? Why so early in the course?

Introduction to Operations Research

Programming, numerics and optimization

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Lecture 15: Log Barrier Method

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Lecture 2 - Introduction to Polytopes

Linear methods for supervised learning

COMS 4771 Support Vector Machines. Nakul Verma

Support Vector Machines.

Introduction to Machine Learning

Convex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)

Chapter 15 Introduction to Linear Programming

Introduction to Modern Control Systems

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Simplex Algorithm in 1 Slide

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Applied Lagrange Duality for Constrained Optimization

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

The Affine Scaling Method

Optimization for Machine Learning

Minimum-Polytope-Based Linear Programming Decoder for LDPC Codes via ADMM Approach

11 Linear Programming

A Short SVM (Support Vector Machine) Tutorial

MATH3016: OPTIMIZATION

POLYHEDRAL GEOMETRY. Convex functions and sets. Mathematical Programming Niels Lauritzen Recall that a subset C R n is convex if

Lec13p1, ORF363/COS323

Convexity Theory and Gradient Methods

Efficient Euclidean Projections in Linear Time

Convex Optimization. Lijun Zhang Modification of

Finding Euclidean Distance to a Convex Cone Generated by a Large Number of Discrete Points

Programs. Introduction

Constrained optimization

The Constrained Laplacian Rank Algorithm for Graph-Based Clustering

60 2 Convex sets. {x a T x b} {x ã T x b}

Local and Global Minimum

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

21-256: Lagrange multipliers

Lecture 2. Topology of Sets in R n. August 27, 2008

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

4. Simplicial Complexes and Simplicial Homology

Lecture 7: Support Vector Machine

Fast-Lipschitz Optimization

of Convex Analysis Fundamentals Jean-Baptiste Hiriart-Urruty Claude Lemarechal Springer With 66 Figures

Convex Optimization MLSS 2015

Inverse and Implicit functions

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang

Modified Augmented Lagrangian Coordination and Alternating Direction Method of Multipliers with

The Simplex Algorithm

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Sequential Coordinate-wise Algorithm for Non-negative Least Squares Problem

arxiv: v1 [math.at] 8 Jan 2015

MA4254: Discrete Optimization. Defeng Sun. Department of Mathematics National University of Singapore Office: S Telephone:

Convexity: an introduction

Math 5593 Linear Programming Lecture Notes

REGULAR GRAPHS OF GIVEN GIRTH. Contents

1 Linear Programming. 1.1 Optimizion problems and convex polytopes 1 LINEAR PROGRAMMING

Geometry. Every Simplicial Polytope with at Most d + 4 Vertices Is a Quotient of a Neighborly Polytope. U. H. Kortenkamp. 1.

Hyperbolic structures and triangulations

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Analysis of high dimensional data via Topology. Louis Xiang. Oak Ridge National Laboratory. Oak Ridge, Tennessee

Bilinear Programming

Numerical Optimization

Convex Optimization - Chapter 1-2. Xiangru Lian August 28, 2015

Linear Programming. Readings: Read text section 11.6, and sections 1 and 2 of Tom Ferguson s notes (see course homepage).

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Linear programming and duality theory

Convex Optimization M2

Kernel Methods & Support Vector Machines

New Methods for Solving Large Scale Linear Programming Problems in the Windows and Linux computer operating systems

Lecture Notes 2: The Simplex Algorithm

However, this is not always true! For example, this fails if both A and B are closed and unbounded (find an example).

Combinatorial Geometry & Topology arising in Game Theory and Optimization

Lecture Linear Support Vector Machines

Lecture 4: Rational IPs, Polyhedron, Decomposition Theorem

(1) Given the following system of linear equations, which depends on a parameter a R, 3x y + 5z = 2 4x + y + (a 2 14)z = a + 2

A PARAMETRIC SIMPLEX METHOD FOR OPTIMIZING A LINEAR FUNCTION OVER THE EFFICIENT SET OF A BICRITERIA LINEAR PROBLEM. 1.

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Convex Optimization / Homework 2, due Oct 3

Convex Optimization CMU-10725

Alternating Projections

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

6 Randomized rounding of semidefinite programs

Distance-to-Solution Estimates for Optimization Problems with Constraints in Standard Form

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

Linear Programming. Larry Blume. Cornell University & The Santa Fe Institute & IHS

Convexity. 1 X i is convex. = b is a hyperplane in R n, and is denoted H(p, b) i.e.,

Linear Programming Problems

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Lecture 18: March 23

CS675: Convex and Combinatorial Optimization Spring 2018 The Simplex Algorithm. Instructor: Shaddin Dughmi

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Transcription:

Proection onto the probability simplex: An efficient algorithm with a simple proof, and an application Weiran Wang Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science, University of California, Merced {wwang5,mcarreira-perpinan}@ucmerced.edu September 3, 203 Abstract We provide an elementary proof of a simple, efficient algorithm for computing the Euclidean proection of a point onto the probability simplex. We also show an application in Laplacian K-modes clustering. Proection onto the probability simplex Consider the problem of computing the Euclidean proection of a point y = [y,...,y D ] R D onto the probability simplex, which is defined by the following optimization problem: x R D 2 x y 2 a s.t. x = x 0. This is a quadratic program and the obective function is strictly convex, so there is a unique solution which we denote by x = [x,...,x D ] with a slight abuse of notation. 2 Algorithm The following ODlogD algorithm finds the solution x to the problem: Algorithm Euclidean proection of a vector onto the probability simplex. Input: y R D Sort y into u: u u 2 u D Find ρ = max{ D: u + u i > 0} Define λ = ρ ρ u i Output: x s.t. x i = max{y i +λ,0}, i =,...,D. The complexity of the algorithm is doated by the cost of sorting the components of y. The algorithm is not iterative and identifies the active set exactly after at most D steps. It can be easily implemented see section 4. The algorithm has the following geometric interpretation. The solution can be written as x i = max{y i + λ,0}, i =,...,D, where λ is chosen such that D x i =. Place the values y,...,y D as points on the X axis. Then the solution is given by a rigid shift of the points such that the points to the right of the Y axis sum to. The pseudocode above appears in Duchi et al. 2008, although earlier papers Brucker, 984; Pardalos and Kovoor, 990 solved the problem in greater generality. http://www.cs.berkeley.edu/~duchi/proects/duchishsich08.html b c

Other algorithms The problem can be solved in many other ways, for example by particularizing QP algorithms such as active-set methods, gradient-proection methods or interior-point methods. It can also be solved by alternating proection onto the two constraints in a finite number of steps Michelot, 986. Another way Boyd and Vandenberghe, 2004, Exercise 4., solution available at http://see.stanford.edu/ materials/lsocoee364b/hw4sol.pdf is to construct a Lagrangian formulation by dualizing the equality constraint and then solve a D nonsmooth optimization over the Lagrange multiplier. Algorithm has the advantage of being very simple, not iterative, and identifying exactly the active set at the solution after at most D steps each of cost O after sorting. 3 A simple proof A proof of correctness of Algorithm can be found in Shalev-Shwartz and Singer 2006 and Chen and Ye 20, but we offer a simpler proof which involves only the KKT theorem. We apply the standard KKT conditions for the problem Nocedal and Wright, 2006. The Lagrangian of the problem is Lx,λ,β = 2 x y 2 λx β x where λ and β = [β,...,β D ] are the Lagrange multipliers for the equality and inequality constraints, respectively. At the optimal solution x the following KKT conditions hold: x i y i λ β i = 0, i =,...,D 2a x i 0, i =,...,D 2b β i 0, i =,...,D 2c x i β i = 0, i =,...,D 2d D x i =. 2e From the complementarity condition 2d, it is clear that if x i > 0, we must have β i = 0 and x i = y i +λ > 0; if x i = 0, we must have β i 0 and x i = y i + λ + β i = 0, whence y i + λ = β i 0. Obviously, the components of the optimal solution x that are zeros correspond to the smaller components of y. Without loss of generality, we assume the components of y are sorted and x uses the same ordering, i.e., y y ρ y ρ+ y D, x x ρ > x ρ+ = = x D, and that x x ρ > 0, x ρ+ = = x D = 0. In other words, ρ is the number of positive components in the solution x. Now we apply the last condition and have = D x i = x i = y i +λ whichgivesλ = ρ ρ y i. Henceρisthekeytothesolution. OnceweknowρthereareonlyD possible values of it, we can compute λ, and the optimal solution is obtained by ust adding λ to each component of y and thresholding as in the end of Algorithm. It is easy to check that this solution indeed satisfies all KKT conditions. In the algorithm, we carry out the tests for =,...,D if t = y + y i > 0. We now prove that the number of times this test turns out positive is exactly ρ. The following theorem is essentially Lemma 3 of Shalev-Shwartz and Singer 2006. Theorem. Let ρ be the number of positive components in the solution x, then ρ = max{ D: y + y i > 0}. 2

Proof. Recall fromthekktconditions 2 thatλρ = ρ y i, y i +λ > 0for i =,...,ρandy i +λ 0for i = ρ+,...,d. In the sequel, we show that for =,...,D, the test will continue to be positive until = ρ and then stay non-positive afterwards, i.e., y + y i > 0 for ρ, and y + y i 0 for > ρ. i For = ρ, we have y ρ + ρ y i = y ρ +λ = x ρ > 0. ii For < ρ, we have y + y i = y + y i = y + i=+ y i + Since y i +λ > 0 for i =,...,ρ, we have y + y i > 0. iii For > ρ, we have y + y i = y + y i = y + y i = y i = y + i=ρ+ = y +λ+ i=+ y i +ρλ y i +λ. i=+ y i = y +ρλ ρy +λ+ i=ρ+ y y i. i=ρ+ Notice y +λ 0 for > ρ, and y y i for i since y is sorted, therefore y + y i < 0. y i Remarks. We denote λ = y i. At the -th test, λ can be considered as a guess of the true λ indeed, λ ρ = λ. If we use this guess to compute a tentative solution x where x i = max{y i +λ,0}, then it is easy to see that x i > 0 for i =,...,, and x i =. In other words, the first components of x are positive and sum to. If we find x + = 0 or y + +λ 0, then we know we have found the optimal solution and = ρ because x satisfies all KKT conditions. 2. To extend the algorithm to a simplex with a different scale, i.e., x = a for a > 0, replace the u i terms with a u i in Algorithm. 4 Matlab code The following vectorized Matlab code implements algorithm. It proects each row vector in the N D matrix Y onto the probability simplex in D dimensions. function X = SimplexProY [N,D] = sizey; X = sorty,2, descend ; Xtmp = cumsumx,2-*diagsparse./:d; X = maxbsxfun@us,y,xtmpsub2ind[n,d],:n,sumx>xtmp,2,0; 3

5 An application: Laplacian K-modes clustering Consider the Laplacian K-modes clustering algorithm of Wang and Carreira-Perpiñán 203 which is an extension of the K-modes algorithm of Carreira-Perpiñán and Wang, 203a. Given a dataset x,...,x N R D and suitably defined affinity values w nm 0 between each pair of points x n and x m for example, Gaussian, we define the obective function Z,C s.t. λ 2 N N N K w mn z m z n 2 x n c k 2 z nk G σ m=n= K z nk =, for n =,...,N k= n=k= z nk 0, for n =,...,N, k =,...,K. Z is a matrix of N K, with Z = z,...,z N, where z n = z n,...,z nk are the soft assignments of point x n to clusters,...,k, and c,...,c K R D are modes of the kernel density estimates defined for each cluster where G 2 gives a Gaussian. The problem of proection on the simplex appears in the training problem, i.e., in finding a local imum Z, C of 3, and in the out-of-sample problem, i.e., in assigning a new point to the clusters. Training The optimization of 3 is done by alternating imization over Z and C. For fixed C, the problem over Z is a quadratic program of NK variables: 3a 3b 3c Z λtr Z LZ tr B Z 4a s.t. Z K = N 4b Z 0, 4c where b nk = G x n c k /σ 2, n =,...,N, k =,...,K, and L is the graph Laplacian computed from the pairwise affinities w mn. The problem is convex since L is positive semidefinite. One simple and quite efficient way to solve it is to use an accelerated gradient proection algorithm. The basic gradient proection algorithm iteratively takes a step in the negative gradient from the current point and proects the new point onto the constraint set. In our case, this proection is simple: it separates over each row of Z i.e., the soft assignment of each data point, n =,...,N, and corresponds to a proection on the probability simplex of a K-dimensional row vector, i.e., the problem. Out-of-sample mapping Given a new, test point x R D, we wish to assign it to the clusters found during training. We are given the affinity vectors w = w n and g = g k, where w n is the affinity between x and x n, n =,...,N, and g k = G x c k /σ 2, k =,...,K, respectively. A natural and efficient way to define an out-of-sample mapping zx is to solve a problem of the form 3 with a dataset consisting of the original training set augmented with x, but keeping Z and C fixed to the values obtained during training this avoids having to solve for all points again. Hence, the only free parameter is the assignment vector z for the new point x. After dropping constant terms, the optimization problem 3 reduces to the following quadratic program over K variables: z 2 z z+γg 2 5a s.t. z K = 5b z 0 5c where γ = /2λ N n= w n and N z = Z w w = w n N n = w z n n n= 4

is a weighted average of the training points assignments, and so z+γg is itself an average between this and the point-to-centroid affinities. Thus, the solution is the proection of the K-dimensional vector z + γg onto the probability simplex, i.e., the problem again. LASS In general, the optimization problem of eq. 4 is called Laplacian assignment model LASS by Carreira-Perpiñán and Wang 203b. Here, we consider N items and K categories and want to learn a soft assignment z nk of each item to each category, given an item-item graph Laplacian matrix L of N N and an item-category similarity matrix B of N K. The training is as in eq. 4 and the out-of-sample mapping as in eq. 5, and the problem of proection on the simplex appears in both cases. References S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, U.K., 2004. P. Brucker. An On algorithm for quadratic knapsack problems. Operations Research Letters, 33:63 66, Aug. 984. M. Á. Carreira-Perpiñán and W. Wang. The K-modes algorithm for clustering. Unpublished manuscript, arxiv:304.6478, Apr. 23 203a. M. Á. Carreira-Perpiñán and W. Wang. A simple assignment model with Laplacian smoothing. Unpublished manuscript, 203b. Y. Chen and X. Ye. Proection onto a simplex. Unpublished manuscript, arxiv:0.608, Feb. 0 20. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient proections onto the l -ball for learning in high dimensions. In A. McCallum and S. Roweis, editors, Proc. of the 25th Int. Conf. Machine Learning ICML 08, pages 272 279, Helsinki, Finland, July 5 9 2008. C. Michelot. A finite algorithm for finding the proection of a point onto the canonical simplex of R n. J. Optimization Theory and Applications, 50:95 200, July 986. J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer-Verlag, New York, second edition, 2006. P. M. Pardalos and N. Kovoor. An algorithm for a singly constrained class of quadratic programs subect to upper and lower bounds. Math. Prog., 46 3:32 328, Jan. 990. S. Shalev-Shwartz and Y. Singer. Efficient learning of label ranking by soft proections onto polyhedra. J. Machine Learning Research, 7:567 599, 2006. W. Wang and M. Á. Carreira-Perpiñán. Laplacian K-modes clustering. Unpublished manuscript, 203. 5