Iteratively Re-weighted Least Squares for Sums of Convex Functions

Similar documents
Parallel and Distributed Sparse Optimization Algorithms

The Pre-Image Problem in Kernel Methods

Convex Optimization MLSS 2015

Convex optimization algorithms for sparse and low-rank representations

Optimization for Machine Learning

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

Alternating Direction Method of Multipliers

Convex Optimization / Homework 2, due Oct 3

ELEG Compressive Sensing and Sparse Signal Representations

Efficient MR Image Reconstruction for Compressed MR Imaging

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

Lecture 19: November 5

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Optimization for Machine Learning

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

A More Efficient Approach to Large Scale Matrix Completion Problems

Programming, numerics and optimization

Characterizing Improving Directions Unconstrained Optimization

Convex or non-convex: which is better?

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Alternating Projections

Distributed Alternating Direction Method of Multipliers

The Benefit of Tree Sparsity in Accelerated MRI

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Kernels for Structured Data

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Convexity Theory and Gradient Methods

A primal-dual framework for mixtures of regularizers

Divide and Conquer Kernel Ridge Regression

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Outline Introduction Problem Formulation Proposed Solution Applications Conclusion. Compressed Sensing. David L Donoho Presented by: Nitesh Shroff

Proximal operator and methods

Lecture 19 Subgradient Methods. November 5, 2008

Sparse wavelet expansions for seismic tomography: Methods and algorithms

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

The Alternating Direction Method of Multipliers

Markov Networks in Computer Vision

Templates for Convex Cone Problems with Applications to Sparse Signal Recovery

Convex Optimization and Machine Learning

Conditional gradient algorithms for machine learning

Markov Networks in Computer Vision. Sargur Srihari

Introduction to Modern Control Systems

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Conditional gradient algorithms for machine learning

Lecture 18: March 23

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Department of Electrical and Computer Engineering

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning

Bilevel Sparse Coding

Learning Models of Similarity: Metric and Kernel Learning. Eric Heim, University of Pittsburgh

The Alternating Direction Method of Multipliers

Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting

Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization

Machine Learning: Think Big and Parallel

The picasso Package for High Dimensional Regularized Sparse Learning in R

An MM Algorithm for Multicategory Vertex Discriminant Analysis

Solving basis pursuit: infeasible-point subgradient algorithm, computational comparison, and improvements

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

Comparison of Interior Point Filter Line Search Strategies for Constrained Optimization by Performance Profiles

Templates for convex cone problems with applications to sparse signal recovery

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang

A primal-dual Dikin affine scaling method for symmetric conic optimization

Normalized cuts and image segmentation

PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization

Penalty Alternating Direction Methods for Mixed- Integer Optimization: A New View on Feasibility Pumps

Stanford University. A Distributed Solver for Kernalized SVM

Machine Learning Classifiers and Boosting

Transfer Learning Algorithms for Image Classification

Lagrangian methods for the regularization of discrete ill-posed problems. G. Landi

Leaves Machine Learning and Optimization Library

Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach

AMS : Combinatorial Optimization Homework Problems - Week V

Kernel Methods in Machine Learning

Mathematical Programming Methods for Minimizing the Zero-norm over Polyhedral Sets. Francesco Rinaldi

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)

No more questions will be added

General Instructions. Questions

COMS 4771 Support Vector Machines. Nakul Verma

Distance-to-Solution Estimates for Optimization Problems with Constraints in Standard Form

Lecture 4 Duality and Decomposition Techniques

Extending the search space of time-domain adjoint-state FWI with randomized implicit time shifts

Support Vector Machines

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM

Enclosures of Roundoff Errors using SDP

Chapter 7: Numerical Prediction

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Recent Developments in Model-based Derivative-free Optimization

LECTURE 18 LECTURE OUTLINE

Unsupervised Learning

IDENTIFYING ACTIVE MANIFOLDS

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

arxiv: v2 [stat.ml] 5 Nov 2018

Transcription:

Iteratively Re-weighted Least Squares for Sums of Convex Functions James Burke University of Washington Jiashan Wang LinkedIn Frank Curtis Lehigh University Hao Wang Shanghai Tech University Daiwei He University of Washington Iteratively Re-weighted Least Squares / 3

Outline Classical Iterative Re-Weighting Exact Penalization 3 The Iterative Re-Weighting Algorithm (IRWA) 4 Convergence 5 A Numerical Experiment: l -SVM 6 Least-Squares and Sums of Convex Functions 7 General Least-Squares Based Algorithms 8 General Methods Applied to J(x, ε) = l dist (A i x + b i C i ) + ɛ i 9 Comparison with IRWA Iteratively Re-weighted Least Squares / 3

Classical Iterative Re-Weighting: A mainstay of statistical computing l -Regression where A i is the ith row of A. m min Ax + b x R n := A i x + b i, Iteratively Re-weighted Least Squares 3 / 3

Classical Iterative Re-Weighting: A mainstay of statistical computing l -Regression where A i is the ith row of A. Iterative Least Squares Approach Having x k approximate Ax + b by m min Ax + b x R n := A i x + b i, Ax + b = m A i x + b i = m A i x + b i A i x + b i m A i x + b i A i x k + b i. Solve the linear least squares problem in the approximation for x k+ and iterate. Iteratively Re-weighted Least Squares 3 / 3

Classical Iterative Re-Weighting: A mainstay of statistical computing l -Regression where A i is the ith row of A. Iterative Least Squares Approach Having x k approximate Ax + b by m min Ax + b x R n := A i x + b i, Ax + b = m A i x + b i = m A i x + b i A i x + b i m A i x + b i A i x k + b i. Solve the linear least squares problem in the approximation for x k+ and iterate. Modified ɛ-approximate Weighted Least Squares m Ax + b w(x k, ɛ k i ) A i x + b i where w(x k, ɛ k i ) := / A i x k + b i + (ɛ k i ). Iteratively Re-weighted Least Squares 3 / 3

Classical Iterative Re-Weighting Lawson 96 Rice and Usow 968 Karlovitz 970 Kahng 97 Fletcher, Grant and Hebden 97 Schlossmacker 973 Beaton and Tukey 974 Wolke and Schwetlick 988 O Leary 990 Burrus and Burreto 99 Vargas and Burrus 999 Iteratively Re-weighted Least Squares 4 / 3

The Exact Penalty Subproblem min J 0(x) := g T x + x X x T Hx + dist (A i x + b i C i ), g R n, H S+, n A R m i n, b R m i, and C i R m i are non-empty closed convex, and dist (y i C i ) := inf y i z i = y i P Ci (y i ) z i C i P Ci is the projection onto C i. Examples: Equality: C i := {0} Inequality: C i := (, 0] Inequality: C i = [l i, u i ] l l -ball: y i l τ i, l =,, Trust region: X = {x x l τ}, l =,, Iteratively Re-weighted Least Squares 5 / 3

NLP and Exact Penalties Nonlinear Programming (NLP): minimize f (x) subject to F (x) C and x X - f : R n R and F : R n R m smooth - X R n and C R m non-empty, closed, and convex - C := C C C l, F (x) := (F (x) F (x)... F l (x)) F i (x) C i, F i : R n R m i, C i R m i Exact penalty formulation: Given α > 0 min f (x) + α x X Local direction finding approximation dist (F i (x) C i ) min J 0(x) := g T x + x X x T Hx + dist (A i x + b i C i ), Iteratively Re-weighted Least Squares 6 / 3

ɛ-approximate Re-Weighted Least Squares Let P Ci (y) C i be the projection of y onto C i, i.e. y P Ci (y) = dist (y C i ). At (x k, ɛ k ) approximate dist (A i x + b i C i ) = A i x + b i P Ci (A i x + b i ) by dist (A i x + b i C i ) dist (A i x + b i C i ) dist (A i x k + b i C i ) + (ɛ k i ) Iteratively Re-weighted Least Squares 7 / 3

ɛ-approximate Re-Weighted Least Squares Let P Ci (y) C i be the projection of y onto C i, i.e. y P Ci (y) = dist (y C i ). At (x k, ɛ k ) approximate dist (A i x + b i C i ) = A i x + b i P Ci (A i x + b i ) by dist (A i x + b i C i ) dist (A i x + b i C i ) dist (A i x k + b i C i ) + (ɛ k i ) = A ix + b i P Ci (A i x + b i ), dist (A i x k + b i C i ) + (ɛ ki ) Iteratively Re-weighted Least Squares 7 / 3

ɛ-approximate Re-Weighted Least Squares Let P Ci (y) C i be the projection of y onto C i, i.e. y P Ci (y) = dist (y C i ). At (x k, ɛ k ) approximate dist (A i x + b i C i ) = A i x + b i P Ci (A i x + b i ) by dist (A i x + b i C i ) dist (A i x + b i C i ) dist (A i x k + b i C i ) + (ɛ k i ) = A ix + b i P Ci (A i x + b i ), dist (A i x k + b i C i ) + (ɛ ki ) A i x + b i P Ci (A i x k + b i ) dist (A i x k + b i C i ) + (ɛ k i ), Iteratively Re-weighted Least Squares 7 / 3

The full approximation J 0 (x) = g T x + x T Hx + dist (A i x + b i C i ) g T x + x T Hx + w i (x k, ɛ k ) A i x + b i P Ci (A i x k + b i ), where w i (x k, ɛ k ) = / dist (A i x k + b i C i ) + (ɛ k i ). Iteratively Re-weighted Least Squares 8 / 3

The full approximation J 0 (x) = g T x + x T Hx + dist (A i x + b i C i ) g T x + x T Hx + w i (x k, ɛ k ) A i x + b i P Ci (A i x k + b i ), where w i (x k, ɛ k ) = / dist (A i x k + b i C i ) + (ɛ k i ). For simplicity, we drop the term g T x + x T Hx from future consideration. This can be done with (almost) no loss in generality. Iteratively Re-weighted Least Squares 8 / 3

The Iterative Re-Weighting Algorithm (IRWA) IRWA: x k+ arg min x X l w i(x k, ɛ k ) Ai x + b i P Ci (A i x k + b i ) with ɛ k 0. Iteratively Re-weighted Least Squares 9 / 3

The Iterative Re-Weighting Algorithm (IRWA) IRWA: x k+ arg min x X l w i(x k, ɛ k ) Ai x + b i P Ci (A i x k + b i ) with ɛ k 0. Initialize x 0, ɛ 0, M, ν > 0, η, σ, σ (0, ). Solve the re-weighted subproblem for x k+. 3 Set q k i := A i (x k+ x k ) If [ q i k M dist ( Ai x k ) + b i C i + (ɛ k i ) ] +ν, i =,..., l, choose ɛ k+ (0, ηɛ k ]; otherwise, set ɛ k+ := ɛ k. 4 If x k+ x k σ and ɛ k σ, stop; else k := k + and go to Step. Iteratively Re-weighted Least Squares 9 / 3

Convergence Let (x 0, ɛ 0 ) X R ++. Suppose that the sequence {(x k, ɛ k )} k=0 is generated by IRWA with σ = σ = 0. Let S := { k ɛ k+ ηɛ k }. If either ker(a) = {0} or X = R n, then any cluster point x of the subsequence {x k } k S satisfies 0 J 0 ( x) + N ( x X ). If it is assumed that ker(a) = {0}, then x k x the unique global solution, and in at most O(/ε ) iterations, x k is an ε-optimal solution, i.e. J 0 (x k ) inf x X J 0(x) + ε. Iteratively Re-weighted Least Squares 0 / 3

Support Vector Machine Experiment Consider the exact penalty form of the l -SVM problem: m n y i x ij β j + λ β, x i min β R n j= where {(xx i, y i )} m are the training data points with x i R n and y i {, } for each i =,..., m, and λ is the penalty parameter. + Iteratively Re-weighted Least Squares / 3

Support Vector Machine Experiment Consider the exact penalty form of the l -SVM problem: m n y i x ij β j + λ β, x i min β R n j= where {(xx i, y i )} m are the training data points with x i R n and y i {, } for each i =,..., m, and λ is the penalty parameter. For purposes of numerical comparison, we randomly generate 40 problems where X ranges from a 00 40 matrix to a 000 500 matrix where the sparsity of the true solution is always 0% of the number of columns. We compare the performance of IRWA with an ADMM implementation whose parameters are pre-optimized for performance on this data set. The least-squares subproblems for both methods are solved using the same CG solver. The effort is measured in terms of the number of CG solves. + Iteratively Re-weighted Least Squares / 3

ADAL - IRWA Numerical Comparison: Number of CGs With Nesterov acceleration. ADAL IRWA 0 0 0 30 40 Objective CG value 00 50 00 400 600 800 000 00 0 0 0 30 40 Problem Figure : In all 40 problems, IRWA obtains smaller objective function values with fewer CG steps. Iteratively Re-weighted Least Squares / 3

ADAL - IRWA Comparison: Number of False Positives ADAL IRWA err 0 500 000 500 0 5 0 e 05 e 04 0.00 e 05 e 04 0.00 Figure : For both thresholds 0 4 and 0 5, IRWA yields fewer false positives in terms of the numbers of zero values computed. The numbers of false positives is similar for the threshold 0 3. At the threshold 0 5, the difference in recovery is dramatic with IRWA always having fewer than 4 false positives while ADAL has a median of about 000 false positives. Iteratively Re-weighted Least Squares 3 / 3

Least-Squares and Sums of Convex Functions Goals min x φ(x) := f (Ax + b) = f i (A i x + b i ) Minimize φ based on properties of the individual f i only with minimal linkage between the f i s (splitting). For example, apply Prox to the individual f i s, but not to the function φ as a whole. The A i s only enter through weighted least-squares: min x w i (x, p) A i x + b i s i (p), where p E 0 is a parameter vector and s : E 0 W. Iteratively Re-weighted Least Squares 4 / 3

Least-Squares and Sums of Convex Functions Goals min x φ(x) := f (Ax + b) = f i (A i x + b i ) Minimize φ based on properties of the individual f i only with minimal linkage between the f i s (splitting). For example, apply Prox to the individual f i s, but not to the function φ as a whole. The A i s only enter through weighted least-squares: min x w i (x, p) A i x + b i s i (p), where p E 0 is a parameter vector and s : E 0 W. The procedures we consider are iterative with x c and y c := Ax c + b representing the current best approximate solutions, and N(c) is the number of iterations to obtain x c. Iteratively Re-weighted Least Squares 4 / 3

Separate the Roles of the f i s and the A i s min x φ(x) := f (Ax + b) = f i (A i x + b i ) min x f i (y i ) s.t. y b + Ran A Iteratively Re-weighted Least Squares 5 / 3

Separate the Roles of the f i s and the A i s min x φ(x) := f (Ax + b) = f i (A i x + b i ) min x f i (y i ) s.t. y b + Ran A [ Prox γi,f i y := argmin z f i (z) + ] z y γ i Iteratively Re-weighted Least Squares 5 / 3

Least-Squares Based Algorithms Projected Subgradient: min x A i (x x c ) + t c i g c i, g c i f i (yi c ), ti c R := L N(c), O(/ɛ ) no acc Iteratively Re-weighted Least Squares 6 / 3

Least-Squares Based Algorithms Projected Subgradient: min x A i (x x c ) + t c i g c i, Projected Prox-Gradient: g c i f i (yi c ), ti c R := L N(c), O(/ɛ ) no acc min x A i (x x c ) + ti c (I Prox γi,f i )(A i x c + b i ), tc j := γ j l γ i, O(/ɛ) acc Iteratively Re-weighted Least Squares 6 / 3

Least-Squares Based Algorithms Projected Subgradient: min x A i (x x c ) + t c i g c i, Projected Prox-Gradient: g c i f i (yi c ), ti c R := L N(c), O(/ɛ ) no acc min x ADAL: min x A i (x x c ) + ti c (I Prox γi,f i )(A i x c + b i ), tc j := γ j l γ i, O(/ɛ) acc (A i (x x c ) + (I Prox γi,f γ i )(A i x c + b i + γ i ui c ), O(/ɛ) acc i u + i := u c i + γ i (A i x + + b i Prox γi,f i (A i x c + b i + γu c i )). Iteratively Re-weighted Least Squares 6 / 3

Application to J(x, ε) = l dist (A i x + b i C i ) + ɛ i A i (x x c ) + ti c (I P Ci )(A i x c + b i ) Projected Subgradient for J(x, 0) dist(x 0 Σ ) ti c :=, if A i x c + b i / C i, ldist(a i x c +b i C i ) N(c) 0, otherwise. Projected gradient for J(x, ε) t c j := ɛ min dist (A j x c + b j C j ) + ɛ j Projected Prox-Gradient for J(x, 0) γ j l, dist γ (A j x c + b j C j ) γ j tj c i :=, dist (A j x c + b j C j ) > γ j dist (A j x c +b j C j ) l γ i (Σ solution set) ADAL for J(x, 0) Same as projected prox-gradient but t c j := and we include shifts γ i u c i. Iteratively Re-weighted Least Squares 7 / 3

Comparison with IRWA General Methods: Ai (x x k ) + ti c (I P Ci )(A i x k + b i ) O( ɛ ) acc IRWA: Ai (x x k ) + (I P Ci )(A i x k + b i ) dist (A i x k + b i C i ) + ɛ O( ) no acc ɛ i Iteratively Re-weighted Least Squares 8 / 3

Comparison with IRWA General Methods: Ai (x x k ) + ti c (I P Ci )(A i x k + b i ) O( ɛ ) acc IRWA: Ai (x x k ) + (I P Ci )(A i x k + b i ) dist (A i x k + b i C i ) + ɛ O( ) no acc ɛ i New Improved IRWA: ɛ i A i(x x k ) + O( ɛ )!! acc ɛ i dist (A i x k + b i C i ) + ɛ i (I P Ci )(A i x k + b i ) Iteratively Re-weighted Least Squares 8 / 3

Comparison of Projected Gradient and New IRWA Projected gradient for J(x, ε) A i(x x k ) + ɛ min dist (A i x k + b i C i ) + ɛ i (I P Ci )(A i x k + b i ) New Improved IRWA: ɛ i A i(x x k ) + ɛ i dist (A i x k + b i C i ) + ɛ i (I P Ci )(A i x k + b i ) Iteratively Re-weighted Least Squares 9 / 3

Comparison of Projected Gradient and New IRWA Projected gradient for J(x, ε) A i(x x k ) + ɛ min dist (A i x k + b i C i ) + ɛ i (I P Ci )(A i x k + b i ) New Improved IRWA: ɛ i A i(x x k ) + ɛ i dist (A i x k + b i C i ) + ɛ i (I P Ci )(A i x k + b i ) When all ɛ i have the same value ɛ, they are the same algorithm! Iteratively Re-weighted Least Squares 9 / 3

Unaccelerated versions with ɛ = 0.0. Figure : In all 0 problems, the old IRWA with iterative re-weighting obtains similar objective function values with fewer CG steps. Iteratively Re-weighted Least Squares 0 / 3

Accelerated versions with ɛ = 0.0. Figure : In all 0 problems, the old IRWA with iterative re-weighting obtains similar objective function values with many fewer CG steps. Iteratively Re-weighted Least Squares / 3

Reference Thank you! Iteratively Reweighted Linear Least Squares for Exact Penalty Subproblems on Product Sets with F. Curtis, H. Wang and J. Wang. SIAM J. Optim. 5(05): 6-94. Iteratively Re-weighted Least Squares / 3

The Euclidean Huber Distance to a Convex Set C E be non-empty closed and convex h := dist ( C ) the Euclidean distance to C The Euclidean Huber distance to C is just the Moreau-Yosida envelope of the distance to C. e γ h(y) = { γ dist (y C ), if dist (y C ) γ, dist (y C ) γ, if dist (y C ) > γ, { PC (y), if dist (y C ) γ, Prox γ,h (y) = γ y dist (y C ) (y P C (y)), if dist (y C ) γ. Iteratively Re-weighted Least Squares 3 / 3