Composite Self-concordant Minimization

Size: px

Start display at page:

Download "Composite Self-concordant Minimization"

Gilbert Bailey
5 years ago
Views:

1 Composite Self-concordant Minimization Volkan Cevher Laboratory for Information and Inference Systems-LIONS Ecole Polytechnique Federale de Lausanne (EPFL) Paris 6 Dec 11, 2013 joint work with Quoc Tran Dinh Anastasios Kyrillidis Yen-Huan Li

2 (P) Composite minimization a a f convex and smooth g convex and possibly nonsmooth Motivation Problem (P) covers many practical problems: - Unconstrained basic LASSO / logistic regression - Graphical model selection / latent variable graphical model selection - Poisson imaging reconstruction / LASSO problem with unknown variance - Low-rank recovery / clustering - Atomic norm regularization / off-the-grid array processing

3 (P) Composite minimization a a f convex and smooth g convex and possibly nonsmooth Motivation Problem (P) covers many practical LARGE SCALE problems: - Unconstrained basic LASSO / logistic regression - Graphical model selection / latent variable graphical model selection - Poisson imaging reconstruction / LASSO problem with unknown variance - Low-rank recovery / clustering - Atomic norm regularization / off-the-grid array processing need scalable algorithms

4 Composite (P) minimization: modus operandi a a f convex and smooth g convex and possibly nonsmooth Classes of smooth functions (f)

5 Composite (P) minimization: modus operandi a a f convex and smooth g convex and possibly nonsmooth Classes of smooth functions (f)

6 Composite (P) minimization: modus operandi a a f convex and smooth g convex and possibly nonsmooth Classes of smooth functions (f) Following prox computation is tractable:

7 Composite (P) minimization: modus operandi a a f convex and smooth g convex and possibly nonsmooth Classes of smooth functions (f) Following prox computation is tractable: Example:

8 Composite (P) minimization: modus operandi a a f convex and smooth Classes of smooth functions (f) g convex and possibly nonsmooth with tractable prox well-understood Fast gradient schemes (Nesterov s methods) Newton/quasi Newton schemes

9 Composite (P) minimization: an uncharted region a a f convex and smooth Classes of smooth functions (f) g convex and possibly nonsmooth with tractable prox Scalability is NOT great Fast gradient schemes (Nesterov s methods) Newton/quasi Newton schemes

10 Composite (P) self-concordant minimization a a f convex and self-concordant Classes of smooth functions (f) g convex and possibly nonsmooth with tractable prox Key structure for the interior point method

11 Example: Log-determinant for LMIs Application: Graphical model selection Optimization problem

12 Log-barrier for linear/quadratic inequalities Poisson imaging reconstruction via TV regularization -a Basic pursuit denoising problem (BPDP): Barrier formulation LASSO problem with unknown variance

13 Composite (P) self-concordant minimization a a f convex and self-concordant Classes of smooth functions (f) g convex and possibly nonsmooth with tractable prox Our contributions: i) a variable metric (path following) forward-backward framework ii) convergence theory without the Lipschitz gradient assumption iii) novel variants and extensions for several applications & SCOPT

14 Basic algorithmic framework

15 A basic composite minimization framework Main properties of Lower surrogate Upper surrogate Hessian surrogates

16 A basic composite minimization framework Main properties of Lower surrogate Upper surrogate Hessian surrogates

17 A basic composite minimization framework Main properties of Lower surrogate Upper surrogate Hessian surrogates

18 A basic composite minimization framework ISTA Main properties of Lower surrogate Upper surrogate Hessian surrogates

19 A basic composite minimization framework ISTA Main properties of Lower surrogate Upper surrogate Hessian surrogates

20 A basic composite minimization framework ISTA acceleration is possible FISTA Main properties of Lower surrogate Upper surrogate Hessian surrogates

21 To adapt or not to adapt? L is a global worst-case constant

22 To adapt or not to adapt? L is a global worst-case constant

23 To adapt or not to adapt? L is a global worst-case constant

24 To adapt or not to adapt? Variable metric proximal point operator

25 To adapt or not to adapt? Variable metric proximal point operator

26 A basic variable metric minimization framework Proximal point scheme with variable metric [Bonnans, 1993] Proximal point scheme with variable metric Variable metric proximal point operator Additional accuracy vs. computation trade-offs

27 Self-concordance vs. Lipschitz gradient + SC Main properties of Lower surrogate Upper surrogate Hessian surrogates

28 Self-concordance vs. Lipschitz gradient + SC Main properties of Lower surrogate Upper surrogate Global Hessian surrogates

29 Self-concordance vs. Lipschitz gradient + SC Main properties of Lower surrogate Upper surrogate Hessian surrogates Local norm: Utility functions:

30 Self-concordance vs. Lipschitz gradient + SC Main properties of Lower surrogate Upper surrogate Hessian surrogates Local Local norm: Utility functions:

31 Self-concordance: A mathematical tool Main properties of Lower surrogate Upper surrogate Hessian surrogates New variable metric framework with rigorous convergence guarantees Includes several algorithms: Newton, quasi-newton, and gradient methods...

32 Our composite self-concordant minimization framework Proximal Newton scheme

33 Our composite self-concordant minimization framework Proximal Newton scheme How to compute the Proximal Newton direction?

34 Our composite self-concordant minimization framework Proximal Newton scheme How to compute the Proximal Newton direction? Fast gradient schemes (Nesterov s methods) Newton/quasi Newton schemes

35 Our composite self-concordant minimization framework Key contribution: Proximal Newton scheme step size selection procedure How to compute the Proximal Newton direction? Fast gradient schemes (Nesterov s methods) Newton/quasi Newton schemes

36 How do we compute the step-size? Upper surrogate of f Convexity of g and optimality condition of the subproblem

37 How do we compute the step-size? Upper surrogate of f Convexity of g and optimality condition of the subproblem

38 How do we compute the step-size? Upper surrogate of f Convexity of g and optimality condition of the subproblem When,

39 Analytic complexity Worst-case complexity to obtain an -approximate solution

40 Analytic complexity Worst-case complexity to obtain an -approximate solution global convergence local convergence Can explicitly calculate quadratic convergence region

41 Analytic complexity Worst-case complexity to obtain an -approximate solution global convergence local convergence Can explicitly calculate Line-search can accelerate the convergence quadratic convergence region Line-search enhancement

42 Graphical model selection Objective:

43 Graphical model selection Objective: Gradient and Hessian (large-scale, special structure)

44 Graphical model selection Objective: Gradient and Hessian (large-scale, special structure) How to compute the Proximal Newton direction?

45 Graphical model selection Objective: Gradient and Hessian (large-scale, special structure) Dual approach for solving subproblem (SP) Primal subproblem Dual subproblem (SPGL) Unconstrained LASSO problem

46 Graphical model selection Objective: Gradient and Hessian (large-scale, special structure) Dual approach for solving subproblem (SP) Primal subproblem Dual subproblem (SPGL) No Cholesky decomposition and matrix inversion Unconstrained LASSO problem

47 Graphical model selection Objective: Gradient and Hessian (large-scale, special structure) Dual approach for solving subproblem (SP) Primal subproblem Dual subproblem (SPGL) No Cholesky decomposition and matrix inversion Unconstrained LASSO problem How to compute proximal Newton decrement?

48 Graphical model selection: numerical examples Our method vs QUIC [Hseih2011] - QUIC subproblem solver: special block-coordinate descent - Our subproblem solver: general proximal algorithms Convergence behaviour [rho = 0.5]: Lymph [p = 587] (left), Leukemia [p = 1255] (right) Step-size selection strategies: Arabidopsis [p = 834], Leukemia [p = 1255], Hereditary [p = 1869] *Details: A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions, ICML 13 and lions.epfl.ch/publications.

49 Graphical model selection: numerical examples Our method vs QUIC [Hseih2011] - QUIC subproblem solver: special block-coordinate descent On the average x5 acceleration (up to x15) over Matlab QUIC - Our subproblem solver: general proximal algorithms Convergence behaviour [rho = 0.5]: Lymph [p = 587] (left), Leukemia [p = 1255] (right) Step-size selection strategies: Arabidopsis [p = 834], Leukemia [p = 1255], Hereditary [p = 1869] *Details: A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions, ICML 13 and lions.epfl.ch/publications.

50 (P) Composite minimization: alternatives? a a f convex and smooth g convex and possibly nonsmooth Existing numerical approaches - Splitting methods - Forward-backward: applicable if f has Lipschitz gradient - Douglas-Rachford decomposition: f and g have tractable proximity operators - Augmented Lagrangian methods (e.g., D-R again) Prox operator of self-concordant functions are costly!

51 Our cheaper variable metric strategies Proximal gradient scheme* How to compute the search direction? No Lipschitz assumption A new predictor corrector scheme (with local linear convergence*) Proximal quasi-newton scheme (BFGS updates)* *Details: Composite Self-Concordant Minimization, arxiv and lions.epfl.ch/publications.

52 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587]

53 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587] lucky BB

54 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587]

55 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587] Theory is based on the notion of restricted strong convexity

56 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587] Theory is based on the notion of restricted strong convexity convergence depends on the full condition number

57 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587] Theory is based on the notion of restricted strong convexity convergence depends on the full condition number

58 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587] Theory is based on the notion of restricted strong convexity convergence depends on the restricted condition number

59 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587] Theory is based on the notion of restricted strong convexity convergence depends on the restricted condition number

60 New theory: Local linear convergence of the PG method Graph learning: Lymph [p = 587] Heteroschedastic LASSO [rho decreases from left to right]

61 Proximal gradient scheme: new engineering A greedy enhancement prox:

62 Proximal gradient scheme: new engineering A greedy enhancement prox:

63 Proximal gradient scheme: new engineering A greedy enhancement slows down convergence! prox:

64 Proximal gradient scheme: new engineering A greedy enhancement simple decision: prox:

65 Proximal gradient scheme: new engineering A greedy enhancement simple decision: cost: practically none if implemented carefully prox:

66 Poisson imaging reconstruction via TV Our method vs SPIRAL-TAP [Harmany2012] Poisson imaging reconstruction via TV regularization -a

67 Poisson imaging reconstruction via TV Our method vs SPIRAL-TAP [Harmany2012] On the average x10 acceleration (up to x250) over SPIRAL-TAP with better accuracy

68 Barrier extensions

69 Constrained convex problems

70 Constrained convex problems Examples:

71 Constrained convex problems Examples: Main idea: solve a sequence of composite self-concordant problems

72 How does it work? Main idea: solve a sequence of composite self-concordant problems as opposed to DCO One iteration k requires two updates simultaneously: - Update the penalty parameter: - Update the iterative vector (can be solved approximately): *Details: An Inexact Proximal Path-Following Algorithm for Constrained Convex Minimization, optimization-online and lions.epfl.ch/publications.

73 How does it converge? The algorithm consists of two PHASES: - Phase I: Find a starting point such that: - Phase II: Perform the path-following iterations Tracking properties on the penalty parameter and the iterative sequence in Phase II - The penalty parameter decreases at least with a factor at each iteration k - Tracking error of the iterative sequence: Worst-case complexity: - Phase 1: Finding a starting point for Phase II requires at most - Phase II: The worst-case complexity to reach an - solution is at most: - Note: This worst-case complexity is as the same as in standard path-following methods [see Nesterov2004]

74 Proximal path-following Upshot: no-heavy lifting! Proximal path following for conic programming with rigorous guarantees Example: Low-rank SDP matrix approximation... is a regularization parameter in (0, 1), M is the given input matrix. #variables #constraints

75 Proximal path-following Upshot: desired scaling! Proximal path following for conic programming with rigorous guarantees Example: Max-norm clustering DCO: splitting splitting

76 Proximal path-following Upshot: auto. regularization! Proximal path following for conic programming with rigorous guarantees - Main idea: solve a sequence of composite self-concordant problems as opposed to DCO Example: Max-norm clustering Example: Graph selection

77 Conclusions

78 Conclusions A new variable metric proximal-point framework for composite self-concordant minimization + Extensions

79 Conclusions Highlights - Globalization: a new strategy for finding step-size explicitly strategy motivate forward-looking line-search - Search direction: efficient (strongly convex program) - Local convergence: quadratic convergence without boundedness of the Hessian analytic quadratic convergence region

Conclusions Highlights - Globalization: a new strategy for finding step-size explicitly strategy motivate forward-looking line-search - Search direction: efficient (strongly convex program) - Local

80 Conclusions Highlights - Globalization: a new strategy for finding step-size explicitly strategy motivate forward-looking line-search - Search direction: efficient (strongly convex program) - Local convergence: quadratic convergence without boundedness of the Hessian Practical contributions (this talk) analytic quadratic convergence region SCOPT package has quasi-newton / first & second order leverage fast proximal solvers for g(x) (structured norms etc.) robust to subproblem solver accuracy SCOPT FTW

A primal-dual framework for mixtures of regularizers

A primal-dual framework for mixtures of regularizers Baran Gözcü baran.goezcue@epfl.ch Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL) Switzerland