Conditional gradient algorithms for machine learning

Size: px
Start display at page:

Download "Conditional gradient algorithms for machine learning"

Transcription

1 Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR-LJK, INRIA Grenoble Anatoli Juditsky LJK, Université de Grenoble Arkadi Nemirovski Georgia Tech Abstract We consider penalized formulations of machine learning problems with regularization penalty having conic structure. For several important learning problems, state-of-the-art optimization approaches such as proximal gradient algorithms are difficult to apply and computationally expensive, preventing from using them for large-scale learning purpose. We present a conditional gradient algorithm, with theoretical guarantees, and show promising experimental results on two largescale real-world datasets. 1 Introduction We consider statistical learning problems that can be framed into convex optimization problems of the form f(x) + λ x min x K where f is a convex function with Lipschitz continuous gradient and K is a closed convex cone of a Euclidean space. This encompasses supervised classification, regression, ranking, etc. with various choices for the regularization penalty including regular l p -norms, as well as more structured regularization penalties such as the group-lasso penalty or the trace-norm [20]. A wealth of convex optimization algorithms was proposed to tackle such problems; see [20] for a recent overview. Among them, the celebrated Nesterov optimal gradient method for smooth and composite minimization [15, 16], and its stochastic approximation counterparts [14], are now state-of-the-art in machine learning. These algorithms enjoy the best possible theoretical complexity estimate. For instance, Nesterov s algorithm reaches a solution with accuracy ɛ in O(D L/ɛ), where L is the Lipschitz constant of the gradient of the smooth component of the objective, and D is the problem-domain parameter in the norm. However, these algorithms work by solving at each iteration a sub-problem which involves the minimization of the sum of a linear form and a distance-generating function on the problem domain. Most often, and this explains the popularity of this family of algorithms in the last decade, exactly solving the sub-problem is computationally cheap [3]. However, in high-dimensional problems such as multi-task learning with a large number of tasks and features, solving this sub-problem is computationally challenging when the trace-norm regularization is used for instance [13, 17]. These limitations recently motivated alternative approaches, which do not require to solve hard sub-problems at each iteration, and triggered a renewed interest in conditional gradient algorithms. The conditional gradient algorithm [8, 6] (a.k.a. Frank-Wolfe algorithms) works by minimizing a linear form on the problem domain at each iteration, a simpler subproblem than the one involved in proximal gradient algorithms; see [12, 18, 19, 7, 11] for recent works. However, the conditional gradient algorithm allows to tackle constrained formulations of learning problems, whereas penalized formulations are more widespread in applications. We propose a new algorithm suitable for penalized formulations, reminiscent of the conditional gradient algorithm, with theoretical guarantees. 1

2 Constrained formulation I Penalized formulation Constrained formulation II subject to f(x) x ρ f(x) + λ x subject to x f(x) δ Figure 1: Three main formulations of machine learning optimization problems. 2 Penalized learning formulation We first review three main types of formulations of machine learning problems, then detail our assumptions before we present our approach. Learning formulations Depending on the task at hand, some formulations might be preferable to others to pose the learning problem. Three main formulations of machine learning problems as summarized in Figure 1, where f(x) plays the role of the empirical risk and x the role of the regularization penalty/constraint (see [10] more details). Common wisdom would recommend the formulation where prior knowledge can easily be incorporated. When prior knowledge of the noise level are available, then constrained formulation II is more convenient, as one can directly encode this knowledge in the constraint. Similarly, when prior knowledge or pilot estimates on the learning parameter (weight vector/matrix) is available, then constrained formulation I is more appropriate. Finally, when no prior knowledge is available, its is common to consider the penalized formulation as the most appropriate option. Penalized formulation that is We focus on the penalized formulation of learning problems in this paper, Minimize x K f(x) + λ x (2.1) where K E is a closed convex cone of Euclidean space E. Without loss of generality, we can assume that K linearly spans E. We assume that is a norm on E, and f is a convex function with Lipschitz continuous gradient, that is f (x) f (y) L f x y for all x, y K, where denotes the dual norm dual of. Using that x is also a gauge/minkowski functional γ B (x) = inf{r, x rb} with respect to the unit-ball B = {x E, x 1} [7], we rewrite the problem as Minimize x,r where F ([x; r]) and K + are resp. defined as F ([x; r]) subject to [x; r] K +, (2.2) F ([x; r]) := f(x) + λr and K + := {[x; r], x K, x rb}. We shall use resp. the shorthand notations x = x[x; r] and r = r[x; r] to retrieve resp. the x-part and r-part of [x; r]. Note that the x-part x[x ɛ, r ɛ ] of an ɛ-solution to problem (2.2) is also an ɛ-solution to problem (2.1). 3 Conditional-gradient-type algorithm for penalized learning formulations We now describe the proposed algorithm for solving problem (2.2), which we simply rewrite using z := [x; r] as F (z) subject to z K + (3.3) Being different, the proposed method is closely related to the regular conditional gradient algorithm. Namely, it only requires a kind of linearization oracle in addition to objective and gradient calls. 2

3 Minimization oracle The way the proposed algorithm proceeds is reminiscent of the regular conditional gradient algorithm. Note, however, that the proposed algorithm is not an instance of the traditional conditional gradient algorithm. The conditional gradient algorithm was designed exclusively for constrained formulations of type I (see Figure 2), and only requires a so-called linearization oracle in addition to objective and gradient evaluations. In the regular conditional gradient algorithm, the linearization oracle corresponds to min x X f (x), x where X is a closed convex set. Similarly, in addition to objective and gradient calls 1, the proposed algorithm makes calls to a minimization oracle, which writes as x(g) := argmin g, x, (3.4) x K 1 where K 1 := {x, x K, x B} and g E. We do not sacrifice anything by assuming that for every g, x(g) is either zero, or a vector of the -norm equal to 1. To ensure this, it is unnecessary to compute x(g). Indeed, it suffices to compute g, x(g). If this product is 0, we can reset x(g) = 0, Algorithm Let z 1,..., z t,... be the sequence of iterates of the proposed algorithm. At each iteration of the proposed algorithm, we solve a simple two-dimensional sub-problem with bound constraints in order to find the next iterate. Solving the sub-problem only requires objective and gradient calls, the minimization oracle (3.4), and a a priori bound D + on x. The bound D + only plays an instrumental role in the algorithm, and can be set to a quite loose value. Indeed, as the theoretical guarantee shows, the rate of convergence is independent of D +. but depends in fact on the true bound D induced by the problem, unknown and not required by the algorithm. The (t + 1)-th iterate z t+1 of the algorithm is given as the solution of the sub-problem { z t+1 Argmin F (z), z Conv{0, zt, D + z t } }, (3.5) z where z t := [ x(f (x(z t )), 1] is given by call to the minimization oracle as z t := [ x(f (x(z t ))), 1]. The algorithm proceeds by solving a sequence of such sub-problems. Description of the sub-problem We now gives more details on the subproblem, and explain the geometric intuition behind. Define K + [ρ] as K + [ρ] = { [x; ρ] K +, x ρ }, and consider the linear form [(ξ, ρ) f (x), ξ + λρ] for any ρ. For any ρ, the minimum of this linear form is reached at [ρ x(f (x)); ρ]. Therefore, as ρ varies, the set of minima of the linear form span a ray. This is where the loose a priori bound D + comes in. Consider a segment from this ray, namely, the segment corresponding to the minima when 0 ρ D + S(z) = {ρ[ x(f (x)); 1], 0 ρ D + }. For any z, the segment S(z) can be identified by a single call to the first-order oracle (objective and gradient evaluation), and a single call to the the minimization oracle (3.4) with g = f (x) for (K, ). Minimizing F ( ) on the convex hull of S(z) and z corresponds to minimizing F ( ) on the convex hull of the origin and the points z and D + z t, with z t := [ x(f (x(z t )), 1]. Getting back to (3.5), z t+1 = (x t+1, r t+1 ) is given by z t+1 = α t+1 z t + β t+1 z t (3.6) (α t+1, β t+1 ) = argmin {F (α z t + βz t ), α + β 1, α 0, β 0.} (3.7) α,β To solve the two-dimensional sub-problem one can utilize, for instance, the ellipsoid method, or any optimization algorithms using only first order information. Proposition 3.1. Let {z t } be a sequence generated by the algorithm (3.5) with z 1 = 0. Let D < be such that f(x) + λr f(0) and x r imply that r D. The algorithm based on recursion 3.5 is a descent algorithm. In addition, we have for all t = 2, 3,... F (z t ) F 8L f D 2 (t + 1) 1. 1 A close inspection of the proofs of the results below reveals that the discussed algorithms are robust with respect to errors in observation of f(x). In other words, a bound ν on the error in f(x) results in an extra term O(1)ν in the corresponding accuracy bounds. However, to streamline the presentation we do not discuss this modification here. 3

4 Algo. Time Training error Test error MovieLens10M Prox-Grad 42.37s ours 19.15s ImageNet Prox-Grad hrs ours 47.94hrs Table 1: Performance is measured in RMSE on the MovieLens10M dataset and misclassification error on the ILSVRC2010 ImageNet dataset. Note that the time is measured in secs on the Movie- Lens10M dataset, and hrs on the ILSVRC2010 ImageNet dataset. The result holds when a loose instrumental upper-bound D + is available. Yet, the algorithm does not require the knowledge of the upper-bound D (which always exist but is unknown to the algorithm), on which the convergence rate relies (see [10] for details). Acceleration In place of sub-problem 3.5, one can also leverage information gathered by previous iterations, and optimize instead on a convex hull based on previous iterates, an acceleration similar to the so-called restricted simplicial decomposition in the conditional gradient literature [4]. Let M N +, then sub-problem 3.5 can be replaced by z t+1 Argmin {F (z), z C t }. z where C t = { Conv{0; D+ z 0,..., D + z t }, for t M, Conv{0; z t M+1,..., z t ; D + z t M+1,..., D + z t }, for t > M. (3.8) Then, results of Proposition 3.1 still hold for {z t } defined in (3.8). 4 Experiments We conducted experiments on two examples of penalized learning formulations, resp. matrix completion with Gaussian noise and trace-norm regularization penalty, and multi-class classification with logistic loss and trace-norm regularization penalty [1, 2, 9]. We worked on the two corresponding datasets: i) MovieLens10M, ii) ImageNet dataset. We compared the proposed algorithm to the accelerated proximal gradient algorithm, denoted by Prox-Grad (cf. Table 1). MovieLens10M We utilise the proposed algorithm on the MovieLens10M dataset, corresponding to 10 7 ratings of users on movies. Same as in [12], we use half of the sample for training. Note that, because of the inherent sparsity of the problem, the matrix-vector computations can be handled very efficiently. Yet, experiments on this dataset are interesting, as they show that our approach is competitive with the state-of-the-art proximal gradient algorithm. Imagenet We test the algorithm on an image categorization application. We consider the Pascal ILSVRC2010 ImageNet dataset and focus on the Vertebrate-craniate subset, yielding 2943 classes with and average of 1000 examples each. We compute 65, 000-dimensional visual descriptors for each example using Fisher vector representation [9], a state-of-the-art feature vector representation in image categorization. Note that this representation yields dense feature vectors. In order to save enough RAM space to perform matrix-vector computations at each iteration, we set the size of the mini-batches to b = Note that the raw batch proximal gradient algorithm is inappropriate for such a large dataset. The straightforward stochastic gradient version of the proximal algorithm is out of scope, as it would involve an SVD per example. Thus we use as a contender the mini-batch version of the proximal gradient algorithm [5], and we use the same reasoning to set the mini-batch size of the proximal gradient algorithm. Results i show that the proposed algorithm has an edge over the mini-batch proximal gradient algorithm and manages to achieve a similar value of the objective significantly faster. 4

5 References [1] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In ICML, [2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3), [3] F. R. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, [4] D. Bertsekas. Nonlinear Programming. Athena Sci., [5] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, [6] V. Demyanov and A. Rubinov. Approximate Methods in Optimization Problems. American Elsevier, [7] M. Dudik, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm regularization. In AISTATS, [8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95 110, [9] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, and J. Malick. Large-scale image classification with trace-norm regularization. In CVPR, [10] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for machine learning. Technical report, hal, [11] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, [12] M. Jaggi and M. Sulovský. A simple algorithm for nuclear norm regularized problems. In ICML, [13] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In ICML, [14] G. Lan. An optimal method for stochastic composite optimization. Math. Program., [15] Y. Nesterov. Introductory lectures on convex optimization. A basic course. Kluwer, [16] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, CORE Discussion Paper, [17] T. K. Pong, S. J. Paul Tseng, and J. Ye. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM J. Optimization, 20(6): , [18] S. Shalev-Shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a lowrank constraint. In ICML, [19] C. Shen, J. Kim, L. Wang, and A. van den Hengel. Positive semidefinite metric learning using boosting-like algorithms. JMLR, [20] S. Sra, S. Nowozin, and S. J. Wright. Optimization for Machine Learning. MIT Press,

Conditional gradient algorithms for machine learning

Conditional gradient algorithms for machine learning 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - C) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization

Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization PhD defense of Federico Pierucci Thesis supervised by Prof. Zaid Harchaoui Prof. Anatoli Juditsky Dr. Jérôme Malick

More information

Limitations of Matrix Completion via Trace Norm Minimization

Limitations of Matrix Completion via Trace Norm Minimization Limitations of Matrix Completion via Trace Norm Minimization ABSTRACT Xiaoxiao Shi Computer Science Department University of Illinois at Chicago xiaoxiao@cs.uic.edu In recent years, compressive sensing

More information

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics 1. Introduction EE 546, Univ of Washington, Spring 2016 performance of numerical methods complexity bounds structural convex optimization course goals and topics 1 1 Some course info Welcome to EE 546!

More information

Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research Methods (Part II) Mathematical Programming and Research Methods (Part II) 4. Convexity and Optimization Massimiliano Pontil (based on previous lecture by Andreas Argyriou) 1 Today s Plan Convex sets and functions Types

More information

Convex optimization algorithms for sparse and low-rank representations

Convex optimization algorithms for sparse and low-rank representations Convex optimization algorithms for sparse and low-rank representations Lieven Vandenberghe, Hsiao-Han Chao (UCLA) ECC 2013 Tutorial Session Sparse and low-rank representation methods in control, estimation,

More information

Convex Optimization MLSS 2015

Convex Optimization MLSS 2015 Convex Optimization MLSS 2015 Constantine Caramanis The University of Texas at Austin The Optimization Problem minimize : f (x) subject to : x X. The Optimization Problem minimize : f (x) subject to :

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Recent Advances in Frank-Wolfe Optimization. Simon Lacoste-Julien

Recent Advances in Frank-Wolfe Optimization. Simon Lacoste-Julien Recent Advances in Frank-Wolfe Optimization Simon Lacoste-Julien OSL 2017 Les Houches April 13 th, 2017 Outline Frank-Wolfe algorithm review global linear convergence of FW optimization variants condition

More information

Convex Optimization. Lijun Zhang Modification of

Convex Optimization. Lijun Zhang   Modification of Convex Optimization Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Modification of http://stanford.edu/~boyd/cvxbook/bv_cvxslides.pdf Outline Introduction Convex Sets & Functions Convex Optimization

More information

Convex Optimization / Homework 2, due Oct 3

Convex Optimization / Homework 2, due Oct 3 Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a

More information

Finding Euclidean Distance to a Convex Cone Generated by a Large Number of Discrete Points

Finding Euclidean Distance to a Convex Cone Generated by a Large Number of Discrete Points Submitted to Operations Research manuscript (Please, provide the manuscript number!) Finding Euclidean Distance to a Convex Cone Generated by a Large Number of Discrete Points Ali Fattahi Anderson School

More information

Proximal operator and methods

Proximal operator and methods Proximal operator and methods Master 2 Data Science, Univ. Paris Saclay Robert M. Gower Optimization Sum of Terms A Datum Function Finite Sum Training Problem The Training Problem Convergence GD I Theorem

More information

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms Due: Tuesday, November 24th, 2015, before 11:59pm (submit via email) Preparation: install the software packages and

More information

x 1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) 75 x y=20

x 1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) 75 x y=20 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) L R 75 x 2 38 0 y=20-38 -75 x 1 1 Introducing the model 2 1 Brain decoding

More information

IE598 Big Data Optimization Summary Nonconvex Optimization

IE598 Big Data Optimization Summary Nonconvex Optimization IE598 Big Data Optimization Summary Nonconvex Optimization Instructor: Niao He April 16, 2018 1 This Course Big Data Optimization Explore modern optimization theories, algorithms, and big data applications

More information

Convex and Distributed Optimization. Thomas Ropars

Convex and Distributed Optimization. Thomas Ropars >>> Presentation of this master2 course Convex and Distributed Optimization Franck Iutzeler Jérôme Malick Thomas Ropars Dmitry Grishchenko from LJK, the applied maths and computer science laboratory and

More information

Iteratively Re-weighted Least Squares for Sums of Convex Functions

Iteratively Re-weighted Least Squares for Sums of Convex Functions Iteratively Re-weighted Least Squares for Sums of Convex Functions James Burke University of Washington Jiashan Wang LinkedIn Frank Curtis Lehigh University Hao Wang Shanghai Tech University Daiwei He

More information

Composite Self-concordant Minimization

Composite Self-concordant Minimization Composite Self-concordant Minimization Volkan Cevher Laboratory for Information and Inference Systems-LIONS Ecole Polytechnique Federale de Lausanne (EPFL) volkan.cevher@epfl.ch Paris 6 Dec 11, 2013 joint

More information

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Journal of Machine Learning Research 6 (205) 553-557 Submitted /2; Revised 3/4; Published 3/5 The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Xingguo Li Department

More information

Mining Sparse Representations: Theory, Algorithms, and Applications

Mining Sparse Representations: Theory, Algorithms, and Applications Mining Sparse Representations: Theory, Algorithms, and Applications Jun Liu, Shuiwang Ji, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona State University What is Sparsity?

More information

Efficient Euclidean Projections in Linear Time

Efficient Euclidean Projections in Linear Time Jun Liu J.LIU@ASU.EDU Jieping Ye JIEPING.YE@ASU.EDU Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA Abstract We consider the problem of computing the Euclidean

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach

Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach Achintya Kundu Francis Bach Chiranjib Bhattacharyya Department of CSA INRIA - École

More information

Efficient Generalized Conditional Gradient with Gradient Sliding for Composite Optimization

Efficient Generalized Conditional Gradient with Gradient Sliding for Composite Optimization Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 015) Efficient Generalized Conditional Gradient with Gradient Sliding for Composite Optimization Yiu-ming

More information

Learning with infinitely many features

Learning with infinitely many features Learning with infinitely many features R. Flamary, Joint work with A. Rakotomamonjy F. Yger, M. Volpi, M. Dalla Mura, D. Tuia Laboratoire Lagrange, Université de Nice Sophia Antipolis December 2012 Example

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Hausdorff Institute for Mathematics (HIM) Trimester: Mathematics of Signal Processing

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Leaves Machine Learning and Optimization Library

Leaves Machine Learning and Optimization Library Leaves Machine Learning and Optimization Library October 7, 07 This document describes how to use LEaves Machine Learning and Optimization Library (LEMO) for modeling. LEMO is an open source library for

More information

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION Jialin Liu, Yuantao Gu, and Mengdi Wang Tsinghua National Laboratory for Information Science and

More information

Optimization for Machine Learning

Optimization for Machine Learning with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications A Brief History A. Convex

More information

Lecture 2. Topology of Sets in R n. August 27, 2008

Lecture 2. Topology of Sets in R n. August 27, 2008 Lecture 2 Topology of Sets in R n August 27, 2008 Outline Vectors, Matrices, Norms, Convergence Open and Closed Sets Special Sets: Subspace, Affine Set, Cone, Convex Set Special Convex Sets: Hyperplane,

More information

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know learn the proximal

More information

Gradient LASSO algoithm

Gradient LASSO algoithm Gradient LASSO algoithm Yongdai Kim Seoul National University, Korea jointly with Yuwon Kim University of Minnesota, USA and Jinseog Kim Statistical Research Center for Complex Systems, Korea Contents

More information

Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application

Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application Proection onto the probability simplex: An efficient algorithm with a simple proof, and an application Weiran Wang Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science, University of

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

An efficient algorithm for sparse PCA

An efficient algorithm for sparse PCA An efficient algorithm for sparse PCA Yunlong He Georgia Institute of Technology School of Mathematics heyunlong@gatech.edu Renato D.C. Monteiro Georgia Institute of Technology School of Industrial & System

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

An Extended Level Method for Efficient Multiple Kernel Learning

An Extended Level Method for Efficient Multiple Kernel Learning An Extended Level Method for Efficient Multiple Kernel Learning Zenglin Xu Rong Jin Irwin King Michael R. Lyu Dept. of Computer Science & Engineering The Chinese University of Hong Kong Shatin, N.T., Hong

More information

Conflict Graphs for Parallel Stochastic Gradient Descent

Conflict Graphs for Parallel Stochastic Gradient Descent Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.

More information

CS 435, 2018 Lecture 2, Date: 1 March 2018 Instructor: Nisheeth Vishnoi. Convex Programming and Efficiency

CS 435, 2018 Lecture 2, Date: 1 March 2018 Instructor: Nisheeth Vishnoi. Convex Programming and Efficiency CS 435, 2018 Lecture 2, Date: 1 March 2018 Instructor: Nisheeth Vishnoi Convex Programming and Efficiency In this lecture, we formalize convex programming problem, discuss what it means to solve it efficiently

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Sparsity Based Regularization

Sparsity Based Regularization 9.520: Statistical Learning Theory and Applications March 8th, 200 Sparsity Based Regularization Lecturer: Lorenzo Rosasco Scribe: Ioannis Gkioulekas Introduction In previous lectures, we saw how regularization

More information

Stanford University. A Distributed Solver for Kernalized SVM

Stanford University. A Distributed Solver for Kernalized SVM Stanford University CME 323 Final Project A Distributed Solver for Kernalized SVM Haoming Li Bangzheng He haoming@stanford.edu bzhe@stanford.edu GitHub Repository https://github.com/cme323project/spark_kernel_svm.git

More information

Distributed Alternating Direction Method of Multipliers

Distributed Alternating Direction Method of Multipliers Distributed Alternating Direction Method of Multipliers Ermin Wei and Asuman Ozdaglar Abstract We consider a network of agents that are cooperatively solving a global unconstrained optimization problem,

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A. ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS Donghwan Kim and Jeffrey A. Fessler University of Michigan Dept. of Electrical Engineering and Computer Science

More information

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School of Information Science and Technology Department of

More information

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems Analyzing Stochastic Gradient Descent for Some Non- Convex Problems Christopher De Sa Soon at Cornell University cdesa@stanford.edu stanford.edu/~cdesa Kunle Olukotun Christopher Ré Stanford University

More information

EE8231 Optimization Theory Course Project

EE8231 Optimization Theory Course Project EE8231 Optimization Theory Course Project Qian Zhao (zhaox331@umn.edu) May 15, 2014 Contents 1 Introduction 2 2 Sparse 2-class Logistic Regression 2 2.1 Problem formulation....................................

More information

Metric Learning for Large Scale Image Classification:

Metric Learning for Large Scale Image Classification: Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Thomas Mensink 1,2 Jakob Verbeek 2 Florent Perronnin 1 Gabriela Csurka 1 1 TVPA - Xerox Research Centre

More information

An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems

An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems Kim-Chuan Toh Sangwoon Yun Abstract The affine rank minimization problem, which consists of finding a matrix

More information

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems Panos Parpas May 9, 2017 Abstract Composite optimization models consist of the minimization of the sum of a smooth

More information

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization Author: Martin Jaggi Presenter: Zhongxing Peng Outline 1. Theoretical Results 2. Applications Outline 1. Theoretical Results 2. Applications

More information

Transfer Learning Algorithms for Image Classification

Transfer Learning Algorithms for Image Classification Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell 1 Motivation Goal: We want to be able to build classifiers for thousands of visual

More information

Optimization. Industrial AI Lab.

Optimization. Industrial AI Lab. Optimization Industrial AI Lab. Optimization An important tool in 1) Engineering problem solving and 2) Decision science People optimize Nature optimizes 2 Optimization People optimize (source: http://nautil.us/blog/to-save-drowning-people-ask-yourself-what-would-light-do)

More information

A Multilevel Acceleration for l 1 -regularized Logistic Regression

A Multilevel Acceleration for l 1 -regularized Logistic Regression A Multilevel Acceleration for l 1 -regularized Logistic Regression Javier S. Turek Intel Labs 2111 NE 25th Ave. Hillsboro, OR 97124 javier.turek@intel.com Eran Treister Earth and Ocean Sciences University

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1 Section 5 Convex Optimisation 1 W. Dai (IC) EE4.66 Data Proc. Convex Optimisation 1 2018 page 5-1 Convex Combination Denition 5.1 A convex combination is a linear combination of points where all coecients

More information

Robust Principal Component Analysis (RPCA)

Robust Principal Component Analysis (RPCA) Robust Principal Component Analysis (RPCA) & Matrix decomposition: into low-rank and sparse components Zhenfang Hu 2010.4.1 reference [1] Chandrasekharan, V., Sanghavi, S., Parillo, P., Wilsky, A.: Ranksparsity

More information

A fast weighted stochastic gradient descent algorithm for image reconstruction in 3D computed tomography

A fast weighted stochastic gradient descent algorithm for image reconstruction in 3D computed tomography A fast weighted stochastic gradient descent algorithm for image reconstruction in 3D computed tomography Davood Karimi, Rabab Ward Department of Electrical and Computer Engineering University of British

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Parallel and Distributed Sparse Optimization Algorithms

Parallel and Distributed Sparse Optimization Algorithms Parallel and Distributed Sparse Optimization Algorithms Part I Ruoyu Li 1 1 Department of Computer Science and Engineering University of Texas at Arlington March 19, 2015 Ruoyu Li (UTA) Parallel and Distributed

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Parametric Local Metric Learning for Nearest Neighbor Classification

Parametric Local Metric Learning for Nearest Neighbor Classification Parametric Local Metric Learning for Nearest Neighbor Classification Jun Wang Department of Computer Science University of Geneva Switzerland Jun.Wang@unige.ch Adam Woznica Department of Computer Science

More information

COMS 4771 Support Vector Machines. Nakul Verma

COMS 4771 Support Vector Machines. Nakul Verma COMS 4771 Support Vector Machines Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING KELLER VANDEBOGERT AND CHARLES LANNING 1. Introduction Interior point methods are, put simply, a technique of optimization where, given a problem

More information

A primal-dual framework for mixtures of regularizers

A primal-dual framework for mixtures of regularizers A primal-dual framework for mixtures of regularizers Baran Gözcü baran.goezcue@epfl.ch Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL) Switzerland

More information

Case Study 1: Estimating Click Probabilities

Case Study 1: Estimating Click Probabilities Case Study 1: Estimating Click Probabilities SGD cont d AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 31, 2015 1 Support/Resources Office Hours Yao Lu:

More information

PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization

PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization Marie-Caroline Corbineau 1, Emilie Chouzenoux 1,2, Jean-Christophe Pesquet 1 1 CVN, CentraleSupélec, Université Paris-Saclay,

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

Support Vector Machines

Support Vector Machines Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

More information

Effectiveness of Sparse Features: An Application of Sparse PCA

Effectiveness of Sparse Features: An Application of Sparse PCA 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Recent Developments in Model-based Derivative-free Optimization

Recent Developments in Model-based Derivative-free Optimization Recent Developments in Model-based Derivative-free Optimization Seppo Pulkkinen April 23, 2010 Introduction Problem definition The problem we are considering is a nonlinear optimization problem with constraints:

More information

Convex Sets (cont.) Convex Functions

Convex Sets (cont.) Convex Functions Convex Sets (cont.) Convex Functions Optimization - 10725 Carlos Guestrin Carnegie Mellon University February 27 th, 2008 1 Definitions of convex sets Convex v. Non-convex sets Line segment definition:

More information

Supplementary material: Strengthening the Effectiveness of Pedestrian Detection with Spatially Pooled Features

Supplementary material: Strengthening the Effectiveness of Pedestrian Detection with Spatially Pooled Features Supplementary material: Strengthening the Effectiveness of Pedestrian Detection with Spatially Pooled Features Sakrapee Paisitkriangkrai, Chunhua Shen, Anton van den Hengel The University of Adelaide,

More information

Solution Methods Numerical Algorithms

Solution Methods Numerical Algorithms Solution Methods Numerical Algorithms Evelien van der Hurk DTU Managment Engineering Class Exercises From Last Time 2 DTU Management Engineering 42111: Static and Dynamic Optimization (6) 09/10/2017 Class

More information

Asynchronous Multi-Task Learning

Asynchronous Multi-Task Learning Asynchronous Multi-Task Learning Inci M. Baytas 1, Ming Yan 2, Anil K. Jain 1, and Jiayu Zhou 1 1 Department of Computer Science and Engineering 2 Department of Mathematics Michigan State University East

More information

LECTURE 18 LECTURE OUTLINE

LECTURE 18 LECTURE OUTLINE LECTURE 18 LECTURE OUTLINE Generalized polyhedral approximation methods Combined cutting plane and simplicial decomposition methods Lecture based on the paper D. P. Bertsekas and H. Yu, A Unifying Polyhedral

More information

The Benefit of Tree Sparsity in Accelerated MRI

The Benefit of Tree Sparsity in Accelerated MRI The Benefit of Tree Sparsity in Accelerated MRI Chen Chen and Junzhou Huang Department of Computer Science and Engineering, The University of Texas at Arlington, TX, USA 76019 Abstract. The wavelet coefficients

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Random Walk Distributed Dual Averaging Method For Decentralized Consensus Optimization

Random Walk Distributed Dual Averaging Method For Decentralized Consensus Optimization Random Walk Distributed Dual Averaging Method For Decentralized Consensus Optimization Cun Mu, Asim Kadav, Erik Kruus, Donald Goldfarb, Martin Renqiang Min Machine Learning Group, NEC Laboratories America

More information

Alina Ene. From Minimum Cut to Submodular Minimization Leveraging the Decomposable Structure. Boston University

Alina Ene. From Minimum Cut to Submodular Minimization Leveraging the Decomposable Structure. Boston University From Minimum Cut to Submodular Minimization Leveraging the Decomposable Structure Alina Ene Boston University Joint work with Huy Nguyen (Northeastern University) Laszlo Vegh (London School of Economics)

More information

A More Efficient Approach to Large Scale Matrix Completion Problems

A More Efficient Approach to Large Scale Matrix Completion Problems A More Efficient Approach to Large Scale Matrix Completion Problems Matthew Olson August 25, 2014 Abstract This paper investigates a scalable optimization procedure to the low-rank matrix completion problem

More information

Sparse Coding for Learning Interpretable Spatio-Temporal Primitives

Sparse Coding for Learning Interpretable Spatio-Temporal Primitives Sparse Coding for Learning Interpretable Spatio-Temporal Primitives Taehwan Kim TTI Chicago taehwan@ttic.edu Gregory Shakhnarovich TTI Chicago gregory@ttic.edu Raquel Urtasun TTI Chicago rurtasun@ttic.edu

More information

450 Jon M. Huntsman Hall 3730 Walnut Street, Mobile:

450 Jon M. Huntsman Hall 3730 Walnut Street, Mobile: Karthik Sridharan Contact Information Research Interests 450 Jon M. Huntsman Hall 3730 Walnut Street, Mobile: +1-716-550-2561 Philadelphia, PA 19104 E-mail: karthik.sridharan@gmail.com USA http://ttic.uchicago.edu/

More information

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro CMU-Q 15-381 Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization Teacher: Gianni A. Di Caro GLOBAL FUNCTION OPTIMIZATION Find the global maximum of the function f x (and

More information

15. Cutting plane and ellipsoid methods

15. Cutting plane and ellipsoid methods EE 546, Univ of Washington, Spring 2012 15. Cutting plane and ellipsoid methods localization methods cutting-plane oracle examples of cutting plane methods ellipsoid method convergence proof inequality

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 1 A. d Aspremont. Convex Optimization M2. 1/49 Today Convex optimization: introduction Course organization and other gory details... Convex sets, basic definitions. A. d

More information

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago More Data, Less Work: Runtime as a decreasing function of data set size Nati Srebro Toyota Technological Institute Chicago Outline we are here SVM speculations, other problems Clustering wild speculations,

More information

Sparsity and image processing

Sparsity and image processing Sparsity and image processing Aurélie Boisbunon INRIA-SAM, AYIN March 6, Why sparsity? Main advantages Dimensionality reduction Fast computation Better interpretability Image processing pattern recognition

More information

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007 : Convex Non-Smooth Optimization April 2, 2007 Outline Lecture 19 Convex non-smooth problems Examples Subgradients and subdifferentials Subgradient properties Operations with subgradients and subdifferentials

More information

A General Framework for Fast Stagewise Algorithms

A General Framework for Fast Stagewise Algorithms Journal of Machine Learning Research 16 (2015) 2543-2588 Submitted 8/14; Revised 5/15; Published 12/15 A General Framework for Fast Stagewise Algorithms Ryan J. Tibshirani Departments of Statistics and

More information

Lecture 7: Support Vector Machine

Lecture 7: Support Vector Machine Lecture 7: Support Vector Machine Hien Van Nguyen University of Houston 9/28/2017 Separating hyperplane Red and green dots can be separated by a separating hyperplane Two classes are separable, i.e., each

More information