Optimization for Machine Learning

Size: px

Start display at page:

Download "Optimization for Machine Learning"

Eunice Parks
6 years ago
Views:

1 with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering

2 Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications

3 A Brief History A. Convex optimization: before 1980 Develop more efficient algorithms for convex problems B. Convex optimization for machine learning: Apply mature convex optimization algorithms on machine learning models C. Large scale machine learning: Convexity may not longer be required Algorithms become more specific to machine learning models D. Huge scale machine learning: 2010-now Convexity is longer a problem for optimization Parallel & distributed optimization

4 Convex Optimization: Before 1980 Linear least squares regression 1 min x 2N N n=1 Gradient descent ( ) 2 y i ai λ x + 2 x 2 2 Conjugate gradient descent Quasi-Newton method: L-BFGS

5 Convex : Support vector machine min 1 N ε i + λ N 2 x 2 2, n=1 ) s.t. y i (a i x + b 1 ε i Sequential minimal optimization

6 Large Scale Machine Learning Support vector machine Core-set selection: CVM Coordinate descent: LibLinear Stochastic gradient descent (SGD): Pegasos Sparse coding and matrix completion Proximal gradient descent (PG) Frank-Wolfe algorithm (FW) Alternating Direction Method of Multipliers (ADMM) Neural networks Forward and Backward-propagation

7 Huge Scale Machine Learning: 2010-now Deep Neural networks Tailored-made SGD (Resprop, Adam) Making algorithms (PG, FW, ADMM) more efficient & applicable Parallel and distributed version Stochastic version Convergence analysis without convexity

8 Three Interesting Trends Trends General propose to tailer-made for model, even for data Convex to nonconvex Easy implementation to systematic approach Requirements Deep understanding of optimization algorithms Deep understanding of machine learning models Experienced with coding & computer systems Domain knowledge from specific areas

9 Why? Big model (nonconvex) and big data (distributed) Smaller model error, but increase the difficulty on optimization which leads to large optimization error Good algorithms Keep small optimization error

10 Gradient Descent Minimization problem: min F (x) F is a smooth function we can take gradient F Gradient descent x t+1 = x t 1 L F (x t) Example: logistic regression 1 min x N N n=1 ( ( )) log 1 + exp y i (ai x) + λ 2 x 2 2

11 Proximal Gradient Descent F is always not smooth in machine learnings. min F (x) = f (x) }{{} + g(x) }{{} loss function regularizer f : loss function - usually smooth: e.g. square loss, logistic loss g: regularizer - usually nonsmooth: e.g. sparse/low-rank regularizers Examples: logistic regression + feature selection 1 min x N N n=1 ( ( )) log 1 + exp y i (ai x) + λ x 1

12 Matrix Completion: Recommender system

13 Matrix Completion: Recommender system Birds of a feather flock together Rating matrix is low-rank U: users latent factors V : items latent factors

14 Matrix Completion: Recommender system Optimization problem: min X 1 2 P Ω(X O) 2 F } {{ } f :loss + λ X }{{} g:regularizer where observed positions are indicated by Ω, and ratings are in O. f : square loss is used - smooth g: the nuclear norm regularizer - X = i σ i(x ), sum of singular values, nonsmooth

15 Topic Model: Tree-structured lasso

16 Topic Model: Tree-structured lasso Topics naturally have a hierarchical structure

17 Topic Model: Tree-structured lasso Bag of words (BoW) representation of a document

18 Topic Model: Tree-structured lasso Optimization problems 1 min x 2 y Dx λ x Ig 2 g=1 where y is the BoW representation of a document, and different groups are indicated by I g f : square loss is used - smooth g: tree-structured lasso regularizer - nonsmooth

19 PGD Algorithm: Overview Gradient Descent x t+1 = x t 1 L f (x t) 1 = arg min x 2 x (x t 1 L f (x t)) 2 2 = arg min x f (x t ) + (x x t ) f (x t ) + L 2 x x t 2 2 Proximal Gradient Descent x t+1 = prox λ L g ( x t 1 L f (x t) ) : proximal step 1 = arg min x 2 x (x t 1 L f (x t)) λg(x) = arg min x f (x t ) + (x x t ) f (x t ) + L 2 x x t λg(x) }{{} nonsmooth term

20 PGD Algorithm: Overview Proximal step prox λ L g ( ) Cheap closed-form solution Fast O(1/T 2 ) convergence in convex problems y t = x t + t 1 t + 2 (x t x t 1 ), ( x t+1 = prox λ L g y t 1 ) L f (y t). Sound convergence even when f and g are not convex Not for ADMM, SGD and FW algorithms

21 PGD Algorithm: Matrix completion Optimization problem: min X 1 2 P Ω(X O) 2 F + λ X Proximal step X t+1 = prox λ L (Z t), Z t = X t P Ω (X t O) SVD is needed, expensive for large matrices O(m 2 n) for X R m n (m n) How to decrease iteration time complexity while guarantee convergence?

22 PGD Algorithm: Matrix completion Key observation: Z t is sparse + low-rank Z t = P Ω (X t O) + }{{} X t, }{{} X t+1 = prox λ (Z t ) sparse low-rank Let X t = U t Σ t V t. For any u R n, Z t u = P Ω (O X t )u + }{{} U t Σ t (V t u) }{{} sparse:o( Ω 1 ) low rank:o((m+n)k) Rank-k SVD takes O( Ω 1 k + (m + n)k 2 ) time, instead of O(mnk) (similarly, for Z t v) k is much smaller than m and n Ω 1 much smaller than mn Use approximate SVD (power method) instead of exact SVD

23 PGD Algorithm: Matrix completion Proposed AIS-Impute [IJCAI-2015] is in black (a) MovieLens-100K. (b) MovieLens-10M. Small datasets

24 PGD Algorithm: Parallelization Intel Xeon, 12 cores. FaNCL [ICDM-2015, TKDE-2017] is advanced version of AIS-Impute. (a) Yahoo: 12 threads. (b) Speedup v.s. threads. Large datasets

25 PGD Algorithm: Tree-structured lasso Optimization problems: min x 1 2 y Dx λ g=1 x I g 2 Proximal step x t+1 = prox λ L g=1 (zt) Ig (z t), z 2 t = x t 1 L D (Dx t y) Cheap cheap-closed form exists complex, but convex one pass of all groups, O(log(d)d) time for x R d Performance of convex regularizers is not good enough.

26 PGD Algorithm: Tree-structured lasso Convex v.s. nonconvex regularizers regularizer convex nonconvex optimization performance Examples regularizers GP: Geman penalty Laplace penalty LSP: log-sum penalty MCP: minimax concave penalty SCAD: smoothly clipped absolute deviation penalty

27 PGD Algorithm: Tree-structured lasso Optimization problems 1 min x 2 y Dx λ κ ( ) x Ig 2 g=1 where κ is the nonconvex penalty function Proximal step x t+1 = prox λ L g=1 κ ( (z t) Ig 2 ) (z t ), z t = x t 1 L D (Dx t y) No closed-form solution Needs to be iteratively solved with other algorithms O(T p d) is required, T p is number of iterations: expensive Better performance, much less efficient optimization

28 PGD Algorithm: Tree-structured lasso Problem transformation [ICML-2016]: 1 2 y Dx λ [ ( ) ] κ xig 2 κ0 x Ig 2 + λκ 0 x Ig 2 g=1 g=1 }{{}}{{} f :augmented loss ğ:convex regularizer Move the nonconexity from regularizer to the loss augmented loss: f is still smooth convex regularizer: ğ is standard tree-structured regularizer

29 PGD Algorithm: Tree-structured lasso Optimizing transformed problem: min x f (x) + λκ 0 g=1 x I g 2 Proximal step x t+1 = prox λκ 0 L Cheap closed-form solution g=1 (zt) Ig 2 (z t), z t = x t 1 L f (x) Same convergence guarantee on original/transform problems Better performance, efficient optimization

30 PGD Algorithm: Tree-structured lasso Table: Results on tree-structured group lasso. non-accelerated accelerated convex SCP GIST GD-PAN nmapg N2C FISTA accuracy (%) 99.6± ± ± ± ± ±1.8 sparsity (%) 5.5± ± ± ± ± ±0.2 CPU time(sec) 7.1± ± ± ± ± ±0.4 (a) Objective v.s. CPU time. (b) Objective v.s. iterations. N2C is PG algorithm on the transformed problem

31 PGD Algorithm: Binary nets Example: Small Alex-net convolution layers: 2M parameters fully connect layers: 60M parameters Much lager (10x & 100x) networks are always prefer for big data Not available for mobile devices

32 PGD Algorithm: Binary nets Binary nets: min x f (x) : s.t. x = {±1} n. Neural network with binary weights Float (32 bits) to binary (1 bit) - 32x compression Multiplication to addition - 4 times faster Binary constraint act as regularization - prevent overfitting

33 PGD Algorithm: Binary nets During training weights are first binarized ˆx t = sign(x t ); then ˆx t propagates to next layer Take x {±1} as the regularizer g, using PG x t+1 = prox x {±1} n (x t f (x t )) = sign(x t f (x t ) ) }{{} loss-aware loss is considered in binarization [ICLR-2016]

34 PGD Algorithm: Binary nets BPN is proposed method, loss-aware is used with Adam

35 Thanks & QA

Proximal operator and methods

Proximal operator and methods Master 2 Data Science, Univ. Paris Saclay Robert M. Gower Optimization Sum of Terms A Datum Function Finite Sum Training Problem The Training Problem Convergence GD I Theorem