A Technical Report about SAG,SAGA,SVRG

Similar documents
Mini-batch Block-coordinate based Stochastic Average Adjusted Gradient Methods to Solve Big Data Problems

Cache-efficient Gradient Descent Algorithm

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Efficient Deep Learning Optimization Methods

Package graddescent. January 25, 2018

Deep Neural Networks Optimization

IE598 Big Data Optimization Summary Nonconvex Optimization

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Leaves Machine Learning and Optimization Library

Fast or furious? - User analysis of SF Express Inc

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

arxiv: v1 [cs.ms] 27 Oct 2017

Comparison of Optimization Methods for L1-regularized Logistic Regression

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Logistic Regression and Gradient Ascent

Conflict Graphs for Parallel Stochastic Gradient Descent

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Stochastic Function Norm Regularization of DNNs

CSE 546 Machine Learning, Autumn 2013 Homework 2

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

An Introduction to NNs using Keras

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Logistic Regression. Abstract

Perceptron: This is convolution!

Convolution Neural Networks for Chinese Handwriting Recognition

Convex Optimization. Lijun Zhang Modification of

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Deep Learning for Computer Vision II

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

A fast weighted stochastic gradient descent algorithm for image reconstruction in 3D computed tomography

Probabilistic Siamese Network for Learning Representations. Chen Liu

The exam is closed book, closed notes except your one-page cheat sheet.

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

A Multilevel Acceleration for l 1 -regularized Logistic Regression

Case Study 1: Estimating Click Probabilities

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Perceptron as a graph

Deep Learning With Noise

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Penalizied Logistic Regression for Classification

A Brief Look at Optimization

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Adaptive Dropout Training for SVMs

Class 6 Large-Scale Image Classification

Constrained optimization

BudgetedSVM: A Toolbox for Scalable SVM Approximations

Online Learning for URL classification

Effectiveness of Sparse Features: An Application of Sparse PCA

Model compression as constrained optimization, with application to neural nets

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Optimization for Machine Learning

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

CS294-1 Assignment 2 Report

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Stanford University. A Distributed Solver for Kernalized SVM

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

Memory-efficient Large-scale Linear Support Vector Machine

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Convex Optimization MLSS 2015

Convex and Distributed Optimization. Thomas Ropars

3D model classification using convolutional neural network

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Sparse Screening for Exact Data Reduction

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

CSE 250B Assignment 2 Report

Neural Networks and Deep Learning

arxiv: v1 [cs.cv] 20 Nov 2017

Distributed Primal-Dual Optimization for Non-uniformly Distributed Data

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Instance-based Learning

Decentralized and Distributed Machine Learning Model Training with Actors

Linear Regression & Gradient Descent

Machine Learning: Think Big and Parallel

Deep Learning with Tensorflow AlexNet

Proximal operator and methods

Combine the PA Algorithm with a Proximal Classifier

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Machine Learning / Jan 27, 2010

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

A Fast Learning Algorithm for Deep Belief Nets

arxiv: v1 [stat.ml] 21 Feb 2018

Package sgd. August 29, 2016

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Machine Learning Classifiers and Boosting

Transcription:

A Technical Report about SAG,SAGA,SVRG Tao Hu Department of Computer Science Peking University No.5 Yiheyuan Road Haidian District, Beijing, P.R.China taohu@pku.edu.cn Notice: this is a course project for <Algorithms for Big Data Analysis> conducted by Zaiwen Wen at Peking University; and <Deep Learning> conducted by Zhihua Zhang at Peking University. for <ALgorithms for Big Data Analysis>, I completed the SAG, SAGA, SVRG algorithm programming. for <Deep Learning>, I completed different gradient method ablation study. I just write it as a entirety for for completeness and continuity! Abstract The project report mainly includes the ablation study about the regularization term, stochastic gradient method, etc. the source code is provided: https://github. com/dongzhuoyao/gd.git. 1 Overview There exists many gradient method, Full Gradient(FG) is proposed first. as more and more data generate, a light-weight method named Stochastic Gradient(SG) appears. Based on the SG method, there emerges several Stochastic Method: Stochastic Gradient Method(SGD)[1], Momemtum[6], AdaDelta[1], RmsProp[9], SDCA[8]. Recently, NL Roux[7] proposed a Variation Method Stochastic Average Gradient (SAG) that realizes linear convergence. lots of variety based on Variance Method such as SVRG[3] SAGA[2], appears. In this project report I will compare several aspects about some of these methods. Let x R n, y -1,1, logistic regression model[5] is given by: P (y/x) = 1 1+exp( yw T x) The problem I experiment on is regularized average logistic regression loss(through maximum likelihood expectation). min w P (w) = 1 n n i=1 ln(1 + exp( wt x i y i )) + λl(w) where w is weight, L(w) means regression item which can be Ridge Regression or Lasso Regression. as we use gradient method, we have the gradient formula: P (w) = 1 n n i=1 exp( y iw T x i) 1+exp( y iw T x i) ( y ix i ) + λl (w) if I use l1-regularized logistic regression, the L1-norm is not differentiable and has no gradient, thus l1-norm regulation is much more challenging than l2-norm. despite of computing challenge, there Author Info: Taohu,Peking University

also exists several method to tackle the l1-norm regulation. the first strategy is very simple, which is just treating it as a non-smooth optimization problem. thus we can apply some sub-gradient based algorithm to solve it. the second strategy is to apply some smooth approximation to the l1-norm regulation. which can be further solved by some smooth optimization method. in this paper, I use sub-gradient L (w) = sign(w) for convenience if I choose l1-norm regulation. if I use l2-regularized logistic regression, The whole formula is differentiable so that the gradient can be easily acquired. I will choose l2-regularized logistic regression which L (w) = 2w. the dataset we use is mnist[4], so the x i I use above is a 785-by-1 vector, y i is a scalar, w is 1-by-785 vector. the mnist data size is 28-28, adding a bias totally is 785. for each mnist image, I have normalized it to 1 for convenience of training. as the logistic regression is a binary classification problem, hence I just divide the mnist data into even and odd for later study. the final binary digit classfier is: y=1 means odd, otherwise means even. y = sign(w T x) 2 Experiment because the SVRG requires the data cannot be shuffled. so after each epoch is done, I will not shuffle the data. therefore in some of the following figures, you will observe some curve keep the same tendency as time goes. 2.1 How to obtain the optimal In the following Experiment, I will use a parameter named " Objective minus Optimum" which we need the optimum, I will choose the optimal value by SGD with a exponential decay learning rate: µ t = η a t step η, a are the grid search parameter,here we choose the step = 3, which means to decay the learning rate exponentially for every 3 epochs. 1 epochs were processed finally. the grid search result is in Table 1. as the result shows, the minimal is.377, we keep the same grid search parameter where η = 1, a =.9 and increase the final epochs to, then further achieve minimal value.3638/18, we choose the optimum.3638 in the following experiment. Table 1: Optimum Grid Search Table a η 1 1.1 1 1.9.4/13.377/18.4173/.8955 83/.8457 758/.7315.7.4118/37.3826/.9.4764/.875 966/97 1.8617/378.4616/.8997.3887/.8975.4878/.8615.8215/.7896 3.113/356 the element in every grid means "loss/test accuracy", the result is selected by the highest test accuracy in 1 epochs. best optimal achieves when a=.9,η = 1 2.2 Ablation Study 2.2.1 Regulation There exists different regulation items including L1 norm, L2 norm. here I will compare different regression factor λ in Figure 1. different regulation type influence on test accuracy is also compared in Figure 2. 2

in those Figures, I choose 25 epochs(effective passes 1 ) to run all the methods..9 (a). SGD.9 (b). Momemtum.8.8.7 L2=1e (c). SAG.88.86.84.82.78 L2=1e.7 L2=1e (d). SVRG.88.86.84.82.78 L2=1e Figure 1: Regulation Value Experiment.all experiment use l2-regularized logistic regression as default. (a) between different regulation λ with SGD. (b) between different regulation λ with SGD with Momentum. (c) between different regulation λ with SAG. (d) between different regulation λ with SVRG.This figure is best viewed in colour. in Figure 1, I compare different regulation value range from 1e1 to 1e-4. for simplicity L2 regulation is tested. for all four SG method, we can find λ = 1e 3 or 1e-4 is best. when λ = 1e1 the Test Accuracy is very slow and stop rising up as time passed. as λ becomes smaller and smaller, the Test Accuracy is boosting. until λ =1e-3 or 1e-4, the accuracy no longer fluctuated dramatically..7 (a). SGD SGD-L1 SGD-L2 2.5 2. (b). SGD with Momemtum Momemtum-L1 momemtum-l2.4.3.2.1 (c). SAG SAG-L1 5 SAG-L2.45.4.35.3.25.2 (d). SVRG SVRG-L1.45 SVRG-L2.4.35.3.25.2.15.1 Figure 2: Regulation Type Experiment (a) between l1 and l2 regularized logistic regression with SGD. (b) between l1 and l2 regularized logistic regression SGD with Momentum. (c) l1 and l2 regularized logistic regression with SAG. (d) between l1 and l2 regularized logistic regression with SVRG.This figure is best viewed in colour. 1 one effective pass means an interval that goes through all the finite data. 3

I also compared diverse regulation type(figure 2) includes lasso and ridge regulation. However, they show no significant different in. Furthermore, I also compared the weight visualization in Figure 3, here the y-axis is the absolute value of weight. if your observe carefully, you will find out the L1 regulation tends to cause more sparse solution. (a). SGD L1 Params (b). SGD L2 Params (c). SAG L1 Params (d). SAG L2 Params 1 1 1 1 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 12 14 12 14 12 14 12 14 Figure 3: Weight Sparsity Experiment.all experiment use regularization=1e-3 as default. (a) l1-sgd weight distribution. (b) l2-sgd weight distribution. (c) l1-sag weight distribution. (d) l2-sag weight distribution.this figure is best viewed in colour. 2.2.2 Step Size Strategy I mainly compared these step size strategy: fixed, exponential decay(step_size=3), backtracking. the training loss and test accuracy influenced by the step size strategy is displayed in Figure 4. 1 epochs are processed for every method. 4

2.5 2. (a).sgd training loss.85 (b).sgd test accuracy (c).momentum training loss 1.4 1.2.8.4.2 (e).sag training loss.7 5 5.45.4.35 (g).svrg training loss 2.5 2..75.7 5 (d).momemtum test accuracy.85.75.7 (f).sag test accuracy.9.8.7 (h).svrg test accuracy.85.75.7 5 Figure 4: Step Size Strategy Experiment (a,b) training loss and test accuracy with three step size strategy with SGD. (c,d) training loss and test accuracy with three step size strategy SGD with Momentum. (e,f) training loss and test accuracy with three step size strategy with SAG. (g,h) training loss and test accuracy with three step size strategy with SVRG.This figure is best viewed in colour. 2.2.3 Different Method Finally, different method including Stochastic Gradient Descend(SGD), SGD with Momentum, accelerated SGD with Momentum, SAG, SAGA, SVRG are compared in Figure 5., Validation Loss, tendencies are listed. 5

log1(/optimum).4.3.2 (a). SGD Momentum Momentum with Nesterov SAG SAGA SVRG Validation Loss 2.5 2. (b). Validation Loss SGD Momentum Momentum with Nesterov SAG SAGA SVRG.1.92 (c)..88.86.84.82 SGD Momentum Momentum with Nesterov SAG SAGA SVRG Figure 5: SG Method Experiment (a). training loss of different method. (b). validation loss of different method. (c). test accuracy of different method. This figure is best viewed in colour. The time consuming is nearly the same except the SVRG. the SVRG is very slowly because it iteratively calculate the best gradient descent. 2.2.4 Vectorization Programming vectorization programming is very important in numerical calculation. it s a bad habit using "for loop" too frequently. here I will list two different writing style concerning the "non-vectorization" and "vectorization" programming. vectorization is necessary especially in algorithm like SAG, SVRG, because the gradient computing is very frequently. Algorithm 1 Non-vectorization Require: initial value w,x,y for i in 1:n do calculate gradient: P (w) = 1 n n i=1 end for Output (w) exp( y iw T x i) 1+exp( y iw T x i) ( y ix i ) + λl (w) Algorithm 2 vectorization Require: initial value w:d-1,x:n-d,y:n-1 tmp = exp( y (xw)) 1+exp( y (xw) P (w) = x T (tmp ( y)) Output (w) 6

3 Conclusion This technical report mainly compare the regulation type, regulation value, step-size strategy, different gradient method in table form. if time is enough, more detailed comparison can be presented. Acknowledgments some algorithm implement details are borrowed from https://github.com/hiroyuki-kasai/ SGDLibrary. This is only a technical report, I will be glad it helps you. References [1] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 21, pages 177 186. Springer, 21. [2] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 214. [3] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 213. [4] Yann LeCun, Corinna Cortes, and Christopher JC Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 21. [5] Jun Liu, Jianhui Chen, and Jieping Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547 556. ACM, 9. [6] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145 151, 1999. [7] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663 2671, 212. [8] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567 599, 213. [9] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 212. [1] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arxiv preprint arxiv:1212.571, 212. 7