Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Similar documents
Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Case Study 1: Estimating Click Probabilities

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Conflict Graphs for Parallel Stochastic Gradient Descent

Machine Learning (CSE 446): Unsupervised Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

The exam is closed book, closed notes except your one-page cheat sheet.

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Deep Learning and Its Applications

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Combine the PA Algorithm with a Proximal Classifier

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Deep Learning with Tensorflow AlexNet

Parallel Stochastic Gradient Descent

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

5 Machine Learning Abstractions and Numerical Optimization

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Logistic Regression and Gradient Ascent

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CSE 546 Machine Learning, Autumn 2013 Homework 2

Slides credited from Dr. David Silver & Hung-Yi Lee

Deep Learning & Neural Networks

11. Neural Network Regularization

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

The Mathematics Behind Neural Networks

How Learning Differs from Optimization. Sargur N. Srihari

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

ECS289: Scalable Machine Learning

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Fuzzy Set Theory in Computer Vision: Example 3, Part II

CS281 Section 3: Practical Optimization

A Brief Look at Optimization

Machine Learning: Think Big and Parallel

Parallel Deep Network Training

10-701/15-781, Fall 2006, Final

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

A More Robust Asynchronous SGD

Cost Functions in Machine Learning

Class 6 Large-Scale Image Classification

Neural Networks for Classification

CS 229 Midterm Review

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CafeGPI. Single-Sided Communication for Scalable Deep Learning

An Introduction to NNs using Keras

Learning via Optimization

Parallel and Distributed Deep Learning

Unsupervised Learning

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Lecture 20: Neural Networks for NLP. Zubin Pahuja

M. Sc. (Artificial Intelligence and Machine Learning)

Large Scale Distributed Deep Networks

HW Assignment 3 (Due by 9:00am on Mar 6)

Alternatives to Direct Supervision

Convolution Neural Nets meet

Deep Learning. Volker Tresp Summer 2014

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Deep Learning for Computer Vision II

Decentralized and Distributed Machine Learning Model Training with Actors

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Optimization for Machine Learning

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Simple Model Selection Cross Validation Regularization Neural Networks

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Perceptron: This is convolution!

Optimization. Industrial AI Lab.

6 Model selection and kernels

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning (CSE 446): Practical Issues

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Theoretical Concepts of Machine Learning

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Efficient Deep Learning Optimization Methods

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

291 Programming Assignment #3

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

GAN Frontiers/Related Methods

Perceptron as a graph

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Unsupervised Learning: Clustering

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Transcription:

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 21

Announcements... HW 3 posted Projects: the term is approaching the end... (summer is coming!) Today: Review: Adaptive gradient methods Parallelization: how do we parallelize in the big data regime? S. M. Kakade (UW) Optimization for Big data 2 / 21

Parallelization Overview Basic question: How can we do more operations simultaneously so that our overall task finishes more quickly? Issues: 1 Break up computations on: single machine vs. cluster 2 Breakup up: Models or Data Model parallelization or Data parallelization? 3 Asynchrony? What are good models to study these issues? communication complexity, serial complexity, total computation,?? S. M. Kakade (UW) Optimization for Big data 3 / 21

1: One machine or a cluster? One machine: Certain operations are much faster to do when specified without for loops : matrix multiplications, convolutions, Fourier transforms, parallelize by structuring computations to take advantage of this: e.g. for larger matrix multiplies GPUs!!! Why one machine? Shared memory/communication is fast! Try to take advantage of fast simultaneous operations. Cluster: Why? One machine can only do so much. Try to (truly) breakup computations to be done. Drawbacks: Communication is costly! Simple method: run multiple jobs with different parameters. S. M. Kakade (UW) Optimization for Big data 4 / 21

2: Data Parallelization vs. Model parallelization Data parallelization: Breakup data into smaller chunks to process. Mini-batching, batch gradient descent Averaging Model parallelization: Breakup up your model. Try to update parts of model in a distributed manner. Coordinate ascent methods Update layers of a neural net on different machines. Other issues: Asynchrony S. M. Kakade (UW) Optimization for Big data 5 / 21

Issues to consider... Work: the over all compute time. Depth: the serial runtime. Communication: between machines? Error: terminal error. Memory: how much do we need? S. M. Kakade (UW) Optimization for Big data 6 / 21

Mini-batching (Data Parallelization) stochastic gradient computation: at stage s, using w s, compute: w t+1 w t η l(w t ) where E l(w) = L(w). mini-batch SGD: using batch size b: w t+1 w t η b 1 b b l j (w t ) j=1 where η b is our learning rate. How much does this help? It clearly reduces the variance. S. M. Kakade (UW) Optimization for Big data 7 / 21

How much does mini-batching help? Let s consider the regression/square loss case. w t = E[w t ] is the expected iterate at iteration t. Loss decomposition L(w t ) L(w ) = L(w t ) L(w t ) (L(w }{{} t ) L(w )) }{{} Variance Bias mini-batch SGD: using batch size b: w t+1 w t η b 1 b b l j (w t ) j=1 where η b is our learning rate. How much does this help? Bias? Variance? S. M. Kakade (UW) Optimization for Big data 8 / 21

Variance reduction with b? Initially: as long as the Bias Variance, then no point in variance reduction. Suppose you use b samples per SGD update (as opposed to b = 1). You run the same amount of time (as b = 1). How much does this help? Error: Variance is always b times lower. Depth: same Work: b times more As n (with appropriate learning rate settings), the Bias 0 faster than the Variance. So mini-batching, in the limit, is helpful. How much does mini-batching help the bias? S. M. Kakade (UW) Optimization for Big data 9 / 21

Bias reduction with b? As we crank up b, do we continue to expect improvements to the Bias? For large enough b, our update rule is: w t+1 w t η L(w t ) What happens in between b = 1 and b? Question: Let ηb be the maximal learning rate (again for the square loss case) that you can use before divergence. As you turn up b, how would we like ηb to change? S. M. Kakade (UW) Optimization for Big data 10 / 21

Example: sparse case S. M. Kakade (UW) Optimization for Big data 11 / 21

Example: sparse case S. M. Kakade (UW) Optimization for Big data 12 / 21

Example: sparse case S. M. Kakade (UW) Optimization for Big data 13 / 21

Bias reduction with b For b = 1, η 1 1 E[ x 2 ] (it could be smaller for heavy tailed x). Lemma: As b increases, η b increases and saturates to: η b 1 λ max Let b be this critical point : the smallest b where ηb 1 λ max. (Informal) Theorem: Suppose you use b samples per SGD update (as opposed to b = 1). You run the same amount of time (as b = 1). ηb linearly increases until b. Error: Bias contracts b times as fast (per update). Depth: same Work: b times more. So you can get to the same Bias in b less depth (and the same amount of total work). S. M. Kakade (UW) Optimization for Big data 14 / 21

Putting it together: what is an optimal scheme? Initially: there is a critical point b up to which mini-batching will help (both the bias and the variance). Informal Theorem: you can get to the same error in b times less depth and the same total work (as compared to using b = 1). Practice: just crank up b until the error stops decreasing (as a function of the depth). Asymptotically (once the Bias becomes smaller than the variance), you could try to doubling tricks to increase the mini-batch size. Punchline: Unfortunately, b usually will not be all that large for natural problems.this strongly favors the GPU model on one machine. S. M. Kakade (UW) Optimization for Big data 15 / 21

Is it general? Claims were only made for SGD in the square loss case. How general are these ideas? the non-convex case? Empirically, we often go with a GPU model where we max out our mini-batch size. S. M. Kakade (UW) Optimization for Big data 16 / 21

Classification Error (%) Mini-batching: Neural Net training on Mnist 2 1 2 5 10 20 100 200 1 0.5 0.15 0 1 2 3 4 5 6 Depth #10 5 Depth: The number of serial iterations. Work: Depth the mini-batch size. S. M. Kakade (UW) Optimization for Big data 17 / 21

Neural Net Learning rates vs. batch size The maximal learning rate as a function of the batch size. Found with a crude grid search. S. M. Kakade (UW) Optimization for Big data 18 / 21

Neural Net Learning rates vs. batch size The maximal learning rate as a function of the batch size. Found with a crude grid search. S. M. Kakade (UW) Optimization for Big data 19 / 21

Classification Error (%) Neural Net: Test error vs. batch size 6 1 2 5 10 20 100 200 5 4 3 2 1.4 0 1 2 3 4 5 6 Depth #10 5 Subtle issues in non-convex optimization. More overfitting seen here with larger bath sizes. unclear how general this is. for this case, we were too aggressive with the learning rates. S. M. Kakade (UW) Optimization for Big data 20 / 21

Hogwild S. M. Kakade (UW) Optimization for Big data 21 / 21