Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Similar documents
Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Case Study 1: Estimating Click Probabilities

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS 179 Lecture 16. Logistic Regression & Parallel SGD

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Deep Learning and Its Applications

CS281 Section 3: Practical Optimization

Combine the PA Algorithm with a Proximal Classifier

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

5 Machine Learning Abstractions and Numerical Optimization

Deep Learning with Tensorflow AlexNet

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

How Learning Differs from Optimization. Sargur N. Srihari

Slides credited from Dr. David Silver & Hung-Yi Lee

Conflict Graphs for Parallel Stochastic Gradient Descent

Perceptron: This is convolution!

CSE 546 Machine Learning, Autumn 2013 Homework 2

The exam is closed book, closed notes except your one-page cheat sheet.

The Mathematics Behind Neural Networks

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

A Brief Look at Optimization

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Machine Learning (CSE 446): Unsupervised Learning

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Parallel Stochastic Gradient Descent

Theoretical Concepts of Machine Learning

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Cost Functions in Machine Learning

Deep Neural Networks Optimization

Optimization for Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Deep Learning & Neural Networks

Large Scale Distributed Deep Networks

Distributed Machine Learning" on Spark

Unsupervised Learning: Clustering

Class 6 Large-Scale Image Classification

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Scaled Machine Learning at Matroid

Divide and Conquer Kernel Ridge Regression

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

10-701/15-781, Fall 2006, Final

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Machine Learning (CSE 446): Practical Issues

Deep Learning for Computer Vision II

Machine Learning: Think Big and Parallel

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

General Instructions. Questions

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Instance-based Learning

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

1 Training/Validation/Testing

CS 229 Midterm Review

Lecture 37: ConvNets (Cont d) and Training

Logistic Regression and Gradient Ascent

Learning via Optimization

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

M. Sc. (Artificial Intelligence and Machine Learning)

Machine Learning (CSE 446): Decision Trees

Kernels and representation

Convolution Neural Nets meet

Overview and Practical Application of Machine Learning in Pricing

Scalable Data Analysis

Lecture 20: Neural Networks for NLP. Zubin Pahuja

ECS289: Scalable Machine Learning

HW Assignment 3 (Due by 9:00am on Mar 6)

Probabilistic Siamese Network for Learning Representations. Chen Liu

The Problem of Overfitting with Maximum Likelihood

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

STAT432 Mini-Midterm Exam I (green) University of Illinois Urbana-Champaign February 25 (Monday), :00 10:50a SOLUTIONS

Parallel Deep Network Training

Optimization. Industrial AI Lab.

Machine Learning. MGS Lecture 3: Deep Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Learning. Volker Tresp Summer 2014

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Metric Learning for Large-Scale Image Classification:

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Transcription:

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 23

Announcements... HW 3 posted Projects: the term is approaching the end... (summer is coming!) Today: Review: Adaptive gradient methods Parallelization: how do we parallelize in the big data regime? S. M. Kakade (UW) Optimization for Big data 2 / 23

Parallelization Overview Basic question: How can we do more operations simultaneously so that our overall task finishes more quickly? Issues: 1 Break up computations on: single machine vs. cluster 2 Breakup up: Models or Data Model parallelization or Data parallelization? 3 Asynchrony? What are good models to study these issues? communication complexity, serial complexity, total computation,?? S. M. Kakade (UW) Optimization for Big data 3 / 23

1: One machine or a cluster? One machine: Certain operations are much faster to do when specified without for loops : matrix multiplications, convolutions, Fourier transforms, parallelize by structuring computations to take advantage of this: e.g. for larger matrix multiplies GPUs!!! Why one machine? Shared memory/communication is fast! Try to take advantage of fast simultaneous operations. Cluster: Why? One machine can only do so much. Try to (truly) breakup computations to be done. Drawbacks: Communication is costly! Simple method: run multiple jobs with different parameters. S. M. Kakade (UW) Optimization for Big data 4 / 23

2: Data Parallelization vs. Model parallelization Data parallelization: Breakup data into smaller chunks to process. Mini-batching, batch gradient descent Averaging Model parallelization: Breakup up your model. Try to update parts of model in a distributed manner. Coordinate ascent methods Update layers of a neural net on different machines. Other issues: Asynchrony S. M. Kakade (UW) Optimization for Big data 5 / 23

Issues to consider... Work: the over all compute time. Depth: the serial runtime. Communication: between machines? Error: terminal error. Memory: how much do we need? S. M. Kakade (UW) Optimization for Big data 6 / 23

Mini-batching (Data Parallelization) stochastic gradient computation: at iteration t, compute: w t+1 w t η l(w t ) where E l(w) = L(w). mini-batch SGD: using batch size b: w t+1 w t η b 1 b b l j (w t ) j=1 where η b is our learning rate. How much does this help? It clearly reduces the variance. S. M. Kakade (UW) Optimization for Big data 7 / 23

How much does mini-batching help? Let s consider the regression/square loss case. w t = E[w t ] is the expected iterate at iteration t. Parameter decomposition w t w = w t w }{{} t + (w t w ) }{{} "Variance" Bias mini-batch SGD: using batch size b: w t+1 w t η b 1 b b l j (w t ) j=1 where η b is our learning rate. How much does this help? Bias? Variance? S. M. Kakade (UW) Optimization for Big data 8 / 23

Variance reduction with b? Variance term (as a function of the mini-batch size): w t w t 2 O(1/b) This means with mini-batch size of b, the contribution to the error is O(b) times less, as compared to using b = 1 (with the same number of iterations) However: as long as the Bias Variance, then no point in trying to make the Variance term smaller. How much does mini-batching help the bias? S. M. Kakade (UW) Optimization for Big data 9 / 23

Bias reduction with b? As b, do we continue to expect improvements to the Bias? For large enough b, we are just doing batch/exact gradient descent: w t+1 w t η L(w t ) What happens in between b = 1 and b? Define ηb as the maximal learning rate (again for the square loss case) that you can use before divergence. Theorem: The maximal learning rate strictly increases with η b. S. M. Kakade (UW) Optimization for Big data 10 / 23

The critical batch size b Let b be the critical b in which η b is η /2. Theorem: For every square loss problem, in comparison to b = 1 (SGD), we have: w t+1 w exp( γ b ) w t w When b b, the Bias contraction γ b linearly increases with b. When b b, γ b is within a factor of 2 times γ. S. M. Kakade (UW) Optimization for Big data 11 / 23

How much can you mini-batch? Let b be the critical b in which η b is η /2. Theorem: Suppose x 2 L (almost surely) and λ max is the maximal eigenvalue of E[xx ]. Then: E[ x 2 ] λ max b L λ max (for not very kurtotic distributions, L E[ x 2 ). How big is b in practice? (The above are really exact expressions for every problem). In one dimension: b = E[x 4 ] (E[x 2 ]) 2 S. M. Kakade (UW) Optimization for Big data 12 / 23

Putting it all together Comparing to b = 1, with mini-batching, you can get the same error with: Depth: reduce the serial run-time by a factor of b. Work: keep the over all compute time the same. Communication: need average b parameters, per update. Initially, above b mini-batching is useless. Asymptotically (for large enough iterations t, when the Bias becomes smaller than the variance), mini-batching more is always helpful. Practice/Punchline?? b is often not all that large for natural problems.this favors the GPU model on one machine. S. M. Kakade (UW) Optimization for Big data 13 / 23

Extra Slide S. M. Kakade (UW) Optimization for Big data 14 / 23

Extra Slide S. M. Kakade (UW) Optimization for Big data 15 / 23

Is it general? Claims were only made for SGD in the square loss case. How general are these ideas? the non-convex case? Empirically, we often go with a GPU model where we max out our mini-batch size. S. M. Kakade (UW) Optimization for Big data 16 / 23

Classification Error (%) Mini-batching: Neural Net training on Mnist 2 1 2 5 10 20 100 200 1 0.5 0.15 0 1 2 3 4 5 6 Depth #10 5 Depth: The number of serial iterations. Work: Depth the mini-batch size. S. M. Kakade (UW) Optimization for Big data 17 / 23

Neural Net Learning rates vs. batch size The maximal learning rate as a function of the batch size. Found with a crude grid search. S. M. Kakade (UW) Optimization for Big data 18 / 23

Neural Net Learning rates vs. batch size The maximal learning rate as a function of the batch size. Found with a crude grid search. S. M. Kakade (UW) Optimization for Big data 19 / 23

Classification Error (%) Neural Net: Test error vs. batch size 6 1 2 5 10 20 100 200 5 4 3 2 1.4 0 1 2 3 4 5 6 Depth #10 5 Subtle issues in non-convex optimization. More overfitting seen here with larger bath sizes. unclear how general this is. for this case, we were too aggressive with the learning rates. S. M. Kakade (UW) Optimization for Big data 20 / 23

Averaging With mini-batching, our presentation suggests you might as well just use a single machine with mini-batching. What can we do with multiple machines? In the convex case: Breakup your data on multiple machines (Must still have enough data on each machine.) Run (mini-batch) SGD separately on each machine separately. Communicate each machines anser to a central parameter server, and (by convexity) average the final answer from each machine. Then can repeat. S. M. Kakade (UW) Optimization for Big data 21 / 23

Averaging: How good is it? Question: What if there isn t enough data on each machine? How much data do we need on each machine? Roughly, we need κ data points per machine. Theorem: With enough data on each machine, then one can just run one pass of SGD separately on each machine and then average their answers. This will be optimal statistically (e.g. in terms of generalization). S. M. Kakade (UW) Optimization for Big data 22 / 23

Hogwild S. M. Kakade (UW) Optimization for Big data 23 / 23