Parallel Stochastic Gradient Descent

Similar documents
ECS289: Scalable Machine Learning

The Mathematics Behind Neural Networks

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

A Brief Look at Optimization

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

ImageNet Classification with Deep Convolutional Neural Networks

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Asynchronous Multi-Task Learning

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Linear Regression Optimization

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Lecture 37: ConvNets (Cont d) and Training

Clustering with gradient descent

Perceptron: This is convolution!

CS294-1 Assignment 2 Report

Deep Neural Networks Optimization

291 Programming Assignment #3

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

SGD: Stochastic Gradient Descent

Deep Learning and Its Applications

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Lecture #11: The Perceptron

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

CS6220: DATA MINING TECHNIQUES

Experimental Data and Training

Efficient Deep Learning Optimization Methods

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

CME 213 SPRING Eric Darve

ECS289: Scalable Machine Learning

CAP 6412 Advanced Computer Vision

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Big Orange Bramble. August 09, 2016

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

CSC 411: Lecture 02: Linear Regression

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

CMPT 882 Week 3 Summary

Classification: Linear Discriminant Functions

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

M. Sc. (Artificial Intelligence and Machine Learning)

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Unsupervised Learning of Spatiotemporally Coherent Metrics

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

A Novel Technique for Optimizing the Hidden Layer Architecture in Artificial Neural Networks N. M. Wagarachchi 1, A. S.

A Systematic Overview of Data Mining Algorithms

IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

A performance comparison of Deep Learning frameworks on KNL

A Fast Learning Algorithm for Deep Belief Nets

CSC 4510 Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Machine Learning : Clustering, Self-Organizing Maps

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Parallel Implementation of Deep Learning Using MPI

Neural Networks (pp )

Deep Learning with Tensorflow AlexNet

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Learning via Optimization

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Scaled Machine Learning at Matroid

CS281 Section 3: Practical Optimization

Divide and Conquer Kernel Ridge Regression

Louis Fourrier Fabien Gaie Thomas Rolf

Deep Learning Cook Book

Machine Learning 13. week

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

For Monday. Read chapter 18, sections Homework:

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Iterative Methods for Solving Linear Problems

Probabilistic Siamese Network for Learning Representations. Chen Liu

Keras: Handwritten Digit Recognition using MNIST Dataset

Image Classification using Fast Learning Convolutional Neural Networks

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Optimization. Industrial AI Lab.

Harp-DAAL for High Performance Big Data Computing

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Why DNN Works for Speech and How to Make it More Efficient?

Package Rtsne. April 14, 2017

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Decentralized and Distributed Machine Learning Model Training with Actors

Parallel Implementation of Nudged Elastic Band Method

Transcription:

University of Montreal August 11th, 2007 CIAR Summer School - Toronto

Stochastic Gradient Descent Cost to optimize: E z [C(θ, z)] with θ the parameters and z a training point. Stochastic gradient: θ t+1 θ t ɛ t C(θ t, z t ) θ

Stochastic Gradient Descent Cost to optimize: E z [C(θ, z)] with θ the parameters and z a training point. Stochastic gradient: θ t+1 θ t ɛ t C(θ t, z t ) θ Batch gradient: θ k+1 θ k ɛ k n t=1 C(θ k, z t ) θ

Stochastic Gradient Descent Cost to optimize: E z [C(θ, z)] with θ the parameters and z a training point. Stochastic gradient: θ t+1 θ t ɛ t C(θ t, z t ) θ Batch gradient: θ k+1 θ k ɛ k n t=1 C(θ k, z t ) θ Conjugate gradient: similar to batch except the descent direction is not the gradient itself, and the step ɛ k is optimized by line search.

Stochastic vs. Batch Batch is good because: Gradient is less noisy (averaged over a large number of samples) Can use optimized matrix operations to efficiently compute output and gradient over whole dataset

Stochastic vs. Batch Batch is good because: Gradient is less noisy (averaged over a large number of samples) Can use optimized matrix operations to efficiently compute output and gradient over whole dataset Stochastic is good because: For large or infinite datasets, batch and conjugate gradient are impractical. More parameters updates faster convergence Noisy updates can actually help escape local minima

Mini-batch stochastic gradient Trade-off between batch and stochastic: s k +b C(θ t, z t ) θ k+1 θ k ɛ k θ t=s k Typical size of mini-batches: on the order of 10 s or 100 s.

Parallelization Note: we are using a 16-CPU computer 100 s computers on a cluster.

Parallelization Note: we are using a 16-CPU computer 100 s computers on a cluster. Use existing parallel implementations of BLAS to speed-up matrix-matrix multiplications: low speed-up unless using very large mini-batches ( equivalent to batch)

Parallelization Note: we are using a 16-CPU computer 100 s computers on a cluster. Use existing parallel implementations of BLAS to speed-up matrix-matrix multiplications: low speed-up unless using very large mini-batches ( equivalent to batch) Split data into c chunks (each of the c CPUs sees one chunk of the data), and perform mini-batch stochastic gradient descent with parameters store in shared memory:

Parallelization Note: we are using a 16-CPU computer 100 s computers on a cluster. Use existing parallel implementations of BLAS to speed-up matrix-matrix multiplications: low speed-up unless using very large mini-batches ( equivalent to batch) Split data into c chunks (each of the c CPUs sees one chunk of the data), and perform mini-batch stochastic gradient descent with parameters store in shared memory: if all CPUs are updating parameters simultaneously: poor performance (time wasted waiting for memory write locks)

Parallelization Note: we are using a 16-CPU computer 100 s computers on a cluster. Use existing parallel implementations of BLAS to speed-up matrix-matrix multiplications: low speed-up unless using very large mini-batches ( equivalent to batch) Split data into c chunks (each of the c CPUs sees one chunk of the data), and perform mini-batch stochastic gradient descent with parameters store in shared memory: if all CPUs are updating parameters simultaneously: poor performance (time wasted waiting for memory write locks) proposed solution: at any time, only one CPU is allowed to update parameters. The index of the next CPU to update is stored in shared memory, and incremented after each update. Proposed method gives good speed-up in terms of raw samples/s speed (e.g.: x13 with 16 CPUs).

Forking Forking a program creates multiple copies of the program. Memory is duplicated, except for shared memory.

The Big Picture

Virtual Speed-Up The speed at which training examples are processed increases (about) linearly with the number of CPUs.

True Speed-Up In practice, the frequency of updates is decreased lower speed of convergence. Suggested experimental setting:

True Speed-Up In practice, the frequency of updates is decreased lower speed of convergence. Suggested experimental setting: Identify target training error e from simple experiments

True Speed-Up In practice, the frequency of updates is decreased lower speed of convergence. Suggested experimental setting: Identify target training error e from simple experiments For each number of CPUs c, measure the amount of time needed to reach error e

True Speed-Up In practice, the frequency of updates is decreased lower speed of convergence. Suggested experimental setting: Identify target training error e from simple experiments For each number of CPUs c, measure the amount of time needed to reach error e Note that hyper-parameters (learning rate and mini-batch size in particular) need to be re-optimized for each value of c.

Experiment Letter recognition dataset from UCI machine learning repository 15000 training samples, 16-dimensional input, 26 classes Target NLL: 0.11 Constant network architecture (300 hidden neurons) Find optimal fixed learning rate and mini-batch size

Joke

Joke

Discussion, Conclusion, Future Work and *Clap Clap*

Discussion, Conclusion, Future Work and *Clap Clap* My code is likely buggy!

Discussion, Conclusion, Future Work and *Clap Clap* My code is likely buggy! Funny things can happen in parallel code

Discussion, Conclusion, Future Work and *Clap Clap* My code is likely buggy! Funny things can happen in parallel code Optimization sensitive to noise should use different seeds

Discussion, Conclusion, Future Work and *Clap Clap* My code is likely buggy! Funny things can happen in parallel code Optimization sensitive to noise should use different seeds Should also introduce and optimize a learning rate decrease constant

Discussion, Conclusion, Future Work and *Clap Clap* My code is likely buggy! Funny things can happen in parallel code Optimization sensitive to noise should use different seeds Should also introduce and optimize a learning rate decrease constant Use more light-weight parallelization? (pthreads?)

Discussion, Conclusion, Future Work and *Clap Clap* My code is likely buggy! Funny things can happen in parallel code Optimization sensitive to noise should use different seeds Should also introduce and optimize a learning rate decrease constant Use more light-weight parallelization? (pthreads?) What about large clusters? (MPI)