ECS289: Scalable Machine Learning

Similar documents
STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Conflict Graphs for Parallel Stochastic Gradient Descent

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

CS 179 Lecture 16. Logistic Regression & Parallel SGD

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Large Scale Distributed Deep Networks

Parallel Stochastic Gradient Descent

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

FTSGD: An Adaptive Stochastic Gradient Descent Algorithm for Spark MLlib

Toward Scalable Deep Learning

Scaling Distributed Machine Learning

Distributed Delayed Proximal Gradient Methods

Communication-Efficient Distributed Dual Coordinate Ascent

Parallel Implementation of Deep Learning Using MPI

Harp-DAAL for High Performance Big Data Computing

Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

Introduction to Parallel Computing

Distributed Machine Learning: An Intro. Chen Huang

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Distributed Primal-Dual Optimization for Non-uniformly Distributed Data

Parallel and Distributed Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Perceptron: This is convolution!

Announcements: projects

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

MI2LS: Multi-Instance Learning from Multiple Information Sources

CSCE 626 Experimental Evaluation.

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Combine the PA Algorithm with a Proximal Classifier

STA141C: Big Data & High Performance Statistical Computing

Towards the world s fastest k-means algorithm

Parallel Computing Introduction

Linear Regression Optimization

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Distributed Optimization for Machine Learning

Scalable I/O-Bound Parallel Incremental Gradient Descent for Big Data Analytics in GLADE

Gradient Descent. Michail Michailidis & Patrick Maiden

Cache-efficient Gradient Descent Algorithm

Introduction to Parallel Programming

A Brief Look at Optimization

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Mini-batch Block-coordinate based Stochastic Average Adjusted Gradient Methods to Solve Big Data Problems

Stanford University. A Distributed Solver for Kernalized SVM

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Case Study 1: Estimating Click Probabilities

Metric Learning for Large-Scale Image Classification:

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

15.1 Optimization, scaling, and gradient descent in Spark

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Practice Questions for Midterm

Asynchronous Parallel Stochastic Gradient Descent

SCALABLE DISTRIBUTED DEEP LEARNING

Class 6 Large-Scale Image Classification

Adversarial Examples in Deep Learning. Cho-Jui Hsieh Computer Science & Statistics UC Davis

450 Jon M. Huntsman Hall 3730 Walnut Street, Mobile:

CSE 250B Assignment 2 Report

A More Robust Asynchronous SGD

Training Deep Neural Networks (in parallel)

Introduction to parallel Computing

Parallelism. CS6787 Lecture 8 Fall 2017

Bolt: I Know What You Did Last Summer In the Cloud

Asynchronous Multi-Task Learning

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

BAYESIAN GLOBAL OPTIMIZATION

Online Learning. Lorenzo Rosasco MIT, L. Rosasco Online Learning

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Machine Learning Basics. Sargur N. Srihari

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Classification: Linear Discriminant Functions

Lecture 3: Theano Programming

Logistic Regression. Abstract

Parallel Algorithm Engineering

291 Programming Assignment #3

Part 5: Structured Support Vector Machines

Deep Learning and Its Applications

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Bolt: I Know What You Did Last Summer In the Cloud

Distributed Hessian-Free Optimization for Deep Neural Network

CYCLADES: Conflict-free Asynchronous Machine Learning

Deep Learning with Tensorflow AlexNet

Technologies for High Performance Data Analytics

Distributed Machine Learning" on Spark

PARALLELIZED IMPLEMENTATION OF LOGISTIC REGRESSION USING MPI

Matrix Computations and " Neural Networks in Spark

DEEP LEARNING IN PYTHON. The need for optimization

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Transcription:

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 4, 2016

Outline Multi-core v.s. multi-processor Parallel Gradient Descent Parallel Stochastic Gradient Parallel Coordinate Descent

Parallel Programming Parallel algorithms can be different in the following two cases: Shared Memory Model (Multiple cores/multiple processors) Independent L1 cache Shared/independent L2 cache Shared memory Distributed Memory Model Multiple computers

Shared Memory Model (Multiple cores) Shared memory model: each CPU can access the same memory space Programming tools: C/C++: openmp, C++ thread, pthread, intel TBB,... Python: thread,... Matlab: parfor,...

Parallel for loop in OpenMP

Shared Memory Model

Shared Memory Model Two types of shared memory model: 1 Uniform Memory Access (UMA) 2 Non-Uniform Memory Access (NUMA)

Distributed Memory Model Programming tools: MPI, Hadoop, Spark,... (Figure from http://web.sfc.keio.ac.jp/ rdv/keio/sfc/teaching/architecture/computer-architecture- 2013/lec09-smp.html)

Parallel Gradient Descent

Parallel Gradient Descent Gradient descent: x x α f (x) Gradient computation is usually embarrassingly parallel Example: empirical risk minimization can be written as argmin w 1 n n f i (w) i=1 Partition the dataset into k subsets S 1,..., S k Each machine or CPU computes i S i f i (w) Aggregated local gradients to get the global gradient (communication) ( ) f (w) = 1 f i (w) + + f i (w) n i S 1 i S k

Parallel Stochastic Gradient

Parallel Stochastic Gradient in Shared Memory Systems Stochastic Gradient (SG): For t = 1, 2,... i Randomly pick an index i w t+1 w t η t f i (w t ) Computation of f i (w t ) only depends on the i-th sample usually cannot be parallelized. Parallelizing SG is a hard research problem.

Mini-batch SG Mini-batch SG with batch size b: For t = 1, 2,... Randomly pick a subset S {1,..., n} with size b w t+1 w t η t 1 b i S f i(w t ) Equivalent to gradient descent when b = n Equivalent to stochastic gradient when b = 1

Mini-batch SG Mini-batch SG with batch size b: For t = 1, 2,... Randomly pick a subset S {1,..., n} with size b w t+1 w t η t 1 b i S f i(w t ) Equivalent to gradient descent when b = n Equivalent to stochastic gradient when b = 1 Parallelization with k processors: Let S = S 1 S 2 S k i S f i(w t ) = i S 1 f i (w t ) + i S 2 f i (w t ) + + i S k f i (w t ) can be computed in parallel Other versions: divide-and-average (Mann et al., 2009; Zinkevich et al., 2010)

Mini-batch SG How to choose batch size b?

Mini-batch SG on distributed systems Can we avoid wasting communication time? Use non-blocking network IO: Keep computing updates while aggregating the gradient See (Dekel et al., Optimal Distributed Online Prediction Using Mini-Batches. In JMLR 2012)

Asynchronous Stochastic Gradient Synchronized algorithms: all the machine has to stop and synchronize at some points longer waiting time

Asynchronous Stochastic Gradient (shared memory) The original SG: For t = 1, 2,... Randomly pick an index i w w η f i (w)

Asynchronous Stochastic Gradient (shared memory) The asynchronous parallel SG: Each thread repeatedly performs the following updates: For t = 1, 2,... Randomly pick an index i w w η f i (w)

Asynchronous Stochastic Gradient (shared memory) The asynchronous parallel SG: Each thread repeatedly performs the following updates: For t = 1, 2,... Randomly pick an index i w w η f i (w) Main trick: in shared memory systems, every threads can access the same parameter w First discussed in (Langford et al., Slow learners are fast. In NIPS 2009) Proposed in Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, NIPS 2011.

Asynchronous Stochastic Gradient (shared memory) For convex function, converges to the global optimum under certain conditions: (1) bounded delay, (2) small confliction rate A general framework proving the convergence rate of asynchronous SGD and coordinate descent converge to stationary points: Liang et al., A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order, NIPS 2016. (Prove the linear speedup for asynchronous algorithms).

Asynchronous Stochastic Gradient (distributed memory) Use a parameter server to update the parameters See Dean et al., Large Scale Distributed Deep Networks, in NIPS 2012

Parallel Coordinate Descent

Parallel Coordinate Descent (Stochastic) Coordinate Descent (CD): For t = 1, 2,... Randomly pick an index i w t+1 i w t i ( argmin δ f (w t δe i ) ) A simplified version: each coordinate is updated by a gradient step For t = 1, 2,... Randomly pick an index i w t+1 i wi t η i f (w t ) How to parallelize it?

Synchronized Parallel Coordinate Descent Synchronized Parallel Coordinate Descent: For t = 1, 2,... Randomly pick a subset S {1,..., n} with size b wi t η i f (w t ) for all i S w t+1 i

Synchronized Parallel Coordinate Descent Synchronized Parallel Coordinate Descent: For t = 1, 2,... Randomly pick a subset S {1,..., n} with size b wi t η i f (w t ) for all i S w t+1 i Parallelization: let S = S 1 S 2 S k, j-th machine updates the variables in S j Will it converge? Yes, if η is small enough First discussed in Bradley et al., Parallel coordinate descent for l 1 -regularized loss minimization. In ICML 2011 Theoretical guarantee in Richtarik & Takac, Parallel coordinate descent methods for big data optimization. Mathematical Programming, 2012.

Asynchronous Parallel Coordinate Descent The asynchronous parallel coordinate descent: Each thread repeatedly performs the following updates: For t = 1, 2,... Randomly pick an index i w w η f i (w)

Asynchronous Parallel Coordinate Descent The asynchronous parallel coordinate descent: Each thread repeatedly performs the following updates: For t = 1, 2,... Randomly pick an index i w w η f i (w) Main trick: in shared memory systems, every thread can access the same parameter w First implemented in (Bradley et al., Parallel coordinate descent for l 1 -regularized loss minimization. In ICML 2011) Officially discussed in Liu & Wright An Asynchronous Parallel Stochastic Coordinate Descent Algorithm, ICML 2014.

Coming up Next class: Support Vector Machines (SVM) Questions?