CS 6453: Parameter Server. Soumya Basu March 7, 2017

Similar documents
Scaling Distributed Machine Learning with the Parameter Server

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Scaling Distributed Machine Learning

15.1 Optimization, scaling, and gradient descent in Spark

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Map-Reduce. Marco Mura 2010 March, 31th

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Distributed Machine Learning: An Intro. Chen Huang

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Parallelism. CS6787 Lecture 8 Fall 2017

ECS289: Scalable Machine Learning

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Send me up to 5 good questions in your opinion, I ll use top ones Via direct message at slack. Can be a group effort. Try to add some explanation.

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Deep Learning Inference as a Service

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Lecture 22 : Distributed Systems for ML

Toward Scalable Deep Learning

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Distributed Filesystem

SCALABLE DISTRIBUTED DEEP LEARNING

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency

Practice Questions for Midterm

The Stream Processor as a Database. Ufuk

ECE 669 Parallel Computer Architecture

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Mining Distributed Frequent Itemset with Hadoop

Practical Near-Data Processing for In-Memory Analytics Frameworks

Online Course Evaluation. What we will do in the last week?

Resilient Distributed Datasets

CS8803: Statistical Techniques in Robotics Byron Boots. Thoughts on Machine Learning and Robotics. (how to apply machine learning in practice)

HP Designing and Implementing HP Enterprise Backup Solutions. Download Full Version :

GFS: The Google File System

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

NFSv4 as the Building Block for Fault Tolerant Applications

The former pager tasks have been replaced in 7.9 by the special savepoint tasks.

Large Scale Distributed Deep Networks

Distributed and Fault-Tolerant Execution Framework for Transaction Processing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Towards the world s fastest k-means algorithm

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

CSE 5306 Distributed Systems. Consistency and Replication

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Closing Thoughts on Machine Learning (ML in Practice)

CSCE 626 Experimental Evaluation.

Lecture 11 Hadoop & Spark

Huge market -- essentially all high performance databases work this way

CS454/654 Midterm Exam Fall 2004

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Copyright 2013 Thomas W. Doeppner. IX 1

High Performance Computing. Introduction to Parallel Computing

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Orleans. Cloud Computing for Everyone. Hamid R. Bazoobandi. March 16, Vrije University of Amsterdam

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

15.1 Data flow vs. traditional network programming

what is cloud computing?

Fast, Interactive, Language-Integrated Cluster Computing

Mutual consistency, what for? Replication. Data replication, consistency models & protocols. Difficulties. Execution Model.

MapReduce ML & Clustering Algorithms

Parallel Streaming Computation on Error-Prone Processors. Yavuz Yetim, Margaret Martonosi, Sharad Malik

Microservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

Understanding Parallelism and the Limitations of Parallel Computing

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Apache Flink- A System for Batch and Realtime Stream Processing

Parallel Deep Network Training

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CCDP-ARCH. Section 11. Content Networking

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

AN exam March

CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization

MapReduce. U of Toronto, 2014

Programming Systems for Big Data

Providing Real-Time and Fault Tolerance for CORBA Applications

Distributed Systems Principles and Paradigms

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Lecture 6 Consistency and Replication

Distributed Systems CS6421

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory

Assignment 5. Georgia Koloniari

Current Topics in OS Research. So, what s hot?

ACMS: The Akamai Configuration Management System. A. Sherman, P. H. Lisiecki, A. Berkheimer, and J. Wein

Training Deep Neural Networks (in parallel)

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Transcription:

CS 6453: Parameter Server Soumya Basu March 7, 2017

What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1, 3) (5, 6, 7) Training Design a server that makes the above fast!

Why Now? Machine learning is important! Read the news to see why Feature extraction fits nicely into Map-Reduce Many systems take care of this problem So, parameter server focuses on training models

Training in ML Training consists of the following steps: 1. Initialize model with small random values 2. Try to guess the right answer for your input set 3. Adjust the model 4. Repeat step 2-3 until your error is small enough

Systems view of Training Initialize model with small random values Paid once- fairly trivial to parallelize Try to guess the right answer for your input set Iterate through the input set many many times Adjust the model Send a small update to the model parameters

Key Challenges Three main challenges of implementing a parameter server: Accessing parameters requires lots of network bandwidth Training is sequential and synchronization is hard to scale Fault tolerance at scale (~25% failure rate for 10k machine-hour jobs)

First Attempts First attempts used memcached for synchronization [VLDB 2010] Key-value stores have very large overheads Synchronization costs are expensive and not always necessary

Second Generation Second generation of attempts were application specific parameter servers [WDSM 2012, NIPS 2012, NIPS 2013] Fails to factor out common difficulties between many different types of problems Difficult to deploy multiple algorithms in parallel

General Purpose ML General purpose machine-learning frameworks Many have synchronization points -> difficult to scale Key observation: cache state between iterations

GraphLab Distributed GraphLab [PVLDB 2012] Uses coarse-grained snapshots for fault tolerance, impeding scalability Doesn t scale elastically like map-reduce frameworks Asynchronous task scheduling is the main contribution

Piccolo Piccolo [OSDI 2010] Most similar to this paper Is not optimized for Machine Learning though

Technical Contribution Recall the three main challenges: Accessing parameters requires lots of network bandwidth Training is sequential and synchronization is hard to scale Fault tolerance at scale

Dealing with Parameters What are parameters of a ML model? Usually an element of a vector, matrix, etc. Need to do lots of linear algebra operations Introduce new constraint: ordered keys Typically some index into a linear algebra structure

Dealing with Parameters What are parameters of a ML model? Usually an element of a vector, matrix, etc. Need to do lots of linear algebra operations Introduce new constraint: ordered keys Typically some index into a linear algebra structure

Dealing with Parameters High model complexity leads to overfitting Updates don t touch many parameters Range push-and-pull: Can update a range of values in a row instead of single key When sending ranges, use compression

Synchronization ML models try to find a good local min/min Need updates to be generally in the right direction Not important to have strong consistency guarantees all the time Parameter server introduces Bounded Delay

Fault Tolerance Server stores all state, workers are stateless However, workers cache state across iterations Keys are replicated for fault tolerance Jobs are rerun if a worker fails

Evaluation

Evaluation

Limitations Evaluation was done on specially designed ML algorithms Distributed regression and distributed gradient descent How fast is it on a sequential algorithm? Count-Min Sketch is trivially parallelizable No neural networks evaluated?

Future Work What happens to sequential ML algorithms? Synchronization cost ignored, rather than resolved Where are the bottlenecks of synchronization? Lots of waiting time, but on what resource(s)?