Scaling Distributed Machine Learning with the Parameter Server

Similar documents
CS 6453: Parameter Server. Soumya Basu March 7, 2017

Scaling Distributed Machine Learning

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

The Google File System

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Carnegie Mellon University Implementation Tutorial of Parameter Server

Introduction to MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Map-Reduce. Marco Mura 2010 March, 31th

Distributed Computation Models

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

MALT: Distributed Data-Parallelism for Existing ML Applications

Cluster-Based Scalable Network Services

CISC 7610 Lecture 2b The beginnings of NoSQL

Scaling for Humongous amounts of data with MongoDB

Cuckoo Linear Algebra

Master-Worker pattern

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

10. Replication. Motivation

CLOUD-SCALE FILE SYSTEMS

GFS: The Google File System

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

Distributed Machine Learning: An Intro. Chen Huang

Apache Flink. Alessandro Margara

ACMS: The Akamai Configuration Management System. A. Sherman, P. H. Lisiecki, A. Berkheimer, and J. Wein

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Google File System. Arun Sundaram Operating Systems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Practical Near-Data Processing for In-Memory Analytics Frameworks

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

Introduction to Distributed Data Systems

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Data Centers. Tom Anderson

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

The Google File System

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

CompSci 516: Database Systems

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout

MI-PDB, MIE-PDB: Advanced Database Systems

Master-Worker pattern

The Google File System

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Middleware and Distributed Systems. System Models. Dr. Martin v. Löwis

Near Memory Key/Value Lookup Acceleration MemSys 2017

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

DOWNLOAD PDF PARAMETER SERVER FOR DISTRIBUTED MACHINE LEARNING

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

CA485 Ray Walshe Google File System

Presented by: Alvaro Llanos E

Distributed and Fault-Tolerant Execution Framework for Transaction Processing

Leveraging Flash in HPC Systems

Distributed File Systems II

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Near-Data Processing for Differentiable Machine Learning Models

STAR: Scaling Transactions through Asymmetric Replication

An Empirical Study of High Availability in Stream Processing Systems

Parallel Computing: MapReduce Jin, Hai

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

modern database systems lecture 10 : large-scale graph processing

A Non-Relational Storage Analysis

DASH COPY GUIDE. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 31

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Programming Systems for Big Data

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Introduction. Distributed Systems IT332

CS 655 Advanced Topics in Distributed Systems

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

CS Project Report

Lecture 6 Consistency and Replication

Tools for Social Networking Infrastructures

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

DISTRIBUTED COMPUTER SYSTEMS

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Design of Parallel Algorithms. Course Introduction

Warehouse-Scale Computing

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Consistency and Replication

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Transcription:

Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented by: Liang Gong CS294-110 Class Presentation Fall 2015 1

Machine Learning in Industry Large training dataset (1TB to 1PB) Complex models (10 9 to 10 12 parameters) ML must be done in distributed environment Challenges: Many machine learning algorithms are proposed for sequential execution Machines can fail and jobs can be preempted 2

Motivation Balance the need of performance, flexibility and generality of machine learning algorithms, and the simplicity of systems design. How to: Distribute workload Share the model among all machines Parallelize sequential algorithms Reduce communication cost 3

Main Idea of Parameter Server Servers manage parameters Worker Nodes are responsible for computing updates (training) for parameters based on part of the training dataset Parameter updates derived from each node are pushed and aggregated on the server. 4

A Simple Example Server node 5

Server node + worker nodes A Simple Example 6

Server node + worker nodes Server node: all parameters A Simple Example 7

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data A Simple Example 8

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations A Simple Example 9

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w A Simple Example 10

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) A Simple Example 11

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node A Simple Example 12

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example 13

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example 14

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example 15

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example 16

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example 17

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example x x 18

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example x x x x 19

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data A Simple Example Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w x x x x x x x x x x 20

Server node + worker nodes Server node: all parameters Worker node: owns part of the training data Operates in iterations Worker nodes pull the updated w Worker node computes updates to w (local training) Worker node pushes updates to the server node Server node updates w A Simple Example x x x x x x x x x x x x 21

A Simple Example 100 nodes 7.8% of w are used on one node (avg) 1000 nodes 0.15% of w are used on one node (avg) x x x x x x x x x x x x 22

Architecture 23

Architecture Server manager: Liveness and parameter partition of server nodes 24

Architecture All server nodes partition parameters keys with consistent hashing. 25

Architecture Worker node: communicate only with its server node 26

Architecture Updates are replicated to slave server nodes synchronously. 27

Architecture Updates are replicated to slave server nodes synchronously. 28

Architecture Optimization: replication after aggregation 29

Data Transmission / Calling The shared parameters are presented as <key, value> vectors. Data is sent by pushing and pulling key range. Tasks are issued by RPC. Tasks are executed asynchronously. Caller executes without waiting for a return from the callee. Caller can specify dependencies between callees. Sequential Consistency Eventual Consistency 1 Bounded Delay Consistency 30

Trade-off: Asynchronous Call 1000 machines, 800 workers, 200 parameter servers. 16 physical cores, 192G DRAM, 10Gb Ethernet. 31

Trade-off: Asynchronous Call 1000 machines, 800 workers, 200 parameter servers. 16 physical cores, 192G DRAM, 10Gb Ethernet. Asynchronous updates require more iterations to achieve the same objective value. 32

Assumptions It is OK to lose part of the training dataset. Not urgent to recover a fail worker node Recovering a failed server node is critical An approximate solution is good enough Limited inaccuracy is tolerable Relaxed consistency (as long as it converges) 33

Optimizations Message compression save bandwidth Aggregate parameter changes before synchronous replication on server node Key lists for parameter updates are likely to be the same as last iteration cache the list, send a hash <1, 3>, <2, 4>, <6, 7.5>, <7, 4.5> Filter before transmission: gradient update that is less than a threshold. 34

Network Saving 1000 machines, 800 workers, 200 parameter servers. 16 physical cores, 192G DRAM, 10Gb Ethernet. 35

Trade-offs Consistency model vs Computing Time + Waiting Time Sequential Consistency (τ=0) Eventual Consistency (τ= ) 1 Bounded Delay Consistency (τ=1) 36

Discussions Feature selection? Sampling? Trillions of features and trillions of examples in the training dataset overfitting? Each worker do multiple iterations before push? Diversify the labels each node is assigned > Random? If one worker only pushes trivial parameter changes, probably its training dataset are not very useful remove and re-partition. A hierarchy of server node 37

Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption / Problem x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 38

Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Assumption / Problem x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 39

Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Assumption / Problem Problem: What if one example contains 90% of features x x x x x x x x x x (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x 40

Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Assumption / Problem Problem: What if one example contains 90% of features x x x x x x x x x x (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x 41

Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Assumption / Problem Problem: What if one example contains 90% of features x x x x x x x x x x (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x 42

Sketch Based Machine Learning Algorithms Sketches are a class of data stream summaries Problem: An infinite number of data items arrive continuously, whereas the memory capacity is bounded by a small size Every item is seen once Approach: Typically formed by linear projections of source data with appropriate (pseudo) random vectors Goal: use small memory to answer interesting queries with strong precision guarantees http://web.engr.illinois.edu/~vvnktrm2/talks/sketch.pdf 43

Assumption: It is OK to calculate updates for models on each portion of data separately and aggregate the updates. Problem: What about clustering and other ML/DM algorithms? Assumption / Problem x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 44