More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

Size: px

Start display at page:

Download "More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server"

Joanna Heath
5 years ago
Views:

1 More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University *Intel Labs

rithm E.g. optimization, MCMC algorithms For topic models, regression, matrix factorization, SVMs, DNNs, etc.

2 Distributed ML: one machine to many Setting: have iterative, parallel ML algorithm E.g. optimization, MCMC algorithms For topic models, regression, matrix factorization, SVMs, DNNs, etc. Critical updates executed on one machine, in parallel Worker threads share global model parameters θ via RAM for (t = to T) { dothings() parallelupdate(x,θ) dootherthings() } Parallelize over worker threads Share global model parameters via RAM θ θ θ θ θ θ θ θ θ θ θ θ θ

3 Distributed ML: one machine to many Want: scale up by distributing ML algorithm Must now share parameters over a network Seems like a simple task Many distributed tools available, so just pick one and go? Distributed Algorithm θ θ θ θ Single machine, multiple threads Multiple machines, communicating over network switches

4 4 Distributed ML Challenges Not quite that easy Two distributed challenges: Networks are slow Identical machines rarely perform equally Unequal performance Low bandwidth, High delay

5 5 Networks are (relatively) slow Low network bandwidth: 0.-GB/s (inter-machine) vs 0GB/s (CPU-RAM) Fewer parameters transmitted per second High network latency (messaging time): 0,000-00,000 ns (inter-machine) vs 00 ns (CPU-RAM) Wait much longer to receive parameters High bandwidth Low latency Low bandwidth High latency

6 6 Networks are (relatively) slow Parallel ML requires frequent synchronization Exchange 0-000K scalars per second, per thread Parameters not shared quickly enough communication bottleneck Significant bottleneck over a network! θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ

7 Seconds 7 Networks are (relatively) slow Time Breakdown: Compute vs Network LDA machines (56 cores), 0% data per iter Network waiting time Compute time BSP For a clean setting with full control over machines and full network capacity Real clusters with many users have even worse network:compute ratios!

8 Machines don t perform equally Even when

8 8 Machines don t perform equally Even when configured identically Variety of reasons: Vibrating hard drive Background programs; part of a distributed filesystem Other users Machine is a VM/cloud service Occasional, random slowdowns in different machines

9 9 Consequence: Scaling up ML is hard! Going from to N machines: Naïve implementations rarely yield N-fold speedup Slower convergence due to machine slowdowns, network bottlenecks If not careful, even worse than a single machine! Algorithm diverges due to errors from slowdowns!

10 0 Existing general-purpose scalable ML Theory-oriented Focus on algorithm correctness/convergence Examples: Cyclic fixed-delay schemes (Langford et al., Agarwal & Duchi) Single-machine asynchronous (Niu et al.) Naively-parallel SGD (Zinkevich et al.) Partitioned SGD (Gemulla et al.) May oversimplify systems issues e.g. need machines to perform consistently e.g. need lots of synchronization e.g. or even try not to communicate at all Systems-oriented Focus on high iteration throughput Examples: MapReduce: Hadoop and Mahout Spark Graph-based: GraphLab, Pregel May oversimplify ML issues e.g. assume algorithms just work in distributed setting, without proof e.g. must convert programs to new programming model; nontrivial effort

11 Existing general-purpose scalable ML Theory-oriented Focus on algorithm correctness/convergence Examples: Cyclic fixed-delay schemes (Langford et al., Agarwal & Duchi) Single-machine asynchronous (Niu et al.) Naively-parallel SGD (Zinkevich et al.) Partitioned SGD (Gemulla et al.) May oversimplify systems issues e.g. need machines to perform consistently e.g. need lots of synchronization e.g. or even try not to communicate at all Systems-oriented Focus on high iteration throughput Examples: MapReduce: Hadoop and Mahout Spark Graph-based: GraphLab, Pregel May oversimplify ML issues e.g. assume algorithms just work in distributed setting, without proof e.g. must convert programs to new programming model; nontrivial effort Can we take both sides into account?

12 Middle of the road approach Want: ML algorithms converge quickly under imperfect systems conditions e.g. slow network performance e.g. random machine slowdowns Parameters are not communicated consistently Existing work: mostly use one of two communication models Bulk Synchronous Parallel (BSP) Asynchronous (Async) First, understand pros and cons of BSP and Async

13 Bulk Synchronous Parallel Synchronization Barrier (Parameters read/updated here) Thread 4 5 Thread 4 5 Thread 4 5 Thread Time Threads synchronize (wait for each other) every iteration Threads all on same iteration # Parameters read/updated at synchronization barriers

14 4 The cost of synchronicity Thread Thread Thread Thread 4 (a) Machines perform unequally (b) Algorithmic workload imbalanced So threads must wait for each other Time End-of-iteration sync gets longer with larger clusters (due to slow network)

15 5 The cost of synchronicity Wasted computing time! Thread Thread Thread Thread 4 Time Threads must wait for each other End-of-iteration sync gets longer with larger clusters Precious computing time wasted

16 6 Asynchronous Parameters read/updated at any time Thread Thread Thread Thread Time Threads proceed to next iteration without waiting Threads not on same iteration # Parameters read/updated any time

17 7 Slowdowns and Async Difference in iterations parameter error Thread Thread Thread Thread Time Machine suddenly slows down (hard drive, background process, etc.) Causing iteration difference between threads Leading to error in parameters

18 8 Async worst-case situation Difference in iterations parameter error Thread Thread Thread Thread Time Large clusters have arbitrarily large slowdowns! Machines become inaccessible for extended periods Error becomes unbounded!

19 9 What we really want Partial synchronicity Spread network comms evenly (don t sync unless needed) Threads usually shouldn t wait but mustn t drift too far apart! Straggler tolerance Slow threads must somehow catch up Is there a middle ground between BSP and Async? Thread Thread Thread Thread Thread Thread Thread 4 Thread BSP??? Async

20 0 That middle ground Partial synchronicity Spread network comms evenly (don t sync unless needed) Threads usually shouldn t wait but mustn t drift too far apart! Straggler tolerance Slow threads must somehow catch up Thread Force threads to sync up Make thread catch up Thread Thread Thread Time

21 That middle ground How do we realize this? Thread Force threads to sync up Make thread catch up Thread Thread Thread Time

22 Stale Synchronous Parallel Staleness Threshold Thread waits until Thread has reached iter 4 Thread Thread Thread Thread Iteration Note: x-axis is now iteration count, not time! Allow threads to usually run at own pace Fastest/slowest threads not allowed to drift >S iterations apart Threads cache local (stale) versions of the parameters, to reduce network syncing

23 Stale Synchronous Parallel Staleness Threshold Thread Thread Thread Thread 4 Thread will always see these updates Thread may not see these updates (possible error) Iteration A thread at iter T sees all parameter updates before iter T-S Protocol: check cache first; if too old, get latest version from network Consequence: fast threads must check network every iteration Slow threads only check every S iterations fewer network accesses, so catch up!

24 4 SSP provides best-of-both-worlds SSP combines best properties of BSP and Async BSP-like convergence guarantees Threads cannot drift more than S iterations apart Every thread sees all updates before iteration T-S Asynchronous-like speed Threads usually don t wait (unless there is drift) Slower threads read from network less often, thus catching up SSP is a spectrum of choices Can be fully synchronous (S = 0) or very asynchronous (S ) Or just take the middle ground, and benefit from both! Thread Thread Thread Thread 4

25 5 Why does SSP converge? Instead of x true, SSP sees x stale = x true + error The error caused by staleness is bounded Over many iterations, average error goes to zero

26 6 Why does SSP converge? SSP approximates sequential execution Thread Thread Sequential execution Thread Thread Clock Compare actual update order to ideal sequential execution

27 7 Why does SSP converge? SSP approximates sequential execution Staleness Thread Thread Thread Sequential execution Possible error windows for this update: Thread Clock SSP may lose up to S iterations of updates to the left

28 8 Why does SSP converge? SSP approximates sequential execution Staleness Thread Thread Thread Sequential execution Possible error windows for this update: Thread Clock as well as gain up to S iterations of updates to the right

29 9 Why does SSP converge? SSP approximates sequential execution Error window (x)- = 5 iters Thread Thread Thread Sequential execution Possible error windows for this update: Thread Clock Thus, at most S- iterations of erroneous updates Hence numeric error in parameters is also bounded Partial, but bounded, loss of serializability

30 0 Convergence Theorem Want: minimize convex (Example: Stochastic Gradient) L-Lipschitz, problem diameter bounded by F Staleness s, using P threads across all machines Use step size

31 Convergence Theorem Want: minimize convex (Example: Stochastic Gradient) L-Lipschitz, problem diameter bounded by F Staleness s, using P threads across all machines Use step size Difference between SSP estimate and true optimum SSP converges according to Where T is the number of iterations Note: RHS bound contains (L, F) and (s, P) The interaction between theory and systems parameters

32 SSP solves Distributed ML challenges SSP is a synchronization model for fast and correct distributed ML For abelian parameter updates of the form θ new = θ old + Δ SSP reduces network traffic Threads use stale local cache whenever possible Addresses slow network and occasional machine slowdowns Cache Cache Cache Cache

33 SSP + Parameter Server = Easy Distributed ML We implement SSP as a parameter server (PS), called SSPTable Provides all machines with convenient access to global model parameter Can be run on multiple machines reduces load per machine SSPTable allows easy conversion of single-machine parallel ML algorithms Distributed shared memory programming style No need for complicated message passing Replace local memory access with PS access Worker Worker SSPTable (one or more machines) Worker Worker 4 Ahmed et al. (WSDM 0), Power and Li (OSDI 00) Single Machine Parallel Distributed with SSPTable UpdateVar(i) { old = y[i] delta = f(old) y[i] += delta } UpdateVar(i) { old = PS.read(y,i) delta = f(old) PS.inc(y,i,delta) }

34 4 SSPTable Programming Easy, table-based programming just commands! No message passing, barriers, locks, etc. read_row(table,row,s) Retrieve a table row with staleness s inc(table,row,col,value) Increment table s (row,col) by value clock() Inform PS that this thread is advancing to the next iteration

35 5 SSPTable Programming Just put global parameters in SSPTable! Examples: Topic Modeling (MCMC) Topic-word table Matrix Factorization (SGD) Factor matrices L, R Topic Topic Topic Topic 4 SSPTable R Lasso Regression (CD) Coefficients β L SSPTable supports generic classes of algorithms With these models as examples β

36 Seconds 6 SSPTable uses networks efficiently Time Breakdown: Compute vs Network LDA machines (56 cores), 0% data per iter Network waiting time Compute time BSP

37 Seconds 7 SSPTable uses networks efficiently Time Breakdown: Compute vs Network LDA machines (56 cores), 0% data per iter Network waiting time Compute time BSP Staleness Network communication is a huge bottleneck with many machines SSP balances network and compute time

38 Log-Likelihood 8 SSPTable vs BSP and Async -9.00E E E E E E E E E+09 LDA on NYtimes Dataset LDA machines (56 cores), 0% docs per iter Seconds BSP (stale 0) async NYtimes data N = 00M tokens K = 00 topics V = 00K terms BSP has strong convergence guarantees but is slow Asynchronous is fast but has weak convergence guarantees

39 Log-Likelihood 9 SSPTable vs BSP and Async -9.00E E E E E E E+09 LDA on NYtimes Dataset LDA machines (56 cores), 0% docs per iter BSP (stale 0) stale NYtimes data N = 00M tokens K = 00 topics V = 00K terms -.5E E+09 Seconds async BSP has strong convergence guarantees but is slow Asynchronous is fast but has weak convergence guarantees SSPTable is fast and has strong convergence guarantees

40 Iterations Log-Likelihood 40 The Quality vs Quantity tradeoff Quantity: iterations versus time LDA machines, 0% data Seconds Quality: objective versus iterations LDA machines, 0% data -9.00E E E E E E E E E Iterations BSP (stale 0) stale 6 stale 4 stale 48 Progress per time is (iters/sec) * (progress/iter) High staleness yields more iters/sec, but lowers progress/iter Find the sweet spot staleness >0 for maximum progress per second

41 4 The Quality vs Quantity tradeoff More Staleness Progress per time is (iters/sec) * (progress/iter) High staleness yields more iters/sec, but lowers progress/iter Find the sweet spot staleness >0 for maximum progress per second

42 Objective 4 Matrix Factorization (Netflix).40E+09.0E+09.00E E E E+08.00E E+00 Objective function versus time MF machines (56 threads) BSP (stale 0) stale Seconds Netflix data 00M nonzeros 480K rows 8K columns rank 00

43 Objective 4 LASSO (Synthetic) 4.80E E E E E-0 Objective function versus time Lasso 6 machines (8 threads) BSP (stale 0) stale 0 stale 0 stale 40 stale 80 Synthetic data N = 500 samples P = 400K features 4.0E-0 4.0E Seconds

44 Log-Likelihood Inverse time to convergence 44 SSPTable scaling with # machines -8E+08-9E+08 LDA on NYtimes dataset (staleness = 0, k docs per core per iteration) Double # machines: 78% speedup converge in 56% time -E+09 -.E+09 -.E+09 -.E+09 machines (56 cores) 6 machines (8 cores) 8 machines (64 cores) 4 machines ( cores) machines (6 cores) machine (8 cores) Ideal Scaling SSP E+09 Seconds # machines SSP computational model scales with increasing # machines (given a fixed dataset)

45 45 Recent Results Using 8 machines * 6 cores = 8 threads 8GB RAM per machine Latent Dirichlet Allocation NYTimes dataset (00M tokens, 00K words, 0K topics) SSP 00K tokens/s GraphLab 80K tokens/s PubMed dataset (7.5B tokens, 4K words, 00 topics) SSP.M tokens/s GraphLab.8M tokens/s Network latent space role modeling Friendster network sample (9M nodes, 80M edges) 50 roles: SSP takes 4h to converge (vs 5 days on one machine)

46 46 Future work Theory SSP for MCMC Automatic staleness tuning Average-case analysis for better bounds Systems Load balancing Fault tolerance Prefetching Other consistency schemes Applications Hard-to-parallelize ML models DNNs, Regularized Bayes, Network Analysis models

47 47 Coauthors James Cipar Henggang Cui Jin Kyu Kim Seunghak Lee Phillip B. Gibbons Garth A. Gibson Gregory R. Ganger Eric P. Xing

48 48 Workshop Demo SSP is part of a bigger system: Petuum SSP parameter server STRADS dynamic variable scheduler More features in the works We have a demo! Topic modeling (8.M docs, 7.5B tokens, 4K words, 0K topics) Lasso regression (00K samples, 00M dimensions, 5 billion nonzeros) Network latent space modeling (9M nodes, 80M edges, 50 roles) At BigLearning 0 workshop (Monday)

49 49 Summary Distributed ML is nontrivial Slow network Unequal machine performance SSP addresses those problems Efficiently use network resources; reduces waiting time Allows slow machines to catch up Fast like Async, converges like BSP SSPTable parameter server provides easy table interface Quickly convert single-machine parallel ML algorithms to distributed Slides:

Lecture 22 : Distributed Systems for ML

10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.