Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Size: px

Start display at page:

Download "Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa"

Ira Barnett
5 years ago
Views:

1 Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa

2 Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition One of the most important workloads in data centers 2

3 Industry Scale Machine Learning More data, higher accuracy Scales of industry problems 100 Billions samples, 1TBs 1PBs data 10 Billions parameters, 1GBs 1TBs data Distributed execution 100s 1000s machines 3

4 Distributed Machine Learning W 1 W 2 W 3 W 4 Data partitions Model replicas Data partitions Workers

5 Distributed Machine Learning W W W W W W W W gradient Model replicas Data partitions Workers 5

6 Distributed Machine Learning 2. Aggregate gradient for each parameter Parameter server 1. Push gradients Model replicas Data partitions Workers 6

7 Distributed Machine Learning 3. Add gradients to parameters Parameter server W 1 + g 1 W 2 + g 2 W 3 + g 3 W 4 + g 4 4. Pull new parameters Model replicas Data partitions Workers 7

8 Distributed Machine Learning Parameter servers Use multiple PS to avoid bottleneck W 1 W 2 W 3 W 4 Model replicas Data partitions Workers 8

9 Distributed Machine Learning Parameter servers Bottleneck Model replicas Data partitions Workers 9

10 Inbound Congestion Network Core Inbound congestion 10

11 Outbound Congestion Network Core Outbound congestion 11

12 Network Core Congestion Over-subscribed Network Core Congestion in the core in case of over-subscribed networks 12

13 Existing Approaches Over-provisioning network Expensive Limited deployment scale Not available in public clouds Training algorithm Fast network H/W e.g., Infiniband and RoCE 13

14 Existing Approaches Over-provisioning network Expensive Limited deployment scale Not available in public Clouds Asynchronous training algorithm Training efficiency Might not converge Asynchronous training algorithm Network H/W 14

15 Rethinking the Network Design MLNet is a communication layer designed for distributed machine learning systems Improves communication efficiency Orthogonal to existing approaches Training algorithm MLNet Network H/W 15

16 Rethinking the Network Design MLNet is a communication layer designed for distributed machine learning systems Improves communication efficiency Orthogonal to existing approaches Optimizations: Traffic reduction Flow prioritization Training algorithm MLNet Network H/W 16

17 Traffic Reduction 17

18 Traffic Reduction: Key Insight Aggregate the gradients from 6 workers Parameter server g 1 = g 11 + g 12 + g 13 + g 14 + g 15 + g 16 Aggregation is commutative and associative Workers 18

19 Traffic Reduction: Key Insight Aggregate the gradients from 6 workers g 11 + g 12 +g 13 g 14 + g 15 +g 16 Aggregate gradients incrementally does not change the final result 19

20 Traffic Reduction: Design Intercept the push message from the worker to the PS 20

21 Traffic Reduction: Design Redirect the messages to a local worker for partial aggregation 21

22 Traffic Reduction: Design Send the partial results to the PS for final aggregation 22

23 More details on the paper: 1. Traffic reduction in pull request 2. Asynchronous communication 23

24 Traffic Prioritization 24

25 Traffic Prioritization: Key Insight These four TCP flows share a bottleneck link and each of them gets 25% of its bandwidth Job 1 Job 2 Job 3 Job 4 25

26 Traffic Prioritization: Key Insight Job 1 Flow Completion Time (FCT) All flows are delayed! TCP per-flow fairness is harmful in distributed machine learning. Model 1 Model 2 Model 3 Model 4 Job 2 Job 3 Job 4 Average completion time is 4 26

27 Traffic Prioritization: Key Insight MLNet prioritizes the competing flows to minimize the average training time Job 1 Job 2 Job 3 Job 4 27

28 Traffic Prioritization: Key Insight Flow Completion Time (FCT) Job 1 Job 2 Shorten average FCT can largely improve average training time Model 1 Model 2 Model 3 Model 4 Job 3 Job 4 Average completion time is 2 28

29 Evaluation Simulate common network topology in data centers Classic 10Gbps 1024-node data center topology [Fat-Tree, SIGCOMM 08] Training large scale logistic regression 65B parameters, 141TB dataset [Parameter Server, OSDI 14] 800 workers [Parameter Server, OSDI 14] With production trace Data processing rate: uniform(100, 200) MBps Synchronize every 30 seconds 29

30 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Number of parameter servers Cost-effective Expensive 30

31 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Rack reduces 48% completion time Number of parameter servers Cost-effective Expensive 31

32 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Deploying more parameter servers resolve edge network bottlenecks Number of parameter servers Cost-effective Expensive 32

33 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Deploying more parameter servers to reduce training time (1) uses more machines (2) only possible with non-oversubscribed networks Number of parameter servers Cost-effective Expensive 33

34 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Baseline Number of parameter servers MLNet reduces congestion in the network core. Reduces training time by >70% Cost-effective Expensive 34

35 CDF Traffic Prioritization 20 jobs running in the same cluster Baseline Prioritization Training time (Hours) Everyone finish (almost) at the same time 35

36 CDF Traffic Prioritization Baseline Improve the median by 25% Prioritization Training time (Hours) Delay the tail by 2% Better Worse 36

37 CDF Traffic Prioritization + Traffic Reduction Improve the median by 60% Baseline Priori. + Red. Reduction Training time (Hours) Improve the tail by 54% Better Worse 37

38 More details on the paper: 1. Binary tree aggregation 2. More analysis 38

39 Summary MLNet can significantly improve the network performance of distributed machine learning Traffic reduction Flow prioritization Drop-in solution 39

40 Thanks! 40

41 Discussion Relaxed fault-tolerance? When worker fails, drop that portion of data Adaptive communication Reduce synchronization when network is busy? Hybrid network infrastructure? Some with 10GE, some with 40GE ROCE, etc. Degree of tree? 41

42 Traffic Reduction: Design Is the local aggregator a new bottleneck? Example: 15 workers in a rack 42

43 Traffic Reduction: Design Build a balanced aggregation structure such as a binary tree. Example: 15 workers in a rack Binary tree aggregation 43

44 Training time (Hours) Traffic Reduction Worse Better Rack Binary Baseline Number of parameter servers Cost-effective Expensive 44

45 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers Cost-effective Expensive 45

46 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Binary Tree and Rack reduces 78% and 48% completion time Number of parameter servers Cost-effective Expensive 46

47 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Deploying more parameter servers resolve edge network bottlenecks Number of parameter servers Cost-effective Expensive 47

48 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers Deploying more parameter servers to reduce training time (1) needs more machines Cost-effective Expensive (2) only possible with non-oversubscribed networks 48

49 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers Cost-effective Expensive 49

50 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers MLNet reduces congestion in the network core Cost-effective Expensive 50

51 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Binary Baseline Binary is consistently consuming more bandwidth than Rack Number of parameter servers Cost-effective Expensive 51

52 Example: Training a Neural Network G: {g1, g2, g3, g4} W: {w1, w2, w3, w4} W : {w1, w2, w3, w4 } Truth: {cat, dog, cat, } Random init weight Calculate error/gradient Update weights 52

53 Example: Neural Network Model Train W 1 W 4 W 2 W 3 Apply Dog : 99% Cat : 1% 53

54 Model Training Random Init Model Final Model W 4 W 4 Converge W 4 W 2 W 3 W 2 W 3 W 2 W 3 W 1 W 1 W 1 Refine model 54

NaaS Network-as-a-Service in the Cloud

NaaS Network-as-a-Service in the Cloud joint work with Matteo Migliavacca, Peter Pietzuch, and Alexander L. Wolf costa@imperial.ac.uk Motivation Mismatch between app. abstractions & network How the programmers