Scalable Distributed Training with Parameter Hub: a whirlwind tour

Size: px

Start display at page:

Download "Scalable Distributed Training with Parameter Hub: a whirlwind tour"

Miles Turner
5 years ago
Views:

1 Scalable Distributed Training with Parameter Hub: a whirlwind tour

2 TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

3 Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet Your Cloud Active Topology Probing Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

4 Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

5 Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication * In the cloud-based training context Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

6 Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5

7 Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5

8 Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6

9 Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6

10 EC2 reclaims your GPU instances as they run out of capacity 7

11 EC2 reclaims your GPU instances as they run out of capacity 7

12 Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter Server A1 O1 A2 O2 Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 8

13 Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter Sever A1 O1 A2 O2 Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 9

14 Distributed Training Today IN THE CONTEXT OF THE CLOUD Network Core Machine with GPUs Machine Machine with GPUs Machine 10

15 Distributed Training Today FORWARD AND BACKWARD PASSES IN WORKER Network Core Worker 1 PS 1 PS 2 Worker 2 11

16 Distributed Training Today AGGREGATION AND OPTIMIZATION IN PS Network Core Worker 1 PS 1 PS 2 Worker 2 12

17 Distributed training is communication bound - Problem gets worse over time: shifting bottleneck. - With modern GPUs most of the time is spent on communication. - Making GPUs faster will do little to increase throughput - Wasting compute Seconds ResNet 269 GPU idle, waiting on network GPU and Network active GRID 520 K80 M60 V resources. 13

18 Distributed training is communication bound AlexNet ResNet 269 Inception V3 GoogleNet

19 Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 15

20 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS Network Core Worker 1 PS 1 PS 2 Worker 2 16

21 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS GPU Training Framework Network Core Network Worker 1 PS 1 PS 2 Worker 2 16

22 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS 17

23 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS ResNet 269 Inception GoogleNet AlexNet Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads Seconds 17

24 Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 18

25 Bottlenecks in DDNN training BANDWIDTH BOTTLENECK Network Core Worker 1 PS 2 PS 1 Worker 2 19

26 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps 20

27 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps Cloud Bandwidth 20

28 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20

29 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet ResNet: 100 Gbps GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20

30 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet 1300 Gbps 1000 Gbps AlexNet: 1200 Gbps ResNet: 100 Gbps GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20

31 Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 21

32 Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Network Core Worker 1 PS 1 PS 2 Worker 2 22

33 Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Transient congestion, or oversubscription by design Cross-rack communication cost is higher than Intra-rack communication. Hosts Cluster 1: Cluster 2: Hosts Gbps 4 Gbps 23

34 Parameter Hub Optimizations CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING PS 2 PS 1 Worker 2 24

35 Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Network Core GPU Data Copy Aggregation Optimization Network Worker 1 PS 1 PS 2 Worker 2 25

36 Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Network Core GPU Data Copy Aggregation Optimization Network Worker 1 PS 1 PS 2 Worker 2 26

37 Software Optimizations GRADIENTS MEMORY Network Core Worker 1 CPU PS 2 PS 1 Worker 2 27

38 Software Optimizations GRADIENTS MEMORY Network Core Worker 1 CPU PS 2 PS 1 Worker 2 27

39 Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for aggregation. This is used in MxNet. (Wide Aggregation) Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into hierarchy. Perform NUMA aware tree reduction. 28

40 Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for aggregation. This is used in MxNet. (Wide Aggregation) Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into hierarchy. Perform NUMA aware tree reduction. 28

41 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Gradient Array for Key 0 from 8 workers 29

42 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Gradient Array for Key 0 from 8 workers 29

43 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. - Virtual gradients are transferred independently. Gradient Array for Key 0 from 8 workers 29

44 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings Aggregated - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. - Virtual gradients are transferred independently. - A chunk is only processed by a single core : maintaining maximum locality. Gradient Array for Key 0 from 8 workers 29

45 Software Optimizations TALL AGGREGATION AND OPTIMIZATION When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. Aggregated ay for Key 0 from 8 workers 30

46 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Aggregated When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - FP32-level streaming aggregation and optimization to hide communication latency. ay for Key 0 from 8 workers 31

47 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Aggregated Optimized When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - FP32-level streaming aggregation and optimization to hide communication latency. ay for Key 0 from 8 workers 31

48 Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core Worker 1 PS 1 PS 2 Worker 2 32

49 Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core Worker 1 PS 1 PS 2 Worker 2 33

50 Two-Phase Hierarchical Aggregation RACK SCALE PARAMETER SERVICE Cluster Network CM Rack Worker/PS 1 Worker/PS N Worker/PS 1 Worker/PS 2 34

51 Two-Phase Hierarchical Aggregation RACK SCALE PARAMETER SERVICE Cluster Network CM Rack Aggregator Worker/PS 1 Worker/PS N PBox Worker/PS 1 Worker/PS 2 35

52 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack Cluster Network Aggregator Worker/PS 1 Worker/PS N Aggregator Worker/PS 1 Worker/PS 2 36

53 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack Cluster Network Aggregator Worker/PS 1 Worker/PS N 1. Intra-Rack central aggregation Aggregator Worker/PS 1 Worker/PS 2 36

54 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack N times traffic reduction! Cluster Network 2. Inter-Rack aggregation Aggregator Worker/PS 1 Worker/PS N 1. Intra-Rack central aggregation Aggregator Worker/PS 1 Worker/PS 2 36

55 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING VMs Azure/EC2

56 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe VMs Azure/EC2

57 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe VMs Azure/EC2 Distance Matrix

58 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms VMs Azure/EC2 Distance Matrix

59 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms VMs Azure/EC2 Distance Matrix Inferred Network Topology*

60 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology*

61 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology* Hierarchical Reduction Plan

62 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology* Hierarchical Reduction Plan

63 Performance in commercial cloud with PHub Azure (Standard NC6) EC2 (P3.2xlarge) 0 Azure (Standard NC6) EC2 (P3.2xlarge) VS Facebook Gloo VS Ring Reduction Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/ Pytorch. ResNet

64 Framework Integration Support for Mxnet/Pytorch/Caffe2. var phub = std::make_shared<phub>(cfg.redisip, nmap, keysize, appaddrs, cntr, sizeof(float), cfg.rank, plp); phub->toggleuseschedule(pschedule); phub->reduce();

65 Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet Your Cloud Active Topology Probing Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

68 Hardware Parameter Hub

69 Hardware Parameter Hub

70 Hardware Parameter Hub Balanced computation and communication resource ConnectX-3 Card 560+Gbps Network BW 800Gbps PCIe Fully supported by Software Parameter Hub

71 7 Hardware Parameter Hub 35GB/s aggregation throughput. Supports 100+ ResNet-50 training nodes with a single machine Gloo HD Gloo Ring PS-Lite PHub SW

72 Hardware Parameter Hub ResNet-50. See paper for detailed estimates. Better training throughput/$.

73 Hardware Parameter Hub ResNet-50. See paper for detailed estimates. 25 % Better training throughput/$.

arxiv: v1 [cs.dc] 21 May 2018 Abstract

arxiv: v1 [cs.dc] 21 May 2018 Abstract Samples/s Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy University of Washington,