Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Evaluating On-Node GPU Interconnects for Deep Learning Workloads NATHAN TALLENT, NITIN GAWANDE, CHARLES SIEGEL ABHINAV VISHNU, ADOLFY HOISIE Pacific Northwest National Lab PMBS 217 (@ SC) November 13, 217

Acknowledgements Center for Advanced Technology Evaluation (ASCR) Performance Prediction & Diagnosis for Extreme Scientific Workflows (ASCR) 2

Scaling Deep Learning Increasingly Important Scaling some workloads requires a high-performance interconnect Motivating Example: KNL/Omni-path vs. DGX-1 (NVLink 1.) What is scaling behavior given workload and interconnect? Single-KNL/GPU performance very similar, despite GPU's higher peak! DGX-1: better absolute performance but scaling behavior is quite different With Omni-Path, CifarNet scales better than AlexNet With NVLink, AlexNet scales better than CifarNet AlexNet s much larger all-to-all reduction operations stress interconnect bandwidth Iterations/second Iterations/second 2 15 1 5 15 1 5 KNL/Omni-path CifarNet/Cifar1 Batch size 192 DGX 1 1 2 AlexNet /ImageNet 4 8 Batch size: 256 1 2 4 8 KNL Nodes or DGX-1 GPUs 3

Which On-Node GPU Interconnect is Best For Me? Our focus: Scaling Deep Learning across on-node GPUs: Is a high-performance interconnect required (e.g., NVIDIA NVLink) Are PCIe-based interconnects adequate? How dependent is the answer on my workload? Answers not obvious! NVIDIA DGX-1 (NVLink 1.) Cirrascale GX8 CPU CPU 1 IB PCIe GPU GPU 2 NVLink 1 GPU 1 GPU 3 PCIe GPU 4 GPU 6 GPU 5 GPU 7 IB PCIe Cirrascale SR3615 PCIe Riser GPU PCIe 3 GPU 1 GPU 2 GPU3 PCIe 3 CPU CPU 1 PCIe 3 Cirrascale SR3615 PCIe Riser GPU 4 PCIe 3 16 GB/s each direction GPU5 GPU6 GPU7 NVLink 1: 2 GB/s, each direction (per link) 4

On-Node GPU Networks: DGX-1 vs. GX8 CPU CPU 1 NVIDIA DGX-1 IB IB DGX-1 appears to offer much higher performance PCIe GPU GPU 2 NVLink 1 GPU 1 GPU 3 PCIe GPU 4 GPU 6 GPU 5 GPU 7 (Stylized to avoid crossing links: GPU GPU4) PCIe Cirrascale GX8 NVLink 1: 2 GB/s, each direction (per link) Hybrid cube mesh: Two (fully connected) 4-GPU meshes Each GPU: 4 links = 8 GB/s (uni) Two-level tree (PCIe): Two (fully connected) 4-GPU clusters Each GPU: 16 GB/s (uni) PCIe upstream: 16 GB/s Cirrascale SR3615 PCIe Riser GPU PCIe 3 GPU 1 GPU 2 GPU3 PCIe 3 CPU CPU 1 PCIe 3 Cirrascale SR3615 PCIe Riser GPU 4 PCIe 3 16 GB/s each direction GPU5 GPU6 GPU7 5

Outline of Deep Learning Workload Outline of deep learning training algorithm Replicate neural network architecture on each GPU For each batch in image data set: Distribute images among GPUs (data parallel) Process images à activations à parameters (per-gpu) activation: floating point operations Synchronize parameters: all-to-all reduction (allreduce) Use NCCL for GPU collectives: NCCL: NVIDIA Collective Communications Library topology-aware rings, optimized for throughput (pipelined) interconnect-aware Train on ImageNet Dataset: ImageNet Large Scale Visual Recognition Challenge (ILSVRC) Well known benchmark for object classification and detection Workloads: AlexNet (high comm) GoogLeNet (high compute) ResNet/x: everything in-between & more 6

Parameterize ResNet: Control Compute Intensity Work Span 1 2 2.E+9 2.E+8 2.E+7 Activations/Batch 1 2 3 Batch Category Intensity (Work/Comm) GoogLeNet ResNet/32 AlexNet ResNet/16 ResNet/8 ResNet/4 ResNet/2 ResNet/1 Activations/Parameter per Batch 2 15 1 5 Span 1 2 GoogLeNet sandwitched 5.9 11.9 1 2 3 Batch Category FLOPS ResNet/32 ResNet/16 ResNet/8 ResNet/4 ResNet/2 ResNet/1 GoogLeNet AlexNet communication (allreduce) Replicate a ResNet block x times where is x is {1, 2, 4, 8, 16, 32} Intensity When Strong Scaled 18 16 14 12 1 8 6 4 2 Activations/Parameter per GPU 5.93 2.96 1.48.74 1 2 4 8 Number of GPUs Parameterized workload: systematically represent range of neural network depths & batch sizes ResNet/x Span 1 3 GoogLeNet sandwitched ResNet/32 ResNet/16 ResNet/8 ResNet/4 ResNet/2 ResNet/1 GoogLeNet AlexNet 7

GPU-to-GPU Memory Copy: Bandwidth MGBench: unidirectional; GPU-GPU; pipelined using CUDA s async memcopy GX8 has three groups: Intra-SR: within switch Inter-SR: between switches Inter-SR*: anomaly Bandwidth, GB/s 2 18 16 14 12 1 8 6 4 2 Intra-SR (GX8) Inter-SR (GX8) Inter-SR* (GX8) goal Between 4-GPU clusters: PCIe wins on midsize messages 1.E+ 1.E+2 1.E+4 1.E+6 1.E+8 Data Size, Bytes Group results by value clusters DGX-1 has two groups (expected) Inter GPU 1-hop (DGX-1) Inter GPU 2-hops (DGX-1) Within 4-GPU clusters (1-hop; intra-switch): NVink wins (85% of 1 link) (Uses only 1 NVLink; software has to manage routing, etc.) Between 4-GPU cluster (2-hop; inter-switch): depends on payload size PCIe anomaly (see latency plots) PCIe can win on long midsize transfers 8

GPU-to-GPU Memory Copy: Latency Latency, μs 5 4 3 2 1 3 25 2 15 1 5 goal PCIe anomaly Data size: 1 KB DGX 1 DGX 1 GX8 GX8 DGX 1 GX8 Data -1 size: -2 4 Byte -3-4 -5-6 -7-1 -2-3 -4-5 -6-7 Peer-to-peer access between GPUs NVink wins Details at four different data sizes Latency, ms 16 Data size: 1MB 14 12 1 8 6 4 2 DGX 1 NVink wins for large payloads PCIe wins: bandwidth saturates more quickly w.r.t payload x-y means GPU x sent data to GPU y PCIe anomaly GX8-1 -2-3 -4-5 -6-7 Peer-to-peer access between GPUs NVLink: 2 groups, independent of data size PCIe: 1 3 groups, dependent on data size PCIe Anomaly Cirrascale SR, 2nd slot (GPU5) has longer signal paths; delays 9

NCCL: NVIDIA Collective Communications Library NCCL uses topology-aware & interconnect-aware rings Image: Sylvain Jeaugey PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings NCCL is optimized for throughput (pipelined) Small payload: ring latency exposed time = hops link latency Large payload: ring latency hidden time = payload / bandwidth 1

NCCL Allreduce: Effective Bandwidth Effective BW: bandwidth relative to a single GPU s payload. Max is BW of memcopy. goal 4-GPUs (within cluster); ideal allreduce is 1 step. NVink wins by 4% (6% of max) 8-GPUs (between clusters); ideal allreduce is 2 steps: PCIe wins by 3%! 8-GPUs: PCIe wins by 1% on midsize messages Bandwidth, GB/s 5. 4. 3. 2. 1. all_reduce DGX-1 8 GPUs GX8 8 GPUs AlexNet 5.E+5 5.E+6 5.E+7 5.E+8 Data size, Bytes PCIe Bandwidth saturates more quickly with respect to payload size. More hardware for switching and flow control? Broadcast Performance differs with collective. On 8-GPU broadcast, NVLink has slight advantage: single-root has less synchronization vs. all-to-all. 11

Iterations/second Iterations/second Strong-scaling (ImageNet): AlexNet & GoogLeNet goal 12 1 8 6 4 2 12 1 8 6 4 2 DGX-1, batch size 256 GX8, batch size 256 DGX-1, batch size 512 GX8, batch size 512 AlexNet/ImageNet 1DGX-1, batch size 2512 GX8, 4 batch size 512 8 GoogLeNet/ImageNet 1 2 4 8 Number of GPUs Expected NVLink becomes less important as batch size increases (more computation). NVLink important for AlexNet (NVlink has 36% advantage) Same single-gpu performance. Power cap GPUs to equalize the slightly different SM frequencies PCIe is close to NVLink for GoogLeNet Unexpected! Although AlexNet is communication intensive, GX8 has slightly higher 8-GPU allreduce performance! Expected GoogLeNet is more compute intensive than AlexNet by 1 (activations/parameter/batch) AlexNet: 5.9 and 11.9 GoogLeNet: 5 and 14 Gripe: GPUs have very poor performance tools 12

Iterations/ s Iterations/ s Iterations/ s Strong-scaling (ImageNet): ResNet/x 12 1 8 6 4 2 4 3 2 1 1 8 6 4 2 # ResNet Blocks: 1 goal DGX-1 batch size 16 GX8 batch size 16 DGX-1 batch size 32 GX8 batch size 32 DGX-1 batch size 64 GX8 batch size 64 DGX-1 batch size 64 GX8 batch size 64 1 2 # ResNet Blocks: 8 4 8 DGX-1 batch size 64 GX8 batch size 64 1 2 4 8 # ResNet Blocks: 32 1 2 GPUs 4 8 2, 4, 16 in paper Performance expectation Identical GPU work NVLink/PCIe win/loss: fraction of allreduce allreduce win/loss Single-GPU performance slightly different! Converges as batch size increases. But why? CPU-based overheads on smaller batch sizes? Expect DGX-1 win for 2 and 4 GPUs. Holds. Expect GX8 win for 8 GPUs. Explains knee on batch size 16. Why no more knees? GX8 is competitive for ResNet-style workloads. Smaller batch sizes (vs. AlexNet, G-Net). Comports with ResNet s deeper network & fewer parameters; highlight interconnect. 13

Conclusions Scaling ML across multiple on-node GPUs is increasingly important Workload Intensity helps explain scaling performance Parameterized ResNet captures large space of workload intensities Systematically characterize & specify neural network workloads Workload intensity: reflects computation/communication DGX-1 typically has superior performance More links than GX8 s PCIe bus; and higher bandwidth/link GX8 is very competitive for all ResNet-style workloads On 8 GPUs, the GX8 can slightly outperform Unexpected GX8 s PCIe bandwidth saturates more quickly w.r.t. to payload size For medium-sized messages, GX8 has better memory copy latency and an average of 1% better allreduceop performance ResNet currently more popular than AlexNet (large allreduce) GX8 may be especially attractive if cost is considered Hiring! 14