Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Similar documents
Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Profiling DNN Workloads on a Volta-based DGX-1 System

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

CafeGPI. Single-Sided Communication for Scalable Deep Learning

NCCL 2.0. Sylvain Jeaugey

Efficient Communication Library for Large-Scale Deep Learning

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

High Performance Computing

Building the Most Efficient Machine Learning System

PRACTICAL SCALING TECHNIQUES. Ujval Kapasi Dec 9, 2017

Democratizing Machine Learning on Kubernetes

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Deep Learning with Intel DAAL

Hybrid MPI - A Case Study on the Xeon Phi Platform

Interconnection Network for Tightly Coupled Accelerators Architecture

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Building the Most Efficient Machine Learning System

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

A performance comparison of Deep Learning frameworks on KNL

High-Performance Broadcast for Streaming and Deep Learning

Deep Learning with Tensorflow AlexNet

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Effectively Scaling Deep Learning Frameworks

Cisco UCS C480 ML M5 Rack Server Performance Characterization

IBM SpectrumAI with NVIDIA Converged Infrastructure Solutions for AI workloads

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Data Loading and Augmentation for Deep Neural Network Training

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Fuzzy Set Theory in Computer Vision: Example 3

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

Inception Network Overview. David White CS793

IBM Deep Learning Solutions

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

ibench: Quantifying Interference in Datacenter Applications

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Revolutionizing the Datacenter

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Lecture 11: Distributed Training and Communication Protocols. CSE599W: Spring 2018

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Staged Memory Scheduling

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Quantum ESPRESSO on GPU accelerated systems

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Fundamental CUDA Optimization. NVIDIA Corporation

IBM Power AC922 Server

Intelligent Hybrid Flash Management

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Deep Learning Accelerators

World s most advanced data center accelerator for PCIe-based servers

Applying DDN to Machine Learning

Towards Scalable Machine Learning

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

Training Deep Neural Networks (in parallel)

AIRI SCALE-OUT AI-READY INFRASTRUCTURE ARCHITECTED BY PURE STORAGE AND NVIDIA WITH ARISTA 7060X SWITCH REFERENCE ARCHITECTURE

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

AI for HPC and HPC for AI Workflows: The Differences, Gaps and Opportunities with Data Management

Deep Residual Learning

TECHNICAL WHITE PAPER REAL-WORLD AI DEEP LEARNING PIPELINE POWERED BY FLASHBLADE

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

POINT CLOUD DEEP LEARNING

DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS

Performance and Power Co-Design of Exascale Systems and Applications

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Deep Learning mit PowerAI - Ein Überblick

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

Designing Shared Address Space MPI libraries in the Many-core Era

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Architectures for Scalable Media Object Search

ABySS Performance Benchmark and Profiling. May 2010

Intelligent System for AI. 清大資工周志遠 AII Workshop

Transcription:

Evaluating On-Node GPU Interconnects for Deep Learning Workloads NATHAN TALLENT, NITIN GAWANDE, CHARLES SIEGEL ABHINAV VISHNU, ADOLFY HOISIE Pacific Northwest National Lab PMBS 217 (@ SC) November 13, 217

Acknowledgements Center for Advanced Technology Evaluation (ASCR) Performance Prediction & Diagnosis for Extreme Scientific Workflows (ASCR) 2

Scaling Deep Learning Increasingly Important Scaling some workloads requires a high-performance interconnect Motivating Example: KNL/Omni-path vs. DGX-1 (NVLink 1.) What is scaling behavior given workload and interconnect? Single-KNL/GPU performance very similar, despite GPU's higher peak! DGX-1: better absolute performance but scaling behavior is quite different With Omni-Path, CifarNet scales better than AlexNet With NVLink, AlexNet scales better than CifarNet AlexNet s much larger all-to-all reduction operations stress interconnect bandwidth Iterations/second Iterations/second 2 15 1 5 15 1 5 KNL/Omni-path CifarNet/Cifar1 Batch size 192 DGX 1 1 2 AlexNet /ImageNet 4 8 Batch size: 256 1 2 4 8 KNL Nodes or DGX-1 GPUs 3

Which On-Node GPU Interconnect is Best For Me? Our focus: Scaling Deep Learning across on-node GPUs: Is a high-performance interconnect required (e.g., NVIDIA NVLink) Are PCIe-based interconnects adequate? How dependent is the answer on my workload? Answers not obvious! NVIDIA DGX-1 (NVLink 1.) Cirrascale GX8 CPU CPU 1 IB PCIe GPU GPU 2 NVLink 1 GPU 1 GPU 3 PCIe GPU 4 GPU 6 GPU 5 GPU 7 IB PCIe Cirrascale SR3615 PCIe Riser GPU PCIe 3 GPU 1 GPU 2 GPU3 PCIe 3 CPU CPU 1 PCIe 3 Cirrascale SR3615 PCIe Riser GPU 4 PCIe 3 16 GB/s each direction GPU5 GPU6 GPU7 NVLink 1: 2 GB/s, each direction (per link) 4

On-Node GPU Networks: DGX-1 vs. GX8 CPU CPU 1 NVIDIA DGX-1 IB IB DGX-1 appears to offer much higher performance PCIe GPU GPU 2 NVLink 1 GPU 1 GPU 3 PCIe GPU 4 GPU 6 GPU 5 GPU 7 (Stylized to avoid crossing links: GPU GPU4) PCIe Cirrascale GX8 NVLink 1: 2 GB/s, each direction (per link) Hybrid cube mesh: Two (fully connected) 4-GPU meshes Each GPU: 4 links = 8 GB/s (uni) Two-level tree (PCIe): Two (fully connected) 4-GPU clusters Each GPU: 16 GB/s (uni) PCIe upstream: 16 GB/s Cirrascale SR3615 PCIe Riser GPU PCIe 3 GPU 1 GPU 2 GPU3 PCIe 3 CPU CPU 1 PCIe 3 Cirrascale SR3615 PCIe Riser GPU 4 PCIe 3 16 GB/s each direction GPU5 GPU6 GPU7 5

Outline of Deep Learning Workload Outline of deep learning training algorithm Replicate neural network architecture on each GPU For each batch in image data set: Distribute images among GPUs (data parallel) Process images à activations à parameters (per-gpu) activation: floating point operations Synchronize parameters: all-to-all reduction (allreduce) Use NCCL for GPU collectives: NCCL: NVIDIA Collective Communications Library topology-aware rings, optimized for throughput (pipelined) interconnect-aware Train on ImageNet Dataset: ImageNet Large Scale Visual Recognition Challenge (ILSVRC) Well known benchmark for object classification and detection Workloads: AlexNet (high comm) GoogLeNet (high compute) ResNet/x: everything in-between & more 6

Parameterize ResNet: Control Compute Intensity Work Span 1 2 2.E+9 2.E+8 2.E+7 Activations/Batch 1 2 3 Batch Category Intensity (Work/Comm) GoogLeNet ResNet/32 AlexNet ResNet/16 ResNet/8 ResNet/4 ResNet/2 ResNet/1 Activations/Parameter per Batch 2 15 1 5 Span 1 2 GoogLeNet sandwitched 5.9 11.9 1 2 3 Batch Category FLOPS ResNet/32 ResNet/16 ResNet/8 ResNet/4 ResNet/2 ResNet/1 GoogLeNet AlexNet communication (allreduce) Replicate a ResNet block x times where is x is {1, 2, 4, 8, 16, 32} Intensity When Strong Scaled 18 16 14 12 1 8 6 4 2 Activations/Parameter per GPU 5.93 2.96 1.48.74 1 2 4 8 Number of GPUs Parameterized workload: systematically represent range of neural network depths & batch sizes ResNet/x Span 1 3 GoogLeNet sandwitched ResNet/32 ResNet/16 ResNet/8 ResNet/4 ResNet/2 ResNet/1 GoogLeNet AlexNet 7

GPU-to-GPU Memory Copy: Bandwidth MGBench: unidirectional; GPU-GPU; pipelined using CUDA s async memcopy GX8 has three groups: Intra-SR: within switch Inter-SR: between switches Inter-SR*: anomaly Bandwidth, GB/s 2 18 16 14 12 1 8 6 4 2 Intra-SR (GX8) Inter-SR (GX8) Inter-SR* (GX8) goal Between 4-GPU clusters: PCIe wins on midsize messages 1.E+ 1.E+2 1.E+4 1.E+6 1.E+8 Data Size, Bytes Group results by value clusters DGX-1 has two groups (expected) Inter GPU 1-hop (DGX-1) Inter GPU 2-hops (DGX-1) Within 4-GPU clusters (1-hop; intra-switch): NVink wins (85% of 1 link) (Uses only 1 NVLink; software has to manage routing, etc.) Between 4-GPU cluster (2-hop; inter-switch): depends on payload size PCIe anomaly (see latency plots) PCIe can win on long midsize transfers 8

GPU-to-GPU Memory Copy: Latency Latency, μs 5 4 3 2 1 3 25 2 15 1 5 goal PCIe anomaly Data size: 1 KB DGX 1 DGX 1 GX8 GX8 DGX 1 GX8 Data -1 size: -2 4 Byte -3-4 -5-6 -7-1 -2-3 -4-5 -6-7 Peer-to-peer access between GPUs NVink wins Details at four different data sizes Latency, ms 16 Data size: 1MB 14 12 1 8 6 4 2 DGX 1 NVink wins for large payloads PCIe wins: bandwidth saturates more quickly w.r.t payload x-y means GPU x sent data to GPU y PCIe anomaly GX8-1 -2-3 -4-5 -6-7 Peer-to-peer access between GPUs NVLink: 2 groups, independent of data size PCIe: 1 3 groups, dependent on data size PCIe Anomaly Cirrascale SR, 2nd slot (GPU5) has longer signal paths; delays 9

NCCL: NVIDIA Collective Communications Library NCCL uses topology-aware & interconnect-aware rings Image: Sylvain Jeaugey PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings NCCL is optimized for throughput (pipelined) Small payload: ring latency exposed time = hops link latency Large payload: ring latency hidden time = payload / bandwidth 1

NCCL Allreduce: Effective Bandwidth Effective BW: bandwidth relative to a single GPU s payload. Max is BW of memcopy. goal 4-GPUs (within cluster); ideal allreduce is 1 step. NVink wins by 4% (6% of max) 8-GPUs (between clusters); ideal allreduce is 2 steps: PCIe wins by 3%! 8-GPUs: PCIe wins by 1% on midsize messages Bandwidth, GB/s 5. 4. 3. 2. 1. all_reduce DGX-1 8 GPUs GX8 8 GPUs AlexNet 5.E+5 5.E+6 5.E+7 5.E+8 Data size, Bytes PCIe Bandwidth saturates more quickly with respect to payload size. More hardware for switching and flow control? Broadcast Performance differs with collective. On 8-GPU broadcast, NVLink has slight advantage: single-root has less synchronization vs. all-to-all. 11

Iterations/second Iterations/second Strong-scaling (ImageNet): AlexNet & GoogLeNet goal 12 1 8 6 4 2 12 1 8 6 4 2 DGX-1, batch size 256 GX8, batch size 256 DGX-1, batch size 512 GX8, batch size 512 AlexNet/ImageNet 1DGX-1, batch size 2512 GX8, 4 batch size 512 8 GoogLeNet/ImageNet 1 2 4 8 Number of GPUs Expected NVLink becomes less important as batch size increases (more computation). NVLink important for AlexNet (NVlink has 36% advantage) Same single-gpu performance. Power cap GPUs to equalize the slightly different SM frequencies PCIe is close to NVLink for GoogLeNet Unexpected! Although AlexNet is communication intensive, GX8 has slightly higher 8-GPU allreduce performance! Expected GoogLeNet is more compute intensive than AlexNet by 1 (activations/parameter/batch) AlexNet: 5.9 and 11.9 GoogLeNet: 5 and 14 Gripe: GPUs have very poor performance tools 12

Iterations/ s Iterations/ s Iterations/ s Strong-scaling (ImageNet): ResNet/x 12 1 8 6 4 2 4 3 2 1 1 8 6 4 2 # ResNet Blocks: 1 goal DGX-1 batch size 16 GX8 batch size 16 DGX-1 batch size 32 GX8 batch size 32 DGX-1 batch size 64 GX8 batch size 64 DGX-1 batch size 64 GX8 batch size 64 1 2 # ResNet Blocks: 8 4 8 DGX-1 batch size 64 GX8 batch size 64 1 2 4 8 # ResNet Blocks: 32 1 2 GPUs 4 8 2, 4, 16 in paper Performance expectation Identical GPU work NVLink/PCIe win/loss: fraction of allreduce allreduce win/loss Single-GPU performance slightly different! Converges as batch size increases. But why? CPU-based overheads on smaller batch sizes? Expect DGX-1 win for 2 and 4 GPUs. Holds. Expect GX8 win for 8 GPUs. Explains knee on batch size 16. Why no more knees? GX8 is competitive for ResNet-style workloads. Smaller batch sizes (vs. AlexNet, G-Net). Comports with ResNet s deeper network & fewer parameters; highlight interconnect. 13

Conclusions Scaling ML across multiple on-node GPUs is increasingly important Workload Intensity helps explain scaling performance Parameterized ResNet captures large space of workload intensities Systematically characterize & specify neural network workloads Workload intensity: reflects computation/communication DGX-1 typically has superior performance More links than GX8 s PCIe bus; and higher bandwidth/link GX8 is very competitive for all ResNet-style workloads On 8 GPUs, the GX8 can slightly outperform Unexpected GX8 s PCIe bandwidth saturates more quickly w.r.t. to payload size For medium-sized messages, GX8 has better memory copy latency and an average of 1% better allreduceop performance ResNet currently more popular than AlexNet (large allreduce) GX8 may be especially attractive if cost is considered Hiring! 14