High-Performance Training for Deep Learning and Computer Vision HPC

Size: px

Start display at page:

Download "High-Performance Training for Deep Learning and Computer Vision HPC"

Marjorie Tucker
5 years ago
Views:

1 High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University

2 PETTT TE91 (May 18) 3 Scale-up and Scale-out Scale-up: Intra-node Communication Many improvements like: NVIDIA cudnn, cublas, NCCL, etc. CUDA 9 Co-operative Groups Scale-out: Inter-node Communication DL Frameworks most are optimized for single-node only Distributed (Parallel) Training is an emerging trend OSU-Caffe MPI-based Microsoft CNTK MPI/NCCL2 Google TensorFlow grpc-based/mpi/nccl2 Facebook Caffe2 Hybrid (NCCL2/Gloo/MPI) Scale-up Performance NCCL2 NCCL1 cudnn MPI MKL-DNN grpc Hadoop Scale-out Performance Desired

3 Holistic Evaluation is Important!! My framework is faster than your framework! This needs to be understood in a holistic way. Performance depends on the entire execution environment (the full stack) Isolated view of performance is not helpful A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8. CVPR-ECV 18 5

4 CVPR-ECV 18 6 Interdependency of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) Convergence of HPC, Big Data, and Deep Learning!!!

5 Drivers of Modern HPC Cluster and Data Center Architecture Multi-/Many-core Processors High Performance Interconnects InfiniBand (with SR-IOV) <1usec latency, 2Gbps Bandwidth> Multi-core/many-core technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Single Root I/O Virtualization (SR-IOV) Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters Accelerators (NVIDIA GPGPUs and FPGAs) Accelerators high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM SDSC Comet TACC Stampede2 Cloud Cloud CVPR-ECV 18 7

6 Research Challenges to Exploit HPC Technologies 1. What are the fundamental issues in designing DL frameworks? Memory Requirements Computation Requirements Communication Overhead 2. Why do we need to support distributed training? To overcome the limits of single-node training To better utilize hundreds of existing HPC Clusters 1 CNTK 2 Deep Learning and Machine Learning Frameworks Model Propagation CPU Caffe/ OSU-Caffe Caffe2 TensorFlow MXNet Major Computation and Communication Phases in DL Frameworks Forward Backward HPC Platforms InfiniBand Gradient Aggregation Communication Runtimes to support Distributed Training GPU CVPR-ECV 18 8

7 Research Challenges to Exploit HPC Technologies (Cont d) 3. What are the new design challenges brought forward by DL frameworks for Communication runtimes? Large Message Collective Communication and Reductions GPU Buffers (CUDA-Awareness) 4. Can a Co-design approach help in achieving Scale-up and Scale-out efficiently? Co-Design the support at Runtime level and Exploit it at the DL Framework level What performance benefits can be observed? What needs to be fixed at the communication runtime layer? CNTK Deep Learning and Machine Learning Frameworks Point-to- Point Operations CPU Caffe/ OSU-Caffe Model Propagation CUDA- Awareness Caffe2 TensorFlow MXNet Major Computation and Communication Phases in DL Frameworks Forward Backward HPC Platforms InfiniBand Gradient Aggregation 4 Co-Design Opportunities Communication Runtimes (MPI/NCCL/Gloo/MLSL) Large-message Collectives GPU CVPR-ECV

8 CVPR-ECV Multiple Approaches taken up by OSU MPI-driven Deep Learning Co-designing Deep Learning Stacks with High-Performance MPI Accelerating TensorFlow on HPC Systems Accelerating Big Data Stacks Efficient Deep Learning over Big Data

9 Data Parallel Deep Learning and MPI Collectives Major MPI Collectives involved in Designing distributed frameworks MPI_Bcast required for DNN parameter exchange MPI_Reduce needed for gradient accumulation from multiple solvers MPI_Allreduce use just one Allreduce instead of Reduce and Broadcast Loop {} GPU MPI_Bcast (GPU ) F Params L 1 L 2.. L n packed_red uce_buff GPU 1 ApplyUpdates Params L 1 L 2 packed_comm_buff B F B F B F B L n packed_red uce_buff GPU 2 Gradients Params L 1 L 2 L n packed_red uce_buff GPU 3 Params L 1 L 2 L n packed_red uce_buff MPI_Reduce (GPU ) 1. Data Propagation 2. Forward Backward Pass 3. Gradient Aggregatio n A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) CVPR-ECV 18 15

CVPR-ECV 18 16 Overview of the MVAPICH2 MPI Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1),

10 CVPR-ECV Overview of the MVAPICH2 MPI Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,9 organizations in 86 countries More than 474, (>.47 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 17 ranking) 1st, 1,649,6-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 9th, 556,14 cores (Oakforest-PACS) in Japan 12th, 368,928-core (Stampede2) at TACC 17th, 241,18-core (Pleiades) at NASA 48th, 76,32-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) Empowering Top5 systems for over a decade

11 CVPR-ECV Latency (us) Bandwidth (MB/s) Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 1.88us 1x K 2K 4K 8K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth K 2K 4K Message Size (Bytes) 9x Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth K 2K 4K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a 11X MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA MV2-(NO-GDR) MV2-GDR-2.3a

12 Exploiting CUDA-Aware MPI for TensorFlow (Horovod) 35 MVAPICH2-GDR offers excellent performance via advanced designs for MPI_Allreduce. Up to 22% better performance on Wilkes2 cluster (16 GPUs) Images/sec (higher is better) MVAPICH2-GDR is up to 22% faster than MVAPICH No. of GPUs (4GPUs/node) MVAPICH2 MVAPICH2-GDR CVPR-ECV 18 19

13 MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3. Latency (us) ~3X better Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) ~4X better 512K 1M 2M 4M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) 6 ~1X better OpenMPI is ~5X slower 5 than Baidu 4 MV2 is ~2X better 3 than Baidu Message Size (Bytes) MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a CVPR-ECV 18 2

14 MVAPICH2-GDR vs. NCCL2 Reduce Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Reduce (MVAPICH2-GDR) vs. ncclreduce (NCCL2) on 16 GPUs 1 ~2.5X better 1 1 Latency (us) 1 1 Latency (us) ~5X better K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1 Message Size (Bytes) MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-8 GPUs, and EDR InfiniBand Inter-connect CVPR-ECV 18 21

15 MVAPICH2-GDR vs. NCCL2 Allreduce Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Allreduce (MVAPICH2-GDR) vs. ncclallreduce (NCCL2) on 16 GPUs 1 1 Latency (us) 1 1 ~3X better Latency (us) ~1.2X better K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1 Message Size (Bytes) MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-8 GPUs, and EDR InfiniBand Inter-connect CVPR-ECV 18 22

16 OSU-Caffe: Proposed Co-Design Overview To address the limitations of Caffe and existing MPI runtimes, we propose the OSU-Caffe (S-Caffe) framework At the application (DL framework) level Develop a fine-grain workflow i.e. layer-wise communication instead of communicating the entire model At the runtime (MPI) level Develop support to perform reduction of very-large GPU buffers Perform reduction using GPU kernels OSU-Caffe is available from the HiDL project page CVPR-ECV 18 23

17 S-Caffe vs. Inspur-Caffe and Microsoft CNTK AlexNet: Notoriously hard to scale-out on multiple nodes due to comm. overhead! Large number of parameters ~ 64 Million (comm. buffer size = 256 MB) GoogLeNet is a popular DNN 13 million parameters (comm. buffer size = ~5 MB) Up to 14% improvement (Scale-up) S-Caffe delivers better or comparable performance with other multi-node capable DL frameworks Impact of HR CVPR-ECV 18 24

18 CVPR-ECV Multiple Approaches taken up by OSU MPI-driven Deep Learning Co-designing Deep Learning Stacks with High-Performance MPI Accelerating TensorFlow on HPC Systems Accelerating Big Data Stacks Efficient Deep Learning over Big Data

19 Performance Benefits for AR-gRPC with Micro-Benchmark Latency (us) Default grpc (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) payload (Bytes) K 2K 4K 8K Latency (us) Default grpc (IPoIB-56Gbps) AR-gRPC (RDMA-56 Gbps) 16K 32K 64K 128K 256K 512K Payload (Bytes) Latency (us) Default grpc (IPoIB- 56Gbps) AR-gRPC (RDMA-56Gbps) 1M 2M 4M 8M Payload (Bytes) Point-to-Point Latency AR-gRPC (OSU design) Latency on SDSC-Comet-FDR Up to 2.7x performance speedup over Default grpc (IPoIB) for Latency for small messages. Up to 2.8x performance speedup over Default grpc (IPoIB) for Latency for medium messages. Up to 2.5x performance speedup over Default grpc (IPoIB) for Latency for large messages. R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMA-based grpc. (Under Review) CVPR-ECV 18 26

20 Performance Benefit for TensorFlow (Resnet5) Images / Second % grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 1% Images / Second grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 127% 15% Images / Second grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 133% 16% Batch Size/ GPU Batch Size / GPU 4 Nodes 8 Nodes Batch Size / GPU 12 Nodes TensorFlow Resnet5 performance evaluation on an IB EDR cluster Up to 26% performance speedup over Default grpc (IPoIB) for 4 nodes Up to 127% performance speedup over Default grpc (IPoIB) for 8 nodes Up to 133% performance speedup over Default grpc (IPoIB) for 12 nodes CVPR-ECV 18 27

21 Performance Benefit for TensorFlow (Inception3) Images / Second % grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) 7% AR-gRPC (RDMA-1Gbps) Images / Second Batch Size / GPU grppc (IPoIB-1Gbps) grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) 3 Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) 1% MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 24 AR-gRPC (RDMA-1Gbps) 12% % Batch Size / GPU Batch Size / GPU 116% 4 Nodes 8 Nodes 12 Nodes Images / Second TensorFlow Inception3 performance evaluation on an IB EDR cluster Up to 47% performance speedup over Default grpc (IPoIB) for 4 nodes Up to 116% performance speedup over Default grpc (IPoIB) for 8 nodes Up to 153% performance speedup over Default grpc (IPoIB) for 12 nodes CVPR-ECV 18 28

22 CVPR-ECV Multiple Approaches taken up by OSU MPI-driven Deep Learning Co-designing Deep Learning Stacks with High-Performance MPI Accelerating TensorFlow on HPC Systems Accelerating Big Data Stacks Efficient Deep Learning over Big Data

23 CVPR-ECV 18 3 The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) HDFS, Memcached, HBase, and Spark Micro-benchmarks Users Base: 285 organizations from 34 countries More than 26,7 downloads from the project site Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker

24 High-Performance Deep Learning over Big Data (DLoBD) Stacks Challenges of Deep Learning over Big Data (DLoBD) Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs? What are the performance characteristics of representative DLoBD stacks on RDMA networks? Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Outof-band communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy Epochs Time (secs) IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy Epoch Number 2.6X Accuracy (%) X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 217. CVPR-ECV 18 35

CVPR-ECV 18 37 Thank You! panda@cse.ohio-state.

edu/ The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.

25 CVPR-ECV Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: