Efficient Communication Library for Large-Scale Deep Learning

Size: px

Start display at page:

Download "Efficient Communication Library for Large-Scale Deep Learning"

Debra Neal
5 years ago
Views:

1 IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho

2 Deep Learning changing Our Life Automotive/transportation Security/public safety Medicine and Biology Media and Entertainment Consumer Web, Mobile, Retail 2

Error IBM Deep Learning Workflow Latency to model: Typically days to train complex models Limited by training compute throughput This is my focus Training Training

3 Error IBM Deep Learning Workflow Latency to model: Typically days to train complex models Limited by training compute throughput This is my focus Training Training Data (grouped in large minibatches) Next Minibatch Next Epoch Forward Backward Trained Model Conversion/Retraining Needed if training/inference precisions differ Inference Model Input Data Batching Smaller, varied batch size: Application-dependent Individual E.g., from microservices Inference Forward Action Applicationdependent Latency to action: Typically ms to complete full inference workflow Limited by latency of batching (to enable efficient inference) + inference compute+ resultant action 3 3

4 Advance in Computation for Deep Learning [P. Goldsborough] [MichaelGalloy.com] TFLOPS Very good scaling for last 15 years /FPGA 4

5 Motivation: Ok, ever-fast computation. Is this enough? ImageNet1K : 1.2M images, 1K classes, Resnet101 Batch-size = 32 (limited by GPU memory) Iteration time = 300ms #iterations per epoch = Total training time for 100 epoch = 13.2 days ImageNet22K : 7.5M images, 22K classes, Resnet101 Total training time for 100 epoch = 35.2 days No, it is NOT 1.2M samples are still at toy scale Computation scaling is not fast enough the data explosion/model complexity Innovation will take too long, or even stop at some point I cannot wait for days to get my model trained! 5

6 Faster Training Time with Distributed Deep Learning 9Days Recognition Recognition What will you do? Iterate more and create more accurate models? Create more models? Both? 54x Learning runs with Power 8 6

7 Distributed Deep Learning [P. Goldsborough] Gradient/weight (10MB-1GB) Anything Model parallelism (complex partitioning) Data parallelism : Parm-Server Data parallelism : Allreduce 7

8 Communication : Overhead sync sync 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images In weak-scaling Computation cost remains constant Communication cost increases with more learners/gpus Computation /Communication is the key for large-scale deep learning Increase Computation Faster Communication 8

9 Advance in Communication for Deep Learning Still scaling, but not fast enough Computation is still ahead Data perhaps grows much faster 9

10 Designing Large-scale Deep Learning Communication Gradient count Network BW Faster algorithm Good balance Computation Model depth GPU throughput Faster algorithm Mini-batch size Model/Application Deeper/wider model to increase compute time Smaller gradient count to reduce communication time System Balance network and computing resources Select mini-batch size to adjust the ratio Larger mini-batch size to lower the ratio Too big mini-batch size can hurt convergence and accuracy Network-topology aware communication 10

11 IBM PowerAI DDL (Distributed Deep Learning Library) Collaborative communication library for Distributed Deep Learning MPI-like interface for easy-integration Enables deep learning software to scale to 100s severs with CPU/GPUs Works across variety of system sizes Works with variety of network types, switch topologies DDL orchestrates the data communication Plan efficient communication pattern on a hierarchal network environment Actual point-point data transfer via NCCL or MPI Currently integrated into Supported: Caffe, Tensorflow, Chainer, Torch Ongoing : Caffe2, PyTorch, Keras (TF-backend) Currently US patent-pending 11

12 DDL : Topology-aware communication Max bandwidth: 10 Gbytes/sec switch0 switch1 Max sustained bandwidth: 100 Gbytes/sec A B C D Example A, B, C, D broadcast to all others Suffers from congestion Suffers from lower BW A->B B->C C->D B->C C->D D->A C->D D->A A->B D->A A->B B->C 12

13 DDL : Topology-aware communication switch0 switch1 box0 box1 box2 box3 A B C D A->B B->C C->D B->C C->D D->A C->D D->A A->B D->A A->B B->C suboptimal It s a mapping problem System-specific network Application-specific traffic DDL does differently To minimize bus contention To minimize crossing lower BW A->B B->A C->D D->C B->C A->D D->A C->B C->D D->C A->B B->A Optimal 13

14 DDL : Problem Definition and Solution Assumption network topology with various bandwidths Problem Definition min-cost multi-commodity flow problem NP-hard problem but can be solved easily if graph size is small (ie 4 vertices) DDL solves a typical case/topology offline if the cluster/cloud has provide such topology, it performs very well 14

15 DDL : How well it performs on Caffe2 DDL DDL 48 IBM S822LC with PPC64LE RHEL 3 racks and 16 hosts on each, connected though 10GB/s IB Each host has 4 P100-SXM2 with CUDA8, CUDNN5 Comparing algorithms on Resnet50 + Imagenet1K (preloaded to RAMDisk) mbs=32 MPI_Allreduce Ring (all-reduce from Baidu in Feb 2017) GLOO (from Facebook) : NCCL+ib_verb 15

16 Comparison with NCCL 2.1.x Allreduce (POWER) Exploiting in-system topology Exploiting in/cross-system topology IBM P9 Newell Systems (NVLink) with V100s 100Gbps InfiniBand 16

17 Comparison with NCCL 2.1.x Allreduce (X86) NO in-system topology Exploiting cross-system topology X86 Systems (PCIe) with P100s 10Gbps Ethernet 17

18 Conclusion DDL is a topology-aware communication library in PowerAI DDL delivers the industry-best performance with Network hierarchy Multi-tier bandwidth DDL is suitable for common distributed training on cloud environment 18

19 BACKUP 19

Deep Learning mit PowerAI - Ein Überblick

Deep Learning mit PowerAI - Ein Überblick Stephen Lutz Deep Learning mit PowerAI - Open Group Master Certified IT Specialist Technical Sales IBM Cognitive Infrastructure IBM Germany Ein Überblick Stephen.Lutz@de.ibm.com What s that? and what s