PRACTICAL SCALING TECHNIQUES. Ujval Kapasi Dec 9, 2017

Size: px

Start display at page:

Download "PRACTICAL SCALING TECHNIQUES. Ujval Kapasi Dec 9, 2017"

Geoffrey Copeland
5 years ago
Views:

1 PRACTICAL SCALING TECHNIQUES Ujval Kapasi Dec 9, 2017

2 DNN TRAINING ON MULTIPLE GPUS Making DL training times shorter 2

3 DNN TRAINING ON MULTIPLE GPUS Making DL training times shorter local local local Allreduce : Sum across GPUs local Data parallelism : split batch across multiple GPUs 3

4 DNN TRAINING ON MULTIPLE GPUS Making DL training times shorter local local local Allreduce : Sum across GPUs local Data parallelism : split batch across multiple GPUs 4

5 DNN TRAINING ON MULTIPLE GPUS Making DL training times shorter local local local Allreduce : Sum across GPUs local Data parallelism : split batch across multiple GPUs 5

6 DNN TRAINING ON MULTIPLE GPUS Making DL training times shorter local local local Allreduce : Sum across GPUs local Data parallelism : split batch across multiple GPUs 6

7 OPTIMIZE THE INPUT PIPELINE 7

8 Device/Training limited TYPICAL TRAINING PIPELINE Host Device read&decode train augment Host/IO limited Host Device 8

9 EXAMPLE: CNTK ON DGX-1 Device/Training limited (ResNet 50) Host/IO limited (AlexNet) Sync overhead 9

10 THROUGHPUT COMPARISON IMAGES/SECOND File I/O ~10, x550 images ~1600 MB/s with LMDB on DGX-1 V100 Image Decoding ~ 10,000 15, x550 images libjpeg-turbo, OMP_PROC_BIND=true on DGX-1 V100 Training >6,000 (Resnet-50) >14,000 (Resnet-18) Synthetic dataset, DGX-1 V100 10

11 ROOFLINE ANALYSIS Compute Bound I/O 352 I/O Horizontal lines show I/O pipeline throughputs 352: 9800 images/s 480: 4900 images/s Measurements collected using the Caffe2 framework Achieved Images/s RN-50 RN-34 RN-18 inception v3 RN-101 VGG 0 RN Training only Images/s 11

12 Optimize Image I/O: RECOMMENDATIONS Use fast data and file loading mechanisms such as LMDB or RecordIO When loading from files consider mmap instead of fopen/fread Use fast image decoding libraries such as libjpeg-turbo Using OpenCVs imread function relinquishes control over these optimizations and sacrifices performance Optimize Augmentation Allow augmentation on GPU for I/O limited networks Seconds JPEG read and decoding time for 1024 files (290x550) decoding time mmap+decode OpenCV imread Threads 12

13 OPTIMIZE COMMUNICATION 13

14 DATA PARALLEL AND NCCL NCCL uses rings to move data across all GPUs and perform reductions. Ring Allreduce is bandwidth optimal and adapts to many topologies DGX-1 : 4 unidirectional rings 14

15 NCCL 2 MULTI-NODE SCALING 15

16 16

17 TRAIN WITH LARGE BATCHES 17

18 18 DIFFICULTIES OF LARGE-BATCH TRAINING It s difficult to keep the test accuracy, while increasing batch size. Recipe from [Goyal, 2017]: a linear scaling of learning rate γγ as a function of batch size B a learning rate warm-up to prevent divergence during initial training phase Resnet-50 Optimization is not a problem if you get right hyper- Priya Goyal, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,

19 19 LARGE-BATCH TRAINING Sam Smith, et al. Don t Decay the Learning Rate, Increase the Batchsize Keskar, et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima Akiba, et al. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes 19

20 LAYER-WISE ADAPTIVE RATE SCALING (LARS) Use local LR λλ ll for each layer ll ww ll tt = γγ λλ ll (ww ll tt ) where: γγ global LR, (ww ll ll tt ) stochastic gradient of loss function LL wrt ww tt λλ ll local LR for layer ll 20 20

21 21 RESNET-50 WITH LARS: B 32K More details on LARS: 21

22 22 SUMMARY 1) Larger batches allow scaling to larger number of nodes while maintaining high utilization of each GPU 2) The key difficulties in large batch training is numerical optimization 3) The existing approach, based on using large learning rates, can lead to divergence, especially during the initial phase, even with warm-up 4) With Layer-wise Adaptive Rate Scaling (LARS) we scaled up Resnet-50 to B=16K 22

23 23 FUTURE CHALLENGES Hybrid Model parallel and Data parallel Disk I/O for large datasets that can't fit in system memory or on-node SSDs 23

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks