IBM Deep Learning Solutions Reference Architecture for Deep Learning on POWER8, P100, and NVLink October, 2016
How do you teach a computer to Perceive? 2
Deep Learning: teaching Siri to recognize a bicycle 3
Deep Learning: a tale of two infrastructures Deep Learning / Training Focused on perceptive tasks Teach a computer to recognize and categorize images (cross industry) Develop a model for natural language processing, real time translation, and/or interactive voice response (cross industry) Discover patterns of behavior and potential preferences (retail, entertainment) Deep Learning / Inference Act upon a trained model Autonomous vehicle: move through the physical world System of engagement: simplify access for users and provide a more familiar human / computer interface Better meet client needs by delivering recommendations, suggestions, or alternatives Focus:: Datacenter infrastructure System on Chip, low power device, partner solutions 4
Accelerated AI / Deep Learning Strategy for Power Modify open-source DL frameworks to add innovations, optimizations, new algorithms Add system-level optimizations. For example: take advantage of NVLink on Power platform, better network (scale-out) performance Build differentiated GPU-accelerated system solutions, using NVLink Deep Learning Frameworks New algorithmic techniques, optimizations Power / NVLink specific optimizations IBM Version of DL Frameworks 5
Introducing Deep Learning on POWER8 and NVLink A software distribution of Deep Learning frameworks optimized for the POWER8 S822LC for HPC server and for large scale cluster scaling, enabling much faster training of deep learning models Software frameworks are made available at: launchpad.net for stabilized and ported versions of Deep Learning frameworks and supporting libraries, in open source and binary distribution ibm.com for binary distribution of optimized packages containing neural network optimizations from IBM Research Systems are available through IBM direct and Business Partner channels globally, and are provided as a recommended configuration (reference architecture) Binary distribution of optimized Deep Learning frameworks will be supported through IBM Technical Support Services in the coming months Targeted availability for the initial software frameworks is October 31, 2016 6
Simplify Access and Installation Tested, binary builds of common Deep Learning frameworks for ease of implementation Simple, complete installation process documented on IBM OpenPOWER http://openpowerfoundation.org/blogs/ and search Deep Learning Future focus on optimizing specific packages for POWER: NVIDIA Caffe, TensorFlow, and Torch Already ported Future focus OS Ubuntu 14.04 Ubuntu 16.04 CUDA 7.5 8.0 cudnn 5.1 5.1 Built w/ MASS Yes Yes OpenBLAS 0.2.18 Optimize Caffe 1.0 rc3 NVIDIA Caffe 0.14.5 Optimize NVIDIA DIGITS 3.2 Torch 7 Optimize Theano 0.8.2 TensorFlow 0.9(*) Optimize CNTK Nov 2015(*) DL4J 0.5.0(*) Chainer GPU 2x K80 4 x P100 Base System 822LC Minsky * Ported; not released as binary 7
POWER8+P100+NVLink for increases system bandwidth NVLink between CPUs and GPUs enables fast memory access to large data sets in system memory Two NVLink connections between each GPU and CPU-GPU leads to faster data exchange First to market: volume shipments starting September, 2016 NVLink P100 GPU GPU Memory System Memory Power8 CPU 80 GB/s 115 GB/s P100 GPU GPU Memory P100 GPU GPU Memory System Memory Power8 CPU 80 GB/s 115 GB/s NVLink P100 GPU GPU Memory 8
Improve performance: 2.2X faster training time Training time compared (minutes): AlexNet and Caffe to top-1, 50% Accuracy (shorter is better) 4 x M40 / PCIe 4 x P100 / NVLink 0 20 40 60 80 100 120 140 AlexNet Trained in Under 1 Hour (57 mins) 9
Improve Performance: Reduce Communication Overhead NVLink reduces communication time and overhead Data gets from GPU-GPU, Memory-GPU faster, for shorter training times NVLink advantage: data communication POWER8+P1 00+NVLink 78 ms Digits devbox 170 ms ImageNet / Alexnet: Minibatch size = 128 10
Business Value Message Deep Learning Increased Efficiency of Data Scientist reduced training time allows the scientist to iterate and improve models Improved Inference (end product) higher performance allows for more training runs that improves accuracy Why IBM? Cost Effective - two S822LC for HPC is less than price of the NVIDIA DGX-1 offering, and fully configurable OpenPOWER Deep Learning Software Distribution Ease of implementation tested and build frameworks, IBM documented installation process Single source for applications, libraries, and other system components for faster time to compute. First to market - Competitors are selling last years model, OpenPOWER is first to market with P100 GPUs the fastest GPU for Deep Learning. 11
S822LC for HPC: System Requirements for Deep Learning 2 Socket, 4 GPU System with NVLink Required: 2 POWER8 10 Core CPUs 4 NVIDIA P100 Pascal GPUs 256 GB System Memory 2 SSD storage devices High-speed interconnect (IB or Ethernet, depending on infrastructure) Optional: Up to 1 TB System Memory PCIe attached NVMe storage 12
Backup 13
Why Power Power8 GPUs Data Future Proof 8 threads per core Fast inter-cpu connection Wide memory bus CAPI Vector intrinsics Library support NVLINK Fast CPU to GPU connection through NVLink - exclusive Fast inter-gpu connectivity through NVLink First to market with Pascal Parallel access to data through Spectrum Scale (GPFS) Secure Extendable High performance, single namespace across local to shared flash to deep storage on disk 2 Tier0 systems Clear HPC roadmap driven by key technical investments: Summit & Sierra Power9 NVIDIA Volta Mellanox interconnect 14
Model details AlexNet Model. Batch size 1024. Step size 20K Rest of the hyper parameters remain default (base_lr : 0.01, wd 0.0005) ImageNet 2012 Dataset 15
System configuration Details Minsky (g217l) 16 cores (8 cores/socket ) 4.025 GHz 512 GB memory OS Ubuntu 16.04.1 Endian LE Kernel version 4.4.0-34-generic TurboTrainer (t1) 20 cores (10 cores/socket ) 3.694 GHz 512 GB memory OS Ubuntu 16.04 Endian LE Kernel version 4.4.0-36-generic Intel (Haswell) 24 cores (12 cores/socket ) 2.60 GHz 252 GB memory OS Ubuntu 14.04 Endian LE Kernel version 3.13.0-74-generic. SW details. G++ - 5.3.1 Gfortran 5.3.1 OpenBlas - 0.2.18 Boost 1.58.0 CUDA 8.0 Toolkit Lapack 3.6.0 Hdf5 1.8.16 Opencv 2.4.9 NVCaffe 0.14.5 BVLC-Caffe : f28f5ae2f2453f42b5824723efc326a04d d16d85 SW details. G++ - 5.3.1 Gfortran 5.3.1 OpenBlas - 0.2.18 Boost 1.58.0 CUDA 8.0 Toolkit Lapack 3.6.0 Hdf5 1.8.16 Opencv 2.4.9 NVCaffe 0.14.5 BVLC-Caffe : f28f5ae2f2453f42b5824723efc326a04 dd16d85 SW details. G++ - 4.8.4 Gfortran 4.8.4 MKL - 11.3.1 (2016 update 1) ATLAS - 3.10.2 Boost 1.58.0 CUDA 7.5 Toolkit Lapack 3.5.0 Hdf5 1.8.14 Opencv 2.4.11 Gmock 1.7.0 BVLC-Caffe : f28f5ae2f2453f42b5824723efc326a0 4dd16d85 16
IBM Power Systems Server Codename Minsky & NVIDIA Tesla P100 2 POWER8 CPUs Up to 1TB DDR4 memory Up to 4 Tesla P100 GPUs (2x the density) 1 st Server with POWER8 with NVLink Technology Only architecture with CPU:GPU NVlink 17