IBM Deep Learning Solutions

Size: px

Start display at page:

Download "IBM Deep Learning Solutions"

Stuart Harris
5 years ago
Views:

1 IBM Deep Learning Solutions Reference Architecture for Deep Learning on POWER8, P100, and NVLink October, 2016

2 How do you teach a computer to Perceive? 2

3 Deep Learning: teaching Siri to recognize a bicycle 3

4 Deep Learning: a tale of two infrastructures Deep Learning / Training Focused on perceptive tasks Teach a computer to recognize and categorize images (cross industry) Develop a model for natural language processing, real time translation, and/or interactive voice response (cross industry) Discover patterns of behavior and potential preferences (retail, entertainment) Deep Learning / Inference Act upon a trained model Autonomous vehicle: move through the physical world System of engagement: simplify access for users and provide a more familiar human / computer interface Better meet client needs by delivering recommendations, suggestions, or alternatives Focus:: Datacenter infrastructure System on Chip, low power device, partner solutions 4

5 Accelerated AI / Deep Learning Strategy for Power Modify open-source DL frameworks to add innovations, optimizations, new algorithms Add system-level optimizations. For example: take advantage of NVLink on Power platform, better network (scale-out) performance Build differentiated GPU-accelerated system solutions, using NVLink Deep Learning Frameworks New algorithmic techniques, optimizations Power / NVLink specific optimizations IBM Version of DL Frameworks 5

6 Introducing Deep Learning on POWER8 and NVLink A software distribution of Deep Learning frameworks optimized for the POWER8 S822LC for HPC server and for large scale cluster scaling, enabling much faster training of deep learning models Software frameworks are made available at: launchpad.net for stabilized and ported versions of Deep Learning frameworks and supporting libraries, in open source and binary distribution ibm.com for binary distribution of optimized packages containing neural network optimizations from IBM Research Systems are available through IBM direct and Business Partner channels globally, and are provided as a recommended configuration (reference architecture) Binary distribution of optimized Deep Learning frameworks will be supported through IBM Technical Support Services in the coming months Targeted availability for the initial software frameworks is October 31,

Simplify Access and Installation Tested, binary builds of common Deep Learning frameworks for ease of implementation Simple, complete installation process documented on IBM OpenPOWER

7 Simplify Access and Installation Tested, binary builds of common Deep Learning frameworks for ease of implementation Simple, complete installation process documented on IBM OpenPOWER and search Deep Learning Future focus on optimizing specific packages for POWER: NVIDIA Caffe, TensorFlow, and Torch Already ported Future focus OS Ubuntu Ubuntu CUDA cudnn Built w/ MASS Yes Yes OpenBLAS Optimize Caffe 1.0 rc3 NVIDIA Caffe Optimize NVIDIA DIGITS 3.2 Torch 7 Optimize Theano TensorFlow 0.9(*) Optimize CNTK Nov 2015(*) DL4J 0.5.0(*) Chainer GPU 2x K80 4 x P100 Base System 822LC Minsky * Ported; not released as binary 7

8 POWER8+P100+NVLink for increases system bandwidth NVLink between CPUs and GPUs enables fast memory access to large data sets in system memory Two NVLink connections between each GPU and CPU-GPU leads to faster data exchange First to market: volume shipments starting September, 2016 NVLink P100 GPU GPU Memory System Memory Power8 CPU 80 GB/s 115 GB/s P100 GPU GPU Memory P100 GPU GPU Memory System Memory Power8 CPU 80 GB/s 115 GB/s NVLink P100 GPU GPU Memory 8

9 Improve performance: 2.2X faster training time Training time compared (minutes): AlexNet and Caffe to top-1, 50% Accuracy (shorter is better) 4 x M40 / PCIe 4 x P100 / NVLink AlexNet Trained in Under 1 Hour (57 mins) 9

faster, for shorter training times NVLink advantage: data communication

10 Improve Performance: Reduce Communication Overhead NVLink reduces communication time and overhead Data gets from GPU-GPU, Memory-GPU faster, for shorter training times NVLink advantage: data communication POWER8+P1 00+NVLink 78 ms Digits devbox 170 ms ImageNet / Alexnet: Minibatch size =

11 Business Value Message Deep Learning Increased Efficiency of Data Scientist reduced training time allows the scientist to iterate and improve models Improved Inference (end product) higher performance allows for more training runs that improves accuracy Why IBM? Cost Effective - two S822LC for HPC is less than price of the NVIDIA DGX-1 offering, and fully configurable OpenPOWER Deep Learning Software Distribution Ease of implementation tested and build frameworks, IBM documented installation process Single source for applications, libraries, and other system components for faster time to compute. First to market - Competitors are selling last years model, OpenPOWER is first to market with P100 GPUs the fastest GPU for Deep Learning. 11

12 S822LC for HPC: System Requirements for Deep Learning 2 Socket, 4 GPU System with NVLink Required: 2 POWER8 10 Core CPUs 4 NVIDIA P100 Pascal GPUs 256 GB System Memory 2 SSD storage devices High-speed interconnect (IB or Ethernet, depending on infrastructure) Optional: Up to 1 TB System Memory PCIe attached NVMe storage 12

13 Backup 13

Why Power Power8 GPUs Data Future Proof 8 threads per core Fast

support NVLINK Fast CPU to GPU connection through NVLink - exclusive

Parallel access to data through Spectrum Scale (GPFS) Secure

flash to deep storage on disk 2 Tier0 systems Clear HPC roadmap driven

14 Why Power Power8 GPUs Data Future Proof 8 threads per core Fast inter-cpu connection Wide memory bus CAPI Vector intrinsics Library support NVLINK Fast CPU to GPU connection through NVLink - exclusive Fast inter-gpu connectivity through NVLink First to market with Pascal Parallel access to data through Spectrum Scale (GPFS) Secure Extendable High performance, single namespace across local to shared flash to deep storage on disk 2 Tier0 systems Clear HPC roadmap driven by key technical investments: Summit & Sierra Power9 NVIDIA Volta Mellanox interconnect 14

15 Model details AlexNet Model. Batch size Step size 20K Rest of the hyper parameters remain default (base_lr : 0.01, wd ) ImageNet 2012 Dataset 15

16 System configuration Details Minsky (g217l) 16 cores (8 cores/socket ) GHz 512 GB memory OS Ubuntu Endian LE Kernel version generic TurboTrainer (t1) 20 cores (10 cores/socket ) GHz 512 GB memory OS Ubuntu Endian LE Kernel version generic Intel (Haswell) 24 cores (12 cores/socket ) 2.60 GHz 252 GB memory OS Ubuntu Endian LE Kernel version generic. SW details. G Gfortran OpenBlas Boost CUDA 8.0 Toolkit Lapack Hdf Opencv NVCaffe BVLC-Caffe : f28f5ae2f2453f42b efc326a04d d16d85 SW details. G Gfortran OpenBlas Boost CUDA 8.0 Toolkit Lapack Hdf Opencv NVCaffe BVLC-Caffe : f28f5ae2f2453f42b efc326a04 dd16d85 SW details. G Gfortran MKL (2016 update 1) ATLAS Boost CUDA 7.5 Toolkit Lapack Hdf Opencv Gmock BVLC-Caffe : f28f5ae2f2453f42b efc326a0 4dd16d85 16

17 IBM Power Systems Server Codename Minsky & NVIDIA Tesla P100 2 POWER8 CPUs Up to 1TB DDR4 memory Up to 4 Tesla P100 GPUs (2x the density) 1 st Server with POWER8 with NVLink Technology Only architecture with CPU:GPU NVlink 17

Deep Learning mit PowerAI - Ein Überblick

Deep Learning mit PowerAI - Ein Überblick Stephen Lutz Deep Learning mit PowerAI - Open Group Master Certified IT Specialist Technical Sales IBM Cognitive Infrastructure IBM Germany Ein Überblick Stephen.Lutz@de.ibm.com What s that? and what s