IBM Deep Learning Solutions

Similar documents
Deep Learning mit PowerAI - Ein Überblick

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

IBM Power AC922 Server

DGX SYSTEMS: DEEP LEARNING FROM DESK TO DATA CENTER. Markus Weber and Haiduong Vo

IBM CORAL HPC System Solution

DGX UPDATE. Customer Presentation Deck May 8, 2017

IBM Power Advanced Compute (AC) AC922 Server

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Building NVLink for Developers

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

Foundation Overview Mingzhi Christensen

Efficient Communication Library for Large-Scale Deep Learning

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

IBM SpectrumAI with NVIDIA Converged Infrastructure Solutions for AI workloads

Building the Most Efficient Machine Learning System

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

The Tesla Accelerated Computing Platform

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

OpenPOWER Performance

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Building the Most Efficient Machine Learning System

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

World s most advanced data center accelerator for PCIe-based servers

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

Cisco UCS C480 ML M5 Rack Server Performance Characterization

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

A performance comparison of Deep Learning frameworks on KNL

Towards Scalable Machine Learning

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

IBM Power Systems HPC Cluster

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

IBM Power Systems Update. David Spurway IBM Power Systems Product Manager STG, UK and Ireland

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

Mapping MPI+X Applications to Multi-GPU Architectures

Autonomous Driving Solutions

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

IBM Technology and Solutions for Artificial Intelligence and HPC

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Université IBM i 2017

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Revolutionizing Open. Cecilia Carniel IBM Power Systems Scale Out sales

GPU-Accelerated Deep Learning

High-Performance Training for Deep Learning and Computer Vision HPC

Interconnect Your Future

NCCL 2.0. Sylvain Jeaugey

Deep Learning Frameworks with Spark and GPUs

SUPERCHARGE DEEP LEARNING WITH DGX-1. Markus Weber SC16 - November 2016

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

Deep learning in MATLAB From Concept to CUDA Code

INTRODUCING THE DGX FAMILY. Marc Domenech May 8, 2017

NVDIA DGX Data Center Reference Design

Deep Learning on SHARCNET:

Inspur AI Computing Platform

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

NVIDIA GPU TECHNOLOGY UPDATE

IBM Power Systems. Artificial Intelligence mit IBM Power 9 und Power AI / AI Vision. Ulrich Walter

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

NVIDIA DEEP LEARNING INSTITUTE

Machine Learning on VMware vsphere with NVIDIA GPUs

GPUS FOR NGVLA. M Clark, April 2015

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning Performance and Cost Evaluation

Democratizing Machine Learning on Kubernetes

Genius Quick Start Guide

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

IBM Power User Group - Atlanta

Deep Learning Performance and Cost Evaluation

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

IBM Spectrum Scale IO performance

CERN openlab & IBM Research Workshop Trip Report

IBM Leading High Performance Computing and Deep Learning Technologies

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

Data-Centric Innovation Summit NAVEEN RAO CORPORATE VICE PRESIDENT & GENERAL MANAGER ARTIFICIAL INTELLIGENCE PRODUCTS GROUP

(software agnostic) Computational Considerations

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

DEEP LEARNING ALISON B LOWNDES. Deep Learning Solutions Architect & Community Manager EMEA

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

IN11E: Architecture and Integration Testbed for Earth/Space Science Cyberinfrastructures

Intelligent Hybrid Flash Management

OpenLine and Azure Stack

Transcription:

IBM Deep Learning Solutions Reference Architecture for Deep Learning on POWER8, P100, and NVLink October, 2016

How do you teach a computer to Perceive? 2

Deep Learning: teaching Siri to recognize a bicycle 3

Deep Learning: a tale of two infrastructures Deep Learning / Training Focused on perceptive tasks Teach a computer to recognize and categorize images (cross industry) Develop a model for natural language processing, real time translation, and/or interactive voice response (cross industry) Discover patterns of behavior and potential preferences (retail, entertainment) Deep Learning / Inference Act upon a trained model Autonomous vehicle: move through the physical world System of engagement: simplify access for users and provide a more familiar human / computer interface Better meet client needs by delivering recommendations, suggestions, or alternatives Focus:: Datacenter infrastructure System on Chip, low power device, partner solutions 4

Accelerated AI / Deep Learning Strategy for Power Modify open-source DL frameworks to add innovations, optimizations, new algorithms Add system-level optimizations. For example: take advantage of NVLink on Power platform, better network (scale-out) performance Build differentiated GPU-accelerated system solutions, using NVLink Deep Learning Frameworks New algorithmic techniques, optimizations Power / NVLink specific optimizations IBM Version of DL Frameworks 5

Introducing Deep Learning on POWER8 and NVLink A software distribution of Deep Learning frameworks optimized for the POWER8 S822LC for HPC server and for large scale cluster scaling, enabling much faster training of deep learning models Software frameworks are made available at: launchpad.net for stabilized and ported versions of Deep Learning frameworks and supporting libraries, in open source and binary distribution ibm.com for binary distribution of optimized packages containing neural network optimizations from IBM Research Systems are available through IBM direct and Business Partner channels globally, and are provided as a recommended configuration (reference architecture) Binary distribution of optimized Deep Learning frameworks will be supported through IBM Technical Support Services in the coming months Targeted availability for the initial software frameworks is October 31, 2016 6

Simplify Access and Installation Tested, binary builds of common Deep Learning frameworks for ease of implementation Simple, complete installation process documented on IBM OpenPOWER http://openpowerfoundation.org/blogs/ and search Deep Learning Future focus on optimizing specific packages for POWER: NVIDIA Caffe, TensorFlow, and Torch Already ported Future focus OS Ubuntu 14.04 Ubuntu 16.04 CUDA 7.5 8.0 cudnn 5.1 5.1 Built w/ MASS Yes Yes OpenBLAS 0.2.18 Optimize Caffe 1.0 rc3 NVIDIA Caffe 0.14.5 Optimize NVIDIA DIGITS 3.2 Torch 7 Optimize Theano 0.8.2 TensorFlow 0.9(*) Optimize CNTK Nov 2015(*) DL4J 0.5.0(*) Chainer GPU 2x K80 4 x P100 Base System 822LC Minsky * Ported; not released as binary 7

POWER8+P100+NVLink for increases system bandwidth NVLink between CPUs and GPUs enables fast memory access to large data sets in system memory Two NVLink connections between each GPU and CPU-GPU leads to faster data exchange First to market: volume shipments starting September, 2016 NVLink P100 GPU GPU Memory System Memory Power8 CPU 80 GB/s 115 GB/s P100 GPU GPU Memory P100 GPU GPU Memory System Memory Power8 CPU 80 GB/s 115 GB/s NVLink P100 GPU GPU Memory 8

Improve performance: 2.2X faster training time Training time compared (minutes): AlexNet and Caffe to top-1, 50% Accuracy (shorter is better) 4 x M40 / PCIe 4 x P100 / NVLink 0 20 40 60 80 100 120 140 AlexNet Trained in Under 1 Hour (57 mins) 9

Improve Performance: Reduce Communication Overhead NVLink reduces communication time and overhead Data gets from GPU-GPU, Memory-GPU faster, for shorter training times NVLink advantage: data communication POWER8+P1 00+NVLink 78 ms Digits devbox 170 ms ImageNet / Alexnet: Minibatch size = 128 10

Business Value Message Deep Learning Increased Efficiency of Data Scientist reduced training time allows the scientist to iterate and improve models Improved Inference (end product) higher performance allows for more training runs that improves accuracy Why IBM? Cost Effective - two S822LC for HPC is less than price of the NVIDIA DGX-1 offering, and fully configurable OpenPOWER Deep Learning Software Distribution Ease of implementation tested and build frameworks, IBM documented installation process Single source for applications, libraries, and other system components for faster time to compute. First to market - Competitors are selling last years model, OpenPOWER is first to market with P100 GPUs the fastest GPU for Deep Learning. 11

S822LC for HPC: System Requirements for Deep Learning 2 Socket, 4 GPU System with NVLink Required: 2 POWER8 10 Core CPUs 4 NVIDIA P100 Pascal GPUs 256 GB System Memory 2 SSD storage devices High-speed interconnect (IB or Ethernet, depending on infrastructure) Optional: Up to 1 TB System Memory PCIe attached NVMe storage 12

Backup 13

Why Power Power8 GPUs Data Future Proof 8 threads per core Fast inter-cpu connection Wide memory bus CAPI Vector intrinsics Library support NVLINK Fast CPU to GPU connection through NVLink - exclusive Fast inter-gpu connectivity through NVLink First to market with Pascal Parallel access to data through Spectrum Scale (GPFS) Secure Extendable High performance, single namespace across local to shared flash to deep storage on disk 2 Tier0 systems Clear HPC roadmap driven by key technical investments: Summit & Sierra Power9 NVIDIA Volta Mellanox interconnect 14

Model details AlexNet Model. Batch size 1024. Step size 20K Rest of the hyper parameters remain default (base_lr : 0.01, wd 0.0005) ImageNet 2012 Dataset 15

System configuration Details Minsky (g217l) 16 cores (8 cores/socket ) 4.025 GHz 512 GB memory OS Ubuntu 16.04.1 Endian LE Kernel version 4.4.0-34-generic TurboTrainer (t1) 20 cores (10 cores/socket ) 3.694 GHz 512 GB memory OS Ubuntu 16.04 Endian LE Kernel version 4.4.0-36-generic Intel (Haswell) 24 cores (12 cores/socket ) 2.60 GHz 252 GB memory OS Ubuntu 14.04 Endian LE Kernel version 3.13.0-74-generic. SW details. G++ - 5.3.1 Gfortran 5.3.1 OpenBlas - 0.2.18 Boost 1.58.0 CUDA 8.0 Toolkit Lapack 3.6.0 Hdf5 1.8.16 Opencv 2.4.9 NVCaffe 0.14.5 BVLC-Caffe : f28f5ae2f2453f42b5824723efc326a04d d16d85 SW details. G++ - 5.3.1 Gfortran 5.3.1 OpenBlas - 0.2.18 Boost 1.58.0 CUDA 8.0 Toolkit Lapack 3.6.0 Hdf5 1.8.16 Opencv 2.4.9 NVCaffe 0.14.5 BVLC-Caffe : f28f5ae2f2453f42b5824723efc326a04 dd16d85 SW details. G++ - 4.8.4 Gfortran 4.8.4 MKL - 11.3.1 (2016 update 1) ATLAS - 3.10.2 Boost 1.58.0 CUDA 7.5 Toolkit Lapack 3.5.0 Hdf5 1.8.14 Opencv 2.4.11 Gmock 1.7.0 BVLC-Caffe : f28f5ae2f2453f42b5824723efc326a0 4dd16d85 16

IBM Power Systems Server Codename Minsky & NVIDIA Tesla P100 2 POWER8 CPUs Up to 1TB DDR4 memory Up to 4 Tesla P100 GPUs (2x the density) 1 st Server with POWER8 with NVLink Technology Only architecture with CPU:GPU NVlink 17