S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Similar documents
TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

Building the Most Efficient Machine Learning System

IBM Deep Learning Solutions

n N c CIni.o ewsrg.au

Building the Most Efficient Machine Learning System

Deep Learning mit PowerAI - Ein Überblick

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

IBM Power AC922 Server

Deep Learning Performance and Cost Evaluation

IBM CORAL HPC System Solution

NVDIA DGX Data Center Reference Design

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Deep Learning Performance and Cost Evaluation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Building NVLink for Developers

CP2K Performance Benchmark and Profiling. April 2011

Interconnect Your Future

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

The Future of High Performance Interconnects

CafeGPI. Single-Sided Communication for Scalable Deep Learning

NAMD Performance Benchmark and Profiling. February 2012

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

World s most advanced data center accelerator for PCIe-based servers

DGX UPDATE. Customer Presentation Deck May 8, 2017

INTRODUCING THE DGX FAMILY. Marc Domenech May 8, 2017

DGX SYSTEMS: DEEP LEARNING FROM DESK TO DATA CENTER. Markus Weber and Haiduong Vo

LAMMPSCUDA GPU Performance. April 2011

IBM SpectrumAI with NVIDIA Converged Infrastructure Solutions for AI workloads

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Oak Ridge National Laboratory Computing and Computational Sciences

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

LAMMPS and WRF on iwarp vs. InfiniBand FDR

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Mapping MPI+X Applications to Multi-GPU Architectures

ABySS Performance Benchmark and Profiling. May 2010

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

Inspur AI Computing Platform

Preparing GPU-Accelerated Applications for the Summit Supercomputer

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

IBM Power Advanced Compute (AC) AC922 Server

OpenFOAM Performance Testing and Profiling. October 2017

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

Towards Scalable Machine Learning

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

SNAP Performance Benchmark and Profiling. April 2014

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc

CP2K Performance Benchmark and Profiling. April 2011

GROMACS Performance Benchmark and Profiling. August 2011

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

NAMD GPU Performance Benchmark. March 2011

NVIDIA GPU TECHNOLOGY UPDATE

LS-DYNA Performance Benchmark and Profiling. October 2017

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Fujitsu s Approach to Application Centric Petascale Computing

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

NAMD Performance Benchmark and Profiling. January 2015

LS-DYNA Performance Benchmark and Profiling. April 2015

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

TSUBAME-KFC : Ultra Green Supercomputing Testbed

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Interconnect Your Future

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

HPC Technology Trends

AcuSolve Performance Benchmark and Profiling. October 2011

MAHA. - Supercomputing System for Bioinformatics

SUPERCHARGE DEEP LEARNING WITH DGX-1. Markus Weber SC16 - November 2016

Mellanox GPUDirect RDMA User Manual

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

Interconnect Your Future

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

UCX: An Open Source Framework for HPC Network APIs and Beyond

Interconnect Your Future

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

IBM Spectrum Scale IO performance

NCCL 2.0. Sylvain Jeaugey

MILC Performance Benchmark and Profiling. April 2013

Cray XC Scalability and the Aries Network Tony Ford

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

OpenPOWER Performance

IBM Power Systems HPC Cluster

High Performance Computing

Transcription:

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com

A TALE OF ENLIGHTENMENT Basic OK List 10 for x = 1 to 3 20 print x 30 next x 1 FPS Run 1 1 2 2 3 3 OK 2

DEEP LEARNING A NEW COMPUTING PLATFORM Assembly LDX #$00 dec: INX JSR printx CPX #$03 BNE dec BRK 30 FPS DL --> 3

SATURNV PURPOSE 124 Node Supercomputing Cluster Innovation is fueled by the right engine! Deep Learning scalability; move outside the box Drive research and Deep Learning application Partner with university research, government and industry collaborations Enable data science in HPC 4

NVIDIA DGX SATURNV ARCHITECTURE 124 node Cluster nvidia.com/dgx1 124 NVIDIA DGX-1 Nodes 992 P100 GPUs 8x NVIDIA Tesla P100 SXM GPUs NVLINK CubeMesh 2x Intel Xeon 20 core GPUs 512TB DDR4 System Memory SSD 7 TB scratch + 0.5 TB OS Mellanox 36 port EDR L1 and L2 switches 4 ports per system Fat tree topology Ubuntu 14.04, CUDA 8, OpenMPI 1.10.5a1, Docker, DL Frameworks NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL) Deep Learning applied research Many users, frameworks, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC 5

SATURNV STACK 6

DGX-1 MULTI-SYSTEM 7

NVIDIA DGX SATURNV Greenest Supercomputer 8

NVIDIA DGX-1 SATURNV HPL RUN 124 node Supercomputing cluster HPL Setup Problem contained mainly in GPU memory (~16GB / GPU) 124 nodes * 8 GPU/node * 16 GB mem/gpu = 15,872 GB mem --- N = 1419552 Measurement PDU input power time-stamped during full run All cluster hardware nodes, switches, storage SATURNV produced groundbreaking 9.4 GF/W at full scale --> Sets the stage for future Exascale class computing Performance HPL Rpeak 4,896 TF HPL Rmax 3,307 TF Pwr Full run avg 321.2 KW Pwr Core avg 349.5 KW ~15KW sustained per rack 9.4 GF / Watt 40% better than nearest competing technology 9

NOV2016 TOP GREEN500 SYSTEM Green500.org Top500.org SATURNV produced groundbreaking 9.4 GF/W at full scale --> Sets the stage for future Exascale class computing 10

WHAT IS HPL, TOP500, GREEN500? HPL High Performance Linpack Multi-system benchmark - measures optimized double-precision floating performance Solves system of dense linear equations One system or many connected in a cluster - usually Ethernet or InfiniBand Single problem split across many systems single final performance number Well designed to scale across large clusters and push limits Top500 (top500.org) List of the fastest HPL clusters in the world Updated twice a year June and Sept Published at ISC and SC conferences Green500 (green500.org) Same HPL clusters, but rank by power used during the HPL run Published at same time as Top500 11

DGX-1 SUPERCOMPUTER CHALLENGES Giant Leap Towards Exascale AI Compute Significant math performance FP32, FP16, INT8 Highly optimized frameworks Training, Inference Interconnect Multiple compute units inside node Multiple systems Storage Low latency, high bandwidth Equal perf to all systems Local caching for DL workloads Facilities Sufficient for bursts Maintain inlet air temp always High power density 12

NVIDIA DGX-1 COMPUTE NCCL Collective Library 13

DGX-1 COMPUTE AND MULTI-SYSTEM DGX-1 single system considerations Higher performance per system 27x to 58x faster Ingest data faster, provides faster results Also more power and heat High data ingest for DL workloads More storage and I/O into single system Cache data locally NFS cache on local SSD for training data Higher power/thermal density Example: 32 Racks @ 750 KW vs 200 @ 1,000 KW Ambient temperatures very important Silicon uses more power @ higher temps Clocks will gate at thermal and power limits Variability lowers overall performance of multi- GPU and multi-system runs 14

DGX-1 COMPUTE CONSIDERATIONS #1 Recommendation - Using containers improves performance - Access to latest NVIDIA tuned codes - Latest NCCL libraries Clocking - CPUs set to performance mode to improve memory/i/o bandwidth 1500 1480 1460 1440 1420 1400 1380 1360 1340 - Leave GPU clocks at default if you do set them, use base or slightly higher - Running set at max can cause extreme variation and reduced performance depending on workload - Monitor with nvidia-smi dmon 1320 Time Effects of Clocking 15

DGX-1 COMPUTE CONSIDERATIONS Affinity - Best performance when CPU/GPU/mem/IB affinity are aligned - E.g. cpu socket 0<->gpu0/1<->mlx5_0 Interrupt traffic can be high - Keep core 0 and core 20 free for interrupts 16

DGX-1 MULTI-SYSTEM CONSIDERATIONS IB Leaf Switch GPU0 GPU1 MLX0 GPU3 GPU4 MLX1 GPU5 GPU6 MLX1 GPU7 GPU8 MLX1 PCIe PCIe PCIe PCIe CPU0 CPU1 Example affinity with numactl: MEM0 MEM1 mpirun \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=1-4./mycode : \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6-9./mycode : \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=10-13./mycode : \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=15-18./mycode : \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=4 numactl --physcpubind=21-24./mycode : \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=5 numactl --physcpubind=25-28./mycode : \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=6 numactl --physcpubind=30-33./mycode : \ -np 4 bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=35-38./mycode 17

DGX-1 MULTI-NODE INTERCONNECT DESIGN 6,012 GB/s Design topologies that reduce latency and improve total bandwidth Fat-tree topologies for instance Equal bandwidth from a system all the way up to top level switch Ensure GPUDirect RDMA enablement DL and many computation workloads rely on fast synchronization Collectives Consistent iteration times System hierarchy - CPU0 <-> GPU0/1 <-> mlx5_0 - CPU0 <-> GPU2/3 <-> mlx5_1 - CPU1 <-> GPU4/5 <-> mlx5_2 - CPU2 <-> GPU6/7 <-> mlx5_3 If designing with only two IB ports, hook up mlx5_0, mlx5_2 18

DGX-1 MULTI-SYSTEM INTERCONNECT DGX-1 multi-system considerations High node to node communications DL and HPC workloads 4 IB ports à 2 ports DL: up to 5% loss Compute: up to 18% loss 1 IB port per system low performance Significant contention for many workloads Can t GPU Direct RDMA across full system Switch hierarchy critical Low bandwidth on second level Same issues as lowering ports per system Contention, lower bandwidth, variability 19

DGX-1 STORAGE CONSIDERATIONS Storage needs HPC needs well known Parallel FS like Lustre and Spectrum Scale well suited DL workloads just being understood Read dominated Input data rarely changes Can be raw or formatted in a DB (like LMDB) Large group of random read, then reread same data later Approaches Local caching helps significantly Can be many GB (>16GB for instance) Another approach is keep full datasets local (>100GB for ImageNet) Local SSD RAID Alternately, copy all data to nodes at beginning of job Reference designs 10Gb attached Central NFS with local caching Spectrum Scale IB attached (still evaluating) Lustre IB attached (still evaluating) 20

AI GRAND CHALLENGES CANDLE - Accelerate cancer research Energy / Fusion Future of low cost energy Weather and Climate Disaster Preparedness Astrophysics Our future? Autonomous Cars 21

Summary DGX-1 DL SCALABILITY SUMMARY DGX-1 crafted for AI and Computational workloads High compute density, but also high power and thermal density Watch ambient can cause large variability Single system has large demands in data ingest and GPU to GPU communication Multi DGX-1 systems have large demands on inter-node communication for most workloads Need at least two IB rails per system (1 EDR IB for every 2 GPU) DL Storage needs are very high But read dominated (vs writes with HPC) Many codes benefit significantly when watching affinity Align CPU/memory with GPUs and IB cards Avoid cores handling interrupts NVIDIA pre made containers significantly reduce user work Affinity is already handled Provides technologies like NCCL and the latest, tuned code and frameworks 22

DGX-1 DL SCALABILITY SUMMARY Thanks!!! More info at NVIDIA DGX-1 System Architecture: http://www.nvidia.com/object/dgx-1-system-architecture-whitepaper.html CANDLE sessions (http://www.gputechconf.com/agenda/schedule) S7788 - CANDLE: PREDICTING TUMOR CELL RESPONSE TO DRUG TREATMENTS S7782 - THE DOE AND NCI PARTNERSHIP ON PRECISION ONCOLOGY AND THE CANCER MOONSHOT S7792 - BUILDLING EXASCALE DEEP LEARNING TOOLS TO HELP UNDERSTAND CANCER BIOLOGY AT THE MOLECULAR SCALE S7780 - BUILDING EXASCALE DEEP TEXT COMPREHENSION TOOLS FOR EFFECTIVE CANCER SURVEILLANCE S7754 - WHAT'S NEXT IN DGX SERVER SOLUTIONS FOR DEEP LEARNING Thursday, May 11, 10:00 AM - 10:50 AM Room 210B 23