TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Similar documents
TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA V100 PERFORMANCE GUIDE May 2018

World s most advanced data center accelerator for PCIe-based servers

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

IBM Power AC922 Server

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Cisco UCS C480 ML M5 Rack Server Performance Characterization

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

IBM Deep Learning Solutions

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Democratizing Machine Learning on Kubernetes

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

DGX UPDATE. Customer Presentation Deck May 8, 2017

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

ENABLING NEW SCIENCE GPU SOLUTIONS

SUPERCHARGE DEEP LEARNING WITH DGX-1. Markus Weber SC16 - November 2016

IBM Power Advanced Compute (AC) AC922 Server

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

SOLUTIONS BRIEF: Transformation of Modern Healthcare

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

IBM SpectrumAI with NVIDIA Converged Infrastructure Solutions for AI workloads

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

Building the Most Efficient Machine Learning System

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

NVIDIA Accelerators Models HPE NVIDIA GV100 Nvlink Bridge Kit HPE NVIDIA Tesla V100 FHHL 16GB Computational Accelerator

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

DGX SYSTEMS: DEEP LEARNING FROM DESK TO DATA CENTER. Markus Weber and Haiduong Vo

Unified Deep Learning with CPU, GPU, and FPGA Technologies

April 2 nd, Bob Burroughs Director, HPC Solution Sales

S8901 Quadro for AI, VR and Simulation

POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

OpenPOWER Performance

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

Profiling DNN Workloads on a Volta-based DGX-1 System

An introduction to Machine Learning silicon

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Building the Most Efficient Machine Learning System

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Deep Learning mit PowerAI - Ein Überblick

DEEP LEARNING ALISON B LOWNDES. Deep Learning Solutions Architect & Community Manager EMEA

AIRI SCALE-OUT AI-READY INFRASTRUCTURE ARCHITECTED BY PURE STORAGE AND NVIDIA WITH ARISTA 7060X SWITCH REFERENCE ARCHITECTURE

NVDIA DGX Data Center Reference Design

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

Building NVLink for Developers

AMBER 11 Performance Benchmark and Profiling. July 2011

The Tesla Accelerated Computing Platform

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

High Performance Computing

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

TACKLING THE CHALLENGES OF NEXT GENERATION HEALTHCARE

Deep learning in MATLAB From Concept to CUDA Code

NAMD GPU Performance Benchmark. March 2011

GPU-Accelerated Deep Learning

IBM CORAL HPC System Solution

Architectures for Scalable Media Object Search

NVIDIA PLATFORM FOR AI

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

NEW NVIDIA PLATFORM FOR AI

NVIDIA DATA LOADING LIBRARY (DALI)

A performance comparison of Deep Learning frameworks on KNL

Using CNN Across Intel Architecture

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

NVIDIA GPU TECHNOLOGY UPDATE

Deep Learning Inference on Openshift with GPUs

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

An Innovative Massively Parallelized Molecular Dynamic Software

Autonomous Driving Solutions

Fast forward. To your <next>

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Transcription:

TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017

TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important scientific and engineering challenges. NVIDIA Tesla accelerated computing platform powers these modern data centers with the industry-leading applications to accelerate HPC and AI workloads. The intersection of AI and HPC is extending the reach of science and accelerating the pace of scientific innovation like never before. The Tesla V100 GPU is the engine of the modern data center, delivering breakthrough performance with fewer servers resulting in faster insights and dramatically lower costs. Improved performance and time-to-solution can also have significant favorable impacts on revenue and productivity. Every HPC data center can benefit from the Tesla platform. Over 500 HPC applications in a broad range of domains are optimized for GPUs, including all 15 of the top 15 HPC applications and every major deep learning framework. RESEARCH DOMAINS WITH GPU-ACCELERATED APPLICATIONS INCLUDE: MOLECULAR DYNAMICS QUANTUM CHEMISTRY DEEP LEARNING Over 500 HPC applications and all deep learning frameworks are GPU-accelerated. > > To get the latest catalog of GPU-accelerated applications visit: www.nvidia.com/teslaapps > > To get up and running fast on GPUs with a simple set of instructions for a wide range of accelerated applications visit: www.nvidia.com/gpu-ready-apps APPLICATION PERFORMANCE GUIDE

TESLA V100 PERFORMANCE GUIDE MOLECULAR DYNAMICS Molecular Dynamics (MD) represents a large share of the workload in an HPC data center. 100% of the top MD applications are GPU-accelerated, enabling scientists to run simulations they couldn t perform before with traditional CPU-only versions of these applications. When running MD applications, a data center with Tesla V100 GPUs can save up to 80% in server and infrastructure acquisition costs. KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR MD > Servers with V100 replace up to 54 CPU servers for applications such as HOOMD-Blue and Amber > 100% of the top MD applications are GPU-accelerated > Key math libraries like FFT and BLAS > Up to 15.7 TFLOPS per second of single precision performance per GPU > Up to 900 GB per second of memory bandwidth per GPU View all related applications at: www.nvidia.com/molecular-dynamics-apps

HOOMD-Blue Performance Equivalence Single GPU Server vs Multiple CPU-Only Servers CPU-Only Servers 60 50 40 30 20 34 43 54 HOOMD-BLUE Particle dynamics package is written from the ground up for GPUs VERSION 2.1.6 ACCELERATED FEATURES CPU & GPU versions available SCALABILITY Multi-GPU and Multi-Node MORE INFORMATION http://codeblue.umich.edu/hoomdblue/index.html 10 0 2X V100 4X V100 1 Server with V100 GPUs 8X V100 CPU Server: Dual Xeon E5-2690 v4 @ 2.6 GHz, GPU Servers: Same as CPU server with NVIDIA Tesla V100 for PCIe NVIDIA CUDA Version: 9.0.145 Dataset: Microsphere To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes. AMBER Performance Equivalence Single GPU Server vs Multiple CPU-Only Servers CPU-Only Servers 50 45 40 35 30 25 20 15 46 AMBER Suite of programs to simulate molecular dynamics on biomolecule VERSION 16.8 ACCELERATED FEATURES PMEMD Explicit Solvent & GB; Explicit & Implicit Solvent, REMD, amd SCALABILITY Multi-GPU and Single-Node MORE INFORMATION http://ambermd.org/gpus 10 5 0 2X V100 1 Server with V100 GPUs CPU Server: Dual Xeon E5-2690 v4 @ 2.6 GHz, GPU Servers: Same as CPU server with NVIDIA Tesla V100 for PCIe NVIDIA CUDA Version: 9.0.103 Dataset: PME-Cellulose_NVE To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes. APPLICATION PERFORMANCE GUIDE MOLECULAR DYNAMICS

TESLA V100 PERFORMANCE GUIDE QUANTUM CHEMISTRY Quantum chemistry (QC) simulations are key to the discovery of new drugs and materials and consume a large part of the HPC data center's workload. 60% of the top QC applications are accelerated with GPUs today. When running QC applications, a data center's workload with Tesla V100 GPUs can save over 30% in server and infrastructure acquisition costs. KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR QC > Servers with V100 replace up to 5 CPU servers for applications such as VASP > 60% of the top QC applications are GPU-accelerated > Key math libraries like FFT and BLAS > Up to 7.8 TFLOPS per second of double precision performance per GPU > Up to 16 GB of memory capacity for large datasets View all related applications at: www.nvidia.com/quantum-chemistry-apps

VASP Performance Equivalence Single GPU Server vs Multiple CPU-Only Servers CPU-Only Servers 10 5 0 3 2X V100 1 Server with V100 GPUs CPU Server: Dual Xeon E5-2690 v4 @ 2.6 GHz, GPU Servers: Same as CPU server with NVIDIA Tesla V100 for PCIe NVIDIA CUDA Version: 9.0.103 Dataset: Si-Huge To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes. 5 4X V100 VASP Package for performing ab-initio quantum-mechanical molecular dynamics (MD) simulations VERSION 5.4.4 ACCELERATED FEATURES RMM-DIIS, Blocked Davidson, K-points, and exact-exchange SCALABILITY Multi-GPU and Multi-Node MORE INFORMATION www.nvidia.com/vasp APPLICATION PERFORMANCE GUIDE QUANTUM CHEMISTRY

TESLA V100 PERFORMANCE GUIDE DEEP LEARNING Deep Learning is solving important scientific, enterprise, and consumer problems that seemed beyond our reach just a few years back. Every major deep learning framework is optimized for NVIDIA GPUs, enabling data scientists and researchers to leverage artificial intelligence for their work. When running deep learning training and inference frameworks, a data center with Tesla V100 GPUs can save up to 85% in server and infrastructure acquisition costs. KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR DEEP LEARNING TRAINING > Caffe, TensorFlow, and CNTK are up to 3x faster with Tesla V100 compared to P100 > 100% of the top deep learning frameworks are GPU-accelerated > Up to 125 TFLOPS of TensorFlow operations > Up to 16 GB of memory capacity with up to 900 GB/s memory bandwidth View all related applications at: www.nvidia.com/deep-learning-apps

Caffe Deep Learning Framework Training on 8X V100 GPU Server vs 8X P100 GPU Server CAFFE A popular, GPU-accelerated Deep Learning framework developed at UC Berkeley Speedup vs. Server with 8X P100 SXM2 5X 4X 3X 2X 1X GoogLeNet 1.9 Inception V3 2.6X Avg. Speedup ResNet-50 2.5 2.8 2.2 2.1 VGG16 2.9X Avg. Speedup 2.7 3 2.8 VERSION 1.0 ACCELERATED FEATURES Full framework accelerated SCALABILITY Multi-GPU MORE INFORMATION caffe.berkeleyvision.org 0 8X V100 PCIe 1 Server with V100 (16 GB) GPUs 8X V100 SXM2 CPU Server: Dual Xeon E5-2698 v4 @ 3.6GHz, GPU servers as shown Ubuntu 14.04.5 CUDA Version: CUDA 9.0.176 NCCL 2.0.5 CuDNN 7.0.2.43 Driver 384.66 Data set: ImageNet Batch sizes: GoogleNet 192, Inception V3 96, ResNet-50 64 for P100 SXM2 and 128 for Tesla P100, VGG16 96 LOW-LATENCY CNN INFERENCE PERFORMANCE Massive Throughput and Amazing Efficiency at Low Latency CNN Throughput at Low Latency (ResNet-50) Target Latency 7ms 0 3 6 9 12 15 18 21 24 27 30 Xeon CPU 14ms V100 FP16 7ms 0 1 2 3 4 5 6 7 8 9 10 Throughput Images Per Second (In Thousands) System configs: Single-socket Xeon E2690 v4 @ 3.5GHz, and a single NVIDIA Tesla V100, GPU running TensorRT 3 RC vs. Intel DL SDK beta 2 Ubuntu 14.04.5 CUDA Version: 7.0.1.13 CUDA 9.0.176 NCCL 2.0.5 CuDNN 7.0.2.43 Driver 384.66 Precision: CPU FP32, NVIDIA Tesla V100 FP16 APPLICATION PERFORMANCE GUIDE DEEP LEARNING

LOW-LATENCY RNN INFERENCE PERFORMANCE Massive Throughput and Amazing Efficiency at Low Latency RNN Throughput at Low Latency (OpenNMT) Target Latency 200ms 0 100 200 300 400 500 600 Xeon CPU 280ms V100 FP16 117ms 0 100 200 300 400 500 600 Throughput Sentences Per Second System configs: Single-socket Xeon E2690 v4 @ 3.5GHz, and a single NVIDIA Tesla V100, GPU running TensorRT 3 RC vs. Intel DL SDK beta 2 Ubuntu 14.04.5 CUDA Version: 7.0.1.13 CUDA 9.0.176 NCCL 2.0.5 CuDNN 7.0.2.43 Driver 384.66 Precision: CPU FP32, NVIDIA Tesla V100 FP16 APPLICATION PERFORMANCE GUIDE DEEP LEARNING

TESLA V100 PRODUCT SPECIFICATIONS Double-Precision Performance Single-Precision Performance NVIDIA Tesla V100 for PCIe-Based Servers up to 7 TFLOPS up to 14 TFLOPS NVIDIA Tesla V100 for NVLink-Optimized Servers up to 7.8 TFLOPS up to 15.7 TFLOPS Deep Learning up to 112 TFLOPS up to 125 TFLOPS NVIDIA NVLink Interconnect Bandwidth PCIe x 16 Interconnect Bandwidth CoWoS HBM2 Stacked Memory Capacity CoWoS HBM2 Stacked Memory Bandwidth - 300 GB/s 32 GB/s 32 GB/s 16 GB 16 GB 900 GB/s 900 GB/s ABC Product (Model) Name ABC PRODUCT (MODEL) NAME Partner product description paragraph. One hundred words maximum. Xeris exeria nobis exerferis dolupt. > > Spec 1: Some Data > > Spec 2: Some Data > > Spec 3: Some Data > > Spec 4: Some Data COMPANY NAME Optional company brief description paragraph. No more than fifty words. Explia consequam il ilis escipiducium remd. Xeris exeria nobis exerferis dolupt, qui quo volores dolori blab iliquate il il excerum excesequi dolori manaianisi mintes. www.abccompany.com +1 (123) 555-678 jdoe@abccompany.com Assumptions and Disclaimers The percentage of top applications that are GPU-accelerated is from top 50 app list in the i360 report: HPC Support for GPU Computing. Calculation of throughput and cost savings assumes a workload profile where applications benchmarked in the domain take equal compute cycles: http://www.intersect360.com/industry/reports.php?id=131 The number of CPU nodes required to match single GPU node is calculated using lab performance results of the GPU node application speed-up and the Multi-CPU node scaling performance. For example, the Molecular Dynamics application HOOMD-Blue has a GPU Node application speed-up of 37.9X. When scaling CPU nodes to an 8 node cluster, the total system output is 7.1X. So the scaling factor is 8 divided by 7.1 (or 1.13). To calculate the number of CPU nodes required to match the performance of a single GPU node, you multiply 37.9 (GPU Node application speed-up) by 1.13 (CPU node scaling factor) which gives you 43 nodes. 2017 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Dec17