World s most advanced data center accelerator for PCIe-based servers

Similar documents
TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

IBM Power AC922 Server

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

IBM Power Advanced Compute (AC) AC922 Server

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

NVIDIA Accelerators Models HPE NVIDIA GV100 Nvlink Bridge Kit HPE NVIDIA Tesla V100 FHHL 16GB Computational Accelerator

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Building NVLink for Developers

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Building the Most Efficient Machine Learning System

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

GPU Architecture. Alan Gray EPCC The University of Edinburgh

TESLA V100 PERFORMANCE GUIDE May 2018

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

IBM Deep Learning Solutions

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Using Graphics Chips for General Purpose Computation

S8901 Quadro for AI, VR and Simulation

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

NVIDIA DEEP LEARNING PLATFORM

Vector Engine Processor of SX-Aurora TSUBASA

2 x Maximum Display Monitor(s) support MHz Core Clock 28 nm Chip 384 x Stream Processors. 145(L)X95(W)X26(H) mm Size. 1.

OpenPOWER Performance

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

Building the Most Efficient Machine Learning System

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

SAPPHIRE DUAL-X R9 270X 2GB GDDR5 OC WITH BOOST

SAPPHIRE R7 260X 2GB GDDR5 OC BATLELFIELD 4 EDITION

NVIDIA T4 FOR VIRTUALIZATION

NVIDIA GRID. Ralph Stocker, GRID Sales Specialist, Central Europe

Cisco UCS C480 ML M5 Rack Server Performance Characterization

Tesla GPU Computing A Revolution in High Performance Computing

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

SUPERMICRO, VEXATA AND INTEL ENABLING NEW LEVELS PERFORMANCE AND EFFICIENCY FOR REAL-TIME DATA ANALYTICS FOR SQL DATA WAREHOUSE DEPLOYMENTS

GPU for HPC. October 2010

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Extending the Benefits of GDDR Beyond Graphics

Overview. Web Copy. NVIDIA Quadro M4000 Extreme Performance in a Single-Slot Form Factor

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Intel PSG (Altera) Enabling the SKA Community. Lance Brown Sr. Strategic & Technical Marketing Mgr.

Oracle Exadata: Strategy and Roadmap

DU _v01. September User Guide

IBM CORAL HPC System Solution

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

DGX SYSTEMS: DEEP LEARNING FROM DESK TO DATA CENTER. Markus Weber and Haiduong Vo

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

NVIDIA PLATFORM FOR AI

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Server-Grade Performance Computer-on-Modules:

NVIDIA GRID APPLICATION SIZING FOR AUTODESK REVIT 2016

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Overview. NVIDIA Quadro M GB Real Interactive Expression. NVIDIA Quadro M GB Part No. VCQM GB-PB.

NVIDIA AI INFERENCE PLATFORM

Build cost-effective, reliable signage solutions with the 8 display output, single slot form factor NVIDIA NVS 810

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Interconnect Your Future

High Performance Computing

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

IBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

NVIDIA Quadro M5000 Designed for Extreme Performance and Power Efficiency

Accelerate AI with Cisco Computing Solutions

COMPUTING. SharpStreamer Platform. 2U Video Transcode Acceleration Appliance

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

SAPPHIRE TOXIC R9 280X 3GB GDDR5

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

INTRODUCING THE RADEON PRO PROFESSIONAL GRAPHICS FAMILY The Art of the Impossible

Trends in HPC (hardware complexity and software challenges)

CyberServe Atom Servers

HP GTC Presentation May 2012

VEXATA FOR ORACLE. Digital Business Demands Performance and Scale. Solution Brief

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

High Performance Anywhere

AMD FirePro Professional Graphics for CAD & Engineering and Media & Entertainment

Transcription:

NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying within a tight budget. The old approach of deploying lots of commodity compute nodes requires huge interconnect overhead that substantially increases costs without proportionally increasing performance. NVIDIA Tesla P100 GPU accelerators are the most advanced ever built, powered by the breakthrough NVIDIA Pascal architecture and designed to boost throughput and save money for HPC and hyperscale data centers. The newest addition to this family, Tesla P100 for PCIe enables a single node to replace half a rack of commodity CPU nodes by delivering lightning-fast performance in a broad range of HPC applications. MASSIVE LEAP IN PERFORMANCE NVIDIA Tesla P100 for PCIe 30 X 25 X 2X K80 2X P100 (PCIe) 4X P100 (PCIe) SPECIFICATIONS GPU Architecture NVIDIA Pascal NVIDIA CUDA Cores 3584 Double-Precision 4.7 TeraFLOPS Single-Precision 9.3 TeraFLOPS Half-Precision 18.7 TeraFLOPS GPU Memory 16GB CoWoS HBM2 at 732 GB/s or 12GB CoWoS HBM2 at 549 GB/s System Interface PCIe Gen3 Max Power Consumption 250 W ECC Yes Thermal Solution Passive Form Factor PCIe Full Height/Length Compute APIs CUDA, DirectCompute, OpenCL, OpenACC Application Speed-up 20 X 15 X 10 X 5 X 0 X NAMD VASP MILC HOOMD- AMBER Blue Caffe/ AlexNet TeraFLOPS measurements with NVIDIA GPU Boost technology Dual CPU server, Intel E5-2698 v3 @ 2.3 GHz, 256 GB System Memory, Pre-Production Tesla P100 Tesla P100 PCle Data Sheet Oct16

A GIANT LEAP IN PERFORMANCE Tesla P100 for PCIe is reimagined from silicon to software, crafted with innovation at every level. Each groundbreaking technology delivers a dramatic jump in performance to substantially boost the data center throughput. PASCAL ARCHITECTURE More than 18.7 TeraFLOPS of FP16, 4.7 TeraFLOPS of double-precision, and 9.3 TeraFLOPS of singleprecision performance powers new possibilities in deep learning and HPC workloads. COWOS HBM2 Compute and data are integrated on the same package using Chip-on- Wafer-on-Substrate with HBM2 technology for 3X memory performance over the previous-generation architecture. CPU GPU PAGE MIGRATION ENGINE Simpler programming and computing performance Unified Memory tuning means that applications can now scale beyond the GPU s physical memory size to virtually limitless levels. Exponential HPC and hyperscale performance 3X memory boost Virtually limitless memory scalability Teraflops (FP32/FP16) 25 20 15 10 5 0 K40 M40 P100 (FP16) P100 (FP32) Bi-directional BW (GB/Sec) 800 600 400 200 0 K40 M40 P100 Addressable Memory (GB) 10,000 1,000 100 10 0 K40 M40 P100 To learn more about the Tesla P100 for PCIe visit www.nvidia.com/tesla 2016 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, Tesla, NVIDIA GPU Boost, CUDA, and NVIDIA Pascal are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. All other trademarks and copyrights are the property of their respective owners. OCT16

NVIDIA TESLA P40 INFERENCING ACCELERATOR EXPERIENCE MAXIMUM INFERENCE THROUGHPUT In the new era of AI and intelligent machines, deep learning is shaping our world like no other computing model in history. GPUs powered by the revolutionary NVIDIA Pascal architecture provide the computational engine for the new era of artificial intelligence, enabling amazing user experiences by accelerating deep learning applications at scale. The NVIDIA Tesla P40 is purpose-built to deliver maximum throughput for deep learning deployment. With 47 TOPS (Tera-Operations Per Second) of inference performance and INT8 operations per GPU, a single server with 8 Tesla P40s delivers the performance of over 140 CPU servers. As models increase in accuracy and complexity, CPUs are no longer capable of delivering interactive user experience. The Tesla P40 delivers over 30X lower latency than a CPU for real-time responsiveness in even the most complex models. Reduce Application Latency by Over 30X Tesla P40 5.6 ms AlexNet GoogLeNet Tesla M40 24 ms Achieve Over 4X the Inference Throughput 12100 28900 51900 88800 20 0 30 10 40 50 60 70 80 90 Images per Second (In Thousands) Note: GPU: Tesla M40 (TensorRT + FP32) and P40 (TensorRT + Int 8), nvcaffe GoogLeNet AlexNet batch size =128 8X Tesla P40 CPU 160 ms 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Deep Learning Inference Latency in Milliseconds CPU: 22-Core Intel Xeon E5-2699V4, MKL2017 IntelCaffe+VGG19, Batch Size: 4 GPU Tesla M4 (TensorRT + FP32) and P4 (TensorRT + Int8), nvcaffe + VGG19, Bbatch Size: 4 8X Tesla M40 FEATURES The world s fastest processor for inference workloads 47 TOPS of INT8 for maximum inference throughput and responsiveness Hardware-decode engine capable of transcoding and inferencing 35 HD video streams in real time SPECIFICATIONS GPU Architecture Single-Precision Integer Operations (INT8) GPU Memory Memory Bandwidth System Interface Form Factor Max Power Enhanced Programmability Yes with Page Migration Engine ECC Protection Yes Server-Optimized for Data Yes Center Deployment Hardware-Accelerated Video Engine * With Boost Clock Enabled NVIDIA Pascal 12 TeraFLOPS* 47 TOPS* (Tera- Operations per Second) 24 GB 346 GB/s PCI Express 3.0 x16 4.4 H x 10.5 L, Dual Slot, Full Height 250 W 1x Decode Engine, 2x Encode Engine Tesla P40 Data Sheet Sep16

NVIDIA TESLA P40 ACCELERATOR FEATURES AND BENEFITS The Tesla P40 is purpose-built to deliver maximum throughput for deep learning workloads. 140X HIGHER THROUGHPUT TO KEEP UP WITH EXPLODING DATA The Tesla P40 is powered by the new Pascal architecture and delivers over 47 TOPS of deep learning inference performance. A single server with 8 Tesla P40s can replace up to 140 CPU-only servers for deep learning workloads, resulting in substantially higher throughput with lower acquisition cost. REAL-TIME INFERENCE The Tesla P40 delivers up to 30X faster inference performance with INT8 operations for real-time responsiveness for even the most complex deep learning models. SIMPLIFIED OPERATIONS WITH A SINGLE TRAINING AND INFERENCE PLATFORM Today, deep learning models are trained on GPU servers but deployed in CPU servers for inference. The Tesla P40 offers a drastically simplified workflow, so organizations can use the same servers to iterate and deploy. FASTER DEPLOYMENT WITH NVIDIA DEEP LEARNING SDK TensorRT included with NVIDIA Deep Learning SDK and Deep Stream SDK help customers seamlessly leverage inference capabilities like the new INT8 operations and video trans-coding. To learn more about the NVIDIA Tesla P40, visit www.nvidia.com/tesla. 2016 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, Tesla, NVIDIA GPU Boost, CUDA, and NVIDIA Pascal are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. All other trademarks and copyrights are the property of their respective owners. Sep16

NVIDIA TESLA P4 INFERENCING ACCELERATOR ULTRA-EFFICIENT DEEP LEARNING IN SCALE-OUT SERVERS In the new era of AI and intelligent machines, deep learning is shaping our world like no other computing model in history. Interactive speech, visual search, and video recommendations are a few of many AI-based services that we use every day. Accuracy and responsiveness are key to user adoption for these services. As deep learning models increase in accuracy and complexity, CPUs are no longer capable of delivering a responsive user experience. The NVIDIA Tesla P4 is powered by the revolutionary NVIDIA Pascal architecture and purpose-built to boost efficiency for scale-out servers running deep learning workloads, enabling smart responsive AI-based services. It slashes inference latency by 15X in any hyperscale infrastructure and provides an incredible 60X better energy efficiency than CPUs. This unlocks a new wave of AI services previous impossible due to latency limitations. Reduce Application Latency by Over 15X Tesla P4 11 ms Achieve Over 60X the Inference Efficiency GoogLeNet AlexNet 1.4 4.4 12 33 Tesla M4 82 ms 0 10X 20X 30X 40X 50X 60X 70X Images per Second per Watt CPU: Intel Xeon E5-2690V4 MKL2017 IntelCaffe+GoogLeNet and AlexNet, Batch Size: 128 GPU: Tesla M4 (TensorRT + FP32) and P4 (TensorRT + Int 8), nvcaffe GoogLeNet AlexNet, Batch Size: 128 169 Tesla P4 91 Tesla M4 CPU 160 ms 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Deep Learning Inference Latency in Milliseconds CPU: 22-Core Intel Xeon E5-2699V4, MKL2017 IntelCaffe+VGG19, Batch Size: 4 GPU Tesla M4 (TensorRT + FP32) and P4 (TensorRT + Int8), nvcaffe + VGG19, Batch Size: 4 Video Transcode and Inference on H.264 Streams Tesla P4 Tesla M4 CPU 2 14 CPU 35 FEATURES Small form-factor, 50/75-Watt design fits any scaleout server. INT8 operations slash latency by 15X. Hardware-decode engine capable of transcoding and inferencing 35 HD video streams in real time. SPECIFICATIONS GPU Architecture Single-Precision Integer Operations (INT8) GPU Memory Memory Bandwidth System Interface Max Power Enhanced Programmability Yes with Page Migration Engine ECC Protection Yes Server-Optimized for Data Yes Center Deployment Hardware-Accelerated Video Engine * With Boost Clock Enabled NVIDIA Pascal 5.5 TeraFLOPS* 22 TOPS* (Tera- Operations per Second) 8 GB 192 GB/s Low-Profile PCI Express Form Factor 50W/75W 1x Decode Engine, 2x Encode Engine 0 5 10 15 20 25 30 35 Concurrent Streams Note: Dual CPU Xeon E5-2650V4 Tesla GPU M4 and P4 Ubuntu 14.04. H.264 benchmark with FFMPEG slow preset HD = 720p at 30 frames per second. NVIDIA Tesla P4 Data Sheet Sep16

NVIDIA TESLA P4 ACCELERATOR FEATURES AND BENEFITS The Tesla P4 is engineered to deliver real-time inference performance and enable smart user experiences in scale-out servers. FPO RESPONSIVE EXPERIENCE WITH REAL-TIME INFERENCE Responsiveness is key to user engagement for services such as interactive speech, visual search, and video recommendations. As models increase in accuracy and complexity, CPUs are no longer capable of delivering a responsive user experience. The Tesla P4 delivers 22 TOPs of inference performance with INT8 operations to slash latency by 15X. UNPRECEDENTED EFFICIENCY FOR LOW- POWER SCALE-OUT SERVERS The Tesla P4 s small form factor and 50W/75W power footprint design accelerates densityoptimized, scale-out servers. It also provides an incredible 60X better energy efficiency than CPUs for deep learning inference workloads, letting hyperscale customers meet the exponential growth in demand for AI applications. UNLOCK NEW AI-BASED VIDEO SERVICES WITH A DEDICATED DECODE ENGINE Tesla P4 can transcode and infer up to 35 HD video streams in real-time, powered by a dedicated hardware-accelerated decode engine that works in parallel with the GPU doing inference. By integrating deep learning into the video pipeline, customers can offer smart, innovative video services to users which were previously impossible to do. FASTER DEPLOYMENT WITH TensorRT AND DEEPSTREAM SDK TensorRT is a library created for optimizing deep learning models for production deployment. It takes trained neural nets usually in 32-bit or 16-bit data and optimizes them for reduced precision INT8 operations. NVIDIA DeepStream SDK taps into the power of Pascal GPUs to simultaneously decode and analyze video streams. To learn more about the NVIDIA Tesla P4, visit www.nvidia.com/tesla. 2016 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, Tesla, and NVIDIA Pascal are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc. All other trademarks and copyrights are the property of their respective owners. Sep16