April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

Size: px
Start display at page:

Download "April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,"

Transcription

1 April 4-7, 2016 Silicon Valley INSIDE PASCAL Mark Harris, October 27,

2 INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine CPU Tesla P100 Unified Memory Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory 2

3 Teraflops (FP32/FP16) Bandwidth (GB/Sec) Bandwidth GIANT LEAPS IN EVERYTHING 20 P100 (FP16) 160 P100 3x P K40 M40 P100 (FP32) K40 M40 2x 1x K40 M40 3x Compute 5x GPU-GPU BW 3x GPU Mem BW 3

4 Speed-up vs Dual Socket Broadwell TESLA P100 PERFORMANCE DELIVERED NVLink for Max Scalability, Over 45x Faster with 8x P100 50x 45x 2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100 40x 35x 30x 25x 20x 15x 10x 5x 0x Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC 2x Broadwell CPU 4

5 TESLA PASCAL ACCELERATORS Tesla P100 Tesla P40 Tesla P4 Compute SXM2+NVLINK PCIe 5.3 TF DP 10.6 TF SP 21.2 TF HP 11.8 TF SP 47 TOP/s INT8 5.4 TF SP 21.8 TOP/s INT8 Memory HBM2: 720 GB/s 16 GB GDDR5: 346 GB/s 24 GB GDDR5: 192 GB/s 8GB Interconnect NVLink + PCIe Gen3 PCIe Gen3 PCIe Gen3 Programmability Page Migration Engine Unified Memory Page Migration Engine Unified Memory Page Migration Engine Unified Memory Power 300W SXM2 250W PCIe 250W 50W & 75W (Compute #s based on boost clocks; P100 compute #s given for SXM2 version) 5

6 PASCAL ARCHITECTURE 6

7 TESLA P100 GPU: GP SMs 3584 CUDA Cores 5.3 TF Double Precision 10.6 TF Single Precision 21.2 TF Half Precision 16 GB HBM2 720 GB/s Bandwidth 7

8 GP100 SM GP100 CUDA Cores 64 Register File Shared Memory 256 KB 64 KB Active Threads 2048 Active Blocks 32 8

9 Cores FP64 Cores FP64 LD/ST SFU Cores FP64 Cores FP64 LD/ST SFU Cores FP64 Cores FP64 LD/ST SFU Cores FP64 Cores FP64 LD/ST SFU Warps Warps Registers Registers Warps Warps Registers Registers P100 SM More resources per core Maxwell SM Warps Warps Registers Registers Shared Mem Warps Warps Registers Registers 2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth 2x Warps Higher Instruction Throughput P100 SM Shared Mem 9

10 IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast Feature Half precision Single precision Double precision Layout s5.10 s8.23 s11.52 Issue rate pair every clock 1 every clock 1 every 2 clocks Subnormal support Yes Yes Yes Atomic Addition Yes Yes Yes 10

11 Images/Sec/Watt HIGH THROUGHPUT FOR SMALL INTEGERS Tesla P4 and P40: 8-bit Integer Vector Operations 4x 8-bit vector dot product with 32-bit accumulate CUDA intrinsics: int dp4a(char4 a, char4 b, int c) A B int8 int8 int8 int8 int8 int8 int8 int8 Powerful for deep learning inference (Tesla P4 results) 40x Efficiency vs CPU, 8x Efficiency vs FPGA CPU FPGA 1x M4 (FP32) 1x P4 (INT8) int32 C int32 D int32 CUDA Library Support 50 0 AlexNet AlexNet, batch size = 128, CPU: Intel E5-2690v4 using Intel MKL 2017, FPGA is Arria x M4/P4 in node, P4 board power at 56W, P4 GPU power at 36W, M4 board power at 57W, M4 GPU power at 39W, Perf/W chart using GPU power cublas: GemmEx() cudnn v6: Int8 Inference Convolutions TensorRT Г. 11

12 NVLink 12

13 NVLINK P100 supports 4 NVLinks Up to 94% bandwidth efficiency Supports read/writes/atomics to peer GPU Supports read/write access to NVLink-enabled CPU Links can be ganged for higher bandwidth 40 GB/s 40 GB/s 40 GB/s 40 GB/s NVLink on Tesla P100 13

14 NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 14

15 NVLINK TO CPU IBM Power Systems Server S822LC (codename Minsky ) 2x IBM Power8+ CPUs and 4x P100 GPUs DDR4 115GB/s DDR4 80 GB/s per GPU bidirectional for peer traffic IB P8+ CPU P8+ CPU IB 80 GB/s per GPU bidirectional to CPU 115 GB/s CPU Memory Bandwidth P100 P100 P100 P100 Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 15

16 UNIFIED MEMORY 16

17 PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging 49-bit Virtual Addresses Sufficient to cover 48-bit CPU address + all GPU memory GPU page faulting capability Can handle thousands of simultaneous page faults Up to 2 MB page size Better TLB coverage of GPU memory 17

18 UNIFIED MEMORY Dramatically Lower Developer Effort CUDA 6+ Kepler GPU CPU Simpler Programming & Memory Model Single allocation, single pointer, accessible anywhere Eliminate need for explicit copy Greatly simplifies code porting Unified Memory Allocate Up To GPU Memory Size Performance Through Data Locality Migrate data to accessing processor Guarantee global coherence Still allows explicit hand tuning 18

19 CUDA 8: UNIFIED MEMORY Large datasets, simple programming, High Performance CUDA 8 Pascal GPU CPU Enable Large Data Models Oversubscribe GPU memory Allocate up to system memory size Unified Memory Simpler Data Access CPU/GPU Data coherence Unified memory atomic operations Allocate Beyond GPU Memory Size Tune Unified Memory Performance Usage hints via cudamemadvise API Explicit prefetching API 19

20 Application Throughput (MDOF/s) Millions OUT-OF-CORE AMR COMPUTATIONS WITH UNIFIED MEMORY ON P100 P100 (x86 PCI-E) P100 + user hints (x86 PCI-E) P100 (P8 NVLINK) P100 + user hints (P8 NVLINK) All 5 levels fit in GPU memory P100 memory size (16GB) Only 2 levels fit Only 1 level fits Application working set (GB) x86 CPU: Intel E v3, 2 sockets of 10 cores each with HT on (40 threads) 20

21 INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine CPU Tesla P100 Unified Memory Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory More P100 Features: compute preemption, new instructions, larger L2 cache, more Find out more about P100: P4 and P40: Contact: 21

22 BACKUP 22

23 END-TO-END PRODUCT FAMILY HYPERSCALE HPC STRONG-SCALE HPC MIXED-APPS HPC FULLY INTEGRATED DL SUPERCOMPUTER Training: Tesla P100 Tesla P100 with PCI-E Inference: Tesla P40 & P4 Tesla P100 with NVLink Tesla K80 DGX-1 Deep learning training & inference HPC and DL data centers Scaling to multiple GPUs HPC data centers with mix of CPU and GPU workloads Fully integrated deep learning solution 23

24 HALF-PRECISION FLOATING POINT (FP16) 16 bits s e x p f r a c. 1 sign bit, 5 exponent bits, 10 fraction bits 2 40 Dynamic range Normalized values: 1024 values for each power of 2, from 2-14 to 2 15 Subnormals at full speed: 1024 values from 2-24 to 2-15 Special values +- Infinity, Not-a-number USE CASES Deep Learning Training Radio Astronomy Sensor Data Image Processing 24

25 GOPS / watt RADIO ASTRONOMY AND PASCAL Cross correlation benchmark 300 Radio Telescopes: 250 Often remote: power efficiency crucial Low-precision data: FP32 not needed GPUs (FP32) already in use Pascal INT8 arithmetic: x higher efficiency than FP M2090 Fermi (fp32) K40 Kepler (fp32) M40 Maxwell (fp32) P40 Pascal (fp32) P40 Pascal (int8) P40 Pascal (int8 + clock capping) 25

26 HBM2 STACKED MEMORY 26

27 HBM2 : 720GB/SEC BANDWIDTH And ECC is free Spacer 4-high HBM2 Stack Bumps GPU Silicon Carrier Substrate 27

28 UNIFIED MEMORY EXAMPLE On-Demand Paging global void setvalue(int *ptr, int index, int val) { ptr[index] = val; } void foo(int size) { char *data; cudamallocmanaged(&data, size); memset(data, 0, size); setvalue<<<...>>>(data, size/2, 5); cudadevicesynchronize(); Unified Memory allocation Access all values on CPU Access one value on GPU usedata(data); } cudafree(data); 28

29 HOW UNIFIED MEMORY WORKS IN CUDA 6 Servicing CPU page faults GPU Code global void setvalue(char *ptr, int index, char val) { ptr[index] = val; } GPU Memory Mapping CPU Code cudamallocmanaged(&array, size); memset(array, size); setvalue<<<...>>>(array, size/2, 5); CPU Memory Mapping array Page Fault array Interconnect 29

30 HOW UNIFIED MEMORY WORKS ON PASCAL Servicing CPU and GPU Page Faults Page Fault GPU Code global Void setvalue(char *ptr, int index, char val) { ptr[index] = val; } GPU Memory Mapping array Page Fault CPU Code cudamallocmanaged(&array, size); memset(array, size); setvalue<<<...>>>(array, size/2, 5); CPU Memory Mapping array Interconnect 30

31 UNIFIED MEMORY ON PASCAL GPU memory oversubscription void foo() { // Assume GPU has 16 GB memory // Allocate 32 GB char *data; size_t size = 32*1024*1024*1024; cudamallocmanaged(&data, size); } 32 GB allocation Pascal supports allocations where only a subset of pages reside on GPU. Pages can be migrated to the GPU when hot. Fails on Kepler/Maxwell 31

32 GPU OVERSUBSCRIPTION Now possible with Pascal Many domains can benefit from GPU memory oversubscription: Combustion many species to solve for Quantum chemistry larger systems Ray tracing - larger scenes to render Data / Graph Analysis large graphs, data sets 11/16/

33 GPU PERFORMANCE COMPARISON P100 P40 M40 K40 Double Precision TFlop/s Single Precision TFlop/s Half Precision Tflop/s Memory Bandwidth (GB/s) NA NA Memory Size 16GB 24GB 12GB, 24GB 12GB 33

34 NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 34

35 SIMPLIFIED MEMORY MANAGEMENT CODE void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); CPU Code CUDA 6 Code with Unified Memory void sortfile(file *fp, int N) { char *data; cudamallocmanaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,n,1,compare); cudadevicesynchronize(); use_data(data); } free(data); } cudafree(data); 35

36 Million elements per second GREAT PERFORMANCE WITH UNIFIED MEMORY RAJA: Portable C++ Framework for parallel-for style programming RAJA uses Unified Memory for heterogeneous array allocations Parallel forall loops run on device Excellent performance considering this is a "generic version of LULESH with no architecture-specific tuning CPU: 10-core Haswell GPU: Tesla K40 1.5x LULESH Throughput 1.9x 2.0x -Jeff Keasler, LLNL 0 45^3 100^3 150^3 Mesh size GPU: NVIDIA Tesla K40, CPU: Intel Haswell E GHz, single socket 10-core 36

37 UNIFIED MEMORY ON PASCAL Concurrent CPU/GPU access to managed memory global void mykernel(char *data) { data[1] = g ; } void foo() { char *data; cudamallocmanaged(&data, 2); } mykernel<<<...>>>(data); // no synchronize here data[0] = c ; cudafree(data); OK on Pascal: just a page fault Concurrent CPU access to data on previous GPUs caused a fatal segmentation fault 37

38 UNIFIED MEMORY ON PASCAL System-Wide Atomics global void mykernel(int *addr) { atomicadd_system(addr, 10); } void foo() { int *addr; cudamallocmanaged(&addr, 4); *addr = 0; Pascal enables system-wide atomics Direct support of atomics over NVLink Software-assisted over PCIe System-wide atomics not available on Kepler / Maxwell } mykernel<<<...>>>(addr); sync_fetch_and_add(addr, 10); 38

39 PERFORMANCE TUNING ON PASCAL Explicit Memory Hints and Prefetching Advise runtime on known memory access behaviors with cudamemadvise() cudamemadvisesetreadmostly: Specify read duplication cudamemadvisesetpreferredlocation: suggest best location cudamemadvisesetaccessedby: initialize a mapping Explicit prefetching with cudamemprefetchasync(ptr, length, destdevice, stream) Unified Memory alternative to cudamemcpyasync Asynchronous operation that follows CUDA stream semantics 39

40 USE CASE: ON-DEMAND PAGING Graph Algorithms Performance over GPU directly accessing host memory (zero-copy) Baseline: migrate on first touch Optimized: best placement in memory Large Data Set 11/16/

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016 LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016 RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD

More information

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015 Unified memory GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015 Manuel Ujaldón Associate Professor @ Univ. of Malaga (Spain) Conjoint Senior

More information

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some

More information

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU WP-08608-001_v1.1 August 2017 WP-08608-001_v1.1 TABLE OF CONTENTS Introduction to the NVIDIA Tesla V100 GPU Architecture...

More information

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017 May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling

More information

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI MACHINE LEARNING WITH NVIDIA AND IBM POWER AI July 2017 Joerg Krall Sr. Business Ddevelopment Manager MFG EMEA jkrall@nvidia.com A NEW ERA OF COMPUTING AI & IOT Deep Learning, GPU 100s of billions of devices

More information

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU Whitepaper NVIDIA Tesla P100 The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World s Fastest GPU NVIDIA Tesla P100 WP-08019-001_v01.2 1 Table of Contents Introduction...

More information

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018 S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created Agenda DGX-2 internal architecture Software programming model Simple application Results 2 DEEP LEARNING TRENDS Application

More information

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 Accelerators Surge in World s Top Supercomputers 125 100 75 Top500: # of Accelerated Supercomputers 100+ accelerated systems

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017 May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

IBM Deep Learning Solutions

IBM Deep Learning Solutions IBM Deep Learning Solutions Reference Architecture for Deep Learning on POWER8, P100, and NVLink October, 2016 How do you teach a computer to Perceive? 2 Deep Learning: teaching Siri to recognize a bicycle

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017 PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017 INTRODUCING TESLA V100 Volta Architecture Improved NVLink & HBM2 Volta MPS Improved SIMT Model

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

The Visual Computing Company

The Visual Computing Company The Visual Computing Company Update NVIDIA GPU Ecosystem Axel Koehler, Senior Solutions Architect HPC, NVIDIA Outline Tesla K40 and GPU Boost Jetson TK-1 Development Board for Embedded HPC Pascal GPU 3D

More information

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded

More information

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

GPUS FOR NGVLA. M Clark, April 2015

GPUS FOR NGVLA. M Clark, April 2015 S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017 THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu Towards Automatic Heterogeneous Computing Performance Analysis Carl Pearson pearson@illinois.edu Adviser: Wen-Mei Hwu 2018 03 30 1 Outline High Performance Computing Challenges Vision CUDA Allocation and

More information

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

IBM Power AC922 Server

IBM Power AC922 Server IBM Power AC922 Server The Best Server for Enterprise AI Highlights More accuracy - GPUs access system RAM for larger models Faster insights - significant deep learning speedups Rapid deployment - integrated

More information

IBM CORAL HPC System Solution

IBM CORAL HPC System Solution IBM CORAL HPC System Solution HPC and HPDA towards Cognitive, AI and Deep Learning Deep Learning AI / Deep Learning Strategy for Power Power AI Platform High Performance Data Analytics Big Data Strategy

More information

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters

More information

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA S Axel Koehler, Principal Solution Architect HPCN%Workshop%Goettingen,%14.%Mai%2018 NVIDIA - AI COMPUTING COMPANY Computer Graphics Computing Artificial Intelligence

More information

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

CST STUDIO SUITE R Supported GPU Hardware

CST STUDIO SUITE R Supported GPU Hardware CST STUDIO SUITE R 2017 Supported GPU Hardware 1 Supported Hardware CST STUDIO SUITE currently supports up to 8 GPU devices in a single host system, meaning each number of GPU devices between 1 and 8 is

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

More on CUDA and. graphics processing unit computing CHAPTER. Mark Harris and Isaac Gelado CHAPTER OUTLINE

More on CUDA and. graphics processing unit computing CHAPTER. Mark Harris and Isaac Gelado CHAPTER OUTLINE CHAPTER More on CUDA and 20 graphics processing unit computing Mark Harris and Isaac Gelado CHAPTER OUTLINE 20.1 Model of Host/Device Interaction...444 20.2 Kernel Execution Control...449 20.3 Memory Bandwidth

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017 Power Systems AC922 Overview Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017 IBM POWER HPC Platform Strategy High-performance computer and high-performance

More information

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

NVIDIA Fermi Architecture

NVIDIA Fermi Architecture Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA, S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com A TALE OF ENLIGHTENMENT Basic OK List 10 for x = 1 to 3 20 print

More information

OpenPOWER Performance

OpenPOWER Performance OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER IBM SOFTWARE LINUX ECOSYSTEM OPEN SOURCE Solutions with full stack

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to

More information

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14 EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2 GPUS: THE

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help? Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga

More information

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization Presenters: Tim Kaldewey Performance Architect, Watson Group Michael Gschwind Chief Engineer ML & DL, Systems Group David K.

More information

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA GPU TECHNOLOGY UPDATE NVIDIA GPU TECHNOLOGY UPDATE May 2015 Axel Koehler Senior Solutions Architect, NVIDIA NVIDIA: The VISUAL Computing Company GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor 0 Copyright 2018 FUJITSU LIMITED FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a compact and modular form

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE LEVERAGE OUR EXPERTISE sales@microway.com http://microway.com/tesla NUMBERSMASHER TESLA 4-GPU SERVER/WORKSTATION Flexible form factor 4 PCI-E GPUs + 3 additional

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

S8901 Quadro for AI, VR and Simulation

S8901 Quadro for AI, VR and Simulation S8901 Quadro for AI, VR and Simulation Carl Flygare, PNY Quadro Product Marketing Manager Allen Bourgoyne, NVIDIA Senior Product Marketing Manager The question of whether a computer can think is no more

More information

Vector Engine Processor of SX-Aurora TSUBASA

Vector Engine Processor of SX-Aurora TSUBASA Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance

More information

NVIDIA Accelerators Models HPE NVIDIA GV100 Nvlink Bridge Kit HPE NVIDIA Tesla V100 FHHL 16GB Computational Accelerator

NVIDIA Accelerators Models HPE NVIDIA GV100 Nvlink Bridge Kit HPE NVIDIA Tesla V100 FHHL 16GB Computational Accelerator Overview Hewlett Packard supports, on select HPE ProLiant servers, computational accelerator modules based on NVIDIA Tesla, NVIDIA GRID, and NVIDIA Quadro Graphical Processing Unit (GPU) technology. The

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. 1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. Optical Discs 1 Structure of a Graphics Adapter Video Memory Graphics

More information

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen GPU Computing Applications Here's what Nvidia says its Tesla K20(X) card excels at doing - Seismic processing, CFD, CAE, Financial computing,

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017 ACCELERATED COMPUTING: THE PATH FORWARD Jensen Huang, Founder & CEO SC17 Nov. 13, 2017 COMPUTING AFTER MOORE S LAW Tech Walker 40 Years of CPU Trend Data 10 7 GPU-Accelerated Computing 10 5 1.1X per year

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

IBM Power Advanced Compute (AC) AC922 Server

IBM Power Advanced Compute (AC) AC922 Server IBM Power Advanced Compute (AC) AC922 Server The Best Server for Enterprise AI Highlights IBM Power Systems Accelerated Compute (AC922) server is an acceleration superhighway to enterprise- class AI. A

More information