GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Similar documents
Timothy Lanfear, NVIDIA HPC

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

GPUS FOR NGVLA. M Clark, April 2015

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

Steve Scott, Tesla CTO SC 11 November 15, 2011

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

Accelerating High Performance Computing.

The Visual Computing Company

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

GPU Computing fuer rechenintensive Anwendungen. Axel Koehler NVIDIA

CUDA. Matthew Joyner, Jeremy Williams

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

HPC Technology Trends

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

GPU Computing with NVIDIA s new Kepler Architecture

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

CME 213 S PRING Eric Darve

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Preparing GPU-Accelerated Applications for the Summit Supercomputer

TESLA ACCELERATED COMPUTING. Mike Wang Solutions Architect NVIDIA Australia & NZ

n N c CIni.o ewsrg.au

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Fra superdatamaskiner til grafikkprosessorer og

Mathematical computations with GPUs

HPC Cineca Infrastructure: State of the art and towards the exascale

Trends in HPC (hardware complexity and software challenges)

Exascale: challenges and opportunities in a power constrained world

NVIDIA GPU TECHNOLOGY UPDATE

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

GPU Architecture. Alan Gray EPCC The University of Edinburgh

The Future of High- Performance Computing

Fundamentals of Quantitative Design and Analysis

Stan Posey, NVIDIA, Santa Clara, CA, USA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Trends in the Infrastructure of Computing

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

The Future of High Performance Computing

The Era of Heterogeneous Computing

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

The Future of GPU Computing

The Stampede is Coming: A New Petascale Resource for the Open Science Community

IBM Power AC922 Server

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Intel Many Integrated Core (MIC) Architecture

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

The Mont-Blanc approach towards Exascale

TUNING CUDA APPLICATIONS FOR MAXWELL

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Pedraforca: a First ARM + GPU Cluster for HPC

GPU Computing. Axel Koehler Sr. Solution Architect HPC

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

CUDA Experiences: Over-Optimization and Future HPC

IBM Deep Learning Solutions

GPUs and Emerging Architectures

HPC future trends from a science perspective

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Building NVLink for Developers

An Introduction to OpenACC

HIGH-PERFORMANCE COMPUTING

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

Carlo Cavazzoni, HPC department, CINECA

Microprocessor Trends and Implications for the Future

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Lecture 1: Gentle Introduction to GPUs

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

arxiv: v1 [physics.comp-ph] 4 Nov 2013

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Philippe Thierry Sr Staff Engineer Intel Corp.

TUNING CUDA APPLICATIONS FOR MAXWELL

HPC Architectures past,present and emerging trends

World s most advanced data center accelerator for PCIe-based servers

IBM CORAL HPC System Solution

OpenPOWER Performance

When MPPDB Meets GPU:

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Tesla GPU Computing A Revolution in High Performance Computing

Transcription:

GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA

~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2

EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles simulation of combustion for new high-efficiency, lowemission engines. Predictive calculations for thermonuclear and core-collapse supernovae, allowing confirmation of theoretical models. Comprehensive Earth System Model at 1km scale, enabling modeling of cloud convection and ocean eddies. Coupled simulation of entire cells at molecular, genetic, chemical and biological levels. 3

EXAFLOP EXPECTATIONS 1 EF First Exaflop Computer 1 PF Titan 8.2 MW 1 TF 1 GF CM5 ~200 KW Growing size, cost and power 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 4

Power for CPU-only Exaflop Supercomputer = Power for the Bay Area, CA (San Francisco + San Jose) HPC s Biggest Challenge: Power 5

MOORE S LAW IS ONLY PART OF THE STORY 2013: 7B transistors 2010: 3B transistors 2007: 580M transistors 2004: 275M transistors 2001: 42M transistors 1997: 7.5M transistors 1993: 3M transistors 6

ROBERT DENNARD, IBM! 1968: invented DRAM! 1974: postulated all key figures of merit of MOSFETs improve provided geometric dimensions, voltages, and doping concentrations are consistently scaled to maintain the same electric field. 7

CLASSIC DENNARD SCALING 2.8x chip capability in same power 3 2,5 1.4x faster transistors 0.7x capacitance Chip Power 2 2x more transistors 0.7x voltage 1,5 1 1 1,5 2 2,5 3 Chip Capability 8

POST DENNARD SCALING 2x chip capability at 1.4x power 1.4x chip capability at same power 3 Transistors are no faster Static leakage limits reduction in V th => V dd stays constant 2,5 Chip Power 2 2x more transistors 0.7x capacitance 1,5 0.7x voltage 1 1 1,5 2 2,5 3 Chip Capability 9

THE HIGH COST OF DATA MOVEMENT Fetching operands costs more than computing on them 64-bit DP 20 pj 26 pj 256 pj 256-bit access 8 kb SRAM 50 pj 256 bits 16 nj DRAM Rd/Wr 500 pj Efficient off-chip link 1 nj 20mm 28nm IC Relative cost grows with each generation Wire delay (ps/mm) not improving 10

1) Stop making it worse... SO, WHAT TO DO? Multicore CPUs But still only a tiny fraction of CPU power spent on flops 2) Unwind all that complexity we threw at single thread performance 11

HPC IS GOING HYBRID x86 CPU Fast single threads (serial work) PCIe Sandy Bridge 32nm 690 pj/flop GPU Extreme power-efficiency (throughput work) Kepler 28nm 132 pj/flop! Do most work by cores optimized for extreme energy efficiency! Still need a few cores optimized for fast serial work PCIe Xeon (AMD Fusion too) Intel MIC 12

EXPLOSIVE GROWTH OF GPU COMPUTING 150K CUDA Downloads 1.5M CUDA Downloads 1 Supercomputer 52 Supercomputers 60 Universities 560 Universities 4,000 Academic Papers 22,500 Academic Papers 2008 2012 13

Hundreds of GPU-Accelerated Applications www.nvidia.com/appscatalog 14

BREAKTHROUGH EFFICIENCY June 2014 Green 500 list, Tesla powers 15 of the most energyefficient supercomputers First sweep since IBM BlueGene Tsubame-KFC: 4.3 GFLOPS / Watt 15

OVERARCHING GOALS FOR TESLA Power Efficiency Ease of Programming And Portability Application Space Coverage 16

GK110 GPU KEPLER THE WORLD S FASTEST, MOST EFFICIENT HPC ACCELERATOR SMX Hyper-Q Dynamic Parallelism (power efficiency) (programmability and application coverage) 17

TESLA K80 WORLD S FASTEST ACCELERATOR FOR DATA ANALYTICS AND SCIENTIFIC COMPUTING Dual-GPU Accelerator for Max Throughput 2x Faster 2.9 TF 4992 Cores 480 GB/s 25x 20x 15x Deep Learning: Caffe Double the Memory Designed for Big Data Apps 24GB K40 12GB Maximum Performance Dynamically Maximize Perf for Every Application 10x 5x 0x CPU Tesla K40 Tesla K80 Oil & HPC Gas Viz Data Analytics Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 18

PERFORMANCE LEAD CONTINUES TO GROW Peak Double Precision FLOPS Peak Memory Bandwidth GFLOPS GB/s 3500 600 3000 K80 500 K80 2500 400 2000 K40 300 K40 1500 1000 500 M1060 K20 M2090 Sandy Bridge Westmere Haswell Ivy Bridge 200 100 M1060 K20 M2090 Sandy Bridge Westmere Haswell Ivy Bridge 0 2008 2009 2010 2011 2012 2013 2014 0 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU NVIDIA GPU x86 CPU 19

TESLA K80: 10X FASTER ON REAL-WORLD APPS 15x 10x K80 CPU 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled Physics 20

WHAT DOES THE FUTURE HOLD? 21

FAST PACED CUDA GPU ROADMAP 20 18 Pascal Unified Memory 3D Memory NVLink 16 SGEMM / W Normalized 14 12 10 8 6 Kepler Dynamic Parallelism Maxwell DX12 4 2 0 Tesla CUDA Fermi FP64 2008 2010 2012 2014 2016 22

PASCAL GPU FEATURES NVLINK AND STACKED MEMORY NVLINK! GPU high speed interconnect! 80-200 GB/s 3D Stacked Memory! 4x Higher Bandwidth (~1 TB/s)! 3x Larger Capacity! 4x More Energy Efficient per bit 23

KEPLER GPU PASCAL GPU NVLink NVLINK HIGH-SPEED GPU INTERCONNECT POWER CPU NVLink PCIe PCIe X86, ARM64, POWER CPU 2014 X86, ARM64, POWER CPU 2016 24

3D MEMORY Memory Bandwidth 1200 1000 3D Chip-on-Wafer integration Many X bandwidth 2.5X capacity 4X energy efficiency 800 600 400 200 0 2008 2010 2012 2014 2016 25

PASCAL NVLink 3D Memory Module 5 to 12X PCIe 3.0 2 to 4X memory BW & size 1/3 size of PCIe card GPU Chip Power Regulation Memory Stacks 26

PARALLELISM IN MAINSTREAM LANGUAGES! Enable more programmers to write parallel software! Give programmers the choice of language to use! GPU support in key languages C 27

C++ PARALLEL ALGORITHMS LIBRARY std::vector<int> vec =... // previous standard sequential loop std::for_each(vec.begin(), vec.end(), f); // explicitly sequential loop std::for_each(std::seq, vec.begin(), vec.end(), f); // permitting parallel execution std::for_each(std::par, vec.begin(), vec.end(), f); Complete set of parallel primitives: for_each, sort, reduce, scan, etc. ISO C++ committee voted unanimously to accept as official tech. specification working draft N3960 Technical Specification Working Draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf Prototype: https://github.com/n3554/n3554 28

LINUX GCC COMPILER TO SUPPORT GPU Open Source Free to all Linux users Most Widely Used HPC Compiler Incorporating OpenACC into GCC is an excellent example of open source and open standards working together to make accelerated computing broadly accessible to all Linux developers. Oscar Hernandez Oak Ridge National Laboratories 29

NUMBA PYTHON COMPILER Free and open source compiler for array-oriented Python NEW numba.cuda module integrates CUDA directly into Python @cuda.jit( void(float32[:], float32, float32[:], float32[:]) ) def saxpy(out, a, x, y): i = cuda.grid(1) out[i] = a * x[i] + y[i] # Launch saxpy kernel saxpy[griddim, blockdim](out, a, x, y) http://numba.pydata.org/ 30

COMPILE JAVA FOR GPUS Approach: apply a closure to a set of arrays // vector addition float[] X = {1.0, 2.0, 3.0, 4.0, }; float[] Y = {9.0, 8.1, 7.2, 6.3, }; float[] Z = {0.0, 0.0, 0.0, 0.0, }; jog.foreach(x, Y, Z, new jogcontext(), new jogclosureret<jogcontext>() { public float execute(float x, float y) { return x + y; } } ); 14 12 10 8 6 4 2 0 Java Black-Scholes Options Pricing Speedup Speedup vs. Sequential Java 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Millions of Options Learn More: S4939: Vinod Grover: Accelerating JAVA on GPUs Wednesday, 17:30-17:55 Room LL20C foreach iterations parallelized over GPU threads 31

! Power is the constraint THE FUTURE OF HPC IS GREEN! Vast majority of work must be done by cores designed for efficiency! GPU computing has a sustainable model! Aligned with technology trends, supported by consumer markets! Future evolution will focus on:! Integration (CPU, network, memory)! Increased generality efficient on any code with high parallelism! This is simply how computers will be built 32