The Future of GPU Computing

Similar documents
The End of Denial Architecture and The Rise of Throughput Computing

HIGH-PERFORMANCE COMPUTING

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Timothy Lanfear, NVIDIA

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Chris Butler NVIDIA

designing a GPU Computing Solution

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

The End of Denial Architecture and The Rise of Throughput Computing

High Performance Computing. David B. Kirk, Chief Scientist, t NVIDIA

Opportunities for HPC in Industry. Stan Posey HPC Applications and Industry Development NVIDIA, Santa Clara, CA, USA

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

Tesla GPU Computing A Revolution in High Performance Computing

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Steve Scott, Tesla CTO SC 11 November 15, 2011

Accelerating High Performance Computing.

GPUs and Emerging Architectures

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Higher Level Programming Abstractions for FPGAs using OpenCL

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

M02: High Performance Computing with CUDA. Overview

Carlo Cavazzoni, HPC department, CINECA

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Technology for a better society. hetcomp.com

Timothy Lanfear, NVIDIA HPC

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Programming Using NVIDIA CUDA

Mathematical computations with GPUs

Trends in HPC (hardware complexity and software challenges)

Mathematical computations with GPUs

General Purpose GPU Computing in Partial Wave Analysis

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006

Daniel Moth, Parallel Computing Platform, Microsoft

The Art of Parallel Processing

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Tesla GPU Computing A Revolution in High Performance Computing

Lecture 1: Introduction and Computational Thinking

Scientific Computing on GPUs: GPU Architecture Overview

GPU Fundamentals Jeff Larkin November 14, 2016

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

FUSION PROCESSORS AND HPC

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

IBM CORAL HPC System Solution

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction to GPU hardware and to CUDA

High Performance Computing with Accelerators

Trends and Challenges in Multicore Programming

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

EE482S Lecture 1 Stream Processor Architecture

Chapter 3 Parallel Software

From Brook to CUDA. GPU Technology Conference

HPC Technology Trends

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Parallel Computing. November 20, W.Homberg

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Building NVLink for Developers

The Era of Heterogeneous Computing

Intel Many Integrated Core (MIC) Architecture

The Stampede is Coming: A New Petascale Resource for the Open Science Community

GPU-centric communication for improved efficiency

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

CUDA. Matthew Joyner, Jeremy Williams

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Lecture 1: Gentle Introduction to GPUs

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC

GPUS FOR NGVLA. M Clark, April 2015

GPUs and GPGPUs. Greg Blanton John T. Lubia

CPU-GPU Heterogeneous Computing

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

NVIDIA Fermi Architecture

Fra superdatamaskiner til grafikkprosessorer og

Introduction to Parallel Computing

Running the NIM Next-Generation Weather Model on GPUs

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Accelerating CFD with Graphics Hardware

GPU Clouds IGT Cloud Computing Summit Mordechai Butrashvily, CEO 2009 (c) All rights reserved

Accelerating Financial Applications on the GPU

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

A taxonomy of hardware parallelism

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

High-Performance Scientific Computing

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

How HPC Hardware and Software are Evolving Towards Exascale

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Transcription:

The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009

The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009

Outline Single-thread performance is no longer scaling Performance = Parallelism Efficiency = Locality Applications have lots of both Machines need lots of cores (parallelism) and an exposed storage hierarchy (locality) A programming system must abstract this The future is even more parallel

Single-threaded processor performance is no longer scaling

Moore s Law In 1965 Gordon Moore predicted the number of transistors on an integrated circuit would double every year. Later revised to 18 months Also predicted L 3 power scaling for constant function No prediction of processor performance Moore, Electronics 38(8) April 19, 1965

More Transistors Architecture More Performance Applications More Value

The End of ILP Scaling 1e+7 1e+6 Perf (ps/inst) 1e+5 1e+4 1e+3 1e+2 1e+1 1e+0 1980 1990 2000 2010 2020 Dally et al., The Last Classical Computer, ISAT Study, 2001

Explicit Parallelism is Now Attractive 1e+7 1e+6 1e+5 1e+4 Perf (ps/inst) Linear (ps/inst) 1e+3 1e+2 1e+1 1e+0 30:1 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e-4 1980 1990 2000 2010 2020 Dally et al., The Last Classical Computer, ISAT Study, 2001

Performance (vs. VAX- Single-Thread Processor Performance vs Calendar Year 10000 20%/year 1000 52%/year 100 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Source: Hennessy & Patterson, CAAQA, 4th Edition

Single-threaded processor performance is no longer scaling Performance = Parallelism

Chips are power limited and most power is spent moving data

CMOS Chip is our Canvas 20mm

4,000 64b FPUs fit on a chip 64b FPU 20mm 0.1mm 2 50pJ/op 1.5GHz

10mm 250pJ, 4cycles Moving a word across die = 10FMAs Moving a word off chip = 20FMAs 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm 64b 1mm Channel 25pJ/word 64b Off-Chip Channel 1nJ/word 64b Floating Point

Chips are power limited Most power is spent moving data Efficiency = Locality

Performance = Parallelism Efficiency = Locality

Scientific Applications Large data sets Lots of parallelism Increasingly irregular (AMR) Irregular and dynamic data structures Requires efficient gather/scatter Increasingly complex models Lots of locality Global solution sometimes bandwidth limited Less locality in these phases

Performance = Parallelism Efficiency = Locality Fortunately, most applications have lots of both. Amdahl s law doesn t apply to most future applications.

Exploiting parallelism and locality requires: Many efficient processors (To exploit parallelism) An exposed storage hierarchy (To exploit locality) A programming system that abstracts this

Tree-structured machines L3 Net L2 Net L1 L1 L1 L1 P P P P

Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Explicitly manage data movement on this hierarchy Reduces demand, increases utilization Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations

DRAM I/F Giga Thread HOST I/F DRAM I/F Fermi is a throughput computer L2 DRAM I/F DRAM I/F DRAM I/F 512 efficient cores Rich storage hierarchy Shared memory L1 L2 GDDR5 DRAM DRAM I/F

Fermi

Avoid Denial Architecture Single thread processors are in denial about parallelism and locality They provide two illusions: Serial execution - Denies parallelism Tries to exploit parallelism with ILP - inefficient & limited scalability Flat memory - Denies locality Tries to provide illusion with caches very inefficient when working set doesn t fit in the cache These illusions inhibit performance and efficiency

CUDA Abstracts the GPU Architecture Programmer sees many cores and exposed storage hierarchy, but is isolated from details.

CUDA as a Stream Language Launch a cooperative thread array foo<<<nblocks, nthreads>>>(x, y, z) ; Explicit control of the memory hierarchy shared float a[size] ; Also enables communication between threads of a CTA Allows access to arbitrary data within a kernel

Examples 146X 36X 19X 17X 100X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Transcoding HD video stream to H.264 Fluid mechanics in Matlab using.mex file CUDA function Astrophysics N-body simulation 149X 47X 20X 24X 30X Financial simulation of LIBOR model with swaptions GLAME@lab: an M- script API for GPU linear algebra Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics Cmatch exact string matching to find similar proteins and gene sequences

Current CUDA Ecosystem Over 200 Universities Teaching CUDA UIUC MIT Harvard Berkeley Cambridge Oxford IIT Delhi Tsinghua Dortmundt ETH Zurich Moscow NTU Languages C, C++ DirectX Fortran OpenCL Python Compilers PGI Fortran CAPs HMPP MCUDA MPI NOAA Fortran2C OpenMP Applications Oil & Gas Finance Medical Biophysics Numerics DSP CFD Imaging EDA Libraries FFT BLAS LAPACK Image processing Video processing Signal processing Vision Consultants ANEO GPU Tech OEMs

Ease of Programming Source: Nicolas Pinto, MIT

The future is even more parallel

CPU scaling ends, GPU continues Source: Hennessy & Patterson, CAAQA, 4th Edition 2017

DARPA Study Indentifies four challenges for ExaScale Computing Report published September 28, 2008: Four Major Challenges Energy and Power challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge Available at www.darpa.mil/ipto/personnel/docs/exascale_study_initial.pdf Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! Power also constrains what we can put on a chip

Energy and Power Challenge Heterogeneous architecture A few latency-optimized processors Many (100s-1,000s) throughput-optimized processors Which are optimized for ops/j Efficient processor architecture Simple control in-order multi-threaded SIMT execution to amortize overhead Agile memory system to capture locality Keeps data and instruction access local Optimized circuit design Minimize energy/op Minimize cost of data movement * This section is a projection based on Moore s law and does not represent a committed roadmap

An NVIDIA ExaScale Machine in 2017 2017 GPU Node 300W (GPU + memory + supply) 2,400 throughput cores (7,200 FPUs), 16 CPUs single chip 40TFLOPS (SP) 13TFLOPS (DP) Deep, explicit on-chip storage hierarchy Fast communication and synchronization Node Memory 128GB DRAM, 2TB/s bandwidth 512GB Phase-change/Flash for checkpoint and scratch Cabinet ~100kW 384 Nodes 15.7PFLOPS (SP), 50TB DRAM Dragonfly network 1TB/s per node bandwidth System ~10MW RAS 128 Cabinets 2 EFLOPS (SP), 6.4 PB DRAM Distributed EB disk array for file system Dragonfly network with active optical links ECC on all memory and links Option to pair cores for self-checking (or use application-level checking) Fast local checkpoint * This section is a projection based on Moore s law and does not represent a committed roadmap

Conclusion

Performance = Parallelism Efficiency = Locality Applications have lots of both. GPUs have lots of cores (parallelism) and an exposed storage hierarchy (locality)