Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Similar documents
Challenges for Future Computing Systems. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

The Future of GPU Computing

Deep Learning and HPC. Bill Dally, Chief Scientist and SVP of Research January 17, 2017

Timothy Lanfear, NVIDIA HPC

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Portland State University ECE 588/688. Graphics Processors

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

The End of Denial Architecture and The Rise of Throughput Computing

Steve Scott, Tesla CTO SC 11 November 15, 2011

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

GPUS FOR NGVLA. M Clark, April 2015

Threading Hardware in G80

CUDA. Matthew Joyner, Jeremy Williams

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

The Era of Heterogeneous Computing

GPUs and Emerging Architectures

GPU Fundamentals Jeff Larkin November 14, 2016

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Introduction to GPGPU and GPU-architectures

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

A Compile-Time Managed Multi-Level Register File Hierarchy

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CS427 Multicore Architecture and Parallel Computing

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Portland State University ECE 588/688. Cray-1 and Cray T3E

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CME 213 S PRING Eric Darve

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Parallel Processing SIMD, Vector and GPU s cont.

Mattan Erez. The University of Texas at Austin

Fundamental CUDA Optimization. NVIDIA Corporation

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids

45-year CPU Evolution: 1 Law -2 Equations

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

The Nios II Family of Configurable Soft-core Processors

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Scientific Computing on GPUs: GPU Architecture Overview

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Parallel Computing. November 20, W.Homberg

Trends in HPC (hardware complexity and software challenges)

General introduction: GPUs and the realm of parallel architectures

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Tesla Architecture, CUDA and Optimization Strategies

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Modern Processor Architectures. L25: Modern Compiler Design

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

An Introduction to OpenACC

A Data-Parallel Genealogy: The GPU Family Tree

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Trends and Challenges in Multicore Programming

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Systems I The GPU architecture. Jan Lemeire

IBM CORAL HPC System Solution

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Accelerating High Performance Computing.

Advanced Parallel Programming I

Microprocessor Trends and Implications for the Future

Lecture 1: Introduction

HIGH-PERFORMANCE COMPUTING

GPU-centric communication for improved efficiency

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

CUDA Experiences: Over-Optimization and Future HPC

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

HPC Technology Trends

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

How HPC Hardware and Software are Evolving Towards Exascale

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

GPU Performance Nuggets

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Transcription:

Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Scientific Discovery and Business Analytics Driving an Insatiable Demand for More Computing Performance

HPC Analytics Compute Communication Memory & Storage

The End of Historic Scaling Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Moore s Law gives us more transistors Dennard scaling made them useful. Bob Colwell, DAC 2013, June 4, 2013

TITAN 18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack Numerous real science applications 2.14GF/W Most efficient accelerator

TITAN 18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack Numerous real science applications 2.14GF/W Most efficient accelerator

You Are Here 2020 2013 20PF 18,000 GPUs 10MW 2 GFLOPs/W ~10 7 Threads 1,000PF (50x) 72,000HCNs (4x) 20MW (2x) 50 GFLOPs/W (25x) ~10 10 Threads (1000x)

GFLOPS/W EFFICIENCY GAP 50 45 40 Needed 35 Process 30 25 20 15 10 5 0 2013 2014 2015 2016 2017 2018 2019 2020

GFLOPS/W 10 Needed Process CIRCUITS 3X PROCESS 2.2X 1 2013 2014 2015 2016 2017 2018 2019 2020

Simpler Cores = Energy Efficiency Source: Azizi [PhD 2010]

CPU 1690 pj/flop Optimized for Latency Caches GPU 140 pj/flop Optimized for Throughput Explicit Management of On-chip Memory Westmere 32 nm Kepler 28 nm

GFLOPS/W 10 Needed Process ARCHITECTURE 4X CIRCUITS 3X PROCESS 2.2X 1 2013 2014 2015 2016 2017 2018 2019 2020

Programmers, Tools, and Architecture Need to Play Their Positions Programmer Tools Architecture

Programmers, Tools, and Architecture Need to Play Their Positions Programmer Algorithm All of the parallelism Abstract locality Combinatorial optimization Mapping Selection of mechanisms Tools Architecture Fast mechanisms Exposed costs

DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O LOC LOC LOC LOC LOC LOC LOC LOC DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O L2 Banks XBAR Lane Lane Lane Lane Lane Lane Lane Lane

An Enabling HPC Network <1ms Latency Scalable bandwidth Small messages 50% @ 32B Global adaptive routing PGAS Collectives & Atomics MPI Offload

An Open HPC Network Ecosystem Common: Software-NIC API NIC-Router Channel Processor/NIC Vendors System Vendors Networking Vendors

Power 25x Efficiency with 2.2x from process Programming Parallelism Heterogeneity Hierarchy Programmer Tools Architecture

Super Computing From Super Computers to Super Phones

Backup

In The Past, Demand Was Fueled by Moore s Law Source: Moore, Electronics 38(8) April 19, 1965

ILP Was Mined Out in 2001 10000000 1000000 Perf (ps/inst) 100000 Linear (ps/inst) 10000 1000 100 30:1 10 1 1,000:1 0.1 0.01 30,000:1 0.001 0.0001 1980 1990 2000 2010 2020 Source: Dally et al. The Last Classical Computer, ISAT Study, 2001

Voltage Scaling Ended in 2005 Source: Moore, ISSCC Keynote, 2003

Summary Moore s law is alive and well, but Instruction-level parallelism (ILP) was mined out in 2001 Voltage scaling (Dennard scaling) ended in 2005 Most power is spent on communication What does this mean to you?

The End of Historic Scaling Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

All performance is from parallelism In the Future Machines are power limited (efficiency IS performance) Machines are communication limited (locality IS performance)

Two Major Challenges Energy Efficiency 25x in 7 years (~2.2x from process) Programming Parallel (10 10 threads) Hierarchical Heterogeneous

GFLOPS/W Needed 10 Process 1 2013 2014 2015 2016 2017 2018 2019 2020

How is Power Spent in a CPU? ALU 6% Register File 11% In-order Embedded Data Supply 17% Clock + Control Logic 24% Instruction Supply 42% Issue 11% RF 14% OOO Hi-perf ALU 4% Rename 10% Data Supply 5% Fetch 11% Clock + Pins 45% Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

Energy Shopping List Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pj 7.6 pj 64b 8 KB SRAM Rd 14 pj 2.1 pj Wire energy (256 bits, 10mm) 310 pj 174 pj Memory Technology 45 nm 16nm FP Op lower bound = 4 pj DRAM interface pin bandwidth 4 Gbps 50 Gbps DRAM interface energy 20-30 pj/bit 2 pj/bit DRAM access energy 8-15 pj/bit 2.5 pj/bit Keckler [Micro 2011], Vogelsang [Micro 2010]

GFLOPS/W 10 Needed Process CIRCUITS 3X PROCESS 2.2X 1 2013 2014 2015 2016 2017 2018 2019 2020

Latency-Optimized Core (LOC) Throughput-Optimized Core (TOC) PC PC Branch Predict PCs I$ Register Rename Select Instruction Window I$ Register File Register File ALU 1 ALU 2 ALU 3 ALU 4 Reorder Buffer ALU 1 ALU 2 ALU 3 ALU 4

Main Register File 32 banks 15% of Energy Warp Scheduler SIMT Lanes ALU SFU MEM TEX Shared Memory 32 KB Streaming Multiprocessor ()

Percent of All Values Produced Percent of All Values Produced Hierarchical Register File 100% 100% 80% 80% 60% Read >2 Times Read 2 Times 60% Lifetime >3 Lifetime 3 40% Read 1 Time Read 0 Times 40% Lifetime 2 Lifetime 1 20% 20% 0% 0%

Register File Caching (RFC) MRF 4x128-bit Banks (1R1W) Operand Buffering Operand Routing RFC 4x32-bit (3R1W) Banks S F U M E M T E X ALU

Energy Savings from RF Hierarchy 54% Energy Reduction Source: Gebhart, et. al (Micro 2011)

Two Major Challenges Energy Efficiency 25x in 7 years (~2.2x from process) Programming Parallel (10 10 threads) Hierarchical Heterogeneous

Skills on LinkedIn Size (approx) Growth (rel) C++ 1,000,000-8% Javascript 1,000,000-1% Python 429,000 7% Mainstream Programming Fortran 90,000-11% MPI 21,000-3% x86 Assembly 17,000-8% CUDA 14,000 9% Parallel programming 13,000 3% OpenMP 8,000 2% Parallel and Assembly Programming TBB 389 10% 6502 Assembly 256-13% Source: linkedin.com/skills (as of Jun 11, 2013)

Parallel Programming is Easy forall molecule in set: # 1E6 molecules forall neighbor in molecule.neighbors: # 1E2 neighbors ea forall force in forces: # several forces # reduction molecule.force += force(molecule, neighbor)

We Can Make It Hard pid = fork() ; // explicitly managing threads lock(struct.lock) ; // complicated, error-prone synchronization // manipulate struct unlock(struct.lock) ; code = send(pid, tag, &msg) ; // partition across nodes

Programmers, Tools, and Architecture Need to Play Their Positions Programmer Tools Architecture

Programmers, Tools, and Architecture Need to Play Their Positions Programmer Algorithm All of the parallelism Abstract locality Combinatorial optimization Mapping Selection of mechanisms Tools Architecture Fast mechanisms Exposed costs

OpenACC: Easy and Portable do i = 1, 20*128 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do Serial Code: SAXPY!$acc parallel loop do i = 1, 20*128!dir$ unroll 1000 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do

Conclusion

The End of Historic Scaling Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Parallelism is the source of all performance Power limits all computing Communication dominates power

Two Challenges Power 25x Efficiency with 2.2x from process Programming Parallelism Heterogeneity Hierarchy

Power 25x Efficiency with 2.2x from process Programming Parallelism Heterogeneity Hierarchy Programmer Tools Architecture

Super Computing From Super Computers to Super Phones

Communication Takes More Energy Than Arithmetic 64-bit DP 20pJ 20mm 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 256-bit access 8 kb SRAM 50 pj 500 pj Efficient off-chip link 1 nj

Key to Parallelism: Independent operations on independent data sum(map(multiply, x, x)) every pair-wise multiply is independent parallelism is permitted

Key to Locality: Data decomposition should drive mapping Flat computation total = sum(x) vs. tiles = split(x) partials = map(sum, tiles) total = sum(partials)

Key to Locality: Data decomposition should drive mapping total = sum(x) vs. Explicit decomposition tiles = split(x) partials = map(sum, tiles) total = sum(partials)