THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

Size: px

Start display at page:

Download "THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research"

Branden Alexander
5 years ago
Views:

1 THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research

2 The Goal: Sustained ExaFLOPs on problems of interest 2

3 Exascale Challenges Energy efficiency Programmability Resilience Sustained performance on real applications Scalability 3

4 NVIDIA s ExaScale Vision Energy efficiency Hybrid architecture, efficient architecture, aggressive circuits, data locality Programmability Target-independent programming, adaptation layer, agile network, hardware support Resilience Containment domains, low SDC Sustained performance on real applications Scalability 4

5 NVIDIA s ExaScale Vision Energy efficiency Hybrid architecture, efficient architecture, aggressive circuits, data locality Programmability Target-independent programming, adaptation layer, agile network, hardware support Resilience Containment domains, low SDC Sustained performance on real applications Scalability 5

6 PF 18,000 GPUs 10MW 2 GFLOPs/W ~10 7 Threads You Are Here 1,000PF (50x) 72,000HCNs (4x) 20MW (2x) 50 GFLOPs/W (25x) ~10 10 Threads (1000x) 6

7 PF 18,000 GPUs 10MW 2 GFLOPs/W ~10 7 Threads CORAL PF (5-10x) 11MW (1.1x) GFLOPs/W (7-14x) Lots of Threads You Are Here ,000PF (50x) 72,000HCNs (4x) 20MW (2x) 50 GFLOPs/W (25x) ~10 10 Threads (1000x) 7

8 Energy Efficiency 8

9 Its not about the FLOPs DFMA 0.01mm 2 10pJ/OP 2GFLOPs A chip with 10 4 FPUs: 100mm 2 200W 20TFLOPS Pack 50,000 of these in racks 1EFLOPS 10MW 16nm chip, 10mm on a side, 200W 9

10 Overhead Locality 10

11 LOC 0 LOC 7 Heterogeneous Node System Interconnect DRAM Stacks MC NIC NoC L2 0 1MB L2 1 1MB L2 2 1MB L2 3 1MB Lane 0 Lane 15 TOC 1 TOC 2 TOC 3 TPC 127 NVLink LLC MC DRAM DIMMs NVLink NoC NV RAM TOC 0 TPC 0 11

12 CPU 130 pj/flop (Vector SP) Optimized for Latency Deep Cache Hierarchy GPU 30 pj/flop (SP) Optimized for Throughput Explicit Management of On-chip Memory Haswell 22 nm Maxwell 28 nm 12

13 CPU 2nJ/flop (Scalar SP) Optimized for Latency Deep Cache Hierarchy GPU 30 pj/flop (SP) Optimized for Throughput Explicit Management of On-chip Memory Haswell 22 nm Maxwell 28 nm 13

Control Logic 24% Instruction Supply 42% Issue 11% RF 14% OOO Hi-perf ALU

14 How is Power Spent in a CPU? ALU 6% Register File 11% In-order Embedded Data Supply 17% Clock + Control Logic 24% Instruction Supply 42% Issue 11% RF 14% OOO Hi-perf ALU 4% Rename 10% Data Supply 5% Fetch 11% Clock + Pins 45% Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264) 14

15 Payload Arithmetic 20pJ Overhead 980pJ 15

16 16

Control Path Data Path Net Thread PCs RF L0Addr L1Addr Net LM Bank 0 RF L0Addr L1Addr Net LM Bank 3 To LD/ST To LD/ST Active PCs Scheduler L0 I$ ORF ORF ORF Inst

17 Control Path Data Path Net Thread PCs RF L0Addr L1Addr Net LM Bank 0 RF L0Addr L1Addr Net LM Bank 3 To LD/ST To LD/ST Active PCs Scheduler L0 I$ ORF ORF ORF Inst 64 threads 4 active threads 2 DFMAs (4 FLOPS/clock) ORF bank: 16 entries (128 Bytes) L0 I$: 64 instructions (1KByte) LM Bank: 8KB (32KB total) FP/Int FP/Int LS/BR 17

18 Payload Arithmetic 20pJ Overhead 20pJ 18

19 Energy-Efficient Architecture See Steve Keckler s Booth Talk Wednesday 2:30PM How to reduce energy 10x when process gives 2x Do Less Work Eliminate redundancy, waste, and overhead Move fewer bits over less distance Move data more efficiently 19

20 Communication Energy 64-bit DP 20pJ 20mm 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 256-bit access 8 kb SRAM 50 pj 500 pj Efficient off-chip link 1 nj 20

Reduces on-chip signaling energy by 4x 1 2 3

capacitance Swizzlers (Re-time & Level-Shift

21 External Sense: Vmid, Vdd & Gnd Clock Generation Charge-Recycled Signaling (CRS) Reduces on-chip signaling energy by 4x Repeaters Bus wires over bypass capacitance Swizzlers (Re-time & Level-Shift or Bypass) Pattern generators/checkers and configuration logic 21

22 Ground-Referenced Signaling (GRS) Test Chip #1 on Board Probe Station Test Chip #2 fabricated on production GPU Eye Diagram from Probe Poulton et al. ISSCC 2013, JSSCC Dec

23 Programmability 23

24 Target-Independent Programming Mapping Directives Profiling & Visualization Target- Independent Source Mapping Tools Target- Dependent Adaptation Compile Target- Dependent Executable 24

25 Legion Programming Model Enabling Powerful Program Analysis Legion Program Legion Tasks + Data Model = Powerful Programming Analysis Machine-Independent Specification Tasks: decouple control from machine Logical regions: decouple program data from machine Analysis! Why it matters Reduce programmer pain Extract ALL parallelism Easily transform and remap programs for new machines Sequential semantics 25

26 Comparison with MPI+OpenACC The power of program analysis Weak scaling results on Titan out to 8K nodes 1.75X 2.85X As application and machine complexity increases, the performance gap will grow. 26

27 Scalability 27

28 System Sketch System Interconnect DRAM DIMMs NV RAM NoC L2 0 1MB L2 1 1MB L2 2 1MB L2 3 1MB Lane 0 Lane 15 TOC 1 TOC 2 TOC 3 TPC 127 LOC 0 LOC 7 DRAM Stacks NW on-/ off-ramp MC MC MC NoC TOC 0 TPC 0 Node 0: 16.4 TF, 2 TB/s, 512+ GB Node 383 Cabinet 0: 6.3 PF, 128 TB System: up to 1 EFlop Cabinet

29 Heterogenous Network Requirements GPUs present unique requirements on network threads initiating transactions Can saturate 150GB/s NVLINK bandwidth In addition to HPC requirements not met by commodity networks Scalable BW up to 200GB/s per endpoint <1us end-to-end latency at 16K endpoints Scale to 128K endpoints Load balanced routing Congestion control Operations: Load/Store, Atomics, Messages, Collectives 29

1MB TOC0 1MB 1MB 1MB LOC 0 LOC 7 System (up to 1 EF) Conclusion Energy

Enhanced Locality Mapping Directives Profiling & Visualization Programming 10

Independent Source Mapping Tools Target- Dependent Executable System Sketch

NIC GPU-Centric network L2 0 Lane 0 L2 1 Lane 15 TOC1 L2 2 L2 3 TOC2 TOC3 TPC

30 1MB TOC0 1MB 1MB 1MB LOC 0 LOC 7 System (up to 1 EF) Conclusion Energy Efficiency Reduce overhead with Throughput cores Efficient Signaling Circuits Enhanced Locality Mapping Directives Profiling & Visualization Programming Threads Target-independent programming mapping via tools Target- Independent Source Mapping Tools Target- Dependent Executable System Sketch NV RAM System Interconnect DRAM DIMMs Efficient Nodes DRAM Stacks MC NoC MC NIC GPU-Centric network L2 0 Lane 0 L2 1 Lane 15 TOC1 L2 2 L2 3 TOC2 TOC3 TPC 127 TPC 0 Node 0: 16.4 TF, 2 TB/s, 512+ GB Node Cabinet 0: 6.3 PF, 128 TB Cabinet 176

2013 20PF 18,000 GPUs 10MW 2 GFLOPs/W ~10 7 Threads CORAL 150-300PF (5-10x) 11MW (1.

31 PF 18,000 GPUs 10MW 2 GFLOPs/W ~10 7 Threads CORAL PF (5-10x) 11MW (1.1x) GFLOPs/W (7-14x) Lots of Threads You Are Here ,000PF (50x) 72,000HCNs (4x) 20MW (2x) 50 GFLOPs/W (25x) ~10 10 Threads (1000x) 31

32 32

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable