Roadmap and Change. Presentation to 2004 Workshop on Extreme Supercomputing Panel: How Much and How Fast. Thomas Sterling

Size: px

Start display at page:

Download "Roadmap and Change. Presentation to 2004 Workshop on Extreme Supercomputing Panel: How Much and How Fast. Thomas Sterling"

Neal Edgar Harrell
5 years ago
Views:

1 Presentation to 2004 Workshop on Extreme Supercomputing Panel: Roadmap and Change How Much and How Fast Thomas Sterling California Institute of Technology and NS Jet Propulsion aboratory October 12, 2004 October 12, 2004 Thomas Sterling - Caltech & JP 1

2 2 9 Years go Today October 12, 2004 Thomas Sterling - Caltech & JP 2

3 inpack Zettaflops in Zflops 1 Zflops 100 Eflops 10 Eflops 1 Eflops 100 Pflops 10 Pflops 1 Pflops 100 Tflops SM N=1 N= Tflops 1 Tflops 100 Gflops 10 Gflops Courtesy of Thomas Sterling 1 Gflops 100 Mflops October 12, 2004 Thomas Sterling - Caltech & JP 3

4 October 12, 2004 Thomas Sterling - Caltech & JP 4

5 The Way We Were: 1974 IBM 370 market mainstream pprox. 1 Mflops DEC PDP-11 geeks delight Seymour Cray started working on Cray-1 pprox. 100 Mflops 2 nd generation microprocessor e.g. Intel 8008 Core memory Kx1 DRM chips Punch cards, paper tapes, teletypes, selectrics October 12, 2004 Thomas Sterling - Caltech & JP 5

What Will Be Different Moore s aw will have flatlined Nano-scale atomic level devices ssuming we solve lithography problem ocal clock rates ~100 GHz Fastest today is > 700 GHz ocal actions

6 What Will Be Different Moore s aw will have flatlined Nano-scale atomic level devices ssuming we solve lithography problem ocal clock rates ~100 GHz Fastest today is > 700 GHz ocal actions strongly preferential to global actions Non-conventional technologies may be employed Optical Quantum dots Rapid Single Flux Quantum (RSFQ) gates 1 JJ1 JJ2 October 12, 2004 Thomas Sterling - Caltech & JP 6

What we will need 1 nano-watt per Megaflops Energy received from Tau Ceti (per m 2 ) pproximately 1 square meter for 1 Zetaflops s 10 billion execution sites > 10 billion-way parallelism Including

7 What we will need 1 nano-watt per Megaflops Energy received from Tau Ceti (per m 2 ) pproximately 1 square meter for 1 Zetaflops s 10 billion execution sites > 10 billion-way parallelism Including memory and communications: 2000 m 2 3-D packaging (4m) 3 Global latency of ~ 10,000 cycles Including average latency, => 1 trillion-way parallelism October 12, 2004 Thomas Sterling - Caltech & JP 7

8 Parcel Simulation atency Hiding Experiment Nodes Nodes Flat Network Nodes Remote Memory Requests Control Experiment Remote Memory Requests Nodes Test Experiment Remote Memory Requests Input Parcels Output Parcels ocal Memory Remote Memory Requests ocal Memory Process Driven Node Parcel Driven Node October 12, 2004 Thomas Sterling - Caltech & JP 8

9 atency Hiding with Parcels with respect to System Diameter in cycles Sensitivity to Remote atency and Remote ccess Fraction 16 Nodes deg_parallelism in RED (pending t=0 per node) 1000 Total transactional work done/total process work done /4% 1/2% 1% 2% 4% 0.1 Remote Memory atency (cycles) October 12, 2004 Thomas Sterling - Caltech & JP 9

10 atency Hiding with Parcels Idle Time with respect to Degree of Parallelism Idle Time/Node (number of nodes in black) 8.E E+05 Idle time/node (cycles) 6.E+05 5.E+05 4.E+05 3.E+05 Process Transaction 2.E+05 1.E+05 0.E Parallelism evel (parcels/node at time=0) October 12, 2004 Thomas Sterling - Caltech & JP 10

11 rchitecture Innovation Extreme memory bandwidth ctive latency hiding Extreme parallelism Message-driven split-transaction computations (parcels) PIM e.g. Kogge, Draper, Sterling, Very high memory bandwidth ower memory latency (on chip) Higher execution parallelism (banks and row-wide) Streaming Dally, Keckler, Very high functional parallelism ow latency (between functional units) Higher execution parallelism (high density) October 12, 2004 Thomas Sterling - Caltech & JP 11

12 Continuum Computer rchitecture Merges state, logic, and communication in single building block Parcel driven computation Fine grain split transaction computing Move data through vectors of instructions in store Move instruction stream through vector of data Gather-scatter an intrinsic Very efficient Futures for produces-multi-consumer computing Combines strengths of PIM and Streaming ll register architecture (fully associative) Functional units within a cycle of neighbors Extreme parallelism Intrinsic latency hiding October 12, 2004 Thomas Sterling - Caltech & JP 12

13 October 12, 2004 Thomas Sterling - Caltech & JP 13 Inst. Reg. Control ssoc. Memory

14 Conclusions Zettaflops at nano-scale technology is possible Size requirements tolerable But packaging is a challenge; atency challenge does not sink the idea Major obstacles Power atency Parallelism Reliability Programming rchitecture can address many of these Continuum Computing rchitecture Combines advantages of PIM and streaming Strong candidate for future Zetaflops computer October 12, 2004 Thomas Sterling - Caltech & JP 14

15 October 12, 2004 Thomas Sterling - Caltech & JP 15

Continuum Computer Architecture

Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005