Dataflow: The Road Less Complex

Size: px

Start display at page:

Download "Dataflow: The Road Less Complex"

Kory Montgomery
5 years ago
Views:

1 Dataflow: The Road Less Complex Steven Swanson Ken Michelson Andrew Schwerin Mark Oskin University of Washington Sponsored by NSF and Intel

2 Things to keep you up at night (~2016) Opportunities 8 billion transistors; 28Ghz One DRAM chip will exhaust a 32bit address space 120 P4s OR 200,000 RISC-1 will fit on a die. Challenges It will take 36 cycles to cross the die. For reasonable yields, only 1 transistor in 24 billion may be broken, if one flaw breaks a chip. 7yrs and people Chips are networks Fault tolerance is required Simpler designs Better tools 2

3 Outline Monolithic von Neumann processing WaveScalar Results Future work and Conclusions 3

4 Monolithic Processing 4 Von Neumann is simple. We know how to build them. 2016? Communication Fault tolerance Complexity Performance

5 Decentralized Processing Communication Fault tolerance Complexity Performance 5

6 The Problem with Von Neumann Fundamentally centralized. Fetch is the key. There is only one program counter. There is no parallelism in the model. The alternative is dataflow 6

7 Dataflow has been done before.. Dataflow is not new Operations fire when data is available No program counter No false control dependencies Exposes massive parallelism But... 7

8 ...it had issues It never executed mainstream code Special languages No mutable data structures No aliasing Functional Strange memory semantics There are scalability concerns Large slow token stores 8

9 The WaveScalar Model WaveScalar is memory-centric Dataflow Compared to Von Neumann There is no fetch. Compared to traditional dataflow. Memory ordering is a first-class citizen. Normal memory semantics. No I-structures or special languages. We run Spec. 9

10 What is a wave? Maximal loop-free sections of the dataflow graph. May contain branches and joins. They are bigger than hyperblocks. Each dynamic wave has a Wave number. Every value has a wave number. 10

11 Maintaining Memory Order Loads and stores can issue requests to memory in any order: Wave number. Operation sequence number. Ordering information (predecessor and successor sequence numbers). The memory systems reconstructs the correct order. Wave number+sequence numbers provide a total order Your favorite speculative memory system. Or a store buffer. 11

12 WaveScalar benefits Expose everything about the program Data dependencies Memory order Instructions manipulate wave numbers. Multiple, parallel sequences of operations are possible. Synchronization Concurrency Communication 12

13 The WaveCache The I-Cache I is the processor. L2 Cache 13

14 WaveCache Processing PE Cluster Domain Element Long distance communication FLOW CONTROL Dynamic routing Grid-based INPUTS network 1-2 cycle/domain. DECODE Traditional D$ + Store FU Buffer cache coherence. CONFIG. LOGIC Normal OUTPUTS memory hierarchy. FLOW CONTROL 16K instructions. L2 Cache 14

15 Current results Compiled SPEC/mediabench DEC cc compiler (-O4 -unroll 16) Binary translator/compiler From Alpha AXP to WaveScalar Timing/execution-based simulation. Results in alpha instructions per cycle (AIPC) 15

16 Comparison architectures Superscalar 16-wide, 16 ported cache, 1024 issue window, 1024 regs, gshare branch predictor 15 stage pipeline. Perfect cache. WaveCache ~2000 processing elements 16 elements/domain Perfect cache. 16

17 WaveScalar vs Superscalar x faster Not counting clock rate improvements. AIPC 3 WaveScalar superscalar vpr tw olf mcf equake art adpcm mpeg fft 17

18 Cache replacement Not all the instructions will fit. WaveCache miss Destination instruction is not present Evict/Load an instruction (flush/load queues) Instructions volunteer for removal Location is important Normal hashing won t work 18

19 Cache size Normalized performance vpr tw olf mcf equake art adpcm mpeg fft Thrashing is dangerous Dynamic mapping is a big win Cache size (log) 19

20 Speculation Speculation helps AIPC WaveScalar Perfect branch Perfect mem. Disambig Both 2.4x on average for both 3 2 This is gravy!! 1 0 vpr tw olf mcf equake art adpcm mpeg fft 20

21 Future work Hardware implementation A la the Bathysphere Compiler issues Memory parallelism More than von Neumann emulation Vector Streaming WaveScalar is an ISA for writing architectures. Operating system issues What is a context switch? What is a system call? 21

22 Future work Online placement optimization Simulated annealing Defect tolerance Hard and soft faults WaveCache as a computer system. WaveScalar everything (graphics, IO, CPU, Keyboard, hard drive) Uniform namespace for a computer. Adaptation at load time. 22

23 Conclusion Decentralized computing will let you rest easy in 2016! WaveScalar and the WaveCache Dataflow with normal memory!! Outperforms an OOO superscalar by 2.8x Feasible now and in 2016 Enormous opportunities for future research 23

WaveScalar. Winter 2006 CSE WaveScalar 1

WaveScalar. Winter 2006 CSE WaveScalar 1 WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE