The University of Texas at Austin

Size: px

Start display at page:

Download "The University of Texas at Austin"

Ferdinand Townsend
5 years ago
Views:

1 EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1

2 Outline 2 Principles of parallel execution Pipelining Parallel HW (multiple ALUs) EE38(20) (c) Mattan Erez

3 Parallel Execution 3 Concurrency what are the multiple resources? Communication and storage Synchronization Implicit? Explicit? what is being shared? What is being partitioned? EE38(20) (c) Mattan Erez

4 Parallelism Circuits 4 Circuits operate concurrently (always) Communicate through wires and registers Synchronize by: clock signals (including async) design (wave-pipelined) more? EE38(20) (c) Mattan Erez

5 Parallelism Components 5 Collections of circuits memories branch predictors cache scheduler includes ALUs, memory channels, cores but will discuss later. Operate concurrently Communicate by wires, registers, and memory synchronize with clock and signals Often pipelined EE38(20) (c) Mattan Erez

6 Reminders 6 This is not a microarchitecture class We will be discussing microarch. of various stream processors though Details deferred to later in the semester This class is not a replacement for Parallel Computer Architecture class We only superficially cover many details of parallel architectures Focus on parallelism and locality at the same time EE38(20) (c) Mattan Erez

7 Outline 7 Cache oblivious algorithms sub-talk Principles of parallel execution Pipelining Parallel HW (multiple ALUs) Analyze by shared resources Analyze by synch/comm mechanisms ILP, DLP, and TLP organizations EE38(20) (c) Mattan Erez

8 Simplified view of a processor 8 fetch decode dispatch (issue) reg. access execute write-back commit memory hierarchy EE38(20) (c) Mattan Erez

9 Simplified view of a pipelined processor 9 fetch decode dispatch (issue) reg. access execute write-back commit EE38(20) (c) Mattan Erez memory hierarchy

10 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E E E W C F D I R E E E W C F D I R E E 2 F D I R E E E W C 3 F D I R E E E W C EE38(20) (c) Mattan Erez 10

11 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E E E W C F D I R E E E W C F D I R E E 2 F D I R E E E W C 3 F D I R E E E W C EE38(20) (c) Mattan Erez 11

12 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E E E W C 2 F D I R E E E W C 3 F D I R E E E W C EE38(20) (c) Mattan Erez 12

13 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E E E W C 2 F D I R E E E W C 3 F D I R E E E W C What are the parallel resources? EE38(20) (c) Mattan Erez 13

14 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:add r5, r1, r4 3:add r6, r5, r3 1 F D I R E E E W C 2 F D I R E E E W C 3 F D I R E E E W C EE38(20) (c) Mattan Erez 14

15 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:add r5, r1, r4 3:add r6, r5, r3 1 F D I R E E E W C 2 F D I R E E E W C 3 F D I R E E E W C EE38(20) (c) Mattan Erez 15

16 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E E E W C 2 F D I R E E E W C 3 F D I R E E E W C Communication and synchronization mechanisms? EE38(20) (c) Mattan Erez 16

17 Simplified view of a pipelined processor fetch decode dispatch (issue) reg. access execute write-back commit 1:add r4, r1, r2 2:ld r5, r4 3:add r6, r5, r3 4:add r7, r1, r3 1 F D I R E E E W C 2 F D I R L L L L L L L L L W C 3 F D I R E E E W C 4 F D I R E E E W C EE38(20) (c) Mattan Erez 17

18 Simplified view of a OOO pipelined processor 18 fetch decode rename dispatch (issue) scheduler issue (dispatch) reg. access execute write-back commit 1:add r4, r1, r2 2:ld r5, r4 3:add r6, r5, r3 4:add r7, r1, r3 1 F D I d R E E E W C 2 F D I d R L L L L L L L L L W C 3 F D I d R E E E W C 4 F D I d R E E E W C Communication and synchronization mechanisms? EE38(20) (c) Mattan Erez

19 Pipelining Summary 19 Pipelining is using parallelism to hide latency Do useful work while waiting for other work to finish Multiple parallel components, not multiple instances of same component Examples: Execution pipeline Memory pipelines Issue multiple requests to memory without waiting for previous requests to complete Software pipelines Overlap different software blocks to hide latency: computation/communication EE38(20) (c) Mattan Erez

20 Resources in a parallel processor/system 20 Execution ALUs Cores/processors Control Sequencers Instructions OOO schedulers State Registers Memories etworks EE38(20) (c) Mattan Erez

21 Communication and synchronization 21 Synchronization Clock explicit compiler order Explicit signals (e.g., dependences) Implicit signals (e.g., flush/stall) More for pipelining than multiple ALUs Communication Bypass networks Registers Memory Explicit (over some network) EE38(20) (c) Mattan Erez

22 Organizations for ILP (for multiple ALUs) 22 EE38(20) (c) Mattan Erez

23 Superscalar (ILP for multiple ALUs) 23 scheduler EE38(20) (c) Mattan Erez How many ALUs? memory hierarchy

24 SMT/TLS (ILP for multiple ALUs) 25 EE38(20) (c) Mattan Erez

25 SMT/TLS (ILP for multiple ALUs) 26 scheduler memory hierarchy Why is this ILP? How many threads? EE38(20) (c) Mattan Erez

26 VLIW (ILP for multiple ALUs) 28 EE38(20) (c) Mattan Erez

27 VLIW (ILP for multiple ALUs) 29 EE38(20) (c) Mattan Erez How many ALUs? memory hierarchy

28 Explicit Dataflow (ILP for multiple ALUs) 31 EE38(20) (c) Mattan Erez

29 Explicit Dataflow (ILP for multiple ALUs) 32 scheduler scheduler scheduler scheduler EE38(20) (c) Mattan Erez memory hierarchy

30 DLP for multiple ALUs 34 EE38(20) (c) Mattan Erez From HW this is SIMD

31 SIMD (DLP for multiple ALUs) 35 Local Mem Local Mem Local Mem Local Mem EE38(20) (c) Mattan Erez memory hierarchy

32 Vectors (DLP for multiple ALUs) 37 Local Mem Local Mem Local Mem Local Mem memory hierarchy Vectors: memory addresses are part of singleinstruction EE38(20) (c) Mattan Erez and not part of multiple-data

33 TLP for multiple ALUs 38 EE38(20) (c) Mattan Erez

34 MIMD shared memory (TLP for multiple ALUs) 39 scheduler scheduler scheduler scheduler EE38(20) (c) Mattan Erez memory hierarchy

35 MIMD distributed memory 41 scheduler scheduler scheduler scheduler memory hierarchy memory hierarchy memory hierarchy memory hierarchy EE38(20) (c) Mattan Erez

36 Summary of communication and synchronization Style Synchronization Communication Superscalar explicit signals (RS) registers+bypass VLIW clock+compiler registers (bypass?) Dataflow explicit signals registers+explicit SIMD clock+compiler explicit MIMD explicit signals memory+explicit EE38(20) (c) Mattan Erez 44

37 Summary of sharing in ILP HW Style Seq Inst OOO Regs Mem ALUs et Superscalar S P S S S S S SMT/TLS P P S S S S S VLIW S P /A S S P S Dataflow B P P B S P S EE38(20) (c) Mattan Erez 46

38 Summary of sharing in DLP and TLP Style Seq Inst OOO Regs Mem ALUs et Vector S S /A P s P B SIMD S S /A P p P B MIMD P P P P S/P P B EE38(20) (c) Mattan Erez 47

The University of Texas at Austin

EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin