Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Size: px

Start display at page:

Download "Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505"

Audrey Garrison
5 years ago
Views:

1 Computer Architecture: Mul1ple Issue Berk Sunar and Thomas Eisenbarth ECE 505

2 Outline 5 stages of RISC Type of hazards Sta@c and Dynamic Branch Predic@on Pipelining with Excep@ons Pipelining with Floa@ng- Point Opera@ons Loop Unrolling Correla@ng and Tournament Branch Predic@on Dynamic Scheduling: Scoreboard Dynamic Scheduling: Tomasulo: Hardware Based Specula@on Mul@ple Issue: VLIW 3.7 Mul@ple Issue: Superscalar(specula@ve) 3.8 Branch Target Buffer: Principles 3.9

3 Issue Processors data and control stalls eliminated with dynamic scheduling and Performance close to 1 IPC To go beyond 1 IPC, more than 1 instruc@on must be issued (and completed) per cycle Three major flavors of mul@ple issue: 1. Sta@cally scheduled superscalar processors 2. VLIW (very long instruc@on word) processors 3. Dynamically scheduled superscalar processors

4 Superscalar Processors issue varying number of per clock Either in- order scheduled) or out- of- order (dynamically scheduled) VLIW Processors issue fixed number of forma_ed as one large with parallelism explicitly indicated by inst. Inherently scheduled by compiler High similarity to superscalar!

5 Overview of Issue Processors Common Name Superscalar Superscalar (dynamic) Superscalar Issue structure Hazard detec1on Scheduling Dis1nguishing characteris1c Examples Dynamic Hardware In- order Mostly Embedded : MIPS, ARM (e.g. Cortex- A8) Dynamic Hardware Dynamic Some out- of- order execu@on, but no specula@on Dynamic Hardware Dynamic with specula@on VLIW/LIW Sta@c Primarily sodware EPIC Primarily sta@c Primarily sodware Sta@c Mostly sta@c Out- of- order execu@on with specula@on All hazards determined by compiler (oden implicitly) All hazards determined and indicated explicitly by compiler None so far Intel Core ix, AMD Phenom, IBM POWER7 Mostly signal processing, e.g. TI C6x Itanium

6 Basic concept of VLIW VLIW uses independent units (as but VLIW packages for each FU into one very large Overheads grow with amount of parallelism Example: VLIW with 1x Integer FU, 2x Load/Store FU and 2x FP FU 16 to 24 bit opcode per FU: bit inst. word

7 VLIW: Example Program with loop (same as last week): Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop VLIW requires heavy unrolling for being efficient Q: How many unrolls to prevent stalls? Assuming these latencies: Source Instruc1on Des1na1on instruc1on FP ALU op FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Latency cycles

8 VLIW: How many unrolls? Memory Unit 1 Memory Unit 2 FP unit 1 FP unit 2 Int unit L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-56 BNE R1,R2,Loop

9 VLIW: Book proposes 7 Memory Unit 1 Memory Unit 2 FP unit 1 FP unit 2 Int unit L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F18,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4,0(R1) S.D F8,-8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) S.D F24,+24(R1) S.D F20,+16(R1) S.D F28,+8(R1) 23 instruc@ons in 9 cycles: 2.5 IPC DADDUI R1,R1,#-56 BNE R1,R2,Loop

10 VLIW: Disadvantages Increase in code size Non- fully filled result in excess code VLIW requires heavy unrolling (increasing code) in lockstep: components like caches can cause processor to stall Code if number/delay of units varies across processor families, code needs to be recompiled for each machine

11 Outline 5 stages of RISC Type of hazards Sta@c and Dynamic Branch Predic@on Pipelining with Excep@ons Pipelining with Floa@ng- Point Opera@ons Loop Unrolling Correla@ng and Tournament Branch Predic@on Dynamic Scheduling: Scoreboard Dynamic Scheduling: Tomasulo: Hardware Based Specula@on Mul@ple Issue: VLIW 3.7 Mul@ple Issue: Superscalar(specula@ve) 3.8 Branch Target Buffer: Principles 3.9

12 Pukng it all together: Goal: combine Issue with Dynamic Scheduling and Example: machine extended to issue 2 instruc@ons per cycle Separate load/store unit, integer unit and FP unit, each can ini@ate 1 instruc@on per cycle Instruc@ons issued in- order to prevent program seman@cs viola@ons

13 Issue Tomasulo + Specula@on Architecture essen@ally the same However, components must be redundant E.g. CDB must broadcast up to 2 (N) results per cycle Issue and comple@on logic become more complex

14 Issue with Problem: scheduled in parallel may depend on each other Tables to be updated in parallel Either by pipelining table updates (issuing logic) or By widening issue logic (or both) à Issuing step becomes bo_leneck, as complexity grows with N 2 (for N IPC) Back- end of pipeline must be able to complete and commit mul@ple instruc@ons per clock Easier since dependences were resolved during issue

15 Issue: Steps All steps must be performed in one clock cycle: 1. Assign and ROB for every to be issued next. (Possible without knowing for ROB and for RS by the number of per unit class). that cannot be issued (no FU available) are delayed in- order. 2. Analyze dependences of in the issued instruc.on bundle. 3. If in bundle depends on earlier one in bundle, use ROB number to update table of dependent Otherwise just as before.

16 Issue+ Example Example Program: LOOP: LD R2,0(R1) DADDIU R2,R2,#1 SD R2,0(R1) DADDIU R1,R1,#8 BNE R2,R3,LOOP Q: Mul1ple issue performance with and without Specula1on? Without 2 nd LD must wait for BNE to execute With 2 nd LD can execute as soon as R1 is updated à Data dependent branches limit performance without specula@on

17 Example: without cycle 1 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) 1 1 DADDIU R2,R2,#1 1 1 SD R2,0(R1) 1 DADDIU R1,R1,#8 1 BNE R2,R3,LOOP 2 LD R2,0(R1) 2 DADDIU R2,R2,#1 2 SD R2,0(R1) 2 DADDIU R1,R1,#8 2 BNE R2,R3,LOOP 3 LD R2,0(R1) 3 DADDIU R2,R2,#1 3 SD R2,0(R1) 3 DADDIU R1,R1,#8 3 BNE R2,R3,LOOP

18 Example: without cycle 2 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,#1 1 1 SD R2,0(R1) 2 1 DADDIU R1,R1,#8 2 1 BNE R2,R3,LOOP 2 LD R2,0(R1) 2 DADDIU R2,R2,#1 2 SD R2,0(R1) 2 DADDIU R1,R1,#8 2 BNE R2,R3,LOOP 3 LD R2,0(R1) 3 DADDIU R2,R2,#1 3 SD R2,0(R1) 3 DADDIU R1,R1,#8 3 BNE R2,R3,LOOP

19 Example: without cycle 3 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,#1 1 1 SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 3 2 LD R2,0(R1) 2 DADDIU R2,R2,#1 2 SD R2,0(R1) 2 DADDIU R1,R1,#8 2 BNE R2,R3,LOOP 3 LD R2,0(R1) 3 DADDIU R2,R2,#1 3 SD R2,0(R1) 3 DADDIU R1,R1,#8 3 BNE R2,R3,LOOP

20 Example: without cycle 4 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,#1 1 1 SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 3 2 LD R2,0(R1) 4 2 DADDIU R2,R2,#1 4 2 SD R2,0(R1) 2 DADDIU R1,R1,#8 2 BNE R2,R3,LOOP 3 LD R2,0(R1) 3 DADDIU R2,R2,#1 3 SD R2,0(R1) 3 DADDIU R1,R1,#8 3 BNE R2,R3,LOOP

21 Example: without cycle 5 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 3 2 LD R2,0(R1) 4 2 DADDIU R2,R2,#1 4 2 SD R2,0(R1) 5 2 DADDIU R1,R1,#8 5 2 BNE R2,R3,LOOP 3 LD R2,0(R1) 3 DADDIU R2,R2,#1 3 SD R2,0(R1) 3 DADDIU R1,R1,#8 3 BNE R2,R3,LOOP

22 Example: without cycle 6 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 3 2 LD R2,0(R1) 4 2 DADDIU R2,R2,#1 4 2 SD R2,0(R1) 5 2 DADDIU R1,R1,#8 5 2 BNE R2,R3,LOOP 6 3 LD R2,0(R1) 3 DADDIU R2,R2,#1 3 SD R2,0(R1) 3 DADDIU R1,R1,#8 3 BNE R2,R3,LOOP

23 Example: without cycle 7 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) 4 2 DADDIU R2,R2,#1 4 2 SD R2,0(R1) 5 2 DADDIU R1,R1,#8 5 2 BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 3 DADDIU R1,R1,#8 3 BNE R2,R3,LOOP

24 Example: without cycle 8 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,#1 4 2 SD R2,0(R1) 5 2 DADDIU R1,R1,# BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP

25 Example: without cycle 9+10 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,#1 4 2 SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP 9

26 Example: without cycle Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP 9

27 Example: without cycle 13 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP 9

28 Example: without cycle 14 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,# BNE R2,R3,LOOP 9

29 Example: without cycle 15 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,#1 7 3 SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9

30 Example: without cycle Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9

31 Example: without cycle 18 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9

32 Example: without cycle 19 Itera1on Instruc1on Issue Execute Mem access Write CDB 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9 19

33 Example: with Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) 4 2 DADDIU R2,R2,#1 4 Nothing changes up to here 2 SD R2,0(R1) 5 2 DADDIU R1,R1,#8 5 2 BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP 9

34 Example: with cycle 5 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) 4 5 Specula@ve Execu@on starts in cycle 5 2 DADDIU R2,R2,#1 4 2 SD R2,0(R1) 5 2 DADDIU R1,R1,#8 5 2 BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 Specula1ve Execu1onè ROB needed 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP

35 Example: with cycle 6 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,#1 4 2 SD R2,0(R1) 5 6 Ok because of renaming 2 DADDIU R1,R1,# BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP

36 Example: with cycle 7 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,#1 4 Write only on commit 2 SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 6 3 LD R2,0(R1) 7 3 DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP

37 Example: with cycle 8 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 6 3 LD R2,0(R1) DADDIU R2,R2,#1 7 3 SD R2,0(R1) 8 3 DADDIU R1,R1,#8 8 3 BNE R2,R3,LOOP

38 Example: with cycle 9 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 6 3 LD R2,0(R1) DADDIU R2,R2,#1 7 3 SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9

39 Example: with cycle 10 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,#1 7 3 SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9

40 Example: with cycle 11 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9

41 Example: with cycle 12 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9

42 Example: with cycle 13 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP 9 13

43 Example: with cycle 14 Itera1o Instruc1on Issue Execute Mem access Write CDB Commit n 1 LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP LD R2,0(R1) DADDIU R2,R2,# SD R2,0(R1) DADDIU R1,R1,# BNE R2,R3,LOOP Commits almost 2 IPC aser 1 st itera1on

44 Example: with Example with code completes ader 14 cycles instead of 19 cycles. à Advantage for data- dependent branches However, can result in worse performance and much higher power

45 Outline 5 stages of RISC Type of hazards Sta@c and Dynamic Branch Predic@on Pipelining with Excep@ons Pipelining with Floa@ng- Point Opera@ons Loop Unrolling Correla@ng and Tournament Branch Predic@on Dynamic Scheduling: Scoreboard Dynamic Scheduling: Tomasulo: Hardware Based Specula@on Mul@ple Issue: VLIW 3.7 Mul@ple Issue: Superscalar(specula@ve) 3.8 Branch Target Buffer: Principles 3.9

46 Branch Target Buffer (BTB) Issue requires high IF bandwidth Branch predic1on: predicts branch outcome only Even with best possible of branch outcome, have to wait for branch target address to be determined for IF

47 Branch Target Buffer (BTB) BTB decides whether undecoded is branch, and if so, predicts following PC

48 Branch Target Buffer (BTB) BTB decides whether undecoded is branch, and if so, predicts following PC BTB only contains info for control (Jumps and Branches) For all other (and predict not- taken branches), next PC is PC+4 How to update BTB?

49 BTB explained

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on - Abdullah Muzahid Hardware- Based Specula0on (Sec0on 3.6) In mul0ple issue processors, stalls due to branches would be frequent: You may need