CS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II

CS 194-6 Digital Systems Project Laboratory Lecture 10: Advanced Processors II 2008-11-24 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TA: Greg Gibeling www-inst.eecs.berkeley.edu/~cs194-6/ 1

Beyond static pipelines and single threads Dynamic scheduling. Exception handling. Multi-threading and multi-core. Cache coherency. 2

Dynamic Scheduling 3

Recall: Out of Order Execution Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Issue instructions out of program order Example:... so let ADDD go first!"#$%&'!" #$%& '()*$+ (!" #(% (,)*'+!*%+ -.!/" #0% #(% #$, ADDD 1.2" #3% #$% #$ ( MULTD waiting on F4 to load... 4

Dynamic Scheduling: Enables Out-of-Order Goal: Enable out-of-order by breaking pipeline in two: fetch and execution. Example: IBM Power 5: Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes F6 Floatingpoint WB pipeline Xfer I-fetch and decode: like static pipelines Today s focus: execution unit 5

90 nm, 58 M transistors L1 (64K Instruction) 512K L2 CS 152 L14: Cache I L1 (32K Data) PowerPC UC Regents Spring 970 2005 UCB FX 6

Recall: WAR and WAW hazards... Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value. Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect. Dynamic scheduling eliminates WAR and WAW hazards, making out-of-order execution tractable 7

Dynamic Scheduling: A mix of 3 ideas Imagine: an endless supply of registers... Top-down idea: Registers that may be written only once (but may be read many times) eliminate WAW and WAR hazards. Mid-level idea: An instruction waiting for an operand to execute may trigger on the (single) write to the associated register. Bottom-up idea: To support snooping on register writes, attach all machine elements to a common bus. Robert Tomasulo, IBM, 1967. FP unit for IBM 360/91 (eliminates RAW hazards) 8

Consider this simple loop... Loop: LD F0,0(R1) ;F0=vector array element ADDD F4,F0,F2 ;add scalar from F2 SD F4,0(R1) 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Every pass through the loop introduces the potential for WAW and/or WAR hazards for F0, F4, and R1. (Note: F registers are floating point registers. F0 is not equal to the constant 0, but instead is a normal register just like F1, F2,...). 11

Given an endless supply of registers... Rename architected registers (Ri, Fi) to new physical registers (PRi, PFi) on each write. ADDI R1,R0,64 Loop: LD F0,0(R1) ADDD SD SUBI BNEZ NOP F4,F0,F2 0(R1),F4 F4,0(R1) R1,R1,8 R1,Loop What was gained? An instruction may execute once all of its source registers have been written. R1 PR01 F0 PF00 ADDI PR01,PR00,64 LD PF00 0(PR01) ADDD PF04, PF00, PF02 SD PF04, 0(PR01) SUBI PR11, PR01, 8 BEQZ PR11 ENDLOOP ITER2: LD PF10 0(PR11) ADDD PF14, PF10, PF02 SD PF14, 0(PR11) SUBI PR21, PR11, 8 BEQZ PR21 ENDLOOP ITER3: LD PF20 O(PR21) [...] 12

Bus-Based CPUs 13

A common bus == long wires == slow? Pipelines in theory Long wires are the price we paid to avoid stalls Pipelines in practice Wires are short, so clock periods can be short. wiring by abutment Conjecture: If processor speed is limited by long wires, lets do a design that fully uses the semantics of long wires by using a bus. 14

A bus-based multi-cycle computer From Memory Load Unit If we add too many functional units, one bus is too long, too slow. Solutions: more buses, faster electrical signalling Register File ALU #1 ALU #2... Common Data Bus <data id#, data value> (1) Only one unit writes at a time (one source). (2) All units may read the written values (many destinations), if interested in id#. Store Unit To Memory 15

Data-Driven Execution (Associative Control) Caveat: In comparison to static pipelines, there is great diversity in dynamic scheduling implementations. Presentation that follows is a composite, and does not reflect any specific machine. 16

Recall: IBM Power 5 block diagram... Queues between instruction fetch and execution. Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes F6 Floatingpoint WB pipeline Xfer MP = Mapping from architected registers to physical registers (renaming). ISS = Instruction Issue 17

Instructions placed in Reorder Buffer Each line holds physical <src1, src2, dest> registers for an instruction, and controls when it executes From Memory Load Unit Reorder Buffer Inst # [...] src1 # src1 val src2 # src2 val dest # dest val 6 7 [...] ALU #1 ALU #2 Store Unit Common Data Bus: <reg #, reg val> To Memory Execution engine works on the physical registers, not the architecture registers. 18

Circular Reorder Buffer: A closer look Next instr to commit, (complete). Tail of List Head of List Inst # Op U E #1 #2 #d P1 P2 Pd P1 value 8 9 10 ADD OR SUB Add next inst, in program order. 0 0 1 1 1 0 0 Instruction opcode Use bit (1 if line is in use) Execute bit (0 if waiting...) Physical register numbers Valid bits for values P2 value Pd value Copies of physical register values 19

Example: The life of ADD R3,R1,R2 Issue: R1 renamed to PR21, whose value (13) was set by an earlier instruction. R2 renamed to PR22; it has not been written. R3 renamed to PR23. Inst# Op U E #1 #2 #d P1 P2 Pd P1 value P2 value Pd value 9 Add 1 0 21 22 23 1 0 0 13 - - A write to PR22 appears on the bus, value 87. Both operands are now known, so 13 and 87 sent to ALU. Inst# Op U E #1 #2 #d P1 P2 Pd P1 value P2 value Pd value 9 Add 1 1 21 22 23 1 1 0 13 87 - ALU does the add, writing < PR23, 100 > onto the bus. Inst# Op U E #1 #2 #d P1 P2 Pd P1 value P2 value Pd value 9 Add 1 1 21 22 23 1 1 1 13 87 100 20

More details (many are still overlooked) Issue logic monitors bus to maintain a physical register file, so that it can fill in <val> fields during issue. Reorder buffer: a state machine triggered by reg# bus comparisons Q. Why are we storing each physical register value several times in the reorder buffer? Quick access. From Memory Load Unit Example: Load/Store Disambiguation Inst # [...] src1 # src1 val src2 # src2 val dest # dest val 6 7 [...] ALU #1 ALU #2 Store Unit Common Data Bus: <reg #, reg val> To Memory 21

Exceptions and Interrupts Exception: An unusual event happens to an instruction during its execution. Examples: divide by zero, undefined opcode. Interrupt: Hardware signal to switch the processor to a new instruction stream. Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting). 22

Challenge: Precise Interrupt / Exception Definition: (or exception),$'-.)$'(//+(%'()'0*'(#'0#$+%%./$'0)'$(1+#'2+$3++#' $3"'0#)$%.4$0"#)!"#$%& ' #()%& '*+, -./0%01102.%31%#44%'(".562.'3("%67%.3%#()%'(246)'(8%& ' '".3.#44$%239740.0 - (3%01102.%31%#($%'(".562.'3(%#1.05%& ' /#"%.#:0(%74#20 ;/0%'(.05567.%/#()405%0'./05%#<35."%./0%75385#9%35% 50".#5."%'.%#.%& '*+% = Follows from the contract between the architect and the programmer... 23

Precise Exceptions in Static Pipelines 8?(%>$@:;*A';BC@;D 120$!'()'*3/4)$5#67)*8/-)./0)9!"##$%& '"$(% P9 ";<@R' 0?J!!?G:>? K T 0!H@H' 0?J S G/:/1*0 L98.:/-0 6C 6C0A..-/440 D>1/3*7(84 KFG!!::/<9:0 I31(./ KFG K IJ/-,:(= KFG 0 F9*90A..-0 E7::0 D>1/3* K-7*/?91@ C9)4/ E7::0H0 G*9</ P9! E7::0F0 G*9</ P9 K E7::0D0 G*9</ P9 0 A4B81;-(8()40!8*/--)3*4 D6C C D:E>'?FG?B@=:;'$EHI<'=;'B=B?E=;?'A;@=E'G:JJ=@'B:=;@',0'<@HI?/ Key observation: architected state only change in memory and register write stages. 24

Dynamic scheduling and exceptions...?@(%e$>9<# )*G.,0#, H%&G.8G.,0#, )*G.,0#, '&(#% H&#:;& 5&:-;&-)*+,,&- 8:FF2( I-66 )*J#$&(4/*06#,(KB I-66 I-66 =>&#+(& =>#&I(2:/J Key observation: Only the architected state needs to be precise, not the physical register state. So, we delay removing instructions from the reorder buffer until we are ready to commit to that state changing the architected registers. 9!/.(-+#(2:/.),&(#%&;)$/;);&#:;&;)2/(:)2/.(-+#(2:/ 25

Add completion logic to data path... To sustain CPI < 1, must be able to do multiple issues, commits, and reorder buffer execution launches and writes per cycle. From Memory Reorder Buffer Inst # [...] src1 # src1 val src2 # src2 val dest # dest val 6 7 [...] Commit ISA Registers Load Unit ALU #1 ALU #2 Store Unit To Memory Not surprising design and validation teams are so large. 26

Power 5: By the numbers... Fetch up to 8 instructions per cycle. Dispatch up to 5 instructions per cycle Execute up to 8 instructions per cycle Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes Up to 200 instructions in flight. 240 physical registers (120 int + 120 FP) F6 Floatingpoint WB pipeline Xfer A thread may commit up to 5 instructions per cycle. 27

Note: Good branch prediction required Because so many stages between predict and result! Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes F6 Floatingpoint WB pipeline Xfer BP = Branch prediction. On IBM Power 5, quite complex... uses a predictor to predict the best branch prediction algorithm! 28

Recap: Dynamic Scheduling Three big ideas: register renaming, data-driven detection of RAW resolution, bus-based architecture. Very complex, but enables many things: out-of-order execution, multiple issue, loop unrolling, etc. Has saved architectures that have a small number of registers: IBM 360 floating-point ISA, Intel x86 ISA. 29

Throughput Computing 30

Recall: Throughput and multiple threads Goal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) multi-threaded program execution time. Example: Sun Niagara (32 instruction streams on a chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law (application dependent). Memory system performance. 31

Throughput Computing Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs. Multi-core: Integrating several processors that (partially) share a memory system on the same chip 32

Multi-Threading (Static Pipelines) 33

Recall: Bypass network prevents stalls Instead of bypass: Interleave threads on the pipeline to prevent stalls... IR ID (Decode) IR EX IR MEM WB IR WE, MemToReg Mux,Logic From WB 32 op rs1 rs2 RegFile rd1 A 32 A L U 32 Y Data Memory Addr Dout Din WE MemToReg R ws wd WE rd2 M M Ext B 34

Introduced in 1964 by Seymour Cray Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W F D X M W F D X M W F D X M W F D X M W t9 Last instruction in a thread always completes writeback before next instruction in same thread reads regfile 4 CPUs, each run at 1/4 clock PC PC PC 1 PC 1 1 1 I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread select 2 Many variants... 35

Multi-Threading (Dynamic Scheduling) 36

Power 4 (predates Power 5 shown earlier) Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. Branch redirects Out-of-order processing Instruction fetch IF IC BP MP ISS RF EX BR WB LD/ST MP ISS RF EA DC Fmt WB Xfer Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes 37

For most apps, most execution units lie idle Observation: Most hardware in an out-of-order CPU concerns physical registers. Could several instruction threads share this hardware? Percent of Total Issue Cycles 100 90 80 70 60 50 40 30 20 10 0 alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing Onchip Parallelism, ISCA 1995. 38

Simultaneous Multi-threading... One thread, 8 units Cycle M M FX FX FP FP BR CC 1 2 3 4 5 6 7 8 9 Two threads, 8 units Cycle M M FX FX FP FP BR CC 1 2 3 4 5 6 7 8 9 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes 39

Branch redirects Power 4 Out-of-order processing Instruction fetch IF IC BP MP ISS RF EX BR WB LD/ST MP ISS RF EA DC Fmt WB Xfer Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes Branch redirects Instruction fetch IF IC BP D0 D1 D2 D3 Xfer GD Interrupts and flushes Power 5 Group formation and instruction decode 2 fetch (PC), 2 initial decodes Out-of-order processing Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF F6 Floatingpoint WB pipeline 2 commits (architected register sets) Xfer CP 40

Power 5 data flow... Program counter Instruction cache Instruction translation Alternate Branch history tables Instruction buffer 0 Instruction buffer 1 Branch prediction Return stack Thread priority Target cache Group formation Instruction decode Dispatch Sharedregister mappers Dynamic instruction selection Shared issue queues Read sharedregister files Shared execution units LSU0 FXU0 LSU1 FXU1 FPU0 FPU1 BXU CRL Write sharedregister files Data Translation Group completion Data translation Data Cache Store queue Data cache L2 cache Shared by two threads Thread 0 resources Thread 1 resources Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck. 41

Power 5 thread performance... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they owned the machine. Instructions per cycle (IPC) Single-thread mode 0,7 2,7 1,6 4,7 3,6 2,5 1,4 6,7 7,7 7,6 5,6 4,5 3,4 2,3 2,1 6,6 5,5 4,4 3,3 2,2 6,5 5,4 4,3 3,2 2,1 7,4 6,3 5,2 4,1 7,2 6,1 Thread 0 priority, thread 1 priority 7,0 1,1 0,1 1,0 Power save mode Thread 0 IPC Thread 1 IPC 42

Multi-Core 43

Recall: Superscalar utilization by a thread Percent of Total Issue Cycles 100 90 80 70 60 50 40 30 20 10 0 alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Observation: In many cases, the on-chip cache and DRAM I/O bandwidth is also underutilized by one CPU. So, let 2 cores share them. 44

Most of Power 5 die is shared hardware Core #1 Shared Components L2 Cache L3 Cache Control Core #2 DRAM Controller 45

Core-to-core interactions stay on chip (1) Threads on two cores that use shared libraries conserve L2 memory. (2) Threads on two cores share memory via L2 cache operations. Much faster than 2 CPUs on 2 chips. 46

Sun Niagara 47

The case for Sun s Niagara... Percent of Total Issue Cycles 100 90 80 70 60 50 40 30 20 10 0 alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Observation: Some apps struggle to reach a CPI == 1. For throughput on these apps, a large number of single-issue cores is better than a few superscalars. 48

Niagara (original): 32 threads on one chip 8 cores: Single-issue, 1.2 GHz 6-stage pipeline 4-way multi-threaded Fast crypto support Die size: 340 mm² in 90 nm. Power: 50-60 W Shared resources: 3MB on-chip cache 4 DDR2 interfaces 32G DRAM, 20 Gb/s 1 shared FP unit GB Ethernet ports Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO) 49

The board that booted Niagara first-silicon Source: J Schwartz weblog (then Sun COO, now CEO) 50

Used in Sun Fire T2000: Coolthreads Claim: server uses 1/3 the power of competing servers. Web server benchmarks used to position the T2000 in the market. 51

Project Blackbox A data center in a 20-ft shipping container. Servers, CS 194-6 L10: air-conditioners, Advanced Processors II power distribution. 52

Just hook up network, power, and water... 53

Holds 250 T1000 servers. 2000 CPU cores, 8000 threads. 55

Cache Coherency 56

Two CPUs, two caches, shared DRAM... CPU0 Cache CPU1 Cache Addr Value Addr Value 16 5 16 5 0 Shared Main Memory Addr 16 Value 5 Write-through caches 0 CPU0: LW R2, 16(R0) CPU1: LW R2, 16(R0) CPU1: SW R0,16(R0) View of memory no longer coherent. Loads of location 16 from CPU0 and CPU1 see different values! What to do? 57

The simplest solution... one cache! CPU0 CPU1 Memory Switch Shared Multi-Bank Cache Shared Main Memory CPUs do not have internal caches. Only one cache, so different values for a memory address cannot appear in 2 caches! Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank. In that case, one request is stalled. 58

Not a complete solution... good for L2. CPU0 CPU1 For modern clock rates, access to shared cache through switch takes 10+ cycles. Memory Switch Shared Multi-Bank Cache Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good. Shared Main Memory This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched. 59

Modified form: Private L1s, shared L2 CPU0 CPU1 L1 Caches L1 Caches Memory Switch or Bus Shared Multi-Bank L2 Cache Shared Main Memory Thus, we need to solve the cache coherency problem for L1 cache. Advantages of shared L2 over private L2s: Processors communicate at cache speed, not DRAM speed. Constructive interference, if both CPUs need same data/instr. Disadvantage: CPUs share BW to L2 cache... 60

Cache coherency goals... CPU0 Cache CPU1 Cache 1. Only one processor at a time has write permission for a memory location. Addr Value Shared Memory Hierarchy Addr 16 Addr Value Value 16 5 16 5 0 5 0 2. No processor can load a stale copy of a location after a write. 61

Implementation: Snoopy Caches CPU0 CPU1 Cache Snooper Cache Memory bus Snooper Shared Main Memory Hierarchy Each cache has the ability to snoop on memory bus transactions of other CPUs. The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs. 62

Writes from 10,000 feet... For write-thru caches... Cache CPU0 Snooper Cache Memory bus CPU1 Shared Main Memory Hierarchy Snooper To a first-order, reads will just work if write-thru caches implement this policy. A two-state protocol (cache lines are valid or invalid ). 1. Writing CPU takes control of bus. 2. Address to be written is invalidated in all other caches. Reads will no longer hit in cache and get stale data. 3. Write is sent to main memory. Reads will cache miss, retrieve new value from main memory 63

Limitations of the write-thru approach CPU0 CPU1 Every write goes to the bus. Cache Snooper Shared Main Memory Hierarchy Cache Memory bus Snooper Write-back big trick: keep track of whether other caches also contain a cached line. If not, a cache has an exclusive on the line, and can read and write the line as if it were the only CPU. For details, take CS 152 and CS 252... Total bus write bandwidth does not support more than 2 CPUs, in modern practice. To scale further, we need to use write-back caches. 64

Recap: Throughput processing Simultaneous Multithreading: Instructions streams can share an out-of-order engine economically. Multi-core: Once instruction-level parallelism run dry, thread-level parallelism is a good use of die area. 65

Next Monday: This Friday: Thanksgiving Holiday, No Meeting Final Presentation Fri, Dec 5 66