CS 152 Computer Architecture and Engineering

Size: px

Start display at page:

Download "CS 152 Computer Architecture and Engineering"

Tracey Dean
5 years ago
Views:

1 CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III John Lazzaro ( TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/

2 Last Time: Dynamic Scheduling Each line holds physical <src1, src2, dest> registers for an instruction, and controls when it executes From Memory Load Unit Reorder Buffer Inst # [...] src1 # src1 val src2 # src2 val dest # dest val 6 7 [...] ALU #1 ALU #2 Store Unit Common Data Bus: <dest #, dest val> To Memory Execution engine works on the physical registers, not the architecture registers.

3 Today: Throughput and multiple threads Goal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs. Example: Sun Niagara (32 instruction streams on a chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law (application dependent). Memory system performance.

4 Throughput Computing Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs. Multi-core: Integrating several processors that (partially) share a memory system on the same chip

5 Multi-Threading (static pipelines)

6 Recall: Bypass network prevents stalls Instead of bypass: interleave threads on the pipeline to prevent stalls... IR ID (Decode) IR EX IR MEM WB IR WE, MemToReg Mux,Logic From WB 32 op rs1 rs2 RegFile rd1 A 32 A L U 32 Y Data Memory Addr Dout Din WE MemToReg R ws wd WE rd2 M M Ext B

7 Introduced in 1964 by Seymour Cray Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W F D X M W F D X M W F D X M W F D X M W t9 Last instruction in a thread always completes writeback before next instruction in same thread reads regfile 4 CPUs, each run at 1/4 clock PC PC PC 1 PC I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread 2 Many variants...

8 Multi-Threading (dynamic scheduling)

Power 4 (predates Power 5 shown Tuesday) Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.

9 Power 4 (predates Power 5 shown Tuesday) Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. Branch redirects Out-of-order processing Instruction fetch IF IC BP MP ISS RF EX BR WB LD/ST MP ISS RF EA DC Fmt WB Xfer Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes

10 For most apps, most execution units lie idle Observation: Most hardware in an out-of-order CPU concerns physical registers. Could several Percent of Total Issue Cycles For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy instruction threads share this hardware? alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing Onchip Parallelism, ISCA 1995.

11 Simultaneous Multi-threading... One thread, 8 units Cycle M M FX FX FP FP BR CC Two threads, 8 units Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

12 Branch redirects Power 4 Out-of-order processing Instruction fetch IF IC BP MP ISS RF EX BR WB LD/ST MP ISS RF EA DC Fmt WB Xfer Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes Branch redirects Instruction fetch IF IC BP D0 D1 D2 D3 Xfer GD Interrupts and flushes Power 5 Group formation and instruction decode 2 fetch (PC), 2 initial decodes Out-of-order processing Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF F6 Floatingpoint WB pipeline 2 commits (architected register sets) Xfer CP

13 Power 5 data flow... Program counter Instruction cache Instruction translation Alternate Branch history tables Instruction buffer 0 Instruction buffer 1 Branch prediction Return stack Thread priority Target cache Group formation Instruction decode Dispatch Sharedregister mappers Dynamic instruction selection Shared issue queues Read sharedregister files Shared execution units LSU0 FXU0 LSU1 FXU1 FPU0 FPU1 BXU CRL Write sharedregister files Data Translation Group completion Data translation Data Cache Store queue Data cache L2 cache Shared by two threads Thread 0 resources Thread 1 resources Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to botteneck.

14 Power 5 thread performance... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they owned the machine. Instructions per cycle (IPC) Single-thread mode 0,7 2,7 1,6 4,7 3,6 2,5 1,4 6,7 7,7 7,6 5,6 4,5 3,4 2,3 2,1 6,6 5,5 4,4 3,3 2,2 6,5 5,4 4,3 3,2 2,1 7,4 6,3 5,2 4,1 7,2 6,1 Thread 0 priority, thread 1 priority 7,0 1,1 0,1 1,0 Power save mode Thread 0 IPC Thread 1 IPC

15 Reminder: Friday DRAM controller checkoff Run your test vector suite on the Calinx board, display results on LEDs. T e s t IM Bus D R A M Test ongoing transactions on both buses, with randomized start times. Load, store, and verify different data word patterns. V e c t o r s DM Bus C o n t r o l l e r DRAM

16 Multi-Core

17 Recall: Superscalar utilization by a thread Percent of Total Issue Cycles For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Observation: In many cases, the on-chip cache and DRAM I/O bandwidth is also underutilized 20 by one CPU. 10 So, let 2 cores 0 alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite share them.

18 Most of Power 5 die is shared hardware Core #1 Shared Components L2 Cache L3 Cache Control Core #2 DRAM Controller

19 Core-to-core interactions stay on chip (1) Threads on two cores that use shared libraries conserve L2 memory. (2) Threads on two cores share memory via L2 cache operations. Much faster than 2 CPUs on 2 chips.

20 Sun Niagara

21 Percent of Total Issue Cycles The case for Sun s Niagara... alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Note: This equation is correct. Observation: Some apps struggle to reach a CPI <= 1. For throughput on these apps, a large number of single-issue cores is better than a few superscalars.

22 Niagara: 32 threads on one chip 8 cores: Single-issue 6-stage pipeline 4-way multi-threaded Fast crypto support Die size: 340 mm² in 90 nm. Power: W Shared resources: 3MB on-chip cache 4 DDR2 interfaces 32G DRAM, 20 Gb/s 1 shared FP unit GB Ethernet ports Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)

23 Niagara status: First motherboard runs Source: J Schwartz weblog (Sun COO)

24 Cell: The PS3 chip

25 Synergistic Processing Units (SPUs) 8 cores using local memory, not traditional caches

One Synergistic Processing Unit (SPU) Programmers manage caching explicitly 256 KB Local Store -- 128 128-bit

26 One Synergistic Processing Unit (SPU) Programmers manage caching explicitly 256 KB Local Store bit Registers SPU issues 2 inst/cycle (in order) to 7 execution units SPU fills Local Store using DMA to DRAM and network

27 Conclusions: Throughput processing Simultaneous Multithreading: Instructions streams can share an out-of-order engine economically. Multi-core: Once instruction-level parallelism run dry, thread-level parallelism is a good use of die area.

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last