CS 152 Computer Architecture and Engineering

Size: px

Start display at page:

Download "CS 152 Computer Architecture and Engineering"

Sharleen Parsons
5 years ago
Views:

1 CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III Dave Patterson ( John Lazzaro ( www-inst.eecs.berkeley.edu/~cs152/ 1

2 Last Time: Dynamic Scheduling Each line holds physical <src1, src2, dest> registers for an instruction, and controls when it executes From Memory Load Unit Reorder Buffer Inst # [...] src1 # src1 val src2 # src2 val dest # dest val 6 7 [...] ALU #1 ALU #2 Store Unit Common Data Bus: <dest #, dest val> To Memory Execution engine works on the physical registers, not the architecture registers. 2

Recall: Throughput and multiple threads Goal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs.

3 Recall: Throughput and multiple threads Goal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs. Example: Sun Niagara (32 instruction streams on a chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law (application dependent). Memory system performance. 3

4 This Time: Throughput Computing Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs. Multi-core: Integrating several processors that (partially) share a memory system on the same chip Also: A town meeting discussion on lessons learned from Lab 4. 4

5 Multi-Threading 5

Power 4 (predates Power 5 shown Tuesday) Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.

6 Power 4 (predates Power 5 shown Tuesday) Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. Branch redirects Out-of-order processing Instruction fetch IF IC BP MP ISS RF EX BR WB LD/ST MP ISS RF EA DC Fmt WB Xfer Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes 6

7 For most apps, most execution units lie idle Observation: Most hardware in an out-of-order CPU concerns physical registers. Could several instruction threads share this hardware? Percent of Total Issue Cycles alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing Onchip Parallelism, ISCA

8 Simultaneous Multi-threading... One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes 8

9 Administrivia: Big Game -- Go Cal! Thursday 11/18: Preliminary design document due, by 9 PM. Friday 11/19: Review design document with TAs in lab section. Sunday 11/21: Revised design document due in , by 11:59 PM Friday 12/3: Demo deep pipeline in lab section. 9

Administrivia: Mid-term and Field Trip Mid-Term II Review Session: Sunday, 11/21, 7-9 PM, 306 Soda. (no lecture Tuesday) Mid-Term II: Tuesday, 11/ 23, 5:30 to 8:30 PM, 101 Morgan.

10 Administrivia: Mid-term and Field Trip Mid-Term II Review Session: Sunday, 11/21, 7-9 PM, 306 Soda. (no lecture Tuesday) Mid-Term II: Tuesday, 11/ 23, 5:30 to 8:30 PM, 101 Morgan. LaVal 9 PM! Xilinx field trip: Tuesday 11/30, bus leaves at 8:30 AM, from 4th floor Soda. Thursday 12/2: Advice on Presentations. Prepare you for your final project talk. Send Doug RSVP by 5PM today! 10

11 Multi-Threading (continued) 11

12 Branch redirects Power 4 Out-of-order processing Instruction fetch IF IC BP MP ISS RF EX BR WB LD/ST MP ISS RF EA DC Fmt WB Xfer Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes Branch redirects Instruction fetch IF IC BP D0 D1 D2 D3 Xfer GD Interrupts and flushes Power 5 Group formation and instruction decode 2 fetch (PC), 2 initial decodes Out-of-order processing Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF F6 Floatingpoint WB pipeline 2 commits (architected register sets) Xfer CP 12

13 Power 5 data flow... Program counter Instruction cache Instruction translation Alternate Branch history tables Instruction buffer 0 Instruction buffer 1 Branch prediction Return stack Thread priority Target cache Group formation Instruction decode Dispatch Sharedregister mappers Dynamic instruction selection Shared issue queues Read sharedregister files Shared execution units LSU0 FXU0 LSU1 FXU1 FPU0 FPU1 BXU CRL Write sharedregister files Data Translation Group completion Data translation Data Cache Store queue Data cache L2 cache Shared by two threads Thread 0 resources Thread 1 resources Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to botteneck. 13

14 Power 5 thread performance... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they owned the machine. Instructions per cycle (IPC) Single-thread mode 0,7 2,7 1,6 4,7 3,6 2,5 1,4 6,7 7,7 7,6 5,6 4,5 3,4 2,3 2,1 6,6 5,5 4,4 3,3 2,2 6,5 5,4 4,3 3,2 2,1 7,4 6,3 5,2 4,1 7,2 6,1 Thread 0 priority, thread 1 priority 7,0 1,1 0,1 1,0 Power save mode Thread 0 IPC Thread 1 IPC 14

15 Multi-Core 15

16 Recall: Superscalar utilization by a thread Percent of Total Issue Cycles alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Observation: In many cases, the on-chip cache and DRAM I/O bandwidth is also underutilized by one CPU. So, let 2 cores share them. 16

17 Most of Power 5 die is shared hardware Core #1 Shared Components L2 Cache L3 Cache Control Core #2 DRAM Controller 17

18 Core-to-core interactions stay on chip (1) Threads on two cores that use shared libraries conserve L2 memory. (2) Threads on two cores share memory via L2 cache operations. Much faster than 2 CPUs on 2 chips. 18

19 Percent of Total Issue Cycles The case for Sun s Niagara... alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Observation: Some apps struggle to reach a CPI <= 1. For throughput on these apps, a large number of single-issue cores is better than a few superscalars. 19

20 Niagara: 32 threads on one chip 8 cores: Single-issue 6-stage pipeline 4-way multi-threaded Fast crypto support Die size: 340 mm² in 90 nm. Power: W Shared resources: 3MB on-chip cache 4 DDR2 interfaces 32G DRAM, 20 Gb/s 1 shared FP unit GB Ethernet ports Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO) 20

21 Niagara status: First motherboard runs Source: J Schwartz weblog (Sun COO) 21

22 Lab 4 Town Meeting 22

23 Lab 4: Reflections from the TAs Everyone worked hard. Only in retrospect did most students realize they also had to work smart. Example: Only one group member knows how to download to board. Once this member falls asleep, the group can t go on working... Solution: Actually use the Lab Notebook to document processes. An example of working smart. 23

24 Lab 4: Reflections from the TAs Example: Comprehensive test rigs seen as a checkoff item for Lab report, done last. Actual debugging proceeds in haphazard, painful way. A Better Way: One group spent 10 hours up front writing a cache test module. Brandon The best cache testing I ve ever seen. They finished on time. An example of working smart. 24

25 Lab 4: Reflections from the TAs Example: Group has a long design meeting at start of project. Little is documented about signal names, state machine semantics. Members design incompatible modules, suffer. A Better Way: Carry notebooks (silicon or paper) to meetings, and force documentation of the decisions on details. 25

26 Lab 4: Discussion... 26

27 Conclusions: Throughput processing Simultaneous Multithreading: Instructions streams can share an out-of-order engine economically. Multi-core: Once instruction-level parallelism run dry, thread-level parallelism is a good use of die area. Lab 4: Hard work is admirable, but even reasonable deadlines are hard to meet if you don t also work smart. 27

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/