Multi-Threading. Last Time: Dynamic Scheduling. Recall: Throughput and multiple threads. This Time: Throughput Computing

CS Computer Architecture and Engineering Lecture Advanced Processors III -- Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs/ Last Time: Dynamic Schedung Each ne holds physical <src, src, dest> registers for an instruction, and controls when it executes From Memory Load Unit Reorder Buffer Inst # [...] src # src val src # src val dest # dest val [...] ALU # ALU # Store Unit Common Bus: <dest #, dest val> To Memory Execution engine works on the physical registers, not the architecture registers. Recall: Throughput and multiple threads Goal: Use multiple instruction streams to improve () throughput of machines that run many programs () execution time of multithreaded programs. Example: Sun Niagara ( instruction streams on a chip). Difficulties: Gaining full advantage requires rewriting appcations, OS, braries. Ultimate miter: Amdahl s law (appcation dependent). Memory system performance. This Time: Throughput Computing Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several Us. Multi-core: Integrating several processors that (partially) share a memory system on the same chip Also: A town meeting discussion on lessons learned from Lab. Multi-Threading Power (predates Power shown Tuesday) Single-threaded predecessor to Power. execution units in out-of-order engine, each may issue an instruction each cycle. fetch IF IC BP BR EX LD/ST EA DC Fmt D D D GD EX FX crack and group formation FP F

For most apps, most execution units e idle Most hardware in an out-of-order U concerns physical registers. Could several instruction threads share this hardware? hydrod mdljdp mdljsp nasa sucor Appcations For an -way d miss i miss From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing Onchip Parallesm, ISCA. Simultaneous Multi-threading... One thread, units Two threads, units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Administrivia: Big Game -- Go Cal! Thursday /: Preminary design document due, by PM. Friday /: Review design document with TAs in lab section. Sunday /: Revised design document due in email, by : PM Friday /: Demo deep in lab section. Administrivia: Mid-term and Field Trip Mid-Term II Review Session: Sunday, /, - PM, Soda. (no lecture Tuesday) Mid-Term II: Tuesday, /, : to : PM, Morgan. LaVal s @ PM! Xinx field trip: Tuesday /, bus leaves at : Send Doug RSVP AM, from th floor Soda. by PM today! Thursday /: Advice on Presentations. Prepare you for your final project talk. Power fetch IF IC BP BR EX LD/ST EA DC Fmt Multi-Threading D D D crack and group formation GD EX FX FP F (continued) fetch IF IC BP Power Branch EX Load/store EA DC Fmt commits (architected register sets) D D D GD Group formation and instruction decode fetch (PC), initial decodes EX Fixed-point F Floatingpoint Figure. Power instruction (IF = instruction fetch, IC = instruction, BP = branch predict, = decode stage, = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data s, F = six-cycle floating-point execution pipe, Fmt = data format, = write back, and = group commit).

Program counter translation Power data flow... Alternate Branch history tables buffer buffer Branch prediction Return stack Thread priority Target Group formation decode Dispatch issue queues Dynamic instruction selection register mappers Read sharedregister files execution units LSU FXU LSU FXU FPU FPU BXU CRL by two threads Thread resources Thread resources Write sharedregister files Translation Group completion translation Why only threads? With, one of the shared resources (physical registers,, memory bandwidth) would be prone to botteneck. Cache Store queue L Power thread performance... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they owned the machine. s per cycle (IPC) Single-thread mode,,,,,,,,,,,,,,, Thread priority, thread priority Thread IPC,,,,,,,,,,,,,, Thread IPC,,,,,, Power save mode Multi-Core Most of Power die is shared hardware Chip overview Figure shows the Power chip, which IBM fabricates using sicon-on-insulator (SOI) devices and copper interconnect. SOI technology reduces device capacitance to increase transistor performance. Copper interconnect decreases wire resistance and reduces delays in wire-dominated chip-timing paths. In nm thography, the chip uses eight metal levels and measures mm. The Power processor supports the -bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. The two cores share a.-mbyte (,-Kbyte) L. We implemented the L as three identical sces with separate controllers for each. The L sces are -way set-associative with congruence classes of -byte nes. The data s real address determines which L sce the data is d in. Either processor core can independently access each L controller. We also integrated the directory for an offchip -Mbyte L on the Power chip. Having the L directory on chip allows the processor to check the directory after an L miss without experiencing off-chip delays. To reduce memory latencies, we integrated the memory controller on the chip. This eminates driver and receiver delays to an external controller. Recall: Superscalar utization by a thread hydrod Processor core We designed the Power processor core to support both enhanced SMT and singlethreaded (ST) operation modes. Figure shows the Power s instruction, which is identical to the Power s. All latencies in the Power, including the branch misprediction penalty and load-to-use latency with an L data hit, are the same as in the Power. The identical structure lets optimizations designed for Power- based systems perform equally well on Power-based systems. Figure shows the Power s instruction flow diagram. mdljdp mdljsp nasa sucor Appcations For an -way d miss i miss chip -Mbyte L on the Power chip. Having the L directory on chip allows the processor to check the directory after an L miss without experiencing off-chip delays. To reduce memory latencies, we integrated the memory controller on the chip. This eminates driver and receiver delays to an external controller. ing paths. In nm thography, the chip uses eight metal levels and measures mm. The Power processor supports the -bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. The two In cores SMT share mode, a.-mbyte the Power (,-Kbyte) uses two sepa- L. We implemented the L as three identical sces with separate controllers for each. The L sces are -way set-associative with congruence classes of -byte nes. The data s real address determines which L sce the data is d in. Either processor core rate instruction fetch address registers to store the program counters for the two threads. fetches (IF stage) alternate between the two threads. In ST mode, the Power uses only one program counter and can fetch instructions for that thread every cycle. can independently It can fetch up access to eight each instructions L controller. from We the also instruction integrated the directory (IC stage) for every an off- cycle. The two threads share the instruction and the instruction translation facity. In a given cycle, all fetched instructions come from the same thread. In many cases, the on-chip and DRAM I/O bandwidth is also underutized by one U. So, let cores share them. Processor core We designed the Power processor core to support both enhanced SMT and singlethreaded (ST) operation modes. Figure shows the Power s instruction, Core-to-core interactions stay on chip which is identical to the Power s. All latencies in the Power, including the branch misprediction penalty and load-to-use latency with an L data hit, are the same as in the Power. The identical structure lets optimizations designed for Power- Core # Core # supports a.-mbyte on-chip L. Power and Power+ systems both have - Mbyte L s, whereas Power systems have a -Mbyte L. The L operates as a backdoor with separate buses for reads and writes that operate at half processor speed. In Power and Power+ systems, the L was an inne for data retrieved from memory. Because of the higher transistor density of the Power s -nm technology, we could move the memory controller on chip and eminate a chip previously needed for the memory controller function. These two changes in the Power also have the significant side benefits of reducing latency to the L and main memory, as well as reducing the number of chips necessary to build a system. HOT CHIPS Figure. Power chip (FXU = fixed-point execution unit, ISU = instruction sequencing unit, IDU = instruction decode unit, LSU = load/store unit, IFU = instruction fetch unit, FPU = floating-point unit, and MC = memory controller). Components L Cache L Cache Control DRAM Controller HOT CHIPS Figure. Power chip (FXU = fixed-point execution unit, ISU = instruction sequencing unit, IDU = instruction decode unit, LSU = load/store unit, IFU = instruction fetch unit, FPU = floating-point unit, and MC = memory controller). () Threads on two cores that use shared braries conserve L memory. () Threads on two cores share memory via L operations. Much faster than Us on chips. IEEE MICRO supports a.-mbyte on-chip L. Power and Power+ systems both have - Mbyte L s, whereas Power systems have a -Mbyte L. The L operates as a backdoor with separate buses for reads and writes that operate at half processor speed. In Power and Power+ systems, the L was an inne for data retrieved from memory. Because of the higher transistor density of the Power s -nm technology, we could move the mem-

The case for Sun s Niagara... hydrod mdljdp mdljsp nasa sucor Appcations For an -way d miss i miss Some apps struggle to reach a I <=. For throughput on these apps, a large number of single-issue cores is better than a few superscalars. Niagara: threads on one chip cores: Single-issue -stage -way multi-threaded Fast crypto support resources: MB on-chip DDR interfaces G DRAM, Gb/s shared FP unit GB Ethernet ports Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO) Die size: mm! in nm. Power: - W Niagara status: First motherboard runs Lab Town Meeting Source: J Schwartz weblog (Sun COO) Lab : Reflections from the TAs Lab : Reflections from the TAs Everyone worked hard. Only in retrospect did most students reaze they also had to work smart. Example: Only one group member knows how to download to board. Once this member falls asleep, the group can t go on working... Solution: Actually use the Lab Notebook to document processes. An example of working smart. Example: Comprehensive test rigs seen as a checkoff item for Lab report, done last. Actual debugging proceeds in haphazard, painful way. A Better Way: One group spent hours up front writing a test module. Brandon The best testing I ve ever seen. They finished on time. An example of working smart.

Lab : Reflections from the TAs Lab : Discussion... Example: Group has a long design meeting at start of project. Little is documented about signal names, state machine semantics. Members design incompatible modules, suffer. A Better Way: Carry notebooks (sicon or paper) to meetings, and force documentation of the decisions on details. Conclusions: Throughput processing Simultaneous Multithreading: s streams can share an out-of-order engine economically. Multi-core: Once instruction-level parallesm run dry, thread-level parallesm is a good use of die area. Lab : Hard work is admirable, but even reasonable deadnes are hard to meet if you don t also work smart.