Multi-Threading. Last Time: Dynamic Scheduling. Recall: Throughput and multiple threads. This Time: Throughput Computing

Similar documents
CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II

IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

IBM's POWER5 Micro Processor Design and Methodology

CS 152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering. Lecture 9 Performance Dave Patterson. John Lazzaro. www-inst.eecs.berkeley.

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Performance Measurement (as seen by the customer)

CS425 Computer Systems Architecture

CS 152 Computer Architecture and Engineering

Lecture 14: Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Power 7. Dan Christiani Kyle Wieschowski

Case Study IBM PowerPC 620

Portland State University ECE 588/688. Cray-1 and Cray T3E

CS 152 Computer Architecture and Engineering

Chapter-5 Memory Hierarchy Design

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Simultaneous Multithreading Architecture

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Lecture-13 (ROB and Multi-threading) CS422-Spring

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

CS152 Computer Architecture and Engineering. Lecture 15 Virtual Memory Dave Patterson. John Lazzaro. www-inst.eecs.berkeley.

CS 152 Computer Architecture and Engineering

1. PowerPC 970MP Overview

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Handout 2 ILP: Part B

Power Technology For a Smarter Future

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Open Innovation with Power8

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Multi-core Programming Evolution

POWER6 Processor and Systems

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CS 152 Exam #2 Solutions

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji


CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS 152 Computer Architecture and Engineering

Multithreaded Processors. Department of Electrical Engineering Stanford University

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

IBM POWER4: a 64-bit Architecture and a new Technology to form Systems

POWER7: IBM's Next Generation Server Processor

CS 152 Computer Architecture and Engineering

Advanced Processor Architecture

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

What SMT can do for You. John Hague, IBM Consultant Oct 06

Advanced processor designs

Parallel Processing SIMD, Vector and GPU s cont.

Limitations of Scalar Pipelines

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Lecture 18: Multithreading and Multicores

POWER7: IBM's Next Generation Server Processor

Hardware-Based Speculation

Multiple Issue ILP Processors. Summary of discussions

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

CS 152, Spring 2012 Section 8

Control Hazards. Prediction

E0-243: Computer Architecture

Memory. Lecture 22 CS301

Simultaneous Multithreading: a Platform for Next Generation Processors

CS 152 Computer Architecture and Engineering

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Parallel Computing: Parallel Architectures Jin, Hai

EE 4980 Modern Electronic Systems. Processor Advanced

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Multicore and Parallel Processing

Pipelining to Superscalar

" # " $ % & ' ( ) * + $ " % '* + * ' "

POWER3: Next Generation 64-bit PowerPC Processor Design

All About the Cell Processor

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CMSC Computer Architecture Lecture 18: Exam 2 Review Session. Prof. Yanjing Li University of Chicago

CS425 Computer Systems Architecture

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Transcription:

CS Computer Architecture and Engineering Lecture Advanced Processors III -- Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs/ Last Time: Dynamic Schedung Each ne holds physical <src, src, dest> registers for an instruction, and controls when it executes From Memory Load Unit Reorder Buffer Inst # [...] src # src val src # src val dest # dest val [...] ALU # ALU # Store Unit Common Bus: <dest #, dest val> To Memory Execution engine works on the physical registers, not the architecture registers. Recall: Throughput and multiple threads Goal: Use multiple instruction streams to improve () throughput of machines that run many programs () execution time of multithreaded programs. Example: Sun Niagara ( instruction streams on a chip). Difficulties: Gaining full advantage requires rewriting appcations, OS, braries. Ultimate miter: Amdahl s law (appcation dependent). Memory system performance. This Time: Throughput Computing Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several Us. Multi-core: Integrating several processors that (partially) share a memory system on the same chip Also: A town meeting discussion on lessons learned from Lab. Multi-Threading Power (predates Power shown Tuesday) Single-threaded predecessor to Power. execution units in out-of-order engine, each may issue an instruction each cycle. fetch IF IC BP BR EX LD/ST EA DC Fmt D D D GD EX FX crack and group formation FP F

For most apps, most execution units e idle Most hardware in an out-of-order U concerns physical registers. Could several instruction threads share this hardware? hydrod mdljdp mdljsp nasa sucor Appcations For an -way d miss i miss From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing Onchip Parallesm, ISCA. Simultaneous Multi-threading... One thread, units Two threads, units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Administrivia: Big Game -- Go Cal! Thursday /: Preminary design document due, by PM. Friday /: Review design document with TAs in lab section. Sunday /: Revised design document due in email, by : PM Friday /: Demo deep in lab section. Administrivia: Mid-term and Field Trip Mid-Term II Review Session: Sunday, /, - PM, Soda. (no lecture Tuesday) Mid-Term II: Tuesday, /, : to : PM, Morgan. LaVal s @ PM! Xinx field trip: Tuesday /, bus leaves at : Send Doug RSVP AM, from th floor Soda. by PM today! Thursday /: Advice on Presentations. Prepare you for your final project talk. Power fetch IF IC BP BR EX LD/ST EA DC Fmt Multi-Threading D D D crack and group formation GD EX FX FP F (continued) fetch IF IC BP Power Branch EX Load/store EA DC Fmt commits (architected register sets) D D D GD Group formation and instruction decode fetch (PC), initial decodes EX Fixed-point F Floatingpoint Figure. Power instruction (IF = instruction fetch, IC = instruction, BP = branch predict, = decode stage, = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data s, F = six-cycle floating-point execution pipe, Fmt = data format, = write back, and = group commit).

Program counter translation Power data flow... Alternate Branch history tables buffer buffer Branch prediction Return stack Thread priority Target Group formation decode Dispatch issue queues Dynamic instruction selection register mappers Read sharedregister files execution units LSU FXU LSU FXU FPU FPU BXU CRL by two threads Thread resources Thread resources Write sharedregister files Translation Group completion translation Why only threads? With, one of the shared resources (physical registers,, memory bandwidth) would be prone to botteneck. Cache Store queue L Power thread performance... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they owned the machine. s per cycle (IPC) Single-thread mode,,,,,,,,,,,,,,, Thread priority, thread priority Thread IPC,,,,,,,,,,,,,, Thread IPC,,,,,, Power save mode Multi-Core Most of Power die is shared hardware Chip overview Figure shows the Power chip, which IBM fabricates using sicon-on-insulator (SOI) devices and copper interconnect. SOI technology reduces device capacitance to increase transistor performance. Copper interconnect decreases wire resistance and reduces delays in wire-dominated chip-timing paths. In nm thography, the chip uses eight metal levels and measures mm. The Power processor supports the -bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. The two cores share a.-mbyte (,-Kbyte) L. We implemented the L as three identical sces with separate controllers for each. The L sces are -way set-associative with congruence classes of -byte nes. The data s real address determines which L sce the data is d in. Either processor core can independently access each L controller. We also integrated the directory for an offchip -Mbyte L on the Power chip. Having the L directory on chip allows the processor to check the directory after an L miss without experiencing off-chip delays. To reduce memory latencies, we integrated the memory controller on the chip. This eminates driver and receiver delays to an external controller. Recall: Superscalar utization by a thread hydrod Processor core We designed the Power processor core to support both enhanced SMT and singlethreaded (ST) operation modes. Figure shows the Power s instruction, which is identical to the Power s. All latencies in the Power, including the branch misprediction penalty and load-to-use latency with an L data hit, are the same as in the Power. The identical structure lets optimizations designed for Power- based systems perform equally well on Power-based systems. Figure shows the Power s instruction flow diagram. mdljdp mdljsp nasa sucor Appcations For an -way d miss i miss chip -Mbyte L on the Power chip. Having the L directory on chip allows the processor to check the directory after an L miss without experiencing off-chip delays. To reduce memory latencies, we integrated the memory controller on the chip. This eminates driver and receiver delays to an external controller. ing paths. In nm thography, the chip uses eight metal levels and measures mm. The Power processor supports the -bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. The two In cores SMT share mode, a.-mbyte the Power (,-Kbyte) uses two sepa- L. We implemented the L as three identical sces with separate controllers for each. The L sces are -way set-associative with congruence classes of -byte nes. The data s real address determines which L sce the data is d in. Either processor core rate instruction fetch address registers to store the program counters for the two threads. fetches (IF stage) alternate between the two threads. In ST mode, the Power uses only one program counter and can fetch instructions for that thread every cycle. can independently It can fetch up access to eight each instructions L controller. from We the also instruction integrated the directory (IC stage) for every an off- cycle. The two threads share the instruction and the instruction translation facity. In a given cycle, all fetched instructions come from the same thread. In many cases, the on-chip and DRAM I/O bandwidth is also underutized by one U. So, let cores share them. Processor core We designed the Power processor core to support both enhanced SMT and singlethreaded (ST) operation modes. Figure shows the Power s instruction, Core-to-core interactions stay on chip which is identical to the Power s. All latencies in the Power, including the branch misprediction penalty and load-to-use latency with an L data hit, are the same as in the Power. The identical structure lets optimizations designed for Power- Core # Core # supports a.-mbyte on-chip L. Power and Power+ systems both have - Mbyte L s, whereas Power systems have a -Mbyte L. The L operates as a backdoor with separate buses for reads and writes that operate at half processor speed. In Power and Power+ systems, the L was an inne for data retrieved from memory. Because of the higher transistor density of the Power s -nm technology, we could move the memory controller on chip and eminate a chip previously needed for the memory controller function. These two changes in the Power also have the significant side benefits of reducing latency to the L and main memory, as well as reducing the number of chips necessary to build a system. HOT CHIPS Figure. Power chip (FXU = fixed-point execution unit, ISU = instruction sequencing unit, IDU = instruction decode unit, LSU = load/store unit, IFU = instruction fetch unit, FPU = floating-point unit, and MC = memory controller). Components L Cache L Cache Control DRAM Controller HOT CHIPS Figure. Power chip (FXU = fixed-point execution unit, ISU = instruction sequencing unit, IDU = instruction decode unit, LSU = load/store unit, IFU = instruction fetch unit, FPU = floating-point unit, and MC = memory controller). () Threads on two cores that use shared braries conserve L memory. () Threads on two cores share memory via L operations. Much faster than Us on chips. IEEE MICRO supports a.-mbyte on-chip L. Power and Power+ systems both have - Mbyte L s, whereas Power systems have a -Mbyte L. The L operates as a backdoor with separate buses for reads and writes that operate at half processor speed. In Power and Power+ systems, the L was an inne for data retrieved from memory. Because of the higher transistor density of the Power s -nm technology, we could move the mem-

The case for Sun s Niagara... hydrod mdljdp mdljsp nasa sucor Appcations For an -way d miss i miss Some apps struggle to reach a I <=. For throughput on these apps, a large number of single-issue cores is better than a few superscalars. Niagara: threads on one chip cores: Single-issue -stage -way multi-threaded Fast crypto support resources: MB on-chip DDR interfaces G DRAM, Gb/s shared FP unit GB Ethernet ports Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO) Die size: mm! in nm. Power: - W Niagara status: First motherboard runs Lab Town Meeting Source: J Schwartz weblog (Sun COO) Lab : Reflections from the TAs Lab : Reflections from the TAs Everyone worked hard. Only in retrospect did most students reaze they also had to work smart. Example: Only one group member knows how to download to board. Once this member falls asleep, the group can t go on working... Solution: Actually use the Lab Notebook to document processes. An example of working smart. Example: Comprehensive test rigs seen as a checkoff item for Lab report, done last. Actual debugging proceeds in haphazard, painful way. A Better Way: One group spent hours up front writing a test module. Brandon The best testing I ve ever seen. They finished on time. An example of working smart.

Lab : Reflections from the TAs Lab : Discussion... Example: Group has a long design meeting at start of project. Little is documented about signal names, state machine semantics. Members design incompatible modules, suffer. A Better Way: Carry notebooks (sicon or paper) to meetings, and force documentation of the decisions on details. Conclusions: Throughput processing Simultaneous Multithreading: s streams can share an out-of-order engine economically. Multi-core: Once instruction-level parallesm run dry, thread-level parallesm is a good use of die area. Lab : Hard work is admirable, but even reasonable deadnes are hard to meet if you don t also work smart.