CS 152 Computer Architecture and Engineering

Similar documents
CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II

Multi-Threading. Last Time: Dynamic Scheduling. Recall: Throughput and multiple threads. This Time: Throughput Computing

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

IBM's POWER5 Micro Processor Design and Methodology

CS425 Computer Systems Architecture

Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

PowerPC 620 Case Study

Chapter-5 Memory Hierarchy Design

Lecture-13 (ROB and Multi-threading) CS422-Spring

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering

Handout 2 ILP: Part B

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS 152 Computer Architecture and Engineering

CS425 Computer Systems Architecture

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Case Study IBM PowerPC 620

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Advanced processor designs

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

EECS Digital Design

Instruction Level Parallelism

Multiple Issue ILP Processors. Summary of discussions

EC 513 Computer Architecture

TDT 4260 lecture 7 spring semester 2015

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Multithreaded Processors. Department of Electrical Engineering Stanford University

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

UC Berkeley CS61C : Machine Structures

Outline. Lecture 40 Hardware Parallel Computing Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Power 7. Dan Christiani Kyle Wieschowski

Simultaneous Multithreading Architecture

CS 152 Exam #2 Solutions

Cache performance Outline

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Computer Science 146. Computer Architecture

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Metodologie di Progettazione Hardware-Software

Hardware-Based Speculation

Advanced issues in pipelining

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

E0-243: Computer Architecture

Hardware Speculation Support

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining

Multiple Instruction Issue. Superscalars

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

Improving Performance: Pipelining

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Pentium IV-XEON. Computer architectures M


Lecture 19: Instruction Level Parallelism -- Dynamic Scheduling, Multiple Issue, and Speculation

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

OPENSPARC T1 OVERVIEW

CS 152 Computer Architecture and Engineering

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

ECE 571 Advanced Microprocessor-Based Design Lecture 4

CS 152, Spring 2012 Section 8

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

What SMT can do for You. John Hague, IBM Consultant Oct 06

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Transcription:

CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1

Last Time: Dynamic Scheduling Fetch up to 8 instructions per cycle. Dispatch up to 5 instructions per cycle Execute up to 8 instructions per cycle Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes Up to 200 instructions in flight. CS 152 L18: Advanced Processors II 240 physical registers (120 int + 120 FP) F6 Floatingpoint WB pipeline Xfer A thread may commit up to 5 instructions per cycle. 2

Today: Throughput and multiple threads Goal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) multi-threaded program execution time. Example: Sun Niagara (32 instruction streams on a chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law (application dependent). Memory system performance. 3

Throughput Computing Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs. Multi-core: Integrating several processors that (partially) share a memory system on the same chip 4

Multi-Threading (Static Pipelines) 5

Recall: Bypass network prevents stalls Instead of bypass: Interleave threads on the pipeline to prevent stalls... IR ID (Decode) IR EX IR MEM IR WE, MemToReg WB Mux,Logic From WB 32 op rs1 rs2 RegFile rd1 A 32 A L U 32 Y Data Memory Addr Dout Din WE MemToReg R ws wd WE rd2 M M Ext B 6

Introduced in 1964 by Seymour Cray Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W F D X M W F D X M W F D X M W F D X M W t9 Last instruction in a thread always completes writeback before next instruction in same thread reads regfile 4 CPUs, each run at 1/4 clock PC PC PC 1 PC 1 1 1 I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread select 2 Many variants... 7

Multi-Threading (Dynamic Scheduling) 8

Power 4 (predates Power 5 shown Tuesday) Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. Branch redirects Out-of-order processing Instruction fetch BR MP ISS RF EX WB Xfer LD/ST IF IC BP MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes 9

For most apps, most execution units lie idle Observation: Most hardware in an out-of-order CPU concerns physical registers. Could several Percent of Total Issue Cycles 100 90 80 70 60 50 40 30 For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy instruction threads share this hardware? 20 10 0 alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing Onchip Parallelism, ISCA 1995. 10

Simultaneous Multi-threading... One thread, 8 units Cycle M M FX FX FP FP BR CC 1 2 3 4 5 6 7 8 9 Two threads, 8 units Cycle M M FX FX FP FP BR CC 1 2 3 4 5 6 7 8 9 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes 11

Branch redirects Power 4 Out-of-order processing Instruction fetch IF IC BP MP ISS RF EX BR WB LD/ST MP ISS RF EA DC Fmt WB Xfer Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX FX WB Xfer Instruction crack and group formation MP ISS RF FP F6 WB Xfer Interrupts and flushes Branch redirects Instruction fetch IF IC BP D0 D1 D2 D3 Xfer GD Interrupts and flushes Power 5 Group formation and instruction decode 2 fetch (PC), 2 initial decodes Out-of-order processing Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF F6 Floatingpoint WB pipeline 2 commits (architected register sets) Xfer CP 12

Power 5 data flow... Program counter Instruction cache Instruction translation Alternate Branch history tables Instruction buffer 0 Instruction buffer 1 Branch prediction Return stack Thread priority Target cache Group formation Instruction decode Dispatch Sharedregister mappers Dynamic instruction selection Shared issue queues Read sharedregister files Shared execution units LSU0 FXU0 LSU1 FXU1 FPU0 FPU1 BXU CRL Write sharedregister files Data Translation Group completion Data translation Data Cache Store queue Data cache L2 cache Shared by two threads Thread 0 resources Thread 1 resources Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck. 13

Power 5 thread performance... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they owned the machine. Instructions per cycle (IPC) Single-thread mode 0,7 2,7 1,6 4,7 3,6 2,5 1,4 6,7 7,7 7,6 5,6 4,5 3,4 2,3 2,1 6,6 5,5 4,4 3,3 2,2 6,5 5,4 4,3 3,2 2,1 7,4 6,3 5,2 4,1 7,2 6,1 Thread 0 priority, thread 1 priority 7,0 1,1 0,1 1,0 Power save mode Thread 0 IPC Thread 1 IPC 14

This Friday: Memory System Checkoff T e s t V e c t o r s IC Bus Instruction Cache Data Cache IM Bus Run your test vector suite on the Calinx board, display results on LEDs DC Bus DM Bus D R A M C o n t r o l l e r D R A M CS 152 L14: Cache II 15

Multi-Core 16

Recall: Superscalar utilization by a thread Percent of Total Issue Cycles 100 90 80 70 60 50 40 30 For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Observation: In many cases, the on-chip cache and DRAM I/O bandwidth is also underutilized 20 by one CPU. 10 So, let 2 cores 0 alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite share them. 17

Most of Power 5 die is shared hardware Core #1 Shared Components L2 Cache L3 Cache Control Core #2 DRAM Controller 18

Core-to-core interactions stay on chip (1) Threads on two cores that use shared libraries conserve L2 memory. (2) Threads on two cores share memory via L2 cache operations. Much faster than 2 CPUs on 2 chips. 19

Coming in 2007: 4 cores per die... Current products from Intel and AMD use 2 CPU cores. Both are planning 4-core designs. 20

Sun Niagara 21

The case for Sun s Niagara... Percent of Total Issue Cycles 100 90 80 70 60 50 40 30 For an 8-way superscalar. memory conflict long fp short fp long integer short integer load delays control hazards branch misprediction dcache miss icache miss dtlb miss itlb miss processor busy Observation: Some apps struggle to reach a CPI == 1. For throughput on these apps, a large number 20 of single-issue 10 cores is better 0 alvinn doduc eqntott espresso fpppp hydro2d li mdljdp2 mdljsp2 nasa7 ora Applications su2cor swm tomcatv composite than a few superscalars. 22

Niagara: 32 threads on one chip 8 cores: Single-issue, 1.2 GHz 6-stage pipeline 4-way multi-threaded Fast crypto support Die size: 340 mm² in 90 nm. Power: 50-60 W Shared resources: 3MB on-chip cache 4 DDR2 interfaces 32G DRAM, 20 Gb/s 1 shared FP unit GB Ethernet ports Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO) 23

The board that booted Niagara first-silicon Source: J Schwartz weblog (then Sun COO, now CEO) 24

Used in Sun Fire T2000: Coolthreads Claim: server uses 1/3 the power of competing servers. Web server benchmarks used to position the T2000 in the market. 25

Project Blackbox A data center in a 20-ft shipping container. Servers, CS 152 L19: Advanced air-conditioners, Processors III power distribution. 26

Just hook up network, power, and water... 27

28

Holds 250 T1000 servers. 2000 CPU cores, 8000 threads. 29

30

Cell: The PS3 chip 31

L2 Cache 512 KB PowerPC Synergistic Processing Units (SPUs) PowerPC manages the 8 SPUs, also runs serial code. 2X area of Pentium 4 -- 4GHz+ cycle time 32

33

Synergistic Processing Units (SPUs) 8 cores using local memory, not traditional caches 34

One Synergistic Processing Unit (SPU) Programmers manage caching explicitly 256 KB Local Store and 128 128-bit Registers SPU issues 2 inst/cycle (in order) to 7 execution units SPU fills Local Store using DMA to DRAM and network 35

36

L2 Cache PowerPC 37

Example: Using Cell to Decode HDTV 38

39

Conclusions: Throughput processing Simultaneous Multithreading: Instructions streams can share an out-of-order engine economically. Multi-core: Once instruction-level parallelism run dry, thread-level parallelism is a good use of die area. 40