CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Similar documents
Advanced Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

The Processor: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lec 25: Parallel Processors. Announcements

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruction Level Parallelism: Multiple Instruction Issue

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Chapter 4 The Processor (Part 4)

Multicore and Parallel Processing

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Thomas Polzer Institut für Technische Informatik

Chapter 4. The Processor

Chapter 4. The Processor

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

EIE/ENE 334 Microprocessors

Chapter 4. The Processor

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Full Datapath. Chapter 4 The Processor 2

Chapter 4 The Processor 1. Chapter 4D. The Processor

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CS61C : Machine Structures

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Parallelism, Multicore, and Synchronization

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining. CSC Friday, November 6, 2015

Lecture 2: Processor and Pipelining 1

Chapter 4. The Processor. Jiang Jiang

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Chapter 4. The Processor

Multiple Instruction Issue. Superscalars

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

LECTURE 3: THE PROCESSOR

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Chapter 4. The Processor

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Static, multiple-issue (superscaler) pipelines

Full Datapath. Chapter 4 The Processor 2

CS3350B Computer Architecture Quiz 3 March 15, 2018

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruc>on Level Parallelism

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Computer Architecture

Processor Architecture

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Outline Marquette University

COMPUTER ORGANIZATION AND DESIGN

DEE 1053 Computer Organization Lecture 6: Pipelining

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

ELE 655 Microprocessor System Design

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Multiple Issue ILP Processors. Summary of discussions

Four Steps of Speculative Tomasulo cycle 0

Processor (II) - pipelining. Hwansoo Han

COMPUTER ORGANIZATION AND DESIGN

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Hardware-based Speculation

CS252 Graduate Computer Architecture Midterm 1 Solutions

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CSSE232 Computer Architecture I. Datapath

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #7. Warehouse Scale Computer

Chapter 4 The Processor 1. Chapter 4B. The Processor

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Computer Architecture ELEC3441

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Multi-cycle Instructions in the Pipeline (Floating Point)

ECE260: Fundamentals of Computer Engineering

2 GHz = 500 picosec frequency. Vars declared outside of main() are in static. 2 # oset bits = block size Put starting arrow in FSM diagrams

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture. Lecture 6.1: Fundamentals of

Instr. execution impl. view

Modern Computer Architecture

CS425 Computer Systems Architecture

Advanced processor designs

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Anne Bracy CS 3410 Computer Science Cornell University

LECTURE 10. Pipelining: Advanced ILP

Transcription:

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3 Instructor: Dan Garcia inst.eecs.berkeley.edu/~cs61c!

Compu@ng in the News At a laboratory in São Paulo, a Duke University neuroscientist is in his own race with the World Cup clock. He is rushing to finish work on a mind-controlled exoskeleton that he says a paralyzed Brazilian volunteer will don, navigate across a soccer pitch using his or her thoughts, and use to make the ceremonial opening kick of the tournament on June 12. www.technologyreview.com/news/526336/world-cup-mind-control-demo-faces-deadlines-critics

Parallel Requests Assigned to computer e.g., Search Katz Parallel Threads Assigned to core e.g., Lookup, Ads Software Parallel Instruc@ons >1 instruc@on @ one @me e.g., 5 pipelined instruc@ons Parallel Data >1 data item @ one @me e.g., Add of 4 pairs of words Hardware descrip@ons All gates func@oning in parallel at same @me You Are Here! Harness Parallelism & Achieve High Performance Hardware Today s Lecture Warehouse Scale Computer Core Memory Input/Output Instruc@on Unit(s) Main Memory Computer (Cache) Core Core Func@onal Unit(s) A 3 +B 3 A 2 +B 2 A 1 +B 1 A 0 +B 0 Smart Phone Logic Gates

P&H Figure 4.50

P&H 4.51 Pipelined Control

Hazards Situa@ons that prevent star@ng the next logical instruc@on in the next clock cycle 1. Structural hazards Required resource is busy (e.g., roommate studying) 2. Data hazard Need to wait for previous instruc@on to complete its data read/write (e.g., pair of socks in different loads) 3. Control hazard Deciding on control ac@on depends on previous instruc@on (e.g., how much detergent based on how clean prior load turns out)

3. Control Hazards Branch determines flow of control Fetching next instruc@on depends on branch outcome Pipeline can t always fetch correct instruc@on S@ll working on ID stage of branch BEQ, BNE in MIPS pipeline Simple solu@on Op@on 1: Stall on every branch un@l have new PC value Would add 2 bubbles/clock cycles for every Branch! (~ 20% of instruc@ons executed)

Stall => 2 Bubbles/Clocks I n s t r. O r d e r beq Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles) ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg I$ ALU I$ Reg D$ Reg ALU Reg D$ Reg ALU I$ Reg D$ Reg Where do we do the compare for the branch?

Control Hazard: Branching Op@miza@on #1: Insert special branch comparator in Stage 2 As soon as instruc@on is decoded (Opcode iden@fies it as a branch), immediately make a decision and set the new value of the PC Benefit: since branch is complete in Stage 2, only one unnecessary instruc@on is fetched, so only one no- op is needed Side Note: means that branches are idle in Stages 3, 4 and 5 Question: What s an efficient way to implement the equality comparison?

One Clock Cycle Stall Time (clock cycles) I n s t r. O r d e r beq Instr 1 Instr 2 Instr 3 Instr 4 ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg I$ ALU I$ Reg D$ Reg ALU Reg D$ Reg ALU I$ Reg D$ Reg Branch comparator moved to Decode stage.

Control Hazards: Branching Op@on 2: Predict outcome of a branch, fix up if guess wrong Must cancel all instruc@ons in pipeline that depended on guess that was wrong This is called flushing the pipeline Simplest hardware if we predict that all branches are NOT taken Why?

Control Hazards: Branching Op@on #3: Redefine branches Old defini@on: if we take the branch, none of the instruc@ons aper the branch get executed by accident New defini@on: whether or not we take the branch, the single instruc@on immediately following the branch gets executed (the branch- delay slot) Delayed Branch means we always execute inst a8er branch This op@miza@on is used with MIPS

Example: Nondelayed vs. Delayed Branch Nondelayed Branch Delayed Branch or $8, $9, $10! add $1, $2,$3! add $1, $2, $3! sub $4, $5, $6! beq $1, $4, Exit! xor $10, $1, $11! sub $4, $5, $6! beq $1, $4, Exit! or $8, $9, $10! xor $10, $1, $11! Exit: Exit:

Control Hazards: Branching Notes on Branch- Delay Slot Worst- Case Scenario: put a no- op in the branch- delay slot Beqer Case: place some instruc@on preceding the branch in the branch- delay slot as long as the changed doesn t affect the logic of program Re- ordering instruc@ons is common way to speed up programs Compiler usually finds such an instruc@on 50% of @me Jumps also have a delay slot

Greater Instruc@on- Level Parallelism (ILP) Deeper pipeline (5 => 10 => 15 stages) Less work per stage shorter clock cycle Mul@ple issue superscalar Replicate pipeline stages mul@ple pipelines Start mul@ple instruc@ons per clock cycle CPI < 1, so use Instruc@ons Per Cycle (IPC) E.g., 4GHz 4- way mul@ple- issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in prac@ce 4.10 Parallelism and Advanced Instruction Level Parallelism

Mul@ple Issue Sta@c mul@ple issue Compiler groups instruc@ons to be issued together Packages them into issue slots Compiler detects and avoids hazards Dynamic mul@ple issue CPU examines instruc@on stream and chooses instruc@ons to issue each cycle Compiler can help by reordering instruc@ons CPU resolves hazards using advanced techniques at run@me

Superscalar Laundry: Parallel per stage T a s k 12 2 AM 6 PM 7 8 9 10 11 1 A B C 30 30 30 30 30 Time (light clothing) (dark clothing) (very dirty clothing) O D r (light clothing) d E (dark clothing) e More r F (very dirty clothing) resources, HW to match mix of parallel tasks?

Pipeline Depth and Issue Width Intel Processors over Time Microprocessor Year Clock Rate Pipeline Stages Issue width Cores Power i486 1989 25 MHz 5 1 1 5W Pentium 1993 66 MHz 5 2 1 10W Pentium Pro 1997 200 MHz 10 3 1 29W P4 Willamette 2001 2000 MHz 22 3 1 75W P4 Prescott 2004 3600 MHz 31 3 1 103W Core 2 Conroe 2006 2930 MHz 14 4 2 75W Core 2 Yorkfield 2008 2930 MHz 16 4 4 95W Core i7 Gulftown 2010 3460 MHz 16 4 6 130W Chapter 4 The Processor

Pipeline Depth and Issue Width 10000 1000 100 10 1 1989 1992 1995 1998 2001 2004 2007 2010 Clock Power Pipeline Stages Issue width Cores

Sta@c Mul@ple Issue Compiler groups instruc@ons into issue packets Group of instruc@ons that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruc@on Specifies mul@ple concurrent opera@ons

Scheduling Sta@c Mul@ple Issue Compiler must remove some/all hazards Reorder instruc@ons into issue packets No dependencies within a packet Possibly some dependencies between packets Varies between ISAs; compiler must know! Pad issue packet with nop if necessary

MIPS with Sta@c Dual Issue Two- issue packets One ALU/branch instruc@on One load/store instruc@on 64- bit aligned ALU/branch, then load/store Pad an unused instruc@on with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB

Hazards in the Dual- Issue MIPS More instruc@ons execu@ng in parallel EX data hazard Forwarding avoided stalls with single- issue Now can t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effec@vely a stall Load- use hazard S@ll one cycle use latency, but now two instruc@ons More aggressive scheduling required

Scheduling Example Schedule this for dual- issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: 1 2 3 4

Scheduling Example Schedule this for dual- issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 2 3 4

Scheduling Example Schedule this for dual- issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 3 4

Scheduling Example Schedule this for dual- issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 4

Scheduling Example Schedule this for dual- issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 n IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

Loop Unrolling Replicate loop body to expose more parallelism Reduces loop- control overhead Use different registers per replica@on Called register renaming Avoid loop- carried an@- dependencies Store followed by a load of the same register Aka name dependence Reuse of a register name

Loop Unrolling Example ALU/branch Load/store cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size

Dynamic Mul@ple Issue Superscalar processors CPU decides whether to issue 0, 1, 2, each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Though it may s@ll help Code seman@cs ensured by the CPU

Dynamic Pipeline Scheduling Allow the CPU to execute instruc@ons out of order to avoid stalls But commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 subu $s4, $s4, $t3 slti $t5, $s4, 20 Can start subu while addu is wai@ng for lw

Why Do Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable e.g., cache misses Can t always schedule around branches Branch outcome is dynamically determined Different implementa@ons of an ISA have different latencies and hazards

Specula@on Guess what to do with an instruc@on Start opera@on as soon as possible Check whether guess was right If so, complete the opera@on If not, roll- back and do the right thing Common to sta@c and dynamic mul@ple issue Examples Speculate on branch outcome (Branch Predic@on) Roll back if path taken is different Speculate on load Roll back if loca@on is updated

Pipeline Hazard: Matching socks in later load 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e r A B C D E F 30 30 30 30 30 30 30 bubble Time A depends on D; stall since folder @ed up;

Out- of- Order Laundry: Don t Wait T a s k 12 2 AM 6 PM 7 8 9 10 11 1 A B 30 30 30 30 30 30 30 bubble Time C O D r d E e r F A depends on D; rest con@nue; need more resources to allow out- of- order

Out Of Order Intel All use OOO since 2001 Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores i486 1989 25MHz 5 1 No 1 5W Power Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W Core 2 Yorkfield 2008 2930 MHz 16 4 Yes 4 95W Core i7 Gulftown 2010 3460 MHz 16 4 Yes 6 130W

Does Mul@ple Issue Work? The BIG Picture Yes, but not as much as we d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing Some parallelism is hard to expose Limited window size during instruc@on issue Memory delays and limited bandwidth Hard to keep pipelines full Specula@on can help if done well

And in Conclusion.. Pipelining is an important form of ILP Challenge is (are?) hazards Forwarding helps w/many data hazards Delayed branch helps with control hazard in 5 stage pipeline Load delay slot / interlock necessary More aggressive performance: Longer pipelines Superscalar Out- of- order execu@on Specula@on