CSE502: Computer Architecture CSE 502: Computer Architecture

Similar documents
CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

CS6290 Speculation Recovery

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Superscalar Processors

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Advanced issues in pipelining

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University


EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

Handout 2 ILP: Part B

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

EE 457 Unit 8. Exceptions What Happens When Things Go Wrong

What are Exceptions? EE 457 Unit 8. Exception Processing. Exception Examples 1. Exceptions What Happens When Things Go Wrong

Multithreaded Processors. Department of Electrical Engineering Stanford University

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

EECS 470 Midterm Exam Answer Key Fall 2004

POLITECNICO DI MILANO. Exception handling. Donatella Sciuto:

Lecture-13 (ROB and Multi-threading) CS422-Spring

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

Chapter. Out of order Execution

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Advanced Computer Architecture

CS 61C: Great Ideas in Computer Architecture Virtual Memory. Instructors: John Wawrzynek & Vladimir Stojanovic

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Hardware-Based Speculation

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture. Out-of-order Execution

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

" # " $ % & ' ( ) * + $ " % '* + * ' "

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

The Problem with P6. Problem for high performance implementations

ECE 4750 Computer Architecture, Fall 2017 T10 Advanced Processors: Out-of-Order Execution

EECS 470 Midterm Exam Winter 2008 answers

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

Four Steps of Speculative Tomasulo cycle 0

The Pentium II/III Processor Compiler on a Chip

The P6 Architecture: Background Information for Developers

E0-243: Computer Architecture

3. Process Management in xv6

ECE 4750 Computer Architecture, Fall 2018 T12 Advanced Processors: Memory Disambiguation

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Superscalar Processor Design

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

EECS 470 Midterm Exam Winter 2015

Out of Order Processing

Dynamic Scheduling. CSE471 Susan Eggers 1

CSE 153 Design of Operating Systems Fall 18

Lecture 15 Pipelining & ILP Instructor: Prof. Falsafi

Register Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

EECS 470 Final Exam Fall 2013

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Static vs. Dynamic Scheduling

EECS 470 Midterm Exam

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Pentium IV-XEON. Computer architectures M

Computer Architecture Spring 2016

EC 513 Computer Architecture

November 7, 2014 Prediction

Static & Dynamic Instruction Scheduling

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Advanced processor designs

UNIT- 5. Chapter 12 Processor Structure and Function

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS 537 Lecture 2 Computer Architecture and Operating Systems. OS Tasks

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory

CS 104 Computer Organization and Design

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

PowerPC 620 Case Study

Processor: Superscalars Dynamic Scheduling

Adapted from David Patterson s slides on graduate computer architecture

Pentium Pro Case Study ECE/CS 752 Fall 2017

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Precise Exceptions and Out-of-Order Execution. Samira Khan

Processes (Intro) Yannis Smaragdakis, U. Athens

Protection and System Calls. Otto J. Anshus

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Transcription:

CSE 502: Computer Architecture Instruction Commit

The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following sequential execution allowed E.g., wrong path instructions may not commit They do not exist in the sequential execution

Everything Should Appear In-Order ISA defines program execution in sequential order To the outside, CPU must appear to execute in order When is someone looking?

Looking at CPU State When OS swaps contexts OS saves the current program state (requires looking ) Allows restoring the state later When program has a fault (e.g., page fault) OS steps in and looks at the current CPU state

Implementation in the CPU ARF keeps state corresponding to committed insns. Commit from ROB happens in order ARF always contains some RF state of sequential execution Whoever wants to look should look in ARF What about insns. that executed out of order?

Only the Sequential Part Matterns LSQ Memory Memory PC RS ROB PC RF fpc PRF RF Sequential View of the Processor State of the Superscalar Out-of-Order Processor What if there s no ARF?

View of the Unified Register File If you need to see a register, you go through the arat first. srat PRF ARF arat

View of Branch Mispredictions RS LSQ ROB Memory PC Wrong-path instructions are flushed architected state has never been touched fpc PRF ARF Fetch correct path instructions Mispredicted Branch Which can update the architected state when they commit

Committing Instructions (1/2) Retire vs. Commit Sometimes people use this to mean the same thing Sometimes they mean different things Check the context! Insn. commits by making effects visible (A)RF, Memory/$, PC

Committing Instructions (2/2) When an insn. executes, it modifies processor state Update a register Update memory Update the PC To make effects visible, core copies values Value from Physical Reg to Architected Reg Value from LSQ to memory/cache Value from ROB to Architected PC

x86 Commit (1/2) ROB contains uops, outside world knows insns. ROB uop 1 (ADD) uop 1 (SUB) uop 1 (LD) uop 2 (ADD) uop 1 (POP) uop 2 (POP) uop 3 (POP) uop 4 (POP) uop 5 (POP) uop 6 (POP) uop 7 (POP) uop 8 (POP) commit commit ADD EAX, EBX SUB EBX, ECX POPA???? ADD EDX, [EAX] If we take an interrupt right now, we ll see a half-executed instruction!

x86 Commit (2/2) Works when uop-flow length commit width What to do with long flows? In all cases: can t commit until all uops in a flow completed Just commit N uops per cycle... but make commit uninterruptable ROB uop 1 uop 2 uop 3 uop 4 uop 5 uop 6 uop 7 uop 8 commit commit Timer interrupt! POPA Now do something about the interrupt. Defer: Can t act on this yet...

Handling REP-prefixed Instructions (1/2) Ex. REP STOSB (memset EAX value ECX times) Entire sequence is one x86 instruction What if REPs for 1,000,000,000 iterations? Can t lock up for a billion cycles while uops commit Can t wait to commit until all uops are done Can t even fetch the entire instruction not enough space in ROB At the ISA level, REP iterations are interruptible... Treat each iteration as a separate macro-op

Handling REP-prefixed Instructions (2/2) MOV EDI, <pointer> All ; the of array these we are want interruptible to memset points (commit can stop and SUB EAX, EAX effects ; zero be seen by outside world), since they all have welldefined ISA-level states: CLD ; clear direction flag (REP fwd) A: ECX=3, EDI = ptr+1 MOV ECX, 4 ; do for 100 iterations B: ECX=2, EDI = ptr+2 REP STOSB ; memset! C: ECX=1, EDI = ptr+3 ADD EBX, EDX ; unrelated instruction D: ECX=0, EDI = ptr+4 A: MOV EDI, xxx SUB EAX, EAX CLD MOV ECX, 4 ucmp ECX, 0 ujcz STA tmp, EDI STD EAX, tmp ADD EDI, 1 SUB ECX, 1 ucmp ECX, 0 ujcz Check for zero iterations (could happen with MOV ECX, 0 ) MOVS flow REP overhead for 1 st iteration B: C: STA tmp, EDI STD EAX, tmp ADD EDI, 1 SUB ECX, 1 ucmp ECX, 0 ujcz STA tmp, EDI STD EAX, tmp ADD EDI, 1 SUB ECX, 1 ucmp ECX, 0 ujcz MOVS flow REP overhead for 2nd iter. MOVS flow REP overhead for 3 rd iter D: STA tmp, EDI STD EAX, tmp ADD EDI, 1 SUB ECX, 1 ucmp ECX, 0 ujcz ADD EBX, EDX MOVS flow REP 4 th iter

Blocked Commit To commit N insns. per cycle, ROB needs N ports (in addition to ports for dispatch, issue, exec, and WB) ROB Can t reuse ROB entries until all in block have committed. Can t commit across blocks. inst 1 inst 2 inst 3 inst 4 Four read ports for four commits ROB inst 1 inst 2 inst 3 inst 4 One wide read port Reduces cost, lowers IPC due to constraints.

Faults Divide-by-Zero, Overflow, Page-Fault All occur at a specific point in execution (precise) DBZ! (resume execution) DBZ! Trap Divide may have executed before other instructions due to OoO scheduling! Trap? (when?) CPU maintains appearance of sequential execution

Timing of DBZ Fault Need to hold on to your faults On a fault, flush the machine and switch to the kernel RS ROB Architected State Exec: DBZ Just make note of the fault, but don t do anything (yet) Let earlier instructions commit The arch. state is the same as just before the divide executed in the sequential order Now, raise the DBZ fault and when you switch to the kernel, everything appears as it should

Speculative Faults Faults might not be faults ROB Branch Mispredict DBZ! (flush wrong-path) The fault goes away Which is what we want, since in a sequential execution, the wrong-path divide would not have executed (and faulted) Buffer faults until commit to avoid speculative faults

Timing of TLB Miss Store must re-execute (or re-commit) Cannot leave the ROB TLB miss Trap Walk page-table, may find a page fault (resume execution) Re-execute store Store TLB miss can stall the core

Load Faults are Similar Load issues, misses in TLB When load is oldest, switch to kernel for page-table walk could be painful; there are lots of loads Modern processors use hardware page-table walkers OS loads a few registers with PT information (pointers) Simple logic fetches mapping info from memory Requires page-table format is specified by the ISA

Asynchronous Interrupts Some interrupts are not associated with insns. Timer interrupt I/O interrupt (disk, network, etc ) Low battery, UPS shutdown When the CPU notices doesn t matter (too much) Key Pressed Key Pressed Key Pressed

Two Options for Handling Async Interrupts Handle immediately Use current architected state and flush the pipeline Deferred Stop fetching, let processor drain, then switch to handler What if CPU takes a fault in the mean time? Which came first, the async. interrupt or the fault?

Store Retirement (1/2) Stores forward to later loads (for same address) Normally, LSQ provides this facility D$ D$ D$ st 33 st 17 st 17 ld ld ld ld At commit, store Updates cache After store has left the LSQ, the D$ can provide the correct value

Store Retirement (2/2) Can t free LSQ Store entry until write is done Enables forwarding until loads can get value from cache Have to re-check TLB when doing write TLB contents at Execute were speculative Store may stall commit for a long time If there s a cache miss If there s a TLB miss (with HW TLB walk) store Don t we check DTLB during storeaddress computation anyway? Do we need to do it again here? All instructions may have successfully executed, but none can commit!

Writeback Buffer (1/2) Want to get stores out of the way quickly store ld ld store D$ WB Buffer Eventually, the cache update occurs, the WB buffer entry is emptied. Even if store misses in cache, entering WB buffer counts as committing. Allows other insns. to commit. WB buffer is part of the cache hierarchy. May need to provide values to later loads. Cache can now provide the correct value. Usually fast, but potential structural hazard

Writeback Buffer (2/2) Stores enter WB Buffer in program order Multiple stores can exist to same address Only the last store is visible No one can see this store anymore! Load 42 Store 42 Addr 42 13 8 Value 1234-1 90901 oldest next to write to cache 42 5678 youngest Load 42

Write Combining Buffer (1/2) Augment WBB to combine writes together Addr Value Load 42 42 1234 5678 Only one writeback Now instead of two Store 42 Load 42 If Stores to same address, combine the writes

Write Combining Buffer (2/2) Can combine stores to same cache line $-Line Addr 80 1234 Cache Line Data 5678 One cache write can serve multiple original store instructions Aggressiveness of write-combining may be limited by memory ordering model Store 84 Benefit: reduces cache traffic, reduces pressure on store buffers Writeback/combining buffer can be implemented in/integrated with the MSHRs Only certain memory regions may be write-combinable (e.g., USWC in x86)

Senior Store Queue Use STQ as WBB (not necessarily write combining) STQ STQ head tail STQ head STQ head STQ tail STQ Store Store Store Store Store Store Store Store Store Store DL1 L2 While stores are completing, other accesses (loads, etc ) can continue getting the values from the senior STQ New stores cannot allocate into Senior STQ entries until stores complete No WBB and no stall on Store commit

Retire Insn. retires by cleaning up all related state Besides updating architected state needs to deallocate resources ROB/LSQ entries Physical register colors of various sorts RAT checkpoints Most are FIFO s or Queues Alloc/dealloc is usually just inc/dec head/tail pointers Unified PRF requires a little more work Have to return old mapping to free list