Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Size: px

Start display at page:

Download "Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline"

Lynne Douglas
5 years ago
Views:

1 Current Microprocessors Pipeline Efficient Utilization of Hardware Blocks Execution steps for an instruction:.send instruction address ().Instruction Fetch ().Store instruction ().Decode Instruction, fetch operands ().Address Computation ().Memory Access (ME).Execution () 8.Write Back () ADD R, R, # DR Only one block used every cycle Efficient Utilization of Hardware Blocks for (i=; i < ; i++) { a[i] = a[i] + } LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP All blocks used One instruction terminates each cycle A/ / 8/ 9/ / / / / / 8/ A/ / / / 8/ 9/ A/ 8/ 9/ / / / / / / / 9/ / / 8/ / / /

2 Pipeline All blocks used Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R BRn LOOP LDR R,R,# ADD R,R,# Program execution up to 8 times faster. Instruction execution time barely changed. Pipeline Implementation PC IR + Memory Registers Address computation Pipeline registers Memory ALU All information (data and control) stored in pipeline registers Registers Pipeline Implementation PC IR + PC b xb registers b address instruction Opcode xb registers Opcode b data xb registers Opcode b result b data

3 Trends The Pentium IV pipeline has stages + 8 conversion stages x8 µinstructions. The initial motivation for the pipeline was: Efficiently exploiting architecture blocks Increase instruction execution rate Current motivation is increasing clock frequency: Split stages into sub-stages Maximum duration of a sub-stage reduced Enables clock frequency increase Difficult to reach sustained performance Pipeline Hazards Not always possible to issue/commit one instruction per cycle When an instruction cannot proceed it is a pipeline hazard Three types of pipeline hazards: Resource hazard Data hazard Control hazard Hazard induces a pipeline stall The control circuit injects one or several bubbles Performance metric: IPC (Instructions Per Cycle) < (optimum) if hazards Pipeline Hazard PC IR + Inst- Inst- Inst- Inst- Inst- Inst- Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Inst. 8 9 Inst Inst 8 cycles Inst Inst cycles

4 Resource Hazards Access same block at same cycle Solutions: Accept pipeline stall (SPARC) Increase resources (memory ports) 9/ / Inst. LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R Resource conflict Resource Hazards No memory access no resource hazard Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R BRn LOOP LDR R,R,# ADD R,R,# Data Hazards Data dependence between two instructions LDR R R,# ADD R R,# An instruction can fetch operands only when available R contains value expected by LDR Inst. 8 9 LDR R,R,# ADD R,R,#

5 Forwarding Data is often available in processor before it is written in register: Immediately pass data to expecting block = Forwarding Pipeline stall only when data absolutely necessary Data available; pipeline stall avoided Inst. 8 9 LDR R,R,# ADD R,R,# Implementing forwarding Forwarding requires: Additional data paths Adding/Increasing size of muxes Modifying control circuit (detect/activate forwarding) ADD ADD Forwarding Forwarding cannot avoid all pipeline stalls: ADD R R,# STR R R,# forwarding Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,#

6 Multi-Cycles Instructions Example: floating-point instructions FPADD: execution cycles FPMUL: execution cycles Resource hazard if FPADD and FPMUL in same block Inst. FPMUL F F,F FPADD F F,F FPADD F F,F New data dependences: Write registers out of order New resource conflicts (register banks ports) Ecriture dans le désordre Inst. FPMUL F F,F FPADD F F,F FPADD F F,F Pipeline and Exceptions Le pipeline makes exception management harder. Example: LDR has a page fault in ADD has a page fault in Precise exception on instruction I: All instructions before I finish normally All instructions after I can be interrupted, then reexecuted from the start after exception handled Exceptions Inst. 8 LDR ADD Necessary to implement precise exceptions Pipeline and Exceptions PC IR + Exception vector Exception vector for each pipeline register After exception, no more state modification In, exceptions dealt with in same order as instructions

7 Multi-Cycle Instructions and Exceptions Example: FPADD does exception NaN in Processor state modified in FPADD before exception detection Forbid out of order state modification Exception Inst. FPMUL F F,F FPADD F F,F Control Hazards Branch: must know destination (and possibly condition value) before fetching next instruction LOOP Condition (bits n,p,z) known at the end of this stage LDR R R,# BRn LOOP Branch destination address available at end of this stage Inst. 8 9 BRn LOOP LDR R,R,# Current Microprocessors Branch Prediction

8 Branch Prediction Usually, branch destination address is constant (except for RET and indirect branches) Predict destination address: store destination addresses in a table for each branch execution table is indexed by branch instruction PC when PC sent to memory, also sent to table table says if it s a branch, and the destination address Known branch instructions Destination address It is/is not a branch PC Predicted address Address Prediction If PC corresponds to branch, update PC: PC = Destination address Example: conditional branch Inst. JMPR LABEL Inst Without address prediction 8 Computed destination address Inst. 8 9 JMPR LABEL Inst With address prediction 9 Predicted destination address Error Prediction Detect error (destination address always computed) Squash speculatively fetched instructions Speculated instructions only modify machine state after check Branch squashing costs cycle Prediction error detected Speculated instructions did not modify machine state Inst. 8 9 JMPR Inst (err) Inst (err) Inst (err) Inst (OK) 8

9 Recovery After Misprediction PC IR + X Inst JMPR JMPR Inst JMPR Label X Inst Label (err) X (err) (OK) (OK) (err) Label X Inst JMPR (err) Label X JMPR Label JMPR Label JMPR Label JMPR Label Condition Prediction Conditional branches Must predict condition value Condition value can change often from one branch execution to another prediction difficult Example: branch taken Inst. BR LABEL Inst No condition prediction, address prediction 8 9 Condition computed Inst. BR LABEL Inst With condition and address prediction 8 9 Condition predicted Prediction Strategies Static prediction: Always taken Works well with loops Compile-Time analysis EPIC/-; limitations of static analysis Hit rate: from % to 9% LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP 9

10 Dynamique prediction: Commonplace in processors Recent mechanisms: hit rate up to 99% for certain applications Principle: learn individual branch behavior A first mechanism: local history one -state automaton per branch Prediction Strategies Taken PC Update table with condition Weakly Taken n bits Weakly Not taken n entries Prediction Taken () Not Taken Not taken () PC i= (taken) i= (taken) i= (taken) i= (taken) i=99 (non taken) Example Initial state P P P NP NP NP P FP FNP NP LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP Prediction not taken taken taken taken taken iterations: iteration i=: BRn taken iteration i=: BRn taken iteration i=99: BRn not taken Improving Dynamic Prediction A small hit rate increase can have a significant impact on overall processor performance if (a == ) a = ; /* Branchement B */ if (b == ) b = ; /* Branchement B */ if (a == b) ; /* Branchement B */ To improve prediction accuracy, use behavior of preceding branches: global history

11 History of last p branches PC n bits n+p entries Prediction p bits Update table Index p+n bits not taken in history taken in history p = B PC B PC B xx B B PC B PC B B B B B Impact of Branch Prediction on Processor Performance On average, instruction out of is a branch In current pipelines, misprediction cost is to cycles 8 cycles between fetch and branch resolution cycles to restart the pipeline cycles penalty instruction / cycle ( instructions) : % wrong predictions: * (,8* +,*(,* +,*)) = cycles % wrong predictions: * (,8* +,*(,8* +,*)) = cycles % wrong predictions: * (,8* +,*(,9* +,*)) = cycles instructions / cycle ( instructions) : % wrong predictions: * (,* +,*) = cycles % wrong predictions: * (,8* +,*) = cycles % wrong predictions: * (,9* +,*) = cycles Current Microprocessors Caches

12 Performance Memory Access Latency Moore s law µproc CPU %/an (X/.yr) DRAM/Processor gap: (% / year) DRAM 999 DRAM 9%/an (X/yr) Processor cycle time << memory access time Cache Memory Fast (SRAM) but small (cost) memory: to cycles (pipelined) Processor sends memory requests to cache Data in cache: hit Data not in cache: miss Performance: Hit rate Average memory access time Processor Cache Memory -bit SRAM Cell bit Cell -bit selection Amplifier Selection W SRAM=Static Random Access Memory. Writing: bit=value; bit =value selection= Reading: selection= bit=v DD, bit =V DD selection= value: /V DD V decreases on bit /V SS V decreases on bit bit Stored value bit

13 Caches and Locality Properties Most programs have strong locality properties Temporal locality: address A referenced at time t has strong probability to be referenced again within a short time interval Spatial locality: address A referenced at time t, strong probability to reference a neighbor address within a short time interval for (i=; i<n; i++) { for (j=; j<n; j++) { y[i] = y[i] + a[i][j] * x[j] } } y[i]: spatial and temporal locality a[i][j]: spatial locality x[j]: spatial and temporal locality Data and Instructions Locality Locality properties for instructions as well Temporal locality: just keep address in cache Sptial locality: load addresses by blocks LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP Loop reuse instructions temporal locality Consecutive instructions spatial locality Cache Architecture Processor request (memory address) Cache block (or line) Processor Latch Data addresses in Memory (tags) Yes Latch = No Memory Data

14 Hardware mapping Programmer does not manage it Transparent for programmer Mapping is a simple function of the address # line in cache # byte in line Address Mapping # byte in cache line # line in cache Bits 8 Tag # line in cache # byte in cache C S bytes cache L S bytes cache # byte: log (L S ) least significant bits of address # line: log (C S /L S ) least significant bits of address Cache Line Fetch data by blocks Address = byte address Block: Most significant bits of bytes addresses identical Only least significant bits vary Bits 8 Example: -bit address, 8-byte block Block address Byte address Same cache line Distinct lines Consecutive addresses Reading Data Example: C S = bytes L S = 8 bytes Requested address ( bits): # line: # byte in line: A request can have a variable size: byte, half-word, word request = address + nb of bytes address = addresse of first byte example: bytes (-bit word) bytes sent to processor

15 Associativity Physical memory size >> cache size Mapping function can breed data conflicts Reduce conflicts by increasing cache associativity for (i=; i<n; i++) { for (j=; j<; j++) { x[j] = y[j] } = i=, j= j= x y Lines: Cache (horizontal view) Cache conflict Associative Cache Structure Processor request (memory address) Banks Processor Latch = = MUX Latch Memory Associative Cache Operations Degree of associativity n. A data can be stored within n different entries Upon a cache miss, choose block/bank Set of possible blocks = set: LRU: Least Recently Used Random Pseudo-LRU: most recently used line not replaced, random among others FO: First In First Out Set

16 Writing a Data (Write-Through) Processor request (memory request) Data to write Latch = = Memory Writing a Data (Write-Back) Processor request (memory address) Data to write Latch = = Latch Memory Cache miss Virtual Memory/Physical Memory: TLB Processor uses virtual addresses Data have an address in physical memory Virtual/Physical address translation TLB (Translation Lookaside Buffer) Cache of address translations TLB entry = page TLB often fully associative (n = number of lines). = Processor request (virtual address) Virtual address = = = Physical address TLB

17 Summary Cache & TLB Instructions Cache & TLB Data Processor TLB Cache Simultaneous: Virtual/physical address translation Cache access using virtual address Several Cache Levels Example: Alpha Processor Instruction Cache 8 Ko, -way Data Cache 8 Ko, -way Shared Cache 9 Ko, -way Offchip Shared Cache -8Mo, -way Memory Impact of Cache Misses on Processor Performance On average, / instructions are load/store Instruction cache misses Cache hierarchies reduce average memory latency GHz processor, ns for memory access instructions : % cache misses: * (,* +,*(,* +,*)) = cycles % cache misses: * (,* +,*(,9* +,*)) = cycles % cache misses: cycles

18 Current Microprocessors Superscalar Execution Superscalar Processor Pipeline: at most one instruction per cycle Superscalar degree of n: up to n instructions complete per cycle (in practice, n ) Requirements for a superscalar processor: An uninterrupted flow of instructions Determine which instructions can execute in parallel Propagate data among instructions (result of instruction i is operand of instruction j) Several functional units Constraint: precise interruptions Superscalar implemenations share a lot of features Superscalar Processor Architecture Architecture Source: Proceedings of the IEEE Pipeline 8

Instruction-Level Parallelism (ILP) (fine-grain parallelism) Instruction Fetch Cache lines Instruction flow disruptions: branches instruction cache

19 Instruction-Level Parallelism (ILP) (fine-grain parallelism) Instruction Fetch Cache lines Instruction flow disruptions: branches instruction cache misses Avoid disruptions: Number fetched instructions : possibly several cache lines (multi-port cache) Buffer to store pre-fetched instructions 8 9 XX 8 9 XXX Branch predicted taken Inst. buffer Dependences Find instruction dependences Avoid false dependences due to register aliasing RAW (Read After Write): true dependence WAW (Write After Write): out-of-order write (false dependence) WAR (Write After Read): too early write (false dependence) 9

20 Register Aliasing: Renaming Binary compatibility limits number of registers Technology allows more registers Physical registers + ReOrder Buffer (ROB) Each instruction mapped to a ROB entry A table maps logical registers to: either a physical register or a ROB entry Eliminates register aliasing Finds true dependences L: move r, r lw r8, (r) add r, r, Logical register r r 8 Physical register r r 8 ROB Physical storage ROB 8 ROB Value produced # by by move add produced # by lw move lw add 8 After dependences, dispatch Reservation stations: buffer for each function unit Instruction executed when: all operands available functional unit available Tomasulo algorithm Dispatch add r, r, Ope- Source Data Valid Source Data Valid Result ration add/+ ROB lw/+ ROB move r imm imm - - Reservation stations for ALU ROB 8 ROB ROB ALUs Issue Ready instructions sent to FU Result propagated through buses: to physical storage (ROB entries) to reservation stations Pending instructions immediately issued Internal model closer to dataflow than von Neumann Stations ROB UF UF UF UF Common Data Bus

Commit An instruction can only commit when reaching ROB end commit order = program order Logical architecture state only modified at commit precise interruptions possible in OoO processors Logical

21 Commit An instruction can only commit when reaching ROB end commit order = program order Logical architecture state only modified at commit precise interruptions possible in OoO processors Logical state: registers and memory When an instruction leaves the ROB: result written into register OR data sent to memory Superscalar processor of degree n: n instructions can commit simultaneously If instruction at top of ROB has not completed, processor is stall Pentium IV Source: Tom s Hardware Trace Cache Address Instruction cache Address Trace Cache A B C D A B D A B C Predictor Fetch Predictor Fetch A P B NP P A n predictions A B D C D

tomcatv Programs Factoring in main constraints gcc expres

22 Pentium IV Source: Tom s Hardware Ideally, Instruction Issues per cycle 8,8,,9, 8, gcc espresso li fpppp doducd tomcatv Programs Factoring in main constraints gcc expres so li fpppp doducd tom catv P rogra m 8 Nombre d instructions/cycle

23 In reality instructions/cycle Max, instructions per cycle; on average

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we