Lecture 9 Pipeline and Cache - PDF Free Download

Lecture 9 Pipeline and Cache Peng Liu liupeng@zju.edu.cn 1

What makes it easy Pipelining Review all instructions are the same length just a few instruction formats memory operands appear only in loads and stores Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Data hazards must be handled carefully MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) 2

MIPS Pipeline Data / Control Paths A (fast) 1 PCSrc 0 IF/ID Control ID/EX EX EX/MEM MEM PC 4 Instruction Memory Read Address Add RegWrite Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Sign 16 Extend 32 Shift left 2 ALUSrc 0 1 0 1 Add ALU ALU cntrl ALUOp Branch Address Write Data Data Memory Read Data MemWrite MemRead MEM/WB MemtoReg 1 0 WB RegDst 3

MIPS Pipeline Data / Control Paths (debug) 1 0 ID/EX EX/MEM PCSrc MEM/WB IF/ID Instr EX MEM WB Control Instr Control Instr Control PC 4 Instruction Memory Read Address Add RegWrite Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Sign 16 Extend 32 Shift left 2 ALUSrc 0 1 0 1 RegDst Add ALU ALU cntrl ALUOp Branch Address Write Data Data Memory Read Data MemWrite MemRead MemtoReg 1 0 4

MIPS Pipeline Control (pipelined debug) PC 1 0 4 Instruction Memory Read Address Add IF/ID EX Control RegWrite Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Sign 16 Extend 32 ID/EX Instr 0 MEM Control Shift left 2 ALUSrc 0 1 Add ALU ALU cntrl ALUOp EX/MEM Instr Branch Address WB Control Write Data Data Memory Read Data MemWrite MemRead PCSrc MEM/WB Instr MemtoReg 1 0 1 RegDst 5

Control Hazards When the flow of instruction addresses is not what the pipeline expects; incurred by change of flow instructions Conditional branches (beq, bne) Unconditional branches (j) Possible solutions Stall Move decision point earlier in the pipeline Predict Delay decision (requires compiler support) Control hazards occur less frequently than data hazards; there is nothing as effective against control hazards as forwarding is for data hazards 6

The Last Type of Pipeline Hazards: Control Hazards Just like a data hazard on the program counter (PC) Cannot fetch the next instruction if I don t know its address (PC) Only worse: we use the PC at the very beginning of the pipeline Makes forwarding very difficult Calculating the next PC for MIPS: Most instructions: PC + 4 Can be calculated in IF stage and forward to next instructions Branches Must first compare registers and calculate PC+4+offset Must fetch, decode, and execute instruciton Jump Must calculate new PC based on offset Must fetch and decode instruction 7

1 0 Datapath Branch and Jump Hardware 1 0 Shift left 2 Jump ID/EX EX/MEM PCSrc IF/ID Control PC 4 Instruction Memory Read Address Add PC+4[31-28] Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data 16 Sign 32 Extend Shift left 2 1 0 Add ALU ALU cntrl Branch Address Data Memory Write Data Read Data MEM/WB 1 0 0 1 Forward Unit 8

Jumps Incur One Stall Jumps not decoded until ID, so one stall is needed I n s t r. O r d e r j stall lw and ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg Fortunately, jumps are very infrequent only 2% of the SPECint instruction mix 9

Review: Branches Incur Three Stalls I n s t r. O r d e r beq stall stall stall lw ALU IM Reg DM Reg ALU Can fix branch hazard by waiting stall but affects throughput IM Reg DM Reg and ALU IM Reg DM 10

Moving Branch Decisions Earlier in Pipe Move the branch decision hardware back to the EX stage Reduces the number of stall cycles to two Adds an and gate and a 2x1 mux to the EX timing path Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall cycles to one (like with jumps) Computing branch target address can be done in parallel with RegFile read (done for all instructions only used when needed) Comparing the registers can t be done until after RegFile read, so comparing and updating the PC adds a comparator, an and gate, and a 3x1 mux to the ID timing path Need forwarding hardware in ID stage For longer pipelines, decision points are later in the pipeline, incurring more stalls, so we need a better solution 11

Early Branch Forwarding Issues Bypass of source operands from the EX/MEM if (IDcontrol.Branch and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) ForwardC = 1 if (IDcontrol.Branch and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the Compare MEM/WB forwarding is taken care of by the normal RegFile write before read operation If the instruction immediately before the branch produces one of the branch compare source operands, then a stall will be required since the EX stage ALU operation is occurring at the same time as the ID stage branch compare operation 12

1 0 Supporting ID Stage Branches PCSrc IF/ID Hazard Unit Control 0 Branch 1 0 ID/EX EX/MEM PC 4 Add Instruction Memory Read 0 Address IF.Flush Shift left 2 Read Addr 1 RegFile Read Addr 2 Read Data 1 Write Addr ReadData 2 Write Data 16 Sign Extend Add 32 Compare 1 0 ALU ALU cntrl Data Memory Read Data Address Write Data MEM/WB 1 0 Forward Unit 0 1 Forward Unit 13

Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome 1. Predict not taken always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall If taken, flush instructions in the pipeline after the branch in IF, ID, and EX if branch logic in MEM three stalls in IF if branch logic in ID one stall ensure that those flushed instructions haven t changed machine state automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite or RegWrite) restart the pipeline at the branch destination 14

Flushing with Misprediction (Not Taken) I n s t r. O r d e r 4 beq $1,$2,2 8 flush sub $4,$1,$5 16 and $6,$1,$7 20 or r8,$1,$9 ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop) 15

Branch Prediction, con t Resolve branch hazards by statically assuming a given outcome and proceeding 2. Predict taken always predict branches will be taken Predict taken always incurs a stall (if branch destination hardware has been moved to the ID stage) As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance With more hardware, possible to try to predict branch behavior dynamically during program execution 3. Dynamic branch prediction predict branches at run-time using run-time information 16

Dynamic Branch Prediction A branch prediction buffer (aka branch history table (BHT)) in the IF stage, addressed by the lower bits of the PC, contains a bit that tells whether the branch was taken the last time it was execute Bit may predict incorrectly (may be from a different branch with the same low order PC bits, or may be a wrong prediction for this branch) but the doesn t affect correctness, just performance If the prediction is wrong, flush the incorrect instructions in pipeline, restart the pipeline with the right instructions, and invert the prediction bit The BHT predicts when a branch is taken, but does not tell where its taken to! A branch target buffer (BTB) in the IF stage can cache the branch target address (or!even! the branch target instruction) so that a stall can be avoided 17

1-bit Prediction Accuracy 1-bit predictor in loop is incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the Loop: 1 st loop instr bottom of the loop code 2 nd loop instr 1. First time through the loop, the predictor. mispredicts the branch since the branch is taken. back to the top of the loop; invert prediction bit. (predict_bit = 1) last loop instr 2. As long as branch is taken (looping), prediction is correct bne $1,$2,Loop fall out instr 3. Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time 18

2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed right 9 times wrong on loop Taken fall out 1 Taken 0 Predict Taken Predict Not Taken Not taken Taken right on 1 st iteration Not taken Taken Predict Taken 1 Not taken 0 Predict Not Taken Not taken Loop: 1 st loop instr 2 nd loop instr... last loop instr bne $1,$2,Loop fall out instr 19

Delayed Decision First, move the branch decision hardware and target address calculation to the ID pipeline stage A delayed branch always executes the next sequential instruction the branch takes effect after that next instruction MIPS software moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot. Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper 20

Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6 21

Compiler Optimizations & Data Hazards Compilers rearrange code to try to avoid load-use stalls Also known as: filling the load delay slot Idea: find an independent instruction to space apart load & use When successful, no stall is introduced dynamically Independent instruction from before or after the load-use No data dependence to the load-use instructions No control dependence to the load-use instructions No exception issues When can t fill the slot Leave it empty, hardware will introduce the stall In original MIPS (no HW stalls), a nop was needed in the slot 22

Code Scheduling to Avoid Stalls Recorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; 23

Stall on branches Control Hazard Solutions Wait until you know the answer (branch CPI becomes 3) Control inserted in ID stage of the pipeline Predict not-taken Assume branch is not taken and continue with PC + 4 If branch is actually not taken, all is good Otherwise, nullify misfetched instructions and restart from PC+4+offset Predict taken Assume branch is taken More complex, cause you also need to predict PC+4+offset Still need to nullify instructions if assumption is wrong Most machines do some type of predictions (taken, not-taken) 24

Exceptions and Interrupts Unexpected events requiring changes in flow of control Different ISAs use the terms differently Exception Arises within the CPU E.g., undefined opcode, overflow, syscall, Interrupt E.g., from an external I/O controller (network card) Dealing with them without sacrificing performance is hard 25

Handling Exceptions Something bad happens to an instruction Need to stop the machine Fix up the problem (with privileged code like the operating system) Start the machine up again MIPS: exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) Save indication of the problem In MIPS: Cause register We ll assume 1-bit for the examples 0 for undefined opcode, 1 for overflow Jump to handler code at hardwired address (e.g., 8000_0180) 26

An Alternate Mechanism Vectored Interrupts Handler address determined by the cause Avoids the software overhead of selecting the write code to run based on cause register Example: Undefined opcode: C000_0000 Overflow: C000_0020 C000_0040 Instructions either Deal with the interrupt, or Jump to real handler 27

Handler Actions Read cause, and transfer to relevant handler A software switch statement May be necessary even if vectored interrupts are available Determine action required If restartable Take corrective action Use EPC to return to program Otherwise Terminate program (e.g. segfault, ) Report error using EPC, cause, 28

Precise Exceptions Definition: precise exceptions All previous instruction had completed The faulting instruction was not started None of the next instructions were started No changes to the architecture state (registers, memory) Why are precise exceptions desirable by OS developers? With a single cycle machine, precise exceptions are easy Why? 29

Exceptions in a Pipeline Another form of control hazard Consider overflow on add in EX stage Add $1, $2, $1 Prevent $1 from being clobbered by add Complete previous instructions Nullify subsequent instructions Set Cause and EPC register values Transfer control to handler Similar to mispredicted branch Uses much of the same hardware 30

Nullifying Instructions Nullify = turn an instruction into a nop ( or pipeline bubble) Reset its RegWrite and MemWrite signals Does not affect the sate of the system Resets its branch and jump signals Does not cause unexpected flow control Mark that is should not raise any exceptions of its own Does not cause unexpected flow control Let if flow down the pipeline 31

Dealing with Multiple Exceptions Pipelining overlaps multiple instructions Could have multiple exceptions at once Simple approach: deal with exception from earliest instruction Flush subsequent instructions Necessary for precise exceptions In more complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult! 32

Imprecise Exceptions Actually available in several processors Just stop pipeline and save pipeline state Including exception causes Let the handler work out Which instruction(s) had exceptions Which to complete or flush May require manual completion Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines 33

Beyond the 5-sstage Pipeline Convert transistors to performance Use transistors to Exploit parallelism Or create it (speculate) Processor generations Simple machine Reuse hardware Pipelined Separate hardware for each stage Super-scalar Multiple port mems, function units Out-of-order Mega-ports, complex scheduling Speculation Multi-core processors Each design has more logic to accomplish same task (but faster) 34

Memory System Outline Memory systems basics Memory technologies SRAM, DRAM, Disk Principle of locality Spatial and temporal locality Memory Hierarchies Exploiting locality 35

Random Access Memory(RAM) Dynamic Random Access Memory (DRAM) High density, low power, cheap, but slow Dynamic since data must be refreshed regularly ( leaky buckets ) Contents are lost when power is lost Static Random Access Memory (SRAM) Lower density (about 1/10 density of DRAM), higher cost Static since data is held without refresh if power is on Fast access time, often 2 to 10 times faster than DRAM Flash memory Holds contents without power Data written in blocks, generally slow Very cheap 36

Principle of locality Programs Have Locality Programs access a relatively small portion of the address space at any given time Can tell what memory locations a program will reference in the future by looking at what it referenced recently in the past Two Types of Locality Temporal Locality if an item has been referenced recently, it will tend to be referenced again soon Spatial Locality if an item has been referenced recently, nearby items will tend to be referenced soon Nearby refers to memory addresses 37

Spatial Locality Locality Examples Likely to reference data near recent references Example: for (i=0; i<n; i++) a[i] = Temporal Locality Likely to reference same data that was referenced recently Example: for (i=0; i<n; i++) a[i] = f(a[i-1]); 38

Taking Advantages of Locality Memory hierarchy: Store everything on disk Or Flash in some systems Copy recently accessed (and nearby) data to smaller DRAM memory DRAM is called main memory Copy more recently accessed (and nearby) data to smaller SRAM memory Cache memory attached or close to CPU 39

Typical Memory Hierarchy: Everything is a Cache for Something Else Access time Capacity Managed by 1 cycle ~500B Software/compiler 1-3 cycles ~64KB hardware 5-10 cycles 1-10MB hardware ~100 cycles ~10GB Software/OS 10 6-10 7 cycles ~100GB Software/OS 40

Caches A cache is an interim storage component Functions as a buffer for larger, slower storage components Exploits principles of locality Provide as much inexpensive storage space as possible Offer access speed equivalent to the fastest memory For data in the cache Key is to have the right data cached Computer systems often use multiple caches Cache ideas are not limited to hardware designers Example: Web caches widely used on the Internet 41

Memory Hierarchy Levels Block (aka line): unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses= 1 - hit ratio Then data supplied to CPU from upper level 42

Average Memory Access Times Need to define an average access time Since some will be fast and some slow Access time = hit time + miss rate x miss penalty The hope is that the hit time will be low and the miss rate low since the miss penalty is so much larger than the hit time Average Memory Access Time (AMAT) Formula can be applied to any level of the hierarchy Access time for that level Can be generalized for the entire hierarchy Average access time that the processor sees for a reference 43

How Processor Handles a Miss Assume that cache access occurs in 1 cycle Hit is great, and basic pipeline is fine CPI penalty = miss rate x miss penalty A miss stalls the pipeline (for a instruction or data miss) Stall the pipeline (you don t have the data it needs) Send the address that missed to the memory Instruct main memory to perform a read and wait When access completes, return the data to the processor Restart the instruction 44

How to Build A Cache? Big question: locating data in the cache I need to map a large address space into a small memory How do I do that? Can build full associative lookup in hardware, but complex Need to find a simple but effective solution Two common techniques: Direct Mapped Set Associative Further questions Block size (crucial for spatial locality) Replacement policy (crucial for temporal locality) Write policy (writes are always more complex) 45

Terminology Block Minimum unit of information transfer between levels of the hierarchy Block addressing varies by technology at each level Blocks are moved one level at a time Hit Data appears in a block in lower numbered level Hit rate Percent of accesses found Hit time Time to access at lower nuumbered level Hit time = Cache access time + Time to determine hit/miss Miss Data was not in lower numbered level and had to be fetched from a higher numbered level Miss rate Percent of misses (1 Hit rate) Miss penalty Overhead in getting data from a higher numbered level Miss penalty = higher level access time + Time to deliver to lower level + Cache replacement/forward to processor time Miss penalty is usually much larger than the hit time 46

Direct Mapped Cache Location in cache determined by (main) memory address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) 47

Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 48

Cache Example 8-blocks, 1 word/block, direct mapped Initial state Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N 49

Cache Example Compulsory/Cold Miss Word addr Binary addr Hit/Miss Cache block 22 10 110 Miss 110 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 50

Cache Example Compulsory/Cold Miss Word addr Binary addr Hit/Miss Cache block 26 11 010 Miss 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 51

Cache Example Word addr Binary addr Hit/Miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010 Hit Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 52

Cache Example Word addr Binary addr Hit/Miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011 16 10 000 Hit 000 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 53

Cache Example Replacement Word addr Binary addr Hit/Miss Cache block 18 11 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 54

Cache Organization & Access Assumptions 32-bit address 4 Kbyte cache 1024 blocks, 1word/block Steps Use index to read V, tag from cache Compare read tag with tag from address If match, return data & hit signal Otherwise, return miss 55

Cache Block Example Assume a 2 n byte direct mapped cache with 2 m byte blocks Byte select The lower m bits Cache index The lower (n-m) bits of the memory address Cache tag The upper (32-n) bits of the memory address 56

Block Sizes Larger block sizes take advantage of spatial locality Also incurs larger miss penalty since it takes longer to transfer the block into the cache Large block can also increase the average time or the miss rate Tradeoff in selecting block size Average Access Time = Hit Time *(1-MR) + Miss Penalty * MR 57

Direct Mapped Problems: Conflict misses Two blocks that are used concurrently and map to same index Only one can fit in the cache, regardless of cache size No flexibility in placing 2 nd block elsewhere Thrashing If accesses alternate, one block will replace the other before reuse No benefit from caching Conflicts & thrashing can happen quite often 58

Fully Associate Cache Opposite extreme in that I has no cache index to hash Use any available entry to store memory elements No conflict misses, only capacity misses Must compare cache tags of all entries to find the desired one 59

Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B 60