COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped execution Could potentially achieve k time speed up for k-stage pipelines Pipeline Hazards: Structural: two micro-ops requires the same circuits in the same cycle Control: target branch PC not known until execution Data: successive instructions read the output of previous instruction Comp 212 Computer Org & Arch 1 Z. Li, 2008 Comp 212 Computer Org & Arch 2 Z. Li, 2008 Instruction Micro-Operations Instruction Pipeline no hazard An 6-stage pipeline Execution takes longer than fetch Break up execution into sub-cycles, i.e, DI, CO, FO, EI, WO. Allow overlapping, or prefetch the command Branch : may have to refetch the correct instruction Speedup: 9x6=54 (no pipeline) vs 14 (pipelined) time slots. Comp 212 Computer Org & Arch 3 Z. Li, 2008 Comp 212 Computer Org & Arch 4 Z. Li, 2008
Conditional branching Alternative Pipeline View The correct PC address is runtime dependent Found that Correct PC should be I15 Branch Flush out I6-I3 Comp 212 Computer Org & Arch 5 Z. Li, 2008 Comp 212 Computer Org & Arch 6 Z. Li, 2008 Speedup perfect case k-stage pipeline, n instructions, execution time speed up: Dealing with Branches Pipeline efficiency depends on a steady stream of instructions that fills up the pipeline Conditional branching is a major drawback for efficiency Can be deal with by: Multiple Streams Prefetch Branch Target Loop buffer Branch prediction Delayed branching Comp 212 Computer Org & Arch 7 Z. Li, 2008 Comp 212 Computer Org & Arch 8 Z. Li, 2008
Branch Prediction Static Solutions Branch Prediction Dynamic, Runtime Based Predict never taken Assume that jump will not happen Always fetch next instruction 68020 & VAX 11/780 Predict always taken Assume that jump will happen Always fetch target instruction Taken/Not taken switch Use 1 or 2 bits to record taken/not taken history Good for loops Branch history table Based on previous history Good for loops Predict by opcode By collecting stats on different opcode w.r.t. branching Correct rate > 75% Comp 212 Computer Org & Arch 9 Z. Li, 2008 Comp 212 Computer Org & Arch 10 Z. Li, 2008 Branch Prediction State Diagram RISC Reduced Instruction Set Computer Comp 212 Computer Org & Arch 11 Z. Li, 2008 Comp 212 Computer Org & Arch 12 Z. Li, 2008
Motivation of RISC Improve Pipeline efficiency Fixed instruction format and small number of instructions:» Make the operations more predictable and manageable Large register files» avoid data dependency and hazard Both compile time and run time pipeline optimization,» register renaming, out of order execution. A little bit of history. The computer family concept IBM System/360 1964, DEC PDP-8 Separates architecture from implementation Microporgrammed control unit Idea by Wilkes 1951, produced by IBM S/360 1964 Flexibility and extensibility in CPU control implementation. Cache memory IBM S/360 model 85 1969 Comp 212 Computer Org & Arch 13 Z. Li, 2008 Comp 212 Computer Org & Arch 14 Z. Li, 2008 A bit of history. Solid State RAM (See memory notes) Microprocessors Intel 4004 1971 Pipelining Introduces parallelism into fetch execute cycle The Next Step - RISC Reduced Instruction Set Computer Key features Large number of general purpose registers or use of compiler technology to optimize register use Limited and simple instruction set Emphasis on optimising the instruction pipeline Multiple processors Comp 212 Computer Org & Arch 15 Z. Li, 2008 Comp 212 Computer Org & Arch 16 Z. Li, 2008
Instruction Characteristics Operations Operations Performed Functions to be performed, how it interacts with memory Operands Used Types of operands Memory organization and addressing modes Executing Sequence Control and pipeline operations Assignments Movement of data Conditional statements (IF,THEN, FOR, WHILE) Sequence control Procedure call-return is very time consuming Some HLL instruction lead to many machine code operations Comp 212 Computer Org & Arch 17 Z. Li, 2008 Comp 212 Computer Org & Arch 18 Z. Li, 2008 Operation Statistics In High Level Language (HLL) like C/Pascal, assignment is the dominating operation Number of machine instruction/memory references: Operands Mainly local scalar variables Optimisation should concentrate on accessing local variables Pascal C Average Integer Constant 16% 23% 20% Scalar Variable 58% 53% 55% Array/Structure 26% 24% 25% Comp 212 Computer Org & Arch 19 Z. Li, 2008 Comp 212 Computer Org & Arch 20 Z. Li, 2008
Procedure Calls Time consuming, Depends on number of parameters passed Depends on level of nesting Most programs do not do a lot of calls followed by lots of returns Most variables are local Implications Best support is given by optimising most used and most time consuming features Large number of registers Operand referencing Careful design of pipelines Branch prediction etc. Simplified (reduced) instruction set Comp 212 Computer Org & Arch 21 Z. Li, 2008 Comp 212 Computer Org & Arch 22 Z. Li, 2008 Large Register File Software solution Require compiler to allocate registers Allocate based on most used variables in a given time Requires sophisticated program analysis Hardware solution Have more registers Thus more variables will be in registers Why CISC? Software costs far exceed hardware costs Increasingly complex high level languages (HLL) Semantic gap: machine instruction vs HLL instruction Leads to: Large instruction sets More addressing modes Hardware implementations of HLL (high level language) statements» e.g. CASE (switch) on VAX Comp 212 Computer Org & Arch 23 Z. Li, 2008 Comp 212 Computer Org & Arch 24 Z. Li, 2008
Ease compiler writing Intention of CISC Improve execution efficiency Complex operations in microcode/micro-ops Support more complex HLLs Variable access localization However, CISC instructions are complex, hard to predict and optimize. Comp 212 Computer Org & Arch 25 Z. Li, 2008 Comp 212 Computer Org & Arch 26 Z. Li, 2008 Registers for Local Variables Register is the fastest storage Better than cache and memory Try to limit the data assignment to registers would be good for performance Software approach: compiler figure out variable assignment to register at compile time Hardware approach: register windows: Register Windows Most operands reference several local variables in the function, along with couple of globals Function calls change local variable set Function calls also involves parameters to be passed So, instead of using stack to save local variables, and pass parameters, partition register file into sets, And select different window to access it according to program execution. Comp 212 Computer Org & Arch 27 Z. Li, 2008 Comp 212 Computer Org & Arch 28 Z. Li, 2008
Register Windows cont. Overlapping Register Windows Three areas within a register set Parameter registers Local registers Temporary registers Examples: Berkeley RISC use 8 windows of 16 registers each Temporary registers from one set overlap parameter registers from the next This allows parameter passing without moving data Comp 212 Computer Org & Arch 29 Z. Li, 2008 Comp 212 Computer Org & Arch 30 Z. Li, 2008 Circular Buffer diagram Managing register window When a call is made, a current window pointer is moved to show the currently active register window If all windows are in use, an interrupt is generated and the oldest window (the one furthest back in the call nesting) is saved to memory Global Variables Allocated by the compiler to memory Inefficient for frequently accessed variables Have a set of registers for global variables Eg. Requires R0~R7 to be used for storing globals. A saved window pointer indicates where the next saved windows should restore to Comp 212 Computer Org & Arch 31 Z. Li, 2008 Comp 212 Computer Org & Arch 32 Z. Li, 2008
Referencing variable in windowed register Referencing variable in cache Comp 212 Computer Org & Arch 33 Z. Li, 2008 Comp 212 Computer Org & Arch 34 Z. Li, 2008 Registers v Cache Windowed Register: stores all variables of the last N-1 most recent procedural calls, faster, Handles globals well Cache: store a selection of recent variables, more efficient usage of memory, Compiler Based Register Optimization Assume small number of registers (16-32) Optimizing is up to compiler HLL programs have no explicit references to registers Assign symbolic or virtual register to each candidate variable Map (unlimited) symbolic registers to real registers Symbolic registers that do not overlap can share real registers If you run out of real registers some variables use memory Comp 212 Computer Org & Arch 35 Z. Li, 2008 Comp 212 Computer Org & Arch 36 Z. Li, 2008
How to assign variables to registers? Graph Coloring Algorithm: Build register interference graph, 2 variables if alive at the same time, or interfere with each other, draw an edge Try to find smallest number of colors for all nodes, such that nodes interfering each other do not have the same color Each color is assigned to a different register RISC Pipelining Most instructions are register to register Two phases of execution I: Instruction fetch E: Execute» ALU operation with register input and output For load and store I: Instruction fetch E: Execute» Calculate memory address D: Memory» Register to memory or memory to register operation Comp 212 Computer Org & Arch 37 Z. Li, 2008 Comp 212 Computer Org & Arch 38 Z. Li, 2008 Effects of Pipelining 13 cycles 10 cycles, 1 mem port Out of order execution Optimization of Pipelining Insertion of NoOp to avoid clearing pipelines by circuits Out of order execution: 8 cycles, 2 mem ports Comp 212 Computer Org & Arch 39 Z. Li, 2008 Comp 212 Computer Org & Arch 40 Z. Li, 2008
Comparison of CISC/RISC processors CISC vs RISC: a summary Comp 212 Computer Org & Arch 41 Z. Li, 2008 Comp 212 Computer Org & Arch 42 Z. Li, 2008 CISC vs RISC Compiler simplification? Complex machine instructions harder to exploit Optimization more difficult Smaller programs? Program takes up less memory but Memory is now cheap May not occupy less bits, just look shorter in symbolic form CISC vs RISC Faster programs? Bias towards use of simpler instructions More complex control unit Microprogram control store larger thus simple instructions take longer to execute It is far from clear that CISC is the appropriate solution» More instructions require longer op-codes» Register references require fewer bits Comp 212 Computer Org & Arch 43 Z. Li, 2008 Comp 212 Computer Org & Arch 44 Z. Li, 2008
RISC Characteristics Simple instructions One instruction per cycle Register to register operations Few, simple addressing modes Few, simple instruction formats Hardwired design (no microcode) Fixed instruction format More compile time optimization effort No conclusive comparison Quantitative compare program sizes and execution speeds Qualitative examine issues of high level language support and use of VLSI real estate Problems No pair of RISC and CISC that are directly comparable No definitive set of test programs Difficult to separate hardware effects from complier effects Most comparisons done on toy rather than production machines Most commercial devices are a mixture Register renaming Out of order execution Comp 212 Computer Org & Arch 45 Z. Li, 2008 Comp 212 Computer Org & Arch 46 Z. Li, 2008 What is Superscalar? Scalar computer: handle one instruction one data at a time Vector Computer: handle multiple data at a time. Superscalar Architecture Superscalar Computer: Multiple independent pipelines (2 int, 2 fp, 1 mem) are implemented Each pipeline has stages which can also handle multiple instructions Comp 212 Computer Org & Arch 47 Z. Li, 2008 Comp 212 Computer Org & Arch 48 Z. Li, 2008
Superscalar vs Superpipeline Superpipeline: Many pipeline stages need less than half a clock cycle Double internal clock speed gets two tasks per external clock cycle Superscalar allows parallel fetch execute Limitations Instruction level parallelism Compiler based optimisation Hardware techniques Limited by True data dependency Procedural dependency Resource conflicts Output dependency Antidependency Comp 212 Computer Org & Arch 49 Z. Li, 2008 Comp 212 Computer Org & Arch 50 Z. Li, 2008 True Data Dependency ADD r1, r2 (r1 := r1+r2;) MOVE r3,r1 (r3 := r1;) Can fetch and decode second instruction in parallel with first Can NOT execute second instruction until first is finished Procedural Dependency Can not execute instructions after a branch in parallel with instructions before a branch Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed This prevents simultaneous fetches Comp 212 Computer Org & Arch 51 Z. Li, 2008 Comp 212 Computer Org & Arch 52 Z. Li, 2008
Resource Conflict Two or more instructions requiring access to the same resource at the same time e.g. two arithmetic instructions Solution: Can duplicate resources Effect of Dependency Illustration of Data dependency Procedural (branch) dependency Resource dependency e.g. have two arithmetic units Comp 212 Computer Org & Arch 53 Z. Li, 2008 Comp 212 Computer Org & Arch 54 Z. Li, 2008 Design Issues Instruction level parallelism Instructions in a sequence are independent Execution can be overlapped Governed by data and procedural dependency Instruction Issue Policy Order in which instructions are fetched Order in which instructions are executed Order in which instructions change registers and memory Machine Parallelism Ability to take advantage of instruction level parallelism Governed by number of parallel pipelines Comp 212 Computer Org & Arch 55 Z. Li, 2008 Comp 212 Computer Org & Arch 56 Z. Li, 2008
In-order issue & exec Issue instructions in the order they occur Not very efficient May fetch >1 instruction Instructions must stall if necessary (2 fetch, 3 exe, 2 mem ports) In order issue, out-of-order execute Output dependency R3:= R3 + R5; (I1) R4:= R3 + 1; (I2) R3:= R5 + 1; (I3) I2 depends on result of I1 - data dependency If I3 completes before I1, the result from I1 will be wrong - output (readwrite) dependency Comp 212 Computer Org & Arch 57 Z. Li, 2008 Comp 212 Computer Org & Arch 58 Z. Li, 2008 Out-of-order issue and execute Decouple decode pipeline from execution pipeline Can continue to fetch and decode until this instruction window pipeline is full When a functional unit becomes available an instruction can be executed Since instructions have been decoded, processor can look ahead Antidependency Write-write dependency R3:=R3 + R5; (I1) R4:=R3 + 1; (I2) R3:=R5 + 1; (I3) R7:=R3 + R4; (I4) I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3 Comp 212 Computer Org & Arch 59 Z. Li, 2008 Comp 212 Computer Org & Arch 60 Z. Li, 2008
Register Renaming Output and antidependencies occur because register contents may not reflect the correct ordering from the program May result in a pipeline stall Registers allocated dynamically i.e. registers are not specifically named Register Renaming example R3b:=R3a + R5a (I1) R4b:=R3b + 1 (I2) R3c:=R5a + 1 (I3) R7b:=R3c + R4b (I4) Without subscript refers to logical register in instruction With subscript is hardware register allocated Note R3a R3b R3c Comp 212 Computer Org & Arch 61 Z. Li, 2008 Comp 212 Computer Org & Arch 62 Z. Li, 2008 Machine Parallelism Performances Duplication of Resources Out of order issue Renaming Not worth duplication functions without register renaming Need instruction window large enough (more than 8) Comp 212 Computer Org & Arch 63 Z. Li, 2008 Comp 212 Computer Org & Arch 64 Z. Li, 2008
Branch Prediction 80486 fetches both next sequential instruction after branch and branch target instruction Gives two cycle delay if branch taken RISC - Delayed Branch Calculate result of branch before unusable instructions pre-fetched Always execute single instruction immediately following branch Keeps pipeline full while fetching new instruction stream Not as good for superscalar Multiple instructions need to execute in delay slot Instruction dependence problems Revert to branch prediction Comp 212 Computer Org & Arch 65 Z. Li, 2008 Comp 212 Computer Org & Arch 66 Z. Li, 2008 Superscalar Implementation PowerPC Direct descendent of IBM 801, RT PC and RS/6000 All are RISC RS/6000 first superscalar PowerPC 601 superscalar design similar to RS/6000 Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Later versions extend superscalar concept Mechanisms to communicate these values Mechanisms to initiate multiple instructions in parallel Resources for parallel execution of multiple instructions Mechanisms for committing process state in correct order Comp 212 Computer Org & Arch 67 Z. Li, 2008 Comp 212 Computer Org & Arch 68 Z. Li, 2008
PowerPC 601 General View PowerPC 601 Pipeline Comp 212 Computer Org & Arch 69 Z. Li, 2008 Comp 212 Computer Org & Arch 70 Z. Li, 2008 Summary RISC A design that simplifies elementary instruction processing Allows for optimization and improvements of efficiency by compiler and run-time circuits later Main-stream solution now, MIPS, SPARC, PowerPC, etc. Superscalar Multiple fetch, execution and memory port units Additional dimension to achieve ILP Brings more complex issues to consistence and correctness of execution. Comp 212 Computer Org & Arch 71 Z. Li, 2008