Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata
What is a Processors Architecture Instruction Set Architecture (ISA) Describes class of ISA, addressing and access to a memory, operands a operations, controls the flow of instructions, encoding ISA Design and Organization Defines an interface (or boundary) between software and hardware Describes the high-level aspect of the processors design. Defines logical blocks or units that processors contain (ALU, FPU, General-purpose register, MMU, Controller, ) Hardware Includes detailed logic design of the processors the low level aspect (expected process technology) 2
Evolution of the Microprocessor Architecture - ISA CISC and RISC architecture Two critical performance techniques Today CISC instructions are usually translated internally on RISC-like instructions in the most x86 architectures Instruction Level Parallelism (ILP) pipelining and multiple instruction issue later Cache memory usage from the simple to the hierarchical cache memory with sophistical organization and optimization Growth of processor performance today Using ILP, Cache memory and specialized processor units have brought ~50% performance growth per year for 16 years since 2002 Today it is ~20% per year in the uniprocessor performance The future growth is focused on the usage of TLP and DLP techniques 3
Evolution of the Microprocessor Architecture - Design Design and Organization From simple processors with single ALU to complex multicore processors in these days CPU off-load on peripheral devices (for example graphic or network cards) Integration of peripheral devices on chip (northbridge, graphic card) Hardware evolution The first fully electronic digital computing device - Atanasoff Berry Computer, which was made based on vacuum tube. The modern process technology 32nm respectively 22nm, which are used in productions today. 4
A Reference RISC processor (1) A RISC architecture will be used to illustrate the basic concepts. The introduced ideas are of course applicable to the other processor architectures. A typical RISC processor 32-bit fixed format instruction 32-bit general-purpose registers Memory access only via Load and Store instructions with the single addressing mode (based + displacement) Simple branch conditions Delayed branch 5
A Reference RISC processor (2) Instruction types Register-Register 5 stage of the RISC instruction Register-Immediate Branch Jump/Call Instruction fetch cycle (IF) Instruction decode/register fetch cycle (ID) Execution/effective address cycle (EX) Memory access (MEM) Write-back cycle (WB) Every stage of the instruction uses no shared resources of the processor (expect for MEM and IF) 6
5 Stages of MIPS Datapath 7
Sequential Unpipeline Instruction Execution It was used in the first computers. Processor architecture model is called sub-scalar architecture. Instructions are executed one by one without any overlap in sequential order. The same processing time for each instruction stage is not needed Some parts of the processor may not be used in some clock cycles. Amount of time, which is needed to execute the whole process, is sum of execution time of all instructions. Tc=T.N.CPI Tc - Total execution time, T - Clock period, N - Total number of instructions, CPI - Clock per instruction It is still used in very simple computers (like simple calculators) Q: How many cycles require execution of these two instructions? 8
Instruction Pipeline Execution The overlap of the execution time of instructions brings better usage of the processor units. Scalar processor architecture model It does not reduce the execution time of the instruction It increases the CPU instruction throughput Some modifications of hardware architecture for pipeline execution are needed Bypassing and/or short-circuiting Pipeline registers Pipeline overhead Setup time on pipeline registers Q: Will it work correctly? 9
MIPS Pipeline Datapath 10
Pipeline Timing, Throughput and Latency Pipeline Timing Unpipeline execution Pipeline execution Same execution time for each stage Balanced pipeline Pipeline Throughput Time of each stage can be different Count of completed instructions per second Pipeline Latency How long it takes to execute single instruction in pipeline. 11
Pipeline Hazards Hazards prevent the execution of the next instruction in its designated clock cycle. They reduce speedup of pipeline execution There are three classes of hazards Structural hazards: arise from resource conflicts. Hardware may not support all combinations of instruction in overlapped execution. Data hazards: Instruction depends on result of previous instruction, which is not available when it is needed. Control hazards: May arise when the program counter is changed by the flow control instruction (branch). 12
Structural Hazards Example 1: Memory access of instruction fetch and memory access of data, on the processor with single memory access channel. Instruction OR can not be loaded in cycle 4 due to structural hazard (there is not gate for access into the memory for getting a13 new instruction)
Structural Hazards Possible solution of structural hazard from Example 1 Separate instruction and data memory (brings higher cost of CPU) Stall the pipeline for one cycle (decrease pipeline execution speedup) 14
Data Hazard RAW data hazard on register R1 Hazard has impact on instructions sub, add and or. Result from the first instruction add is not available in time. 15
Three Generic Classes of Data Hazards Read After Write (RAW) Following instruction tries to use a result from previous instruction, which is not available in that moment add r1,r2,r3 sub r6,r1,r4 Impact of this type of the hazard can be reduced with hardware technique called forwarding. The ALU result from EX/MEM and MEM/WB pipeline registers is fed back to the ALU inputs. Hardware must be able to detect RAW hazards. Called a dependency 16
Three Generic Classes of Data Hazards Write After Read (WAR) The following instruction writes its result into the register before the previous instruction can read the original value. mul r2,r1,r3 add r1,r4,r5 The hardware may not allow occurrence of this data hazard It can not happen in this simple MIPS 5 stage pipeline It is common for processors with complex pipelining Called anti-dependence 17
Three Generic Classes of Data Hazards Write After Write (WAW) Following instruction writes its own result into the register before previous one. sub r1,r2,r3 add r1,r4,r3 The hardware may not allow occurrence of this data hazards It can not happen in this simple MIPS 5 stage pipeline Called output dependence Q: Why the last two types of data hazards can not happen in this type of the processor? 18
HW Changes for Forwarding Forwarding paths Modify the multiplexors to select where ALU operands should come from Logic for determining sources of ALU operands in ID stage 19
Unsolved Data Hazards Not all types of the data hazards can be solved with usage of forwarding technique The data from the load (lw) instruction is available after the fourth clock cycle. Even forwarding technique can not deliver it into the ALU in the end of third clock cycle.20
How To Avoid Load Hazards Hardware assisted solution Hardware must be able to recognize the load hazards The pipeline is stalled when the load hazard occurs Software scheduling by compiling an application The order of the instructions can be modified with help of independent instructions during compilation into acceptable sequence without load hazards 21
Control Hazards (Branch Hazards) It can cause a greater performance loss then the data hazards Branch Hazards Change of PC may or may not be other than plus 4, when a branch is executed 22
Branch Stall Impact Branch penalty for 5 stage MIPS pipeline is 3 clock cycle Minimization of branch penalty with HW modification Branch target is known at the end of execute/addresscalculation stage (3rd cycle clock) In the instruction decode stage Determines if the branch is taken or not Calculates its target address Branch penalty is minimized but still exists Branch prediction schemes Simple compile-time schemes Dynamic branch prediction 23
Reducing Pipeline Branch Penalties with a Simple HW Modification Early branch determination Target of the branch is known in the ID stage of a branch instruction 24
Reducing Pipeline Branch Penalties with the Static Scheduling Strategy Stalls the pipeline until the branch destination is known The simplest scheme to handle branches Pipeline is to be freezed or flushed wasting of CPU time Waste CPU time in case of 5 stage MIPS pipeline. Next instruction can not be fetched until branch destination is known Predicted branch not taken Next instruction is fetched as if the branch was not executed PC = PC + 4 Instruction must be removed from the pipeline if the branch is taken Processor state can not change until branch outcome is known Approximately 47% MIPS branches are not taken Predicted branch taken Inverse alternative to the previous case 25
Reducing Pipeline Branch Penalties with the Static Scheduling Strategy Delayed branch Branch takes place after a following instruction Delay slot is amount of clock cycles which are needed to obtain branch target address The instructions following branch in the delay slot are executed whether the branch is taken or not branch instruction sequential successor1 sequential successor2 sequential successorn branch target if taken branch delay slot of length n Q: What is a length of the branch delay slot in 5 stage MIPS pipeline with early branch determination? 26
Delayed Branch Where do we get instructions to fill branch delay slot? Canceling instructions from branch delay slot Second two cases may require canceling of the instructions from delay slot Instruction is executed but write-back must be disabled until the result from the branch is known. 27
Dynamic Branch Prediction The simple branch prediction schemes are a branchprediction buffer or branch history table A small cache indexed by the lower bits of branch address with limited number of entry contains a bit that says whether the branch was taken or not Prediction may not be correct. Another branch with same lover bits of address can modify the prediction bit A branch condition cannot be predicted Local scope of the prediction bit 2-bit prediction schemes are often used 28
Two Bits Dynamic Branch Prediction The scope of the predictors is local (per branch prediction) Mis-prediction rate for 2-bits branch predictor with 4k entries is from 1% to 18% 29
Advanced Branch Prediction Schemes Correlating Branch Predictors (Two-level adaptive predictors) Takes a look at the recent behavior of other branches at first Behavior of the predicted branch has lower priority Tournament Predictors: Adaptively Combining Local and Global Predictors Usage of multiple predictors Local base, Global base and combination of both bases on a selector Effective at medium size of prediction cache (8k 32k bits) 30
Extending the MIPS Pipeline to Handle Multi-cycle Operations 4 separate functional units Integer units (1 clock) Brings the possibility of WAW, RAW data and structural hazards. FP and integer multiplier (7 clock) FP adder (4 clock) Integer ALU, load, store, branches FP add, subtract, conversion FP and integer divider Unpipeline (24 clock) 31
Structural Hazards and Multi-cycle Operation Extension The probability of the structural hazards increases with the presence of unpipelined units Only one instruction can be in EX stage on these units (issuing problem) The number of register writes required in one cycle can be larger than 1, because of varying running time of instructions Possible solution is to implement multiple write gateways or separate integer and floating-point registers a simple shift register, that indicates when already-issued instructions will use register file and stall new instruction issue a mechanism to stall conflict instruction when it tries to enter MEM or WB stage 32
Data Hazards and Multi-cycle Operation Extension Stalls for RAW will be more frequent (varying running time of instructions) WAW hazards are possible Instructions do not reach WB stage in order in which they were issued The processor does not known anything about the length of the pipeline by issuing of an instruction A simple solution: if an instruction wants to write into the same register as an instruction already issued, it is stalled. 33
Super-scalar processors Issuing one instruction every clock cycle is not enough for effective usage of every function units of the processor Multiple instruction are issued, decoded and forwarded to their execution processor must have multiple function units for each pipeline stage multiple read/write gateways into the register set have to be present processor must have multiple read gateways into instruction cache (or memory) 34
Super-scalar processors A lot of independent instruction have to be present in the code of an application Issued instructions can not contain any dependency to avoid data hazards Independent instruction can be scheduled statically by the compilation or dynamically with help of the hardware before their execution 35
Superscalar processors Types of superscalar processors Superscalar with dynamic issue structure Some instructions can be executed out-of-order. No speculation during execution Not used today Superscalar with dynamic issue and speculative scheduling Out-of-order execution with speculation Pentium 4, Core, Core2,.., IBM, Power5 VLIW/LIW Static issued instructions with static scheduling. Hazard detection and other preparation in compilation time Embedded space EPIC (primarily static issued instructions with most static scheduling) Hazard detection in compilation time Intel Itanium 36
Advantage of Dynamic Scheduling Allows the code which was compiled for one pipeline to run efficiently on a different pipeline It simplifies the compiler Compiled code is not platform depended Optimization for efficient execution is not necessary It can handle dependencies that were not known at compile time. 37
Dynamic Scheduling The simple idea of dynamic scheduling is to allow out-of-order execution of instructions after their dynamic issue. Out-of-order execution implies out-of-order completion and it introduces the possibility of WAR and WAW hazards. Both these hazards are avoided by the use of register renaming. (will be discussed in future chapter) To allow out-of-order execution it is essential to split the DI stage of the simple five-stage pipeline into two stages Issue (II) decodes instruction and check for structural hazards - all in-order Read operands (RO) wait until no data hazards, then read operands for all out-of-order waiting instructions 38
Dynamic Scheduling Issue window represents count of instructions which can be forwarded to their out-of-order execution without being completed Out-of-order completion have to be tended with score board table or today more usually with help of thomasulo's algorithm 39
Static scheduling at the compilation time VLIW/LIW or EPIC processors schedule the instruction at the compilation time. Advantage at compilation time scheduling Compiler has code of whole application to find independent instructions (instructions without dependencies) The Long instruction word with most static scheduled instruction can be created very effectively Effective usage of all processor units (the best case) Disadvantage Not all potential hazards states can by solved in the compilation time Compiler detects potential hazards and the hardware help is necessary for their solving The processor has only limited ability to solve hazards problems 40
Conclusion Introduction into the computer architecture description with help of ISA Matter of unpipelined instruction execution on processors with independent 5 stage execution Pipeline instruction execution on the reference processors with the simple 5 stage MIPS pipeline Classes of the hazards and their possible solutions Extension of 5 stage MIPS pipeline for multi-cycle instruction execution Introduction into super-scalar processors and dynamic and static scheduling of independent instructions 41
Literature John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach (4th Edition) Paul H. J. Kelly, Advanced Computer Architecture Lecture notes 332 Andrew S. Tanenbaum, Operating Systems: Design and Implementation 42