MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

1 MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

OUTLINE Review of MIPS ISA Review on Pipelining 2

READING ASSIGNMENT ReadAppendixA ReadAppendixC 3

THEMIPS ISA (A.9) First MIPS in 1985 General-purpose RISC Load-store architecture MIPS provides a good architectural model for study Popular and easy MIPS emphasizes A simple load-store instruction set Design for pipelining efficiency Efficiency as a compiler target Several models MIPS64 4

THEMIPS ISA (A.9) Fixed instruction encoding Supports these addressing modes displacement (offset 12-16 bits) immediate (8-16 bits) register indirect Supports these data sizes and types 8-, 16-, 32-, and 64-bit integers 32-bit and 64-bit IEEE 754 floating-point numbers Supports these simple instructions load, store, add, subtract, move register-register, shift Compare equal, compare not equal, compare less, branch (with a PC-relative address at least 8 bits long), jump, call, and return Registers Integer registers Floating-point registers 5

THEMIPS ISA (A.9) Registers for MIPS64 32 64-bit GPRs or integer registers (R0,, R31) R0 is always 0 32 64-bit FPRs (F0,, F31) Hold single 32-bit single precision value Hold single 64-bit double precision value Special: Exception, PC, FP status Data Types for MIPS64 8-bit bytes, 16-bit half words, 32-bit words, and 64-bit double words for integer data 32-bit single precision and 64-bit double precision for floating point. Data can be loaded into registers with sign or zero extension 6

THEMIPS ISA (A.9) Addressing Modes for MIPS Data Transfers Register addressing Immediate Displacement Direct Memory addressing 64-bit byte addresses Big Endian or little Endian (Mode bit) Aligned memory addressing! 7

THEMIPS ISA (A.9) 8 MIPS Instruction Format

THEMIPS ISA (A.9) MIPS Operations loads and stores, ALU operations, branches and jumps, floating-point 9

THEMIPS ISA (A.9) 10 MIPS Operations

THEMIPS ISA (A.9) MIPS Control Flow Instructions 11

THEMIPS ISA (A.9) MIPS Floating-Point Operations Manipulate the floating-point registers on either single or double precision Single-precision operations ADD.S, SUB.S, MUL.S, DIV.S Double precision operations ADD.D, SUB.D, MUL.D, DIV.D Floating-point compares Set a bit in the special floating-point status register that can be tested with a pair of branches: BC1T and BC1F 12

THEMIPS ISA (A.9) Example unsigned int long x[32]; for (k=0;k<32; k++) { x[k] = x[k] * 54; } ===================================================== LD R1, 1024(R0) // ADDRESS OF X IS IN MEM[1024] DADDU R2, R0, R0 // K = 0 DADDIU R4, R0, 54 // CONSTANT 54 IN R4 L LD R3, 0(R1) // LOAD X[K] MULTDU R3, R3, R4 // X[K]*54 SD R3, 0(R1) // UPDATE X[K] DADDIU R1, R1, 8 // NEXT ELEMENT DADDIU R2, R2, 1 // INCREMENT K SLTIU R5,R2, 32 BNE R5, R0, L 13

PIPELINING Key implementation technique used to make fast CPUs Pipelining is an implementation technique whereby multiple instructions are overlapped in execution to take advantage of parallelism among the actions (steps) needed to execute an instruction Invisible to programmer! Instruction execution is split into steps(stages) Each stage finishes part of the instruction All stages are given the same time to finish (processor cycle!) Improves throughput not latency! The depth of the pipeline determine the speedup Ideally, speedup equals the number of pipeline stages Longer is better, but more complex and expensive! 14

SIMPLE IMPLEMENTATION TORISC ISA Propertiesthat makeiteasy All operations on data apply to data in registers and typically change the entire register The only operations that affect memory are load and store operations Few and fixed size instruction formats 5-stage implementation Instruction fetch cycle ((IF) Instruction decode/register fetch cycle (ID) Execution/effective address cycle (EX) Memory access (MEM) Write-back cycle (WB) Cycles Load5 ALU and Store4 Branch and Jump 2 15

SIMPLE IMPLEMENTATION TORISC ISA Cycles for Unpipelined = 5N Cycles for Pipelined = 5 + N-1 Speed up = 5N/(N-1+5) 16

SIMPLE IMPLEMENTATION TORISC ISA Pipeline registers Separate memories 17

SIMPLE IMPLEMENTATION TORISC ISA Pipelining performance Single Unpipelined Pipelined Cycle time 400 ns 100 ns 100 ns Cycles Per Instruction Instruction mix Time Single =ICx1x400=400IC 4 cycles ALU 2 cycles Branch 5 cycles memory Time Unpipe =ICx(0.5x4+0.2x2+0.3x5)x100=390IC Time Pipe Speedup? 1 =ICx(1)x100=100IC Ideally,speedupis5,but Unbalanced stages Instruction mix Time to fill/empty 50% ALU, 30% MEM, 20% Branch 1 18

PIPELININGHAZARDS Hazards are the occasions in which the next instruction in the stream is prevented from executing during its designated clock cycle Types Structural accesstoonehwunitinthesamecycle Data one instruction requires the result of a previous instruction(s) that are still in the pipeline Control instructions that change the PC Solvinghazards can be done bystallingthe pipeline Conflicting instructions are paused Earlier instructions proceed They reduce the performance of the pipeline! 19

PIPELININGHAZARDS Structural Hazards Solution Duplicate units Pipeline units Cost vs. improvement! 20

PIPELININGHAZARDS Data Hazards Forwarding? Register File? 21

PIPELININGHAZARDS Data Hazards Forwarding does not solve all hazards! 22

PIPELININGHAZARDS Data Hazards Forwarding does not solve all hazards! 23

PIPELININGHAZARDS Control Hazards Theyareofgreaterimpactonperformancethandata! RecallthatifabranchchangesthePCtoitstargetaddress, itisatakenbranch;ifitfallsthrough,itisnottaken,or untaken. Should we fetch the following instruction or the one at the branch target? Wait the branch decision! Stall! Expensive! 24

PIPELININGHAZARDS Control Hazards Treatthebranchasnot taken! Treatthebranchastaken! Static prediction! If prediction is wrong flush the pipeline! 25

PIPELININGHAZARDS Control Hazards Delayed branch! Compiler! 26

PIPELININGHAZARDS Control Hazards Dynamic branch prediction! Use a branch-prediction buffer or branch history table(bht) history table(bht) Thebufferisaddressedbythelowerportion of branch instruction Each location contain bit(s) that tells the predictionofthebranch(1or2bits) The buffer is essentially a small cache shared by all branch instructions Accuracy depends how often the branch of interest is executed and accurate prediction 27

PIPELININGHAZARDS 28 Control Hazards

PIPELININGHAZARDS 29 Control Hazards