ECE 154A Introduction to Computer Architecture Fall 2012 Dmitri Strukov Lecture 10 Floating point review Pipelined design
IEEE Floating Point Format single: 8 bits double: 11 bits single: 23 bits double: 52 bits S Exponent Fraction x ( 1) S (1 Fraction) 2 (Exponent Bias) S: sign bit (0 non negative, 1 negative) Normalized significand: 1.0 significand < 2.0 Always has a leading pre binary point 1 bit, so no need to represent it explicitly (hidden bit) Significand is Fraction with the 1. restored Exponent: excess representation: actual exponent + Bias Ensures exponent is unsigned Single: Bias = 127; Double: Bias = 1203
Floating Point Addition Consider a 4 digit decimal example 9.999 10 1 + 1.610 10 1 1. Align decimal points Shiftnumber withsmallerexponent exponent 9.999 10 1 + 0.016 10 1 2. Add significands 9.999 10 1 + 0.016 10 1 = 10.015 10 1 3. Normalize result & check for over/underflow 1.0015 10 2 4. Round and renormalize if necessary 1.002 10 2
FP Adder Hardware Step 1 Step 2 Step 3 Step 4
Floating Point Multiplication Consider a 4 digit decimal example 1.110 10 10 9.200 10 5 1. Add exponents For biased exponents, subtract bias from sum New exponent = 10 + 5 = 5 2. Multiply significands 1.110 9.200 = 10.212 10.212 10 5 3. Normalize result & check for over/underflow 1.0212 10 6 4. Round and renormalize if necessary 1.021 10 6 5. Determine sign of result from signs of operands +1.021 10 6
Accurate Arithmetic IEEE Std 754 specifies additional rounding control Extra bits of precision (guard, round, sticky) Choice of rounding modes Allows programmer to fine tune numerical behavior of a computation Not allfp units implement alloptions Most programming languages and FP libraries just use defaults Trade off between hardware complexity, performance, and market requirements
Interpretation of Data The BIG Picture Bits have no inherent meaning Interpretation depends on the instructions applied Computer representations of numbers Finite range and precision Need to account for this in programs
Associativity Parallel programs may interleave operations in unexpected orders Assumptions of associativity may fail (x+y)+z x -1.50E+38 y 1.50E+38 0.00E+0000E+00 z 1.0 1.0 1.00E+00 x+(y+z) -1.50E+38 1.50E+38 0.00E+00 Need to validate parallel programs under varying degrees of parallelism
Pipelined datapath
Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions smallnumber of instruction formats opcode always the first 6 bits Smaller is faster limited instruction set limited number of registers in register file limitednumber ofaddressing modes Make the common case fast arithmetic operands from the register file (load store machine) allow instructions to contain immediate operands Good design demands good compromises three instruction formats
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non stop: Speedup p = 2n/0.5n + 1.5 4 = number of stages 4.5 An Overview of Pipelining Chapter 4 The Processor 11
The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R type; calculate l memory address Mem: Read/write the data from/to the Data Memory WB: Write the result data into the register file
A Pipelined MIPS Processor Start the next instruction before the current one has completed improves throughput total amount of work done in a given time instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R type IFetch Dec Exec Mem WB clock cycle (pipeline stage time) is limited by the slowest stage for some stages don t need the whole clock cycle (e.g., WB) for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction)
Pipeline Performance Single cycle (T c = 800ps) Pipelined (T c = 200ps) p) Chapter 4 The Processor 14
Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages If not bl balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Chapter 4 The Processor 15
Single Cycle vs. Multicycle vs. Pipelined Clock Time needed Time allotted Instr 1 Instr 2 Instr 3 Instr 4 Clock Time needed Time allotted 3 cycles 5 cycles 3 cycles 4 cycles Instr 1 Instr 2 Instr 3 Instr 4 Time saved 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 f r a d w Cycle 1 f f f f f f f Cycle 2 3 f r f a r d a w d w 2 3 r r r r r r r a a a a a a a Drainage region 4 f = Fetch f r a d w 5 r = Reg read a = op f r a d w 6 d = Data access w = Writeback f r a d w 7 f r a d Instruction (a) Task-time diagram w 4 5 Start-up region Pipeline stage d d d d d d d w w w w w w w (b) Space-time diagram
MIPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register lw Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec Exec Mem WB Chapter 4 The Processor 17
Pipelining and ISA Design MIPS ISA designed for pipelining pp All instructions are 32 bits Easier to fetch and decode in one cycle c.f. x86: 1 to 17 byte instructions i Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands Memory access takes only one cycle Chapter 4 The Processor 18
Graphically Representing MIPS Pipeline Can help with answering questions like: How many cycles does it take to execute this code? What is the doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?
Why Pipeline? For Performance! Time (l (clock cycles) I n s t r. Inst 0 Inst 1 A LU Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 O Inst 2 r d e r Inst 3 Inst 4 Timeto fillthe pipeline
Hazards Situations that prevent starting the next instruction in the next cycle Structure hazards A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction Chapter 4 The Processor 21
Structure Hazards Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires dt data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble bbl Hence, pipelined datapaths require separate instruction/data i memories Or separate instruction/data caches Chapter 4 The Processor 22
A Single Memory Would Be a Structural Hazard Time (l (clock cycles) I n s t r. lw Inst 1 A LU Mem Reg Mem Reg Mem Reg Mem Reg Reading data from memory O Inst 2 r d e r Inst 3 Mem Reg Mem Reg Mem Reg Mem Reg Inst 4 Reading instruction from memory Mem Reg Mem Reg Fix with separate instr and data memories (I$ and D$)
Data Hazards An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 sub $t2, $s0, $t3 Chapter 4 The Processor 24
Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 AL LU Read before write data hazard
Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards add $1, AL LU $,$,$ Usub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 Read before write data hazard
Loads Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 AL LU Load use data hazard
How About Register File Access? Time (clock cycles) I n s add $1, t Inst 1 r. Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half O r d e r Inst 2 add $2,$1, clock edge that controls register writing clock edge that t controls loading of pipeline state registers