CPU Organization Datapath Design:

Size: px

Start display at page:

Download "CPU Organization Datapath Design:"

Anna Rice
5 years ago
Views:

1 CPU Organization Datapath Design: Capabilities & performance characteristics of principal Functional Units (FUs): (e.g., Registers, ALU, Shifters, Logic Units,...) Ways in which these components are interconnected (buses connections, multiplexors, etc.). How information flows between components. Control Unit Design: Logic and means by which such information flow is controlled. Control and coordination of FUs operation to realize the targeted Instruction Set Architecture to be implemented (can either be implemented using a finite state machine or a microprogram). Hardware description with a suitable language, possibly using Register Transfer Notation (RTN). #1 Final Review Summer

2 CPU Execution Time: The CPU Equation A program is comprised of a number of instructions Measured in: instructions/program The average instruction takes a number of cycles per instruction (CPI) to be completed. Measured in: cycles/instruction CPU has a fixed clock cycle time = 1/clock rate Measured in: seconds/cycle CPU execution time is the product of the above three parameters as follows: CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle #2 Final Review Summer

3 Instruction Types & CPI: An Example An instruction set has three instruction classes: Instruction class CPI A 1 B 2 C 3 Two code sequences have the following instruction counts: Instruction counts for instruction class Code Sequence A B C CPU cycles for sequence 1 = 2 x x x 3 = 10 cycles CPI for sequence 1 = clock cycles / instruction count = 10 /5 = 2 CPU cycles for sequence 2 = 4 x x x 3 = 9 cycles CPI for sequence 2 = 9 / 6 = 1.5 #3 Final Review Summer

4 Instruction Frequency & CPI Given a program with n types or classes of instructions with the following characteristics: C i = Count of instructions of type i CPI i = Average cycles per instruction of type i F i = Frequency of instruction type i = C i / total instruction count Then: CPI = n ( ) i= 1 CPI i F i #4 Final Review Summer

5 Instruction Type Frequency & CPI: A RISC Example Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% % Load 20% % Store 10% % Branch 20% % CPI = Typical Mix n ( ) i = 1 CPI i F i CPI =.5 x x x x 2 = 2.2 #5 Final Review Summer

6 Performance Enhancement Calculations: Amdahl's Law The performance enhancement possible due to a given design improvement is limited by the amount that the improved feature is used Amdahl s Law: Performance improvement or speedup due to enhancement E: Execution Time without E Performance with E Speedup(E) = = Execution Time with E Performance without E Suppose that enhancement E accelerates a fraction F of the execution time by a factor S and the remainder of the time is unaffected then: Execution Time with E = ((1-F) + F/S) X Execution Time without E Hence speedup is given by: Execution Time without E 1 Speedup(E) = = ((1 - F) + F/S) X Execution Time without E (1 - F) + F/S Note: All fractions here refer to original execution time. #6 Final Review Summer

7 Pictorial Depiction of Amdahl s Law Enhancement E accelerates fraction F of execution time by a factor of S Before: Execution Time without enhancement E: Unaffected, fraction: (1- F) Affected fraction: F Unchanged Unaffected, fraction: (1- F) F/S After: Execution Time with enhancement E: Execution Time without enhancement E 1 Speedup(E) = = Execution Time with enhancement E (1 - F) + F/S #7 Final Review Summer

8 Amdahl's Law With Multiple Enhancements: Example Three CPU performance enhancements are proposed with the following speedups and percentage of the code execution time affected: Speedup 1 = S 1 = 10 Percentage 1 = F 1 = 20% Speedup 2 = S 2 = 15 Percentage 1 = F 2 = 15% Speedup 3 = S 3 = 30 Percentage 1 = F 3 = 10% While all three enhancements are in place in the new design, each enhancement affects a different portion of the code and only one enhancement can be used at a time. What is the resulting overall speedup? Speedup = ( ( 1 ) i F 1 i + i F S i i ) Speedup = 1 / [( ) +.2/ /15 +.1/30)] = 1 / [ ] = 1 /.5833 = 1.71 #8 Final Review Summer

9 A Single Cycle MIPS Datapath Inst Memory Adr Rs <21:25> Rt <16:20> Rd <11:15> <0:15> Imm16 Instruction<31:0> imm16 4 PC Ext Adder Adder npc_sel PC Mux Clk 00 RegDst RegWr busw 32 Clk 1 imm16 Rd 0 Rt 5 5 Rs 5 Rw Ra Rb bit Registers 16 Rt busa busb 32 Extender Equal Mux 1 ALUctr = ALU MemWr WrEn Adr Data In Data Memory Clk MemtoReg 0 Mux 1 ExtOp ALUSrc #9 Final Review Summer

10 Partitioning The Single Cycle Datapath Add registers between smallest steps Next PC PC Instruction Fetch npc_sel Operand Fetch ExtOp ALUSrc ALUctr MemRd MemWr RegDst RegWr MemWr Exec Mem Access Reg. File Data Mem Result Store #10 Final Review Summer

11 npc_sel Next PC PC Example Multi-cycle Datapath Instruction Fetch IR Reg File Operand Fetch A B ExtOp ALUSrc ALUctr Ext ALU R MemRd MemWr Mem Access M MemToReg Data Mem RegDst RegWr Reg. File Registers added: IR: Instruction register A, B: Two registers to hold operands read from register file. R: or ALUOut, holds the output of the ALU M: or Memory data register (MDR) to hold data read from data memory Result Store Equal #11 Final Review Summer

12 Operations In Each Cycle R-Type Logic Immediate Load Store Branch Instruction Fetch IR Mem[PC] IR Mem[PC] IR Mem[PC] IR Mem[PC] IR Mem[PC] Instruction Decode A R[rs] B R[rt] A R[rs] A R[rs] A R[rs] B R[rt] A R[rs] B R[rt] If Equal = 1 Execution R A + B R A OR ZeroExt[imm16] R A + SignEx(Im16) R A + SignEx(Im16) PC PC (SignExt(imm16) x4) else PC PC + 4 Memory M Mem[R] Mem[R] B PC PC + 4 Write Back R[rd] R PC PC + 4 R[rt] R PC PC + 4 R[rd] M PC PC + 4 #12 Final Review Summer

13 Mapping RTNs To Control Points Examples & State Assignments IR MEM[PC] instruction fetch 0000 imem_rd, IRen Execute ALUfun, Sen R-type R A fun B 0100 Aen, Ben ORi R A or ZX 0110 A R[rs] B R[rt] 0001 LW R A + SX 1000 SW decode / operand fetch R A + SX 1011 BEQ & ~Equal PC PC BEQ & Equal PC PC + SX Memory RegDst, RegWr, PCen M MEM[S] 1001 MEM[S] B PC PC To instruction fetch state 0000 R[rd] R PC PC R[rt] R PC PC R[rt] M PC PC Write-back To instruction fetch state 0000 To instruction fetch state 0000 #13 Final Review Summer

14 Alternative Multiple Cycle Datapath (In Textbook) Shared instruction/data memory unit A single ALU shared among instructions Shared units require additional or widened multiplexors Temporary registers to hold data between clock cycles of the instruction: Additional registers: Instruction Register (IR), Memory Data Register (MDR), A, B, ALUOut #14 Final Review Summer

15 Operations In Each Cycle R-Type Logic Immediate Load Store Branch Instruction Fetch IR Mem[PC] PC PC + 4 IR Mem[PC] PC PC + 4 IR Mem[PC] PC PC + 4 IR Mem[PC] PC PC + 4 IR Mem[PC] PC PC + 4 Instruction Decode A R[rs] B R[rt] ALUout PC + (SignExt(imm16) x4) A R[rs] B R[rt] ALUout PC + (SignExt(imm16) x4) A R[rs] B R[rt] ALUout PC + (SignExt(imm16) x4) A R[rs] B R[rt] ALUout PC + (SignExt(imm16) x4) A B R[rs] R[rt] ALUout PC + (SignExt(imm16) x4) Execution ALUout A + B ALUout A OR ZeroExt[imm16] ALUout A + SignEx(Im16) ALUout A + SignEx(Im16) If Equal = 1 PC ALUout Memory M Mem[ALUout] Mem[ALUout] B Write Back R[rd] ALUout R[rt] ALUout R[rd] Mem #15 Final Review Summer

16 Finite State Machine (FSM) Specification IR MEM[PC] PC PC instruction fetch A R[rs] B R[rt] ALUout PC +SX 0001 decode Execute R-type ALUout A fun B 0100 ORi ALUout A op ZX 0110 LW ALUout A + SX 1000 SW ALUout A + SX 1011 BEQ If A = B then PC ALUout 0010 Memory R[rd] ALUout 0101 To instruction fetch R[rt] ALUout 0111 M MEM[ALUout] 1001 R[rt] M 1010 To instruction fetch MEM[ALUout] B 1100 To instruction fetch Write-back #16 Final Review Summer

17 MIPS Multi-cycle Datapath Performance Evaluation What is the average CPI? State diagram gives CPI for each instruction type Workload below gives frequency of each type Type CPI i for type Frequency CPI i x freqi i Arith/Logic 4 40% 1.6 Load 5 30% 1.5 Store 4 10% 0.4 branch 3 20% 0.6 Average CPI: 4.1 Better than CPI = 5 if all instructions took the same number of clock cycles (5). #17 Final Review Summer

18 Microprogrammed Control Unit Microprogram Storage ROM/PLA Outputs Control Signal Fields Multicycle Datapath 1 Adder Microinstruction Address Inputs State Reg Sequencing Control Field Types of branching Set state to 0 (fetch) Dispatch i (state 1) Use incremented address (seq) state number 2 Address Select Logic Microprogram Counter, MicroPC Opcode #18 Final Review Summer

19 Microprogram for The Control Unit Label ALU SRC1 SRC2 Dest. Memory Mem. Reg. PC Write Sequencing Fetch: Add PC 4 Read PC IR ALU Seq Add PC Extshft Dispatch Lw: Add rs Extend Seq Read ALU Seq rt MEM Fetch Sw: Add rs Extend Seq Write ALU Fetch Rtype: Func rs rt Seq rd ALU Fetch Beq: Subt. rs rt ALUoutCond. Fetch Ori: Or rs Extend0 Seq rt ALU Fetch #19 Final Review Summer

20 MIPS Integer ALU Requirements (1) Functional Specification: inputs: 2 x 32-bit operands A, B, 4-bit mode outputs: 32-bit result S, 1-bit carry, 1 bit overflow, 1 bit zero operations: add, addu, sub, subu, and, or, xor, nor, slt, sltu (2) Block Diagram: c A zero ovf ALU S 32 B m 10 operations thus 4 control bits 4 00 add 01 addu 02 sub 03 subu 04 and 05 or 06 xor 07 nor 12 slt 13 sltu #20 Final Review Summer

21 Building Block: 1-bit ALU Performs: AND, OR, addition on A, B or A, B inverted A invertb CarryIn and Operation or Mux Result B 1-bit Full Adder add CarryOut #21 Final Review Summer

22 32-Bit ALU Using 32 1-Bit ALUs A0 B0 A31 B31 CarryIn0 CarryIn1 A1 B1 CarryIn2 A2 B2 CarryIn3 CarryIn31 1-bit ALU 1-bit ALU 1-bit ALU : : 1-bit ALU CarryOut0 CarryOut1 CarryOut30 Result0 Result1 Result2 Result31 32-bit rippled-carry adder (operation/invertb lines not shown) Addition/Subtraction Performance: Assume gate delay = T Total delay = 32 x (1-Bit ALU Delay) = 32 x 2 x gate delay = 64 x gate delay = 64 T C CarryOut31 #22 Final Review Summer

23 MIPS ALU With SLT Support Added CarryIn0 Less A0 B0 A1 B1 Less = 0 A2 B2 Less = 0 1-bit Result0 ALU CarryIn1 CarryOut0 1-bit Result1 ALU CarryIn2 CarryOut1 CarryIn3 CarryIn31 1-bit ALU A31 B31 1-bit Less = 0 ALU CarryOut31 C : : CarryOut30 Result2 : : : : Result31 Zero Overflow #23 Final Review Summer

24 Improving ALU Performance: Carry Look Ahead (CLA) A0 B1 A B A B A B G P G P G P G P Cin S S S S C1 =G0 + C0 P0 C2 = G1 + G0 P1 + C0 P0 P1 A B C-out kill 0 1 C-in propagate 1 0 C-in propagate generate G = A and B P = A xor B C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2 G P #24 Final Review Summer

25 C L A 4-bit Adder 4-bit Adder 4-bit Adder Cascaded Carry Look-ahead C0 G0 P0 C1 =G0 + C0 P0 C2 = G1 + G0 P1 + C0 P0 P1 16-Bit Example Delay = = 5 gate delays = 5T { C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2 G P Assuming all gates have equal delay T C4 =... #25 Final Review Summer

26 Unsigned Multiplication Example Paper and pencil example (unsigned): Multiplicand 1000 Multiplier Product m bits x n bits = m + n bit product, m = 32, n = 32, 64 bit product. The binary number system simplifies multiplication: 0 => place 0 ( 0 x multiplicand). 1 => place a copy ( 1 x multiplicand). We will examine 4 versions of multiplication hardware & algorithm: Successive refinement of design. #26 Final Review Summer

27 Operation of Combinational Multiplier A 3 A 2 A 1 A 0 B 0 A 3 A 2 A 1 A 0 B 1 A 3 A 2 A 1 A 0 B 2 A 3 A 2 A 1 A 0 B 3 P 7 P 6 P 5 P 4 P 3 P 2 P 1 P 0 At each stage shift A left ( x 2). Use next bit of B to determine whether to add in shifted multiplicand. Accumulate 2n bit partial product at each stage. #27 Final Review Summer

28 MULTIPLY HARDWARE Version 3 Combine Multiplier register and Product register: 32-bit Multiplicand register. 32 -bit ALU. 64-bit Product register, (0-bit Multiplier register). Multiplicand 32 bits 32-bit ALU Product (Multiplier) 64 bits Shift Right Write Control #28 Final Review Summer

29 Multiply Algorithm Version 3 Product0 = 1 Start 1. Test Product0 Product0 = 0 1a. Add multiplicand to the left half of product & place the result in the left half of Product register 2. Shift the Product register right 1 bit. 32nd repetition? No: < 32 repetitions Yes: 32 repetitions Done #29 Final Review Summer

30 Combinational Shifter from MUXes Basic Building Block sel A B bit right shifter D A 7 A 6 A 5 A 4 A 3 A 2 A 1 A 0 S 2 S 1 S R 7 R 6 R 5 R 4 R 3 R 2 R 1 R 0 What comes in the MSBs? How many levels for 32-bit shifter? #30 Final Review Summer

31 Division 1001 Quotient Divisor Dividend Remainder (or Modulo result) See how big a number can be subtracted, creating quotient bit on each step: Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder => Dividend = Quotient + Divisor 3 versions of divide, successive refinement #31 Final Review Summer

32 DIVIDE HARDWARE Version 3 32-bit Divisor register. 32 -bit ALU. 64-bit Remainder regegister (0-bit Quotient register). Divisor 32 bits 32-bit ALU HI LO Remainder (Quotient) 64 bits Shift Left Write Control #32 Final Review Summer

33 Divide Algorithm Version 3 Start: Place Dividend in Remainder 1. Shift the Remainder register left 1 bit. 2. Subtract the Divisor register from the left half of the Remainder register, & place the result in the left half of the Remainder register. Remainder >= 0 Test Remainder Remainder < 0 3a. Shift the Remainder register to the left setting the new rightmost bit to 1. 3b. Restore the original value by adding the Divisor register to the left half of the Remainderregister, &place the sum in the left half of the Remainder register. Also shift the Remainder register to the left, setting the new least significant bit to 0. nth repetition? No: < n repetitions Yes: n repetitions (n = 4 here) Done. Shift left half of Remainder right 1 bit. #33 Final Review Summer

34 Representation of Floating Point Numbers in Single Precision IEEE 754 Standard Value = N = (-1) S X 2 E-127 X (1.M) 0 < E < 255 Actual exponent is: e = E sign S E M exponent: excess 127 binary integer added mantissa: sign + magnitude, normalized binary significand with a hidden integer bit: 1.M Example: 0 = = Magnitude of numbers that can be represented is in the range: (1.0) to ( ) Which is approximately: 1.8 x to 3.40 x #34 Final Review Summer

35 Floating Point Conversion Example The decimal number is to be represented in the IEEE bit single precision format: = (converted to binary) Hidden = x 2 11 (normalized binary) The mantissa is negative so the sign S is given by: S = 1 The biased exponent E is given by E = e E = = = Fractional part of mantissa M: M = (in 23 bits) The IEEE 754 single precision representation is given by: S E M 1 bit 8 bits 23 bits #35 Final Review Summer

36 (1) Start Compare the exponents of the two numbers shift the smaller number to the right until its exponent matches the larger exponent Floating Point Addition Flowchart (2) Add the significands (mantissas) (3) Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent No Still normalized? Done yes (4) Overflow or Underflow? Generate exception or return error Round the significand to the appropriate number of bits If mantissa = 0, set exponent to 0 (5) No Yes #36 Final Review Summer

37 Floating Point Addition Hardware #37 Final Review Summer

38 Floating Point Multiplication Flowchart (1) Start Is one/both operands =0? Set the result to zero: exponent = 0 (2) (3) (4) (5) Compute exponent: biased exp.(x) + biased exp.(y) - bias Compute sign of result: Xs XOR Ys Multiply the mantissas Normalize mantissa if needed Generate exception or return error Yes Overflow or Underflow? (6) No Still Normalized? Yes No Round or truncate the result mantissa Done (7) #38 Final Review Summer

39 Rounding Digits Normalized result, but some non-zero digits to the right of the significand --> the number should be rounded E.g., B = 10, p = 3: = * 10 2-bias 2-bias = * 10 2-bias = * 10 One round digit must be carried to the right of the guard digit so that after a normalizing left shift, the result can be rounded, according to the value of the round digit. IEEE Standard: four rounding modes: round to nearest (default) round towards plus infinity round towards minus infinity round towards 0 round to nearest: round digit < B/2 then truncate > B/2 then round up (add 1 to ULP: unit in last place) = B/2 then round to nearest even digit it can be shown that this strategy minimizes the mean error introduced by rounding. #39 Final Review Summer

40 Sticky Bit Additional bit to the right of the round digit to better fine tune rounding d0. d1 d2 d3... dp X... X X X S X X S d0. d1 d2 d3... dp X... X X X 0 Rounding Summary: Sticky bit: set to 1 if any 1 bits fall off the end of the round digit d0. d1 d2 d3... dp X... X X X 1 generates a borrow Radix 2 minimizes wobble in precision. Normal operations in +,-,*,/ require one carry/borrow bit + one guard digit. One round digit needed for correct rounding. Sticky bit needed when round digit is B/2 for max accuracy. Rounding to nearest has mean error = 0 if uniform distribution of digits are assumed. #40 Final Review Summer

41 Pipelining: Design Goals The length of the machine clock cycle is determined by the time required for the slowest pipeline stage. An important pipeline design consideration is to balance the length of each pipeline stage. If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls): Time per instruction on unpipelined machine Number of pipe stages Under these ideal conditions: Speedup from pipelining = the number of pipeline stages = k One instruction is completed every cycle: CPI = 1. #41 Final Review Summer

42 Pipelined Instruction Processing Representation Clock cycle Number Time in clock cycles Instruction Number Instruction I IF ID EX MEM WB Instruction I+1 IF ID EX MEM WB Instruction I+2 IF ID EX MEM WB Instruction I+3 IF ID EX MEM WB Instruction I +4 IF ID EX MEM WB Time to fill the pipeline Pipeline Stages: IF = Instruction Fetch ID = Instruction Decode EX = Execution MEM = Memory Access WB = Write Back First instruction, I Completed Last instruction, I+4 completed #42 Final Review Summer

43 Single Cycle, Multi-Cycle, Pipeline: Performance Comparison Example For 1000 instructions, execution time: Single Cycle Machine: 8 ns/cycle x 1 CPI x 1000 inst = 8000 ns Multicycle Machine: 2 ns/cycle x 4.6 CPI (due to inst mix) x 1000 inst = 9200 ns Ideal pipelined machine, 5-stages: 2 ns/cycle x (1 CPI x 1000 inst + 4 cycle fill) = 2008 ns #43 Final Review Summer

44 Clk Single Cycle, Multi-Cycle, Vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 8 ns Load Store Waste 2ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load IF ID EX MEM WB Store IF ID EX MEM R-type IF Pipeline Implementation: Load IF ID EX MEM WB Store IF ID EX MEM WB R-type IF ID EX MEM WB #44 Final Review Summer

45 MIPS: A Pipelined Datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 Add result Add PC Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 Registers Read data 2 Write register Write data 0 M ux 1 Zero ALU ALU result Address Write data Data memory Read data 1 M ux 0 16 Sign extend 32 IF ID EX MEM WB Instruction Fetch Instruction Decode Execution Memory Write Back #45 Final Review Summer

46 Pipeline Control Pass needed control signals along from one stage to the next as the instruction travels through the pipeline just like the data Execution/Address Calculation stage control lines Memory access stage control lines Write-back stage control lines Instruction Reg Dst ALU Op1 ALU Op0 ALU Src Branch Mem Read Mem Write Reg write Mem to Reg R-format lw sw X X beq X X WB Instruction Control M WB EX M WB IF/ID ID/EX EX/MEM MEM/WB #46 Final Review Summer

47 Basic Performance Issues In Pipelining Pipelining increases the CPU instruction throughput: The number of instructions completed per unit time. Under ideal condition instruction throughput is one instruction per machine cycle, or CPI = 1 Pipelining does not reduce the execution time of an individual instruction: The time needed to complete all processing steps of an instruction (also called instruction completion latency). It usually slightly increases the execution time of each instruction over unpipelined implementations due to the increased control overhead of the pipeline and pipeline stage registers delays. #47 Final Review Summer

48 Pipelining Performance Example Example: For an unpipelined machine: Clock cycle = 10ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%, respectively. If pipelining adds 1ns to the machine clock cycle then the speedup in instruction execution from pipelining is: Non-pipelined Average instruction execution time = Clock cycle x Average CPI = 10 ns x ((40% + 20%) x %x 5) = 10 ns x 4.4 = 44 ns In the pipelined five implementation five stages are used with an average instruction execution time of: 10 ns + 1 ns = 11 ns Speedup from pipelining = Instruction time unpipelined Instruction time pipelined = 44 ns / 11 ns = 4 times #48 Final Review Summer

49 Pipeline Hazards Hazards are situations in pipelining which prevent the next instruction in the instruction stream from executing during the designated clock cycle resulting in one or more stall cycles. Hazards reduce the ideal speedup gained from pipelining and are classified into three classes: Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions. Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC. #49 Final Review Summer

50 Structural hazard Example: Single Memory For Instructions & Data Time (clock cycles) I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Mem Reg Mem Reg Mem Reg Mem Reg ALU Mem ALU Mem Reg Mem Reg ALU Reg Mem Reg ALU Mem Reg Mem Reg Detection is easy in this case (right half highlight means read, left half write) #50 Final Review Summer

51 Data Hazards Example Problem with starting next instruction before first is finished Data dependencies here that go backward in time create data hazards. sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 IM Reg CC 7 CC 8 CC / DM Reg and $12, $2, $5 IM Reg DM Reg or $13, $6, $2 IM Reg DM Reg add $14, $2, $2 IM Reg DM Reg sw $15, 100($2) IM Reg DM Reg #51 Final Review Summer

52 Data Hazard Resolution: Stall Cycles Stall the pipeline by a number of cycles. The control unit must detect the need to insert stall cycles. In this case two stall cycles are needed. Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 IM Reg CC 7 CC / DM Reg CC 9 20 CC CC and $12, $2, $5 IM STALL STALL Reg DM Reg or $13, $6, $2 STALL STALL IM Reg DM Reg add $14, $2, $2 IM Reg DM Reg sw $15, 100($2) IM Reg DM Reg #52 Final Review Summer

53 Performance of Pipelines with Stalls Hazards in pipelines may make it necessary to stall the pipeline by one or more cycles and thus degrading performance from the ideal CPI of 1. CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction If pipelining overhead is ignored and we assume that the stages are perfectly balanced then: Speedup = CPI unpipelined / (1 + Pipeline stall cycles per instruction) When all instructions take the same number of cycles and is equal to the number of pipeline stages then: Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction) #53 Final Review Summer

54 Data Hazard Resolution: Forwarding Register file forwarding to handle read/write to same register ALU forwarding #54 Final Review Summer

55 Data Hazard Example With Forwarding Value of register $2 : Value of EX/MEM : Value of MEM/WB : Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC / X X X 20 X X X X X X X X X 20 X X X X Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg and $12, $2, $5 IM Reg DM Reg or $13, $6, $2 IM Reg DM Reg add $14, $2, $2 IM Reg DM Reg sw $15, 100($2) IM Reg DM Reg #55 Final Review Summer

56 A Data Hazard Requiring A Stall A load followed by an R-type instruction that uses the loaded value Program execution order (in instructions) lw $2, 20($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 IM Reg DM Reg CC 7 CC 8 CC 9 and $4, $2, $5 IM Reg DM Reg or $8, $2, $6 IM Reg DM Reg add $9, $4, $2 IM Reg DM Reg slt $1, $6, $7 IM Reg DM Reg Even with forwarding in place a stall cycle is needed This condition must be detected by hardware #56 Final Review Summer

57 Compiler Scheduling Example Reorder the instructions to avoid as many pipeline stalls as possible: lw $15, 0($2) lw $16, 4($2) add $14, $5, $16 sw $16, 4($2) The data hazard occurs on register $16 between the second lw and the add resulting in a stall cycle With forwarding we need to find only one independent instructions to place between them, swapping the lw instructions works: lw $16, 4($2) lw $15, 0($2) add $14, $5, $16 sw $16, 4($2) Without forwarding we need two independent instructions to place between them, so in addition a nop is added. lw $16, 4($2) lw $15, 0($2) nop add $14, $5, $16 sw $16, 4($2) #57 Final Review Summer

58 Control Hazards: Example Three other instructions are in the pipeline before branch instruction target decision is made when BEQ is in MEM stage. Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 IM Reg DM Reg 44 and $12, $2, $5 IM Reg DM Reg 48 or $13, $6, $2 IM Reg DM Reg 52 add $14, $2, $2 IM Reg DM Reg 72 lw $4, 50($7) IM Reg DM Reg In the above diagram, we are predicting branch not taken Need to add hardware for flushing the three following instructions if we are wrong losing three cycles. #58 Final Review Summer

59 Reducing Delay of Taken Branches Next PC of a branch known in MEM stage: Costs three lost cycles if taken. If next PC is known in EX stage, one cycle is saved. Branch address calculation can be moved to ID stage using a register comparator, costing only one cycle if branch is taken. IF.Flush Hazard detection unit M u x ID/EX WB EX/MEM Control 0 M u x M WB MEM/WB IF/ID EX M WB PC 4 Instruction memory Shift left 2 Registers = M u x M u x ALU Data memory M u x Sign extend M u x Forwarding unit #59 Final Review Summer

60 Pipeline Performance Example Assume the following MIPS instruction mix: Type Frequency Arith/Logic 40% Load 30% of which 25% are followed immediately by an instruction using the loaded value Store 10% branch 20% of which 45% are taken What is the resulting CPI for the pipelined MIPS with forwarding and branch address calculation in ID stage? CPI = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + stalls by loads + stalls by branches = 1 +.3x.25x1 +.2 x.45x1 = = #60 Final Review Summer

61 Memory Hierarchy: Motivation Processor-Memory (DRAM) Performance Gap 1000 Performance CPU Processor-Memory Performance Gap: (grows 50% / year) DRAM µproc 60%/yr. DRAM 7%/yr. #61 Final Review Summer

62 Memory Hierarchy: Motivation The Principle Of Locality Programs usually access a relatively small portion of their address space (instructions/data) at any instant of time (program working set). Two Types of locality: Temporal Locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close will tend to be referenced soon. The presence of locality in program behavior, makes it possible to satisfy a large percentage of program access needs using memory levels with much less capacity than program address space. #62 Final Review Summer

63 Levels of The Memory Hierarchy Part of The On-chip CPU Datapath Registers One or more levels (Static RAM): Level 1: On-chip 16-64K Level 2: On or Off-chip K Level 3: Off-chip 128K-8M Dynamic RAM (DRAM) 16M-16G Interface: SCSI, RAID, IDE, G-100G Registers Cache Main Memory Magnetic Disc Farther away from The CPU Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput Optical Disk or Magnetic Tape #63 Final Review Summer

64 Cache Design & Operation Issues Q1: Where can a block be placed cache? (Block placement strategy & Cache organization) Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in cache? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q4: What happens on a write? (Cache write policy) Write through, write back #64 Final Review Summer

65 Cache Organization & Placement Strategies Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: 1 Direct mapped cache: A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) 2 Fully associative cache: A block can be placed anywhere in cache. 3 Set associative cache: A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: (Block address) MOD (Number of sets in cache) If there are n blocks in a set the cache placement is called n-way set-associative. #65 Final Review Summer

66 Cache Organization Example #66 Final Review Summer

67 Locating A Data Block in Cache Each block frame in cache has an address tag. The tags of every cache block that might contain the required data are checked or searched in parallel. A valid bit is added to the tag to indicate whether this entry contains a valid address. The address from the CPU to cache is divided into: A block address, further divided into: An index field to choose a block set in cache. (no index field when fully associative). A tag field to search and match addresses in the selected set. A block offset to select the data from the block. Tag Block Address Index Block Offset #67 Final Review Summer

68 Address Field Sizes Physical Address Generated by CPU Tag Block Address Index Block Offset Block offset size = log2(block size) Index size = log2(total number of blocks/associativity) Tag size = address size - index size - offset size #68 Final Review Summer

69 Four-Way Set Associative Cache: MIPS Implementation Example Tag Field Address Index Field Index V Tag Data V Tag Data V Tag Data V Tag Data sets 1024 block frames 4-to-1 m ultiplexo r Hit Da ta #69 Final Review Summer

70 Cache Organization/Addressing Example Given the following: A single-level L 1 cache with 128 cache block frames Each block frame contains four words (16 bytes) 16-bit memory addresses to be cached (64K bytes main memory or 4096 memory blocks) Show the cache organization/mapping and cache address fields for: Fully Associative cache. Direct mapped cache. 2-way set-associative cache. #70 Final Review Summer

71 Cache Example: Fully Associative Case V V All 128 tags must be checked in parallel by hardware to locate a data block V Valid bit Block Address = 12 bits Tag = 12 bits Block offset = 4 bits #71 Final Review Summer

72 Cache Example: Direct Mapped Case V V V Only a single tag must be checked in parallel to locate a data block V Valid bit Block Address = 12 bits Tag = 5 bits Index = 7 bits Block offset = 4 bits Main Memory #72 Final Review Summer

73 Cache Example: 2-Way Set-Associative Two tags in a set must be checked in parallel to locate a data block Valid bits not shown Block Address = 12 bits Tag = 6 bits Index = 6 bits Block offset = 4 bits Main Memory #73 Final Review Summer

74 Calculating Number of Cache Bits Needed How many total bits are needed for a direct- mapped cache with 64 KBytes of data and one word blocks, assuming a 32-bit address? 64 Kbytes = 16 K words = 2 14 words = 2 14 blocks Block size = 4 bytes => offset size = 2 bits, #sets = #blocks = 2 14 => index size = 14 bits Tag size = address size - index size - offset size = = 16 bits Bits/block = data bits + tag bits + valid bit = = 49 Bits in cache = #blocks x bits/block = 2 14 x 49 = 98 Kbytes How many total bits would be needed for a 4-way set associative cache to store the same amount of data? Block size and #blocks does not change. #sets = #blocks/4 = (2 14 )/4 = 2 12 => index size = 12 bits Tag size = address size - index size - offset = = 18 bits Bits/block = data bits + tag bits + valid bit = = 51 Bits in cache = #blocks x bits/block = 2 14 x 51 = 102 Kbytes Increase associativity => increase bits in cache #74 Final Review Summer

75 Cache Replacement Policy When a cache miss occurs the cache controller may have to select a block of cache data to be removed from a cache block frame and replaced with the requested data, such a block is selected by one of two methods: Random: Any block is randomly selected for replacement providing uniform allocation. Simple to build in hardware. The most widely used cache replacement strategy. Least-recently used (LRU): Accesses to blocks are recorded and and the block replaced is the one that was not used for the longest period of time. LRU is expensive to implement, as the number of blocks to be tracked increases, and is usually approximated. #75 Final Review Summer

76 Miss Rates for Caches with Different Size, Associativity & Replacement Algorithm Sample Data Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% #76 Final Review Summer

77 Cache Performance Princeton Memory Architecture For a CPU with a single level (L1) of cache for both instructions and data (Princeton memory architecture) and no stalls for cache hits: CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty) + (Writes x Write miss rate x Write miss penalty) If write and read miss penalties are the same: With ideal memory Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty #77 Final Review Summer

78 Cache Performance Princeton Memory Architecture CPUtime = Instruction count x CPI x Clock cycle time CPI execution = CPI with ideal memory CPI = CPI execution + Mem Stall cycles per instruction CPUtime = Instruction Count x (CPI execution + Mem Stall cycles per instruction) x Clock cycle time Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x Miss penalty CPUtime = IC x (CPI execution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPI execution + Misses per instruction x Miss penalty) x Clock cycle time #78 Final Review Summer

79 Cache Performance Harvard Memory Architecture For a CPU with separate level one (L1) caches for instructions and data (Harvard memory architecture) and no stalls for cache hits: CPUtime = Instruction count x CPI x Clock cycle time CPI = CPI execution + Mem Stall cycles per instruction CPUtime = Instruction Count x (CPI execution + Mem Stall cycles per instruction) x Clock cycle time Mem Stall cycles per instruction = Instruction Fetch Miss rate x Miss Penalty + Data Memory Accesses Per Instruction x Data Miss Rate x Miss Penalty #79 Final Review Summer

80 Cache Performance Example Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per cycle) with a single level of cache. CPI execution = 1.1 Instruction mix: 50% arith/logic, 30% load/store, 20% control Assume a cache miss rate of 1.5% and a miss penalty of 50 cycles. CPI = CPI execution + mem stalls per instruction Mem Stalls per instruction = Mem accesses per instruction x Miss rate x Miss penalty Mem accesses per instruction = = 1.3 Instruction fetch Load/store Mem Stalls per instruction = 1.3 x.015 x 50 = CPI = = The ideal CPU with no misses is 2.075/1.1 = 1.88 times faster #80 Final Review Summer

81 Cache Performance Example Suppose for the previous example we double the clock rate to 400 MHZ, how much faster is this machine, assuming similar miss rate, instruction mix? Since memory speed is not changed, the miss penalty takes more CPU cycles: Miss penalty = 50 x 2 = 100 cycles. CPI = x.015 x 100 = = 3.05 Speedup = (CPI old x C old )/ (CPI new x C new ) = x 2 / 3.05 = 1.36 The new machine is only 1.36 times faster rather than 2 times faster due to the increased effect of cache misses. CPUs with higher clock rate, have more cycles per cache miss and more memory impact on CPI #81 Final Review Summer

82 2 Levels of Cache: L 1, L 2 CPU L 1 Cache Hit Rate= H 1, Hit time = 1 cycle (No Stall) L 2 Cache Hit Rate= H 2, Hit time = T 2 cycles Main Memory Memory access penalty, M #82 Final Review Summer

83 2-Level Cache Performance Memory Access Tree CPU Stall Cycles Per Memory Access CPU Memory Access L 1 L1 Hit: Stalls= H1 x 0 = 0 (No Stall) L1 Miss: % = (1-H1) L 2 L2 Hit: (1-H1) x H2 x T2 L2 Miss: Stalls= (1-H1)(1-H2) x M Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1)(1-H2) x M CPUtime = IC x (CPI execution + Mem Stall cycles per instruction) x C Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access #83 Final Review Summer

84 Two-Level Cache Example CPU with CPI execution = 1.1 running at clock rate = 500 MHZ 1.3 memory accesses per instruction. L 1 cache operates at 500 MHZ with a miss rate of 5% L 2 cache operates at 250 MHZ with miss rate 3%, (T 2 = 2 cycles) Memory access penalty, M = 100 cycles. Find CPI. CPI = CPI execution + Mem Stall cycles per instruction With No Cache, CPI = x 100 = With single L 1, CPI = x.05 x 100 = 7.6 Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1)(1-H2) x M =.05 x.97 x x.03 x 100 = =.247 Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access =.247 x 1.3 =.32 CPI = = 1.42 Speedup = 7.6/1.42 = 5.35 #84 Final Review Summer

85 Memory Bandwidth Improvement Techniques Wider Main Memory: Memory width is increased to a number of words (usually the size of a second level cache block). Memory bandwidth is proportional to memory width. e.g Doubling the width of cache and memory doubles memory bandwidth Simple Interleaved Memory: Memory is organized as a number of banks each one word wide. Simultaneous multiple word memory reads or writes are accomplished by sending memory addresses to several memory banks at once. Interleaving factor: Refers to the mapping of memory addressees to memory banks. e.g. using 4 banks, bank 0 has all words whose address is: (word address) (mod) 4 = 0 #85 Final Review Summer

86 Wider memory, bus and cache Narrow bus and cache with interleaved memory Three examples of bus width, memory width, and memory interleaving to achieve higher memory bandwidth Simplest design: Everything is the width of one word #86 Final Review Summer

87 Memory Interleaving #87 Final Review Summer

88 Memory Width, Interleaving: An Example Given a base system with following parameters: Cache Block size = 1 word, Memory bus width = 1 word, Miss rate = 3% Miss penalty = 32 cycles, broken down as follows: (4 cycles to send address, 24 cycles access time/word, 4 cycles to send a word) Memory access/instruction = 1.2 Ideal execution CPI (ignoring cache misses) = 2 Miss rate (block size=2 word) = 2% Miss rate (block size=4 words) = 1% The CPI of the base machine with 1-word blocks = 2 + (1.2 x.03 x 32) = 3.15 Increasing the block size to two words gives the following CPI: 32-bit bus and memory, no interleaving = 2 + (1.2 x.02 x 2 x 32) = bit bus and memory, interleaved = 2 + (1.2 x.02 x ( ) = bit bus and memory, no interleaving = 2 + (1.2 x.02 x 1 x 32) = 2.77 Increasing the block size to four words; resulting CPI: 32-bit bus and memory, no interleaving = 2 + (1.2 x 1% x 4 x 32) = bit bus and memory, interleaved = 2 + (1.2 x 1% x ( ) = bit bus and memory, no interleaving = 2 + (1.2 x 1% x 2 x 32) = 2.77 #88 Final Review Summer

CPU Organization Datapath Design:

CPU Organization Datapath Design: Capabilities & performance characteristics of principal Functional Units (FUs): (e.g., Registers, ALU, Shifters, Logic Units,...) Ways in which these components are interconnected