CSE 141 Computer Architecture Summer Session Lecture 3 ALU Part 2 Single Cycle CPU Part 1. Pramod V. Argade

CSE 141 Computer Architecture Summer Session 1 2004 Lecture 3 ALU Part 2 Single Cycle CPU Part 1 Pramod V. Argade

Reading Assignment Announcements Chapter 5: The Processor: Datapath and Control, Sec. 5.3-5.4 Homework 3: Due Mon., July 12 in class 4.14c, 4.27, 4.28, 4.31, Multiply (-6 * 7) using Booth algorithm using 4 bit 2 s complement representation for the operands. 5.5, 5.6, 5.8, 5.9, 5.10 Quiz 3 When: Mon., July 12, First 10 minutes of the class Topic: ALU, Chapter 4 Need: Paper, pen, calculator Slide 5-2

CSE141 Course Schedule Lecture # Date Time Room Topic Quiz topic Homework Due 1 Mon. 6/28 6-8:50 PM Center 109 Introduction, Ch. 1 ISA, Ch. 3 - - 2 Wed. 6/30 6-8:50 PM Center 109 Performance, Ch. 2 ISA Arithmetic, Ch. 4 Ch. 3 #1 - Mon. 7/5 No Class July 4th Holiday - - Arithmetic, Ch. 4 Cont. Performance 3 Wed. 7/7 6-8:50 PM Center 109 #2 Single-cycle CPU Ch. 5 Ch. 2 Single-cycle CPU Ch. 5 Cont. 4 Mon. 7/12 6-8:50 PM Center 109 Arithmetic, Ch. 4 #3 Multi-cycle CPU Ch. 5 Multi-cycle CPU Ch. 5 Cont. 5 Tue. 7/13 7:30-8:50 PM Center 109 - - (July 5th make up class) Single and Multicycle CPU Examples and Single-cycle CPU 6 Wed. 7/14 6-8:50 PM Center 109 - Review for Midterm Ch. 5 Mid-term Exam 7 Mon. 7/19 6-8:50 PM Center 109 - #4 Exceptions Pipelining Ch. 6 8 Tue. 7/20 7:30-8:50 PM Center 109 - - (July 5th make up class) 9 Wed. 7/21 6-8:50 PM Center 109 Hazards, Ch. 6 - - 10 Mon. 7/26 6-8:50 PM Center 109 Memory Hierarchy & Caches Ch. 7 11 Wed. 7/28 6-8:50 PM Center 109 Virtual Memory, Ch. 7 Course Review Hazards Ch. 6 Cache Ch. 7 12 Sat. 7/31 7-10 PM Center 109 Final Exam - - #5 #6 Slide 5-3

SLT: Set-on-less-than Logic SLT $1, $2, $3 if( $2 < $3) $1 = 1; else $1 = 0; To test A < B, do a subtraction (A - B) (A < B) if (A - B) < 0, i.e. negative Use sign bit Route the sign bit to bit 0 of result Set bits 1-31 to zero There is a complication due to overflow Work out solution in Homework problem 4.23 Slide 5-4

Set if Less Than SLT $m, $n, $p if( $n < $p ) { $m = 1; } else { $m = 0; } a. Binvert Operation CarryIn a 0 1 b 0 2 1 Less 3 CarryOut Result Binvert a0 b0 a1 b1 0 CarryIn CarryIn ALU0 Less CarryOut CarryIn ALU1 Less CarryOut Operation Result0 Result1 $n < $p ($n - $p) < 0 Binvert CarryIn Operation a2 b2 0 CarryIn ALU2 Less CarryOut Result2 a 0 1 CarryIn b 0 2 1 Result a31 b31 0 CarryIn ALU31 Less Set Result31 Overflow Less 3 Set b. Overflow detection Overflow Slide 5-5

Complete 32-bit ALU from last lecture Bnegate a0 b0 a1 b1 0 a2 b2 0 a31 b31 0 CarryIn ALU0 Less CarryOut CarryIn ALU1 Less CarryOut CarryIn ALU2 Less CarryOut CarryIn ALU31 Less Operation Result0 Result1 Result2 Result31 Set Zero Overflow Functionality provided Arithmetic Operations: ADD, SUB Logical Operations: AND, OR Compare SLT Support for branch BEQ, BNE Exception detection Overflow What is missing? Signed multiply Unsigned multiply Signed division Unsigned division Slide 5-6

Grade school Multiplication algorithm In general (ignoring sign bits): m bits x n bits = (m+n) bit product Binary makes it easy: 0 => place 0 ( 0 x multiplicand) 1 => place multiplicand ( 1 x multiplicand) Paper and pencil example of binary multiplication: (8*10 = 80, 0x8 * 0xa = 0x50 ) 1000 (multiplicand) x 1010 (multiplier) 0000 1000x 0000xx 1000xxx 1010000 (Result) Slide 5-7

Observations about Multiplication More complicated than addition Simple algorithm: Accomplished via shift and add More time delay and more gates (=> silicon area) Let's look at 3 versions based on grade school algorithm Slide 5-8

Multiplication: First Version Initialization: Load 32-bit multiplicand and zero extend to 64 bits Load 64-bit product register with zero Need a state machine to control operation 32 Iterations are required Each Iteration takes 3 clocks Total 96 + 3 = 99 clocks Multiplicand Shift left 64 bits Multiplier0 = 1 1a. Add multiplicand to product and place the result in Product register Start 1. Test Multiplier0 Multiplier0 = 0 64-bit ALU Multiplier Shift right 32 bits 2. Shift the Multiplicand register left 1 bit 3. Shift the Multiplier register right 1 bit Product 64 bits Write Control test 32nd repetition? No: < 32 repetitions Observations: 32 bits in multiplicand are always zero 64-bit ALU is unnecessary Left Shifted multiplicand does not affect lower bits of the product Done Yes: 32 repetitions Slide 5-9

Multiplication: Second Version Initialization: Load 32-bit multiplicand to 32-bit register Load 64-bit product register with zero Need a state machine to control operation Multiplier0 = 1 Start 1. Test Multiplier0 Multiplier0 = 0 Multiplicand 32 bits 1a. Add multiplicand to the left half of the product and place the result in the left half of the Product register 32-bit ALU Multiplier Shift right 32 bits 2. Shift the Product register right 1 bit Product 64 bits Shift right Write Control test 3. Shift the Multiplier register right 1 bit Observations: 32 Iterations are required Each Iteration takes 3 clocks Total 96 + 3 = 99 clocks 32-bit ALU is used No: < 32 repetitions 32nd repetition? Yes: 32 repetitions Done Slide 5-10

Multiplication: Third Version Initialization: Load 32-bit multiplicand to 32-bit register Load upper 32 bits of product register with zero Load lower 32 bits of product register with multiplier Need a state machine to control operation Product0 = 1 Start 1. Test Product0 Product0 = 0 Multiplicand 32 bits 1a. Add multiplicand to the left half of the product and place the result in the left half of the Product register 32-bit ALU Product Shift right Write Control test 2. Shift the Product register right 1 bit 64 bits Observations: 32 Iterations are required Each Iteration takes 2 clocks Total 64 + 3 = 67 clocks 32-bit ALU is used 64-bit Product Reg. holds Product and Multiplier No: < 32 repetitions 32nd repetition? Yes: 32 repetitions Done Slide 5-11

Multiplying Signed Numbers Convert all operands to positive Determine sign of the product Sign of the product = sign( op1) ^ sign( op2) Multiply positive operands (only 31 bits) If the sign of the result is negative, negate the result Adds extra logic and delay to multiply Is there a better way? Slide 5-12

Booth s Algorithm An elegant approach to multiplying signed numbers With ability to add, subtract and shift There are multiple ways to do multiply Consider signed operands A and B A = (A 31 *-2 31 ) + (A 30 *2 30 ) + (A 29 *2 29 ) + +(A 1 *2 1 ) + (A 0 *2 0 ) = (-A 31 *2 31 ) + (2A 30 -A 30 )2 30 + (2A 29 -A 29 )2 29 + + (2A 0 -A 0 )2 0 = (A 30 -A 31 )2 31 + (A 29 -A 30 )2 30 + + (A 1 -A 2 )2 1 + (A -1 -A 0 )2 0 A*B = [(A 30 -A 31 )2 31 + (A 29 -A 30 )2 30 + + (A 1 -A 2 )2 1 + (A -1 -A 0 )2 0 ]*B = (A 30 -A 31 )2 31 *B + (A 29 -A 30 )2 30 *B + + (A 1 -A 2 )2 1 *B + (A -1 -A 0 )2 0 *B Recipe: Evaluate (A i-1 -A i ) 0: Do nothing 1: Add B 2: Subtract B Slide 5-13

Booths algorithm: Signed multiplication A*B = (A 30 -A 31 )2 31 *B + (A 29 -A 30 )2 30 *B + + (A 1 -A 2 )2 1 *B + (A -1 -A 0 )2 0 *B middle end of run of run 0 1 1 1 1 0 beginning of run Current Bit Bit to the Right Explanation Example Op 1 0 Begins run of 1s 0001111000 sub 1 1 Middle of run of 1s 0001111000 none 0 1 End of run of 1s 0001111000 add 0 0 Middle of run of 0s 0001111000 none Originally for Speed (when shift was faster than add) Replace a string of 1s in multiplier with an initial subtract when we first see a one and then later add for the bit after the last one Potential speed up recognizing that string of 0 s and 1 s requires no operation! Slide 5-14

Booth s Algorithm Recipe: for A*B Add A i-1 = 0 Evaluate (A i-1 -A i ) 0: Do nothing 1: Add B 2: Subtract B Example: Use Booth s Algorithm for following multiplication 2 * (-6) = 0010 * 1010 = -12 = 1111 0100 Slide 5-15

Division 1001 Quotient Divisor 1000 1001010 Dividend 1000 10 101 1010 1000 10 Remainder (or Modulo result) See how big a number can be subtracted, creating quotient bit on each step Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder => sizeof( Dividend ) = sizeof( Quotient ) + sizeof( Divisor ) 3 versions of divide, successive refinement Slide 5-16

Division 1.0 Initialization: 32-bit quotient register = 0, 64-bit remainder = divisor 64-bit Divisor = (32-bit divisor << 32) Divisor 64 bits Shift Right 64-bit ALU Quotient 32 bits Shift Left Remainder 64 bits Write Control Slide 5-17

Division 1.0 Start 1. Subtract the Divisor register from the Remainder register, and place the result in the Remainder register. Remainder >= 0 Test Remainder Remainder < 0 2a. Shift the Quotient register to the left setting the new rightmost bit to 1. 2b. Restore the original value by adding the Divisor register to the Remainder register, and place the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0. 3. Shift the Divisor register right 1 bit. 33rd repetition? No: < 33 repetitions Done Yes: 33 repetitions Slide 5-18

Divide Algorithm Optimizations similar to that for multiply algorithm can be done 32-bit Divisor register 32-bit ALU Quotient bits are left shifted into the remainder register In case the result of subtraction is negative, remainder register has to be restored Takes one extra clock cycle Non-restoring divide algorithm removes this step Divide overflow case 0x80000000/-1 Slide 5-19

Floating Point: Introduction We need a way to represent real numbers Numbers with fractions, e.g., 3.14159265 (recognize me?) Very small numbers, e.g., 0.0000000000000000000000013621 Very large numbers, e.g., 9,349,398,989,787,762,244,859,087,678 Binary Fractions: 1011 2 = 1x2 3 + 0x2 2 + 1x2 1 + 1x2 0 so... 101.011 2 = 1x2 2 + 0x2 1 + 1x2 0 + 0x2-1 + 1x2-2 + 1x2-3 e.g.,.75 = 0.5 + 0.25 = 1/2 + 1/4 =.11 2 Slide 5-20

Recall Scientific Notation decimal point exponent 6.02 x 10 23 Mantissa radix (base) IEEE Single Precision F.P. ± 1.M x 2 e - 127 Slide 5-21

IEEE 754Single-precision Floating-Point 1 8 23 S E M Total 32 bits sign exponent: excess 127 binary integer mantissa: normalized binary significand w/ hidden integer bit: 1.M N = (-1) S (1.M) 2 E-127 Example: Convert - 325.75 to IEEE Single Precision Floating Point Representation Slide 5-22

IEEE 754 Double-precision Floating-Point 1 11 20 32 S E M M Total 64 bits sign exponent: excess 1023 binary integer mantissa: normalized binary significand w/ hidden integer bit: 1.M N = (-1) S (1.M) 2 E-1023 Example: Convert - 325.75 to IEEE Double Precision Floating Point Representation Slide 5-23

IEEE 754 Single Precision FP 1 8 23 S E M Total 32 bits sign exponent: excess 127 binary integer mantissa: normalized binary significand w/ hidden integer bit: 1.M If E=255 and F is nonzero, then V=NaN ("Not a number") If E=255 and F is zero and S is 1, then V=-Infinity If E=255 and F is zero and S is 0, then V=Infinity If 0<E<255 then V=(-1) S * 2 (E-127) * (1.F) If E=0 and F is zero and S is 1, then V=-0 If E=0 and F is zero and S is 0, then V=0 In particular, 0 00000000 00000000000000000000000 = 0 1 00000000 00000000000000000000000 = -0 0 11111111 00000000000000000000000 = Infinity 1 11111111 00000000000000000000000 = -Infinity 0 11111111 00000100000000000000000 = NaN 1 11111111 00100010001001010101010 = NaN 0 10000000 00000000000000000000000 = +1 * 2 (128-127) * 1.0 = 2 Slide 5-24

Floating Point Addition Start 1. Compare the exponents of the two numbers. Shift the smaller number to the right until its exponent would match the larger exponent 2. Add the significands 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent Overflow or underflow? Yes No Exception 4. Round the significand to the appropriate number of bits No Still normalized? Yes Done Slide 5-25

Floating Point Addition Example: 0.5 + ( - 0.4375) Sign Exponent Significand Sign Exponent Significand Small ALU Compare exponents Exponent difference 0 1 0 1 0 1 Control Shift right Shift smaller number right Big ALU Add 0 1 Increment or decrement 0 1 Shift left or right Normalize Rounding hardware Round Sign Exponent Significand Slide 5-26

IEEE 754 Floating Point Increasing the size of significand enhances accuracy Increasing the size of exponent increases the range of the numbers that can be represented Overflow or underflow can happen Can do integer compare for greater-than, sign Single Precision Range of about 2 x 10-38 to 2 x 10 38 Double Precision Range of about 2 x 10-308 to 2 x 10 308 Infinite variety of real numbers exist between, say, 0 and 1 Not more than 2 53 can be represented exactly in double precision Slide 5-27

Floating Point Complexities Operations are somewhat more complicated In addition to overflow we can have underflow Accuracy can be a big problem IEEE 754 keeps two extra bits, guard and round four rounding modes positive divided by zero yields infinity zero divide by zero yields not a number Implementing the standard can be tricky Not using the standard can be even worse See text for description of 80x86 and Pentium bug! Slide 5-28

Summary Multiplication and division take much longer than addition, requiring multiple addition steps. Floating Point extends the range of numbers that can be represented, at the expense of precision (accuracy). FP operations are very similar to integer, but with pre- and postprocessing. Slide 5-29

CSE 141 Computer Architecture Fall 2003 Lecture 3 The Processor: Datapath and Control Pramod V. Argade

Datapath and Control Design The Five Classic Components of a Computer Processor Control Memory Input Datapath Output Slide 5-32

Single Cycle Implementation Datapath and Control Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Clock Cycle I. Fetch Decode Op. Fetch Execute Store Next PC Complete Execution of a Single Instruction Next Instruction Slide 5-33

Datapath Abstract / Simplified View: Data Register # PC Address Instruction Instruction Registers Register # ALU memory Register # Address Data memory Data Slide 5-34

Combinational Elements that operate on data values Produces same output if given same inputs State Elements Two Types of Logic Components contains internal storage state elements can be read at any time clock is used to determine when a state element should be written A B Combinational Logic C = f(a,b) A B State Element C = f(a,b,state) clk Slide 5-35

Clock Clock is a free running signal Fixed cycle time (period) Frequency = 1/(cycle time) Duty Cycle: (% high)/(%low), e.g. 50/50 Duty Cycle below Jitter: Uncertainty in rising or falling edge Rising Edge Falling Edge Clock Cycle (Period) Slide 5-36

Edge-triggered Clocking Values stored in the machine are updated on a clock edge The clock edge can be either rising or falling State element 1 Combinational logic State element 2 State element 1 Combinational logic State element 1 Clock cycle Clock cycle By default a state element is written every clock edge An explicit write control signal is required otherwise. Edge triggered methodology allows, in the same clock cycle to: read the contents of a register send the value through some combinational logic, and write the contents of the same or another register Possible to have the same state element as input and output Slide 5-37

Storage Elements D Latch Two inputs: the data value to be stored (D) the clock signal (C) indicating when to read & store D Two outputs: the value of the internal state (Q) and it's complement C Q D D _ Q C Q Falling edge triggered D flip-flop Output changes only on the clock edge D D D Q latch C D Q D latch _ C Q Q _ Q D C C Q Slide 5-38

CPU: Clocking Clk Setup Hold Don t Care Setup Hold............ CLK CLK All storage elements are clocked by the same clock edge Slide 5-39

Register: A Storage Element Similar to the D Flip Flop except N-bit input and output Write Enable input Write Enable: 0: Data Out will not change 1: Data Out will become Data In (on the clock edge) Write Enable Data In N Data Out N Clk Slide 5-40

Register File Register File consists of (32) registers: Two 32-bit output busses: busa and busb One 32-bit input bus: busw Register is selected by: RA selects the register to put on busa RB selects the register to put on busb RW selects the register to be written via busw when Write Enable is 1 Clock input (CLK) Write Enable busw 32 Clk RW RA RB 5 5 5 32 32-bit Registers busa 32 busb 32 Slide 5-41

Memory Write Enable Address Memory One input bus: Data In One output bus: Data Out Memory word is selected by: Address selects the word to put on Data Out Write Enable = 1: address selects the memory word to be written via the Data In bus Clock input (CLK) Data In DataOut 32 32 Clk The CLK input is a factor ONLY during write operation During read operation, behaves as a combinational logic block: Address valid => Data Out valid after access time. Slide 5-42

Basic 4 x 2 Static RAM Din[1] Din[0] Write enable D D C latch Enable Q D D C latch Enable Q 0 2-to-4 decoder D C D latch Q D C D latch Q Enable Enable 1 Address D C D latch Q D C D latch Q Enable Enable 2 D C D latch Q D C D latch Q Enable Enable 3 Dout[1] Dout[0] Slide 5-43

A Simple Implementation of MIPS CPU Simplified to contain only: Memory-reference instructions: lw, sw Arithmetic-logical instructions: add, sub, and, or, slt Control flow instructions: beq, j Execution Time = Instructions * CPI * Cycle Time Processor design (datapath and control) will determine: Clock cycle time Clock cycles per instruction We will design a single cycle processor: Advantage: One clock cycle per instruction Disadvantage: long cycle time Slide 5-44

Arithmetic Instructions (R-Type) ADD, SUB, AND, OR, SLT Example add rd, rs, rt 31 26 21 16 11 6 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 0 e.g. add $t3, $s0, $s5 REG[$t3] = REG[$s0] + REG[$s5] Slide 5-45

Load/Store Instructions (I-Type) LW, SW Examples lw rt, rs, imm16 sw rt, rs, imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 e.g. lw $s3, -4($s2) REG[$s3] = D-MEM[ REG[$s2] - 4 ] Slide 5-46

Branch (I-Type) Beq Example beq rs, rt, imm16 31 26 21 16 op rs rt displacement 6 bits 5 bits 5 bits 16 bits e.g. 0x4c beq $s1, $t3, -12 if( REG[$s1] == REG[$t3] ) { new_pc = old_pc + 4-12 # new_pc = 0x44 } else { new_pc = old_pc + 4 # new_pc = 0x50 } 0 Slide 5-47

Jump (J-Type) J Example J Label 31 26 op target address 6 bits 26 bits 0 e.g. 0x8000 0000 j 0x111 1111 new_pc = 0x8444 4444 Slide 5-48

Components Required to implement the ISA Next PC generation Add 4 or extended 16-bit immediate to PC Memory Instruction read Data read/write Registers (32 x 32-bit) Read register rs Read register rt Write register rt or rd Sign extend immediate operand ALU to operate on the operands Slide 5-49

CPU: Instruction Fetch RTL version of the instruction fetch step: Fetch the Instruction: mem[pc] Update the program counter: Sequential Code: PC <- PC + 4 Branch and Jump: PC <- something else Clk PC Next Address Logic Address Instruction Memory Instruction Word 32 Slide 5-50

CPU: Register-Register Operations (Add, Subtract etc.) R[rd] <- R[rs] op R[rt] Example: addu rd, rs, rt Ra, Rb, and Rw come from instruction s rs, rt, and rd fields ALUctr and RegWr: control logic after decoding the instruction 31 26 21 16 11 6 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 0 RegWr busw 32 Clk Rd Rs 5 5 5 Rt Rw Ra Rb 32 32-bit Registers busa 32 busb 32 ALUctr ALU 32 Result Slide 5-51

CPU: Load Operations R[rt] <- Mem[R[rs] + SignExt[imm16]] Example: lw rt, rs, imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 Rd Rt RegDst Mux Rs RegWr 5 5 5 busw 32 Clk imm16 Rw Ra Rb 32 32-bit Registers 16 busb 32 Extender busa 32 32 Mux ALUSrc ALUctr ALU Data In 32 Clk 32 MemWr WrEn Adr Data Memory 32 W_Src Mux ExtOp Slide 5-52

CPU: Store Operations Mem[ R[rs] + SignExt[imm16] <- R[rt] ] Example: sw rt, rs, imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 RegDst busw 32 Clk Rd Mux Rt RegWr 5 5 imm16 Rs 5 Rw Ra Rb 32 32-bit Registers 16 Rt busb 32 Extender busa 32 32 Mux ALUctr ALU Data In 32 Clk 32 MemWr WrEn Adr Data Memory 32 W_Src Mux ExtOp ALUSrc Slide 5-53

CPU: Datapath for Branching beq rs, rt, imm16 31 26 21 Datapath generates condition (equal) 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits Instruction Address 0 Cond 4 imm16 PC Ext npc_sel Mux Adder Adder PC 00 Clk 32 busw Clk Rs RegWr 5 5 5 Rw Ra Rb 32 32-bit Registers Rt busa 32 busb 32 Equal? Sign extend to 32 bits and left shift by 2 Slide 5-54

CPU: Binary arithmetic for PC In theory, the PC is a 32-bit byte address into the instruction memory: Sequential operation: PC<31:0> = PC<31:0> + 4 Branch operation: PC<31:0> = PC<31:0> + 4 + SignExt[Imm16] * 4 The magic number 4 always comes up because: The 32-bit PC is a byte address And all our instructions are 4 bytes (32 bits) long In other words: The 2 LSBs of the 32-bit PC are always zeros There is no reason to have hardware to keep the 2 LSBs In practice, we can simplify the hardware by using a 30-bit PC<31:2>: Sequential operation: PC<31:2> = PC<31:2> + 1 Branch operation: PC<31:2> = PC<31:2> + 1 + SignExt[Imm16] In either case: Instruction Memory Address = PC<31:2> concat 00 Slide 5-55

Single Cycle Implementation Putting it all together PCSrc 4 Add RegWrite Shift left 2 Add ALU result 1 M u x 0 PC Read address Instruction [31 0] Instruction memory Instruction [25 21] Instruction [20 16] 1 M u Instruction [15 11] x 0 RegDst Instruction [15 0] Read register 1 Read register 2 Write register Write data Read data 1 Read data 2 Registers 16 Sign 32 extend ALUSrc 1 M u x 0 ALU control ALU Zero ALU result MemWrite Address Write data Data memory MemRead Read data MemtoReg 1 M u x 0 Instruction [5 0] ALUOp Slide 5-56