T insn-mem T regfile T ALU T data-mem T regfile

This Unit: rithmetic CI 371 Computer Organization and Design Unit 3: rithmetic pp pp pp ystem software Mem CPU I/O! little review! Binary + 2s complement! Ripple-carry addition (RC)! Fast integer addition! Carry-select (Ce)! hifters! Integer Multiplication and division! Floating point arithmetic CI371 (Roth/Martin): rithmetic 1 CI371 (Roth/Martin): rithmetic 2 Readings The Importance of Fast ddition! P&H! Chapter 3 + 4 PC Insn Mem Register File s1 s2 d Data Mem T insn-mem T regfile T LU T data-mem T regfile! ddition of two numbers is most common operation! Programs use addition frequently! Loads and stores use addition for address calculation! Branches use addition to test conditions and calculate targets! ll insns use addition to calculate default next PC! Fast addition critical to high performance CI371 (Roth/Martin): rithmetic 3 CI371 (Roth/Martin): rithmetic 4

Review: Binary Integers! Computers represent integers in binary (base2) 3 = 11, 4 = 1, 5 = 11, 3 = 1111 +! Natural since only two values are represented! ddition, etc. take place as usual (carry the 1, etc.) 17 = 11 +5 = 11 22 = 111! ome old machines use decimal (base1) with only /1 3 = 11! Unnatural for digial logic, implementation complicated & slow Fixed Width! On pencil and paper, integers have infinite width! In hardware, integers have fixed width! N bits: 16, 32 or 64! LB is 2, MB is 2 N-1! Range: to 2 N 1! Numbers >2 N represented using multiple fixed-width integers! In software CI371 (Roth/Martin): rithmetic 5 CI371 (Roth/Martin): rithmetic 6 What bout Negative Integers?! ign/magnitude! Unsigned plus one bit for sign 1 = 11, -1 = 111 +! Matches our intuition from by hand decimal arithmetic! Both and! ddition is difficult! Range: (2 N-1 1) to 2 N-1 1! Option II: two s complement (2C)! Leading s mean positive number, leading 1s negative 1 = 11, -1 = 111111 +! One representation for +! Easy addition! Range: (2 N-1 ) to 2 N-1 1 The Tao of 2C! How did 2C come about?! Let s design a representation that makes addition easy! Think of subtracting 1 from by hand! Have to borrow 1s from some imaginary leading 1 = 1-1 = 11-1 = 111111! Now, add the conventional way -1 = 111111 +1 = 11 = 1 CI371 (Roth/Martin): rithmetic 7 CI371 (Roth/Martin): rithmetic 8

till More On 2C! What is the interpretation of 2C?! ame as binary, except MB represents 2 N 1, not 2 N 1! 1 = 111111 = 2 7 +2 6 +2 5 +2 4 +2 2 +2 1 +! Extends to any width! 1 = 1111 = 2 5 +2 4 +2 2 +2 1! Why? 2 N = 2*2 N 1! 2 5 +2 4 +2 2 +2 1 = ( 2 6 +2*2 5 ) 2 5 +2 4 +2 2 +2 1 = 2 6 +2 5 +2 4 +2 2 +2 1! Trick to negating a number quickly: B = B + 1! (1) = (1) +1 = 111+1 = 1111 = 1! ( 1) = (1111) +1 = +1 = 1 = 1! () = () +1 = 1111+1 = =! Think about why this works 1st Grade: Decimal ddition 1 43 +29 72! Repeat N times! dd least significant digits and any overflow from previous add! Carry overflow to next addition! Overflow: any digit other than least significant of sum! hift two addends and sum one digit to the right! um of two N-digit numbers can yield an N+1 digit number CI371 (Roth/Martin): rithmetic 9 CI371 (Roth/Martin): rithmetic 1 Binary ddition: Works the ame Way The Half dder 1 111111 43 = 1111 +29 = 1111 72 = 11! Repeat N times! dd least significant bits and any overflow from previous add! Carry the overflow to next addition! hift two addends and sum one bit to the right! um of two N-bit numbers can yield an N+1 bit number! More steps (smaller base) +! Each one is simpler (adding just 1 and )! o simple we can do it in hardware! How to add two binary integers in hardware?! tart with adding two bits! When all else fails... look at truth table B = C = 1 = 1 1 = 1 1 1 = 1! = ^B! (carry out) = B! This is called a half adder B B H CI371 (Roth/Martin): rithmetic 11 CI371 (Roth/Martin): rithmetic 12

The Other Half! We could chain half adders together, but to do that! Need to incorporate a carry out from previous adder C B = C CI = 1 = 1 1 = 1 CI 1 1 = 1 1 = 1 B F 1 1 = 1 B 1 1 = 1 1 1 1 = 1 1! = C B + C B + C B + CB = C ^ ^ B! = C B + C B + CB + CB = C + CB + B! This is called a full adder CI371 (Roth/Martin): rithmetic 13 Ripple-Carry dder! N-bit ripple-carry adder! N 1-bit full adders chained together! = CI 1, 1 = CI 2, etc.! CI =! N 1 is carry-out of entire adder! N 1 = 1! overflow! Example: 16-bit ripple carry adder! How fast is this?! How fast is an N-bit ripple-carry adder? F B 1 F 1 B 1 2 F 2 B 2 15 F 15 B 15 CI371 (Roth/Martin): rithmetic 14 Quantifying dder Delay! Combinational logic dominated by gate (transistor) delays! rray storage dominated by wire delays! Longest delay or critical path is what matters! Can implement any combinational function in 2 logic levels! 1 level of ND + 1 level of OR (PL)! NOTs are free : push to input (DeMorgan s) or read from latch! Example: delay(fulldder) = 2! d(carryout) = delay(b + C + BC)! d(um) = d( ^ B ^ C) = d(b C + BC + BC + BC) = 2! Note ^ means Xor (just like in C & Java)! Caveat: 2 assumes gates have few (<8?) inputs Ripple-Carry dder Delay! Longest path is to 15 (or 15 )! d( 15 ) = 2 + MX(d( 15 ),d(b 15 ),d(ci 15 ))! d( 15 ) = d(b 15 ) =, d(ci 15 ) = d( 14 )! d( 15 ) = 2 + d( 14 ) = 2 + 2 + d( 13 )! d( 15 ) = 32! D( N 1 ) = 2N! Too slow!! Linear in number of bits! Number of gates is also linear F B 1 F 1 B 1 2 F 2 B 2 15 F 15 B 15 CI371 (Roth/Martin): rithmetic 15 CI371 (Roth/Martin): rithmetic 16

Bad idea: a PL-based dder?! If any function can be expressed as two-level logic! why not use a PL for an entire 8-bit adder?! Not small! pprox. 2 15 ND gates, each with 2 16 inputs! Then, 2 16 OR gates, each with 2 16 inputs! Number of gates exponential in bit width!! Not that fast, either! n ND gate with 65 thousand inputs!= 2-input ND gate! Many-input gates made a tree of, say, 4-input gates! 16-input gates would have at least 8 logic levels! o, at least 16 levels of logic for a 16-bit PL! Even so, delay is still logarithmic in number of bits! There are better (faster, smaller) ways Theme: Hardware!= oftware! Hardware can do things that software fundamentally can t! nd vice versa (of course)! In hardware, it s easier to trade resources for latency! One example of this: speculation! low computation is waiting for some slow input?! Input one of two things?! Compute with both (slow), choose right one later (fast)! Does this make sense in software? Not on a uni-processor! Difference? hardware is parallel, software is sequential CI371 (Roth/Martin): rithmetic 17 CI371 (Roth/Martin): rithmetic 18 Carry-elect dder! Carry-select adder! Do 15-8 +B 15-8 twice, once assuming C 8 ( 7 ) =, once = 1! Choose the correct one when 7 finally becomes available +! Effectively cuts carry chain in half (break critical path)! But adds mux 15- B 15-! Delay? 15-7- 8+ B 7-7- 16 15-8 15-8 8+ B 15-8 1 16 15-8 15-8 8+ B 15-8 15-8 18 CI371 (Roth/Martin): rithmetic 19 Multi-egment Carry-elect dder! Multiple segments! Example: 5, 5, 6 bit = 16 bit! Hardware cost! till mostly linear (~2x)! Compute each segment with and 1 carry-in! erial mux chain! Delay! 5-bit adder (1) + Two muxes (4) = 14 4-5+ B 4-4- 1 9-5 9-5 5+ B 9-5 1 1 8-5 9-5 5+ B 9-5 15-1 15-1 6+ B 15-1 1 15-1 15-1 6+ B 15-1 8-5 CI371 (Roth/Martin): rithmetic 2 12 12 15-1 14

Carry-elect dder Delay! What is carry-select adder delay (two segment)?! d( 15 ) = MX(d( 15-8 ), d( 7- )) + 2! d( 15 ) = MX(2*8, 2*8) + 2 = 18! In general: 2*(N/2) + 2 = N+2 (vs 2N for RC)! What if we cut adder into 4 equal pieces?! Would it be 2*(N/4) + 2 = 1? Not quite! d( 15 ) = MX(d( 15-12 ),d( 11- )) + 2! d( 15 ) = MX(2*4, MX(d( 11-8 ),d( 7- )) + 2) + 2! d( 15 ) = MX(2*4,MX(2*4,MX(d( 7-4 ),d( 3- )) + 2) + 2) + 2! d( 15 ) = MX(2*4,MX(2*4,MX(2*4,2*4) + 2) + 2) + 2! d( 15 ) = 2*4 + 3*2 = 14 nother Option: Carry Lookahead! Is carry-select adder as fast as we can go?! Nope! nother approach to using additional resources! Instead of redundantly computing sums assuming different carries! Use redundancy to compute carries more quickly! This approach is called carry lookahead (CL)! N-bit adder in M equal pieces: 2*(N/M) + (M 1)*2! 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18 CI371 (Roth/Martin): rithmetic 21 CI371 (Roth/Martin): rithmetic 22 Carry Lookahead dder (CL) dders In Real Processors! Calculate propagate and generate based on, B! Not based on carry in! Combine with tree structure! Prior years: CL covered in great detail! Dozen slides or so! Not this year! Take aways! Tree gives logarithmic delay! Reasonable area C C 1 C 2 C 3 B 1 B 1 2 B 2 3 B 3 G P G 1- G 1 P 1 G 2 P 2 G 3 P 3 P 1- C 1 G 3-2 P 3-2 C 3 G 3- P 3- C 2! Real processors super-optimize their adders! Ten or so different versions of CL! Highly optimized versions of carry-select! Other gate techniques: carry-skip, conditional-sum! ub-gate (transistor) techniques: Manchester carry chain! Combinations of different techniques! lpha 21264 uses CL+Ce+RippleC! Used a different levels! Even more optimizations for incrementers! Why? C 4 C 4 CI371 (Roth/Martin): rithmetic 23 CI371 (Roth/Martin): rithmetic 24

ubtraction: ddition s Tricky Pal! ign/magnitude subtraction is mental reverse addition! 2C subtraction is addition! How to subtract using an adder?! sub B = add -B! Negate B before adding (fast negation trick: B = B + 1)! Isn t a subtraction then a negation and two additions? +! No, an adder can implement +B+1 by setting the carry-in to 1 B 1 CI371 (Roth/Martin): rithmetic 25 CI371 (Roth/Martin): rithmetic 26 ~ 16-bit LU! Build an LU with functions: add/sub, and, or, not,xor! ll of these already in CL adder/subtracter! add B, sub B check! not B is needed for subtraction! and B are first level Gs! or B are first level Ps! xor B?! i = i^b i^c i! What is still missing? B CI371 (Roth/Martin): rithmetic 27 ~ 1 & G P ^ dd -sum ^ hift and Rotation Instructions! Left/right shifts are useful! Fast multiplication/division by small constants (next)! Bit manipulation: extracting and setting individual bits in words! Right shifts! Can be logical (shift in s) or arithmetic (shift in copies of MB) srl 1111, 2 = 11 sra 1111, 2 = 1111! Caveat: sra is not equal to division by 2 of negative numbers! Rotations are less useful! But almost free if shifter is there! MIP and LC4 have only shifts, x86 has shifts and rotations CI371 (Roth/Martin): rithmetic 28

imple hifter! The simplest 16-bit shifter: can only shift left by 1! Implement using wires (no logic!)! lightly more complicated: can shift left by 1 or! Implement using wires and a multiplexor (mux16_2to1) Barrel hifter! What about shifting left by any amount 15?! 16 consecutive left-shift-by-1-or- blocks?! Would take too long (how long?)! Barrel shifter: 4 shift-left-by-x-or- blocks (X = 1,2,4,8)! What is the delay? O <<8 <<4 <<2 <<1 O 15 <<1 O <<1 O shift shift[3] shift[2] shift[1] shift[]! imilar barrel designs for right shifts and rotations CI371 (Roth/Martin): rithmetic 29 CI371 (Roth/Martin): rithmetic 3 3rd Grade: Decimal Multiplication 19 // multiplicand * 12 // multiplier 38 + 19 228 // product! tart with product, repeat steps until no multiplier digits! Multiply multiplicand by least significant multiplier digit! dd to product! hift multiplicand one digit to the left (multiply by 1)! hift multiplier one digit to the right (divide by 1)! Product of N-digit, M-digit numbers may have N+M digits Binary Multiplication: ame Refrain 19 = 111 // multiplicand * 12 = 11 // multiplier = = 76 = 111 152 = 111 = + = 228 = 1111 // product ±! maller base! more steps, each is simpler! Multiply multiplicand by least significant multiplier digit +! or 1! no actual multiplication, add multiplicand or not! dd to total: we know how to do that! hift multiplicand left, multiplier right by one digit CI371 (Roth/Martin): rithmetic 31 CI371 (Roth/Martin): rithmetic 32

oftware Multiplication! Can implement this algorithm in software! Inputs: md (multiplicand) and mr (multiplier) int pd = ; // product int i = ; for (i = ; i < 16 && mr!= ; i++) { if (mr & 1) { pd = pd + md; } md = md << 1; // shift left mr = mr >> 1; // shift right } CI371 (Roth/Martin): rithmetic 33 Hardware Multiply: Iterative Multiplicand Multiplier << 1 >> 1 (32 bit) (16 bit) 32 Product (32 bit) 32+! Control: repeat 16 times! If least significant bit of multiplier is 1! Then add multiplicand to product! hift multiplicand left by 1! hift multiplier right by 1 CI371 (Roth/Martin): rithmetic 34 we lsb==1? Hardware Multiply: Multiple dders C<<4 C<<3 C<<3 P C<<2 C<<2 C<<1 C<<1 C C C<<4 P Consecutive ddition F 3 B 3 2 B 2 1 B 1 F F F F T 3 D 3 T 2 D 2 T 1 D 1 T F F F F 3 2 1 B C B D C D! 2 N-bit RC adders +! 2 + d(add) gate delays! M N-bit RC adders delay! Naïve: O(M*N)! ctual: O(M+N)! Multiply by N bits at a time using N adders! Example: N=5, terms (P=product, C=multiplicand, M=multiplier)! P = (M[]? (C) : ) + (M[1]? (C<<1) : ) + (M[2]? (C<<2) : ) + (M[3]? (C<<3) : ) +! rrange like a tree to reduce gate delay critical path! Delay? N 2 vs N*log N? Not that simple, depends on adder! pprox 2N versus N + log N, with optimization: O(log N) CI371 (Roth/Martin): rithmetic 35 F T 3 3 B 3 D 3 2 B 2 D 2 1 B 1 D 1 B D! M N-bit Carry elect? F F F F! Delay calculation tricky T 2 T 1 T C B C D! Carry ave dder (C) F F F F! 3-to-2 C tree + adder! Delay: O(log M + log N) 3 2 1 CI371 (Roth/Martin): rithmetic 36

Hardware!= oftware: Part Deux! Recall: hardware is parallel, software is sequential! Exploit: evaluate independent sub-expressions in parallel! Example I: = + B + C + D! oftware? 3 steps: (1) 1 = +B, (2) 2 = 1+C, (3) = 2+D +! Hardware? 2 steps: (1) 1 = +B, 2=C+D, (2) = 1+2 side: trength Reduction! trength reduction: compilers will do this (sort of) * 4 = << 2 / 8 = >> 3 * 5 = ( << 2) +! Useful for address calculation: all basic data types are 2 M in size int [1]; &[N] = +(N*sizeof(int)) = +N*4 = +N<<2! Example II: = + B + C! oftware? 2 steps: (1) 1 = +B, (2) = 1+C! Hardware? 2 steps: (1) 1 = +B (2) = 1+C +! ctually hardware can do this in 1.2 steps!! ub-expression parallelism exists below 16-bit addition level CI371 (Roth/Martin): rithmetic 37 CI371 (Roth/Martin): rithmetic 38 4th Grade: Decimal Division 9 // quotient 3 29 // divisor dividend -27 2 // remainder! hift divisor left (multiply by 1) until MB lines up with dividend s! Repeat until remaining dividend (remainder) < divisor! Find largest single digit q such that (q*divisor) < dividend! et LB of quotient to q! ubtract (q*divisor) from dividend! hift quotient left by one digit (multiply by 1)! hift divisor right by one digit (divide by 1) Binary Division 11 = 9 3 29 = 11 1111-24 = - 11 5 = 11-3 = - 11 2 = 1 CI371 (Roth/Martin): rithmetic 39 CI371 (Roth/Martin): rithmetic 4

Binary Division Hardware! ame as decimal division, except (again)! More individual steps (base is smaller) +! Each step is simpler! Find largest bit q such that (q*divisor) < dividend! q = or 1! ubtract (q*divisor) from dividend! q = or 1! no actual multiplication, subtract divisor or not! Complication: largest q such that (q*divisor) < dividend! How do you know if (1*divisor) < dividend?! Human can eyeball this! Computer does not have eyeballs! ubtract and see if result is negative oftware Divide lgorithm! Can implement this algorithm in software! Inputs: dividend and divisor for (int i = ; i < 32; i++) {! ""remainder = (remainder << 1) (dividend >> 31);! ""if (remainder >= divisor) {! """"quotient = (quotient << 1) 1;! """"remainder = remainder - divisor;! ""} else {! quotient = quotient << 1! }! ""dividend = dividend << 1;! }! CI371 (Roth/Martin): rithmetic 41 CI371 (Roth/Martin): rithmetic 42 Divide Example! Input: Divisor = 11, Dividend = 1111 Divider Circuit Divisor tep Remainder Quotient Remainder Dividend 1111 1 1 1 111 2 11 1 11 3 1 1 1 1 4 1 1 1 1 5 11 11 1! Result: Quotient: 11, Remainder: 1 ub >= Remainder msb hift in or 1 Quotient hift in or 1 Dividend hift in! N cycles for n-bit divide CI371 (Roth/Martin): rithmetic 43 CI371 (Roth/Martin): rithmetic 44

rithmetic Latencies! Latency in cycles of common arithmetic operations! ource: oftware Optimization Guide for MD Family 1h Processors, Dec 27! Intel Core 2 chips similar Int 32 Int 64 dd/ubtract 1 1 Multiply 3 5 Divide 14 to 4 23 to 87! Divide is variable latency based on the size of the dividend! Detect number of leading zeros, then divide ummary pp pp pp ystem software Mem CPU I/O! Integer addition! Most timing-critical operation in datapath! Hardware!= software! Exploit sub-addition parallelism! Fast addition! Carry-select: parallelism in sum! Multiplication! Chains and trees of additions! Division! Next up: Floating point CI371 (Roth/Martin): rithmetic 45 CI371 (Roth/Martin): rithmetic 46