Chapter 3. Arithmetic for Computers

Size: px

Start display at page:

Download "Chapter 3. Arithmetic for Computers"

Myrtle Wilkerson
5 years ago
Views:

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 3 Arithmetic for Computers Arithmetic for Computers Operatios o itegers Additio

1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 3 Arithmetic for Computers Arithmetic for Computers Operatios o itegers Additio ad subtractio Multiplicatio ad divisio Dealig with overflow Floatig-poit real umbers Represetatio ad operatios 3.1 Itroductio Chapter 3 Arithmetic for Computers 2

operads Overflow if result sig is 0 Chapter 3 Arithmetic for Computers 3 Iteger Subtractio Add egatio of secod operad Example: 7 6 = 7 + ( 6) +7: 0000 0000

2 Iteger Additio Example: Additio ad Subtractio Overflow if result out of rage Addig +ve ad ve operads, o overflow Addig two +ve operads Overflow if result sig is 1 Addig two ve operads Overflow if result sig is 0 Chapter 3 Arithmetic for Computers 3 Iteger Subtractio Add egatio of secod operad Example: 7 6 = 7 + ( 6) +7: : : Overflow if result out of rage Subtractig two +ve or two ve operads, o overflow Subtractig +ve from ve operad Overflow if result sig is 0 Subtractig ve from +ve operad Overflow if result sig is 1 Chapter 3 Arithmetic for Computers 4

3 Arithmetic for Multimedia Graphics ad media processig operates o vectors of 8-bit ad 16-bit data Use 64-bit adder, with partitioed carry chai Operate o 8 8-bit, 4 16-bit, or 2 32-bit vectors SIMD (sigle-istructio, multiple-data) Saturatig operatios O overflow, result is largest represetable value c.f. 2s-complemet modulo arithmetic E.g., clippig i audio, saturatio i video Chapter 3 Arithmetic for Computers 5 Multiplicatio Start with log-multiplicatio approach 3.3 Multiplicatio multiplicad multiplier product Legth of product is the sum of operad legths Chapter 3 Arithmetic for Computers 6

add/shift Oe cycle per partial-product additio That s ok, if

4 Multiplicatio Hardware Iitially 0 Chapter 3 Arithmetic for Computers 7 Optimized Multiplier Perform steps i parallel: add/shift Oe cycle per partial-product additio That s ok, if frequecy of multiplicatios is low Chapter 3 Arithmetic for Computers 8

Faster Multiplier Uses multiple adders Cost/performace tradeoff Ca be pipelied Several multiplicatio performed i parallel Chapter 3 Arithmetic for Computers 9 LEGv8 Multiplicatio Three multiply

5 Faster Multiplier Uses multiple adders Cost/performace tradeoff Ca be pipelied Several multiplicatio performed i parallel Chapter 3 Arithmetic for Computers 9 LEGv8 Multiplicatio Three multiply istructios: MUL: multiply Gives the lower 64 bits of the product SMULH: siged multiply high Gives the upper 64 bits of the product, assumig the operads are siged UMULH: usiged multiply high Gives the lower 64 bits of the product, assumig the operads are usiged Chapter 3 Arithmetic for Computers 10

Divisio divisor quotiet divided 1001 1000 1001010-1000 10 101 1010-1000 10 remaider -bit operads

bit i quotiet, subtract Otherwise 0 bit i quotiet, brig dow ext divided bit Restorig divisio Do

Adjust sig of quotiet ad remaider as required 3.

6 Divisio divisor quotiet divided remaider -bit operads yield -bit quotiet ad remaider Check for 0 divisor Log divisio approach If divisor divided bits 1 bit i quotiet, subtract Otherwise 0 bit i quotiet, brig dow ext divided bit Restorig divisio Do the subtract, ad if remaider goes < 0, add divisor back Siged divisio Divide usig absolute values Adjust sig of quotiet ad remaider as required 3.4 Divisio Chapter 3 Arithmetic for Computers 11 Divisio Hardware Iitially divisor i left half Iitially divided Chapter 3 Arithmetic for Computers 12

7 Optimized Divider Oe cycle per partial-remaider subtractio Looks a lot like a multiplier! Same hardware ca be used for both Chapter 3 Arithmetic for Computers 13 Faster Divisio Ca t use parallel hardware as i multiplier Subtractio is coditioal o sig of remaider Faster dividers (e.g. SRT devisio) geerate multiple quotiet bits per step Still require multiple steps Chapter 3 Arithmetic for Computers 14

8 LEGv8 Divisio Two istructios: SDIV (siged) UDIV (usiged) Both istructios igore overflow ad divisio-by-zero Chapter 3 Arithmetic for Computers 15 Floatig Poit Represetatio for o-itegral umbers Icludig very small ad very large umbers Like scietific otatio 3.5 Floatig Poit I biary ormalized ot ormalized ±1.xxxxxxx 2 2 yyyy Types float ad double i C Chapter 3 Arithmetic for Computers 16

9 Floatig Poit Stadard Defied by IEEE Std Developed i respose to divergece of represetatios Portability issues for scietific code Now almost uiversally adopted Two represetatios Sigle precisio (32-bit) Double precisio (64-bit) Chapter 3 Arithmetic for Computers 17 IEEE Floatig-Poit Format sigle: 8 bits double: 11 bits S Expoet sigle: 23 bits double: 52 bits Fractio x = (-1) S (1+ Fractio) 2 (Expoet -Bias) S: sig bit (0 Þ o-egative, 1 Þ egative) Normalize sigificad: 1.0 sigificad < 2.0 Always has a leadig pre-biary-poit 1 bit, so o eed to represet it explicitly (hidde bit) Sigificad is Fractio with the 1. restored Expoet: excess represetatio: actual expoet + Bias Esures expoet is usiged Sigle: Bias = 127; Double: Bias = 1203 Chapter 3 Arithmetic for Computers 18

10 Sigle-Precisio Rage Expoets ad reserved Smallest value Expoet: Þ actual expoet = = 126 Fractio: Þ sigificad = 1.0 ± ± Largest value expoet: Þ actual expoet = = +127 Fractio: Þ sigificad 2.0 ± ± Chapter 3 Arithmetic for Computers 19 Double-Precisio Rage Expoets ad reserved Smallest value Expoet: Þ actual expoet = = 1022 Fractio: Þ sigificad = 1.0 ± ± Largest value Expoet: Þ actual expoet = = Fractio: Þ sigificad 2.0 ± ± Chapter 3 Arithmetic for Computers 20

11 Floatig-Poit Precisio Relative precisio all fractio bits are sigificat Sigle: approx 2 23 Equivalet to 23 log decimal digits of precisio Double: approx 2 52 Equivalet to 52 log decimal digits of precisio Chapter 3 Arithmetic for Computers 21 Floatig-Poit Example Represet = ( 1) S = 1 Fractio = Expoet = 1 + Bias Sigle: = 126 = Double: = 1022 = Sigle: Double: Chapter 3 Arithmetic for Computers 22

12 Floatig-Poit Example What umber is represeted by the sigleprecisio float S = 1 Fractio = Fxpoet = = 129 x = ( 1) 1 ( ) 2 ( ) = ( 1) = 5.0 Chapter 3 Arithmetic for Computers 23 Deormal Numbers Expoet = Þ hidde bit is 0 x = (-1) S (0+ Fractio) 2 Smaller tha ormal umbers allow for gradual uderflow, with dimiishig precisio Deormal with fractio = x = (-1) S (0+ 0) 2 -Bias Two represetatios of 0.0! -Bias = ±0.0 Chapter 3 Arithmetic for Computers 24

13 Ifiities ad NaNs Expoet = , Fractio = ±Ifiity Ca be used i subsequet calculatios, avoidig eed for overflow check Expoet = , Fractio Not-a-Number (NaN) Idicates illegal or udefied result e.g., 0.0 / 0.0 Ca be used i subsequet calculatios Chapter 3 Arithmetic for Computers 25 Floatig-Poit Additio Cosider a 4-digit decimal example Alig decimal poits Shift umber with smaller expoet Add sigificads = Normalize result & check for over/uderflow Roud ad reormalize if ecessary Chapter 3 Arithmetic for Computers 26

14 Floatig-Poit Additio Now cosider a 4-digit biary example ( ) 1. Alig biary poits Shift umber with smaller expoet Add sigificads = Normalize result & check for over/uderflow , with o over/uderflow 4. Roud ad reormalize if ecessary (o chage) = Chapter 3 Arithmetic for Computers 27 FP Adder Hardware Much more complex tha iteger adder Doig it i oe clock cycle would take too log Much loger tha iteger operatios Slower clock would pealize all istructios FP adder usually takes several cycles Ca be pipelied Chapter 3 Arithmetic for Computers 28

FP Adder Hardware Step 1 Step 2 Step 3 Step 4 Chapter 3 Arithmetic for Computers 29 Floatig-Poit Multiplicatio Cosider a 4-digit decimal example 1.110 10 10 9.200 10 5 1.

15 FP Adder Hardware Step 1 Step 2 Step 3 Step 4 Chapter 3 Arithmetic for Computers 29 Floatig-Poit Multiplicatio Cosider a 4-digit decimal example Add expoets For biased expoets, subtract bias from sum New expoet = = 5 2. Multiply sigificads = Þ Normalize result & check for over/uderflow Roud ad reormalize if ecessary Determie sig of result from sigs of operads Chapter 3 Arithmetic for Computers 30

16 Floatig-Poit Multiplicatio Now cosider a 4-digit biary example ( ) 1. Add expoets Ubiased: = 3 Biased: ( ) + ( ) = = Multiply sigificads = Þ Normalize result & check for over/uderflow (o chage) with o over/uderflow 4. Roud ad reormalize if ecessary (o chage) 5. Determie sig: +ve ve Þ ve = Chapter 3 Arithmetic for Computers 31 FP Arithmetic Hardware FP multiplier is of similar complexity to FP adder But uses a multiplier for sigificads istead of a adder FP arithmetic hardware usually does Additio, subtractio, multiplicatio, divisio, reciprocal, square-root FP «iteger coversio Operatios usually takes several cycles Ca be pipelied Chapter 3 Arithmetic for Computers 32

17 FP Istructios i LEGv8 Separate FP registers 32 sigle-precisio: S0,, S31 32 double-precisio: DS0,, D31 S stored i the lower 32 bits of D FP istructios operate oly o FP registers Programs geerally do t do iteger ops o FP data, or vice versa More registers with miimal code-size impact FP load ad store istructios LDURS, LDURD STURS, STURD Chapter 3 Arithmetic for Computers 33 FP Istructios i LEGv8 Sigle-precisio arithmetic FADDS, FSUBS, FMULS, FDIVS e.g., FADDS S2, S4, S6 Double-precisio arithmetic FADDD, FSUBD, FMULD, FDIVD e.g., FADDD D2, D4, D6 Sigle- ad double-precisio compariso FCMPS, FCMPD Sets or clears FP coditio-code bits Brach o FP coditio code true or false B.cod Chapter 3 Arithmetic for Computers 34

18 FP Example: F to C C code: float f2c (float fahr) { retur ((5.0/9.0)*(fahr )); } fahr i S12, result i S0, literals i global memory space Compiled LEGv8 code: f2c: LDURS S16, [X27,cost5] // S16 = 5.0 (5.0 i memory) LDURS S18, [X27,cost9] // S18 = 9.0 (9.0 i memory) FDIVS S16, S16, S18 // S16 = 5.0 / 9.0 LDURS S18, [X27,cost32] // S18 = 32.0 FSUBS S18, S12, S18 // S18 = fahr 32.0 FMULS S0, S16, S18 // S0 = (5/9)*(fahr 32.0) BR LR // retur Chapter 3 Arithmetic for Computers 35 FP Example: Array Multiplicatio X = X + Y Z All matrices, 64-bit double-precisio elemets C code: void mm (double x[][], double y[][], double z[][]) { it i, j, k; for (i = 0; i! = 32; i = i + 1) for (j = 0; j! = 32; j = j + 1) for (k = 0; k! = 32; k = k + 1) x[i][j] = x[i][j] + y[i][k] * z[k][j]; } Addresses of x, y, z i X0, X1, X2, ad i, j, k i X19, X20, X21 Chapter 3 Arithmetic for Computers 36

19 FP Example: Array Multiplicatio LEGv8 code: mm:... LDI X10, 32 // X10 = 32 (row size/loop ed) LDI X19, 0 // i = 0; iitialize 1st for loop L1: LDI X20, 0 // j = 0; restart 2d for loop L2: LDI X21, 0 // k = 0; restart 3rd for loop LSL X11, X19, 5 // X11 = i * 2 5 (size of row of c) ADD X11, X11, X20 // X11 = i * size(row) + j LSL X11, X11, 3 // X11 = byte offset of [i][j] ADD X11, X0, X11 // X11 = byte address of c[i][j] LDURD D4, [X11,#0] // D4 = 8 bytes of c[i][j] L3: LSL X9, X21, 5 // X9 = k * 2 5 (size of row of b) ADD X9, X9, X20 // X9 = k * size(row) + j LSL X9, X9, 3 // X9 = byte offset of [k][j] ADD X9, X2, X9 // X9 = byte address of b[k][j] LDURD D16, [X9,#0] // D16 = 8 bytes of b[k][j] LSL X9, X19, 5 // X9 = i * 2 5 (size of row of a) Chapter 3 Arithmetic for Computers 37 FP Example: Array Multiplicatio ADD X9, X9, X21 // X9 = i * size(row) + k LSL X9, X9, 3 // X9 = byte offset of [i][k] ADD X9, X1, X9 // X9 = byte address of a[i][k] LDURD D18, [X9,#0] // D18 = 8 bytes of a[i][k] FMULD D16, D18, D16 // D16 = a[i][k] * b[k][j] FADDD D4, D4, D16 // f4 = c[i][j] + a[i][k] * b[k][j] ADDI X21, X21, 1 // $k = k + 1 CMP X21, X10 // test k vs. 32 B.LT L3 // if (k < 32) go to L3 STURD D4, [X11,0] // = D4 ADDI X20, X20, #1 // $j = j + 1 CMP X20, X10 // test j vs. 32 B.LT L2 // if (j < 32) go to L2 ADDI X19, X19, #1 // $i = i + 1 CMP X19, X10 // test i vs. 32 B.LT L1 // if (i < 32) go to L1 Chapter 3 Arithmetic for Computers 38

20 Accurate Arithmetic IEEE Std 754 specifies additioal roudig cotrol Extra bits of precisio (guard, roud, sticky) Choice of roudig modes Allows programmer to fie-tue umerical behavior of a computatio Not all FP uits implemet all optios Most programmig laguages ad FP libraries just use defaults Trade-off betwee hardware complexity, performace, ad market requiremets Chapter 3 Arithmetic for Computers 39 Subword Parallellism Graphics ad audio applicatios ca take advatage of performig simultaeous operatios o short vectors Example: 128-bit adder: Sixtee 8-bit adds Eight 16-bit adds Four 32-bit adds Also called data-level parallelism, vector parallelism, or Sigle Istructio, Multiple Data (SIMD) Chapter 3 Arithmetic for Computers Parallelism ad Computer Arithmetic: Subword Parallelism

21 ARMv8 SIMD bit registers (V0,, V31) Works with iteger ad FP Examples: 16 8-bit iteger adds: ADD V1.16B, V2.16B, V3.16B 4 32-bit FP adds: FADD V1.4S, V2.4S, V3.4S Chapter 3 Arithmetic for Computers 41 Other ARMv8 Features 245 SIMD istructios, icludig: Square root Fused multiply-add, multiply-subtract Covertio ad scalar ad vector roud-toitegral Structured (strided) vector load/stores Saturatig arithmetic Chapter 3 Arithmetic for Computers 42

22 x86 FP Architecture Origially based o 8087 FP coprocessor 8 80-bit exteded-precisio registers Used as a push-dow stack Registers idexed from TOS: ST(0), ST(1), FP values are 32-bit or 64 i memory Coverted o load/store of memory operad Iteger operads ca also be coverted o load/store Very difficult to geerate ad optimize code Result: poor FP performace 3.7 Real Stuff: Streamig SIMD Extesios ad AVX i x86 Chapter 3 Arithmetic for Computers 43 x86 FP Istructios Data trasfer Arithmetic Compare Trascedetal FILD mem/st(i) FISTP mem/st(i) FLDPI FLD1 FLDZ FIADDP mem/st(i) FISUBRP mem/st(i) FIMULP mem/st(i) FIDIVRP mem/st(i) FSQRT FABS FRNDINT FICOMP FIUCOMP FSTSW AX/mem FPATAN F2XMI FCOS FPTAN FPREM FPSIN FYL2X Optioal variatios I: iteger operad P: pop operad from stack R: reverse operad order But ot all combiatios allowed Chapter 3 Arithmetic for Computers 44

23 Streamig SIMD Extesio 2 (SSE2) Adds bit registers Exteded to 8 registers i AMD64/EM64T Ca be used for multiple FP operads 2 64-bit double precisio 4 32-bit double precisio Istructios operate o them simultaeously Sigle-Istructio Multiple-Data Chapter 3 Arithmetic for Computers 45 Matrix Multiply Uoptimized code: 1. void dgemm (it, double* A, double* B, double* C) 2. { 3. for (it i = 0; i < ; ++i) 4. for (it j = 0; j < ; ++j) 5. { 6. double cij = C[i+j*]; /* cij = C[i][j] */ 7. for(it k = 0; k < ; k++ ) 8. cij += A[i+k*] * B[k+j*]; /* cij += A[i][k]*B[k][j] */ 9. C[i+j*] = cij; /* C[i][j] = cij */ 10. } 11. } 3.8 Goig Faster: Subword Parallelism ad Matrix Multiply Chapter 3 Arithmetic for Computers 46

24 Matrix Multiply x86 assembly code: 1. vmovsd (%r10),%xmm0 # Load 1 elemet of C ito %xmm0 2. mov %rsi,%rcx # register %rcx = %rsi 3. xor %eax,%eax # register %eax = 0 4. vmovsd (%rcx),%xmm1 # Load 1 elemet of B ito %xmm1 5. add %r9,%rcx # register %rcx = %rcx + %r9 6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, elemet of A 7. add $0x1,%rax # register %rax = %rax cmp %eax,%edi # compare %eax to %edi 9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0 10. jg 30 <dgemm+0x30> # jump if %eax > %edi 11. add $0x1,%r11d # register %r11 = %r vmovsd %xmm0,(%r10) # Store %xmm0 ito C elemet 3.8 Goig Faster: Subword Parallelism ad Matrix Multiply Chapter 3 Arithmetic for Computers 47 Matrix Multiply Optimized C code: 1. #iclude <x86itri.h> 2. void dgemm (it, double* A, double* B, double* C) 3. { 4. for ( it i = 0; i < ; i+=4 ) 5. for ( it j = 0; j < ; j++ ) { 6. m256d c0 = _mm256_load_pd(c+i+j*); /* c0 = C[i][j] */ 7. for( it k = 0; k < ; k++ ) 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 9. _mm256_mul_pd(_mm256_load_pd(a+i+k*), 10. _mm256_broadcast_sd(b+k+j*))); 11. _mm256_store_pd(c+i+j*, c0); /* C[i][j] = c0 */ 12. } 13. } 3.8 Goig Faster: Subword Parallelism ad Matrix Multiply Chapter 3 Arithmetic for Computers 48

25 Matrix Multiply Optimized x86 assembly code: 1. vmovapd (%r11),%ymm0 # Load 4 elemets of C ito %ymm0 2. mov %rbx,%rcx # register %rcx = %rbx 3. xor %eax,%eax # register %eax = 0 4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B elemet 5. add $0x8,%rax # register %rax = %rax vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elemets 7. add %r9,%rcx # register %rcx = %rcx + %r9 8. cmp %r10,%rax # compare %r10 to %rax 9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0 10. je 50 <dgemm+0x50> # jump if ot %r10!= %rax 11. add $0x1,%esi # register % esi = % esi vmovapd %ymm0,(%r11) # Store %ymm0 ito 4 C elemets 3.8 Goig Faster: Subword Parallelism ad Matrix Multiply Chapter 3 Arithmetic for Computers 49 Right Shift ad Divisio Left shift by i places multiplies a iteger by 2 i Right shift divides by 2 i? Oly for usiged itegers For siged itegers Arithmetic right shift: replicate the sig bit e.g., 5 / >> 2 = = 2 Rouds toward c.f >>> 2 = = Fallacies ad Pitfalls Chapter 3 Arithmetic for Computers 50

26 Associativity Parallel programs may iterleave operatios i uexpected orders Assumptios of associativity may fail (x+y)+z x -1.50E+38 y 1.50E E+00 z E+00 x+(y+z) -1.50E E E+00 Need to validate parallel programs uder varyig degrees of parallelism Chapter 3 Arithmetic for Computers 51 Who Cares About FP Accuracy? Importat for scietific code But for everyday cosumer use? My bak balace is out by ! L The Itel Petium FDIV bug The market expects accuracy See Colwell, The Petium Chroicles Chapter 3 Arithmetic for Computers 52

27 Cocludig Remarks Bits have o iheret meaig Iterpretatio depeds o the istructios applied 3.9 Cocludig Remarks Computer represetatios of umbers Fiite rage ad precisio Need to accout for this i programs Chapter 3 Arithmetic for Computers 53 Cocludig Remarks ISAs support arithmetic Siged ad usiged itegers Floatig-poit approximatio to reals Bouded rage ad precisio Operatios ca overflow ad uderflow Chapter 3 Arithmetic for Computers 54

Chapter 3. Floating Point Arithmetic

Chapter 3. Floating Point Arithmetic COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 3 Floatig Poit Arithmetic Review - Multiplicatio 0 1 1 0 = 6 multiplicad 32-bit ALU shift product right multiplier add