Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware

Size: px
Start display at page:

Download "Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware"

Transcription

1 Draft submtted for publcaton. Please do not dstrbute Usng Delayed Addton echnques to Accelerate Integer and Floatng-Pont Calculatons n Confgurable Hardware Zhen Luo, Nonmember and Margaret Martonos, Member, IEEE Abstract-- he speed of arthmetc calculatons n confgurable hardware s lmted by carry propagaton even wth the dedcated carry propagaton hardware found n recent FPGAs. hs paper proposes and evaluates an approach called delayed addton that reduces the carrypropagaton bottleneck and mproves the performance of arthmetc calculatons. Our approach employs the dea used n Wallace trees to store the results n an ntermedate form and delay addton untl the end of a repeated calculaton such as accumulaton or dot-product; ths effectvely removes carry propagaton overhead from the calculaton s crtcal path. We present both nteger and floatng-pont desgns that use our technque. Our ppelned nteger multply-accumulate (MAC) desgn s based on a farly tradtonal multpler desgn, but wth delayed addton as well. hs desgn acheves a 66MHz clock rate on an XC4036XL-2 FPGA. Next, we present a 32-bt floatng-pont accumulator based on delayed addton. Here delayed addton requres a novel algnment technque that decouples the ncomng operands from the accumulated result. A conservatve verson of ths desgn acheves a 33 MHz clock rate. We also present a 32-bt floatng-pont accumulator desgn wth compler-managed overflow avodance that acheves a 66MHz clock rate. Fnally, we present an applcaton of delayed addton technques to solve a system of lnear equatons usng conjugate gradent method. hese desgns and applcatons demonstrate the utlty of delayed addton for acceleratng FPGA calculatons n both the nteger and floatng-pont domans. Index erms-- FPGA, delayed addton, Wallace tree, multply-accumulate (MAC) I. INRODUCION When an arthmetc calculaton s carred out n a RISC mcroprocessor, each nstructon typcally has two source operands and one result. In many computatons, however, the result of one arthmetc nstructon s just an ntermedate result n a long seres of calculatons. For example, dot product and other long summatons use a long seres of nteger or floatng-pont operatons to compute a fnal result. Whle FPGA desgns often suffer from much slower clock rates than custom VLSI, confgurable hardware allows us to make specalzed hardware for these he authors are wth the Electrcal Engneerng Department of Prnceton Unversty, Prnceton, NJ 08544, USA. cases; wth ths, we can optmze the ppelnng characterstcs for the partcular computaton. A typcal multpler n a full-custom ntegrated crcut has three stages. Frst, t uses Booth encodng to generate the partal products. Second, t uses one or more levels of Wallace tree compresson to reduce the number of partal products to two. hrd, t uses a fnal adder to add these two numbers and get the result. For such a multpler, the thrd stage, performng the fnal add, generally takes about onethrd of the total multplcaton tme [8, 9]. If mplemented usng FPGAs, stage 3 could become an even greater bottleneck because of the carry propagaton problem. It s hard to apply fast adder technques to speed up carry propagaton wthn the constrants of current FPGAs. In Xlnx 4000-seres chps, for example, the fastest 16-bt adder possble s the hardwred rpple-carry adder [19]. he mnmum delay of such an adder (n a -2 speed grade XC4000xl part) s more than four tmes the delay of an SRAM-based, 4-nput look-up table that forms the core of the confgurable logc blocks. Snce ths carry propagaton s such a bottleneck, t mpedes ppelnng long seres of addtons or multples n confgurable hardware; the carrypropagaton les along the crtcal path, t determnes the ppelned clock rate for the whole computaton. Our work removes ths bottleneck from the crtcal path so that stages 1 and 2 can run at full speed. hs mproves the performance of nner products and other seres calculatons. As an example, consder the summaton C of a vector A: 99 C = A[ ] = 0 Our goal s to accumulate the elements of A wthout payng the prce of 99 seralzed addtons. We observe that n tradtonal multpler desgns (e.g., the multply unts of most recent mcroprocessors [15, 16]), Wallace trees are used to accumulate the partal products. Our work proposes and evaluates ways n whch smlar technques can be used to replace tme-consumng addtons n seres calculatons wth Wallace tree compresson. he technque s applcable to confgurable hardware, because n a dynamcally confgurable system t s practcal to consder buldng specfc hardware for nner products or other repeated calculatons. he technque s effectve for confgurable hardware because t removes addton s carry 1

2 Draft submtted for publcaton. Please do not dstrbute propagaton logc from the crtcal path of these calculatons, thus allowng them to be ppelned at much faster clock rates. By usng Wallace trees to accumulate results wthout carry propagaton overhead, we can greatly accelerate both nteger and floatng-pont calculatons. We demonstrate our deas on three desgns. he frst desgn s an nteger unt that performs ppelned sequences of MAC (multplyaccumulate) operatons; ths ppelned desgn operates at a 37MHz clock rate. he second and thrd desgns perform floatng-pont accumulatons (.e., repeated addtons) on 32-bt IEEE sngle-precson format numbers. One of them uses a conservatve stall technque to respond to possble overflows; t operates at 33MHz. he other sgn reles on compler assstance to avod overflows by breakng calculatons nto, chunks of no more than 512 summaton elements at a tme. hs approach yelds a 66MHz clock rate for 32-bt IEEE sngle-precson summatons. hese clock rates ndcate the sgnfcant promse of ths approach n mplementng hgh-speed ppelned computatons on FPGA-based systems. We demonstrate ths promse on an example applcaton: solvng lnear equatons usng conjugate gradent method. he remander of ths paper s structured as follows. Secton II ntroduces the basc dea of Delayed Addton calculaton, and presents a desgn for a ppelned nteger multply-accumulate unt based on ths approach. Secton III moves nto the floatng-pont doman, presentng a desgn of a ppelned 33MHz 32-bt floatng-pont accumulator wth delayed addton. Buldng on ths basc desgn, Secton IV then presents the 66MHz floatng-pont accumulator wth compler-managed overflow avodance, whch s used to form the MAC unt n our applcaton to solve lnear equatons usng conjugate gradent method n secton V. Secton VI dscusses ssues of roundng and error theory related to these desgns, Secton VII presents related work, and Secton VIII provdes our conclusons. II. DELAYED ADDIION IN A PIPELINED INEGER MULIPLY-ACCUMULAOR A. Overvew A multply-accumulator unt conssts of a multpler and an adder. For adders of 16 bts or less mplemented n Xlnx FPGAs, the hardwred rpple-carry adder s the fastest. For adders more than 16 bts long, a carry-select adder s a good choce for fast addton n FPGA. It uses rpple-carry adders as basc elements and a few multplexers to choose the result. hus t can stll utlze the hardwred rpple-carry logc on Xlnx FPGA to acheve relatvely hgh speed. Most of the multplers that have been mplemented so far n FPGAs are based on bt-seral multplers [2, 14]. hs s because bt-seral multplers take much less area than any other knd of multplers. Snce they have a regular layout, a1[n] a2[n] a3[n] s[n] c[n] a1[n] a2[n] a3[n] a4[n] cout cn s[n] c[n] t s easy to map on a FPGA to acheve very hgh clock rate. However, bt-seral multpler requres a very long latency to produce a result. For two multplcands of M and N bts long, t takes M+N clock cycles to get the product [7]. Although some mplementatons have tred to releve ths problem by multplyng more than one bt per cycle [2], we know of no such mplementatons wth an overall throughput of more than 10MHz. Bt-array multplers also have a regular layout, whch makes t easy to map on FPGA and to acheve hgh clock rates [3]. Unlke bt-seral multplers, they produce one product every cycle. hus they can acheve a very hgh throughput at the prce of large area cost. In the case of a 32-bt nteger MAC wth a 64-bt fnal result, we would expect to have a bt-array multpler of 63 ppelne stages for multplcaton and one for accumulaton. hus we would need a CLB matrx to mplement t [3]. However, CLB matrx of ths sze can barely ft nto the largest Xlnx part avalable (XC40125XV) now (as of Sept. 1998) [20], whch would nvolve a huge cost. Our desgn, as we wll see next, has comparable performance to bt-array multpler for vector MAC and s much more area effcent. B. Background on Wallace rees a1[1] a2[1] a3[1] s[1] c[1] Fg 1. (a) An array of n 3-2 adders. Fg 1. (b) An array of n 4-2 adders. Before contnung on detaled desgns, we wll frst gve a bref revew on some bascs of Wallace tree [10] and ts dervatves [11]. One level of Wallace tree s composed of arrays of 3-2 adders (or compressors). he logc of a 3-2 adder s the same as a full adder except the carry-out from the prevous bt has now become an external nput. For each bt of a 3-2 adder, the logc s: S[] = A1[] A2[] A3[]; C[] = A1[]A2[]+ A2[]A3[] + A3[]A1[]; For the whole array, S+2C = A1 + A1 +A3 a1[1] a2[1] a3[1] a4[1] cout cn s[1] c[1] S and C are partal results that we refer to n ths paper as the pseudo-sum. hey can be combned durng a fnal 2

3 Draft submtted for publcaton. Please do not dstrbute Booth encoders Wallace tree level 1 Wallace tree level 2 Wallace tree level 3 Wallace tree level 4 B B B B B B B B B B B B B B B B B 4-2 adder array 4-2 adder array 4-2 adder array 4-2 adder array FF 4-2 adder array 4-2 adder array FF 4-2 adder array FF 3-2 adder array Compressor 4-2 adder array Fnal Adder (not ppelned) 64-bt adder Fg. 2 Integer MAC wth delayed addton. cycle # Combnatonal Input 1 BH W1 W2 W3 W4 CPR Input 2 BH W1 W2 W3 W4 CPR Input 3 BH W1 W2 W3 W4 CPR Input 4 BH W1 W2 W3 W4 CPR Input 5 BH W1 W2 W3 W4 CPR Input 6 BH W1 W2 W3 W4 CPR Fnal Addton Fg. 3 5 Ppelne dagram of Integer MAC: A [ ] B[ ]. he stages marked: BH (Booth encoders), W1 (Wallace tree level 1), W2 = 0 (Wallace tree level 2), W3 (Wallace tree level 3), W4 (Wallace tree level 4) and CPR (Compressor) refer to the sx ppelne stages shown n Fgure 2. he fnal addton s performed only once per summaton and does not mpact the ppelned clock rate. addton phase to compute a true sum. he total number of nputs across an entre level of a 3-2 adder array s the same as the bt-wdth of the nputs. Fg. 1 (a) shows the layout of such an array example. In some Wallace tree desgns, 4-2 adder arrays have also been used, because they reduce the number of compressor levels requred [11]. Each bt of such an array s composed of a 4-2 adder. he typcal logc s: C out [] = A1[]A2[] + A2[]A3[] + A3[]A1[] ; S[] =A1[] A2[] A3[] A4[] C n [] ; C[] = (A1[] A2[] A3[] A4[])C n [] + (A1[] A2[] A3[] A4[])A4[] ; For the whole array, S + 2C = A1 + A2 + A3 + A4 Fg. 1 (b) shows the layout of an array example usng 4-2 adders. At frst glance, one mght ntally thnk that C n and C out are smlar to the carry-n and carry-out n the rpplecarry adders. he key dfference, however, s that C n does not propagate to C out. he crtcal path of an array of 3-2 or 4-2 adders s n the vertcal, not horzontal drecton. Furthermore, the logc shown maps well to coarse-graned FPGAs. Wth Xlnx 4000-seres parts, we can ft each S or C, for ether a 3-2 or 4-2 adder, nto a sngle CLB usng the F, G, and H functon generators. C. Desgn of Integer MAC wth Delayed Addton For an nteger MAC unt, the mplementaton s straghtforward because ntegers are fxed-pont and are therefore algned. Our desgn looks exactly lke a tradtonal multpler desgn wth Booth encodng and Wallace tree except that a 4-2 adder array s nserted nto the ppelne before the fnal addton. o acheve accumulaton, we repeatedly execute: Pseudo-sum = Pseudo-sum + (the fnal two partal products for each multplcaton) Recall that pseudo-sum refers to the S and C values currently beng computed by a 3-2 or 4-2 adder array, awatng the fnal addton that wll calculate the true result. Fg. 2 shows a block dagram of our mplementaton. 3

4 Draft submtted for publcaton. Please do not dstrbute Desgns Xlnx part number CLB matrx sze CLBs used Flp-flops used Ppelne stages Speed (MHz) Fnal Addton Delay (ns) IMAC (delayed addton) XC4036xlhq radtonal Integer multpler XC4036xlhq N/A able 1: Synthess results for ppelned nteger MAC wth delayed addton and ppelned adder. Each level of a Wallace tree has a smlar delay, and ths delay s also smlar to that of a Booth encoder. hus, as shown n Fg. 2, a natural way to ppelne ths desgn s to let each level of logc (above the dotted lne) be one of the ppelne stages. he well-matched delays make for a very effcent ppelned mplementaton. he fnal compressor, just above the dotted lne, stores and updates the pseudosum every cycle. When the repeated summaton s complete, a fnal add (not part of the ppelne) converts ths ntermedate form to a true sum result. he pseudo-sum s updated each cycle, but the fnal adder s only used when the full accumulaton s done. herefore, t s not one of the ppelne stages, but rather consttutes a post-processng step as shown n Fgure 3. Wth ths structure, the carry propagaton tme for the fnal addton s no longer on the crtcal path that determnes the clock rate of the ppelned MAC desgn. For suffcently long vectors, ths fnal addton tme, done only once per entre summaton rather than once per element, wll be neglgble even compared to the faster vector MAC calculatons of ths desgn. D. Desgn Synthess Results For all the desgns n ths paper, we used the Synopsys fpga_analyzer tool ( ) to generate a.sxnf fle from our VHDL nput and we used Xlnx Foundaton tools (V1.3) for the rest of the synthess. In order to remove the bottleneck at the pad nputs, we added an extra ppelne stage before the booth encoder to buffer the chp nputs. After several ntal attempts, we targeted our desgn at the speed of 66.7 MHz and specfed ths nformaton n the tmng constrant fle (.pcf fle), where we lsted ths requrement for all the crtcal paths. he PAR (placement and routng) worked through successfully and mng Analyzer gves all the tmng nformaton after our desgn s completely placed and routed. he synthess results we get for the above desgn are lsted n able 1. Most notably, our desgn fts n a Xlnx 4036 part and acheves the targetted clock rate of 66.7 MHz. he fnal addton delay, done once per vector as a post-processng step, takes roughly 40ns or nearly 3 cycles. o demonstrate the advantage of delayed addton, we also tred to mplement an nteger MAC composed of a tradtonal nteger multpler and an adder. However, the desgn was too bg to ft on one XC4036 chp, so we bult an nteger multpler on the chp nstead. o trade off between ppelne speed and area cost, we used the carryselect adder for fnal addton and dvded t nto two ppelne stages. In the frst stage, three 32-bt addtons are carred out n parallel, one for the lower 32 bts and two for the upper 32 bts. In the second stage, we select the upper 32 bts between the two results by the carryout from the lower 32-bt addton. he synthess result of ths desgn s also lsted n able 1. mng analyss shows t s exactly the two ppelne stages of the adder that are the bottleneck of the whole desgn. However, further ppelnng the adder wll nvolve a much larger area cost and s not lkely to gve any performance gan due to the long wrng delays n FPGA. From table 1, we can see that by usng the delayed addton algorthm, we have acheved a faster ppelne speed than the tradtonal multpler and accumulator desgn. Accordng to the data above, an IMAC wth the delayed addton would requre 7 + (N-1) + 3 = N + 9 cycles for an nteger nner product of length N to complete, where 7 stands for the number of ppelne stages, 3 stands for the cycle tme for the fnal addton. he overall latency for ths desgn would thus be 15ns (N + 9) = (15N + 135)ns. Snce we could only mplement the multpler on one XC4036 chp n the tradtonal desgn, we have to add another two ppelne stages for the accumulator n our calculaton. hus the overall latency for an nner product of sze N usng tradtonal IMAC would be 10 + (N-1) = N + 9 cycles as well. Snce the cycle tme n the delayed addton desgn s 20% shorter than the tradtonal desgn, the delayed addton desgn has a performance speedup of 120%. III. USING DELAYED ADDIION IN A FLOAING-POIN ACCUMULAOR Multply and accumulaton also appears frequently n floatng-pont applcatons. For example, of the 24 Lvermore Loops, 5 loops (loop 3, 4, 6, 9, 21) are bascally long vector nner-product-lke computaton [17]. In certan applcatons, such as the conjugate gradent example n Secton V, multply and accumulaton domnates the whole computaton process. hus t would be deal f we could also use our delayed addton technques to buld a floatngpont multply and accumulator to speed up ths knd of computatons lke what we dd n the nteger case. However, a floatng-pont MAC unt uses too much area to ft on a sngle FPGA chp. he major reason s that floatng- 4

5 Draft submtted for publcaton. Please do not dstrbute s exponent fracton Fg. 4 IEEE sngle precson format. S s the sgn; exponent s based by 127. If exponent s not 0 (normalzed), mantssa = 1.fracton If exponent s 0 (denormalzed), mantssa = 0.fracton pont accumulaton s a much more complex process than the nteger case, as explaned below. Rather than a MAC unt, we nstead focus here on a floatng-pont accumulator usng delayed addton. We frst gve a bref revew of tradtonal approaches, then descrbe how we have used delayed addton technques to optmze performance. A. radtonal Sngle-Precson Addton Algorthm As shown n Fg.4, a tradtonal floatng-pont adder would frst extract the 1-bt sgn, 8-bt exponent and 23-bt fracton of each ncomng number from the IEEE 754 sngle precson format. By checkng the exponent, the adder determnes f each ncomng number s denormalzed. If the exponent bts are all 0, whch means the number s denormalzed, the mantssa s 0.fracton, otherwse, mantssa s 1.fracton. Next, the adder compares the exponents of the two numbers and shfts the mantssa of the smaller number to get them algned. Sgn-adjustments also occur at ths pont f ether of the ncomng numbers s negatve. Next, t adds the two mantssas; the result needs another sgn-adjustment f t s negatve. Fnally the adder re-normalzes the sum, adjusts the exponent accordngly and truncates the resultng mantssa nto 24 bts by the approprate roundng scheme [2]. he above algorthm s desgned for a sngle addton rather than a seres of addtons. Even more so than n the nteger case, ths straghtforward approach s dffcult to ppelne. One problem les n the fact that the ncomng nextelement-to-be-summed must be algned wth the current accumulated result. hs adds a challenge to our delayed addton technque snce we do not keep the accumulated result n ts fnal form, and thus cannot algn ncomng addends to t. Lkewse, at the end of the computaton, renormalzaton also mpedes a delayed addton approach. For these two problems, we have come up wth two solutons: 1. Mnmze the nteracton between the ncomng number and the accumulated result. o acheve ths, we self-algn the ncomng number on each cycle, rather than algnng t to the Pseudo-sum. Secton 1) wll descrbe self-algnment n more detal. 2. Use the delayed addton for accumulaton only. Postpone roundng and normalzaton untl the end of the entre accumulaton. hs approach s also used when mplementng MAC n some full-custom IC floatng-pont unts [12]. B. Our Delayed Addton Floatng-Pont Accumulaton Algorthm hs secton descrbes our approach for delayed addton accumulaton n floatng-pont numbers. Smlar to what we dd n Integer MAC, we repeatedly execute pseudo-sum = pseudo-sum + ncomng operand. Each ncomng operand s an IEEE sngle-precson floatng-pont number, wth 1- bt sgn, 8-bt exponent (EXP[7-0]) and 23-bt fracton. We consder the exponent bts as three subfelds: hgh-order exponent, a decson bt and low-order exponent for smplcty of dscusson. Hgh-order exponent refers to the EXP[7-6], the decson bt s EXP[5] and low-order exponent refers to EXP[4-0]. We take dfferent actons accordng to the value of these three felds. Lke the tradtonal adder, our desgn frst extends the 23-bt fracton nto 24-bt mantssa. However, we choose not to algn the ncomng operand and the current pseudo-sum drectly because that way the ncomng operand nteracts wth the accumulated pseudo-sum throughout the algnment process, whch makes the further ppelnng mpossble. hus the algnment process could easly become the bottleneck of the whole ppelne f we stll adopt the tradtonal algnment method. Instead, we keep summary nformaton about the hgh-order exponent of the accumulated result, and algn ts mantssa to a fxed boundary accordng the ts low-order exponent. We refer to ths technque as "self-algnment" and descrbe t below. 1) Self-Algnng Incomng Operands here are two ways to algn two floatng-pont numbers. he common way s to shft the mantssa of one number by d bts, where d stands for the dfference of the exponents of the two numbers. Another way s to nstead shft both mantssas to some common boundares. radtonal floatng-pont adders use frst method. In our case, however, the second way s used snce we would lke to mnmze the nteractons of the ncomng number and the accumulated pseudo-sum. We could have fully unrolled the ncomng operand and the accumulated pseudo-sum by left-shftng ther mantssas the number of bts denoted by ther exponents except for the huge area cost nvolved. In that case, the shfted mantssa would be as long as 255(the largest 8-bt exponent possble) + 24(the wdth of sngle precson mantssa) = 279 bts. Because of ths, we only left-shft the mantssa the number of bts denoted by the low-order exponent (EXP[4-0]) n our desgn. Snce low-order exponent s a 5-bt quantty, the largest decmal t can express s 31. hus, by left-shftng to account for low-order bts, we have extended the wdth of our mantssa to 55 bts. Although ths s stll wde, our desgn can ft nto 910 CLBs on a Xlnx 4036 as we wll see later, and ths gves us the ablty to garner truly hgh- 5

6 Draft submtted for publcaton. Please do not dstrbute 55-bt regster for self-algned Incomng number From 64-bt adder adder array adder array 1 GS 0 MUX 1 0 MUX 1 0 MUX 1 P 64-bt regster for pseudo-sum o 64-bt adder 0 E GS s the stall sgnal generated by the control unt. P ndcates f the top 2 bts of the ncomng number s exponent s bgger than that of the accumulated S and C or not. E ndcates f the top 2 bts of the exponent of accumulated S and C s bgger than that of the ncomng number or not, E and other control sgnals together serve as the clock enable sgnal for flp-flops S and C. Fg. 5 Compressor desgn (compressor performance sngle-precson floatng-pont from an FPGAbased desgn. In the above self-algnng process, we dd not take nto consderaton of the hgh-order exponent (EXP[7-6]) and decson bt(exp[5]). hus the shfted mantssa of the ncomng operand s stll not perfectly algned wth that of the current pseudo-sum. We used the fact below to solve ths remanng problem. In sngle-precson IEEE floatngpont, the mantssa s only 24 bts wde. hus, f we try to add two orgnally normalzed numbers that dffer by more than 2 24 tmes, algnment wll cause the smaller of the two numbers to be "rght-shfted" out of the expressble range for ths format. For example, = 2 26 n sngleprecson calculatons. Our algorthm effcently uses ths fact to dentfy the smlar cases and handles them approprately. Once self-algned, the ncomng number can be thought of as m (the 55-bt mantssa) 2 64 EXP[7-5]. Meanwhle, our pseudo-sum s stored as mp (the 64-bt mantssa) 2 64 EXP[7-5]. If the current pseudo-sum and the ncomng operand are dentcal n decson bt (EXP[5]), then f the hghorder exponent (EXP[7-6]) of the ncomng number s bgger than that of the pseudo-sum, the mantssa of the pseudo-sum wll be shfted out of the expressble range as long as t s no more than 64 bt wde. In ths case, we smply replace the current pseudo-sum by the ncomng operand. On the other hand, f the hgh-order exponent of the ncomng number s smaller than that of the pseudosum, the ncomng number wll be shfted out of the expressble range snce m s less than 64 bt wde. hus we smply gnore the ncomng operand. he compresson wll only take place when hgh-order exponent of the pseudo-sum s equal to that of the ncomng number. Note that f the current pseudo-sum and the ncomng operand are not dentcal n EXP[5], then determnng the approprate response would actually requre subtractng the two full exponents to determne by how much they dffer. hs would pose a bottleneck n the ppelne; thus we hope to avod ths scenaro entrely. hs leads to our desgn descrbed n Secton 2) below. 2) Compressor Implementaton Detals In order to avod the undesrable scenaro of unequal decson bts, we actually keep two runnng pseudo-sums. One compressor, referred to as compressor-0, takes care of ncomng operands whose decson bt s 0 and the other compressor (compressor-1) handles those whch has a decson bt of 1. We smply shunt each ncomng operand to the approprate compressor as shown n Fgure 6. In ths way, we can always take operatons correspondng to the hgh-order exponent as descrbed above. he two pseudo-sums from compressor-0 and compressor-1 are both added together durng the fnal add stage as a postprocessng step followng the ppelned computaton. Fgure 5 shows the desgn layout for one of the two compressor unts n the desgn, namely compressor-0. Compressor-1 has essentally dentcal structure, except that t cross-connects wth adder-0 as shown n Fgure 6. he runnng pseudo-sum s stored as the Wallace tree's S and C partal results n the 64-bt regsters shown. Were t not for the possblty of ether pseudo-sum overflowng, the desgn would now be complete. Snce the accumulated result may exceed the regster capacty, we have also devsed a technque for recognzng and respondng to potental pseudo-sum overflows. Snce we are not dong the full carry-propagaton of a tradtonal adder, we cannot use the tradtonal overflow-detecton technque of comparng carry-n and carry-out at the hghest bt. In fact, wthout performng the fnal add to convert the pseudo-sum to the true sum, t s mpossble to precsely know a pror when overflows wll occur. Our approach nstead reles on conservatvely determnng whenever an overflow mght occur, and then stallng the ppelne to respond. We can conservatvely detect possble overflow stuatons by examnng the top three bts of the S 6

7 Draft submtted for publcaton. Please do not dstrbute Ppelne Stage 1 Read n and generate 24-bt mantssa Ppelned Porton Ppelne Stage 2 Shft by EXP[2] - EXP[0] Ppelne Stage 3 Shft by EXP[4]-EXP[3]. Complement the mantssa f s = 1 Ppelne Stage 4 Stall response Fnal Addton and Normalzaton (Not ppelned) Compressor 0 (adder not ncluded) Compressor 1 (adder not ncluded) 64-bt adder 0 64-bt adder 1 Algn the numbers from adder 0 and adder1 Fnal 64-bt adder Normalze the result Overflow Detecton & Handlng GlobalStall Combnatonal logc (Not ppelned) Fg. 6 Floatng-pont accumulator ppelnng scheme Xlnx part number CLB matrx CLB Flpflops Ppelne Speed Fnal Add and sze used stages (MHz) Norm. Delay (ns.) XC4036xlhq able 2: Synthess results for floatng-pont accumulaton wth delayed addton and C portons of the pseudo-sum and the sgn bt from the 55-bt ncomng operand. We have used espresso to form a mnmzed truth table generatng the GlobalStall sgnal (GS n Fg. 4) as a Boolean functon of these 7 bts. As shown n Fg. 5, the GlobalStall sgnal s used as the clock enable sgnal on the frst three ppelne stages; when t s asserted, the ppelne stalls and no new operands are processed untl we respond to the possble overflow. Snce the desgn's two compressors are summng dfferent numbers, they wll of course approach overflow at dfferent tmes snce only one number s added a tme. Our desgn, however, does overflow processng n both compressors whenever ether compressor's GlobalStall sgnal s asserted. hs coordnated effort avods cases where overflow handlng n one compressor s mmedately followed by an overflow n the other compressor and t potentally reduces the number of stalls needed, too, snce we process these two pseudo-sums n parallel durng the stall. When a stall occurs, our response s to sum the S and C portons of each compressors' pseudo-sum usng the 64-bt adders shown n the Stall Response box n Fgure 5. hs s a tradtonal 64-bt addton ncurrng a sgnfcant carry propagaton delay, but snce t occurs durng the stall-tme, t does not le on the crtcal path that determnes the desgn's ppelned clock rate. (As long as stalls are nfrequent, t does not notceably mpact performance.) he decson of what to do wth the newly formed sum depends on ts value,.e., t depends on whether () an overflow truly occurred or () we were overly conservatve n our stall detecton. In cases where an overflow does occur, the value of EXP[5] n the pseudo-sum wll change. Recall that compressor-0 s to handle the accumulaton of ncomng operands whose EXP[5] bt s 0, wth a pseudo-sum whose EXP[5] bt s also 0. If the pseudo-sum overflow causes EXP[5] to change value, then we need to pass the newlycomputed full sum over to the other compressor. hs s why the desgn n Fg. 6 ncludes the cross-coupled connectons of adder-1 to compressor-0 and vce versa. When we are overly conservatve n predctng a stall, EXP[5] wll not change values. In ths case, we retan the pseudo-sum n ts current form. C. Expermental Results Fgure 6 shows the block dagram of ths desgn and able 2 summarzes the synthess results. Because ths s a 7

8 Draft submtted for publcaton. Please do not dstrbute Ppelne Stage 1 Read n and generate 24-bt mantssa Ppelned Porton Ppelne Stage 2 Shft by exp[1] and exp[0] Ppelne Stage 3 Shft by exp[3] and exp[2] Ppelne Stage 4 Shft by exp[4]. Complement the mantssa f s = 1 Ppelne Stage 5 Compressor 0 (adder not ncluded) Compressor 1 (adder not ncluded) Algn the S and C from adder 0 and adder1 Fnal addton and normalzaton 4-2 compressor array take S and C from both compressors Fnal 64-bt adder Normalze the result Combnatonal logc (not ppelned) Fg. 7 Floatng-pont accumulator wth compler-managed overflow avodance ppelnng scheme floatng-pont accumulator rather than a MAC unt, t s actually smaller than the nteger MAC unt dscussed n the prevous secton. Usng 4 ppelne stages, our desgn attans a clock rate of 33MHz. Because of extra bookkeepng requred to renormalze the fnal result, the post-processng delay n ths desgn s larger. At 72.4ns, ths delay corresponds to roughly 2.4 of the ppelned clock cycles. As n the nteger case, ths dfference between the fnal add tme and the ppelned clock cycle tme hghlghts the utlty of delayed addton. By pullng ths delay off the vector computaton s crtcal path, we pay for t only once per vector, not on each clock cycle. Accordng to the data above, usng the same analyss as n secton III and assumng there s no stall durng the computaton, we wll have to wat 4 + (N-1) + 3 = N + 6 cycles for an accumulaton of length N to complete. Snce each stall causes a 3-cycle bubble n the ppelne and too many stalls may eventually ncur expensve system nterrupt, we also want to make sure how frequent stalls mght be when we accumulate N numbers. We dd two smulatons. Smulaton I used 100,000 unformly dstrbuted floatng-pont numbers wth ther absolute values rangng from 2-31 to Because postves and negatves are balanced, we dd not even meet one case of stallng. Smulaton II uses 100,000 unformly dstrbuted postve floatng-pont numbers rangng from 0 to 2 31, and we only found 24 cases of stallng. Summng these 100,000 numbers would need 100,006 cycles, so that 72 stall cycles are neglgble. From ths experment we conclude that overflow and stallng pose lttle problem for most applcatons, as long as we have a reasonably large local buffer for operands. As we wll show and explot n Secton IV, we can prove that for vectors shorter than 512 elements, there s no chance of stallng at all. IV. FLOAING-POIN ACCUMULAOR WIH COMPILER- MANAGED OVERFLOW AVOIDANCE he man reason why we have the overflow detecton and handlng logc n the prevous desgn s because of the possble overflow of the pseudo-sum after a number of operatons. However, the stall-related logc s very complcated and has a bg area cost. Worst of all, t sts on the crtcal path of our desgn and slows down the ppelne speed. o avod the area and speed overhead due to overflow detecton and handlng, we present a dfferent style desgn here. hs desgn omts overflow handlng by relyng on the compler to break a large accumulaton nto smaller peces so that overflow s guaranteed not to occur when each of these peces s executed. Avodng area overhead for stall handlng s desrable, but we wll not have much gan n our desgn f we have to break an accumulaton nto very small peces. Our goal s to determne a bound of how often the stall wll occur. he largest ncomng mantssa that can be fed nto one compressor s (the frst 24 bts are 1 s and 8

9 Draft submtted for publcaton. Please do not dstrbute Xlnx part number CLB matrx CLB Flpflops Ppelne Speed Fnal Addton sze used stages (MHz) Delay (ns) XC4036xlhq able 3: Synthess results of floatng-pont accumulaton wth delayed addton and compler-managed overflow avodance. the rest 31 bts are 0 s). hs s less than hus f the pseudo-sum s stored n an n-bt (n > 55) regster, we may have an overflow every 2 n-55 accumulatons. We can use ths formula to choose a sutable n for specfc applcatons. In ths desgn, we choose the pseudo-sum wdth to be 64 as n the desgn from Secton III. Snce = 9, overflows may occur every 2 9 = 512 accumulatons. We wll rely on the compler to break summatons nto 512-element vectors. However, snce we no longer need the overflow checkng and handlng n ths desgn, all the related components n our prevous desgn can be removed and the two 64-bt adders below the compressors n Fgure 4 are replaced by an array of 4-2 adders. hs also greatly smplfes the control logc on the crtcal path and enables us to further ppelne our desgn. Fgure 7 shows a resultng 5-ppelnestage desgn. Synthess results summarzed n table 3 show that we have doubled the speed of the conservatve desgn and have acheved a hgh clock rate of 66MHz. For an accumulaton of sze N, we wll need 5 + (N-1) + 5 = N + 9 cycles to complete the whole computaton. V. APPLICAIONS OF DELAYED ADDIION ECHNIQUES In prevous sectons, we have examned MACs and accumulatons n solaton. We now evaluate the utlty of our approach wthn a larger applcaton. Our desgn s especally useful for those applcatons n whch nnerproduct computatons domnate the overall executon tme. One such example s to solve a system of lnear equatons usng conjugate gradent method. he system of lnear equatons takes the form: A x = b, (5.1) where A s a N N matrx and x, b are 1 N vectors. Such computatons arse frequently n scentfc matrx-orented code, such as n Space-me Adaptve Processng (SAP) [18, 21]. SAP s a problem for whch confgurable accelerators are often explored, so a hgh-performance, FPGA-based nner product unt would form the core of such an accelerator. A. Conjugate Gradent Method: Background he conjugate gradent method s based on the dea of mnmzng the functon f(x) = 1 2 x Ax bx (5.2) he functon s mnmzed when ts gradent f = Ax b s zero, whch s exactly the soluton of (5.1). N-way p a+b c Dv Sub Eq. 5.4: α N Eq. 5.5: x +1 - N - - Eq. 5.6: g +1 N - - N Eq. 5.7: d +1 N + 1 N 1 - able 4: Number of dfferent operatons n Equaton ( ) p stands for nner product, Dv for dvson, Sub for subtracton. We can use regresson to fnd the mnmum x n (5.2). Set the ntal condtons to be x 0 = 0 (ntal soluton) d 0 = b (ntal drecton) g 0 = -d 0 (ntal gradent) (5.3) and teratvely calculate the followng steps: g α = d d A d (5.4) x +1 = x + α d (5.5) g +1 = A x +1 b (5.6) d +1 = -g +1 + g A d +1 d d A d (5.7) untl x +1 x < err (5.8) where err s a pre-specfed error lmt. As we can see from (5.4) to (5.7), each teraton s dvded nto four steps and these four steps must be seralzed snce the operaton of each step reles on the full knowledge of prevous step s result. Each step has sgnfcant parallelsm, however. he number of dfferent knds of calculatons requred for each step s summarzed n table 4. Wherever N appears n a table entry, the correspondng operaton has N-fold parallelsm. he above analyss shows that usng conjugate gradent method to solve a system of lnear equatons s potentally a good canddate for hghlghtng the utlty of FPGA computaton wth our specally desgned accumulaton unts. Next, we present a confgurable computng system that mplements ths applcaton. B. Proposed Archtecture he current confgurable computng systems can be roughly categorzed nto closely-coupled archtectures and looselycoupled archtectures. For closely-coupled archtectures, the confgurable hardware s part of a mcroprocessor and s generally treated as a specal functonal unt (SFU). he SFU behaves smlarly to other functonal unts on chp; t also takes two operands and produces a result for each 9

10 Draft submtted for publcaton. Please do not dstrbute FP Multpler 33 MHz FP Multpler 33 MHz multplers each clock cycle (Fg 8). In ths way, the MAC unt can run at the full speed of the accumulator, 66MHz. 0 MUX 1 FP Accumulator 66 MHz Fg. 8 A Floatng-Pont MAC: he accumulator takes n operands alternatvely from the two multplers. he whole MAC runs at 66MHz. operaton. hs archtecture, however, s not sutable for our applcaton for several reasons. Frst of all, all state-of-the-art mcroprocessors have floatng-pont multplers and adders. Generally these functonal unts are ppelned and can run at a very hgh speed. Some mcroprocessors can even use them to form two MAC unts, lke HP PA-RISC. Our SFU, on the other hand, has a much slower speed. What s more, as we can see from prevous dscusson, t takes a lot of hardware resource to mplement a MAC unt n FPGA, whch s beyond the capablty of most current closely-coupled confgurable computng system. For those that have ths capablty, the number of SFUs that can be mplemented s very lmted, too. hus the parallelsm n the algorthms can hardly be exploted. Loosely-coupled confgurable computng system, on the other hand, can provde us wth plenty of hardware resource for takng advantage of the mass parallelsm n applcatons. In ths archtecture, we normally have one host machne and multple FPGA boards connected to the I/O bus, wth FPGAs and assocated memory on each board. he host machne confgures and ntalzes each FPGA board and s responsble for communcaton wth other machnes whle the FPGA boards takes care of major calculaton. he host machne starts the operaton on FPGA boards after ntalzaton and FPGA board talks to the host machne when the job s done by nterrupt. For our desgn, each FPGA board wll be used to mplement several MAC unts, and wll also contan some local memory. We use floatng-pont accumulators and multplers to form MAC unts. Snce the sze of the matrx A s generally known before computaton, we choose a floatng-pont accumulator wth compler-managed overflow avodance as descrbed n Secton IV. On a separate FPGA, we wll have an 8-stage ppelned floatngpont multpler; our desgn can run at 33MHz on 2 Xlnx parts. hus we can form a MAC unt wth two floatngpont multplers and a floatng-pont accumulator, wth the floatng-pont accumulator takng n the products alternatvely from each one of the two floatng-pont Local memory on the FPGA board wll help manage the I/O data flow from the host machne to the FPGA board. It holds the data of matrx A and the results for each teraton ncludng α, x +1, g +1, d +1, and also provdes space for ntermedate results. When solvng more than one system of lnear equatons smultaneously, t would be more effcent to have two memory chps. hs would allow us to have one chp dong computaton whle the other one s downloadng a new system of lnear equatons from the host machne. he memores would alternate ther functons once the computaton s done. he communcaton latency between host machne and each FPGA board wll thus be totally hdden except for the frst tme. C. Performance of FPGA-based Conjugate Gradent Accelerator Suppose we have L (L>2) MAC unts on a FPGA board. hs FPGA board s thus capable of processng all the multplcaton and accumulaton n (5.4) (5.8) except for the dvson, whch s handled by CPU va nterrupt. o smplfy the calculaton, we assume the compler guarantees N s less than 512. hus we don t have to worry about breakng long nner products nto small peces. We also assume each MAC unt s fed from local memory at ts optmum bandwdth of two 32-bt operands at 66MHz. Suppose we have L MAC unts on the FPGA board, the requred local memory bandwdth would thus be 528 L Meg bytes per second. Before gong down to calculaton detals, we frst defne K- p tme as the number of clock cycles needed for our MAC unt to compute the nner product of two vectors of length K. In our case, we need (k-1) + 5 = k + 17 cycles to complete such a computaton, where 8 comes from the number of ppelne stages for multpler, 5 comes from the number of ppelne stages for the accumulator, k-1 s the vector length mnus 1, and 5 counts for the combnatonal delay for the fnal addton. We can then calculate the latency of one teraton of computaton as follows: 1) Latency for step 1: Computng α = g d d A d In ths step (5.4), we frst calculate d A. he calculaton conssts of N nner products, whch requres N/L N-p tme. Snce d A s an N 1 vector, ( d A) d s smply one nner product of length N, and so s g d. hus we can compute d A) d and g d ( smultaneously wthn one N- p tme. In all, we need N/L +1 N-p tme for nner 10

11 Draft submtted for publcaton. Please do not dstrbute products as well as one nterrupt for dvson. Snce d A d wll be used later as a dvsor n step 4, we can save one nterrupt by sendng back both α and 1/ d A d L = 2 L = 4 L = 8 2) Latency for step 2: Computng x +1 = x + α d In ths step (5.5), we have to compute each element of x +1 and we have N such computatons. For element j of x +1, we have x +1,j = x,j + α d,j. hs can be computed as an nner product of vector length two. Although nner product of sze two s not economcal at all usng our MAC unt, t s far better than to send these numbers back to host machne for computaton and send the results back, whch would ncur a hgh penalty n communcaton tme. hus we need N/L 2-p tme for ths step. 3) Latency for step 3: Computng g +1 = A x +1 b In ths step (5.6), we also have to compute each element of g +1 and we have N such computatons. Snce b s fxed through teratons, we can store the b n the memory beforehand. hus for element j of g +1, we have g +1,j = (- b j ) + Σ A j,k x,k. hs can be computed as an nner product of vector length N+1, so we need N/L (N+1)-p tme for ths step. 4) Latency for step 4: Computng d +1 = -g +1 + g 1 A d d A d + d In ths step (5.7), we need N/L +1 N-p tme to calculate g A d for the same reason stated n (5.3.1). Snce + 1 1/ d A d s avalable from step 1, we need one multplcaton tme for g + 1A d (1/ d A d ). And we need another N/L 2-p tme to get the result for the same reason stated n In all, we need N/L +1 N-p tme + N/L 2-p tme + 1 multplcaton tme for ths step. 5) Overall Performance By summng up the tme needed n each step, we get the tme for one teraton of the computaton. In all, t takes 2 ( N/L + 1) N-p tme + 1 N/L (N+1)-p tme + 2 N/L 2- p tme +1 multplcaton tme + 1 nterrupt tme for each teraton. If we plug n the N-p tme and 2-p tme, and we take 8 cycles to be one multplcaton tme, and we assume that 1 nterrupt takes 500 cycles, then we need (3N + 90) N/L + 2N cycles for each teraton. Fg. 9 shows the cycle count chart for dfferent problem szes and number of MAC unts. From ths table we can see, wth the ncrease of problem sze, the number of cycles needed becomes nversely proportonal to the number of MAC unts L. hs s because the cost of nterrupt has become trval n overall latency when N gets large. Number of cycles needed Fg Problem sze N Cycle count for one teraton wth regard to problem sze N and number of MAC unts L. It s worth mentonng here that the major computaton n checkng for convergence (5.8), x +1 x, s also an nner product operaton. From (5.5) we know that x +1 x = α d, each element of whch s calculated by the multplcaton process n step 2. hus we just have to save all these elements of α d n the memory and calculate ts norm later. A schedulng that defntely won t put extra latency on the system would be to calculate t durng the last N-p tme n step 1 and check f x +1 x < err durng the nterrupt. VI. DISCUSSION In ths secton we dscuss some of the ssues rased by our delayed addton technque, partcularly wth respect to floatng-pont calculatons. he IEEE floatng-pont standard [1] specfes the format of a floatng-pont number of ether sngle or double precson, as well as roundng operatons and excepton handlng. It provdes us wth both a representaton standard and an operaton standard. he representaton standard s helpful for transportng code from one system to another. he operaton standard, together wth the representaton standard, works to ensure that same result can be expected for floatng pont calculatons on dfferent platforms (f they all choose the same roundng scheme). Our desgns, descrbed n Sectons III, IV, and V, abde by the representaton porton of the IEEE standard. Our approaches mplement sngle-precson IEEE floatng-pont ncludng denormalzed numbers. Our operatons do reorder computatons, however, and make the basc assumpton that 11

12 Draft submtted for publcaton. Please do not dstrbute the addtons performed are commutatve. hs assumpton s also routnely made by most current-generaton mcroprocessors, where out-of-order executon also assumes that floatng-pont operatons on ndependent regsters are commutatve. Smlarly, many compler optmzatons geared at scentfc code also assume commutatvty; optmzatons such as loop nterchange and loop fuson reorder computatons as a matter of course. When users are concerned, error theory n computatons can be used to determne to what extent such reorderng s safe [4,5]. VII. RELAED WORK hs paper touches on areas related to both computer arthmetc and confgurable computng. Hennessy and Patterson provde an overvew of computer arthmetc n Appendx A of [6]. hey concentrate on logc prncples of varous desgns of basc arthmetc components. In addton, nnumerable books and papers go nto more detal on adder and multpler desgns at the transstor level. For example, Weste and Eshraghan [7] provdes a detaled dscusson on varous multpler mplementatons and Wallace trees. Examples of recent full-custom multpler desgn can be found n the Work by Ohkubo et al. [8] and Makno et al. [9]. We can also fnd recent multpler desgns n state-ofthe-art mcroprocessors such as DEC Alpha [15] and SUN Ultrasparc [16]. hese desgns, as wth most, use Booth encoders and Wallace trees Our approach employs the dea used n Wallace trees. C. S. Wallace used 3-2 adders to buld up the frst Wallace tree [10]. here have been many dervatves snce then. he most mportant change s to use 4-2 adders to replace the 3-2 adders n the orgnal mplementaton. In many desgns, pass transstors rather than full CMOS logc gates are used to buld 4-2 adders to mprove the crcut speed, as we can see n Hekes et al. [11]. Early n 1994, Cank et al. [3] dscussed how to map a btarray multpler to Xlnx FPGA. hey bult an 8 8 btarray multpler for nteger multplcaton wth Xlnx 3000 seres and ther fully ppelned mplementaton on XC acheved more than 100 Mhz. Now almost all the major FPGA vendors have provded ther mplementatons of nteger multpler or multply-accumulator of 16 bt or shorter length [22, 24]. A comparson of the speed of these mplementatons can be found n [23]. Most of them are based on bt-seral or bt-array multplers. However, btarray multpler has too bg an area cost for long nteger multplcaton. We know of no mplementatons of 32-btarray multpler, nor have we seen any mplementatons of 32-bt nteger multpler or multply-accumulators based on bt-seral or other algorthms. More recent work has examned mplementng floatngpont unts n FPGAs [2,3,13,14]. Louca et al. [2] present approaches for mplementng IEEE-standard floatng-pont addton and multplcaton n FPGAs. hey used a modfed bt-seral multpler and prototyped ther desgns on Altera FLEX8000s. Lgon et al. [14] also dscuss the mplementaton of IEEE sngle precson floatng-pont multplcaton and addton and they accessed the practcablty of several of ther desgns on XILINX 4000 seres FPGA. Shraz et al. [13] talk about the lmted precson floatng pont arthmetc. hey have adapted IEEE standard for lmted precson computaton lke FF n DSP. At the applcaton level, there are many papers about Space- me Adaptve Processng (SAP). Some background knowledge can be found n Gupta [18]. In addton, Smth et al. [19] talks about usng conjugate gradent method for SAP and Gupta [18] dscussed about the possblty of usng confgurable hardware for SAP. Fnally, several authors have dscussed roundng and error theory [4,5,12]. We can see how people deal wth the error n physcs from aylor [4]. Wlknson [5] and Hekes et al. [12] touch on roundng error n MAC and nner-product unts. VIII. CONCLUSIONS Wthn many current FPGAs, carry propagaton represents a sgnfcant bottleneck that mpedes mplementng truly hgh-performance ppelned adders, multplers, or Multply-accumulate (MAC) unts wthn confgurable desgns. hs paper descrbes a delayed addton technque for mprovng the ppelned clock rate of desgns that perform repeated ppelned calculatons. Our technque draws on Wallace trees to accumulate values wthout performng a full carry-propagaton; Wallace trees are unversally used wthn the multply unts n hghperformance processors. he unque nature of confgurable computng allows us to apply these technques not smply wthn a sngle calculaton, but rather across entre streams of calculatons. We have demonstrated the sgnfcant leverage of our approach by presentng three desgns exemplfyng both nteger and floatng-pont calculatons. he desgns operate at ppelned clock rates between 33 and 66MHz. We have also used our desgns n a real applcaton: solvng system of lnear equatons usng conjugate gradent method. hese technques and applcatons should help to broaden the space of nteger and floatng-pont computatons that can be customzed for hgh-performance executon on current FPGAs. REFERENCES [1] IEEE Standards Board. IEEE Standard for Bnary Floatng-Pont Arthmetc. echncal Report ANSI/IEEE Std , Insttute of Electrcal and Electroncs Engneers, New York,

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Lecture 3: Computer Arithmetic: Multiplication and Division

Lecture 3: Computer Arithmetic: Multiplication and Division 8-447 Lecture 3: Computer Arthmetc: Multplcaton and Dvson James C. Hoe Dept of ECE, CMU January 26, 29 S 9 L3- Announcements: Handout survey due Lab partner?? Read P&H Ch 3 Read IEEE 754-985 Handouts:

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

RADIX-10 PARALLEL DECIMAL MULTIPLIER

RADIX-10 PARALLEL DECIMAL MULTIPLIER RADIX-10 PARALLEL DECIMAL MULTIPLIER 1 MRUNALINI E. INGLE & 2 TEJASWINI PANSE 1&2 Electroncs Engneerng, Yeshwantrao Chavan College of Engneerng, Nagpur, Inda E-mal : mrunalngle@gmal.com, tejaswn.deshmukh@gmal.com

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier Floatng-Pont Dvson Algorthms for an x86 Mcroprocessor wth a Rectangular Multpler Mchael J. Schulte Dmtr Tan Carl E. Lemonds Unversty of Wsconsn Advanced Mcro Devces Advanced Mcro Devces Schulte@engr.wsc.edu

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011 9/8/2 2 Outlne Appendx C: The Bascs of Logc Desgn TDT4255 Computer Desgn Case Study: TDT4255 Communcaton Module Lecture 2 Magnus Jahre 3 4 Dgtal Systems C.2: Gates, Truth Tables and Logc Equatons All sgnals

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

Mallathahally, Bangalore, India 1 2

Mallathahally, Bangalore, India 1 2 7 IMPLEMENTATION OF HIGH PERFORMANCE BINARY SQUARER PRADEEP M C, RAMESH S, Department of Electroncs and Communcaton Engneerng, Dr. Ambedkar Insttute of Technology, Mallathahally, Bangalore, Inda pradeepmc@gmal.com,

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Area Efficient Self Timed Adders For Low Power Applications in VLSI

Area Efficient Self Timed Adders For Low Power Applications in VLSI ISSN(Onlne): 2319-8753 ISSN (Prnt) :2347-6710 Internatonal Journal of Innovatve Research n Scence, Engneerng and Technology (An ISO 3297: 2007 Certfed Organzaton) Area Effcent Self Tmed Adders For Low

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Array transposition in CUDA shared memory

Array transposition in CUDA shared memory Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces Range mages For many structured lght scanners, the range data forms a hghly regular pattern known as a range mage. he samplng pattern s determned by the specfc scanner. Range mage regstraton 1 Examples

More information

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction Chapter 2 Desgn for Testablty Dr Rhonda Kay Gaede UAH 2 Introducton Dffcultes n and the states of sequental crcuts led to provdng drect access for storage elements, whereby selected storage elements are

More information

Newton-Raphson division module via truncated multipliers

Newton-Raphson division module via truncated multipliers Newton-Raphson dvson module va truncated multplers Alexandar Tzakov Department of Electrcal and Computer Engneerng Illnos Insttute of Technology Chcago,IL 60616, USA Abstract Reducton n area and power

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization What s a Computer Program? Descrpton of algorthms and data structures to acheve a specfc ojectve Could e done n any language, even a natural language lke Englsh Programmng language: A Standard notaton

More information

GSLM Operations Research II Fall 13/14

GSLM Operations Research II Fall 13/14 GSLM 58 Operatons Research II Fall /4 6. Separable Programmng Consder a general NLP mn f(x) s.t. g j (x) b j j =. m. Defnton 6.. The NLP s a separable program f ts objectve functon and all constrants are

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

USING GRAPHING SKILLS

USING GRAPHING SKILLS Name: BOLOGY: Date: _ Class: USNG GRAPHNG SKLLS NTRODUCTON: Recorded data can be plotted on a graph. A graph s a pctoral representaton of nformaton recorded n a data table. t s used to show a relatonshp

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Brave New World Pseudocode Reference

Brave New World Pseudocode Reference Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning Parallel Inverse Halftonng by Look-Up Table (LUT) Parttonng Umar F. Sddq and Sadq M. Sat umar@ccse.kfupm.edu.sa, sadq@kfupm.edu.sa KFUPM Box: Department of Computer Engneerng, Kng Fahd Unversty of Petroleum

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES UbCC 2011, Volume 6, 5002981-x manuscrpts OPEN ACCES UbCC Journal ISSN 1992-8424 www.ubcc.org VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

More information

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface. IDC Herzlya Shmon Schocken Assembler Shmon Schocken Sprng 2005 Elements of Computng Systems 1 Assembler (Ch. 6) Where we are at: Human Thought Abstract desgn Chapters 9, 12 abstract nterface H.L. Language

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

THE low-density parity-check (LDPC) code is getting

THE low-density parity-check (LDPC) code is getting Implementng the NASA Deep Space LDPC Codes for Defense Applcatons Wley H. Zhao, Jeffrey P. Long 1 Abstract Selected codes from, and extended from, the NASA s deep space low-densty party-check (LDPC) codes

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Random Kernel Perceptron on ATTiny2313 Microcontroller

Random Kernel Perceptron on ATTiny2313 Microcontroller Random Kernel Perceptron on ATTny233 Mcrocontroller Nemanja Djurc Department of Computer and Informaton Scences, Temple Unversty Phladelpha, PA 922, USA nemanja.djurc@temple.edu Slobodan Vucetc Department

More information

FPGA-based implementation of circular interpolation

FPGA-based implementation of circular interpolation Avalable onlne www.jocpr.com Journal of Chemcal and Pharmaceutcal Research, 04, 6(7):585-593 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 FPGA-based mplementaton of crcular nterpolaton Mngyu Gao,

More information

FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER

FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER A Dssertaton Submtted In Partal Fulflment of the Requred for the Degree of MASTER OF TECHNOLOGY In VLSI Desgn Submtted By GEETA Roll no. 601361009

More information

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6) Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL) Crcut Analyss I (ENG 405) Chapter Method of Analyss Nodal(KCL) and Mesh(KVL) Nodal Analyss If nstead of focusng on the oltages of the crcut elements, one looks at the oltages at the nodes of the crcut,

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Specifications in 2001

Specifications in 2001 Specfcatons n 200 MISTY (updated : May 3, 2002) September 27, 200 Mtsubsh Electrc Corporaton Block Cpher Algorthm MISTY Ths document shows a complete descrpton of encrypton algorthm MISTY, whch are secret-key

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

MATHEMATICS FORM ONE SCHEME OF WORK 2004

MATHEMATICS FORM ONE SCHEME OF WORK 2004 MATHEMATICS FORM ONE SCHEME OF WORK 2004 WEEK TOPICS/SUBTOPICS LEARNING OBJECTIVES LEARNING OUTCOMES VALUES CREATIVE & CRITICAL THINKING 1 WHOLE NUMBER Students wll be able to: GENERICS 1 1.1 Concept of

More information

FPGA Based Fixed Width 4 4, 6 6, 8 8 and Bit Multipliers using Spartan-3AN

FPGA Based Fixed Width 4 4, 6 6, 8 8 and Bit Multipliers using Spartan-3AN IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.11 No.2, February 211 61 FPGA Based Fxed Wdth 4 4, 6 6, 8 8 and 12 12-Bt Multplers usng Spartan-3AN Muhammad H. Ras and Mohamed H.

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z. TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versons of Jacob teraton Gauss-Sedel and SOR teratve methods Next week: More MPI Debuggng and totalvew GPU computng Read: Class notes and references

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

High-Boost Mesh Filtering for 3-D Shape Enhancement

High-Boost Mesh Filtering for 3-D Shape Enhancement Hgh-Boost Mesh Flterng for 3-D Shape Enhancement Hrokazu Yagou Λ Alexander Belyaev y Damng We z Λ y z ; ; Shape Modelng Laboratory, Unversty of Azu, Azu-Wakamatsu 965-8580 Japan y Computer Graphcs Group,

More information

Rapid Development of High Performance Floating-Point Pipelines for Scientific Simulation 1

Rapid Development of High Performance Floating-Point Pipelines for Scientific Simulation 1 Rapd Development of Hgh Performance Floatng-Pont Ppelnes for Scentfc Smulaton 1 G. Lenhart, A. Kugel and R. Männer Dept. for Computer Scence V, Unversty of Mannhem, B6-26B, D-68131 Mannhem, Germany {lenhart,kugel,maenner}@t.un-mannhem.de

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

CHAPTER 4 PARALLEL PREFIX ADDER

CHAPTER 4 PARALLEL PREFIX ADDER 93 CHAPTER 4 PARALLEL PREFIX ADDER 4.1 INTRODUCTION VLSI Integer adders fnd applcatons n Arthmetc and Logc Unts (ALUs), mcroprocessors and memory addressng unts. Speed of the adder often decdes the mnmum

More information

On Some Entertaining Applications of the Concept of Set in Computer Science Course

On Some Entertaining Applications of the Concept of Set in Computer Science Course On Some Entertanng Applcatons of the Concept of Set n Computer Scence Course Krasmr Yordzhev *, Hrstna Kostadnova ** * Assocate Professor Krasmr Yordzhev, Ph.D., Faculty of Mathematcs and Natural Scences,

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

Wavefront Reconstructor

Wavefront Reconstructor A Dstrbuted Smplex B-Splne Based Wavefront Reconstructor Coen de Vsser and Mchel Verhaegen 14-12-201212 2012 Delft Unversty of Technology Contents Introducton Wavefront reconstructon usng Smplex B-Splnes

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson

More information

Optimal Workload-based Weighted Wavelet Synopses

Optimal Workload-based Weighted Wavelet Synopses Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel matas@tau.ac.l Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978,

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations* Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information