FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER

Size: px

Start display at page:

Download "FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER"

Myra Kelly
5 years ago
Views:

FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER A Dssertaton Submtted In Partal Fulflment of the Requred for the Degree of MASTER OF TECHNOLOGY In VLSI Desgn Submtted By GEETA Roll no.

1 FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER A Dssertaton Submtted In Partal Fulflment of the Requred for the Degree of MASTER OF TECHNOLOGY In VLSI Desgn Submtted By GEETA Roll no Under the gudance of (Ms. Saksh Bajaj) Assstant Professor, ECED T.U. Patala Department of Electroncs and Communcaton Engneerng THAPAR UNIVERSITY (Establshed under the secton 3 of UGC Act, 1956) PATIALA (PUNJAB) July, 2015

3 ACKNOWLEDGEMENT Frst of all, I would lke to express my grattude to Ms. Saksh Bajaj mam, Assstant Professor, Electroncs and Communcaton Engneerng Department, Thapar Unversty, Patala for hs patent gudance and support throughout ths report. I am truly very fortunate to have the opportunty to work wth hm. I found ths gudance to be extremely valuable. I am also thankful to our Head of the Department, Dr. Sanjay Sharma as well as PG Coordnator, Dr. Amt Kumar Kohl Assocate professor, Dr. Alpana Agarwal Assocate professor, Dr. Anl Arora Assstant professor, Electroncs and Communcaton Engneerng Department, entre faculty and staff of Electroncs and Communcaton Engneerng Department and then frends who devoted ther valuable tme and helped me n all possble ways towards successful completon of ths work. I thank all those who have contrbuted drectly or ndrectly to ths work. Lastly, I would also lke to thank my parents for ther years of unyeldng love and encourage. They have always wanted the best for me and I admre ther determnaton and sacrfce. GEETA II

4 ABSTRACT Multplers are beng ncreasngly used n DSP processors, flters, communcatons systems etc. Wth the rsng complcatons of technology, hgh-speed systems are n great demand. On comparson wth other operaton n an arthmetc logc unt the multpler consumes more tme and power. Hence the demand to desgn or mplement multplers wth optmal speed, power and area has ncreased. Ths dssertaton ncludes the mplementaton of decmal multplers whch are arranged n parallel, wth the dea of reducng delay. The partal products are generated n parallel by usng sgned dgt radx-10 recodng of the multpler and a smplfed box of multplcand multples. Number of partal products are reduced by creatng a tree structure of partal products. The same s developed by usng a new algorthm known as decmal mult-operand carry save addton. Ths uses unconventonal decmal coded number systems, whch largely mproves the area and latency of the pror or exstng desgn. It ncludes optmzed dgt recorders, decmal carry-save adders (CSA s) combnng dfferent decmal-coded operands, and carry free adders mplemented by specal desgned bt counters (a desgn methodology that combnes all these technques to obtan effcent reducton trees wth dfferent area and delay trade-offs for any number of partal products generated). The generaton of partal products are developed parallel by usng sgned-dgt (SD) radx-10 recordngs of the multpler and a smplfed set of multplcand multples. Evaluaton results for 16-dgt operands show that the proposed archtectures have nterestng area-delay fgures compared to conventonal Booth radx-4 and radx-8 parallel bnary multplers and outperform the fgures of prevous alternatves for decmal multplcaton. The modules have been desgned n Verlog HDL, smulated and syntheszed usng Xlnx III

5 DECLARATION ACKNOWLEDGEMENT ABSTRACT TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES LIST OF SYMBOLS ABBRIVEATIONS TABLE OF CONTENTS CHAPTER 1: Introducton Introducton multplers Organzaton of the report 3 CHAPTER 2: Lterature Revew 4-9 I II III IV VI VIII CHAPTER 3: Parallel Decmal Multplcaton Fxed-pont decmal multplcaton Decmal parallel multpler Radx-10 archtecture SD radx-10 recodng Generaton of multplcand multples Implemented of dgt recoders Partal Product Reducton Partal Product Arrays Method for Carry Save Addton Decmal Dgt Adders usng Bt Counters Decmal P:2 CSA Trees for Dgts Coded n (4221) CHAPTER 4: FPGA Implementaton usng Xlnx Introducton to FPGA FPGA technology trends Xlnx specfcaton 30 IX XI 26 IV

6 4.3.1 Confgurable logc blocks Input /Output blocks Ram blocks Programmable routng FPGA mplementaton usng Xlnx Overvew of FPGA desgn flow Desgn entry Behavoral smulaton Desgn synthess Desgn mplementaton Analyzng desgn usng chp scope pro 36 CHAPTER 5: Implementaton and results CHAPTER 6: 5.1 Radx-10 parallel decmal multpler Smulaton Results for Radx-10 Parallel Decmal multpler Synthess Results for Radx-10 Parallel Decmal multpler Comparson results Concluson and future scope Concluson Future scope REFERNCES V

7 LIST OF FIGURES Fgure 2.1 Delay-constraned comparson graph 4 Fgure 3.1 Combnatonal SD radx-10 archtecture. 13 Fgure 3.2 Partal product generaton for SD radx Fgure 3.3 Generaton of multplcand Multples for SD radx Fgure 3.4 Calculaton of 5X for decmal operands coded n (4221) 16 Fgure 3.5 Partal product arrays generated for 16-dgt operands SD radx-10 archtecture Fgure 3.6 Calculaton of 2 for decmal operands coded n (4221) and(5211). Fgure 3.7 Proposed decmal 3:2 CSA for operands coded n (4221). (a) Decmal dgt (4 bt) 3:2 CSA.(b) Full adder Fgure 3.8 Proposed decmal 3:2 CSA for nput operands coded n (5211). (a) Output operands coded n (5211). (b) Mxed (5211)/(4221) coded output operands. 23 Fgure 3.9 Gate-level mplementaton of dgt counters. (a) (4221) counter for 9 bts. (b) (4221) counter for 8 bts. (c) 7:3 bnary counter. 25 Fgure :4 reducton of (5211) decmal-coded operands. 26 Fgure 3.11 Decmal 6:2 CSA for (4221) decmal-coded operands. (a) Dgt column. (b) Implementaton of a 2 block. 26 Fgure 3.12 Proposed decmal 17:2 CSAs. (a) Area-optmzed tree.(b) Delay-optmzed tree 28 Fgure 4.1 Look up table mplemented as (a)memory (b) multpler and memory 30 Fgure 4.2 FPGA desgn flow 33 Fgure 4.3 Steps n synthess process 35 Fgure 4.4 Dfferent fles generated n mplementaton process 36 Fgure 5.1 Smulaton results for Radx-10 parallel decmal multpler 40 Fgure 5.2 Area comparson of radx-10 decmal multpler 41 Fgure 5.3 delay comparson of radx-10 decmal multpler 42 VI

8 LIST OF TABLES Table 3.1 Decmal codng. 12 Table 3.2 Selected decmal codes for the recoded dgts. 17 Table 5.1 Synthess report of radx-10 parallel decmal multpler 40 VII

9 LIST OF SYMBOLS Z X Y r Decmal nteger operand Decmal dgt Multplcand Multpler Weght of the j bts P Fnal Product ys Sgn sgnal 1 yb V Hot one code OR AND XOR Lm Left arthmetc bnary shft of m bt shft w, Bts of the BCD j x, Bts of the (4221) j S H X F V Sgn encodng Hot one 10 s complement encodng Regular (4221) dgt extra dgt poston to support the wdth of multplcand Regular (4221) dgt B Regular (5211) dgt W Carry vector coded n (5211) s, Sum bt of a full adder j h, Carry bt of full adder j VIII

10 ABBRIVEATIONS DFP DFUs DSP FPGA HDL IEEE ISE LSB LUT MAC MSB RTL RAM VLSI XST CSA HA PP BCD SD CALs ulp Decmal floatng pont Decmal floatng unts Dgtal sgnal processor Feld programmable gate array Hardware descrpton language Insttute of Electrcal and Electroncs Engneers Integrated Software Envronment Least sgnfcant dgt Look up table Multply and accumulate Most sgnfcant dgt Regster transfer level Random access memory Very large scale ntegraton Xlnx synthess technology Carry save adder Half adder Partal product Bnary coded decmal Sgned dgt Carry look ahead adder Unt n the last place IX

11 EDIF Electroncs desgn nterchange format NGC Netlst Constrants Fle ILA Integrated Logc Analyzer VIO Vrtual Input /Output ATC2 Aglent Trace Core 2 IBA Integrated Bus Analyzer X

12 CHAPTER 1 INTRODUCTION 1.1 Introducton To Multpler Multplers are wdely used n many hgh performance systems. They are used as small blocks n large systems lke Dgtal sgnal processors, flters, communcaton systems, mcroprocessors etc. DSP s have operatons lke convoluton operatons, transform operatons, flterng operatons etc. All these operatons need multplcaton unt. So multpler plays an mportant role n DSP s. In older mcroprocessors, multplcaton s performed by repeated addton but n recent mcroprocessors multplcaton nstructons are avalable. In communcaton systems for baseband sgnal transmsson multpler s used. Flter contans large number of multplers, so use of fast multplers s necessary for hgh speed applcatons. Multplers also fnd applcaton n 3D graphcs and n vdeo processng for dsplayng streamlne mages. The tme taken by multpler n calculatng fnal result are very mportant and plays major role n determnng ts performance. In DSP chps multplcaton tme plays mportant role n calculatng nstructon cycle tme. So there s a need of hgh speed multpler. A multpler s a dgtal crcut, whch s used to multply two numbers. Dgtal multpler multples two numbers n a smlar manner as t s done n mathematcs.e. addton of shfted partal products. Multplcaton of unsgned numbers can be done by usng AND gates and full adders. AND gates are used for generatng the partal products and full adders are used for addng the generated partal products. For sgned number multplcaton the negatve numbers are frst converted nto ts 2 s complement representaton. Ths makes sure that the all the partal products are postve. There are three types of dgtal multplers.e. seral, parallel and seral-parallel multpler. In seral multpler, both the operands are entered serally, therefore t requres less area. Due to less area requrement of seral multpler t has less hardware cost and 1

13 low power consumpton. But the speed of the seral multpler s poor because the operands are entered serally. In parallel multpler operaton s carred out n parallel whch ncreases the speed of the multpler. But parallel multplers occupy larger area and more complex as compared to seral multplers. Hgh power consumpton s also one problem n parallel multplers. Parallel multplers can further classfed nto array multplers and tree multplers. Most wdely used array multplers are Braun multpler, booth multpler, Robertson multpler, Baugh-Wooley multpler etc. Seral-parallel multplers are used to take advantage of hgh speed operaton of parallel multpler and small area of seral multpler. In seral-parallel multplers, one operand s entered serally whle other s stored n parallel. Ths requres less area and also enhances the speed of the multpler. In today scenaro hardware s becomng a topc of nterest for decmal floatng pont (DFP).software DFP mplementaton are slower than hardware mplementaton although ths satsfy the precson requrement and ths could not stratfy the demands of future fnancal hgh performance, user-orented applcatons and commercal [1].The revson of IEEEE 754 [2] standard for floatng pont arthmetc specfcatons for DFP arthmetc that can be executed n software, hardware or combnaton of both. The IBM Z9 archtecture perform the basc operaton dedcated hardware. The IBM power6 mcroprocessor [3] orented to servers and work staton for hgh performance DFUs the key factor s multplcaton. Most of the bnary floatng pont unt FPUs uses parallel bnary multplers for hgh performance [4].however complexty n the generaton of multplcand multples and n adequacy n representng decmal values for the system release on bnary sgnals, results n hard to mplement decmal multplcaton. Ths leads to complcaton n generaton and reducton of partal product, thus decmal adders are representng n a parallel fashon whle commercal mplementaton of sequental decmal multpler [5,6,7].these mplementaton are released on teratve algorthm for decmal nteger multplcaton [7,8] and therefore leads to low performance several technques has been proposed for the mprovement of sequental decmal multpler [9,10,11]. The frst mplementaton of the parallel decmal multpler s show n [12].some other parallel decmal multpler archtecture are shown n [8]. These archtecture are further advanced to support bnary multplcaton.n ths paper represented the method for the effcent mplementaton of decmal parallel multplcaton usng a parallel generaton and reducton of the partal product by a carry-save addton tree. 2

14 1.2 Organzaton of Thess The report s organzed as follows: Chapter 2- Descrbe the applcatons of the multpler and gves a bref about the Parallel decmal multpler. Chapter 3- Dscusses the IEEE 754 standard, Floatng pont multplcaton algorthm And the adders used n floatng pont multplcaton. Chapter 4- FPGA flow usng Xlnx s dscussed n ths chapter. Chapter 5- Implementaton results are descrbed and report s concluded. Chapter 6- Concluson and future scope of the mplemented desgn s dscussed here. 3

15 CHAPTER 2 LITERATURE REVIEW Lu han [13]: Descrbe decmal multplcaton s consdered as one of the most complcated dyadc operatons, whch requres hgh-cost hardware mplementaton. Therefore, the processor ndustry has opted to use the sequental decmal multplers to reduce the hgh cost of parallel archtectures. However, the man drawback of teratve multplers s ther hgh latency. The focus has been on reducng the latency of decmal sequental multplers whle mantanng a low cost of area. Consequently, a hghfrequency sequental decmal multpler s proposed whose cycle tme s reduced to the latency of a bnary half-adder plus that of a decmal multply-by-two operaton, whch overall s less than that of a decmal carry-save adder. The sequental multpler works wth a hgher clock frequency than the fastest prevous decmal multpler whch n turn leads to overall latency advantage. Fgure 2.1: Delay-constraned comparson graph [13] Kenney, R.D [11]: Presented the decmal arthmetc s reganng popularty n the computng communty due to the growng mportance of commercal, fnancal, and Internet-based applcatons, whch process decmal data.teratve decmal multpler, whch s operates at hgh clock frequences and scales well to large operand szes. The multpler uses a new decmal representaton for ntermedate products, whch allows for a very fast two stage teratve multpler desgn. Decmal multplers, whch are 4

16 syntheszed usng a 0.11 mcron CMOS standard cell lbrary, operate at clock frequences close to 2 GHz. The latency of the proposed desgn to multply two n-dgt BCD operands s (n + 8) cycles wth a new multplcaton able to begn every (n + 1) cycles. Vazquez.A and antelo [12]: Introduce the two novel archtecture for parallel decmal multpler.our multpler are based on a new algorthm for decmal carry-save multoperand addton that uses a novel BCD 4221 recodng for decmal dgts. It sgnfcantly mproves the area and latency of the partal product reducton tree wth respect to prevous proposals. Ths paper present three schemes for fast and effcent generaton of partal products n parallel. The recodng of the BCD 8421 multpler operand nto mnmally redundant sgned dgt radx 10, radx 4 and radx 5 representatons usng new recorder s reduces the complexty of partal product generaton. In addton, SD radx 4 and radx 5 recodng s allow the reuse of a conventonal parallel bnary radx 4 multpler to perform combned bnary decmal multplcatons. Evaluaton results show that the proposed archtectures have nterestng area delay fgures compared to conventonal Booth radx 4 and radx 8parallel bnary multplers and other representatve alternatves for decmal multplcaton. Vazquea, A [14]: Presented an hgh-performance decmal floatng-pont unts (DFUs) s demandng effcent mplementatons of parallel decmal multplers. The parallel generaton of partal products s performed usng sgned-dgt radx-10 or radx-5 recordngs of the multpler and a smplfed set of multplcand multples. The reducton of partal products s mplemented n a tree structure based on a decmal mult operand carry-save addton algorthm that uses unconventonal (non BCD) decmal-coded number systems. The present the new mprovements to reduce the latency of the prevous desgns, whch nclude: optmzed dgt recorders for the generaton of 2n-tuples (and 5- tuples), decmal carry-save adders (CSAs) combnng dfferent decmal-coded operands, and carry-free adders mplemented by specal desgned bt counters. A desgn methodology that combnes all these technques to obtan effcent reducton trees wth dfferent area and delay trade-offs for any number of partal products generated. M. F. Cowlshaw [15]: Decmal arthmetc s the norm n human calculatons, and human-centrc applcatons must use a decmal floatng-pont arthmetc to acheve the same results. Intal benchmarks ndcate that some applcatons spend 50% to 90% of ther tme n decmal processng, because software decmal arthmetc suffers a 100 to 1000 performance penalty over hardware. Exstng desgns, however, ether fal to 5

17 conform to modern standards or are ncompatble wth the establshed rules of decmal arthmetc. Decmal floatng- pont whch not only provdes the strct results whch are necessary for commercal applcatons but also meets the constrants and requrements of the IEEE 854 standard. M.A. Erle [16]: Proposed a fxed-pont decmal multplcaton that utlzes a smple recodng scheme to produce sgned-magntude representatons of the operands thereby greatly smplfyng the process of generatng partal products for each multpler dgt. The partal products are generated usng a dgt-by-dgt multpler on a word by- dgt bass, frst n a sgned-dgt form wth two dgts per poston, and then combned va a combnatonal crcut. As the sgned-dgt partal products are developed one at a tme whle traversng the recoded multpler operand from the least sgnfcant dgt to the most sgnfcant dgt, each partal product s added along wth the accumulated sum of prevous partal products va a sgned-dgt adder. Ths work s sgnfcantly dfferent from other work employng dgt-by-dgt multplers due to the effcency ganed by restrctng the range of dgts throughout the multplcaton process. G. Jaberpur [17]: Descrbe contrbutons to dgt recurrence decmal dvson hardware and focus on technques for mprovng the performance of quotent dgt selecton (QDS) as the most complex part. Desgn D1 uses the dgt set for quotent dgts. Another desgn (D2) uses mxed bnary/decmal carry-save manpulaton of the few most sgnfcant dgts of partal remanders. Motvated by successful combned arthmetc algorthms such as hybrd adders, the authors offer a decmal dvson scheme that takes advantage of the best desgn optons of D1 and D2 wth due modfcatons that sgnfcantly enhance the dvson speed. In partcular, they confgure the archtectures of QDS and partal remander computaton paths n favor of reduced balanced latences of both. Furthermore, they remove the roundng cycle by cost-free auto-roundng, whch s an exclusve advantage of the dgt set. The authors of D1 and D2 have used logcal effort (LE) and crcut synthess to evaluate ther dvders, respectvely. Therefore for a far comparson, the authors evaluate the proposed desgn (D3) wth both methods. The LE-based D3/D1 comparson shows 21% more speed at the cost of 6% more area, whereas the synthessbased D3/D2 comparson results n 46% less latency and 23% less area. Kavan, A [18]: Multplcaton, as one of the four basc operatons embedded n arthmetc processors, s nowadays experencng beng spotlghted by the hardware desgners nvolved n the revved decmal arthmetc. The decmal hardware unts usually 6

18 employ the sequental mplementaton for ths operaton, due to the hgh area cost of the parallel decmal multplers. However, the man drawback of ths teratve method s n regard to ts hgh latency. The ntenton of ameloratng ths problem, proposes a hghfrequency sequental decmal Multpler. The cycle tme of the proposed multpler s determned by a decmal carry-save adder whch s about 22% less than that of the fastest prevous desgn. Schmookler.M [19]: Descrbe Parallel decmal arthmetc capablty s becomng ncreasngly attractve wth new applcatons of computers n a multprogrammng envronment. The drect producton of decmal sums offers a sgnfcant mprovement n addton over methods requrng decmal correcton. The several technques useful n desgnng decmal adders whch are both hgh speed and economcal. One such technque s the drect producton of decmal sums wthout the need of frst producng the bnary sums. Another technque s the refnement of carry look-ahead to drectly produce the decmal carres. These technques offer sgnfcant mprovement over the well-known method of decmal correcton. They also permt the desgn of a parallel decmal arthmetc unt whch s compettve to a bnary arthmetc unt n performance and cost. Tom as Lang and Alberto Nannarell [20]: presented a combnatonal decmal multply unt whch can be ppelned to reach the desred throughput. Wth respect to prevous mplementatons of decmal multplcaton, the proposed unt s combnatonal (parallel) and not sequental, has a smpler recodng of the operands whch reduces the number of partal product pre computatons and uses counters to elmnate the need of the decmal equvalent of a 4:2 adder. The results of the mplementaton show that the combnatonal decmal multpler offers a good compromse between latency and area when compared to other decmal multply unts and to bnary double-precson multplers. M.A erle [21]: descrbe the fxed-pont decmal multplcaton that utlze decmal carrysave addton to reduce the crtcal path delay. Frst, a multpler that stores a reduced number of multplcand multples and uses decmal carry-save addton n the teratve porton of the desgn s presented. Then, a second multpler desgn s proposed wth several notable mprovements ncludng fast generaton of multplcand multples that do not need to be stored, the use of decmal (4:2) compressors, and a smplfed decmal carry- propagate addton to produce the fnal product. When multplyng two n-dgt operands to produce a 2n-dgt product, the mproved multpler desgn has a worst-case 7

19 latency of n + 4 cycles and an ntaton nterval of n + 1 cycles. Three data-dependent optmzatons, whch help reduce the multplers average latency. R.D Kenney [22]: ntroduced and analyzed three dfferent technques for performng fast decmal addton on multple bnary coded decmal (BCD) operands. Two of the technques cover BCD correcton values and correct ntermedate results whle addng the nput operands. The frst nvolves sngle addton. The second nvolves two addtons. The thrd technque uses a bnary carry-save adder tree and produces a bnary sum. Combnatonal logc s then used to correct the sum and determne the carry nto the next more sgnfcant dgt. Multoperand adder desgns are constructed and syntheszed for four to 16 nput operands. Analyses are performed on the synthess results and the merts of each technque are dscussed. Fnally, these technques are compared to several prevous technques for hgh-speed decmal addton. L.Dadda [23] : Problem of Multoperand Parallel Decmal Addton wth an approach that uses bnary arthmetc and suggested adopton of bnary-coded decmal (BCD) numbers. Ths nvolves correctng of the result n order to obtan the BCD result or a bnary-to-decmal (BD) converson. They opted for the latter approach, whch s sgnfcantly effcent for a large number of addends. Converson demands a relatvely small area and allows fast operaton. The BD converson moreover creates an easy algnment of the sums of adjacent columns. BCD dgt adders usng fast carry-free adders and the converson problem usng a parallel scheme nvolvng elementary converson cells s desgned. They developed spreadsheets for addng several BCD dgts and for smulatng the BD converson as a desgn tool. B.shraz [24]: major advantage of the bnary coded decmal (BCD) system n provdng rapd bnary decmal converson has been dscussed. They focused on the short comng of the BCD system s that BCD arthmetc operatons, that they are slow and requre complex hardware. They concluded that the performance of BCD operatons can be enhanced usng a redundant bnary coded decmal (RBCD) representaton whch leads to carry-free operatons. Ths paper ntroduces the VLSI desgn of an RBCD adder. The desgn conssts of two small PLA's and two 4-bt bnary adders for one dgt of the RBCD adder. The addton delay s constant for n-dgt RBCD addton (no carry propagaton delay). The VLSI tme and space complextes of the desgn as well as ts layoutare presented. In addton, BCD to RBCD conversons have been carred out n a constant 8

20 tme. However, RBCD to BCD converson requres a carry-rpple operaton whch can be accomplshed wth a complexty equvalent to that of the carry-look-ahead crcutry. 9

CHAPTER 3 PARALLEL DECIMAL MULTIPLIER 3.1 Fxed-Pont Decmal Multplcaton The decmal nteger of any operand weghted 4-bt vector as d 1 Z can be represented as a postve 0 Z 10 3 Z z, j 0 r (3.

21 CHAPTER 3 PARALLEL DECIMAL MULTIPLIER 3.1 Fxed-Pont Decmal Multplcaton The decmal nteger of any operand weghted 4-bt vector as d 1 Z can be represented as a postve 0 Z 10 3 Z z, j 0 r (3.1) Where j j z, s the jth bt of the th dgt, Z [0,9] s the th decmal dgt and r 1 s the weght of the jth bt.table 1 shows the set of coded decmal no. system that consst BCD (wth r 2 j j ) the other codespresented n table 3.1 used for sgnfyng dfferent decmal operand,as requred by the methods of represented n ths paper. The codes are referred by ther weght bts as ( r 3, r2, r1, r0 ).The 4-bt vector Z n the coded decmal number ( r 3, r2, r1, r0 ) s represented by Z (r3r2 r1 r0). The multplcand d 1 X and multpler 0 X 10 d 1 Y are unsgned decmal 0 Y 10 nteger d-dgtal BCD words. Multplcaton conssts of the three aspects generaton of partal product, reducton (addton) of partal product and fnally carry propagaton addton (converson to non-redundant 2d dgt BCD representaton P 2 1 p 0 decmal floatng-pont multplcaton consst of exponent addton, roundng P X Y, sgn calculaton, handlng and excepton detecton. The decmal dgts n decmal multplcaton, ncrements the numbers of multplcands multples and hence more complex than bnary multplcaton n representng decmal values n system based on bnary logc. 10.The 10

22 For the generaton of decmal partal product two approaches are used. The frst approach performs a dgt by dgt multplcaton of the nput usng look up table methods [8] [21]. Recently, a sgned-dgt radx-10 recodng (from [0, 9] to [-5, 5]) s recommended for the magntude reducton of operand. Ths recommendaton smplfes and speeds up the generaton of partal product than, a sgned-dgt partal product are represented usng combnatonal logc and smplfes tables. Ths methods doesn t suts wth seral multplcaton, snce parallel partal product demands for more hardware. The next approach produces and retan all the multplcand multple. The multple are then reduced by multplers. Ths approach requres wde range decmal carry propagate addtons to produce multplcand multples (3X 4X 5X 6X 7X 8X 9X).In [6] only even multples (2X 4X 6X 8X) are calculated and retan. Odd multples (3X 5X 7X) are dscovered on demands. A condensed set up BCD multples (X 2X 4X 5X) s produced n [9] wthout the carry propagaton. The multples are obtaned from ths set. Frst step to mprove decmal multplcaton s by reducton of partal product usng carry propagate addton (drect decmal addton) [19]. Decmal carry save addton approach uses to present sum and carry [22], [11], [8], [7] or a BCD sum word and carry bt per dgt. The frst group mplements decmal addton by mxng bnary CSAs wth combnatonal logc. In [7] a scheme of 3:2 bnary CSAs s used to add the partal product teratvely. In order to convert BCD nto decmal dgt +6, +12 dgt are added to obtan the carres and sum dgt. The other approach uses a bnary CSAs tree montored by a sngle decmal correctons[22].recently n [23] uses a bnary to BCD converson and a bnary carry- free adder to add N d-dgt BCD operands. A second group of method [9] [20] uses 4bt radx-10 carry propagate adder to execute decmal carry save addton. In [9], a seral multpler s executed usng an array of carry look ahead adders (CLAs). A CSA tree usng these radx-10 CLAs s executed n the combnatonal decmal multpler. In order to reduce all decmal partal product, effcent operand decmal tree adders are requred. Most of the schemes uses parallel network wth bnary CSAs tree full adders. Due to smple logc cell and faster operaton [22, 23]. The multoperand decmal tree adder are executed usng bnary CSAs tree but the operand decmal tree are coded n decmal codng then BCD. These multoperand CSAs trees are explaned secton. 11

23 Table 3.1 Decmal codng Z Z (BCD ) Z (5421) Z (4221) Z (5211) Z (4311 ) Z (3321) Decmal Parallel Multplers In ths secton an overvew of recommended archtecture. The desgn consst of partal product generaton and ther reducton of partal product explaned n secton 3.3 and 3.4.these archtecture uses codes for 4221 and 5221 then BCD to present partal product. Ths leads to reducton of decmal partal product Radx-10 Archtecture Fgure 3.1 shows the archtecture of d-dgt SD-radx multpler ths multpler consst of three stages generaton of partal product [decmal] coded n (generaton of multplcand multples and SD radx-10 encodng of the multpler),reducton of partal product and a fnally carry propagate addton (BCD).The multpler s encoded nto d-sd radx-10 dgt and an addtonal bt to generate d+1 partal products.sd radx-10 dgt controls 5:1 mux to select a postve multoperand multples 0,X,2X,3x,4X,5X)coded n 4221.the output bts of mux s nverted by XOR gate. When the sgn of SD radx-10 s negatve, to fnd the partal product. Next, the d+1 partal product are coded (4221) decmal dgt usng P:2 CSAs trees explaned secton the dgts to be condensed for each column ranges from p=d+1 to p=2.thus,the two 2d-dgt operands S and H are formed from d+1 partal products. 12

24 The resultant product s 2d-dgt BCD word denoted by P=2H+S. S and H are requred to be progressed before beng added. S s recoded from (4221) to BCD excess-6 (BCD plus 6).The H x2 multplcaton s executed along wth the recodng of S. The X2 requres a (4221) to (5421) dgt recorder and a 1-bt wred left shft to obtan 2H. Fnally BCD carry propagate addton s obtaned by usng quaternary tree (Q-T) adder [14] SD radx-10 recodng Fgure.3.1: Combnatonal SD radx-10 archtecture The block dagram of generaton of partal product by usng the SD radx-10 recodng. Ths recodng transforms a BCD dgt Y {0,...,9} nto a SD radx-10 Yb {5,...,5}.The value of the recoded dgt Yb depends upon the value of Y and on a sgnal ys 1 (sgn sgnal) ths shows f y 1 s greater than or equal to S. Thus, the d- dgt BCD multpler Y s recoded nto the d+1 dgt SD radx-10 multpler s refer by 13

25 d Yb wth {0,1 d ys d 1 } 0 Yb 10 Yb. Each dgt Fgure 3.2: Partal product generaton for SD radx-10.[2] Yb generates a partal product PP [ ] selectng the proper multplcand multple coded n (4221). Ths s performed n a smlar way to a modfed booth recodng. Represented as fve hot one code sgnals y1, y2, y3, y4, y5 } and a sgn bt { ys.these sgnals are attaned drectly from the BCD multpler dgts Y usng the followng logcal expressons: ys y V y.( y V y ) 0, 3,2,1, y5 y, 2 V y,1.( y,0 ys 1) (3.2(a)) (3.2(b)) y 4 ys 1. y,0.( y,2 y,1) V ys 1. y,2. y,0 (3.2(c)) y y.( y ys ) (3.2(d)) y 3, 1,0 1 2 ys 1. y,0.( y,0 V y,2. y,1) V ys 1. y,3. y,0. y,2 y,1 (3.2(e)) y1 y, 2Vy,1.( y,0 ys 1) (3.2(f)) SymbolsV,, and specfes Boolean operators OR, AND, and XOR, respectvely. The fve hot one code sgnals are used as an control sgnals for 5:1 muxes to select the postve d+1 dgt multples {0,X,2X,3X,4X,5X}.To fnd the partal product,the 14

26 nomnated postve multple s 10 s complemented f ys bt s one. Ths s done smply by a bt nverson of the postve (4221) decmal coded multple usng a row of XOR gates that are controlled by ys.the accumulaton of one ulp (unt n the last place) s done enclosng a tal encoded bt ys (hot one ) to the next sgnfcant partal product PP []. To remove a sgn extenson, and thus, to reduce the dffculty of the partal product reducton tree,the Partal product sgn bts poston nto two dgts as: ys are coded at every leadng ( pp[ ] d 2, pp[ ] d 1 ( ys0, ys0 ys0 ) (0,111ys ), (0,0000), ys 0 ys0 ), 0, 0 d 1 d 1. (3.3) Therefore, every partal product PP [] have (d+3) dgt length. 3.3 Generaton of multplcand multples The block dagram for the generaton of the postve multplcand multples ( X, 2X, 3X, 4X, 5X ) for SD radx-10 recodng. All these multples are encoded (4221). The X BCD multplcand can be easly recoded to (4221) usng the logcal expressons ( w, 3, w,2, w,1, w,0) x,3 V x,2, x,3, x,3 V x,1, x,0 (3.4) where, x, j and w, j are, respectvely, the bt of the BCD and (4221) representaton of X, respectvely. The generaton of multples s as follows: Multples 2 X : Table 1 show the BCD dgt recoded nto (5421) decmal codng. An L1shft s executed to the recoded multplcand, gettng the 2X BCD multples.then by usng expresson 3.4 the 2X BCD multples s recoded agan. Multples second 4 X : It s obtaned as 2X x2, where the 2X multple s coded n (4221).The 2X operaton performed by s L1 shft,to encoded from (4221) to (5211). 15

27 Multples Fgure 3.3: Generaton of multplcand Multples for SD radx-10.[2] 5 X :Each BCD dgt s frst recoded to the (5421) decmal codng s shown n table 1.It s obtaned by usng L3shft of the (4221) multplcand reproducng the code n Then recoded from(5221) to (4221) as shown n fgure 3.3 fmultples 3 X : It s obtaned by a carry-propagate addton of BCD multples X and 2 X n a d- dgt BCD adder (mplemented as a quaternary- tree decmal adder [21]) by usng equaton 3.4 the BCD sum dgt are recoded nto (4221). x10 2 x10 1 x X (4221) x5 L3 shft X (5211) Dgt recodng X (4221) Fgure3.4: calculaton of 5X for decmal operands coded n (4221) 16

28 Table 3.2: Selected decmal codes for the recoded dgts Z Z (4221s) Z (5211s) Z Z (4221s) Z (5211s) Implementaton of Dgt Recoders The outlne of profcent dgt recoders s a basc ssue, due to ther hgh effect on the area and performance of the entre multpler. Dgt recoders are utlzed to process the decmal multplcand multples n secton 3.3 and n the decrease of partal products (Secton 3.4) to fgure x2n (n > 0) operatons. For logcal Executon of dgts recoders for BCD, BCD excess 6, and (5421) decmal codes are straght forward, snce there s just a mappng of decmal dgts to these codes (every decmal dgt has a soltary 4-bt representaton). On the other hand, because of the redundancy of (4221) and (5211) decmal codes, there are a few decsons for the dgt recodng to (4221) or (5211). The sxteen 4-bt vectors of a codng can be mapped (recoded) nto dstnctve subsets of 4-bt vectors of the other decmal codng depctng to the same decmal dgt. These subsets of the (4221) and (5211) codes are addtonally decmal codngs. Among all of the subsets consdered, the non-redundant decmal codes (4221s) and (5211s) (subsets of ten 4-bt vectors), ndcated n Table 2, present fascnatng propertes. Specfcally, these codes verfy 2Z(4221s) L1 [ Z(5211s)] (3.5) shft That s, subsequent to movng 1 bt to one sde an operand Z depcted n (5211s), the resultant bt-vector represents to the decmal estmaton of 2Z coded n (4221s). Ths rearranges the usage of x2n operatons for n > 1. Specfcally, for a decmal operand, (4221) Z, Z x 2n s mplemented by a frst level of Z ( 4221s ) to Z ( 5211s ) to dgt 17

29 recoders followed by n - 1 levels of Z ( 4221s) to Z ( 5211s) dgt recoders. The yeld of every level of dgt recoders s moved 1 bt to one sde such that the most noteworthy bt of each (5211s) dgt (weght 5) s moved out to the followng decmal poston (weght 10). Also, at tmes, the x2 may be rearranged. Specfcally, the recodng gven by Expresson (4) maps the BCD representaton nto the subset (4221s). Consequently, the subsequent x2 operatons n Fgure 3.3 are mplemented usng a level of easer (4221s) to (5211s) dgt recoders. A (4221) to (5211s) dgt recoder has an equpment many-sded qualty of around 27 NAND2 doors, and ts basc way has (generally) the postponement of a full snake.the (4221s) to (5211s) dgt recoder has a more straght forward equpment many-sded qualty (around 19 NAND2 entryways) wth 25 percent less nactvty. Moreover, the opposte dgt recodng (from (5211) to (4221)) s effectvely actualzed utlzng a soltary full vper, snce Z * ( 5211) z, 34 z,22 z,12 z,0 (3.6) * wth z z ( z z z ) 3.Ths recoder s utlzed to produce the x5 dfferent, 12,0,3,1, 0 for the (4221) codng and n mxed (4221/5211) multoperand CSAs (Secton 5.5) to change over a (5211) decmal-coded operand nto the equvalent (4221) coded one. 3.4 Partal Product Reducton The partal product arrays are produced by the SD-radx 10 encodngs. Every column of p dgt s compressed to two dgts by usng of a decmal dgt p : 2 CSA tree. Ths represent the technque for the effcent reducton of decmal carry save addton CSAs or full adder, uses of (4221) and (5211) s performed nstead of BCD. Thus, encodng needs decmal encodng. Decmal dgt adders to also ntroduced n secton3.4.2to reduce the delay. The dgtal adders also reduced the 9 dgts coded n (4221) or (5221) to 4 dgt coded (4221). Fnally, the desgn of the proposed p : 2 decmal CSA trees s mplemented n the SD radx-10 n secton Partal Product Arrays In ths secton 3.2.1, the SD radx-10 archtecture generates d+1 partal products coded n (4221) of d+3 dgt length. These partal product PP[] are algned accordng to ther decmal weghts by 4 -bt left shft ( PP [ ] x 10 ).The resultant partal product array for 18

16-dgt nput operands s shown n fgure 3.7(a).In ths case, the number of dgts to be reduced vares from p 17 to p 2. In number of dgts to be reduced vares from P 2.

30 16-dgt nput operands s shown n fgure 3.7(a).In ths case, the number of dgts to be reduced vares from p 17 to p 2. In number of dgts to be reduced vares from P 2. In partcular, the hghest columns can be reduced wth the area- optmzed or delay- optmzed decmal 17 : 2 CSA trees presented n secton Every output of 3:2 CSA s multpled by 2 before gong to another 3:2 CSA nput. The block x2 are connected to fast nputs of another 3:2 CSA to reduce the carry path complexty as compare to sum path. Summng the carry outputs of the last CSA before multpled by 2.therefore several x2 multplcaton are performed n a seres at the end. Both, fgure 3.7, represents the same crtcal path delay rrespectve of wrng ssues. S: Sgn encodng H: Hot one 10 s complement encodng X: Regular (4221) dgt F: Extra dgt poston to support the wdth of multplcand multples Fgure 3.5: Partal product arrays generated for 16-dgt operands SD radx-10 archtecture 19

31 3.4.2 Method for Carry Save Addton The decmal codes obtaned by expresson 3.1, there s a famly of codes approprate for smple carry save addton. Ths famly of codes proves that the 9 s the aggregate of ther weghted bts, that s, whch comprses of (52111), (4221), (3321) and (4321) codes as shown n Table 1. 3 j0 r 9 (3.7) j However, there are decmal codng whch uses dfferent context, to mplement components for decmal carry save addton. These are known as redundant codes, as two or more dfferent 4 bt vector may symbolze smlar decmal dgt. These codes exhbt followng propertes: A decmal dgt Z [0,9] s represented by sxteen 4-bt vectors.hence, any Boolean functon (AND, OR, XOR ) executng over the 4-bt vector representaton of two or more nput dgts results a 4-bt vector that represents a vald decmal dgt. 9 s complement of a dgt Z j can be as 1 s complement, that s, by nvertng bts snce 3 9 Z rj z, j rj ( 1 z, j ) rj j0 3 j0 3 j0 3 = z, jrj j0 (3.8) Negatve operands can be calculated by nvertng the bts of postve bt vector representaton and summng wth 1 ulp, such as Z r r r r ) Z( r r r r ) 1 (3.9) ( Hence fast decmal carry save addton can be performed by usng conventonal 4-bt bnary 3:2 CSA as 20

32 A B C 3 3 j0 j0 s ( a, j, j r b, j 2 c 3 j0, j ) r h, j r s 2 H (3.10) Wth ( r 3 r2 r1 r0 ) {(4221),(5211),(4311),(3321)}, s, j and h, j are the sum and carry bt of a full adder, [0,9] H and [0,9] S are the decmal carry and sum dgts at poston. No decmal correcton s needed because the 4-bt vector expresson of represent vald decmal dgt n the selected codng. H and Be that as t may, a decmal multplcaton by 2 s needed before utlzng the convey dgt H for later processng s. Here, the examnaton of decmal carry-save addton to just (5211) and (4221) decmal codes, subsequent to the generaton of products of two for operands coded n (4311) and (3321) s more dffcult. Fgure. 8 demonstrates a sample of x2 multplcaton for decmal operands wrtten n (4221) and (5211) decmal codes. To streamlne the documentaton, utlze H and W for the carry vector coded n (4221) and (5211) respectvely. Subsequently, the equaton H 2 H L1 [ W] (3.11) shft S x10 1 x10 0 x10 1 x H(4221) H(5211) Dgt recodng L1 shft x W(5211) H(4221) L1 shft x2 Dgt recodng H(4221) W(5211) Multplcaton by 2 Fgure: 3.6 Calculaton of 2 for decmal operands coded n (4221) and(5211) 21

The resultant bt vector subsequent to movng 1 bt to the left sde W represents to the double of H. The operand 2H s coded n (4221), snce the weght bts of W are multpled by 2 after the 1-bt left move.

33 The resultant bt vector subsequent to movng 1 bt to the left sde W represents to the double of H. The operand 2H s coded n (4221), snce the weght bts of W are multpled by 2 after the 1-bt left move. The entre recodng of H nto W took after by a 1 [ W] 2H augmentaton s performed by a dgt L shft. The bts of W are ndcated by w, j. The bt moved out w ) ndcate to a decmal carry out (weght 10) to the followng dgt (, 3 poston, whle the bt moved n w ) s a decmal carry nput (weght 1). In ths manner, the dgts of 2H are gven by ( 2 ) w, 2 4 w,1 2 w,0 2 w 1,3 ( 1, 3 H (3.12) (a) (b) Fg 3.7: Proposed decmal 3:2 CSA for operands coded n (4221). (a) Decmal dgt (4-bt) 3:2 CSA. (b) Full adder [2] Fgure 3.7 (a) demonstrates the usage of a decmal 3:2 CSA for dgts coded n (4221) utlzng a 4-bt bnary 3:2 CSA. The weght bts n Fg 3.7 (a) are set n parenthess over every secton. The 4-btbnary 3:2 CSA ncludes three decmal dgts A, B, C ), coded n ( (4221), and results n a decmal sum dgt ( S ) and a carry dgt H coded n (4221), such as A B C S 2 H. 22

These dgt adders comprse of a row of bt counters.

34 (a) Fgure 3.8: Proposed decmal 3:2 CSA for nput operands coded n (5211). (a) Output operands coded n (5211). (b) Mxed (5211)/(4221) coded output operands.[2] Decmal Dgt Adders usng Bt Counters A group of fast decmal dgt adders that dmnshes 9, 8, or 7 dgts coded n (4221) or (5211) nto 4 or 3. These dgt adders comprse of a row of bt counters. These bt counters total a segment of up to p=9 bts (same weght) and results a q-bt vector ( q [log 2 p] 4) wth weghts (4221), whch ndcates to a decmal dgt Z [0,9]. In Fg. 11a, demonstrate a usage of a bt counter that sums upto9 bts delverng a (4221) dgt, whch utlzes two levels of bnary full adders. The bnary weght of every yeld s demonstrated n paranthess. Contngent upon the data, the path delay dffers generally from 2 to 4 XOR gate delays for yeld (1), from 2 to 3 XORs for yeld (2), and s around 2 XORs for yeld (4). The 8-bt counter of Fg.3.11(b) just adds up to 8 bts however has a comparatve crtcal path delay as a bnary 4:2 CSA (3 XOR gate delays for yeld (1)). Essentally, the ntal two levels of half adders (HA) and the two OR gates perform the (b) 23

35 process Q0 q0 Q1 q1,2,2 4 q0 4 q1,1,1 2 q0 2 q ,0 k0,0 k0 x k x, k (3.13) (3.14) Snce 0, 1 [0,4], Q the total sum Z ( 4221) Q1 Q0 [0,8], Q smple way n the fnal logc level of fgure.11 (b) as s mplemented n a z,3 q1,2 q0,2 V q1,1 q0,1; z,2 q1,2 q0,2 V q1,0 q0,0; Z (4221) (3.15) z,1 q1,2 q0,2 V ( q1,1 q0,1); z,0 q1,0 q0,0; In fgure 3.9 (c) a conventonal 7:3 bnary counter s shown whch dmnshes 7 bts nto 3-bt vector wth vector (421). These counters can be utlzed to dmnsh 9 or 8 decmal dgts (coded n (4221) or (5211)) nto 4, or 7 dgts nto 3.Anexample of ths strategy s portrayed n Fg.3.10 for nne data operands coded n (5211). The methodology s comparatve for (4221) nput operands. A row of four (4221) decmal counters of Fg. 11a sums the estmatons of every bt secton creatng a (4221) dgt per bt secton. The four bts of each (4221) dgt are put n a secton from the most sgnfcant (top) to the slghtest sgnfcant (base) and adjusted n four rows as ndcated by the weght bt of ther secton (5 10,210,110,110 ). Ths creates 4 decmal dgts coded n (5211) that must be multpled by an alternate element (gven by the bnary weghts of (4221)) pror beng ncluded. Ths assocaton causes all the bts of a yeld operand to have the same latency. The x2 and x4 squares are executed as portrayed n Secton The 9:4 decmal dgt adder s ndcated by the marked box n Fgure 3.12, where the multplcatve component of every yeld s demonstrated n parenthess. The 8:4 and 7:3 decmal dgt adders are actualzed utlzng a row of four counters of Fgs. 11b and 11c, separately. On the other hand, the 4-bt decmal 3:2 CSA dsplayed n Secton s a 3:2 decmal dgt adder executed utlzng a row of 3-bt counters (a level of full adders) wth the carry yeld multpled by a factor x2. 24

36 (a) (b) Fgure3.9: Gate-level mplementaton of dgt counters. (a) (4221) counter for 9 bts. (b) (4221) counter for 8 bts. (c) 7:3 bnary counter.[2] 25

Fgure 3.10: 9 : 4 reducton of (5211) decmal-coded operands.[2] 3.4.4 Decmal P:2 CSA Trees for Dgts Coded n (4221) A decmal dgt p:2 CSA tree lessens p ( p 9) nput dgts [l] coded n (4221) nto two decmal dgts S and Z (wth weght 10 ) H.

These decmal p:2 CSA trees are composed as: (a) Fgure 3:11 Decmal 6:2 CSA for (4221) decmal-coded operands. (a) Dgt column. (b) Implementaton of a 2 block.

37 Fgure 3.10: 9 : 4 reducton of (5211) decmal-coded operands.[2] Decmal P:2 CSA Trees for Dgts Coded n (4221) A decmal dgt p:2 CSA tree lessens p ( p 9) nput dgts [l] coded n (4221) nto two decmal dgts S and Z (wth weght 10 ) H. Moreover, a few decmal carry yelds 1 are produced to the followng sgnfcant decmal poston ( 10 ) furthermore, a specfc 1 number of decmal carry nputs orgnate from the prevous poston ( 10 ). These decmal p:2 CSA trees are composed as: (a) Fgure 3:11 Decmal 6:2 CSA for (4221) decmal-coded operands. (a) Dgt column. (b) Implementaton of a 2 block.[2] For p < 7, the nput dgts Z [l] are dmnshed n a frst level of bnary 3:2 CSAs. Every carry yeld dgt s multpled by 2 precedng beng dmnshed n the followng level of the bnary 3:2 CSA tree. Each x2 operaton produces a decmal (b) 26

38 carry yeld to the followng sgnfcant dgt column of the partal product array. The slowest yelds are assocated wth fast nputs of the followng bnary 3:2 CSA level to adjust the aggregate delay of the dstnctve path (an F demonstrates the fast nput) Fgure. 3.11(a) demonstrates an executon of a decmal 6:2 CSA (the multplcatve component connected wth every sgnal s n parenthess).the full adder setup of Fgure.3.7 (b) to mnmze the crtcal path delay of the CSA tree. The dgt blocks marked x2 comprse of a (4221) to (5211s) dgt recorder wth the yelds (for 4221 coded operands) 1-bt left moved, as ndcated n Fgure. 11(b).The most sgnfcant yeld bt w ) ndcates a decmal carry to the (, 3 followng dgt column. To streamlne the charts of the dverse decmal p:2 CSA trees, the carres went between adjonng dgt columns w, ) are not (, 3 w 1, 3 ndcated. The convey yeld H must be multpled by 2 precedng beng acclmatzed wth the entrety yeld S. For p 7, t take after dverse methodologes to acqure area-optmzed or delay mproved executons. For regon upgraded usage, the data dgts Z [l] are decreased n a frst level of bnary 3:2 CSAs. Every mddle operand s connected wth a multplcatve varable power of 2. Operands wth the same component are lessened n a bnary 3:2 CSA some tme recently beng multpled by ths component, that s n n n n n n1 2 A 2 B 2 C 2 ( A B C) 2 S 2 H (3.16) Ths lessens the equpment complcated nature snce the general number of x2 operatons s lessened. An area-optmzed decmal 17:2 CSA tree for operands coded n (4221) s demonstrated n Fg. 12(a). The x4 and x8 dgt blocks create two and three decmal carry outs to the followng crtcal dgt secton of the partal product array. Ths ndcates two neghborng dgt columns to show how decmal carres are passed (parallel assocatons).to mprove the delay executons, the nfo dgts Z [l] are dmnshed n a frst level of the decmal dgt adders depcted n Secton These adders lessen 9 or 8 dgts coded n (4221) or (5211) to 4 dgts, and 7 dgts to 3. The yeld dgts, coded n (4221), may have multplcatve elements of ( X 4 ), ( X 2 ), on the other hand ( X 1). 27

39 (a) (b) Fgure 3.12: Proposed decmal 17:2 CSAs. (a) Area-optmzed tree (b) Delay-optmzed tree. [2] The crtcal path delay s decreased by adjustng the delay of the dstnctve paths. For ths reason, the mddle operands wth hgher multplcatve varables are multpled n parallel wth the dmnshment of the other ntermedate operands utlzng bnary 3:2 CSAs. The delay mproved 17:2 CSA tree n Fgure. 3.12(b) has more equpment many-sded qualty (dentcal to two x2 peces all the more) yet the crtcal path s margnally speeder (around 1 XOR delay faster). Its delay s of around sx levels of bnary 3:2 CSAs and three levels of dgt recorders. The blocks marked 9:4 and 8:4 ndcates the decmal dgt adders. 28

40 CHAPTER 4 FPGA IMPLEMENTATION USING XILLINX Ths chapter presents about the FPGA deas and FPGA Synthess Flow. An FPGA s a devce that comprses of thousands or even large number of transstors connected to mplement logc functons. They mplement functons from smple addton and subtracton to complex dgtal flterng and error detecton and ts correcton. 4.1 Introducton to FPGA A feld programmable gate array (FPGA) s a semconductor devce that can be desgned by the desgner or the customer after manufacturng, hence t s known as feld programmable. Feld Programmable gate arrays (FPGAs) are truly nnovatory devces that combne the benefts of both hardware and software. FPGAs are programmed wth the logc crcut dagram or the source code n Hardware Descrpton Language (HDL) to determne how the chp wll work. They may be used to perform any logcal functon that an Applcaton Specfc Integrated Crcut (ASIC) mght perform but the capacty to update the functonalty after shppng provdes advantages for many applcatons. FPGAs contan programmable logc components also called logc blocks, and a herarchy of reconfgurable nterconnects that permt the blocks to be wred together lke a 1 chp programmable breadboard. Just lke computer hardware, FPGAs perform calculatons spatally and smultaneously computng a large number of operatons n resources dstrbuted across a slcon chp. These types of systems can be thousands of tmes faster than mcroprocessor-based desgns. However, unlke n ASICs, these computatons can be programmed nto a chp, temporarly frozen by the manufacturng process. It means that an FPGA based desgn can be programmed and reprogrammed a large number of tmes. 4.2 FPGA Technology Trends Common trend s bgger and faster. 29

41 Ths s acheved by ncreasng devce densty through even smaller fabrcaton process technology. New generatons of FPGAs are geared towards performng entre systems on a sngle devce. Features such as RAM, clock management, dedcated arthmetc hardware and transcevers are exsted n addton to the man programmable logc. FPGAs are also avalable wth the embedded processors. 4.3 XILLINX Specfcs All Xlnx FPGAs conssts of the followng basc resources 1) Confgurable logc blocks (CLBs). 2) Input/output blocks (IOBs). 3) RAM blocks. 4) Programmable Interconnectons (PIs). 5) Other resources lke clock buffers, three-state buffers and boundary scan logc andso on. Fgure 4.1: Look-up table mplemented as (a) Memory (b) Multplexers and Memory Confgurable Logc Blocks The man buldng block of Xlnx CLBs s the slce. Spartan III holds four slces per CLB. Each slce contans two 4-nput functon generators (F/G), two storage elements and carry logc. Each functon generator output drves both the CLB output and the D- nput of a flp-flop. The look-up tables and storage elements of the CLB have the followng characterstcs: 30

42 1) Look-Up Tables The way logc functons are mplemented n a FPGA s another key feature. Logc blocks that carry out logcal functons are look-up tables, mplemented as memory, or multplexer and memory. Fgure 4.1 shows these alternatves, together wth an example of memory contents for some basc operatons. A 2n*1 ROM can mplement any n-bt functon. Typcal szes for n are 2, 3, 4, or 5. In fgure 4.1(a), an n-bt LUT s mplemented as a 2n*1 memory; the nput address selects one of 2n memory locatons. The memory locatons are normally loaded wth values from the user s confguraton bt stream. In fgure 4.1(b), the multplexer control nputs are the LUT nputs. The result s a general-purpose logc gate. An n-lut can mplement any n-bt functon. 2) Storage Elements The storage elements n a slce can be confgured ether a edge-trggered D-type flpflops or as level-senstve latches. The D-nput can be drven ether by the functon generators wthn the slce or drectly from the slce nputs, bypassng the functon generators. As well as clock and clock enable sgnals, each slce has also synchronous set and reset sgnals Input/output Blocks The Xlnx IOB ncludes nputs and outputs that support a wde varety of I/O sgnalng standards. The IOB storage elements act ether as D-type flp-flops or as latches. For each flp-flop, the set/reset (SR) sgnals can be ndependently confgured as synchronous set, synchronous reset, asynchronous preset, or asynchronous clear. Pull-up and pull-down resstors and an optonal weak-keeper crcut can be attached to each pad. IOBs are programmable and can be categorzed as follows: 1) Input Path A buffer n the IOB nput path s routng the nput sgnals ether drectly to nternal logc or through an optonal nput flp-flop. 2) Output Path The output path ncludes a 3-state output buffer that drves the output sgnal onto the pad. The output sgnal can be routed to the buffer drectly from the nternal logc or through an optonal IOB output flp-flop. The 3-state control of the output can also be routed drectly from the nternal logc or through a flp-flop that provdes synchronous enable and dsable sgnals. 31

43 3) Bdrectonal Block Ths can be any combnaton of nput and output confguratons Ram Blocks Xlnx FPGA ncorporates several large RAM memores (block selects RAM). These memory blocks are organzed n columns along the chp. The number of blocks rangng from 8 up to more than 100, dependng on the devce sze and famly Programmable Routng Adjacent to each CLB stands a general routng matrx (GRM). The GRM s a swtch matrx through whch resources are connected; the GRM s also the means by whch the CLB gans access to the general-purpose routng. Horzontal and vertcal routng resources for each row or column nclude: Long Lnes: Bdrectonal wres that dstrbute sgnals across the devce. Vertcal and horzontal long lnes span the full heght and wdth of the devce. Hex Lnes: Route sgnals to every thrd or sxth block away n all four drectons. Double Lnes: Route sgnals to every frst or second block away n all four drectons. Drect Lnes: Route sgnals to neghborng blocks vertcally, horzontally & dagonally. Fast Lnes: Internal CLB local nterconnectons from LUT outputs to LUT nputs. 4.4 FPGA Implementaton Usng XILINX The FPGA that s used for the mplementaton of the crcut s the Xlnx Spartan 6E(Famly), XC3S5000 (Devce). The workng envronment/tool for the desgn s the Xlnx ISE 14.2 s used for FPGA Desgn flow of VHDL code Overvew of FPGA Desgn Flow As the FPGA archtecture evolves and ts complexty ncreases. Today, most FPGA vendors provde a farly complete set of desgn tools that allows automatc synthess and complaton from desgn specfcatons n hardware specfcaton languages, such as Verlog or VHDL, all the way down to a bt stream to program FPGA chps. Atypcal FPGA desgn flow ncludes the steps and components shown n Fgure

44 Fgure 4.2: FPGA Desgn Flow In some cases, delays between some specfc pars of regsters may be constraned. The second desgn nput component s the choce of FPGA devce. Each FPGA vendor typcally provdes a wde range of FPGA devces, wth dfferent performance, cost and power tradeoffs. The selecton of target devce may be an teratve process. The desgner may start wth a small (low capacty) devce wth a nomnal speed-grade. But, f synthess effort fals to map the desgn nto the target devce, the desgner has to upgrade to a hghcapacty devce. Smlarly, f the synthess result fals to meet the operatng frequency, he has to upgrade to a devce wth hgher speed-grade. In both the cases, the cost of the FPGA devce wll ncrease n some cases by 50% or even by 100%. Ths clearly underscores the need to have better synthess tools snce ther qualty drectly mpacts the performance and cost of FPGA [25]. The FPGA mplementaton of desgns s done to check ther functonalty on actual hardware. The cost of mplementaton and desgn cycle tme of ASICs are large therefore the bgger desgns are frst checked on FPGA and f they gve satsfactory results then there ASIC mplementaton s done. Fgure 4.2 shows the dfferent steps nvolved n the FPGA mplementaton. It ncludes desgn entry through HDL, behavor smulaton, logc synthess, desgn mplementaton, 33

45 bt stream generaton and fnally programmng the FPGA. Xlnx provdes all the functons necessary for mplementng a desgn on FPGA. The Xlnx ISE sute ncludes ISIM smulator for functonal and tmng smulaton, XST for synthess applcatons, Xpower analyzer for estmatng power and chp scope pro analyzer for debug and verfcaton purposes. The dfferent steps nvolved n FPGA mplementaton usng Xlnx are descrbed below: Desgn Entry The basc archtecture of the system s desgned n ths step whch s coded n a Hardware descrpton Language lke Verlog or VHDL. A desgn module s splt nto two parts, each of whch s called a desgn unt n Verlog. The module declaraton represents the external nterface to the desgn module. The module nternals represents the nternal descrpton of the desgn module-ts behavor, ts structure, or a mxture of both Behavoral Smulaton After the desgn phase, create a test bench waveform contanng nput stmulus to verfy the functonalty of the Verlog code module usng smulaton software.e. Model sm ISE for dfferent nputs to generate outputs and f t verfes then precede further, otherwse modfcatons and necessary correctons wll be done n the HDL code. Ths s called as the behavoral smulaton Desgn Synthess After the correct smulatons results, the desgn s then syntheszed. Durng synthess, the Xlnx ISE tool does the followng operatons: HDL Complaton: The tool comples all the sub-modules n the man module f any and then checks the syntax of the code wrtten for the desgn. 1) Desgn Herarchy Analyss: analyss of the herarchy of the desgn. 2) HDL Synthess: The process whch translates VHDL or Verlog code nto a devce net lst format,.e. a complete crcut wth logcal elements such as Multplexer, Adder/subtractons, counters, regsters, flp flops Latches, Comparators, XORs, decoders, etc. for the desgn. If the desgn contans more than one sub desgns, ex. to mplement a processor, a CPU as one desgn element and RAM as another and so on are needed, and then the synthess process generates net lst for each desgn element. Synthess process wll check code syntax and analyze the herarchy of the desgn whch ensures that the desgn s optmzed for the desgn archtecture, the desgner has selected. 34

46 The resultng net lst s saved to an NGC (Natve Generc Crcut) fle (for Xlnx Synthess Technology (XST)). Fgure 4.3 shows the complete process of synthess from HDL code to NGC fle generaton. HDL Code HDL parsng HDL synthess Low Level optmzaton NGC Fle Fgure 4.3: Steps n synthess process 3) Advanced HDL Synthess (Low Level synthess): The blocks syntheszed n the HDL synthess are further defned n terms of the low level blocks such as buffers, lookup tables. The tool then generates a 'net lst' fle (NGC fle) and then optmzes t. The fnal net lst output fle has an extenson of NGC. Ths NGC fle contans both the desgn data and the constrants Desgn Implementaton Desgn mplementaton process conssts of the followng sub processes 1) Translaton:In ths process all the nput net lsts and desgn constrants are combned and saved n a fle called as natve generc database fle. The ports avalable n desgn are assgned to physcal elements of the target devce. Tmng requrements of the desgn are also specfed n translate process. Translate propertes can also be changed by modfyng them. 2) Mappng: After translate process mappng s done. In mappng the crcut s dvded nto sub-blocks. The sub-blocks are made so that they can ft nto FPGA sub-blocks. A fle s generated called as natve crcut descrpton fle. Ths fle contans our desgn mapped nto components of FPGA. 35

47 3) Place and Route: Sub-blocks of map process are converted nto logc blocks and connected n place and route step. Ths process takes NCD fle as nput and outputs the routed NCD fle. Here placement and routng of blocks s done. NGC Fle Translate NGD Fle Mappng NCD Fle Place and Route Routed NCD Fle Generate Bt Fle Bt Fle Fgure 4.4: Dfferent fles generated n mplementaton process 4) Bt stream Generaton:In ths process bt fle s generated for partcular Xlnx devce from the routed NCD fle. The output bt fle contans bnary bts necessary to program the devce. Sometmes ths process s also called as bt-stream generaton. The generated bt fle s used to program the FPGA devce. 5) Functonal Smulaton: Post-Translate (functonal) smulaton can be performed pror to mappng of the desgn. Ths smulaton process allows the user to verfy that the desgn has been syntheszed correctly and any dfferences due to the lower level of abstracton can be dentfed. 4.5 Analyzng Desgn Usng Chp Scope Pro The FPGA desgns are becomng more complex, due to need of faster desgns and shorter desgn tmes. Debuggng and verfcaton s mportant factor n determnng the complete desgn tme. It takes almost 50% of the desgn tme. But the Xlnx chp scope pro software performs faster debuggng and verfcaton. It shrnks overall desgn by 25%. It s a powerful tool that s easy to use. It s used for debug, verfcaton and nsertng short sgnal sequences. Chp scope pro uses FPGA resources lke block RAM for trgger and data storage, slce logc for trgger comparson. It uses three types of flows.e. core generator, core nserter and plan-ahead flow. The core nserter flow s smlar to plan- 36

48 ahead flow and provded n plan-ahead software. In core generator flow the core s nstantated n source HDL, whle n core nserter flow the core s nserted nto generated fle after synthess. Dfferent types of cores are used by chp scope pro software lke ICON core, ILA core, VIO core, IBA core. Some of them are explaned below: ICON core: It s used to control up to 15 capture cores. It acts as nterface between JTAG nterface and capture cores. The man functon of ths core s to control the other cores. It can be used n both the core generator and core nserter flows. VIO core: It defnes and generates vrtual /os. It s used to apply stmulus and read outputs transton on the node that wants to be selected. Ths core s used to generate vrtual nput and outputs. It provdes optons for synchronous and asynchronous nputs and outputs, where each can have wdth of 256 bts. There s also opton of clock. It can be system clock or JTAG clock The outputs of the desgn whch are wanted to be mplement are connected to the nput of the VIO core and nputs of the desgn are connected to the output of the VIO core therefore the nputs are vrtual LEDs and outputs are vrtual DIP Swtches. Ths core s controlled by the ICON core. So a control port s also provded. VIO core uses FPGA logc not RAM. It s only used n core generator flow. ILA core: It s a capture core. It can be used to create custom trggers when actvated causes data to be stored durng crcut operatons. Sgnals can be stored dependng on the condton specfed by used. A desgn can contan up to 15 ILA cores. Aglent trace core: It used to store large amount of data off chp or when customer uses Aglent analyzer. ILA wth Aglent trace s smlar to ILA except data s captured off chp or by Aglent trace port analyzer. 37

49 CHAPTER 5 SIMULATION RESULTS AND SYNTHESIS Ths chapter dscusses about the mplementaton of parallel decmal multpler, and ther synthess and smulaton results. Synthess report descrbes actual hardware utlzaton and tmng constrants of the desgn. Smulaton results descrbe the behavoural functonalty of the desgn. Implementaton of the desgn s done to transfer the desgn on to actual hardware. The software used for the smulaton, synthess and mplementaton s Xlnx ISE 14.5 targetng Spartan -6E FPGA. The workng envronment for all the desgns s Tool verson : ISE 14.5 Optmzaton Goal : Speed Desgn Strategy : Balanced Famly : Spartan 6E Devce : XC6SLX45 Speed : 3 Package: CSG324 Smulator : ISIM Total slces: Total LUTs: Radx-10 parallel decmal multpler In radx-10 decmal multpler desgn, the partal product are generated by usng the SD radx-10 recodng. On calculatng the generaton of decmal partal products coded n (4221) generaton of multplcand multples and SD radx-10 encodng of the multpler, 38

50 partal product reducton and fnally BCD carry propagate addton Usng these approach, the Radx-10 decmal multpler s desgned n VHDL. Then the multpler s smulated and syntheszed usng Xlnx ISE 14.5 targetng Spartan 6 FPGA devce Smulaton Results for Radx-10 Parallel Decmal multpler The desgn s smulated usng ISIM smulator. Fgure 5.1 shows the smulaton results for Radx-10 Parallel decmal multpler. Consderng the followng are the nputs n BCD format. Clk = changng 1 or 0 (for 10ns wth perod 1ns) s the clock A= (for 10ns) s the (Multplcand) B= (for 10ns) s the (Multpler) The output obtaned after calculatng the generaton of decmal partal products coded n (4221) generaton of multplcand multples and SD radx-10 encodng of the multpler,partal product reducton and fnally BCD carry propagate addton. Result= The obtaned results on the ISIM smulator are smlar to the desred results as shown n fgure below fgure

Fgure 5.1: Smulaton Results for Radx-10 Parallel Decmal Multpler 5.1.2 Synthess Results for Radx-10 Parallel Decmal multpler Table 5.

51 Fgure 5.1: Smulaton Results for Radx-10 Parallel Decmal Multpler Synthess Results for Radx-10 Parallel Decmal multpler Table 5.1 shows the total delay and area utlzaton summary n terms of number of slces The 16-dgt SD radx-10 (Fgure 3.1) combnatonal multpler have been syntheszed usng Model sm SE6.5c. For partal product reducton, the SD radx- 10 multpler mplements area-optmzed decmal p:2 CSA trees, smlar to the decmal 17:2 CSA tree. Archtecture Table 5.1: Synthess report of radx-10 parallel decmal multpler Delay (mn. perod) Area (no. of slce LUTS)/ DEC. SD radx Ref Ref Ref

RADIX-10 PARALLEL DECIMAL MULTIPLIER

RADIX-10 PARALLEL DECIMAL MULTIPLIER 1 MRUNALINI E. INGLE & 2 TEJASWINI PANSE 1&2 Electroncs Engneerng, Yeshwantrao Chavan College of Engneerng, Nagpur, Inda E-mal : mrunalngle@gmal.com, tejaswn.deshmukh@gmal.com