Efficient Diminished-1 Modulo 2 n þ 1 Multipliers

Size: px

Start display at page:

Download "Efficient Diminished-1 Modulo 2 n þ 1 Multipliers"

Ralf Conley
6 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL Efficiet Dimiished-1 Modulo þ 1 Multipliers Costas Efstathiou, Haridimos T. Vergos, Member, IEEE, Giorgos Dimitrakopoulos, ad Dimitris Nikolos, Member, IEEE Abstract I this work, we propose a ew algorithm for desigig dimiished-1 modulo þ 1 multipliers. The implemetatio of the proposed algorithm requires þ 3 partial products that are reduced by a tree architecture ito two summads, which are fially added by a dimiished-1 modulo þ 1 adder. The proposed multipliers, compared to existig implemetatios, offer ehaced operatio speed ad their regular structure allows efficiet VLSI implemetatios. Idex Terms Modulo þ 1 multipliers, computer arithmetic, residue umber system, Fermat umber trasform, VLSI desig. 1 INTRODUCTION æ ARITHMETIC modulo þ 1 has bee used i several applicatios, which iclude specialized digital sigal processors based o Residue Number System (RNS) arithmetic [1], [], [3], [4], Fermat Number Trasform (FNT) for elimiatig the roudoff errors i covolutio computatios [5], [6], [7], [8], ad cryptographic algorithms [9]. For the implemetatio of these applicatios, several desigs for modulo þ 1 arithmetic blocks have bee proposed. Efficiet modulo þ 1 adders have bee preseted i [10], [11], [1], multioperad adders ad residue geerators i [13], ad multipliers i [14], [15], [16], [17], [18]. The prime moduli of the form þ 1, apart from beig useful for ordiary RNSs, are vital i FNT ad useful i cryptography. The Fermat umber 16 þ 1, by beig the oly Fermat umber of practical iterest, was chose for the implemetatio of the Iteratioal Data Ecryptio Algorithm (IDEA) [9]. Sice a umber i the rage of ½0; Š requires þ 1 bits for its represetatio, the weighted represetatio of a operad modulo þ 1 is a problem i a RNS that uses the three moduli set f 1; ; þ 1g, give that the other two chaels operate o -bit quatities. To overcome this problem ad sice, i the case of a zero operad, the result ca be derived straightforwardly, Leibowitz [5] itroduced the dimiished-1 represetatio. Uder this represetatio, each umber is represeted decremeted by 1 modulo þ 1 ad all arithmetic operatios are ihibited for a zero operad. Zero is represeted usig a separate zero idicatio bit. This represetatio has the advatage that the umbers are represeted by bits ad simplifies the basic operatios of additio, multiplicatio, ad scalig modulo þ 1. Recetly, the beefits of dimiished-1 arithmetic have bee utilized for the desig of low-power covolutio architectures [19] ad for high speed implemetatio of the IDEA cryptographic algorithm [0].. C. Efstathiou is with the Departmet of Iformatics, TEI of Athes, Ag. Spyridoos St., 110 Egaleo, Athes, Greece. cefsta@teiath.gr.. H.T. Vergos, G. Dimitrakopoulos, ad D. Nikolos are with the Techology ad Computer Architecture Lab, Computer Egieerig ad Iformatics Departmet, Uiversity of Patras, 6500 Patras, Greece. {vergos, dimitrak}@ceid.upatras.gr, ikolosd@cti.gr. Mauscript received 31 Oct. 003; revised 18 Jue 004; accepted Nov. 004; published olie 15 Feb For iformatio o obtaiig reprits of this article, please sed to: tc@computer.org, ad referece IEEECS Log Number TC We ca distiguish the multipliers modulo þ 1 i the followig categories, depedig o the type of operads that they accept:. Both operads use stadard represetatio [14], [15].. Oe iput uses a stadard represetatio, while the other utilizes a dimiished-1 represetatio [18].. Both iputs use dimiished-1 represetatio [16], [17]. It is importat to ote that the multipliers preseted i [10] also use bits for their represetatio, but do ot follow the dimiished-1 disciplie. This represetatio is specific for the IDEA implemetatio ad imposes all operads to be i weighted form, except the operad, which is represeted as a all zeros operad. I this paper, we preset a ew algorithm for desigig tree multipliers for the third of the above categories, that is, modulo þ 1 multipliers whose both iputs are i dimiished-1 represetatio. We show that the proposed multipliers are more efficiet tha the multipliers preseted i [14], [15], [16], [17], [18]. The ew desig method is preseted i Sectio. A area ad delay aalysis is give i Sectio 3 ad compared agaist the previous solutios. Experimetal results based o static CMOS implemetatios are also preseted i Sectio 3. Our coclusios are draw i the last sectio. THE PROPOSED MULTIPLIERS I this sectio, a ew architecture for modulo þ 1 multiplicatio for dimiished-1 operads is itroduced. At first, the derivatio of the partial products is explaied. The, the reductio of the partial products i two summads is examied. Let A; B be two ð þ 1Þ-bit umbers with 0 A; B < þ 1 ad suppose that A 1 ¼ a 1 a...a 0, B 1 ¼ b 1 b...b 0 deote their dimiished-1 represetatios such that A 1 ¼ A 1 B 1 ¼ B 1 ð1þ ad A 1 ;B 1 6¼ 0. Assume that Q deotes the product of A ad B modulo þ 1, that is, Q ¼jABj, where jxj m deotes the residue of x modulo m. The, accordig to [16] ad [10], for the dimiished-1 represetatio of Q, we have that Q Q 1 ¼ A 1 ¼ B 1 þ1 þ1 þ1 ¼ ða 1 þ 1ÞðB 1 þ 1Þ 1 ðþ ¼ A 1 B 1 þ A 1 þ B 1 þ1 ¼ ja 1 B 1 j þa 1 þ B 1 : þ1 The term ja 1 B 1 j of () ca be expressed as A 1 B 1 þ1 ¼ X 1 X 1 a i b j iþj ¼ X 1 X 1 a i b j iþj þ1 : Takig ito accout that i þ j, (3) ca be writte as A 1 B 1 þ1 ¼ X 1 X 1 a i b j ð 1Þ s jiþjj ; ð4þ ð3þ /05/$0.00 ß 005 IEEE Published by the IEEE Computer Society

2 49 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 005 where 0; if i þ j< s ¼ 1; if i þ j : For the two cases of (5), relatio (4) ca be expressed as A 1 B 1 þ1 ¼ X 1 X 1 i X 1 a i b j jiþjj þ X 1 i¼1 j¼ i ð a i b j Þ jiþj For z f0; 1g, it holds that z ¼ þ 1 z ¼ þ z ; j ð5þ : where z deotes the complemet of bit z. The, accordig to (7), (6) ca be rewritte as A 1 B 1 þ1 ¼ X 1 1 i X X 1 a i b j jiþjj þ X 1 i¼1 j¼ i ð þ a i b j Þ jiþj j : Relatio (8) idicates that oe way to form the partial products is to complemet each bit a i b j with i þ j ad place it at bit positio ji þ jj, provided that a correctio equal to j jiþjj j is take ito accout for each complemetatio. Therefore, (8) ca be reformulated as A 1 B 1 þ1 ¼ X 1 ðpp i þ C i Þ ; ð9þ where PP i deotes the ith partial product ( P 1 PP i ¼ a 0b j j ; if i ¼ 0 P 1 i a i b j jiþjj P þ 1 j¼ i a ib j jiþjj ð10þ ; if i 6¼ 0 ad C i is the correspodig correctio factor. It should be oted that PP 0 does ot cotai ay complemeted bits ad, thus, C 0 ¼ 0. O the other had, for i 6¼ 0, the value of C i depeds o the umber of complemeted bits a i b j ad is give by C i ¼ X 1 jiþjj ¼ ð i 1Þ: j¼ i ð6þ ð7þ ð8þ ð11þ Accordig to (10) ad (11), the followig partial products ad correctio factors are derived: PP 0¼ a 0b 1 a 0b... a 0b 1 a 0b 0; C 0¼ 0 PP 1¼ a 1b a 1b 3... a 1b 0 a 1b 1; C 1¼ ð 1 1Þ PP ¼ a b 3 a b 4... a b 1 a b ; C ¼ ð 1Þ... PP ¼ a b 1 a b 0... a b 3 a b ; C ¼ ð 1Þ PP 1¼ a 1b 0 a 1b 1... a 1b a 1b 1; C 1¼ ð 1 1Þ: The total correctio, C P, required for the formatio of the above partial products is equal to C P ¼ X 1 C i ¼ C 0 þ X 1 ð i 1Þ ¼ ð 1 Þ: ð1þ i¼1 I the followig, we cosider the reductio of the partial products ito two summads. This ca be performed i a variety of ways. I this paper, a FA-based Dadda tree architecture is followed [1]. Although the use of a tree architecture i iteger multipliers results i irregular architectures, i our case, the resultig FA array is completely regular ad, therefore, well-suited for VLSI implemetatios. This is due to the fact that the same umber of bits participate i every bit positio sice the carry output of the most-sigificat bit positio is fed back as a carry iput to the least-sigificat bit positio of the ext stage. Let c deote a carry output at the most sigificat bit positio which has a weight of. Sice c ¼ c ¼ þ c þ1 ; ð13þ the c ca be complemeted ad added at the least sigificat bit positio of the ext stage, provided that a correctio of is take ito accout. Sice a FA row reduces the umber of partial products by oe, þ 1 FA rows are required i order to derive the two fial summads from þ 3 partial products. The FAs at the most sigificat bit positio will the produce þ 1 carries of weight. Therefore, the correctio, C R, required durig the additio of þ 3 partial products is C R ¼ð þ 1Þ : ð14þ Mergig both correctio factors of (1) ad (14) results i a sigle factor C, which, i modulo þ 1 arithmetic, is equal to C ¼ CP þ C R ¼ ð 1Þþðþ1Þ ¼ 1: þ1 þ1 ð15þ Sice C is treated i the proposed architecture as a extra partial product, we have to use its dimiished-1 represetatio i our reductio scheme, i.e., C 1 ¼jC 1j, which is equal to the all 0s -bit vector. This vector, alog with the PP i s of (9) ad the A 1, B 1 of () forms the þ 3 partial products of the proposed architecture. Although C 1 ¼ 0, it caot be igored durig the reductio of the partial products sice, i this case, less tha þ 1 carries of weight will be produced. The above aalysis idicates that Q 1 ¼ X 1 PP i þ A 1 þ B 1 þ C 1 : ð16þ A implemetatio of the proposed architecture is composed of AND or NAND gates that form a bit of each partial product, a Dadda tree that reduces the þ 3 partial products ito two summads, ad a modulo þ 1 adder for dimiished-1 operads [1] that accepts these two summads ad produces the required product. A dimiished-1 modulo þ 1 parallel adder is effectively a iverted ed-aroud-carry adder. Sice a direct coectio of the carry output to the carry iput via a iverter leads to a oscillatig circuit, dedicated architectures have bee proposed that do ot suffer from this problem [10], [11], [1]. I this work, the parallel-prefix architecture proposed i [1] is utilized i order to achieve the fastest possible implemetatio. This architecture was derived by allowig the iverted reeterig carry to recirculate at each existig prefix level. The desig of these adders is briefly described as follows: At first, the carry-geerate bits g i, the carry-propagate bits p i, ad the half-sum bits h i, for every i, 0 i 1, are computed accordig to: g i ¼ a i b i, p i ¼ a i þ b i, ad h i ¼ a i b i, where, þ, ad deote the logical AND, OR, ad exclusive-or operatios, respectively. The, usig the bits g i ad p i, the carries c i, for 1 i, are computed i log prefix levels, accordig to the followig relatio: ðg i ;P i Þ¼ðg i ;p i Þðg i 1 ;p i 1 Þðg 0 ;p 0 Þ ðg 1 ;p 1 Þðg iþ1 ;p iþ1 Þ; with c i ¼ G i. Fially, the sum bits s i are derived usig s i ¼ h i c i 1. By defiitio, ðg; pþ is equal to ðg; pþ ad is the prefix operator defied as ðg; pþðg 0 ;p 0 Þ¼ðg þ p g 0 ;p p 0 Þ.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 005 493 Fig. 1. Sample simplified-fa (SFA) implemetatio. Additioal simplificatios are possible to the Dadda reductio tree.

3 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL Fig. 1. Sample simplified-fa (SFA) implemetatio. Additioal simplificatios are possible to the Dadda reductio tree. Cosider the partial products PP 0 ¼ a 1 b 0 a b 0...a 1 b 0 a 0 b 0, PP ¼ A 1, ad PP þ1 ¼ B 1. If these three partial products are drive to the same FA row of the array, the each FA ca be simplified sigificatly. Fig. 1 presets a possible implemetatio of a block that accepts a 1, b 1, ad b 0 ad performs the additio of the bits a 1 b 0, a 1, ad b 1. The simplified FA is deoted as SFA. The FA of the same row that accepts a 0 b 0, a 0, ad b 0 ca be further simplified to a HA. Furthermore, sice C 1 is the all 0s vector, the row of FAs that accepts this operad ca be simplified to a row of half-adders (HA). Example 1. For a modulo 57 multiplier the derived set of partial products is show i Fig.. Fig. 3 presets a umerical example illustratig the modulo partial-product reductio usig the Dadda method. Every three terms are reduced to two, usig a FA row, which is idicated by a box that surrouds them. The resultig sum ad carry vectors are deoted as ðsþ ad ðcþ. The bold ad uderlied bits of each stage declare the carry bits of weight 8 that are complemeted ad added at the leastsigificat bit positio. Additioally, Fig. 4 presets the attaied FA-based implemetatio. Note that, i the first level of the tree, oly HAs ad SFAs have bee used for reducig the delay. The circles at the carry output of a HA, FA, or a SFA deote the complemet operatio. 3 COMPARISONS The multipliers desiged accordig to the methods preseted i [14], [15], ad [18] require, apart from the partial-products reductio array, a fial carry-propagate adder ad a modulo correctio step with a delay equal to a -bit carry propagate adder. Thus, the proposed desig ad those of [16] ad [17] that require oly oe -bit carry-propagate additio are superior to these previous methods. Additioally, the authors of [16] ad [17] have prove their superiority over [14] ad [18]. Therefore, i this sectio, we compare the proposed (hereafter deoted block PROP) Fig.. The set of partial products for the proposed modulo 8 þ 1 multiplier. multipliers agaist those of [16] (hereafter deoted block WANG) ad [17] (hereafter deoted block MA), both qualitatively ad quatitatively. For our qualitative comparisos, we adopt the approximatios of the uit-gate model [], that is, we cosider that all -iput mootoic gates cout as oe gate equivalet for both area ad delay, while a -iput XOR or XNOR gate couts as two gate equivalets for both area ad delay. We deote a Booth ecoder by BE, a Booth selector block by BS, ad a parallel modulo þ 1 adder by PA. The area of a block Y will be deoted A Y ad its executio latecy as T Y. The area ad delay i equivalet gates of the compoets used i the comparisos are show i Table 1. I the proposed multipliers, þ 3 partial products are required. The three of them are bits from the iput operads, which are added usig the SFA cells, while oe of them is the all zeros vector. The rest of the partial products are produced by ð 1Þ AND or NAND gates. These partial products are the reduced to two by the use of a Dadda tree. The depth i FA stages of a Dadda tree, deoted DðkÞ,is a fuctio of its umber of operads ad is listed i Table for all practical values of k. Each of the colums of the tree, except the least sigificat oe, is composed of 1 FAs, 1 SFA, ad 1 HA. The least sigificat slice is composed of 1 FAs ad HAs. Therefore, the total area of the Dadda tree required by the proposed multipliers is A DT ¼ ð 1ÞA FA þð 1ÞA SFA þðþ1þa HA, while its executio delay is T DT ¼ Dð þ 3ÞT FA. As exemplified i the previous sectio, i several cases, it is possible to arrage the first level of the Dadda tree so that it is composed oly of SFAs or of SFAs ad HAs. This ca be achieved i the cases where ð þ Þ or ð þ 1Þ is a Dadda umber, i.e., whe ¼ 4; 5; 7; 8; 11; 1; 17; 18; 6; 7;... I these cases, the executio delay of the Dadda tree is T DT ¼ðDðþ3Þ 1ÞT FA þ T HA. Takig ito accout the approximatios of the uit gate model, we get that Fig. 3. Numerical example i the case of the proposed modulo 8 þ 1 multiplier.

4 494 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 005 Fig. 4. The proposed modulo 8 þ 1 multiplier. A PROP ¼ þ A DT þ A PA ¼ 8 þ 9 log þ 1 ð17þ þ 4 equivalet gates; T PROP ¼ 1 þ T DT þ T PA ¼ 4Dð þ 3Þþlog þ ; if ¼ 4; 5; 7; 8; 11; 1; 17; 18;... 4Dð þ 3Þþlog þ 4; otherwise: ð18þ The multipliers proposed i [16] follow a similar structure as the proposed oes. However, the followig should be oted: TABLE 1 Area ad Delay of the Basic Compoets i Equivalet Gates. þ 1 partial products are utilized. Out of them, 1 are produced usig two iput AND gates. However, these AND gates require that oe of their iput operads be iverted. Oe partial product is produced by the use of! 1 multiplexors. We cosider that a multiplexor has the same complexity as a XOR gate. The fial partial product is the iverse of the umber of zeros i the 1 bits from

5 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL TABLE FA Stages i a k Operad Dadda Tree TABLE 3 Area ad Delay i Equivalet Gates b 1 to b 1. This umber is provided by a 1 bits to dlog ð 1Þe couter (deoted by CNT).. I [16], it is proposed to reduce the partial products i two fial summads by the use of a Wallace tree. I our comparisos, we assume that this reductio is performed by a Dadda tree. The latter has the same time complexity while it, i parallel, offers reduced area complexity.. The two fial summads are added i a modulo þ 1 parallel adder with a carry iput set to 1. Sice such a block is ot available i the literature, we assume that this is implemeted by a HA stage, followed by a modulo þ 1 parallel adder. The area requiremets of the multipliers proposed i [16] are:. ð 1Þ AND ad XOR gates for formig the partial products.. ð 1Þ dlog ð 1Þe FAs for the CNT block that forms the last partial product.. ð 1Þ FAs for the Dadda tree ad HAs for producig the two fial summads.. A modulo þ 1 adder PA. Takig ito accout the approximatios of the uit-gate model, it follows that A WANG ¼ 8 þ 9 log þ 9 7dlog ð 1Þe 1: Cosiderig the executio delay, oe must ote that: ð19þ. The terms of the 1 partial products require more tha a sigle gate delay to be produced sice each is the AND of a ormal iput bit with the other iverted.. The multiplexors impose a extra delay for the derivatio of this specific partial product agaist the rest. I order to compesate for this extra delay, the output of the multiplexors should be drive to the secod or to subsequet stages of the Dadda tree. However, this is ot possible, whe þ 1 is a Dadda umber or, equivaletly, whe ¼ 5; 8; 1; 18; 7; 41; 6; Fially, the partial product produced by the CNT may also ot be ready whe eeded for a miimum depth Dadda tree. Because of this, we caot provide a closed form equatio for T WANG. I our estimatio, we cosider that the CNT is desiged accordig to [3]. The multipliers proposed i [17] use Booth recodig to reduce the umber of partial products that should be added. I the followig, we cosider that is eve. The umber of derived partial products i [17] is þ 1, each ð þ 1Þ-bits wide. Oe of the partial products is a costat, whereas the rest are derived usig a Booth ecoder for each overlappig triplet of the multiplier ad þ 1 Booth selector blocks. I [17], it is proposed that these partial products are reduced ito a carry ad sum vector usig a Carry- Save Adder (CSA) Array. I the followig, we cosider that this is performed by a Dadda tree to reduce the delay. The umber of FA stages i the Dadda tree is Dð þ 1Þ, whereas the umber of FAs ad HAs required is 1 ð 1Þ blog ð þ 1Þc ad, respectively. The sum ad carry vectors produced are the fed ito two cascaded modulo CSA stages, each cotributig T FA of executio delay. The first stage, because of the costats i the high order bits of the sum ad carry vectors, ca be implemeted by 1 HA ad dlog ð Þe FAs, whereas the secod requires FAs. The two resultig vectors eed to be added i a modulo þ 1 parallel adder with a carry iput set to 1, as i the case of the multipliers proposed i [16]. Also, i this case, we assume that this is implemeted by a HA stage, followed by a modulo þ 1 parallel adder. Accordig to the above aalysis, we have that, for eve values of : A MA ¼ A BE þ ð þ 1ÞA BS þ 1 j ð 1Þ log k l þ 1 m þ log þ A FA þ þ 1 þ A HA þ A PA ¼ 6 þ 9 log þ 7 l þ 7 log m j k 14 log þ 1 ð0þ T MA ¼ T BE þ T BS þ D þ 1 T FA þ T FA þ T HA þ T PA ¼ 0 þ 4D þ 1 þ log : ð1þ Takig ito accout the area estimates of (17), (18), (19), ad the aalysis preseted earlier for the delay T WANG, (0) ad (1), we preset i Table 3 the delay ad area requiremets of the multipliers uder cosideratio for several values of. The proposed multipliers offer sigificat savigs i executio time compared to either the multipliers proposed i [16] or i [17]. The

are also more area efficiet tha the multipliers i [16] for >4.

6 496 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 005 TABLE 4 Area (m ) ad Delay (s) Results of the Dimiished-1 Modulo þ 1 Multipliers TABLE 5 Area (m ) ad Power (mw) Results for the Dimished-1 Modulo þ 1 Multipliers proposed multipliers are also more area efficiet tha the multipliers i [16] for >4. Fially, cosiderig as a metric the area time product, the proposed multipliers are more efficiet tha the multipliers proposed i [17] for <4. Quatitative compariso results are obtaied by implemetig the differet multiplier architectures ito a 0.18m CMOS stadard cell library. At first, a program was writte i C++ that geerates structural Verilog descriptios for the proposed ad the multipliers proposed i [16] ad [17]. We used this program to geerate Verilog models for multipliers with operad sizes of 4, 8, 16, ad 3 bits. Each desig, after performig extesive simulatios that verified its correctess, was sythesized ad optimized recursively for miimum delay, with Syopsys Desig Compiler usig the UMC 0.18m CMOS stadard cell library (five metal layers), uder typical coditios (1.8Volt, 5 C). The, the derived etlists ad the desig costraits were passed to Cadece Silico Esemble to perform the fial placemet ad routig of the desig. All desig costraits, such as output load, max faout, ad floorpla iitializatio iformatio, were held costat for each architecture. Fial timig aalysis was performed usig PrimeTime of Syopsys toolset after all RC parasitic iformatio were extracted from the layout ad back-aotated to the gate-level etlist. Table 4 shows the obtaied area ad delay results. The reported area measuremets are performed i the fial layout ad iclude both cell ad itercoect area. The simulatio data idicate that the proposed multipliers offer delay savigs betwee 7 percet ad 11 percet over the multipliers i [16] ad betwee 10 percet ad 18 percet over the multipliers i [17]. Additioally, i all cases, they are more area efficiet tha the multipliers of [16] by 6 percet o average. I order to measure power cosumptio, all desigs were optimized targetig a delay equal to the miimum delay of the Booth modulo þ 1 multipliers proposed by Ma i [17]. The resultig etlists were placed ad routed ad the parasitics were extracted. All gathered desig data were passed to PrimePower of Syopsys ad power was estimated after the applicatio of 5; 000 radom vectors. Experimetal results, show i Table 5, idicate that the proposed multipliers i the majority of the cases require the smallest implemetatio area, while their power cosumptio is less tha the multipliers of [16] ad [17] by 13 percet ad 3 percet o average. 4 CONCLUSIONS I this paper, we have proposed a ew algorithm for desigig dimiished-1 modulo þ 1 multipliers. The proposed multipliers offer sigificat savigs i propagatio delay compared to the already kow oes ad they are more area ad power efficiet for less strict delay costraits. ACKNOWLEDGMENTS The authors would like to thak the aoymous reviewers for their costructive commets. G. Dimitrakopoulos has bee supported by the D. Maritsas Graduate Scholarship. REFERENCES [1] F. Taylor, A Sigle Modulus ALU for Sigal Processig, IEEE Tras. Acoustics, Speech, ad Sigal Processig, vol. 33, pp , [] E. DiClaudio et al., Fast Combiatorial RNS Processors for DSP Applicatios, IEEE Tras. Computers, vol. 44, pp , [3] J. Ramirez et al., RNS-Eabled Digital Sigal Processor Desig, IEE Electroics Letters, vol. 38, o. 6, pp , 00. [4] R. Chaves ad L. Sousa, RDSP: A RISC DSP Based o Residue Number System, Proc. Euromicro Symp. Digital Systems Desig, pp , Sept [5] L.M. Leibowitz, A Simplified Biary Arithmetic for the Fermat Number Trasform, IEEE Tras. Acoustics, Speech, ad Sigal Processig, vol. 4, pp , [6] T.K. Truog et al., Techiques for Computig the Discrete Fourier Trasform Usig the Quadratic Residue Fermat Number Systems, IEEE Tras. Computers, vol. 35, pp , [7] M. Beaissa et al., Dimiished-1 Multiplier for a Fast Covolver ad Correlator Usig the Fermat Number Trasform, IEE Proc. G, vol. 135, pp , [8] S. Suder at al., Area-Efficiet Dimiished-1 Multiplier for Fermat Number-Theoretic Trasform, IEE Proc. G, vol. 140, pp , [9] R. Zimmerma et al., A 177 Mb/s VLSI Implemetatio of the Iteratioal Data Ecryptio Algorithm, IEEE J. Solid-State Circuits, vol. 9, o. 3, pp , [10] R. Zimmerma, Efficiet VLSI Implemetatio of Modulo ð 1Þ Additio ad Multiplicatio, Proc. IEEE Symp. Computer Arithmetic, pp , Apr [11] C. Efstathiou et al., Modulo 1 Adder Desig Usig Select-Prefix Blocks, IEEE Tras. Computers, vol. 5, pp , 003. [1] H.T. Vergos, C. Efstathiou, ad D. Nikolos, Dimiished-Oe Modulo þ 1 Adder Desig, IEEE Tras. Computers, vol. 51, pp , 00. [13] S.J. Piestrak, Desig of Residue Geerators ad Multioperad Modular Adders Usig Carry-Save Adders, IEEE Tras. Computers, vol. 43, pp , [14] A.A. Hiasat, A Memoryless modð 1Þ Residue Multiplier, Electroics Letters, vol. 8, o. 3, pp , 199. [15] A. Wrzyszcz ad D. Milford, A New Modulo a þ 1 Multiplier, Proc. It l Cof. Computer Desig, pp , [16] Z. Wag, G.A. Jullie, ad W.C. Miller, A Efficiet Tree Architecture for Modulo þ 1 Multiplicatio, J. VLSI Sigal Processig, vol. 14, pp , [17] Y. Ma, A Simplified Architecture for Modulo ð þ 1Þ Multiplicatio, IEEE Tras. Computers, vol. 47, o. 3, pp , Mar [18] A.V. Curiger et al., Regular VLSI Architectures for Multiplicatio Modulo ( þ 1), IEEE J. Solid-State Circuits, vol. 6, o. 7, pp , [19] V. Paliouras, A. Skavatzos, ad T. Stouraitis, Multi-Voltage Low Power Covolvers Usig the Polyomial Residue Number System, Proc. ACM Great Lakes Symp. VLSI, pp. 7-11, 00. [0] A. Hammalaie, M. Tommiska, ad J. Skytta, 6.78 Gigabits per Secod Implemetatio of the IDEA Cryptographic Algorithm, Lecture Notes i Computer Sciece, vol. 438, pp , 00. [1] L. Dadda, Some Schemes for Parallel Multipliers, Alta Frequeza, vol. 34, pp , [] A. Tyagi, A Reduced-Area Scheme for Carry-Select Adders, IEEE Tras. Computers, vol. 4, o. 10, pp , Oct [3] E.E. Swartzlader, Parallel Couters, IEEE Tras. Computers, vol., pp , 1973.

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity