Power Optimization for Universal Hash Function Data Path Using Divide-and-Concatenate Technique

Size: px

Start display at page:

Download "Power Optimization for Universal Hash Function Data Path Using Divide-and-Concatenate Technique"

Jerome Robinson
5 years ago
Views:

1 Poer Optimization or Universal Hash Function Data Path Using Divide-and-Concatenate Technique Bo Yang, and Ramesh Karri Dept. o Electrical and Computer Engineering, Polytechnic University Brooklyn, NY, USA yangbo@photon.poly.edu, ramesh@india.poly.edu ABSTRACT We present an architecture level lo poer design technique called divide-and-concatenate or universal hash unctions based on the olloing observations: (i) the poer consumption o a -bit array multiplier and associated universal hash data path decreases as O( 4 ) i its clock rate remains constant. (ii) to universal hash unctions are equivalent i they have the same collision probability property. In the proposed approach e divide a -bit data path (ith collision probability 2 ) into to/our -bit data paths (each ith collision probability 2 ) and concatenate their results to construct an equivalent -bit data path (ith a collision probability 2 ). A popular lo poer technique that uses parallel data paths saves 62.10% dynamic poer consumption incurring 102% area overhead. In contrast, the divide-and-concatenate technique saves 55.44% dynamic poer consumption ith only 16% area overhead. Categories and Subject Descriptors B.2.2 [Arithmetic and Logic Structures]: Perormance Analysis and Design Aids General Terms Design, Perormance Keyords Universal Hash Function, Poer Optimization, Divide-and- Concatenate 1. INTRODUCTION Hash unctions have a ide range o applications in computer and netork related areas including databases, eb search engines, and most importantly netork security applications. In databases and eb search engines the record lookup can be speed up by using hash values as indexes. In netork security, keyed hash unctions are used as message Permission to make digital or hard copies o all or part o this ork or personal or classroom use is granted ithout ee provided that copies are not made or distributed or proit or commercial advantage and that copies bear this notice and the ull citation on the irst page. To copy otherise, to republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a ee. CODES+ISSS 05, Sept , 2005, Jersey City, Ne Jersey, USA. Copyright 2005 ACM /05/ $5.00. authentication codes to assure the integrity o a message. MD5 and SHA-1 are to popular hash algorithms. Since they are iterative algorithms in hich the current computation step depends on the result o the previous step, they are not parallelizable and cannot be pipelined. Moreover, these hash unctions are not scalable; message integrity oered by these hash unctions cannot be tailored to application requirements. Recently, universal hash unction such as MMH [1], NH [2] and TMMH [3] has been proposed as an attractive alternative. They use additions and multiplications. Their collision probabilities are determined by the size o these additions and multiplications. They do not have iterative internal structure and are parallelizable. And most importantly they are scalable; message integrity oered by these hash unctions can be tailored to application requirements. They are ideal or implementation in hardare. Hardare implementations o universal hash unctions have been used in several applications including, high speed routers and ireless cards. In virus detection and content classiication applications or high speed routers, each and every packet is hashed and compared ith pre-computed signatures [4]. Integrating a large number o universal hash unctions improves the accuracy o content classiication. For these high perormance systems, lo poer universal hash implementations can improve integration density, reduce packaging cost and improve reliability. More and more ireless cards, PDAs and cell phones are supporting authentication in hardare. For these portable devices, lo poer universal hash implementations can prolong battery lie. The hardare implementations o universal hash unctions are data path dominated ith only a smaller amount o control logic. Arithmetic operations such as addition and multiplication are at the core o these algorithms. Existing lo poer design techniques can be applied to universal hash unction hardare designs. One straightorard approach to reduce poer consumption o universal hash unctions is by using lo poer implementations o adders and multipliers [5]. Orthogonal to this and other circuit level and logic level lo poer approaches are architectural level lo poer design techniques such as glitch reduction [6], dynamic voltage scaling [7], data representation [8] [9], pipelining and parallel data paths [10] [11]. In the parallel data paths technique, the original data path is replicated N times ith each replicated data path operating at 1/N o the original clock requency. Since each o the data paths operate at a loer clock requency, their supply voltage can be reduced. The dynamic poer consumed by the parallel data paths architecture is 219

2 only 1/N 2 o the poer consumption o the original data path. Hoever, it incurs N times the hardare overhead [10] [11]. In this paper e ill develop an architecture level poer optimization technique or universal hash unctions. This technique can yield savings in poer consumption comparable to that o the parallel data paths technique but ith signiicantly less area overhead. Instead o replicating the original hash data path directly, the divide-and-concatenate technique divides the -bit data path into our -bit data paths and concatenates their outputs to construct an equivalent -bit data path [12]. Obviously, a straightorard hash data path and its corresponding divide-and-concatenate hash data path are not equivalent in terms o the results that they output. We deine to hash data paths to be equivalent i the results that they output satisy a pre-deined property. For one ay universal hash unctions and associated message authentication codes the actual result is not important. Rather, it is the collision probability o the result that is important. Hence, e propose that to hash data paths be considered equivalent i they have the same collision probability. When e discuss equivalent hash data paths and architectures in this paper, it means 1) they can process same size input every clock cycle and generate same size output; 2) they have the same collision probability. In the rest o this paper e ill describe the divide-andconcatenate lo poer design technique or universal hash unctions. Speciically, e ill introduce universal hash unctions and Linear Congruential Hash (LCH) universal hash unction in section 2. The motivation or the proposed lo poer optimization technique is presented in section 3. We ill apply the divide-and-concatenate technique to design lo poer LCH universal hash hardare in section 4. The experimental results o the proposed lo poer LCH hash data paths using an IBM 0.18 ASIC library are reported in section 5. We ill summarize our contributions in section LINEAR CONGRUENTIAL HASH (LCH) A hash unction (h) converts an input rom a large domain (x) into an output in a small range (the hash value y = h(x), oten a subset o integers). There are three eatures or a hash unction: Pre-image resistance: For a given hash value y, it is computationally ineasible to ind a message x such that h(x) = y. Second pre-image resistance: For a given message x, it is computationally ineasible to ind a message x x such that h(x ) = h(x). Collision resistance: It is computationally ineasible to ind messages x and x such that x x and h(x ) = h(x). A hash unction that meets the pre-image resistance and the second pre-image resistance is a one ay hash unction. Further, i a one ay hash unction meets the third eature, it is collision resistant. Collision resistance implies second pre-image resistance o a hash unction. It does not guarantee the pre-image resistance only i one insists on alloing the degenerate case o hash unctions that do not actually compress [13]. So or a one ay hash unction, the collision probability, hich stands or the ability o collision resistance, is the most important parameter. 2.1 LCH Algorithm Carter and Wegman [14] deined a universal hash unction as ollos: Let A and B be to sets, and let H be a amily o unctions rom A to B. For every pair x 1, x 2 A ith x 1 x 2, h(x 1), h(x 2) B ith h H, i the collision probability o h(x 1) = h(x 2) equals to 1/ B, H is a universal amily o hash unctions. B is size o set and 1/ B is the smallest possible value o the probability. In many cases, some hash unction amilies can only achieve a collision probability hich is slightly larger than 1/ B. Such universal hash unctions are called almost universal [14]. In this paper e ill ocus on LCH, a idely used universal unction amily hich is deined as: H m,x(m) = k (m ix i + t)modp (1) i=1 here m i is the i th ord in a message block m and x i is the i th ord in key x and t Z p. p = 2 + s is a prime number hich is close to 2. Modular reduction o the accumulated result using p generates a -bit hash value. Since p is close to 2, the collision probability o LCH (1/p)is close to 2. The key block and every message block consist o k -bit ords. Since the modular reduction step can be amortized over k multiply-and-accumulate operations, increasing k can increase the speed o hash computation. Hoever, a large k results in a longer key. 2.2 Modular Reduction Reducing the 2-bit result o the multiply-and-accumulate step o the -bit LCH unction modulo p = 2 +s yields a - bit hash value. A division-less modular reduction algorithm rom [1] that uses 2 multiplications and 3 subtractions can be used to perorm the modular reduction. Let x = 2 a + b be the 2-bit input here a, b are unsigned -bit integers: 2 a+b (2 a+b) a (2 a+s) = b s a(mod2 a+s) (2) Step 1: Since a, b [0, 2 1], b s a [ s (2 1), 2 1]. I y = b s a then y x (mod 2 + s) can be represented as a signed 2-bit integer y = 2 c + d, here c { s,..., 0} and d is an unsigned -bit integer. Step 2: Repeat step 1 to compute z = d s c. z y x (mod 2 + s) and z {0,..., 2 + s 2 }. Step 3: I z > 2 + s return z (2 + s) else return z. 3. MOTIVATION Toards motivating the proposed divide-and-concatenate lo poer technique consider a -bit LCH universal hash data path shon in Figure 1 that operates on -bit input message ords and -bit subkey ords and has a collision probability o 2. This data path can be implemented as a to-stage pipeline using a combinational adder and a combinational array multiplier that can perorm addition and multiplication in a single clock cycle respectively. It uses our -bit registers (R1, R2, H and L), to 2-bit registers (R3, R4), to -bit 2-to-1 multiplexers (mux1 and mux2) and to 2-bit 2-to-1 multiplexers (mux3 and mux4). The control signals are generated by control logic. Equation (1) can be reritten as ollos: h m,x(m) = (( k (m ix i)) + kt) mod p (3) i=1 At the beginning o every message block hen neblock is set to active, registers R1, R2, R3, L and H and the 220

3 mulsrc addsrc addsub load DMUX m i s x i mux1 mulsrc mux2 Ne block R1 R2 Control p Control signals... R3 mux3 loadr4 + 2 mux4 R4 comp sign ext H L Hash Data path /N /N Data path /N /N MUX Data path Figure 1: -bit hardare architecture or the LCH (a) (b) counter in the control logic are all set to zero hile register R4 is initialized to k t (this is a constant or a speciic instance o LCH unction). This is olloed by k multiplyand-accumulate steps o the k -bit message ords ith the k -bit key ords. In each step, m i in register R1 is multiplied ith x i in register R2 and the result is stored in register R3. For this purpose, the control signal mulsrc o multiplexers mux1 and mux2 are set to select loer input. This is then accumulated into register R4 using the adder/subtractor unit. The 2-bit accumulated result at the end o the k + 1 clock cycles is stored in the (H, L) register pair. Ater the accumulation, the modular reduction is applied to the 2-bit accumulated result in (H, L) to get a -bit hash. Step 1 o the reduction algorithm consumes to clock cycles. By setting mulsrc to select the upper input o mux1 and mux2 respectively, e ill get s a at the end o cycle 1. By setting comp signal to select the upper input o mux4 and setting addsub to do subtraction e ill get y = b s a at the end o cycle 2. By repeating these operations in clock cycles 3 and 4 e ill get z (step 2 in section 2.2). In the last cycle, e perorm z p. I the result is less than zero, the value in register pair (H, L) is the hash value. Otherise output o the adder/subtractor is the hash. This is loaded into the output registers (H, L) by setting the load signal to 1. Overall, this architecture generates a -bit hash or a k message block in k + 6 cycles. Let the maximum clock rate o a -bit LCH data path be MHz; this is determined by the -bit combinational multiplier. Throughput o the straightorard LCH data path shon in Figure 1 is Mbps (The coeicient k/(k+ 6) is omitted, because all throughput ill be scaled by this actor.). 3.1 Analysis o Poer Consumption The dynamic poer or CMOS circuits can be computed as: P dynamic = αc L V 2 dd (4) α is the average sitching probability o the inputs to the circuit. is the clock requency, C L is the equivalent load capacitance o the circuit (or a target technology library, it is proportional to the area o the circuit) and V dd is the supply voltage o the circuit. α is a unction o the statistics o the input signals, the circuit style, the implemented unction and the circuit topology. Dynamic poer can be reduced by reducing one or a number o actors o equation (4). Reducing V dd obviously has the most impact. Any Figure 2: technique Parallel data path poer optimization lo poer technique that reduces the supply voltage has to deploy additional techniques to maintain the throughput because reducing the supply voltage ill increase the circuit delay as ollos [11]: V dd delay = lc criticalpath (5) (V dd V t ) 2 l is a technology parameter that can be omitted hen the optimized circuits are targeted to the same technology. V t is the threshold voltage and much smaller than V dd. The loer the supply voltage V dd, the longer the circuit delay. Finally, it is the capacitance in the critical path and not the capacitance o the hole circuit that determines the delay. Parallel data paths technique shon in Figure 2 reduces the dynamic poer by irst replicating the original data path N times ith each orking at 1/N the original clock requency. The supply voltage to each o the data paths is then scaled. Hoever, the original throughput is maintained by the replicated data paths. Since the replicated data paths process the same input, the average sitching probabilities o every data path are the same. Ignoring the capacitance o the multiplexer and demultiplexer, the capacitance o the parallel data path is NC L. The dynamic poer consumed by these parallel data paths is: P parallel α( N )NC L( V dd N )2 = αclv dd N 2 = P original N 2 (6) We can apply this technique to the LCH data path shon in Figure 1. I to LCH data paths are used, the poer consumption can be reduced to 1/4 o the original value, but it doubles the area. Let us look at poer consumed by LCH and other universal hash unctions rom a dierent angle. Let us consider the poer consumption o these universal hash unctions as a unction o their input ord length hen the clock requency is ixed. The hardare complexity o a -bit array multiplier and hence that o a -bit LCH data path increases as O( 2 ). The average capacitance o a circuit is proportional to its area, so the capacitance o a multiplier and the LCH data path increase as O( 2 ). Consequently, the delay o a -bit array multiplier decreases as O() [15]. When the same voltage supply is applied to to multipliers, the delay o a small multiplier is small. By reducing the supply voltage to the small multiplier described in equation (5), its delay 221

4 can be made equal to that o the large multiplier. According to equation (5), the supply voltage can be reduced as O() to maintain a constant delay hen operand size is reduced. The to multipliers ill have the same average sitching probability α. Overall, the dynamic poer consumption o a -bit multiplier decreases as: P multiplier = αc L V 2 dd = αo( 2 )O() 2 = O( 4 ) (7) Since the delay o a 2-bit adder is generally smaller than that o a -bit combinational array multiplier [15], the critical path o the LCH data path shon in Figure 1 is determined by the multiplier. Overall, the dynamic poer consumption o a -bit LCH data path decreases as: O( 4 ) i the critical path delays remain unchanged, so using small size data paths can reduce the poer consumption greatly. The proposed divide-and-concatenate technique ill use several small data paths to construct an equivalent poer eicient LCH data path. The concept o equivalence is crucial. As e discussed in the introduction, e propose that to data paths implementing a hash unction be considered equivalent i they have the same collision probability. 3.2 Collision Probability o Universal Hash Functions Proposition 1: Given to universal hash unctions h1 rom A to B and h2 rom A to C ith collision probability p1 and p2 respectively, a ne universal hash unction h3 can be deined rom A to B C as an ordered pair h3(x) = (h1(x), h2(x)) ith collision probability p1 p2 [16]. When h1 and h2 are to universal hash unctions in the same universal hash amily ith to dierent key, the above proposition is simpliied as: Proposition 2: For a universal hash unction, its collision probability o 2 can be reduced to 2 n by hashing the same message n times using n dierent keys and concatenating the results. Hoever, this solution requires n times key material. The Toeplitz-extension described in [1] reduces the amount o key material making this approach practical. In Toeplitzextension, the keys or n times hashing are not necessarily dierent, the other n 1 keys can be obtained by rotating the irst key. For example, hen e use to copies o the -bit LCH data path to construct a 2-bit data path, using Toeplitz-extension the keys or the second -bit LCH data path can be obtained by rotating the original key by one-bit. The proo that Toeplitz extension does not compromise the collision probability o the result is given in [2]. Using to -bit LCH data paths (ith collision probability o 2 ) and concatenating their outputs can provide a collision probability o 2 2 ; this is same as the collision probability provided by a 2-bit data path. Since the area o an LCH data path decreases as O( 2 ) and the dynamic poer consumption o LCH data path decreases as O( 4 ), e can reduce both area and poer consumption. 4. OPTIMIZING POWER BY DIVIDE AND CONCATENATE Let us no construct an equivalent divide-and-concatenate m i x i 2 R LCH universal hash architecture that reduces dynamic poer consumption ith very lo area overhead. Consider the straightorard -bit LCH universal hash data path shon in Figure 1. It takes one -bit message ord every clock cy-bit LCH data path Control -bit LCH data path C Hash Figure 3: The divide-and-concatenate architecture using to -bit LCH data paths m i x i m i+1 x i+1 R R -bit LCH data path Control -bit LCH data path -bit LCH data path -bit LCH data path C C + Hash Figure 4: The equivalent divide-and-concatenate architecture composed o our -bit LCH data paths cle and generates a -bit hash value ater the entire message is processed. Using the divide-and-concatenate technique a -bit LCH universal hash data path ith collision probability o 2 can be constructed using to -bit LCH universal hash data paths, each ith collision probability o 2 and by concatenating their -bit results to generate a -bit hash value ith collision probability o 2 as shon in Figure 3. The ixed rotate operation R implements the Toeplitz extension technique to generate additional key material or the replicated data paths. R and the concatenation operation C in Figure 3 do not entail any area overhead as they are implemented as just renaming o ires in the circuit. The upper -bits and loer -bits o the -bit ord need to be applied one ater the other to this divide-and-concatenate data path. Since the poer consumed by the -bit LCH data path is O( 4 ), the -bit LCH data path consumes about 1/16 o that consumed by a -bit LCH data path hen they run at the same speed. The divide-and-concatenate LCH data path shon in Figure 3 consumes about 1/8 o that consumed by a -bit LCH data path hen they run at the same speed. Since the area o a -bit LCH data path is O( 2 ), the area o the divide-and-concatenate architecture shon in Figure 3 is about hal o that o a -bit LCH hash data path shon in Figure 1. Hoever, this divideand-concatenate LCH data path can only process a -bit input every clock cycle resulting in a throughput o ( )/2 Mbps, about hal o that o a -bit LCH data path. Let us duplicate this LCH divide-and-concatenate architecture once more to yield the LCH data path shon in Figure 4. This divide-and-concatenate data path uses our - bit LCH data paths and processes a -bit input every cycle. The upper -bit and loer -bit o the -bit ord can be applied in parallel in this divide-and-concatenate data path This divide-and-concatenate LCH data path yields the same hash value as the divide-and-concatenate LCH data path shon in Figure

5 In the divide-and-concatenate LCH data paths shon in Figure 3 and Figure 4, the message ord m i and key ord x i are -bit. So the original k message block is reorganized as (2k) (). Modular reduction has the olloing property: (a + b) mod p = (a mod p + b mod p) mod p (8) The equation (3) can be reritten as ollos: h m,x(x) = [ ( 2k i=1,i=odd (mixi) + k t) mod p + 2 ( 2k i=1,i=even (mixi) + k t) mod p 2 ] mod p (9) The divide-and-concatenate LCH data path in Figure 3 accumulates the temporary results or 2k cycles and then perorms the modular reduction. The divide-and-concatenate LCH data path in Figure 4 accumulates the temporary results (the upper and loer to -bit data paths calculate the irst and second term inside the bracket o equation (9) respectively) or k cycles and then perorms the modular reduction step. This divide-and-concatenate LCH data path using our -bit data paths is equivalent to the straightorard -bit LCH data path in to aspects: 1) they can process same size input every clock cycle (then same throughput) and generate same size hash value; 2) they have a collision probability o 2. We call the divideand-concatenate architecture using our -bit data path equivalent divide-and-concatenate architecture. The area o this equivalent divide-and-concatenate LCH data path is approximately equal to the area o the straightorard -bit LCH data path. The dynamic poer consumed by this equivalent divide-and-concatenate LCH data path is about 1/4 o that consumed by straightorard -bit LCH data path. Let us compare this ith the parallel data paths approach using to -bit LCH data paths. This duplicated data paths approach incurs about 100% area overhead and consumes 1/4 o the dynamic poer consumed by the straightorard -bit LCH data path. The divide-and-concatenate approach can be urther extended as ollos: Each -bit data path in the equivalent divide-and-concatenate architecture can be replaced by our /4-bit LCH data paths. This translates into sixteen /4- bit LCH data paths to construct an equivalent -bit LCH data path. While the parallel data paths technique using our -bit LCH data paths incurs 400% area overhead, the equivalent divide-and-concatenate LCH data path using sixteen /4-bit LCH data paths can reduce the dynamic poer consumption to 1/16 ithout incurring any area overhead. The smallest data path size that can be used to construct an equivalent divide-and-concatenate data path is 4-bits. Belo this value, the keys are so small that Toeplitz extension does not ork (a 2-bit key ord can not be rotated our times). Another advantage o the divide-and-concatenate technique over the parallel data paths technique is that the divide-and-concatenate technique uses a single clock cycle; the divide-and-concatenate technique does not reduce the clock rate o the duplicated data paths. On the contrary, a higher speed clock is used or the multiplexer and demultiplexer in parallel data paths as shon in Figure EXPERIMENTAL RESULTS The analysis o the proposed divide-and-concatenate technique in section 4 used the simpliied models or the area and delay o multipliers and LCH data paths. In this section, e ill present a detailed implementation based validation o our claims. In the divide-and-concatenate technique, the number o small size data paths increases quadratically hen the size o data paths decreases. The linear and constant component ill incur area overhead. For example, the 32-bit LCH data path consumes 9071 gates hile its equivalent divideand-concatenate architecture composed o our 16-bit LCH data path consumes 9917 hen they are targeted to IBM 0.18m ASIC library. We implemented the straightorard 64-bit LCH universal hash data path and the divide-andconcatenate LCH data paths using 32-bit, 16-bit, 8-bit and 4-bit. LCH data path using to parallel 64-bit LCH data paths is also implemented. They are modeled using VHDL and simulated using Modeltech Modelsim and synthesized using Synopsys Design Complier. The poer consumption as reported by Synopsys Prime Poer based on the netlist and simulation results rom Modelsim. The supply voltage o the targeted library is 1.62V. Since the divide-and-concatenate data paths using dierent size data paths use dierent supply voltages, e modiied the voltage parameter in technology library ile. The divide-andconcatenate architectures using dierent size data paths are synthesized on the library ith dierent voltage parameter. The delay and dynamic poer consumption is also reported based on the modiied library. We summarize the data path idth, number o data paths, area, clock requency, voltage, poer consumption, poer consumption and ratio o area overhead percentage to poer consumption saving percentage in Table 1. The 64-bit straightorard 64-bit LCH data path uses the original 1.62V poer supply provided by the targeted library. Its maximum clock rate can achieve 142MHz. For the divide-and-concatenate LCH architecture using 32-bit data path, the supply voltage can be scaled don to 1.32V and the clock rate o the design can still achieve 142MHz. The area overhead compared to the straightorard 64-bit LCH Data path is in the ith ro. For example, the area overhead o equivalent divide-and-concatenate architecture using 32-bit LCH data paths is 16%, hile the area o overhead o parallel data path technique that uses to 64-bit LCH data path is 102%. The dynamic poer consumption saving are listed in the second to the last ro. For example, the equivalent divide-and-concatenate architecture using 16- bit data paths can save 75.29% poer consumption, hile the parallel data path technique that uses to 64-bit LCH data path saves 62.10%. We use the ratio o poer consumption saving to area overhead to evaluate the eiciency o the proposed architectures. For example, the equivalent divideand-concatenate architecture using 32-bit LCH data paths saves 55.44% poer consumption ith 16% area overhead, so its ratio is 3.465( ). The ratios are shon in the last ro. Except the divide-and-concatenate architecture using 4-bit data paths, all other divide-and-concatenate architectures are superior to the parallel data path technique. Specially, the divide-and-concatenate using 32-bit data path has the best ratio. As the idth data path in divide-andconcatenate architecture become small, the percentage o the area o multipliers in the area o the hardare archi- 223

6 Table 1: Implementation results or the straightorard 64-bit LCH data path, equivalent divide-andconcatenate data paths and parallel data paths ith collision probability o 2 64 Architectures Straightorard Divide-and-Concatenate Parallel data paths Data path idth (bits) # o data paths Area (Gates) Area overhead ratio 16% 34 % 57% 92% 102% Clock Rate (MHz) Voltage (Volts) Poer consumption (µ) Poer saving ratio 55.44% 75.29% 75.49% 74.34% 62.10% Ratio (poer ratio/area ratio) tecture also becomes small. The quadratically duplication o linear components incurs more area overhead and poer consumption resulting in smaller ratios. Especially, in 4- bit data path based designs the LCH area is dominated by adders. This is because the area o an adder is comparable to that o a multiplier and the number o adders gros quadratically resulting in ineicient design. We ound that the poer consumption reported by Prime Poer by using deault average sitch probability is almost the same as the poer consumption based on the simulation results rom Modelsim. This is because o the inherent randomness eature o hash unctions. The average sitch probability reported by Modelsim is almost 50% that is the same as the deault value used by Prime Poer. 6. CONCLUSIONS Applying general lo poer design techniques to universal hash unctions yields only moderate improvements. We deined a collision probability equivalent data path and combined it ith the divide-and-concatenate technique to design lo poer architectures ith very small area overhead or LCH calculations based on using multiple small multipliers in place o one larger multiplier. When applied to design the LCH universal hash ith collision probability o 2 64, compared to the 64-bit straightorard implementation, the proposed technique can reduce the poer consumption by 55% and 75% having no perormance loss ith an area overhead o just 16% and 34% by using 32-bit and 16-bit data paths respectively. 7. REFERENCES [1] S. Halevi and H. Kraczyk. Mmh: Sotare message authentication in the gbit/second rates. In Workshop on Fast Sotare Encryption, pages , [2] J. Black, S. Halevi, H. Kraczyk, T. Krovetz, and P. Rogaay. Umac: Fast and secure message authentication. In Cryptology Conerence on Advances in Cryptology, pages , [3] D. A. McGre. The truncated multi-modular hash unction (TMMH). In IETF Internet Drat, [4] S.Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockood. Deep packet inspection using parallel bloom ilters. In High Perormance Interconnects, pages , [5] Z. Huang. High-Level Optimization Techniques or Lo-Poer Multiplier Design. Ph.D. Thesis, University o Caliornia at Los Angeles, [6] A. Raghunathan, S. Dey, and N. K. Jha. Glitch analysis and reduction in register transer level poer optimization. In Design Automation Conerence, pages , [7] J. Yu, W. Wu, X. Chen, H. Hsieh, J. Yang, and F. Balarin. Assertion-based design exploration o dvs in netork processor architectures. In Design Automation and Test in Europe, pages 92 97, [8] P. Petrov and A. Orailoglu. Lo-poer instruction bus encoding or embedded processors. IEEE Trans. Very Large Scale Integr. Syst., 12(8): , [9] M. Srivastava and M. Potkonjak. Poer optimization in programmable processors and asic implementations o linear systems: Transormation-based approach. In Design Automation Conerence, pages , [10] A. Chandrakasan, S. Sheng, and R. Brodersen. Lo-poer cmos digital design. IEEE Journal o Solid-State Circuits, 27(4): , [11] J. M. Rabaey. Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Engleood Clis, NJ, [12] B. Yang, R. Karri, and D. A. Mcgre. Divide-and-concatenate: An architecture level optimization technique or universal hash unctions. In Design Automation Conerence, pages 44 52, [13] P. Rogaay and T. Shrimpton. Cryptographic hash unction basics: Deinitions, implications, and separations or preimage resistance, second-preimage resistance, and collision resistance. In Fast Sotare Encryption, pages , [14] L. Carter and M. Wegman. Universal hash unctions. Journal o Computer and System Sciences, 18: , [15] I. Koren. Computer Arithmetic Algorithms. A. K. Peters, Natick, Massachusetts, 2nd Edition, [16] J. R. Black. Message Authentication Codes. Ph.D. Thesis, University o Caliornia at Davis,

An 80Gbps FPGA Implementation of a Universal Hash Function based Message Authentication Code

An 80Gbps FPGA Implementation of a Universal Hash Function based Message Authentication Code An 8Gbps FPGA Implementation of a Universal Hash Function based Message Authentication Code Abstract We developed an architecture optimization technique called divide-and-concatenate and applied it to