A unified architecture of MD5 and RIPEMD-160 hash algorithms

Similar documents
Fast Evaluation of the Square Root and Other Nonlinear Functions in FPGA

On Optimized FPGA Implementations of the SHA-3 Candidate Grøstl

Improving Online Security By Using Cryptographic Algorithms For Efficient Storage Of Password S.B.Saravanan et al.,

Design Space Exploration of a Reconfigurable HMAC-Hash Unit

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

The Serial Commutator FFT

Two Hardware Designs of BLAKE-256 Based on Final Round Tweak

CONSIDERATIONS ON HARDWARE IMPLEMENTATIONS OF ENCRYPTION ALGORITHMS

A SIMPLE 1-BYTE 1-CLOCK RC4 DESIGN AND ITS EFFICIENT IMPLEMENTATION IN FPGA COPROCESSOR FOR SECURED ETHERNET COMMUNICATION

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Groestl Tweaks and their Effect on FPGA Results

Pushing the Limits of SHA-3 Hardware Implementations to Fit on RFID

Hardware Implementation of The Message Digest Procedure MDP-384

MD5 Message Digest Algorithm. MD5 Logic

The theory and design of a class of perfect reconstruction modified DFT filter banks with IIR filters

TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO.

RC6 Implementation including key scheduling using FPGA

Programmable Memory Blocks Supporting Content-Addressable Memory

Performance Evaluation of Cryptographic Algorithms on Reconfigurable Hardware: MD5 based on Timing and Area Implementation

Use of Embedded FPGA Resources in Implementations of Five Round Three SHA-3 Candidates

TSEA44 - Design for FPGAs

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

Vortex: A New Family of One-way Hash Functions Based on AES Rounds and Carry-less Multiplication

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Parallelized Radix-4 Scalable Montgomery Multipliers

Efficient Hardware Design and Implementation of AES Cryptosystem

Implementation of Full -Parallelism AES Encryption and Decryption

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

ECE 545 Lecture 12. FPGA Resources. George Mason University

Hardware Implementation of Cryptosystem by AES Algorithm Using FPGA

SHA3 Core Specification. Author: Homer Hsing

VHDL-MODELING OF A GAS LASER S GAS DISCHARGE CIRCUIT Nataliya Golian, Vera Golian, Olga Kalynychenko

SLICED: Slide-based concurrent error detection technique for symmetric block ciphers

A High Speed Hardware Architecture for Universal Message. Authentication Code

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

FLOATING POINT ADDERS AND MULTIPLIERS

INTRODUCTION TO FPGA ARCHITECTURE

HIGH DATA RATE 8-BIT CRYPTO PROCESSOR

Implementation & Benchmarking of Padding Units & HMAC for SHA-3 candidates in FPGAs & ASICs

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

High Performance Single-Chip FPGA Rijndael Algorithm Implementations

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

CCproc: A custom VLIW cryptography co-processor for symmetric-key ciphers

EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS

P V Sriniwas Shastry et al, Int.J.Computer Technology & Applications,Vol 5 (1),

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring

Design of an Efficient Architecture for Advanced Encryption Standard Algorithm Using Systolic Structures

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS.

Ieee Asia-Pacific Conference On Circuits And Systems - Proceedings, 2000, p

Evolution of Sha-176 Algorithm

ELCT 501: Digital System Design

On the parallelization of slice-based Keccak implementations on Xilinx FPGAs

Field Programmable Gate Array (FPGA)

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

A Quadruple Precision and Dual Double Precision Floating-Point Multiplier

Journal of Global Research in Computer Science A UNIFIED BLOCK AND STREAM CIPHER BASED FILE ENCRYPTION

Design Of High Performance Rc4 Stream Cipher For Secured Communication

The Lekha 3GPP LTE FEC IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

ADDERS AND MULTIPLIERS

Efficient Hardware Realization of Advanced Encryption Standard Algorithm using Virtex-5 FPGA

Section 6. Memory Components Chapter 5.7, 5.8 Physical Implementations Chapter 7 Programmable Processors Chapter 8

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

CHAPTER 9 MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

IDEA, RC5. Modes of operation of block ciphers

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

Low area implementation of AES ECB on FPGA

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

High Level Abstractions for Implementation of Software Radios

Dynamically Reconfigurable Coprocessors in FPGA-based Embedded Systems

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

DUE to the high computational complexity and real-time

Multi-mode Operator for SHA-2 Hash Functions

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Novel Design of Dual Core RISC Architecture Implementation

The Nios II Family of Configurable Soft-core Processors

A New Technique for Sub-Key Generation in Block Ciphers

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A High-Performance VLSI Architecture for Advanced Encryption Standard (AES) Algorithm

Low-Area Implementations of SHA-3 Candidates

Prachi Sharma 1, Rama Laxmi 2, Arun Kumar Mishra 3 1 Student, 2,3 Assistant Professor, EC Department, Bhabha College of Engineering

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Introduction to information Security

Compact Hardware Implementations of ChaCha, BLAKE, Threefish, and Skein on FPGA

AL8253 Core Application Note

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

ORCA FPGA- Optimized VectorBlox Computing Inc.

Topics. Midterm Finish Chapter 7

A Comparison of Two Algorithms Involving Montgomery Modular Multiplication

Cryptographic Algorithms - AES

RC-6 CRYPTOSYSTEM IN VHDL. BY:- Deepak Singh Samant

Optimized AES Algorithm Using FeedBack Architecture Chintan Raval 1, Maitrey Patel 2, Bhargav Tarpara 3 1, 2,

Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders

Computer Arithmetic Multiplication & Shift Chapter 3.4 EEC170 FQ 2005

2010 First International Conference on Networking and Computing

Transcription:

Title A unified architecture of MD5 and RIPMD-160 hash algorithms Author(s) Ng, CW; Ng, TS; Yip, KW Citation The 2004 I International Symposium on Cirquits and Systems, Vancouver, BC., 23-26 May 2004. In I International Symposium on Circuits and Systems Proceedings, 2004, v. 2, p. II-889-II-892 Issued Date 2004 URL http://hdl.handle.net/10722/46445 Rights This work is licensed under a Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License.; 2004 I. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the I.

A UNIID ARCHITCTUR O MD5 AND RIPMD-160 HASH ALGORITHMS Chiu-Wah Ng, Tung-Sang Ng and Kun-Wah Yip Department of lectrical & lectronic ngineering, The University of Hong Kong Pokfulam Road, Hong Kong mail: {cwng, tsng, kwyip}@eee.hku.hk ABSTRACT Hash algorithms are important components in many cryptographic applications and security protocol suites. In this paper, a unified architecture for MD5 and RIPMD-160 hash algorithms is developed. These two algorithms are different in speed and security level. Therefore, a unified hardware design allows applications to switch from one algorithm to another based on different requirements. The architecture has been implemented using Altera s P10K50SBC356-1, providing a throughput over 200Mbits/s for MD5 and 80Mbits/s for RIPMD-160 when operated at 26.66MHz with a resource utilization of 1964LC. I. INTRODUCTION A hash function computes a fixed length output called the message digest from an input message of various lengths. The MD5 message digest algorithm [1], developed by Ron Rivest at MIT, accepts a message input of various lengths and produces a 128-bit hash code. It has been one of the most widely-used hash algorithms. However, it has been indicated in [2] that there is a security threat in the algorithm. urthermore, a 128-bit hash output may not offer sufficient security protection in the near future. To provide higher security protection, RIPMD-160 has been proposed by Dobbertin, Bosselaers and Preneel [2]. RIPMD-160 accepts the same input format as that of MD5, and produces a 160-bit output. It is easy to find that the structures of the two algorithms are quite similar. It follows that they can be combined together to give one hardware design that can perform the two hash functions. This approach has the following advantages. irstly, the unified design is a resource-efficient implementation when different hash algorithms are needed in applications. Applications can switch to either algorithm based on different requirements. Second, since MD5 is still the most widely-used hash algorithm, upgrading the current implementation in the future to RIPMD-160 is much easier with a unified hardware architecture. Motivated by these observations, in this paper we develop a unified architecture for MD5 and RIPMD-160. Comparison with other unified architectures indicates that the proposed architecture is area-efficient. II. RVIW O HASH ALGORITHMS MD5 and RIPMD-160 share the same flow of operation. irst, the input message is padded and divided into data blocks of length 512 bits. ach data block is treated as 16 32-bit words. The algorithms iteratively processes each data block. or the first data block, an initial value is used to compute an intermediate result. The intermediate result, called the chaining variable, is then updated according to the input data block and the previous result. After all iterations are done, the final chaining variable is the hash value. A. MD5 igure 1 shows the basic operation of MD5 [3]. The algorithm consists of 4 main rounds with each round applying the basic operation 16 times. In the figure, ( BCD,, ) is a nonlinear bit-wise function of 3 inputs. There are 4 different functions for the 4 rounds. The index i represents each step, and Xi [] represents one message word. The orders of the 16 words for processing are different for different rounds. In the figure, K[] i is a 32-bit constant chosen from a fixed table containing 64 constants stated in the specification [1] and Si [] is a 5-bit input that controls the left rotation of the input sum. After a total of 64 basic operations, the chaining variable is updated by directly adding the 4 variables A, B, C and D with the current content in the chaining variable. B. RIPMD-160 igure 2 shows the basic operation of RIPMD-160. RIPMD-160 consists of 5 main rounds with each round applying the basic operation 16 times. There are two parallel lines in each step, and five different non-linear functions ( BCD,, ) corresponding to the 5 rounds. In the figure, Xi [ ] is the input word, K[ i ] is one of the ten 32- bit constants in the algorithm, Si [ ] corresponds to the 4- bit control for the left rotation operation, and Rol 10 denotes a 10-bit left cyclic shift operation. After a total of 80 basic

operations, the chaining variable is updated by adding the left five variables A, B, C, D,, the five right variables A, B, C, D,, and the current content in the chaining variable. III. UNIID ARCHITCTUR DSIGN, AND ITS GPA IMPLMNTATION It is apparent from Section 2 that MD5 and RIPMD- 160 share many common parts, but there are also certain differences, as listed below. 1. The input order for the word Xi [ ] and the number of shifts Si [ ] are different. 2. There are 64 constants for K[ i ] in MD5 but there are only 10 constants in RIPMD-160. 3. RIPMD-160 has an extra variable ( ) for the 160-bit output. 4. The nonlinear functions of the two algorithms are different. 5. The MD5 requires a 5-bit cyclic shifter while RIPMD-160 needs a 4-bit one. 6. There is an extra cyclic shift operation (Rol 10 ) in RIPMD-160. In order to unify the two algorithms for PGA implementation, we modify the basic hardware structure of MD5 (igure 1) with the following modifications and additions. 1. Two separate look-up tables (LUTs) are used to store the constants and numbers of shifts. 2. All the input orders are stored in a LUT to index the message words in different steps. 3. The 128-bit chaining variable is expanded to 160 bits. 4. A 10-bit cyclic shifter, which is required in RIPMD- 160, is added. 5. Two additional nonlinear functions are added. [our and five nonlinear functions are required for MD5 and RIPMD-160, respectively. Out of these nine functions, only six are distinct.] 6. Multiplexers are added. The resultant unified architecture is shown in igure 3. The design has been modeled using VHDL and implemented in PGA. The target device is an Altera P10K50SBC356-1. The aim is to obtain a resourceefficient implementation. Therefore, only one of the two lines of RIPMD-160 is implemented in the PGA, resulting in a saving of nearly half of the logic elements. This arrangement is used as a trade-off between area and speed. Compact hardware implementation can be obtained by the following approach. Modern PGAs have numerous memory resources besides logic elements. The hash algorithms take 512-bits data as the input. A dual-port RAM in the PGA can be used to store this large amount of data to allow both external and internal access of data. Memory elements configured as ROM are used to store the constants, numbers of shifts and word-input orders. or constants lookup, a ROM with 6-bit input address and 32-bit output is used for the MD5 algorithm as it specifies that 64 32-bits words and a ROM with 4-bit input address is required for storing the 10 constants of RIPMD-160. or the shift-number lookup, since there are 160 steps in RIPMD-160, a ROM with 8-bit address lines must be used. These constants fill up the lower 196 words of the ROM, thus leaving the upper 64 words for the variable shift in MD5. This arrangement allows simple decoding logic to select the appropriate constants for the 2 algorithms. In order to accommodate the larger shift specified in the MD5 algorithm, a ROM with 5-bit output is used. or the message selection sequence, the same approach is used and a ROM with 8-bit input address and 4-bit output is used. As LUTs are used in the aforementioned manner, a counter can be used to index the appropriate values during each step. It simplifies the control logic design. A compact implementation thus results. Table 1 summarizes the memory resource utilization. Since memory lookup is slow and in this case the message words are accessed by first accessing the message selection table and then the dual-port RAM, the execution time is increased. To speed up, the execution of the basic operation is pipelined into two stages. irst, an operand fetch is performed. The appropriate constants are looked up in memory. Then an execution phase for the actual operation is performed. igure 4 shows an example of the execution of MD5. With additional circuits for initialization and output data processing, the final hardware design is shown in igure 5. IV. PRORMANC VALUATION AND COMPARISON The design has been synthesized, placed and routed on Altera s P10K50SBC356-1 using Altera s MaxPlusII 10.0. The maximum frequency is 26.66 MHz. The number of logic cells used is 1964 (68%) and that of memory bits is 5376 (13%). The throughput of the hash algorithms can be calculated as follows: Throughput (Mbits/s) = 512 x freq (MHz) / no. of clk cycles. The throughput performance is shown in Table 2. Several implementations of hash algorithms can be found in the literature. However, each of them uses different PGA devices and hence a comparison of the performance is difficult. As a rough comparison, Table 3 and Table 4 sum-

marize the results. In [4], a universal hardware module is designed for the MD4 family, including MD5, RIPMD- 160, SHA-1 and SHA-256. Although this module can combine 4 different algorithms, it requires multiple clock cycles to execute a step. In the design presented herein, each step can be executed in a single clock cycle. In addition, the module of [4] requires significantly more resources (1004 CLBs) when compared with the proposed design. In [5], only MD5 is implemented. We can compare the performance of our design with the performance results reported in [5]. In [6], the implementation combines MD5, SHA-1 and HAS-160 and thus requires significant resources. It is apparent from Table 4 that the unified architecture requires only slightly more resources than the single MD5 implementation but with a comparable performance. Table 1. Memory utilization of the proposed design. MD5 constants 64 x 32 = 2048 bits RIPMD-160 constants 16 x 32 = 512 bits Shift amount 256 x 5 = 1280 bits Message selection 256 x 4 = 1024 bits Dual port ram 512 bits Totals: 5376 bits Table 2. Performance of the proposed design. MD5 RIPMD-160 Clock cycles 66 162 Throughput 206Mbits/s 84Mbits/s V. CONCLUSION In this paper, a unified architecture for MD5 and RIPMD-160 has been designed and implemented in PGA. It has been shown that the implementation is resource-efficient and preserves the single clock-cycle structure. With the present PGA implementation, the hardware design is suitable for applications that require low to medium throughput such as thernet networks. RRNCS [1] R. Rivest, The MD5 Message-Digest Algorithm, RC 1321, Apr. 1992. [2] H. Dobbertin, A. Bosselaers and B. Preneel, RIPMD-160: A Strengthened Version of RIPMD, ast Software ncryption, LNCS 1039, pp. 71-92, Springer-Verlag, 1996. [3] W. Stallings, Cryptography and Network Security, 2 nd ed., Now York: Prentice-Hall, 1997. [4] S. Dominikus, A hardware implementation of MD4-family hash algorithms, Proc. 9 th Int. Conf. on lectronics, Circuits and Systems, vol. 3, pp. 1143-1146, 2002. [5] J. Deepakumara, H. M. Heys and R. Venkatesan, PGA implementation of MD5 hash algorithm, Proc. 2001 Canadian Conf. on lectrical and Computer ngineering, vol. 2, pp. 919-924, 2001. [6] Y. K. Kang, D. W. Kim, T. W. Kwon and J. R. Choi, An efficient implementation of hash function processor for IPSC, Proc. 2002 I Asia-Pacific Conf. on ASIC, pp. 93-96, 6-8 Aug. 2002. Table 3. Performance comparison of different designs. Algo. This design [4] [5] [6] Cyc. Thru. Cyc. Thru. Cyc. Thru. Cyc. Thru. MD5 66 206 206 107 65 165 65 142 RIPMD-160 162 84 337 65 N/A N/A N/A N/A Table 4. Comparison of resource utilization. This design [4] [5] [6] PGA Vendor Altera Xilinx Xilinx Altera Resource util. 1964 LC 1004 CLB 880 slices 10573 L A' B' C' D' Rol 10 Rol 10 << << << igure 1. Basic MD5 operation. A' B' C' D' igure 2. Basic RIPMD-160 operation. '

rol << igure 3. Basic operation of the unified architecture. etch 0 etch 1 xecution 0 etch 2 xecution 1 etch 80 xecution 79 xecution 80 igure 4. xecution of the MD5 algorithm: an example. IV H0 H1 H2 H3 H4 rol Adders X RAM ROM 1 ROM 2 Data in ROM Barrel Shift er Index ROM igure 5. Hardware design of the unified architecture.