Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs

Size: px

Start display at page:

Download "Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs"

Laurel Ball
6 years ago
Views:

1 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates Using FPGAs Ekaat Homsirikamol Marcin Rogaski Kris Gaj George Mason Universit ehomsiri, mrogask, This ork has een supported in part NIST through the Recover Act Measurement Science and Engineering Research Grant Program, under contract no. 6NANBD4.

2 Contents Introduction and Motivation 2 2 Methodolog 4 2. Choice of a Language, FPGA Devices, and Tools Performance Metrics for FPGAs Uniform Interface Assumptions and Simplifications Optimization Target Design Methodolog Comprehensive Designs of SHA-3 Candidates 3 3. Notations and Smols Basic Component Description Multiplication 2 in the Galois Field GF(2 8 ) Multiplication n in the Galois Field GF(2 8 ) AES BLAKE Block Diagram Description vs. Variant Differences Blue Midnight Wish (BMW) Block Diagram Description vs. Variant Differences CueHash Block Diagram Description vs. Variant Differences ECHO Block Diagram Description vs. Variant Differences Fugue Block Diagram Description

3 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates vs. Variant Differences Groestl Block Diagram Description vs. Variant Differences Hamsi Block Diagram Description vs. Variant Differences JH Block Diagram Description vs. Variant Differences Keccak Block Diagram Description vs. Variant Differences Luffa Block Diagram Description vs. Variant Differences SHA Block Diagram Description vs. Variant Differences Shaal Block Diagram Description vs. Variant Differences SHAvite Block Diagram Description vs. Variant Differences SIMD Block Diagram Description vs. Variant Differences Skein Block Diagram Description vs. Variant Differences Design Summar and Results Design Summar Relative Performance of the and 256-it Variants of the SHA-3 Candidates Results Results from Other Groups 5. Best Results from Other Groups Best Results

4 4 E. Homsirikamol, M. Rogaski, and K. Gaj 6 Conclusions and Future Work 4

5 Astract Performance in hardare has een demonstrated to e an important factor in the evaluation of candidates for crptographic standards. Up to no, no consensus exists on ho such an evaluation should e performed in order to make it fair, transparent, practical, and acceptale for the majorit of the crptographic communit. In this report, e formulate a proposal for a fair and comprehensive evaluation methodolog, and appl it to the comparison of hardare performance of 4 Round 2 SHA-3 candidates. The most important aspects of our methodolog include the definition of clear performance metrics, the development of a uniform and practical interface, generation of multiple sets of results for several representative FPGA families from to major vendors, and the application of a simple procedure to convert multiple sets of results into a single ranking. The VHDL codes for 256 and -it variants of all 4 SHA-3 Round 2 candidates and the old standard SHA-2 have een developed and thoroughl verified. These codes have een then used to evaluate the relative performance of all aforementioned algorithms using seven modern families of Field Programmale Gate Arras (FPGAs) from to major vendors, Xilinx and Altera. All algorithms have een evaluated using four performance measures: the throughput to area ratio, throughput, area, and the execution time for short messages. Based on these results, the 4 Round 2 SHA-3 candidates have een divided into several groups depending on their overall performance in FPGAs.

6 Chapter Introduction and Motivation Starting from the Advanced Encrption Standard (AES) contest organized NIST in [], open contests have ecome a method of choice for selecting crptographic standards in the U.S. and over the orld. The AES contest in the U.S. as folloed the NESSIE competition in Europe [2], CRYPTREC in Japan, and estream in Europe [3]. Four tpical criteria taken into account in the evaluation of candidates are: securit, performance in softare, performance in hardare, and flexiilit. While securit is commonl recognized as the most important evaluation criterion, it is also a measure that is most difficult to evaluate and quantif, especiall during a relativel short period of time reserved for the majorit of contests. A tpical outcome is that, after eliminating a fraction of candidates ased on securit flas, a significant numer of remaining candidates fail to demonstrate an eas to identif securit eaknesses, and as a result are judged to have adequate securit. Performance in softare and hardare are next in line to clearl differentiate among the candidates for a crptographic standard. Interestingl, the differences among the crptographic algorithms in terms of hardare performance seem to e particularl large, and often serve as a tiereaker hen other criteria fail to identif a clear inner. For example, in the AES contest, the difference in hardare speed eteen the to fastest final candidates (Serpent and Rijndael) and the sloest one (Mars) as a factor of seven [][4]; in the estream competition the spread of results among the eight top candidates qualified to the final round as a factor of 5 in terms of speed (Trivium x vs. Pomaranch), and a factor of 3 in terms of area (Grain v vs. Edon8) [5][6]. At this point, the focus of the attention of the entire crptographic communit is on the SHA-3 contest for a ne hash function standard, organized NIST [7][8]. The contest is no in its second round, ith 4 candidates remaining in the competition. The evaluation is scheduled to continue until the second quarter of 22. In spite of the progress made during previous competitions, no clear and commonl 2

7 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 3 accepted methodolog exists for comparing hardare performance of crptographic algorithms [9]. The majorit of the reported evaluations have een performed on an ad-hoc asis, and focused on one particular technolog and one particular famil of hardare devices. Other pitfalls included the lack of a uniform interface, performance metrics, and optimization criteria. These pitfalls are compounded different skills of designers, using to different hardare description languages, and no clear a of compressing multiple results to a single ranking. In this paper, e address all the aforementioned issues, and propose a clear, fair, and comprehensive methodolog for comparing hardare performance of SHA-3 candidates and an future algorithms competing to ecome a ne crptographic standard. Our methodolog is ased on the use of FPGA devices from various vendors. The advantages of using FPGAs for comparison include short development time, ide availailit of tools, and a limited numer of vendors dominating the market. The hardare evaluation of SHA-3 candidates started shortl after announcing the specifications and reference softare implementations of 5 algorithms sumitted to the contest [7][8][]. The majorit of initial comparisons ere limited to less than five candidates, and their results have een pulished at []. The more comprehensive efforts ecame feasile onl after NISTs announcement of 4 candidates qualified to the second round of the competition in Jul 29. Since then, to comprehensive studies have een reported in the Crptolog eprint Archive [][2]. The first, from the Universit of Graz, has focused on ASIC technolog, the second from to institutions in Japan, has focused on the use of the FPGA-ased SASEBO-GII oard from AIST, Japan. Although oth studies generated quite comprehensive results for their respective technologies, the did not quite address the issues of the uniform methodolog, hich could e accepted and used a larger numer of research teams. Our stud is intended to fill this gap, and put forard the proposal that could e evaluated and commented on a larger crptographic communit.

8 Chapter 2 Methodolog 2. Choice of a Language, FPGA Devices, and Tools Out of to major hardare description languages used in industr, VHDL and Verilog HDL, e choose VHDL. We elieve that either of the to languages is perfectl suited for the implementation and comparison of SHA-3 candidates, as long as all candidates are descried in the same language. Using to different languages to descrie different candidates ma introduce an undesired ias to the evaluation. FPGA devices from to major vendors, Xilinx and Altera, dominate the market ith aout 9% of the market share. We therefore feel that it is appropriate to focus on FPGA devices from these to companies. In this stud, e have chosen to use seven families of FPGA devices from Xilinx and Altera. These families include to major groups, those optimized for minimum cost (Spartan 3 from Xilinx, and Cclone II and III from Altera) and those optimized for high performance (Virtex 4 and 5 from Xilinx, and Stratix II and III from Altera). Within each famil, e use devices ith the highest speed grade, and the largest numer of pins. As CAD tools, e have selected tools developed FPGA vendors themselves: Xilinx ISE Design Suite v.. (including Xilinx XST, used for snthesis) and Altera Quartus II v. 9. Suscription Edition Softare. 2.2 Performance Metrics for FPGAs Choosing proper performance metrics for the implementation of hash functions (or an other crptographic transformations) using FPGAs is a non-trivial task, and no clear consensus exists so far on ho these metrics should e defined. Belo e summarize our proposed approach, hich e applied in our stud. 4

9 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 5 Speed. In order to characterize the speed of the hardare implementation of a hash function, e suggest using Throughput, understood as a throughput (numer of input its processed per unit of time) for long messages. To e exact, e define Throughput using the folloing formula: lock size T hroughput = (2.) T (HT ime(n + ) HT ime(n)) here lock size is a message lock size, characteristic for each hash function, HT ime(n) is a total numer of clock ccles necessar to hash an N-lock message, T is a clock period, different and characteristic for each hardare implementation of a specific hash function. Throughput defined this a is tpicall independent of N (and thus the size of the message), as in all hash function architectures e investigated so far, the expression HT ime(n + ) HT ime(n) is a constant that corresponds to the numer of clock ccles eteen processing of to susequent input locks. The effective throughput for short messages is alas smaller, and is expressed the formula N lock size T hroughput eff = (2.2) T HT ime(n) In this paper, e provide the exact formulas for HT ime(n) for each SHA-3 candidate (see Tale 4.2), and values of f = /T for each algorithm FPGA device pair (see Tales 4.8 and 4.9). Therefore, e provide sufficient information to calculate and compare values of the effective throughputs for each specific message size, hich ma e of interest in a given application. For short messages, it is more important to evaluate the total time required to process a message of a given size (rather than throughput). The size of the message can e chosen depending on the requirements of an application. For example, in the ebash stud of softare implementations of hash functions, execution times for all sizes of messages, from -tes (empt message) to 496 tes, are reported, and five specific sizes 8,, 576, 536, and 496 are featured in the tales [3]. The generic formulas e include in this paper (see Tale 4.2) allo the calculation of the execution times for an message size. In order to characterize the capailit of a given hash function implementation for processing short messages, e present in this stud the comparison of execution times for an empt message (one lock of data after padding) and a -te (8-its) message efore padding (hich ecomes equivalent for majorit, ut not all, of the investigated functions to 24 its after padding). To e exact our parameters are defined as follos T empt = T HT ime() (2.3) ( ) padlen(8) T B = T HT ime, lock size (2.4)

10 6 E. Homsirikamol, M. Rogaski, and K. Gaj here padlan(8) denotes the size of an 8-it message after padding. Resource Utilization/Area. Resource utilization is particularl difficult to compare fairl in FPGAs, and is often a source of various evaluation pitfalls. First, the asic programmale lock (such as CLB slice in Xilinx FPGAs) has a different structure and different capailities for various FPGA families from different vendors. For example, in Virtex 5, a CLB slice includes four 6-input Look-Up-Tales (LUTs); in Spartan 3 and Virtex 4, a CLB slice includes to 4-input LUTs. In Cclone II and Cclone III, the asic programmale lock is called Logic Element (LE); in Stratix II and III, the asic programmale component has a different structure and is called ALUT (Adaptive Look-Up Tale). Taking this issue into account, e suggest avoiding an comparisons across famil lines. Secondl, all modern FPGAs include multiple dedicated resources, hich can e used to implement specific functionalit. These resources include Block RAMs (BRAMs), multipliers (MULs), and DSP units in Xilinx FPGAs, and memor locks, multipliers, and DSP units in Altera FPGAs. In order to implement a specific operation, some of these resources ma e interchangale, ut there is no clear conversion factor to express one resource in terms of the other. Therefore, e suggest in the general case, treating resource utilization as a vector, ith coordinates specific to a given FPGA famil. For example, Resource Utilization Spartan3 = (#CLBslices, #BRAMs, #MULs) (2.5) Resource Utilization CcloneIII = (#LE, #memor its, #MULs) (2.6) Taking into account that vectors cannot e easil compared to each other, e have decided to opt out of using an dedicated resources in the hash function implementations used for our comparison. Thus, all coordinates of our vectors, other than the first one have een forced ( choosing appropriate options of the snthesis and implementation tools) to e zero. This a, our resource utilization (further referred to as Area) is characterized using a single numer, specific to the given famil of FPGAs, namel the numer of CLB slices (#CLBslices) for Xilinx FPGAs, the numer of Logic Elements (#LE) for Cclone II and Cclone III, and the numer of Adaptive Look-Up Tales (#ALUT ) in Stratix II and Stratix III. The resource utilization vector in FPGAs (or even its simplified one-coordinate form, referred to as Area aove) cannot e easil translated to an equivalent area or the numer of transistors in ASICs. An attempts to define a resource utilization unit that ould appl to oth technologies (such as an equivalent logic gate) have een mostl unsuccessful, and of limited value in practice. The onl common denominator is cost, ut unfortunatel the prices of integrated circuits, and FPGAs in particular, are not commonl availale, and are affected multiple non-technical factors (including the numer of units ordered, the relationship eteen companies, etc.)

11 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates Uniform Interface In order to remove an amiguit in the definition of our hardare cores for SHA-3 candidates, and in order to make our implementations as practical as possile, e have developed an interface shon in Fig. 2.a, and descried elo. In a tpical scenario, the SHA core is assumed to e surrounded to standard FIFO modules: Input FIFO and Output FIFO, as shon in Fig. 2.. In this configuration, SHA core is an active module, hile a surrounding logic (FIFOs) is passive. Passive logic is much easier to implement, and in our case is composed of standard logic components, FIFOs, availale in an major lirar of IP cores. Each FIFO module generates signals empt and full, hich indicate that the FIFO is empt and/or full, respectivel. Each FIFO accepts control signals rite and read, indicating that the FIFO is eing ritten to and/or read from, respectivel. The aforementioned assumptions aout the use of FIFOs as surrounding modules are ver natural and eas to meet. For example, if a SHA core implemented on an FPGA communicates ith an outside orld using PCI, PCI-X, or PCIe interface, the implementations of these interfaces most likel alread include Input and Output FIFOs, hich can e directl connected to a SHA core. If a SHA core communicates ith another core implemented on the same FPGA, then FIFOs are often used on the oundar eteen the to cores in order to accommodate for an differences eteen the rate of generating data one core and the rate of accepting data another core. Additionall, the inputs and outputs of our proposed SHA core interface do not need to e necessaril generated/consumed FIFOs. An circuit that can support control signals src read and src read can e used as a source of data. An circuit that can support control signals dst read and dst rite can e used as a destination for data. The exact format of an input to the SHA core, for the case of pre-padded messages, is shon in Fig To scenarios of operation are supported. In the first scenario, the message itlength after padding is knon in advance and is smaller than 2. In this scenario, shon in Fig. 2.2a, the first ord of input represents message length after padding, a) clk rst ) clk rst clk rst clk rst clk rst clk rst clk rst clk rst SHA core din dout src_read dst_read src_read dst_rite Input FIFO ext_idata idata din dout fifoin_full fifoin_empt full empt fifoin_rite rite read fifoin_read din SHA core src_read src_read dout dst_read dst_rite odata fifoout_full fifoout_rite rite Output FIFO ext_odata din dout fifoout_empt full empt fifoout_read read Figure 2.: a) Input/output interface of a SHA core. ) A tpical configuration of a SHA core connected to to surrounding FIFOs.

12 8 E. Homsirikamol, M. Rogaski, and K. Gaj a) its ) its msg_len last = msg_len_p seg len last= seg_ seg len last= message seg_... seg_n-_len last= seg_n-_len_p seg_n- Figure 2.2: Format of input data for to different operation scenarios: a) ith message itlength knon in advance, and ) ith message itlength unknon in advance. Notation: msg len message length after padding, msg len p message length efore padding, seg i len segment i length after padding, seg i len p segment i length efore padding, last a one-it flag denoting the last segment of the message (or one-segment message), itise OR. expressed in its. This ord has the least significant it, representing a flag called last, set to one. This ord is folloed the message length efore padding. This value is required several SHA-3 algorithms using internal counters (such as BLAKE, ECHO, Shavite-3, and Skein), even if padding is done outside of the SHA core. These to control ords are folloed all ords of the message. The second format, shon in Fig. 2.2, is used hen either message length is not knon in advance, or it is greater than 2. In this case, the message is processed in segments of data denoted as seg, seg,...,seg n-. For the ease of processing data the hash core, the size of the segments, from seg to seg n-2 is required to e alas an integer multiple of the lock size, and thus also of the ord size. The least significant it of the segment length expressed in its is thus naturall zero, and this it, treated as a flag called last, can e used to differentiate eteen the last segment and all previous segments of the message. The last segment efore padding can e of aritrar length < 2. This segment is processed in the same a as the entire message in scenario a). This a there is no need for an additional signal to distinguish eteen these to scenarios. Scenario a) is a special case of scenario ). In case the SHA core supports padding, the protocol can e even simpler, as explained in [4]. Please note that scenario ) is ver similar to the a data is processed a tpical softare API for hash functions, such as [5]. The Update function of the softare API corresponds to processing segments from seg to seg n-2. The function Final corresponds to the processing of the last segment of data, seg n-.

13 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates Assumptions and Simplifications Our stud is performed using the folloing assumptions. Onl the SHA-3 candidate variants ith the 256-it and the -it outputs have een implemented and compared at this point. Other variants, treated either independentl or as cominations of multiple variants (all-in-one hash cores) ma e sujects of future comparisons. Padding is assumed to e done outside of the hash cores (e.g., in softare). All investigated hash functions have ver similar padding schemes, hich ould lead to similar asolute area overhead if implemented as a part of the hardare core. The relative area penalt ill e higher for cores ith smaller area used for main processing. The complexit of the padding circuit ill also depend on the assumptions regarding the smallest unit of a message (it, te, or ord), hich ma e different for specific applications. Onl the primar mode of operation is supported for all functions. Special modes, such as tree hashing or MAC mode are not implemented (their implementation ould actuall ork against the respective candidates, ecause of the area and speed penalt introduced these extra features). There is also no support for providing salt specific to each message. The salt values are fixed to all zeros in all SHA-3 candidates supporting this special input (namel BLAKE, ECHO, SHAvite-3, and Skein). 2.5 Optimization Target We elieve that the choice of the primar optimization target is one of the most important decisions that needs to e made efore the start of the comparison. The optimization target should drive the design process of ever SHA-3 candidate, and it should also e used as a primar factor in ranking the otained SHA-3 cores. The most common choices are: Maximum Throughput, Minimum Latenc, Minimum Area, Throughput to Area Ratio, Product of Latenc times Area, etc. Our choice is the Throughput to Area Ratio, here Throughput is defined as Throughput for long messages, and Area is expressed in terms of the numer of asic programmale logic locks specific to a given FPGA famil. This choice has multiple advantages. First, it is practical, as hardare cores are tpicall applied in situations, here the size of the processed data is significant and the speed of processing is essential. Otherise, the input/output latenc overhead associated ith using a hardare accelerator dominates the total processing time, and the cost of using dedicated hardare (FPGA) is not justified. Optimizing for the est ratio provides a good alance eteen the speed and the cost of the solution. Secondl, this optimization criterion is a ver reliale guide throughout the entire design process. At ever junction here the decisions must e made, starting from the choice of a high-level hardare architecture don to the choice of the particular FPGA tool options, this criterion facilitates the decision process, leaving ver fe possile paths for further investigation.

14 E. Homsirikamol, M. Rogaski, and K. Gaj On the contrar, optimizing for Throughput alone, leads to highl unrolled hash function architectures, in hich a relativel minor improvement in speed is associated ith a major increase in the circuit area. In hash function cores, latenc, defined as a dela eteen providing an input and otaining the corresponding output, is a function of the input size. Since various sizes ma e most common in specific applications, this parameter is not a ell-defined optimization target. Finall, optimizing for area leads to highl sequential designs, resemling small general-purpose microprocessors, and the final product depends highl on the maximum amount of area (e.g., a particular FPGA device) assumed to e availale. 2.6 Design Methodolog Our design of all 4 SHA-3 candidates folloed an identical design methodolog. Each SHA core is composed of the Datapath and the Controller. The Controller is implemented using three main Finite State Machines, orking in parallel, and responsile for the Input, Main Processing, and the Output, respectivel. As a result, each circuit can simultaneousl perform the folloing three tasks: output hash value for the previous message, process a current lock of data, and read the next lock of data. The parameters of the interface are selected in such a a that the time necessar to process one lock of data is alas larger or equal to the time necessar to read the next lock of data. This a, the processing of long streams of data can happen at full speed, ithout an visile input interface overhead. The finite state machines responsile for input and output are almost identical for all hash function candidates; the third state machine, responsile for main data processing, is ased on a similar template. The similarit of all designs and reuse of common uilding locks assures a high fairness of the comparison. The design of the Datapath starts from the high level architecture. At this point, the most complex task that can e executed in an iterative fashion, ith the minimum overhead associated ith multiplexing inputs specific to a given iteration round, is identified. The proper choice of such a task is ver important, as it determines oth the numer of clock ccles per lock of the message and the circuit critical path (minimum clock period). It should e stressed that the choice of the most complex task that can e executed in an iterative fashion should not follo lindl the specification of a function. In particular, quite often one round (or one step) from the description of the algorithm is not the most suitale component to e iterated in hardare. Either multiple rounds (steps) or fractions thereof ma e more appropriate. In Tale 2. e summarize our choices of the main iterative tasks of SHA-3 candidates. Each such task is implemented as cominational logic, surrounded registers. The next step is an efficient implementation of each cominational lock ithin the DataPath. In Tale 2.2, e summarize major operations of all SHA-3 candidates that require logic resources in hardare implementations. Fixed shifts, fixed rotations, and

15 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates Tale 2.: Main iterative tasks of the hardare architectures of SHA-3 candidates optimized for the maximum Throughput to Area ratio Function Main Iterative Task Function Main Iterative Task BLAKE G i..g i+3 JH Round function R 8 BMW entire function Keccak Round R CueHash one round Luffa The Step Function, Step ECHO AES round/aes round/ Shaal To iterations BIG.SHIFTROWS, BIG.MIXCOLUMNS of the main loop Fugue 2 surounds SHAvite-3 AES round (ROR3, CMIX, SMIX) Groestl Modified AES round SIMD 4 steps of the compression function Hamsi Truncated Non-Linear Skein 4 rounds of Permutation P Threefish- Tale 2.2: Major operations of SHA-3 candidates (other than permutations, fixed shifts and fixed rotations). maddn denotes a multioperand addition ith n operands. Function NTT Linear S-ox GF MUL MUL madd ADD Boolean code /SUB BLAKE madd3 ADD XOR BMW madd7 ADD,SUB XOR CueHash ADD XOR ECHO AES 8x8 x2, x3 XOR Fugue AES 8x8 x4..x7 XOR Groestl AES 8x8 x2..x5, x7 XOR Hamsi LC Serpent XOR 4x4 JH 4x4 x2, x5 XOR Keccak NOT,AND,XOR Luffa 4x4 x2 XOR Shaal x3, x5 ADD,SUB NOT,AND,XOR SHAvite-3 AES 8x8 x2, x3 NOT,XOR SIMD NTT x85, x233 madd3 ADD NOT,AND,OR Skein ADD XOR SHA-256 madd5 NOT,AND,XOR

16 2 E. Homsirikamol, M. Rogaski, and K. Gaj other more complex permutations are omitted ecause the appear in all candidates and require onl routing resources (programmale interconnects). The most complex out of logic operations are the Numer Theoretic Transform (NTT) [6] in SIMD, linear code (LC) [7] in Hamsi, asic operations of AES (8x8 AES S-ox and multiplication a constant in the Galois Field GF(2 8 )) in ECHO, Fugue, Groestl, and SHAvite-3; and multioperand additions in BLAKE, BMW, SIMD, and SHA-256. For each of these operations e have implemented at least to alternative architectures. NTT as optimized using a Fast Fourier Transform (FFT) [6]. In Hamsi, the linear code as implemented using oth logic (matrix vector multiplications in GF(4)), and using look-up tales. AES 8x8 S-oxes (SuBtes) ere implemented using oth look-up tales (stored in distriuted memories), and using logic onl (folloing method descried in [8], Section.6..3). Multi-operand additions ere implemented using the folloing four methods: carr save adders (CSA), tree of to operand adders, parallel counter, and a + in VHDL [9]). Finall, integer multiplications 3 and 5 in Shaal have een replaced a fixed shift and addition. All optimized implementations of asic operations have een applied uniforml to all SHA-3 candidates. In case the initial testing did not provide a strong indication of superiorit of one of the alternative methods, the entire hash function unit as implemented using to alternative versions of the asic operation code, and the results for a version ith the etter throughput to area ratio have een listed in the result tales. All VHDL codes have een thoroughl verified using a universal testench, capale of testing an aritrar hash function core that follos interface descried in Section 2.3 [2]. A special padding script as developed in Perl in order to pad messages included in the Knon Anser Test (KAT) files distriuted as a part of each candidates sumission package. An output from the script follos a similar format as its input, ut includes apart from padding its also the lengths of the message segments, defined in Section 2.3, and shon schematicall in Fig The generation of a large numer of results as facilitated an open source tool ATHENa (Automated Tool for Hardare EvaluatioN) [2]. This enchmarking environment as also used to optimize requested snthesis and implementation frequencies and other tool options.

17 Chapter 3 Comprehensive Designs of SHA-3 Candidates The designs of all 4 SHA-3 candidates folloed the same asic design principle ith the core separated into to main units, the Datapath and the Controller. Onl the Datapath diagrams are provided in this chapter as the Controller can e derived from the Datapath and the specification of the function, and descried using ASM charts. The full specification of each of the algorithms can e found in [2 35]. 3. Notations and Smols Tale 3. provides the notation and smols that are eing used throughout this chapter. 3

18 4 E. Homsirikamol, M. Rogaski, and K. Gaj Tale 3.: Notations and Smols Word A group of its used in arithmetic and logic operations, tpicall of the size of or its. Block A group of ords. X[i] Refers to an arra position i in X. X i Refers to a it position i in X. salt Salt values are alas assumed to e zero and as a result the are omitted from the diagrams. Block size in its. h Hash value size in its. Word size in its. IV Initialization vector SEXT Sign extension. ZEXT Zero extension. <<<R Rotation left R positions. If R is a constant: fixed rotation; if R is a variale: variale rotation implemented using a arrel rotator. >>>R Rotation right R positions. If R is a constant: fixed rotation; if R is a variale: variale rotation implemented using a arrel rotator. <<S Shift left S positions. If S is a constant: fixed shift; if S is a variale: variale shift implemented using a arrel shifter. >>S Shift right S positions. If S is a constant: fixed shift; if S is a variale: variale shift implemented using a arrel shifter. Concatenation. B default, the uses concatenate ack to the same arrangement as efore the separation (split) occurs. SIPO PISO sitch endian ord sitch endian te Serial-in-parallel-out unit. Parallel-in-serial-out unit. Wordise endianness sitching. Bteise endianness sitching.

19 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates Basic Component Description This section descries implementations of selected asic components used in more than one algorithm. These components include multiplication 2 in GF(2 8 ), SuBtes, Mix- Columns, and AES Round Multiplication 2 in the Galois Field GF(2 8 ) X[7] X[6] Y[7] X[5] Y[6] X[4] Y[5] X[3] Y[4] X[2] Y[3] Y[2] X[] Y[] Y[] X[] Figure 3.: Basic : x Multiplication n in the Galois Field GF(2 8 ) Galois Field multiplication n other than 2 is summarized in Tale 3.2. Tale 3.2: Galois Field Multiplication n x3(x) = x2(x) X x4(x) = x2(x2(x)) x5(x) = x4(x) X x6(x) = x4(x) x2(x) x7(x) = x4(x) x3(x)

20 6 E. Homsirikamol, M. Rogaski, and K. Gaj AES AES is a asic uilding lock of man SHA-3 candidates. An AES round consists of three asic operations, SuBtes, MixColumns and ShiftRos shon in Figure 3.2. SuBte operation, shon in Figure 3.3, performs direct sustitution on all tes of its input. Mix- Columns performs matrix multiplication on each ord of its input. A ord of AES contains its. Hence, four instances of MixColumns are required to process the entire AES lock of its. The SBOX and ShiftBtes operations of AES and their full specifications can e found in [36] and [37]. x SuBtes MixColumns ShiftRos ke Figure 3.2: Basic : AES Round

21 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 7 x x[] x[5] AES SBOX AES SBOX [] [5] Figure 3.3: Basic : AES SuBtes Note : All uses are 8 it ide 2 3 x2 x2 x2 x2 2 3 Figure 3.4: Basic : AES MixColumns

22 8 E. Homsirikamol, M. Rogaski, and K. Gaj 3.3 BLAKE 3.3. Block Diagram Description Figure 3.5 shos the datapath of BLAKE. In this design, the cominational CORE implements one half of the BLAKE s round [2]. Thus, to clock ccles are necessar to implement the full round. First, a message lock is loaded into SIPO. Once done, the lock is stored in a temporar register, used to hold the message lock until this lock is full processed the CORE. This temporar register allos the next message lock to e loaded simultaneousl into SIPO. The message lock msg and the constant c are then applied as inputs to the function PERMUTE and the otained output is passed to the design s CORE. Simultaneousl, the chaining value, CV, is initialized ith the the Initialization Vector, IV, and an input to the CORE, V, is initialized ith the value dependent on the chaining value, the counter, t, and a loer half of the constant c. The initial value of V is mixed the CORE ith an output of the lock PERMUTE, CM, for tent clock ccles ( rounds). Once this operation is completed, an additional clock ccle is required for finalization. The output of Finalization is used as the next chaining value, for intermediate message locks, or as the final hash value for the last message lock. The Initialization unit performs the folloing function: v[] v[] v[2] v[3] h[] h[] h[2] h[3] v[4] v[5] v[6] v[7] v[8] v[9] v[] v[] h[4] h[5] h[6] h[7] c[] c[] c[2] c[3] v[2] v[3] v[4] v[5] t[] c[4] t[] c[5] c[6] c[7] The Finalization unit performs the folloing operation: h [] h[] v[] v[8] h [] h[] v[] v[9] h [2] h[2] v[2] v[] h [3] h[3] v[3] v[] h [4] h[4] v[4] v[2] h [5] h[5] v[5] v[3] h [6] h[6] v[6] v[4] h [7] h[7] v[7] v[5] In Figure 3.6, an operation of the BLAKE s PERMUTE module is presented. A ne value of the variale m is selected depending on the round numer using a ide multiplexer preceded constant permutations. A permutation tale is shon in Tale 3.3. The selection signal of the multiplexer ccles from to 9 (and then again ack to for BLAKE- ) until all BLAKE s rounds are executed. Each output of the multiplexer is then mixed ith the respective constant using the transformation XOR W CROSS (defined in the note to Fig. 3.6) and registered afterards.

23 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 9 BLAKE : =, h=256 BLAKE : =24, h= /2 IV /2 /2 din SIPO c c[..7] /2 t /8 /2 t h Initialization c v /2 CV /2 V /2 msg constant V PERMUTE CM /2 /2 CM CORE VP v Finalization h h /2 /2 /2 PISO dout Figure 3.5: BLAKE : Datapath

24 2 E. Homsirikamol, M. Rogaski, and K. Gaj Tale 3.3: BLAKE : Permutation Constants hi lo σ σ σ σ σ σ σ σ σ σ The CORE unit is shon in Figure 3.7 and represents one half of the BLAKE s round. As specified in [2], there are to levels of G functions and therefore a permutation eteen the first and the second half-round is required. This permutation is performed ordise and is shon in Tale 3.4. LVL2 permute transforms the state matrix (output of 4 parallel G functions) into a ne matrix appropriate for the second half-round. LVL permute is a permutation inverse to LVL2 permute. Tale 3.4: BLAKE: Half Round s Permutation LVL2 (forard) LVL (revert) The G-function in the CORE unit is shon in Figure 3.8. Note that the XOR operations used to calculate input values CM 2i and CM 2i+, hich are normall depicted as a part of the G-function, are omitted in our design. These operations ere placed as a part of the PERMUTE unit and therefore skipped here. R, R2, R3 and R4 are rotating constants. The values of these constants are shon in Tale vs. Variant Differences BLAKE- doules the ord size of BLAKE-, there increasing the lock size as ell. Hence, the IV and the constant are changed from its to 24 its. These values can Tale 3.5: BLAKE : Rotation Constants of BLAKE- R R2 R3 R

25 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 2 msg BLAKE : =,= BLAKE : =24,= constant small_permute Pσ hi Pσ hi Pσ9 P σ lo P σ8lo P σ9 lo hi i 5 small_permute /2 /2 /2 /2 /2 / i /2 Note: hi and lo denotes top and ottom half of the permutation tale m XOR_W CROSS /2 c /2 m = m[..5] c = c[..5] cm 2i = m 2i c 2i+ CM cm 2i+ = m 2i+ c 2i Figure 3.6: BLAKE : PERMUTE Tale 3.6: BLAKE: Rotating Constants of BLAKE- R R2 R3 R e found in Section 2.2. of [2]. BLAKE- introduces also an increase in the numer of mixing rounds from to 4. As a result, the numer of clock ccles required in our design for processing a single lock of message increases from 2 to 29. The multiplexer selection signal in the PERMUTE unit loops ack hen the round numer reaches. Hence, after reaching 9, this selection signal goes ack to. Finall, the rotation constants are adjusted to reflect the increased ord size. These values are descried in Tale 3.6.

26 22 E. Homsirikamol, M. Rogaski, and K. Gaj V[] V[8] V[4] V[2] V[] V[9] V[5] V[3] V[2] V[] V[6] V[4] V[3] V[] V[7] V[5] A B C D A B C D A B C D A B C D CM[,] cm G_function CM[2,3] G_function CM[4,5] G_function CM[6,7] G_function 2 cm cm cm F F F F /4 /4 /4 /4 BLAKE : =, = BLAKE : =24, = =/6 LVL2 permute LVL permute vp Figure 3.7: BLAKE : CORE A CM2i CM2i+ BLAKE : = BLAKE : = R2 R4 B <<< <<< /4 F C D R <<< R3 <<< Figure 3.8: BLAKE : G-function

27 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates Blue Midnight Wish (BMW) 3.4. Block Diagram Description Our design for Blue Midnight Wish (BMW) hashes a lock of data ithin one clock ccle. Since the numer of clock ccles necessar to read a lock of a message is greater than the numer of clock ccles required to hash it, an additional clock signal is used in the circuit as shon in Figure 3.9. This faster clock (io clk) is used to drive the SIPO and PISO units, alloing them to read and rite data at a faster rate than the operation of other units in the circuit. The rate of reading and riting is determined the lock size and the numer of ccles required to process a lock. Since onl one clock ccle is used to process a message lock, the frequenc of io clk is lock size/ord size times higher than the main clock. This ratio is equal to 8 for BMW-256. BMW requires each message lock to go through the endianness sitching efore the start of processing. A message lock is then mixed ith the chaining value in order to otain the next chaining value. Once all locks of the message are processed, a finalization round is initiated. Since there is no incoming message lock, the chaining value and the input message lock are replaced the constant and the chaining value, respectivel. The descriptions of F, F, F2 and AddElements and its associated logical operations can e found in Tale.3 and Tale in [22] vs. Variant Differences BMW- increases the ord size of BMW-256 from to its. As a result, the lock size is douled as ell. Since the lock size increases, the numer of clock ccles required to load a message lock also increases for io clk from 8 ccles to 6 ccles. Furthermore, logic functions, specificall shifts and rotations, are adjusted to accommodate the increased ord size. These changes are shon in Tale.3 of [22]. All other operations remain the same.

28 24 E. Homsirikamol, M. Rogaski, and K. Gaj din BMW 256 : =, h=256, = BMW : =24, h=, = io_clk SIPO IV sitch endian ord constant CV H F M qa H M AddElement A A qa F q q F2 hp qa M hp h hp.. h sitch endian ord h io_clk PISO dout Figure 3.9: BMW : Datapath

29 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates CueHash 3.5. Block Diagram Description A straightforard iterative architecture is used in our design. The datapath of CueHash is shon in Figure 3.. Due to endianness issue, the input message is required to go through endianness sitching tice. First, the teise endianness sitching is applied, hich is then folloed the ordise endianness sitching. A ord of CueHash consists of its. For each message, the chaining value A is initialized to IV. The 256 leftmost its of the chaining value are xored ith an input message lock. The state is then transformed for 6 rounds. A round is descried in Figure 3.. Saps used inside of the round are descried in Figure 3.2. All operations inside the round are performed ordise. This process repeats until all message locks are processed. In the last round of the last message lock, an integer one is xored ith the position zero of the chaining value, rp, activating the control signal f inal efore the chaining value is inserted ack into the state register. Then, the chaining value is transformed for 6 rounds to get the final hash value. The hash value is required to go through the endianness sitching process again to reach the correct hash output vs. Variant Differences Everthing is the same for oth variants ith the exception of truncation size. CueHash6/- 256 truncates the state to 256 its to otain the hash value, as opposed to CueHash6/- hich truncates the state to its.

30 26 E. Homsirikamol, M. Rogaski, and K. Gaj sitch endian te din SIPO 256 sitch endian ord zeros CueHash6/ 256 : h=256 CueHash6/ : h= IV B A=B C B C rp = rp 23.. rp rp dout 24 D=B C 23 PISO 24 rp final h r sitch endian te ROUND 24 rp h sitch endian ord 24 h rp h.. rp Figure 3.: CueHash : Datapath

31 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 27 Note : All operations are performed ordise, ith = 7 <<< SapA r 5.. rp 5.. <<< SapC r rp r 23.. SapB SapD rp 23.. Figure 3.: CueHash : Round SapA SapB SapC SapD Note : Each index is it ide Figure 3.2: CueHash : Saps

32 28 E. Homsirikamol, M. Rogaski, and K. Gaj 3.6 ECHO 3.6. Block Diagram Description ECHO s top level datapath is shon in Figure 3.3. A message lock is first concatenated ith the chaining value to produce the state matrix. The state matrix is vieed as an arra of 6 ords ith each ord representing its. The state then goes through rounds of iteration for ECHO-256. Note that c represents the numer of its hashed so far. This value also includes its of the currentl processed lock. Once the state matrix is thoroughl mixed, a ne chaining value is computed from the state matrix the BIG.Final unit. This operation is descried as follos: v [] v[] m[] m[4] m[8] [] [4] [8] [2] v [] v[] m[] m[5] m[9] [] [5] [9] [3] v [2] v[2] m[2] m[6] m[] [2] [6] [] [4] v [3] v[3] m[3] m[7] m[] [3] [7] [] [5] din SIPO ECHO 256 : p = 536 q = M p W=V M p 248 m BIG.Final c v v c q q q IV PISO q CV dout ECHO : p = 24 q = x ECHO ROUND q Figure 3.3: ECHO : Datapath In Figure 3.4, operations inside of the ECHO round are shon. In our design, each ECHO round is executed in three clock ccles. BIG.SuBtes is performed in the first to clock ccles and BIG.ShiftRos and BIG.MixColumns in the third ccle. BigSuBtes operation is shon in Figure 3.5. The unit takes in the state matrix and the message length counter, C, and produces the next state. In the first clock ccle of the round

33 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 29 c x c x BIG.SuBtes x BIG.ShiftRos BIG.MixColumns x Figure 3.4: ECHO : Round operation, the ke is chosen to e the length counter plus the numers eteen and 5. These added values follo the ord numer. Hence, the fourteenth ord gets the ke as C + 4. In the next ccle, salt is selected as the ke. Since in our implementation, salt is not used, zero is selected instead. x[] x 248 x[5] c zeros c zeros ke=c ke x AES round AES round x ke ke=c zeros c 5 zeros c [] 248 [5] Figure 3.5: ECHO : BIG.SuBtes To operations are performed in the third ccle of a round. First, BIG.ShiftRos is performed. This operation is equivalent to the ord permutation given in Tale 3.7. Next,

34 3 E. Homsirikamol, M. Rogaski, and K. Gaj BIG.MixColumns transforms the permuted state to otain the final value of a round. In Figure 3.6, a diagram of BIG.MixColumns is shon. BIG.MixColumns separates the state into 4 locks, each lock containing 4 -it ords. A te of data from each ord is selected to go through the AES MixColumn. All data is then comined together to produce the final state. Tale 3.7: ECHO : BIG.ShiftRos Word Permuted Word MC Note: Buses size are 8 it ide unless specified otherise x [] x x [] x [] x [2] x [3] x 248 x[] x[2] x [3] 2 3 AES MixColumn AES MixColumn 2 3 MC MC MC [] [] [2] [3] [] [] [2] [3] 248 Figure 3.6: ECHO : BIG.MixColumns vs. Variant Differences ECHO- differs from ECHO-256 in its message lock and chaining value sizes. The message lock is reduced from 536 its to 24. On the other hand, the chaining value is increased from to 24 its. This change increases the securit of ECHO and therefore

35 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 3 a smaller numer of rounds is used. Onl 8 rounds are used in ECHO-. Finall, onl BIG.Final is altered. ECHO- s BIG.Final is descried as follos: v [] v[] m[] [] [8] v [] v[] m[] [] [9] v [2] v[2] m[2] [2] [] v [3] v[3] m[3] [3] [] v [4] v[4] m[4] [4] [2] v [5] v[5] m[5] [5] [3] v [6] v[6] m[6] [6] [4] v [7] v[7] m[7] [7] [5]

36 E. Homsirikamol, M. Rogaski, and K. Gaj 3.7 Fugue 3.7. Block Diagram Description Fugue 256 : =96, h=256 Fugue : =52, h= din s Finalization s r ROUND din h PISO rp IV dout S Figure 3.7: Fugue : Datapath In Figure 3.7, the datapath of Fugue is shon. For ever message, the state register is initialized to IV. The state is vieed as a matrix of 4 X tes, here X is the column length dependent on the lock size of Fugue. For Fugue-, the lock size is equal to 96 its. Hence, the matrix have dimensions 4 x 3. The state is mixed ith input message locks through the ROUND unit. Once all message locks are processed, the state goes through Finalization. For Fugue-, Finalization is descried elo: S = S[..3] (S[4] S[]) (S[5] S[]) S[6..8] A round of Fugue is shon in Figure 3.8. The path through the ROUND unit is selected ased on the sequence of operations as descried in Section of F-256 in [25]. TIX operates in parallel as follos: S [] = din S [] = S[] S[24] S [8] = S[8] din S [] = S[] S[]

37 Comparing Hardare Performance of Fourteen Round To SHA-3 Candidates 33 din r din TIX s s ROR3 s CMIX s s ADDFn RORFn s i SMIX rp Figure 3.8: Fugue : Round

38 34 E. Homsirikamol, M. Rogaski, and K. Gaj Tale 3.8: Fugue: F-256 ADDFn and RORFn Operation i S [4] = S[4] S[] S [5] = S[5] S[] ROR5 S [4] = S[4] S[] S [6] = S[6] S[] ROR4 ROR3 and CMIX are performed consecutivel. All RORn operations are teise rotations n tes. This is equivalent to >>> (n 8). As such, ROR3 can e considered as >>> 24. CMIX operates as follos: S [] = S [] = S [2] = S [5] = S [6] = S [7] = S[] S[4] S[] S[5] S[2] S[6] S[5] S[4] S[6] S[5] S[7] S[6] The ADDFn and RORFn operations are selected the i control signal. The selection process is descried in Tale 3.8. Finall, the SMIX operation is descried in Figure 3.9. The SMIX operation first splits an input into an arra of -it locks. Then, each lock is further splitted into 6 tes. These tes are transformed using AES SBOX and the resulting vector of 6 tes is used as an input to the Matrix Multiplier. The Matrix Multiplier performs multiplication of a constant matrix an input vector. The value of the constant matrix is shon in Tale 3.9. Empt positions in this tale correspond to the values zero. All multiplications are defined as multiplications in GF (2 8 ) vs. Variant Differences Fugue- increases the state size to 4 x 36 hich is equivalent to 52 its. Additionall, TIX, CMIX, ADDFn, RORFn and Finalization have een modified. TIX is no performed in parallel as follos: S [] = din S [] = S[] S[24] S [4] = S[4] S[27] S [7] = S[7] S[3] S [8] = S[8] din S [22] = S[22] S[]

Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs

Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs Ekawat Homsirikamol Marcin Rogawski Kris Gaj George Mason University ehomsiri, mrogawsk, kgaj@gmu.edu Last revised: Decemer