Low-Area Implementations of SHA-3 Candidates

Jens-Peter Cryptographic Engineering Research Group (CERG) http://cryptography.gmu.edu Department of ECE, Volgenau School of IT&E, George Mason University, Fairfax, VA, USA SHA-3 Project Review Meeting

Outline 2 3

Motivation Motivation Assumptions Interface Minimization Technique There have been several comparison of Throughput/Area optimized implementations [Gaj],[Matsuo],[Baldwin],[Guo]. Only few low-area implementations of single SHA-3 algorithms on FPGAs. Not all fully autonomous, Varying interface assumptions. Low-area implementations highlight flexibility of algorithm designs. Goal First comprehensive comparison of low-area implementations of Round 2 SHA-3 Candidates. All use the same standardized interface. All optimized for the same parameters under the same assumptions.

Assumptions Motivation Assumptions Interface Minimization Technique Implementing for minimum area alone can lead to unrealistic run-times. Goal: Achieve the maximum Throughput/Area ratio for a given area budget. Realistic scenario: System on Chip: Certain area only available. Standalone: Smaller Chip, lower cost, but limit to smallest chip available, e.g. 768 slices on smallest Spartan 3 FPGA.

Assumptions Motivation Assumptions Interface Minimization Technique Target Implementing for minimum area alone can lead to unrealistic run-times. Goal: Achieve the maximum Throughput/Area ratio for a given area budget. Realistic scenario: System on Chip: Certain area only available. Standalone: Smaller Chip, lower cost, but limit to smallest chip available, e.g. 768 slices on smallest Spartan 3 FPGA. Xilinx Spartan 3e, low cost FPGA family Budget: 5 slices, Block RAM (BRAM)

Interface Motivation Assumptions Interface Minimization Technique Based on Interface and I/O Protocol from [Gaj], w=6. msg len ap, seq len ap (after padding ) in 32-bit words. msg len bp, seq len bp (before padding) in bits. n 2 msg len bp = seq len ap i 32 + seq len bp n i= n msg len ap = seq len ap i 32 w i= clk clk rst rst SHA Core w din dout src_ready dst_ready src_read dst_write w bits msg_len_ap msg_len_bp message w bits seq_len_ap seq seq_len_ap seq seq_len_ap n seq_len_bp n seq n a)sha Interface b)sha Protocol

Minimization Technique Motivation Assumptions Interface Minimization Technique Datapath Use BRAM to store state, initialization vectors, constants Use BRAM in each clock cycle Avoid temporary storage or use: Free registers, i.e. unused flip-flops after LUTs Shift Registers (x6 bit / Distributed RAM (x6 bit) LUT = 2 Slice Control Logic Small main state machine, up-to 8 states Counter for clock cycles in longest state Stored Program Control within states BRAM addressing must follow regular sequence, can have offset between rounds

Block RAM Motivation Assumptions Interface Minimization Technique BRAM Mode=write_first BRAM Mode=read_first data_in data_out data_in data_out Limits We use exclusively read first mode, i.e. old value is read, new value is written. Saves clock cycles, however, leads to address offset. Control logic might become difficult. Maximum 2 input and 2 output ports, 2 addresses (dual port). Maximum single port w/ 64 bits or dual port w/ 32 bits each.

BLAKE 2 3 4 A B 2 3 4 5 6 7 8 9 2 G G G G2 G G G G G2 G G2 G2 G G2 G G2 G2 G G2 G2 G G2 G G2 G G2 G G G G G G2 G2 G2 G2 Smallest implementation would be 2 G-function BRAM contention. Best result: 2 G-functions, pipelined as shown above. Keeps data in-flight longer, eliminates BRAM contention However, generation of addresses difficult Large control logic.

CubeHash Port A BRAM DRam B 3 Port B 3 6 5 <<< 7 DRam A din Reg <<< 3 6 5 DRam C dout Reg Initialization vector pre-processed stored in BRAM. BRAM contention requires swapping data between BRAM and Distributed RAM (DRAM). This requires additional clock cycles. Leads to simpler control logic. Finalization very costly at 9,296 clock cycles.

ECHO 3 3 3 Port A 3 3 6 BRAM 3 Port B 3 5 3 reg A S Box S Box S Box S Box din 3 65 reg B MixCol key key dout 6 5 Mix-Columns contains 2 Mix-Column units with total 8x8 bit register. 4 logic based S-Boxes with integrated pipeline stage. Faster Mix-Columns would exceed 5 slices. Key Generation uses DRAM, 32 bit adder. Allows store salt. Small control unit with room for improvement. 3 init 2 3 3 2

Fugue Port A BRAM Port B 7 32 5 8 23 32 6 3 24 S Box S Box S Box S Box Reg 32 >>>8 >>>6 >>>24 SMIX SMIX SMIX SMIX 7 5 8 23 6 3 24 2 3 3 6 5 32 dout din 32 Reg 6 5 3 4 logic based S-Boxes followed by pipeline stage. Only fixed rotations. Most time consuming function: SMIX. Challenge to store the S-Mix table.

Grøstl 6 5 dout din 3 dram Reg counter Port A BRAM Port B 7 S Box 7 S Box 3 Reg Reg Reg Reg mix mix Serialized P and Q. Only 2 S-Boxes due to Shift Rows can only use 2x8 bits. Uses single set of registers for two Mix Column units. 6 / 32 bit datapath 7 clock cycles per column.

JH 3 6 5 input 5 5 reg 5 dout 5 3 7 3 3 2 3 7 DRAM DRAM 7 7 Port A BRAM Port B 3 7 7 3 3 7 rc_d 3 3 3 grp_dgrp r_d 3 3 Uses BRAM and two 32x8 DRAMs. Grouping and de-grouping are the most expensive operations. 6 clock cycle operation. multiple narrow memory accesses.

Luffa 3 dout 5 6 Port A BRAM Port B MixCol reg reg reg 2 reg 3 din Scrumb reg DROM 3 6 5 2 2 3 Reg Message injection uses serialized XORs. SubCrumb is implemented as DROM. Constraints: DRAM had to be used to keep control logic simple.

Shabal din Reg 5 3 6 Port B 32 BRAM Port A 32 32 32 32 2 3 6 5 SR2CP SR6 M <<<7 <<<7 dout W 32 32 32 2 SR2 A [] UU new A BXOR [3] [9] [6] SR6 B new B Based on paper by [Detrey]. BRAM contains state register C and initialization vectors. Our I/O is more complex than [Detrey]. BRAM makes controller more complex.

SHAvite-3 dout Port A BRAM Port B 3 5 6 reg reg reg 2 reg 3 S Box S Box S Box S Box dob reg 3 din 6 2 Reg S Box 5 4 ROM based S-Boxes. Regular path of the mds matrix is an advantage. Mix column is realized through a shift register.

SHA-2 Full Datapath 7 slices 65 clock cycles

SHA-2 Full Datapath 7 slices 65 clock cycles Small Datapath 52 slices 595 clock cycles

SHA-2 Full Datapath 7 slices 65 clock cycles Small Datapath 52 slices 595 clock cycles SHA-2 is not well suited for very small implementations.

Performance Equations Performance Equations Implementation Implementation Comparison Block Size (bits) b Clock Cycles to hash N blocks clk = Throughput b (l + p) T Algorithm st +( l + p) N + end BLAKE 52 8 + ( 32+ 48) N + 65 52/( 52 T ) CubeHash 256 2 + ( 6+ 928) N + 932 256/( 928 T ) ECHO 536 6 + ( 96+2449) N + 7 536/(2545 T ) Fugue 32 33 + ( 2+ 6) N + 99 32/( 63 T ) Grøstl 52 2 + ( 32+2) N + 577 52/(52 T ) JH 52 35 + ( 32+574) N + 7 52/(66 T ) Luffa 256 2 + ( 6+ 66) N + 647 256/( 622 T ) Shabal 52 36 + ( 32+ 48) N + 28 52/( 8 T ) SHAvite-3 52 8 + ( 32+ 648) N + 7 52/( 68 T ) SHA-256 52 8 + ( 32+ 563) N + 7 256/( 595 T )

Implementation Performance Equations Implementation Implementation Comparison Area (slices) Block RAMs Maximum Delay (ns) T Throughput (Mbps) Throughput/ Area (Mbps/slice) Throughput (Mbps) Throughput/ Area (Mbps/slice) Algorithm Large m Small m BLAKE 538 9.73 2.7.9 88.4.64 CubeHash 292 9.84 28..9 2.5.8 ECHO 499.87 55.5. 54.8. Fugue 48.52 44..9 2.5.5 Grøstl 474 8.96 49.6. 33.7 JH 37.98 29..8 28.6.77 Luffa 453 9.42 43.6. 2.4.47 Shabal 498 8.84 723.9.45 78.8.359 SHAvite-3 496 6.99 7.7.22 2.4.26 SHA-256 52. 42.9.8 4.6.7

for Large Messages Performance Equations Implementation Implementation Comparison 8 Shabal 6 Throughtput [Mbits/s] 4 2 SHAvite 3 BLAKE Groestl ECHO Luffa CubeHash JH Fugue SHA 2 25 3 35 4 45 5 55 Area [Slices]

for Short Messages Performance Equations Implementation Implementation Comparison 6 x 5 5 4 BLAKE CubeHash ECHO Fugue Groestl JH Luffa Shabal SHAvite 3 CubeHash JH Latency (ns) 3 Fugue Luffa ECHO 2 BLAKE SHAvite 3 2 4 6 8 2 4 6 Message Size (bytes) Groestl Shabal

Performance Equations Implementation Implementation Comparison Control Logic vs. Datapath 6 5 Datapath Controller 4 Area (Slices) 3 2 CubeHash JH Luffa Groestl Fugue SHAvite 3 Shabal ECHO SHA 2 BLAKE

Implementation Comparison Performance Equations Implementation Implementation Comparison Reference Area (slices) Block RAMs Maximum Delay (ns) I/O Width Datapath Width Clock Cycles (l + p) Functionality Throughput (Mbps) Throughput/ Area (Mbps/slice) Algorithm Device BLAKE-32 [?] 24 2 5.2 32 32 86 xc3s5 FA 5..93 BLAKE-32 TW 538 9.73 6 32 52 xc3se FA 2.7.9 BMW [?] 895 26. 32 32 6 xcv3 FA 9.. CubeHash [TW] 292 9.84 6 32 944 xc3se FA 28.. ECHO [TW] 499.87 6 32 2545 xc3se FA 55.5. ECHO [?] 27 2.8 8 8 6593 xc5vlx5-2 FA 72..57 Fugue [TW] 48.52 6 32 63 xc3se FA 44..9 Grøstl [?] 276 6.67 64 64 xc3s5 FA 92..5 Grøstl [TW] 474 8.96 6 32 52 xc3se FA 49.6. JH [TW] 37.98 6 32 64 xc3se FA 29..8 Keccak [?] 444 3.77 64 387 xc5vlx5 FA 7..6 Luffa [TW] 453 9.42 6 32 622 xc3se FA 43.6. Shabal [?] 499.25 32 64 xc3s2 FA 8.6 Shabal [TW] 498 8.84 6 32 8 xc3se FA 723.9.45 Shavite-3 [TW] 496 6.99 6 32 68 xc3se FA 7.7.22

Performance Equations Implementation Implementation Comparison Comparison of Candidate 2 GMU Others.5 V 2 Throughput/Area V 5.5 V 5 V 3 BLAKE 32 BMW Cubehash ECHO Fugue Groestl JH Keccak Luffa Shabal Shavite 3 SHA 2

Best Candidate Performance Equations Implementation Implementation Comparison 2 Virtex Spartan.5 Throughput/Area.5 BMW ECHO Keccak SHA 2 Blake CubeHash Fugue Groestl JH Luffa ShabalSHAvite 3

Performance Equations Implementation Implementation Comparison Thanks for your attention.