Low-Area Implementations of SHA-3 Candidates

Similar documents
Lightweight Implementations of SHA-3 Candidates on FPGAs

Implementation & Benchmarking of Padding Units & HMAC for SHA-3 candidates in FPGAs & ASICs

Fair and Comprehensive Methodology for Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates using FPGAs

Use of Embedded FPGA Resources in Implementations of Five Round Three SHA-3 Candidates

Use of Embedded FPGA Resources in Implementa:ons of 14 Round 2 SHA- 3 Candidates

ECE 545 Lecture 8b. Hardware Architectures of Secret-Key Block Ciphers and Hash Functions. George Mason University

Vivado HLS Implementation of Round-2 SHA-3 Candidates

Sharing Resources Between AES and the SHA-3 Second Round Candidates Fugue and Grøstl

GMU SHA Core Interface & Hash Function Performance Metrics

A High-Speed Unified Hardware Architecture for AES and the SHA-3 Candidate Grøstl

Lightweight Implementations of SHA-3 Candidates on FPGAs

ECE 545. Digital System Design with VHDL

GMU SHA Core Interface & Hash Function Performance Metrics Interface

Environment for Fair and Comprehensive Performance Evalua7on of Cryptographic Hardware and So=ware. ASIC Status Update

Two Hardware Designs of BLAKE-256 Based on Final Round Tweak

Groestl Tweaks and their Effect on FPGA Results

!"#$%&'()*+%&,-%&.*/.&0"&#%(1.*"0* 2+345*!%(,',%6.7*87'()*9/:37* :."&).*A%7"(*8('B.&7'6=* 8C2C3C*

Evaluation of Hardware Performance for the SHA-3 Candidates Using SASEBO-GII

SHA3 Core Specification. Author: Homer Hsing

Version 2.0, November 11, Stefan Tillich, Martin Feldhofer, Mario Kirschbaum, Thomas Plos, Jörn-Marc Schmidt, and Alexander Szekely

Federal standards NIST FIPS 46-1 DES FIPS 46-2 DES. FIPS 81 Modes of. operation. FIPS 46-3 Triple DES FIPS 197 AES. industry.

ECE 646 Lecture 12. Cryptographic Standards. Secret-key cryptography standards

Efficient Hardware Implementations of High Throughput SHA-3 Candidates Keccak, Luffa and Blue Midnight Wish for Single- and Multi-Message Hashing

SHA-3 interoperability

Compact Hardware Implementations of ChaCha, BLAKE, Threefish, and Skein on FPGA

ECE 545 Fall 2010 Exam 1

Compact implementations of Grøstl, JH and Skein for FPGAs

AES as A Stream Cipher

On Optimized FPGA Implementations of the SHA-3 Candidate Grøstl

ECE 545 Fall 2013 Final Exam

Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs

CDA 4253 FPGA System Design Op7miza7on Techniques. Hao Zheng Comp S ci & Eng Univ of South Florida

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

Implementation and Comparative Analysis of AES as a Stream Cipher

Hardware Benchmarking of Cryptographic Algorithms Using High-Level Synthesis Tools: The SHA-3 Contest Case Study

FPGA Matrix Multiplier

C vs. VHDL: Benchmarking CAESAR Candidates Using High- Level Synthesis and Register- Transfer Level Methodologies

A Zynq-based Testbed for the Experimental Benchmarking of Algorithms Competing in Cryptographic Contests

Hardware Implementation of the Code-based Key Encapsulation Mechanism using Dyadic GS Codes (DAGS)

Compact Implementation of Threefish and Skein on FPGA

Comparison of the Hardware Performance of the AES Candidates Using Reconfigurable Hardware

FPGA Implementations of SHA-3 Candidates: CubeHash, Grøstl, LANE, Shabal and Spectral Hash

Implementation and Analysis of the PRIMATEs Family of Authenticated Ciphers

GMU Hardware API for Authen4cated Ciphers

On the parallelization of slice-based Keccak implementations on Xilinx FPGAs

FPGA Implementation of High Speed AES Algorithm for Improving The System Computing Speed

Low-cost hardware implementations of Salsa20 stream cipher in programmable devices

Use of Embedded FPGA Resources in Implementations of Five Round Three SHA-3 Candidates

INTRODUCTION TO FPGA ARCHITECTURE

A Low-Area yet Performant FPGA Implementation of Shabal

High-Speed Hardware for NTRUEncrypt-SVES: Lessons Learned Malik Umar Sharif, and Kris Gaj George Mason University USA

Pushing the Limits of SHA-3 Hardware Implementations to Fit on RFID

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Compact Dual Block AES core on FPGA for CCM Protocol

A Lightweight Implementation of Keccak Hash Function for Radio-Frequency Identification Applications

problem maximum score 1 8pts 2 6pts 3 10pts 4 15pts 5 12pts 6 10pts 7 24pts 8 16pts 9 19pts Total 120pts

Hardware Performance Evaluation of SHA-3 Candidate Algorithms

Introduction to Field Programmable Gate Arrays

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 15 Memories

EECS150 - Digital Design Lecture 16 - Memory

Fast implementation and fair comparison of the final candidates for Advanced Encryption Standard using Field Programmable Gate Arrays

Hardware Architectures

Jaap van Ginkel Security of Systems and Networks

Benchmarking of Cryptographic Algorithms in Hardware. Ekawat Homsirikamol & Kris Gaj George Mason University USA

Fast implementations of secret-key block ciphers using mixed inner- and outer-round pipelining

Field Programmable Gate Array (FPGA)

Topics. Midterm Finish Chapter 7

Digital Logic & Computer Design CS Professor Dan Moldovan Spring 2010

ALTERA M9K EMBEDDED MEMORY BLOCKS

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

ECE 699: Lecture 9. Programmable Logic Memories

Outline of Presentation

Hardware Implementation of Cryptosystem by AES Algorithm Using FPGA

Topic #6. Processor Design

AES Core Specification. Author: Homer Hsing

HDL Coding Style Xilinx, Inc. All Rights Reserved

Programmable Logic. Simple Programmable Logic Devices

HIGH DATA RATE 8-BIT CRYPTO PROCESSOR

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

CHAPTER 4 BLOOM FILTER

Compact FPGA Implementations of the Five SHA-3 Finalists

CAESAR Hardware API. Ekawat Homsirikamol, William Diehl, Ahmed Ferozpuri, Farnoud Farahmand, Panasayya Yalla, Jens-Peter Kaps, and Kris Gaj

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Design and Implementation of Multi-Rate Encryption Unit Based on Customized AES

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

ECE 448 Lecture 13. FPGA Memories. George Mason University

An 80Gbps FPGA Implementation of a Universal Hash Function based Message Authentication Code

Fundamentals of Linux Platform Security

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

Pipelining & Verilog. Sequential Divider. Verilog divider.v. Math Functions in Coregen. Lab #3 due tonight, LPSet 8 Thurs 10/11

Hardware Design with VHDL Design Example: BRAM ECE 443

The Xilinx XC6200 chip, the software tools and the board development tools

ECE 485/585 Microprocessor System Design

Low area implementation of AES ECB on FPGA

Binary Adders. Ripple-Carry Adder

EECS150 - Digital Design Lecture 16 Memory 1

Efficient Hardware Realization of Advanced Encryption Standard Algorithm using Virtex-5 FPGA

Digital Signal Processing for Analog Input

NIST SHA-3 ASIC Datasheet

ECE 448 Lecture 5. FPGA Devices

Transcription:

Jens-Peter Cryptographic Engineering Research Group (CERG) http://cryptography.gmu.edu Department of ECE, Volgenau School of IT&E, George Mason University, Fairfax, VA, USA SHA-3 Project Review Meeting

Outline 2 3

Motivation Motivation Assumptions Interface Minimization Technique There have been several comparison of Throughput/Area optimized implementations [Gaj],[Matsuo],[Baldwin],[Guo]. Only few low-area implementations of single SHA-3 algorithms on FPGAs. Not all fully autonomous, Varying interface assumptions. Low-area implementations highlight flexibility of algorithm designs.

Motivation Motivation Assumptions Interface Minimization Technique There have been several comparison of Throughput/Area optimized implementations [Gaj],[Matsuo],[Baldwin],[Guo]. Only few low-area implementations of single SHA-3 algorithms on FPGAs. Not all fully autonomous, Varying interface assumptions. Low-area implementations highlight flexibility of algorithm designs. Goal First comprehensive comparison of low-area implementations of Round 2 SHA-3 Candidates. All use the same standardized interface. All optimized for the same parameters under the same assumptions.

Assumptions Motivation Assumptions Interface Minimization Technique Implementing for minimum area alone can lead to unrealistic run-times. Goal: Achieve the maximum Throughput/Area ratio for a given area budget. Realistic scenario: System on Chip: Certain area only available. Standalone: Smaller Chip, lower cost, but limit to smallest chip available, e.g. 768 slices on smallest Spartan 3 FPGA.

Assumptions Motivation Assumptions Interface Minimization Technique Target Implementing for minimum area alone can lead to unrealistic run-times. Goal: Achieve the maximum Throughput/Area ratio for a given area budget. Realistic scenario: System on Chip: Certain area only available. Standalone: Smaller Chip, lower cost, but limit to smallest chip available, e.g. 768 slices on smallest Spartan 3 FPGA. Xilinx Spartan 3e, low cost FPGA family Budget: 5 slices, Block RAM (BRAM)

Interface Motivation Assumptions Interface Minimization Technique Based on Interface and I/O Protocol from [Gaj], w=6. msg len ap, seq len ap (after padding ) in 32-bit words. msg len bp, seq len bp (before padding) in bits. n 2 msg len bp = seq len ap i 32 + seq len bp n i= n msg len ap = seq len ap i 32 w i= clk clk rst rst SHA Core w din dout src_ready dst_ready src_read dst_write w bits msg_len_ap msg_len_bp message w bits seq_len_ap seq seq_len_ap seq seq_len_ap n seq_len_bp n seq n a)sha Interface b)sha Protocol

Minimization Technique Motivation Assumptions Interface Minimization Technique Datapath Use BRAM to store state, initialization vectors, constants Use BRAM in each clock cycle Avoid temporary storage or use: Free registers, i.e. unused flip-flops after LUTs Shift Registers (x6 bit / Distributed RAM (x6 bit) LUT = 2 Slice Control Logic Small main state machine, up-to 8 states Counter for clock cycles in longest state Stored Program Control within states BRAM addressing must follow regular sequence, can have offset between rounds

Block RAM Motivation Assumptions Interface Minimization Technique BRAM Mode=write_first BRAM Mode=read_first data_in data_out data_in data_out Limits We use exclusively read first mode, i.e. old value is read, new value is written. Saves clock cycles, however, leads to address offset. Control logic might become difficult. Maximum 2 input and 2 output ports, 2 addresses (dual port). Maximum single port w/ 64 bits or dual port w/ 32 bits each.

BLAKE 2 3 4 A B 2 3 4 5 6 7 8 9 2 G G G G2 G G G G G2 G G2 G2 G G2 G G2 G2 G G2 G2 G G2 G G2 G G2 G G G G G G2 G2 G2 G2 Smallest implementation would be 2 G-function BRAM contention. Best result: 2 G-functions, pipelined as shown above. Keeps data in-flight longer, eliminates BRAM contention However, generation of addresses difficult Large control logic.

CubeHash Port A BRAM DRam B 3 Port B 3 6 5 <<< 7 DRam A din Reg <<< 3 6 5 DRam C dout Reg Initialization vector pre-processed stored in BRAM. BRAM contention requires swapping data between BRAM and Distributed RAM (DRAM). This requires additional clock cycles. Leads to simpler control logic. Finalization very costly at 9,296 clock cycles.

ECHO 3 3 3 Port A 3 3 6 BRAM 3 Port B 3 5 3 reg A S Box S Box S Box S Box din 3 65 reg B MixCol key key dout 6 5 Mix-Columns contains 2 Mix-Column units with total 8x8 bit register. 4 logic based S-Boxes with integrated pipeline stage. Faster Mix-Columns would exceed 5 slices. Key Generation uses DRAM, 32 bit adder. Allows store salt. Small control unit with room for improvement. 3 init 2 3 3 2

Fugue Port A BRAM Port B 7 32 5 8 23 32 6 3 24 S Box S Box S Box S Box Reg 32 >>>8 >>>6 >>>24 SMIX SMIX SMIX SMIX 7 5 8 23 6 3 24 2 3 3 6 5 32 dout din 32 Reg 6 5 3 4 logic based S-Boxes followed by pipeline stage. Only fixed rotations. Most time consuming function: SMIX. Challenge to store the S-Mix table.

Grøstl 6 5 dout din 3 dram Reg counter Port A BRAM Port B 7 S Box 7 S Box 3 Reg Reg Reg Reg mix mix Serialized P and Q. Only 2 S-Boxes due to Shift Rows can only use 2x8 bits. Uses single set of registers for two Mix Column units. 6 / 32 bit datapath 7 clock cycles per column.

JH 3 6 5 input 5 5 reg 5 dout 5 3 7 3 3 2 3 7 DRAM DRAM 7 7 Port A BRAM Port B 3 7 7 3 3 7 rc_d 3 3 3 grp_dgrp r_d 3 3 Uses BRAM and two 32x8 DRAMs. Grouping and de-grouping are the most expensive operations. 6 clock cycle operation. multiple narrow memory accesses.

Luffa 3 dout 5 6 Port A BRAM Port B MixCol reg reg reg 2 reg 3 din Scrumb reg DROM 3 6 5 2 2 3 Reg Message injection uses serialized XORs. SubCrumb is implemented as DROM. Constraints: DRAM had to be used to keep control logic simple.

Shabal din Reg 5 3 6 Port B 32 BRAM Port A 32 32 32 32 2 3 6 5 SR2CP SR6 M <<<7 <<<7 dout W 32 32 32 2 SR2 A [] UU new A BXOR [3] [9] [6] SR6 B new B Based on paper by [Detrey]. BRAM contains state register C and initialization vectors. Our I/O is more complex than [Detrey]. BRAM makes controller more complex.

SHAvite-3 dout Port A BRAM Port B 3 5 6 reg reg reg 2 reg 3 S Box S Box S Box S Box dob reg 3 din 6 2 Reg S Box 5 4 ROM based S-Boxes. Regular path of the mds matrix is an advantage. Mix column is realized through a shift register.

SHA-2 Full Datapath 7 slices 65 clock cycles

SHA-2 Full Datapath 7 slices 65 clock cycles Small Datapath 52 slices 595 clock cycles

SHA-2 Full Datapath 7 slices 65 clock cycles Small Datapath 52 slices 595 clock cycles SHA-2 is not well suited for very small implementations.

Performance Equations Performance Equations Implementation Implementation Comparison Block Size (bits) b Clock Cycles to hash N blocks clk = Throughput b (l + p) T Algorithm st +( l + p) N + end BLAKE 52 8 + ( 32+ 48) N + 65 52/( 52 T ) CubeHash 256 2 + ( 6+ 928) N + 932 256/( 928 T ) ECHO 536 6 + ( 96+2449) N + 7 536/(2545 T ) Fugue 32 33 + ( 2+ 6) N + 99 32/( 63 T ) Grøstl 52 2 + ( 32+2) N + 577 52/(52 T ) JH 52 35 + ( 32+574) N + 7 52/(66 T ) Luffa 256 2 + ( 6+ 66) N + 647 256/( 622 T ) Shabal 52 36 + ( 32+ 48) N + 28 52/( 8 T ) SHAvite-3 52 8 + ( 32+ 648) N + 7 52/( 68 T ) SHA-256 52 8 + ( 32+ 563) N + 7 256/( 595 T )

Implementation Performance Equations Implementation Implementation Comparison Area (slices) Block RAMs Maximum Delay (ns) T Throughput (Mbps) Throughput/ Area (Mbps/slice) Throughput (Mbps) Throughput/ Area (Mbps/slice) Algorithm Large m Small m BLAKE 538 9.73 2.7.9 88.4.64 CubeHash 292 9.84 28..9 2.5.8 ECHO 499.87 55.5. 54.8. Fugue 48.52 44..9 2.5.5 Grøstl 474 8.96 49.6. 33.7 JH 37.98 29..8 28.6.77 Luffa 453 9.42 43.6. 2.4.47 Shabal 498 8.84 723.9.45 78.8.359 SHAvite-3 496 6.99 7.7.22 2.4.26 SHA-256 52. 42.9.8 4.6.7

for Large Messages Performance Equations Implementation Implementation Comparison 8 Shabal 6 Throughtput [Mbits/s] 4 2 SHAvite 3 BLAKE Groestl ECHO Luffa CubeHash JH Fugue SHA 2 25 3 35 4 45 5 55 Area [Slices]

for Short Messages Performance Equations Implementation Implementation Comparison 6 x 5 5 4 BLAKE CubeHash ECHO Fugue Groestl JH Luffa Shabal SHAvite 3 CubeHash JH Latency (ns) 3 Fugue Luffa ECHO 2 BLAKE SHAvite 3 2 4 6 8 2 4 6 Message Size (bytes) Groestl Shabal

Performance Equations Implementation Implementation Comparison Control Logic vs. Datapath 6 5 Datapath Controller 4 Area (Slices) 3 2 CubeHash JH Luffa Groestl Fugue SHAvite 3 Shabal ECHO SHA 2 BLAKE

Implementation Comparison Performance Equations Implementation Implementation Comparison Reference Area (slices) Block RAMs Maximum Delay (ns) I/O Width Datapath Width Clock Cycles (l + p) Functionality Throughput (Mbps) Throughput/ Area (Mbps/slice) Algorithm Device BLAKE-32 [?] 24 2 5.2 32 32 86 xc3s5 FA 5..93 BLAKE-32 TW 538 9.73 6 32 52 xc3se FA 2.7.9 BMW [?] 895 26. 32 32 6 xcv3 FA 9.. CubeHash [TW] 292 9.84 6 32 944 xc3se FA 28.. ECHO [TW] 499.87 6 32 2545 xc3se FA 55.5. ECHO [?] 27 2.8 8 8 6593 xc5vlx5-2 FA 72..57 Fugue [TW] 48.52 6 32 63 xc3se FA 44..9 Grøstl [?] 276 6.67 64 64 xc3s5 FA 92..5 Grøstl [TW] 474 8.96 6 32 52 xc3se FA 49.6. JH [TW] 37.98 6 32 64 xc3se FA 29..8 Keccak [?] 444 3.77 64 387 xc5vlx5 FA 7..6 Luffa [TW] 453 9.42 6 32 622 xc3se FA 43.6. Shabal [?] 499.25 32 64 xc3s2 FA 8.6 Shabal [TW] 498 8.84 6 32 8 xc3se FA 723.9.45 Shavite-3 [TW] 496 6.99 6 32 68 xc3se FA 7.7.22

Performance Equations Implementation Implementation Comparison Comparison of Candidate 2 GMU Others.5 V 2 Throughput/Area V 5.5 V 5 V 3 BLAKE 32 BMW Cubehash ECHO Fugue Groestl JH Keccak Luffa Shabal Shavite 3 SHA 2

Best Candidate Performance Equations Implementation Implementation Comparison 2 Virtex Spartan.5 Throughput/Area.5 BMW ECHO Keccak SHA 2 Blake CubeHash Fugue Groestl JH Luffa ShabalSHAvite 3

Performance Equations Implementation Implementation Comparison Thanks for your attention.