CryptoManiac: Application Specific Architectures for Cryptography. Overview

Similar documents
Architectureís Diminishing Return

CCproc: A custom VLIW cryptography co-processor for symmetric-key ciphers

Application Specific Architectures: A Recipe for Fast, Flexible and Power Efficient Designs

Week 5: Advanced Encryption Standard. Click

Advanced Encryption Standard and Modes of Operation. Foundations of Cryptography - AES pp. 1 / 50

APNIC elearning: Cryptography Basics

Fast implementations of secret-key block ciphers using mixed inner- and outer-round pipelining

Architectural Analysis of Cryptographic Applications for Network Processors

Using Error Detection Codes to detect fault attacks on Symmetric Key Ciphers

Block Ciphers. Lucifer, DES, RC5, AES. CS 470 Introduction to Applied Cryptography. Ali Aydın Selçuk. CS470, A.A.Selçuk Block Ciphers 1

Winter 2011 Josh Benaloh Brian LaMacchia

Sankeeth Kumar Chinta.

Dynamically Reconfigurable Coprocessors in FPGA-based Embedded Systems

Processor (IV) - advanced ILP. Hwansoo Han

SPECIALIZED COPROCESSOR FOR IMPLEME TI G THE RC4 STREAM CIPHER

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Computer and Data Security. Lecture 3 Block cipher and DES

Four Steps of Speculative Tomasulo cycle 0

CONSIDERATIONS ON HARDWARE IMPLEMENTATIONS OF ENCRYPTION ALGORITHMS

Hardware Implementation of Cryptosystem by AES Algorithm Using FPGA

COSC 6385 Computer Architecture - Pipelining

Security Applications

Lecture 2B. RTL Design Methodology. Transition from Pseudocode & Interface to a Corresponding Block Diagram

U-II BLOCK CIPHER ALGORITHMS

Encryption Algorithms Authentication Protocols Message Integrity Protocols Key Distribution Firewalls

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

CS425 Computer Systems Architecture

PGP: An Algorithmic Overview

Security. Communication security. System Security

Lecture 2: Secret Key Cryptography

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Cryptographic Algorithms - AES

PASSWORDS & ENCRYPTION

The Processor: Instruction-Level Parallelism

Comparison of the Hardware Performance of the AES Candidates Using Reconfigurable Hardware

Computational Security, Stream and Block Cipher Functions

Security+ Guide to Network Security Fundamentals, Third Edition. Chapter 11 Basic Cryptography

Content of this part

Design of an Efficient Architecture for Advanced Encryption Standard Algorithm Using Systolic Structures

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

LECTURE 3: THE PROCESSOR

Title. Author(s)Fukase, Masa-aki; Sato, Tomoaki. Issue Date Doc URL. Type. Note. File Information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Speeding Up AES By Extending a 32 bit Processor Instruction Set

Processor Architecture

Computer Architecture

Computer Architecture Spring 2016

EITF20: Computer Architecture Part2.2.1: Pipeline-1

RC-6 CRYPTOSYSTEM IN VHDL. BY:- Deepak Singh Samant

Implementation of the block cipher Rijndael using Altera FPGA

Block Ciphers. Secure Software Systems

ALU(B) delay in cycles Arithmetic 32% 1 2 Data Transfer 36% 2 2 Floating Point 10% 3 4 Control Transfer 22% 2 2

Readings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3 Symmetric Key Cryptography 3.1 Block Ciphers Symmetric key strength analysis Electronic Code Book Mode (ECB) Cipher Block Chaining Mode (CBC) Some

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

Parallelizing Cryptography. Gordon Werner Samantha Kenyon

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Pipelining: Hazards Ver. Jan 14, 2014

Implementation of Full -Parallelism AES Encryption and Decryption

The Nios II Family of Configurable Soft-core Processors

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

FPGA Can be Implemented Using Advanced Encryption Standard Algorithm

Metodologie di Progettazione Hardware-Software

OPTICAL networks require secure data transmission at

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Understanding Cryptography by Christof Paar and Jan Pelzl. Chapter 4 The Advanced Encryption Standard (AES) ver. October 28, 2009

Secret Key Algorithms (DES) Foundations of Cryptography - Secret Key pp. 1 / 34

CS146 Computer Architecture. Fall Midterm Exam

Introduction to Modern Symmetric-Key Ciphers

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CPSC 467b: Cryptography and Computer Security

Cryptography Functions

ELECTRONICS DEPARTMENT

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

BLOWFISH ALGORITHM ON ITS OWN CLOUD COMPUTER PERFORMANCE AND IMPLEMENTATION

Secret Key Algorithms (DES)

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Lecture 9: Multiple Issue (Superscalar and VLIW)

CPSC 467b: Cryptography and Computer Security

COMPUTER ORGANIZATION AND DESI

Software Performance Characterization of Block Cipher Structures Using S-boxes and Linear Mappings

Modern Computer Architecture

Advanced Instruction-Level Parallelism

Introduction to Cryptography. Vasil Slavov William Jewell College

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

CSc 466/566. Computer Security. 6 : Cryptography Symmetric Key

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

14:332:331 Pipelined Datapath

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Delineation of Trivial PGP Security

A practical integrated device for lowoverhead, secure communications.

Transcription:

: Application Specific Architectures for Cryptography Lisa Wu, Chris Weaver, Todd Austin {wul,chriswea,taustin}@eecs.umich.edu Overview Goal - fast programmable cryptographic processing Fast : efficient execution of computationally intensive cryptographic workloads Programmable: support for algorithms within existing protocols, support for new algorithms Motivation Cipher kernel analyses and characterizations Solution - hardware/software co-design Software: crypto-specific ISA Hardware: efficient co-processor implementation Results More than 2 times faster than a high-end general purpose processor and orders of magnitude less area and power 1

Cryptography Definitions: encryption vs. decryption public-key cipher vs. private-key cipher Public-secret key ciphers used in most protocol standards Public-Key Cipher plaintext f(x) ciphertext g(x) plaintext Public Key Private Key Private-Key Cipher plaintext g(x) ciphertext g(x) plaintext Private Key Private Key public private client SSL Session Breakdown Focus: Private-Key Ciphers authenticate private key https get https recv... close server Relative Contribution to Run Time 100% SSL Characterization by Session Length 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1k 2k 4k 8k 16k 32k SSL Session Length (bytes) Size of E*Trade login screen (21k) Public Other Private 2

1 Cipher Bottleneck Analysis 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Alias Branch Issue Mem Res Window All 0.2 0.1 0 3DES Mars RC4 Rijndael Twofish Likely bottlenecks - issue width, resources Possible bottlenecks - memory aliases, window size Not a bottleneck - branch mispredictions, memory latency Cipher Kernel Characterization Characterization of Cipher Kernel Operations 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Branch Mov Ld/St Xbox Sbox Mult Rotates Logical Arith SBOX - substitutions XBOX - permutations IDEA, Mars, RC4, and RC6 rely on arithmetic computations; benefit from more resources (multiplies) and faster operations (rotates) Blowfish, 3DES, Rijndael and Twofish rely on substitutions; benefit from increased memory bandwidth 3

System Architecture Request Format Result Format CM id session action data id session result Proc requests In Q Req Scheduler CM Proc... Out Q results Keystore CM Proc Processing Element IF ID/RF EX/MEM WB B T B FU FU I M E M RF FU FU InQ/OutQ Interface Data Mem Keystore Interface 4

Combining Functional Unit Logical Unit XOR AND {tiny} {long} Pipelined 32-Bit MUL 1K Byte SBOX Cache 32-Bit Adder 32-Bit Rotator {short} Logical Unit XOR AND {tiny} ISA bundle := <inst><inst><inst><inst> inst := <operation pair><dest><operand 1><operand 2><operand 3> operation pair := <short><tiny> <tiny><short> <tiny><tiny> <long><nop> tiny := <xor> <and> <signext> <nop> short := <add> <addinc> <sub> <rot> <sbox> <nop> long := <mul> <mulmod> Examples: Instruction Add-Xor R4, R1, R2, R3 And-Rot R4, R1, R2, R3 And-Xor R4, R1, R2, R3 Expression R4 <- (R1+R2) R3 R4 <- (R1&&R2)<<<R3 R4 <- (R1&&R2) R3 5

Scheduling Example: Blowfish SBOX SBOX SBOX SBOX ADD XOR ADD XOR Load SBOX SBOX SBOX SBOX Add-XOR Load Add XOR XOR-SignExt Takes only 4 cycles per iteration to execute! XOR Sign Ext Design Methodology Kernels were hand-scheduled and then validated with the super optimizer for accuracy Physical design estimates generated using Synopsys synthesis tools and Cacti 2 cache compiler for 0.25um technology Timing estimates are based on EX synthesis estimate + bypass latency Area estimates are based on EX synthesis estimate + array sizes + 10% control/datapath overhead Power estimates are based on EX synthesis estimate + (array access energy * array access frequency) 6

Timing and Area Estimates Timing and Area Estimates for Various Configurations 4W Comb 3W Comb 2W Comb 4W NoComb Timing Estimate Area Estimate Power Estimate Critical Path 2.78 ns 2.66 ns 2.54 ns 2.76 ns 1.39mm x 1.33mm x 1.26mm x 1.3mm x 1.39mm 1.33mm 1.26mm 1.3mm 606.37 mw 593.51 mw 568.50 mw 586.86 mw byps-lgcadd-lgc byps-lgcadd-lgc byps-lgcadd-lgc byps-rotate 90.0 Encryption Performance 80.0 OC-12 70.0 60.0 50.0 40.0 Alpha ISA++ 3WC 4WNC ISA+ 4WC 2WC 30.0 20.0 10.0 0.0 Blowfish 3DES IDEA MARS RC4 RC6 Rijndael Twofish OC-3 HDTV T-3 7

Special Case Studies: 3DES and Rijndael 90.0 Performance/Area Tradeoff 80.0 8WC 70.0 60.0 3DES Rijndael 4WNC 4WC 50.0 3WC 40.0 30.0 2WC 20.0 10.0 0.0 2WC 0.00 0.50 1.00 1.50 2.00 2.50 3.00 Area (mm2) 3WC 4WC 4WNC System-level performance studies (in paper) show that can service high bandwidth network and disk I/O traffic with one half as many processing elements (compared to Alpha 21264) 8WC Conclusion and Future Work An efficient 4-wide VLIW cryptographic co-processor design called the Instruction combining - efficient utilization of clock cycle Rijndael runs 2.25 times faster with 1/100th area and power of a 600MHz Alpha processor Access the cost of programmability in the by comparing design and performance of A dedicated hardware Rijndael implementation (no programmability) A FPGA Rijndael implementation (hardware programmability) (software programmability) These results make a very strong case for application-specific architecture optimization; we are exploring this approach in other program domains 8

BACK-UP Related Work Hardware only designs specific to a particular algorithm, both in the public and secret-key space Examples are the DES and 3DES implemented by Shiva, IBM, and Hi-Fn. Research of programmable hardware FPGAs, both in the public and secret-key space Architectural extension has been added for the PowerPC instruction set by Shi & Lee for general permutation 9

My Research Contribution Design and implementation of the coprocessor Hardware models of 8WC, 4WC, 3WC, 2WC, and 4WNC ISA and scheduling of kernels Timing, area, power, and performance analyses of the co-processor Design and implementation of the super optimizer Instruction combination study Automatic generation of varied width schedules Publication - ISCA 2001 Benchmark Suite Cipher Key Size Blk Size Rnds/Blk Author Application 3DES 112 64 48 CryptSoft SSL, SSH Blowfish 128 64 16 CryptSoft Norton Utilities IDEA 128 64 8 Ascom PGP, SSH Mars 128 128 16 IBM AES Candidate RC4 128 8 1 CryptSoft SSL RC6 128 128 18 RSA Security AES Candidate Rijndael 128 128 10 Rijmen AES Standard Twofish 128 128 16 Counterpane AES Candidate 10

Cipher Throughput Analysis 350.00 300.00 250.00 200.00 150.00 100.00 50.00 0.00 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish Alpha 21264 4W DF Alpha 21264 vs. 4W All except Mars and Twofish were within 10% of the actual machine tests Mars 11%, Twofish 15% Alpha 21264 vs. DF Blowfish, IDEA, and RC6 are running within 20% of DF performance Mars 29%, Twofish 76% RC4 and Rijndael are outliers Cipher Relative Run Time Cost Focus: Kernel Loop 100 90 80 70 60 50 40 30 20 10 0 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish 16 64 256 1k 4k 16k 64k 256k 1M Session Length (in bytes) 3DES and IDEA are small even for 16 byte sessions Mars, RC4, RC6, Rijndael, and Twofish drop well below 10% for 4k+ byte sessions Blowfish is outlier, drops below 10% only for 64k+ byte sessions 11

The Processor Maniac #1 A 4-wide 32-bit VLIW machine with no cache and a simple branch predictor Supports a triadic (three input operands) ISA that permits combining of most cryptographic operation pairs for better clock cycle utilization Can be combined into chip multiprocessor configurations for improved performance on workloads with inter-session and inter-packet parallelism The Super Optimizer S Validate hand-scheduled kernel results Automate generation of optimized kernels for the various architecture studied Instruction combination studies give insight as to possibly eliminate unnecessary hardware 12

Instruction Combination Study Scheduling with Various FU Configurations 16 14 12 10 8 6 all-comb nott STOnly TSOnly no-comb 4 2 0 3DES Blowfish IDEA Mars RC4 w/o RC4 with RC6 Rijndael Advanced Computer Architecture Cryptography Lab Kernels Instruction Combining Characteristics Combinations Breakdown by Kernel 4.5 4 3.5 3 2.5 Rijndael RC6 RC4 Mars Blowfish 2 1.5 1 0.5 0 1ST 2ST 3ST 4ST 1TS 2TS 3TS 4TS 1TT 2TT 3TT 4TT Advanced Computer Architecture Various LabCombinations per Cycle 13

SBOX Instruction Semantics 63 SBOX instruction eliminates address generation All SBOX tables are aligned to a 1k byte boundary Address generation becomes zero-latency bit concatenation Stores to SBOX storage are not visible by later SBOX s until an SBOXSYNC is executed Table Index 10 0 24 16 8 0 op 00 SBOX Table Architectural Extensions All instructions are limited to two register input operands and one register output ROL and ROR (rotates) for 64 and 32-bit data types ROLX and RORX support a constant rotate of a register input, followed by an XOR with another register input MULMOD computes the modular multiplication of two register values modulo the value 0x10001 SBOX speeds the accessing of substitution tables with 256- entry tables and 32-bit contents XBOX implements a portion of a full 64-bit permutation 14

Performance of ISA Extensions 4.5 4 3.5 3 2.5 Orig/4W Opt/4W Opt/4W+ Opt/8W+ Opt/DF 2 1.5 1 0.5 0 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish 15