: Application Specific Architectures for Cryptography Lisa Wu, Chris Weaver, Todd Austin {wul,chriswea,taustin}@eecs.umich.edu Overview Goal - fast programmable cryptographic processing Fast : efficient execution of computationally intensive cryptographic workloads Programmable: support for algorithms within existing protocols, support for new algorithms Motivation Cipher kernel analyses and characterizations Solution - hardware/software co-design Software: crypto-specific ISA Hardware: efficient co-processor implementation Results More than 2 times faster than a high-end general purpose processor and orders of magnitude less area and power 1
Cryptography Definitions: encryption vs. decryption public-key cipher vs. private-key cipher Public-secret key ciphers used in most protocol standards Public-Key Cipher plaintext f(x) ciphertext g(x) plaintext Public Key Private Key Private-Key Cipher plaintext g(x) ciphertext g(x) plaintext Private Key Private Key public private client SSL Session Breakdown Focus: Private-Key Ciphers authenticate private key https get https recv... close server Relative Contribution to Run Time 100% SSL Characterization by Session Length 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1k 2k 4k 8k 16k 32k SSL Session Length (bytes) Size of E*Trade login screen (21k) Public Other Private 2
1 Cipher Bottleneck Analysis 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Alias Branch Issue Mem Res Window All 0.2 0.1 0 3DES Mars RC4 Rijndael Twofish Likely bottlenecks - issue width, resources Possible bottlenecks - memory aliases, window size Not a bottleneck - branch mispredictions, memory latency Cipher Kernel Characterization Characterization of Cipher Kernel Operations 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Branch Mov Ld/St Xbox Sbox Mult Rotates Logical Arith SBOX - substitutions XBOX - permutations IDEA, Mars, RC4, and RC6 rely on arithmetic computations; benefit from more resources (multiplies) and faster operations (rotates) Blowfish, 3DES, Rijndael and Twofish rely on substitutions; benefit from increased memory bandwidth 3
System Architecture Request Format Result Format CM id session action data id session result Proc requests In Q Req Scheduler CM Proc... Out Q results Keystore CM Proc Processing Element IF ID/RF EX/MEM WB B T B FU FU I M E M RF FU FU InQ/OutQ Interface Data Mem Keystore Interface 4
Combining Functional Unit Logical Unit XOR AND {tiny} {long} Pipelined 32-Bit MUL 1K Byte SBOX Cache 32-Bit Adder 32-Bit Rotator {short} Logical Unit XOR AND {tiny} ISA bundle := <inst><inst><inst><inst> inst := <operation pair><dest><operand 1><operand 2><operand 3> operation pair := <short><tiny> <tiny><short> <tiny><tiny> <long><nop> tiny := <xor> <and> <signext> <nop> short := <add> <addinc> <sub> <rot> <sbox> <nop> long := <mul> <mulmod> Examples: Instruction Add-Xor R4, R1, R2, R3 And-Rot R4, R1, R2, R3 And-Xor R4, R1, R2, R3 Expression R4 <- (R1+R2) R3 R4 <- (R1&&R2)<<<R3 R4 <- (R1&&R2) R3 5
Scheduling Example: Blowfish SBOX SBOX SBOX SBOX ADD XOR ADD XOR Load SBOX SBOX SBOX SBOX Add-XOR Load Add XOR XOR-SignExt Takes only 4 cycles per iteration to execute! XOR Sign Ext Design Methodology Kernels were hand-scheduled and then validated with the super optimizer for accuracy Physical design estimates generated using Synopsys synthesis tools and Cacti 2 cache compiler for 0.25um technology Timing estimates are based on EX synthesis estimate + bypass latency Area estimates are based on EX synthesis estimate + array sizes + 10% control/datapath overhead Power estimates are based on EX synthesis estimate + (array access energy * array access frequency) 6
Timing and Area Estimates Timing and Area Estimates for Various Configurations 4W Comb 3W Comb 2W Comb 4W NoComb Timing Estimate Area Estimate Power Estimate Critical Path 2.78 ns 2.66 ns 2.54 ns 2.76 ns 1.39mm x 1.33mm x 1.26mm x 1.3mm x 1.39mm 1.33mm 1.26mm 1.3mm 606.37 mw 593.51 mw 568.50 mw 586.86 mw byps-lgcadd-lgc byps-lgcadd-lgc byps-lgcadd-lgc byps-rotate 90.0 Encryption Performance 80.0 OC-12 70.0 60.0 50.0 40.0 Alpha ISA++ 3WC 4WNC ISA+ 4WC 2WC 30.0 20.0 10.0 0.0 Blowfish 3DES IDEA MARS RC4 RC6 Rijndael Twofish OC-3 HDTV T-3 7
Special Case Studies: 3DES and Rijndael 90.0 Performance/Area Tradeoff 80.0 8WC 70.0 60.0 3DES Rijndael 4WNC 4WC 50.0 3WC 40.0 30.0 2WC 20.0 10.0 0.0 2WC 0.00 0.50 1.00 1.50 2.00 2.50 3.00 Area (mm2) 3WC 4WC 4WNC System-level performance studies (in paper) show that can service high bandwidth network and disk I/O traffic with one half as many processing elements (compared to Alpha 21264) 8WC Conclusion and Future Work An efficient 4-wide VLIW cryptographic co-processor design called the Instruction combining - efficient utilization of clock cycle Rijndael runs 2.25 times faster with 1/100th area and power of a 600MHz Alpha processor Access the cost of programmability in the by comparing design and performance of A dedicated hardware Rijndael implementation (no programmability) A FPGA Rijndael implementation (hardware programmability) (software programmability) These results make a very strong case for application-specific architecture optimization; we are exploring this approach in other program domains 8
BACK-UP Related Work Hardware only designs specific to a particular algorithm, both in the public and secret-key space Examples are the DES and 3DES implemented by Shiva, IBM, and Hi-Fn. Research of programmable hardware FPGAs, both in the public and secret-key space Architectural extension has been added for the PowerPC instruction set by Shi & Lee for general permutation 9
My Research Contribution Design and implementation of the coprocessor Hardware models of 8WC, 4WC, 3WC, 2WC, and 4WNC ISA and scheduling of kernels Timing, area, power, and performance analyses of the co-processor Design and implementation of the super optimizer Instruction combination study Automatic generation of varied width schedules Publication - ISCA 2001 Benchmark Suite Cipher Key Size Blk Size Rnds/Blk Author Application 3DES 112 64 48 CryptSoft SSL, SSH Blowfish 128 64 16 CryptSoft Norton Utilities IDEA 128 64 8 Ascom PGP, SSH Mars 128 128 16 IBM AES Candidate RC4 128 8 1 CryptSoft SSL RC6 128 128 18 RSA Security AES Candidate Rijndael 128 128 10 Rijmen AES Standard Twofish 128 128 16 Counterpane AES Candidate 10
Cipher Throughput Analysis 350.00 300.00 250.00 200.00 150.00 100.00 50.00 0.00 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish Alpha 21264 4W DF Alpha 21264 vs. 4W All except Mars and Twofish were within 10% of the actual machine tests Mars 11%, Twofish 15% Alpha 21264 vs. DF Blowfish, IDEA, and RC6 are running within 20% of DF performance Mars 29%, Twofish 76% RC4 and Rijndael are outliers Cipher Relative Run Time Cost Focus: Kernel Loop 100 90 80 70 60 50 40 30 20 10 0 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish 16 64 256 1k 4k 16k 64k 256k 1M Session Length (in bytes) 3DES and IDEA are small even for 16 byte sessions Mars, RC4, RC6, Rijndael, and Twofish drop well below 10% for 4k+ byte sessions Blowfish is outlier, drops below 10% only for 64k+ byte sessions 11
The Processor Maniac #1 A 4-wide 32-bit VLIW machine with no cache and a simple branch predictor Supports a triadic (three input operands) ISA that permits combining of most cryptographic operation pairs for better clock cycle utilization Can be combined into chip multiprocessor configurations for improved performance on workloads with inter-session and inter-packet parallelism The Super Optimizer S Validate hand-scheduled kernel results Automate generation of optimized kernels for the various architecture studied Instruction combination studies give insight as to possibly eliminate unnecessary hardware 12
Instruction Combination Study Scheduling with Various FU Configurations 16 14 12 10 8 6 all-comb nott STOnly TSOnly no-comb 4 2 0 3DES Blowfish IDEA Mars RC4 w/o RC4 with RC6 Rijndael Advanced Computer Architecture Cryptography Lab Kernels Instruction Combining Characteristics Combinations Breakdown by Kernel 4.5 4 3.5 3 2.5 Rijndael RC6 RC4 Mars Blowfish 2 1.5 1 0.5 0 1ST 2ST 3ST 4ST 1TS 2TS 3TS 4TS 1TT 2TT 3TT 4TT Advanced Computer Architecture Various LabCombinations per Cycle 13
SBOX Instruction Semantics 63 SBOX instruction eliminates address generation All SBOX tables are aligned to a 1k byte boundary Address generation becomes zero-latency bit concatenation Stores to SBOX storage are not visible by later SBOX s until an SBOXSYNC is executed Table Index 10 0 24 16 8 0 op 00 SBOX Table Architectural Extensions All instructions are limited to two register input operands and one register output ROL and ROR (rotates) for 64 and 32-bit data types ROLX and RORX support a constant rotate of a register input, followed by an XOR with another register input MULMOD computes the modular multiplication of two register values modulo the value 0x10001 SBOX speeds the accessing of substitution tables with 256- entry tables and 32-bit contents XBOX implements a portion of a full 64-bit permutation 14
Performance of ISA Extensions 4.5 4 3.5 3 2.5 Orig/4W Opt/4W Opt/4W+ Opt/8W+ Opt/DF 2 1.5 1 0.5 0 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish 15