High-Performance Modular Multiplication on the Cell Broadband Engine

Similar documents
High-Performance Modular Multiplication on the Cell Processor

Montgomery Multiplication Using Vector Instructions

Crypto On the Playstation 3

- 0 - CryptoLib: Cryptography in Software John B. Lacy 1 Donald P. Mitchell 2 William M. Schell 3 AT&T Bell Laboratories ABSTRACT

Breaking ECC2K-130 on Cell processors and GPUs

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research.

Breaking ECC2K-130. May 20, Oberseminar Computer Security, COSEC group, B-IT, Bonn

Introduction to Cryptography and Security Mechanisms: Unit 5. Public-Key Encryption

This chapter continues our overview of public-key cryptography systems (PKCSs), and begins with a description of one of the earliest and simplest

Spring 2011 Prof. Hyesoon Kim

POST-SIEVING ON GPUs

TECHNISCHE UNIVERSITEIT EINDHOVEN Faculty of Mathematics and Computer Science Exam Cryptology, Tuesday 31 October 2017

Understanding Cryptography by Christof Paar and Jan Pelzl. Chapter 9 Elliptic Curve Cryptography

Elliptic Curves over Prime and Binary Fields in Cryptography

0x1A Great Papers in Computer Security

LECTURE NOTES ON PUBLIC- KEY CRYPTOGRAPHY. (One-Way Functions and ElGamal System)

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

A New Attack with Side Channel Leakage during Exponent Recoding Computations

Bipartite Modular Multiplication

Low-Latency Elliptic Curve Scalar Multiplication

Scalable Montgomery Multiplication Algorithm

Parallel Computing: Parallel Architectures Jin, Hai

My 2 hours today: 1. Efficient arithmetic in finite fields minute break 3. Elliptic curves. My 2 hours tomorrow:

Side-Channel Attacks on RSA with CRT. Weakness of RSA Alexander Kozak Jared Vanderbeck

A High-Speed FPGA Implementation of an RSD- Based ECC Processor

Applications of The Montgomery Exponent

NEW MODIFIED LEFT-TO-RIGHT RADIX-R REPRESENTATION FOR INTEGERS. Arash Eghdamian 1*, Azman Samsudin 1

Public Key Encryption

Realizing Arbitrary-Precision Modular Multiplication with a Fixed-Precision Multiplier Datapath

SIMD Acceleration of Modular Arithmetic on Contemporary Embedded Platforms

A SIGNATURE ALGORITHM BASED ON DLP AND COMPUTING SQUARE ROOTS

Provably Secure and Efficient Cryptography

Dr. Jinyuan (Stella) Sun Dept. of Electrical Engineering and Computer Science University of Tennessee Fall 2010

Channel Coding and Cryptography Part II: Introduction to Cryptography

ECE 297:11 Reconfigurable Architectures for Computer Security

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Key Exchange. Secure Software Systems

Studies on Modular Arithmetic Hardware Algorithms for Public-key Cryptography

NEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Algorithms and arithmetic for the implementation of cryptographic pairings

Key Management and Distribution

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Fast Arithmetic Modulo 2 x p y ± 1

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

high performance medical reconstruction using stream programming paradigms

Cryptography and Network Security. Sixth Edition by William Stallings

Analysis, demands, and properties of pseudorandom number generators

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm

Scalar Blinding on Elliptic Curves with Special Structure

Multifunction Residue Architectures for Cryptography 1

High-Performance Cryptography in Software

Public Key Cryptography

A NOVEL RNS MONTGOMERY MATHEMATICAL PROCESS ALGORITHM FOR CRYPTOGRAPHY. Telangana, Medak, Telangana

Cell Processor and Playstation 3

Elliptic Curve Cryptosystem

Finite Field Arithmetic Using AVX-512 For Isogeny-Based Cryptography

draft-orman-public-key-lengths-01.txt Determining Strengths For Public Keys Used For Exchanging Symmetric Keys Status of this memo

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange.

RSA. Public Key CryptoSystem

Cryptography and Network Security Chapter 10. Fourth Edition by William Stallings

The Application of Elliptic Curves Cryptography in Embedded Systems

Hardware Architectures

Study Guide to Mideterm Exam

Proposal For C%: A Language For Cryptographic Applications

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Montgomery Multiplication Using Vector Instructions

Collision Search for Elliptic Curve Discrete Logarithm over GF(2 m ) with FPGA

Efficient Implementation of Public Key Cryptosystems on Mote Sensors (Short Paper)

Pomcor JavaScript Cryptographic Library (PJCL)

Cryptography and Network Security

An update on Scalable Implementation of Primitives for Homomorphic EncRyption FPGA implementation using Simulink Abstract

Computer Security. 08. Cryptography Part II. Paul Krzyzanowski. Rutgers University. Spring 2018

Type-II optimal polynomial bases. D. J. Bernstein University of Illinois at Chicago. Joint work with: Tanja Lange Technische Universiteit Eindhoven

Computer Security 3/23/18

Software Engineering Aspects of Elliptic Curve Cryptography. Joppe W. Bos Real World Crypto 2017

Public-Key Cryptography

Introduction to Cryptography Lecture 11

IMPLEMENTATION OF ELLIPTIC CURVE POINT MULTIPLICATION ALGORITHM USING DSP PROCESSOR 1Prof. Renuka H. Korti, 2Dr. Vijaya C.

High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields

High-Performance Integer Factoring with Reconfigurable Devices

! Addition! Multiplication! Bigger Example - RSA cryptography

ECDLP course ECC2K-130. Daniel J. Bernstein University of Illinois at Chicago. Tanja Lange Technische Universiteit Eindhoven

Distributed Systems. 26. Cryptographic Systems: An Introduction. Paul Krzyzanowski. Rutgers University. Fall 2015

Vectorized implementations of post-quantum crypto

Cell Programming Tips & Techniques

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange

CS669 Network Security

A Binary Redundant Scalar Point Multiplication in Secure Elliptic Curve Cryptosystems

Public-Key Encryption, Key Exchange, Digital Signatures CMSC 23200/33250, Autumn 2018, Lecture 7

Accelerating SSL using the Vector processors in IBM s Cell Broadband Engine for Sony s Playstation 3

VHDL for RSA Public Key System

The University of Texas at Austin

The Beta Cryptosystem

Recommendation to Protect Your Data in the Future

Cryptography and Network Security. Sixth Edition by William Stallings

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Efficient Elliptic Curve Processor Architectures for Field Programmable Logic

Lecture 9. Public Key Cryptography: Algorithms, Key Sizes, & Standards. Public-Key Cryptography

Transcription:

High-Performance Modular Multiplication on the Cell Broadband Engine Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 21

Outline Motivation and previous work Applications for multi-stream modular multiplication Background Fast reduction with special primes The Cell broadband engine Modular multiplications on the Cell Performance results Conclusions 2 / 21

Motivation Modular multiplication is a performance critical operation in many cryptographic applications RSA ElGamal Elliptic curve cryptosystems as well as in cryptanalytic applications computing elliptic curve discrete logarithms (Pollard rho) factoring integers (elliptic curve factorization method) Measure the performance on the Cell. 3 / 21

Previous works Misc. Platforms Lots of performance results for many platforms GNU Multiple Precision (GMP) Arithmetic Library: almost all platforms (multiplicaton + reduction seperately), Bernstein et. al (Eurocrypt 2009): NVIDIA GPUs, Brown et al. (CT-RSA 2001): NIST primes on x 86. On the Cell Broadband Engine The Multi-Precision Math (MPM) Library by IBM, Optimize for one specific bit-size Costigan and Schwabe (Africacrypt 2009): special 255-bit prime, Bernstein et. al (SHARCS 2009): 195-bit generic moduli 4 / 21

Contributions What did I do? Present high-performance multi-stream algorithms Montgomery multiplication, schoolbook multiplication, various special reduction schemes. Implementation details (in C) are presented for a cryptologic interesting range 192 521 bits targeted at the Cell Broadband Engine. 5 / 21

Multi-Stream Modular Multiplication Applications Modular exponentiations using a square-and-multiply algorithm. Cryptography Exponentiations with the same random exponent: ElGamal encryption (ElGamal, Crypto 1984), Double base ElGamal - Damgård ElGamal (Damgård, Crypto 1991), Double hybrid Damgård ElGamal (Kiltz et. al, Eurocrypt 2009). Batch decryption in elliptic curve cryptosystems Cryptanalysis Pollard rho (elliptic curve discrete logarithm problem) Integer factorization (elliptic curve factorization method) 6 / 21

Special Primes Faster reduction exploiting the structure of the special prime. By US National Institute of Standards Five recommended primes in the FIPS 186-3 (Digital Signature Standard) P 192 = 2 192 2 64 1 P 224 = 2 224 2 96 + 1 P 256 = 2 256 2 224 + 2 192 + 2 96 1 P 384 = 2 384 2 128 2 96 + 2 32 1 P 521 = 2 521 1 Prime used in Curve25519 Proposed by Bernstein at PKC 2006 P 255 = 2 255 19 7 / 21

Example: P 192 = 2 192 2 64 1 0 x < P 2 192, 0 x H, x L < 2 192, x = x H 2 192 + x L x x L + x H 2 64 + x H mod P 192 8 / 21

Example: P 192 = 2 192 2 64 1 0 x < P 2 192, 0 x H, x L < 2 192, x = x H 2 192 + x L x x L + x H 2 64 + x H mod P 192 x H 2 64 < 2 256 x H 2 64 x H 2 64 mod 2 192 + xh 2 64 2 64 + xh 2 64 mod P 2 192 2 192 192 8 / 21

Example: P 192 = 2 192 2 64 1 0 x < P 2 192, 0 x H, x L < 2 192, x = x H 2 192 + x L x x L + x H 2 64 + x H mod P 192 x H 2 64 < 2 256 x H 2 64 x H 2 64 mod 2 192 + xh 2 64 2 64 + xh 2 64 mod P 2 192 2 192 192 s 1 = (c 5, c 4, c 3, c 2, c 1, c 0 ), s 2 = (c 11, c 10, c 9, c 8, c 7, c 6 ), s 3 = (c 9, c 8, c 7, c 6, 0, 0), s 4 = (0, 0, c 11, c 10, 0, 0), s 5 = (0, 0, 0, 0, c 11, c 10 ) Return s 1 + s 2 + s 3 + s 4 + s 5 8 / 21

Example: P 192 = 2 192 2 64 1 0 x < P 2 192, 0 x H, x L < 2 192, x = x H 2 192 + x L x x L + x H 2 64 + x H mod P 192 x H 2 64 < 2 256 x H 2 64 x H 2 64 mod 2 192 + xh 2 64 2 64 + xh 2 64 mod P 2 192 2 192 192 s 1 = (c 5, c 4, c 3, c 2, c 1, c 0 ), s 2 = (0, 0, c 7, c 6, c 7, c 6 ), s 3 = (c 9, c 8, c 9, c 8, 0, 0), s 4 = (c 11, c 10, c 11, c 10, c 11, c 10 ) Return s 1 + s 2 + s 3 + s 4 Solinas, technical report 1999 Note: this reduces to [0, 4 P 192 ] 8 / 21

Multiplication algorithms Generic Moduli Montgomery multiplication (with final subtraction) Special Moduli Multiplication + special reduction Size of the modulus: Multiplication method: 192-521 bit schoolbook Investigate other methods (such as Karatsuba) is left as future work. 9 / 21

The Cell Broadband Engine Cell architecture in the PlayStation 3 (@ 3.2 GHz): Broadly available (24.6 million) Relatively cheap (US$ 300) The Cell contains eight Synergistic Processing Elements (SPEs) six available to the user in the PS3 one Power Processor Element (PPE) the Element Interconnect Bus (EIB) a specialized high-bandwidth circular data bus 10 / 21

Cell architecture, the SPEs The SPEs contain a Synergistic Processing Unit (SPU) Access to 128 registers of 128-bit SIMD operations Dual pipeline (odd and even) Rich instruction set In-order processor 256 KB of fast local memory (Local Store) Memory Flow Controller (MFC) 11 / 21

Programming Challenges Memory The executable and all data should fit in the LS Or perform manual DMA requests to the main memory (max. 214 MB) Branching No smart dynamic branch prediction Instead prepare-to-branch instructions to redirect instruction prefetch to branch targets Instruction set limitations 16 16 32 bit multipliers (4-SIMD) Dual pipeline One odd and one even instruction can be dispatched per clock cycle. 12 / 21

Modular Multiplication on the Cell I Four (16 m)-bit integers A, B, C, D represented in m vectors. V [0] =. V [i] =. V [m 1] = X = m 1 i=0 x i 2 16 i 128-bit wide vector {}}{ 16-bit 16-bit }{{}}{{} high low b 0 c 0 d 0. a i b i c i d i a m 1 b m 1. }{{} c m 1 the most significant position of X 1 is located in d m 1 either the lower or higher 16-bit of the 32-bit word 13 / 21

Modular Multiplication on the Cell II Implementation use the multiply-and-add instruction, if 0 a, b, c, d < 2 16, then a b + c + d < 2 32. try to fill both the odd and even pipelines, are branch-free. Do not fully reduce modulo (m-bits) P, Montgomery and special reduction [0, 2 m, These numbers can be used as input again, Reduce to [0, P at the cost of a single comparison + subtraction. 14 / 21

Modular Multiplication on the Cell III Special reduction [0, t P (t Z and small) How to reduce to [0, 2 m? Apply special reduction again Repeated subtraction (t times) For a constant modulus m-bit P Select the four values to subtract simultaneously using select and cmpgt instructions and a look-up table. 15 / 21

Modular Multiplication on the Cell IV For the special primes this can be done even faster. t t P 224 = t (2 224 2 96 + 1 ) = {c 7,..., c 0 } c 7 c 6 c 5 c 4 c 3 c 2 c 1 c 0 0 0 0 0 0 0 0 0 0 1 0 2 32 1 2 32 1 2 32 1 2 32 1 0 0 1 2 1 2 32 1 2 32 1 2 32 1 2 32 2 0 0 2 3 2 2 32 1 2 32 1 2 32 1 2 32 3 0 0 3 4 3 2 32 1 2 32 1 2 32 1 2 32 4 0 0 4 c 0 = t, c 1 = c 2 = 0 and c 3 = (unsigned int) (0 t). If t > 0 then c 4 = c 5 = c 6 = 2 32 1 else c 4 = c 5 = c 6 = 0. Use a single select. 16 / 21

Modular Multiplication on the Cell V P 255 = 2 255 19 Original approach Proposed by Bernstein and implemented on the SPE by Costigan and Schwabe (Africacrypt 2009): 19 Here x F 2 255 19 is represented as x = x i 2 12.75i. Redundant representation Following ideas from Bos, Kaihara and Montgomery (SHARCS 2009), 15 Calculate modulo 2 P 255 = 2 256 38 = x i 2 16, Reduce to [0, 2 256. i=0 i=0 17 / 21

Performance Results 1600 1400 1200 Modular Multiplication Performance Results NIST primes Generic primes Curve25519 prime Number of cycles 1000 800 600 400 200 0 150 200 250 300 350 400 450 500 550 Bitsize of the modulus Montgomery multiplication multiplication + fast reduction 1.4 2.5 18 / 21

Comparison Special Moduli Number of cycles for what? Measurements over millions of multi-stream modular multiplications, Cycles for a single modular multiplication, include benchmark overhead, function call, loading (storing) the input (output), converting from radix-2 32 to radix-2 16. 19 / 21

Comparison Special Moduli Number of cycles for what? Measurements over millions of multi-stream modular multiplications, Cycles for a single modular multiplication, include benchmark overhead, function call, loading (storing) the input (output), converting from radix-2 32 to radix-2 16. Special prime P 255 Costigan and Schwabe (Africacrypt 2009), 255 bit. single-stream: 444 cycles (144 mul, 244 reduction, 56 overhead). multi-stream: 168 cycles. no function call, loading and storing, perfectly scheduled (filled both pipelines) this work: 180 cycles (< 168 + 56), both approaches are comparable in terms of speed (on the Cell). 19 / 21

Comparison Generic Moduli Generic 195-bit moduli Bernstein et al. (SHARCS 2009), multi-stream, 189 cycles, This work: multi-stream, 159 cycles for 192-bit generic moduli, Scaling: ( 195 192 )2 159 = 164 cycles. Generic moduli Bitsize #cycles New MPM umpm 192 159 1,188 877 224 237 1,188 877 256 300 1,188 877 384 719 2,092 1,610 512 1,560 3,275 2,700 20 / 21

Conclusions We presented SIMD algorithms for Montgomery and schoolbook multiplication and fast reduction. Implementations are optimized for the Cell architecture. Implementation results for moduli of size 192 to 521 bits show that special primes are 1.4 to 2.5 times faster compared to generic primes. Future work Try Karatsuba multiplication Further optimize Montgomery multiplication (almost finished) 21 / 21