High-Performance Integer Factoring with Reconfigurable Devices

Size: px

Start display at page:

Download "High-Performance Integer Factoring with Reconfigurable Devices"

Osborne Griffin
5 years ago
Views:

1 FPL 2010, Milan, August 31st September 2nd, 2010 High-Performance Integer Factoring with Reconfigurable Devices Ralf Zimmermann, Tim Güneysu, Christof Paar Horst Görtz Institute for IT-Security Ruhr-University Bochum

2 Outline Introduction Arithmetic using modern s ECM System Layout Results Conclusions & Future Work

3 Outline Introduction Arithmetic using modern s ECM System Layout Results Conclusions & Future Work

4 Introduction to Elliptic Curve Method Why ECM? Factor small numbers Low memory requirements Composite number ~ 200 bits ECM prime factor Nice to know but importance?

5 Introduction to Elliptic Curve Method Why ECM? Factor small numbers Low memory requirements Number Field Sieve RSA modulus 1024 Bits ECM ECM ECM ECM prime factors n = pq Nice to know but useful for co-factorization (NFS)

Existing implementations Software: GMP-ECM (P. Zimmermann et al.

2008) GPUs: GPU-ECM (Bernstein et al. 2008) Simka et al.

micro-controller Proof-of-concept implementation Proof of Concept

(CHES 2006, IEEE Transaction on Computers (Sep 2010))

6 Existing implementations Software: GMP-ECM (P. Zimmermann et al.) Variant: GMP-EECM (Bernstein et al. 2008) GPUs: GPU-ECM (Bernstein et al. 2008) Simka et al. (FPL 2005) Montgomery Curves Virtex-E 2000 w/ embedded micro-controller Proof-of-concept implementation Proof of Concept Gaj et al. (CHES 2006, IEEE Transaction on Computers (Sep 2010)) Implementation on high and low cost s Suggestions to Phase 2 implementation 2nd Phase De Meulenaer et al. (FCCM 2007) Uses DSP hardcores to implement Phase 1 No implementation of Phase 2 (possible?)

7 How to implement ECM? ECM How? Phase 1: scalar multiplication Montgomery Ladder Phase 2: more complex Standard continuation in hardware (rough overview) Precomputations Scalar multiplications jq Storing primes in a memory Main computations point addition accumulation of product d

Design Goals and Decisions Usage of Elliptic Curve Method (ECM) in co-factorization Goals Implement both Phases 1 and 2 Usable on COPACOBANA v4 2nd Phase Host PC + COPACOBANA 1.

8 Design Goals and Decisions Usage of Elliptic Curve Method (ECM) in co-factorization Goals Implement both Phases 1 and 2 Usable on COPACOBANA v4 2nd Phase Host PC + COPACOBANA 1. Pre-Compute Initial values 2. Send Curve parameters, moduli 3. Compute final gcd Curves in Montgomery Form Efficient formulas for hardware Computation/storage of y-coordinate can be omitted

9 Outline Introduction Arithmetic using modern s ECM System Layout Results Conclusions & Future Work

10 Modern s Programmable elements Configurable Logic Blocks (CLB) Many, many (even more ) CLBs Connection via interconnect (switching matrix) Modern s contain hardcores Dedicated memory elements Arithmetic hardcores, e.g., to accelerate integer multiplication and addition Embedded PowerPC processors High-speed I/O Transceivers

11 Embedded Memory Elements (Virtex-4) 18kbit storage element (BlockRAM) 400 MHz with connected output register DINA DINB Flexible BRAM configuration Dual-Port storage Single-port storage RAM or ROM ADDRA WEA ADDRB WEB Cascading possible DOUTA DOUTB

12 Digital Signal Processing (Virtex-4) DSP Slice: two adjacent DSP Fast signed 18x18-bit multiplication and signed 48-bit addition/subtraction (400 MHz) Integrated pipeline register Controllable by an OPMODE signal DSP elements can be cascaded

13 Basic Modular Addition Input: Modulus M Operands A M, B M Output: A+B (mod M) A B M R = A+B S = A+B-M If (b = 1) then Return R Else Return S + 0 R borrow (b) S

14 Implemented Mod. Addition/Subtraction Addition of two b*17-bit numbers: b b clock cycles Two designs: one for Virtex-4 DSPs, one for new DSP types Non Virtex-4 Virtex-4 Flipflops Slices Input LUTs DSP 2 Cycles 2*b + 3 Frequency 400 MHz

15 Basic Modular Multiplication Modular Multiplication with Quotient Pipelining (Orup) Input: Modulus dependent par M (M, k, d), operands A, B Output: S A B R -1 (mod M) block selection of S i,0 shift by k bits block of k bits a 0 b 0 m 0 S i,0 a 1 aa 2 b 1 b 2 m 1 M m 2 a 3 a n b 3 b n m 3 m n S i,1 S i,2 i,j S i,3 S i,n i,4 // n is number of rounds for i = 0 to n do q i = S i (mod 2 k ) scalar multiplication 0 S i+1 = S i /2 k + q i M + b i A Return S n+1 accumulation

16 Multiplication: Parallel vs. Sequential Parallel approach: Use b DSPs to multiply two b*17 bit values (b+1) * 6 cycles Sequential approach: Use 3 DSPs to multiply two b*17 bit values 6 + (b+2)*b cycles Example: b=10 7 DSPs less at cost of 60 clock cycles DSP cycles

17 Implemented Modular Multiplication II Multiplication of two 17*b bit numbers: b * (b + 1) b clock cycles Requirements: 5 b; (b+1)-th block = 0 Memory alignment: 5 b 15 Virtex-4 Flipflops 131 Slices 73 4-Input LUTs 83 DSP 3 Cycles b*(b + 2) + 6 Frequency 400 MHz

18 Outline Introduction Arithmetic using modern s ECM System Layout Results Conclusions & Future Work

19 COPACOBANA: Original Architecture host backplane USB DIMM Module 1 controller card DIMM Module 2 Controller DIMM Module 20 l bus data address Backplane with plug-in slots can host up to 20 DIMM-sized modules 6 x low-cost Xilinx Spartan-3 s (XC3S1000) per module Shared 64-bit data and 16-bit address connection on backplane (bi-directional) Controller connects PC with s in a slow Master-Slave scheme (3 MBit/s)

20 ECM System on COPACOBANA Backplane Gigabit Ethernet Control modul Control DIMM modul 1 Virtex-4 DIMM modul 2 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4... DIMM modul 16 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Virtex-4 Data Address Modified COPACOBANA, planned 2006 Working with Virtex-4 SX35 s: 192 DSPs, 192 BRAM Data transfer using Ethernet Host Computer: performs (simple) pre-computations, i.e., curve generation COBACOBANA: performs cost intensive operations, i.e., elliptic curve arithmetic

21 ECM System Architecture

22 Elliptic Curve Processor

23 Workspace Memory Cell = 16 times 17-bit = 272 bit Block = 2 Cells = one Point Addressed by instruction ROM Fast in/out for arithmetic Units Needs 1 BlockRAM resource

24 Instructions 1 BlockRAM core using both ports 54 bits per instruction ( bits) 326 instructions for Phase 1 + Phase 2 Space for more

25 Outline Introduction Arithmetic using modern s ECM System Layout Results Conclusions & Future Work

26 Implementation of Phase Pre-computation using Addition Chain Reduce expensive Pt. Adds Use cheap Pt. Dbl No expensive Pt. Dbl + Add Gaj et al. This work 0 Pt. Dbl Pt. Add (D+A)* Improved/Fixed main computation Scanning algorithm Considering another special case

27 Scalability Related work - target bit-length? Scalable system Fixed layout Exchangeable ROM

28 Timing details Operation Cycles Factor Frequency (MHz) Total Cycles Top -> ECM P1 Pre: 2P P1 Step: 2P + (P+Q) ,312 P2 Pre: P+Q ,030 P2 Pre: 2P ,038 P2: Start Phase P2: Init Round ,480 P2: " " ,784 P2: " " ,226 P2: P+Q ,266 P2: End Phase sync results ECM -> Top Complete Run 1,969,878 Data transfer Phase 1 Phase 2 Data transfer Results for ECM Phase (b = 10, B1 = 960 and B2 = 57000)

29 Results for ECM Phase type Virtex-4 SX 35 Target bit-length 151 Number of ECM units 24 Maximal clock frequency 200 Number of instructions for ECM Phase ,969,878 Time for ECM Phase ms Number of ECM calculations per second 2424 ECM Phase using B1 = 960, B2 = [1376 bit k] 310,272 ECM calculations / second on COPACOBANA 1 1 Optimal COPACOBANA using 16 modules (= 128 )

implementations Metric: resource utilization?

30 Comparison Apple equals Orange? Different Families Different approaches Incomplete information (speed grade, packaging,..) Synthesis vs Place-and-Route results compare only Virtex 4 implementations Metric: resource utilization? Impossible without designs Let s This normalize does not work! then! Metric: cost-performance? Varing pricing models (EasyPath, volume discounts) Highly non-linear prices (V4LX200 vs V4LX35)

31 Comparison (ECM Phase 1) 198 bits vs 202 bits Arithmetic: Fabric vs. DSPs Improvement by factor 2.24 resp. 37

32 Comparison (ECM Phase 1) 135 bits vs 134 bits Fully pipelined vs non-pipelined DSP Arithmetic: Mult vs Mult+Add Phase 1 vs Phase 1+2: use ~59x parameter size

33 COPACOBANA v4 Results COPACOBANA v4: ECM/s using multiple s COPACOBANA v4: Efficiency using multiple s 80% % 60% % % 30% 20% Expected Achieved 10% 0% Efficiency Bottleneck: COPACOBANA v4 Interface

34 COPACOBANA v4 Bottleneck Bus Architecture Slow single master Bus no interrupts No parallel read/write operations Controller No fast Ethernet No PCI-Express USB with ~40 Mbps Processor Solution: Rivyera Slow CPU Additional host Power consumption, Network latency,

35 Outline Introduction Arithmetic using modern s Implementation ECM System Layout Conclusions & Future Work

36 Conclusions Novel and complete implementation Fastest implementation of phase on s Block parameter b = 10 Place & route results 2nd Phase Generic and scalable system Supports bit integers Exchange instruction ROM Only small changes in fabric/resources Highly parallel architecture for co-factorization Multiple ECM-units per (24 units on Virtex-4 SX 35) Sadly: COPACOBANA cluster (max 128 Virtex-4 SX 35) not suitable

Future Work Improve Phase 1 Sliding Window algorithm Enough memory available Modify the instruction ROM Improvement for extensive Phase 1 Evaluate different technologies: Virtex-4

37 Future Work Improve Phase 1 Sliding Window algorithm Enough memory available Modify the instruction ROM Improvement for extensive Phase 1 Evaluate different technologies: Virtex-4 SX $ 192 DSPs Virtex-6 LX130T 1000 $ 480 DSPs (up to 600 MHz) Spartan-6 LX $ 180 DSPs (up to 250 MHz) COPACOBANA v4 RIVYERA Fast data bus Two ring networks Intel i7 CPU

38 Thank you for your attention! Any Questions? Ralf Zimmermann, Tim Güneysu, Christof Paar

Enhancing COPACOBANA for Advanced Applications in Cryptography and Cryptanalysis

Enhancing COPACOBANA for Advanced Applications in Cryptography and Cryptanalysis Tim Güneysu, Christof Paar, Gerd Pfeiffer, Manfred Schimmler Horst Görtz Institute for IT Security, Ruhr University Bochum,