Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

Size: px

Start display at page:

Download "Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization"

Avice Jones
5 years ago
Views:

1 Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Sashisu Bajracharya, Deapesh Misra, Kris Gaj George Mason University Tarek El-Ghazawi The George Washington University

2 Objective & Outline N = P Q factorization P Q P, Q large integers..... FPGA Array. Number Field Sieve (NFS) Factorization 2. Mesh Routing Architecture for the Matrix Step of NFS 3. Implementation. Results 5. Conclusions 6. Reconfigurable computers & future work

3 Number Field Sieve (NFS) Steps Polynomial Selection Sieving (Relation Collection) Matrix (Linear Algebra) Two most computationally intensive steps Square Root

4 Previous work 2 D. J. Bernstein, Circuits for integer factorization: a proposal Mesh approach to the sieving and matrix steps improves asymptotic complexity for NFS performance 22 A. K. Lenstra, A. Shamir, J. Tomlinson, E. Tromer, Analysis of Bernstein's Factorization Circuit, Asiacrypt 22 Detailed design for mesh routing 23 W. Geiselmann, R. Steinwandt, Hardware to solve sparse systems of linear equations over GF(2), CHES 23 Distributing computations among multiple nodes

5 Our Objective Design, describe in RTL VHDL, synthesize & simulate existing theoretical designs for the matrix step of NFS using current generation of FPGA devices. Matrix (Linear Algebra) Focus of this paper

6 Input to the Matrix Step D D Matrix A: D = number of the matrix columns and rows D [ 7, ] d column density (weight) = maximum number of ones per column d << D, e.g., d= for D=

7 Function of the Matrix Step Find linear dependency in the large sparse matrix obtained after sieving step c i c i2 c il c i c i2... c il = D = number of the matrix columns and rows D D

8 Block Wiedemann Algorithm for the Matrix Step of NFS ) Uses multiple matrix-by-vector multiplications of the sparse matrix A with k random vectors v i A v i, A 2 v i,., A k v i where k = 2D/K 2) Post computations leading to the determination of the linear dependence of the matrix columns Most time consuming operation: A [DxD] v [Dx]

9 Mesh Routing Architecture for the Matrix Step

10 Matrix-by-Vector Multiplication v T A v A Sparse Matrix

11 Matrix-by-Vector Multiplication for Sparse Matrices v T A v A Sparse Matrix

12 Mesh Routing m x m mesh where m = D A v T representation of column d = maximum number of non-zero entries per column m D D m

13 Mesh corresponding to the matrix A and vector v Cell representing Column Row indices of non-zero 3 matrix entries Vector bit

14 Routing in the Mesh Fourth cell Each time a packet arrives at the target cell, the packet s vector s bit is xored with the partial result bit on the target cell

15 Packets generated by each cell in the mesh Destination address Cell representing Column Vector bit

16 Only packets with non-zero vector bits need to be routed

17 Mesh Routing with K parallel matrix by vector multiplications Example for K=2 A v and A v 2 computed in parallel A v v mesh

18 Mesh corresponding to the matrix A and vectors v, v 2 Cell representing Column Row indices of non-zero 3 matrix entries 6 7 Vector bits v, v

19 Packets with multiple vector bits generated by each cell in the mesh Cell representing Column Destination address Vector bits v, v

20 Mesh Routing with p multiple columns and K vectors Example for p=2, K=2 A v and A v 2 computed in parallel v v 2 A p columns

21 Mesh with multiple columns and multiple vectors per cell Cell representing Columns,

22 Packets generated by the mesh with multiple columns and multiple vectors per cell Cell representing Columns,

23 Clockwise Transposition Routing Four iterations repeated Cells compareexchange direction

24 Compare-Exchange Cases Between Columns Direction of travel = direction of exchange Result of comparison = Exchange the packets.

25 Compare-Exchange Cases Between Columns Direction of travel direction of exchange Result of comparison = Exchange the packets. Rule for Exchange: Exchange iff the distance to target of the packet which is farthest from its destination gets reduced.

26 Implementation

27 Structure of the Xilinx Virtex FPGA Configurable Logic Block (CLB) slices Block RAMs Block RAMs I/O Blocks Block RAMs

28 Two modes of operation of CLB slices Logic mode Memory mode 8 Combinational logic -bit register -bit register RAM 6x RAM 6x -bit register -bit register CLB slice

29 Target FPGA Device Xilinx Virtex II XC2V8 6,592 CLB slices 93,8 LUTs LookUpTables 93,8 FF (Flip-Flop) Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs Multipliers 8 x 8 Block RAMs FF Carry & Control Logic O LUT CLB slice Carry & Control Logic LUT FF CLB-SLICE CLK Multipliers 8 x 8 Block RAMs I/O Block

30 Mesh parameters for single FPGA m () = 2 m () = m () = 2 p = 6 D () = (m () ) 2 * p = = (2) 2 * 6 = 23 K =5 33 3

31 Synthesis Results on one Virtex II XC2V8 for Improved Mesh Routing Design Matrix Size K CLB slices 23x23 (Mesh 2x2, p=6 ) 6738 (%) LUTs,38 (%) FFs 6,279 (7%) Clock Period (ns) Time for K mult (ns) Time per mult (ns) x23 (Mesh 2x2, p=6 ) 32 29,938 (6%) 5,983 (5%) 9,65 (2%) x23 (Mesh 2x2, p=6 ) 5 3,2 (93%) 7,3 (89%) 27,6 (29%) f CLK-ROUTE = 55.5 MHz T CLK-ROUTE = 8 ns

32 Distributed Computation (Geiselmann, Steinwandt, CHES 23) A v Av A, A,2 A,3 v A A 2, A 2,2 A 2,3 A 3, A 3,2 A 3,3 v 2 v 3 = A 2 A A A, v A v,2 2 A,3 v 3 = A v = s j=, j s A : A j= s, j v v j j

33 Using smaller FPGA arrays to perform the entire computation ) FPGA array performs single sub-matrix by sub-vector multiplication 2) Reuse FPGA array for next sub-computation

34 Parallel Loading & Unloading of Data Vector Non Zero Matrix Entries Result Vector

35 Sub matrix load-compute sequence A, A,2 A,3 A 2, A 2,2 A 2,3 A 3, A 3,2 A 3,3 v v 2 = v 3 A A 2 A 3 A, v + A,2 v 2 + A,3 v 3 = A Time load compute load compute load compute unload load compute load compute load compute unload load compute load compute load compute unload

36 Size of the mesh implemented using f FPGAs Single FPGA m (f) m () FPGA m () Matrix of f FPGAs #cells (f) = #cells () f (m (f) ) 2 = (m () ) 2 f m (f) = m () f m (f)

37 Size of the matrix handled using f FPGAs D () Sub-matrix handled by one FPGA D () A () D (f) A (f) D (f) A D D () = (m () ) 2 * p D (f) = D () * f d (f) = d* D (f) /D Sub-matrix handled by f FPGAs Entire matrix A D d column density of A= maximum number of ones per column of A d (f) column density of A (f)

38 Sizes and the column densities of submatrices handled using f FPGAs D () D =,d= D () A () A (f) D (f) f D (f) d (f) 2,3 D (f) A D 23, ,82 2 2,359,296, 23,, D

39 Mesh Cell Design for Improved Mesh Routing CU matrix data vector data Loading packet matrix data vector/result data annihilate Current packet new packet exchange result status bits Result calculation r c coordinate exchange annihilate row/col Comparator oper eq_packet

40 Matrix Data Status Matrix data Loading Address Routing Address st r L c L r R c R Vector data K k (f) k (f) + k p k(f) k (f) + k p f (#FPGAs) matrix data size (bits) Matrix Data Size = + k (f) + 2 k p = + log 2 m (f) + 2 log 2 p

41 Format of the Packet status row destination column destination local result position st r c l vector bits k (f) k (f) k p K v k (f) = log 2 m (f) k p = log 2 p m (f) = m () f K = 5 # FPGAs packet size

42 Loading Unit Memory of packet destination addresses R[i] p d en decode matrix data en P[i] K p matrix data Memory of vector data vector data comb vector/result data result packet

43 Current Packet Unit packet annihilate annih new packet CP exchange current packet

44 Result Calculation result Memory of the result vector bits P [i] en addr2 Check Dest new packet vector bits new packet excluding vector bits cell coordinate

45 Resource Proportion for LUT Usage.3 RAM Mesh of 2x2 P=6 K= logic %LUT RAM % LUT logic

46 T CLK-IO T CLK-IO T CLK-IO T CLK-IO T CLK-ROUTE = 3 * T CLK-IO m () * packet size 2 2 m () * packet size x x T CLK-ROUTE = 8 ns T CLK-IO = 6 ns m () * packet size 2 2 m () * packet size x x FPGA boundary

47 Slowdown caused by crossing the chips boundaries #bits crossing the boundary x = multiplexing factor = = # of pins per boundary = m () *packet size (f) * 2 2* 73 *2 = (#FPGA_pins / ) 277 = 7 for f=2 Slowdown factor = T CLK-IO + (x-) T CLK-IO + T CLK-ROUTE T CLK-ROUTE T CLK-IO + T CLK-ROUTE = =.33 T CLK-ROUTE

48 Results and Analysis

49 Results for a 52-bit number N K= number of concurrent multiplications=5 p =number of columns handled in one cell D = number of columns in matrix A = 6.7 x 6 d (f) = density of sub-matrix handled by mesh m (f) = mesh dimension n = number of times to repeat sub-multiplications T route = time for K multiplications in the mesh T Load = time for loading and unloading for K multiplications T Total = total time for the Matrix Step = 3 (D/K) n ( T route + T Load ) R = (#FPGAs * T Total )/ ( FPGA * T Total for FPGA) Virtex II chips (f) D p d (f) m (f) n 6.7x x 6 T route (ns) T Load (ns) T Total (days) R 2 6.7x x 5 6. x x x 6.3x x x 6.3 x

50 Results for a 2-bit number N K= number of concurrent multiplications=5 p =number of columns handled in one cell D = number of columns in matrix A = d (f) = density of sub-matrix handled by mesh m (f) = mesh dimension n = number of times to repeat sub-multiplications T route = time for K multiplications in the mesh T Load = time for loading and unloading for K multiplications T Total = total time for the Matrix Step = 3 (D/K) n ( T route + T Load ) R = (#FPGAs * T Total )/ ( FPGA * T Total for FPGA) Virtex II chips (f) D p d (f) m (f) n T route (ns) T Load (ns) T Total (days) R x 3 3,59 3,568.9x x 9.8 x 5.5 x 6.9x x x x.8x x 7.7 x 6.6 x x 8.2

51 Time vs. D Time vs D 2 5 fpga fpgas 256 fpgas 2 fpgas fpgas Time (days) 5 year 6 2, years Matrix Dimension (cols) fpgas fpga fpgas 256 fpgas 2 fpgas fpgas fpgas

52 Time vs. #FPGAs 2 Time (#days) 8 6 D = ~ (#FPGA) 3/ #FPGAs

53 Cost*Time vs. #FPGAs D = #FPGAs * #days 2 $2 bln$year #FPGAs

54 Conclusions

55 Summary & Conclusions () First practical hardware implementation of Mesh Routing for the Number Field Sieve implemented and verified using timing simulation Practical numbers, based on post-placing & routing static timing analysis, obtained for an array of Xilinx Virtex II 8 FPGAs A two-dimensional array of Virtex II chips can perform computations faster than a single FPGA by a factor approximately proportional to (number of FPGAs) 3/2

56 Summary & Conclusions (2) Matrix step for a 52-bit RSA key takes about 5 days on the rectangular array of FPGAs Assuming matrix size D= matrix step for a 2-bit RSA key appears to be prohibitive from the point of view of full (throughput) cost

57 Mapping designs to existing reconfigurable platforms..... Generic Array of FPGAs Existing General Purpose Reconfigurable SuperComputers

58 What is a reconfigurable computer? Microprocessor system FPGA system µp... µp FPGA... FPGA µp memory... µp memory FPGA memory... FPGA memory I/O Interface Interface I/O

59 Example: SRC 6E System Source: [SRC, MAPLD]

60 SRC MAP Reconfigurable Processor Source: [SRC, MAPLD]

61 SRC Hi-Bar TM Based Systems Hi-Bar sustains. GB/s per port with 8 ns latency per tier Up to 256 input and 256 output ports with two tiers of switch Common Memory (CM) has controller with DMA capability Controller can perform other functions such as scatter/gather Up to 8 GB DDR SDRAM supported per CM node SRC Hi-Bar Switch SNAP Memory SNAP Memory MAP MAP Common Memory Common Memory µp PCI-X Gig Ethernet etc. µp PCI-X Chaining GPIO SRC-6 Disk Storage Area Network Local Area Network Customers Existing Networks Wide Area Network Source: [SRC, MAPLD]

62 SRC Programming SRC µp system FPGA system HDL (VHDL) HLL (C) Library Developer Application Programmer

63 Other reconfigurable supercomputers Cray XD (formerly Octiga Bay 2 K) from Cray Inc. SGI Altix 3 from Silicon Graphics Star Bridge Hypercomputer from Star Bridge Systems

64 Advantages of reconfigurable computers general-purpose: cost distributed among multiple users with different needs behave like hardware: - parallel processing - distributed memory - specialized functional units, etc. can be programmed by mathematicians themselves using traditional programming languages or GUI environments encourage innovation and experimentation

65 Our future goal Polynomial Selection Sieving Next target Matrix (Linear Algebra) Square Root

66 Questions?

67 Backup slides

68 Sources of optimism - emergence of new companies supporting reconfigurable supercomputing, including major players in the area of traditional supercomputing, such as Cray Inc. and SGI - constant progress in the capabilities, performance, and flexibility of existing reconfigurable computing platforms - constant progress in the compiler technology, and logic synthesis of high level programming languages.

69 Parameters f = number of FPGAs in the FPGA array m () = mesh dimension for one FPGA = 2 m (f) = mesh dimension for f FPGAs = m () f p = number of columns of matrix handled in one cell of the mesh D (f) = matrix dimension handled by f FPGAs = (m (f) ) 2 p = ( m () f) 2 p = f (m () ) 2 p = f D ()

70 Parameters D D () D (f) = number of columns in matrix A = number of columns of sub-matrix of A handled by one FPGA in the mesh = (m () ) 2 p = number of columns of sub-matrix of A handled by f FPGAs in the mesh = (m (f) ) 2 p = (m () ) 2 f p = D () f d = column density of matrix A d () = density of sub-matrix handled by one FPGA = d D () / D d (f) = density of sub-matrix handled by f FPGAs = d D (f) / D

71 Routing Parameters T CLK_mult = multiplication clock period T CLK_IO = IO clock period x = bits to exchange between FPGAs / (bus size between FPGAs) = 2 ( + 2 k (f) +k p + K) (m () ) 2 )/ 277 T step = Total time needed for transfer of packet between cells across FPGAs = T CLK_IO + (x-) T CLK_IO + T CLK_mult h c = slowdown factor due to limited inter-fpga IO connections = T step / T CLK_mult T routne = routing time for sub-multiplication in the mesh = #entries per cell #steps T CLK_mult h c = p d (f) m (f) T CLK_mult h c

72 Loading Unloading Parameters b () = #pins for data transfer for FPGAs = #maximum FPGA IO/ 2 b (f) = #pins for data transfer for f FPGAs = (#maximum FPGA IO /) f s IO = clock stages between two FPGA connections T = CLK_load loading clock period T load = time for loading and unloading for a sub-multiplication = [( (#matrix entries bits) + (#vector bits to load) + (#vector bits to unload) )/ b (f) + s IO (m (f) -) m (f) * loading packet bits/b (f) ] T CLK_load = (( + k ( f ) + 2 k p ( ) d D f ) ( ) + ( K D b ( f ) f ) ( ) + K D f ) D ( f ) / D + s IO (m (f) -)m (f) ( + k ( f ) + 2 k p + K)/b (f) T CLK_load

73 Parameters n = number of times to repeat sub-multiplications = D 2 /(D (f) ) 2 = D 2 /((m (f) ) 2 p) 2 T Total = total time for a Matrix step = 3 D/K n ( T route + T load )

74 D (f) d (f) entries p columns d (f) p = d p f (m () ) 2 /D p threshold area on FPGA total of d (f) p entries

75 D (f) p entries p m () cells m () cells

76 Total time of routing Total routing takes maximum d m compare-exchange operations, where d matrix density = maximum number of non-zero entries per column m mesh size = D, where D is the matrix size

Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer

Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer Sashisu Bajracharya, Chang Shu, Kris Gaj George Mason University Tarek El-Ghazawi The George