EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing

Size: px

Start display at page:

Download "EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing"

Gyles Gaines
5 years ago
Views:

1 EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing Shi (Chris) Pu, Jyh-charn (Steve) Liu Department of Computer Science and Engineering, Texas A&M University

2 Outline 2 Introduction Objective and results Computing Model and Evaluation of Existing Optimization Techniques Key Findings EAGL : an Elliptic Curve Arithmetic GPU-based Library for Bilinear Pairing Benchmarks Identification of Computing Bottlenecks Conclusion

3 Introduction 3 Bilinear pairings have been widely researched for a board range of crypto protocols and secure applications. e.g. key agreement [1][2], identity-based encryption [3][4], BLS signature verification [5], secret handshake [6] Some of these protocols are well suited for decentralized Internet scale applications. 5 We will give one example [7] of this case in the next slide.

4 Privacy-preserving Health-care Cloud 4 high throughput and low response time for serving requests in an emergency response situation

5 Computing Performance for Pairings 5 At the 128-bit security level: Naehrig et al. [8] : 669 pairings/sec on a single core of Intel Q9550 CPU (2.83GHz) (2010) Beuchat et al. [9] : 1202 pairings/sec on a single core of Intel i7 CPU (2.8GHz) (2010) Aranha et al. [10] : 1658 pairings/sec on a single core of Intel i5 CPU (2.8GHz) (2011) Mitsunari et al. [11] : 2051 pairings/sec on a single core of an Intel Core i7-4700mq CPU (2.4GHz) (2013) Reduction of Computational Complexity Parallelization of pairings

6 Parallel Computer Architectures 6 Multi-core-based clusters well established Continual increase of the core count will offer good computing resources for pairings GPU Clusters with hundreds of GPUs -- emerging Only public computation (inputs are either encrypted or public) of pairings on GPU Previous studies on point multiplication [12, 13, 14] and bilinear pairings pairings [15, 16] reported inferior performances than their multi-core counterparts.

7 Motivation & Results Motivation: Understand the relationship between computational structures of pairings and the single instruction multi-thread (SIMT) parallel execution model of GPU. Results: Elliptic curve Arithmetic GPU-based Library (EAGL) 5 Compute Unified Device Architecture (CUDA) programming model 5 Parallelizes Miller s algorithm for the R-ate pairing [17] at 128-bit security level 5 Validated by MIRACL [18]. 5 Performance: 1408 R-ate pairings in ms, or pairings/sec as the amortized throughput Major performance bottlenecks for pairings on the GTX-680 device: 5 Fast/slow memory allocated for intermediate results of pairings 5 GPU pipeline utilization 5 proper use of the state-of-the-art pairing optimization techniques

8 Highlight of Kepler GPU 8 Nvidia Kepler GPU released in One GPU contains multiple SMXs (streaming multiprocessor extreme), each SMX works as a computing unit processes 32 threads, called a warp, per clock simultaneously runs multiple warps of threads for better utilization of the pipeline inside SMX SIMT: Single Instruction Multi-Thread threads are stalled and synchronization is inserted when the same instruction for all threads is not satisfied, e.g., in the if-else branches. kernel function: the program run by GPU threads

9 GPU Memory Hierarchy 64K 32-bit registers and 64KB on-chip shared memory/l1 cache bit shared memory banks, one memory interface per bank the interface competition reduces the actual degree of parallelism 32 shared mem LD/ST units shared mem interface competition occurs only within one warp of GPU threads

10 Major Design Factors Operations of ECC are multi-precision integer arithmetic functions a[n] b[n] mod q[n], n = 8 in our BN curve and the 128-bit security level 5 Similar for + and - a[8] representing an integer composed of 8 32-bit integers Major design factors of the multi-precision integer arithmetic 1. Data Storage Format 5 Using GPU integer or floating point instruction sets? 2. Number system 5 Residue number system (RNS) or conventional Montgomery? computational complexity, parallelization suitability 3. Dependency Limitation 5 Variable reading/writing race condition 4. Resource Competition 5 shared memory interfaces

11 Number systems option 1: Conventional Montgomery Modular multiplications c[n] = a[n] b[n] mod q[n] are selected as the representative code segment of the pairing computation Option 1. Conventional Montgomery 5 A conversion from (x mod q) to (xr -1 mod q) 5 c = a b mod q Complexity (with a single thread per instance) 5 T[2n] = a[n] b[n]: n 2 INT32 multiplication ( MUL) 5 reduction(t) = n 2 + n MUL 5 Overall: 2n 2 + n MUL T=a b; c = reduction(t){ m = (T mod R) q mod R; t = (T + m q) / R; return (t q)? (t q) : t;} where R > q, R is power of 2 and co-prime to q; R -1 and q have RR -1 qq =1;

12 Option 2. RNS-based Montgomery [12] Based on Chinese Remainder Theorem, the modulus M = m i, m i is i= 1 prime. SIMT-friendly: for an arbitrary integer X < M: independent x i = X mod m i, 1 < i < n a[n] b[n] in RNS can be fully parallelized But RNS-based integers cannot be directly used in pairings because M is not prime, unless two Base Extensions (BE) are inserted in each reduction. n Complexity (with t threads per instance) n 2 /t (for a[n] b[n]) + 2n 2 (for BE) + 3n (other parts in reduction) MUL When (n = 8, t = 4): 2n + 2n 2 + 3n = 2n 2 + 5n MUL

to the highest extension field: (a b + c d + e f + ) mod q When multi-precision multiplication are fully parallelized by t parallel threads Based on the

13 Lazy Reduction and Number Systems For (a b) mod q one multi-precision multiplication and one reduction. Lazy reduction for (a b) mod q + ( c d ) mod q (a b + c d ) mod q Two multi-precision multiplication and one reduction A general format for lazy reduction to the highest extension field: (a b + c d + e f + ) mod q When multi-precision multiplication are fully parallelized by t parallel threads Based on the RNS-base number system: 5Computational complexity: (n 2 + n 2 + n 2 )/ t.+ (2n 2 + 3n) MUL Based on the conventional Montgomery: 5Computational complexity: (n 2 + n 2 + n 2 )/t + (n 2 + n) MUL 5A cheaper option

14 8 Parallel Multiplication (PMU) Sequences in One Multiprecision Multiplication: T[16] = a[8] b[8] can be considered as 8 PMU sequences of T[x] = a[i] b[j=0~7], 0 i 7, and x = i + j respectively. In each PMU sequence: T[x] = (low 32-bit) (a[i] b[j=0,1,...,7] + T[x] + carry); carry = (high 32-bit) a[i] b[j=0,1,...,7] + T[x] + carry) if j > 0, The inter-dependency among 8 PMU sequences is the reading/writing (R/W) of T[0-15]

15 Eliminating the Race Condition of R/W T[x] Assuming T[0-15] is in the shared memory: Race-free condition on R/W T[x] when (T[0-15] is partitioned into multiple segments with a constant size) && (each segment is mapped to one thread) && (R/W address of each thread has a constant offset which is large enough)) Write requests

16 Parallel Computing Model Candidates CI_1/2/4thread model: Using 1/2/4 GPU thread(s) to complete one computing instance (CI) of the multi-precision arithmetic operations: a[8] b[8] (as well as for +, -) Next two slides: two types of shared memory bank conflicts are eliminated in the CI_1/2/4thread models The CI_8thread model does not meet the race free requirement mentioned earlier

17 Elimination of Bank Conflict (CI_2thread model) Storage of a 256-bit variable a[8] for a block of GPU threads Each CI owns one a[8]. One block of GPU threads own a number of a[8]s. Their placement in the shared memory: a stripe strip of 64-bit chaff spacers are inserted to ensure each access A i in a warp visits different banks Maybe a typo?

18 Elimination of Another Bank Conflict (CI_2/4thread models) Occurs when R/W T[0-15] in different CIs To eliminate bank conflict caching T 0-11 and T 4-15 in register, and then accumulating T 0-15 serially

19 Result: Fully parallelized eight PMU sequences 19 Bank Conflict Elimination verified by nvprof profiling tool of CUDA Zero replay cost caused by bank conflict The total cost of multi-precision multiplication in the CI_2thread model is slightly above n 2 /2 MUL due to the serial accumulation step of T.

20 Performance of multiplications reductions in CI-1/2/4thread Models Pure multi-precision multiplications and reductions invoked in one bilinear pairing. models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-1thread 57.87ms 39.95ms N/A 97.82ms CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms CI-4thread 28.58ms 51.64ms 83.59ms 80.22ms parallelization Bisecting the shared of T[16] memory = a[8] usage b[8] per works CI into two threads usually more low than gain double from parallelization thread counts as the per thread SMXcount per CI increases, due to more synchronization thread cnt: 160 needed (CI_1), for 352 accumulating (CI_2), 738(CI_4); T[i] limited by placing complete warps into SMX

21 Effects of Reduction models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-1thread 57.87ms 39.95ms N/A 97.82ms CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms CI-4thread 28.58ms 51.64ms 83.59ms 80.22ms CI-2thread adopted for EAGL for the best overall performance In CI-2thread, parallelized reduction is slower than its serial counterpart The value of T in T + m q is a copy in the global memory because the addition only reads T once. m = (T mod R) q' mod R and m q took 43% execution time of reduction

22 More on Lazy Reduction (LR)

23 LR applied to extension fields F q 2, F q 6, F q 12 (Aranha et al. [10] ) When applying LR in F q k, computing a modular multiplication in F q k needs k reductions. Doubling intermediate results each time The size of temporary data in functions of EAGL increases to 1.4 times for when LR in F q 2 is changed to LR in F q 4 Explosive growth in data swapping between shared memory and global memory for LR in F q 12 5 a number of double-sized intermediate results in F q4. need to be maintained before a reduction in F q 12 occurs.

24 Investigation of Lazy Reduction (LR) 1. LR in F q 2 or F q 4 applied 2. with swbased prefetching; Optimization choices Execution time Threads per SMX Shared mem per CI Throughput (/sec) lazy reduct in F q ms bytes lazy reduct in F q ms bytes prefetch + lazy reduct in F q 4 Execution of 1000 modular multiplications in F q ms bytes more fast lazy reduct in F 4 q 233.7ms bytes cache assigned lazy reduct in F 4 q 225.2ms bytes the increase of slow global memory accesses 3. marginal improvement dominate of execution the reduction time of observed. computational As the size complexity; of 2. the next-warp prefetching technique working sets is spatially optimized, fewer gain with more fast cache once a brings no noticeable gain since next warp has threshold is passed. no scheduling priority without hw support.

25 Comparison with Existing CPU/GPU-based solutions

26 EAGL vs. GPU-based solutions Implementations Algorithm Curve Type EAGL R-ate, prime order Security Exec time (ms) Device ordinary 128-bit GTX-680 [15] η T, prime order ordinary 128-bit 3.94 C2050 [16] Tate, composite order supersingular 80-bit 23.8 M SMX 352 threads (8 176 instances) = 1408 instances, the execution time of 1408 R-ate pairings is ms, equivalent to pairings/sec as the amortized throughput. the peak GFLOPS of GTX-680 is roughly three times larger than M2050/C2050. After the normalization of peak GFLOPS, EAGL is roughly 4.4 times faster than [15].

27 Experimental Result (Bilinear Pairing, vs. CPU-based solutions) 27 Implementations Algorithm Device Core clock Throughput EAGL R-ate pairing GTX MHz [8] Ate pairing Intel Q GHz (est.) [9] Ate pairing Intel i GHz (est.) [10] Ate pairing Intel i5 2.8GHz (est.) [11] Ate pairing Intel i7-4700mq 2.4GHz (est.) adopt the perfect acceleration model for CPU-based solutions on multi-core CPUs. EAGL has 40% of the throughput in [11]. since EAGL costs few CPU resource, it can work as a scalable bilinear pairing co-processor, while CPU is available for other business logic.

28 Major Computing Bottleneck? 28 low-level multi-precision arithmetic functions based on the CI-2thread model, or high-level computation structures of the pairing computation? To determine whether the major bottleneck is the former, we cross compare the EAGL-based point multiplication with other GPU/CPU-based point multiplication solutions Since few GPU-based pairing solutions are available

29 EAGL vs. GPU/CPU-based Point Multiplication Solutions Implementations Key size Throughput (/sec) Device Device peak GFLOPS with GPU with CPU [13] 224 bit 5895 GTX [12] (RNS-based) 224 bit 9827 GTX EAGL 256 bit GTX Implementations Key size Elliptic curve Throughput (/sec) Device Optimized GLS method [19] 256 bit twisted Edward curves in F 2 q GHz AMD Opteron (single core) MIRACL (our code base) 224 bit standardized GHz AMD Phenom II EAGL 256 bit standardized GTX 680 EAGL has 2.76 times higher throughput than that in [12], higher than the increase of peak GFLOPs (1.72 folds). EAGL has 2.1 times higher throughput than [19]. Although EAGL uses the same CI-2thread model for these two algorithms, the performance relationships of EAGL vs. CPU-based counterparts are different!

30 Computing Latency with Unlimited On-chip Memory The experiment on multi-precision multiplications and reductions in a pairing simulated a unlimited on-chip memory scenario for 1408 pairings models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms A pairing also needs 20k inexpensive multi-precision additions/subtractions, which is less than 15ms if they fully reside in the share memory Theoretical optimal latency 76ms + 15ms = 91ms << 420ms (the actual latency). What contribute to the remaining execution time?

31 The Sizes of Temporary Variables for Point Multiplications and Pairings in EAGL in bilinear pairing, the size of temporary variables in the global memory space fluctuates, and is much greater than that in point multiplication the sizes of temporary variables in the shared memory space are stable Miller Loop Final Exponential point calculation line calculation merge results of step 1,2 4 inversion 5 rest of FE

32 An Experiment for Identifying the Major Bottleneck? We assert that a large proportion of execution time is spent in variables swapping between shared memory and global memory as follows: 1. We extract profiling information of a copy function copy() that copies an element in F 12 q in the global memory, and a powering function that computes x y, here x is in F 12 q and y in F q x y takes 47ms, copy() takes 35μs Nvprof shows that x y triggers nearly 500 times more global memory hits than copy() Therefore, global memory hits in x y roughly takes17ms = μs, roughly 35% of the execution time of x y 2. Global memory hits can weight more if synchronization are waiting for them, for example, global memory R/W embedded in branch divergence.

33 Conclusion EAGL for bilinear pairing Conventional-Montgomery-based CI-2thread model is more efficient than RNS-based computing models [12] on Kepler GPU; On Kepler GPU, the trade-off between computational complexity and (on-chip) memory resource is very different from that on CPU. Bilinear pairing has exceeded the on-chip memory resource limit of Kepler GPU; Point multiplication far from it Future directions 5 Larger on-chip memory and advanced memory prefetching architecture 5 Practical design for large scale systems 5 Integrated Development Environment (IDE) 5 Mobile enviornments

34 34 Many Thanks to the Organizers, the Team, and new friends

35 35 Questions and Discussion

36 Reference D. Fiore, R. Gennaro, and N.P. Smart, "Constructing Certificateless Encryption and ID-Based Encryption from ID-Based Key Agreement". Pairing 2010, pp , K. Yoneyama, "Strongly Secure Two-Pass Attribute-Based Authenticated Key Exchange". Pairing 2010, pp , A.D. Caro, V. Iovino, G. Persiano, "Fully Secure Anonymous HIBE and Secret-Key Anonymous IBE with Short Cipher-texts". Pairing 2010, pp , L. Wang, L. Wang, M. Mambo, E. Okamoto, "New Identity-Based Proxy Re-encryption Schemes to Prevent Collusion Attacks". Pairing 2010, pp , D. Boneh, B. Lynn, H. Shacham, "Short Signatures from the Weil Pairing". Asiacrypt 2001, pp , P. Duan, "Oblivious Handshakes and Computing of Shared Secrets: Pairwise Privacypreserving Protocols for Internet Applications". Ph.D. Dissertation, available at

37 7. J. Pecarina, S. Pu, J.C. Liu, "SAPPHIRE: Anonymity for Enhanced Control and Private Collaboration in Healthcare Clouds". CloundCom 2012, pp , M. Naehrig, R. Niederhagen, and P. Schwabe, "New Software speed records for cryptographic pairings", Latincrypt 2010, pp , J.L. Beuchat, J.E.G. Diaz, S. Mitsunari, E. Okamoto, F. Rodriguez-Henriquez, and T. Teruya "High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves", Pairing 2010, pp D.F. Aranha, K. Karabina, P. Longa, C.H. Gebotys, J. Lopez, "Faster Explicit Formulas for Computing Pairing over Ordinary Curves". Euro-Crypt 2011, pp 48-68, S. Mitsunari, "A Fast Implementation of the Optimal Ate Pairing over BN curve on Intel Haswell Processor". In IACR eprint archive 2013: S. Antão, J.C. Bajard, L. Sousa. "RNS-based Elliptic Curve Point Multiplication for Massive Parallel Architectures". In the Computer Journal 2011, vol. 55 issue 5, pp , 2012.

38 13. D. J. Bernstein, T.R. Chen, C.M. Cheng, T. Lange, and B.Y. Yang, "ECM on Graphics Cards". Euro-Crypt 2009, pp , R. Szerwinski, and T. Guneysu, "Exploiting the Power of GPUs for Asymmetric Cryptology", CHES-2008, pp 79-99, Y. Katoh, Y.J. Huang, C.M. Cheng, T. Takagi, "Efficient Implementation of the eta Pairing on GPU". Cryptology eprint Archive, Y. Zhang, C.J. Xue, D.S. Wong, N. Mamoulis, S.M. Yiu, "Acceleration of Composite Order Bilinear Pairing on Graphics Hardware". Cryptology eprint Archive, E.J. Lee, H.S. Lee and C.M. Park, "Efficient and Generalized Pairing Computation on Abelian Varieties". IEEE Transactions on Information Theory, Vol.55, Issue.4, pp , MIRACL: Multiprecision Integer and Rational Arithmetic Cryptographic Library. Available at P. Longa and C. Gebotys. Analysis of Efficient Techniques for Fast Elliptic Curve Cryptography on x86-64 based Processors. IACR Cryptology eprint Archive, 335, 1 34, 2010.

Efficient Implementation of the η T Pairing on GPU

Efficient Implementation of the η T Pairing on GPU Yosuke Katoh 1, Yun-Ju Huang 2, Chen-Mou Cheng 3, and Tsuyoshi Takagi 4 1 Graduate School of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka,