EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing
|
|
- Gyles Gaines
- 5 years ago
- Views:
Transcription
1 EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing Shi (Chris) Pu, Jyh-charn (Steve) Liu Department of Computer Science and Engineering, Texas A&M University
2 Outline 2 Introduction Objective and results Computing Model and Evaluation of Existing Optimization Techniques Key Findings EAGL : an Elliptic Curve Arithmetic GPU-based Library for Bilinear Pairing Benchmarks Identification of Computing Bottlenecks Conclusion
3 Introduction 3 Bilinear pairings have been widely researched for a board range of crypto protocols and secure applications. e.g. key agreement [1][2], identity-based encryption [3][4], BLS signature verification [5], secret handshake [6] Some of these protocols are well suited for decentralized Internet scale applications. 5 We will give one example [7] of this case in the next slide.
4 Privacy-preserving Health-care Cloud 4 high throughput and low response time for serving requests in an emergency response situation
5 Computing Performance for Pairings 5 At the 128-bit security level: Naehrig et al. [8] : 669 pairings/sec on a single core of Intel Q9550 CPU (2.83GHz) (2010) Beuchat et al. [9] : 1202 pairings/sec on a single core of Intel i7 CPU (2.8GHz) (2010) Aranha et al. [10] : 1658 pairings/sec on a single core of Intel i5 CPU (2.8GHz) (2011) Mitsunari et al. [11] : 2051 pairings/sec on a single core of an Intel Core i7-4700mq CPU (2.4GHz) (2013) Reduction of Computational Complexity Parallelization of pairings
6 Parallel Computer Architectures 6 Multi-core-based clusters well established Continual increase of the core count will offer good computing resources for pairings GPU Clusters with hundreds of GPUs -- emerging Only public computation (inputs are either encrypted or public) of pairings on GPU Previous studies on point multiplication [12, 13, 14] and bilinear pairings pairings [15, 16] reported inferior performances than their multi-core counterparts.
7 Motivation & Results Motivation: Understand the relationship between computational structures of pairings and the single instruction multi-thread (SIMT) parallel execution model of GPU. Results: Elliptic curve Arithmetic GPU-based Library (EAGL) 5 Compute Unified Device Architecture (CUDA) programming model 5 Parallelizes Miller s algorithm for the R-ate pairing [17] at 128-bit security level 5 Validated by MIRACL [18]. 5 Performance: 1408 R-ate pairings in ms, or pairings/sec as the amortized throughput Major performance bottlenecks for pairings on the GTX-680 device: 5 Fast/slow memory allocated for intermediate results of pairings 5 GPU pipeline utilization 5 proper use of the state-of-the-art pairing optimization techniques
8 Highlight of Kepler GPU 8 Nvidia Kepler GPU released in One GPU contains multiple SMXs (streaming multiprocessor extreme), each SMX works as a computing unit processes 32 threads, called a warp, per clock simultaneously runs multiple warps of threads for better utilization of the pipeline inside SMX SIMT: Single Instruction Multi-Thread threads are stalled and synchronization is inserted when the same instruction for all threads is not satisfied, e.g., in the if-else branches. kernel function: the program run by GPU threads
9 GPU Memory Hierarchy 64K 32-bit registers and 64KB on-chip shared memory/l1 cache bit shared memory banks, one memory interface per bank the interface competition reduces the actual degree of parallelism 32 shared mem LD/ST units shared mem interface competition occurs only within one warp of GPU threads
10 Major Design Factors Operations of ECC are multi-precision integer arithmetic functions a[n] b[n] mod q[n], n = 8 in our BN curve and the 128-bit security level 5 Similar for + and - a[8] representing an integer composed of 8 32-bit integers Major design factors of the multi-precision integer arithmetic 1. Data Storage Format 5 Using GPU integer or floating point instruction sets? 2. Number system 5 Residue number system (RNS) or conventional Montgomery? computational complexity, parallelization suitability 3. Dependency Limitation 5 Variable reading/writing race condition 4. Resource Competition 5 shared memory interfaces
11 Number systems option 1: Conventional Montgomery Modular multiplications c[n] = a[n] b[n] mod q[n] are selected as the representative code segment of the pairing computation Option 1. Conventional Montgomery 5 A conversion from (x mod q) to (xr -1 mod q) 5 c = a b mod q Complexity (with a single thread per instance) 5 T[2n] = a[n] b[n]: n 2 INT32 multiplication ( MUL) 5 reduction(t) = n 2 + n MUL 5 Overall: 2n 2 + n MUL T=a b; c = reduction(t){ m = (T mod R) q mod R; t = (T + m q) / R; return (t q)? (t q) : t;} where R > q, R is power of 2 and co-prime to q; R -1 and q have RR -1 qq =1;
12 Option 2. RNS-based Montgomery [12] Based on Chinese Remainder Theorem, the modulus M = m i, m i is i= 1 prime. SIMT-friendly: for an arbitrary integer X < M: independent x i = X mod m i, 1 < i < n a[n] b[n] in RNS can be fully parallelized But RNS-based integers cannot be directly used in pairings because M is not prime, unless two Base Extensions (BE) are inserted in each reduction. n Complexity (with t threads per instance) n 2 /t (for a[n] b[n]) + 2n 2 (for BE) + 3n (other parts in reduction) MUL When (n = 8, t = 4): 2n + 2n 2 + 3n = 2n 2 + 5n MUL
13 Lazy Reduction and Number Systems For (a b) mod q one multi-precision multiplication and one reduction. Lazy reduction for (a b) mod q + ( c d ) mod q (a b + c d ) mod q Two multi-precision multiplication and one reduction A general format for lazy reduction to the highest extension field: (a b + c d + e f + ) mod q When multi-precision multiplication are fully parallelized by t parallel threads Based on the RNS-base number system: 5Computational complexity: (n 2 + n 2 + n 2 )/ t.+ (2n 2 + 3n) MUL Based on the conventional Montgomery: 5Computational complexity: (n 2 + n 2 + n 2 )/t + (n 2 + n) MUL 5A cheaper option
14 8 Parallel Multiplication (PMU) Sequences in One Multiprecision Multiplication: T[16] = a[8] b[8] can be considered as 8 PMU sequences of T[x] = a[i] b[j=0~7], 0 i 7, and x = i + j respectively. In each PMU sequence: T[x] = (low 32-bit) (a[i] b[j=0,1,...,7] + T[x] + carry); carry = (high 32-bit) a[i] b[j=0,1,...,7] + T[x] + carry) if j > 0, The inter-dependency among 8 PMU sequences is the reading/writing (R/W) of T[0-15]
15 Eliminating the Race Condition of R/W T[x] Assuming T[0-15] is in the shared memory: Race-free condition on R/W T[x] when (T[0-15] is partitioned into multiple segments with a constant size) && (each segment is mapped to one thread) && (R/W address of each thread has a constant offset which is large enough)) Write requests
16 Parallel Computing Model Candidates CI_1/2/4thread model: Using 1/2/4 GPU thread(s) to complete one computing instance (CI) of the multi-precision arithmetic operations: a[8] b[8] (as well as for +, -) Next two slides: two types of shared memory bank conflicts are eliminated in the CI_1/2/4thread models The CI_8thread model does not meet the race free requirement mentioned earlier
17 Elimination of Bank Conflict (CI_2thread model) Storage of a 256-bit variable a[8] for a block of GPU threads Each CI owns one a[8]. One block of GPU threads own a number of a[8]s. Their placement in the shared memory: a stripe strip of 64-bit chaff spacers are inserted to ensure each access A i in a warp visits different banks Maybe a typo?
18 Elimination of Another Bank Conflict (CI_2/4thread models) Occurs when R/W T[0-15] in different CIs To eliminate bank conflict caching T 0-11 and T 4-15 in register, and then accumulating T 0-15 serially
19 Result: Fully parallelized eight PMU sequences 19 Bank Conflict Elimination verified by nvprof profiling tool of CUDA Zero replay cost caused by bank conflict The total cost of multi-precision multiplication in the CI_2thread model is slightly above n 2 /2 MUL due to the serial accumulation step of T.
20 Performance of multiplications reductions in CI-1/2/4thread Models Pure multi-precision multiplications and reductions invoked in one bilinear pairing. models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-1thread 57.87ms 39.95ms N/A 97.82ms CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms CI-4thread 28.58ms 51.64ms 83.59ms 80.22ms parallelization Bisecting the shared of T[16] memory = a[8] usage b[8] per works CI into two threads usually more low than gain double from parallelization thread counts as the per thread SMXcount per CI increases, due to more synchronization thread cnt: 160 needed (CI_1), for 352 accumulating (CI_2), 738(CI_4); T[i] limited by placing complete warps into SMX
21 Effects of Reduction models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-1thread 57.87ms 39.95ms N/A 97.82ms CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms CI-4thread 28.58ms 51.64ms 83.59ms 80.22ms CI-2thread adopted for EAGL for the best overall performance In CI-2thread, parallelized reduction is slower than its serial counterpart The value of T in T + m q is a copy in the global memory because the addition only reads T once. m = (T mod R) q' mod R and m q took 43% execution time of reduction
22 More on Lazy Reduction (LR)
23 LR applied to extension fields F q 2, F q 6, F q 12 (Aranha et al. [10] ) When applying LR in F q k, computing a modular multiplication in F q k needs k reductions. Doubling intermediate results each time The size of temporary data in functions of EAGL increases to 1.4 times for when LR in F q 2 is changed to LR in F q 4 Explosive growth in data swapping between shared memory and global memory for LR in F q 12 5 a number of double-sized intermediate results in F q4. need to be maintained before a reduction in F q 12 occurs.
24 Investigation of Lazy Reduction (LR) 1. LR in F q 2 or F q 4 applied 2. with swbased prefetching; Optimization choices Execution time Threads per SMX Shared mem per CI Throughput (/sec) lazy reduct in F q ms bytes lazy reduct in F q ms bytes prefetch + lazy reduct in F q 4 Execution of 1000 modular multiplications in F q ms bytes more fast lazy reduct in F 4 q 233.7ms bytes cache assigned lazy reduct in F 4 q 225.2ms bytes the increase of slow global memory accesses 3. marginal improvement dominate of execution the reduction time of observed. computational As the size complexity; of 2. the next-warp prefetching technique working sets is spatially optimized, fewer gain with more fast cache once a brings no noticeable gain since next warp has threshold is passed. no scheduling priority without hw support.
25 Comparison with Existing CPU/GPU-based solutions
26 EAGL vs. GPU-based solutions Implementations Algorithm Curve Type EAGL R-ate, prime order Security Exec time (ms) Device ordinary 128-bit GTX-680 [15] η T, prime order ordinary 128-bit 3.94 C2050 [16] Tate, composite order supersingular 80-bit 23.8 M SMX 352 threads (8 176 instances) = 1408 instances, the execution time of 1408 R-ate pairings is ms, equivalent to pairings/sec as the amortized throughput. the peak GFLOPS of GTX-680 is roughly three times larger than M2050/C2050. After the normalization of peak GFLOPS, EAGL is roughly 4.4 times faster than [15].
27 Experimental Result (Bilinear Pairing, vs. CPU-based solutions) 27 Implementations Algorithm Device Core clock Throughput EAGL R-ate pairing GTX MHz [8] Ate pairing Intel Q GHz (est.) [9] Ate pairing Intel i GHz (est.) [10] Ate pairing Intel i5 2.8GHz (est.) [11] Ate pairing Intel i7-4700mq 2.4GHz (est.) adopt the perfect acceleration model for CPU-based solutions on multi-core CPUs. EAGL has 40% of the throughput in [11]. since EAGL costs few CPU resource, it can work as a scalable bilinear pairing co-processor, while CPU is available for other business logic.
28 Major Computing Bottleneck? 28 low-level multi-precision arithmetic functions based on the CI-2thread model, or high-level computation structures of the pairing computation? To determine whether the major bottleneck is the former, we cross compare the EAGL-based point multiplication with other GPU/CPU-based point multiplication solutions Since few GPU-based pairing solutions are available
29 EAGL vs. GPU/CPU-based Point Multiplication Solutions Implementations Key size Throughput (/sec) Device Device peak GFLOPS with GPU with CPU [13] 224 bit 5895 GTX [12] (RNS-based) 224 bit 9827 GTX EAGL 256 bit GTX Implementations Key size Elliptic curve Throughput (/sec) Device Optimized GLS method [19] 256 bit twisted Edward curves in F 2 q GHz AMD Opteron (single core) MIRACL (our code base) 224 bit standardized GHz AMD Phenom II EAGL 256 bit standardized GTX 680 EAGL has 2.76 times higher throughput than that in [12], higher than the increase of peak GFLOPs (1.72 folds). EAGL has 2.1 times higher throughput than [19]. Although EAGL uses the same CI-2thread model for these two algorithms, the performance relationships of EAGL vs. CPU-based counterparts are different!
30 Computing Latency with Unlimited On-chip Memory The experiment on multi-precision multiplications and reductions in a pairing simulated a unlimited on-chip memory scenario for 1408 pairings models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms A pairing also needs 20k inexpensive multi-precision additions/subtractions, which is less than 15ms if they fully reside in the share memory Theoretical optimal latency 76ms + 15ms = 91ms << 420ms (the actual latency). What contribute to the remaining execution time?
31 The Sizes of Temporary Variables for Point Multiplications and Pairings in EAGL in bilinear pairing, the size of temporary variables in the global memory space fluctuates, and is much greater than that in point multiplication the sizes of temporary variables in the shared memory space are stable Miller Loop Final Exponential point calculation line calculation merge results of step 1,2 4 inversion 5 rest of FE
32 An Experiment for Identifying the Major Bottleneck? We assert that a large proportion of execution time is spent in variables swapping between shared memory and global memory as follows: 1. We extract profiling information of a copy function copy() that copies an element in F 12 q in the global memory, and a powering function that computes x y, here x is in F 12 q and y in F q x y takes 47ms, copy() takes 35μs Nvprof shows that x y triggers nearly 500 times more global memory hits than copy() Therefore, global memory hits in x y roughly takes17ms = μs, roughly 35% of the execution time of x y 2. Global memory hits can weight more if synchronization are waiting for them, for example, global memory R/W embedded in branch divergence.
33 Conclusion EAGL for bilinear pairing Conventional-Montgomery-based CI-2thread model is more efficient than RNS-based computing models [12] on Kepler GPU; On Kepler GPU, the trade-off between computational complexity and (on-chip) memory resource is very different from that on CPU. Bilinear pairing has exceeded the on-chip memory resource limit of Kepler GPU; Point multiplication far from it Future directions 5 Larger on-chip memory and advanced memory prefetching architecture 5 Practical design for large scale systems 5 Integrated Development Environment (IDE) 5 Mobile enviornments
34 34 Many Thanks to the Organizers, the Team, and new friends
35 35 Questions and Discussion
36 Reference D. Fiore, R. Gennaro, and N.P. Smart, "Constructing Certificateless Encryption and ID-Based Encryption from ID-Based Key Agreement". Pairing 2010, pp , K. Yoneyama, "Strongly Secure Two-Pass Attribute-Based Authenticated Key Exchange". Pairing 2010, pp , A.D. Caro, V. Iovino, G. Persiano, "Fully Secure Anonymous HIBE and Secret-Key Anonymous IBE with Short Cipher-texts". Pairing 2010, pp , L. Wang, L. Wang, M. Mambo, E. Okamoto, "New Identity-Based Proxy Re-encryption Schemes to Prevent Collusion Attacks". Pairing 2010, pp , D. Boneh, B. Lynn, H. Shacham, "Short Signatures from the Weil Pairing". Asiacrypt 2001, pp , P. Duan, "Oblivious Handshakes and Computing of Shared Secrets: Pairwise Privacypreserving Protocols for Internet Applications". Ph.D. Dissertation, available at
37 7. J. Pecarina, S. Pu, J.C. Liu, "SAPPHIRE: Anonymity for Enhanced Control and Private Collaboration in Healthcare Clouds". CloundCom 2012, pp , M. Naehrig, R. Niederhagen, and P. Schwabe, "New Software speed records for cryptographic pairings", Latincrypt 2010, pp , J.L. Beuchat, J.E.G. Diaz, S. Mitsunari, E. Okamoto, F. Rodriguez-Henriquez, and T. Teruya "High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves", Pairing 2010, pp D.F. Aranha, K. Karabina, P. Longa, C.H. Gebotys, J. Lopez, "Faster Explicit Formulas for Computing Pairing over Ordinary Curves". Euro-Crypt 2011, pp 48-68, S. Mitsunari, "A Fast Implementation of the Optimal Ate Pairing over BN curve on Intel Haswell Processor". In IACR eprint archive 2013: S. Antão, J.C. Bajard, L. Sousa. "RNS-based Elliptic Curve Point Multiplication for Massive Parallel Architectures". In the Computer Journal 2011, vol. 55 issue 5, pp , 2012.
38 13. D. J. Bernstein, T.R. Chen, C.M. Cheng, T. Lange, and B.Y. Yang, "ECM on Graphics Cards". Euro-Crypt 2009, pp , R. Szerwinski, and T. Guneysu, "Exploiting the Power of GPUs for Asymmetric Cryptology", CHES-2008, pp 79-99, Y. Katoh, Y.J. Huang, C.M. Cheng, T. Takagi, "Efficient Implementation of the eta Pairing on GPU". Cryptology eprint Archive, Y. Zhang, C.J. Xue, D.S. Wong, N. Mamoulis, S.M. Yiu, "Acceleration of Composite Order Bilinear Pairing on Graphics Hardware". Cryptology eprint Archive, E.J. Lee, H.S. Lee and C.M. Park, "Efficient and Generalized Pairing Computation on Abelian Varieties". IEEE Transactions on Information Theory, Vol.55, Issue.4, pp , MIRACL: Multiprecision Integer and Rational Arithmetic Cryptographic Library. Available at P. Longa and C. Gebotys. Analysis of Efficient Techniques for Fast Elliptic Curve Cryptography on x86-64 based Processors. IACR Cryptology eprint Archive, 335, 1 34, 2010.
Efficient Implementation of the η T Pairing on GPU
Efficient Implementation of the η T Pairing on GPU Yosuke Katoh 1, Yun-Ju Huang 2, Chen-Mou Cheng 3, and Tsuyoshi Takagi 4 1 Graduate School of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka,
More informationPOST-SIEVING ON GPUs
POST-SIEVING ON GPUs Andrea Miele 1, Joppe W Bos 2, Thorsten Kleinjung 1, Arjen K Lenstra 1 1 LACAL, EPFL, Lausanne, Switzerland 2 NXP Semiconductors, Leuven, Belgium 1/18 NUMBER FIELD SIEVE (NFS) Asymptotically
More informationHigh Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields
High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields Santosh Ghosh, Dipanwita Roy Chowdhury, and Abhijit Das Computer Science and Engineering
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationHigh-Performance Modular Multiplication on the Cell Broadband Engine
High-Performance Modular Multiplication on the Cell Broadband Engine Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 21 Outline Motivation and previous
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationNEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors
Four NEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors Selected Areas in Cryptography (SAC 2016) St. Johns, Canada Patrick Longa Microsoft Research Next-generation elliptic curves Recent
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationBreaking ECC2K-130 on Cell processors and GPUs
Breaking ECC2K-130 on Cell processors and GPUs Daniel V. Bailey, Lejla Batina, Daniel J. Bernstein, Peter Birkner, Joppe W. Bos, Hsieh-Chung Chen, Chen-Mou Cheng, Gauthier van Damme, Giacomo de Meulenaer,
More informationLecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material
More informationHigh-Performance Packet Classification on GPU
High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCaches Concepts Review
Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationA Cache Hierarchy in a Computer System
A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the
More informationFinite Field Arithmetic Using AVX-512 For Isogeny-Based Cryptography
Finite Field Arithmetic Using AVX-512 For Isogeny-Based Cryptography Gabriell Orisaka 1, Diego F. Aranha 1,2, Julio López 1 1 Institute of Computing, University of Campinas, Brazil 2 Department of Engineering,
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationSM9 identity-based cryptographic algorithms Part 2: Digital signature algorithm
SM9 identity-based cryptographic algorithms Part 2: Digital signature algorithm Contents 1 Scope... 1 2 Normative references... 1 3 Terms and definitions... 1 3.1 message... 1 3.2 signed message... 1 3.3
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationAnalysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs
AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationECDLP on GPU I. INTRODUCTION
ECDLP on GPU Lei Xu State Key Laboratory of Information Security Institute of Software,Chinese Academy of Sciences Beijing, China Email: xuleimath@gmail.com Dongdai Lin State Key Laboratory of Information
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationFault-Based Attack of RSA Authentication
Fault-Based Attack of RSA Authentication, Valeria Bertacco and Todd Austin 1 Cryptography: Applications 2 Value of Cryptography $2.1 billions 1,300 employees $1.5 billions 4,000 employees $8.7 billions
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationimplementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot
Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationA Parameterizable Processor Architecture for Large Characteristic Pairing-Based Cryptography
A Parameterizable Processor Architecture for Large Characteristic Pairing-Based Cryptography Gary C.T. Chow Department of Computing Imperial College London, UK cchow@doc.ic.ac.uk Ken Eguro Embedded and
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationEEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu
EEC 483 Computer Organization Chapter 5.3 Measuring and Improving Cache Performance Chansu Yu Cache Performance Performance equation execution time = (execution cycles + stall cycles) x cycle time stall
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationHigh-Performance Cryptography in Software
High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering
More informationSM9 identity-based cryptographic algorithms Part 3: Key exchange protocol
SM9 identity-based cryptographic algorithms Part 3: Key exchange protocol Contents 1 Scope... 1 2 Normative references... 1 3 Terms and definitions... 1 3.1 key exchange... 1 3.2 key agreement... 1 3.3
More informationAlgorithms and arithmetic for the implementation of cryptographic pairings
Cairn seminar November 29th, 2013 Algorithms and arithmetic for the implementation of cryptographic pairings Nicolas Estibals CAIRN project-team, IRISA Nicolas.Estibals@irisa.fr What is an elliptic curve?
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationPRDSA: Effective Parallel Digital Signature Algorithm for GPUs
I.J. Wireless and Microwave Technologies, 2017, 5, 14-21 Published Online September 2017 in MECS(http://www.mecs-press.net) DOI: 10.5815/ijwmt.2017.05.02 Available online at http://www.mecs-press.net/ijwmt
More informationGPU-accelerated Verification of the Collatz Conjecture
GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,
More informationLecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationConstructing Pairing-Friendly Elliptic Curves for Cryptography
Constructing Pairing-Friendly Elliptic Curves for Cryptography University of California, Berkeley, USA 2nd KIAS-KMS Summer Workshop on Cryptography Seoul, Korea 30 June 2007 Outline 1 Recent Developments
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationLecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs
Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model
More informationCaching Basics. Memory Hierarchies
Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationLecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability
Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood
More information--> Buy True-PDF --> Auto-delivered in 0~10 minutes. GM/T Translated English of Chinese Standard: GM/T0044.
Translated English of Chinese Standard: GM/T0044.1-2016 www.chinesestandard.net Buy True-PDF Auto-delivery. Sales@ChineseStandard.net CRYPTOGRAPHY INDUSTRY STANDARD OF THE PEOPLE S REPUBLIC OF CHINA GM
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationBlock Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations
Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationMemory Management! Goals of this Lecture!
Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationMemory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"
Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationSource Anonymous Message Authentication and Source Privacy using ECC in Wireless Sensor Network
Source Anonymous Message Authentication and Source Privacy using ECC in Wireless Sensor Network 1 Ms.Anisha Viswan, 2 Ms.T.Poongodi, 3 Ms.Ranjima P, 4 Ms.Minimol Mathew 1,3,4 PG Scholar, 2 Assistant Professor,
More informationSoftware Implementation of Tate Pairing over GF(2 m )
Software Implementation of Tate Pairing over GF(2 m ) G. Bertoni 1, L. Breveglieri 2, P. Fragneto 1, G. Pelosi 2 and L. Sportiello 1 ST Microelectronics 1, Politecnico di Milano 2 Via Olivetti, Agrate
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More information