EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing

Size: px
Start display at page:

Download "EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing"

Transcription

1 EAGL : an Elliptic Curve Arithmetic GPUbased Library for Bilinear Pairing Shi (Chris) Pu, Jyh-charn (Steve) Liu Department of Computer Science and Engineering, Texas A&M University

2 Outline 2 Introduction Objective and results Computing Model and Evaluation of Existing Optimization Techniques Key Findings EAGL : an Elliptic Curve Arithmetic GPU-based Library for Bilinear Pairing Benchmarks Identification of Computing Bottlenecks Conclusion

3 Introduction 3 Bilinear pairings have been widely researched for a board range of crypto protocols and secure applications. e.g. key agreement [1][2], identity-based encryption [3][4], BLS signature verification [5], secret handshake [6] Some of these protocols are well suited for decentralized Internet scale applications. 5 We will give one example [7] of this case in the next slide.

4 Privacy-preserving Health-care Cloud 4 high throughput and low response time for serving requests in an emergency response situation

5 Computing Performance for Pairings 5 At the 128-bit security level: Naehrig et al. [8] : 669 pairings/sec on a single core of Intel Q9550 CPU (2.83GHz) (2010) Beuchat et al. [9] : 1202 pairings/sec on a single core of Intel i7 CPU (2.8GHz) (2010) Aranha et al. [10] : 1658 pairings/sec on a single core of Intel i5 CPU (2.8GHz) (2011) Mitsunari et al. [11] : 2051 pairings/sec on a single core of an Intel Core i7-4700mq CPU (2.4GHz) (2013) Reduction of Computational Complexity Parallelization of pairings

6 Parallel Computer Architectures 6 Multi-core-based clusters well established Continual increase of the core count will offer good computing resources for pairings GPU Clusters with hundreds of GPUs -- emerging Only public computation (inputs are either encrypted or public) of pairings on GPU Previous studies on point multiplication [12, 13, 14] and bilinear pairings pairings [15, 16] reported inferior performances than their multi-core counterparts.

7 Motivation & Results Motivation: Understand the relationship between computational structures of pairings and the single instruction multi-thread (SIMT) parallel execution model of GPU. Results: Elliptic curve Arithmetic GPU-based Library (EAGL) 5 Compute Unified Device Architecture (CUDA) programming model 5 Parallelizes Miller s algorithm for the R-ate pairing [17] at 128-bit security level 5 Validated by MIRACL [18]. 5 Performance: 1408 R-ate pairings in ms, or pairings/sec as the amortized throughput Major performance bottlenecks for pairings on the GTX-680 device: 5 Fast/slow memory allocated for intermediate results of pairings 5 GPU pipeline utilization 5 proper use of the state-of-the-art pairing optimization techniques

8 Highlight of Kepler GPU 8 Nvidia Kepler GPU released in One GPU contains multiple SMXs (streaming multiprocessor extreme), each SMX works as a computing unit processes 32 threads, called a warp, per clock simultaneously runs multiple warps of threads for better utilization of the pipeline inside SMX SIMT: Single Instruction Multi-Thread threads are stalled and synchronization is inserted when the same instruction for all threads is not satisfied, e.g., in the if-else branches. kernel function: the program run by GPU threads

9 GPU Memory Hierarchy 64K 32-bit registers and 64KB on-chip shared memory/l1 cache bit shared memory banks, one memory interface per bank the interface competition reduces the actual degree of parallelism 32 shared mem LD/ST units shared mem interface competition occurs only within one warp of GPU threads

10 Major Design Factors Operations of ECC are multi-precision integer arithmetic functions a[n] b[n] mod q[n], n = 8 in our BN curve and the 128-bit security level 5 Similar for + and - a[8] representing an integer composed of 8 32-bit integers Major design factors of the multi-precision integer arithmetic 1. Data Storage Format 5 Using GPU integer or floating point instruction sets? 2. Number system 5 Residue number system (RNS) or conventional Montgomery? computational complexity, parallelization suitability 3. Dependency Limitation 5 Variable reading/writing race condition 4. Resource Competition 5 shared memory interfaces

11 Number systems option 1: Conventional Montgomery Modular multiplications c[n] = a[n] b[n] mod q[n] are selected as the representative code segment of the pairing computation Option 1. Conventional Montgomery 5 A conversion from (x mod q) to (xr -1 mod q) 5 c = a b mod q Complexity (with a single thread per instance) 5 T[2n] = a[n] b[n]: n 2 INT32 multiplication ( MUL) 5 reduction(t) = n 2 + n MUL 5 Overall: 2n 2 + n MUL T=a b; c = reduction(t){ m = (T mod R) q mod R; t = (T + m q) / R; return (t q)? (t q) : t;} where R > q, R is power of 2 and co-prime to q; R -1 and q have RR -1 qq =1;

12 Option 2. RNS-based Montgomery [12] Based on Chinese Remainder Theorem, the modulus M = m i, m i is i= 1 prime. SIMT-friendly: for an arbitrary integer X < M: independent x i = X mod m i, 1 < i < n a[n] b[n] in RNS can be fully parallelized But RNS-based integers cannot be directly used in pairings because M is not prime, unless two Base Extensions (BE) are inserted in each reduction. n Complexity (with t threads per instance) n 2 /t (for a[n] b[n]) + 2n 2 (for BE) + 3n (other parts in reduction) MUL When (n = 8, t = 4): 2n + 2n 2 + 3n = 2n 2 + 5n MUL

13 Lazy Reduction and Number Systems For (a b) mod q one multi-precision multiplication and one reduction. Lazy reduction for (a b) mod q + ( c d ) mod q (a b + c d ) mod q Two multi-precision multiplication and one reduction A general format for lazy reduction to the highest extension field: (a b + c d + e f + ) mod q When multi-precision multiplication are fully parallelized by t parallel threads Based on the RNS-base number system: 5Computational complexity: (n 2 + n 2 + n 2 )/ t.+ (2n 2 + 3n) MUL Based on the conventional Montgomery: 5Computational complexity: (n 2 + n 2 + n 2 )/t + (n 2 + n) MUL 5A cheaper option

14 8 Parallel Multiplication (PMU) Sequences in One Multiprecision Multiplication: T[16] = a[8] b[8] can be considered as 8 PMU sequences of T[x] = a[i] b[j=0~7], 0 i 7, and x = i + j respectively. In each PMU sequence: T[x] = (low 32-bit) (a[i] b[j=0,1,...,7] + T[x] + carry); carry = (high 32-bit) a[i] b[j=0,1,...,7] + T[x] + carry) if j > 0, The inter-dependency among 8 PMU sequences is the reading/writing (R/W) of T[0-15]

15 Eliminating the Race Condition of R/W T[x] Assuming T[0-15] is in the shared memory: Race-free condition on R/W T[x] when (T[0-15] is partitioned into multiple segments with a constant size) && (each segment is mapped to one thread) && (R/W address of each thread has a constant offset which is large enough)) Write requests

16 Parallel Computing Model Candidates CI_1/2/4thread model: Using 1/2/4 GPU thread(s) to complete one computing instance (CI) of the multi-precision arithmetic operations: a[8] b[8] (as well as for +, -) Next two slides: two types of shared memory bank conflicts are eliminated in the CI_1/2/4thread models The CI_8thread model does not meet the race free requirement mentioned earlier

17 Elimination of Bank Conflict (CI_2thread model) Storage of a 256-bit variable a[8] for a block of GPU threads Each CI owns one a[8]. One block of GPU threads own a number of a[8]s. Their placement in the shared memory: a stripe strip of 64-bit chaff spacers are inserted to ensure each access A i in a warp visits different banks Maybe a typo?

18 Elimination of Another Bank Conflict (CI_2/4thread models) Occurs when R/W T[0-15] in different CIs To eliminate bank conflict caching T 0-11 and T 4-15 in register, and then accumulating T 0-15 serially

19 Result: Fully parallelized eight PMU sequences 19 Bank Conflict Elimination verified by nvprof profiling tool of CUDA Zero replay cost caused by bank conflict The total cost of multi-precision multiplication in the CI_2thread model is slightly above n 2 /2 MUL due to the serial accumulation step of T.

20 Performance of multiplications reductions in CI-1/2/4thread Models Pure multi-precision multiplications and reductions invoked in one bilinear pairing. models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-1thread 57.87ms 39.95ms N/A 97.82ms CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms CI-4thread 28.58ms 51.64ms 83.59ms 80.22ms parallelization Bisecting the shared of T[16] memory = a[8] usage b[8] per works CI into two threads usually more low than gain double from parallelization thread counts as the per thread SMXcount per CI increases, due to more synchronization thread cnt: 160 needed (CI_1), for 352 accumulating (CI_2), 738(CI_4); T[i] limited by placing complete warps into SMX

21 Effects of Reduction models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-1thread 57.87ms 39.95ms N/A 97.82ms CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms CI-4thread 28.58ms 51.64ms 83.59ms 80.22ms CI-2thread adopted for EAGL for the best overall performance In CI-2thread, parallelized reduction is slower than its serial counterpart The value of T in T + m q is a copy in the global memory because the addition only reads T once. m = (T mod R) q' mod R and m q took 43% execution time of reduction

22 More on Lazy Reduction (LR)

23 LR applied to extension fields F q 2, F q 6, F q 12 (Aranha et al. [10] ) When applying LR in F q k, computing a modular multiplication in F q k needs k reductions. Doubling intermediate results each time The size of temporary data in functions of EAGL increases to 1.4 times for when LR in F q 2 is changed to LR in F q 4 Explosive growth in data swapping between shared memory and global memory for LR in F q 12 5 a number of double-sized intermediate results in F q4. need to be maintained before a reduction in F q 12 occurs.

24 Investigation of Lazy Reduction (LR) 1. LR in F q 2 or F q 4 applied 2. with swbased prefetching; Optimization choices Execution time Threads per SMX Shared mem per CI Throughput (/sec) lazy reduct in F q ms bytes lazy reduct in F q ms bytes prefetch + lazy reduct in F q 4 Execution of 1000 modular multiplications in F q ms bytes more fast lazy reduct in F 4 q 233.7ms bytes cache assigned lazy reduct in F 4 q 225.2ms bytes the increase of slow global memory accesses 3. marginal improvement dominate of execution the reduction time of observed. computational As the size complexity; of 2. the next-warp prefetching technique working sets is spatially optimized, fewer gain with more fast cache once a brings no noticeable gain since next warp has threshold is passed. no scheduling priority without hw support.

25 Comparison with Existing CPU/GPU-based solutions

26 EAGL vs. GPU-based solutions Implementations Algorithm Curve Type EAGL R-ate, prime order Security Exec time (ms) Device ordinary 128-bit GTX-680 [15] η T, prime order ordinary 128-bit 3.94 C2050 [16] Tate, composite order supersingular 80-bit 23.8 M SMX 352 threads (8 176 instances) = 1408 instances, the execution time of 1408 R-ate pairings is ms, equivalent to pairings/sec as the amortized throughput. the peak GFLOPS of GTX-680 is roughly three times larger than M2050/C2050. After the normalization of peak GFLOPS, EAGL is roughly 4.4 times faster than [15].

27 Experimental Result (Bilinear Pairing, vs. CPU-based solutions) 27 Implementations Algorithm Device Core clock Throughput EAGL R-ate pairing GTX MHz [8] Ate pairing Intel Q GHz (est.) [9] Ate pairing Intel i GHz (est.) [10] Ate pairing Intel i5 2.8GHz (est.) [11] Ate pairing Intel i7-4700mq 2.4GHz (est.) adopt the perfect acceleration model for CPU-based solutions on multi-core CPUs. EAGL has 40% of the throughput in [11]. since EAGL costs few CPU resource, it can work as a scalable bilinear pairing co-processor, while CPU is available for other business logic.

28 Major Computing Bottleneck? 28 low-level multi-precision arithmetic functions based on the CI-2thread model, or high-level computation structures of the pairing computation? To determine whether the major bottleneck is the former, we cross compare the EAGL-based point multiplication with other GPU/CPU-based point multiplication solutions Since few GPU-based pairing solutions are available

29 EAGL vs. GPU/CPU-based Point Multiplication Solutions Implementations Key size Throughput (/sec) Device Device peak GFLOPS with GPU with CPU [13] 224 bit 5895 GTX [12] (RNS-based) 224 bit 9827 GTX EAGL 256 bit GTX Implementations Key size Elliptic curve Throughput (/sec) Device Optimized GLS method [19] 256 bit twisted Edward curves in F 2 q GHz AMD Opteron (single core) MIRACL (our code base) 224 bit standardized GHz AMD Phenom II EAGL 256 bit standardized GTX 680 EAGL has 2.76 times higher throughput than that in [12], higher than the increase of peak GFLOPs (1.72 folds). EAGL has 2.1 times higher throughput than [19]. Although EAGL uses the same CI-2thread model for these two algorithms, the performance relationships of EAGL vs. CPU-based counterparts are different!

30 Computing Latency with Unlimited On-chip Memory The experiment on multi-precision multiplications and reductions in a pairing simulated a unlimited on-chip memory scenario for 1408 pairings models T=a b S-reduct P-reduct best mul+reduct thread per SMX throughput (/sec) CI-2thread 39.40ms 36.71ms 48.73ms 76.11ms A pairing also needs 20k inexpensive multi-precision additions/subtractions, which is less than 15ms if they fully reside in the share memory Theoretical optimal latency 76ms + 15ms = 91ms << 420ms (the actual latency). What contribute to the remaining execution time?

31 The Sizes of Temporary Variables for Point Multiplications and Pairings in EAGL in bilinear pairing, the size of temporary variables in the global memory space fluctuates, and is much greater than that in point multiplication the sizes of temporary variables in the shared memory space are stable Miller Loop Final Exponential point calculation line calculation merge results of step 1,2 4 inversion 5 rest of FE

32 An Experiment for Identifying the Major Bottleneck? We assert that a large proportion of execution time is spent in variables swapping between shared memory and global memory as follows: 1. We extract profiling information of a copy function copy() that copies an element in F 12 q in the global memory, and a powering function that computes x y, here x is in F 12 q and y in F q x y takes 47ms, copy() takes 35μs Nvprof shows that x y triggers nearly 500 times more global memory hits than copy() Therefore, global memory hits in x y roughly takes17ms = μs, roughly 35% of the execution time of x y 2. Global memory hits can weight more if synchronization are waiting for them, for example, global memory R/W embedded in branch divergence.

33 Conclusion EAGL for bilinear pairing Conventional-Montgomery-based CI-2thread model is more efficient than RNS-based computing models [12] on Kepler GPU; On Kepler GPU, the trade-off between computational complexity and (on-chip) memory resource is very different from that on CPU. Bilinear pairing has exceeded the on-chip memory resource limit of Kepler GPU; Point multiplication far from it Future directions 5 Larger on-chip memory and advanced memory prefetching architecture 5 Practical design for large scale systems 5 Integrated Development Environment (IDE) 5 Mobile enviornments

34 34 Many Thanks to the Organizers, the Team, and new friends

35 35 Questions and Discussion

36 Reference D. Fiore, R. Gennaro, and N.P. Smart, "Constructing Certificateless Encryption and ID-Based Encryption from ID-Based Key Agreement". Pairing 2010, pp , K. Yoneyama, "Strongly Secure Two-Pass Attribute-Based Authenticated Key Exchange". Pairing 2010, pp , A.D. Caro, V. Iovino, G. Persiano, "Fully Secure Anonymous HIBE and Secret-Key Anonymous IBE with Short Cipher-texts". Pairing 2010, pp , L. Wang, L. Wang, M. Mambo, E. Okamoto, "New Identity-Based Proxy Re-encryption Schemes to Prevent Collusion Attacks". Pairing 2010, pp , D. Boneh, B. Lynn, H. Shacham, "Short Signatures from the Weil Pairing". Asiacrypt 2001, pp , P. Duan, "Oblivious Handshakes and Computing of Shared Secrets: Pairwise Privacypreserving Protocols for Internet Applications". Ph.D. Dissertation, available at

37 7. J. Pecarina, S. Pu, J.C. Liu, "SAPPHIRE: Anonymity for Enhanced Control and Private Collaboration in Healthcare Clouds". CloundCom 2012, pp , M. Naehrig, R. Niederhagen, and P. Schwabe, "New Software speed records for cryptographic pairings", Latincrypt 2010, pp , J.L. Beuchat, J.E.G. Diaz, S. Mitsunari, E. Okamoto, F. Rodriguez-Henriquez, and T. Teruya "High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves", Pairing 2010, pp D.F. Aranha, K. Karabina, P. Longa, C.H. Gebotys, J. Lopez, "Faster Explicit Formulas for Computing Pairing over Ordinary Curves". Euro-Crypt 2011, pp 48-68, S. Mitsunari, "A Fast Implementation of the Optimal Ate Pairing over BN curve on Intel Haswell Processor". In IACR eprint archive 2013: S. Antão, J.C. Bajard, L. Sousa. "RNS-based Elliptic Curve Point Multiplication for Massive Parallel Architectures". In the Computer Journal 2011, vol. 55 issue 5, pp , 2012.

38 13. D. J. Bernstein, T.R. Chen, C.M. Cheng, T. Lange, and B.Y. Yang, "ECM on Graphics Cards". Euro-Crypt 2009, pp , R. Szerwinski, and T. Guneysu, "Exploiting the Power of GPUs for Asymmetric Cryptology", CHES-2008, pp 79-99, Y. Katoh, Y.J. Huang, C.M. Cheng, T. Takagi, "Efficient Implementation of the eta Pairing on GPU". Cryptology eprint Archive, Y. Zhang, C.J. Xue, D.S. Wong, N. Mamoulis, S.M. Yiu, "Acceleration of Composite Order Bilinear Pairing on Graphics Hardware". Cryptology eprint Archive, E.J. Lee, H.S. Lee and C.M. Park, "Efficient and Generalized Pairing Computation on Abelian Varieties". IEEE Transactions on Information Theory, Vol.55, Issue.4, pp , MIRACL: Multiprecision Integer and Rational Arithmetic Cryptographic Library. Available at P. Longa and C. Gebotys. Analysis of Efficient Techniques for Fast Elliptic Curve Cryptography on x86-64 based Processors. IACR Cryptology eprint Archive, 335, 1 34, 2010.

Efficient Implementation of the η T Pairing on GPU

Efficient Implementation of the η T Pairing on GPU Efficient Implementation of the η T Pairing on GPU Yosuke Katoh 1, Yun-Ju Huang 2, Chen-Mou Cheng 3, and Tsuyoshi Takagi 4 1 Graduate School of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka,

More information

POST-SIEVING ON GPUs

POST-SIEVING ON GPUs POST-SIEVING ON GPUs Andrea Miele 1, Joppe W Bos 2, Thorsten Kleinjung 1, Arjen K Lenstra 1 1 LACAL, EPFL, Lausanne, Switzerland 2 NXP Semiconductors, Leuven, Belgium 1/18 NUMBER FIELD SIEVE (NFS) Asymptotically

More information

High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields

High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields Santosh Ghosh, Dipanwita Roy Chowdhury, and Abhijit Das Computer Science and Engineering

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

High-Performance Modular Multiplication on the Cell Broadband Engine

High-Performance Modular Multiplication on the Cell Broadband Engine High-Performance Modular Multiplication on the Cell Broadband Engine Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 21 Outline Motivation and previous

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

NEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors

NEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors Four NEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors Selected Areas in Cryptography (SAC 2016) St. Johns, Canada Patrick Longa Microsoft Research Next-generation elliptic curves Recent

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Breaking ECC2K-130 on Cell processors and GPUs

Breaking ECC2K-130 on Cell processors and GPUs Breaking ECC2K-130 on Cell processors and GPUs Daniel V. Bailey, Lejla Batina, Daniel J. Bernstein, Peter Birkner, Joppe W. Bos, Hsieh-Chung Chen, Chen-Mou Cheng, Gauthier van Damme, Giacomo de Meulenaer,

More information

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material

More information

High-Performance Packet Classification on GPU

High-Performance Packet Classification on GPU High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Finite Field Arithmetic Using AVX-512 For Isogeny-Based Cryptography

Finite Field Arithmetic Using AVX-512 For Isogeny-Based Cryptography Finite Field Arithmetic Using AVX-512 For Isogeny-Based Cryptography Gabriell Orisaka 1, Diego F. Aranha 1,2, Julio López 1 1 Institute of Computing, University of Campinas, Brazil 2 Department of Engineering,

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

SM9 identity-based cryptographic algorithms Part 2: Digital signature algorithm

SM9 identity-based cryptographic algorithms Part 2: Digital signature algorithm SM9 identity-based cryptographic algorithms Part 2: Digital signature algorithm Contents 1 Scope... 1 2 Normative references... 1 3 Terms and definitions... 1 3.1 message... 1 3.2 signed message... 1 3.3

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

ECDLP on GPU I. INTRODUCTION

ECDLP on GPU I. INTRODUCTION ECDLP on GPU Lei Xu State Key Laboratory of Information Security Institute of Software,Chinese Academy of Sciences Beijing, China Email: xuleimath@gmail.com Dongdai Lin State Key Laboratory of Information

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Fault-Based Attack of RSA Authentication

Fault-Based Attack of RSA Authentication Fault-Based Attack of RSA Authentication, Valeria Bertacco and Todd Austin 1 Cryptography: Applications 2 Value of Cryptography $2.1 billions 1,300 employees $1.5 billions 4,000 employees $8.7 billions

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

A Parameterizable Processor Architecture for Large Characteristic Pairing-Based Cryptography

A Parameterizable Processor Architecture for Large Characteristic Pairing-Based Cryptography A Parameterizable Processor Architecture for Large Characteristic Pairing-Based Cryptography Gary C.T. Chow Department of Computing Imperial College London, UK cchow@doc.ic.ac.uk Ken Eguro Embedded and

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu EEC 483 Computer Organization Chapter 5.3 Measuring and Improving Cache Performance Chansu Yu Cache Performance Performance equation execution time = (execution cycles + stall cycles) x cycle time stall

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

High-Performance Cryptography in Software

High-Performance Cryptography in Software High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering

More information

SM9 identity-based cryptographic algorithms Part 3: Key exchange protocol

SM9 identity-based cryptographic algorithms Part 3: Key exchange protocol SM9 identity-based cryptographic algorithms Part 3: Key exchange protocol Contents 1 Scope... 1 2 Normative references... 1 3 Terms and definitions... 1 3.1 key exchange... 1 3.2 key agreement... 1 3.3

More information

Algorithms and arithmetic for the implementation of cryptographic pairings

Algorithms and arithmetic for the implementation of cryptographic pairings Cairn seminar November 29th, 2013 Algorithms and arithmetic for the implementation of cryptographic pairings Nicolas Estibals CAIRN project-team, IRISA Nicolas.Estibals@irisa.fr What is an elliptic curve?

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

PRDSA: Effective Parallel Digital Signature Algorithm for GPUs

PRDSA: Effective Parallel Digital Signature Algorithm for GPUs I.J. Wireless and Microwave Technologies, 2017, 5, 14-21 Published Online September 2017 in MECS(http://www.mecs-press.net) DOI: 10.5815/ijwmt.2017.05.02 Available online at http://www.mecs-press.net/ijwmt

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University 18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Constructing Pairing-Friendly Elliptic Curves for Cryptography

Constructing Pairing-Friendly Elliptic Curves for Cryptography Constructing Pairing-Friendly Elliptic Curves for Cryptography University of California, Berkeley, USA 2nd KIAS-KMS Summer Workshop on Cryptography Seoul, Korea 30 June 2007 Outline 1 Recent Developments

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

--> Buy True-PDF --> Auto-delivered in 0~10 minutes. GM/T Translated English of Chinese Standard: GM/T0044.

--> Buy True-PDF --> Auto-delivered in 0~10 minutes. GM/T Translated English of Chinese Standard: GM/T0044. Translated English of Chinese Standard: GM/T0044.1-2016 www.chinesestandard.net Buy True-PDF Auto-delivery. Sales@ChineseStandard.net CRYPTOGRAPHY INDUSTRY STANDARD OF THE PEOPLE S REPUBLIC OF CHINA GM

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Memory Management! Goals of this Lecture!

Memory Management! Goals of this Lecture! Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Source Anonymous Message Authentication and Source Privacy using ECC in Wireless Sensor Network

Source Anonymous Message Authentication and Source Privacy using ECC in Wireless Sensor Network Source Anonymous Message Authentication and Source Privacy using ECC in Wireless Sensor Network 1 Ms.Anisha Viswan, 2 Ms.T.Poongodi, 3 Ms.Ranjima P, 4 Ms.Minimol Mathew 1,3,4 PG Scholar, 2 Assistant Professor,

More information

Software Implementation of Tate Pairing over GF(2 m )

Software Implementation of Tate Pairing over GF(2 m ) Software Implementation of Tate Pairing over GF(2 m ) G. Bertoni 1, L. Breveglieri 2, P. Fragneto 1, G. Pelosi 2 and L. Sportiello 1 ST Microelectronics 1, Politecnico di Milano 2 Via Olivetti, Agrate

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information