GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract Multiple-precisio iteger operatios are key compoets of may security applicatios but ufortuately they are computatioally expesive o cotemporary CPUs. I this paper, we preset our desig ad implemetatio of a multiple-precisio iteger library for GPUs which is implemeted by CUDA. We report our experimetal results which show that a sigificat speedup ca be achieved by GPUs as compared with the GNU MP library o CPUs. Keywords: Multiple-precisio algorithm, GPU, CUDA I. INTRODUCTION Public-key ecryptio plays a critical role i our daily life. The core compoet of a public-key system is a set of multiple-precisio iteger operatios. A server that relies o public-key ecryptio (such as a SSL server) eeds to process a large umber of multiple-precisio iteger operatios, which require huge computig power. Recet advaces i Graphics Processig Uits (GPUs) ope a ew era of GPU computig. For example, commodity GPUs like NVIDIA s GTX has processig cores ad ca achieve 9 GFLOPS of computatioal horsepower. More importatly, the NVIDIA CUDA programmig model makes it easier for developers to develop o-graphic applicatios usig GPU. I CUDA, the GPU becomes a dedicated coprocessor to the host CPU, which works i the priciple of Sigle-Program Multiple Data (SPMD) where multiple threads based o the same code ca ru simultaeously. We are motivated by the fact that GPUs could be utilized to speed up multiple-precisio iteger operatios. This is of practical importace to ed users as well as applicatio servers. However, it is ot easy to achieve high performace o GPUs due to the complicated memory architecture ad the relatively slow iteger operatios. I this paper, we preset our desig, implemetatio, ad experimetal results o a highly optimized multiple-precisio iteger library. Our library achieved a sigificat speedup for a umber of multiple-precisio iteger operatios. The rest of the paper is orgaized as follows. Sectio II provides backgroud iformatio o GPU architecture ad CUDA programmig model. Sectio III presets our desig ad implemetatio of multiple-precisio iteger arithmetic o GPU. Experimetal results are preseted i Sectio IV, ad we coclude the paper i Sectio V. The latest Fermi architecture has a better support o iteger operatios, but it is out of the scope of this paper. II. BACKGROUND AND RELATED WORK GPUs are dedicated hardware for maipulatig computer graphics. Due to the huge computig demad for real-time ad high-defiitio D graphics, the GPU has evolved ito a highly parallel, multithreaded, may core processor. The advaces of computig power i GPUs have drive the developmet of geeral-purpose computig o GPUs (GPGPU). The first geeratio of GPGPU requires that ay o-graphics applicatio must be mapped through graphics applicatio programmig iterfaces (APIs). NVIDIA provided a geeral-purpose parallel programmig model, amely Compute Uified Device Architecture (CUDA) [] [], which exteds the C programmig laguage for geeral-purpose applicatio developmet. Meawhile, aother GPU vedor AMD also itroduced Close To Metal (CTM) programmig model which provides a assembly laguage for applicatio developmet []. Itel also exposed Larrabee, a ew maycore GPU architecture specifically desiged for the market of GPU computig this year []. Sice the release of CUDA, it has bee used for speedig up a large umber of applicatios [-]. Give its popularity, we choose CUDA to implemet our multipleprecisio iteger library. III. MULTIPLE-PRECISION MODULAR ARITHMETIC I this sectio, we preset a set of library fuctios of multiple-precisio modular arithmetic implemeted o GPUs. I modular arithmetic, all operatios are performed i a group Z m, i.e., the set of itegers {,,, m-}. I the followig, the modulus m is represeted i radix b as (m m - m m ) b, where m. Each symbol m i is referred to as a radix b digit. No-egative itegers x ad y, x<m, y<m, are represeted i radix b as (x x - x x ) b ad (y y - y y ) b respectively. We have implemeted the followig multiple-precisio library fuctios for CUDA: Multiple-precisio compariso Multiple-precisio additio ad subtractio Multiple-precisio modular additio ad subtractio Multiple-precisio multiplicatio ad divisio Multiple-precisio Motgomery reductio Multiple-precisio Motgomery multiplicatio Multiple-precisio expoetiatio

Our library implemets each operatio as a sigle thread. To make full usage of a GPU, hudreds to thousads of threads are required to be executed simultaeously. It is also possible to implemet a complicated operatio by multithreadig, e.g., a block of threads could be used to perform a sigle operatio such as expoetiatio. We leave this as our future work. A. Compariso, Additio ad Subtractio The pseudo codes of multiple-precisio compariso, additio, ad subtractio operatios are show i Algorithm,, ad, respectively. Algorithm Multiple-precisio Compariso b digits. OUTPUT:, if x > y, if x = y -, if x < y. : i : while ( x i == yi ad i > ) : i i : ed while : if ( x i > yi ) the retur : else if ( x i == yi ) the retur 7: else retur - Algorithm Multiple-precisio Additio b digits. OUTPUT: x + y = ( zz zz ) b. : c /* carry digit */ : for ( i from to ) do : zi ( xi + yi mod b : c ( xi + yi b : ed for 7: z + c : retur ( z z z z) b Algorithm Multiple-precisio Subtractio b digits, x y. OUTPUT: x y = ( zz z z) : c /* carry digit */ : for ( i from to ) do : zi ( xi yi mod b : if ( xi yi + c ) the c : else c : ed for 7: retur ( z z z z) b B. Modular Additio ad Subtractio The pseudo codes of multiple-precisio modular additio ad subtractio operatios are show i Algorithm ad, respectively. Algorithm Multiple-precisio Modular Additio b digits, x < m, y < m. OUTPUT: ( x + y) mod m = ( z z z z) : c /* carry digit */ : for ( i from to ) do : zi ( xi + yi mod b : if ( xi + yi + c < b ) the c : else c : ed for 7: z + c m + : if ( ( z + zz zz ) b >= ( m + mm mm ) b ) the 9: ( t + tt tt ) b ( z + zz zz ) b ( m + mm mm ) b : retur ( t t t t ) b : else retur ( z z z z) b Algorithm Multiple-precisio Modular Subtractio b digits, x < m, y < m. OUTPUT: ( x y) mod m = ( z z z z) : if ( x >= y ) the retur x y : else : t ( m y) : retur ( x + t) mod m : ed else C. Multiplicatio, Divisio, ad Modular Multiplicatio Oe straightforward method to implemet modular multiplicatio of x y mod m is to calculate x y first ad the calculate the remaider of x y divided by m. Hece modular multiplicatio ca be implemeted by usig multiplicatio ad divisio operatios. Next, we give the pseudocode for calculatig multiple-precisio multiplicatio ad divisio i Algorithm ad 7, respectively. Algorithm Multiple-precisio Multiplicatio b digits ad s + radix b digits respectively. OUTPUT: x y = ( z + s + z + s z z) : for ( i from to + s + ) do : z i : ed for : for ( i from to s ) do : c /* carry digit */ : for ( j from to ) do 7: ( uv) b zi + j + x j yi + c : z i + j v c u 9: ed for : z + i + u : ed for : retur ( z + s + z + s z z) b

Algorithm 7 Multiple-precisio Divisio b digits ad s + radix b digits respectively, s, y s. OUTPUT: the quotiet q = ( q s qq ) b ad remaider r = ( rs r r ) b such that x = q y + r, r < y. : for ( i from to s ) do : q i : ed for s : while ( x y ) do : q s q s + : s x x y 7: ed while : for ( i from dow to t + ) do 9: if ( x i == ys ) the q i s b : else q i s ( xi + xi) / ys : while ( q i s ( ys + ys ) > xi + xi + xi ) do : q i s q i s : ed while : i s x x qi s y : if ( x < ) the : i s x x + y 7: q i s q i s : ed if 9: ed for : r x : retur ( q, r) The classical modular multiplicatio is suitable for ormal operatios. However, whe performig modular expoetiatios, Motgomery multiplicatio shows much better performace advatage []. Motgomery multiplicatio makes uses of Motgomery reductio. Hece the followig gives the pseudocode of Motgomery reductio ad Motgomery multiplicatio i Algorithm ad 9 respectively. Let m be a positive iteger, ad let R ad A be itegers such that R > m, gcd(m, R) =, ad A < m R. The Motgomery reductio of A modulo m with respect to R is defied as A R mod m. I our library, R is chose as b to simply the calculatio. Algorithm Multiple-precisio Motgomery Reductio INPUT: iteger m with radix b digits ad gcd(m, b) =, R = b, m ' = m mod b, ad iteger A with radix b digits ad A < m R. OUTPUT: T = A R mod m. : T A : for ( i from to ) : ui Ti m' mod b i : T T + ui m : ed for : T T / b 7: if ( T m ) the T T m : retur T Algorithm 9 Multiple-precisio Motgomery Multiplicatio INPUT: o-egative iteger m, x, y with radix b digits, x < m, y < m, ad gcd(m, b) =, R = b, m ' = m mod b. OUTPUT: T = x y R mod m. : T : for ( i from to ) : ui ( T + xi y ) m' mod b : T ( T + xi y + ui m) / b : ed for : if ( T m ) the T T m 7: retur T D. Modular Expoetiatio Modular expoetiatio has foud a lot of applicatio [7]. There are differet ways to implemet modular expoetiatio. We choose to implemet the Motgomery expoetiatio because it avoids usig divisio operatios which are very iefficiet i GPUs. The pseudocode of Motgomery expoetiatio is show i Algorithm. Algorithm Multiple-precisio Motgomery Expoetiatio INPUT: iteger m with radix b digits ad gcd(m, b) =, R = b, positive iteger x with radix b digits ad x < m, ad positive iteger e = ( e t e ). e OUTPUT: x mod m. : x Mot( x, R mod m) : A R mod m : for ( i from dow to ) : A Mot( A, A) : if e i == the A Mot( A, x ) : ed for 7: A Mot( A, ) : retur A IV. IMPLEMENTATION AND EXPERIMENTAL RESULT I this sectio, we first briefly discuss the data structure of multiple-precisio (MP) iteger ad optimizatio techiques used by our library, ad the report our experimetal results. More details ca be foud i []. A. Data Structure of Multiple-precisio Iteger We represet a MP iteger as a sequece of -bit itegers, sice most GPUs support -bit iteger operatios. There are two ways to arrage this sequece of -bit itegers i memory. Oe is to put the data of a MP iteger i a array. The a group of MP itegers will be stored as a two-dimesioal array. The secod way is to traspose the two-dimesioal array described previously, so that each MP iteger is stored i a colum istead of a row. This is to achieve coalesced memory access o GPUs.

I our implemetatio, a group of MP itegers are orgaized i two parts. The first oe is a array, which keeps the legth of each MP iteger. The secod part is a matrix. Suppose the umber of MP iteger is, the maximum legth of the MP iteger is l, the set of MP itegers could be regarded as matrix[(/w) l][w], i which w is the umber of colums. B. Optimizatio Techiques Usig Costat Value with Cache Memory Most algorithms will use the same data multiple times durig the calculatios. Uder these cases, the utilizatio of memory via cache mechaism ca icrease the calculatio efficiecy. O GPUs, texture ad costat memory adopt cache mechaism. Thus, those frequetly accessed data ca be kept i texture or costat memory i order to achieve high readig efficiecy. Usig Shared Memory for Temp Value From the algorithms listed i Sectio III, we otice that some algorithms (Algorithm to ) eed to use temporary variables. Usig local or global memory to store these variables will cause log readig latecy. But if we use shared memory, the readig latecy ca become much shorter. Hece, we adopt shared memory to store the temporary variables as much as possible. Balacig the Computig Resource I CUDA programmig model, the umber of registers ad shared memory is limited i a sigle SM (Stream Multiple-processor), which oly ca make blocks be active simultaeously. Cosequetly, i order to maximize the umber of threads ruig i a sigle SM, we eed to reasoably maage the umber of registers ad shared memory i each block. C. Experimetal Results We tested our library o XFX GTX graphics card. It cotais a NVIDIA GT GPU which has processig cores workig at. GHz. We also give the results of GNU MP library ruig o a i7 CPU (.GHz) for compariso. I the followig figures (Figure to ), the x-axis deotes the umber of multiple-precisio itegers, ad the y- axis deotes the achieved umber of operatios per secod. Figure to ad 9 respectively represet the operatios per secod results about the additio, subtractio, multiplicatio, divisio, modular additio, modular subtractio i GPU MP library ruig o GPU ad CPU. I order to guaratee GPU ruig with full load, we select five groups of data, ad each group cotais 9,, 7,, ad 7 multiple-precisio itegers, respectively. I each group, we select multiple-precisio itegers with three differet legths, icludig -bit, -bit ad -bit. Figure 7 ad list the results about Motgomery reductio ad Motgomery multiplicatio algorithm. Sice GNU MP library has o idividual algorithm about Motgomery reductio ad Motgomery multiplicatio, we oly presets our results o GPU. All results show that the GPU MP library ca achieve sigificat speedup o GPU, far better tha the GNU MP library ruig o CPU. Multiple-precisio Additio Operatio per Secod (x ) CPU Add() CPU Add() CPU Add() GPU Add() GPU Add() GPU Add() 9 7 7 Figure. Multiple-precisio Additio ruig o CPU & GPU Multiple-precisio Subtractio Operatio per Secod (x ) CPU sub() CPU sub() CPU sub() GPU sub() GPU sub() GPU sub() 9 7 7 Figure. Multiple-precisio Subtractio ruig o CPU & GPU Multiple-precisio Multiplicatio Operatio per Secod (x ) CPU Mul() CPU Mul() CPU Mul() GPU Mul() GPU Mul() GPU Mul() 9 7 7 Figure. Multiple-precisio Multiplicatio ruig o CPU & GPU Multiple-precisio Divisio Operatio per Secod (x ) CPU Div() CPU Div() CPU Div() GPU Div() GPU Div() GPU Div() 9 7 7 Figure. Multiple-precisio Divisio ruig o CPU & GPU

Multiple-precisio Modular Additio Operatio per Secod (x ) CPU Mod Add() CPU Mod Add() CPU Mod Add() GPU Mod Add() GPU Mod Add() GPU Mod Add() 9 7 7 Figure. Multiple-precisio Modular Additio ruig o CPU & GPU Multiple-precisio Modular Substractio Operatio per Secod (x ) CPU Mod Sub() CPU Mod Sub() CPU Mod Sub() GPU Mod Sub() GPU Mod Sub() GPU Mod Sub() 9 7 7 Figure. Multiple-precisio Modular Subtractio ruig o CPU & GPU Multiple-precisio Motgomery Reductio Operatio per Secod (x ) Multiple-precisio Motgomery Multiplicatio Operatio per Secod (x ) GPU Mot Reductio() GPU Mot Reductio() GPU Mot Reductio() Figure 7. Multiple-precisio Motgomery Reductio ruig o GPU 7 GPU Mot Mul() GPU Mot Mul() GPU Mot Mul() Figure. Multiple-precisio Motgomery Multiplicatio ruig o GPU Multiple-precisio Motgomery Expoetiatio Operatio per Secod (x) 9 7 CPU Exp() GPU Exp() 9 7 7 Figure 9. Multiple-precisio Motgomery Expoetiatio ruig o CPU & GPU V. CONCLUSIONS Multiple-precisio iteger operatios are a importat compoet i public-key cryptography for ecryptig ad sigig digital data. I this paper, we describe the desig, implemetatio ad optimizatio of multiple-precisio iteger library for GPUs usig CUDA. I the future, we will explore how to make use of the ew Fermi architecture to further optimize the performace of our library. We will also port our library to OpeCL. ACKNOWLEDGMENT This work is supported by FRG Grat frg9: FRG/-9/9 from Hog Kog Baptist Uiversity. REFERENCES [] NVIDIA CUDA. http://developer.vidia.com/object/cuda.html [] NVIDIA CUDA Compute Uified Device Architecture: Programmig Guide, Versio.beta, Ju.. [] AMD CTM Guide: Techical Referece Maual.. http://ati.amd.com/compayifo/researcher/documets/ati_ctm_g uide.pdf [] Seiler, L., et. al.,. Larrabee: a may-core x architecture for visual computig. ACM Trasactios o Graphics, 7(), Aug.. [] GNU MP Arithmetic Library. http://gmplib.org/ [] Motgomery, P., 9. Multiplicatio without trial divisio, Math. Computatio, vol., 9, 9-. [7] Meezes, A., va Oorshot, P., ad Vastoe S., 99. Hadbook of applied cryptography. CRC Press, 99. [] Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stoe, S. S., Kirk, D. B., ad Hwu, W.. Optimizatio priciples ad applicatio performace evaluatio of a multithreaded GPU usig CUDA. I Proceedigs of ACM PPoPP, Feb.. [9] Falcao, G., Sousa, L., ad Silva, V.. Massiv parallel LDPC decodig i GPU. I Proceedigs of ACM PPoPP, Feb.. [] Owes, J. D., Housto, M., Luebke, D., Gree, S., Stoe, J. E., ad Phillips, J. C.. GPU computig. IEEE Proceedigs, May, 79-99. [] X.-W. Chu, K. Zhao, ad M. Wag. Massively Parallel Network Codig o GPUs. I Proceedigs of IEEE IPCCC, Austi, Texas, USA, Dec. [] X.-W. Chu, K. Zhao, ad M. Wag. Practical Radom Liear Network Codig o GPUs. I Proceedigs of IFIP Networkig 9, Arche, Germay, May 9. [] K. Zhao ad X.-W. Chu. GPUMP: a Multiple-Precisio Iteger Library for GPUs. Techical Report, Departmet of Computer Sciece, Hog Kog Baptist Uiversity,.