GPUMP: a Multiple-Precision Integer Library for GPUs

Similar documents
Appendix D. Controller Implementation

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Chapter 3 Classification of FFT Processor Algorithms

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

A Study on the Performance of Cholesky-Factorization using MPI

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Ones Assignment Method for Solving Traveling Salesman Problem

Chapter 4. Procedural Abstraction and Functions That Return a Value. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

A Note on Least-norm Solution of Global WireWarping

How do we evaluate algorithms?

Lecture 1: Introduction and Strassen s Algorithm

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

1&1 Next Level Hosting

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Analysis of Algorithms

CSE 417: Algorithms and Computational Complexity

Analysis of Algorithms

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Matrix representation of a solution of a combinatorial problem of the group theory

Data Structures and Algorithms. Analysis of Algorithms

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

6.854J / J Advanced Algorithms Fall 2008

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

BOOLEAN MATHEMATICS: GENERAL THEORY

top() Applications of Stacks

EE123 Digital Signal Processing

Elementary Educational Computer

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Lecture 2. RTL Design Methodology. Transition from Pseudocode & Interface to a Corresponding Block Diagram

Lower Bounds for Sorting

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

Bank-interleaved cache or memory indexing does not require euclidean division

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Computer Systems - HS

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

3D Model Retrieval Method Based on Sample Prediction

Multiprocessors. HPC Prof. Robert van Engelen

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

UNIVERSITY OF MORATUWA

Cache-Optimal Methods for Bit-Reversals

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Major CSL Write your name and entry no on every sheet of the answer script. Time 2 Hrs Max Marks 70

Chapter 3. Floating Point Arithmetic

SPIRAL DSP Transform Compiler:

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Reversible Realization of Quaternary Decoder, Multiplexer, and Demultiplexer Circuits

. Written in factored form it is easy to see that the roots are 2, 2, i,

An Efficient Algorithm for Graph Bisection of Triangularizations

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

Examples and Applications of Binary Search

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea

n Explore virtualization concepts n Become familiar with cloud concepts

Fast Fourier Transform (FFT) Algorithms

An Efficient Algorithm for Graph Bisection of Triangularizations

Recursion. Recursion. Mathematical induction: example. Recursion. The sum of the first n odd numbers is n 2 : Informal proof: Principle:

Python Programming: An Introduction to Computer Science

G2 T. Specification Sheet G2T-001 G2T Touchscreen Mainframes Accepts G2 Plug-in Modules Four Sizes: 2RU, 3RU, 6RU and 8RU

Pattern Recognition Systems Lab 1 Least Mean Squares

New HSL Distance Based Colour Clustering Algorithm

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Avid Interplay Bundle

LU Decomposition Method

condition w i B i S maximum u i

Improving Template Based Spike Detection

The Magma Database file formats

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

EFFICIENT MULTIPLE SEARCH TREE STRUCTURE

Python Programming: An Introduction to Computer Science

Novel Encryption Schemes Based on Catalan Numbers

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Lecture 5. Counting Sort / Radix Sort

Σ P(i) ( depth T (K i ) + 1),

Stone Images Retrieval Based on Color Histogram

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

Performance Plus Software Parameter Definitions

Cubic Polynomial Curves with a Shape Parameter

Chapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Evaluation scheme for Tracking in AMI

Enhancing Efficiency of Software Fault Tolerance Techniques in Satellite Motion System

SCI Reflective Memory

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Counting the Number of Minimum Roman Dominating Functions of a Graph

Algorithm. Counting Sort Analysis of Algorithms

CS2410 Computer Architecture. Flynn s Taxonomy

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

Adaptive Resource Allocation for Electric Environmental Pollution through the Control Network

Transcription:

GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract Multiple-precisio iteger operatios are key compoets of may security applicatios but ufortuately they are computatioally expesive o cotemporary CPUs. I this paper, we preset our desig ad implemetatio of a multiple-precisio iteger library for GPUs which is implemeted by CUDA. We report our experimetal results which show that a sigificat speedup ca be achieved by GPUs as compared with the GNU MP library o CPUs. Keywords: Multiple-precisio algorithm, GPU, CUDA I. INTRODUCTION Public-key ecryptio plays a critical role i our daily life. The core compoet of a public-key system is a set of multiple-precisio iteger operatios. A server that relies o public-key ecryptio (such as a SSL server) eeds to process a large umber of multiple-precisio iteger operatios, which require huge computig power. Recet advaces i Graphics Processig Uits (GPUs) ope a ew era of GPU computig. For example, commodity GPUs like NVIDIA s GTX has processig cores ad ca achieve 9 GFLOPS of computatioal horsepower. More importatly, the NVIDIA CUDA programmig model makes it easier for developers to develop o-graphic applicatios usig GPU. I CUDA, the GPU becomes a dedicated coprocessor to the host CPU, which works i the priciple of Sigle-Program Multiple Data (SPMD) where multiple threads based o the same code ca ru simultaeously. We are motivated by the fact that GPUs could be utilized to speed up multiple-precisio iteger operatios. This is of practical importace to ed users as well as applicatio servers. However, it is ot easy to achieve high performace o GPUs due to the complicated memory architecture ad the relatively slow iteger operatios. I this paper, we preset our desig, implemetatio, ad experimetal results o a highly optimized multiple-precisio iteger library. Our library achieved a sigificat speedup for a umber of multiple-precisio iteger operatios. The rest of the paper is orgaized as follows. Sectio II provides backgroud iformatio o GPU architecture ad CUDA programmig model. Sectio III presets our desig ad implemetatio of multiple-precisio iteger arithmetic o GPU. Experimetal results are preseted i Sectio IV, ad we coclude the paper i Sectio V. The latest Fermi architecture has a better support o iteger operatios, but it is out of the scope of this paper. II. BACKGROUND AND RELATED WORK GPUs are dedicated hardware for maipulatig computer graphics. Due to the huge computig demad for real-time ad high-defiitio D graphics, the GPU has evolved ito a highly parallel, multithreaded, may core processor. The advaces of computig power i GPUs have drive the developmet of geeral-purpose computig o GPUs (GPGPU). The first geeratio of GPGPU requires that ay o-graphics applicatio must be mapped through graphics applicatio programmig iterfaces (APIs). NVIDIA provided a geeral-purpose parallel programmig model, amely Compute Uified Device Architecture (CUDA) [] [], which exteds the C programmig laguage for geeral-purpose applicatio developmet. Meawhile, aother GPU vedor AMD also itroduced Close To Metal (CTM) programmig model which provides a assembly laguage for applicatio developmet []. Itel also exposed Larrabee, a ew maycore GPU architecture specifically desiged for the market of GPU computig this year []. Sice the release of CUDA, it has bee used for speedig up a large umber of applicatios [-]. Give its popularity, we choose CUDA to implemet our multipleprecisio iteger library. III. MULTIPLE-PRECISION MODULAR ARITHMETIC I this sectio, we preset a set of library fuctios of multiple-precisio modular arithmetic implemeted o GPUs. I modular arithmetic, all operatios are performed i a group Z m, i.e., the set of itegers {,,, m-}. I the followig, the modulus m is represeted i radix b as (m m - m m ) b, where m. Each symbol m i is referred to as a radix b digit. No-egative itegers x ad y, x<m, y<m, are represeted i radix b as (x x - x x ) b ad (y y - y y ) b respectively. We have implemeted the followig multiple-precisio library fuctios for CUDA: Multiple-precisio compariso Multiple-precisio additio ad subtractio Multiple-precisio modular additio ad subtractio Multiple-precisio multiplicatio ad divisio Multiple-precisio Motgomery reductio Multiple-precisio Motgomery multiplicatio Multiple-precisio expoetiatio

Our library implemets each operatio as a sigle thread. To make full usage of a GPU, hudreds to thousads of threads are required to be executed simultaeously. It is also possible to implemet a complicated operatio by multithreadig, e.g., a block of threads could be used to perform a sigle operatio such as expoetiatio. We leave this as our future work. A. Compariso, Additio ad Subtractio The pseudo codes of multiple-precisio compariso, additio, ad subtractio operatios are show i Algorithm,, ad, respectively. Algorithm Multiple-precisio Compariso b digits. OUTPUT:, if x > y, if x = y -, if x < y. : i : while ( x i == yi ad i > ) : i i : ed while : if ( x i > yi ) the retur : else if ( x i == yi ) the retur 7: else retur - Algorithm Multiple-precisio Additio b digits. OUTPUT: x + y = ( zz zz ) b. : c /* carry digit */ : for ( i from to ) do : zi ( xi + yi mod b : c ( xi + yi b : ed for 7: z + c : retur ( z z z z) b Algorithm Multiple-precisio Subtractio b digits, x y. OUTPUT: x y = ( zz z z) : c /* carry digit */ : for ( i from to ) do : zi ( xi yi mod b : if ( xi yi + c ) the c : else c : ed for 7: retur ( z z z z) b B. Modular Additio ad Subtractio The pseudo codes of multiple-precisio modular additio ad subtractio operatios are show i Algorithm ad, respectively. Algorithm Multiple-precisio Modular Additio b digits, x < m, y < m. OUTPUT: ( x + y) mod m = ( z z z z) : c /* carry digit */ : for ( i from to ) do : zi ( xi + yi mod b : if ( xi + yi + c < b ) the c : else c : ed for 7: z + c m + : if ( ( z + zz zz ) b >= ( m + mm mm ) b ) the 9: ( t + tt tt ) b ( z + zz zz ) b ( m + mm mm ) b : retur ( t t t t ) b : else retur ( z z z z) b Algorithm Multiple-precisio Modular Subtractio b digits, x < m, y < m. OUTPUT: ( x y) mod m = ( z z z z) : if ( x >= y ) the retur x y : else : t ( m y) : retur ( x + t) mod m : ed else C. Multiplicatio, Divisio, ad Modular Multiplicatio Oe straightforward method to implemet modular multiplicatio of x y mod m is to calculate x y first ad the calculate the remaider of x y divided by m. Hece modular multiplicatio ca be implemeted by usig multiplicatio ad divisio operatios. Next, we give the pseudocode for calculatig multiple-precisio multiplicatio ad divisio i Algorithm ad 7, respectively. Algorithm Multiple-precisio Multiplicatio b digits ad s + radix b digits respectively. OUTPUT: x y = ( z + s + z + s z z) : for ( i from to + s + ) do : z i : ed for : for ( i from to s ) do : c /* carry digit */ : for ( j from to ) do 7: ( uv) b zi + j + x j yi + c : z i + j v c u 9: ed for : z + i + u : ed for : retur ( z + s + z + s z z) b

Algorithm 7 Multiple-precisio Divisio b digits ad s + radix b digits respectively, s, y s. OUTPUT: the quotiet q = ( q s qq ) b ad remaider r = ( rs r r ) b such that x = q y + r, r < y. : for ( i from to s ) do : q i : ed for s : while ( x y ) do : q s q s + : s x x y 7: ed while : for ( i from dow to t + ) do 9: if ( x i == ys ) the q i s b : else q i s ( xi + xi) / ys : while ( q i s ( ys + ys ) > xi + xi + xi ) do : q i s q i s : ed while : i s x x qi s y : if ( x < ) the : i s x x + y 7: q i s q i s : ed if 9: ed for : r x : retur ( q, r) The classical modular multiplicatio is suitable for ormal operatios. However, whe performig modular expoetiatios, Motgomery multiplicatio shows much better performace advatage []. Motgomery multiplicatio makes uses of Motgomery reductio. Hece the followig gives the pseudocode of Motgomery reductio ad Motgomery multiplicatio i Algorithm ad 9 respectively. Let m be a positive iteger, ad let R ad A be itegers such that R > m, gcd(m, R) =, ad A < m R. The Motgomery reductio of A modulo m with respect to R is defied as A R mod m. I our library, R is chose as b to simply the calculatio. Algorithm Multiple-precisio Motgomery Reductio INPUT: iteger m with radix b digits ad gcd(m, b) =, R = b, m ' = m mod b, ad iteger A with radix b digits ad A < m R. OUTPUT: T = A R mod m. : T A : for ( i from to ) : ui Ti m' mod b i : T T + ui m : ed for : T T / b 7: if ( T m ) the T T m : retur T Algorithm 9 Multiple-precisio Motgomery Multiplicatio INPUT: o-egative iteger m, x, y with radix b digits, x < m, y < m, ad gcd(m, b) =, R = b, m ' = m mod b. OUTPUT: T = x y R mod m. : T : for ( i from to ) : ui ( T + xi y ) m' mod b : T ( T + xi y + ui m) / b : ed for : if ( T m ) the T T m 7: retur T D. Modular Expoetiatio Modular expoetiatio has foud a lot of applicatio [7]. There are differet ways to implemet modular expoetiatio. We choose to implemet the Motgomery expoetiatio because it avoids usig divisio operatios which are very iefficiet i GPUs. The pseudocode of Motgomery expoetiatio is show i Algorithm. Algorithm Multiple-precisio Motgomery Expoetiatio INPUT: iteger m with radix b digits ad gcd(m, b) =, R = b, positive iteger x with radix b digits ad x < m, ad positive iteger e = ( e t e ). e OUTPUT: x mod m. : x Mot( x, R mod m) : A R mod m : for ( i from dow to ) : A Mot( A, A) : if e i == the A Mot( A, x ) : ed for 7: A Mot( A, ) : retur A IV. IMPLEMENTATION AND EXPERIMENTAL RESULT I this sectio, we first briefly discuss the data structure of multiple-precisio (MP) iteger ad optimizatio techiques used by our library, ad the report our experimetal results. More details ca be foud i []. A. Data Structure of Multiple-precisio Iteger We represet a MP iteger as a sequece of -bit itegers, sice most GPUs support -bit iteger operatios. There are two ways to arrage this sequece of -bit itegers i memory. Oe is to put the data of a MP iteger i a array. The a group of MP itegers will be stored as a two-dimesioal array. The secod way is to traspose the two-dimesioal array described previously, so that each MP iteger is stored i a colum istead of a row. This is to achieve coalesced memory access o GPUs.

I our implemetatio, a group of MP itegers are orgaized i two parts. The first oe is a array, which keeps the legth of each MP iteger. The secod part is a matrix. Suppose the umber of MP iteger is, the maximum legth of the MP iteger is l, the set of MP itegers could be regarded as matrix[(/w) l][w], i which w is the umber of colums. B. Optimizatio Techiques Usig Costat Value with Cache Memory Most algorithms will use the same data multiple times durig the calculatios. Uder these cases, the utilizatio of memory via cache mechaism ca icrease the calculatio efficiecy. O GPUs, texture ad costat memory adopt cache mechaism. Thus, those frequetly accessed data ca be kept i texture or costat memory i order to achieve high readig efficiecy. Usig Shared Memory for Temp Value From the algorithms listed i Sectio III, we otice that some algorithms (Algorithm to ) eed to use temporary variables. Usig local or global memory to store these variables will cause log readig latecy. But if we use shared memory, the readig latecy ca become much shorter. Hece, we adopt shared memory to store the temporary variables as much as possible. Balacig the Computig Resource I CUDA programmig model, the umber of registers ad shared memory is limited i a sigle SM (Stream Multiple-processor), which oly ca make blocks be active simultaeously. Cosequetly, i order to maximize the umber of threads ruig i a sigle SM, we eed to reasoably maage the umber of registers ad shared memory i each block. C. Experimetal Results We tested our library o XFX GTX graphics card. It cotais a NVIDIA GT GPU which has processig cores workig at. GHz. We also give the results of GNU MP library ruig o a i7 CPU (.GHz) for compariso. I the followig figures (Figure to ), the x-axis deotes the umber of multiple-precisio itegers, ad the y- axis deotes the achieved umber of operatios per secod. Figure to ad 9 respectively represet the operatios per secod results about the additio, subtractio, multiplicatio, divisio, modular additio, modular subtractio i GPU MP library ruig o GPU ad CPU. I order to guaratee GPU ruig with full load, we select five groups of data, ad each group cotais 9,, 7,, ad 7 multiple-precisio itegers, respectively. I each group, we select multiple-precisio itegers with three differet legths, icludig -bit, -bit ad -bit. Figure 7 ad list the results about Motgomery reductio ad Motgomery multiplicatio algorithm. Sice GNU MP library has o idividual algorithm about Motgomery reductio ad Motgomery multiplicatio, we oly presets our results o GPU. All results show that the GPU MP library ca achieve sigificat speedup o GPU, far better tha the GNU MP library ruig o CPU. Multiple-precisio Additio Operatio per Secod (x ) CPU Add() CPU Add() CPU Add() GPU Add() GPU Add() GPU Add() 9 7 7 Figure. Multiple-precisio Additio ruig o CPU & GPU Multiple-precisio Subtractio Operatio per Secod (x ) CPU sub() CPU sub() CPU sub() GPU sub() GPU sub() GPU sub() 9 7 7 Figure. Multiple-precisio Subtractio ruig o CPU & GPU Multiple-precisio Multiplicatio Operatio per Secod (x ) CPU Mul() CPU Mul() CPU Mul() GPU Mul() GPU Mul() GPU Mul() 9 7 7 Figure. Multiple-precisio Multiplicatio ruig o CPU & GPU Multiple-precisio Divisio Operatio per Secod (x ) CPU Div() CPU Div() CPU Div() GPU Div() GPU Div() GPU Div() 9 7 7 Figure. Multiple-precisio Divisio ruig o CPU & GPU

Multiple-precisio Modular Additio Operatio per Secod (x ) CPU Mod Add() CPU Mod Add() CPU Mod Add() GPU Mod Add() GPU Mod Add() GPU Mod Add() 9 7 7 Figure. Multiple-precisio Modular Additio ruig o CPU & GPU Multiple-precisio Modular Substractio Operatio per Secod (x ) CPU Mod Sub() CPU Mod Sub() CPU Mod Sub() GPU Mod Sub() GPU Mod Sub() GPU Mod Sub() 9 7 7 Figure. Multiple-precisio Modular Subtractio ruig o CPU & GPU Multiple-precisio Motgomery Reductio Operatio per Secod (x ) Multiple-precisio Motgomery Multiplicatio Operatio per Secod (x ) GPU Mot Reductio() GPU Mot Reductio() GPU Mot Reductio() Figure 7. Multiple-precisio Motgomery Reductio ruig o GPU 7 GPU Mot Mul() GPU Mot Mul() GPU Mot Mul() Figure. Multiple-precisio Motgomery Multiplicatio ruig o GPU Multiple-precisio Motgomery Expoetiatio Operatio per Secod (x) 9 7 CPU Exp() GPU Exp() 9 7 7 Figure 9. Multiple-precisio Motgomery Expoetiatio ruig o CPU & GPU V. CONCLUSIONS Multiple-precisio iteger operatios are a importat compoet i public-key cryptography for ecryptig ad sigig digital data. I this paper, we describe the desig, implemetatio ad optimizatio of multiple-precisio iteger library for GPUs usig CUDA. I the future, we will explore how to make use of the ew Fermi architecture to further optimize the performace of our library. We will also port our library to OpeCL. ACKNOWLEDGMENT This work is supported by FRG Grat frg9: FRG/-9/9 from Hog Kog Baptist Uiversity. REFERENCES [] NVIDIA CUDA. http://developer.vidia.com/object/cuda.html [] NVIDIA CUDA Compute Uified Device Architecture: Programmig Guide, Versio.beta, Ju.. [] AMD CTM Guide: Techical Referece Maual.. http://ati.amd.com/compayifo/researcher/documets/ati_ctm_g uide.pdf [] Seiler, L., et. al.,. Larrabee: a may-core x architecture for visual computig. ACM Trasactios o Graphics, 7(), Aug.. [] GNU MP Arithmetic Library. http://gmplib.org/ [] Motgomery, P., 9. Multiplicatio without trial divisio, Math. Computatio, vol., 9, 9-. [7] Meezes, A., va Oorshot, P., ad Vastoe S., 99. Hadbook of applied cryptography. CRC Press, 99. [] Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stoe, S. S., Kirk, D. B., ad Hwu, W.. Optimizatio priciples ad applicatio performace evaluatio of a multithreaded GPU usig CUDA. I Proceedigs of ACM PPoPP, Feb.. [9] Falcao, G., Sousa, L., ad Silva, V.. Massiv parallel LDPC decodig i GPU. I Proceedigs of ACM PPoPP, Feb.. [] Owes, J. D., Housto, M., Luebke, D., Gree, S., Stoe, J. E., ad Phillips, J. C.. GPU computig. IEEE Proceedigs, May, 79-99. [] X.-W. Chu, K. Zhao, ad M. Wag. Massively Parallel Network Codig o GPUs. I Proceedigs of IEEE IPCCC, Austi, Texas, USA, Dec. [] X.-W. Chu, K. Zhao, ad M. Wag. Practical Radom Liear Network Codig o GPUs. I Proceedigs of IFIP Networkig 9, Arche, Germay, May 9. [] K. Zhao ad X.-W. Chu. GPUMP: a Multiple-Precisio Iteger Library for GPUs. Techical Report, Departmet of Computer Sciece, Hog Kog Baptist Uiversity,.