Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Similar documents
Ones Assignment Method for Solving Traveling Salesman Problem

6.854J / J Advanced Algorithms Fall 2008

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

Lecture 5. Counting Sort / Radix Sort

A Study on the Performance of Cholesky-Factorization using MPI

1.2 Binomial Coefficients and Subsets

Chapter 3 Classification of FFT Processor Algorithms

The isoperimetric problem on the hypercube

The Magma Database file formats

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA

Evaluation scheme for Tracking in AMI

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

CHAPTER IV: GRAPH THEORY. Section 1: Introduction to Graphs

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

An Efficient Algorithm for Graph Bisection of Triangularizations

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

. Written in factored form it is easy to see that the roots are 2, 2, i,

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

An Efficient Algorithm for Graph Bisection of Triangularizations

Fast Fourier Transform (FFT) Algorithms

1. SWITCHING FUNDAMENTALS

Algorithm. Counting Sort Analysis of Algorithms

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

Transitioning to BGP

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Lecture 1: Introduction and Strassen s Algorithm

Combination Labelings Of Graphs

Analysis of Algorithms

Algorithms for Disk Covering Problems with the Most Points

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

On (K t e)-saturated Graphs

Minimum Spanning Trees

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

Improved Random Graph Isomorphism

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

Minimum Spanning Trees. Application: Connecting a Network

1 Graph Sparsfication

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Image Segmentation EEE 508

6.851: Advanced Data Structures Spring Lecture 17 April 24

Parallel Polygon Approximation Algorithm Targeted at Reconfigurable Multi-Ring Hardware

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Civil Engineering Computation

Examples and Applications of Binary Search

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Descriptive Statistics Summary Lists

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

27 Refraction, Dispersion, Internal Reflection

3D Model Retrieval Method Based on Sample Prediction

Elementary Educational Computer

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

New Results on Energy of Graphs of Small Order

A NOTE ON COARSE GRAINED PARALLEL INTEGER SORTING

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Alpha Individual Solutions MAΘ National Convention 2013

Pattern Recognition Systems Lab 1 Least Mean Squares

Counting the Number of Minimum Roman Dominating Functions of a Graph

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

Dynamic Programming and Curve Fitting Based Road Boundary Detection

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

Lecture 28: Data Link Layer

The golden search method: Question 1

Python Programming: An Introduction to Computer Science

A Note on Least-norm Solution of Global WireWarping

A Hierarchical Load Balanced Fault tolerant Grid Scheduling Algorithm with User Satisfaction

Lecture 7 7 Refraction and Snell s Law Reading Assignment: Read Kipnis Chapter 4 Refraction of Light, Section III, IV

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Project 2.5 Improved Euler Implementation

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Recursive Procedures. How can you model the relationship between consecutive terms of a sequence?

Parabolic Path to a Best Best-Fit Line:

Intro to Scientific Computing: Solutions

Homework 1 Solutions MA 522 Fall 2017

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Math Section 2.2 Polynomial Functions

Lecture 13: Validation

arxiv: v2 [cs.ds] 24 Mar 2018

Lower Bounds for Sorting

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Data Structures and Algorithms. Analysis of Algorithms

Accuracy Improvement in Camera Calibration

A graphical view of big-o notation. c*g(n) f(n) f(n) = O(g(n))

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

CMPT 125 Assignment 2 Solutions

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

Avid Interplay Bundle

Transcription:

Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of Korea e-mail: dkimku@korea.ac.kr **Departmet of omputer Egieerig & Ifo. Techology Soogsil Uiversity Seoul, -, Rep. of Korea e-mail: kchug@soogsil.ac.kr Abstract Algorithms of fidig prime umbers up to some iteger by Sieve of Eratosthees are simple ad fast. However, eve with the time complexity o greater tha O( l l ), it may take weeks or eve years of PU time if is large like over digits. o kow shortcut was reported yet. We develop efficiet parallel algorithms to balace the workload of each computer, ad to exted mai memory with disk storage to accommodate Giga-bytes of data. Our experimets show that a complete set of up to -digit prime umbers ca be foud i a week of computig time (PU time) usig our.ghz Petium- Ps with Liux ad MPI library. Also, by sieve of Eratosthees, we thik it is very ulikely that we ca compute all primes up to 0 digits usig fastest computers i the world. Key words: prime umber, sieve of Eratosthees, load balacig, MPI, cluster computer. Itroductio Prime umbers gets people s attetio due to their importat role i data security usig ecryptio/ decryptio by public keys with such as RSA algorithm []. There are a lot of mathematical theories o prime, but the computatio cost of big prime umbers is usually huge eve with today s high performace computers. Sieve of Eratosthees (abbreviated by SOE later) [,] is a classic but straightforward way of fidig all primes i a give rage. Let s review the method. The sievig of umbers starts with a array A[:] whose iitial value is set to zero. Begiig with, which is the smallest prime, all multiples of the prime up to are marked to, i.e. A[]= A[]= = A[ k ]= = ( k is a iteger). ow, the ext smallest umarked iteger is used to mark its multiples i the same way; A[]= A[]= = A[ k ' ]= =. This procedure cotiues util we fid a ew smallest prime x such that x. The, all umarked itegers with A[ ]=0 are gathered. The algorithm seems serial i ature, sice prime umbers are foud oe by oe i a icreasig order, ad all primes lyig betwee X ad X ca be foud oly after we fid all prime umbers up to X. Let S deote the set that cotais all prime umbers from to Q to geerate primes up to, where Q is the greatest prime less tha or equal to. The prime umbers i S are called sievig primes. For simplicity let s assume that S is idexed i icreasig order. The, S =,,,, L, Q}. Let [X:Y] deote a iterval (or a rage) that cotais all the itegers betwee X ad Y iclusive. Sievig i the iterval is a process of markig all multiples of each sievig prime i the iterval, ad the collectig oly umarked umbers as prime. I the literature, there are a few sequetial algorithms of prime geeratio reported to reduce the * This research was supported by KOSEF Grat (R0-00-000--0) ad by Korea Uiversity Research Grat.

complexity close to O() [,,]. It is reported that the k-th prime is greater tha k (l k + l l k -) for k []. May big prime umbers are reported [,]. This paper presets a parallel prime geeratio method to exted the work of Bokahari [] with the combied (static ad dyamic) load balacig algorithm that we apply SOE oto distributed memory parallel computers (cluster) ad expad the rage of iteger beyod the mai memory capacity usig hard disk storage. We develop methods of partitioig the work load, ad devise the way of efficietly represetig ad storig big itegers. The proposed algorithms are expected to deliver the best performace if they provide eve distributio of the load to all computers. Void Iterate_Sievig(, M) Fid all sievig primes up to ; for (I=0; I < ; I++) /* = /M */ /* overs the array A[I*M+:(I+)*M] */ ) Read i workig prime set up to } ( I + )M from the disk; ) Mark A[ ] as that are multiple of the umber i workig prime set.; ) Write A[I*M+:(I+)*M] to disk;}. Load Model of Sieve of Eratosthees Let s fid the computatioal load for the work. The major computatioal load of sievig i the rage : is to mark (or write) o the array whose idex matches some multiple of sievig primes. The umber of write operatios for the multiple of the first prime is /. Similarly, it is / for the secod prime. Thus, L[ : ] = + + + L+ =. This Q p S p load is bouded by l l + O() []. If a umber x is a multiple of some prime p, the remaider after a iteger divisio should be zero. If x is a multiple, the x + p, x + p, are also multiple. Thus, multiples ca the foud by additio istead of divisio, which saves the computatioal cost. It is kow that there are X/l X prime umbers [, ], thus, if we set X as a -digit umber, there are 0 prime umbers that easily exceed the 000 GB hard disk capacity to store if they are represeted i -bit format. To compute big itegers usig SOE, we firstly eed to overcome the limit of mai memory capacity. Oly M itegers are loaded o mai memory at a time ad sievig computatio is doe, where M is the capacity of mai memory. Thus, we have to repeat the markig at least /M times as show i Iterate_Sievig( ) below. We assume that the set of sievig prime is available before the iteratio, or it ca be computed very quickly. ost-wise, throughout the iteratios lie ) ad lie ) i Iterate_Sievig() eed O( l l ) ad O() time, respectively. However, the time complexity of the above algorithm icreases sice it has the term Ω ( / l ) because of lie ) as explaied below. Whe is large, we ca ot load the whole sievig prime umbers ito memory(or i cache block), istead we may read i oly ecessary prime umbers, called workig prime set, from the disk (or from mai memory). At iteratio i, the greatest iteger i mai memory for sievig is Mi, thus the umber of sievig primes eeded is Mi / l Mi. The overall time complexity t[ : ] of loadig the workig prime set oto mai memory is obtaied as t[: ] = = ( M / l Mi / l M ) Mi i Mi / l M M t[: ] ( M / l M ) xdx [ x x ] l M M t [ : ] ( ) l M By substitutig =, the above relatio ca be M rewritte as t[ : ] = Ω( / l )

. Parallel Algorithms For parallel executio, the sievig computatio ca be partitioed to P processors i two ways. Either the markig space of array A[:] ca be partitioed to P segmets (subarrays), the all umbers that are multiple of ay of the sievig primes are marked i each segmet simultaeously. Or the etire set of sievig prime are divided ito P disjoit subsets, ad each processor marks o its local copy of A[:] for the multiples of sievig primes i its ow subset, the oly those umbers are collected as prime whose P copies have ever marked by ay processor. This ca be implemeted by reduce operatio i MPI library. The two schemes are called array partitio ad set partitio, respectively. For simplicity assume here that K P = for some iteger K. Divisio ito P equal load is doe i a recursive maer as explaied below. Our parallel computig adopts master/slave cofiguratio so that the master is i charge of fidig the sievig prime set S, while slaves sieve umbers i the rage allocated to them. Aother role of the master i the cofiguratio is to supervise dyamic load distributio as explaied later. For set partitio we should kow how to partitio the sievig prime set S ito P disjoit subsets to provide equal load to all processors, while i array partitio we must divide the itervals : ito P segmets each of which will have eve load. We start from two equal partitios. The half-load set boudary is the locatio that divides the sievig prime set ito two subsets (left subset ad right subset) so that the load to mark with left subset is equal to the load to mark with right. I other words, each demads a half the overall load by the origial set. The total load deoted by L [ : ] is approximated as L[ : ] = dx. p S p i x We also kow the relatio dx = l = l. x The complexity may ot be as tight as give i [], but it will be applied i later computatio. Let ( u ~ v) represet a subset of S that icludes all primes betwee u ad v. (Suppose Z z ) ow, we wat to kow the poit Z where the umber of sievig operatios i the iterval [:] with primes ( u ~ z) is a half the umber with primes i ( u ~ v), i other words, the load of sievig with prime umbers i ( u ~ z) is equal to the load with those i ( z + + ~ v), where z is the immediate ext prime umber after z. z v = + p= u p p= z p The, the summatio equatios are coverted to z v itegratio by approximatio as dx = dx. u x z x l z lu = l v l z, l z = l u + l, z = uv Z = uv ( ~ uv ( u ~ v v ow, which meas that the sievig computatio with primes i u ) eeds the half load of the computatio with primes i ). If the half-load boudary formula is applied agai to both sets ( u ~ z) ad ( z ~ v), we will get four equal-load partitios. The method ca be recursively applied to K divide the origial load to P = equal-load partitios, as described i Algorithm Set_Partitio. Void Set_Partitio(L, R, P) if (R-L > Mi_iterval) AD (P>=) /* Mi_iterval is the threshold for parallel computatio. */ mid = LR ; Set_Partitio(L, mid, P/); Set_Partitio(mid, R, P/)} else Allocate all (L~R) to ay of P; /* (L~R) is a set of primes. */ exit;} } For aother way of load partitioig, let us fid the half-load poit that is the boudary poit of divisio of the iterval [:] ito two equal-load segmets, [:, ad [y+: ]. The half-load poit x ca be foud as below, where S ' is the subset of S that icludes all primes less tha y. L[: = p S ' p l l To fid y that satisfies the above relatioship is ot easy, so we istead fid a approximatio Y. If is

0 very large such as 0, the approximatio is obtaied as y Y = / with very small error. The relative error f a o-icreasig fuctio ad is very 0 close to for 0 0 < f < ). as show below ( ll ll L[: L[: Y ] f = L[: ll l(l l ) l(l0 l ) = < 0.0 ll l l0 The approximatio is valid for a big.. Experimets ad Discussio The prime umber geerator has bee experimeted o cluster computers. Oe of the clusters cosists of.ghz AMD processors, each with a 0GB local disk ad GB mai memory. All odes are cosidered fully coected by Fast Etheret ad Myriet switches. Aother cluster is plaed to use, which is -ode Myriet cluster i KISTI []. Our experimets are performed o the -ode cluster, ad maximum executio time is curretly limited to hours. Uder the costrait, we ca compute the primes up to = 0. To reduce the overheads i the executio, we modify the sievig process as explaied below. Firstly, to save the storage ad disk/memory access time, big itegers/idexes like over te digits, (ormally -bit iteger represetatio covers oly up to te digits) are grouped with commo prefix i biary bits ad oly the lower bit/digit iformatio is specifically represeted. Secodly, the order of markig is adjusted to overcome the cache limitatio. Fidig a iteger that is a multiple of some prime (actually the idex i of A[i]) is usually doe by a additio; sice if x is a multiple of some prime p, the x + p, x + p, are also multiple. Sice we have a limited size of cache (or cache lie), the ext multiple may ot be i the cache whe p is greater tha cache lie size. I such a case, every attempt to markig geerates a cache miss. I this respect, a certai umber of sievig primes adjacet each other are grouped, ad markig is doe for all primes together, istead of oe prime by oe prime. As itroduced earlier, set partitioig is simple to implemet, however, it requires sychroizatio ( reduce operatio) of the whole array A[:] from P processors after idividual markig operatios are complete. It adds commuicatio cost of O( log P), which may ot be eglected compared to sievig cost of l l. Thus, the algorithm is practical for small, ad our experimets focus o set partitioig. Sice the half load boudary is a approximate value, there ca be some imbalace amog differet partitios. To get complete balacig of load throughout the computatio, the static load distributio will be combied with dyamic load balacig. Static distributio allocates oly a fractio q ( 0 < q ) of the iterval [:] to P processors. The remaiig iterval is equally divided ito R ( >> P) small segmets for dyamic load distributio, thus, processors that fiish sievig o the statically assiged itervals will receive small segmets from the master for executio util o more are left. Figure shows the predicted ad observed times of sequetial executio with various rages, where the upper limit (= ) is chose as 0 for the coveiece of the experimets. The plot reveals that the ruig times are close to O ( /l ), whose reaso was due to the limited mai memory/cache capacity as explaied i sectio. Figures to plot speedups of parallel computatio with three job partitio schemes: static oly, static ad dyamic combied, ad dyamic oly. Static simply divides the itervals equally, which gives too iferior results. The reaso is i that eve though the couts of markig are equal, real load due to cache misses is ot reflected at all. The same problem is relieved i dyamic ad combied partitioig, sice the scheme cotrols load durig ru time. Whe the problem size is big (i.e. ), parallel computatio with processors of dyamic ad combied partitio schemes has best performace sice it gets smallest work load per job favorable to cache hits. We observe i the plots super-liear speedups are obtaied, ad it is sice the sequetial executios experieces saturated operatios due to the heavy disk traffic ad high cache misses because of overloaded resource utilizatio tha other parallel oes. Also, parallel computig is suitable i the computatio sice there is very little overhead i commuicatio for master/slave job distributio. As the rage grows, i additio to huge PU time cosumptio, the sievig also requires other resources to the limit such as hard disk capacity. To cover digit prime umbers, we expect the hard disk should be as large as TB. We predicts that it eeds days of

computig time to get digit primes with our ode cluster, while -digit eeds a year of computatio.. ocludig Remarks This research focuses o efficiet parallel computatio of prime umbers i a give rage. Load is distributed to each processor statically ad dyamically, thus, evetually all processors sped the same amout of computatio time to achieve the miimum executio time. The partitio of load to processors was based o divisio of computig itervals ad divisio of prime umbers i the set S of sievig primes. Saturatio of the performace is observed due to full load to disk ad cache misses i both sequetial ad parallel prime geerators. We expect that the complete set of prime umbers up to 0 ca be obtaied i days of executio, ad 0 i a year, although the times could vary depedig o the speed of the parallel computers. 0 Figure (dyamic load distributio) exec.time i secods 00000 0000 0 000 data size Figure Sequetial time complexity Speed up measured predicted Figure (static & dyamic combied) Figure (static load distributio) Refereces [] S.H. Bokhari, Multiprocessig the sieve of Eratosthees, IEEE omputer 0(): April, pp.0- [] P. Dusart, The K-th prime is greater tha k(l k + l l k-) for k >=, Mathematics of omputatio, Vol, o., pp. -, Ja.. [] G. Hardy ad E. Wright, A itroductio to the theory of umbers, laredo Press,. [] X. Luo, A practical sieve algorithm for fidig prime umbers, ommuicatios of the AM, Vol. o., pp. -, March. [] G. Pailard,. Lavult, ad R. Fraca, A distributed prime sievig algorithm based o Schedulig by Multiple Edge Reversal, It l Symp. O Parallel ad Distributed omputig, Uiv. Lille, Frace, July -, 00. [] Rivest, R., Shamir, A., Adlema, L.: A method for obtaiig digital sigatures ad public key cryptosystems, omm. AM (Feb. ), pp. 0- [] http://www.prime-umbers.org [] http://primes.utm.edu/howmay.shtml, How may primes are there? [] http://www.ksc.re.kr KISTI Supercomputig eter.