Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of Korea e-mail: dkimku@korea.ac.kr **Departmet of omputer Egieerig & Ifo. Techology Soogsil Uiversity Seoul, -, Rep. of Korea e-mail: kchug@soogsil.ac.kr Abstract Algorithms of fidig prime umbers up to some iteger by Sieve of Eratosthees are simple ad fast. However, eve with the time complexity o greater tha O( l l ), it may take weeks or eve years of PU time if is large like over digits. o kow shortcut was reported yet. We develop efficiet parallel algorithms to balace the workload of each computer, ad to exted mai memory with disk storage to accommodate Giga-bytes of data. Our experimets show that a complete set of up to -digit prime umbers ca be foud i a week of computig time (PU time) usig our.ghz Petium- Ps with Liux ad MPI library. Also, by sieve of Eratosthees, we thik it is very ulikely that we ca compute all primes up to 0 digits usig fastest computers i the world. Key words: prime umber, sieve of Eratosthees, load balacig, MPI, cluster computer. Itroductio Prime umbers gets people s attetio due to their importat role i data security usig ecryptio/ decryptio by public keys with such as RSA algorithm []. There are a lot of mathematical theories o prime, but the computatio cost of big prime umbers is usually huge eve with today s high performace computers. Sieve of Eratosthees (abbreviated by SOE later) [,] is a classic but straightforward way of fidig all primes i a give rage. Let s review the method. The sievig of umbers starts with a array A[:] whose iitial value is set to zero. Begiig with, which is the smallest prime, all multiples of the prime up to are marked to, i.e. A[]= A[]= = A[ k ]= = ( k is a iteger). ow, the ext smallest umarked iteger is used to mark its multiples i the same way; A[]= A[]= = A[ k ' ]= =. This procedure cotiues util we fid a ew smallest prime x such that x. The, all umarked itegers with A[ ]=0 are gathered. The algorithm seems serial i ature, sice prime umbers are foud oe by oe i a icreasig order, ad all primes lyig betwee X ad X ca be foud oly after we fid all prime umbers up to X. Let S deote the set that cotais all prime umbers from to Q to geerate primes up to, where Q is the greatest prime less tha or equal to. The prime umbers i S are called sievig primes. For simplicity let s assume that S is idexed i icreasig order. The, S =,,,, L, Q}. Let [X:Y] deote a iterval (or a rage) that cotais all the itegers betwee X ad Y iclusive. Sievig i the iterval is a process of markig all multiples of each sievig prime i the iterval, ad the collectig oly umarked umbers as prime. I the literature, there are a few sequetial algorithms of prime geeratio reported to reduce the * This research was supported by KOSEF Grat (R0-00-000--0) ad by Korea Uiversity Research Grat.

complexity close to O() [,,]. It is reported that the k-th prime is greater tha k (l k + l l k -) for k []. May big prime umbers are reported [,]. This paper presets a parallel prime geeratio method to exted the work of Bokahari [] with the combied (static ad dyamic) load balacig algorithm that we apply SOE oto distributed memory parallel computers (cluster) ad expad the rage of iteger beyod the mai memory capacity usig hard disk storage. We develop methods of partitioig the work load, ad devise the way of efficietly represetig ad storig big itegers. The proposed algorithms are expected to deliver the best performace if they provide eve distributio of the load to all computers. Void Iterate_Sievig(, M) Fid all sievig primes up to ; for (I=0; I < ; I++) /* = /M */ /* overs the array A[I*M+:(I+)*M] */ ) Read i workig prime set up to } ( I + )M from the disk; ) Mark A[ ] as that are multiple of the umber i workig prime set.; ) Write A[I*M+:(I+)*M] to disk;}. Load Model of Sieve of Eratosthees Let s fid the computatioal load for the work. The major computatioal load of sievig i the rage : is to mark (or write) o the array whose idex matches some multiple of sievig primes. The umber of write operatios for the multiple of the first prime is /. Similarly, it is / for the secod prime. Thus, L[ : ] = + + + L+ =. This Q p S p load is bouded by l l + O() []. If a umber x is a multiple of some prime p, the remaider after a iteger divisio should be zero. If x is a multiple, the x + p, x + p, are also multiple. Thus, multiples ca the foud by additio istead of divisio, which saves the computatioal cost. It is kow that there are X/l X prime umbers [, ], thus, if we set X as a -digit umber, there are 0 prime umbers that easily exceed the 000 GB hard disk capacity to store if they are represeted i -bit format. To compute big itegers usig SOE, we firstly eed to overcome the limit of mai memory capacity. Oly M itegers are loaded o mai memory at a time ad sievig computatio is doe, where M is the capacity of mai memory. Thus, we have to repeat the markig at least /M times as show i Iterate_Sievig( ) below. We assume that the set of sievig prime is available before the iteratio, or it ca be computed very quickly. ost-wise, throughout the iteratios lie ) ad lie ) i Iterate_Sievig() eed O( l l ) ad O() time, respectively. However, the time complexity of the above algorithm icreases sice it has the term Ω ( / l ) because of lie ) as explaied below. Whe is large, we ca ot load the whole sievig prime umbers ito memory(or i cache block), istead we may read i oly ecessary prime umbers, called workig prime set, from the disk (or from mai memory). At iteratio i, the greatest iteger i mai memory for sievig is Mi, thus the umber of sievig primes eeded is Mi / l Mi. The overall time complexity t[ : ] of loadig the workig prime set oto mai memory is obtaied as t[: ] = = ( M / l Mi / l M ) Mi i Mi / l M M t[: ] ( M / l M ) xdx [ x x ] l M M t [ : ] ( ) l M By substitutig =, the above relatio ca be M rewritte as t[ : ] = Ω( / l )

. Parallel Algorithms For parallel executio, the sievig computatio ca be partitioed to P processors i two ways. Either the markig space of array A[:] ca be partitioed to P segmets (subarrays), the all umbers that are multiple of ay of the sievig primes are marked i each segmet simultaeously. Or the etire set of sievig prime are divided ito P disjoit subsets, ad each processor marks o its local copy of A[:] for the multiples of sievig primes i its ow subset, the oly those umbers are collected as prime whose P copies have ever marked by ay processor. This ca be implemeted by reduce operatio i MPI library. The two schemes are called array partitio ad set partitio, respectively. For simplicity assume here that K P = for some iteger K. Divisio ito P equal load is doe i a recursive maer as explaied below. Our parallel computig adopts master/slave cofiguratio so that the master is i charge of fidig the sievig prime set S, while slaves sieve umbers i the rage allocated to them. Aother role of the master i the cofiguratio is to supervise dyamic load distributio as explaied later. For set partitio we should kow how to partitio the sievig prime set S ito P disjoit subsets to provide equal load to all processors, while i array partitio we must divide the itervals : ito P segmets each of which will have eve load. We start from two equal partitios. The half-load set boudary is the locatio that divides the sievig prime set ito two subsets (left subset ad right subset) so that the load to mark with left subset is equal to the load to mark with right. I other words, each demads a half the overall load by the origial set. The total load deoted by L [ : ] is approximated as L[ : ] = dx. p S p i x We also kow the relatio dx = l = l. x The complexity may ot be as tight as give i [], but it will be applied i later computatio. Let ( u ~ v) represet a subset of S that icludes all primes betwee u ad v. (Suppose Z z ) ow, we wat to kow the poit Z where the umber of sievig operatios i the iterval [:] with primes ( u ~ z) is a half the umber with primes i ( u ~ v), i other words, the load of sievig with prime umbers i ( u ~ z) is equal to the load with those i ( z + + ~ v), where z is the immediate ext prime umber after z. z v = + p= u p p= z p The, the summatio equatios are coverted to z v itegratio by approximatio as dx = dx. u x z x l z lu = l v l z, l z = l u + l, z = uv Z = uv ( ~ uv ( u ~ v v ow, which meas that the sievig computatio with primes i u ) eeds the half load of the computatio with primes i ). If the half-load boudary formula is applied agai to both sets ( u ~ z) ad ( z ~ v), we will get four equal-load partitios. The method ca be recursively applied to K divide the origial load to P = equal-load partitios, as described i Algorithm Set_Partitio. Void Set_Partitio(L, R, P) if (R-L > Mi_iterval) AD (P>=) /* Mi_iterval is the threshold for parallel computatio. */ mid = LR ; Set_Partitio(L, mid, P/); Set_Partitio(mid, R, P/)} else Allocate all (L~R) to ay of P; /* (L~R) is a set of primes. */ exit;} } For aother way of load partitioig, let us fid the half-load poit that is the boudary poit of divisio of the iterval [:] ito two equal-load segmets, [:, ad [y+: ]. The half-load poit x ca be foud as below, where S ' is the subset of S that icludes all primes less tha y. L[: = p S ' p l l To fid y that satisfies the above relatioship is ot easy, so we istead fid a approximatio Y. If is

0 very large such as 0, the approximatio is obtaied as y Y = / with very small error. The relative error f a o-icreasig fuctio ad is very 0 close to for 0 0 < f < ). as show below ( ll ll L[: L[: Y ] f = L[: ll l(l l ) l(l0 l ) = < 0.0 ll l l0 The approximatio is valid for a big.. Experimets ad Discussio The prime umber geerator has bee experimeted o cluster computers. Oe of the clusters cosists of.ghz AMD processors, each with a 0GB local disk ad GB mai memory. All odes are cosidered fully coected by Fast Etheret ad Myriet switches. Aother cluster is plaed to use, which is -ode Myriet cluster i KISTI []. Our experimets are performed o the -ode cluster, ad maximum executio time is curretly limited to hours. Uder the costrait, we ca compute the primes up to = 0. To reduce the overheads i the executio, we modify the sievig process as explaied below. Firstly, to save the storage ad disk/memory access time, big itegers/idexes like over te digits, (ormally -bit iteger represetatio covers oly up to te digits) are grouped with commo prefix i biary bits ad oly the lower bit/digit iformatio is specifically represeted. Secodly, the order of markig is adjusted to overcome the cache limitatio. Fidig a iteger that is a multiple of some prime (actually the idex i of A[i]) is usually doe by a additio; sice if x is a multiple of some prime p, the x + p, x + p, are also multiple. Sice we have a limited size of cache (or cache lie), the ext multiple may ot be i the cache whe p is greater tha cache lie size. I such a case, every attempt to markig geerates a cache miss. I this respect, a certai umber of sievig primes adjacet each other are grouped, ad markig is doe for all primes together, istead of oe prime by oe prime. As itroduced earlier, set partitioig is simple to implemet, however, it requires sychroizatio ( reduce operatio) of the whole array A[:] from P processors after idividual markig operatios are complete. It adds commuicatio cost of O( log P), which may ot be eglected compared to sievig cost of l l. Thus, the algorithm is practical for small, ad our experimets focus o set partitioig. Sice the half load boudary is a approximate value, there ca be some imbalace amog differet partitios. To get complete balacig of load throughout the computatio, the static load distributio will be combied with dyamic load balacig. Static distributio allocates oly a fractio q ( 0 < q ) of the iterval [:] to P processors. The remaiig iterval is equally divided ito R ( >> P) small segmets for dyamic load distributio, thus, processors that fiish sievig o the statically assiged itervals will receive small segmets from the master for executio util o more are left. Figure shows the predicted ad observed times of sequetial executio with various rages, where the upper limit (= ) is chose as 0 for the coveiece of the experimets. The plot reveals that the ruig times are close to O ( /l ), whose reaso was due to the limited mai memory/cache capacity as explaied i sectio. Figures to plot speedups of parallel computatio with three job partitio schemes: static oly, static ad dyamic combied, ad dyamic oly. Static simply divides the itervals equally, which gives too iferior results. The reaso is i that eve though the couts of markig are equal, real load due to cache misses is ot reflected at all. The same problem is relieved i dyamic ad combied partitioig, sice the scheme cotrols load durig ru time. Whe the problem size is big (i.e. ), parallel computatio with processors of dyamic ad combied partitio schemes has best performace sice it gets smallest work load per job favorable to cache hits. We observe i the plots super-liear speedups are obtaied, ad it is sice the sequetial executios experieces saturated operatios due to the heavy disk traffic ad high cache misses because of overloaded resource utilizatio tha other parallel oes. Also, parallel computig is suitable i the computatio sice there is very little overhead i commuicatio for master/slave job distributio. As the rage grows, i additio to huge PU time cosumptio, the sievig also requires other resources to the limit such as hard disk capacity. To cover digit prime umbers, we expect the hard disk should be as large as TB. We predicts that it eeds days of

computig time to get digit primes with our ode cluster, while -digit eeds a year of computatio.. ocludig Remarks This research focuses o efficiet parallel computatio of prime umbers i a give rage. Load is distributed to each processor statically ad dyamically, thus, evetually all processors sped the same amout of computatio time to achieve the miimum executio time. The partitio of load to processors was based o divisio of computig itervals ad divisio of prime umbers i the set S of sievig primes. Saturatio of the performace is observed due to full load to disk ad cache misses i both sequetial ad parallel prime geerators. We expect that the complete set of prime umbers up to 0 ca be obtaied i days of executio, ad 0 i a year, although the times could vary depedig o the speed of the parallel computers. 0 Figure (dyamic load distributio) exec.time i secods 00000 0000 0 000 data size Figure Sequetial time complexity Speed up measured predicted Figure (static & dyamic combied) Figure (static load distributio) Refereces [] S.H. Bokhari, Multiprocessig the sieve of Eratosthees, IEEE omputer 0(): April, pp.0- [] P. Dusart, The K-th prime is greater tha k(l k + l l k-) for k >=, Mathematics of omputatio, Vol, o., pp. -, Ja.. [] G. Hardy ad E. Wright, A itroductio to the theory of umbers, laredo Press,. [] X. Luo, A practical sieve algorithm for fidig prime umbers, ommuicatios of the AM, Vol. o., pp. -, March. [] G. Pailard,. Lavult, ad R. Fraca, A distributed prime sievig algorithm based o Schedulig by Multiple Edge Reversal, It l Symp. O Parallel ad Distributed omputig, Uiv. Lille, Frace, July -, 00. [] Rivest, R., Shamir, A., Adlema, L.: A method for obtaiig digital sigatures ad public key cryptosystems, omm. AM (Feb. ), pp. 0- [] http://www.prime-umbers.org [] http://primes.utm.edu/howmay.shtml, How may primes are there? [] http://www.ksc.re.kr KISTI Supercomputig eter.