Shared Virtual Memory Machines. Mississippi State, MS Abstract

Size: px

Start display at page:

Download "Shared Virtual Memory Machines. Mississippi State, MS Abstract"

Ethan Dawson
5 years ago
Views:

1 Performance Consderatons of Shared Vrtual Memory Machnes Xan-He Sun Janpng Zhu Department of Computer Scence NSF Engneerng Research Center Lousana State Unversty Dept. of Math. and Stat. Baton Rouge, LA Msssspp State Unversty Msssspp State, MS Abstract Generalzed speedup s dened as parallel speed over sequental speed. In ths paper the generalzed speedup and ts relaton wth other exstng performance metrcs, such as tradtonal speedup, ecency, scalablty, etc., are carefully studed. In terms of the ntroduced asymptotc speed, we show that the derence between the generalzed speedup and the tradtonal speedup les n the denton of the ecency of unprocessor processng, whch s a very mportant ssue n shared vrtual memory machnes. A scentc applcaton has been mplemented on a KSR-1 parallel computer. Expermental and theoretcal results show that the generalzed speedup s dstnct from the tradtonal speedup and provdes a more reasonable measurement. In the study of derent speedups, an nterestng relaton between xed-tme and memory-bounded speedup s revealed. Varous causes of superlnear speedup are also presented. Manuscrpt receved March 5, 1994; revsed Nov. 14, 1994 and March 14, Ths research was supported n part by the Natonal Aeronautcs and Space Admnstraton under NASA contract NAS and NAS1-1672/MMP.

2 Index Terms: Hgh Performance Computng, Parallel Processng, Performance Evaluaton, Performance Metrcs, Scalablty, Speedup, Shared Vrtual Memory

3 Address for correspondence: Xan-He Sun Department of Computer Scence Lousana State Unversty Baton Rouge, LA (504) ,

4 1 Introducton In recent years parallel processng has enjoyed unprecedented attenton from researchers, government agences, and ndustres. Ths attenton s manly due to the fact that, wth the current crcut technology, parallel processng seems to be the only remanng way to acheve hgher performance. However, whle varous parallel computers and algorthms have been developed, ther performance evaluaton s stll elusve. In fact, the more advanced the hardware and software, the more dcult t s to evaluate the parallel performance. In ths paper, targetng recent development of shared vrtual memory machnes, we study the generalzed speedup [1] performance metrc, ts relaton wth other exstng performance metrcs, and the mplementaton ssues. Dstrbuted-memory parallel computers domnate today's parallel computng arena. These machnes, such as the Kendall Square KSR-1, Intel Paragon, TMC CM-5, and IBM SP2, have successfully delvered hgh performance computng power for solvng some of the so-called \grandchallenge" problems. From the vewpont of processes, there are two basc process synchronzaton and communcaton models. One s the shared-memory model n whch processes communcate through shared varables. The other s the message-passng model n whch processes communcate through explct message passng. The shared-memory model provdes a sequental-lke program paradgm. Vrtual address space separates the user logcal memory from physcal memory. Ths separaton allows an extremely large vrtual memory to be provded on a sequental machne when only a small physcal memory s avalable. Shared vrtual address combnes the prvate vrtual address spaces dstrbuted over the nodes of a parallel computer nto a globally shared vrtual memory [2]. th shared vrtual address space, the shared-memory model supports shared vrtual memory, but requres sophstcated hardware and system support. An example of a dstrbuted-memory machne whch supports shared vrtual address space s the Kendall Square KSR-1 1. Shared vrtual memory smples the software development and portng process by enablng even extremely large programs to run on a sngle processor before beng parttoned and dstrbuted across multple processors. However, the memory access of the shared vrtual memory s non-unform [2]. The access tme of local memory and remote memory s derent. Runnng a large program on a small number of processors s possble but could be very necent. The necent sequental processng wll lead to a msleadng hgh performance n terms of speedup or ecency. Generalzed speedup, dened as parallel speed over sequental speed, s a newly proposed performance metrc [1]. In ths paper, through both theoretcal proofs and expermental results, we show that generalzed speedup provdes a more reasonable measurement than tradtonal speedup. In the process of studyng generalzed speedup, the relaton between the generalzed speedup and many other metrcs, such as ecency, scaled speedup, scalablty, are also studed. The relaton 1 Tradtonally, the message-passng model s bounded by the local memory of the processng processors. th recent technology advancement, the message-passng model has extended the ablty to support shared vrtual memory. 1

5 between xed-tme and memory-bounded scaled speedup s analyzed. Varous reasons for superlnearty n derent speedups are also dscussed. Results show that the man derence between the tradtonal speedup and the generalzed speedup s how to evaluate the ecency of the sequental processng on a sngle processor. The paper s organzed as follows. In secton 2 we study tradtonal speedup, ncludng the scaled speedup concept, and ntroduce some termnology. Analyss shows that the tradtonal speedup, xed-sze or scaled sze, may acheve superlnearty on shared vrtual memory machnes. Furthermore, wth the tradtonal speedup metrc, the slower the remote memory access s, the larger the speedup. Generalzed speedup s studed n Secton 3. The term asymptotc speed s ntroduced for the measurement of generalzed speedup. Analyss shows the derences and the smlartes between the generalzed speedup and the tradtonal speedup. Relatons between derent performance metrcs are also dscussed. Expermental results of a producton applcaton on a Kendall Square KSR-1 parallel computer are gven n Secton 4. Secton 5 contans a summary. 2 The Tradtonal Speedup One of the most frequently used performance metrcs n parallel processng s speedup. It s de- ned as sequental executon tme over parallel executon tme. Parallel algorthms often explot parallelsm by sacrcng mathematcal ecency. To measure the true parallel processng gan, the sequental executon tme should be based on a commonly used sequental algorthm. To dstngush t from other nterpretatons of speedup, the speedup measured wth a commonly used sequental algorthm has been called absolute speedup [3]. Another wdely used nterpretaton s the relatve speedup [3], whch uses the unprocessor executon tme of the parallel algorthm as the sequental tme. There are several reasons to use the relatve speedup. Frst, the performance of an algorthm vares wth the number of processors. Relatve speedup measures the varaton. Second, relatve speedup avods the dculty of choosng the practcal sequental algorthm, mplementng the sequental algorthm, and matchng the mplementaton/programmng skll between the sequental algorthm and the parallel algorthm. Also, when problem sze s xed, the tme rato of the chosen sequental algorthm and the unprocessor executon of the parallel algorthm s xed. Therefore, the relatve speedup s proportonal to the absolute speedup. Relatve speedup s the speedup commonly used n performance study. In ths study we wll focus on relatve speedup and reserve the terms tradtonal speedup and speedup for relatve speedup. The concepts and results of ths study can be extended to absolute speedup. From the problem sze pont of vew, speedup can be dvded nto the xed-sze speedup and the scaled speedup. Fxed-sze speedup emphaszes how much executon tme can be reduced wth parallel processng. Amdahl's law [4] s based on the xed-sze speedup. The scaled speedup s concentrated on explorng the computatonal power of parallel computers for solvng otherwse 2

6 ntractable large problems. Dependng on the scalng restrctons of the problem sze, the scaled speedup can be classed as the xed-tme speedup [5] and the memory-bounded speedup [6]. As the number of processors ncreases, xed-tme speedup scales problem sze to meet the xed executon tme. Then the scaled problem s also solved on an unprocessor to get the speedup. As the number of processors ncreases, memory-bounded speedup scales problem sze to utlze the assocated memory ncrease. A detaled study of the memory-bounded speedup can be found n [6]. Let p and S p be the number of processors and the speedup wth p processors. Denton 1 Superlnear speedup: S p > p Untary speedup: S p = p. Lnear speedup: S p = a p for some constant a > 0. It s debatable f any machne-algorthm par can acheve \truly" superlnear speedup. Seven possble causes of superlnear speedup are lsted n Fg. 1. The rst four causes n Fg. 1 are patterned from [7]. 1. cache sze ncreased n parallel processng 2. overhead reduced n parallel processng 3. latency hdden n parallel processng 4. randomzed algorthms 5. mathematcal necency of the seral algorthm 6. hgher memory access latency n the sequental processng 7. prole shftng Fgure 1. Causes of Superlnear Speedup. Cause 1 s unlkely applcable for scaled speedup, snce when problem sze scales up, by memory or by tme constrant, the cache ht rato s unlkely to ncrease. Cause 2 n Fg. 1 can be consdered theoretcally [8], there s no measured superlnear speedup ever attrbuted to t. Cause 3 does not exst for relatve speedup snce both the sequental and parallel executon use the same algorthm. Snce parallel algorthms are often mathematcally necent, cause 5 s a lkely source of superlnear 3

7 speedup of relatve speedup. A good example of superlnear speedup based on 5 can be found n [9]. Cause 7 wll be explaned n the end of Secton 3, after the generalzed speedup s ntroduced. th the vrtual memory and shared vrtual memory archtecture, cause 6 can lead to an extremely hgh speedup, especally for scaled speedup where an extremely large problem has to be run on a sngle processor. Fgure 5 shows a measured superlnear speedup on a KSR-1 machne. The measured superlnear speedup s due to the nherent decency of the tradtonal speedup metrc. To analyze the decency of the tradtonal speedup, we need to ntroduce the followng denton. Denton 2 The cost of parallelsm s the rato of the total number of processor cycles consumed n order to perform one unt operaton of work when processors are actve to the machne clock rate. The sequental executon tme can be wrtten n terms of work: Sequental executon tme = Amount of work Processor cycles per unt of work : (1) Machne clock rate The rato n the rght hand sde of Eq. (1), processor cycles per unt of work over machne clock rate, s the cost of sequental processng. ork can be dened as arthmetc operatons, nstructons, transtons, or whatever s needed to complete the applcaton. In scentc computng the number of oatng-pont operatons (FLOPS) s commonly used to measure work. In general, work may be of derent types, and unts of derent operatons may requre derent numbers of nstructon cycles to nsh. (For example, the tmes consumed by one dvson and one multplcaton may be derent dependng on the underlyng machne, and operaton and memory reference rato may be derent for derent computatons.) The nuence of work type on the performance s one of the topcs studed n [1]. In ths paper, we study the nuence of necent memory access on the performance. e assume that there s only one work type and that any ncrease n the number of processor cycles s due to necent memory access. In a shared vrtual memory envronment, the memory avalable depends on the system sze. Let be the amount of work executed when processors are actve (work performed n all steps that use processors), and let = P p represent the total work. The cost of parallelsm n a p processor system, denoted as c p (; ), s the elapsed tme for one unt operaton of work when processors are actve. Then, c p (; ) gves the accumulated elapsed tme where processors are actve. c p (; ) contans both computaton tme and remote memory access tme. The unprocessor executon tme can be represented n terms of unprocessor cost. t(1) = px c p (s; ); 4

8 where c p (s; ) s the cost of sequental processng on a parallel system wth p processors. It s derent from c p (1; ) whch s the cost of the sequental porton of the parallel processng. Parallel executon tme can be represented n terms of parallel cost, The tradtonal speedup s dened as t(p) = px P p c p (; ): P S p = t(1) p t(p) = c p (s; ) c p (; ) : (2) Dependng on archtecture memory herarchy, n general c p (; ) may not equal c p (j; ) for 6= j [10]. If c p (; ) = c p (p; ), for 1 < p, then S p = c p(s; ) c p P (p; ) p : (3) The rst rato of Eq. (3) s the cost rato, whch gves the nuence of memory access delay. The second rato, P p s the smple analytc model based on degree of parallelsm [6]. It assumes that memory access tme s constant as problem sze and system sze vary. The cost rato dstngushes the derent performance analyss methods wth or wthout consderaton of the memory nuence. In general, cost rato depends on memory mss rato, page replacement polcy, data reference pattern, etc. Let remote access rato be the quotent of the number of remote memory accesses and the number of local memory accesses. For a smple case, f we assume there s no remote access n parallel processng and the remote access rato of the sequental processng s (p? 1)=p, then c p (s; ) c p (p; ) = 1 p + p? 1 tme of per remote access p tme of per local access : (5) Equaton (5) approxmately equals the tme of per remote access over the tme of per local access. Snce the remote memory access s much slower than the local memory access under the current technology, the speedup gven by Eq. (3) could be consderably larger than the smple analytc model (4). In fact, the slower the remote access s, the larger the derence. For the KSR-1, the tme rato of remote and P local access s about 7.5 (see Secton 4). Therefore, for p = 32, the cost p rato s 7.3. For any = > 0:14, under the assumed remote access rato, we wll have a superlnear speedup. (4) 5

9 3 The Generalzed Speedup hle parallel computers are desgned for solvng large problems, a sngle processor of a parallel computer s not desgned to solve a very large problem. A unprocessor does not have the computng power that the parallel system has. hle solvng a small problem s napproprate on a parallel system, solvng a large problem on a sngle processor s not approprate ether. To create a useful comparson, we need a metrc that can vary problem szes for unprocessor and multple processors. Generalzed speedup [1] s one such metrc. Generalzed Speedup = Parallel Speed Sequental Speed : (6) Speed s dened as the quotent of work and elapsed tme. Parallel speed mght be based on scaled parallel work. Sequental speed mght be based on the unscaled unprocessor work. By denton, generalzed speedup measures the speed mprovement of parallel processng over sequental processng. In contrast, the tradtonal speedup (2) measures tme reducton of parallel processng. If the problem sze (work) for both parallel and sequental processng are the same, the generalzed speedup s the same as the tradtonal speedup. From ths pont of vew, the tradtonal speedup s a specal case of the generalzed speedup. For ths and for hstorcal reasons, we sometmes call the tradtonal speedup the speedup, and call the speedup gven n Eq. (6) the generalzed speedup. Lke the tradtonal speedup, the generalzed speedup can also be further dvded nto xedsze, xed-tme, and memory-bounded speedup. Unlke the tradtonal speedup, for the generalzed speedup, the scaled problem s solved only on multple processors. speedup s szeup [1]. The xed-tme benchmark SLALOM [11] s based on szeup. The xed-tme generalzed If memory access tme s xed, one mght always assume that the unprocessor cost c p (s) wll be stablzed after some ntal decrease (due to ntalzaton, loop overhead, etc.), assumng the memory s large enough. hen cache and remote memory access are consdered, cost wll ncrease when a slower memory has to be accessed. Fgure 2 depcts the typcal cost changng pattern. From Eq. (1), we can see that unprocessor speed s the recprocal of unprocessor cost. hen the cost reaches ts lowest value, the speed reaches ts hghest value. The unprocessor speed correspondng to the stablzed man memory cost s called the asymptotc speed (of unprocessor). Asymptotc speed represents the performance of the sequental processng wth ecent memory access. The asymptotc speed s the approprate sequental speed for Eq. (6). For memorybounded speedup, the approprate memory bound s the largest problem sze whch can mantan the asymptotc speed. After choosng the asymptotc speed as the sequental speed, the correspondng asymptotc cost has only local access and s ndependent of the problem sze. e use c(s; 0 ) to denote the correspondng asymptotc cost, where 0 s a problem sze whch acheves the asymptotc speed. If there s no remote access n parallel processng, as assumed n Secton 2, then c(s; 0 )=c p (p; 0 ) = 1. By Eq. (3), the correspondng speedup equals the smple speedup 6

10 Cost Insuffcent Memory Increases Sequental Executon Tme Fts n Cache Fts n Man Memory Fts n Remote Memory Problem Sze Fgure 2. Cost Varaton Pattern. whch does not consder the nuence of memory access tme. In general, parallel work s not the same as 0, and c p (; ) may not equal c p (p; ) for 1 p. So, n general, we have Generalzed Speedup = P p cp(; ) 1 c(s;0) = P p c(s; 0 ) c p (; ) : (7) Equaton (7) s another form of the generalzed speedup. It s a quotent of sequental and parallel tme as s tradtonal speedup (2). The derence s that, n Eq. (7), the sequental tme s based on the asymptotc speed. hen remote memory s needed for sequental processng, c(s; 0 ) s smaller than c p (s; ). Therefore, the generalzed speedup gves a smaller speedup than tradtonal speedup. Parallel ecency s dened as Ecency = The Generalzed ecency can be dened smlarly as By denton, Generalzed Ecency = speedup number of processors : (8) generalzed speedup number of processors : (9) and Ecency = p P p Generalzed Ecency = c(s; ) c p (; ) p P p c(s; 0 ) (10) c p (; ) : (11) 7

11 Equatons (10) and (11) show the derence between the two ecences. Tradtonal speedup compares parallel processng wth the measured sequental processng. Generalzed speedup compares parallel processng wth the sequental processng based on the asymptotc cost. From ths pont of vew, generalzed speedup s a reform of tradtonal speedup. The followng lemmas are drect results of Eq.(7). Lemma 1 If c p (s; ) s ndependent of problem sze, tradtonal speedup s the same as generalzed speedup. Lemma 2 If the parallel work,, acheves the asymptotc speed, that s = 0 xed-sze tradtonal speedup s the same as the xed-sze generalzed speedup., then the By Lemma 1, f the smple analytc model (4) s used to analyze performance, there s no derence between the tradtonal and the generalzed speedup. If the problem sze s larger than the suggested ntal problem sze 0, then the sngle processor speedup S 1 may not equal to one. S 1 measures the sequental necency due to the derence n memory access. The generalzed speedup s also closely related to the scalablty study. Isospeed scalablty has been proposed recently n [12]. The sospeed scalablty measures the ablty of an algorthmmachne combnaton mantanng the average (unt) speed, where the average speed s dened as the speed over the number of processors. hen the system sze s ncreased, the problem sze s scaled up accordngly to mantan the average speed. If the average speed can be mantaned, we say the algorthm-machne combnaton s scalable and the scalablty s where 0 (p; p 0 ) = p0 p 0 ; (12) s the amount of work needed to mantan the average speed when the system sze has been changed from p to p 0, and s the problem sze solved when p processors were used. By denton Average Speed = p P p c p (; ) : Snce the sequental cost s xed n Eq. (11), xng average speed s equvalent to xng generalzed ecency. Therefore the sospeed scalablty can be seen as the so-generalzed-ecency scalablty. hen the memory nuence s not consedered,.e. c p (s; ) s ndependent of the problem sze, the so-generalzed-ecency wll be the same as the so-tradtonal-ecency. In ths case, the sospeed scalablty s the same as the soecency scalablty proposed by Kumar [13, 2]. Lemma 3 If the sequental cost c p (s; ) s ndependent of problem sze or f the smple analyss model (4) s used for speedup, the soecency and sospeed scalablty are equvalent to each other. The followng theorem gves the relaton between the scalablty and the xed-tme speedup. 8

12 Theorem 1 Scalablty (12) equals one f and only f the xed-tme generalzed speedup s untary. Proof: Let c(s; 0 ); c p (; ),, be as dened n Eq. (7). If scalablty (12) equals 1, let 0, p 0 be as dened n Eq. (12) and dene 0 smlarly as, we have p 0 0 = p ; (13) for any number of processors p and p 0. By denton, generalzed speedup G S p 0 = P p c(s; 0 ) c p 0(; 0 ) : th some arthmetc manpulaton, we have Smlarly, we have 0 = G S p 0 p 0 p 0 p = G S p p By Eq. (13) and the above two equatons, P p 0 0 c p 0(; 0 ) : c(s; 0 ) P p c p (; ) : c(s; 0 ) G S p 0 p 0 P p 0 0 c p 0(; 0 ) c(s; 0 ) = G S p p P p c p (; ) c(s; 0 ) : (14) For xed speed, By equaton (13), 0 P p p c p 0(; 0 ) = Xp 0 0 c p 0(; 0 ) = Substtutng Eq. (15) nto Eq. (14), we have p P p px c p (; ) : c p(; ): (15) G S p 0 p 0 = G S p p : For p = 1, G S p 0 = p 0 G S p : (16) Equaton (16) s the correspondng untary speedup when G S 1 s not equal to one. If the work 9

13 equals 0, then G S 1 = 1 and Eq. (16) becomes G S p 0 = p 0 ; whch s the untary speedup dened n denton 1. If the xed-tme generalzed speedup s untary, then for any number of processors, p and p 0, and the correspondng problem szes, and 0, where 0 xed-tme constrant, we have and Therefore, p P p P p c(s; 0 ) c p (; ) = p; P p 0 0 c(s; 0 ) 0 c p 0(; 0 ) = : p0 The average speed s mantaned. Also snce we have the equalty px c p (; ) = P 0 p p 0 0 c p 0(; 0 ) : c p(; ) = Xp 0 0 p = 0 p 0 : 0 s the scaled problem sze under the c p 0(; 0 ); The scalablty (12) equals one. 2 The followng theorem gves the relaton between memory-bounded speedup and xed-tme speedup. The theorem s for generalzed speedup. However, based on Lemma 1, the result s true for tradtonal speedup when unprocessor cost s xed or the smple analyss model s used. Theorem 2 If problem sze ncreases proportonally to the number of processors n memorybounded scaleup, then memory-bounded generalzed speedup s lnear f and only f xed-tme generalzed speedup s lnear. Proof: Let c(s; 0 ); c p (; ), and be as dened n Theorem 1. Let 0 ; be the scaled problem sze of xed-tme and memory-bounded scaleup respectvely, and 0 and accordngly. If memory-bounded speedup s lnear, we have P p c(s; 0 ) c p (; ) = a p; 10 be dened

14 and P p 0 c(s; 0 ) c p 0(; ) = a ; p0 for some constant a > 0. Combne the two equatons, we have the equaton By assumpton, p P p c p (; ) = P p p 0 0 c p 0(; ) : (17) s proportonal to the number of processors avalable, = p0 p : (18) Substtutng Eq. (18) nto Eq. (17), we get the xed-tme equalty: Xp 0 c p 0(; ) = That s 0 =, and the xed-tme generalzed speedup s lnear. px c p(; ): (19) If xed-tme speedup s lnear, then, followng smlar deductons as used for Eq. (17), we have p P p c p (; ) = P 0 p p 0 0 c p 0(; 0 ) : (20) Applyng the xed-tme equalty Eq. (19) to Eq. (20), we have the reduced equaton 0 0 = p0 p : (21) th the assumpton Eq. (18), Eq. (21) leads to = 0 ; and memory-bounded generalzed speedup s lnear. 2 The assumpton of Theorem 2 s problem sze (work) ncreases proportonally to the number of processors. The assumpton s true for many applcatons. However, t s not true for dense matrx computaton where the memory requrement s a square functon of the order of the matrx and the computaton s a cubc functon of the order of the matrx. For ths knd of computatonal ntensve applcatons, n general, memory-bounded speedup wll lead to a large speedup. The followng corollares are drect results of Theorem 1 and Theorem 2. Corollary 1 If problem sze ncreases proportonally to the number of processors n memorybounded scaleup, then memory-bounded generalzed speedup s untary f and only f xed-tme 11

15 generalzed speedup s untary. Corollary 2 If work ncreases proportonally wth the number of processors, then scalablty (12) equals one f and only f the memory-bounded generalzed speedup s untary. Snce unprocessor cost vares on shared vrtual memory machnes, the above theoretcal results are not applcable to tradtonal speedup on shared vrtual memory machnes. Fnally, to complete our dscusson on the superlnear speedup, there s a new cause of superlnearty for generalzed speedup. The new source of superlnear speedup s called prole shftng [11], and s due to the problem sze derence between sequental and parallel processng (see Fgure 1). An applcaton may contan derent work types. hle problem sze ncreases, some work types may ncrease faster than the others. hen the work types wth lower costs ncrease faster, superlnear speedup may occur. A superlnear speedup due to prole shftng was studed n [11]. 4 Expermental Results In ths secton, we dscuss the tmng results for solvng a scentc applcaton on KSR-1 parallel computers. e rst gve a bref descrpton of the archtecture and the applcaton, and then present the tmng results and analyses. 4.1 The Machne The KSR-1 computer dscussed here s a representatve of parallel computers wth shared vrtual memory. Fgure 3 shows the archtecture of the KSR-1 parallel computer [14]. Each processor on the KSR-1 has 32 Mbytes of local memory. The CPU s a super-scalar processor wth a peak performance of 40 Mops n double precson. Processors are organzed nto derent rngs. The local rng (rng:0) can connect up to 32 processors, and a hgher level rng of rngs (rng:1) can contan up to 34 local rngs wth a maxmum of 1088 processors. If a non-local data element s needed, the local search engne (SE:0) wll search the processors n the local rng (rng:0). If the search engne SE:0 can not locate the data element wthn the local rng, the request wll be passed to the search engne at the next level (SE:1) to locate the data. Ths s done automatcally by a herarchy of search engnes connected n a fat-tree-lke structure [14, 15]. The memory herarchy of KSR-1 s shown n Fg. 4. Each processor has 512 Kbytes of fast subcache whch s smlar to the normal cache on other parallel computers. Ths subcache s dvded nto two equal parts: an nstructon subcache and a data subcache. The 32 Mbytes of local memory on each processor s called a local cache. A local rng (rng:0) wth up to 32 processors can have 1 Gbytes total of local cache whch s called Group:0 cache. Access to the Group:0 cache s provded by Search Engne:0. Fnally, a hgher level rng 12

16 rng:0 rng:1 connectng up to 34 rng:0 s rng:0 P M rng:0 connectng up to 32 processers M P P M Fgure 3. Conguraton of KSR-1 parallel computers. P : processor M : 32 Mbytes of local memory Processor 512 KB Subcache 32 MB Local Cache 1GB Group:0 Cache 34 GB Group:1 Cache Search Engne:0 Search Engne:1 Fgure 4. Memory herarchy of KSR-1. 13

17 of rngs (rng:1) connects up to 34 local rngs wth 34 Gbytes of total local cache whch s called Group:1 cache. Access to the Group:1 cache s provded by Search Engne:1. The entre memory herarchy s called ALLCACHE memory by the Kendall Square Research. Access by a processor to the ALLCACHE memory system s accomplshed by gong through derent Search Engnes as shown n Fg. 4. The latences for derent memory locatons [16] are: 2 cycles for subcache, 20 cycles for local cache, 150 cycles for Group:0 cache, and 570 cycles for Group:1 cache. 4.2 The Applcaton Regularzed least squares problems (RLSP) [17] are frequently encountered n scentc and engneerng applcatons [18]. The major work s to solve the equaton (A T A + I)x = A T b (22) by orthogonal factorzaton schemes (Householder Transformatons and Gvens rotatons). Ecent Householder algorthms have been dscussed n [19] for shared memory supercomputers, and n [20] for dstrbuted memory parallel computers. or Note that Eq. (22) can also be wrtten as 0 (A T ; p A p I 1 A x = (A T ; p I) B T Bx = B T b 0 b 0 1 A (23) 1 A ; (24) so that the major task s to carry out the QR factorzaton for matrx B whch s nether a complete full matrx nor a sparse matrx. The upper part s full and the lower part s sparse (n dagonal form). Because of the specal structure n B, not all elements n the matrx are aected n a partcular step. Only a submatrx of B wll be transformed n each step. If the columns of the submatrx B at step are denoted by B = [b b +1 b n], then the Householder Transformaton can be descrbed as: 14

18 Householder Transformaton Intalze matrx B for = 1, n end for 1: =?sgn(a () )(bt b )1=2 2: w = b? e 1 3: j = w T b j (2? a () ); 4: b j = b j? jw ; j = + 1; n j = + 1; ; n The calculaton of j 's and updatng of b j 's can be done n parallel for derent ndex j. 4.3 Tmng Results The numercal experments reported here were conducted on the KSR-1 parallel computer nstalled at the Cornell Theory Center. There are 128 processors altogether on the machne. Durng the perod when our experments were performed, however the computer was congured as two standalone machnes wth 64 processors each. Therefore, the numercal results were obtaned usng less than 64 processors. Fgure 5 shows the tradtonal xed-sze speedup curves obtaned by solvng the regularzed least squares problem wth derent matrx szes n. The matrx s of dmensons 2n n. e can see clearly that as the matrx sze n ncreases, the speedup s gettng better and better. For the case when n = 2048, the speedup s 76 on 56 processors. Although t s well known that on most parallel computers, the speedup mproves as the problem sze ncreases, what s shown n Fg. 5 s certanly too good to be a reasonable measurement of the real performance of the KSR-1. The problem wth the tradtonal speedup s that t s dened as the rato of the sequental tme to the parallel tme used for solvng the same xed-sze problem. The complex memory herarchy on the KSR-1 makes the computatonal speed of a sngle processor hghly dependent on the problem sze. hen the problem s so bg that not all data of the matrx can be put n the local memory (32 Mbytes) of the sngle computng processor, part of the data must be put n the local memory of other processors on the system. These data are accessed by the computng processor through Search Engne:0. As a result, the computatonal speed on a sngle processor slows down sgncantly due to the hgh latency of Group:0 cache. The sustaned computatonal speed on a sngle processor s 5.5 Mops, 4.5 Mops and 2.7 Mops for problem szes 1024, 1600 and 2048 respectvely. On the other hand, wth multple processors, most of the data needed are n the local memory of each processor, so the computatonal speed suers less from the hgh Group:0 cache 15

19 Speedup Ideal Speedup n = 1024 n = 1600 n = Number of Processors Fgure 5. Fxed-sze (Tradtonal) Speedup on KSR-1 latency. Therefore, the excellent speedups shown n Fg. 5 are the results of sgncant unprocessor performance degradaton when a large problem s solved on a sngle processor. Fgure 6 shows the measured sngle processor speed as a functon of problem sze n. The Householder Transformaton algorthm gven before was mplemented n KSR Fortran. The algorthm has a numercal complexty of = 2n 3 + 8:5n :5n, and the speed s calculated usng s = =t where t s the CPU tme used to nsh the computaton. As can be seen from Fg. 6, the three segments represent sgncantly derent speeds for derent matrx szes. hen the whole matrx can be t nto the subcache, the performance s close to 7 Mops. The speed decreases to around 5.5 Mops when the matrx can not be t nto the subcache, but stll can be accommodated n the local cache. Note, however, when the matrx s so bg that access to Group:0 cache through Search Engne:0 s needed, the performance degrades sgncantly and there s no clear stable performance level as can be observed n the other two segments. Ths s largely due to the hgh Group:0 cache latency and the contenton for the Search Engne whch s used by all processors on the machne. Therefore, the access tme of Group:0 cache s less unform as compared to that of the subcache and local cache. To take the derence of sngle processng speeds for derent problem szes nto consderaton, we have to use the generalzed speedup to measure the performance of multple processors on the KSR-1. As can be seen from the denton of Eq. (6), the generalzed speedup s dened as the rato of the parallel speed to the asymptotc sequental speed, where the parallel speed s based on a scaled problem. In our numercal tests, the parallel problem was scaled n a memory- 16

20 Speed Subcache Local Cache Group:0 Cache Order of the Matrces Fgure 6. Speed Varaton of Unprocessor Processng on KSR-1 bounded fashon as the number of processors ncreases. The ntal problem was selected based on the asymptotc speed (5.5 Mops from Fg. 6) and then scaled proportonally accordng to the number of processors,.e. wth p processors, the problem s scaled to a sze that wll ll M p Mbytes of memory, where M s the memory requred by the unscaled problem. Fgure 7 shows the comparsons of the tradtonal scaled speedup and the generalzed speedup. For the tradtonal scaled speedup, the scaled problem s solved on both one and p processors, and the value of the speedup s calculated as the rato of the tme of one processor to that of p processors. hle for the generalzed speedup, the scaled problem s solved only on multple processors, not on a sngle processor. The value of the speedup s calculated usng Eq. (6), where the asymptotc speed s used for the sequental speed. It s clear that Fg. 7 shows that the generalzed speedup gves much more reasonable performance measurement on KSR-1 than does the tradtonal scaled speedup. th the tradtonal scaled speedup, the speedup s above 20 wth only 10 processors. Ths excellent superlnear speedup s a result of the severely degraded sngle processors speed, rather than the perfect scalablty of the machne and the algorthm. Fnally, table 1 gves the measured sospeed scalablty (see Eq. (12)) of solvng the regularzed least squares problem on a KSR-1 computer. The speed to be mantaned on derent number of processors s 3.25 Mops, whch s 60% of the asymptotc speed of 5.5 Mops. The sze of the 2nn matrx s ncreased as the number of processors ncreases. It starts as n = 27 on one processor and ncreases to n = 2773 on 56 processors. One may notce that (2; 4) > (1; 2) n table 1, whch means that the machne-algorthm par scales better from 2 processors to 4 processors than t does 17

21 20 16 Ideal Speedup Generalzed Speedup Tradtonal Speedup 12 Speedup Number of Processors Fgure 7. Comparson of Generalzed and Tradtonal Speedup on KSR-1 from one processor to two processors. Ths can be explaned by the fact that on one processor, the matrx s small enough that all data can be accommodated n the subcache. Once all the data s loaded nto the subcache, the whole computaton process does not need data from local cache and Group:0 cache. Therefore, the data access tme on one processor s sgncantly shorter than that on two processors whch nvolves subcache, local cache and Group:0 cache to pass messages. As a result, sgncant ncrease n the work s necessary n the case of two processors to oset the extra data access tme nvolvng derent memory herarches. Ths s the major reason for the low (1; 2) value. hen the number of processors ncreases from 2 to 4, the data access pattern s the same for both cases wth subcache, local cache and Group:0 cache all nvolved, so that the work does not need to be ncreased sgncantly to oset the extra communcaton cost when gong from 2 processors to 4 processors. It s nterestng to notce, whle the scalablty of the RLSP-KSR1 combnaton s relatvely low, the data n Table 1 has a smlar decreasng pattern as the measured and computed scalablty of Burg-nCUBE, SLALOM-nCUBE, Burg-MasPar and SLALOM-MasPar combnatons [12]. The scalabltes are all decreasng along columns and have some rregular behavor at (1; 2) and (2; 4). Interested readers may wonder how the measured scalablty s related to the measured generalzed speedup gven n Fg. 7. hle Fg. 7 demonstrates a nearly lnear generalzed speedup, the correspondng scalablty gven n Table 1 s far from deal (the deal scalablty would be unty). The low scalablty s expected. Recall that the scaled speedup gven n Fg. 7 s memory-bounded speedup [6]. That s when the number of processors s doubled, the usage of memory s also doubled. 18

22 (N; N 0 ) Table 1. Measured Scalablty of RLSP-KSR1 combnaton. As a result, the number of elements n the matrx s ncreased by a factor of 2. Corollary 2 shows that f work ncreases lnearly wth the number of processors, then untary memory-bounded speedup wll lead to deal scalablty. For the regularzed least squares applcaton, however, the work s a cubc functon of the matrx sze n. hen the memory usage s doubled, the number of oatng pont operatons s ncreased by a factor of eght. If a perfect generalzed speedup s acheved from p to p 0 = 2p, the average speed at p and p 0 should be the same. By Eq. (12) we have (p; p 0 ) = 2p 8p = 1 4 : th the measured speedup beng a lttle lower than untary as shown n Fg. 7, a less than 0:25 scalablty s expected. Table 1 conrms ths relaton, except at (2; 4) for the reason ponted out earler. The scalablty n the last column s notceably lower than other columns. It s because when 56 nodes are nvolved n computatons, communcaton has to pass through rng:1, whch slows down the communcaton sgncantly. Computaton ntensve applcatons have often been used to acheve hgh ops. The RLSP applcaton s a computaton ntensve applcaton. Table 1 shows that sospeed scalablty does not gve credts for computaton ntensve applcatons. The computaton ntensve applcatons may acheve a hgh speed on multple processors, but the ntal speed s also hgh. The sospeed scalablty measures the ablty to mantan the speed, rather than to acheve a partcular speed. The mplementaton s conducted on a KSR-1 shared vrtual memory machne. The theoretcal and analytcal results gven n Secton 2 and Secton 3, however, are general and can be appled on derent parallel platforms. For nstance, for Intel Paragon parallel computers, where vrtual memory s supported to swap data n and out from memory to dsk, we expect that necent sequental processng wll cause smlar superlnear (tradtonal) speedup as demonstrated on KSR-1. For dstrbuted-memory machnes whch do not support vrtual memory, such as CM-5, tradtonal speedup has another draw back. Due to memory constrant, scaled problems often cannot be solved on a sngle processor. Therefore, scaled speedup s unmeasurable. Denng asymptotc speed 19

23 smlarly as gven n Secton 3, the generalzed speedup can be appled to ths knd of dstrbutedmemory machnes to measure scalable computatons. Generalzed speedup s dened as parallel speed over sequental speed. Gven a reasonable ntal sequental speed, t can be used on any parallel platforms to measure the performance of scalable computatons. 5 Concluson Snce the scaled up prncple was proposed n 1988 by Gustafson and other researchers at Sanda Natonal Laboratory [21], the prncple has been wdely used n performance measurement of parallel algorthms and archtectures. One dculty of measurng scaled speedup s that vary large problems have to be solved on unprocessor, whch s very necent f vrtual memory s supported, or s mpossble otherwse. To overcome ths shortcomng, generalzed speedup was proposed [1]. Generalzed speedup s dened as parallel speed over sequental speed and does not requre solvng large problems on unprocessor. The study [1] emphaszed the xed-tme generalzed speedup, szeup. To meet the need of the emergng shared vrtual memory machnes, the generalzed speedup, partcularly mplementaton ssues, has been carefully studed n the current research. It has shown that tradtonal speedup s a specal case of generalzed speedup, and, on the other hand, generalzed speedup s a reform of tradtonal speedup. The man derence between generalzed speedup and tradtonal speedup s how to dene the unprocessor ecency. hen unprocessor speed s xed these two speedups are the same. Extendng these results to scalablty study, we have found that the derence between sospeed scalablty [12] and soecency scalablty [13] s also due to the unprocessor ecency. hen the unprocessor speed s ndependent of the problem sze, these two proposed scalabltes are the same. As part of the performance study, we have shown that an algorthm-machne combnaton acheves a perfect scalablty f and only f t acheves a perfect speedup. An nterestng relaton between xed-tme and memory-bounded speedups s revealed. Seven causes of superlnear speedup are also lsted. A scentc applcaton has been mplemented on a Kendall Square KSR-1 shared vrtual memory machne. Expermental results show that unprocessor ecency s an mportant ssue for vrtual memory machnes, and that the asymptotc speed provdes a reasonable way to dene the unprocessor ecency. The results n ths paper on shared vrtual memory can be extended to general parallel computers. Snce unprocessor ecency s drectly related to parallel executon tme, scalablty, and benchmark evaluatons, the range of applcablty of the unprocessor ecency study s wder than speedups. The unprocessor ecency mght be explored further n a number of contexts. 20

24 Acknowledgement The authors are grateful to the Cornell Theory Center for provdng access to ts KSR-1 parallel computer, and to the referees for ther helpful comments on the revson of ths paper. References [1] X.-H. Sun and J. Gustafson, \Toward a better parallel performance metrc," Parallel Computng, vol. 17, pp. 1093{1109, Dec [2] K. Hwang, Advanced Computer Archtecture: Parallelsm, Scalablty, Programmablty. McGraw-Hll Book Co., [3] J. Ortega and R. Vogt, \Soluton of partal derental equatons on vector and parallel computers," SIAM Revew, pp. 149{240, June [4] G. Amdahl, \Valdty of the sngle-processor approach to achevng large scale computng capabltes," n Proc. AFIPS Conf., pp. 483{485, [5] J. Gustafson, \Reevaluatng Amdahl's law," Communcatons of the ACM, vol. 31, pp. 532{ 533, May [6] X.-H. Sun and L. N, \Scalable problems and memory-bounded speedup," J. of Parallel and Dstrbuted Computng, vol. 19, pp. 27{37, Sept [7] D. Helmbold and C. McDowell, \Modelng speedup(n) greater than n," IEEE Trans. on Parallel and Dstrbuted Sys., pp. 250{256, Apr [8] D. Parknson, \Parallel ecency can be greater than unty," Parallel Computng, vol. 3, pp. 261{262, [9] D. Ncol, \Inated speedups n parallel smulatons va malloc()," Internatonal Journal on Smulaton, vol. 2, pp. 413{426, Dec [10] X.-H. Sun and J. Zhu, \Performance predcton of scalable computng: A case study," n Proc. of the 28th Hawa Internatonal Conference on System Scences, pp. 456{465, Jan [11] J. Gustafson, D. Rover, S. Elbert, and M. Carter, \The desgn of a scalable, xed-tme computer benchmark," J. of Parallel and Dstrbuted Computng, vol. 12, no. 4, pp. 388{401, [12] X.-H. Sun and D. Rover, \Scalablty of parallel algorthm-machne combnatons," IEEE Transactons on Parallel and Dstrbuted Systems, pp. 599{613, June [13] A. Y. Grama, A. Gupta, and V. Kumar, \Isoecency: Measurng the scalablty of parallel algorthms and archtectures," IEEE Parallel & Dstrbuted Technoloty, vol. 1, pp. 12{21, Aug [14] Kendall Square Research, \KSR parallel programmng." altham, USA, [15] C. Leserson, \Fat-trees: Unversal networks for hardware-ecent supercomputng," IEEE Transactons on Computers, vol. 34, no. 10, pp. 892{901,

25 [16] Kendall Square Research, \KSR techncal summary." altham, USA, [17] A. N. Tkhnov and V. Arsenn, Soluton of Ill-posed Problems. John ley and Sons, [18] Y. M. Chen, J. P. Zhu,. H. Chen, and M. L. asserman, \GPST nverson algorthm for hstory matchng n 3-d 2-phase smulators," n IMACS Trans. on Scentc Computng I, pp. 369{374, [19] J. Dongarra, I. S. Du, D. C. Sorensen, and H. A. van der Vorst, Solvng Lnear Systems on Vector and Shared Memory Computers. Phladelpha: SIAM, [20] A. Pothen and P. Raghavan, \Dstrbuted orthogonal factorzaton: Gvens and Householder algorthms," SIAM J. of Sc. and Stat. Computng, vol. 10, pp. 1113{1135, [21] J. Gustafson, G. Montry, and R. Benner, \Development of parallel methods for a processor hypercube," SIAM J. of Sc. and Stat. Computng, vol. 9, pp. 609{638, July

and NSF Engineering Research Center Abstract Generalized speedup is dened as parallel speed over sequential speed. In this paper

and NSF Engineering Research Center Abstract Generalized speedup is dened as parallel speed over sequential speed. In this paper Shared Vrtual Memory and Generalzed Speedup Xan-He Sun Janpng Zhu ICASE NSF Engneerng Research Center Mal Stop 132C Dept. of Math. and Stat. NASA Langley Research Center Msssspp State Unversty Hampton,