and NSF Engineering Research Center Abstract Generalized speedup is dened as parallel speed over sequential speed. In this paper

Shared Vrtual Memory and Generalzed Speedup Xan-He Sun Janpng Zhu ICASE NSF Engneerng Research Center Mal Stop 132C Dept. of Math. and Stat. NASA Langley Research Center Msssspp State Unversty Hampton, VA 23681-1 Msssspp State, MS 39762 Abstract Generalzed speedup s dened as parallel speed over sequental speed. In ths paper the generalzed speedup and ts relaton wth other exstng performance metrcs, such as tradtonal speedup, ecency, scalablty, etc., are carefully studed. In terms of the ntroduced asymptotc speed, we show that the derence between the generalzed speedup and the tradtonal speedup les n the denton of the ecency of unprocessor processng, whch s a very mportant ssue n shared vrtual memory machnes. A scentc applcaton has been mplemented on a KSR-1 parallel computer. Expermental and theoretcal results show that the generalzed speedup s dstnct from the tradtonal speedup and provdes a more reasonable measurement. In the study of derent speedups, varous causes of superlnear speedup are also presented. Ths research was supported by the Natonal Aeronautcs and Space Admnstraton under NASA contract NAS1-1948 whle the rst author was n resdence at the Insttute for Computer Applcatons n Scence and Engneerng (ICASE), NASA Langley Research Center, Hampton, VA 23681-1.

1 Introducton In recent years parallel processng has enjoyed unprecedented attenton from researchers, government agences, and ndustres. Ths attenton s manly due to the fact that, wth the current crcut technology, parallel processng seems to be the only remanng waytoacheve hgher performance. However, whle varous parallel computers and algorthms have been developed, ther performance evaluaton s stll elusve. In fact, the more advanced the hardware and software, the more dcult t s to evaluate the parallel performance. In ths paper we target recent development of shared vrtual memory machnes and revst the generalzed speedup [17] performance metrc. Dstrbuted-memory parallel computers domnate today's parallel computng arena. These machnes, such as the Kendall Square KSR-1, Intel Paragon, and TMC CM-5, have successfully delvered hgh performance computng power for solvng certan of the so-called \grand-challenge" problems. From the vewpont of processes, there are two basc process synchronzaton and communcaton models. One s the shared-memory model n whch processes communcate through shared varables. The other s the message-passng model n whch processes communcate through explct message passng. The shared-memory model provdes a sequental program paradgm. th shared vrtual address space, the shared-memory model supports shared vrtual memory, but requres sophstcated hardware and system support. An example of a dstrbuted-memory machne whch supports shared vrtual address space s the Kendall Square KSR-1. Tradtonally, the message-passng model s bounded by the local memory of the processng processors. th recent technology advancement, the message-passng model has extended the ablty to support shared vrtual memory. Shared vrtual memory smples the software development and portng process by enablng even extremely large programs to run on a sngle processor before beng parttoned and dstrbuted across multple processors. However, the memory access of the shared vrtual memory s non-unform [8]. The access tme of local memory and remote memory s derent. Runnng a large program on a small number of processors s possble but could be very necent. The necent sequental processng wll lead to a msleadng hgh performance n terms of speedup or ecency. Generalzed speedup, dened as parallel speed over sequental speed, s a new performance metrc proposed n [17]. In ths paper, we revst generalzed speedup and address the measurement ssues. Through both theoretcal proofs and expermental results, we show that generalzed speedup provdes a more reasonable measurement than tradtonal speedup. In the process of studyng generalzed speedup, the relaton between the generalzed speedup and many other metrcs, such as ecency, scaled speedup, scalablty, are also studed. Varous reasons for superlnearty n derent speedups are also dscussed. Results show that the man derence between the tradtonal speedup and the generalzed speedup s how toevaluate the ecency of the sequental processng on a sngle processor. 1

The paper s organzed as follows. In secton 2 we study tradtonal speedup, ncludng the scaled speedup concept, and ntroduce some termnology. Analyss shows that the tradtonal speedup, xed-sze or scaled sze, may acheve superlnearty on shared vrtual memory machnes. Furthermore, wth the tradtonal speedup metrc, the slower the remote memory access s, the larger the speedup. Generalzed speedup s studed n Secton 3. The term asymptotc speed s ntroduced for the measurement of generalzed speedup. Analyss shows the derences and the smlartes between the generalzed speedup and the tradtonal speedup. Ecency and scalablty ssues are also dscussed. Expermental results of a producton applcaton on a Kendall Square KSR-1 parallel computer are gven n Secton 4. Secton 5 contans a summary. 2 The Tradtonal Speedup One of the best accepted and the most frequently used performance metrcs n parallel processng s speedup. It measures the parallel processng gan over sequental processng and s dened as sequental executon tme over parallel executon tme. Parallel algorthms often explot parallelsm by sacrcng mathematcal ecency. To measure the true parallel processng gan, the sequental executon tme should be based on a commonly used sequental algorthm. To dstngush t from other nterpretatons of speedup, the speedup measured wth a commonly used sequental algorthm has been called absolute speedup [14]. Absolute speedup s an mportant metrc, especally when new parallel algorthms are ntroduced. Another wdely used nterpretaton s the relatve speedup [14], whch uses the unprocessor executon tme of the parallel algorthm as the sequental tme. There are several reasons to use the relatve speedup. Frst, the performance of an algorthm vares wth the number of processors. Relatve speedup measures the varaton. Second, relatve speedup avods the dculty ofchoosng the practcal sequental algorthm, mplementng the sequental algorthm, and matchng the mplementaton/programmng skll between the sequental algorthm and the parallel algorthm. Also, when problem sze s xed, the tme rato of the chosen sequental algorthm and the unprocessor executon of the parallel algorthm s xed. Therefore, the relatve speedup s proportonal to the absolute speedup. Relatve speedup s the speedup commonly used n performance study. The well known Amdahl's law [1] and Gustafson's scaled speedup [4] are both based on relatve speedup. In ths study we wll focus on relatve speedup and reserve the terms tradtonal speedup and speedup for relatve speedup. The concepts and results of ths study can be extended to absolute speedup. The absolute speedup and the relatve speedup are dstngushed by the sequental algorthm. After a sequental algorthm s chosen, from the problem sze pont of vew, speedup can be further dvded nto the xed-sze speedup and the scaled speedup. Fxed-sze speedup emphaszes how much executon tme can be reduced wth parallel processng. Amdahl's law s based on the xed-sze speedup. th one parameter, the sequental processng rato, Amdahl's law gves the lmtaton of 2

the xed-sze speedup. The scaled speedup s concentrated on explorng the computatonal power of parallel computers for solvng otherwse ntractable large problems. Dependng on the scalng restrctons of the problem sze, the scaled speedup can be classed as the xed-tme speedup and the memory-bounded speedup [18]. hen p processors are used, xed-tme speedup scales problem sze to meet the xed executon tme. Then the scaled problem s also solved on an unprocessor to get the speedup. Correspondng to Amdahl's law, Gustafson has gven a smple xed-tme speedup formula [5]. The memory-bounded speedup [18] s another practcally used scaled speedup. It s dened n a smlar way to the xed-tme speedup. The derence s that n memory-bounded speedup the problem sze s scaled based on the avalable memory, whle n xed-tme speedup the problem sze s scaled up to meet the xed executon tme. A detaled study of the memory-bounded speedup can be found n [18]. Speedup can also be classed based on the acheved performance. Let p and S p be the number of processors and the speedup wth p processors. The followng terms were used n [7]. Denton 1 Super-lnear speedup: lm p!1 S p p = 1 Lnear super-untary speedup: p<s p <cpfor some constant c>1. Untary speedup: S p = p. Lnear sub-untary speedup: p<s p <pfor some postve constant <1. Sub-lnear speedup: lm p!1 S p p =. e say a speedup s a superlnear speedup f t s ether super-lnear or lnear super-untary. It s debatable f any machne-algorthm par can acheve \truly" superlnear speedup. Four possble causes of superlnear speedup gven n [7] are lsted n Fg. 1. 1. cache sze ncreased n parallel processng 2. overhead reduced n parallel processng 3. latency hdden n parallel processng 4. Randomzed algorthms Fgure 1. Causes of Superlnear Speedup: part 1 Cause 2 n Fg. 1 can be consdered theoretcally [15], there s no measured superlnear speedup ever attrbuted to t. Cause 3 does not exst for relatve speedup snce both the sequental and 3

parallel executon use the same algorthm. Cause 1 s unlkely applcable for scaled speedup, snce when problem sze scales up, by memory or by tme constrant, the cache ht rato s unlkely to ncrease. Two other causes of superlnear relatve speedup and scaled speedup are lsted n Fg. 2. 5. mathematcal necency of the seral algorthm 6. hgher memory access latency n the sequental processng Fgure 2. Causes of Superlnear Speedup: part 2 Snce parallel algorthms are often mathematcally necent, cause 5 s a lkely source of superlnear speedup of relatve speedup. A good example of superlnear speedup based on 5 can be found n [13]. th the vrtual memory and shared vrtual memory archtecture, cause 6 can lead to an extremely hgh speedup, especally for scaled speedup where an extremely large problem has to be run on a sngle processor. Fgure 7 shows a measured superlnear speedup on a KSR-1 machne. The measured superlnear speedup s due to the nherent decency of the tradtonal speedup metrc. To analyze the decency of the tradtonal speedup, we need to ntroduce the followng denton. Denton 2 The cost of parallelsm s the rato of the total number of processor cycles consumed n order to perform one unt operaton of work when processors are actve to the machne clock rate. The sequental executon tme can be wrtten n terms of work: Sequental executon tme = Amount of work Processor cycles per unt of work : (1) Machne clock rate The rato n the rght hand sde of Eq. (1), processor cycles per unt of work over machne clock rate, s the cost of sequental processng. ork can be dened as arthmetc operatons, nstructons, transtons, or whatever s needed to complete the applcaton. In scentc computng the number of oatng-pont operatons (FLOPS) s commonly used to measure work. In general, work may be of derent types, and unts of derent operatons may requre derent numbers of nstructon cycles to nsh. (For example, the tmes consumed by one dvson and one multplcaton may be derent dependng on the underlyng machne, and operaton and memory reference rato may be derent for derent computatons.) The nuence of work type on the performance s one of the topcs studed n [17]. In ths paper, 4

we study the nuence of necent memory access on the performance. e assume that there s only one work type and that any ncrease n the number of processor cycles s due to necent memory access. In a shared vrtual memory envronment, the memory avalable depends on the system sze. Let be the amount ofwork executed when processors are actve, and let = P p =1 represent the total work. The cost of parallelsm n a p processor system, denoted as c p (; ), s the elapsed tme for one unt operaton of work when processors are actve. Then, c p (; ) gves the accumulated elapsed tme where processors are actve. c p (; ) contans both computaton tme and remote memory access tme. The unprocessor executon tme can be represented n terms of unprocessor cost. t(1) = px =1 c p (s; ); where c p (s; ) s the cost of sequental processng on a parallel system wth p processors. It s derent from c p (1;) whch s the cost of the sequental porton of the parallel processng. Parallel executon tme can be represented n terms of parallel cost, t(p) = px =1 c p (; ): The tradtonal speedup s dened as If c p (; )=c p (p; ), for 1 <p, then S p = t(1) P p t(p) = =1 c p (s; ) P p =1 c p (; ) : (2) S p = c p(s; ) c p (p; ) P p : : (3) =1 The rst rato of Eq. (3) s the cost rato, whch gves the nuence of memory access delay. The second rato, P p =1 s the smple analytc model based on degree of parallelsm [18]. It assumes that memory access tme s constant as problem sze and system sze vary. The cost rato dstngushes the derent performance analyss methods wth or wthout consderaton of the memory nuence. In general, cost rato depends on memory mss rato, page replacement polcy, data reference pattern, etc. For a smple case, f we assume there s no remote access n parallel processng and the remote access rato of the sequental processng s (p 1)=p, then (4) 5

c p (s; ) c p (p; ) = 1 p + p 1 tme of per remote access p tme of per local access : (5) Equaton (5) approxmately equals the tme of per remote access over the tme of per local access. Snce the remote memory access s much slower than the local memory access under the current technology, the speedup gven by Eq. (3) could be consderably larger than the smple analytc model (4). In fact, the slower the remote access s, the larger the derence. For the KSR-1, the tme rato of remote and local access s about 7.5 (see Secton 4). Therefore, for p = 32, the cost rato s 7.3. For any P p = =1 superlnear speedup. > :14, under the assumed remote access rato, we wll have a 3 The Generalzed Speedup hle parallel computers are desgned for solvng large problems, a sngle processor of a parallel computer s not desgned to solve a very large problem. A unprocessor does not have the computng power that the parallel system has. hle solvng a small problem s napproprate on a parallel system, solvng a large problem on a sngle processor s not approprate ether. To create a useful comparson, we need a metrc that can vary problem szes for unprocessor and multple processors. Generalzed speedup [17] s one such metrc. Generalzed Speedup = Parallel Speed Sequental Speed : (6) Speed s dened as the quotent of work and elapsed tme. Parallel speed mght be based on scaled parallel work. Sequental speed mght be based on the unscaled unprocessor work. By denton, generalzed speedup measures the speed mprovement of parallel processng over sequental processng. In contrast, the tradtonal speedup (2) measures tme reducton of parallel processng. If the problem sze (work) for both parallel and sequental processng are the same, the generalzed speedup s the same as the tradtonal speedup. From ths pont of vew, the tradtonal speedup s a specal case of the generalzed speedup. For ths and for hstorcal reasons, we sometmes call the tradtonal speedup the speedup, and call the speedup gven n Eq. (6) the generalzed speedup. Lke the tradtonal speedup, the generalzed speedup can also be further dvded nto xedsze, xed-tme, and memory-bounded speedup. Unlke the tradtonal speedup, for the generalzed speedup, the scaled problem s solved only on multple processors. speedup s szeup [17]. The xed-tme benchmark SLALOM [6] s based on szeup. The xed-tme generalzed If memory access tme s xed, one mght always assume that the unprocessor cost c p (s) wll be stablzed after some ntal decrease (due to ntalzaton, loop overhead, etc.), assumng the memory s large enough. hen cache and remote memory access are consdered, cost wll ncrease when a slower memory has to be accessed. Fgure 3 depcts the typcal cost changng pattern. 6

Cost Insuffcent Memory Increases Sequental Executon Tme Fts n Cache Fts n Man Memory Fts n Remote Memory Problem Sze Fgure 3. Cost Varaton Pattern. From Eq. (1), we can see that unprocessor speed s the recprocal of unprocessor cost. hen the cost reaches ts lowest value, the speed reaches ts hghest value. The unprocessor speed correspondng to the stablzed man memory cost s called the asymptotc speed (of unprocessor). Asymptotc speed represents the performance of the sequental processng wth ecent memory access. The asymptotc speed s the approprate sequental speed for Eq. (6). For memorybounded speedup, the approprate memory bound s the largest problem sze whch can mantan the asymptotc speed. After choosng the asymptotc speed as the sequental speed, the correspondng asymptotc cost has only local access and s ndependent of the problem sze. e use c(s; ) to denote the correspondng asymptotc cost, where s a problem sze whch acheves the asymptotc speed. If there s no remote access n parallel processng, as assumed n Secton 2, then c(s; )=c p (p; ) = 1. By Eq. (3), the correspondng speedup equals the smple speedup whch does not consder the nuence of memory access tme. In general, parallel work s not the same as.sowehave Generalzed Speedup = P p =1 cp(; ) 1 c(s;) = P p =1 c(s; ) c p (; ) : (7) Equaton (7) s another form of the generalzed speedup. It s a quotent of sequental and parallel tme as s tradtonal speedup (2). The derence s that, n Eq. (7), the sequental tme s based on the asymptotc speed. hen remote memory s needed for sequental processng, c(s; )s smaller than c p (s; ). Therefore, the generalzed speedup gves a smaller speedup than tradtonal speedup. Parallel ecency s dened as Ecency = speedup number of processors : (8) 7

The Generalzed ecency can be dened smlarly as By denton, and Generalzed Ecency = c(s; ) Ecency = P p p =1 c p (; ) generalzed speedup number of processors : (9) (1) c(s; ) Generalzed Ecency = P p p =1 c p (; ) : (11) Equatons (1) and (11) show the derence between the two ecences. The tradtonal ecency assumes that the measured sequental processng acheves hundred percent ecency. The generalzed ecency assumes that the sequental processng based on the asymptotc cost acheves hundred percent ecency. Tradtonal speedup compares parallel processng wth the measured sequental processng. Generalzed speedup compares parallel processng wth the sequental processng based on the asymptotc cost. From ths pont of vew, generalzed speedup s a reform of tradtonal speedup. The followng propostons are drect results of Eq.(7). Proposton 1 If c p (s; ) s ndependent of problem sze, tradtonal speedup s the same as generalzed speedup. Proposton 2 If the parallel work,, acheves the asymptotc speed, that s =, then the xed-sze tradtonal speedup s the same as the xed-sze generalzed speedup. By Proposton 1, f the smple analytc model (4) s used to analyze performance, there s no derence between the tradtonal and the generalzed speedup. If the problem sze s larger than the suggested ntal problem sze, then the sngle processor speedup S 1 may not equal to one. S 1 measures the sequental necency due to the derence n memory access. The generalzed speedup s also closely related to the scalablty study. Isospeed scalablty has been proposed recently n [19]. The sospeed scalablty measures the ablty of an algorthmmachne combnaton mantanng the average (unt) speed, where the average speed s dened as the speed over the number of processors. hen the system sze s ncreased, the problem sze s scaled up accordngly to mantan the average speed. If the average speed can be mantaned, we say the algorthm-machne combnaton s scalable and the scalablty s (p; p )= p p ; (12) where s the amount ofwork needed to mantan the average speed when the system sze has been changed from p to p, and s the problem sze solved when p processors were used. By 8

denton Average Speed = P p p =1 c p (; ) : Snce the sequental cost s xed n Eq. (11), xng average speed s equvalent to xng generalzed ecency. Therefore the sospeed scalablty can be seen as the so-generalzed-ecency scalablty. hen the memory nuency s not consedered,.e. c p (s; ) s ndependent of the problem sze, the so-generalzed-ecency wll be the same as the so-tradtonal-ecency. In ths case, the sospeed scalablty s the same as the soecency scalablty proposed by Kumar [11, 8]. Proposton 3 If the sequental cost c p (s; ) s ndependent of problem sze or f the smple analyss model (4) s used for speedup, the soecency and sospeed scalablty are equvalent to each other. The followng theorem gves the relaton between the scalablty and the xed-tme speedup. Theorem 1 Scalablty (12) equals one f and only f the xed-tme generalzed speedup s untary. Proof: Let c(s; );c p (; ),, be as dened n Eq. (7). If scalablty (12) equals 1, let, p be as dened n Eq. (12) and dene smlarly as, we have p = p ; (13) for anynumber of processors p and p. By the denton of generalzed speedup, generalzed speedup G S p = P p c(s; ) c p (; ) : th some arthmetc manpulaton, we have p = GS p p P p c p (; ) c(s; ) : Smlarly, wehave p =GS p p By Eq. (13) and the above two equatons, P p c p (; ) : c(s; ) G S p p P p c p (; ) = G S p c(s; ) p P p c p (; ) : c(s; ) For xed-tme speedup Xp c p (; )= px c p(; ): 9

Thus, For p =1, G S p p = G S p p : GS p =p GS p : (14) Equaton (14) s the correspondng untary speedup when G S 1 s not equal to one. If the work equals, then G S 1 = 1 and Eq. (14) becomes G S p = p ; whch s the untary speedup dened n denton 1. If the xed-tme generalzed speedup s untary, then for any number of processors, p and p, and the correspondng problem szes, and, where s the scaled problem sze under the xed-tme constrant, we have and Therefore, c(s; ) P p c p (; ) = p; P p The average speed s mantaned. Also snce c(s; ) c p (; ) = p : P p p c p (; ) = P p p c p (; ) : px c p(; )= Xp c p (; ); we have the equalty p = p : The scalablty (12) equals one. 2 The followng theorem gves the relaton between memory-bounded speedup and xed-tme speedup. The theorem s for generalzed speedup. However, based on Proposton 1, the result s true for tradtonal speedup when unprocessor cost s xed or the smple analyss model s used. Theorem 2 If problem sze ncreases proportonally to the number of processors n memorybounded scaleup, then memory-bounded generalzed speedup s untary f and only f xed-tme generalzed speedup s untary. Proof: Let c(s; );c p (; ), and be as dened n Theorem 1. Let ; be the scaled 1

problem sze of xed-tme and memory-bounded scaleup respectvely, and and accordngly. If memory-bounded speedup s untary, we have be dened c(s; ) P p c p (; ) = p; and P p Combne the two equatons, we have the equaton c(s; ) c p (; ) = p : P p p c p (; ) = P p p c p (; ) : (15) By assumpton, s proportonal to the number of processors avalable, = p p Substtutng Eq. (16) nto Eq. (15), we get the xed-tme equalty: : (16) Xp c p (; )= px c p(; ): (17) That s =, and the xed-tme generalzed speedup s untary. If xed-tme speedup s untary, then, followng smlar deductons as used for Eq. (15), we have P p p c p (; ) = P p p c p (; ) : (18) Applyng the xed-tme equalty Eq. (17) to Eq. (18), we have the reduced equaton = p p th the assumpton Eq. (16), Eq. (19) leads to : (19) = ; and memory-bounded generalzed speedup s untary. 2 The followng corollary s a drect result of Theorem 1 and Theorem 2. Corollary 1 If work ncreases proportonally wth the number of processors, then scalablty (12) 11

equals one f and only f the memory-bounded generalzed speedup s untary. Fnally, to complete our dscusson on the superlnear speedup, there s a new cause of superlnearty for generalzed speedup. The new source of superlnear speedup s called prole shftng [6], and s due to the problem sze derence between sequental and parallel processng. An applcaton may contan derent work types. hle problem sze ncreases, some work types may ncrease faster than the others. hen the work types wth lower costs ncrease faster, superlnear speedup may occur. A superlnear speedup due to prole shftng was studed n [6]. 7. prole shftng Fgure 4. Causes of Superlnear Speedup: part 3 4 Expermental Results In ths secton, we dscuss the tmng results for solvng an applcaton problem on KSR-1 parallel computers. e rst gve bref descrptons of the archtecture and the applcaton problem, and then present the tmng results and analyses. 4.1 The Machne The machne to be dscussed here can be vewed as a combnaton of (or a compromse between) the dstrbuted and shared memory parallel archtectures. Ther hybrd s called the Shared Vrtual Memory archtecture. A representatve of ths category s the new KSR-1 parallel computer from Kendall Square Research. It has dstrbuted physcal memory whch makes the system scalable to a large number of processors, and a shared address space whch provdes users a shared-memory-lke programmng envronment. Fgure 5 shows the archtecture of the KSR-1 parallel computer [9]. Each processor on the KSR- 1 has 32 Mbytes of local memory. The CPU s a super-scalar processor wth a peak performance of 4 Mops n double precson. Processors are organzed nto derent rngs. The local rng (rng:) can connect up to 32 processors, and a hgher level rng of rngs (rng:1) can contan up to 34 local rngs wth a maxmum of 188 processors. If a non-local data element s needed, the local search engne (SE:) wll search the processors n the local rng (rng:). If the search engne SE: can not locate the data element wthn the local rng, the request wll be passed to the search engne at the next level (SE:1) to locate the data. 12

rng: rng:1 connectng up to 34 rng: s rng: P M rng: connectng up to 32 processers M P P M Fgure 5. Conguraton of KSR-1 parallel computers. P : processor M :32Mbytes of local memory Ths s done automatcally by a herarchy of search engnes connected n a fat-tree-lke structure [9, 12]. The memory herarchy of KSR-1 s shown n Fg. 6. Each processor has 512 Kbytes of fast subcache whch s smlar to the normal cache on other parallel computers. Ths subcache s dvded nto two equal parts: an nstructon subcache and a data subcache. The 32 Mbytes of local memory on each processor s called a local cache. A local rng (rng:) wth up to 32 processors can have 1Gbytes total of local cache whch s called Group: cache. Access to the Group: cache s provded by Search Engne:. Fnally, a hgher level rng of rngs (rng:1) connects up to 34 local rngs wth 34 Gbytes of total local cache whch s called Group:1 cache. Access to the Group:1 cache s provded by Search Engne:1. The entre memory herarchy s called ALLCACHE memory by the Kendall Square Research. Access by a processor to the ALLCACHE memory system s accomplshed by gong through derent Search Engnes as shown n Fg. 6. The latences for derent memory locatons [1] are: 2 cycles for subcache, 2 cycles for local cache, 15 cycles for Group: cache, and 57 cycles for Group:1 cache. 4.2 The Applcaton Least squares problems are frequently encountered n scentc and engneerng applcatons. The major work of solvng least squares problems s to solve the normal equaton A T Ax = A T b (2) by orthogonal factorzaton schemes (Householder Transformatons and Gvens rotatons). Ecent Householder algorthms have been dscussed n [3] for shared memory supercomputers, and n [16] 13

Processor 512 KB Subcache 32 MB Local Cache Search Engne: 1GB Group: Cache Search Engne:1 34 GB Group:1 Cache Fgure 6. Memory herarchy of KSR-1. for dstrbuted memory parallel computers. In many cases, for nstance the nverse problem of partal derental equatons [2], the normal equaton system resultng from the dscretzaton s too ll-condtoned to be solved drectly. Tkhnov's regularzaton method [2] s frequently used n ths case to ncrease numercal stablty. The key step n ths process s to ntroduce a regularzaton factor >. Instead of solvng (2) drectly, we solve the followng system for x. Eq. (21) can also be wrtten as (A T ; p I) (A T A + I)x = A T b (21) @ 1 p A I A x =(A T ; p I) @ b 1 A (22) or B T Bx = B T @ b 1 A ; (23) 14

so that the major task s to carry out the QR factorzaton for matrx B whch has the structure B = 2 6 6 4 a (1) 11 a (1) 12 a (1) 1n.. a (1) m1 a (1) p p. m2 a (1) mn... p. 3 7 7 5 ; (24) where we usually have m n wth m of the same order as n. Matrx B s nether a complete full matrx nor a sparse matrx. The upper part s full and the lower part s sparse (n dagonal form). Because of the specal structure n (24), not all elements n the matrx are aected n a partcular transformaton step. In the rst step, all elements wthn the frame n matrx (24) wll be aected. In each new step, the frame n (24) wll shft downwards one row wth the left most column out of the game. Therefore, at the th step, the submatrx B aected n the transformaton has the form: B = 2 6 4 a () a () a () n.. m+ 1; a () p.. m+ 1;n If the columns of matrx B of (25) are denoted by b j,.e. 3 7 5 : (25) B =[b b +1 b n]; (26) then the Householder Transformaton can be descrbed as: 15

Householder Transformaton Intalze matrx B for =1,n end for 1: = sgn(a () )(bt b )1=2 2: w = b e 1 3: j =w T b j (2 a () ); 4: b j =b j j w ; j =+1;n j = +1;;n The calculaton of j 's and updatng of b j 's can be done n parallel for derent ndex j. 4.3 Tmng Results The numercal experments reported here were conducted on the KSR-1 parallel computer nstalled at the Cornell Theory Center. There are 128 processors altogether on the machne. Durng the perod when our experments were performed, however the computer was congured as two standalone machnes wth 64 processors each. Therefore, the numercal results were obtaned usng less than 64 processors. Fgure 7 shows the tradtonal xed-sze speedup curves obtaned by solvng the regularzed least squares problem wth derent matrx szes n. The matrx s of dmensons 2n n. e can see clearly that as the matrx sze n ncreases, the speedup s gettng better and better. For the case when n = 248, the speedup s 76 on 56 processors. Although t s well known that on most parallel computers, the speedup mproves as the problem sze ncreases, what s shown n Fg. 7 s certanly too good to be a reasonable measurement of the real performance of the KSR-1. The problem wth the tradtonal speedup s that t s dened as the rato of the sequental tme to the parallel tme used for solvng the same xed-sze problem. The complex memory herarchy on the KSR-1 makes the computatonal speed of a sngle processor hghly dependent on the problem sze. hen the problem s so bg that not all data of the matrx can be put n the local memory (32 Mbytes) of the sngle computng processor, part of the data must be put n the local memory of other processors on the system. These data are accessed by the computng processor through Search Engne:. As a result, the computatonal speed on a sngle processor slows down sgncantly due to the hgh latency of Group: cache. The sustaned computatonal speed on a sngle processor s 5.5 Mops, 4.5 Mops and 2.7 Mops for problem szes 124, 16 and 248 respectvely. On the other hand, wth multple processors, most of the data needed are n the local 16

8 7 6 5 Speedup4 Ideal Speedup n = 124 n = 16 n = 248 + 3 + 2 1 + + + 5 1 15 2 25 3 35 4 45 5 55 Number of Processors + + + Fgure 7. Fxed-sze (Tradtonal) Speedup on KSR-1 memory of each processor, so the computatonal speed suers less from the hgh Group: cache latency. Therefore, the excellent speedups shown n Fg. 7 are the results of sgncant unprocessor performance degradaton when a large problem s solved on a sngle processor. Fgure 8 shows the measured sngle processor speed as a functon of problem sze n. The Householder Transformaton algorthm gven before was mplemented n KSR Fortran. The algorthm has a numercal complexty of=2n 3 +8:5n 2 +26:5n, and the speed s calculated usng s = =t where t s the CPU tme used to nsh the computaton. As can be seen from Fg. 8, the three segments represent sgncantly derent speeds for derent matrx szes. hen the whole matrx can be t nto the subcache, the performance s close to 7 Mops. The speed decreases to around 5.5 Mops when the matrx can not be t nto the subcache, but stll can be accommodated n the local cache. Note, however, when the matrx s so bg that access to Group: cache through Search Engne: s needed, the performance degrades sgncantly and there s no clear stable performance level as can be observed n the other two segments. Ths s largely due to the hgh Group: cache latency and the contenton for the Search Engne whch s used by all processors on the machne. Therefore, the access tme of Group: cache s less unform as compared to that of the subcache and local cache. To take the derence of sngle processng speeds for derent problem szes nto consderaton, we have to use the generalzed speedup to measure the performance of multple processors on the KSR-1. As can be seen from the denton of Eq. (6), the generalzed speedup s dened as the rato of the parallel speed to the asymptotc sequental speed, where the parallel speed s 17

Speed 1 9 8 7 6 5 4 3 2 1 Subcache All Cache Remote Memory 2 4 6 8 1 12 14 16 18 2 Order of the Matrces Fgure 8. Speed Varaton of Unprocessor Processng on KSR-1 based on a scaled problem. In our numercal tests, the parallel problem was scaled n a memorybounded fashon as the number of processors ncreases. The ntal problem was selected based on the asymptotc speed (5.5 Mops from Fg. 8) and then scaled proportonally accordng to the number of processors,.e. wth p processors, the problem s scaled to a sze that wll ll M p Mbytes of memory, where M s the memory requred by the unscaled problem. Fgure 9 shows the comparsons of the tradtonal scaled speedup and the generalzed speedup. For the tradtonal scaled speedup, the scaled problem s solved on both one and p processors, and the value of the speedup s calculated as the rato of the tme of one processor to that of p processors. hle for the generalzed speedup, the scaled problem s solved only on multple processors, not on a sngle processor. The value of the speedup s calculated usng Eq. (6), where the asymptotc speed s used for the sequental speed. It s clear that Fg. 9 shows that the generalzed speedup gves much more reasonable performance measurement on KSR-1 than does the tradtonal scaled speedup. th the tradtonal scaled speedup, the speedup s above 2 wth only 1 processors. Ths excellent superlnear speedup s a result of the severely degraded sngle processors speed, rather than the perfect scalablty of the machne and the algorthm. 5 Concluson Snce the scaled up prncple was proposed n 1988 by Gustafson and other researchers at Sanda Natonal Laboratory [5], the prncple has been wdely used n performance measurement of parallel algorthms and archtectures. One dculty of measurng scaled speedup s that vary large problems 18

2 16 Ideal Speedup Generalzed Speedup Tradtonal Speedup 12 Speedup 8 4 2 4 6 8 1 Number of Processors Fgure 9. Comparson of Generalzed and Tradtonal Speedup on KSR-1 have to be solved on unprocessor, whch svery necent f vrtual memory s supported, or s mpossble otherwse. To overcome ths shortcomng, generalzed speedup was proposed and studed by Gustafson and Sun [17]. Generalzed speedup s dened as parallel speed over sequental speed and does not requre solvng large problems on unprocessor. The study [17] emphaszed the xed-tme generalzed speedup, szeup. To meet the need of the emergng shared vrtual memory machnes, the generalzed speedup, partcularly mplementaton ssues, has been carefully studed n the current research. It has shown that tradtonal speedup s a specal case of generalzed speedup, and, on the other hand, generalzed speedup s a reform of tradtonal speedup. The man derence between generalzed speedup and tradtonal speedup s how to dene the unprocessor ecency. hen unprocessor speed s xed these two speedups are the same. Extendng these results to scalablty study, wehave found that the derence between sospeed scalablty [19] and soecency scalablty [11] s also due to the unprocessor ecency. hen the unprocessor speed s ndependent of the problem sze, these two proposed scalabltes are the same. As part of the performance study, wehave shown that an algorthm-machne combnaton acheves a perfect scalablty f and only f t acheves a perfect speedup. Seven causes of superlnear speedup are also lsted. A scentc applcaton has been mplemented on a Kendall Square KSR-1 shared vrtual memory machne. Expermental results show that unprocessor ecency s an mportant ssue for vrtual memory machnes, and that the asymptotc speed provdes a reasonable way to dene the unprocessor ecency. The results n ths paper on shared vrtual memory can be extended to general parallel com- 19

puters. Snce unprocessor ecency s drectly related to parallel executon tme, scalablty, and benchmark evaluatons, the range of applcablty of the unprocessor ecency study s wder than speedups. The unprocessor ecency mght be explored further n a number of contexts. Acknowledgement The authors are grateful to the Cornell Theory Center for provdng access to ts KSR-1 parallel computer. References [1] Amdahl, G. Valdty of the sngle-processor approach to achevng large scale computng capabltes. In Proc. AFIPS Conf. (1967), pp. 483{485. [2] Chen, Y. M., Zhu, J. P., Chen,. H., and asserman, M. L. GPST nverson algorthm for hstory matchng n 3-d 2-phase smulators. In IMACS Trans. on Scentc Computng I (1989), pp. 369{374. [3] Dongarra, J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A. Solvng Lnear Systems on Vector and Shared Memory Computers. SIAM, Phladelpha, 1991. [4] Gustafson, J. Reevaluatng Amdahl's law. Communcatons of the ACM 31 (May 1988), 532{533. [5] Gustafson, J., Montry, G., and Benner, R. Development of parallel methods for a 124-processor hypercube. SIAM J. of Sc. and Stat. Computng 9, 4 (July 1988), 69{638. [6] Gustafson, J., Rover, D., Elbert, S., and Carter, M. The desgn of a scalable, xedtme computer benchmark. J. of Parallel and Dstrbuted Computng 12, 4 (1991), 388{41. [7] Helmbold, D., and McDowell, C. Modelng speedup(n) greater than n. In Proc. of the 1989 Int'l Conf. on Parallel Processng, Vol. III (1989), pp. 219{225. [8] Hwang, K. Advanced Computer Archtecture: Parallelsm, Scalablty, Programmablty. McGraw-Hll Book Co., 1993. [9] Kendall Square Research. KSR parallel programmng. altham, USA, 1991. [1] Kendall Square Research. KSR techncal summary. altham, USA, 1991. [11] Kumar, V., and Gupta, A. Analyss of scalablty of parallel algorthms and archtectures: A survey. In Proc. of 1991 Int'l Conf. on Supercomputng (June 1991), pp. 396{45. [12] Leserson, C. Fat-trees: Unversal networks for hardware-ecent supercomputng. IEEE Transactons on Computng 34, 1 (1985), 892{91. [13] Ncol, D. Inated speedups n parallel smulatons va malloc(). Internatonal Journal on Smulaton 2 (Dec. 1992), 413{426. [14] Ortega, J., and Vogt, R. Soluton of partal derental equatons on vector and parallel computers. SIAM Revew (June 1985), 149{24. 2

[15] Parknson, D. Parallel ecency can be greater than unty. Parallel Computng 3 (1986), 261{262. [16] Pothen, A., and Raghavan, P. Dstrbuted orthogonal factorzaton: Gvens and Householder algorthms. SIAM J. of Sc. and Stat. Computng 1 (1989), 1113{1135. [17] Sun, X.-H., and Gustafson, J. Toward a better parallel performance metrc. Parallel Computng 17 (Dec 1991), 193{119. [18] Sun, X.-H., and N, L. Scalable problems and memory-bounded speedup. J. of Parallel and Dstrbuted Computng 19 (Sept. 1993), 27{37. [19] Sun, X.-H., and Rover, D. Scalablty of parallel algorthm-machne combnatons. IEEE Transactons on Parallel and Dstrbuted Systems (1994). to appear. [2] Tkhnov, A. N., and Arsenn, V. Soluton of Ill-posed Problems. John ley and Sons, 1977. 21