Parametric Micro-level Performance Models for Parallel Computing

Size: px

Start display at page:

Download "Parametric Micro-level Performance Models for Parallel Computing"

Brook Taylor
5 years ago
Views:

Computer Siene Tehnial Report Computer Siene 12-5-1994 Parametri Miro-level Performane Model for Parallel Computing Youngtae Kim Iowa State Univerity Mark Fienup Iowa State Univerity Jeffrey S.

1 Computer Siene Tehnial Report Computer Siene Parametri Miro-level Performane Model for Parallel Computing Youngtae Kim Iowa State Univerity Mark Fienup Iowa State Univerity Jeffrey S. Clary Iowa State Univerity Sureh C. Kothari Iowa State Univerity Follow thi and additional work at: Part of the Sytem Arhiteture Common, and the Theory and Algorithm Common Reommended Citation Kim, Youngtae; Fienup, Mark; Clary, Jeffrey S.; and Kothari, Sureh C., "Parametri Miro-level Performane Model for Parallel Computing" (1994). Computer Siene Tehnial Report Thi Artile i brought to you for free and open ae by the Computer Siene at Iowa State Univerity Digital Repoitory. It ha been aepted for inluion in Computer Siene Tehnial Report by an authorized adminitrator of Iowa State Univerity Digital Repoitory. For more information, pleae ontat digirep@iatate.edu.

2 Parametri Miro-level Performane Model for Parallel Computing Abtrat Parametri miro-level (PM) performane model are introdued to addre the important iue of how to realitially model parallel performane. Thee model an be ued to predit exeution time, identify performane bottlenek, and ompare mahine. The aurate predition and analyi of exeution time i ahieved by inorporating preie detail of interproeor ommuniation, memory operation, auxiliary intrution, and effet of ommuniation and omputation hedule. Parameter are ued for flexibility to tudy variou algorithmi and arhitetural iue. The development and verifiation proe, parameter and the ope of appliability of thee model are diued. A oherent view of performane i obtained from the exeution profile generated by PM model. The model are targeted at a large la numerial algorithm ommonly implemented on both SIMD and MIMD mahine. Speifi model are preented for matrix multipliation, LU deompoition, and FFT on a 2-D proeor array with ditributed memory. A ae tudy i done on MaPar MP-1 and MP-2 mahine to validate PM model and demontrate their utility. Diipline Sytem Arhiteture Theory and Algorithm Thi artile i available at Iowa State Univerity Digital Repoitory:

3 Parametri Miro-level Performane Model for Parallel Computing TR94-23 Youngtae Kim, Mark Fienup, Jeffrey C. Clary & Sureh C. Kothari Deember 5, 1994 Iowa State Univerity of Siene and Tehnology Department of Computer Siene 226 Atanaoff Ame, IA 50011

4 Parametri Miro-level Performane Model for Parallel Computing Youngtae Kim, Mark Fienup, Jerey S. Clary, Sureh C. Kothari Department of Computer Siene Iowa State Univerity Ame, Iowa Abtrat Parametri miro-level (PM) performane model are introdued to addre the important iue of how to realitially model parallel performane. Thee model an be ued to predit exeution time, identify performane bottlenek, and ompare mahine. The aurate predition and analyi of exeution time i ahieved by inorporating preie detail of interproeor ommuniation, memory operation, auxiliary intrution, and eet of ommuniation and omputation hedule. Parameter are ued for exibility to tudy variou algorithmi and arhitetural iue. The development and veriation proe, parameter and the ope of appliability of thee model are diued. A oherent view of performane i obtained from the exeution prole generated by PM model. The model are targeted at a large la numerial algorithm ommonly implemented on both SIMD and MIMD mahine. Spei model are preented for matrix multipliation, LU deompoition, and FFT on a 2-D proeor array with ditributed memory. A ae tudy i done on MaPar MP-1 and MP-2 mahine to validate PM model and demontrate their utility. Keyword: Performane Model, Parallel omputing, Numerial Algorithm, Memory Ae Optimization 1

5 1 Introdution How to model parallel omputation ha been an important topi of reearh in highperformane omputing. Performane model have been extenively invetigated through theoretial and empirial tudie. One important iue i how to make model realiti. The paper [19, 3] diu hortoming of earlier theoretial reearh, and propoe new model alled BSP and LogP for parallel omputation. An important apet of both model i the inorporation of ommuniation parameter whih were ignored in earlier theoretial reearh. The tudie [2, 7, 8, 18] addre everal pragmati iue and provide inight into important attribute of parallel performane. A good introdution to performane and alability of parallel ytem i provided in reent book [10, 12]. Thi paper i about parametri miro-level (PM) performane model for parallel omputation. While BSP and LogP model [19, 3] fou on what i a realiti abtration for modeling parallel performane, our emphai i on pragmati model to aurately predit and analyze exeution time. Our goal i to develop performane model that an be atually ued to predit performane on exiting and future generation mahine, ompare mahine, and failitate eient implementation of algorithm by identifying performane bottlenek. To develop uh model, we adopt a miro-level approah whih inorporate preie detail of interproeor ommuniation, memory operation, miellaneou overhead due to auxiliary intrution, and eet of ommuniation and omputation hedule. Exeution time an be predited by tting timing urve to experimental data, a diued in [8]. The bai approah i to determine an algebrai expreion for the tting formula by analyi of algorithm and then determine the oeient by experiment. Thi approah i loely aligned with our goal; it an aurately predit exeution time. A tting formula expree exeution time a a funtion of problem ize and number of proeor. It doe not deribe how arhitetural parameter aet performane. Alo, it i not poible to identify performane bottlenek uing the tting formula. We addre thee hortoming with PM model. Firt, intead of prediting exeution time a a alar quantity, PM model predit a vetor that 2

6 repreent igniant omponent of exeution time. Thi i ueful for analyi of performane. Seondly, the formula are parametri. Arhitetural and algorithmi parameter are inorporated a variable. The parameter provide exibility to tudy avariety ofarhitetural and algorithmi iue. For example, the impat of hanging proeor peed, ommuniation peed, or memory ae peed an be tudied by varying the parameter of the model. A tradeo i to be expeted between realiti modeling and it appliability in abene of pei information about the parallel algorithm or the arhiteture. It i deirable that performane model are not unneearily pei with repet to algorithm and arhiteture. Model need to be deigned with a et of parameter appliable to a wide la of parallel algorithm and arhiteture. Spei enter into the piture when parameter value have to be determined. There i an example in [3] where two implementation of FFT are onidered. The experimental reult how a dramati dierene in ommuniation ot of thoe two implementation. If a model i to predit the dierene, it i inevitable that detail of the implementation of algorithm have to be onidered. In order to aommodate thee oniting requirement, our approah i to deign the parameter and the proe of model development with general appliability in mind, and follow it up with omplete example of model whih get into pei. We onider exeution time a the prinipal meaure of performane. The model generate exeution prole to provide a piture of how omputation, memory operation, ommuniation, and miellaneou overhead together aount for the total exeution time. The exeution prole an be ued to view the performane in dierent way. Other metri uh a peedup, eieny, and MFLOPS are dened on bai of exeution prole. It i well known that performane metri an provide dierent and ometime mileading view of performane [8, 9]. We orrelate variou performane metri to provide oherent view of parallel performane. PM model are appropriate for a large la of data parallel numerial algorithm, deribed later in the paper. Thi la i of interet ine it inlude a large number 3

7 of algorithm enompaing many of the ienti and engineering appliation. Thee algorithm are typially implemented on MIMD mahine, but many of them an alo be implemented quite eiently on SIMD mahine. Separate PM model are needed for dierent algorithm. Eah model inlude a omplete repreentation of the parallel algorithm, determined by key part of the algorithm. A onrete illutration, we preent model for matrix multipliation, LU deompoition, and fat Fourier tranform (FFT), all implemented on a 2-D proeor array. Thee algorithm are of oniderable interet in pratie; individually, they have been ued a example in many empirial and theoretial tudie [19, 3, 1, 16, 6]. Together, the algorithm repreent varying degree of omputation, ommuniation, and memory requirement, and erve well a tet ae. PM model are validated, and their utility i demontrated in a ae tudy on Ma- Par MP-1 and MP-2. Two implementation of eah algorithm are tudied to illutrate the analyi and impat of memory operation. The ae tudy provide intereting example of how arhitetural dierene aet performane. For example, the hoie between a mall number of powerful proeor or a large number of le powerful proeor i often a point of debate in parallel omputing. To tudy thi iue, we preent a onrete example of performane omparion of 16K proeor MP-1 and 4K proeor MP-2 uing three algorithm with dierent omputational harateriti. The model are diued in Setion 2, the performane analyi i deribed in Setion 3, a ae tudy i preented in Setion 4, and onluion are in Setion 5. 2 Parametri Miro-level Model In thi etion, the model development and veriation proe i deribed uing the example of three parallel algorithm on a 2-D proeor array. Parameter of the model and it appliability are alo diued. 4

8 2.1 Model Development Eah PM model i baed on a preie analytial formula that apture eential operation of a given parallel algorithm. The formula ha four omponent to predit the exeution time a a vetor. Thee omponent are omputation time, ommuniation time, memory ae time, and the time for auxiliary intrution. Arhitetural parameter of the model are determined by experimental meaurement. In hypothetial ae uh a the tudy of a futuriti mahine, the parameter are extrapolated. We will rt provide an overview of model development and follow it with detail Overview The development proe an be deribed a follow: Step 1: Derive analytial formula f omp, f omm, and f mem for part of the exeution time for omputation, ommuniation, and memory operation repetively. Step 2: Do experimental meaurement of ample ae to determine model parameter and alo the time for omputation, ommuniation, aeing the memory, and the miellaneou time for auxiliary intrution. Step 3: Selet the template for regreion analyi to etimate the miellaneou overhead time. Determine the regreion oeient baed on experimentally meaured value. The regreion formula for miellaneou overhead time i denoted by f mi. Step 4: Baed on the experimental meaurement, modify the analytial expreion f mem and f omm o that the predition math with experimental timing determined in Step 2. The modiation to f mem are done to take into aount ahe eet and overlap of memory aee with other operation. The modi- ation to f omm are done to take into aount overlap of ommuniation with omputation. 5

9 Step 5: Finally, the following formula i obtained to predit the exeution time: f omp + f omm + f mi + f mem Detail The analytial formula are given for the three parallel algorithm in Appendix A. In analyzing pratial enario for parallel mahine, the lower order term an be igniant. Thee formula are arefully derived by examining the parallel algorithm to apture all it eential detail. The formula are omplex, but the advantage i that the performane predition are very aurate. The three algorithm ued in the tudy are well-known. The LU deompoition i deribed in [5]. The detail of the FFT algorithm an be found in [4]. Cannon' parallel algorithm i deribed in [12]. The LU deompoition ue a 2-D attered data layout for the oeient matrix, and it inlude partial pivoting. Dierent ommuniation pattern are ued by the three algorithm. The matrix multipliation ue nearetneighbor ommuniation where element are hifted from one proeor to the next along either a row or a olumn with wrap-around at the end. In ae of the LU deompoition, ommuniation i needed for pivoting and for broadating a pivot row andamultiplier olumn. A one-to-all broadat i ued along either a row ora olumn of proeor. To implement buttery operation, the FFT algorithm require ommuniation between proeor in a row or a olumn where the ditane between the ommuniating proeor i a power of two. Depending on whether the routing i pipelined or non-pipelined, the ot of a ommuniation operation varie. Table 1 ummarize dierent ommuniation heme and their ot, and it alo lit ot on MaPar mahine. The Xnet[d] primitive on MaPar i a verion of non-pipelined routing, and the Xnetp[d] and Xnet[d] are for pipelined routing, where d i ditane. Typially, to end a large meage from one proeor to another, multiple individual meage may be required. There may alo be a limit on the number of meage that an be pipelined together. On MaPar, eah meage ha to be either one, four or eight byte, and it ha to be loaded in a regiter 6

10 Table 1: Communiation Cot general deription MaPar pei deription routing ommuniation ot primitive ommuniation ot heme (32-bit meage) MP-1 MP-2 Pipelined T X + dt Xp + kt Xt Xnetp[d] 58 + d 48 + d Xnet[d](Copy) 84 + d 48 + d Non- T X + dkt Xt Xnet[d], d = pipelined Xnet[d], d> d d T X : tartup time T Xp : time to ll the pipeline T Xt : tranmiion time d : ditane k :number of meage rt. The pipelining i done at the bit level for eah meage. In our ae tudy, ingle preiion arithmeti i ued, and the meage are four byte eah. The ommuniation ot formula in Table 1 are implied in aordane with [15] to how the ot on MaPar when the meage ize i four byte. Example are ited in [8] to point out that imple overhead-type operation hould not be negleted, no matter how trivial they may eem. PM model onider miellaneou overhead ariing from auxiliary intrution to implement loop in the mahine language, regiter move, et. A regreion formula i ued to predit the miellaneou overhead time. The template for the regreion formula i determined by examining the loop truture of the parallel program. The template for the three algorithm are lited in Table 2. A imple algebrai manipulation of template how that miellaneou overhead i a funtion of two variable the loal problem ize and the number of proeor. The oeient 0, 1, 2, 3, and 4 hown in Table 2 are determined on bai of experimental meaurement of ample ae with dierent loal ize of problem and uing 1K and 4K proeor. The arhiteture parameter inlude individual timing for oating point intrution, ommuniation primitive, and LOAD and STORE operation. It i aumed that memory aee are only through LOAD and STORE intrution. The arhiteture parameter an be obtained from the mahine manual, but it i a good idea 7

11 Table 2: Regreion Template for Miellaneou Overhead regreion template regreion oeient MP-1 MP e e-7 f MM mi P ( M + 2 M M 3 ) e e e e e e e e-5 = mi P ( 0M + 1 (log 2 P )M + 2 M M 3 ) e e e e e e-7 f LU e e-6 mi = 0M + 1 (log 2 P 2 )M + 2 M log 2 M e e e e-6 f FFT For fmi MM LU and f FFT mi For f mi P P : proeor array ize P P : proeor array ize N N : matrix ize N :number of element M M : loal problem ize per M : loal problem ize per proeor (M = N=P) proeor (M = N=P 2 ) Table 3: Arhiteture Parameter Operation MP-1 Cyle MP-2 Cyle T load Load T tore Store T mult Floating Point Multiply T div Floating Point Diviion T add Floating Point Addition T neg Floating Point Negation T mp Floating Point Comparion T twiddle Twiddle Fator Calulation for FFT

12 to atually meaure thee timing. The arhiteture parameter are lited in Table 3 along with the value for the MaPar MP-1 and MP-2 mahine. The other parameter inlude problem ize, PE array ize, and the timing for algorithm pei primitive uh a omputing the twiddle fator for FFT. 2.2 Veriation of Model Anumber of feature are built into the model to enure that the exeution time are predited aurately. Firt, the preie detail of omputation, ommuniation, memory operation, and miellaneou overhead are inluded in the model. Seondly, the model parameter are arefully determined by experiment. However, PM model are omplex, and it i important toverify eah model ytematially. The proedure for uh averiation i deribed here. Thi proedure wa ued in our ae tudy to verify the model on the MaPar MP-1 and MP-2 mahine. We deribe the neeary experimental meaurement to be obtained by running the parallel program for ample problem ize. The experimental meaurement inlude: (i) total exeution time (T exe ), (ii) omputation time (T omp ), (iii) ommuniation time (T omm ), (iv) miellaneou overhead time (T mi ), and (v) the time for memory operation (T mem ). The experimental meaurement for (ii), (iii), and (iv) were done after deleting appropriate intrution from the ompiler generated aembly ode. Firt, T mi i meaured by deleting all the omputation, ommuniation plu the aoiated LOAD and STORE intrution. Next, only the ommuniation and the memory intrution are omitted, and the omputation time (T omp )idetermined by ubtrating T mi from the reulting exeution time. Finally, only the memory intrution are omitted, and the ommuniation time (T omm ) i determined by ubtrating T omp + T mi from the reulting exeution time. The time for memory operation i baed on the previou meaurement uing the equation T mem = T exe, T omp, T omm, T mi. The auray of model i baed on the following obervation: The omputation and ommuniation timing predited by the analytial formu- 9

13 la f omp and f omm are heked individually with experimental value T omp and T omm. Only a part of the experimental data i ued to determine the regreion oeient, and the remaining data i ued a the tet data to verify the regreion formula. The memory model i heked eparately. Experimental meaurement are ometime triky, epeially due to the fat that overlap have tobetaken into aount. In ome ae, we had to modify the aembly ode to get the experimental data ine the ompiler introdued major tranformation into the ode and making hange in the high-level language did not produe the eet we wanted. For example, thi wa the ae in an intane where we wanted to eletively omit ertain intrution to meaure their eet. There may be problem ariing from data dependenie where omiting ertain intrution an have ide eet. For example, omitting a LOAD an make the ubequent diviion intrution to aue exeption of diviion by zero. Thee iue have to be addreed in experimental proedure. Our experiene i that LOAD-STORE arhiteture make experimental proedure impler, it at leat avoid ompliation reulting from omplex addreing mode where it i not poible to eparate memory aee. A ytemati development of experimental proedure i an important and omplex topi by itelf. For example, a timing proedure uitable for program that ue meage paing i deribed in [11]. To do omplete jutie to it i beyond the ope of thi paper. 2.3 Sope and Appliability PM model are appliable to a la of numerial algorithm deribed a follow. Firt, the work done by the algorithm i haraterizable a a et of oating point operation. Seondly, the parallel exeution proeed a a ueion of tep with ynhronization point in between. Eah tep onit of omputation followed by ommuniation. The ame program i exeuted by all proeor, but dierent data i proeed. Within 10

14 eah tep, ome proeor in a MIMD mahine may nih their omputation earlier and remain partly idle till the next ynhronization point. The onept of tight ynhronization i inherent in the BSP model [19]. The BSP model onider an algorithm a a equene of upertep. Eah upertep ombine omputation and ommuniation. Many of the numerial algorithm from ienti and engineering appliation fall in the ategory to whih PM model an be applied. There are alo important exeption; for example orting algorithm where it i the data movement and not the oating point operation that haraterize work. The parallel algorithm onidered in thi paper are ued on both SIMD and MIMD mahine. We have implemented thee algorithm on MaPar, a SIMD arhiteture and ncube, a MIMD arhiteture. PM model with ome hange an be applied to dierent mahine. Experimental meaurement may poe a problem on ome mahine. For example, in ome ae it may not be poible to arrive at a yle time for an individual intrution beaue it may vary depending on the adjaent intrution. Thi wa oberved to be the ae on ncube. We have found it i eaier to make experimental meaurement on mahine that have proeor with LOAD-STORE arhiteture where the only intrution to ae memory are LOAD and STORE operation. Fortunately, thi i the ae with everal reent parallel mahine inluding MaPar MP-1 and MP-2, Intel Paragon, IBM SP-1 and SP-2. A PM model i ueful in many way. The ae tudy in later etion provide an illutration of how it i ueful to identify performane bottlenek, analyze performane, and ompare mahine. Contant are important in pratie. For example, a better deign that inreae performane by 50% i not omething that a omputer manufaturer an aord to ignore. In uh ituation, PM model provide a viable tool to aurately analyze performane of dierent deign. For a new generation of mahine, an important onideration i ot eetive improvement in performane. The alternative ould be either fater proeor, fater ommuniation hardware or fater memory. Suh alternative an be evaluated by PM model. 11

15 3 Performane Analyi The exeution prole generated by model are ued a the bai for performane analyi. We derive quantitative relationhip that are ueful for a la of algorithm diued in Setion Exeution Prole PM model predit the exeution time a a um of four omponent orreponding to omputation, ommuniation, miellaneou overhead and memory operation. The model an be ued to predit the total exeution time, and eah of it omponent eparately. The exeution prole for an algorithm i preented in the form of a table that how perentage attributed to eah omponent of the exeution time for a range of problem ize. The omputation omponent repreent the ueful work, and the other three omponent hould be a mall a poible. It beome lear from the exeution prole how igniant ommuniation, memory operation, or miellaneou overhead are a performane bottlenek. Performane an be viewed in dierent way uing variou metri. Exeution prole provide a bai to orrelate dierent view in order to provide a oherent piture of parallel performane. Speedup, eieny, and MFLOPS are dened on bai of exeution prole in way that reveal preiely the role of key fator uh a load balane. 3.2 Load Balane Load balane i an important attribute of performane in parallel omputing. For the la of algorithm onidered in thi analyi, load balane an be thought ofa the degree of utilization of proeor averaged over all \ompute only" tep after the memory and miellaneou overhead are fatored out. The following denition of Load Balane Fator(LB f ) i uh that the range for LB f i between zero to one, with one orreponding to the bet utilization of proeor. 12

16 LB f (N) = nf lop(n )t flop P 2 fomp(n ) nf lop(n) :number of normalized oating point operation for equential t f lop f omp (N) N P P omputation (P = 1) : time for a ingle normalized oating point operation : total time for oating point operation done in parallel : problem ize parameter : proeor array ize To deal with the mixture of fat and low oating point operation, normalized oating point operation are ued in thi paper. For example, on MaPar MP-1 where the ADD operation take 127 yle, and the MULT operation take 225 yle, the normalized FLOP for thee operation are ounted a 1 and 1.77 repetively. 3.3 Eieny Baed on Work Traditionally, eieny i alulated baed on the work done. However, in parallel omputing, eieny i ommonly dened a the peedup divided by the number of proeor. The ioeieny analyi [12, 13] i baed on thi denition. It ha been argued in [2] that intead of relying on time a a meaure of work, eieny hould be dened by uing unit ount baed on the ize of an indiviible tak a the meaure of work. The ratio of work aomplihed (wa) to the work expended (we) i propoed in [2] a the alternative denition of eieny. Following thee idea, onider a normalized FLOP a the unit of work. There are ome objetion to uing FLOP a a unit of work in general [8]. In our ae, however, we are onidering numerial algorithm and taking into aount memory and other operation eparately. Another objetion i that operation ount i an imperfet meaure of omputational work ine it doe not tandardize aro omputer [8]. We agree and addre thi point later in the ontext of omparing two mahine. With a normalized FLOP a the unit of work, wa i proportional to MFLOPS and we i proportional to peak MF LOP S. Auming a normalization i ued, the ratio of MFLOPS to peak MFLOPS an be onidered a the 13

17 alternate denition for eieny(ef f(n)). A hown below, the ineieny reulting from ommuniation and other overhead i aptured by the frational term, and the ineieny due to idle proeor i repreented by the load balane fator. Eff(N) = fomp(n ) fomp(n )+fomm(n )+f mi (N )+fmem(n ) LB f (N) Interetingly, for example provided in [2], the ommonly ued denition and the alternate denition of eieny both led to the ame reult. The following obervation may explain why it i o. On reubtituting for LB f and uing the traditional denition of peedup, it beome lear that both denition of eieny lead to the ame formula. Thi an be veried by uing the following formula for peedup a the ratio of the equential exeution time to the parallel exeution time. Speedup(N) = nf lop(n )t flop fomp(n )+fomm(n )+f mi (N )+fmem(n ) The overhead due to memory operation and miellaneou operation are alo preent in equential proeing. We have not fatored thoe out and are in eet meauring the overall eieny by aounting for all oure of ineieny. 3.4 MFLOPS, Eieny and Exeution time Firt, onider the MFLOPS meaure. The normalized MFLOPS are given by: MFLOPS(N) = nf lop(n )10,6 Texe T exe : experimentally meaured parallel omputation time for ize N Baed on our earlier diuion, MFLOPS an alo be alulated by: MFLOPS(N) =P eak MFLOPS Eff(N) P eak MF LOP S : lok rate number of yle per normalized f lop P 2 14

18 The quetion i what i a good meaure of performane to ompare dierent mahine baed on a given algorithm. A reaonable way i to interpret higher performane a aomplihing more ueful work in the ame amount of time. Intuitively, one may think that the eieny ould erve the purpoe. However, eieny an be a mileading meaure for omparion of dierent mahine. A mahine may be le eient, but ould till perform more work beaue it i fater than the other mahine. Thi ugget that one hould really onider the produt of the eieny and the rate of work of a mahine. If normalized FLOP i onidered a the unit of work, then MFLOPS i uh a meaure. MFLOPS alo ha a problem. The diulty lie in uing a normalized FLOP a a unit of work aro dierent mahine. In pite of normalization, the ame work (for example the multipliation of two matrie of a given ize) an tranlate into dierent FLOPS on dierent mahine. Thi problem an be addreed in a ouple of dierent way. One olution i to onider a unit of work that depend on the appliation, not on the mahine. For example, an addition plu a multipliation i a viable unit of work to ompare performane of matrix multipliation on dierent mahine. Another olution an be to require a onverion rate to onvert a FLOP from one mahine to another mahine. The onverion i done o that the number of FLOP orreponding to the ame work i unhanged in going from one mahine to another mahine. The denition of work and the onverion rate depend on the algorithm. Thu, in omparing dierent mahine with repet to a given algorithm, there are really three iue; the eieny, the rate of work, and the unit of work. The bottom line i alway the exeution time auming auray of alulation i atifatory. With the ue of an appropriate onverion rate, higher MFLOPS number indeed mean lower exeution time. A a onrete example, for matrix multipliation one normalized FLOP on MaPar MP-1 hould be onverted to ( 2:58 ) normalized FLOP on MP-2. 2:77 Thi i beaue the number for normalized FLOP for an addition plu multipliation i 2.58 on MP-2 and 2.77 on MP-1. The LU deompoition kernel ue the ame oating point operation a matrix multipliation, thu the ame onverion rate i 15

19 appliable for both. An analyi of FFT kernel how that the onverion rate i one for that algorithm. Inidentally, without the onverion the MFLOP number on MP- 1 are inated, when ued for meauring performane of matrix multipliation and LU deompoition. 4 Cae Study Thi tudy wa done on a 16K proeor MaPar MP-1 with 16K byte of memory per proeor and a 4K proeor MP-2 mahine with 64K byte of memory per proeor. PM model of matrix multipliation, LU deompoition, and FFT are onidered. Two implementation of eah algorithm were tudied to illutrate the analyi and impat of memory operation. The eond implementation inluded oftware pipelining to redue the time for memory operation. The highet level of ompiler optimization wa ued with both implementation. A pre-analyi wa done auming the memory overlap ratio to be zero in the model. Seondly, a pot-analyi wa done by inluding a non-zero overlap ratio baed on the experimental data from the eond implementation whih introdued igniant memory overlap a a reult of oftware pipelining. 4.1 Parallel Mahine MaPar MP-1 and MP-2 mahine are baed on a ingle-intrution tream, multiple data tream (SIMD) arhiteture with proeor arranged in a two dimenional toroidal grid. A parallel program run on the array ontrol unit (ACU) whih broadat intrution to the proeor. The ommuniation operation on MaPar and their ot are diued earlier. The MP-1 and MP-2 mahine have a lok rate of 12.5 MHz, and idential intrution et. However, the MP-1 ue 4-bit proeor while the MP-2 ue 32-bit proeor. The MP-2 proeor an perform oating point operation four to ve time fater than the MP-1 proeor. Meaured yle time for everal intrution are hown in Table 3. There i no ahe memory on either mahine, and eah proeor ha forty 32-bit regiter. Memory aee are done only through LOAD and 16

20 STORE intrution. Other intrution, inluding interproeor ommuniation, are all regiter baed. Table 4: Auray of Exeution time Predition 16K MP-1 4K MP-2 model experi- di. model experi- di. N mental mental (e) (e) (%) (e) (e) (%) Matrix Multipliation LU Deompoition Fat Fourier Tranform Validation of Model To validate PM model, their predition are ompared with experimental reult on MaPar MP-1 and MP-2 mahine. We did ompare the model and the experimental reult for the four part of the exeution time eparately. Intead of preenting the individual omparion for eah part, the omparion of the total exeution time i 17

21 preented Table 4. The reult how that in all ae, the model are very aurate. 4.3 Pre-Analyi: Identifying Performane Bottlenek PM model yield exeution prole that an provide lue for improving performane. The prole for the three algorithm are hown in Table 5, 6, and 7. The exeution pro- le inlude the total exeution time, and it break-up baed on omputation, ommuniation, miellaneou overhead, and memory operation. Note that the omponent other than the omputation hould be a mall a poible for high performane. The pre-analyi table 5 and 6 how that memory operation aount for a igniant portion of the exeution time. For matrix multipliation, miellaneou overhead and interproeor ommuniation together ontitute only a mall part (10% or le) of the exeution time, but memory operation aount for a muh a 37% on MP-1 and 52% on MP-2. It get wore with LU deompoition a it i a more memory-ae intenive algorithm. Miellaneou overhead for LU deompoition are igniant for maller problem ize, but they dereae for larger problem. The performane pro- le for FFT (Table 7) i quite dierent. It i lear that memory operation i not the problem. The performane lo with FFT i mainly due to interproeor ommuniation. The pre-analyi ugget that the performane of matrix multipliation and LU deompoition ould be igniantly improved by uing tehnique that minimize the time for memory operation. 4.4 Memory Ae Optimization The performane lo due to memory operation an be minimized by exploiting the organization of the memory and how itwork. We ued bloking and oftware pipelining. Sine there i no ahe memory on MaPar, bloking wa implemented uing the regiter. Software pipelining wa found to be more ritial for performane improvement on MaPar mahine. 18

22 Table 5: Pre-Analyi by Model : Matrix Multipliation N omp % MP-1 omm % mem ae % mi % omp % MP-2 omm % mem ae % mi % Table 6: Pre-Analyi by Model : LU Deompoition N omp % MP-1 omm % mem ae % mi % omp % MP-2 omm % mem ae % mi % Table 7: Pre-Analyi by Model : Fat Fourier Tranform N omp % MP-1 omm % mem ae % mi % omp % MP-2 omm % mem ae % mi %

23 regiter a, b, ; for i = 0to M-1 begin for j = 0to M-1 begin = C(i,j); for k = 0to M-1 begin a = A(i,k); b = B(k,j); += a * b; end C(i, j) = ; end end (bai verion) regiter a0, a1, b0, b1, ; for i = 0to M-1 begin for j = 0to M-1 begin = 0.0; a0 = A(i,0); b0 = B(0,j); for k = 0to M-1 begin (1) a1 = A(i,k+1); (2) b1 = B(k+1,j); (3) += a0 * b0; a0 = a1; b0 = b1; end += a0 * b0; C(i, j) += ; end end end (oftware pipelined verion) Figure 1: An example of oftware pipelining applied to matrix multipliation Software Pipelining Tehnique Software pipelining i ued to redue the overhead of aeing the memory. Thi tehnique ha been previouly tudied [14, 17] for VLIW and other arhiteture. The tehnique i ommonly ued on RISC worktation. On MaPar, we had to apply the tehnique by hand to oure level program to hange the order of operation in ueive iteration of a loop o that data ould be prefethed. Software pipelining help if the hardware an overlap prefething of data with omputation and ommuniation. We applied oftware pipelining to omputation loop with oating point operation and alo to ommuniation loop that move a blok of data from the loal memory of one 20

24 proeor to another proeor. The oftware pipelining tehnique i illutrated in Figure 1 by the example of matrix multipliation. For the bai matrix multipliation loop in the left program egment, element of the A and B array are ued for oating point operation immediately after they are aeed. A a reult, oating point operation annot tart until the memory aee are omplete. On the other hand, for the pipelined loop, the array element get prefethed in line (1) and (2). Thi prefething i overlapped with the oating point omputation done in line (3). Software pipelining an be ombined with loop unrolling for further improvement in performane Meaurement of Memory Overlap Thi etion illutrate how memory ae optimization i aounted for, and how igniant i it impat on performane. The impat of overlapping memory operation i meaured by the overlap ratio (O r ) baed on the equation: f mem (1, O r )=T mem. A dened originally, f mem give the time for memory operation in abene of overlap, and T mem i the experimentally meaured time for memory operation in preene of oftware pipelining. The overlap ratio play a role imilar to the hit ratio for analyzing the ahe memory performane. Similar to the ahe hit ratio, the overlap ratio ha a value between 0 and 1, and the loer it i to 1, the higher the performane. The overlap reulting from oftware pipelining i expeted to inreae up to a point with inreaing number of pipelined iteration of the for loop. In a pipelined operation, the eieny inreae with the number of job until it level o at a maximum value. The ame trend i oberved for the overlap ratio. The overlap ratio depend on the algorithm, arhiteture of the mahine and problem ize. It inreae with the loal problem ize until it level o a hown in Figure 2 and 3. Note that eah gure refer to the total problem ize and not the loal ize at eah proeor. The orreponding loal problem ize are larger on MP-2 a it ha only 4K proeor ompared to 16K proeor on MP-1. LU deompoition how higher overlap than matrix multipliation on MP-1, and 21

25 0:90 0:85 Matrix Multipliation LU Deompoition Fat Fourier Tranfrom Overlap ratio 0:80 0:75 0:70 0: Problem ize (N) Figure 2: Memory Overlap Ratio on 16K MP-1 Overlap ratio 0:90 0:85 0:80 0:75 0: Matrix Multipliation LU Deompoition Fat Fourier Tranfrom :65 Problem ize (N) Figure 3: Memory Overlap Ratio on 4K MP-2 22

26 it i other way around on MP-2. LU deompoition kernel i more memory intenive; it require an additional STORE operation ompared to matrix multipliation kernel. We veried that if an additional STORE operation i inluded (redundantly) in the matrix multipliation kernel, then it overlap ratio mathe loely with that of LU deompoition. Memory aee need to be fatored into realiti model of parallel omputing. Memory aee an have igniant impat on performane even in parallel omputing. For two out of the three algorithm in our tudy, the memory ae ot in fat turn out to be ubtantially higher than the interproeor ommuniation ot. Memory ae time an vary igniantly due to memory hierarhy and overlap of memory aee with other operation. We have addreed memory overlap whih i the relevant iue on MaPar mahine where there i no ahe memory, but the overlap i a igniant fator. A future reearh, it will be worthwhile to do ae tudie on other mahine with ahe memory. There i extenive literature on performane analyi of ahe memory whih need to be explored in the ontext of realiti modeling for parallel mahine with ditributed memory. 4.5 Pot-Analyi A \pot-analyi" wa done to tudy performane after it wa improved by oftware pipelining. To aount for the memory overlap, f mem i replaed by f mem (1, O r ) in the pot-analyi. A omparion of exeution time between pre-analyi and potanalyi (Table 8) how that a igniant improvement in performane i poible on MaPar mahine by overlapping memory operation with other operation. The pot-analyi table 9, 10 and 11 provide a quantitative piture of how dierent overhead impat performane. The following trend are oberved for the three algorithm when overhead are onidered a perentage of the total exeution time. For matrix multipliation, memory i the dominant overhead. For LU deompoition, miellaneou and ommuniation overhead are alo high, but only for maller problem. A the problem ize inreae, the other two overhead diminih and memory 23

27 Table 8: Improvement by Overlappping of Memory Operation 16K MP-1 4K MP-2 pre- pot- improve pre- pot- improve- N anal. anal. -ment anal. anal. ment (e) (e) (%) (e) (e) (%) Matrix Multipliation LU Deompoition Fat Fourier Tranform beome the dominant overhead. Software pipelining help ubtantially and more o in ae of MP-2, but the ot of memory operation till remain relatively high. For FFT, the ommuniation overhead i the mot igniant followed by the miellaneou overhead; memory overhead i very low. Next, we analyze eieny whih i aeted by overhead and the load balane. The eieny urve on MP-1 and MP-2 are hown in Figure 4 and 5. Matrix multipliation ha the leat overhead plu the bet poible load balane, thu it ahieve the highet eieny among the three algorithm. After oftware pipelining, the overall 24

28 Table 9: Pot-Analyi by Model a : Matrix Multipliation N omp % MP-1 omm % mem ae % mi % omp % MP-2 omm % mem ae % mi % Table 10: Pot-Analyi by Model a : LU Deompoition N omp % MP-1 omm % mem ae % mi % omp % MP-2 omm % mem ae % mi % Table 11: Pot-Analyi by Model a :Fat Fourier Tranform N omp % MP-1 omm % mem ae % mi % omp % MP-2 omm % mem ae % mi % a Pot-Analyi how performane after memory ae optimization i done 25

29 Eieny 1:0 0:9 0:8 0:7 0:6 0:5 0:4 0: Matrix Multipliation LU Deompoition Fat Fourier Tranfrom Problem ize (N) Figure 4: Eieny on 16K MP-1 Eieny 1:0 0:9 0:8 0:7 0:6 0:5 0:4 0:3 Matrix Multipliation LU Deompoition Fat Fourier Tranfrom Problem ize (N) Figure 5: Eieny on 4K MP-2 26

30 overhead for LU deompoition beome maller ompared to FFT, epeially for large problem. The load balane for LU deompoition i low for mall ize problem, but it improve due to 2-D attered deompoition a the problem ize inreae. Uing the formula from Setion 3.2, it wa heked that the load balane fator (LB f ) for LU deompoition hanged from 0.64 to 0.94 on MP-1 and from 0.76 to 0.96 on MP-2. The net reult i that the eieny urve for LU deompoition eventually take o, and i muh higher than the FFT urve. For matrix multipliation and FFT, it i eay to ee from the parallel algorithm itelf that LB f = 1, i.e., proeor are fully utilized when the problem ize i a multiple of the PE array ize. 4.6 Comparion of Two Mahine We will ue PM model to ompare two mahine. Thi omparion provide a onrete example to tudy an important iue in parallel omputing, namely, \whih hoie i better? { a mall number ofpowerful proeor or a large number of le powerful proeor". MP-1 ha 16K imple 4-bit proeor wherea MP-2 ha 4K 32-bit proeor. Eah MP-2 proeor i four to ve time fater than MP-1 proeor in term of oating point omputation. The peak rating of 16K proeor MP-1 and 4K proeor MP-2 are repetively 1613 and 1969 normalized MFLOPS. The two mahine have the ame amount of total memory, thu it i poible to ompare problem of the ame ize on both mahine. The three algorithm ued for the omparion are ueful to get dierent perpetive. We will ompare the two mahine in term of overhead, eieny, MFLOPS, and exeution time. The impat of overhead turn out to be igniantly dierent on MP-1 and MP-2. For eah algorithm, we ompare the data for the ame ize problem on MP-1 and MP-2. A een from the pot-analyi table 9, 10 and 11, all overhead inluding memory, ommuniation, and miellaneou are igniantly higher on MP-2. Thi an be undertood on bai of two fator related to dierene in arhitetural parameter. Firt, only ertain operation are fater on MP-2, and thoe are alo not in the ame proportion. For example, oating point operation are four to ve time fater, but 27

31 Dierene 30% 25% 20% 15% 10% 5% 0% 5% 10% Problem ize (N) Matrix Multipliation LU Deompoition Fat Fourier Tranfrom MP-1 > MP Figure 6: Performane Comparion baed on MFLOPS 6? 6 MP-2 > MP-1? Dierene 30% 25% 20% 15% 10% 5% 0% 5% 10% Matrix Multipliation LU Deompoition Fat Fourier Tranfrom Problem ize (N) Figure 7: Performane Comparion baed on Exeution time 6 MP-1 > MP-2? 6 MP-2 > MP-1? 28

32 memory operation are only twie a fat ompared to MP-1. The ommuniation operation and auxiliary intrution leading to miellaneou overhead are not fater at all. Seondly, MP-1 i a larger mahine where more proeor are onneted to eah other, thu it ommuniation bandwidth i higher. Next, we ompare the two mahine in term of eieny and MFLOPS. In all ae, the eieny on MP-2 i lower (ompare Figure 4 and 5). A maller mahine an ahieve higher load balane whih help eieny. In thi ae, however, the main fator i overhead whih are igniantly higher on MP-2. Although MP-2 i le eient, it i fater and ha higher peak MFLOPS rating than MP-1. So we ompare the two mahine in term of MFLOPS. The 4K proeor MP-2, in all intane, ahieve lower MFLOPS than the 16K proeor MP-1 (ee Figure 6). The omparion baed on overhead, eieny, and MFLOPS implie that MP-2 i wore mahine than MP-1. Before we admit that onluion, let u ompare exeution time. A omparion of exeution time i hown in Figure 7. It i een that MP- 2 i better than MP-1 for matrix multipliation in all ae, it i alo better for LU deompoition with maller problem. Thee reult are not urpriing baed on the earlier diuion in Setion 3.4. It wa pointed out that the MFLOPS number on MP-1 are inated for matrix multipliation and LU deompoition. To get a oherent piture of performane, we need to onvert MFLOPS from one mahine to another. The onverion rate are given in Setion 3.4. If MFLOP omparion i redone uing proper onverion, then it turn out to be exatly the ame a the exeution time omparion. In fat, for FFT, the onverion rate i one whih i onitent with the obervation that both the MFLOP and the exeution time omparion urve are almot idential for that algorithm (ompare Figure 6 and 7). 4.7 Predition for a Future Mahine We illutrate how PM model an be ued to make performane predition for a future generation mahine. For a new mahine, many dierent alternative may beofinteret. For example, it may be neeary to onider impat of inreaing proeor peed, 29

33 Table 12: Speedup Predition for 16K MP-2 over 4K MP-2 4K MP-2 16K MP-2 relative N time (e) time (e) peedup Matrix Multipliation LU Deompoition Fat Fourier Tranform improving memory ae time, enhaning ommuniation hardware, or inreaing the number of proeor. We ue PM model to predit performane when the number of proeor i inreaed from 4K to 16K in a future MP-2 mahine. The peedup predition are given in Table 12. We have hown peedup obtained by inreaing the number of proeor from 4K to 16K on MP-2. Exeution prole are provided in Table 13 to give an idea of how overhead due to interproeor ommuniation, memory aee, and auxiliary intrution are expeted to hange. If Table 9, 10, 11 and Table 13 are ompared, it i een that overhead inreae. The inreae i mot igniant in ae of FFT. 30

34 Table 13: Predition of Exeution Prole on 16K MP-2 N omp % omm % memory % mi % Matrix Multipliation LU Deompoition Fat Fourier Tranform It i diult to hek validity of future predition, but it may be poible to hek validity of the approah. To hek validity of the approah, hypothetial predition were made for 16K proeor MP-1, and they were heked uing the real mahine. A ouple of thing are worth mentioning about the validation. The memory overlap ratio depend on the loal ize of problem, and we veried that it i fairly aurate to extrapolate the overlap ratio on that bai. The regreion formula for prediting miellaneou overhead were developed uing tet ae on 1K and 4K proeor on MP-1, and their validity wa heked on 16K proeor. 31

Combined Radix-10 and Radix-16 Division Unit

Combined Radix-10 and Radix-16 Division Unit Combined adix- and adix-6 Diviion Unit Tomá ang and Alberto Nannarelli Dept. of Eletrial Engineering and Computer Siene, Univerity of California, Irvine, USA Dept. of Informati & Math. Modelling, Tehnial