Evaluation of Benchmark Performance Estimation for Parallel. Fortran Programs on Massively Parallel SIMD and MIMD. Computers.

Size: px

Start display at page:

Download "Evaluation of Benchmark Performance Estimation for Parallel. Fortran Programs on Massively Parallel SIMD and MIMD. Computers."

Victor Griffin
5 years ago
Views:

1 Evaluation of Benhmark Performane Estimation for Parallel Fortran Programs on Massively Parallel SIMD and MIMD Computers Thomas Fahringer Dept of Software Tehnology and Parallel Systems University of Vienna Bruennerstr 72, A-121 Vienna, Austria To be published in: IEEE Proeedings of the 2nd Euromiro Workshop on Parallel and Distributed Proessing, Malaga/Spain, Jan 1994 Abstrat A potential problem enountered when parallelizing programs for massively parallel systems is to guide the parallelization eort through performane predition Estimating the performane of parallel programs based on benhmarking is getting inreasingly popular in reent years However, there was little researh done so far to evaluate this approah mainly due to the lak of atual implementations This paper disusses the advantages and disadvantages of benhmark performane estimation for SIMD and MIMD mahines The design and implementation of a benhmark performane estimator is presented Even though benhmark performane estimations have been demonstrated to be very useful, experiments based on the desribed prototype unover several severe problems of this approah This inludes time eort, portability, measurement omplexity, performane inuene of target mahine and ompiler, pattern mathing of kernels, and predition auray Conrete experiments for the MasPar MP-1 and the ipsc/6 hyperube are presented 1 Introdution Common use of massively parallel SIMD and MIMD systems has been hindered by the diulty of programming suh mahines Even though the development of expliit parallel languages ([19, ]) and parallelizing ompilers ([3, 13, 1]) has been put forth in reent years to alleviate the programming task, the user is still responsible to make most of the strategi program transformation deisions It is widely aepted that a performane estimator is a key omponent to guide the parallelization and optimization eort For some time people have great hope in using the so-alled benhmark performane estimation approah 1 This involves pre-measurement of kernels to over both sequential and parallel program setions Parallel kernels are ommonly measured for varying data sizes and proessor numbers The measured runtimes are stored in a kernel library In order to obtain estimated runtimes, a parallel program is parsed to detet existing library kernels inorporating pattern mathing For eah kernel disovered, the premeasured runtime is aumulated, whih nally yields the overall runtime of the program MaDonald ([14]) desribes an approah to predit the runtime for Fortran77 programs by mapping language onstruts to time-formulas He ahieves good results for small kernels with trivial ontrol ow V Balasundaram et al ([2]) built a training set tool to help validating dierent data layout shemes for parallel programs based on the loosely synhronous ommuniation model They do not handle proedure alls and inorporate guessing to model ontrol ow They ahieve fairly aurate estimates This paper tries to address the pratial issue of using benhmarking to obtain performane estimates The orresponding researh started with the reation of a benhmark kernel library for a SIMD mahine, namely the MasPar MP-1 The runtime of parallel MasPar Fortran programs was derived by hand using pre-measured kernel runtimes This work initiated the design of an automati benhmark performane esti- 1 For the remainder of this paper performane estimation refers to benhmark performane estimation 1

2 mator for message passing Fortran programs, to be integrated in the the Vienna Fortran Compilation System (VFCS - [3]), whih is a ompiler that automatially translates Fortran programs into message passing programs for distributed memory parallel arhitetures The target programs for the benhmark estimator are based on the single-program multiple-data (SPMD) programming model ([3]), where eah proessor is exeuting the same program based on a dierent data domain The target mahine is the ipsc/6 hyperube This paper reports on the experiene with the manually obtained performane estimates for the MP-1 and the automatially derived values by the desribed prototype estimator for the ipsc/6 hyperube The benhmark approah is evaluated with respet to following issues: 1 time eort to build and maintain a benhmark performane estimator, 2 portability, 3 measurement omplexity, 4 loal and global performane inuene of target mahine and ompiler, 5 pattern mathing of kernels, and 6 predition auray The paper is organized as follows In Setion 2 the design and lassiation of a kernel library is presented The problems arisen during the pre-measurement of kernels for dierent target arhitetures are disussed Then the overall performane estimation onepts and tehniques inorporated are outlined A variety of problems disovered during development of the prototype estimator are analyzed In Setion 3 experiments for both the MasPar MP-1 and the ipsc/6 hyperube are presented The paper onludes with a a summary of important observations about this researh and future work 2 A benhmark Performane Estimator 21 Benhmark Kernel Library The benhmark kernels of the prototype implementation are stored in a benhmark kernel library As dierent kernels may require dierent measurement, pattern mathing and evaluation tehniques a lassi- ation is naturally imposed on them This inludes four dierent kernel lasses: 1 primitive operations: basi operations (+;?; ; =), logial operations (<, >; ==,et), array aess kernels (eg A(I+1)), et 2 primitive statements: DO loop header statements, subroutine and funtion alls, onditional and unonditional statements, assignment statements, GOTO statements, atomi ommuniation statements (send and reeive), et This kernel lass also ontains Fortran9 array operations as used in MasPar Fortran ([15]) on the MP-1 3 intrinsi funtions: SI, COS, MOD, LOG, et This kernel lass also ontains impliit redution funtions inluded in the Fortran77 language speiation suh as MI, MAX, IDEX, et Other redution funtions are mahine spei implementations ([15, 11]) suh as DOT PRODUCT, SHIFT, MAXLOC, COUT, TRASPOSE, and a variety of olletive ommuniation statements (eg broadast) 4 ode patterns: This kernel lass inludes standard ode patterns amenable to reognition suh as elementary operations of linear algebra (matrix multipliation, matrix inversion, determinant omputation, et) and ommonly used stenils suh as the Jaobi relaxation, LU deomposition, Gauss-Jordan and others Moreover, eah kernel lass is divided into mahine spei and mahine independent kernels In the framework of this projet extensive work and implementation for the rst 3 kernel lasses has been done About 15 dierent kernels were olleted aross these three lasses The urrent implementation of the desribed performane tool handles only a few larger ode patterns mainly beause of the diulty to detet them in a program evertheless, there exist a variety of oneptual ideas ([12, 4]) to approah the problem of pattern mathing for more ompliated kernels This will be addressed in future researh It is frequently assumed that building and maintaining suh a kernel library an be done with little time eort On the one hand, about 1 1/2 man-years were required to develop and implement a dierent performane estimator at the University of Vienna, namely the P 3 T 2 ([7, 5]) This estimator overs a wide lass of parallel programs going way beyond the apabilities of the desribed benhmark performane estimator On the other hand, building the benhmark estimator and in partiular the kernel library is an ongoing eort 2 The P 3 T is based on an analytial model, whih omputes a set of parallel program parameters to relate to the parallel program's performane 2

3 for more than 2 man-years A multitude of problems were enountered: The underlying target arhiteture and ompiler version fore the designer of a kernel library to add many dierent variations of even primitive kernels Eg on the Intel i6 it is vital to analyze the number, data types and the dimensionality (in ase of arrays) of proedure parameters A dierene in the number of array dimensions of atual and formal parameters may inrease a proedure all runtime by up to 5 % In ontrary to what is laimed in [2] a large portion of kernels { inluding primitive operations and statements { are neither portable aross a variety of target arhitetures nor aross dierent ompiler releases for the same mahine The kernel set had to be modied even for dierent ompiler versions of the ipsc/6 hyperube Eg in order to orrespond to the signiant performane dierene of kernel aesses to the main memory versus to a register, it is a prerequisite to model the register alloation poliy of the underlying target ompiler This eet is referred to as the loal kernel measurement eet, beause it is frequently loal to individual kernels The kernels are measured individually on dierent arhitetures However, they our in a global domain of a parallel program and may strongly inuene eah others performane Eg for the PSC Fortran Compiler Release 3 on the ipsc/6 hyperube depending on the problem size 3 of a program, a dierent number of assembly ode statements is generated for array and salar aesses by the target ompiler For small problem sizes only two, otherwise four assembly ode statements are generated Ignoring this fat would indue an estimation error of 4 to 5 % for the orresponding data aess runtime Furthermore, if a primitive operation is deteted by the target ompiler to be part of a ommon subexpressions ([1]), then its runtime an be signiantly redued Besides the problem of unovering how the target ompiler is handling ommon subexpressions, it is neessary to add additional kernels whih measure the eet of primitive operations as part of ommon subexpressions or as individual kernels This eet is alled the global kernel measurement eet, beause a kernel may inuene the performane of another kernel 3 This refers to the size of alloated arrays In order to ne tune the kernel library it was therefore neessary to analyze target arhiteture, target ompiler releases, and in partiular the resulting assembly ode of a parallel program This strongly speaks against portable kernel libraries eedless to mention the intense time eort for this task ot surprising at all, only a small subset of the kernel library desribed in this paper ould be used for both the ip- SC/6 hyperube and the MasPar MP-1 Larger ode patterns more likely allow to model both loal and global kernel measurement eets This, however, faes the designer of a kernel library with the diult task of pattern mathing for more ompliated and larger kernels (f [12, 4]) 22 Exeute kernels on dierent target arhitetures The kernel library is exeuted on every dierent target arhiteture for whih runtimes are to be estimated Primitive operations and most primitive statements { exept Fortran9 array operations { are measured for dierent data types, onstant and variable operand values Communiation statements, Fortran9 operations and intrinsi funtions are designed for dierent data layout shemes and measured for varying number of proessors and problem sizes Similar to [2] the hi-square t method [17] is used to t the measured runtime information into piee-wise linear funtions modeling both xed and variable stepsizes between these funtions Unfortunately xed stepsizes annot be assumed Even trivial kernels display non-onstant stepsizes between piee-wise linear performane funtions of even dierent shapes Undulations and runaway behavior of performane funtions are other frequently observed anomalies Fig 1 shows the benhmarking results of DOT PRODUCT whih is an intrinsi funtion in MasPar Fortran implementing the dot-produt multipliation of vetors First, this funtion was benhmarked for two vetors as rows of an array M The assoiated performane urve displays step wise linear urves of dierent shapes Sometimes a step is skipped (2 25) For 4 52 the runaway funtion behavior of a negative performane step was observed Even the stepsizes vary The reason for the steps between the linear piee-wise funtions is due to the wrap around memory hierarhy level of the MasPar MP-1 system ([16]) If a data vetor does not t onto a spei memory hierarhy level, it is wrapped around to the next higher memory level Data aess osts are suddenly inreasing eah time a next higher memory hierarhy 3

4 1?3 se 1?3 se DOT PRODUCT(M(I,2:-1),M(I,2:-1)) DOT PRODUCT(V(1:),V(1:)) s s s s s s s s s Figure 1: Irregular performane behavior of benhmark kernels level has to be aessed This is visualized by the steps in the performane funtion Seond, the same funtion is evaluated for a onedimensional vetor of size Using one-dimensional arrays instead of rows of a two-dimensional array uts the runtime overhead drastially This is due to a redued memory address omputation whih has to be done by the frontend proessor of the MP-1 The overall performane behavior is desribed by a asade funtion with disontinuities at multiples of 124 for At these multiples all proessors on the utilized MP-1 onguration (124 proessors) are employed in the omputation of DOT PRODUCT In all other ases some proessors have to be disabled for the assoiated omputation yles, while others are atually omputing This depends on the data layout sheme It seems that disabling some of the proessors is not for free, whih may explain this benhmark behavior The desribed performane estimator inorporates the arithmeti mean aross all stepsizes to ompensate for non-onstant stepsizes of performane funtions For some ases the stepsize onsistently inreases for larger problem sizes (see Fig 4) In that ase a linear funtion is omputed to model these non-onstant stepsizes by inorporating hi-square tting The memory requirement to store every single piee-wise linear funtion together with the assoiated step size would be unfeasible In general it was observed that more advaned interpolation tehniques are required to over larger lasses of parallel programs by the desribed performane funtion tting method This will be addressed in future researh When organizing the benhmark kernels in several exeutables a awkward diulty has been disovered As the kernel library ontains about 15 dierent kernels, they are ombined in several large exeutables instead of putting every single kernel in a distint exeutable On the Intel i6 it turned out that the measured performane for most spei kernels in a large exeutable was very dierent than if measured in single exeutables The reason is obvious: the pu-pipeline and ahe behavior is dissimilar for both measurement variants Deviations of one order of magnitude were observed in the worst ase and at least 1 to 2 % in the best ase The same problem is ubiquitous when estimating the performane of a real parallel program by a set of pre-measured kernels Again, going beyond primitive kernels might help to alleviate this drawbak 23 Deriving performane estimates Fig 2 illustrates the struture and omponents of the desribed benhmark performane estimator The parallel program to be evaluated by this tool is attributed by onrete values for program unknowns: loop iteration ounts, branhing probabilities, and statement exeution ounts These program unknowns are referred to as sequential program parameters as they relate to the ontrol ow of an SPMD program, whih is equal for all proessors The sequential program parameters are derived by a single preeding prole run of the input program as initiated by the Weight Finder ([6]), whih is an advaned proler for Fortran programs integrated in the VFCS 1 The parallel program attributed by the sequential program parameters is parsed by inorporating its syntax tree and ontrol owgraph representation under the VFCS There are a variety of routines as provided by the VFCS ([3]), whih allow to onveniently traverse through the syntax tree and ontrol owgraph For eah syntax tree no- 4

5 de a kernel pattern mathing and subsequently a performane evaluation algorithm is initiated, whih is explained in the following 2 Depending on the lass of benhmark kernels to be mathed with, a dierent pattern mathing strategy is applied Primitive operations, primitive statements and intrinsi funtions are simply deteted by their syntax tree node representation The underlying ompilation system strongly supports this pattern mathing task by normalizing expressions aross the entire parallel program Furthermore an expression simplier statially evaluates expressions ontaining symbolis and onstants to redue them to essentials For entire ode patterns (eg matrix multipliation) more advaned tehniques are required suh as those mentioned in [12, 4] The implementation status of the pattern mather handles all kernels in the kernel library exept ode patterns 3 Based on the pre-measured runtimes of the kernels in the kernel library and the sequential program parameters it is straight forward to obtain the estimated runtime for arbitrary program segments At the lowest level the runtime for a spei primitive statement S { or the sum of all ontained primitive operations in S, if S is a nonprimitive statement { is multiplied by the orresponding statement exeution ount This gure needs to be further weighted by the assoiated branhing probability in ase of a onditional statement, whih then yields the estimated runtime for S In order to ompute the estimated runtime for an arbitrary program segment the estimated runtimes for all of its statements are summed up The only problem arising with this approah is to estimate the runtime for proedure (Fortran subroutine or funtion) alls A major assumption is that the runtime of a proedure all is independent of the all site As a onsequene the runtime at a partiular all site is the same as the runtime of the proedure over all all sites This assumption is ommonly made ([1, 9]) and prevents expensive time and memory onsuming simulation eorts to evaluate a more preise program behavior The estimated runtime for a spei proedure all instantiation is therefore obtained by dividing the aumulated estimated runtime for the assoiated proedure by the sum of the statement exeution ounts aross all all sites of this proedure Multiplying this value by the statement exeution ount of a spei proedure all statement allows to weight the importane of dierent all sites with respet to their runtime overhead 4 Optionally the parallel program's internal representation (syntax tree) an be annotated with the estimated runtime values as derived in the performane evaluation phase This supports a lean interfae to other parallelization and optimization phases under the VFCS suh as the seletion of program transformations ([5, 3]) and automati data distribution strategies ([4]), whih require performane estimates 5 The estimated runtime gures an be seletively visualized together with the assoiated program statements This is fully implemented using a X11/Motif window under the VFCS ([5, 7]) ote that only the absene of a parser for MasPar Fortran programs prevents the automati performane predition of MasPar Fortran programs All other tool omponents an be equally applied to both SIMD and MIMD programs Future researh will extend the benhmark estimator in this diretion parallel program + sequential program parameters parsing pattern mathing performane evaluation annotated parallel program performane visualization Benhmarking Performane Estimator primitive operations intrinsi funtions primitive statements ode patterns Tool Program Kernels Figure 2: A performane estimator based on benhmarking 5

6 3 Experiments This setion disusses two experiments to validate the desribed benhmark estimator approah The rst experiment examines the Red-Blak hekerboard algorithm ([17]), whih is a pointwise relaxation method, on the ipsc/6 hyperube The seond experiment evaluates the JACOBI relaxation iterative method 4 ([17]) on the MasPar MP-1 The urrent benhmark estimator prototype handles only sequential programs for a single Intel i6 proessor Therefore, the rst experiment relates to the automati estimation of purely sequential kernels Fig3 shows the measured and estimated runtimes of the Red-Blak Relaxation ode The benhmark estimator was used to automatially estimate the runtime of this ode for varying problem sizes It an be learly seen that for inreasing problem sizes the estimation auray onsistently deteriorates For a problem size of = 124 the error rate is about 27 % The main reason for this result is, that the pre-measured runtimes of the rather small kernels in the kernel library lak the modeling of the global kernel measurement eets (f Setion 21) In partiular the inaurate modeling of the Intel i6 ahe and pu-pipeline behavior indue this deviation Using larger kernel may improve these results se estimated measured Figure 3: Measured versus predited runtimes for the Red-Blak Relaxation ode The seond experiment was obtained on the Mas- Par MP-1 with 124 proessors for the JACOBI relaxation iterative method The performane of a parallel 4 This method is used to approximate the solution of a partial dierential equation disretized on a grid JACOBI program written in MasPar Fortran is measured and predited for varying problem sizes Based on the desribed estimation approah it is possible to predit the runtime of the JACOBI program within 1 % of the atual result Table 1: Various kernels and program segments nr Kernel 1 F(2:-1,2:-1) = U(2:-1,2:-1) 2 F(2:-1,2:-1) = OMEGA * U(2:-1,2:-1) 3 F(2:-1,2:-1) = U(2:-1,2:-1) + U(1:-2, 2:-1) + U(3:,2:-1) + U(2:-1,1:-2) + U(2:-1,3:) 4 F(2:-1,2:-1) = (1-OMEGA) * U(2:-1,2:-1) + OMEGA*25*(F(2:-1,2:-1) + U(1:-2,2:-1) + U(3:,2:-1) + U(2:-1,3:) + U(2:-1,1:-2)) 5 JACOBI program The rst three entries in Table 1 illustrate library kernels as measured for the MasPar MP-1 The fourth entry displays the ode pattern of the main JACOBI relaxation statement The last entry represents the entire JACOBI program Fig 4 illustrates the measured versus predited runtimes for eah of the kernels in Table 1 in the same order, where Kernel-1 orresponds to Fig4a,, and Kernel-5 to Fig4e The measured runtime of Kernel-1 whih is a Fortran9 array assignment operation is plotted as a quadrati funtion This behavior is approximated by a step-wise linear funtion (see dashed funtion) Kernel-2 is very similar to Kernel-1, but inludes a salar multipliation This additional operation implies a doubling of the runtime, beause it is proessed on the frontend proessor of the MP-1 Kernel-3 represents a frequently found neighbor omputation stenil The reason for the orresponding asade runtime funtion in Fig4 is due to the wrap around memory hierarhy of the MasPar MP-1 By approximating this funtion with a stepwise linear funtion using the hi-square tting tehnique an estimation auray of more than 95 % is ahieved The non-onstant stepsize between the stepwise linear funtions is modeled by a linear funtion as outlined in Setion 22 Kernel-4 is a ombination of Kernel-1, 2 and 3 The resulting performane is therefore modeled as a linear ombination of these sub-kernels plus two additional salar operation kernels The dierene between atual versus predited runtime is surprisingly small (within 5 % in the worst ase) In Fig4e the performane of the entire JACOBI program is visualized showing a worst ase deviation 6

7 of less then 6 % for the largest data size measured From this plot it an be learly seen that the estimation auray worsens with inreasing problem size However, for this example the pre-measured kernel runtime funtions ompensate eah other Fig 4 shows an under-estimation while all others over-estimate the atual result There are other experiments were suh good estimation results ould not be ahieved This is just another sign of how diult it is to ne tune the kernel library for a given arhiteture 1?3 se ?3 se ?3 se b a d ?3 se measured estimated ?3 se e Figure 4: Measured versus predited runtimes for the JACOBI program 4 Conlusion This paper analyzes the popular benhmark performane estimation by evaluating a prototype implementation Based on experiments done on both the MasPar MP-1 and the Intel i6 following observations were made: portability: In ontrary to popular belief there is only little evidene that larger portions of the kernel library are portable aross a variety of arhitetures In many ases it was neessary to preisely investigate the target ompiler's ode restruturing poliies, register alloation, ommon sub-expression elimination and other strategies in the deepest assembly ode level in order to tune the kernel library loal and global kernel measurement eets: Two major problems were deteted when building a benhmark kernel library First, the performane eet purely loal to a spei kernel, suh as the register alloation impat for omplex proessors Seond, pre-measuring kernels laks to suiently relate to an ourrene in a real world program beause of the dierent global ahe and pu-pipeline behavior This is a partiular serious disadvantage of the benhmark method if used for MIMD mahines with omplex proessors pattern mathing: In order to build a reasonably aurate benhmark performane estimator it is vital to go beyond primitive kernels This requires larger kernels to be deteted in a parallel program, whih in turn raises the question of pattern mathing for suh kernels This appears to be an open researh topi time eort: More than two man-years of work were required to reate a kernel library overing a small set of appliation programs It order to ahieve reasonably aurate performane estimates it was neessary to study the assembly ode of kernels and programs estimation auray: The estimation auray severely depends on on the target arhiteture, the target ompiler and the quality of the kernel library This aounts in partiular for omplex parallel systems with proessors inluding pupipelines and ahes suh as the Intel i6 However, it seems that SIMD mahines with relatively simple proessors are reasonably well suited for this approah The experiments done on the MasPar MP-1 (f Setion 3) demonstrate that 7

8 Even though a good part of this paper reports on the disadvantages of the performane estimation based on benhmarking, this method seems the best known way of deriving onrete and realisti estimated runtime information In theory it greatly simplies the task of modeling target ompilers and arhitetures by simply measuring the kernels without worrying about details of the underlying system In pratie this method requires substantial researh before being appliable to larger lasses of real-world programs The urrent prototype implementation will be extended by a pattern mather for larger ode patterns in future work Furthermore the kernel library will be extended and validated for additional arhitetures Referenes [1] A Aho, R Sethi, and J Ullman Compilers, Priniples, Tehniques and Tools Series in Computer Siene Addison Wesley, 19 [2] V Balasundaram, G Fox, K Kennedy, and U Kremer A Stati Performane Estimator to Guide Data Partitioning Deisions In 3rd ACM Sigplan Symposium on Priniples and Pratie of Parallel Programming (PPoPP), Williamsburg, VA, April [3] B Chapman, S Benkner, R Blasko, P Brezany, M Egg, T Fahringer, HM Gerndt, J Hulman, B Knaus, P Kutshera, H Moritsh, A Shwald, V Sipkova, and HP Zima VIEA FORTRA Compilation System - Version 1 - User's Guide, January 1993 [4] B Chapman, T Fahringer, and H Zima Automati Support for Data Distribution In Pro of the Sixth Annual Workshop on Languages and Compilers for Parallel Computing, Portland, Oregon, Aug 1993 [5] T Fahringer Automati Performane Predition for Parallel Programs on Massively Parallel Computers PhD thesis, University of Vienna, Department of Software Tehnology and Parallel Systems, Otober 1993 [6] T Fahringer The Weight Finder, An Advaned Proler for Fortran Programs In Automati Parallelization, ew Approahes to Code Generation, Data Distribution, and Performane Predition Vieweg Advaned Studies in Computer Siene, ISB , Verlag Vieweg, Wiebaden, Germany, Marh 1993 [7] T Fahringer and H Zima A Stati Parameter based Performane Predition Tool for Parallel Programs In Invited Paper, In Pro of the 7th ACM International Conferene on Superomputing 1993, Tokyo, Japan, July 1993 [] G Fox, S Hiranandani, K Kennedy, C Koelbel, U Kremer, C Tseng, and M Wu Fortran D Language Speiation Tehnial Report TR9-141, Dept of Computer Siene, Rie University, Deember 199 [9] SL Graham, PB Kessler, and MK MKusik gprof: A Call Graph Exeution Proler In In Proeedings of the SIGPLA 2 Symposium on Compiler Constrution, pages 12 { 126, June 192 SIGPLA oties, Vol17, o 6 [1] S Hiranandani, K Kennedy, C Koelbel, U Kremer, and C Tseng An overview of the Fortran D programming system In Pro of the 4th Workshop on Languages and Compilers for Parallele Computing, Santa Clara, CA, Aug 1991 [11] Intel Superomputer Systems Division, Beaverton, OR ipsc/6 Fortran Compiler User's Guide, Marh 1992 [12] CW Keler Knowledge-Based Automati Parallelization by Pattern Reognition In Proeedings (Preprint), International Workshop on Automati Parallelization 1993 Universitat des Saarlandes, Saarbruken,Germany, Marh 1993 [13] D Loveman High Performane Fortran: Proposal, January 1992 [14] B MaDonald Prediting the Exeution Time of Sequential Sienti Codes In Proeedings (Preprint), International Workshop on Automati Parallelization 1993 Universitat des Saarlandes, Saarbruken,Germany, Marh 1993 [15] MasPar Computer Corporation, Sunnyvale, CA MasPar Fortran Referene Manual, July 1992 Software Version 2, Doument Part umber 933-, Revision A5 [16] MasPar Computer Corporation, Sunnyvale, CA MasPar System Overview, July 1992 Doument Part umber 933-1, Revision A5 [17] WH Press, BP Flannery, SA Teukolsky, and WT Vetterling umerial Reipes in C; The Art of Sienti Computing Cambridge University Press, 19

9 [1] V Sarkar Partitioning and Sheduling Parallel Programs for Multiproessor The MIT Press, Cambridge, Massahusetts, 199 [19] H Zima, P Brezany, B Chapman, P Mehrotra, and A Shwald Vienna Fortran - a language speiation Tehnial report, ICASE, Hampton,VA, 1992 ICASE Internal Report 21 9

Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reconfigurable Hardware Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,