An Evaluation of Automatic and Interactive Parallel Programming Tools

Size: px

Start display at page:

Download "An Evaluation of Automatic and Interactive Parallel Programming Tools"

Augusta Miller
5 years ago
Views:

1 An Evaluation of Automati and Interative Parallel Programming Tools Doreen Y Cheng Computer Siene Co NASA Ames Researh Center MS Moffett Field, CA 9435 Douglas M Pase Formerly at NASA (CSC) Cray Researh, In 655F Lone Oak Dr Eagan, MN Abstrat We have evaluated two automati and one interative toola uaing eo typial NAS appliationa on a ORA Y Y-MP It waa found that automati toola produe inauffiient performane improvement Interative toola an produe better performane beauae they help uaen find and eliminate falae dependenies However,,imple-minded ode tranaformation haa resulted in a,ignifiant performane degradation whih anel, the,peedup obtainable by parallelization Therefore, tool, mud perform mahine-,peifi optimization, The benhmarlu ontain a large number of amall to medium,ize loopa, whih limit, the performane ahievable by parallelinng only loopa Featurea to a66e66 whether a,etion of ode,hould be parallelized, vetorized ' or left aequential are alao nee66ary 1 Introdution By the year 2, the Numerial Aero dynamis Simulation (NAS) program at NASA Ames Researh Center will provide sientists with superomputers having parallel arhitetures To make the power of parallel proessing available to sientists it will be neessary to provide a programm'ing environment that will enable them to fous on physial and mathematial modeling In addition, it will be neessary to provide tools that will help sientists produe orret and effiient ode for variety of superomputers Many parallel programming tools have been developed by researh institutes and industry1 Previously reported evaluations of parallel programming languages and tools have used small syntheti benhmarks2 The benhmarks and hardware used have quite different harateristis than those at NAS Parallelization depends on harateristis of both appliations and target systems We have evaluated several urrent approahes to parallel programming tools by using typial NAS appliations and hardware to give us insights into the proper diretion for researh into, and design and development of future NAS parallel programming environme ts The funtions of existing tools fall into three ategories: tools that onvert a sequential program into a parallel one, tools that assist in the reation of a parallel program and tools that help aid parallel debugging and performane optimization Two approahes have been taken in building tools that onvert sequential programs to parallel: automati and interative Automati tools rely o? ompilers to on,: ert a sequential pro gram mto a parallel one m a bath proessing!ashion Diretives an be inserted manually mto a program before ompilation, but the t ols do not help users find fal,e dependenle, falle dependeny is a dependene that will not atually our during the pro gram exeution for a given data set Interative tools require user interation to guide the ompiler during the ourse of parallelizing eah setion of the program They provide failities for analyzing dependenies One requirement for the NAS parallel programming environment is to minimize the number of hanges in the urrent programming praties of sientists If either automati or interative tools ould parallelize existing programs with a reasonable amount of user effort and deliver satisfatory performane, sientists would be able to ontinue writing sequential appliations and use the tools to onvert them into parallel ones We planned to evaluate a number of existing tools in order of inreasing user effort: automati tools, interative tools, and tools for writing new programs This paper reports the results of evaluating tools in the first two ategories The evaluation was performed on a CRA Y Y-MP using benhmarks that are representative of urrent NAS appliations The era Y Y-MP was hosen beause the alane between its proessor speed and mteronnetion performane is similar to what we expet from future NAS omputers Table 1 lists the system software used in the exp { iment The ORA Y Autotasking faility, fpp, and KAP/ORAy4 from Kuk & Assoiates were hosen for the evaluation of automati tools The Forge programs from Paifi Sierra Researh was the most 1991 ACM /91/412 $15 412

2 interative onversion tool we ould find that ran on the Y-MP We have found that automati tools produe insuffiient performane improvement for our appliations Interative tools allow users to onveniently aess a set of integrated tools In some ases, interative tools an produe better performane beause they help users find and eliminate false dependenies The urrent version of Forge flags a onverted loop as parallel by simply using a DO ALL ompiler diretive For our appliations, this approah has resulted in a signifiant performane degradation whih anels the potential speedup obtainable by parallelization For programs optimized for vetor mahines the degradation is the most severe To ahieve high performane, it is neessary for tools to perform optimization speifi to a target mahine in addition to parallelizing the program We have also found that typial NAS programs ontain a large number of small to medium size loops, whih limits the performane ahievable by parallelizing only loops Based on these results, we believe that tools are needed for designing parallel algorithms and writing parallel programs Features to assess whether a setion of ode should be parallelized, vetorized, or left sequential are neessary to redue the efl'ort of user direted optimization The urrent version of Forge has been signifiantly improved based on this results Setion 2 of this paper desribes the evaluation of CRAY fpp and KAP/CRAY Setion 3 presents the results of Forge evaluation Setion 4 explains the observed phenomenon, and Setion 5 presents onlusions and diretion for future work 2 Evaluation of Automatie Tools At present time, both fpp and KAP an only parallelize program loops The first experiment was designed to test the quality of automati parallelization For this reason, no ompiler diretives were inserted into the programs We seleted twenty five programs from typial NAS private odes and publidomain benhmarks for the evaluation The objetives of the evaluation were to find out how muh performane improvement ould be obtained by automati parallelization tools and how muh extra ompilation time was required To ompare parallel performane with the best sequential performane, the enhaned vetorization apability of these tools were studied as well 21 Benehmarks We used twenty five benhmarks, thirteen from the Perfet s,6 ten NAS private programs, the Livermore Loops, and NAS kernels They represent typial appliations urrently running on NAB superomputers The harateristis of the programs are listed in Table 2 memory requirements varied from 11 Kwords to 54 Mwords (8 bytes/word) The total number of floating point operations required by these programs varied from 58 million to 78 billion operations The I/O requirements of the benhmarks were very low, with only a few exeptions A large solid state storage devie was used to redue the time spent doing I/O Eah benhmark was first ompiled using the default vetorization provided by the CRAY ompiler This version was the base for omparison To study vetorization, three more versions of eah benhmark were generated using difl'erent preproessors The first version used fpp The seond version used KAP The last version used both fpp and KAP The performane of these versions was measured using a single proessor on a dediated Y-MP Three more versions of eah program were generated to study parallelization: using fpp, using KAP and using both preproessors The performane was measured on a dediated Y-MP using 1, 4, and 8 proessors 22 Results Figures 1-8 summarizes the results of evaluating the automati tools Eah benhmark is represented by a number that is the same number appearing in the first olumn of Table 2 The data of eah benhmark was plotted in Figures 1-7 aording to this numbering (the X-axis) Figure 1 shows the performane of default vetorization in MFLOPs Figure 2 shows the orresponding ompilation time in seonds Figure 3 is a graph of the speedup obtained by enhaned vetorization The speedup is alulated by dividing the elapsed time for the default vetorization version by the exeution time of the orresponding enhaned vetorization version Figure 4 shows the extra ompilation time required by enhaned vetorization In figure 4, Cw is the ompilation time used by enhaned vetorization and Cd is the ompilation time used by default vetorization Figures 3 and 4 show that only few benhmarks signifiantly benefit from the extra analysis and transformations provided by enhaned vetorization For many benhmarks, the default vetorization offered by the ompiler provides as good or better performane than the enhaned vetorization ofl'ered by the preproessors, and with less ompilation overhead This is beause the 413

3 user-performed optimization for the era Y vetor unit before the evaluation has made most loops vedorizable by simple analysis and transformations performed by the era Y ompiler The performane of a parallel program was ompared to the best sequential ounterpart implementing the same algorithm For this reason, the default vetorization version was used as the base instead of the parallel version running on a single proessor Figures 5 and 6 display the speedup of parallelized version running on 4 and 8 proessors The speedup is alulated by dividing the elapsed time required by the parallel version (T where n is the number of proessors us:d) into the time used by the default vetor version (T) Figure 7 ompares the ompilation osts (O p is the ompilation time used by parallelization and tl is the time used by default vedorization) More programs were improved by parallelization than by enhaned vetorization On the other hand, the ompilation time was lengthened by a fator of 3 on the average The majority of the program ran less than 5% faster; a few even slowed down Exeuting on 8 proessors did lead to higher speedup However, the best effiieny dropped to 5% from 75% on 4 proessors (The effiieny is defined as the speedup divided by the number of proessors used) The poor performane is beause the programs do not ontain enough large grain loops that are parallelizable by automati tools Loops parallelizable by these tools are also vedorizable The potential performane improvement is largely onsumed by vetorization and not muh left for parallel exeution To further illustrate the relation between vetorization and parallelization, Figure 8 plots the speedup obtained by parallel exeution on 4 proessors against the perentage of vetorization in the benhmarks (Exeutions on 8 proessors have similar harateristis) The results seem to indiate that programs that do not vetorize well do not parallelize well either Furthermore, higher than 7% vetorization is required to obtain reasonable speedup on multiple pro essors (This is one reason why the pro grams whih improved signifiantly under enhaned vetorization did not improve as muh as might be expeted under parallelization) The dependene of parallelization on the perentage of vetorization is partially beause the tools only try to parallelize loops and the analysis used for parallelization is similar to that for vetorization As a result, programs that ontain large vedorizable loops obtain higher speedup on multiple pro essors than those with small vetorizable loops This study shows that urrent state-ofthe-art automati tools deliver unsatisfatory performane on multiple proessors for these NAS benhmarks One possible reason is the lak of knowledge at algorithm level that an be used to reate large grain parallelism A natural question is whether interative tools an do better This lead to the seond experiment: evaluation of Forge 3 Evaluation of Interative Tools Just like the automati tools we studied, Forge only parallelijes loops, although it an be used to analyze dependenies between any two parts of a program The objetives of the evaluation were to find out how muh performane improvement a state-of-the-art interative tool ould produe for the NAS appliations, how muh user effort would be required, and how to improve them if they are not adequate The results are biased by the benhmark programs and the hardware platform used and represent harateristis of urrent NAS appliations and failities 31 s The evaluation used five programs Four of them, NAS4, NAS7, NAS8, and NASlO, were seleted from the ten NAS private programs used in evaluating the automati tools ARe3D was from the Perfet NAS4 and NAS7 show a low MFLOP rate using default vetorization (53 and 37 MFLOPs) Only a low perent e of the odes is vetorizable (63% and 59'7) Even using enhaned vetorization or automati parallelization (running on 4 pro essors) the improvement is less than 3% NAS8 and NASlO, on the other hand, show a high MFLOP rate (179 and 134 MFLOPs) A high perentage of these programs are vetorizable (both 96%) Automati parallelization speeds up the programs by a fador of 25 (the largest speedup) on 4 proessors During the evaluation, Forge was first used to generate the run time profile of a program Then all loops in the program were analyzed When Forge pointed out that dependenies had prohibited parallelization, its database query faility was used to analyze the dependenies and determine whether they ould be ignored The false dependenies were removed Finally, Forge was used to generate a program whih parallelized all loops that were parallelizable 414

4 32 Re8ult8 The results are summariled in Tables 3-12 Eah row of a table presents the performane of one version of the program The first olumn lists the name of the version Eah entry of the other olumns of Tables 3-7 ontains three numbers The first number is the elapsed time in seonds The seond is the speedup of a program running on multiple proessors relative to the performane of the same version running on one proessor The third number is the multiproessor effiieny The performane obtained by the automati parallelilation tool, fpp, (the entries named "fpp" in Tables 3-12) was used as the base to measure the improvement made by Forge This data was re-measured sine the system software had been hanged (Table 1) Up to six additional versions of eah benhmark were studied to understand th,: effet of ode transformations and granularity on performane under the ondition of fixed data set size Version 1 (named "Forge") was the parallel program generated by Forge and ompiled without invoking fpp All parallelizable loops in this version were parallelized The other versions were manually derived from this version Version 2 (named "Forge No Dep") paralleliled the same loops that were paralieliled by fpp The next four versions were generated for improving performane Version 3 (named "Forge T > 1%") only parallelized the loops whose exeution time was greater than 1% of the total elapsed time Only loops whih onsumed more than 1% of the total exeution time were parallelized in the forth version (named "Forge T > 1%") These three versions were ompiled without using fpp In the next two versions, Forge was used to optimize the loops with false dependenies while fpp was used for the rest The two versions differ in how the loops with false dependenies were optimized Version 5 (named "Forge + fpp Parallel") parallelizes the outer loops and vetorizes the inner loops when possible, whereas Version 6 (named "Forge + fpp Vetor") vetorizes the loops No versions 5 or 6 were produed for NAS8 and NASI beause they annot be further improved by Forge as will be shown The exeution time of the versions of NAS8 and NAS1 generated by Forge is intolerably long To redue the time needed on a dediated mahine, the data set size was redued (indiated by the label "Small" in the tables) The effet of data size is reported in a separate paper7 The performane of the version optimized by fpp using full size data is inluded for omparison The performane was measured on a dediated Y-MP using 1, 4, and 8 proe88rs Tables 8-12 tabulate the performane improvement of different versions over the version optimiled by fpp Based on the data presented in Tables 3 12, we have drawn the following onlusi n The first is that in addition to parall! lizmg ode, tools must perform optimila tlons speifi to a target mahine in order to produe reasonable performane The seond is that interative tools must provide more assist than presently available to ease user performed optimization The following para graphs elaborate these two points Table 13 ompares the single proessor performane of the fpp tjerlion and the Forge No Dep Verlion to the performane obtained by using only the default vetorization pro vided by the era Y ompiler Table 13 shows that for ARe3D, NAS4 and NAS7, both preproessors degrade the performane; Forge degrades more than fpp For NAS8 and NASI, fpp improves the performanes while Forge degrades by a fator of 5 to 9 The data in this table provides a referene for the data shown in Table 14 Table 14 ompares the performane of the Forge No Dep Veraion to the performane of the fpp Verlion It shows that when same loops are parallelized, the Forge No Dep Ver,ion runs signifiantly slower on a single proessor than the fpp Ver,ion For highly vetorized pro grams, suh as NAS8 and NASI, the degradation is as high as a fator of 7 to 1 Sine both versions parallelize the same loops, the performane differene reflets the quality of ode generation Table 15 gives a f w examples of the differene in ode genera tion When Forge parallelizes a loop, it simply inserts "DO ALL" diretives It does not optimize for era Y arhiteture espeially for the vetor units The simple-minded ode generation signifiantly dereases the performane Table 16 o ares the performane of the Forge Ver,ion whih parallelizes all the parallelizable loops to the performane of the fpp Verlion able 17 ompares the best performane of eah benhmark (obtained using tools) to the fpp Veraion The results indiate the performane of the Forge Ver,ion of ARe3D, NAS4, and NAS7 exeeds the performane of the respetive fpp Verlion when multiple proessors are used Only after additional user effort in optimization does the single proessor performane of these three programs exeed the performane of the fpp ounter parts However, the degradation in NAS8 and NAS1 is hardly redueed even in the best versions The degradation aused by simpleminded ode generation anels the speedup 415

5 obtainable through parallelization Only when the grain sile of the loops with false dependenies is large enough an the speedup obtained by the extra parallelisation overome the degradation aused by poor ode generation In NAS4 and NAS7, most parallelizable loops ontain false dependenies In addition, these loops ontain a muh larger grain sile than the loops with no dependenies Forge aids the user in parallelizing loops with false dependenies - loops not parallelized by automati tools For these two programs the performane is dominated by the speedup obtainable by parallelilation Therefore, Forge obtains better performane on multiple proessors than fpp In the ase of ARC3D, the two types of loops and their granularity are omparable The performane improvement of using Forge is thus less_ pronouned for this program More than 95% of the parallelizable loops in programs NAS8 and NASI are dependene free A few loops with false dependene in these programs are ver small (less than 4% of exeution time) For these pro grams, the degradation mtrodued by ode generation dominates the performane, and Forge annot improve the performane over fpp The degradation due to simple ode generation is therefore most pronouned for these programs The above analysis shows that parallelilation alone is insuffiient; parallel tools must perform mahine-speifi optimilations One possible solution is to let interative tools insert diretives that onvey only dependeny information and to let vendor-provided preproessors and/or ompilers perform optimization Forge provides an environment in whih users an onveniently aess tools that guide and assist parallelization However, even with all the help, it is still quite diffiult to disover false dependenies and to improve performane For this kind of tools to be useful for sientists, more funtions are needed Improving the performane of a pro gram an be diffiult and tedious beause there is no easy way to hoose the loops to be parallelized If no overhead is involved, the more loops are parallelized the better the performane would be With overhead, however, parallelizing small loops would degrade the performane In the experiment, many different versions of eah benhmark had to be derived in searh for better performane The following two paragraphs show the importane of tools for helping user make the tradeoffs between parallelism and granularity One attempt in searhing for better performane was to find a simple measure whih allows us to find loops with large enough granularity Forge uses perentage of exeution time onsumed by a loop ombined with the average loop length, to help users to selet the loops to be paralleliled Th,p perloo p entage of time is defined as 1X ---, T fopttrj where T,oo p is the exeution time of a loop, and T'ottrJ 18 the elapsed time of the program Table 18 shows the ratio of the exeution time of the versions with different granularity (measured by perentage of time) and the exeution time of the Forge tleraion whih parallelizes all the parallelizable loops It would be expeted that when paralleliling only larger grain loops, the performane should be better This is the ase for NAS4 and NAS8 The rest of the benhmarks, however, show little differene The reason is that the perentage-of-exeution-time meas ures the amount of work ontained in a loop relative to the work of the entire program It is not a good measure of the granularity referred to in parallelization A better measure of granularity is the ratio of the number of operations in a loop that perform useful work to the number of operations added in order to exeute the loop onurrently (overhead) The overhead an be introdued by both the system and ode transformation The system overhead inludes the time needed to reate, suspend, shedule, and terminate onurrent tasks, and the time spent in task ommuniation and synhronization The overhead intro dued by ode transformation is due to the sequential ode added to a loop while it is parallelized Sine overhead is system and transformation dependent, it is diffiult to measure at user level; tools must help The next attempt for optimizing the programs was to use the strength of both Forge and fpp In this approah, Forge was used to analyze a program and fpp was used to generate the ode Again, the results were different for different programs Two of them obtained better performane by vetorizing the loops with false dependenies, and one of them by parallelizing suh loops Better performane an be obtained only after onsidering the tradeoffs between grain size and parallel overhead For this reason, it is important for the interative tools to help users to make the tradeoffs Users should be able to query whether a setion of ode should be parallelized, vetorized, or left sequential At the very least, a tool should give a performane estimate for exeution on the target mahine The database query failities provided by Forge are quite useful in disovering false dependenies However, it is rather diffiult for people who are not familiar with the on- 416

6 epts and terminologies of dependeny analysis to use the faility Messages suh as "use - use onflit" are not very useful to users at algorithm level Questions like "Do I+N and J+M have overlapped range?" and "Is the value of M set at eah iteration of the loop before it is used?" make it muh easier for users to disover false dependenies Furthermore, use-def hain analysis may lead to the onlusion that a dependeny exists where it does not Examples of this kind an be found when values are read in from a file or are indexed by the values of an array In these ases, asking questions at the appliation level may lead to quik disovery of a false dependene 4 Disussion To further understand the behavior of the benhmarks, the harateristis of the loops they ontain are summarized in Tables 19 and 2 Table 19 lists the total number of loops, the number of loops that exeuted using the given input, and the number of the exeuted loops that an be parallelized The granularity of the loops are roughly haraterized by the perentage of exeution time Also listed in the table are the number of parallelized loops that have no dependenies and the number of loops that have false dependenies Table 2 shows another view of the data in Table 19 Although the programs have been o imized for the era Y vetor unit, only 56% to 9% of the exeuted loops are parallelizable Exept NAS4, more than 2/3 of loops are very small - exeution time is less than 1% of the total time Over 95% of the loops onsume less than 1% of the elapsed time Exept NAS7, the largest loop onsumes only less than 31% of the time Small to medium grain size of most loops is the reason why parallelization by both automati and interative tools do not ahieve high speedup even on 4 proessors Further study using different problem sizes1 shows that all benhmarks ontain serial loops whose bounds are proportional to problem size When the grain size of these loops is large enough, the speedup on multiple pro essors remains onstant in spite of inreasing problem size Therefore, only parallelizing loops is not suffiient; tools should support parallel algorithm design and program onstrution 5 Conlusions and Future Work To gain insights into the proper diretion for researh into, and design and development of future NAS parallel program- ming environments, we have performed two experiments In the first experiment, we used twenty five typial NAS appliations to evaluate two state-of-the-art automati tools: CRAY fpp and KAP/CRAY We found most benhmarks are not signifiantly improved by automati tools beause they do not spend enough time exeuting the large grain loops that are parallelizable by these tools Loops parallelizable by these tools are also vetorizable It is possible that the potential performane improvement is largely onsumed by vetorization and not muh is left for parallel exeution In the seond experiment, we used five typial NAS appliations to evaluate a stateof-the-art interative tool: Forge We found interative tools an produe better performane in some ase beause they help users find and eliminate fal,e dependenies The urrent version of Forge flags a onverted loop as parallel by simply using a DO ALL ompiler diretive For our appliations, this approah has resulted in a signifiant performane degradation whih anels the potential speedup obtainable by parallelization For programs optimized for vetor mahines the degradation is the most severe To ahieve high performane, it is neessary for tools to perform optimization speifi to a target mahine in addition to parallelizing the program Forge provides users with a onvenient environment to parallelize existing programs Integrating the following features an make them more useful First, tools should generate ode to take advantage of target mahine arhiteture Seond, tools should be provided to evaluate the tradeoffs between granularity of a setion of ode and the overhead introdued by either vetorization or parallelization Third, users should be able to query on whether a setion of ode should be parallelized, vetorized, or left sequential on a partiular hardware Forth, tools should be provided for developing new parallel algorithms and programs, sine only parallelizing loops is not suffiient In addition, the messages generated during the interations between a tool and a user should be understandable by appliation sientists Our analysis shows that the benhmarks ontain a large number of small to medium size loops, whih limits the performane ahievable by parallelizing only loops Based on these results, we believe that tools are needed for designing parallel algorithms and writing parallel programs The data obtained depends on the system harateristis of CRAY Y-MP The behavior may be quite different on mahines that introdue very small overhead to parallel exeution 417

7 The evaluation of parallel tools and environments is an ongoing proess at NAS Evaluation of tools for different mahines, suh &8 Intel ipsc/86, and evaluation of tools for writing new programs have been sheduled s that reflet the harateristis of future NAS appliations will be used in future evaluations Based on the experienes gained, the NAS parallel pro gramming environment will be designed and developed M F L P S , Aknowledgements The authors would like to express sinere appreiation to Paifi Sierra Researh, Kuk and Assoiates In for their generous support of this work and prompt response to suggestions and omments We would like to thank Katherine E Flether for the data she olleted in studying automati tools and Dr Jeffrey T Deutsh for his ritial review of the paper and his valuable suggestions for future work Referenees Doreen Y Cheng, "A Survey of Parallel Programming Tools," NASA Report RND-91-5, NASA Amu RelJearh Center, Feb 22, 1991 Alan H Karp and Robert G Babb II, "A Comparison of 12 Parallel Fortran Dialets," IEEE Software, Sept 1988 "CF77 Compiling System, Volume 4: Parallel Proessing Guide," SG-97 CRA Y Ruearh, In, Mendota HeightlJ, MN, 199 "KAPjCRAY User's Guide," Kuk & AlJIJolatu, In, Champaign, IL, 1989 "The Forge User's Guide Version 71," Paifi-Sierra Ruearh, De 199 L Kipp, "Perfet s Doumentation, Suite 1," CSRD, Univeraity of RlinoilJ at Urbana-Champaign, IL, 199 Doreen Y Cheng, "Forge Evaluation and An Ideal Parallel Programming Environment," Submitted to the IEEE, ACM WorklJhop on Parallel Programming TooilJ, Hawaii, 1991? S P e e d u P 1 Figure 1- Default Vetorization Performane 9 S 8 e 7 6, 5 n 4, d Figure Jj 1 \:, 8 CvtCd Compile Times with Default Vetorization Figure 'pp o KAP KAP+lpp 2 Speedup from Enhaned Vetorization fpp o KAP KAP+fpp Ii 2, u l:i Jj"a u ' u u u-uu 1 -,-, IiI - I I -' Figure 4 - Enhaned Vetor Compilation Time Expense 418

8 Tv/T r: C o U >! o 1pp o KAP KAP+lpp C 1:' : ( w Automati Interative UNICOS CF fmp Cpp Z61 -Wd-e46ijt -Wd-e46ijt KAP 11 Not Used Forge Not Used Table 1 - System Software Used h/tb BOO CplCd C C Figure 5 5 Figure U 5-4 CPU Speedups o 1 o 15 C 6-8 CPU Speedups >! U 1pp o KAP KAP+lpp 1pp o KAP KAP+lpp U o CC "' o 2 25 C C C u P 4P 8P fpp % 58% 35% Forge % 83% 54 68% Forge No Dep % 68% 44% Forge T> 1% % 65% Forge T> 1% % 8% 49 61% Forge + Cpp Vetor % 35% Forge + fpp Parallel % 87% 75% Table 3 - Elapsed time (seonds), speedup, and multiproessor effiieny of ARC3D 3 Figure 7 - Parallel Compilation Expense S P e e d u P pp o KAP KAP+lpp r -1 o C C ; ; o % Vetorization Figure 8-4 CPU Speedup vs % Vetorization 419

9 IV o ' NAS1 NAS2 NAS3 NAS4 NAS5 NAS6 NAS7 NAS8 NAS9 NAS1 Name MFLOP9 NASKERN ADM ARC2D BONA DYFESM FL52 MX M33D OCEAN an SPEC77 SPICE TRACK TRFD Soure Size Floating Point Floating Point Floating Point Floating Point Data Transferred Lines (MW) Adds Multiplies Reiproals Operations Mbytes 6, ,562,115,68 11,752,4,33 62,432,216 2,934,947,317 1C 4, ,32,89,324 4,81,1,58 5,764,488,75 13,885,579,87 3 2, ,647,16,394 5,594,873,334 2,346,221 13,244,379, ,68,489,244 6,674,198,82 2,186,73 14,356,873, , ,55,588,3 11,652,969,46 5,241,896 28,253,799, , ,598,13,65 7,86, ,732,247 12,89,63, ,371, ,997, ,834 1,425,483, , ,656,174,913 28,279,511,154 3,829,361,83 5,765,47, , ,14,85,825 1,354,68,536 1,39,31,52 17,678,4, , ,446,177,882 41,125,391,455 3,115,375,229 77,686,944, , ,95,915 79,786,453 4,663,693 23,546, , ,8,724,151 1,116,588,777 32,395,31 2,229,77, , ,824,288 24,133,23 23,57, ,14, , ,61,635 1,32,141,17 132,689,166 2,25,891, , ,13, ,838,352 92,161,725 1,177,13, , ,624,57 144,699,124 42, , , ,684, ,92,561 31,327,69 641,932, ,27 11! 153,971,939 1,485,467, ,938,87 3,94,378, , ,694,57,639 5,165,166, ,632,72 12,158,37, , ,682,527 55,872,466 19,5,428 1,53,65,421 19; 2, ,3, ,473,915 3,41,35 259,545, , ,92,944,788 86,19,669 15,52,16 1,968,466, , ,286,311 24,82,326 3,662,444 57,769, , ,827,571 43,597,17 1,36,233 84,784, ,132,61 216,213, , ,81, Average Minimum Maximum Standard Dev Total , ,882,953,279 5,68,64,3 64,873,935 11,132,467, ,286,311 24,82,326 42,488 57,769, , ,446,177,882 41,125,391,455 5,764,488,75 77,686, 944, , ,794,988,825 9,722,72,34 1,44,271,652 18,22,461, , ,73,831,967 14,216,7,53 16,21,848, ,311,687, Table 2 - Charateristis

10 1P 4P 8P 1P 4P 8P fpp % 27% 14% fpp % 68% 47% Forge % 83% 5% Forge No Dep % 33% 17% Forge T> 1% % 47% 28% Forge + fpp Vetor % 55% 24% Forge + fpp Parallel % 85% Table 4 - Elapsed time, speedup, and multiproessor effiieny of NAS4 54% fpp Small % 66% 41% Forge Small % 68% 41% Forge T > 1% Small 1% 72% % Forge T> 1% Small 1% 77% 52% Forge T> 1% Table 6 - Elapsed time, speedup, and multiproessor effiieny of NAS8 fpp 1P 4P 8P % 25% 13% Forge % 51% 27% Forge No Dep % 25% 13% Forge T> 1% % 52% 27% Forge + fpp Vetor % 24% 12% Forge + fpp Parallel % 5% 27% 1P 4P 8P fpp % 63% 38% fpp Small % 51% 29% Forge Small % 75% 48% Forge T > 1% Small 1% 78% 51% Forge T > 1% Table 7 - Elapsed time, speedup, and multiproeessor effieieney of NAS1 Table 5 - Elapsed time, speedup, and multiproessor effiieny of NAS7 421

11 l/fpp IP 4P 8P fpp Forge Forge No Dep Forge T > 1% Forge T> 1% Forge + fpp Vetor Forge + fpp Parallel Table 8 - Elapsed time of ARC3D normalized wrt Ipp Version Ijfpp IP 4P 8P fpp Forge Forge T> 1% Forge T> 1% Table 11 - Elapsed time of NAS8 (small data set) normalized wrt Ipp Version l/fpp IP 4P 8P fpp Forge Forge 698 T > 1% l/fpp IP 4P 8P fpp 1 1 I Forge Forge No Dep Forge T> 1% Forge + fpp Vetor Forge + fpp Parallel Table 9 - Elapsed time of NAS4 normalized wrt Ipp Version Table 12 - Elapsed time of NAS1 (small data set) normalized wrt Ipp Ver,ion Forge/fpp Forge/fpp Forge/fpp {lpj (4P) (8P) ARC3D NAS NAS NAS NASI Table 14 - Normalized exeution time wrt Ipp Version (same loops parallelized) l/fpp 1P 4P 8P fpp Forge Forge No Dep Forge T> 1% Forge + fpp Vetor Forge + fpp Parallel Default fpp Forge No Dep Vetorization ARC 3D NAS NAS NAS NABlO Table 1 - Elapsed time of NAS7 normalized wrt Ipp Version Table 13 - Single proessor performane of the Ipp Ver,ion, Forge No Dep Ver"ion and the version using only default vetorization The top number of eah entry is the elapsed time The bottom number is the elapsed time normalized wrt to the default time 422

12 fpp Forge Insert Y Y DO ALL Use Options Besides SHARED & PRIVATE Y N Chek Run Time Y N Loop Length Parallel Outer Y N Vetor Inner Take Advantage of Y N CRAY Vetor Unit Table 15 - Examples of differenes in ode generation ARC3D NAS4 NAS7 NAS8 NASI (%) (%) (%) (%) (%) ParalieliJable Loops T < 1% T < 1% T < 1% MAXT No Depend o -139 False Depend Table 19 - loop harateristis 46 Forgejfpp (ip) Forge/fpp (4P) ARC3D NAS NAS NAS NASI Forge/fpp (8P) Table 16 - Normalized exeution time wrt fpp Ver4ion (all loops parallelized in the Forge Ver4ion) Forgejfpp Forge/fpp Forgejfpp (lp) (4P) (8P) ARC3D NAS NAS NAS NASI ARC3D NAS4 NAS7 NAS8 NASI Saling LMAX Variables KMAX M NEQ NNX JDIM JMAX NNY KDIM Total Number of Loops Num of Loops with True Dep & Salable Bnd Largest Perentage 2% 22% 3% 3% 65% of Time Table 2 - Number and granularity (measured by perentage of time) of the serial loops whose bounds are proportional to the problem size Table 17 - Best performane normalized wrt the fpp Ver4ion l/forge T> 1% T> 1% (lp) (lp) ARC 3D 1 98 NAS4 2 NAS7 1 NASOB NAS1 97 T > 1% T> 1% T > 1% T> 1% (4P) (4P) (8P) (8P) Table 18 - Effet of using perentage of time as a measure of granularity on performane 423

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications System-Level Parallelism and hroughput Optimization in Designing Reonfigurable Computing Appliations Esam El-Araby 1, Mohamed aher 1, Kris Gaj 2, arek El-Ghazawi 1, David Caliga 3, and Nikitas Alexandridis