Modeling and Predicting the Efficiency of Application Execution in Distributed Environments

Size: px

Start display at page:

Download "Modeling and Predicting the Efficiency of Application Execution in Distributed Environments"

Camilla Pitts
6 years ago
Views:

1 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) Modelig ad Predictig the Efficiecy of Applicatio Executio i Distributed Eviromets AHMED M. AL KINDI a,*, KHAMIS N. AL GHARBI b, GRAHAM NUDD c a Departmet of Computer Sciece, College of Sciece b Iformatio Systems Departmet, College of Commerce ad Ecoomics Sulta Qaboos Uiversity, P.O. Box 36, PC 123, Oma c High Performace Systems Laboratory, Departmet of Computer Sciece, Uiversity of Warwick, Covetry CV4 7AL, UK Abstract: Oe of the mai objectives i the study of performace predictio is to give researchers the optio to select from a variety of distributed systems/resources based o what would optimise the performace of the applicatio. This paper itroduces a optimizatio scheme for modellig ad predictig the efficiecy of a applicatio executio whe ru i a distributed eviromet. The results obtaied from this research show that the suggested modellig ad predictio techique ca provide the same quality of iformatio as a measuremet process for applicatio optimizatio, but i a fractio of the time. Keywords: Performace Optimizatio, Dyamic Performace Predictio/Modellig, DFT, FFT, FFTW, PACE. 1 Itroductio The promise of Iformatio Power Grid (IPG) is based o the cocept of selectig distributed resources to perform a sigle or multiple computatioal tasks with performace optimisatio i mid [1]. It relies o the availability of systems ad the utilisatio of performace iformatio to guide applicatio executio (i.e. commuicatio ad sychroizatio overhead of a applicatio executio whe ru i a distributed eviromet [2]). Istead of writig optimized codes by had, there have bee several examples of automated code geerators geared towards obtaiig the maximum performace o differet platforms. Oe such library is the Portable High Performace ANSI-C (PHiPAC), developed at Berkeley ad iteded for producig high performace matrix-vector libraries i the Basic Liear Algebraic Subprograms (BLAS) suite [3], [4], ad [5]. I a similar way, a adaptive applicatio called The Fastest Furrier Trasform i the West (FFTW) [6] uses automatically geerated blocks of highly optimized C codes called codelets, to calculate a FFT i a miimum amout of time o ay system [7]. This paper utilizes ad models this adaptive applicatio ad shows that efficiecy of applicatio executio i distributed eviromets ca be predicted with reasoable quality of iformatio. This research uses a performace evaluatio tool developed by the Uiversity of Warwick 1 called Performace Aalysis ad Characterizatio Eviromet (PACE), which was developed as a modellig toolset for high performace ad distributed applicatios. The work is a extesio of the sequetial performace evaluatio techiques preseted i [8] ad [9] which led to further ehacemet of the FFTW. The experimets ad evaluatios were coducted usig machies with equal CPU ad memory sizes (Itel Petium 2.4 GHz ad 512 MB RAM ruig uder the Red Hat LINUX operatig system). The ext sectio itroduces PACE while Sectio 3 discuses the FFTW package. These overviews are deliberately brief due 1 This work was coducted at the Uiversity of Warwick (UK) ad fuded i part by the Sulta Qaboos Uiversity, Oma

2 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) i part to space costrait. Iterested researchers are advised to seek further iformatio i literatures refereced at the ed of this paper. Sectios 4 through 7 preset the implemetatio ad results. 2 Performace Aalysis ad Characterizatio Eviromet (PACE) PACE was developed as a modelig toolset for high performace ad distributed applicatios. It icludes tools for the defiitio ad creatio of the applicatio model ad applicatio performace aalysis [9]. It uses associative objects orgaized i a layered framework as a basis for represetig each of a system s compoets [1]. PACE supports ay applicatio that depeds o message passig usig MPI. I fact, ay hardware or platform that uses these programmig tools ca be aalyzed withi PACE [11]. A key feature of the object orgaizatio is the idepedet represetatio of computatio, parallelizatio, ad hardware i a layered framework. The fuctios of the layers are: Applicatio Layer: describes a applicatio i terms of a sequece of subtasks. It acts as the etry poit to the performace study, ad icludes a iterface that ca be used to modify parameters of a performace sceario. Applicatio Subtask Layer: describes the sequetial part of every subtask withi a applicatio that ca be executed i parallel. A applicatio might cosist of a umber of differet subtasks, each represetig the applicatio s parallel kerel. A combiatio of these subtasks by the applicatio object forms a fuctioality model of the whole applicatio. Parallel Template Layer: describes the parallel characteristics of subtasks i terms of expected computatio/commuicatio iteractios betwee processors. This is doe through the use of models i the hardware layer. Hardware Layer: collects system specificatio parameters, micro-bechmark results (e.g. atomic laguage operatios), statistical models (e.g. regressio commuicatio models), aalytical models (e.g. cache cotetio), ad heuristics that characterize the commuicatio ad computatio abilities of a particular system. Whe applicatio source code is available, PACE performs a static aalysis of the code to produce the cotrol flow of the applicatio, operatio couts, ad the commuicatio structure. A compiler traslates C code to scripts that are the liked with a evaluatio egie ad the hardware models. The fial output is a biary file that ca be executed rapidly. The user determies the system/applicatio cofiguratio ad the type of output that is required as commad lie argumets. The model biary performs all the ecessary model evaluatios ad produces the requested results. A MPI versio of PACE is also available where it has the ability to measure ad evaluate MPI-Sed, MPI-Receive, ad other MPI commads. MPI-PACE takes ito accout additioal overhead that is icurred whe sedig ad receivig the data from ad to the processors. This commuicatio/sychroizatio overhead is also modeled ad predicted ad the added to the overall computatioal cost of the evaluated applicatio. 3 The Fastest Fourier Trasform i the West (FFTW) The FFTW is based o the divide-ad-coquer approach of the recursive FFT. To illustrate how a DFT is computed usig the recursive FFT, let us assume that the followig polyomial: 1 = j A ( x) a j x (1) j= is to be evaluated at the complex roots of uity of degree : ω, ω, ω,..., ω ( is assured to be a power of 2). Polyomial A ca be used to represet ay complex valued vector a where: a = (a,a 1,,a -1 ) (2) The the results ca be defied as follows: y k 1 = A( ω k ) kj = aiω (3) j= where k =,1,2,., -1. The DFT of vector a is the vector: y = (y, y 1,, y -1 ) (4) To compute the DFT of the vector a, the array is divided ito a eve-idex sub-array, ad a oddidex sub-array. The two polyomials represetig the divided arrays are hece as follows: A [] (x) = a + a 2 x + a 4 x a -2 x /2-1 (5)

3 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) A [1] (x) = a 1 + a 3 x + a 5 x a -1 x /2-1 (6) This divide-ad-coquer method is based o the followig formula: A(x) = A [] (x 2 ) + A [1] (x 2 ) * x (7) Therefore, the computatio of the DFT of a is ow reduced to two steps: Evaluatig the polyomials: A [] (x) ad A [1] (x) of degree /2 at the /2 complex roots of uity of degree /2: ω, ω, ω,..., ω 2 4 2, ad the applyig formula (7). Sice each root occurs twice, the equatios i (5 ad 6) may ow be recursively computed at the /2 complex roots of uity of degree /2. Therefore, a elemet DFT computatio is successfully divided ito two /2 elemet DFT computatios. This forms the basis of the portable C package called the FFTW which computes oe or multi-dimesioal complex DFT. It is claimed that due to its self-cofigurig ad dyamic ature, the FFTW ca compute the trasform faster tha ay other software. It requires a sequece of measuremets to be made for its software compoets (codelets) o the target system, ad the uses a combiatio of these codelets to determie how best to compute the trasform of ay give size. There are three essetial compoets that make up the FFTW: Codelets: These are a collectio of highly optimized blocks of C code used to compute the trasform. The codelets are geerated automatically. There are two types of codelets: No-Twiddle: used to compute small-sized trasforms (N <= 64), ad Twiddle: used to compute larger trasforms by the combiatio of smaller oes. Plaer: The plaer determies the most efficiet way i which the codelets ca be combied together to calculate the required trasform. Each idividual combiatio of codelets is called a pla. Typically dozes of plas are available for a specific FFT size. Executor: The executio of each pla is performed ad measured by the executor. This results i the ru-time cost of each pla o a specific system. From this, the best pla ca be determied i.e. the pla with the miimum executio time. The iitial stage of the FFTW operatio ivolves the plaer, which fids out the fastest way to compute the DFT of a give complex array. To illustrate the operatio of the plaer, a example of performig a 16-poit FFT is cosidered. The plaer uses the divide-ad-coquer approach to partitio the data set ad to costruct all possible combiatios of codelets. The resultig istructios cotaied withi the pla may look as follows: FFT (16) = Divide & Coquer (16,8) >> 8. FFT (2) or Divide & Coquer (16,4) >> 4. FFT (4) or Divide & Coquer (16,2) >> 2. FFT (8) or Solve (16) I this example, the plaer has selected four types of istructios or sequeces to sed to the executor. Oe is to solve usig a o-twiddle codelet 16, ad the rest are to solve usig a combiatio of twiddle ad o-twiddle codelets (e.g. twiddle codelet 8 ad a o-twiddle codelet 2 for the first istructio). To beefit from the recursive ature of the algorithm, solutios to idividual istructios could also be stored i tables durig this real-time computatio for reusability. The differet alteratives that are available for the calculatio usig this approach i the FFTW result i differet performace due, i part, to the use of the memory hierarchy icludig caches. For istace, the data required for computig a FFT of a small size may fit ito the system s primary cache, resultig i a better processig time, whereas the FFT of a larger size may ot. Thus combiatios of smaller FFTs may out-perform larger FFTs. By iitially measurig all these combiatios, a best pla ca be determied, which will be differet o differet systems. The FFTW provides two parallelizatio techiques: Multithread FFTW ad MPI-FFTW. Because PACE relies o MPI for its predictio aalysis of parallel eviromets, MPI-FFTW is selected for use i this research. Further detail o the FFTW may be foud at 4 Modelig the FFTW The measuremet procedure required by the FFTW to select the best pla for a specific system ca be replaced by a set of performace models. To

4 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) overcome the costly measuremets of its iitial stage, separate models are created to represet all idividual FFTW codelets. The codelets are dyamically profiled ad aalyzed, ad the models for the idividual codelets are geerated. The operatio of the FFTW plaer is the simulated i order to select the miimum predicted executio time of a appropriate set of codelet models. Fially, the predicted optimum pla is passed to the FFTW executor to compute the trasformatio. The evaluatio of each pla ca be performed for a rage of systems ad problem sizes. The evaluatio time for this process is faster tha the correspodig real-time FFTW measuremets. The validity of this optimizatio methodology depeds o the selectio patter of the optimum solutio (or best pla). To achieve this, the performace models require accurate predictio of every pla s executio time. The performaces of six selected o-twiddle codelets are illustrated i Table 1 which shows their modeled performace compared to their real-time executio o a sigle Petium machie (P64 stads for pla for FFT of size 64). Table 1. Executio vs Predictio for six o-twiddle codelets Plas PACE (µs) FFTW (µs) PACE / FFTW (%) P P P P P P Table 2 holds the predicted times for the modeled twiddle ad o-twiddle codelets. Table 2. Mode ls table for all the c odelets Twiddle No-twiddle Time (µs) Codelets Codelets Time (µs) Oce the models have bee created ad depedig o the required FFT size, a performace p redictio ca be extracted. The ext phase i the aalysis requires the creatio of the commuicatio ad sychroizatio models which predict the overhead cost of trasferrig ad receivig the FFT data amog processors. The MPI- FFTW uses a umber of kow MPI calls (such as MPI-Sed, MPI-Receive, MPI-Gather, etc) to commuicate ad sychroize the parallelizatio of the trasform process. For example, it uses MPI_Bcast to broadcast a message (such as trasform size) to all used processors, ad uses MPI_Sed to sed data. These MPI calls form the mai commuicatio/sychroizatio overhead of the MPI-FFTW. The calls are evaluated usig PACE s parallel template where the umber of processors to be used is also declared. The applicatio object (Fig.1) is cofigured, compiled ad liked with the subtask object (Fig.2) alog with the parallel template (Fig.3). The parallel template passes the umber of processors through variable Nproc. The resultig biary would ow produce the commuicatio cost estimates depedig o the required umber of processors. applicatio foo_appl { iclude foo_subt; iclude hardware; var umeric: N = N, Nproc = 1; lik { hardware: Nproc = Nproc; foo_subt: N = N/Nproc; optio { hrduse="itelp4_24"; proc exec iit { call foo_subt; Fig.1. Applicatio object (foo_appl) used i MPI- PACE Table 3 shows the predicted commuicatio / sychroizatio costs for the MPI-FFTW o the give umber of machies. The ability to evaluate ad predict the cost of commuicatio whe usig parallel processig, gives researchers a cosiderable advatage sice there eed ot be ay supportig applicatio executio. This is especially true if the applicatio at had is big ad takes a log time to execute.

5 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) subtask foo_subt { iclude asyc; iclude hardware; var umeric: N; lik { asyc: Tx = mai(), N = N; (* * CHIP3S * Applicatio Characterizatio Tool * Source : foo_subt * RUV Type: clc *) proc cflow mai { compute <is clc, FCAL>; Fig.2. Applicatio subtask (foo_subt) used i MPI- PACE #i clude <mpidefs.h> (* * asyc.la - Parallel template*) partmp asyc { iclude hardware; var umeric: N; var compute: Tx; (*executio time *) optio { stage = 1, seval = 1; proc exec iit { var umeric :i; step mpicom { for (i = 2; i <= hardware.nproc; i = i + 1) { call MPI_Barrier (MPI_COMM_WORLD); call MPI_Bcast (4,1,MPI_COMM_WORLD); call MPI_Gather (4,4,1,MPI_COMM_WORLD); cofdev 1, i, 4, MPI_COMM_WORLD; (* prit "asyc: Tx=", Tx; *) step cpu { cofdev Tx; Fig.3. Parallel template object used i MPI-PACE Oe additioal advatage is that a user ca also evaluate the parallelizatio of more tha the umber of machies curretly available. For example, i this study, up to 256 processors (which is more the the available resources) were defied i the evaluatio models. This helps i tryig to foresee the outcome of the parallelizatio of computig certai FFT ad parallelizatio sizes where it was impossible to do because of either huge commuicatio overhead that icur uacceptable bottleeck, or the fact that there were t eough machies to do the experimets o. Table 3. Predicted commuicatio/sychroizatio costs for MPI-FFTW Number of Processors Commuicatio Overhead (µs) e+6 After creatig the models for the commuicatio / sychroizatio overheads ad the codelets, a aalysis ad performace evaluatio may ow be coducted for ay FFT size ad for ay umber of processors i a parallelized eviromet. Give the fact that all machies used i this study are of the same performace characterizatio, the it is safe to assume that if the predicted executio time for a poit FFT equals Tseq (predicted sequetial time), the the predicted executio time for a /p poit FFT is (Tseq / p). If this is the applied to a distributed memory eviromet where p represets the umber of processors, ad add the predicted overhead cost, the the predicted computatio is as follows: where Tpar is the predicted time for computig a trasform by the FFTW i a parallel eviromet, T seq is the predicted sequetial time for the same trasform, p is the umber of processors used, ad T (c ommu.+ syc) is the overhead for commuicatio / sychroizatio. MPI-FFTW uses a speedup measuremet to evaluate its performace ehacemet through parallelism. Therefore, the experimets coducted here are also measured accordigly so that comparisos could be made i terms of speedups as well as executio times. This measuremet is give by the factor of sequetial executio time over the time obtaied whe ru i a distributed eviromet: o r: Predicted Tpar (p) = (Tseq / p) + T (commu.+syc) (8) Speed up = Tseq / Tpar (9) Speedup = Tseq / [(Tseq / p) + T (commu.+syc)] (1)

Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) 5 A Parallelized Performace Predictio ad Evaluatio of the MPI-FFTW Up to 16 cocurret

Results show that for small FFTs, both the predicted pe rformace as well as the FFTW real- time executio degrades istatly whe distributed memory is used.

6 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) 5 A Parallelized Performace Predictio ad Evaluatio of the MPI-FFTW Up to 16 cocurret processors were used to predict ad measure i real-time the ehacemet gaied by usig the MPI-FFTW to compute small trasforms. Results show that for small FFTs, both the predicted pe rformace as well as the FFTW real- time executio degrades istatly whe distributed memory is used. This is because the iitial cost of commuicatio overhead is greater tha the total computatio cost of the FFT i a sigle machie. Fig.4 shows the predicted speedup for computig a 64 poit FFT o 2, 4, 8, ad 16 processors i compariso to the real-time speedup gaied by the MPI-FFTW. The speedups for oe processor (which is 1 for all FFT sizes) are ot plotted so that the rest of the speedups ca be clearly viewed. eedup Sp FFTW PACE2 Fig.4. Speedup comparisos for a 64 poit FFT Speedup Fig.5. Predicted Speedups 16 N=1K N=2K N=4K Other FFT sizes (256, 512, K, 2K, ad 4K) were used i order to compare the MPI-FFTW with the MPI-PACE. The trasforms were computed usig 16 distributed machies. Fig.5 shows the predicted speedups, while Fig.6 shows the real-time measuremets. The behavior of the MPI-FFTW i the begiig is similar to the predicted oe. The iitial icrease i the umber of processors results i the degradatio of the performace. The behavior chages a little as the real-time speedups idicate a better performace tha predicted whe 8 processors are used for sizes 2K ad 4K, but the overall patter remais the same. While the real-time speedups deped o the availability of the machies (which is ot always possible), ad the possibility of affectig other users due to commuicatio overhead, it is much easier to coduct performace evaluatios ad produce reliable predictios. To illustrate this poit further, a umber of evaluatio experimets were coducted usig 1 to 256 processors ad for FFT sizes ragig from 32 to 128K. As Fig.7 shows, it was possible to coduct a predictio aalysis for up to 256 machies. The figure shows that for the user to beefit from the available distributed resources i the machies, the trasform must be at least of the size 16K. It also idicates that for that size, oly usig 2 processors would be deemed successful i terms of speedup. Trasforms of sizes 32K, 64K, ad 256K may be computed faster whe 4 machies are used i parallel. Otherwise the commuicatio/sychroizatio overhead is provig much costlier ad the speedups start to degrade agai as the umb er of processors icrease. However, although the performace predictio of the MPI- FFTW is close to that of the real-time executio whe the size of the FFT is small, a differet sceario occurs whe a compariso is made for larger trasforms. A umber of factors cotribute to this discrepacy. The FFTW is a adaptive package ad chages its behavior accordig to the circumstaces of the computatio, the fial outcome of the performace of the FFTW depeds also o the FFT size, speed i which idividual/combied trasforms (plas) are computed, ad the load o the system. As these factors icrease i size ad umber, so would the differece betwee the predicted ad the real-time performace. A icrease i the umber of plas meas a icrease i the variatios of best plas chose for each ru ad hece the possibility of sedig a differet pla to compute the trasform the the oe beig aalyzed ad predicted. Furthermore, the load affectig the machies i the real-time eviromet also cotributes to the overall performace of the FFTW, ad to have multiple machies ruig at the same time while beig affected by exteral variables meas that the chace of predictig precisely what the performace would be is more difficult whe the size of the problem is larger. Aother factor is the probability of havig the commuicatio/sychroizatio cost overestimated, which may idicate a eed for further research towards improvig the MPI aalysis especially whe

7 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) dyamically optimized applicatios are studied. However, for most practical sizes of FFTs, the user ca greatly beefit from the modelig techiques as well as the dyamic performace evaluatio preseted here. Speedup Speedup Fig.6. MPI-FFTW Speedups K 2 K 4 K N=32 N=64 N=128 N=1K N=2K N=4K N=8K N=16K N=32K N=64K N=128K Fig.7. Speedup comparisos for various FFT sizes usig up to 256 machies 6 Modelig ad Predictig the Efficiecy The efficiecy of parallel computatio is calculated as the ratio of speedup resultig from distributig a give problem over a umber of processors. Efficiecy iformatio i this case study is helpful i determiig if the predicted speedups are efficiet eough for the size of the give problem. The efficiecy of usig p umber of processors equals the speedup gaied whe usig p processors i a parallel eviromet divided by p, ad it is writte as follows: Efficiecy(p) = Speedup (p) / p (11) For example, if the speedup of usig 4 machies i parallel to compute a give FFT is 2, the the efficiecy of this computatio process is 2/4 = 1/2. That is, the utilisatio power of such eviromet is oly 5% of the total computatioal resources available. Therefore, while speedups i Fig.7 are up to 2.5 faster o certai sizes, this is still ot the best performace the machies ca offer. Uless the machies are beig used i solvig other problems, 1/2 of the resources remai idle while the comput atio is goig o. Fig.8 illustrates that the efficiecy predicted is similar to that gaied by the MPI-FFTW. The figure represets predictio ad executio compariso for the efficiecy of computig a 64 poit trasform i a distributed eviromet cosistig of up to 16 processors. iciecy Eff PACE FFTW Fig.8. Efficiecy comparisos for computig a 64 poit FFT Efficiecy 1.2 N= N=64 N=128 N=1K N=2K N=4K N=8K N=16K N=32K N=64K N=128K Fig.9. Predicted efficiecy for computig various FFT sizes usig up to 256 machies Fig.9 predicts that usig oe machie to solve up to 128K FFT is more efficiet tha parallelisig the computatio eve if there are as may as 256 powerful processors. Oe reaso for this is the large commuicatio overhead cost that is ivolved whe computig this size of FFT. The fact that the data must be divided ad the distributed amog processors for the computatio the they must be reallocated back cotributes to this large overhead. For compariso purposes, the efficiecy of the parallelisatio process of the real-time MPI-FFTW is also measured (although less machies used). As predicted ad as illustrated by Fig.1, there is t much efficiecy i tryig to parallelise the computatio of the FFTs usig a fast machie like the Petium. The decrease i efficiecy is as sharp as the oe predicted by the previous figure.

Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) Efficiecy.25.2.15.1.5 N=32 N=64 N=128 N=1 K Fig.1. MPI-FFTW efficiecy for computig various FFT sizes Efficiecy.

8 Proceedigs of the 1th WSEAS Iteratioal Coferece o COMPUTERS, Vouliagmei, Athes, Greece, July 13-15, 26 (pp19-26) Efficiecy N=32 N=64 N=128 N=1 K Fig.1. MPI-FFTW efficiecy for computig various FFT sizes Efficiecy K 32 K 64 K Fig.11. MPI-FFTW efficiecy for large FFT sizes usig up to 16 machies However, ad as show i Fig.11, the efficiecy for computig a 64K poit FFT starts to icrease whe more tha 8 processors are used. This last observatio cofirms the previously oted fact that for large FFTs, predictig the overall performace behaviour of the FFTW is more difficult. 7 Summary This paper preseted a modelig techique for evaluatig the performace ad efficiecy of a applicatio executio i distributed eviromet. Besides predictig the behaviour patter of the parallelised computatios, the paper also aalysed ad predicted speedups as well as efficiecy gaied through multiprocessig. Results showed a reliable predictio process with a huge reductio i the overhead cost of the learig process. Predicted speedups suggest a performace ehacemet that is ot efficiet because the commuicatio ad sychroisatio costs are too big for the computatios at had. This is especially true whe large umber of processors is used. These predictios ad overall behaviour patters are matched by the real-time implemetatio. 2 K 4 K 8 K Experimets carried out here also show that parallelizatio schemes ca be aalyzed ad their performaces ca be predicted where either the cost of doig real-time studies is expesive ad/or whe there are t eough resources available. The results also highlight MPI-PACE overestimatio of the commuicatio / sychroizatio i distributed eviromets ad the eed to further re-examie it so that a more precise evaluatio is coducted. Refereces: [1]. Foster, C Kesselma, J. Nick. ad S. Tuecke. Grid Services for Distributed Systems Itegratio. IEEE Computer, 35 (6). 22. [2]. A. M. Alkidi, K. N. Al Gharbi, G. R. Nudd. Modelig ad Predictig the Commuicatio ad Sychroizatio Overhead of a Distributed Eviromet. Proceedigs of the 3 rd Iteratioal Symposium o Telecommuicatios, September 25. Volume 1 of IST25, pp [3]. C. Lawso, R. Haso, D. Kicaid, ad F. Krogh. Basic Liear Algebra Subprograms (BLAS) for FORTRAN usage. ACM Trasactios o Mathematical Software, 5:38-325, [4]. J. Dogarra, W. Getzsch: Computer Bechmarks, Volume 8, [5]. L. Blackford, J. Demmel, J. Dogarra, I. Duff, S. Hammarlig, G. Hery, M. Heroux, L. Kaufma, A. Lumsdaie, A. Petitet, R. Pozo, K. Remigto, R. Whaley. A Updated Set of Basic Liear Algebra Subprograms (BLAS), ACM TOMS, Volume 28, Issue 2, Jue 22, pp , ISSN [6]. M. Frigo, S. Johso. FFTW: A adaptive software architecture for FFT, i Proc. of the IEEE It. Cof. o Acoustics Speech, ad Sigal Processig, 3, Seattle (1998) [7]. M. Frigo. A fast Fourier Trasform Compiler, i Proc. of the ACM SIGPLAN Cof. o Programmig Laguage Desig ad Implemetatio (PLDI 99), Atlata (1999). [8]. A. Alkidi, D. Kerbyso, E. Papaefstathiou, G. Nudd. Ru-Time Optimizatio Usig Dyamic Performace Predictio, Proceedigs of the 8th iteratioal coferece, High-Performace Computig ad Networkig, Amsterdam, May 2. Volume 1823 of LNCS, Spriger, pp [9]. A. Alkidi, D. Kerbyso, E. Papaefstathiou, G. Nudd. Optimizatio of applicatio executio o dyamic systems, Future Geeratio Computer Systems 17 (8) (21), pp [1]. D. Kerbyso, E. Papaefstathiou, J. Harper, S. Perry, ad G. Nudd: Is predictive tracig too late for HPC users, I High-Performace Computig, pp 57-67, Pleum Press, [11]. G. Nudd, D. Kerbyso, E. Papaefstathiou, S. Perry, J. Harper, D. Wilcox: PACE-A Toolset for the performace predictio of parallel ad distributed systems, The Iteratioal Joural of High Performace Computig Applicatios, vol 14, fall 2, pp [12]. G. Nudd, E. Papaefstathiou, Y. Papay, T. Atherto, C. Clarke, D. Kerbyso, A. Stratto, R. Ziai, ad M. Zemerly: A layered approach to the characterizatio of parallel systems for performace predictio, i Proc. of the Performace Evaluatio of Parallel Systems Workshop (PEPS 93), Covetry, U.K. (1993) pp

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As