IMPLEMENTATION OF UNSTRUCTURED GRID GMRES+LU-SGS METHOD ON SHARED-MEMORY, CACHE-BASED PARALLEL COMPUTERS

Size: px

Start display at page:

Download "IMPLEMENTATION OF UNSTRUCTURED GRID GMRES+LU-SGS METHOD ON SHARED-MEMORY, CACHE-BASED PARALLEL COMPUTERS"

Felix Lindsey
6 years ago
Views:

1 AIAA-97 IMPLEMENTATION OF UNSTRUCTURED GRID GMRES+LU-SGS METHOD ON SHARED-MEMORY, CACHE-BASED PARALLEL COMPUTERS Dmtr Sharov, Hong Luo, Joseph D. Baum Scence Applcatons Internatonal Corporaton 7 Goodrdge Drve, MS -6-9 McLean, VA, USA and Ranald Löhner Insttute for Computatonal Scences and Informatcs George Mason Unversty, Farfax, VA 3, USA ABSTRACT The mplementaton of an unstructured grd matrx-free GMRES+LU-SGS scheme on shared-memory, cache-based parallel machnes s descrbed. A specal grd renumberng technque s used for the parallelzaton rather than the tradtonal method of parttonng the computatonal doman. The renumberng technque helps to avod nter-processor data dependences, cache-msses, and cache-lne overwrte whle allowng ppelnng. The resultng source code can be used wth maxmum effcency and wthout modfcatons on tradtonal (scalar) computers, vector supercomputers, and shared-memory parallel systems. Specal attenton has been pad to develop an optmally parallelzed precondtoner for the GMRES scheme.. INTRODUCTION Consderable progress has recently been made n the development of mplct schemes for unstructured grds. The mplct methods are wdely used to accelerate convergence of steady-state problems as well as to mprove the effcency of unsteady solvers by advancng the soluton wth substantally larger tme steps. The GMRES+LU-SGS mplct scheme proposed by Luo, Baum, and Löhner for steady-state solutons as well as for unsteady problems can mprove the effcency of tradtonal explct methods by one to more than two orders of magntude. The scheme uses the Lower Upper-Symmetrc Gauss-Sedel (LU-SGS) scheme as a precondtoner for the Generalzed Mnmal Resdual (GMRES) method 3. The LU-SGS scheme was orgnally proposed by Jameson and Yoon 4 on structured grds, and has been successfully generalzed and extended to unstructured meshes 5-7. Copyrght by the authors. Publshed by the Amercan Insttute of Aeronautcs and Astronautcs, Inc. wth permsson. Another way to reduce turn-around tme s to use the multple processors. Wth the advent of massvely parallel machnes,.e. machnes n excess of 5 nodes, the explotaton of parallelsm n solvers has become a maor focus of attenton. Most of the applcatons ported successfully to parallel machnes to date have followed the Sngle Program Multple Data (SPMD) paradgm. For grd-based solvers, a spatal subdoman was stored and updated n each processor. For obvous reasons, load balancng 8- has been a maor focus of actvty. Despte the strkng successes reported to date, only the smplest of all solvers: explct tmesteppng or mplct teratve schemes, perhaps wth multgrd added on, have been ported wthout maor changes and/or problems to massvely parallel computers wth dstrbuted memory. Many code optons that are essental for realstc smulatons are not easy to parallelze on ths type of machne. Among these, we menton local remeshng, repeated h-refnement such as requred for transent problems 3, contact detecton and force evaluaton 4, some precondtoners 5, applcatons where partcles, flow, and chemstry Amercan Insttute of Aeronautcs and Astronautcs

2 nteract, and applcatons wth rapdly varyng load mbalances. Even f 99% of all operatons requred by these codes can be parallelzed, the maxmum achevable gan wll be restrcted to :. If we accept as a fact that for most large-scale codes we may not be able to parallelze more than 99% of all operatons, the shared-memory paradgm, dscarded for a whle as nonscalable, make a comeback. It s far easer to parallelze some of the more complex algorthms, as well as cases wth large load mbalance, on shared-memory machne (such as the SGI Orgn ). The obectve of the present research effort s to mplement the GMRES+LU-SGS scheme on shared memory parallel computers. Here we wll use the shared memory parallelzaton technque, orgnally proposed by Löhner and mplemented for explct schemes 6. Ths method s based on extensve mesh renumberng whch provdes proper load balancng and avods cache-msses and cache-lne overwrte whle allowng ppelnng. The advantage of the method over the tradtonal approach, whch s based on doman parttonng, s ts ablty to be easly used wth repeated local mesh refnement and local or global remeshng. The matrx-free GMRES+LU-SGS mplct scheme uses the LU-SGS approxmate factorzaton as a precondtoner. The parallelzaton of the LU-SGS algorthm s not an obvous task, because of nherent data dependency. The LU-SGS algorthm can be vectorzed for vector processors by usng planes ++k=const for structured meshes, or by usng hyper plane edge reorderng for unstructured meshes 7. Unfortunately, for the ntended shared-memory parallelzaton approach, there are very severe penaltes to start a loop 6. Hence, a loop can be effcently parallelzed only f ts vector length s large enough. For an explct scheme, f scalablty to even 6 processors s to be acheved, the vector loop lengths should be at least 6x,. For typcal tetrahedral grds there are approxmately vector-length groups, ndcatng that we would need at least x6x,=35, edges to run effcently. For the LU-SGS scheme ths restrcton s much more severe. Snce we usually have several hundreds of hyper planes, tmes or even tmes more edges are requred to run the code effcently. Moreover, snce the LU-SGS scheme s used as a precondtoner for the GMRES method, even a small neffcency n parallelzaton of the LU-SGS scheme may result n severe degradaton of overall performance. Thus a comparson of dfferent types of matrx-free parallelzed precondtoners has been performed as part of the present effort.. GOVERNING EQUATIONS AND THEIR DISCRETIZATION The Euler equatons governng unsteady compressble nvscd flows can be expressed n the conservatve form as Q F + =, (.) t x where the summaton conventon has been employed. The unknown vector Q, and nvscd flux vector F are defned by ρ Q = ρu, ρe F ρu. (.) = ρuu + pδ u ( ρe + p) Here ρ, p, e denote the densty, pressure, and specfc total energy of the flud respectvely, and u s the velocty of the flow n the coordnate drecton x. Ths set of equatons s completed by the addton of an equaton of state. The governng equatons are dscretzed by the fnte volume method based on dual mesh cells assocated wth the nodes of the mesh, where the control volumes are nonoverlappng dual cells constructed by the medan planes of the tetrahedra. In the present study the numercal flux functons for nvscd fluxes at the dual mesh cell nterface are computed usng the AUSM+ (Advecton Upwnd Splttng Method) scheme 7. Lnear reconstructon of prmtve varables s used wth Van Albada lmter. Equaton (.) can be rewrtten n a sem-dscrete form as Q V = R, (.3) t where V s the volume of the dual mesh cell, and R s the rght-hand sde resdual and equals to zero for a steady-state soluton. 3. SHARED MEMORY PARALLELIZATION TECHNIQUE The parallelzaton technque for explct schemes 6 wll be generalzed for mplct computatons n ths paper. The method requres no explct doman decomposton, but s based on the combnaton of several renumberng and data regroupng technques developed to avod or consderably mnmze cachemsses, cache-lne overwrte, and memory contenton. Amercan Insttute of Aeronautcs and Astronautcs

3 Renumberng to mnmze cache-msses NEDGE a NEDGE c NEDGE b NPOIN Proc. Proc. Proc. NPOIN Proc NPOIN Proc. Proc. Fg.. Edge and node renumberng. (a) Renumberng to mnmze cache-msses. (b) Renumberng to avod memory contenton. (c) Renumberng for -processor machne. All unstructured CFD codes contan basc loops over nodes, edges, and elements. If a loop over the edges s consdered and cache-msses are a concern, then the storage locatons for the requred pont nformaton should be as close as possble n memory when requred by an edge. At the same tme, as the loop progresses through the edges, the pont nformaton should be accessed as unformly as possble. Ths may be acheved by frst renumberng the ponts usng a bandwdth-mnmzaton technque such as the reverse Cuthll McKee 8, wavefront 9, or Peano-Hlbert type space-fllng curves, and subsequently renumberng the edges accordng to the mnmum pont number on each edge 9. Fgure a shows an example of the reordered edges and ponts. The same type of renumberng s done for all enttes, whch serve as basc loops n the code (e.g. elements, boundary faces, etc.). All of these algorthms are of complexty O(N) or at most O(N log N), and well worth the effort. Data and loop rearrangements to avod memory contenton In order to acheve ppelnng or vectorzaton, memory contenton ssues must be avoded. The memory contenton can arse for nstance n a loop over the edges whle wrtng to correspondng ponts. The followng example s a typcal smplfed loop: Loop DO 6 IEDGE=,NEDGE IPOI=LNOED(,IEDGE) IPOI=LNOED(,IEDGE) REDGE=F(IPOI,IPOI) RHSPO(IPOI)=RHSPO(IPOI)+REDGE RHSPO(IPOI)=RHSPO(IPOI)-REDGE 6 CONTINUE Snce one and the same pont can be accessed more than once from dfferent edges, the nformaton n RHSPO may be corrupted n the ppelne. To make sure that no pont s accessed more than once, the loop can be splt nto several contenton-free loops over renumbered edges, see Fg. b: Loop C$DIR IVDEP DO 4 IPASS=,NPASS NEDG=EDPAS(IPASS)+ NEDG=EDPAS(IPASS+)!PIPELINING DIRECTIVE DO 6 IEDGE=NEDG,NEDG IPOI=LNOED(,IEDGE) 3 Amercan Insttute of Aeronautcs and Astronautcs

4 IPOI=LNOED(,IEDGE) REDGE=F(IPOI,IPOI) RHSPO(IPOI)=RHSPO(IPOI)+REDGE RHSPO(IPOI)=RHSPO(IPOI)-REDGE 6 CONTINUE 4 CONTINUE Data and loop rearrangements to avod cache-lne overwrte An auto-parallelzng compler can parallelze the nner loop n loop. However, as has been mentoned n Ref. 6, such parallelzaton s not effcent, because of both start-up penaltes and cache-lne overwrte. The start-up penaltes are assocated wth launchng of a parallel loop. To mnmze the penaltes, the number of passes NPASS should be as small as possble and therefore, the vector-length should be large. However, when large vector-lengths are used, the probablty that dfferent processors access the same cache-lne s ncreased. If the cache-lne overwrte takes place, all processors must update ths lne, leadng to a large ncrease of nterprocessor communcaton, severe performance degradaton, and non-scalablty. To keep vector-lengths short and enoy small start-up cost, a specal edge renumberng has been proposed n Ref. 6. Fgure c llustrates the dea of ths renumberng for the case of two processors. The actual loop may look lke: Loop 3 DO IMACG=,NPASG,NPROC IMAC=IMACG IMAC=MIN(NPASG,IMAC+NPROC-) C PARALLELIZATION DIRECTIVE C$DOACROSS LOCAL(IPASG) DO IPASG=IMAC,IMAC CALL LOOP3P(IPASG) CONTINUE CONTINUE LOOP3P becomes subroutne of the form: SUBROUTINE LOOP3P(IPASG) NPAS=EDPAG(IPASG)+ NPAS=EDPAG(IPASG+) DO 4 IPASS=NPAS,NPAS NEDG=EDPAS(IPASS)+ NEDG=EDPAS(IPASS+) C$DIR IVDEP!PIPELINING DIRECTIVE DO 6 IEDGE=NEDG,NEDG IPOI=LNOED(,IEDGE) IPOI=LNOED(,IEDGE) REDGE=F(IPOI,IPOI) RHSPO(IPOI)=RHSPO(IPOI)+REDGE RHSPO(IPOI)=RHSPO(IPOI)-REDGE 6 CONTINUE 4 CONTINUE RETURN There s no doubt that ths algorthm can be appled to any explct CFD solver. In the sequel, we consder the extenson of ths method to an mplct scheme. 4. IMPLICIT TIME INTEGRATION In order to obtan a steady-state soluton, the spatally dscretzed equatons must be ntegrated n tme. Usng Euler mplct tme-ntegraton, Eq.(.) can be wrtten n dscrete form as V t n = R n+, (4.) where t s the tme ncrement and n s the dfference of an unknown vector between tme levels n and n+;.e., n n+ n = Q Q. (4.) Equaton (4.) can be lnearzed n tme as V n n Q n R = R +, (4.3) t Q where R s the rght-hand sde resdual and equals zero for a steady-state soluton. Wrtng the equaton for all nodes leads to the delta form of the backward Euler scheme where A = R, (4.4) n V R A = I. (4.5) t Q We use a smplfed flux functon to obtan the lefthand sde Jacoban matrx, R = F( Q, n ( ) + F( Q λ ( Q where λ s the spectral radus, Q, n ) )) s (4.6) λ = v n + c, (4.7) where n s the unt vector normal to the cell nterface, v s the velocty vector, and c s the speed of sound. 4 Amercan Insttute of Aeronautcs and Astronautcs

5 Usng an edge-based data structure, the left-hand sde Jacoban matrx A s stored n lower, upper, and dagonal forms, whch can be expressed as where L V A = L + U + D, (4.8) F( Q, n ) λ s, (4.9) Q = I U F( Q, n ) = I Q λ s, (4.) F( Q, n ). (4.) D = I + s t + λ I Q Equaton (4.4) represents a system of lnear smultaneous algebrac equatons and needs to be solved at each tme step. The most wdely used methods to solve ths lnear system are teratve soluton methods and approxmate factorzaton methods. In Ref. t has been shown that the matrx-free GMRES+LU-SGS method results n very good convergence for unstructured meshes. Snce our goal s not to solve our system entrely by the LU-SGS approxmate factorzaton but rather use the GMRES wth approprate precondtoner, the precondtoner must be very fast, and at the same tme t should resemble the orgnal Jacoban matrx A as close as possble. Precondtonng wll be cost-effectve only f the addtonal computatonal work ncurred for each subteraton s compensated for by a reducton n the total number of teratons to convergence. Thus, even a moderate neffcency n parallelzaton of the precondtoner can be crtcal. Next, the followng matrx-free methods are consdered as canddates for the GMRES precondtoner:. The LU-SGS;. Data-Parallel Lower-Upper Relaxaton (DP-LUR) method, whch by ts nature s a Jacob teratve method; 3. Symmetrc Gauss-Sedel (SGS) relaxaton method. The LU-SGS approxmate factorzaton scheme s ust a subset of the SGS method and corresponds to the SGS scheme wth k=.. The LU-SGS approxmate factorzaton s descrbed as followng. ( D + L) D ( D + U ) = R + ( LD U ) (4.) Neglectng the last term on the rght-hand sde of Eq. (4.), and assumng that F F = F( Q + ) F( Q), (4.3) Q the system can be solved n the two steps. Frst, a lower (forward) sweep: * ( D + L) = R (4.4) or, n matrx-free form: * * * = D R ( F λ ) s ; (4.5) : L ( ) and second, an upper (backward) sweep: or: = * ( Q * D + U ) = D (4.6) D ( F λ ) s (4.7) : U ( ) The most remarkable feature of ths approxmaton s that there s no need to store the upper and lower matrces U and L, whch substantally reduces the memory requrements. It s found that ths approxmaton does not compromse any numercal accuracy, and the extra computatonal cost s neglgble. These sweeps can be vectorzed wth long vector lengths by usng specal orderng technque 7, but parallelzaton of the LU-SGS algorthm s not straghtforward due to nherent data dependences.. The DP-LUR method has been successfully used n Ref. as a substtute for the LU-SGS method for massvely parallel computer mplementaton. The method has no nherent data dependences, so t can be easly parallelzed n the same way as an explct scheme. The method can be descrbed n the followng way: The frst subteraton: = D R (4.8) Then the k max subteratons are made usng k + k ( R ( U + L) ) = D, (4.9) whch can be wrtten n matrx-free form as 5 Amercan Insttute of Aeronautcs and Astronautcs

k + k k = D R ( F λ ) s, (4.) where k s the subteraton number. Wth ths approach, the data that s requred for each subteraton has already been computed durng the prevous subteraton.

6 k + k k = D R ( F λ ) s, (4.) where k s the subteraton number. Wth ths approach, the data that s requred for each subteraton has already been computed durng the prevous subteraton. Therefore, the entre subteraton may be performed smultaneously, and there are no data dependences. 3. Symmetrc Gauss-Sedel relaxaton. Frst, zero the array: =. (4.) Then the k max subteratons are made usng forward sweep: k + / k ( D + L) = R U (4.) and then a backward sweep: k ( ) + k D + U = R L +/ (4.3) whch can be wrtten n matrx-free form as forward sweep: k+ / k+ / k+ / = D R ( F λ ) s : L ( ) k k ( F ) λ Q s : U ( ) (4.4) and backward sweep: k+ k+ k+ = D R ( F λ ) s : U ( ) k+ / k+ / ( F ) λ Q s : L ( ) (4.5) For one subteraton (k max =), the SGS method s equvalent to the LU-SGS approxmate factorzaton method. loop ntaton, heavy nterprocessor communcatons and poor load balance.. Splt the computatonal doman nto several nonoverlappng regons accordng to the number of processors, and apply the SGS method nsde of each regon wth (or wthout) some specal nterprocessor boundary treatment 37. Ths approach may suffer from convergence degradaton but takes advantage of mnmal parallelzaton overhead and good load balance. Our experence wth the shared memory SGI Orgn computer has shown that the frst method doesn t provde good scalablty, so we wll consder the second approach here. For testng purposes we computed a transonc flow n a channel wth a % crcular bump on the lower wall. The length of the channel s 3, ts heght s, and ts wdth s.5. The nlet Mach number s.675. Ths s a three-dmensonal smulaton of a two-dmensonal flow. The tetrahedral mesh was automatcally generated by the advancng front technque and contans 3,56 grd ponts, 64,595 elements, and 8,756 boundary trangles. The mesh and computed pressure contours are shown n Fgs. a and b. All computatons were run wth essentally nfnte tme step (CFL= 4 ). a 5. PARALLELIZATION OF THE PRECONDITIONER Snce parallelzaton of the DP-LUR method s straghtforward, we wll dscuss only the parallelzaton of the Symmetrc Gauss-Sedel methods. There are two approaches to the soluton of the problem:. Use a specal schedulng algorthm whch enables data parallelsm by regroupng edges. Ths method has the advantage of producng exactly the same result as the sngle processor case, but t suffers from severe overhead penaltes for parallel b Fg.. Flow n channel wth crcular bump. (a) Surface mesh. (b) Computed pressure contours on the channel surface at M= Amercan Insttute of Aeronautcs and Astronautcs

7 a Fg. 3. Blocks wth the wavefront renumberng. (a) blocks. (b) 5 blocks. a b b c Fg. 4. (a) Peano-Hlbert-Morton space-fllng curve (b) blocks obtaned wth the Peano-Hlbert renumberng. (c) 5 blocks obtaned wth the Peano- Hlbert renumberng. There are several methods to obtan a good parttonng of computatonal doman nto blocks 8. In our case we use the fact that the grd nodes are already renumbered to mnmze bandwdth, so we cut the entre array of nodes nto equally szed peces correspondng to the number of processors. Ths technque s very smple and provdes perfect load balancng. Though the method doesn t provde good control over mnmzaton of nterprocessor boundary, t wll be shown that ths ssue can be addressed by usng alternatve node renumberng technques. In addton, snce the shared-memory platforms are consdered, the nterprocessor communcaton overhead s not tghtly connected to the area of the nterprocessor boundares. The parttonng nto blocks usng the wavefront 9 renumberng s shown n Fg. 3a. Fgure 3b shows smlar parttonng nto 5 blocks. The wavefront renumberng results n very narrow slces, thus a Peano-Hlbert type space-fllng curve was also consdered to renumber the ponts. An example of such curve s shown n Fg. 4a. Ths curve was obtaned usng Morton s algorthm. The blocks parttonng correspondng to the Peano-Hlbert renumberng s shown n Fg. 4b, whle 5 blocks parttonng s shown n Fg. 4c. Next, the mplementaton of the LU-SGS scheme on parallel nonoverlapped blocks s consdered. Fgure 5a shows an example of a grd pont surrounded by nodes belongng to the same block. All surrounded nodes are dvded nto two groups L and U for lower and upper matrx computatons correspondngly (see Eqs.(4.4-5)). At frst, the SGS used locally on each processor wthout any contrbuton from nterprocessor boundares. Consder pont, whch has neghbors belongng to dfferent blocks (Fg. 5b). If there s no any exchange between the blocks, the L and U sets wll look as shown n Fg. 5b, and contrbuton from the three gray-colored nodes of processor A are not computed. Ths approach has been tested usng the LU-SGS scheme (wthout the GMRES) on and 5 blocks, wth the wavefront node renumberng. The test computaton was performed on a sngle processor Pentum III PC. Fgure. 6 demonstrates that convergence severely degrades for -block case and stalls for 5-block case. The second approach for parallelzaton s the socalled hybrd LU-SGS or HLU-SGS. A smlar scheme was used n Ref. 7 for structured grds. Ths scheme uses the DP-LUR for nterprocessor edges, and regular SGS scheme for edges nternal to each block. It s easer to consder Eqs. (4.45) to understand the 7 Amercan Insttute of Aeronautcs and Astronautcs

method. Schematcally, the method s shown n Fgs. 5c and 5d. Fgure 5c corresponds to the forward sweep, and Fg. 5d corresponds to the backward sweep of the SGS procedure.

8 method. Schematcally, the method s shown n Fgs. 5c and 5d. Fgure 5c corresponds to the forward sweep, and Fg. 5d corresponds to the backward sweep of the SGS procedure. In the SGS scheme, when the forward sweep s performed, upper matrx computaton has no data dependency. Conversely, when the backward sweep s performed, the lower matrx computaton has no data dependency. Forward sweep L() U() L() U() a Processor B c Backward sweep L() U() L() U() b Processor A d Fg. 5. Stencl for Gauss-Sedel scheme. (a) Internal pont. (b) Interface pont wthout nterprocessor communcatons. (c) Hybrd SGS forward sweep. (d) Hybrd SGS backward sweep. The results of HLU-SGS computaton (wth k=) on and 5 blocks are shown n Fg. 6. It s demonstrated that the hybrd scheme has some advantages over the LU-SGS scheme. Next, consder how the LU-SGS, HLU-SGS, and DP-LUR schemes work as a precondtoner for the GMRES method. In our computatons we used the same verson of GMRES as n Ref. wth search drectons, teratons and soluton tolerance set to.. Results of the DP-LUR scheme as precondtoner are shown n Fg. 7a. The advantages of ths method are ts easy parallelzaton and lack of dependency on the number of processors. The nfluence of the number of subteratons k max has also been represented n Fg. 7a. Note that k max = s equvalent to a dagonal precondtoner. It s neffcent to use more than one subteraton, snce ncrease of k max yelds no mprovement n the convergence rate. When the DP- LUR scheme s used not as a precondtoner, the result s reversed: ncrease n the number of subteratons mproves performance. Fgure 7b llustrates the nfluence of number of subteratons n the SGS precondtoner on convergence. These computatons were performed usng one sngle block. The test wth k=, whch s equvalent to the LU-SGS precondtoner, gves the best performance overall, n contrast wth results obtaned wth the SGS scheme alone, whch converges better 8 Amercan Insttute of Aeronautcs and Astronautcs

9 when more subteratons are used (usually up to ). Ths can be explaned by the fact that the GMRES teratve procedure s more effcent than the SGS teratons. GMRES+DPLUR(k=) (Dag.) GMRES+DPLUR(k=) GMRES+DPLUR(k=5) GMRES+DPLUR(k=) - - LUSGS ( bl.) LUSGS ( bl.) LUSGS (5 bl.) HLUSGS ( bl.) HLUSGS (5 bl.) CPU Tme (s) Fg. 6. Convergence hstory for LU-SGS and hybrd LU-SGS schemes wthout GMRES on,, and 5 blocks wth wavefront renumberng. a CPU Tme (s) - GMRES+SGS(k= LU-SGS) GMRES+SGS(k=) GMRES+SGS(k=3) GMRES+SGS(k=) Next, ncreasng the number of blocks s consdered. Prevously, some authors 7 used several precondtoner subteratons to reduce ts degradaton. Our results usng 5 blocks (Fg. 7c), show that the ncreasng the number of subteratons actually leads to slower convergence. b CPU Tme (s) 5 blocks, k= 5 blocks, k= 5 blocks, k=4 It was demonstrated that wth ncreasng of number of blocks, the LU-SGS scheme suffers from performance degradaton. Let s check how ths fact nfluences the behavor of the GMRES scheme. Fgure 8a shows convergence rates comparsons for,,, and 5 blocks usng a smple LU-SGS precondtoner wthout hybrd treatment of nterprocessor boundares and wth the wavefront node renumberng. The dagonal precondtoner result s also shown because t represents the worst scenaro of the LU-SGS scheme, when the number of blocks s equal to the number of grd ponts. Fgure 8b shows the correspondng results for the hybrd LU-SGS scheme. The hybrd scheme s a better choce for large number of blocks. The worst case for the hybrd scheme corresponds to the DP-LUR precondtoner wth k max =. c CPU Tme (s) Fg. 7. Convergence hstory. (a) GMRES+DP-LUR scheme, sngle block. (b) GMRES+SGS scheme, sngle block. (c) GMRES+Hybrd SGS scheme, 5 blocks. 9 Amercan Insttute of Aeronautcs and Astronautcs

10 GMRES+LU-SGS( block) GMRES+LU-SGS( blocks) GMRES+LU-SGS( blocks) GMRES+LU-SGS(5 blocks) GMRES+Dagonal GMRES+LU-SGS( block) GMRES+HLUSGS( blocks) GMRES+HLUSGS( blocks) GMRES+HLUSGS(5 blocks) GMRES+DPLUR(k=) - - a CPU Tme (s) b CPU Tme (s) GMRES+LU-SGS GMRES+LU-SGS( blocks) GMRES+LU-SGS(5 blocks) GMRES+LU-SGS( blocks) GMRES+LU-SGS GMRES+HLUSGS( blocks) GMRES+HLUSGS(5 blocks) GMRES+HLUSGS( blocks) - - c CPU Tme (s) CPU Tme (s) Fg. 8. Convergence hstory. (a) GMRES+LU-SGS wth wavefront renumberng. (b) GMRES+hybrd LU-SGS wth wavefront renumberng. (c) GMRES+LU-SGS wth Peano-Hlbert renumberng. (d) GMRES+hybrd LU-SGS wth Peano-Hlbert renumberng. d Fgures 8c and 8d show the results of computatons wth the Peano-Hlbert renumberng. Comparson wth the results shown n Fgs.8a and 8b shows that the gan from a good renumberng technque s much more mportant than gan from the hybrd SGS scheme. We conclude that the LU-SGS scheme beng used as a precondtoner for the GMRES s not very senstve to parttonng nto blocks. Both the LU-SGS scheme and the HLU-SGS scheme can be used as a precondtoner. When large number of processors s requred t s better to pay attenton to doman-splttng technque, n our case Peano-Hlbert reorderng gves good results. If the number of processors doesn t exceed, t s not mportant how the doman s dvded nto blocks. Amercan Insttute of Aeronautcs and Astronautcs

- proc. proc. 4 proc. 6 proc. 8 proc. proc. proc. 6 proc. -5-6 4 6 8 Steps Fg. 9. ONERA M6 wng. Absolute velocty contours. M=.84, angle of attack 3.6 o Fg.

11 - proc. proc. 4 proc. 6 proc. 8 proc. proc. proc. 6 proc Steps Fg. 9. ONERA M6 wng. Absolute velocty contours. M=.84, angle of attack 3.6 o Fg.. Resdual convergence hstory versus tme steps for ONERA M6.5.5 Computaton Experment Computaton Experment.5.5 -Cp -Cp a X/C Fg. ONERA M6. Cp profle at % semspan (a) and 44% semspan (b). b X/C 6. APPLICATION OF THE GMRES+LU-SGS SCHEME TO LARGE-SCALE COMPUTATIONS The computatons were performed on a SGI Orgn computer wth R processors. ONERA M6 Wng Confguraton The frst applcaton s an nvscd transonc flow over a ONERA M6 wng. The M6 wng has a leadngedge sweep angle of 3 o, an aspect of 3.8, and a taper rato of.56. The arfol secton of the wng s the ONERA D arfol, whch s a % maxmum thckness-to-chord rato conventonal secton. The flow solutons are presented at a Mach number of.84 and an angle of attack of 3.6 o. The mesh used n the computaton conssts of 74,95 elements, 36,5 grd ponts, and,76 boundary ponts. The computed absolute velocty contours on the wng surface are dsplayed n Fg. 9. The upper surface contours clearly show the sharply captured Lambda-type shock structure formed by the two nboard shock waves, whch merge near the 8% semspan to form the sngle strong shock wave n the outboard regon of the wng. The computed pressure coeffcent dstrbutons are compared wth expermental data 3 n Fg.. We can observe that Amercan Insttute of Aeronautcs and Astronautcs

12 there s only one grd pont wthn the shock structure; ths demonstrates the sharp shock-capturng ablty of the AUSM+ scheme. The results obtaned compare closely wth the expermental data. The convergence hstory for,, 4, 6, 8,,, and 6 processors s shown n Fg.. No serous convergence degradaton s observed. Wng/Pylon/Fnned-Store (Egln) Confguraton Another test case was conducted for the wng/pylon/fnned-store confguraton reported n Ref. 3, whch conssts of a clpped delta wng, 45 o sweep, composed of a constant NACA 64 symmetrc arfol secton. The wng has a root chord of 6n, a semspan of 3n, and a taper rato of.34. The pylon s located at the mdspan staton. The wdth of the pylon s.94n. A constant NACA8 arfol secton wth a leadng-edge sweep of 45 o and a truncated tp defnes the four fns of the store. The mesh used n the computaton s shown n Fg.a. It contans,39,694 elements, 39,547 grd ponts, and 7,359 boundary ponts. The flow solutons are presented at a Mach number of.95 and an angle of attack of o. Fgures b and c show the pressure contours on the upper and lower wng surfaces, respectvely. The convergence hstory for the computaton wth 6 processors s shown n Fg. 3. The resultng speedups for the bump case, ONERA M6 case, and the Egln case are shown n Fg. 4. The speedup was measured by tmng CPU tme of one tme step on dfferent number of processors. The performance degrades wth the number of processors. Ths s to be expected, as the ncreasng number of passes results n hgher relatve loop costs. The resultng speedup s very smlar to one obtaned for the explct scheme, see Ref. 6. Tme Accurate Smulaton of Arcraft Canopy Traectory For the tme accurate mplct computaton we use the same method appled n Ref.. The method s based on pseudo-tmesteps. In ths method, Eq. (.3) s transformed to the followng form: n+ n+ n n V Q V Q t n+ ( α) R + αr n VQ + = τ (6.) where τ s the pseudo tme varable, n denotes the tme level. If α=, the scheme s the backward Euler method. If α=.5, the resultng scheme known as Crank- Ncholson method s second-order n tme. Ths yelds the followng system of lnear equatons n+ m V R m ( + ) I Q = α t τ Q n+ n n n m V Q V Q R + + αr ( α) t α t n (6.) It has been shown n Ref. that the fully mplct scheme s more accurate than ts lnearzed counterpart, snce t requres several subteratons to acheve convergence on each tme step. The new method was used to compute an F/A8- C/D fghter canopy eecton. Durng the ntal openng of the canopy, a number of topologcal changes occur n the geometry. The problem was computed wth Mach number.76. The computaton was started at t=6 ms, when the canopy ust started to move, see Fg. 5, and ended at t=5ms when the canopy has moved to the tal part of the plane. At ntal stage the canopy was hnged to the plane. After rotatng 45 degrees (at t=4 ms), the canopy was released and allowed to move n response to the forces exerted on t. The mesh at several tme nstances as well as velocty feld s shown n Fg. 6. The average sze of the mesh was 5, ponts and,3, elements. Mesh sze vared a lttle durng the computaton. More detaled data on ths problem can be found n Ref. 33. A tme step correspondng to the CFL number of 5 was used n the computaton. The smulaton requred approxmately 4 CPU hours on 8 processors of SGI Orgn. The plot of CPU tme vs. problem tme s shown n Fg. 7. The curve s not straght but rather has 8 steps attrbuted to global remeshng requred by the algorthm. The remeshng was done by a new parallel algorthm descrbed n Ref. 34. Ths computaton s more than tmes faster as compared to our explct computaton, see Ref. 33. Only 84 tme steps were requred (compare wth approxmately 3, tme steps wth explct scheme). Ths sgnfcant reducton n CPU requrements s attrbuted to the mplct GMRES scheme and parallel remeshng. Fgure 8 shows the speedup of the tme accurate computaton. Ths speedup was computed by measurng the CPU tme for a sngle tmestep. The tme accurate speedup result s very smlar to those obtaned for the steady-state cases. Amercan Insttute of Aeronautcs and Astronautcs

b a c Fg.. (a) Surface mesh used for Wng/Pylon/Fnned-Store confguraton.

(c) Computed pressure contours on the lower surface. -.5.

Resdual convergence hstory versus tme steps for Wng/Pylon/Fnned-Store confguraton on 6

13 b a c Fg.. (a) Surface mesh used for Wng/Pylon/Fnned-Store confguraton. (b) Computed pressure contours on the upper surface. (c) Computed pressure contours on the lower surface Speedup Perfect case Bump Onera M6 Egln Tme steps Fg. 3. Resdual convergence hstory versus tme steps for Wng/Pylon/Fnned-Store confguraton on 6 processors Processors Fg. 4. Speedups n computatons of the channel wth crcular bump, ONEA M6, and Wng/Pylon/Fnned- Store (Egln) confguratons. 3 Amercan Insttute of Aeronautcs and Astronautcs

5 ms 9 ms 5 ms 7 ms 6 ms Fg. 5. F/A8-C/D fghter canopy eecton. CONCLUSIONS A parallelzaton technque for matrx-free GMRES+LU-SGS unstructured grd method on sharedmemory machne s proposed.

14 5 ms 9 ms 5 ms 7 ms 6 ms Fg. 5. F/A8-C/D fghter canopy eecton. CONCLUSIONS A parallelzaton technque for matrx-free GMRES+LU-SGS unstructured grd method on sharedmemory machne s proposed. The method requres no drect doman parttonng and can easly be combned wth mesh refnement and remeshng procedures. Specal attenton s gven to parallel mplementaton of GMRES precondtoner. It s shown that for moderate number of processors, the LU-SGS method wthout nterprocessor data exchange s a good choce. The hybrd LU-SGS scheme works slghtly better for hgher number of processors. The proper node renumberng s crtcal to effcency of the method. For parallelzaton of the mplct scheme the Peano-Hlbert type renumberng demonstrated the best results. Even though the method s effcency degrades wth ncreasng the number of processors, the degradaton s proven to be small and the method always mantans ts stablty snce the worst case corresponds to the GMRES scheme wth the dagonal precondtonng, whch s proven to be stable for the Euler computatons. The method has been successfully appled to several steady-state and tme-accurate 3-D smulatons. Sgnfcant savngs n CPU tme are acheved as compared to the prevous verson of the code, whch utlzed the explct Runge-Kutta tme ntegraton. 4 Amercan Insttute of Aeronautcs and Astronautcs

15 Fg. 6. F/A8-C/D fghter canopy eecton. Surface mesh and absolute velocty contours. 5 Amercan Insttute of Aeronautcs and Astronautcs

16 5 4 CPU Tme (h) 3 Speedup Problem Tme (ms) Processors Fg. 7. Canopy eecton. CPU tme on 8 processors versus problem tme Fg. 8. Speedup for the canopy eecton case. REFERENCES. Luo, H., Baum, J.D., and Löhner, R., A Fast, matrx-free Implct Method for Compressble Flows on Unstructured Grds, Journal of Computatonal Physcs, Vol. 46, pp ,998.. Luo, H., Baum, J.D., and Löhner, R., An Accurate, Fast Matrx-Free Implct Method for Computng Unsteady Flows on Unstructured Grds, AIAA , Saad, Y., and Schultz, M.H., GMRES: a Generalzed Mnmal Resdual Algorthm for Solvng Nonsymmetrc Lnear systems, SIAM J. Sc. Stat. Comp., Vol. 7, No 3 (988), pp Jameson, A., and Yoon, S., Lower-Upper Implct Schemes wth Multple Grds for the Euler equatons, AIAA J., Vol. 5, No7, pp , Soetrsno, M., Imlay, S.T., and Roberts, D.W., A Zonal Implct Procedure for Hybrd Structured- Unstructured Grds, AIAA , Men shov, I., Nakamura, Y., An Implct Advecton Upwnd Splttng Scheme for Hypersonc Ar Flows n Thermochemcal Nonequlbrum, 6 th Int. Symp. on CFD, pp.85-8, Sharov, D., Nakahash, K., Reorderng of Hybrd Unstructured Grds for Lower-Upper Symmetrc Gauss-Sedel Computatons, AIAA J., vol.36, No 3, pp.48486, Wllams, D., Performance of Dynamc Load Balancng Algorthms for Unstructured Grd Calculatons; CalTech Rep. C3P93 (99). 9. Smon, H., Parttonng of Unstructured Problems for Parallel Processng; NASA Ames Tech. Rep. RNR-9-8 (99).. Mehrota, P., Saltz, J., Vogt, R. (eds.), Unstructured Scentfc Computaton on Scalable Multprocessors; MIT Press (99).. Vdwans, A., Kallnders, Y, Venkatakrshnan, V., A Parallel Load Balancng Algorthm for 3-D Adaptve Unstructured Grds; AIAA-9333-CP (993). 6 Amercan Insttute of Aeronautcs and Astronautcs

17 . Löhner, R, Three-Dmensonal Flud-Structure Interacton Usng a Fnte Element Solver and Adaptve Remeshng; Computer Systems n Engneerng,, 577 (99). 3. Löhner, R, and Baum, J.D., Adaptve H- Refnement on 3-D Unstructured Grds for Trancent Problems; Int. J. Num. Meth. Fluds, 4, pp.479 (99). 4. Haug, E., Charler, H., et.al., Recent Trends and Developments of Crashworthness Smulaton Methodologes and ther Integraton nto the Industral Vehcle Desgn Cycle; Proc. Thrd European Cars/Trucks Smulaton Symposum (ASIMUTH), Oct.8 (99). 5. Ramamurt, R., and Löhner, R, Smulaton of Flow Past Complex Geometres Usng a Parallel Implct Incompressble Flow Solver; pp.49-5, Proc. th AIAA CFD Conf., Orlando, FL, July (993). 6. Löhner, R, Renumberng Strateges for Unstructured-Grd Solvers Operatng on Shared- Memory, Cache-Based Parallel Machnes, AIAA 9745, 997, pp Lou, M.S, Progress towards an Improved CFD Method: AUSM+, AIAA 95-7, (995). 8. Cuthll, E., and McKee, J., Reducng the Bandwdth of Sparse Symmetrca Matrces; Proc. ACM Nat. Conf., New York 969, pp.57-7, (969). 9. Löhner, R, Some Useful Renumberng Strateges for Unstructured Grds; Int. J. Num. Meth. Eng., 36, pp.3597, (993).. Sagan, H., Space-Fllng Curves, Sprnger Verlag, New York, Candler, G.V., and Wrght, M.J., Data-Parallel Lower-Upper Relaxaton Method for Reactng Flows, AIAA Journal, 3, No, pp38386, Povtsky, A., Morrs, P.J., Parallel Compact Mult- Dmansonal Numercal Algorthm wth Applcaton to Aeroacoustcs, AIAA 997, (999). 3. Jenssen, C.B., Implct Multblock Euler and Naver-Stokes Calculatons, AIAA Journal, 3, No 9, pp.88-84, Sheng, C., Hyams, D. et al., Three-Dmensonal Incompressble Naver-Stokes Flow Computatons About Complete Confguratons Usng a Multblock Unstructured Grd Approach, AIAA , (999). 5. Stoll, P., Gerlnger, P., Bruggemann, D., Doman Decomposton for an Implct LU-SGS Scheme usng Overlappng Grds, AIAA , pp , (997). 6. Wssnk, A.W., Lyrntzs, A.S., and Strawn, R.C., Parallelzaton of a Three-Dmensonal Flow Solver for Euler Rotorcraft Aerodynamcs Predctons, AIAA Journal, 34, No., pp.76-83, Wssnk, A.W., Lyrntzs, A.S., Chronopoulos, A.T., A Parallel Newton-Krylov Method for Rotorcraft Flowfeld Calculatons, AIAA-97-49, pp.6-7, Flower, J., Otto, S., Salama, M., Optmal Mappng of rregular Fnte Element Domans to Parallel Processors; pp.395 (99). 9. Venkatakrshnan, V., Smon, H.D., Barth, T.J., A MIMD Implementaton of a Parallel Euler Solver for Unstructured Grds; NASA Ames Tech. Rep. RNR-94 (99). 3. Löhner, R, Ramamurt, R., A Load Balancng Algorthm for Unstructured Grds; Comp. Flud Dyn., 5, pp (995). 3. Schmtt, V., Charpn, F., Pressure Dstrbutons on the ONERA M6 Wng at Transonc Mach Numbers, Experment Data Base for Computer Program Assessment, AGARD AR8, Ilem, E.R., CFD Wng/Pylon/Fnned Store Mutual Interference Wnd Tunnel Experment, AEDC- TSR-9-P4, Arnold Engneerng Development Center, Arnold AFB,TN, Jan., Baum, J.D., Löhner, R, Marquette, T.J., Luo, H., Numercal Smulaton of Arcraft Canopy Traectory, AIAA , (997). 34. Löhner, R., A Parallel Advancng Front Grd Generaton Scheme, AIAA-5, (). 7 Amercan Insttute of Aeronautcs and Astronautcs

RECENT research on structured mesh flow solver for aerodynamic problems shows that for practical levels of

RECENT research on structured mesh flow solver for aerodynamic problems shows that for practical levels of A Hgh-Order Accurate Unstructured GMRES Algorthm for Invscd Compressble Flows A. ejat * and C. Ollver-Gooch Department of Mechancal Engneerng, The Unversty of Brtsh Columba, 054-650 Appled Scence Lane,