Helsinki University of Technology Laboratory of Applied Thermodynamics. Parallelization of a Multi-block Flow Solver

Size: px

Start display at page:

Download "Helsinki University of Technology Laboratory of Applied Thermodynamics. Parallelization of a Multi-block Flow Solver"

Lily Lawson
6 years ago
Views:

1 Helsinki University of Technology Laboratory of Applied Thermodynamics Parallelization of a Multi-block Flow Solver Patrik Rautaheimo 1, Esa Salminen 2 and Timo Siikonen 3 Helsinki University of Technology, Espoo, Finland Report No. 98 January 30, 1997 Otaniemi ISBN ISSN Research Scientist, Laboratory of Applied Thermodynamics 2 Research Scientist, Laboratory of Applied Thermodynamics 3 Associate Professor, Laboratory of Applied Thermodynamics

2 Abstract A parallelization of a Navier Stokes solver is presented. The flow solver is based on a multi-block structured grid. The parallelization is performed over the blocks, and the data on the block boundaries is exchanged using the MPI Standard. The parallelized code can also be run in a single-processor mode or on a shared memory machine. In order to facilitate the pre- and post-processing a separate program for a domain decomposition has been written. The first tests indicate a good scaleability of the parallelization approach.

3 1 Contents Nomenclature 2 1 Introduction 3 2 Basic Features of the Flow Solver Numerical method Treatment of boundary conditions Computation and data arrangement Parallelization Main principles Parallelization using MPI Communication during iteration Automatic Grid Block Splitting Main principles Grid splitting Redefinition of the boundary conditions Sample case Test Runs Scaling Blocking Single-processor performance Conclusions 21

4 / 0 2 Nomenclature total time of the calculation flux vectors in -, - and -directions flux in a given direction in space Mach number number of processes number of grid points on one edge source term Reynolds number vector of the conservative variables total communication per total calculation time ( total communication time time Cartesian coordinates angle of attack constant "!$#&%(' efficiency of parallelization ( *),+(#-%"' ) ) Subscripts.. -index; summation index viscous coordinate directions free-stream value

5 3 1 Introduction For over a decade parallelization has been used to enhance the efficiency of flow solvers. The simplest method of parallelization, which can be used with shared memory machines, takes place on the DO-loop level. DO-loop level parallelization is ineffective for a large number of processors. Better performance from a large number of processors can be obtained by dividing the space into smaller subdomains. With a shared memory machine like the Cray C94, the parallelization over the sub-domains is a trivial task, but with a massively parallel system like the Cray T3E things get more complicated. A common approach, applied e.g in [1] and [2], is to divide the computational domain into equally sized blocks and to apply message passing between the blocks. In this paper, the parallelization of a multi-block Navier Stokes software is described. The parallelization is based on the Message Passing Interface (MPI) Standard [3]. The computational domain is divided into blocks and the block boundaries are updated using MPI. In order to get a balance between the processes, the blocks should be of equal size. However, the code can handle several (smaller) blocks in one process. This property can be utilized especially with small cases and with a small number of processors, when a good load balance is not so critical. In addition to the changes in the flow solver, a separate preprocessor has been written to make the domain decomposition. This is because during the pre- and postprocessing the grid and the results can be more easily handled in a few larger blocks. In the domain decomposition the most difficult task is to handle the definition of boundary conditions automatically from the original boundary file. In the following, the flow solver and the changes required for the parallelization are briefly described. Next, the principles of the domain decomposition are given. Test runs have been performed with the T3E and T3D machines, and with a cluster of SGI Indigo workstations.

6 4 2 Basic Features of the Flow Solver 2.1 Numerical method The flow simulation is based on the solution of the Reynolds averaged Navier Stokes equations: (2.1) where is the vector of dependent variables, and represent the inviscid and viscous parts of the fluxes, and is a possible source term. The flow solver utilizes a structured multiblock grid. For the solution Eq. (2.1) is written in a finite-volume form ) )! ) ) (2.2) where ) is a cell volume, and are the inviscid and viscous parts of the flux on the cell surface. The sum is taken over the faces of the computational cell. The solution proceeds blockwise after explicitly defined boundary conditions. The boundary conditions between the blocks are defined only on the highest grid level. In each block an implicit LU-factored solution with a multigrid acceleration of convergence is performed [4]. The underlying solution method is based either on the flux-difference [5] or flux-vector [6] splitting. The flux calculation utilizes a MUSCL-type differencing with a second- or third-order accuracy. The code has been applied for external [7] and internal [8] flows. Turbulence is modelled either by an algebraic model or a two-equation model. A Reynolds-stress model is under development and has been applied for simple test cases [9]. 2.2 Treatment of boundary conditions The boundary conditions are handled using two layers of ghost cells on the block boundaries to preserve second-order accuracy, as seen in Fig The list of boundary conditions is given in Table 2.1. For connectivity, cyclic and sliding mesh boundary conditions information is exchanged between the blocks. Since the blocks can be connected in an arbitrary way, the block boundaries are divided into patches. A patch is formed by the common boundary shared by two adjacent

7 5 TWO LAYERS OF GHOST CELLS Fig The blocks are surrounded by two layers of ghost cells. BC PATCHES Fig Different boundary conditions can be applied on the patches of the block surfaces. blocks. Thus the block surface can contain several patches with different boundary conditions, as shown in Fig The definition of the patches and the corresponding boundary condition types are given as input data in a specific boundary condition file. With a shared memory machine the patch data is written on a boundary array after an iteration cycle. The receiving block reads the information from this array at a given position. This procedure is performed block by block: firstly, every block writes the values of the dependent variables on all its patches into the array and then the data is substituted from the array to the ghost cells of the appropriate blocks. Only the central memory of the machine is utilized in this approach. 2.3 Computation and data arrangement The computer code is programmed using standard Fortran77. The variables are stored in one-dimensional arrays. Starting addresses are utilized in calling routines

8 6 Table 2.1. Boundary condition types. Boundary condition Exchange of data Connectivity External Inlet Mirror Outlet Cyclic Singularity Solid Rotating solid Moving solid Sliding mesh yes no no no no yes no no no no yes to separate data for different blocks and for different multigrid levels. Computation is performed using three kinds of DO-loops. For some variables, e.g. for the calculation of the equation of state, there is a single loop over the entire block including the ghost cells. In order to exclude the ghost cells, a three-level loop over the. -, - and - directions is utilized. Most of the calculation, including the evaluation of the fluxes and making the implicit sweeps, is done slabwise as shown in Fig In this approach the ghostcells at the beginning and at the end of the row are included in the calculation, which increases the amount of calculation typically by a few per cent depending on the grid size. This treatment was originally designed for a vector computer, where the increased amount of computation is more than compensated by the enhanced efficiency owing to a longer vector. The treatment has been retained in the parallelization in order to maintain the original structure of the code and to facilitate portability between the vector and RISC machines. DIRECTION OF COMPUTATION } COMPUTATIONAL DOMAIN Fig Computation proceeds slabwise, including two parts of the ghost cell layers.

9 7 3 Parallelization 3.1 Main principles The code structure forms an ideal base for the parallelization. All the essential procedures are treated block by block including the updating of boundary conditions. By using block sizes like and several multigrid levels can be used in each block. If all the blocks are of an equal size, the work between the processors is balanced. With the current RISC processors and the block sizes given above, the calculation takes in the order of 10 seconds per iteration cycle. Since the majority of the calculation takes place slabwise, it is impractical to use very small block sizes owing to the useless computation of ghost cells. Even more important is to obtain a suitable balance between the times spent on the computation and the communication, which with current fast processors requires that blocks have a sufficient size. There are some general requirements for the parallelization. Firstly, we wish that the same software can be used in a single-processor mode or with a shared memory multiprocessor machine like the C94. In practice, it is difficult to always use the specified block sizes. Because of this the possibility to compute several blocks per processor has to be retained. This property is important, especially in small cases, where a good load balance is not so critical, and which can be calculated using a few processors. In large cases with complicated geometries, the division into equally sized blocks may also be difficult or impractical. Then small blocks can be situated in the same processor or, if the number of small blocks is small in comparison with the standard equally sized blocks, the idle time of the processors occupied by the small blocks does not decrease the overall efficiency significantly. 3.2 Parallelization using MPI The parallelization is based on the Message Passing Interface (MPI) Standard [3]. MPI routines are implemented so that the program also runs in an environment where MPI is not implemented.the updating of boundaries between different processes is done using the basic MPI SSEND and MPI RECV commands instead of using the array for boundary data, as in the case of a shared memory calculation. Also MPI BCAST and MPI GATHER are used to give input parameters to the processes, and to gather convergence histories.

10 8 PE 0 PE 1 PE 2 read and send input, boudary conditions and mesh receive input, boudary conditions and mesh from PE 0 receive input, boudary conditions and mesh from PE 0 initialization initialization initialization block connectivity block connectivity block connectivity advance computation one time step advance computation one time step advance computation one time step block connectivity block connectivity block connectivity receive and write global residuals send global residuals to PE 0 send global residuals to PE 0 continue continue Yes iteration? Yes iteration? Yes continue iteration? No No No write solution write solution write solution Fig A flow diagram of the parallelized code. Communication between the processors is depicted by horizontal dashed lines. The computational cycle is described in Fig One processing element (PE 0) is the master and the others are slaves. Parallelization is done so that only the master process reads in the input parameters, but all the processes write the output files. The master process reads input parameters, including files where boundary conditions are specified and the grid is defined, and sends the desired input parameters and the appropriate parts of the grid to the slaves. After every iteration cycle slave processes send the convergence parameters (global residuals etc.) to the master. Convergence parameters are printed on a screen and stored in a convergence-monitoring file. This is accomplished using the MPI GATHER command. Because processes are highly independent of each other, the memory requirement per process comes from the size of the block(s) that a process simulates. Since the possibility to calculate a different number of differently sized blocks has been retained, a dynamic memory allocation is performed in each process separately. Inside the process the communication can be done by using the central memory of the machine (default) or MPI subroutines can be utilized for a possible debugging. 3.3 Communication during iteration At the beginning of the calculation the connective patches are gone through and the order of communication is decided. Exchanged data is rearranged into vectors that are sent to other processes where the data is rearranged back into the desired order.

11 Fig Example of the parallel communication. By deciding the order of communication, communication is also made parallel. This means that most of the processes can communicate at the same time. As an example, four processes that are connected to each other in the way as presented in Fig. 3.2 is considered. If the order of the connection is from smaller to larger face numbers, where faces are numbered from the left counter-clockwise, we need 4 different communication time levels: 1. level processes 0-1 (2 would also like to communicate with 0 and 3 with 2), 2. level 0-2 (1 would like to communicate with 3 and 3 with 2), 3. level 2-3(0 is done, 1 would like to communicate with 3) and finally 4. level 1-3. For a more complicated case, a deadlock situation could also happen. This means that every process is waiting for some other one. Clearly a better way to do the communication is in two time levels as: 1. level 0-1;2-3 and 2. level 0-2;1-3. So every process is doing work at the same time. When one process is sending the data another one is receiving, no buffer is needed. This is achieved by numbering the connections at the beginning of computation in the master process. During the iteration all processes are communicating from the smallest number to the largest one. Numbering the connections is done so that every process can have only one connection in one communication time level. Some future development could be made to the numbering, for example by giving high numbers for the processor with the highest work load so an other can complete their communication regardless of the situation in the highest loaded processor. It was found that a synchronous send is faster than the standard one. Also nonblocking receive was tested, but no advantages were found. The best possible MPI commands were found to be MPI RECV and MPI SSEND. For performance testing extra communication is subtracted from the calculation. This is an interruption subroutine and collection of global residuals as well as global forces during the iteration. These do not have an effect on the calculation or the final result. Due to the development history of the code, in the test runs with the T3D, the communication between the boundaries is performed for each variable to be solved separately, instead of using a single pair of the MPI SEND and MPI RECV commands. These commands are performed in each block simultaneously using the standard communication mode of MPI. For the T3D the global residuals are also gathered. So parallelization is not fully comparable with the T3E.

12 10 4 Automatic Grid Block Splitting 4.1 Main principles In order to utilize the computing power of a massively parallel system, a program utilizing a simple domain decomposition algorithm has been developed for dividing large grid blocks into smaller ones. In addition to the grid splitting, the program also rewrites the boundary condition file and the computation control file. A good balance between processes is desirable. This can be achieved by dividing the space into equally sized sub-domains. However, from the point of view of grid generation, the above requirement increases the amount of manual work. This can be avoided by generating the original grid without considering too much the requirements of the parallel processing. With a separate tool, the grid can be divided into sub-blocks suitable for parallel computing. The goal is that the user does not need to work at all with the small blocks during the pre- and post-processing stages. In order to simplify the splitting algorithm, it is assumed that the original blocks are directly divided into sub-blocks, i.e. the subblocks can occupy only a single original block, as shown in Fig Because of this, the efficiency requires that the desired sub-block size, e.g., is taken into account in the grid generation. However, since the number of computational blocks per processor is arbitrary, the user can also deviate from the requirement of equal sub-blocks. The grid-splitting software keeps a record of the grid division so that after the simulation the sub-blocks can be merged into the original form. This is done for the grid itself and for the solution files in order to facilitate post-processing. 4.2 Grid splitting The splitting program can be run in two different modes. In the first mode, the user explicitly defines the grid sub-block boundaries. Then the only task of the program is to rewrite the boundary condition file. In the second mode, the program automatically splits the blocks. The only information required from the user in the latter case is the desired block edge dimension. In the automatic mode, the splitting strategy is as follows: The block is always fully cut. If the block edge dimension is smaller than the desired one, no cutting will take place. If the block edge dimension is larger than the desired one, but smaller than twice the desired size, a cutting line from the middle of the block face will

13 11 Block 1 Block 2 => Fig An example of the division of the original grid blocks into equally sized subblocks. be chosen, as shown in Fig. 4.2a. In the future this could be improved by trying to leave a larger block on the possible solid wall side. This will make things more complicated, but improves the behaviour of the turbulence models, which require wall correction. If the block edge dimension is larger than twice the desired size, but cannot be equally distributed, the smaller block will be cut from the middle of the original block, i.e. the block next to the possible solid wall is always as large as possible. The resulting division is shown in Fig. 4.2b. 4.3 Redefinition of the boundary conditions The boundary condition patch splitting is a complex task in comparison with the grid block splitting described above. This is especially true in cases where two connected blocks are not cut at the same location. That is why the algorithm divides the boundary patches using the cutting lines from both blocks. First the limits of the new boundary condition patches are computed. As a separate step, the connectivity information is updated. The program does not assume anything about the grid topology. That is, the connections in C-type or O-type blocks do not need any special treatment. The most challenging task in the BC patch splitting process is the determination of the additional cutting lines from connected patches. If the original blocks 1 and 2 are divided as shown in Fig. 4.3, the boundary patches in the neighbouring blocks will be cut correspondingly. When we calculate these lines, we must consider which block faces are connected and what the orientation is between the blocks. An additional difficulty comes from the relative position of the blocks. Since there are six faces on both blocks and four possible orientations, we have combinations. The right case can be found by computing a magic number a) b) Fig An example of the division in two different cases.

14 12 Block cutting line Additional BC patch cutting line from block 1 Block 2 Block cutting line Additional BC patch cutting line from block 2 Block 1 Fig Additional BC patch cuttings caused by the connection between the blocks. (4.1) where the face numbers and can have values from one to six and the orientation can have values from zero to three. 4.4 Sample case This example illustrates the division of the boundary patches in complex connections between the blocks. This two-block grid is purely fictitious and does not present any reasonable CFD geometry. The original grid and the split grid are shown in Fig Note that, in order to increase the complexity of the splitting task, the larger block is rotated about the -axis so that its -direction coincides with the smaller block s -direction. y I J J 2 J K I 1 J 3 K x z K K New block boundaries Fig Original two-block grid and the split four-block grid.

15 13 Both blocks are divided in the -direction at one pre-determined location. In the larger block this location is the 9th node and in the smaller block the 3rd node. However, due to the rotation of the larger block, these cuttings are not parallel but perpendicularly crossing, similar to the case shown in Fig Further, the 9th node in the larger block s -direction coincides with the 5th node in the smaller block s -direction, and the smaller block s 3rd node in the -direction coincides with the larger block s 5th node in the -direction. This is why the connective patch on both blocks has to be divided into four pieces.

16 14 5 Test Runs There are two commonly used test methods in the parallelization. One is to keep the size of one process computation constant, so-called scaling. This means that total problem size rises as the number of processes increases. Another is to keep the total problem size constant and divide it between the processes. Hereafter this is called blocking. The theoretical speed-up is different for both cases. In the first case, if total time of the calculation is and total communication time, then the ideal time spent on computation is *) + #-%"' (5.1) where is the number of processes. When the communication is taken into account we get the true computation time and the speed-up *!$#&%(' Speed-up (5.2) (5.3) where total communication per total calculation time. This is roughly equal to the communication per calculation time in one process. Assuming that all processes have their own bandwidth for the message passing, the amount of communication in scaling is the same for every process and thus is constant. This means that the speed-up should be more or less linear in this kind of test. If the total size of the problem is kept constant (blocking) the factor is not constant but a function of the used processes and the original problem size. The calculation time in one process is roughly %(' (5.4) where is the length of one edge. For the communication time we get (5.5) and the ratio is %(' (5.6)

17 15 Fig Delta wing. The symmetry plane of the grid is shown only partly. Finally we obtain the speed-up for constant size problem as Speed-up Note that in both cases the commonly used Amdahl s law Speed-up (5.7) (5.8) is not valid. The program was run on the Cray T3E, T3D machines and also on a cluster of SGI Indigo (MIPS R4400SC) workstations. In the latter case the communication between the workstations was made through a standard, low-speed Ethernet. Also the calculations with the workstation cluster were performed when there was other communication in the Ethernet. More information of the T3D and the workstation cluster tests can be found from Rautaheimo et al. [10]. With the T3E, the test runs were made with 3-dimensional torus-type topology. Every block is connected to six other blocks. This means that all processors have an equal amount of work and perfect load balancing is achieved. In a real case, the perfect load balancing cannot be achieved. This is because the boundary conditions are different. For example, calculation of the wall boundary condition will take more time than symmetric boundary condition. The test runs with the T3D and cluster of workstations were performed with a delta wing at, and or. This case has also been calculated in [7]. 5.1 Scaling Grids were generated so that all the blocks had a size of. The computational domain was split into 1-64 different blocks and each block was calculated in a different process. Thus the coarsest grid has 32,768 and the densest grid 2,097,152 grid points. A coarse level of the delta grid is shown in Fig. 5.1, and a calculated pressure distribution on a wing surface in Fig. 5.2.

18 16 Fig Pressure distribution on the delta wing. Table 5.1. Performance of the parallelization in scaling. NPS N. of cells T3E T3D SGI N/A N/A The computation time was measured over 50 iteration cycles. The time spent on the initialization and termination of the run was not included. The efficiency is obtained directly from the absolute time spent on the calculation. The results are presented in Table 5.1. It can be seen that scaling is achieved in these test runs with the T3E up to processes. The estimated speed-up for processes is roughly. With the T3D and the SGI cluster the global residuals have been collected during iteration, and consequently parallelization is not so good as in the case of the T3E. Speed-up can be seen in Fig. 5.3 with the results obtained from different platforms. One must keep in mind that results with the T3E and T3D are not directly comparable because of changes in code, and also the test case is better balanced for the T3E. Speed-up is linear for the T3E, as it should be, according to Eq. (5.3). For every block face there are data points that must be changed during iteration steps. Every point contains flow variables: density, 3 momentums, total energy, turbulent viscosity, turbulent viscosity coefficient, pressure and pres-

19 17 Fig Speed-up of the parallelization in scaling. Fig Time spent in communication each processes in scaling. sure difference. Every block has 6 faces and finally we get variables Mbytes to be sent to the other processes and to be received from the others. Time spent in communication can be seen in Fig The largest time is seconds (total time is seconds) with processes. This makes a bandwidth of Mbytes/s Mbytes/s per processor. With a very simple MPI communication the authors have achieved Mbytes/s so Mbytes/s is a relatively good performance for a complex communication. 5.2 Blocking Another way of testing parallelization is to divide a big problem into small ones. With the T3E the grid size was. This was limited because of the processor memory size in the T3E. Block sizes for different cases can be seen

20 18 Table 5.2. Performance of the parallelization in blocking. NPS N. of cells/block T3E Fig Speed-up of the parallelization in blocking. in Table 5.2. As can been seen from Table 5.2, the scaling is not so good in this case. With processes the speed-up is roughly. This follows directly from Eq. (5.7). The smaller the block is, the larger the ratio is between the communication and the calculation per processor. With a very small block size, the fact that unnecessary ghost cells are calculated in each block also decreases the efficiency. Also the faces of the block are not of an equal size and the communication times between different processors are not in balance. The speed-up is closer to Eq. (5.7) than Amdahl s law Eq. (5.8), as can be seen in Fig In Fig. 5.5, function is from Eq. (5.7) (5.9) and the function is Amdahl s law (5.10) Since the boundary conditions are treated explicitly, splitting of the computational domain into smaller parts will also reduce the performance of the implicit stage. This was tested with the delta wing by dividing the original grid of size

21 19 Fig norm of -momentum residual. Table 5.3. Effect of the compiler directive on the performance. Directive # -Osplit2 -O2 -O3 -Oaggress -Ounroll 2 -O2 -Ounroll2 -O3 -Oaggress -O3 -Oaggress -Ounroll2 -Ounroll2 -Oscalar3 into pieces. Iteration histories of -norm of -momentum can be seen in Fig. 5.6 as calculated with different block sizes. It can be seen that the convergence is not much affected by the grid size. However, it should be noted that in this case the smallest block size is still relatively large,. 5.3 Single-processor performance Some effort has been put into getting the program run efficiently in single-process mode in the T3E. A summary of the compiler directives can be seen in Table 5.3. The compiler directives -Ounroll2 -Oscalar3 were found to be the best choice, although no big difference was found between the directives in Table 5.3. By using the T3E memory streams the code runs approximately faster. Level (1, 2 or 3) of the stream did not seem to have any effect. Also some recoding was done. Recoding was done in a conservative manner, so that no vector loop was removed. Eventually the program also runs a bit faster on the C94. Also one part of the pro-

22 20 Platform Table 5.4. Single-process performance in different platforms. Speed (Mflops) C94 (240MHz Vector processor) R4000 (200 MHz IP22 with 1 Mbyte secondary cash) R8000 ( 75 MHz IP21 with 4 Mbyte secondary cash) R10000 (195 MHz IP28 with 1 Mbyte secondary cash) Digital AlphaServer (440 MHz EV56 with 4 Mbyte secondary cash) T3D (150 MHz Digital alpha processor with 8 kbyte secondary cash) T3E (300 MHz EV5 with 96 kbyte secondary cash) unmodified code gram was inlined using SGI s compiler because the compiler of the T3E does not include inlining. The time of an iteration cycle decreased because of these optimization actions by, and the speed increased from to Mflops. A single-processor performance for different platforms can be seen in Table 5.4. In these tests, the grid size was. As can be seen, the performance with the T3E is Mflops. This means that performance for the biggest case with scaling is about Gflops. Note also that the performance with the C94 will be better ( Mflops) if a larger grid is used.

23 21 6 Conclusions The parallelization of a multi-block flow solver has been presented. The parallelization takes place over the blocks and the communication between the blocks utilizes the MPI Standard. The computer code is portable with different types of machines. The first test runs were made on a cluster of SGI workstations. The performance curve obtained is almost linear up to 16 processes. This is in spite of the fact that the workstations were connected to a standard, low-speed Ethernet. With the Cray T3D machine, test runs have been made with up to 64 processors. The performance of the code is satisfactory, but not excellent. With a constant block size ( ), the efficiency with 64 processors is about 91%. However, the obtained speed indicates that even this machine can be effectively utilized with processors even with this efficiency. With the Cray T3E machine, test runs indicate excellent parallelization. With scaling, the parallelization is almost and with blocking it is still when processors were used. The increased performance is due to optimizing the communication and also the calculation with the T3E is done with a torus type 3- dimensional grid that has good load balancing. Hence, the T3D and T3E results are not directly comparable. It is also shown that Amdahl s law gives too poor performance estimates for both test cases. If the size of the case is kept as constant and the parallelization is performed by dividing the grid into smaller and smaller blocks, the efficiency of the code decreases as the number of processors is increased. This is caused by a larger ratio between communication and the calculation as the blocks are getting smaller, and also the extra time spent on the calculation of the ghost cells. Because of this property, and the requirement of the multigrid, the block size should not be smaller than ( ) or ( ). This limits the effective use of the parallelization on the T3E to cases where the number of grid points is of the order of half a million or more. In practice this is not a limitation if one has access to a vector computer with a sufficient memory size. The vector computer will be faster than the T3E in mediumsized jobs with the order of grid points. Although the benefits of a massively parallel computation are achieved with smaller cases, in CFD that is a waste of resources. Smaller cases can be calculated with sufficient small computational times on vector machines or even on workstations. In this work, the computation and also the communication are made parallel. In test cases the work balance is very good and also the performance is good. With reasonable block size the ratio between communication and calculation is under and thus no future development in communication is needed at this point. With real

24 22 application, the problem is how to divide the work between processors so that good work loading is obtained. Acknowledgments We would like to thank Sami Saarinen for his advice. Access to the parallel computer systems T3E at the Center for Scientific Computing (CSC) in Finland and the T3D of Cray Research Inc. in Eagan is also gratefully acknowledged.

25 23 Bibliography [1] Bärwolff, G., Ketelsen, K., and Thiele, F., Parallelization of a Finite-Volume Navier Stokes Solver on a T3D Massively Parallel System, in Sixth International Symposium on Computational Fluid Dynamics, (Nevada), [2] Sawley, M. and Tegner, J., A Comparison of Parallel Programming Models for Multiblock Flow Computations, Journal of Computational Physics, Vol. 122, 1995, pp [3] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard., Computer Science Dept., U. of Tennessee, Knoxville,TN, [4] Siikonen, T., Hoffren, J., and Laine, S., A Multigrid Factorization Scheme for the Thin-Layer Navier Stokes Equations, in Proceedings of the 17th ICAS Congress, (Stockholm), pp , Sept ICAS Paper [5] Roe, P., Approximate Riemann Solvers, Parameter Vectors, and Difference Schemes, Journal of Computational Physics, Vol. 43, 1981, pp [6] Van Leer, B., Flux-Vector Splitting for the Euler Equations, in Proceeding 8th International Conference on Numerical Mehods in Fluid Dynamics, (Aachen), (also Lecture Notes in Physics, Vol. 170, 1982). [7] Siikonen, T., Kaurinkoski, P., and Laine, S., Transonic Flow over a Delta Wing Using a Turbulence Model, in Proceedings of the 19th ICAS Congress, (Anaheim), pp , Sept ICAS Paper [8] Siikonen, T. and Pan, H., Application of Roe s Method for the Simulation of Viscous Flow in Turbomachinery, in Proceedings of the first European Computational Fluid Dynamics Conference (et al., C. H., ed.), (Brussels, Belgium), pp , Elsevier Science Publishers B.V., Sept [9] Rautaheimo, P. and Siikonen, T., Implementation of the Reynolds-stress Turbulence Model, in Proceedings of the ECCOMAS Congress, (Paris), Sept [10] Rautaheimo, P., Salminen, E., and Siikonen, T., Parallelization of a Multi- Block Navier Stokes Solver, in Proceedings of the ECCOMAS Congress, (Paris), Sept

1 1 General set up In this work the CRAY T3D computer is used for benchmark the FINFLO CFD code. Computer is located in Eagan, USA. One purpose of thi

1 1 General set up In this work the CRAY T3D computer is used for benchmark the FINFLO CFD code. Computer is located in Eagan, USA. One purpose of thi Helsinki University of Technology CFD-group/ Laboratory of Applied Thermodynamics MEMO No CFD/TERMO-7-96 DATE: April 17, 1996 TITLE Transportation of FINFLO to CRAY T3D AUTHOR(S) Patrik Rautaheimo ABSTRACT