Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Size: px

Start display at page:

Download "Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN"

Peregrine Fleming
5 years ago
Views:

1 The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics (CFD) problems of interest require far greater computational power than is available on any sequential machine. In CFD problems, where a large number of similar operations are performed, a parallel machine can be utilised to exploit the inherent parallelism of the algorithm. Distributed memory machines, although requiring extra programming, can provide truly scalable performance, at a fraction of the cost of current vector supercomputers. At the University of Hertfordshire part of the current work programme involves the parallel implementation of a sequential 2-D Navier-Stokes multiblock aerofoil code. In order to utilise an arbitrary network of transputers, it is necessary to have software which can effect the communication of data between processors and also schedule this data for processing. This paper is concerned with the development and implementation of a general purpose FORTRAN harness for a distributed memory machine with an arbitrary number of processors and hardware configuration. The general purpose harness therefore will not confine its use to the CFD work, but to any problem where a large number of similar operations is performed. The harness, which comprises many concurrently executing processes, is replicated to all the transputers in the network. The data is sorted into order and distributed to the network, so that as nearly as possible each transputer is responsible for performing the same amount of work. This will ensure that the

2 224 Applications of Supercomputers in Engineering distribution of computational load is even, thereby preventing the one with the most work from holding up the others. To illustrate the design and performance of the harness a simple five-point solution to the potential problem is considered in this paper. INTRODUCTION The implementation of the harness, written in Parallel 3L Fortran [1], is ultimately intended for porting the multiblock, 2-dimensional, Navier-Stokes aerofoil code onto the network of transputers. For the purposes of debugging and verification of the harness, independently of the aerofoil code, a simple model which simulates the data processed by the aerofoil code was constructed. Laplace's potential equation, which simply uses the mean value of four adjacent cells to update a cell, is used for the 'worker' process. This paper describes the design and performance of a multiblock Laplace solver on a network of T800 transputers. It is essentially the multiblock algorithm that makes the problem suitable for parallelisation. The discretized domain of computation is divided into subdomains, called blocks, thus creating internal boundaries between blocks. Before a block can be updated, transfer of data, called halo data, across internal boundaries is necessary. This therefore also necessitates the communication of halo data between transputers. The purpose of the harness is to schedule blocks for processing and effect the communication of halo data. The harness, which comprises many concurrently executing processes, is replicated to all the transputers in the network. The blocks are sorted into order on the basis of their size and then distributed to the network so that as nearly as possible each transputer is responsible for updating the same number of cells. This will ensure that the distribution of the computational load is even, thereby preventing the one with the most work from holding up all the others. TRANSPUTERS AND MIMD ARCHITECTURES Transputers belong to a class of parallel machines known as Multiple Instruction Multiple Data (MIMD). Each transputer is directly connected to its own local

3 Applications of Supercomputers in Engineering 225 memory holding both data and source code. Thus different transputers in the network may hold different data and perform quite different operations on that data. In CFD problems, however, the transputer is usually programmed as Single Instruction Multiple Data (SIMD), where the same governing equations are solved over each block. Moreover, in the case of the harness, where the same code is replicated to all procesors in the network, the transputer is coded as Single Program Multiple Data (SPMD). The T800 transputer is a high performance chip with four communication links, a 32 bit Central Processing Unit (CPU) and a 64 bit Floating Point Unit (FPU). The most important feature of the T800 is that the CPU, FPU and each communication link can operate in parallel within each transputer. This allows the message-passing overhead to be hidden to a large extent since communication can be overlapped with local computation. All transputers in the network execute locally resident source code and process locally resident data. During communication between processors, data is explicitly sent from one processor to another via the transputers' serial links. If the processors are not directly connected then intermediate processors are necessary to store and forward the message through the network, towards its destination transputer. PARALLELISM VIA GEOMETRIC DECOMPOSITION Geometric or data parallelism is the most natural sub-division of the workload for calculations over a region of space. The data is divided into sub-domains, called blocks. The shape of the blocks has an important effect on the communication to computation ratio. For example a rectangular block with p.q points has 2(p+q) boundary points, while a square block with p.q points has 47^q boundary points. Now 2(p + q) > 4,/jxq, therefore square blocks are favoured, since there is less halo data to communicate.

4 226 Applications of Supercomputers in Engineering BLOCK HALO DATA Once a multiblock domain has been established, calculations on each block can begin in parallel if the block's boundary conditions are known. This may be either a physical boundary or an internal boundary as a consequence of the domain decomposition. The physical boundary will be handled by the source code. The internal boundary requires boundary data from its neighbour, which may reside on a different processor. This data is provided by allowing a buffer on the boundary of each block which will store a copy of the corresponding overlap or halo data. h- _ L J buffer cells for boundary conditions U internal cells of block J L J _ L l_ figure 1. buffer cells for halo data Once halo data has been received for all sides of the block an update of the block, using the sequential algorithm, can commence. On completion of the algorithm, the current block's boundary data is sent to the neighbouring blocks. The current block then awaits for its halo data to be refreshed so that the next update can commence. The communication of the halo data and scheduling of the blocks for processing is effected by the harness. STRUCTURE OF THE HARNESS The harness [6] comprises many concurrent processes, namely: 1. the control process, 2. the worker process, 3. the first-in first-out buffers (FIFO's), 4. the transporter processes, 5. the dynamic memory allocator.

5 Applications of Supercomputers in Engineering 227 Control Transputer Hard Link Channel Memory Mapped Channel figure 2: FIFO's and Control interconnection

6 228 Applications of Supercomputers in Engineering From Out free, ptr To Alloc where n=slicel_ength > Memory Mapped Channel > Data Flow Direction figure 3: Detail Structure Diagram for One FIFO, Link Guardian Work, Memory Allocator and Control.

7 Applications of Supercomputers in Engineering 229 Figure 2 illustrates the process interconnection within the harness package, showing how the transporter processes input and output halo data, via the FIFO's to and from the control process. Figure 3 shows a structure graph, which illustrates the actual access connections between tasks. It also shows the data flow between tasks. As well as giving some information about the sequencing of interactions, the structure graph represents a static picture of the structure of the system, including both control and data interactions. The structure graph shows how the transporter, worker and control processes may communicate with the memory allocator, to request or to free heap space. The channels connecting the worker and control tasks allow the worker to request from control a block number to process, and control to acknowledge with a block number of a block that has had all the necessary halo data communicated to it. Before a block can be updated, halo data from physically adjacent blocks, at an appropriate time level, are required. This necessitates the communication of halo data between transputers. This is performed by the control process and the transporter tasks, via the FIFO's. The implementation of the harness is for an arbitrary network of transputers. This is achieved by allowing the transputers to investigate their interconnection - for the efficient communication of halo data, each transputer needs to know the shortest path to all other transputers DISTRIBUTION OF INITIAL DATA The distribution of initial data to the network of transputers, at 'start-up' consists mainly of two types: 1. the data required by the worker process to perform an iteration of Laplace's potential equation (block data). This consists of grid data, flow data, and control data which are separated into 'packets' representing data for each block of the multiblock computational domain.

8 230 Applications of Supercomputers in Engineering 2. the local and global data required by the harness (in the form of lists). These lists are essentially 'look-up' tables which describe the distribution strategy and topology of the network. MEMORY ALLOCATOR The memory allocator allows concurrent tasks to use the same array, the heap, in a checked manner. 3L FORTRAN has nothing to compare directly with the pointer types of PASCAL or ADA. Therefore, in order to make use of dynamic structures in FORTRAN the best one can do is to simulate dynamic storage by assigning large arrays for the purpose and to use INTEGER values for links [5]. Each element of the array specifically 'points' to the next element. This is achieved by recording with each element, an 'arrow' in the LINK array to where the next element can be found. The heap is stored in a COMMON block, so that all concurrent processes within the harness package have access to this memory block. This requires that control over this shared memory must be very carefully handed over from task to task. Utilising the memory allocator as a concurrent task, as opposed to a package, enforces mutual exclusion by providing coordinated sharing. The allocator structure graph, shown below, performs services in response to calls from a number of user tasks. The Allocator never calls nor has control over other tasks and always accepts calls immediately, subject only to the usual constraint of accepting one caller at a time. spacefull = FALSE / free figure 4: Structure Graph illustrating the Memory Allocator Control, worker and the transporters are provided with a channel link for communication with the memory allocator. Any task requiring a slice of the

9 Applications of Supercomputers in Engineering 231 heap in order to store data may do so by sending a request command and the slice length required, the memory allocator then acknowledges with a pointer to the start of the slice. Similarly, a process may free a slice of heap by sending a free command and the pointer to the beginning of the slice. TRANSPORTERS The transporters, which run at high priority, input or output data via the transputer serial links. Since input and output can be performed in parallel with computation, having four separate input and output transporter tasks can help to streamline communication. All communication to and from the transporter tasks is in the form of an INTEGER value specifying the number of words in the slice that follows and then the slice itself. Input transporters receive data from other transputers, which is copied to the heap, via communication with the memory allocator. Output transporters output data to neighbouring transputers and then free the heap of this slice of data, also via communication with the memory allocator. The transporter tasks and the memory allocator highlight the efficiency of using dynamic memory allocation, since only the pointer and slice length need to be communicated between tasks within the harness, as opposed to the entire slice. FIRST-IN FIRST-OUT BUFFERS (FIFO's) For systems with high levels of interaction activity, a buffer task is provided between the sender and target which can handle any temporary excess of items produced by the sender. This will prevent temporary communication deadlocks and as a consequence poor data throughput. The FIFO's, located between control and each transporter, decouple the control or transorter tasks by accepting communication even if the target task is not ready to receive thus preventing congestion or latency delays. Clearly the FIFO's are vital to an efficient communication system. The structure graph is shown in figure 5.

10 232 Applications of Supercomputers in Engineering space Full = FALSE ptr, slicelength figure 5: Structure Graph illustrating the FIFO buffer CONTROL The purpose of the control process is to direct the communication of halo data to their destination transputers, and schedule blocks for processing by the work process. The control process acts as an acceptor rather than a caller in its interactions with other tasks. This is because a call to another task risks a congestion delay. Control has five guarded inputs, four from the input transporters and the other is a request channel from the work process. Control holds two local lists, an active list and an inactive list. The active list holds block numbers that have had all the necessary halo data communicated to them and therefore await processing by the worker process. The distribution strategy may be such that there are many blocks per transputer, therefore the required halo data may reside on the same transputer. For physically adjacent blocks on different transputers, it is possible for the block to be one iteration ahead. The inactive list has halo data entries for two time levels, t and t+1. The inactive list therefore allows control to hold halo data pointers received for a block which is already active.

11 Applications of Supercomputers in Engineering 233 activelistempty = FALSE [_ inputs via FIFO's ptr, slicelength blockno O To Allocator and output via FIFO's figure 6: Structure graph illustrating the Control Process WORKER The worker process performs an iteration of Laplace's potential equation on a block, viz The worker initiates rendezvous with the control process by requesting an active block number to process. When acknowledged with a block, the worker then extracts the main block data from the heap. The main block data contains pointers to where the halo data is located in the heap, which is also extracted. Since the heap is a large one-dimensional array, the extracted data must be converted to multi-dimensional arrays, which will contain both block data and halo data, in the form required by the Laplace solver. On completion of an iteration, the worker process sets up halo data, for communication to destination transputers. The iteration count is incremented by one and the next active block is requested, at which time the halo data will be communicated to the destination transputers by control on de-activation. CONFIDENCE TESTING A simple test case, multiblock grid shown in figure 7, was constructed which, although geometrically uncomplicated, contains most of the essential features of the general case, with regard to halo data communication. The data, corresponding to figure 7, is distributed to the network, where each processor holds a copy of the harness and then the functionality of the harness is examined.

12 234 Applications of Supercomputers in Engineering The results from the test case for the single transputer verified the communication and control process functionality. The same test case now required verification on a network of transputers, as well as confirming performance capabilities by allowing parameters of interest, such as the number of blocks per processor, block shape and the block size, to be varied. It was found that as the work process becomes computationally intensive, the communication of the halo data becomes less critical. PERFORMANCE OF THE COMPLETE PROGRAM The computation time was determined for the multiblock grid shown in figure 7 (i.e. 12 blocks) for one, three, six and twelve transputers, whilst varying the computation per block. The worker process was made computationally more intensive by repeating the calculation on each block before halo data is exchanged. Below are tabulated results of execution times for 25 iterations. Loop No. of Procs. CPU Time speed-up efficiency ( seconds ) Table 1: Harness Execution Times for Figure 7(a) By increasing the 'loop' variable in table 1, the work process was made computationally more intensive i.e. the ratio of computation to communication was increased. The results indicate that for a computationally intensive

13 Applications of Supercomputers in Engineering 235 algorithm, because the harness exchanges halo data less periodically, the network runs more efficiently. For the case where loop= 10000, the harness is 95.3% efficient on twelve processors. Altering the shape of the blocks and sub-dividing blocks further, whilst maintaining the same number of cells per block, was also investigated. Each block of figure 7 was modified to that shown in figure 7(b) and 7(c). Table 2 gives CPU times (and efficiency) obtained for the different block shapes of figure 7(a), 7(b) and 7(c). No. of Processors figure 7 (a) (Square) figure 7(b) (Rectangular) figure 7(c) (sub-divided) (0.889) 32.2 (0.875) 31.0 (0.952) Table 2 CPU Times for Block Shape and Number of Blocks (loop=250, iterations= 25) Square blocks have less halo data to communicate than rectangular blocks, therefore square blocks were expected to produce faster CPU times. However, results from table 2 show that there is no significant advantage by having square blocks. Sub-dividing each block to four blocks, shown in figure 7(c), however, did give a notable increase in efficiency, 95.2%. By allowing many blocks per processor it is possible to reduce the transputer idle time ie update a block while another block's halo data has not yet arrived. This observation is also noted in table 1. MULTIBLOCK NAVIER-STOKES AEROFOIL CODE The current work program of the research is to implement a parallel version of the British Aerospace aerofoil code, MB2DV14 [7] and to program the algorithm on a network of transputers, using the communications harness. The code, initially developed for a Cray Y-MP, is used to simulate the viscous flow around a two-dimensional aerofoil, base on the Jameson algorithm [8], and consists of approximately lines of FORTRAN.

14 236 Applications of Supercomputers in Engineering Essentially, the Laplace's potential problem in the Worker process must be replaced by the multiblock Navier-Stokes code. For the harness to run efficiently, one requirement is that the algorithm must be computationally intensive, which is satisfied by the multiblock Navier-Stokes code. Furthermore, the test cases of figure 7 are such that each processor is perfectly load balanced i.e. each processor is responsible for updating the same number of cells. Typically, a multiblock aerofoil mesh used to solve a viscous flow will not have blocks of equal number of cells, and may also be of irregular shape. Table 2 shows that irregular block shapes do not significantly affect the CPU time. The network will not be perfectly load balanced, and all processors will be held up by the one with the most work to perform. The data will therefore be distributed such that all processors have a similar work load. CONCLUSIONS The multiblock Laplace solver is a highly parallel application and is amenable to an efficient implementation on a network of transputers. Maximum efficiency is achieved by ensuring perfect load balancing ie distributing the blocks to the network of transputers such that each transputer is responsible for updating the same number of cells. An overhead may be expected from use of the heap, which is a simulation of dynamic memory storage. This is because the lists within the heap are essentially, interleaved lists, which means that once a list is deleted, the 'simply linked' structure is destroyed. Therefore the next logical item in the list may not reside in the next position of the array. The elements in a list, from the heap, must therefore be scanned using the LINK array before the list can be used. Many small blocks increase the congestion delay due to entry queuing. Fewer large blocks tend to increase structural delays resulting from either fixed order of acceptance or conditional acceptance. It follows that there is an optimum block size. Performance evaluations were carried out for a 12 and 48 block topology, on one, three, six and twelve transputers. A speed-up of upto and an efficiency of 95.3% was attained, with increasing computation.

15 Applications of Supercomputers in Engineering 237 Block shape was shown to not significantly affect the CPU time, whereas increasing the number of blocks per processor reduced the processor idle time and therefore increased efficiency. Since the Navier-Stokes algorithm is computationally intensive, it is expected that the results from the Laplace solver will be carried over to the parallel Navier-Stokes solver, if the load balancing requirement is satisfied. Research was supported by the Science and Engineering Research Council (SERC) and by British Aerospace. REFERENCES [1] 3L Ltd, Parallel FORTRAN Compiler, Reference Manual, V2.1.3, [2] Notes for the Short Course, Parallel Processing Using 31 Parallel FORTRAN, National Transputer Support Centre, Sheffield City Polytechnic, [3] Fountain and May, Tutorial Introduction to OCCAM Programming, Inmos. [4] R.J.A. Buhr, System Design with ADA, [5] P.D. Terry, FORTRAN from PASCAL, Addison-Wesley, [6] Multiblock Euler Solver on an Array of Transputers, BAe, [7] J. Benton, Two Dimensional, Multiblock Navier-Stokes Aerofoil Code, MB2DV14, BAe, [8] A. Jameson, W. Schmidt, E. Turkel, Numerical Solution of the Euler Equations by Finite Volume Methods Using Runge-Kutta Time Stepping Schemes, AIAA paper , 1981.

16 238 Applications of Supercomputers in Engineering (a) 10x10 (b) 5x20 (c) 5x5 x 4 blocks figure block Topology

Implementation of an integrated efficient parallel multiblock Flow solver

Implementation of an integrated efficient parallel multiblock Flow solver Thomas Bönisch, Panagiotis Adamidis and Roland Rühle adamidis@hlrs.de Outline Introduction to URANUS Why using Multiblock meshes