Part i Procedures for Realistic Loops

Size: px

Start display at page:

Download "Part i Procedures for Realistic Loops"

Edmund Nichols
5 years ago
Views:

1 Part i Procedures for Realistic Loops J. Saltz, R. Das, R. Ponnusamy, D. Mavriplis, H. Berryman and J. Wu ICASE, NASA Langley Research Center Hampton VA Abstract This paper describes a set of primitives (PARTI) developed to eficiently execute unstructured problems on distributed memory machines. These primitives halve been incorporated into actual programs and kernels by hand. The development of the primitives were done such that the sequential code and the parallel code on each node would look identical. We have introduced a mechanism by which duplicate fetches of 08- processor data between different loops are eliminated. This elimination of duplicate fetches appears to make a considerable impact on communicaiiions overheads. 1 Introduction In many algorithms, data produced or input during a program s initialization plays a large role in determining the nature of the subsequent computation. When the data structures that definle a computation have been initialized, a preprocessing phase follows. Vital elements of the strategy used by the rest of the algorithm are determined by this preprocessing phase. To effectively exploit many multiprocessor architectures, we may also have to carry out run time preprocessing. This preprocessing will be referred to as runtime compilation [22]. The purpose of runtime compilation is not to determine which computations are to be performed but instead to determine how a multiprocessor machine will schedule the algorithm s work, how to map the data structuries and how data mlovement within the multiprocessor is to be scheduled. In distributed memory MIMD architectures, there is typically a non-trivial communications latency or startup cost. For efficiency reasons, information to be transmitted should be collected into relatively large messages. The cost of fetching array elements can be reduced by precomputing what data each processor needs to send and to receive. In irregular problems, such as solving PDEs on unstructured meshes and sparse matrix algorithms, the communications pattern depends on the input data. This typically arises due to some level of indirection in the code. In this case, it is not possible to predict at compile time what data must be prefetched. Only recently have methods been developed to integrate the kinds of runtime optimizations mentioned above into compilers and programming environments [22]. The lack of compile-time information is dealt with by transforming the original parallel loop into two constructs called an inspector and executor [19]. During program execution, the inspector examines the data references made by a processor, and calculates what off-processor data needs to be fetched and where that data will be stored once it is received. The executor loop then uses the information from the inspector to implement the actual computation. We have developed a suite of primitives that can be used directly by programmers to generate inspector/executor pairs. The primitives incorporate a number of new insights we have had about sparse and unstructured computations. These primitives differ in a number of ways from those described earlier [22]. Our new primitives carry out preprocessing that make it straightforward to produce parallelized loops that are virtually identical in form to the original sequential loops. The importance of this is that it will be possible to generate the same quality object code on the nodes of the distributed memory machine as could be produced by the sequential program running on a single node. Our primitives make use of hash tables [14] to allow us to recognize and exploit a number of situations in which a single off-processor distributed array reference is used several times. In such situations, the primitives only fetch a single copy of each unique off-processor distributed array reference /91/OOOO/OO67/$01.OO Q IEEE 67

2 1.1 Distributed Data Access using PARTI These primitives are named PARTI (Parallel Automated Runtime Toolkit at ICASE) [7], [5]; they carry out the distribution and retrieval of globally indexed but irregularly distributed data-sets over the numerous local processor memories. Each inspector produces a schedule, which is essentially a pattern of communication for gathering or scattering data. In order to avoid duplicate data accesses, a list of offprocessor data references is stored locally (for each processor) in a hash table [14], [28]. For each new off-processor data reference required, a quick search through the hash table is performed in order to determine if this reference has already been accessed. If the reference has not previously been accessed, it is stored in the hash table, otherwise it is discarded. The primitives thus only fetch a single copy of each unique off-processor distributed data-set reference. This idea has also been extended to allow us to produce incremental schedules. For example, if two loops require different but overlapping data references, by simply preserving the hash table formed during the generation of the schedule for the first loop, we may generate a schedule for the second loop that allows us to obtain only those off-processor elements which have not been previously encountered. 1.2 Distributed Translation Tables In distributed memory machines, large data arrays need to be partitioned between local memories of processors. These partitioned data arrays are called distrabuted arrays. Long term storage of distributed array data is assigned to specific memory locations in the distributed machine. Each element in a distributed array is assigned to a particular processor, and in order for another processor to be able to access a given element of the array we must know the processor in which it resides, and its local address in this processor's memory. We thus build a translation table which, for each array element, lists the host processor address. For a one-dimensional array of N elements, the translation table also contains N elements, and therefore must be itself be distributed over the local memories of the processors. This is accomplished by putting the first N/NP elements on the first processor, the second N/NP elements on the second processor, etc..., where NP is the number of processors. Thus, if we are required to access the mth element of the array, we look up its address in the distributed translation table, which we know can be found in the (m/n)* NP+ lth processor. One of the parti primitives handles initialization of distributed translation tables, and other primitives are used to access the distributed translation tables. In Section 2.2, we will give examples of PARTI procedures that initialize and access distributed translation tables. The PARTI primitives have been used to solve a variety of realistic applications such as a 3-D unstructured mesh multigrid Euler solver [SI. PARTI has also been distributed to a variety of universities and laboratories. 2 The Parti Primitives In this section we present a running example in order to illustrate the way in which PARTI procedure calls are used and to describe the optimizations carried out by these procedures. In Figure 1, we depict a set of loops which roughly mimics loops frequently encountered in unstructured mesh fluids codes. Loop L1 sweeps over the edges of a mesh. The mesh edges may define a three dimensional object such as an aircraft. The reference pattern is determined by integer array edgelist. Note that indirection appears in S1 and S2 on both the left and the right sides of the expressions. Loop L2 sweeps over a set of faces. These faces may define the surface of a three dimensional object. The reference pattern is determined by integer array f acelist. Indirection again appears on both the left and the right sides of expressions S3 and S4. Note also that both loop L1 and loop L2 read from array x and that neither loop writes to x. 2.1 PARTI Executor Figure 2 depicts the executor code with embedded fortran callable PARTI procedures +gather, dfscatter-add and dfscatter-addnc. Before this ccde is run, we have to carry out a preprocessing phase, to be described in Section 2.2. The arrays x and y are partitioned between processors, each processor is responsible for the long term storage of specified elements of each of these arrays. The way in which x and y are to be partitioned between processors is determined by the inspector. In this example, elements of x and y are partitioned between processors in exactly the same way. Each processor is responsible for n-on-proc elements of x and Y. It should be noted that except for the procedure calls, the structure of the loops in Figure 2 is identical to that of the loops in Figure 1. In Figure 2, we 68

3 L1 do i=l,n-edge nl = edgelist(i) n2 = edgelist(n-edge+i) Sl y(n1) = y(n1) +... x(n1)... x(n;!) $2 y(n2) = y(n2) +... x(n1)... x(n2) end do I,2 do i=l,nface ml = face-list(i) m2 = face-list(nface+i) m3 = facejist(2*nface + i ) m4 = face-list(3*nface + i ) $33 y(m1) = y(m1) +... x(m1)... x(m2)... x(m3)... +4) S4 y(m2) = y(m2) +... x(m1)... x(m2)... x(m3)... x(m4) end do Figure 1: Example Code again use arrays named x and y; in Figure 2, x and y now represent arrays defined on a single processor of a distributed memory multiprocessor. On each processor P, arrays x and y are declared to be larger than would be needed to store the number of array elements for which P is responsible. We will store copies of offprocessor array elements beginning with local array elements x(n-on-procti) and y(n-on-proc+i). The PARTI subroutine calls depicted in Figure 2 move data between. processors using a precomputed communication pattern. The communication pattern is specified by either a single schedule or by an array of schedules. dfmgather uses communication schedules to fetch off-processor data that will be needed either by loop L1 or by loop L2. The schedules specify the locations in distributed memory from which data is to be obtained. In Figure 2, off-processor data is obtained from array x defined on each processor. Copies of the off-processor data are placed in a buffer area beginning with x(n-on-proc+i). The PARTI procedures dfscatter-add and dfscatter-addnc, in statement S2 and S3 Figure 2, accumulate data to off-processor memory locations. Both dfscatter-add and dfscatter-addnc obtain data to be accumulated to off processor locations from a buffer area that begins with y(n-on-proc+i). Off-processor data is accumulated to locations of y between indices 1 and n-on-proc. The distinctions between dfscatter-add and dfscatter-addnc will be described in Section 2.3. In Figure 2, several data may be accumulated to a given off-processor location in loop L1 or in loop L PARTI Inspector In this section, we will outline how we carry out the preprocessing needed to generate the arguments needed by the code in Figure 2. This preprocessing is depicted in Figure 3. The way in which the nodes of an irregular mesh are numbered frequently do not have a useful correspondence to the connectivity pattern of the mesh. When we partition such a mesh in a way that minimizes interprocessor communication, we may need to be able to assign arbitrary mesh points to each processor. The PARTI procedure ifbuild-translation-table (S1 in Figure 3) allows us to map a globally indexed distributed array onto processors in an arbitrary fashion. Each processor passes ifbuild-translat ion-table a list of the array elements for which it will be responsible (myvals in SI, Figure 3). If a given processor needs to obtain a datum that corresponds to a particular global index i for a specific distributed array, 69

4 S1 translation-table = ifbuild-translation-t able( 1,myvals,non-proc) real*8 x(n-on-proc+n-off-proc) real*8 y(n-on-proc+n-off-proc) S 1 dfmgather (sched-array12,x(n-on-proc+ 1),x) L1 do i=l,n-edge nl = local-edge-list(i) n2 = local-edge-list(n-edge+i) y(n1) = y(n1) +... x(n1)... x(n2) y(n2) = y(n2) +... x(n1)... x(n2) end do S2 dfscatter-add(edge-sched,y(n-on-proc+l))y) L2 do i=l,nface ml = localface-list(i) m2 = localface-list(nface+i) m3 = localface~list(2*nface + i ) m4 = localface-list(3*nface + i ) y(m1) = y(m1) +... x(m1)... x(m2)... x(m3)... 4m4) y(m2) = y(m2) +... x(m1)... x(m2)... x(m3)... x(m4) end do S3 dfscatter-addnc(face-sched,y(n-on-proc+l), buffer-mapping, y) Figure 2: Parallelized Code for Each Processor S2 call flocalize( translation-tableledge-sched,edge-list, local-edge-list,2*n~dge1n-off-proc) S3 sched-array( 1) = edge-sched S4 call fmlocalize( translation-table,facesched, incrementalfacesched, facelistjocalface-list, 4*nface, n-off-proc-face, nnew-off-procface, buffer-mapping, 1,sched-array) S5 sched-array(2) = incrementalfacesched Figure 3: Inspector Code for Each Processor the processor can consult the distributed translation table (Section 1.2) to find the location of that datum in distributed memory. The PARTI procedures flocalize and fmlocalize carry out the bulk of the preprocessing needed to produce the executor code depicted in Figure 2. We will first describe pocalize, (S2 in Figure 3). On each processor, flocalize is passed: 1. a pointer to a distributed translation table (translation-table in S2), 2. a list of globally indexed distributed array references, (edgelist in S2), and 3. number of globally indexed distributed array references (2*n-edge in S2). Flocalize returns: 1. a schedule that can be used in PARTI gather and scatter procedures (edge-sched in Sa), 2. a list of integers that can be used to specify the pattern of indirection in the executor code (local-edgelist in S2), and 3. number of distinct off-processor references found in edge-list (n-off-proc in S2). There are a variety of situations in which the same data need to be accessed by multiple loops (Section 1.1). In Figure 1, no assignments to x are carried out. In the beginning of Figure 2, each processor 70

5 can gather a single copy of every distinct off-processor value of x referenced by loops L1 or L2. The PARTI prlocedure fmlocalize (S4 in Figure 3) makes it simple to remove these duplicate references. fmlocalize makes it possible to obtain only those off-processor data not requested by a given set of pre-existing schedules. The procedure dfmgather in the executor in Figure 2 obtains off-processor data using two schedules; ed,ge-sched produced by fiocalize (S2 Figure 3) and incremental-face-sched produced by fmlocalize (S4 Figure 3). To review the work carried out by fmlocalize, we will summarize the significance of all but one of the arguments of this PARTI procedure. On each processor, fmlocalize is passed: ;. number of globally indexed distributed array references (4*nface in S4), 4: a pointer to a distributed translation table (translation-table in S4), a list of globally indexed distributed array references. (facelist in S4), number of pre-existing schedules that need to be taken into account when removing duplicates (1 in S4), and an array of pointers to pre-existing schedules (sched-array in S4). Fnnlocalize returns: 1. a schedule that can be used in PARTI gather and scatter procedures. This schedule does not take any pre-existing schedules into account (facesched in S4), an incremental schedule that includes only offprocessor data accesses not included in the preexisting schedules (incrementalfacesched in S4), a list of integers that can be used to specify the pattern of indirection in the executor code (localfacelist in S4), number of distinct off-processor references in facelist (n-off-procface in S4). 51. number of distinct off-processor references riot encountered in any other schedule (naew-off-procface in S4). buffer-mapping - to be discussed in Section A Return to the Executor We have already discussed dfmgather in Section 2.1 but we have not said anything so far about the distinction between dfscatter-add and dfscatter-addnc. When we make use of incremental schedules, to each offprocessor distributed array element, we assign a single buffer location. In our example, we carry out separate off-processor accumulations after loops L 1 and L2. As we will describe below, in this situation, our off-processor accumulation procedures may no longer reference consecutive elements of a buffer. In S2, Figure 2, we can assign copies of distinct off-processor elements of y to buffer locations. We can then use a schedule (edge-sched) to specify where in distributed memory each consecutive value in the buffer is to be accumulated. PARTI procedure dfscatter-add can be employed; this procedure uses schedule edge-sched to accumulate to off-processor locations consecutive buffer locations beginning with y(n-on-proc + 1). When we get to L2, some of the offprocessor copies may already be associated with buffer locations. Consequently in S3, Figure 2, our schedule (facesched) must access buffer locations in an irregular manner. The pattern of buffer locations accessed is specified by integer array bufler-mapping passed to dfscatter-addnc in S3, Figure 2. (dfscatter-addnc stands for dfscatter-add non-contiguous) 3 Status of PARTI Primitives The PARTI procedures described in this paper have been used to port a 3-D unstructured mesh Euler solver [18]. This work has spurred many improvements in the optimizations carried out by our primitives. In Mavriplis Euler code, we have seen a reduction in communication time in a problem solved on a 53,921 node grid, from 288 seconds to 82 seconds on 32 processors of an Intel ipsc/860 (compared to a computation time of 151 seconds). Approximately a factor of two improvement in communication overhead was seen when we employed incremental schedules. (Recall from Section 2.2 that use of incremental schedules allows us to obtain only those non-updated off-processor elements which have not been previously encountered in earlier loops). In this code, the total cost of all preprocessing was less than 2 seconds. The computational rate of this mesh run on 32 processors of an ipsc/860 was 54 Mflops while the same code run on 64 processors of an ipsc/sso ran at 87 Mflops. The unstructured mesh was partitioned by the method described in [25]. 71

Since the form of the sequential code and the parallelized code was virtually identical, we did not expect the parallelization process to introduce any new inefficiencies beyond those exacted by the

6 Since the form of the sequential code and the parallelized code was virtually identical, we did not expect the parallelization process to introduce any new inefficiencies beyond those exacted by the preprocessing and by the calls to the primitives. On a smaller problem, we compared the parallel code running on a single node with the sequential code and found only a 2 percent performance degradation. 4 PARTI Compiler We have developed a prototype compiler which takes as input a Fortran 77 program enhanced with specifications for distributing data [28]. The compiler outputs a distributed memory program with embedded PARTI procedures. The PARTI procedures embedded by this compiler were an earlier version of the procedures described here, these procedures are described in [22]. One of the inputs that must be supplied to this compiler generated program is information about how arrays and loop iterations are to be distributed between processors. This compiler allows arrays and loop iterations to be partitioned in an arbitrary manner; this flexibility is of practical importance in unstructured and sparse computations [23], [5]. The PARTI compiler [28] employed a set of language extensions designed to specify regular and irregular array distributions. A set of Fortran 77 language extensions (Fortran-D) have been proposed [ll], these extensions subsume the language extensions described in [28]. This compiler was tested on several NASA kernels, the performance of the resulting codes was benchmarked on the ipsc/860 [22]. We are currently constructing a Parascope based compiler that will embed the version of PARTI discussed in this paper. This new compiler will be an extension of a Parascope based distributed memory compiler targeted towards regular problems described in [13]. This new compiler is being designed to incorporate partitioners in an integral manner. Customized partitioners decompose problems based on an a programmer s understanding of the computationally important dependency relations. For instance, unstructured mesh Euler or Navier Stokes solvers contain a variety of loops over either mesh edges or tetrahedra. Each loop may exhibit a different data dependency pattern. A programmer who is familiar with a given application will generally know which portions of a code need to be taken into account when calculating data partitions; the need for such insight may or may not be reduced when we link partitioners to compilers. We plan to develop directives to allow users to specify which loops are to be taken into account when determining array partitioning. Note that the user does not specify a partition directly. Partitioners such as those described in [9], [lo] and [25] can partition arrays based on connectivity graphs that originate from loop dependence relations. We intend to design a compiler that is able to embed a PARTI primitive designed to translate execution-time loop dependency relations into a distributed representation of a connectivity graph. A partitioner coupled to PARTI will be written so that it will input the connectivity graph information. After partitioning is completed, information concerning the chosen partitioning will be returned and a distributed translation table will be initialized. This mechanism of linking a runtime partitioner to a compiler was initially outlined in [19] and is closely related to [16]. 5 Relation to Other Work Programs designed to carry out a range of irregular computations including sparse direct and iterative methods require many of the optimizations described in this paper. Some examples of such programs are described in [2], [17], [4], [27] and [12]. Several researchers have developed programming environments that are targeted towards particular classes of irregular or adaptive problems. Williams [27] describes a programming environment (DIME) for calculations with unstructured triangular meshes using distributed memory machines. Baden [3] has developed a programming environment targeted towards particle computations, this programming environment provides facilities that support dynamic load balancing. There are a variety of compiler projects targeted at distributed memory multiprocessors [as], [B], [all, [20], [l], [26]. With the exception of Kali project [15], and the Parti work described her5 and in [24], [19], and [23], these compilers do not attempt to efficiently deal with loops that arise in sparse or unstructured scientific computations. The PARTI runtime support procedures and the compilers described in this paper is qualitatively different from the efforts cited above in a number of important respects. We have developed and demonstrated mechanisms that allow us to support irregularly distributed arrays. Irregularly distributed arrays must be supported to make it possible to map data and computational work in an arbitrary manner. Support for arbitrary distributions was proposed in [19] and 72

7 [23;1 but to our knowledge, this is the first implementation of a compiler based distributed translation table mechanism for irregular scientific problems. 'We find that many unstructured NA.SA codes must carry out data accumulations to off-processor memory locations. We chose one of our kernels to demonstrate this, and designed our primitives and compiler to be able to handle this situation. To our knowledge, our compiler effort is unique in its ability to efficiently car,ry out irregular patterns of off-processor data accunnulations. We augment our primitives with a hash table designed to eliminate duplicate data accesses. Other researchers have used different data structures for management of off-processor data copies [15]. We have also developed a mechanism for producing incremental schedules, The use of incremental schedules allows us to obtain only those non-updated of -processor elemeints which have not been previously encountered in earlier loops. 6 Summary and Conclusions We have shown that PARTI primitives can be used to port actual unstructured code to distributed meimory machines. These primitives are highly optimized and require very little overhead. Duplicate off-processor data access is removed using hash tables during the formation of both total schedules and incremeintal schedules. Primitives for inter-processor data movement using multiple schedules have been presented, and using these reduce the overall data transfer. A compiler has been implemented which takes in an extended Fortran 77 and produces P.ARTI primitive emlbedded node code to be run on the Intel ipsc/860 machine. The PARTI primitives are available for public distribution and can be obtained from netlib or from the anonymous ftp cite ra.cs.yale.edu. Acknowledgements The authors would like to thank Horst Simon for the use of his unstructured mesh partitioning software ancl Venkatakrishnan for useful suggestions for low level communications scheduling. We would also like to,acknowledge support from NASA contract NAS while the authors were in residence at ICASE, NASA Langley Research Center along with support froim NSF grant ASC for authors Saltz and Berryman. References F. AndrC, J.-L. Pazat, and H. Thomas. PAN- DORE: A system to manage data distribution. In International Conference on Supercomputing, pages , June C. Ashcraft, S. C. Eisenstat, and J. W. H. Liu. A fan-in algorithm for distributed sparse numerical factorization. SISSC, 11(3): , S. Baden. Programming abstractions for dynamically partitioning and coordinating localized scientific calculations running on multiprocessors. To appear, SIAM J. Sei. and Stat. Computation., D. Baxter, J. Saltz, M. Schultz, S. Eisentstat, and K. Crowley. An experimental study of methods for parallel preconditioned krylov methods. In Proceedings of the 1988 Hypercube Multiprocessor Conference, Pasadena CA, pages 1698,1711, January H. Berryman, J. Saltz, and J. Scroggs. Execution time support for adaptive scientific algorithms on distributed memory machines, to appear in concurrency: Practice and experience Report 90-41, ICASE, May A. Cheung and A. P. Reeves. The paragon multicomputer environment: A first implementation. Technical Report EE-CEG-89-9, Cornell University Computer Engineering Group, Cornell University School of Electrical Engineering, july R. Das, J. Saltz, and H. Berryman. A manual for parti runtime primitives - revision 1 (document and parti software available through netlib). Interim Report 91-17, ICASE, R. Dits, J. Saltz, D. Mavriplis, J. Wu, and H. Berryman. Unstructured mesh problems, parti primitives and the arf compiler. In Parallel Processing for Scientific Computation, Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing, Houston Tx, April 1991, G. Fox. A graphical approach to load balancing and sparse matrix vector multiplication on the hypercube. In The IMA Volumes in Mathematics and its Applications. Volume 19: Numerical 13

8 Algorithms for Modern Parallel Computer Architectures Martin Schultz Editor. Springer-Verlag, G. Fox. A review of automatic load balancing and decomposition methods for the hypercube. In The IMA Volumes in Mathematics and its Applications. Volume 13: Numerical Algorithms for Modern Parallel Computer Architectures Martin Schultz Editor. Springer-Verlag, G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu. Fortran D language specification. Department of Computer Science Rice COMP TR90-141, Rice University, December G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Computers. Prentice-Hall, Englewood Cliffs, New Jersey, [13] S. Hiranandani, K. Kennedy, and C. Tseng. Compiler support for machine-independent parallel programming in Fortran D. In Compilers and Runtime Software for Scalable Multiprocessors, J. Saltz and P. Mehrotra Editors, Amsterdam, The Netherlands, To appear Elsevier. [14] S. Hiranandani, J. Saltz, P. Mehrotra, and H. Berryman. Performance of hashed cache data migration schemes on multicomputers. Journal of Parallel and Distributed Computing, to appear, 12, August [15] C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd ACM SIGPLAN Symposium on Principles Practice of Parallel Programming, pages AGM SIGPLAN, March [16] M. Lam and M. C. Rinard. Coarse grain parallel programming in jade. In Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Williamsburg VA. ACM Press, [17] J. W. Liu. Computational models and task scheduling for parallel sparse cholesky factorization. Parallel Computing, 3: , [18] D. J. Mavriplis. Three dimensional unstructured multigrid for the euler equations, paper cp. In AIAA luth Computational Fluid Dynamics Conference, June [19] R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nicol, and Kay Crowley. Principles of runtime support for parallel processors. In Proceedings of the 1988 ACM International Conference on Supercomputing, St. Malo France, pages , July [20] A. Rogers and K. Pingali. Process decomposition through locality of reference. In Conference on Programming Language Design and Implementation. ACM SIGPLAN, June [21] M. Rosing, R.W. Schnabel, and R.P. Weaver. Expressing complex parallel algorithms in Dino. In Proceedings of the 4th Conference on Hypercubes, Conurrent Computers and Applications, pages , [22] J. Saltz, H. Berryman, and J. Wu. Runtime compilation for multiprocessors, to appear: Concurrency, Practice and Experience, Report 90-59, ICASE, [23] J. Saltz, K. Crowley, R. Mirchandaney, and Harry Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8~ , [24] Joel Saltz and M.C. Chen. Automated problem mapping: the crystal runtime system. In The Proceedings of the Hypercube Microprocessors Conf., Knoxville, TN, September [25] H. Simon. Partitioning of unstructured mesh problems for parallel processing. In Proceedings of the Conference on Prallel Methods on Large Scale Structural Analysis and Physics Applications. Permagon Press, [26] P. S. Tseng. A Parallelizing Compiler for Dastributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, May [27] R. D. Williams and R. Glowinski. Distributed irregular finite elements. Technical Report C3P 715, Caltech Concurrent Computation Program, February [28] J. Wu, J. Saltz, S. Hiranandani, and H. Berryman. Runtime compilation for multicomputers. In The Proceedings of the ICPP, [29] H. Zima, H. Bast, and M. Gerndt. Superb: A tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6:l-18,

PARTI Primitives for Unstructured and Block Structured Problems

Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 1992 PARTI Primitives for Unstructured