Part i Procedures for Realistic Loops

Size: px
Start display at page:

Download "Part i Procedures for Realistic Loops"

Transcription

1 Part i Procedures for Realistic Loops J. Saltz, R. Das, R. Ponnusamy, D. Mavriplis, H. Berryman and J. Wu ICASE, NASA Langley Research Center Hampton VA Abstract This paper describes a set of primitives (PARTI) developed to eficiently execute unstructured problems on distributed memory machines. These primitives halve been incorporated into actual programs and kernels by hand. The development of the primitives were done such that the sequential code and the parallel code on each node would look identical. We have introduced a mechanism by which duplicate fetches of 08- processor data between different loops are eliminated. This elimination of duplicate fetches appears to make a considerable impact on communicaiiions overheads. 1 Introduction In many algorithms, data produced or input during a program s initialization plays a large role in determining the nature of the subsequent computation. When the data structures that definle a computation have been initialized, a preprocessing phase follows. Vital elements of the strategy used by the rest of the algorithm are determined by this preprocessing phase. To effectively exploit many multiprocessor architectures, we may also have to carry out run time preprocessing. This preprocessing will be referred to as runtime compilation [22]. The purpose of runtime compilation is not to determine which computations are to be performed but instead to determine how a multiprocessor machine will schedule the algorithm s work, how to map the data structuries and how data mlovement within the multiprocessor is to be scheduled. In distributed memory MIMD architectures, there is typically a non-trivial communications latency or startup cost. For efficiency reasons, information to be transmitted should be collected into relatively large messages. The cost of fetching array elements can be reduced by precomputing what data each processor needs to send and to receive. In irregular problems, such as solving PDEs on unstructured meshes and sparse matrix algorithms, the communications pattern depends on the input data. This typically arises due to some level of indirection in the code. In this case, it is not possible to predict at compile time what data must be prefetched. Only recently have methods been developed to integrate the kinds of runtime optimizations mentioned above into compilers and programming environments [22]. The lack of compile-time information is dealt with by transforming the original parallel loop into two constructs called an inspector and executor [19]. During program execution, the inspector examines the data references made by a processor, and calculates what off-processor data needs to be fetched and where that data will be stored once it is received. The executor loop then uses the information from the inspector to implement the actual computation. We have developed a suite of primitives that can be used directly by programmers to generate inspector/executor pairs. The primitives incorporate a number of new insights we have had about sparse and unstructured computations. These primitives differ in a number of ways from those described earlier [22]. Our new primitives carry out preprocessing that make it straightforward to produce parallelized loops that are virtually identical in form to the original sequential loops. The importance of this is that it will be possible to generate the same quality object code on the nodes of the distributed memory machine as could be produced by the sequential program running on a single node. Our primitives make use of hash tables [14] to allow us to recognize and exploit a number of situations in which a single off-processor distributed array reference is used several times. In such situations, the primitives only fetch a single copy of each unique off-processor distributed array reference /91/OOOO/OO67/$01.OO Q IEEE 67

2 1.1 Distributed Data Access using PARTI These primitives are named PARTI (Parallel Automated Runtime Toolkit at ICASE) [7], [5]; they carry out the distribution and retrieval of globally indexed but irregularly distributed data-sets over the numerous local processor memories. Each inspector produces a schedule, which is essentially a pattern of communication for gathering or scattering data. In order to avoid duplicate data accesses, a list of offprocessor data references is stored locally (for each processor) in a hash table [14], [28]. For each new off-processor data reference required, a quick search through the hash table is performed in order to determine if this reference has already been accessed. If the reference has not previously been accessed, it is stored in the hash table, otherwise it is discarded. The primitives thus only fetch a single copy of each unique off-processor distributed data-set reference. This idea has also been extended to allow us to produce incremental schedules. For example, if two loops require different but overlapping data references, by simply preserving the hash table formed during the generation of the schedule for the first loop, we may generate a schedule for the second loop that allows us to obtain only those off-processor elements which have not been previously encountered. 1.2 Distributed Translation Tables In distributed memory machines, large data arrays need to be partitioned between local memories of processors. These partitioned data arrays are called distrabuted arrays. Long term storage of distributed array data is assigned to specific memory locations in the distributed machine. Each element in a distributed array is assigned to a particular processor, and in order for another processor to be able to access a given element of the array we must know the processor in which it resides, and its local address in this processor's memory. We thus build a translation table which, for each array element, lists the host processor address. For a one-dimensional array of N elements, the translation table also contains N elements, and therefore must be itself be distributed over the local memories of the processors. This is accomplished by putting the first N/NP elements on the first processor, the second N/NP elements on the second processor, etc..., where NP is the number of processors. Thus, if we are required to access the mth element of the array, we look up its address in the distributed translation table, which we know can be found in the (m/n)* NP+ lth processor. One of the parti primitives handles initialization of distributed translation tables, and other primitives are used to access the distributed translation tables. In Section 2.2, we will give examples of PARTI procedures that initialize and access distributed translation tables. The PARTI primitives have been used to solve a variety of realistic applications such as a 3-D unstructured mesh multigrid Euler solver [SI. PARTI has also been distributed to a variety of universities and laboratories. 2 The Parti Primitives In this section we present a running example in order to illustrate the way in which PARTI procedure calls are used and to describe the optimizations carried out by these procedures. In Figure 1, we depict a set of loops which roughly mimics loops frequently encountered in unstructured mesh fluids codes. Loop L1 sweeps over the edges of a mesh. The mesh edges may define a three dimensional object such as an aircraft. The reference pattern is determined by integer array edgelist. Note that indirection appears in S1 and S2 on both the left and the right sides of the expressions. Loop L2 sweeps over a set of faces. These faces may define the surface of a three dimensional object. The reference pattern is determined by integer array f acelist. Indirection again appears on both the left and the right sides of expressions S3 and S4. Note also that both loop L1 and loop L2 read from array x and that neither loop writes to x. 2.1 PARTI Executor Figure 2 depicts the executor code with embedded fortran callable PARTI procedures +gather, dfscatter-add and dfscatter-addnc. Before this ccde is run, we have to carry out a preprocessing phase, to be described in Section 2.2. The arrays x and y are partitioned between processors, each processor is responsible for the long term storage of specified elements of each of these arrays. The way in which x and y are to be partitioned between processors is determined by the inspector. In this example, elements of x and y are partitioned between processors in exactly the same way. Each processor is responsible for n-on-proc elements of x and Y. It should be noted that except for the procedure calls, the structure of the loops in Figure 2 is identical to that of the loops in Figure 1. In Figure 2, we 68

3 L1 do i=l,n-edge nl = edgelist(i) n2 = edgelist(n-edge+i) Sl y(n1) = y(n1) +... x(n1)... x(n;!) $2 y(n2) = y(n2) +... x(n1)... x(n2) end do I,2 do i=l,nface ml = face-list(i) m2 = face-list(nface+i) m3 = facejist(2*nface + i ) m4 = face-list(3*nface + i ) $33 y(m1) = y(m1) +... x(m1)... x(m2)... x(m3)... +4) S4 y(m2) = y(m2) +... x(m1)... x(m2)... x(m3)... x(m4) end do Figure 1: Example Code again use arrays named x and y; in Figure 2, x and y now represent arrays defined on a single processor of a distributed memory multiprocessor. On each processor P, arrays x and y are declared to be larger than would be needed to store the number of array elements for which P is responsible. We will store copies of offprocessor array elements beginning with local array elements x(n-on-procti) and y(n-on-proc+i). The PARTI subroutine calls depicted in Figure 2 move data between. processors using a precomputed communication pattern. The communication pattern is specified by either a single schedule or by an array of schedules. dfmgather uses communication schedules to fetch off-processor data that will be needed either by loop L1 or by loop L2. The schedules specify the locations in distributed memory from which data is to be obtained. In Figure 2, off-processor data is obtained from array x defined on each processor. Copies of the off-processor data are placed in a buffer area beginning with x(n-on-proc+i). The PARTI procedures dfscatter-add and dfscatter-addnc, in statement S2 and S3 Figure 2, accumulate data to off-processor memory locations. Both dfscatter-add and dfscatter-addnc obtain data to be accumulated to off processor locations from a buffer area that begins with y(n-on-proc+i). Off-processor data is accumulated to locations of y between indices 1 and n-on-proc. The distinctions between dfscatter-add and dfscatter-addnc will be described in Section 2.3. In Figure 2, several data may be accumulated to a given off-processor location in loop L1 or in loop L PARTI Inspector In this section, we will outline how we carry out the preprocessing needed to generate the arguments needed by the code in Figure 2. This preprocessing is depicted in Figure 3. The way in which the nodes of an irregular mesh are numbered frequently do not have a useful correspondence to the connectivity pattern of the mesh. When we partition such a mesh in a way that minimizes interprocessor communication, we may need to be able to assign arbitrary mesh points to each processor. The PARTI procedure ifbuild-translation-table (S1 in Figure 3) allows us to map a globally indexed distributed array onto processors in an arbitrary fashion. Each processor passes ifbuild-translat ion-table a list of the array elements for which it will be responsible (myvals in SI, Figure 3). If a given processor needs to obtain a datum that corresponds to a particular global index i for a specific distributed array, 69

4 S1 translation-table = ifbuild-translation-t able( 1,myvals,non-proc) real*8 x(n-on-proc+n-off-proc) real*8 y(n-on-proc+n-off-proc) S 1 dfmgather (sched-array12,x(n-on-proc+ 1),x) L1 do i=l,n-edge nl = local-edge-list(i) n2 = local-edge-list(n-edge+i) y(n1) = y(n1) +... x(n1)... x(n2) y(n2) = y(n2) +... x(n1)... x(n2) end do S2 dfscatter-add(edge-sched,y(n-on-proc+l))y) L2 do i=l,nface ml = localface-list(i) m2 = localface-list(nface+i) m3 = localface~list(2*nface + i ) m4 = localface-list(3*nface + i ) y(m1) = y(m1) +... x(m1)... x(m2)... x(m3)... 4m4) y(m2) = y(m2) +... x(m1)... x(m2)... x(m3)... x(m4) end do S3 dfscatter-addnc(face-sched,y(n-on-proc+l), buffer-mapping, y) Figure 2: Parallelized Code for Each Processor S2 call flocalize( translation-tableledge-sched,edge-list, local-edge-list,2*n~dge1n-off-proc) S3 sched-array( 1) = edge-sched S4 call fmlocalize( translation-table,facesched, incrementalfacesched, facelistjocalface-list, 4*nface, n-off-proc-face, nnew-off-procface, buffer-mapping, 1,sched-array) S5 sched-array(2) = incrementalfacesched Figure 3: Inspector Code for Each Processor the processor can consult the distributed translation table (Section 1.2) to find the location of that datum in distributed memory. The PARTI procedures flocalize and fmlocalize carry out the bulk of the preprocessing needed to produce the executor code depicted in Figure 2. We will first describe pocalize, (S2 in Figure 3). On each processor, flocalize is passed: 1. a pointer to a distributed translation table (translation-table in S2), 2. a list of globally indexed distributed array references, (edgelist in S2), and 3. number of globally indexed distributed array references (2*n-edge in S2). Flocalize returns: 1. a schedule that can be used in PARTI gather and scatter procedures (edge-sched in Sa), 2. a list of integers that can be used to specify the pattern of indirection in the executor code (local-edgelist in S2), and 3. number of distinct off-processor references found in edge-list (n-off-proc in S2). There are a variety of situations in which the same data need to be accessed by multiple loops (Section 1.1). In Figure 1, no assignments to x are carried out. In the beginning of Figure 2, each processor 70

5 can gather a single copy of every distinct off-processor value of x referenced by loops L1 or L2. The PARTI prlocedure fmlocalize (S4 in Figure 3) makes it simple to remove these duplicate references. fmlocalize makes it possible to obtain only those off-processor data not requested by a given set of pre-existing schedules. The procedure dfmgather in the executor in Figure 2 obtains off-processor data using two schedules; ed,ge-sched produced by fiocalize (S2 Figure 3) and incremental-face-sched produced by fmlocalize (S4 Figure 3). To review the work carried out by fmlocalize, we will summarize the significance of all but one of the arguments of this PARTI procedure. On each processor, fmlocalize is passed: ;. number of globally indexed distributed array references (4*nface in S4), 4: a pointer to a distributed translation table (translation-table in S4), a list of globally indexed distributed array references. (facelist in S4), number of pre-existing schedules that need to be taken into account when removing duplicates (1 in S4), and an array of pointers to pre-existing schedules (sched-array in S4). Fnnlocalize returns: 1. a schedule that can be used in PARTI gather and scatter procedures. This schedule does not take any pre-existing schedules into account (facesched in S4), an incremental schedule that includes only offprocessor data accesses not included in the preexisting schedules (incrementalfacesched in S4), a list of integers that can be used to specify the pattern of indirection in the executor code (localfacelist in S4), number of distinct off-processor references in facelist (n-off-procface in S4). 51. number of distinct off-processor references riot encountered in any other schedule (naew-off-procface in S4). buffer-mapping - to be discussed in Section A Return to the Executor We have already discussed dfmgather in Section 2.1 but we have not said anything so far about the distinction between dfscatter-add and dfscatter-addnc. When we make use of incremental schedules, to each offprocessor distributed array element, we assign a single buffer location. In our example, we carry out separate off-processor accumulations after loops L 1 and L2. As we will describe below, in this situation, our off-processor accumulation procedures may no longer reference consecutive elements of a buffer. In S2, Figure 2, we can assign copies of distinct off-processor elements of y to buffer locations. We can then use a schedule (edge-sched) to specify where in distributed memory each consecutive value in the buffer is to be accumulated. PARTI procedure dfscatter-add can be employed; this procedure uses schedule edge-sched to accumulate to off-processor locations consecutive buffer locations beginning with y(n-on-proc + 1). When we get to L2, some of the offprocessor copies may already be associated with buffer locations. Consequently in S3, Figure 2, our schedule (facesched) must access buffer locations in an irregular manner. The pattern of buffer locations accessed is specified by integer array bufler-mapping passed to dfscatter-addnc in S3, Figure 2. (dfscatter-addnc stands for dfscatter-add non-contiguous) 3 Status of PARTI Primitives The PARTI procedures described in this paper have been used to port a 3-D unstructured mesh Euler solver [18]. This work has spurred many improvements in the optimizations carried out by our primitives. In Mavriplis Euler code, we have seen a reduction in communication time in a problem solved on a 53,921 node grid, from 288 seconds to 82 seconds on 32 processors of an Intel ipsc/860 (compared to a computation time of 151 seconds). Approximately a factor of two improvement in communication overhead was seen when we employed incremental schedules. (Recall from Section 2.2 that use of incremental schedules allows us to obtain only those non-updated off-processor elements which have not been previously encountered in earlier loops). In this code, the total cost of all preprocessing was less than 2 seconds. The computational rate of this mesh run on 32 processors of an ipsc/860 was 54 Mflops while the same code run on 64 processors of an ipsc/sso ran at 87 Mflops. The unstructured mesh was partitioned by the method described in [25]. 71

6 Since the form of the sequential code and the parallelized code was virtually identical, we did not expect the parallelization process to introduce any new inefficiencies beyond those exacted by the preprocessing and by the calls to the primitives. On a smaller problem, we compared the parallel code running on a single node with the sequential code and found only a 2 percent performance degradation. 4 PARTI Compiler We have developed a prototype compiler which takes as input a Fortran 77 program enhanced with specifications for distributing data [28]. The compiler outputs a distributed memory program with embedded PARTI procedures. The PARTI procedures embedded by this compiler were an earlier version of the procedures described here, these procedures are described in [22]. One of the inputs that must be supplied to this compiler generated program is information about how arrays and loop iterations are to be distributed between processors. This compiler allows arrays and loop iterations to be partitioned in an arbitrary manner; this flexibility is of practical importance in unstructured and sparse computations [23], [5]. The PARTI compiler [28] employed a set of language extensions designed to specify regular and irregular array distributions. A set of Fortran 77 language extensions (Fortran-D) have been proposed [ll], these extensions subsume the language extensions described in [28]. This compiler was tested on several NASA kernels, the performance of the resulting codes was benchmarked on the ipsc/860 [22]. We are currently constructing a Parascope based compiler that will embed the version of PARTI discussed in this paper. This new compiler will be an extension of a Parascope based distributed memory compiler targeted towards regular problems described in [13]. This new compiler is being designed to incorporate partitioners in an integral manner. Customized partitioners decompose problems based on an a programmer s understanding of the computationally important dependency relations. For instance, unstructured mesh Euler or Navier Stokes solvers contain a variety of loops over either mesh edges or tetrahedra. Each loop may exhibit a different data dependency pattern. A programmer who is familiar with a given application will generally know which portions of a code need to be taken into account when calculating data partitions; the need for such insight may or may not be reduced when we link partitioners to compilers. We plan to develop directives to allow users to specify which loops are to be taken into account when determining array partitioning. Note that the user does not specify a partition directly. Partitioners such as those described in [9], [lo] and [25] can partition arrays based on connectivity graphs that originate from loop dependence relations. We intend to design a compiler that is able to embed a PARTI primitive designed to translate execution-time loop dependency relations into a distributed representation of a connectivity graph. A partitioner coupled to PARTI will be written so that it will input the connectivity graph information. After partitioning is completed, information concerning the chosen partitioning will be returned and a distributed translation table will be initialized. This mechanism of linking a runtime partitioner to a compiler was initially outlined in [19] and is closely related to [16]. 5 Relation to Other Work Programs designed to carry out a range of irregular computations including sparse direct and iterative methods require many of the optimizations described in this paper. Some examples of such programs are described in [2], [17], [4], [27] and [12]. Several researchers have developed programming environments that are targeted towards particular classes of irregular or adaptive problems. Williams [27] describes a programming environment (DIME) for calculations with unstructured triangular meshes using distributed memory machines. Baden [3] has developed a programming environment targeted towards particle computations, this programming environment provides facilities that support dynamic load balancing. There are a variety of compiler projects targeted at distributed memory multiprocessors [as], [B], [all, [20], [l], [26]. With the exception of Kali project [15], and the Parti work described her5 and in [24], [19], and [23], these compilers do not attempt to efficiently deal with loops that arise in sparse or unstructured scientific computations. The PARTI runtime support procedures and the compilers described in this paper is qualitatively different from the efforts cited above in a number of important respects. We have developed and demonstrated mechanisms that allow us to support irregularly distributed arrays. Irregularly distributed arrays must be supported to make it possible to map data and computational work in an arbitrary manner. Support for arbitrary distributions was proposed in [19] and 72

7 [23;1 but to our knowledge, this is the first implementation of a compiler based distributed translation table mechanism for irregular scientific problems. 'We find that many unstructured NA.SA codes must carry out data accumulations to off-processor memory locations. We chose one of our kernels to demonstrate this, and designed our primitives and compiler to be able to handle this situation. To our knowledge, our compiler effort is unique in its ability to efficiently car,ry out irregular patterns of off-processor data accunnulations. We augment our primitives with a hash table designed to eliminate duplicate data accesses. Other researchers have used different data structures for management of off-processor data copies [15]. We have also developed a mechanism for producing incremental schedules, The use of incremental schedules allows us to obtain only those non-updated of -processor elemeints which have not been previously encountered in earlier loops. 6 Summary and Conclusions We have shown that PARTI primitives can be used to port actual unstructured code to distributed meimory machines. These primitives are highly optimized and require very little overhead. Duplicate off-processor data access is removed using hash tables during the formation of both total schedules and incremeintal schedules. Primitives for inter-processor data movement using multiple schedules have been presented, and using these reduce the overall data transfer. A compiler has been implemented which takes in an extended Fortran 77 and produces P.ARTI primitive emlbedded node code to be run on the Intel ipsc/860 machine. The PARTI primitives are available for public distribution and can be obtained from netlib or from the anonymous ftp cite ra.cs.yale.edu. Acknowledgements The authors would like to thank Horst Simon for the use of his unstructured mesh partitioning software ancl Venkatakrishnan for useful suggestions for low level communications scheduling. We would also like to,acknowledge support from NASA contract NAS while the authors were in residence at ICASE, NASA Langley Research Center along with support froim NSF grant ASC for authors Saltz and Berryman. References F. AndrC, J.-L. Pazat, and H. Thomas. PAN- DORE: A system to manage data distribution. In International Conference on Supercomputing, pages , June C. Ashcraft, S. C. Eisenstat, and J. W. H. Liu. A fan-in algorithm for distributed sparse numerical factorization. SISSC, 11(3): , S. Baden. Programming abstractions for dynamically partitioning and coordinating localized scientific calculations running on multiprocessors. To appear, SIAM J. Sei. and Stat. Computation., D. Baxter, J. Saltz, M. Schultz, S. Eisentstat, and K. Crowley. An experimental study of methods for parallel preconditioned krylov methods. In Proceedings of the 1988 Hypercube Multiprocessor Conference, Pasadena CA, pages 1698,1711, January H. Berryman, J. Saltz, and J. Scroggs. Execution time support for adaptive scientific algorithms on distributed memory machines, to appear in concurrency: Practice and experience Report 90-41, ICASE, May A. Cheung and A. P. Reeves. The paragon multicomputer environment: A first implementation. Technical Report EE-CEG-89-9, Cornell University Computer Engineering Group, Cornell University School of Electrical Engineering, july R. Das, J. Saltz, and H. Berryman. A manual for parti runtime primitives - revision 1 (document and parti software available through netlib). Interim Report 91-17, ICASE, R. Dits, J. Saltz, D. Mavriplis, J. Wu, and H. Berryman. Unstructured mesh problems, parti primitives and the arf compiler. In Parallel Processing for Scientific Computation, Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing, Houston Tx, April 1991, G. Fox. A graphical approach to load balancing and sparse matrix vector multiplication on the hypercube. In The IMA Volumes in Mathematics and its Applications. Volume 19: Numerical 13

8 Algorithms for Modern Parallel Computer Architectures Martin Schultz Editor. Springer-Verlag, G. Fox. A review of automatic load balancing and decomposition methods for the hypercube. In The IMA Volumes in Mathematics and its Applications. Volume 13: Numerical Algorithms for Modern Parallel Computer Architectures Martin Schultz Editor. Springer-Verlag, G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu. Fortran D language specification. Department of Computer Science Rice COMP TR90-141, Rice University, December G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Computers. Prentice-Hall, Englewood Cliffs, New Jersey, [13] S. Hiranandani, K. Kennedy, and C. Tseng. Compiler support for machine-independent parallel programming in Fortran D. In Compilers and Runtime Software for Scalable Multiprocessors, J. Saltz and P. Mehrotra Editors, Amsterdam, The Netherlands, To appear Elsevier. [14] S. Hiranandani, J. Saltz, P. Mehrotra, and H. Berryman. Performance of hashed cache data migration schemes on multicomputers. Journal of Parallel and Distributed Computing, to appear, 12, August [15] C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd ACM SIGPLAN Symposium on Principles Practice of Parallel Programming, pages AGM SIGPLAN, March [16] M. Lam and M. C. Rinard. Coarse grain parallel programming in jade. In Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Williamsburg VA. ACM Press, [17] J. W. Liu. Computational models and task scheduling for parallel sparse cholesky factorization. Parallel Computing, 3: , [18] D. J. Mavriplis. Three dimensional unstructured multigrid for the euler equations, paper cp. In AIAA luth Computational Fluid Dynamics Conference, June [19] R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nicol, and Kay Crowley. Principles of runtime support for parallel processors. In Proceedings of the 1988 ACM International Conference on Supercomputing, St. Malo France, pages , July [20] A. Rogers and K. Pingali. Process decomposition through locality of reference. In Conference on Programming Language Design and Implementation. ACM SIGPLAN, June [21] M. Rosing, R.W. Schnabel, and R.P. Weaver. Expressing complex parallel algorithms in Dino. In Proceedings of the 4th Conference on Hypercubes, Conurrent Computers and Applications, pages , [22] J. Saltz, H. Berryman, and J. Wu. Runtime compilation for multiprocessors, to appear: Concurrency, Practice and Experience, Report 90-59, ICASE, [23] J. Saltz, K. Crowley, R. Mirchandaney, and Harry Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8~ , [24] Joel Saltz and M.C. Chen. Automated problem mapping: the crystal runtime system. In The Proceedings of the Hypercube Microprocessors Conf., Knoxville, TN, September [25] H. Simon. Partitioning of unstructured mesh problems for parallel processing. In Proceedings of the Conference on Prallel Methods on Large Scale Structural Analysis and Physics Applications. Permagon Press, [26] P. S. Tseng. A Parallelizing Compiler for Dastributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, May [27] R. D. Williams and R. Glowinski. Distributed irregular finite elements. Technical Report C3P 715, Caltech Concurrent Computation Program, February [28] J. Wu, J. Saltz, S. Hiranandani, and H. Berryman. Runtime compilation for multicomputers. In The Proceedings of the ICPP, [29] H. Zima, H. Bast, and M. Gerndt. Superb: A tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6:l-18,

PARTI Primitives for Unstructured and Block Structured Problems

PARTI Primitives for Unstructured and Block Structured Problems Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 1992 PARTI Primitives for Unstructured

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

Embedding Data Mappers with Distributed Memory Machine Compilers

Embedding Data Mappers with Distributed Memory Machine Compilers Syracuse University SURFACE Electrical Engineering and Computer Science Technical Reports College of Engineering and Computer Science 4-1992 Embedding Data Mappers with Distributed Memory Machine Compilers

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Voxel Databases: A Paradigm for Parallelism with Spatial Structure

Voxel Databases: A Paradigm for Parallelism with Spatial Structure Voxel Databases: A Paradigm for Parallelism with Spatial Structure Roy Williams California Institute of Technology, Pasadena CA 91125 Abstract This paper concerns parallel, local computations with a data

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D-53757 St. Augustin, Germany Abstract. This paper presents language features

More information

Compile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA

Compile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA Compile-Time Techniques for Data Distribution in Distributed Memory Machines J. Ramanujam Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 783-59 P. Sadayappan

More information

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE* SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Strategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines

Strategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines Strategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines Jochem Häuser European Space Agency and Roy Williams California Institute of Technology Abstract The purpose of this paper

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna

Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna Compiling FORTRAN for Massively Parallel Architectures Peter Brezany University of Vienna Institute for Software Technology and Parallel Systems Brunnerstrasse 72, A-1210 Vienna, Austria 1 Introduction

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

Static and Runtime Algorithms for All-to-Many Personalized Communication on Permutation Networks

Static and Runtime Algorithms for All-to-Many Personalized Communication on Permutation Networks Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 199 Static and Runtime Algorithms

More information

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Yuan-Shin Hwang Department of Computer Science National Taiwan Ocean University Keelung 20224 Taiwan shin@cs.ntou.edu.tw

More information

Low Latency Messages on Distributed Memory Multiprocessors

Low Latency Messages on Distributed Memory Multiprocessors Low Latency Messages on Distributed Memory Multiprocessors MATT ROSING 1 AND JOEL SALTZ 2 1 Pacific Northwest Laboratory, Richland, WA 99352 2 University of Maryland ABSTRACT This article describes many

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

ARRAY DATA STRUCTURE

ARRAY DATA STRUCTURE ARRAY DATA STRUCTURE Isha Batra, Divya Raheja Information Technology Dronacharya College Of Engineering, Farukhnagar,Gurgaon Abstract- In computer science, an array data structure or simply an array is

More information

I ICASE. NASA Dun U - N!i n'ral Anrnn,- Iir'j nnd! jvwn Admiri,;ftinj()r RUN-TIME PARALLELIZATION AND SCHEDULING OF LOOPS

I ICASE. NASA Dun U - N!i n'ral Anrnn,- Iir'j nnd! jvwn Admiri,;ftinj()r RUN-TIME PARALLELIZATION AND SCHEDULING OF LOOPS ( NASA Contractor Report 182039 ~ ICASE Report No. 90-34 I ICASE RUN-TIME PARALLELIZATION AND SCHEDULING OF LOOPS Joel H. Saltz Ravi Mirchandaney -' T '4 Kay Crowley JUN 2 6 1990 Contract No. NASI-18605

More information

Fast Primitives for Irregular Computations on the NEC SX-4

Fast Primitives for Irregular Computations on the NEC SX-4 To appear: Crosscuts 6 (4) Dec 1997 (http://www.cscs.ch/official/pubcrosscuts6-4.pdf) Fast Primitives for Irregular Computations on the NEC SX-4 J.F. Prins, University of North Carolina at Chapel Hill,

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,

More information

Tiling Multidimensional Iteration Spaces for Multicomputers

Tiling Multidimensional Iteration Spaces for Multicomputers 1 Tiling Multidimensional Iteration Spaces for Multicomputers J. Ramanujam Dept. of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 080 901, USA. Email: jxr@max.ee.lsu.edu

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Object-oriented Design for Sparse Direct Solvers

Object-oriented Design for Sparse Direct Solvers NASA/CR-1999-208978 ICASE Report No. 99-2 Object-oriented Design for Sparse Direct Solvers Florin Dobrian Old Dominion University, Norfolk, Virginia Gary Kumfert and Alex Pothen Old Dominion University,

More information

Adaptive-Mesh-Refinement Pattern

Adaptive-Mesh-Refinement Pattern Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points

More information

A Compiler for Parallel Finite Element Methods. with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS

A Compiler for Parallel Finite Element Methods. with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS Contemporary Mathematics Volume 00, 0000 A Compiler for Parallel Finite Element Methods with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS December 11, 1993 Abstract.

More information

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742 UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Y. Han* B. Narahari** H-A. Choi** University of Kentucky. The George Washington University

Y. Han* B. Narahari** H-A. Choi** University of Kentucky. The George Washington University Mapping a Chain Task to Chained Processors Y. Han* B. Narahari** H-A. Choi** *Department of Computer Science University of Kentucky Lexington, KY 40506 **Department of Electrical Engineering and Computer

More information

Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions

Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. a, AUGUST 1995 al5 Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions Ravi Ponnusamy, Joel Saltz,

More information

COMMUNICATION IN HYPERCUBES

COMMUNICATION IN HYPERCUBES PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)

More information

Large-scale Structural Analysis Using General Sparse Matrix Technique

Large-scale Structural Analysis Using General Sparse Matrix Technique Large-scale Structural Analysis Using General Sparse Matrix Technique Yuan-Sen Yang 1), Shang-Hsien Hsieh 1), Kuang-Wu Chou 1), and I-Chau Tsai 1) 1) Department of Civil Engineering, National Taiwan University,

More information

Parallel Unstructured Mesh Generation by an Advancing Front Method

Parallel Unstructured Mesh Generation by an Advancing Front Method MASCOT04-IMACS/ISGG Workshop University of Florence, Italy Parallel Unstructured Mesh Generation by an Advancing Front Method Yasushi Ito, Alan M. Shih, Anil K. Erukala, and Bharat K. Soni Dept. of Mechanical

More information

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Xiaolin Li and Manish Parashar The Applied Software Systems Laboratory Department of

More information

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS Contemporary Mathematics Volume 157, 1994 A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS T.E. Tezduyar, M. Behr, S.K. Aliabadi, S. Mittal and S.E. Ray ABSTRACT.

More information

DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems,

DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems, DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN Barbara Chapman a Piyush Mehrotra b Hans Moritsch a Hans Zima a a Institute for Software Technology and Parallel Systems, University of Vienna, Brunner Strasse

More information

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers Alan L. Cox y, Sandhya Dwarkadas z, Honghui Lu y and Willy Zwaenepoel y y Rice University Houston,

More information

Navier-Stokes Computations on Commodity Computers

Navier-Stokes Computations on Commodity Computers Navier-Stokes Computations on Commodity Computers By Veer N. Vatsa NASA Langley Research Center, Hampton, VA v.n.vatsa@larc.nasa.gov And Thomas R. Faulkner MRJ Technology Solutions, Moffett Field, CA faulkner@nas.nasa.gov

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Task Parallelism in a High Performance Fortran Framework

Task Parallelism in a High Performance Fortran Framework IEEE Parallel & Distributed Technology, Volume 2, Number 3, Fall, 1994, pp. 16-26 Task Parallelism in a High Performance Fortran Framework T. Gross, D. O Hallaron, and J. Subhlok School of Computer Science

More information

Chapter 8 : Multiprocessors

Chapter 8 : Multiprocessors Chapter 8 Multiprocessors 8.1 Characteristics of multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input-output equipment. The term processor in multiprocessor

More information

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin. A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Case Studies on Cache Performance and Optimization of Programs with Unit Strides SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer

More information

Introduction to Multigrid and its Parallelization

Introduction to Multigrid and its Parallelization Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada Distributed Array Data Management on NUMA Multiprocessors Tarek S. Abdelrahman and Thomas N. Wong Department of Electrical and Computer Engineering University oftoronto Toronto, Ontario, M5S 1A Canada

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics

More information

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Control Flow Analysis with SAT Solvers

Control Flow Analysis with SAT Solvers Control Flow Analysis with SAT Solvers Steven Lyde, Matthew Might University of Utah, Salt Lake City, Utah, USA Abstract. Control flow analyses statically determine the control flow of programs. This is

More information

Improving Locality For Adaptive Irregular Scientific Codes

Improving Locality For Adaptive Irregular Scientific Codes Improving Locality For Adaptive Irregular Scientific Codes Hwansoo Han, Chau-Wen Tseng Department of Computer Science University of Maryland College Park, MD 7 fhshan, tsengg@cs.umd.edu Abstract Irregular

More information

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,

More information

The Architecture of a Homogeneous Vector Supercomputer

The Architecture of a Homogeneous Vector Supercomputer The Architecture of a Homogeneous Vector Supercomputer John L. Gustafson, Stuart Hawkinson, and Ken Scott Floating Point Systems, Inc. Beaverton, Oregon 97005 Abstract A new homogeneous computer architecture

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines

Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines 1063 7133/97 $10 1997 IEEE Proceedings of the 11th International Parallel Processing Symposium (IPPS '97) 1063-7133/97 $10 1997 IEEE Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

pc++/streams: a Library for I/O on Complex Distributed Data-Structures

pc++/streams: a Library for I/O on Complex Distributed Data-Structures pc++/streams: a Library for I/O on Complex Distributed Data-Structures Jacob Gotwals Suresh Srinivas Dennis Gannon Department of Computer Science, Lindley Hall 215, Indiana University, Bloomington, IN

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Support for Distributed Dynamic Data Structures in C++ Chialin Chang Alan Sussman. Joel Saltz. University of Maryland, College Park, MD 20742

Support for Distributed Dynamic Data Structures in C++ Chialin Chang Alan Sussman. Joel Saltz. University of Maryland, College Park, MD 20742 Support for Distributed Dynamic Data Structures in C++ Chialin Chang Alan Sussman Joel Saltz Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College Park,

More information

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004 ADR and DataCutter Sergey Koren CMSC818S Thursday March 4 th, 2004 Active Data Repository Used for building parallel databases from multidimensional data sets Integrates storage, retrieval, and processing

More information

Space-filling curves for 2-simplicial meshes created with bisections and reflections

Space-filling curves for 2-simplicial meshes created with bisections and reflections Space-filling curves for 2-simplicial meshes created with bisections and reflections Dr. Joseph M. Maubach Department of Mathematics Eindhoven University of Technology Eindhoven, The Netherlands j.m.l.maubach@tue.nl

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications

Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications Reordering s Important Irregular Science & Engineering Applications Molecular Dynamics Finite Element Analysis Michelle Mills Strout December 5, 2005 Sparse Matrix Computations 2 Lacluster Performance

More information

Semi-automatic domain decomposition based on potential theory

Semi-automatic domain decomposition based on potential theory Semi-automatic domain decomposition based on potential theory S.P. Spekreijse and J.C. Kok Nationaal Lucht- en Ruimtevaartlaboratorium National Aerospace Laboratory NLR Semi-automatic domain decomposition

More information

A Performance Study of Parallel FFT in Clos and Mesh Networks

A Performance Study of Parallel FFT in Clos and Mesh Networks A Performance Study of Parallel FFT in Clos and Mesh Networks Rajkumar Kettimuthu 1 and Sankara Muthukrishnan 2 1 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439,

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press,   ISSN Parallelization of software for coastal hydraulic simulations for distributed memory parallel computers using FORGE 90 Z.W. Song, D. Roose, C.S. Yu, J. Berlamont B-3001 Heverlee, Belgium 2, Abstract Due

More information

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh Scalable Parallel Libraries Conference, Oct. 1994 PASSION Runtime Library for Parallel I/O Rajeev Thakur Rajesh Bordawekar Alok Choudhary Ravi Ponnusamy Tarvinder Singh Dept. of Electrical and Computer

More information

An Experimental Assessment of Express Parallel Programming Environment

An Experimental Assessment of Express Parallel Programming Environment An Experimental Assessment of Express Parallel Programming Environment Abstract shfaq Ahmad*, Min-You Wu**, Jaehyung Yang*** and Arif Ghafoor*** *Hong Kong University of Science and Technology, Hong Kong

More information

THE application of advanced computer architecture and

THE application of advanced computer architecture and 544 IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 45, NO. 3, MARCH 1997 Scalable Solutions to Integral-Equation and Finite-Element Simulations Tom Cwik, Senior Member, IEEE, Daniel S. Katz, Member,

More information

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Cross-Layer Memory Management to Reduce DRAM Power Consumption Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN Toward an automatic mapping of DSP algorithms onto parallel processors M. Razaz, K.A. Marlow University of East Anglia, School of Information Systems, Norwich, UK ABSTRACT With ever increasing computational

More information

An Inspector-Executor Algorithm for Irregular Assignment Parallelization

An Inspector-Executor Algorithm for Irregular Assignment Parallelization An Inspector-Executor Algorithm for Irregular Assignment Parallelization Manuel Arenaz, Juan Touriño, Ramón Doallo Computer Architecture Group Dep. Electronics and Systems, University of A Coruña, Spain

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors

An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Proceedings of the 28th Annual Hmvaii Intemottonol Conference on System Sciences - 1995 An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Matthew

More information

Interoperability of Data Parallel Runtime Libraries

Interoperability of Data Parallel Runtime Libraries Interoperability of Data Parallel Runtime Libraries Guy Edjlali, Alan Sussman and Joel Saltz Department of Computer Science University of Maryland College Park, MD 2742 fedjlali,als,saltzg@cs.umd.edu Abstract

More information

A Beginner s Guide to Programming Logic, Introductory. Chapter 6 Arrays

A Beginner s Guide to Programming Logic, Introductory. Chapter 6 Arrays A Beginner s Guide to Programming Logic, Introductory Chapter 6 Arrays Objectives In this chapter, you will learn about: Arrays and how they occupy computer memory Manipulating an array to replace nested

More information

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this

More information

SVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA)

SVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA) SVM Support in the Vienna Fortran Compilation System Peter Brezany University of Vienna brezany@par.univie.ac.at Michael Gerndt Research Centre Julich(KFA) m.gerndt@kfa-juelich.de Viera Sipkova University

More information

An Optimization Method Based On B-spline Shape Functions & the Knot Insertion Algorithm

An Optimization Method Based On B-spline Shape Functions & the Knot Insertion Algorithm An Optimization Method Based On B-spline Shape Functions & the Knot Insertion Algorithm P.A. Sherar, C.P. Thompson, B. Xu, B. Zhong Abstract A new method is presented to deal with shape optimization problems.

More information

C ICASE INTERIM REPORT 17. Raja Das Joel Saltz Hwry Berryman. NASA Contract No. NAS May 1991

C ICASE INTERIM REPORT 17. Raja Das Joel Saltz Hwry Berryman. NASA Contract No. NAS May 1991 AD-A237 262 NASA Conr actor "Is Reit DTIC ls ELECTE C ICASE INTERIM REPORT 17 A MAN UAL FOR PARTI RUINIME PRTIV Revisi 1 Raja Das Joel Saltz Hwry Berryman. NASA Contract No. NAS1-18605 May 1991 INSTITUTE

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information