Part i Procedures for Realistic Loops
|
|
- Edmund Nichols
- 5 years ago
- Views:
Transcription
1 Part i Procedures for Realistic Loops J. Saltz, R. Das, R. Ponnusamy, D. Mavriplis, H. Berryman and J. Wu ICASE, NASA Langley Research Center Hampton VA Abstract This paper describes a set of primitives (PARTI) developed to eficiently execute unstructured problems on distributed memory machines. These primitives halve been incorporated into actual programs and kernels by hand. The development of the primitives were done such that the sequential code and the parallel code on each node would look identical. We have introduced a mechanism by which duplicate fetches of 08- processor data between different loops are eliminated. This elimination of duplicate fetches appears to make a considerable impact on communicaiiions overheads. 1 Introduction In many algorithms, data produced or input during a program s initialization plays a large role in determining the nature of the subsequent computation. When the data structures that definle a computation have been initialized, a preprocessing phase follows. Vital elements of the strategy used by the rest of the algorithm are determined by this preprocessing phase. To effectively exploit many multiprocessor architectures, we may also have to carry out run time preprocessing. This preprocessing will be referred to as runtime compilation [22]. The purpose of runtime compilation is not to determine which computations are to be performed but instead to determine how a multiprocessor machine will schedule the algorithm s work, how to map the data structuries and how data mlovement within the multiprocessor is to be scheduled. In distributed memory MIMD architectures, there is typically a non-trivial communications latency or startup cost. For efficiency reasons, information to be transmitted should be collected into relatively large messages. The cost of fetching array elements can be reduced by precomputing what data each processor needs to send and to receive. In irregular problems, such as solving PDEs on unstructured meshes and sparse matrix algorithms, the communications pattern depends on the input data. This typically arises due to some level of indirection in the code. In this case, it is not possible to predict at compile time what data must be prefetched. Only recently have methods been developed to integrate the kinds of runtime optimizations mentioned above into compilers and programming environments [22]. The lack of compile-time information is dealt with by transforming the original parallel loop into two constructs called an inspector and executor [19]. During program execution, the inspector examines the data references made by a processor, and calculates what off-processor data needs to be fetched and where that data will be stored once it is received. The executor loop then uses the information from the inspector to implement the actual computation. We have developed a suite of primitives that can be used directly by programmers to generate inspector/executor pairs. The primitives incorporate a number of new insights we have had about sparse and unstructured computations. These primitives differ in a number of ways from those described earlier [22]. Our new primitives carry out preprocessing that make it straightforward to produce parallelized loops that are virtually identical in form to the original sequential loops. The importance of this is that it will be possible to generate the same quality object code on the nodes of the distributed memory machine as could be produced by the sequential program running on a single node. Our primitives make use of hash tables [14] to allow us to recognize and exploit a number of situations in which a single off-processor distributed array reference is used several times. In such situations, the primitives only fetch a single copy of each unique off-processor distributed array reference /91/OOOO/OO67/$01.OO Q IEEE 67
2 1.1 Distributed Data Access using PARTI These primitives are named PARTI (Parallel Automated Runtime Toolkit at ICASE) [7], [5]; they carry out the distribution and retrieval of globally indexed but irregularly distributed data-sets over the numerous local processor memories. Each inspector produces a schedule, which is essentially a pattern of communication for gathering or scattering data. In order to avoid duplicate data accesses, a list of offprocessor data references is stored locally (for each processor) in a hash table [14], [28]. For each new off-processor data reference required, a quick search through the hash table is performed in order to determine if this reference has already been accessed. If the reference has not previously been accessed, it is stored in the hash table, otherwise it is discarded. The primitives thus only fetch a single copy of each unique off-processor distributed data-set reference. This idea has also been extended to allow us to produce incremental schedules. For example, if two loops require different but overlapping data references, by simply preserving the hash table formed during the generation of the schedule for the first loop, we may generate a schedule for the second loop that allows us to obtain only those off-processor elements which have not been previously encountered. 1.2 Distributed Translation Tables In distributed memory machines, large data arrays need to be partitioned between local memories of processors. These partitioned data arrays are called distrabuted arrays. Long term storage of distributed array data is assigned to specific memory locations in the distributed machine. Each element in a distributed array is assigned to a particular processor, and in order for another processor to be able to access a given element of the array we must know the processor in which it resides, and its local address in this processor's memory. We thus build a translation table which, for each array element, lists the host processor address. For a one-dimensional array of N elements, the translation table also contains N elements, and therefore must be itself be distributed over the local memories of the processors. This is accomplished by putting the first N/NP elements on the first processor, the second N/NP elements on the second processor, etc..., where NP is the number of processors. Thus, if we are required to access the mth element of the array, we look up its address in the distributed translation table, which we know can be found in the (m/n)* NP+ lth processor. One of the parti primitives handles initialization of distributed translation tables, and other primitives are used to access the distributed translation tables. In Section 2.2, we will give examples of PARTI procedures that initialize and access distributed translation tables. The PARTI primitives have been used to solve a variety of realistic applications such as a 3-D unstructured mesh multigrid Euler solver [SI. PARTI has also been distributed to a variety of universities and laboratories. 2 The Parti Primitives In this section we present a running example in order to illustrate the way in which PARTI procedure calls are used and to describe the optimizations carried out by these procedures. In Figure 1, we depict a set of loops which roughly mimics loops frequently encountered in unstructured mesh fluids codes. Loop L1 sweeps over the edges of a mesh. The mesh edges may define a three dimensional object such as an aircraft. The reference pattern is determined by integer array edgelist. Note that indirection appears in S1 and S2 on both the left and the right sides of the expressions. Loop L2 sweeps over a set of faces. These faces may define the surface of a three dimensional object. The reference pattern is determined by integer array f acelist. Indirection again appears on both the left and the right sides of expressions S3 and S4. Note also that both loop L1 and loop L2 read from array x and that neither loop writes to x. 2.1 PARTI Executor Figure 2 depicts the executor code with embedded fortran callable PARTI procedures +gather, dfscatter-add and dfscatter-addnc. Before this ccde is run, we have to carry out a preprocessing phase, to be described in Section 2.2. The arrays x and y are partitioned between processors, each processor is responsible for the long term storage of specified elements of each of these arrays. The way in which x and y are to be partitioned between processors is determined by the inspector. In this example, elements of x and y are partitioned between processors in exactly the same way. Each processor is responsible for n-on-proc elements of x and Y. It should be noted that except for the procedure calls, the structure of the loops in Figure 2 is identical to that of the loops in Figure 1. In Figure 2, we 68
3 L1 do i=l,n-edge nl = edgelist(i) n2 = edgelist(n-edge+i) Sl y(n1) = y(n1) +... x(n1)... x(n;!) $2 y(n2) = y(n2) +... x(n1)... x(n2) end do I,2 do i=l,nface ml = face-list(i) m2 = face-list(nface+i) m3 = facejist(2*nface + i ) m4 = face-list(3*nface + i ) $33 y(m1) = y(m1) +... x(m1)... x(m2)... x(m3)... +4) S4 y(m2) = y(m2) +... x(m1)... x(m2)... x(m3)... x(m4) end do Figure 1: Example Code again use arrays named x and y; in Figure 2, x and y now represent arrays defined on a single processor of a distributed memory multiprocessor. On each processor P, arrays x and y are declared to be larger than would be needed to store the number of array elements for which P is responsible. We will store copies of offprocessor array elements beginning with local array elements x(n-on-procti) and y(n-on-proc+i). The PARTI subroutine calls depicted in Figure 2 move data between. processors using a precomputed communication pattern. The communication pattern is specified by either a single schedule or by an array of schedules. dfmgather uses communication schedules to fetch off-processor data that will be needed either by loop L1 or by loop L2. The schedules specify the locations in distributed memory from which data is to be obtained. In Figure 2, off-processor data is obtained from array x defined on each processor. Copies of the off-processor data are placed in a buffer area beginning with x(n-on-proc+i). The PARTI procedures dfscatter-add and dfscatter-addnc, in statement S2 and S3 Figure 2, accumulate data to off-processor memory locations. Both dfscatter-add and dfscatter-addnc obtain data to be accumulated to off processor locations from a buffer area that begins with y(n-on-proc+i). Off-processor data is accumulated to locations of y between indices 1 and n-on-proc. The distinctions between dfscatter-add and dfscatter-addnc will be described in Section 2.3. In Figure 2, several data may be accumulated to a given off-processor location in loop L1 or in loop L PARTI Inspector In this section, we will outline how we carry out the preprocessing needed to generate the arguments needed by the code in Figure 2. This preprocessing is depicted in Figure 3. The way in which the nodes of an irregular mesh are numbered frequently do not have a useful correspondence to the connectivity pattern of the mesh. When we partition such a mesh in a way that minimizes interprocessor communication, we may need to be able to assign arbitrary mesh points to each processor. The PARTI procedure ifbuild-translation-table (S1 in Figure 3) allows us to map a globally indexed distributed array onto processors in an arbitrary fashion. Each processor passes ifbuild-translat ion-table a list of the array elements for which it will be responsible (myvals in SI, Figure 3). If a given processor needs to obtain a datum that corresponds to a particular global index i for a specific distributed array, 69
4 S1 translation-table = ifbuild-translation-t able( 1,myvals,non-proc) real*8 x(n-on-proc+n-off-proc) real*8 y(n-on-proc+n-off-proc) S 1 dfmgather (sched-array12,x(n-on-proc+ 1),x) L1 do i=l,n-edge nl = local-edge-list(i) n2 = local-edge-list(n-edge+i) y(n1) = y(n1) +... x(n1)... x(n2) y(n2) = y(n2) +... x(n1)... x(n2) end do S2 dfscatter-add(edge-sched,y(n-on-proc+l))y) L2 do i=l,nface ml = localface-list(i) m2 = localface-list(nface+i) m3 = localface~list(2*nface + i ) m4 = localface-list(3*nface + i ) y(m1) = y(m1) +... x(m1)... x(m2)... x(m3)... 4m4) y(m2) = y(m2) +... x(m1)... x(m2)... x(m3)... x(m4) end do S3 dfscatter-addnc(face-sched,y(n-on-proc+l), buffer-mapping, y) Figure 2: Parallelized Code for Each Processor S2 call flocalize( translation-tableledge-sched,edge-list, local-edge-list,2*n~dge1n-off-proc) S3 sched-array( 1) = edge-sched S4 call fmlocalize( translation-table,facesched, incrementalfacesched, facelistjocalface-list, 4*nface, n-off-proc-face, nnew-off-procface, buffer-mapping, 1,sched-array) S5 sched-array(2) = incrementalfacesched Figure 3: Inspector Code for Each Processor the processor can consult the distributed translation table (Section 1.2) to find the location of that datum in distributed memory. The PARTI procedures flocalize and fmlocalize carry out the bulk of the preprocessing needed to produce the executor code depicted in Figure 2. We will first describe pocalize, (S2 in Figure 3). On each processor, flocalize is passed: 1. a pointer to a distributed translation table (translation-table in S2), 2. a list of globally indexed distributed array references, (edgelist in S2), and 3. number of globally indexed distributed array references (2*n-edge in S2). Flocalize returns: 1. a schedule that can be used in PARTI gather and scatter procedures (edge-sched in Sa), 2. a list of integers that can be used to specify the pattern of indirection in the executor code (local-edgelist in S2), and 3. number of distinct off-processor references found in edge-list (n-off-proc in S2). There are a variety of situations in which the same data need to be accessed by multiple loops (Section 1.1). In Figure 1, no assignments to x are carried out. In the beginning of Figure 2, each processor 70
5 can gather a single copy of every distinct off-processor value of x referenced by loops L1 or L2. The PARTI prlocedure fmlocalize (S4 in Figure 3) makes it simple to remove these duplicate references. fmlocalize makes it possible to obtain only those off-processor data not requested by a given set of pre-existing schedules. The procedure dfmgather in the executor in Figure 2 obtains off-processor data using two schedules; ed,ge-sched produced by fiocalize (S2 Figure 3) and incremental-face-sched produced by fmlocalize (S4 Figure 3). To review the work carried out by fmlocalize, we will summarize the significance of all but one of the arguments of this PARTI procedure. On each processor, fmlocalize is passed: ;. number of globally indexed distributed array references (4*nface in S4), 4: a pointer to a distributed translation table (translation-table in S4), a list of globally indexed distributed array references. (facelist in S4), number of pre-existing schedules that need to be taken into account when removing duplicates (1 in S4), and an array of pointers to pre-existing schedules (sched-array in S4). Fnnlocalize returns: 1. a schedule that can be used in PARTI gather and scatter procedures. This schedule does not take any pre-existing schedules into account (facesched in S4), an incremental schedule that includes only offprocessor data accesses not included in the preexisting schedules (incrementalfacesched in S4), a list of integers that can be used to specify the pattern of indirection in the executor code (localfacelist in S4), number of distinct off-processor references in facelist (n-off-procface in S4). 51. number of distinct off-processor references riot encountered in any other schedule (naew-off-procface in S4). buffer-mapping - to be discussed in Section A Return to the Executor We have already discussed dfmgather in Section 2.1 but we have not said anything so far about the distinction between dfscatter-add and dfscatter-addnc. When we make use of incremental schedules, to each offprocessor distributed array element, we assign a single buffer location. In our example, we carry out separate off-processor accumulations after loops L 1 and L2. As we will describe below, in this situation, our off-processor accumulation procedures may no longer reference consecutive elements of a buffer. In S2, Figure 2, we can assign copies of distinct off-processor elements of y to buffer locations. We can then use a schedule (edge-sched) to specify where in distributed memory each consecutive value in the buffer is to be accumulated. PARTI procedure dfscatter-add can be employed; this procedure uses schedule edge-sched to accumulate to off-processor locations consecutive buffer locations beginning with y(n-on-proc + 1). When we get to L2, some of the offprocessor copies may already be associated with buffer locations. Consequently in S3, Figure 2, our schedule (facesched) must access buffer locations in an irregular manner. The pattern of buffer locations accessed is specified by integer array bufler-mapping passed to dfscatter-addnc in S3, Figure 2. (dfscatter-addnc stands for dfscatter-add non-contiguous) 3 Status of PARTI Primitives The PARTI procedures described in this paper have been used to port a 3-D unstructured mesh Euler solver [18]. This work has spurred many improvements in the optimizations carried out by our primitives. In Mavriplis Euler code, we have seen a reduction in communication time in a problem solved on a 53,921 node grid, from 288 seconds to 82 seconds on 32 processors of an Intel ipsc/860 (compared to a computation time of 151 seconds). Approximately a factor of two improvement in communication overhead was seen when we employed incremental schedules. (Recall from Section 2.2 that use of incremental schedules allows us to obtain only those non-updated off-processor elements which have not been previously encountered in earlier loops). In this code, the total cost of all preprocessing was less than 2 seconds. The computational rate of this mesh run on 32 processors of an ipsc/860 was 54 Mflops while the same code run on 64 processors of an ipsc/sso ran at 87 Mflops. The unstructured mesh was partitioned by the method described in [25]. 71
6 Since the form of the sequential code and the parallelized code was virtually identical, we did not expect the parallelization process to introduce any new inefficiencies beyond those exacted by the preprocessing and by the calls to the primitives. On a smaller problem, we compared the parallel code running on a single node with the sequential code and found only a 2 percent performance degradation. 4 PARTI Compiler We have developed a prototype compiler which takes as input a Fortran 77 program enhanced with specifications for distributing data [28]. The compiler outputs a distributed memory program with embedded PARTI procedures. The PARTI procedures embedded by this compiler were an earlier version of the procedures described here, these procedures are described in [22]. One of the inputs that must be supplied to this compiler generated program is information about how arrays and loop iterations are to be distributed between processors. This compiler allows arrays and loop iterations to be partitioned in an arbitrary manner; this flexibility is of practical importance in unstructured and sparse computations [23], [5]. The PARTI compiler [28] employed a set of language extensions designed to specify regular and irregular array distributions. A set of Fortran 77 language extensions (Fortran-D) have been proposed [ll], these extensions subsume the language extensions described in [28]. This compiler was tested on several NASA kernels, the performance of the resulting codes was benchmarked on the ipsc/860 [22]. We are currently constructing a Parascope based compiler that will embed the version of PARTI discussed in this paper. This new compiler will be an extension of a Parascope based distributed memory compiler targeted towards regular problems described in [13]. This new compiler is being designed to incorporate partitioners in an integral manner. Customized partitioners decompose problems based on an a programmer s understanding of the computationally important dependency relations. For instance, unstructured mesh Euler or Navier Stokes solvers contain a variety of loops over either mesh edges or tetrahedra. Each loop may exhibit a different data dependency pattern. A programmer who is familiar with a given application will generally know which portions of a code need to be taken into account when calculating data partitions; the need for such insight may or may not be reduced when we link partitioners to compilers. We plan to develop directives to allow users to specify which loops are to be taken into account when determining array partitioning. Note that the user does not specify a partition directly. Partitioners such as those described in [9], [lo] and [25] can partition arrays based on connectivity graphs that originate from loop dependence relations. We intend to design a compiler that is able to embed a PARTI primitive designed to translate execution-time loop dependency relations into a distributed representation of a connectivity graph. A partitioner coupled to PARTI will be written so that it will input the connectivity graph information. After partitioning is completed, information concerning the chosen partitioning will be returned and a distributed translation table will be initialized. This mechanism of linking a runtime partitioner to a compiler was initially outlined in [19] and is closely related to [16]. 5 Relation to Other Work Programs designed to carry out a range of irregular computations including sparse direct and iterative methods require many of the optimizations described in this paper. Some examples of such programs are described in [2], [17], [4], [27] and [12]. Several researchers have developed programming environments that are targeted towards particular classes of irregular or adaptive problems. Williams [27] describes a programming environment (DIME) for calculations with unstructured triangular meshes using distributed memory machines. Baden [3] has developed a programming environment targeted towards particle computations, this programming environment provides facilities that support dynamic load balancing. There are a variety of compiler projects targeted at distributed memory multiprocessors [as], [B], [all, [20], [l], [26]. With the exception of Kali project [15], and the Parti work described her5 and in [24], [19], and [23], these compilers do not attempt to efficiently deal with loops that arise in sparse or unstructured scientific computations. The PARTI runtime support procedures and the compilers described in this paper is qualitatively different from the efforts cited above in a number of important respects. We have developed and demonstrated mechanisms that allow us to support irregularly distributed arrays. Irregularly distributed arrays must be supported to make it possible to map data and computational work in an arbitrary manner. Support for arbitrary distributions was proposed in [19] and 72
7 [23;1 but to our knowledge, this is the first implementation of a compiler based distributed translation table mechanism for irregular scientific problems. 'We find that many unstructured NA.SA codes must carry out data accumulations to off-processor memory locations. We chose one of our kernels to demonstrate this, and designed our primitives and compiler to be able to handle this situation. To our knowledge, our compiler effort is unique in its ability to efficiently car,ry out irregular patterns of off-processor data accunnulations. We augment our primitives with a hash table designed to eliminate duplicate data accesses. Other researchers have used different data structures for management of off-processor data copies [15]. We have also developed a mechanism for producing incremental schedules, The use of incremental schedules allows us to obtain only those non-updated of -processor elemeints which have not been previously encountered in earlier loops. 6 Summary and Conclusions We have shown that PARTI primitives can be used to port actual unstructured code to distributed meimory machines. These primitives are highly optimized and require very little overhead. Duplicate off-processor data access is removed using hash tables during the formation of both total schedules and incremeintal schedules. Primitives for inter-processor data movement using multiple schedules have been presented, and using these reduce the overall data transfer. A compiler has been implemented which takes in an extended Fortran 77 and produces P.ARTI primitive emlbedded node code to be run on the Intel ipsc/860 machine. The PARTI primitives are available for public distribution and can be obtained from netlib or from the anonymous ftp cite ra.cs.yale.edu. Acknowledgements The authors would like to thank Horst Simon for the use of his unstructured mesh partitioning software ancl Venkatakrishnan for useful suggestions for low level communications scheduling. We would also like to,acknowledge support from NASA contract NAS while the authors were in residence at ICASE, NASA Langley Research Center along with support froim NSF grant ASC for authors Saltz and Berryman. References F. AndrC, J.-L. Pazat, and H. Thomas. PAN- DORE: A system to manage data distribution. In International Conference on Supercomputing, pages , June C. Ashcraft, S. C. Eisenstat, and J. W. H. Liu. A fan-in algorithm for distributed sparse numerical factorization. SISSC, 11(3): , S. Baden. Programming abstractions for dynamically partitioning and coordinating localized scientific calculations running on multiprocessors. To appear, SIAM J. Sei. and Stat. Computation., D. Baxter, J. Saltz, M. Schultz, S. Eisentstat, and K. Crowley. An experimental study of methods for parallel preconditioned krylov methods. In Proceedings of the 1988 Hypercube Multiprocessor Conference, Pasadena CA, pages 1698,1711, January H. Berryman, J. Saltz, and J. Scroggs. Execution time support for adaptive scientific algorithms on distributed memory machines, to appear in concurrency: Practice and experience Report 90-41, ICASE, May A. Cheung and A. P. Reeves. The paragon multicomputer environment: A first implementation. Technical Report EE-CEG-89-9, Cornell University Computer Engineering Group, Cornell University School of Electrical Engineering, july R. Das, J. Saltz, and H. Berryman. A manual for parti runtime primitives - revision 1 (document and parti software available through netlib). Interim Report 91-17, ICASE, R. Dits, J. Saltz, D. Mavriplis, J. Wu, and H. Berryman. Unstructured mesh problems, parti primitives and the arf compiler. In Parallel Processing for Scientific Computation, Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing, Houston Tx, April 1991, G. Fox. A graphical approach to load balancing and sparse matrix vector multiplication on the hypercube. In The IMA Volumes in Mathematics and its Applications. Volume 19: Numerical 13
8 Algorithms for Modern Parallel Computer Architectures Martin Schultz Editor. Springer-Verlag, G. Fox. A review of automatic load balancing and decomposition methods for the hypercube. In The IMA Volumes in Mathematics and its Applications. Volume 13: Numerical Algorithms for Modern Parallel Computer Architectures Martin Schultz Editor. Springer-Verlag, G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu. Fortran D language specification. Department of Computer Science Rice COMP TR90-141, Rice University, December G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Computers. Prentice-Hall, Englewood Cliffs, New Jersey, [13] S. Hiranandani, K. Kennedy, and C. Tseng. Compiler support for machine-independent parallel programming in Fortran D. In Compilers and Runtime Software for Scalable Multiprocessors, J. Saltz and P. Mehrotra Editors, Amsterdam, The Netherlands, To appear Elsevier. [14] S. Hiranandani, J. Saltz, P. Mehrotra, and H. Berryman. Performance of hashed cache data migration schemes on multicomputers. Journal of Parallel and Distributed Computing, to appear, 12, August [15] C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd ACM SIGPLAN Symposium on Principles Practice of Parallel Programming, pages AGM SIGPLAN, March [16] M. Lam and M. C. Rinard. Coarse grain parallel programming in jade. In Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Williamsburg VA. ACM Press, [17] J. W. Liu. Computational models and task scheduling for parallel sparse cholesky factorization. Parallel Computing, 3: , [18] D. J. Mavriplis. Three dimensional unstructured multigrid for the euler equations, paper cp. In AIAA luth Computational Fluid Dynamics Conference, June [19] R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nicol, and Kay Crowley. Principles of runtime support for parallel processors. In Proceedings of the 1988 ACM International Conference on Supercomputing, St. Malo France, pages , July [20] A. Rogers and K. Pingali. Process decomposition through locality of reference. In Conference on Programming Language Design and Implementation. ACM SIGPLAN, June [21] M. Rosing, R.W. Schnabel, and R.P. Weaver. Expressing complex parallel algorithms in Dino. In Proceedings of the 4th Conference on Hypercubes, Conurrent Computers and Applications, pages , [22] J. Saltz, H. Berryman, and J. Wu. Runtime compilation for multiprocessors, to appear: Concurrency, Practice and Experience, Report 90-59, ICASE, [23] J. Saltz, K. Crowley, R. Mirchandaney, and Harry Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8~ , [24] Joel Saltz and M.C. Chen. Automated problem mapping: the crystal runtime system. In The Proceedings of the Hypercube Microprocessors Conf., Knoxville, TN, September [25] H. Simon. Partitioning of unstructured mesh problems for parallel processing. In Proceedings of the Conference on Prallel Methods on Large Scale Structural Analysis and Physics Applications. Permagon Press, [26] P. S. Tseng. A Parallelizing Compiler for Dastributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, May [27] R. D. Williams and R. Glowinski. Distributed irregular finite elements. Technical Report C3P 715, Caltech Concurrent Computation Program, February [28] J. Wu, J. Saltz, S. Hiranandani, and H. Berryman. Runtime compilation for multicomputers. In The Proceedings of the ICPP, [29] H. Zima, H. Bast, and M. Gerndt. Superb: A tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6:l-18,
PARTI Primitives for Unstructured and Block Structured Problems
Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 1992 PARTI Primitives for Unstructured
More informationLanguage and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors
Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel
More informationEmbedding Data Mappers with Distributed Memory Machine Compilers
Syracuse University SURFACE Electrical Engineering and Computer Science Technical Reports College of Engineering and Computer Science 4-1992 Embedding Data Mappers with Distributed Memory Machine Compilers
More informationAn Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language
An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University
More informationParallel Implementation of 3D FMA using MPI
Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system
More informationMemory Hierarchy Management for Iterative Graph Structures
Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced
More informationImproving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationVoxel Databases: A Paradigm for Parallelism with Spatial Structure
Voxel Databases: A Paradigm for Parallelism with Spatial Structure Roy Williams California Institute of Technology, Pasadena CA 91125 Abstract This paper concerns parallel, local computations with a data
More informationEcient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines
Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,
More informationOptimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G
Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D-53757 St. Augustin, Germany Abstract. This paper presents language features
More informationCompile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA
Compile-Time Techniques for Data Distribution in Distributed Memory Machines J. Ramanujam Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 783-59 P. Sadayappan
More informationy(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*
SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL
More informationImproving Performance of Sparse Matrix-Vector Multiplication
Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign
More informationStrategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines
Strategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines Jochem Häuser European Space Agency and Roy Williams California Institute of Technology Abstract The purpose of this paper
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationtask object task queue
Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu
More informationCompiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna
Compiling FORTRAN for Massively Parallel Architectures Peter Brezany University of Vienna Institute for Software Technology and Parallel Systems Brunnerstrasse 72, A-1210 Vienna, Austria 1 Introduction
More informationWei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.
Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very
More informationStatic and Runtime Algorithms for All-to-Many Personalized Communication on Permutation Networks
Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 199 Static and Runtime Algorithms
More informationIdentifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1
Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Yuan-Shin Hwang Department of Computer Science National Taiwan Ocean University Keelung 20224 Taiwan shin@cs.ntou.edu.tw
More informationLow Latency Messages on Distributed Memory Multiprocessors
Low Latency Messages on Distributed Memory Multiprocessors MATT ROSING 1 AND JOEL SALTZ 2 1 Pacific Northwest Laboratory, Richland, WA 99352 2 University of Maryland ABSTRACT This article describes many
More informationA Test Suite for High-Performance Parallel Java
page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium
More informationApplication Programmer. Vienna Fortran Out-of-Core Program
Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse
More informationARRAY DATA STRUCTURE
ARRAY DATA STRUCTURE Isha Batra, Divya Raheja Information Technology Dronacharya College Of Engineering, Farukhnagar,Gurgaon Abstract- In computer science, an array data structure or simply an array is
More informationI ICASE. NASA Dun U - N!i n'ral Anrnn,- Iir'j nnd! jvwn Admiri,;ftinj()r RUN-TIME PARALLELIZATION AND SCHEDULING OF LOOPS
( NASA Contractor Report 182039 ~ ICASE Report No. 90-34 I ICASE RUN-TIME PARALLELIZATION AND SCHEDULING OF LOOPS Joel H. Saltz Ravi Mirchandaney -' T '4 Kay Crowley JUN 2 6 1990 Contract No. NASI-18605
More informationFast Primitives for Irregular Computations on the NEC SX-4
To appear: Crosscuts 6 (4) Dec 1997 (http://www.cscs.ch/official/pubcrosscuts6-4.pdf) Fast Primitives for Irregular Computations on the NEC SX-4 J.F. Prins, University of North Carolina at Chapel Hill,
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance
More informationEcient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines
Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,
More informationTiling Multidimensional Iteration Spaces for Multicomputers
1 Tiling Multidimensional Iteration Spaces for Multicomputers J. Ramanujam Dept. of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 080 901, USA. Email: jxr@max.ee.lsu.edu
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationObject-oriented Design for Sparse Direct Solvers
NASA/CR-1999-208978 ICASE Report No. 99-2 Object-oriented Design for Sparse Direct Solvers Florin Dobrian Old Dominion University, Norfolk, Virginia Gary Kumfert and Alex Pothen Old Dominion University,
More informationAdaptive-Mesh-Refinement Pattern
Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points
More informationA Compiler for Parallel Finite Element Methods. with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS
Contemporary Mathematics Volume 00, 0000 A Compiler for Parallel Finite Element Methods with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS December 11, 1993 Abstract.
More informationUMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742
UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College
More informationTechnische Universitat Munchen. Institut fur Informatik. D Munchen.
Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationY. Han* B. Narahari** H-A. Choi** University of Kentucky. The George Washington University
Mapping a Chain Task to Chained Processors Y. Han* B. Narahari** H-A. Choi** *Department of Computer Science University of Kentucky Lexington, KY 40506 **Department of Electrical Engineering and Computer
More informationRuntime Support and Compilation Methods for User-Specified Irregular Data Distributions
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. a, AUGUST 1995 al5 Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions Ravi Ponnusamy, Joel Saltz,
More informationCOMMUNICATION IN HYPERCUBES
PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)
More informationLarge-scale Structural Analysis Using General Sparse Matrix Technique
Large-scale Structural Analysis Using General Sparse Matrix Technique Yuan-Sen Yang 1), Shang-Hsien Hsieh 1), Kuang-Wu Chou 1), and I-Chau Tsai 1) 1) Department of Civil Engineering, National Taiwan University,
More informationParallel Unstructured Mesh Generation by an Advancing Front Method
MASCOT04-IMACS/ISGG Workshop University of Florence, Italy Parallel Unstructured Mesh Generation by an Advancing Front Method Yasushi Ito, Alan M. Shih, Anil K. Erukala, and Bharat K. Soni Dept. of Mechanical
More informationDynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications
Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Xiaolin Li and Manish Parashar The Applied Software Systems Laboratory Department of
More informationA NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS
Contemporary Mathematics Volume 157, 1994 A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS T.E. Tezduyar, M. Behr, S.K. Aliabadi, S. Mittal and S.E. Ray ABSTRACT.
More informationDYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems,
DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN Barbara Chapman a Piyush Mehrotra b Hans Moritsch a Hans Zima a a Institute for Software Technology and Parallel Systems, University of Vienna, Brunner Strasse
More informationCompiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz
Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University
More informationImplementation and Evaluation of Prefetching in the Intel Paragon Parallel File System
Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationEvaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers
Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers Alan L. Cox y, Sandhya Dwarkadas z, Honghui Lu y and Willy Zwaenepoel y y Rice University Houston,
More informationNavier-Stokes Computations on Commodity Computers
Navier-Stokes Computations on Commodity Computers By Veer N. Vatsa NASA Langley Research Center, Hampton, VA v.n.vatsa@larc.nasa.gov And Thomas R. Faulkner MRJ Technology Solutions, Moffett Field, CA faulkner@nas.nasa.gov
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationTask Parallelism in a High Performance Fortran Framework
IEEE Parallel & Distributed Technology, Volume 2, Number 3, Fall, 1994, pp. 16-26 Task Parallelism in a High Performance Fortran Framework T. Gross, D. O Hallaron, and J. Subhlok School of Computer Science
More informationChapter 8 : Multiprocessors
Chapter 8 Multiprocessors 8.1 Characteristics of multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input-output equipment. The term processor in multiprocessor
More informationA High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.
A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationAutomatic Counterflow Pipeline Synthesis
Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The
More informationCase Studies on Cache Performance and Optimization of Programs with Unit Strides
SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer
More informationIntroduction to Multigrid and its Parallelization
Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are
More informationThe Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor
IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.
More informationTarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada
Distributed Array Data Management on NUMA Multiprocessors Tarek S. Abdelrahman and Thomas N. Wong Department of Electrical and Computer Engineering University oftoronto Toronto, Ontario, M5S 1A Canada
More informationTransactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN
The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics
More informationSystem Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationControl Flow Analysis with SAT Solvers
Control Flow Analysis with SAT Solvers Steven Lyde, Matthew Might University of Utah, Salt Lake City, Utah, USA Abstract. Control flow analyses statically determine the control flow of programs. This is
More informationImproving Locality For Adaptive Irregular Scientific Codes
Improving Locality For Adaptive Irregular Scientific Codes Hwansoo Han, Chau-Wen Tseng Department of Computer Science University of Maryland College Park, MD 7 fhshan, tsengg@cs.umd.edu Abstract Irregular
More information6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP
LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,
More informationThe Architecture of a Homogeneous Vector Supercomputer
The Architecture of a Homogeneous Vector Supercomputer John L. Gustafson, Stuart Hawkinson, and Ken Scott Floating Point Systems, Inc. Beaverton, Oregon 97005 Abstract A new homogeneous computer architecture
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationMultigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK
Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible
More informationData Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines
1063 7133/97 $10 1997 IEEE Proceedings of the 11th International Parallel Processing Symposium (IPPS '97) 1063-7133/97 $10 1997 IEEE Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationpc++/streams: a Library for I/O on Complex Distributed Data-Structures
pc++/streams: a Library for I/O on Complex Distributed Data-Structures Jacob Gotwals Suresh Srinivas Dennis Gannon Department of Computer Science, Lindley Hall 215, Indiana University, Bloomington, IN
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationSupport for Distributed Dynamic Data Structures in C++ Chialin Chang Alan Sussman. Joel Saltz. University of Maryland, College Park, MD 20742
Support for Distributed Dynamic Data Structures in C++ Chialin Chang Alan Sussman Joel Saltz Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College Park,
More informationADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004
ADR and DataCutter Sergey Koren CMSC818S Thursday March 4 th, 2004 Active Data Repository Used for building parallel databases from multidimensional data sets Integrates storage, retrieval, and processing
More informationSpace-filling curves for 2-simplicial meshes created with bisections and reflections
Space-filling curves for 2-simplicial meshes created with bisections and reflections Dr. Joseph M. Maubach Department of Mathematics Eindhoven University of Technology Eindhoven, The Netherlands j.m.l.maubach@tue.nl
More informationProgramming as Successive Refinement. Partitioning for Performance
Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing
More informationRun-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications
Reordering s Important Irregular Science & Engineering Applications Molecular Dynamics Finite Element Analysis Michelle Mills Strout December 5, 2005 Sparse Matrix Computations 2 Lacluster Performance
More informationSemi-automatic domain decomposition based on potential theory
Semi-automatic domain decomposition based on potential theory S.P. Spekreijse and J.C. Kok Nationaal Lucht- en Ruimtevaartlaboratorium National Aerospace Laboratory NLR Semi-automatic domain decomposition
More informationA Performance Study of Parallel FFT in Clos and Mesh Networks
A Performance Study of Parallel FFT in Clos and Mesh Networks Rajkumar Kettimuthu 1 and Sankara Muthukrishnan 2 1 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439,
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationTransactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN
Parallelization of software for coastal hydraulic simulations for distributed memory parallel computers using FORGE 90 Z.W. Song, D. Roose, C.S. Yu, J. Berlamont B-3001 Heverlee, Belgium 2, Abstract Due
More informationPASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh
Scalable Parallel Libraries Conference, Oct. 1994 PASSION Runtime Library for Parallel I/O Rajeev Thakur Rajesh Bordawekar Alok Choudhary Ravi Ponnusamy Tarvinder Singh Dept. of Electrical and Computer
More informationAn Experimental Assessment of Express Parallel Programming Environment
An Experimental Assessment of Express Parallel Programming Environment Abstract shfaq Ahmad*, Min-You Wu**, Jaehyung Yang*** and Arif Ghafoor*** *Hong Kong University of Science and Technology, Hong Kong
More informationTHE application of advanced computer architecture and
544 IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 45, NO. 3, MARCH 1997 Scalable Solutions to Integral-Equation and Finite-Element Simulations Tom Cwik, Senior Member, IEEE, Daniel S. Katz, Member,
More informationCross-Layer Memory Management to Reduce DRAM Power Consumption
Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationTransactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN
Toward an automatic mapping of DSP algorithms onto parallel processors M. Razaz, K.A. Marlow University of East Anglia, School of Information Systems, Norwich, UK ABSTRACT With ever increasing computational
More informationAn Inspector-Executor Algorithm for Irregular Assignment Parallelization
An Inspector-Executor Algorithm for Irregular Assignment Parallelization Manuel Arenaz, Juan Touriño, Ramón Doallo Computer Architecture Group Dep. Electronics and Systems, University of A Coruña, Spain
More informationAn In-place Algorithm for Irregular All-to-All Communication with Limited Memory
An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de
More informationAn Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors
Proceedings of the 28th Annual Hmvaii Intemottonol Conference on System Sciences - 1995 An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Matthew
More informationInteroperability of Data Parallel Runtime Libraries
Interoperability of Data Parallel Runtime Libraries Guy Edjlali, Alan Sussman and Joel Saltz Department of Computer Science University of Maryland College Park, MD 2742 fedjlali,als,saltzg@cs.umd.edu Abstract
More informationA Beginner s Guide to Programming Logic, Introductory. Chapter 6 Arrays
A Beginner s Guide to Programming Logic, Introductory Chapter 6 Arrays Objectives In this chapter, you will learn about: Arrays and how they occupy computer memory Manipulating an array to replace nested
More informationMesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System
Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this
More informationSVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA)
SVM Support in the Vienna Fortran Compilation System Peter Brezany University of Vienna brezany@par.univie.ac.at Michael Gerndt Research Centre Julich(KFA) m.gerndt@kfa-juelich.de Viera Sipkova University
More informationAn Optimization Method Based On B-spline Shape Functions & the Knot Insertion Algorithm
An Optimization Method Based On B-spline Shape Functions & the Knot Insertion Algorithm P.A. Sherar, C.P. Thompson, B. Xu, B. Zhong Abstract A new method is presented to deal with shape optimization problems.
More informationC ICASE INTERIM REPORT 17. Raja Das Joel Saltz Hwry Berryman. NASA Contract No. NAS May 1991
AD-A237 262 NASA Conr actor "Is Reit DTIC ls ELECTE C ICASE INTERIM REPORT 17 A MAN UAL FOR PARTI RUINIME PRTIV Revisi 1 Raja Das Joel Saltz Hwry Berryman. NASA Contract No. NAS1-18605 May 1991 INSTITUTE
More information6. Parallel Volume Rendering Algorithms
6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks
More information