An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors

Size: px

Start display at page:

Download "An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors"

Arabella Allison
5 years ago
Views:

1 Proceedings of the 28th Annual Hmvaii Intemottonol Conference on System Sciences An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Matthew Haines* Wim BGhmt ICASE NASA Langley Research Center Hampton; VA Computer Science Department Colorado State University Fort Collins, CO Abstract Management of parallel tasks and distributed data are the essence of parallel programming on distributed memory multiprocessors, and can be expressed explicitly in the programming language, or provided implicitly through some combination of language and runtime support. Functional languages are designed to provide implicit support for both task and data management, but are often less eficient than explicit approaches. This is the classical tension between perjormance and ease of programming. This paper provides an initial study which attempts to quantify this tradeoff. While our quantitative results are accurate at capturing the scales for programming eflort and eficiency of these programming methods, our results are based on two small parallel programs, and should be weighed accordingly. 1 Introduction Programming today s large-scale distributed memory multiprocessors requires the management of both parallel tasks and distributed data structures. Due to the lack of sophisticated language support for these machines, these chores are typically done in an explicit manner, using machine-specific constructs for spawning and synchronizing tasks and for message passing. The resulting programs are difficult and timeconsuming to write, and contain a large amount of machine dependent housekeeping code not germane to the specification of the problem. *Supported in part by a grant from Sandia National Laboratories while at Colorado State University, and by the National Aeronautics and Space Administration under NASA Contract No. NASA-19480, while in residence at ICASE + Supported in part by NSF grant MIP This situation has led to the initiation of many research projects [9, 10, 19, 1, 121 whose goal is to raise the level of abstraction for the programmer, while still providing performance that (hopefully) comes close to machine-specific code. All of these approaches employ an imperative programming language paradigm (typically Fortran), whose side-effects can have a detrimental effect on the ability of a compiler to perform dependence analysis. An alternative approach is to employ a functional language, such as Sisal [14], Haskell [ll], or Id [15], to expose the natural parallelism in an application, and then utilize compiler and runtime support to provide for implicit management of both parallel tasks and distributed data structures. VISA [S] is a runtime system designed to provide implicit task and memory management on distributed memory multiprocessors. The compiler (user) is provided with a shared memory abstraction and a set of primitives for allocating and accessing shared data structures within the virtual address space. Data structures are allocated using a variety of data decompositions specified by a set of predefined or userdefined mapping junctions. We have used VISA provided a distributed memory implementation of Sisal, and in [6] we present an outline of the distributed Sisal system as well as the performance for a large set of programs. In this paper, we present an overview of the VISA system (to gain an understanding as to how implicit data distribution and addressing in a distributed memory environment might be done) and a comparison of implicit and explicit programming methods for distributed memory multiprocessors. Our goal is to quantify the tradeoff between programming effort and efficiency for implicit and explicit programming styles on distributed memory multiprocessors. To compare implicit and explicit programming methods, we have selected two programs, one whose performance is W95 $ IEEE 319

2 heavily influenced by task management and the other whose performance is heavily influenced by data management. We encode these programs using three combinations of implicit and explicit task and data management: (1) Sisal with VISA, (2) explicit parallel C with VISA, and (3) explicit parallel C with message passing. We then measure the programming effort for these three methods using a combination of programming time and total lines of code. Although current state-of-the-art software engineering metrics may provide a more robust analysis, we feel that our metrics do provide a good measure of the relative complexity of a program and the e#ort required create the program. We then compare programming effort with the measured efficiency of each paradigm, in terms of both execution time and space. The results of our experiment help to quantify the tension between programming effort and efficiency in distributed memory programming, providing goals for both implicit methods (to improve efficiency) and explicit methods (to reduce programming effort). It should be noted that this experiment is only an initial study of this tradeoff, and fails to take into consideration the effects of large-scale programs and additional paradigms, such as object-oriented programming. Clearly, such comparisons would be useful. The next section provides an overview of VISA, a system for implicit data management. Section 3 describes the programs used in evaluating the three programming methods, and a description of the programming effort of each approach. Section 4 provides the performance of each of the programs and an analysis of the results. Section 5 provides a brief description of related research projects and a comparison with previous work. 2 The Design and Operation of VISA Address translation is the process of translating a global address into a processor-local address (<processor, offset>), and is required in order to maintain a single addressing space in a distributed memory multiprocessor. VISA (Virtual Shared Addressing) is a distributed memory runtime system that provides a single addressing space and general data decomposition functions to a programmer or compiler. Although other approaches exist which provide language support for a single addressing space [lo, 191, they are typically limited to supporting only shared arrays (i.e. a global index space, not a global address space), and must pre-process all parallel loops to extract the runtime values necessary to compute the processor-local addresses. This preprocessing can occur at compile-time or at run-time, with the latter being similar in runtime cost to the VISA approach. VISA, on the other hand, is designed as a true shared addressing space capable of supporting both scalar and aggregate data structures, and performing address translation on-the-fly. This eliminates the need to pre-compute communication schedules, which, when done at runtime, can incur a significant overhead [16]. VISA also eliminates the need to pre-fetch and store all remote values that will be needed during a parallel loop computation. In contrast, most compiler-based systems must either pre-fetch all remote data needed for a parallel loop computation (expensive in space), or perform strip-mining analysis to divide the parallel loop into smaller sections so that the amount of incoming data can be tolerated (expensive in time). Though we acknowledge that it can be more expensive to satisfy remote references on-the-fly, we implement several optimizations to reduce the burden, including split-phased transactions and multithreading. To use VISA, a compiler (or user) augments a parallel program with VISA primitives for allocating and accessing the data structures to be kept in the single addressing space. Any variables not placed in the VISA space are unaffected by the system, and remain local to each processor. The augmented program is then compiled using the native language compiler of choice, and linked with the VISA library to create the object program, which can then be executed on a distributed memory multiprocessor. 2.1 Message Passing All message passing required for accessing remote values is handled implicitly by the VISA system through the use of a message passing abstraction, supporting both synchronous (blocking) and asynchronous (non-blocking) operations. Since these operations are provided by most host operating systems for distributed memory multiprocessors, VISA can be easily ported to other distributed memory multiprocessors by modifying the message passing abstraction to make the proper native calls. 2.2 Data Distribution As depicted in Figure 1, the VISA address space is a virtual space spanning portions of the local memory from each participating node. This creates two types of address spaces for each participating node in the system: a shared virtual addressing space that spans 380

3 Mapping Func Blocksize Start PE Replicate VISA Addmssing Space :.._..._ _..._... i # Figure 1: The VISA addressing space all of the processors, and a local address space for data visible only to the local processor. Each data structure allocated to the VISA space receives a contiguous set of virtual addresses which are shared among the participating nodes in accordance to the data distribution (or mapping) function. The data distribution scheme determines how the physical storage for a global data structure is to be divided among the participating nodes. The goal is to divide the data structure among the nodes so as to minimize the number of remote references caused by the distribution. This means that the distribution of data must be tied to the access pattern of the parallel computation, and therefore data distribution needs to be flexible to support a wide variety of access patterns. For VISA, data distribution is accomplished by dividing a data structure into a set of blocks, where each block contains blocksize elements. The blocks are then allocated to the physical memories of the nodes in round-robin fashion. This is similar to the block-cyclic distribution scheme proposed for IIPF [9] and used in the Fortran D compiler [lo]. By dividing the data into blocks, we can minimize the amount of storage required for the translation table. Rather than one entry in the table for each element, we need only one entry for the entire data structure, since we can compute the remaining element positions from the data distribution parameters (blocksize, stride, etc.). By reducing the size of the translation table to one entry per structure, we can efficiently replicate it onto the participating processors rather having to store it as a distributed structure itself. The result is that any virtual address reference will result in at most one remote reference, rather than the two that might occur when the translation table is distributed. The disadvantage to this approach is that fully-general (irregular) distributions cannot be supported, since this would require one translation table entry for each element of the distributed data structure. The block-cyclic distribution scheme can be represented with a few control parameters for each The translation table stores the information necessary to translate a virtual address into a processor-local address. Table 1: Control parameter settings for various 1D mapping functions data structure, including the size of each block (blocksize), the node to which the first block is assigned (start-node), and the processor stride at which the blocks are distributed (stride). A fourth control parameter specifies whether or not a data structure is to be replicated. Table 1 shows how these parameters can be modified to achieve a variety of onedimensional mapping functions. For example, cyclic distribution (where each processor gets one element in turn until all elements have been assigned) can be implemented with a blocksize of 1 and a stride of 1. The starting processor can be specified with an argument to the mapping function (maparg). For the general case of block-cyclic, the maparg can be used to specify the blocksize. The stride is always 1 for these one-dimensional mapping functions, but varies for multi-dimensional mapping functions (see [3] for more details on multi-dimensional data structures in VISA). The data distribution function is passed as an argument to the VISA data allocation routine, visaslalloc, which updates the translation tables and allocates the space required to store the data. VISA will also accept a user-defined mapping function, so long as it establishes values for all of the mapping parameters. 2.3 Address Translation Address translation is the process of obtaining the physical address of a datum given its virtual address. For a distributed memory multiprocessor, a physical address consists of the doublet (node, pa), where node is a node designator and pa is the physical address within that node. Since VISA employs a blockcyclic addressing scheme, where the blocksize, starting node, and stride may all vary, it is necessary to store these control parameters, along with other information about each data structure, in a descriptor called a range-map entry. The entire VISA space is there- 381

4 Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 19% fore described by the collection of these entries, called the range-map table, or translation table. The term range refers to the fact that, since all data structures are assigned contiguous addresses in both virtual and physical spaces, the range (low, high) is sufficient to represent all of the addresses within a data structure. In addition to the control parameters, each rangemap entry contains three address ranges for each data structure: l the visa-base, representing the range of virtual (global) addresses for this data structure, l the local-base, representing the range of physical (local) addresses for the blocks allocated to this data structure, and l the optimized-base, representing the optimized range of global addresses2. General address translation then proceeds as follows: l The rangemap entry for the desired data structure is fetched by the find-rm() routine. This routine is exposed to the compiler so that the rangemap entry for a data structure that is to be accessed many times need only be fetched once. l From a virtual address, the relative element position within the data structure (element), the block containing the desired element (block), and the offset of the desired element within this block (block-off set) are computed: element = address - low_range block = element / blocksize block-offset = element mod blocksize l Now, the node which owns the block (node), the block number within that node (node-block), and the relative offset of the actual datum within the node (rel-offset) are computed, where P is the number of participating nodes: offset = ral-offset + local_basa. Otherwise (the access is remote), a message is sent to node, requesting that the desired datum be fetched and returned. An alternative to this address translation scheme is to have a fixed blocksize, start-pe, and stride for every data structure. Address translation calculations can then proceed directly from the virtual address bits. However, we have implemented this fixed addressing scheme and found that, although the actual translation process is faster, the fixed control parameters often cause mis-alignment with the parallel loops which access the data structures, resulting in an excessive number of remote references and severely degrading overall performance of the application [7]. We can, however, provide a significant optimization to the above translation scheme by eliminating the address translation step altogether for local references (which usually dominate the total number of references). The idea is to provide a set of water-mark values for each data structure which indicate the acceptable range of virtual addresses that will result in a local reference for a given global address. These values can be stored in registers when the range-map entry for a data structure is retrieved (indicating that we are about to access the data structure). Then, a simple register comparison of the water-mark values with the virtual address is all that is necessary to establish whether the reference is local. If so, the pre-computed optimizedbase is added to the virtual address to obtain the local physical address. Thus, for local references from the global address space, the overhead imposed by VISA is two compares and an add. For remote references, the overhead is larger, but compared with the cost of a remote fetch, the address translation time is minimal. Having outlined how an implicit data management system (VISA) works, we can now discuss how such a system is used to alleviate the programmer from these details. node = (startnode + (block * stride)) mod P nodeblock = block / P rel-offset = node-block * blocksize + blockaffset 3 Evaluation l If the access is local, the rel-offset is incremented by the local-base from the rangemap entry to produce the actual offset in local physical memory: 2This allows VISA to perform an important optimization of not having to translate virtual addresses that would result in a local physical address. We implement a variety of sample programs in Sisal and execute them on the ncube/2 distributed memory multiprocessor using the VISA runtime system to provide the necessary task and data management support, and in [S] we report on the efficiency of these codes. In this paper, we wish to perform an experiment which attempts to quantify the tradeoff between 332

5 Memory Management Exrdicit 2 lmvlicit Parallel C VISA Sisal VISA Figure 2: Parallel programming style combinations programming effort and efficiency. To do this, we compare implicit (low programmer effort) and explicit (high efficiency) approaches to task and data management. In general, a parallel program executing on a distributed memory multiprocessor must address two issues, either explicitly or implicitly: Task management. Parallel execution is achieved by dividing the independent portions of the program into parallel tasks, distributing these tasks among the participating nodes for parallel execution, and synchronizing their results so that the computation remains determinate. Memory management. Global data structures need to be distributed among the participating nodes in such a way as to minimize the number of remote references generated by the execution of the parallel tasks. Once a distribution is agreed upon, the program must identify those references that fall outside of the local distribution (i.e. remote), and communicate the request to the node which contains the value. Given these two orthogonal programming issues (task and data management), either of which may be handled explicitly or implicitly, we can construct a simple classification of the programming space (depicted in Figure 2): 1. Explicit task management using parallel C and explicit memory management using message passing primitives. This style represents the assembly language of distributed memory computing - fast and dirty. This style offers the lowest level of abstraction, but by exposing all aspects of task and data management to the user, high levels of performance can be achieved. 2. Explicit task management using parallel C and implicit memory management provided by the VISA runtime system. This style provides a higher level of abstraction for the program data (shared memory rather than message passing), but does nothing to increase the task management abstraction. 3. Implicit task management using Sisal and explicit memory management using message passing primitives. This style could be represented by a machine-dependent Sisal compiler that has been modified to generate explicit message passing code. However, such a modification to the compiler has only recently been undertaken, and thus we cannot expand on results of style in our analysis. 4. Implicit task management using Sisal and implicit memory management provided by the VISA runtime system. This represents the opposite end of the (programming effort) spectrum from explicit parallel C with message passing. The user sees a single address space and sequential program semantics, and the underlying systems perform all of the necessary transformations to achieve parallel execution on a distributed memory machine. We acknowledge that these (Sisal and VISA) are not the only systems that could be combined to cover this classification, and further investigation of these styles using other options remains an open area of research. To measure the relative merits of each style, in terms of programming effort and execution time, we encode two programs in three programming styles (1, 2, and 4) specified above. The first program, Lawrence Livermore Loop #7, is selected to highlight the effects of explicit and implicit task management, and the other program, SOR, is selected to highlight the effects of implicit and explicit data management. As a practical consideration, these programs are relatively simple to allow for the various encoding methods of each program in a reasonable amount of time. A more detailed study would encode at least one real application and consider the effects of algorithm complexity and large-scale programming on the tradeoff of programming effort versus efficiency. However, due to practical time and space considerations for this paper, we limit our initial study to the following programs: l Lawrence Livermore Loop #7. This program creates an array A from an input array B and constants R, T, Ci, and C2, where Ai is defined as: Ai= Bi+R*Cl+R2*C2+T*Bi+3+T+R* Bi+2+T*R2*Bi+l+T2*Bi+6+T2*R*Bi+s+

6 T2 * R2 * Bi+4. With very little task management required, this problem highlights the differences between the implicit and explicit memory management styles. l Successive Over-Relaxation (SOR). This problem performs a smoothing operation on an array by iterating over the array and computing each new Ai as the average of the previous iteration s Ai- 1, Ai, and Ai + 1. The access pattern is fixed over all of the iterations, and the array is distributed among the nodes in equal-sized blocks, matching the distribution of the parallel (inner) loop to minimize the remote references. The iteration loop in this program provides a method of controlling the amount synchronization required, thus highlighting the differences between implicit and explicit task management. Both of these programs were encoded by the same person3 and using the three programming styles as follows: l Sisal with VISA. Both codes were transformed into Sisal directly from their mathematical descriptions. The code specifies only what is to be computed, not how the computations are to proceed. The result is a machine-independent specification of the problem that runs on any machine Sisal supports, and one which is not biased with the structure of imperative programming languages. l Explicit parallel C with VISA. Moving into explicit task management, the codes have to specify how the parallel loop is to divided among the workers, and how explicit synchronization is to be performed. Memory management is handled by the VISA system, however, for the Livermore Loop #7 code, special registers were employed to cache the values of the B array so that multiple remote references to retrieve the same value were eliminated. This is a prime example of how optimizations can be exploited once the abstract covers of the system have been removed. However, optimizations such as this also add to the difficulty in understanding and maintaining codes at this level of abstraction. a Explicit parallel C with message passing. Moving away from the VISA system, the explicit 31t should be noted that the programmer in this case was not any more adept at functional programming than imperative programming. task management code is augmented with explicit message passing. The program is designed to optimize the number of remote references required and perform all remote references before the computation loop is initiated, pre-fetching remote values into local overlap regions. Also, the communication model is changed from an interruptdriven request/reply model used in VISA to a synchronous read/write model so that the overhead of the interrupt handler can be avoided. The computation (inner) loop now runs completely without remote references or interrupts, thus improving its cache behavior. Special buffers are used to hold the pre-fetched values, and synchronous communication phases are necessary to avoid deadlock. The distribution of data among the processors is also explicitly stated, and altering this distribution would requires re-coding both the explicit communication and computation phases. The result is a program that is capable of exploiting, and controlling, all aspects of the memory and processor hierarchies. 4 Results and Analysis The goal of our experiment is to evaluate and quantify the tension that exists between programming effort and efficiency. In particular, we wish to quantify these qualities for our test programs. Quantifying efficiency is simple: we can measure time and space usage of each program for each programming style. Quantifying programming effort, however, is much more nebulous. We have settled on a combination of two measures to quantify programming effort: lines of code and time required to write the programs. By combining these two, we capture the effort required in writing terse, efficient programs, as well as longer programs written in a shorter amount of time because less care was taken with algorithms and optimizations. While these metrics may not represent state-of-the-art in parallel software engineering, most programmers would agree that they provide a reasonable approximation of programming effort. Table 2 displays the programming effort in terms of lines of code that the user is responsible for writing, and approximate time it took to code and debug each of the programs. For brevity, we will represent the three programming styles in our tables and figures as follows: SISAL represents Sisal and VISA; C+VISA represents explicit parallel C and VISA; and C+MP represents explicit parallel C with message passing. 384

7 LLNL Loop #7 SOR Measure SISAL C+VISA C+MP SISAL C+VISA CSMP Lines of code Time to encode (hrs) Table 2: Comparison of programming effort, in both time and space SISAL C+VISA C+MP PEs Array Size Time (s) Time (s) SPl Time (s) SPZ Ave Table 3: Execution times for LLNL Loop #7 The claim that implicit parallel languages ease the problem of programming distributed memory multiprocessors is clearly supported by these numbers. As we move from Sisal to explicit C with VISA, and to explicit C with message passing, the code becomes increasingly more complex, requiring increasingly more lines of code, and becoming more machine-dependent. The question, then, is whether increased performance justifies the additional programming effort. Table 3 (also depicted in Figure 3) gives the execution results for LLNL Loop #7 on the ncube/2, where a constant blocksize of (2 ) doubleprecision elements are used; Array Size represents the total size of the A and B arrays; Spl represents the speedup in going from SISAL to C+VISA (TSISAL/TC+VISA); and Sp2 represents the speedup in going from C+VISA to C+MP (Tc+vrs~/Tc+~~). In order to highlight the performance gain achieved by explicit memory management, the blocksize (number of array elements per processor) was kept constant at 65,536 (216) double-precision elements. The data reveals that an average speedup of 1.47 is achieved when going from SISAL to C+VISA, which is due more to the memory caching optimization than the explicit control of tasks. Additionally, an average speedup of 1.83 is achieved when moving from C+VISA to C+MP, demonstrating in part the overhead of the VISA system, but moreover the effectiveness of the pre-fetching optimization and improved cache be- havior. Thus, improvements in efficiency that are achieved in lowering the programming abstraction can be attributed to the indirect overhead of not being able to take advantage of certain optimizations, rather than the direct overhead of maintaining such abstractions. In terms of space requirements, SISAL uses the minimum: two arrays of size n, one for A and one for B. C+VISA allocates an additional 7 double-precision locations per array to cache the values of Bi through Bi+6 so that they need only be retrieved once. C+MP also allocates an additional block of 7 elements to store the values of B that reside on the neighboring node. Thus, both C+VISA and C+MP employ additional memory (on order of the size of the overlap area) to achieve their optimizations. Whether this overhead is acceptable or not depends largely on the application. Table 4 (also depicted in Figure 4) gives the execution results for SOR, where a constant array size of (2 ) double-precision elements and 128 iterations is used to highlight the performance gain achieved by explicit task management. By holding the array size constant, we cause the blocksize to decrease as the number of processors increase, and thus the ratio of iterations to blocksize increases as the number of processors increases. This ratio, iterations to blocksize, represents the increasing emphasis being placed on task management. In moving from SISAL to C+VISA, there is an average speedup of 1.20, which starts as a performance de- 385

8 Proceedings of the 28th Annual Hawaii Intemational Conference on System Sciences PEs Blocksize SISAL C+VISA C+MP Ratio Time (s) Time (s) ] Spl Time (s) ] Spz , Table 4: Execution times for SOR Figure 3: Execution times for LLNL Loop #7 Figure 4: Execution times for SOR crease and gains as the ratio of iterations to blocksize increases, placing greater emphasis of task management on the total execution time. This initial loss in performance is due to the ability of the Sisal compiler to generate code that is highly optimized, which sometimes outperforms normal hand-coded C [2]. However, this small gain is quickly lost as the complex Sisal task management system is outperformed by the handcoded C task management. In moving from C+VISA to C+MP, there is an average speedup of 1.64, again representing the overhead of VISA plus the effective- ness of pre-fetching remote references and improved cache behavior. The single processor time of explicit C with message passing shows the enormous overhead of synchronization that this problem creates, which is not as visible in the other two approaches due to the overhead of VISA. In terms of space requirements, C+VISA uses two arrays of size n, one for the previous iteration and one for the current iteration, and swaps pointers at the end of each iteration. This is the minimal space requirement for SOR. The Sisal compiler also recognizes this

9 optimization, but generates the two swap arrays only after generating an array to hold the initial values, resulting in a space overhead of n elements. C+MP uses only the two necessary arrays, but allocates an additional two elements per processor to hold the prefetched remote values from neighboring nodes. 5 Related Research In [2] shared memory implementations of Sisal and Fortran are compared, and the shared memory implementation of Sisal compares favorably with Fortran on a wide variety of benchmarks. By providing a virtual shared memory runtime system, we have taken this shared memory implementation to distributed memory machines. In [4] we introduced the design of our task management system; in [5] we quantify the characteristics of software multithreading; and in [6] we outline the distributed Sisal runtime system, of which VISA is a part. Another area of research that offers a languageindependent shared memory paradigm is Distributed Shared Memory [13]. However, the inability to couple parallel tasks tightly with the distribution of data, controlled implicitly by the operating system, can result in misalignment, causing excessive message passing. Also since the granularity of sharing data in these systems is often very large (typically a page), contention, or false sharing can occur, in which two unrelated data items exist on the same sharable unit, prohibiting simultaneous access. The most common alternatives to programming distributed memory multiprocessors using an explicit parallel language with message passing are distributed memory language compilers, such as FortranD [lo] and Vienna Fortran [19], both of which closely resemble the proposed HPF standard [9]. These systems offer the advantage of implicit management for both tasks and memory, and allow the programmer to use the familiar sequential shared memory programming paradigm. Although these systems have had success in implementing certain applications, they have not been uniformly accepted as the solution to distributed memory programming for various reasons, including their difficulties with exact dependence analysis [17], their inability to efficiently handle irregular and adaptive codes without incurring large runtime overheads [18]. 6 Conclusions Our goal in this paper is to shed light on the tension, in terms of programming effort and efficiency, that exists between implicit and explicit programming styles for distributed memory multiprocessors. Towards this goal, we have conducted an initial study that compares the performance and programming effort of two programs using three programming styles: implicit task and data management, implicit task management and explicit data management, and explicit task and data management. Implicit task management is provided by the Sisal language compiler, which is designed for execution on shared memory multiprocessors [a]. Implicit data management is provided by the VISA runtime system, whose design and operation we have outlined in this paper. Explicit task and data management are provided through a native C compiler and parallel support library. We feel that our quantitative results are accurate at capturing the scales for programming effort and efficiency of these programming methods. However, it is important to note that our quantitative results are based on two small parallel programs, and should be weighed accordingly. Sisal with VISA provides implicit management of both tasks and data, and offers reasonable performance while alleviating the programmer from the implementation details of an architecture. The result is efficient machine-independent code that is portable among a wide range of architectures [a]. Furthermore, since the current Sisal compiler is unaware of distributed memory and costs associated with accessing remote data, we expect a performance gain when such information is exploited by the compiler. Explicit parallel C with VISA offers the ability to increase the performance of an application, but at the cost of increased size, programming effort, and machine-dependence. For our simple programs, an average speedup of 1.34 over Sisal is achieved, but at the cost of increasing the code size by an average factor of 7, and increasing the time required to encode and debug the programs by an average factor of 11. Explicit parallel C with message passing offers the ability to exploit the problem and machine details to obtain the highest performance for a particular machine. In particular, we are able to better control both the memory hierarchy (by pre-fetching values) and cache (by using overlap regions), both of which are necessary to achieve high performance. For our programs, average speedups of 1.74 over C with VISA, and 2.34 over Sisal are achieved. Once again, this increase in performance is obtained at the cost of 387

10 increasing program sizes by an average factor of 2 over C+VISA, and by a an average factor of 15 over Sisal, while increasing the time required to encode and debug the programs by an average factor of 4 over C+VISA, and by an average factor of 40 over Sisal. References PI Bruno Achauer. The DOWL distributed objectoriented language. Communications of the ACM, 36(9):48-55, September PI David Cann. Retire Fortran? A debate rekindled. Communications of the ACM, 35(8):81-89, August [31 Matthew Haines. Distributed Runtime Support for Task and Data Management PhD thesis, Colorado State University, Computer Science Department, Fort CoIlins, Colorado, June Also appears as Technical Report CS , Computer Science Department, Colorado State University. [41 Matthew Haines and Wim B ohm. Towards a distributed memory implementation of Sisal. In Scalable High Performance Computing Conference, pages IEEE, April Also appears as Technical Report CS , Computer Science Department, Colorado State University. 151 Matthew Haines and Wim B ohm. An evaluation of software multithreading in a conventional distributed memory multiprocessor. In IEEE Symposium on Parallel and Distributed Processing, pages , December [61 Matthew Haines and Wim B-ohm. On the design of distributed memory Sisal. Journal of Programming Languages, 1: , Matthew Haines and Wim B-ohm. Task management, virtual shared memory, and multithreading in a distributed memory implementation of Sisal. In Arndt Bode, Mike Reeve, and Gottfried Wolf, editors, Parallel Architectures and Languages Europe, pages Springer-Verlag Lecture Notes in Computer Science, June Matthew Haines and Wim B-ohm. A virtual shared PI PO1 addressing system for distributed memory Sisal. In John T. Feo, editor, Proceedings of Sisal 93, pages , San Diego, CA, October High Performance Fortran Forum. High Performance Fortran Language SpeciEcation, version 1.0 edition, May Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD distributedmemory machines. Communications of the ACM, 35(8):66-80, August Pll Paul Hudak and Philip Wadler. Report on the functional programming language HaskeII. Technical report, Yale University, New Haven, CT, December P21 Rodger Lea, Christian Jacquemot, and Eric PiIIevesse. COOL: System support for distributed programming. Communications of the ACM, 36(9):37-46, September D31 Kai Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Yale University, September [I41 J. R. McGraw, S. K. Skedzielewski, S. J. AIlan, R. R. Oldehoeft, J. Glauert, C. Kirkham, W. Noyce, and R. Thomas. SISAL: Streams and iteration in a single assignment language: Reference manual version 1.2. Manual M-146, Rev. 1, Lawrence Livermore National Laboratory, Livermore, CA, March R. S. NikhiI. Id (Version 90.0) Reference Manual. Technical Report CSG Memo 284-1, MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139, USA, July Supercedes: Id/83s (July 1985) Id Nouveau (July 1986), Id 88.0 (March 1988), Id 88.1 (August 1988). [lf51 Joel SaItz, Ravi Mirchandaney, and Kay Crowley. Runtime parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5): , May Also appears as ICASE Report No P71 Zhiyu Shen, Zhiyuan Li, and Pen-Chung Yew. An emperical study of Fortran programs for paraiieiizing compilers. IEEE Transactions on Parallel and Distributed Systems, 1(3): , July WI Alan Sussman and Joel SaItz. A manual for the multiblock PART1 runtime primitives revision 4. Technical Report CS-TR-3070 and UMIACS-TR-93-36, University of Maryland, Department of Computer Science and UMIACS, May D91 Hans Zima, Peter Brezany, Barbara Chapman, Piyush Mehrotra, and Andreas SchwaId. Vienna fortran - A language specification version 1.1. Interim Report 21, Institute for Computer Applications in Science and Engineering, Hampton, VA, March

Department of. Computer Science. A Comparison of Explicit and Implicit. March 30, Colorado State University

Department of. Computer Science. A Comparison of Explicit and Implicit. March 30, Colorado State University Department of Computer Science A Comparison of Explicit and Implicit Programming Styles for Distributed Memory Multiprocessors Matthew Haines and Wim Bohm Technical Report CS-93-104 March 30, 1993 Colorado