An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors

Size: px
Start display at page:

Download "An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors"

Transcription

1 Proceedings of the 28th Annual Hmvaii Intemottonol Conference on System Sciences An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Matthew Haines* Wim BGhmt ICASE NASA Langley Research Center Hampton; VA Computer Science Department Colorado State University Fort Collins, CO Abstract Management of parallel tasks and distributed data are the essence of parallel programming on distributed memory multiprocessors, and can be expressed explicitly in the programming language, or provided implicitly through some combination of language and runtime support. Functional languages are designed to provide implicit support for both task and data management, but are often less eficient than explicit approaches. This is the classical tension between perjormance and ease of programming. This paper provides an initial study which attempts to quantify this tradeoff. While our quantitative results are accurate at capturing the scales for programming eflort and eficiency of these programming methods, our results are based on two small parallel programs, and should be weighed accordingly. 1 Introduction Programming today s large-scale distributed memory multiprocessors requires the management of both parallel tasks and distributed data structures. Due to the lack of sophisticated language support for these machines, these chores are typically done in an explicit manner, using machine-specific constructs for spawning and synchronizing tasks and for message passing. The resulting programs are difficult and timeconsuming to write, and contain a large amount of machine dependent housekeeping code not germane to the specification of the problem. *Supported in part by a grant from Sandia National Laboratories while at Colorado State University, and by the National Aeronautics and Space Administration under NASA Contract No. NASA-19480, while in residence at ICASE + Supported in part by NSF grant MIP This situation has led to the initiation of many research projects [9, 10, 19, 1, 121 whose goal is to raise the level of abstraction for the programmer, while still providing performance that (hopefully) comes close to machine-specific code. All of these approaches employ an imperative programming language paradigm (typically Fortran), whose side-effects can have a detrimental effect on the ability of a compiler to perform dependence analysis. An alternative approach is to employ a functional language, such as Sisal [14], Haskell [ll], or Id [15], to expose the natural parallelism in an application, and then utilize compiler and runtime support to provide for implicit management of both parallel tasks and distributed data structures. VISA [S] is a runtime system designed to provide implicit task and memory management on distributed memory multiprocessors. The compiler (user) is provided with a shared memory abstraction and a set of primitives for allocating and accessing shared data structures within the virtual address space. Data structures are allocated using a variety of data decompositions specified by a set of predefined or userdefined mapping junctions. We have used VISA provided a distributed memory implementation of Sisal, and in [6] we present an outline of the distributed Sisal system as well as the performance for a large set of programs. In this paper, we present an overview of the VISA system (to gain an understanding as to how implicit data distribution and addressing in a distributed memory environment might be done) and a comparison of implicit and explicit programming methods for distributed memory multiprocessors. Our goal is to quantify the tradeoff between programming effort and efficiency for implicit and explicit programming styles on distributed memory multiprocessors. To compare implicit and explicit programming methods, we have selected two programs, one whose performance is W95 $ IEEE 319

2 heavily influenced by task management and the other whose performance is heavily influenced by data management. We encode these programs using three combinations of implicit and explicit task and data management: (1) Sisal with VISA, (2) explicit parallel C with VISA, and (3) explicit parallel C with message passing. We then measure the programming effort for these three methods using a combination of programming time and total lines of code. Although current state-of-the-art software engineering metrics may provide a more robust analysis, we feel that our metrics do provide a good measure of the relative complexity of a program and the e#ort required create the program. We then compare programming effort with the measured efficiency of each paradigm, in terms of both execution time and space. The results of our experiment help to quantify the tension between programming effort and efficiency in distributed memory programming, providing goals for both implicit methods (to improve efficiency) and explicit methods (to reduce programming effort). It should be noted that this experiment is only an initial study of this tradeoff, and fails to take into consideration the effects of large-scale programs and additional paradigms, such as object-oriented programming. Clearly, such comparisons would be useful. The next section provides an overview of VISA, a system for implicit data management. Section 3 describes the programs used in evaluating the three programming methods, and a description of the programming effort of each approach. Section 4 provides the performance of each of the programs and an analysis of the results. Section 5 provides a brief description of related research projects and a comparison with previous work. 2 The Design and Operation of VISA Address translation is the process of translating a global address into a processor-local address (<processor, offset>), and is required in order to maintain a single addressing space in a distributed memory multiprocessor. VISA (Virtual Shared Addressing) is a distributed memory runtime system that provides a single addressing space and general data decomposition functions to a programmer or compiler. Although other approaches exist which provide language support for a single addressing space [lo, 191, they are typically limited to supporting only shared arrays (i.e. a global index space, not a global address space), and must pre-process all parallel loops to extract the runtime values necessary to compute the processor-local addresses. This preprocessing can occur at compile-time or at run-time, with the latter being similar in runtime cost to the VISA approach. VISA, on the other hand, is designed as a true shared addressing space capable of supporting both scalar and aggregate data structures, and performing address translation on-the-fly. This eliminates the need to pre-compute communication schedules, which, when done at runtime, can incur a significant overhead [16]. VISA also eliminates the need to pre-fetch and store all remote values that will be needed during a parallel loop computation. In contrast, most compiler-based systems must either pre-fetch all remote data needed for a parallel loop computation (expensive in space), or perform strip-mining analysis to divide the parallel loop into smaller sections so that the amount of incoming data can be tolerated (expensive in time). Though we acknowledge that it can be more expensive to satisfy remote references on-the-fly, we implement several optimizations to reduce the burden, including split-phased transactions and multithreading. To use VISA, a compiler (or user) augments a parallel program with VISA primitives for allocating and accessing the data structures to be kept in the single addressing space. Any variables not placed in the VISA space are unaffected by the system, and remain local to each processor. The augmented program is then compiled using the native language compiler of choice, and linked with the VISA library to create the object program, which can then be executed on a distributed memory multiprocessor. 2.1 Message Passing All message passing required for accessing remote values is handled implicitly by the VISA system through the use of a message passing abstraction, supporting both synchronous (blocking) and asynchronous (non-blocking) operations. Since these operations are provided by most host operating systems for distributed memory multiprocessors, VISA can be easily ported to other distributed memory multiprocessors by modifying the message passing abstraction to make the proper native calls. 2.2 Data Distribution As depicted in Figure 1, the VISA address space is a virtual space spanning portions of the local memory from each participating node. This creates two types of address spaces for each participating node in the system: a shared virtual addressing space that spans 380

3 Mapping Func Blocksize Start PE Replicate VISA Addmssing Space :.._..._ _..._... i # Figure 1: The VISA addressing space all of the processors, and a local address space for data visible only to the local processor. Each data structure allocated to the VISA space receives a contiguous set of virtual addresses which are shared among the participating nodes in accordance to the data distribution (or mapping) function. The data distribution scheme determines how the physical storage for a global data structure is to be divided among the participating nodes. The goal is to divide the data structure among the nodes so as to minimize the number of remote references caused by the distribution. This means that the distribution of data must be tied to the access pattern of the parallel computation, and therefore data distribution needs to be flexible to support a wide variety of access patterns. For VISA, data distribution is accomplished by dividing a data structure into a set of blocks, where each block contains blocksize elements. The blocks are then allocated to the physical memories of the nodes in round-robin fashion. This is similar to the block-cyclic distribution scheme proposed for IIPF [9] and used in the Fortran D compiler [lo]. By dividing the data into blocks, we can minimize the amount of storage required for the translation table. Rather than one entry in the table for each element, we need only one entry for the entire data structure, since we can compute the remaining element positions from the data distribution parameters (blocksize, stride, etc.). By reducing the size of the translation table to one entry per structure, we can efficiently replicate it onto the participating processors rather having to store it as a distributed structure itself. The result is that any virtual address reference will result in at most one remote reference, rather than the two that might occur when the translation table is distributed. The disadvantage to this approach is that fully-general (irregular) distributions cannot be supported, since this would require one translation table entry for each element of the distributed data structure. The block-cyclic distribution scheme can be represented with a few control parameters for each The translation table stores the information necessary to translate a virtual address into a processor-local address. Table 1: Control parameter settings for various 1D mapping functions data structure, including the size of each block (blocksize), the node to which the first block is assigned (start-node), and the processor stride at which the blocks are distributed (stride). A fourth control parameter specifies whether or not a data structure is to be replicated. Table 1 shows how these parameters can be modified to achieve a variety of onedimensional mapping functions. For example, cyclic distribution (where each processor gets one element in turn until all elements have been assigned) can be implemented with a blocksize of 1 and a stride of 1. The starting processor can be specified with an argument to the mapping function (maparg). For the general case of block-cyclic, the maparg can be used to specify the blocksize. The stride is always 1 for these one-dimensional mapping functions, but varies for multi-dimensional mapping functions (see [3] for more details on multi-dimensional data structures in VISA). The data distribution function is passed as an argument to the VISA data allocation routine, visaslalloc, which updates the translation tables and allocates the space required to store the data. VISA will also accept a user-defined mapping function, so long as it establishes values for all of the mapping parameters. 2.3 Address Translation Address translation is the process of obtaining the physical address of a datum given its virtual address. For a distributed memory multiprocessor, a physical address consists of the doublet (node, pa), where node is a node designator and pa is the physical address within that node. Since VISA employs a blockcyclic addressing scheme, where the blocksize, starting node, and stride may all vary, it is necessary to store these control parameters, along with other information about each data structure, in a descriptor called a range-map entry. The entire VISA space is there- 381

4 Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 19% fore described by the collection of these entries, called the range-map table, or translation table. The term range refers to the fact that, since all data structures are assigned contiguous addresses in both virtual and physical spaces, the range (low, high) is sufficient to represent all of the addresses within a data structure. In addition to the control parameters, each rangemap entry contains three address ranges for each data structure: l the visa-base, representing the range of virtual (global) addresses for this data structure, l the local-base, representing the range of physical (local) addresses for the blocks allocated to this data structure, and l the optimized-base, representing the optimized range of global addresses2. General address translation then proceeds as follows: l The rangemap entry for the desired data structure is fetched by the find-rm() routine. This routine is exposed to the compiler so that the rangemap entry for a data structure that is to be accessed many times need only be fetched once. l From a virtual address, the relative element position within the data structure (element), the block containing the desired element (block), and the offset of the desired element within this block (block-off set) are computed: element = address - low_range block = element / blocksize block-offset = element mod blocksize l Now, the node which owns the block (node), the block number within that node (node-block), and the relative offset of the actual datum within the node (rel-offset) are computed, where P is the number of participating nodes: offset = ral-offset + local_basa. Otherwise (the access is remote), a message is sent to node, requesting that the desired datum be fetched and returned. An alternative to this address translation scheme is to have a fixed blocksize, start-pe, and stride for every data structure. Address translation calculations can then proceed directly from the virtual address bits. However, we have implemented this fixed addressing scheme and found that, although the actual translation process is faster, the fixed control parameters often cause mis-alignment with the parallel loops which access the data structures, resulting in an excessive number of remote references and severely degrading overall performance of the application [7]. We can, however, provide a significant optimization to the above translation scheme by eliminating the address translation step altogether for local references (which usually dominate the total number of references). The idea is to provide a set of water-mark values for each data structure which indicate the acceptable range of virtual addresses that will result in a local reference for a given global address. These values can be stored in registers when the range-map entry for a data structure is retrieved (indicating that we are about to access the data structure). Then, a simple register comparison of the water-mark values with the virtual address is all that is necessary to establish whether the reference is local. If so, the pre-computed optimizedbase is added to the virtual address to obtain the local physical address. Thus, for local references from the global address space, the overhead imposed by VISA is two compares and an add. For remote references, the overhead is larger, but compared with the cost of a remote fetch, the address translation time is minimal. Having outlined how an implicit data management system (VISA) works, we can now discuss how such a system is used to alleviate the programmer from these details. node = (startnode + (block * stride)) mod P nodeblock = block / P rel-offset = node-block * blocksize + blockaffset 3 Evaluation l If the access is local, the rel-offset is incremented by the local-base from the rangemap entry to produce the actual offset in local physical memory: 2This allows VISA to perform an important optimization of not having to translate virtual addresses that would result in a local physical address. We implement a variety of sample programs in Sisal and execute them on the ncube/2 distributed memory multiprocessor using the VISA runtime system to provide the necessary task and data management support, and in [S] we report on the efficiency of these codes. In this paper, we wish to perform an experiment which attempts to quantify the tradeoff between 332

5 Memory Management Exrdicit 2 lmvlicit Parallel C VISA Sisal VISA Figure 2: Parallel programming style combinations programming effort and efficiency. To do this, we compare implicit (low programmer effort) and explicit (high efficiency) approaches to task and data management. In general, a parallel program executing on a distributed memory multiprocessor must address two issues, either explicitly or implicitly: Task management. Parallel execution is achieved by dividing the independent portions of the program into parallel tasks, distributing these tasks among the participating nodes for parallel execution, and synchronizing their results so that the computation remains determinate. Memory management. Global data structures need to be distributed among the participating nodes in such a way as to minimize the number of remote references generated by the execution of the parallel tasks. Once a distribution is agreed upon, the program must identify those references that fall outside of the local distribution (i.e. remote), and communicate the request to the node which contains the value. Given these two orthogonal programming issues (task and data management), either of which may be handled explicitly or implicitly, we can construct a simple classification of the programming space (depicted in Figure 2): 1. Explicit task management using parallel C and explicit memory management using message passing primitives. This style represents the assembly language of distributed memory computing - fast and dirty. This style offers the lowest level of abstraction, but by exposing all aspects of task and data management to the user, high levels of performance can be achieved. 2. Explicit task management using parallel C and implicit memory management provided by the VISA runtime system. This style provides a higher level of abstraction for the program data (shared memory rather than message passing), but does nothing to increase the task management abstraction. 3. Implicit task management using Sisal and explicit memory management using message passing primitives. This style could be represented by a machine-dependent Sisal compiler that has been modified to generate explicit message passing code. However, such a modification to the compiler has only recently been undertaken, and thus we cannot expand on results of style in our analysis. 4. Implicit task management using Sisal and implicit memory management provided by the VISA runtime system. This represents the opposite end of the (programming effort) spectrum from explicit parallel C with message passing. The user sees a single address space and sequential program semantics, and the underlying systems perform all of the necessary transformations to achieve parallel execution on a distributed memory machine. We acknowledge that these (Sisal and VISA) are not the only systems that could be combined to cover this classification, and further investigation of these styles using other options remains an open area of research. To measure the relative merits of each style, in terms of programming effort and execution time, we encode two programs in three programming styles (1, 2, and 4) specified above. The first program, Lawrence Livermore Loop #7, is selected to highlight the effects of explicit and implicit task management, and the other program, SOR, is selected to highlight the effects of implicit and explicit data management. As a practical consideration, these programs are relatively simple to allow for the various encoding methods of each program in a reasonable amount of time. A more detailed study would encode at least one real application and consider the effects of algorithm complexity and large-scale programming on the tradeoff of programming effort versus efficiency. However, due to practical time and space considerations for this paper, we limit our initial study to the following programs: l Lawrence Livermore Loop #7. This program creates an array A from an input array B and constants R, T, Ci, and C2, where Ai is defined as: Ai= Bi+R*Cl+R2*C2+T*Bi+3+T+R* Bi+2+T*R2*Bi+l+T2*Bi+6+T2*R*Bi+s+

6 T2 * R2 * Bi+4. With very little task management required, this problem highlights the differences between the implicit and explicit memory management styles. l Successive Over-Relaxation (SOR). This problem performs a smoothing operation on an array by iterating over the array and computing each new Ai as the average of the previous iteration s Ai- 1, Ai, and Ai + 1. The access pattern is fixed over all of the iterations, and the array is distributed among the nodes in equal-sized blocks, matching the distribution of the parallel (inner) loop to minimize the remote references. The iteration loop in this program provides a method of controlling the amount synchronization required, thus highlighting the differences between implicit and explicit task management. Both of these programs were encoded by the same person3 and using the three programming styles as follows: l Sisal with VISA. Both codes were transformed into Sisal directly from their mathematical descriptions. The code specifies only what is to be computed, not how the computations are to proceed. The result is a machine-independent specification of the problem that runs on any machine Sisal supports, and one which is not biased with the structure of imperative programming languages. l Explicit parallel C with VISA. Moving into explicit task management, the codes have to specify how the parallel loop is to divided among the workers, and how explicit synchronization is to be performed. Memory management is handled by the VISA system, however, for the Livermore Loop #7 code, special registers were employed to cache the values of the B array so that multiple remote references to retrieve the same value were eliminated. This is a prime example of how optimizations can be exploited once the abstract covers of the system have been removed. However, optimizations such as this also add to the difficulty in understanding and maintaining codes at this level of abstraction. a Explicit parallel C with message passing. Moving away from the VISA system, the explicit 31t should be noted that the programmer in this case was not any more adept at functional programming than imperative programming. task management code is augmented with explicit message passing. The program is designed to optimize the number of remote references required and perform all remote references before the computation loop is initiated, pre-fetching remote values into local overlap regions. Also, the communication model is changed from an interruptdriven request/reply model used in VISA to a synchronous read/write model so that the overhead of the interrupt handler can be avoided. The computation (inner) loop now runs completely without remote references or interrupts, thus improving its cache behavior. Special buffers are used to hold the pre-fetched values, and synchronous communication phases are necessary to avoid deadlock. The distribution of data among the processors is also explicitly stated, and altering this distribution would requires re-coding both the explicit communication and computation phases. The result is a program that is capable of exploiting, and controlling, all aspects of the memory and processor hierarchies. 4 Results and Analysis The goal of our experiment is to evaluate and quantify the tension that exists between programming effort and efficiency. In particular, we wish to quantify these qualities for our test programs. Quantifying efficiency is simple: we can measure time and space usage of each program for each programming style. Quantifying programming effort, however, is much more nebulous. We have settled on a combination of two measures to quantify programming effort: lines of code and time required to write the programs. By combining these two, we capture the effort required in writing terse, efficient programs, as well as longer programs written in a shorter amount of time because less care was taken with algorithms and optimizations. While these metrics may not represent state-of-the-art in parallel software engineering, most programmers would agree that they provide a reasonable approximation of programming effort. Table 2 displays the programming effort in terms of lines of code that the user is responsible for writing, and approximate time it took to code and debug each of the programs. For brevity, we will represent the three programming styles in our tables and figures as follows: SISAL represents Sisal and VISA; C+VISA represents explicit parallel C and VISA; and C+MP represents explicit parallel C with message passing. 384

7 LLNL Loop #7 SOR Measure SISAL C+VISA C+MP SISAL C+VISA CSMP Lines of code Time to encode (hrs) Table 2: Comparison of programming effort, in both time and space SISAL C+VISA C+MP PEs Array Size Time (s) Time (s) SPl Time (s) SPZ Ave Table 3: Execution times for LLNL Loop #7 The claim that implicit parallel languages ease the problem of programming distributed memory multiprocessors is clearly supported by these numbers. As we move from Sisal to explicit C with VISA, and to explicit C with message passing, the code becomes increasingly more complex, requiring increasingly more lines of code, and becoming more machine-dependent. The question, then, is whether increased performance justifies the additional programming effort. Table 3 (also depicted in Figure 3) gives the execution results for LLNL Loop #7 on the ncube/2, where a constant blocksize of (2 ) doubleprecision elements are used; Array Size represents the total size of the A and B arrays; Spl represents the speedup in going from SISAL to C+VISA (TSISAL/TC+VISA); and Sp2 represents the speedup in going from C+VISA to C+MP (Tc+vrs~/Tc+~~). In order to highlight the performance gain achieved by explicit memory management, the blocksize (number of array elements per processor) was kept constant at 65,536 (216) double-precision elements. The data reveals that an average speedup of 1.47 is achieved when going from SISAL to C+VISA, which is due more to the memory caching optimization than the explicit control of tasks. Additionally, an average speedup of 1.83 is achieved when moving from C+VISA to C+MP, demonstrating in part the overhead of the VISA system, but moreover the effectiveness of the pre-fetching optimization and improved cache be- havior. Thus, improvements in efficiency that are achieved in lowering the programming abstraction can be attributed to the indirect overhead of not being able to take advantage of certain optimizations, rather than the direct overhead of maintaining such abstractions. In terms of space requirements, SISAL uses the minimum: two arrays of size n, one for A and one for B. C+VISA allocates an additional 7 double-precision locations per array to cache the values of Bi through Bi+6 so that they need only be retrieved once. C+MP also allocates an additional block of 7 elements to store the values of B that reside on the neighboring node. Thus, both C+VISA and C+MP employ additional memory (on order of the size of the overlap area) to achieve their optimizations. Whether this overhead is acceptable or not depends largely on the application. Table 4 (also depicted in Figure 4) gives the execution results for SOR, where a constant array size of (2 ) double-precision elements and 128 iterations is used to highlight the performance gain achieved by explicit task management. By holding the array size constant, we cause the blocksize to decrease as the number of processors increase, and thus the ratio of iterations to blocksize increases as the number of processors increases. This ratio, iterations to blocksize, represents the increasing emphasis being placed on task management. In moving from SISAL to C+VISA, there is an average speedup of 1.20, which starts as a performance de- 385

8 Proceedings of the 28th Annual Hawaii Intemational Conference on System Sciences PEs Blocksize SISAL C+VISA C+MP Ratio Time (s) Time (s) ] Spl Time (s) ] Spz , Table 4: Execution times for SOR Figure 3: Execution times for LLNL Loop #7 Figure 4: Execution times for SOR crease and gains as the ratio of iterations to blocksize increases, placing greater emphasis of task management on the total execution time. This initial loss in performance is due to the ability of the Sisal compiler to generate code that is highly optimized, which sometimes outperforms normal hand-coded C [2]. However, this small gain is quickly lost as the complex Sisal task management system is outperformed by the handcoded C task management. In moving from C+VISA to C+MP, there is an average speedup of 1.64, again representing the overhead of VISA plus the effective- ness of pre-fetching remote references and improved cache behavior. The single processor time of explicit C with message passing shows the enormous overhead of synchronization that this problem creates, which is not as visible in the other two approaches due to the overhead of VISA. In terms of space requirements, C+VISA uses two arrays of size n, one for the previous iteration and one for the current iteration, and swaps pointers at the end of each iteration. This is the minimal space requirement for SOR. The Sisal compiler also recognizes this

9 optimization, but generates the two swap arrays only after generating an array to hold the initial values, resulting in a space overhead of n elements. C+MP uses only the two necessary arrays, but allocates an additional two elements per processor to hold the prefetched remote values from neighboring nodes. 5 Related Research In [2] shared memory implementations of Sisal and Fortran are compared, and the shared memory implementation of Sisal compares favorably with Fortran on a wide variety of benchmarks. By providing a virtual shared memory runtime system, we have taken this shared memory implementation to distributed memory machines. In [4] we introduced the design of our task management system; in [5] we quantify the characteristics of software multithreading; and in [6] we outline the distributed Sisal runtime system, of which VISA is a part. Another area of research that offers a languageindependent shared memory paradigm is Distributed Shared Memory [13]. However, the inability to couple parallel tasks tightly with the distribution of data, controlled implicitly by the operating system, can result in misalignment, causing excessive message passing. Also since the granularity of sharing data in these systems is often very large (typically a page), contention, or false sharing can occur, in which two unrelated data items exist on the same sharable unit, prohibiting simultaneous access. The most common alternatives to programming distributed memory multiprocessors using an explicit parallel language with message passing are distributed memory language compilers, such as FortranD [lo] and Vienna Fortran [19], both of which closely resemble the proposed HPF standard [9]. These systems offer the advantage of implicit management for both tasks and memory, and allow the programmer to use the familiar sequential shared memory programming paradigm. Although these systems have had success in implementing certain applications, they have not been uniformly accepted as the solution to distributed memory programming for various reasons, including their difficulties with exact dependence analysis [17], their inability to efficiently handle irregular and adaptive codes without incurring large runtime overheads [18]. 6 Conclusions Our goal in this paper is to shed light on the tension, in terms of programming effort and efficiency, that exists between implicit and explicit programming styles for distributed memory multiprocessors. Towards this goal, we have conducted an initial study that compares the performance and programming effort of two programs using three programming styles: implicit task and data management, implicit task management and explicit data management, and explicit task and data management. Implicit task management is provided by the Sisal language compiler, which is designed for execution on shared memory multiprocessors [a]. Implicit data management is provided by the VISA runtime system, whose design and operation we have outlined in this paper. Explicit task and data management are provided through a native C compiler and parallel support library. We feel that our quantitative results are accurate at capturing the scales for programming effort and efficiency of these programming methods. However, it is important to note that our quantitative results are based on two small parallel programs, and should be weighed accordingly. Sisal with VISA provides implicit management of both tasks and data, and offers reasonable performance while alleviating the programmer from the implementation details of an architecture. The result is efficient machine-independent code that is portable among a wide range of architectures [a]. Furthermore, since the current Sisal compiler is unaware of distributed memory and costs associated with accessing remote data, we expect a performance gain when such information is exploited by the compiler. Explicit parallel C with VISA offers the ability to increase the performance of an application, but at the cost of increased size, programming effort, and machine-dependence. For our simple programs, an average speedup of 1.34 over Sisal is achieved, but at the cost of increasing the code size by an average factor of 7, and increasing the time required to encode and debug the programs by an average factor of 11. Explicit parallel C with message passing offers the ability to exploit the problem and machine details to obtain the highest performance for a particular machine. In particular, we are able to better control both the memory hierarchy (by pre-fetching values) and cache (by using overlap regions), both of which are necessary to achieve high performance. For our programs, average speedups of 1.74 over C with VISA, and 2.34 over Sisal are achieved. Once again, this increase in performance is obtained at the cost of 387

10 increasing program sizes by an average factor of 2 over C+VISA, and by a an average factor of 15 over Sisal, while increasing the time required to encode and debug the programs by an average factor of 4 over C+VISA, and by an average factor of 40 over Sisal. References PI Bruno Achauer. The DOWL distributed objectoriented language. Communications of the ACM, 36(9):48-55, September PI David Cann. Retire Fortran? A debate rekindled. Communications of the ACM, 35(8):81-89, August [31 Matthew Haines. Distributed Runtime Support for Task and Data Management PhD thesis, Colorado State University, Computer Science Department, Fort CoIlins, Colorado, June Also appears as Technical Report CS , Computer Science Department, Colorado State University. [41 Matthew Haines and Wim B ohm. Towards a distributed memory implementation of Sisal. In Scalable High Performance Computing Conference, pages IEEE, April Also appears as Technical Report CS , Computer Science Department, Colorado State University. 151 Matthew Haines and Wim B ohm. An evaluation of software multithreading in a conventional distributed memory multiprocessor. In IEEE Symposium on Parallel and Distributed Processing, pages , December [61 Matthew Haines and Wim B-ohm. On the design of distributed memory Sisal. Journal of Programming Languages, 1: , Matthew Haines and Wim B-ohm. Task management, virtual shared memory, and multithreading in a distributed memory implementation of Sisal. In Arndt Bode, Mike Reeve, and Gottfried Wolf, editors, Parallel Architectures and Languages Europe, pages Springer-Verlag Lecture Notes in Computer Science, June Matthew Haines and Wim B-ohm. A virtual shared PI PO1 addressing system for distributed memory Sisal. In John T. Feo, editor, Proceedings of Sisal 93, pages , San Diego, CA, October High Performance Fortran Forum. High Performance Fortran Language SpeciEcation, version 1.0 edition, May Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD distributedmemory machines. Communications of the ACM, 35(8):66-80, August Pll Paul Hudak and Philip Wadler. Report on the functional programming language HaskeII. Technical report, Yale University, New Haven, CT, December P21 Rodger Lea, Christian Jacquemot, and Eric PiIIevesse. COOL: System support for distributed programming. Communications of the ACM, 36(9):37-46, September D31 Kai Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Yale University, September [I41 J. R. McGraw, S. K. Skedzielewski, S. J. AIlan, R. R. Oldehoeft, J. Glauert, C. Kirkham, W. Noyce, and R. Thomas. SISAL: Streams and iteration in a single assignment language: Reference manual version 1.2. Manual M-146, Rev. 1, Lawrence Livermore National Laboratory, Livermore, CA, March R. S. NikhiI. Id (Version 90.0) Reference Manual. Technical Report CSG Memo 284-1, MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139, USA, July Supercedes: Id/83s (July 1985) Id Nouveau (July 1986), Id 88.0 (March 1988), Id 88.1 (August 1988). [lf51 Joel SaItz, Ravi Mirchandaney, and Kay Crowley. Runtime parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5): , May Also appears as ICASE Report No P71 Zhiyu Shen, Zhiyuan Li, and Pen-Chung Yew. An emperical study of Fortran programs for paraiieiizing compilers. IEEE Transactions on Parallel and Distributed Systems, 1(3): , July WI Alan Sussman and Joel SaItz. A manual for the multiblock PART1 runtime primitives revision 4. Technical Report CS-TR-3070 and UMIACS-TR-93-36, University of Maryland, Department of Computer Science and UMIACS, May D91 Hans Zima, Peter Brezany, Barbara Chapman, Piyush Mehrotra, and Andreas SchwaId. Vienna fortran - A language specification version 1.1. Interim Report 21, Institute for Computer Applications in Science and Engineering, Hampton, VA, March

Department of. Computer Science. A Comparison of Explicit and Implicit. March 30, Colorado State University

Department of. Computer Science. A Comparison of Explicit and Implicit. March 30, Colorado State University Department of Computer Science A Comparison of Explicit and Implicit Programming Styles for Distributed Memory Multiprocessors Matthew Haines and Wim Bohm Technical Report CS-93-104 March 30, 1993 Colorado

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam Parallel Paradigms & Programming Models Lectured by: Pham Tran Vu Prepared by: Thoai Nam Outline Parallel programming paradigms Programmability issues Parallel programming models Implicit parallelism Explicit

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

A Feasibility Study for Methods of Effective Memoization Optimization

A Feasibility Study for Methods of Effective Memoization Optimization A Feasibility Study for Methods of Effective Memoization Optimization Daniel Mock October 2018 Abstract Traditionally, memoization is a compiler optimization that is applied to regions of code with few

More information

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions) By the end of this course, students should CIS 1.5 Course Objectives a. Understand the concept of a program (i.e., a computer following a series of instructions) b. Understand the concept of a variable

More information

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes Todd A. Whittaker Ohio State University whittake@cis.ohio-state.edu Kathy J. Liszka The University of Akron liszka@computer.org

More information

Curriculum 2013 Knowledge Units Pertaining to PDC

Curriculum 2013 Knowledge Units Pertaining to PDC Curriculum 2013 Knowledge Units Pertaining to C KA KU Tier Level NumC Learning Outcome Assembly level machine Describe how an instruction is executed in a classical von Neumann machine, with organization

More information

Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality

Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality Amin Vahdat Department of Computer Science Duke University 1 Introduction Increasingly,

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Modern Buffer Overflow Prevention Techniques: How they work and why they don t

Modern Buffer Overflow Prevention Techniques: How they work and why they don t Modern Buffer Overflow Prevention Techniques: How they work and why they don t Russ Osborn CS182 JT 4/13/2006 1 In the past 10 years, computer viruses have been a growing problem. In 1995, there were approximately

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

NOTE: Answer ANY FOUR of the following 6 sections:

NOTE: Answer ANY FOUR of the following 6 sections: A-PDF MERGER DEMO Philadelphia University Lecturer: Dr. Nadia Y. Yousif Coordinator: Dr. Nadia Y. Yousif Internal Examiner: Dr. Raad Fadhel Examination Paper... Programming Languages Paradigms (750321)

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*)

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) David K. Poulsen and Pen-Chung Yew Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output. Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements Programming Languages Third Edition Chapter 9 Control I Expressions and Statements Objectives Understand expressions Understand conditional statements and guards Understand loops and variation on WHILE

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

Software Architecture in Practice

Software Architecture in Practice Software Architecture in Practice Chapter 5: Architectural Styles - From Qualities to Architecture Pittsburgh, PA 15213-3890 Sponsored by the U.S. Department of Defense Chapter 5 - page 1 Lecture Objectives

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Department of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University

Department of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University Department of Computer Science Uniqueness and Completeness Analysis of Array Comprehensions David Garza and Wim Bohm Technical Report CS-93-132 December 15, 1993 Colorado State University Uniqueness and

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

Thunks (continued) Olivier Danvy, John Hatcli. Department of Computing and Information Sciences. Kansas State University. Manhattan, Kansas 66506, USA

Thunks (continued) Olivier Danvy, John Hatcli. Department of Computing and Information Sciences. Kansas State University. Manhattan, Kansas 66506, USA Thunks (continued) Olivier Danvy, John Hatcli Department of Computing and Information Sciences Kansas State University Manhattan, Kansas 66506, USA e-mail: (danvy, hatcli)@cis.ksu.edu Abstract: Call-by-name

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004 ADR and DataCutter Sergey Koren CMSC818S Thursday March 4 th, 2004 Active Data Repository Used for building parallel databases from multidimensional data sets Integrates storage, retrieval, and processing

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

Affine Loop Optimization Based on Modulo Unrolling in Chapel

Affine Loop Optimization Based on Modulo Unrolling in Chapel Affine Loop Optimization Based on Modulo Unrolling in Chapel ABSTRACT Aroon Sharma, Darren Smith, Joshua Koehler, Rajeev Barua Dept. of Electrical and Computer Engineering University of Maryland, College

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Consistency Issues in Distributed Shared Memory Systems

Consistency Issues in Distributed Shared Memory Systems Consistency Issues in Distributed Shared Memory Systems CSE 6306 Advance Operating System Spring 2002 Chingwen Chai University of Texas at Arlington cxc9696@omega.uta.edu Abstract In the field of parallel

More information

Department of. Computer Science DISTRIBUTED RUNTIME. August 5, Colorado State University

Department of. Computer Science DISTRIBUTED RUNTIME. August 5, Colorado State University Department of Computer Science DISTRIBUTED RUNTIME SUPPORT FOR TASK AND DATA MANAGEMENT Matthew Dennis Haines Technical Report CS-93-110 August 5, 1993 Colorado State University PH.D. DISSERTATION DISTRIBUTED

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

Software Architecture

Software Architecture Software Architecture Does software architecture global design?, architect designer? Overview What is it, why bother? Architecture Design Viewpoints and view models Architectural styles Architecture asssessment

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Pull based Migration of Real-Time Tasks in Multi-Core Processors

Pull based Migration of Real-Time Tasks in Multi-Core Processors Pull based Migration of Real-Time Tasks in Multi-Core Processors 1. Problem Description The complexity of uniprocessor design attempting to extract instruction level parallelism has motivated the computer

More information

Reflective Java and A Reflective Component-Based Transaction Architecture

Reflective Java and A Reflective Component-Based Transaction Architecture Reflective Java and A Reflective Component-Based Transaction Architecture Zhixue Wu APM Ltd., Poseidon House, Castle Park, Cambridge CB3 0RD UK +44 1223 568930 zhixue.wu@citrix.com ABSTRACT In this paper,

More information

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009. CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism

More information

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science.

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science. ALEWIFE SYSTEMS MEMO #12 A Synchronization Library for ASIM Beng-Hong Lim (bhlim@masala.lcs.mit.edu) Laboratory for Computer Science Room NE43-633 January 9, 1992 Abstract This memo describes the functions

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

CS604 - Operating System Solved Subjective Midterm Papers For Midterm Exam Preparation

CS604 - Operating System Solved Subjective Midterm Papers For Midterm Exam Preparation CS604 - Operating System Solved Subjective Midterm Papers For Midterm Exam Preparation The given code is as following; boolean flag[2]; int turn; do { flag[i]=true; turn=j; while(flag[j] && turn==j); critical

More information

Hardware Memory Models: x86-tso

Hardware Memory Models: x86-tso Hardware Memory Models: x86-tso John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 9 20 September 2016 Agenda So far hardware organization multithreading

More information

Parallel Processors. Session 1 Introduction

Parallel Processors. Session 1 Introduction Parallel Processors Session 1 Introduction Applications of Parallel Processors Structural Analysis Weather Forecasting Petroleum Exploration Fusion Energy Research Medical Diagnosis Aerodynamics Simulations

More information

f %. School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213

f %. School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213 r CLEARED,E F.EVIEW,.F I? rn*-i:s.!-.)c: NOT I PTLY Ty!:"'o cr~~~~ S~~. l',-,r -. D~TEPAP~rMEN U' 7EEN E...... :,_ OCT 2 1991 12 D T I(, Program Translation Tools for Systolic Arrays N00014-87-K-0385 Final

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Xinu on the Transputer

Xinu on the Transputer Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1990 Xinu on the Transputer Douglas E. Comer Purdue University, comer@cs.purdue.edu Victor

More information

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D-53757 St. Augustin, Germany Abstract. This paper presents language features

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

One-Sided Append: A New Communication Paradigm For PGAS Models

One-Sided Append: A New Communication Paradigm For PGAS Models One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Petri-net-based Workflow Management Software

Petri-net-based Workflow Management Software Petri-net-based Workflow Management Software W.M.P. van der Aalst Department of Mathematics and Computing Science, Eindhoven University of Technology, P.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands,

More information

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access Philip W. Howard, Josh Triplett, and Jonathan Walpole Portland State University Abstract. This paper explores the

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Python in the Cling World

Python in the Cling World Journal of Physics: Conference Series PAPER OPEN ACCESS Python in the Cling World To cite this article: W Lavrijsen 2015 J. Phys.: Conf. Ser. 664 062029 Recent citations - Giving pandas ROOT to chew on:

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Datenbanksysteme II: Caching and File Structures. Ulf Leser Datenbanksysteme II: Caching and File Structures Ulf Leser Content of this Lecture Caching Overview Accessing data Cache replacement strategies Prefetching File structure Index Files Ulf Leser: Implementation

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers Alan L. Cox y, Sandhya Dwarkadas z, Honghui Lu y and Willy Zwaenepoel y y Rice University Houston,

More information

Free upgrade of computer power with Java, web-base technology and parallel computing

Free upgrade of computer power with Java, web-base technology and parallel computing Free upgrade of computer power with Java, web-base technology and parallel computing Alfred Loo\ Y.K. Choi * and Chris Bloor* *Lingnan University, Hong Kong *City University of Hong Kong, Hong Kong ^University

More information

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall. CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7, 2012 Administrative I will be on travel on Tuesday, September 11 There will be no lecture Instead,

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem. Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Hock-Beng Lim Center for Supercomputing R & D University of Illinois Urbana, IL 61801 hblim@csrd.uiuc.edu Pen-Chung Yew Dept. of Computer

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

Metaprogrammable Toolkit for Model-Integrated Computing

Metaprogrammable Toolkit for Model-Integrated Computing Metaprogrammable Toolkit for Model-Integrated Computing Akos Ledeczi, Miklos Maroti, Gabor Karsai and Greg Nordstrom Institute for Software Integrated Systems Vanderbilt University Abstract Model-Integrated

More information

Computer Architecture Today (I)

Computer Architecture Today (I) Fundamental Concepts and ISA Computer Architecture Today (I) Today is a very exciting time to study computer architecture Industry is in a large paradigm shift (to multi-core and beyond) many different

More information