On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters.

Size: px

Start display at page:

Download "On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters."

Estella Stevens
5 years ago
Views:

1 On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters. Mahesh Kandegedara, D.N. Ranasinghe University of Colombo School of Computing, Sri Lanka Abstract The recent enhancements in processor architechtures have given rise to multi-threaded, multi-core and multi-processor based clusters of high performance computing. To exploit the variety of parallelism available in these current and future computer systems, programmers must use appropriate parallel programming approaches. Though conventional programming models exist for parallel programming neither of them have sufficiently addressed the emerging processor technologies. The paper evaluates how functional programming can be used with distributed memory and shared memory languages to exploit the scalability, heterogeneity and flexibility of clusters in solving the recursive Strassen s matrix multiplication problem. The results show that the functional language Erlang is more efficient than virtual shared memory approach and can be made more scalable than distributed memory programming approaches when incorporated with OpenMP. Index Terms functional, matrix multiplication, multi-threaded, multi-core, multi-processor, MPI, OpenMP, Erlang I. INTRODUCTION The emergent processor architectures have revolutionized uniprocessor systems to multi-processor, multi-core [2] and multithreaded based clusters of high performance work stations. Therefore, the systems which have multiple levels of parallelism exist.for example current high performance workstation can be a cluster with multiple nodes (node level parallelism) each node having multiple shared memory CPU s, and each CPU consists of multi core processors (core level parallelism) where each core supports multi-threading (thread level parallelism). The conventional programming models [1] available to cater the hierarchy of parallel levels can be categorized as Distributed memory, Shared memory and Virtual Shared memory programming models. The main objective of the paper is to propose a hybrid progrmming model that exploits the memory hierarchies available in a cluster with respect to functional programming, based on an identified set of metrics. The paper analyzes following scenarios which represents the most common and emerging programming languages under each programming model. As Distributed Memory programming models Pure MPI and Pure Erlang has been evaluated. Pure OpenMP is the selected model under Shared Memory Programming models. As a Virtual Shared Memory Programming model, OpenMP on Clusters has been evaluated. MPI with Pthreads and Erlang with OpenMP are the Hybrid Programming models being evaluated. The standard matrix multiplication problem using recursive Strassens algorithm has been choosen as the reference problem for evaluation. The rest of the paper is arranged as follows. Sect. II provides background knowledge for the evaluation, describing recursive Strassens algorithm for matrix multiplication Sect. III introduces the design of the model, on which the implementations are based. The experience gained on implementing above scenarios, their individual evaluations and the programming environment used for the evaluations are presented under Sect. IV. Based on the benefits and constraints identified from the evaluations the conclusions are provided in Sect. V. A. Matrix multiplication II. BACKGROUND The multiplication of two matrices is one of the most basic operations of linear algebra and scientific computing and has provided an important focus in the search for methods to speed up scientific computation. Therefore it is a primitive operation in most real world computer graphics and digital signal processing applications. B. Strassens Algorithm :An order (x lg7 ) matrix multiplication algorithm A more enhanced matrix multiplication algorithm, Strassens algorithm [13], was introduced in 1969 with a complexity of (x lg7 ), where lg(7) = The algorithm is more efficient than naive algorithm, by the execution time as well as memory usage. The algorithm is as follows. 1) If the matrices to be multiplied are A, B and the result is C of dimension n, the A, B matrices are divided in to 4 creating 8 sub-matrices of dimension n/2.

2) The 7 Strassens equations from equation 1 to equation 7 are applied on the above sub-matrices creating 7 temporary sub matrices of dimension n/2.

2 2) The 7 Strassens equations from equation 1 to equation 7 are applied on the above sub-matrices creating 7 temporary sub matrices of dimension n/2. P 1 = (A 11 + A 22 ) (B 11 + B 22 ) (1) P 2 = (A 21 + A 22 ) B 11 (2) P 3 = A 11 (B 12 B 22 ) (3) P 4 = A 22 (B 21 B 11 ) (4) P 5 = (A 11 + A 12 ) B 22 (5) P 6 = (A 21 A 11 ) (B 11 + B 12 ) (6) P 7 = (A 12 A 22 ) (B 21 + B 22 ) (7) 3) The temporary sub-matrices are used to calculate 4 submatrices of result C using equation 8 to equation 11. C 11 = P 1 + P 4 P 5 + P 7 (8) C 12 = P 3 + P 5 (9) C 21 = P 2 + P 4 (10) C 22 = P 1 + P 3 P 2 + P 6 (11) When compared to other approaches, Strassens algorithm efficiently uses cache as the calculations are performed on submatrices. Also the algorithm contains only 7 sub-matrix multiplications. But the algorithm does not show much performance for matrices with smaller dimension due to the overhead of pre and post processing of the matrices. C. Recursive Strassens Algorithm Not only the sequential Strassens algorithm has provided improved performance over the conventional multiplication, but the ability to recursively apply the algorithm on sub-matrix multiplications has made the algorithm ideally suited to exploit the multiple levels of parallelism available on computer clusters. This heterogeneity is one of the metrics, which we use in our evaluations. When we carefully analyze the algorithm, the 7 equations from equation 1 to 7, there are 7 multiplications upon the added and subtracted input sub-matrices. By applying the Strassens algorithm again and again on each of the embedded multiplication, the algorithm becomes recursive. Therefore the recursive Strassens algorithm can be scaled into any number of recursive levels, which supports scalability metric of our evaluations. The algorithm can be implemented with multiple recursive levels to exploit parallelism available on multiple memory levels. Fig. 1. Implementation model for Recursive Strassens Algorithm using 7 nodes in the cluster. The model is scalable to any number of recursive levels. III. DESIGN OF THE IMPLEMENTATION MODEL All six implementations of recursive Strassens algorithm based on the MPI, OpenMP and Erlang languages follow a similar model. The objectives of designing the model were, providing maximum scalability, extraction of multiple memory levels and flexibility on language based implementations. The model is based on the Divide and Conquer topology [14]. As shown in the figure 1 the matrices to be multiplied of dimension N are given as the input to master processor P0. In recursive level 1, the P0 scatters the matrices in to submatrices and performs only the additions and subtractions of the 7 Strassens equations (equation 1 to 7), creating 7 x 2 sub-matrices of dimension N/2. The P0 sends 6 chunks to other processors, while keeping a chunk to be computed locally to add more performance by overlapping communication with computation. After completing the multiplications of the sub-matrices in P0 and all the slaves, the results are gathered to P0. The P0 then apply the last 4 equations (equation 8 to 11) and merges the results to form output matrix C. In recursive level 2, when multiplying the sub-matrices of dimension N/2, all the nodes recursively apply Strassens algorithm. Then each node will interact with another 6 nodes, which require altogether 49 nodes. Likewise, each node with its slaves applies the equations and sends back the results to P0. Therefore the model can be scaled to n number of recursive levels which uses

7 n number of nodes. The performance should be increased when the number of nodes used increases as it reduces the dimension of the matrices to be multiplied.

3 7 n number of nodes. The performance should be increased when the number of nodes used increases as it reduces the dimension of the matrices to be multiplied. After conquering all the nodes in the environment, to get the advantage of core level parallelism the algorithm can be recursively spanned into processor cores in each node. Finally, to get the advantage of thread level parallelism, in each core at each node, it is possible create threads or processes to recursively apply the algorithm. Therefore the model is powerful enough to exploit the parallelism available in all the levels of the processing hierarchy. Due to the resource limitations of our testing environment the evaluations are restricted to one node level. The implementations accepts only 7 nodes from the cluster. Due to the extra overhead of allowing the program to be capable of managing an arbitrary number of nodes this limitation has been imposed. As the cluster consists of single-core processors the core level parallelism can not be directly evaluated. But by creating multiple threads in each node, thread level parallelism can be exploited. The evaluations have been done only for matrix dimensions in the order of exponents of 2, due to the nature of Strassen s Algorithm. IV. EVALUATION OF THE IMPLEMENTATIONS A. Programming environment All the executions have been done on Upplanka cluster at University of Colombo School of Computing with 14 has 64 bit single core nodes each having 2 Mb memories. B. Distributed Memory Programming Models 1) Pure MPI: MPI language characteristics The message passing programming model is a distributed memory MIMD model with explicit control parallelism. In designing the MPI, the MPI Forum has identified some critical shortcomings of existing message-passing systems, in areas such as complex data layouts or support for modularity and safe communication. This led to the introduction of new features in MPI -1 [3]. A major advantage in MPI is the capability to embed in sequential base languages like Fortran, C and C++. The disadvantage with the MPI model is the dynamic data balancing is often difficult and the granularity of the code often has to be large to ensure that the communication to computation ratio remains low. Therefore based on language characteristics MPI is more applicable for parallel problems which have coarse grain parallelism. MPI based implementation We have used C as the sequential language to embedding MPI. Initially, we have implemented a recursive version of Strassens algorithm in pure MPI, which was scaled up to only one recursive level. In message passing, asynchronous message has been used with loop fusion applied in scattering and merging the matrices into sub-matrices. As shown in Figure 2 our MPI based implementation shows considerable performance gain over the other high performance implementation. Fig. 2. Variation of execution time along with matrix dimension between Strassens implementation in MPI and another enhanced approach in MPI. The programs have been evaluated with 7 MPI nodes on the cluster. Evaluation of MPI based implementation Scalability The implementation was capable of executing with a maximum matrix dimension of Message passing in MPI involves considerable overhead in total execution time. Support for multiple process levels As MPI is a distributed memory model, we have exploited only the node level parallelism. It is also possible to exploit core level parallelism by enhancing the implementation to create MPI nodes on cores in a node. To exploit thread level parallelism a shared memory approach should be incorporated with MPI. Expressiveness of the language As MPI is not an independent language, but a set of libraries, it inherits the features like fault tolerance, garbage collection from the base language. Also any parallelism should be explicitly handled by the programmer. Due to the divide and conquer nature of the problem the implementation should be capable of efficiently transporting matrix blocks among the nodes. But due to limitations in message size of MPI, the scalability was also limited. 2) Pure Erlang: Functional languages are based on the theoretical foundation of Lambda Calculus [8]. In functional languages, a program is a collection of functions. The high level abstraction of parallel constructs encourages experimentation with

alternative parallelization, which often leads to improved solutions for novel parallel problems. Erlang language characteristics [9] Erlang has a declarative syntax with dynamic typing.

4 alternative parallelization, which often leads to improved solutions for novel parallel problems. Erlang language characteristics [9] Erlang has a declarative syntax with dynamic typing. It was developed having concurrency in mind with lightweight processes and only asynchronous message passing. Since each process has a separate heap that is garbage collected individually, there s a little program execution interruption on memory management. Also Erlang is ideal for soft real time application development with its robust programming techniques [11]. Erlang provides a rich set of external interfaces for foreign codes [20]. Erlang implementation The programming approach in functional programming is much different with respect to the other models [10]. The dynamic data binding characteristic provided easy and quick implementation of the algorithm while giving more consideration on enhancing the programming logic, rather than worrying about language constructs. In our implementation of the algorithm, a matrix was represented as a collection of lists which corresponds to the rows of the matrix. Therefore when creating the matrices, scattering and merging the matrices and in the case of matrix operations (addition, subtraction and multiplication), the operations have been recursively performed on each list to build the final result. As Erlang doesn t allow for globally shared data, on performing recursive operations we had to keep on passing matrices on recursive calls. The implementation is scalable to an arbitrary number of recursive levels. After the first recursive level which uses 7 nodes in the cluster, each node spawns another 6 processes on each recursive iteration. After a process is spawned, it communicates with the master process via asynchronous messages. If one process fails in the middle of its execution, the master process traps the termination and spawns a new process for the task. Evaluation of Erlang based implementation Scalability As shown in the figure 3 the implementation was highly scalable (the evaluations have been done only up to 3 levels). The reason is the light-weightiness of processes and fast context switching among the processes in Erlang. But the implementation was unable to execute matrix dimensions of more than 2048 on the cluster. Based the evaluations done with eprof (an Erlang profiling tool [18] ),it was found that most of the execution time was devoted to matrix scattering, multiplication and merging. This reveals that the recursive ness which comes with functional programming reduces the performance when applied to these types of coarse-grain parallel problems. Fig. 3. Variation of execution time along with matrix dimension in Erlang based implementation for 3 recursive levels. Support for multiple process levels The node level parallelism was there in the implementation to exploit parallelism on the 7 nodes of the cluster. Still Erlang doesn t allow scheduling of programs on different cores as it is up to the Erlang emulator to manage the processes. There isn t much support for core level parallelism in pure Erlang. In the evaluation, after the 1st recursive level (node level) another 2 recursive levels have been tested creating 49 processes in each node. Therefore Erlang processes could be considered as ideal for exploiting thread level parallelism. Expressiveness of the language Because of the abstract nature of functional languages, the program consisted of lesser number of lines of codes. The robustness provided by Erlang, maintains the safety of the program on distributed environments. As the matrix multiplication problem is memory intensive, efficient use of memory leads to efficient programs. In the implementation the maximum matrix dimension was limited due to the recursive behavior of the code which consumed much memory. Therefore functional languages have a negative impact on the performance of highly memory consuming problems. In Erlang, static load balancing among the nodes can be explicitly handled by the master nodes. When a master node is trapping a process in another node, after its successful termination, the master can assign another task to the node.

5 C. Shared Memory Programming Models 1) OpenMP: The key strengths of shared memory programming are simplicity and scalability. The most popular shared memory programming library, OpenMP [6] is based on the use of multiple threads. OpenMP characteristics [4] OpenMP has a minimalist list of control structures for adding parallel behaviors to the source codes. The simplicity in OpenMP comes with the use of directives. In the execution of OpenMP programs, a shared memory process consists of multiple threads. OpenMP provides both implicit and explicit synchronization. Implicit synchronization is provided by fork-join model of parallel execution. The user specifies explicit synchronization to the manage order or data dependencies. OpenMP implementation Because of the simplicity provided by OpenMP, porting a serial code to a parallel code is a relatively easy task and no significant code changes are required. Data parallel directives are used to parallelize matrix scattering, merging and performing multiplication. As the Strassen s equations are independent of each other, functional parallel directives were used in solving the equations. To add the recursive behavior to the OpenMP implementation, Intel s work queuing model has been used [19]. As a result the implementation was capable of extending to any number of recursive levels. Evaluation of OpenMP based implementation Scalability According to the evaluation results shown in Figure 4 the OpenMP based implementation shows a significant performance improvement when the number of recursive levels, is increased. Therefore the implementation is very scalable. As the evaluations are obtained only with one node, a comparison with MPI is not relevant in this scenario. Support for multiple process levels OpenMP provides only the explicit thread level parallelization for the programmer. Therefore core level parallelism is hidden at the operating system level. To exploit node level parallelism, distributed shared memory approach like OpenMP on Clusters is required. Expressiveness of the language The directive based programming model allows easily migrating sequential codes to parallel codes. With the added features like work queuing model from OpenMP version 2.0, the language provides more support on implementing recursive problems. Fig. 4. Variation of execution time along with matrix dimension in Sequential implementation and OpenMP based implementation. The OpenMP based implementation has evaluated to 3 recursive levels.executions were done with only one node on the cluster. D. Virtual Shared Memory Programming Models 1) OpenMP on Clusters: OpenMP on Clusters characteristics [4] The Cluster OpenMP [17] provides a shared memory layer over a distributed memory system. Cluster OpenMP extends OpenMP with one additional directive - the sharable directive. The sharable directive identifies variables that are referenced by more than one thread. The task of keeping shared variables consistent across distributed nodes is handled by the Cluster OpenMP run-time library. Sharable variables are grouped together on certain pages in memory. OpenMP on Clusters implementation Any pure OpenMP source implementation can be extended to be executed on cluster OpenMP, by analyzing and sharing the variables and data structures, which will be accessed by multiple threads. Therefore in our pure OpenMP implementation, the memory allocation and de-allocation functions(on creating and freeing matrices) have been replaced with kmp sharable malloc and kmp sharable free. Evaluation of OpenMP on Clusters implementation Scalability According to Figure 5 performance increase can be seen at the matrix dimension of 1024 with 1 recursive level.

1) MPI with Pthreads: Hybrid implementation As concluded on Pure MPI scenario, to cater thread level parallelism and limitation of the scalability of MPI, a shared memory model should be incorporated

We then improved the implementation to be capable of easily spanning to any number of thread levels.

6 1) MPI with Pthreads: Hybrid implementation As concluded on Pure MPI scenario, to cater thread level parallelism and limitation of the scalability of MPI, a shared memory model should be incorporated with MPI. Therefore a hybrid implementation which scales beyond 1st recursive level using Pthreads with MPI, was carried out. We then improved the implementation to be capable of easily spanning to any number of thread levels. But when executing with 2 thread levels, the program crashed due to memory limitations on creating 49 threads in each node. Evaluation As shown in Figure 6 we achieved a significant performance improvement with the threaded version compared to the Pure MPI version. But due to the overhead involved in creating and switching between threads the executions were limited to only one thread level. Fig. 5. Variation of execution time along with matrix dimension in OpenMP based implementation on Intel Cluster OpenMP to one recursive level.executions were done with 7 nodes on the upplanka cluster allowing maximum of 7 threads on each. But with a matrix dimension of 2048, the compiler complained of insufficient RAM space on cluster on enlarging the shared space. The main reason was that the nature of the problem, i.e., when multiple threads are accessing a matrix, the whole data structure should be consistent in each node s shared space. Therefore, keeping the memories consistent is much more expensive in Cluster OpenMP. Having high data locality, calculations would reduce the overhead to some extent. In the MPI based implementation, maximum buffer size of a message acted as a limitation.in a Cluster OpenMP implementation this converts to limited shared space. Support for multiple process levels The implementation adds the capability of exploiting node level parallelism to the pure OpenMP approach, which had only thread level parallelism. Expressiveness of the language On porting OpenMP programs to Cluster OpenMP, all which is needed to be done is to consider on sharable variables. Therefore the flexibilities provided by OpenMP have been extended by Cluster OpenMP. E. Hybrid Programming Models As discussed above when comparing the language features, each language has limitations due to the programming paradigms they are based on. We now discover several ways of overcoming such limitations by mixed models, to exploit parallelism at various processing levels. Fig. 6. Variation of execution time along with matrix dimension between Pure MPI and MPI + Pthreads implementations. The programs have evaluated with 7 MPI nodes on the cluster. The MPI + Pthreads implementation adds another recursive level using threads. 2) Erlang with OpenMP: Hybrid implementation As a result of the light-weightiness and the fast context switching among the Erlang processes, Erlang implementation was capable of scaling up to more than 3 recursive levels providing a significant speedup. But unlike the other approaches the high memory usage due to the recursive nature of functional programming have a negative impact,

7 and therefore most of the execution time of the program was consumed in pre and post processing. Therefore we conjecture that a hybrid implementation of Erlang and OpenMP would be able to extract the advantages of both approaches. In this hybrid implementation Erlang nodes have been used in the roots of the divide and conquer based design model to create and mange slaves. And the most costly calculations (pre processing, post processing and multiplications) were done with OpenMP embedded in C programs, which have been used as leaf nodes of the design model. The interoperability mechanism which was used in interfacing Erlang with OpenMP is known as distributed Erlang. Evaluation As shown in figure 7 the implementation is scalable than Fig. 7. Variation of execution time along with matrix dimension in Erlang with OpenMP hybrid implementation on the cluster to two recursive levels.executions were done with 7 nodes on the upplanka cluster. pure Erlang or Cluster OpenMP. However when transferring data between Erlang nodes and C nodes, the data marshaling and un-marshaling mechanisms used, have acted as a overhead for scaling with the matrix dimension. As this is a hybrid of Erlang and OpenMP, the flexibilities of both approaches are present. Therefore the Erlang with OpenMP scenario is more flexible than the MPI with Pthreads scenario discussed earlier. V. CONCLUSION When comparing most conventional shared and distributed memory parallel programming paradigms, each paradigm has its own specific capabilities and constraints in implementing the recursive Strassen s matrix multiplication problem. These are summarized in table I. TABLE I OVERALL EVALUATION Language Scalability Support for Expresiveness Multiple Parallel Levels of Language OpenMP on Limited Node and Flexible in Clusters by memory Thread level usage, but depends constraints on base language MPI with Limited, but Node, core and Depends on PThreads scalable than Thread level base language pure MPI Erlang with Scalable than Node, core and Highly OpenMP Pure Erlang, Thread level flexible Cluster OpenMP Pure MPI was scalable in matrix dimension than the other two approaches, but the restrictions placed on message buffers,would limit its scalability. The hybrid of MPI with Pthreads, makes the pure MPI approach recursively scalable with thread level parallelism. Pure OpenMP was the most flexible with compared to MPI and Erlang. But it only focuses on thread level parallelism. The distributed memory extension of Cluster OpenMP hasn t shown much performance when used with memory intensive parallel problem of matrix multiplication. Pure Erlang was the most recursively scalable approach. However the recursive ness reduces the performance when used with memory intensive problems. To overcome individual paradigm limitations, an Erlang and OpenMP hybrid approach has been proposed,implemented and evaluated. The implementation was more scalable than either pure Erlang or Cluster OpenMP. However it wasn t as scalable as MPI with Pthreads due to the inefficiency of data marshaling and unmarshaling mechanisms. On most scientific researches executing sequential codes on high performance clusters is a major concern. An Erlang based framework which utilizes concurrency, robustness and interoperability of Erlang and expressiveness of OpenMP would address the problem. VI. ACKNOWLEDGMENT I gratefully acknowledge all the support given by Mr. Malik Silva and Mr. K. Ganeshamoorthi on setting up evaluation environments. REFERENCES [1] L. M. E. Silva and R. Buyya. High Performance Cluster Computing: Programming and Applications, 2nd ed. vol. 2, Ed. New Jersey: Prentice Hall, 2000, ch 1. [2] David Geer. Chip Makers Turn to Multi-core Processors. IEEE Transactions on Computers, vol. 38, pp May [3] Message Passing Interface (MPI), [Online].13 Sept 2007[Accessed 05 May 2008]; Available from: URL: computing.llnl.gov/tutorials/mpi/. [4] D. E. Lenoski, W. D. Weber. Scalable Shared Memory Processing. IEEE Transactions on. Computers, vol. 48, pp Nov 1999.

8 [5] OpenMP, [Online].13 Sept 2007[Accessed 05 May 2008]; Available from: URL:computing.llnl.gov/tutorials/openMP. October [6] Leonardo Dagum, Ramesh Menon, OpenMP: An Industry-Standard API for Shared-Memory Programming, IEEE Transactions on. Computational Science & Engineering, vol. 05, pp , Jan-Mar, [7] G. Jost and H. Jin. Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster, NASA Advanced Supercomputing Division, California, Tech. Rep. NAS November [8] Raul Rojas, A Tutorial Introduction to the Lambda Calculus, Berlin, May [9] Pekka Hedqvist, A Parallel and Multithreaded ERLANG Implementation, M. S. Thesis, Uppsala University, Sweden, June [10] Jonas Barkland, Rober Virding, Erlang Reference Manual, [Online].Feb 1999[ Accessed 05 Jan 2008]; Available from: URL: manual/part frame.html [11] Joe Armstrong, Making reliable distributed systems in the presence of software errors, Ph.D. dissertation, The Royal Institute of Technology, Sweden. December [12] Y. P. Mong, Justin, Faster Matrix Multiplication Algorithms, University of Manchester, May [13] V. Strassen. Gaussian elimination is not optimal, Numer, 1969, pp [14] Barry Wilkinson and Michel Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, 2nd ed., Ed. Prentice Hall, 2003, ch 4. [15] Rolf Rabenseifner. Hybrid Parallel Programming: Performance Problems and Chances, In Proceedings of the 45th Cray Users Group (CUG) Conference, Columbus, Ohio, May 2003, pp [16] Rolf Ranenseifner and Gerhard Wellein, Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architecture, International Journal of High Performance Computing Applications, Sage Science Press, Vol.17, No.1, pp 49-62, [17] Jay P. Hoeflinger, Extending OpenMP to Clusters, White Paper, Intel Corporation, [18] Claes Wikstrm, Gunilla Hugosson. Eprof [online] [Accessed April 2008]; Available from: URL: 5.2/lib/tools2.1/doc/html/eprof.html. [19] S. Shah, G. Haab, P. Petersen, and J. Throop, Flexible control structures for parallelism in OpenMP, In Proceedings of the Fourth European Workshop on OpenMP - EWOMP02,Lund, Sweden, September 2002, pp [20] Erlang documentation, [online] [Accessed on April 2008]; Available from: URL: [21] Sample Matrix Multiplication code, [online] [Accessed on August 2008]; Available from: URL: tyang/class/240b99/homework/sam mm.c

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk