On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters.

Size: px
Start display at page:

Download "On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters."

Transcription

1 On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters. Mahesh Kandegedara, D.N. Ranasinghe University of Colombo School of Computing, Sri Lanka Abstract The recent enhancements in processor architechtures have given rise to multi-threaded, multi-core and multi-processor based clusters of high performance computing. To exploit the variety of parallelism available in these current and future computer systems, programmers must use appropriate parallel programming approaches. Though conventional programming models exist for parallel programming neither of them have sufficiently addressed the emerging processor technologies. The paper evaluates how functional programming can be used with distributed memory and shared memory languages to exploit the scalability, heterogeneity and flexibility of clusters in solving the recursive Strassen s matrix multiplication problem. The results show that the functional language Erlang is more efficient than virtual shared memory approach and can be made more scalable than distributed memory programming approaches when incorporated with OpenMP. Index Terms functional, matrix multiplication, multi-threaded, multi-core, multi-processor, MPI, OpenMP, Erlang I. INTRODUCTION The emergent processor architectures have revolutionized uniprocessor systems to multi-processor, multi-core [2] and multithreaded based clusters of high performance work stations. Therefore, the systems which have multiple levels of parallelism exist.for example current high performance workstation can be a cluster with multiple nodes (node level parallelism) each node having multiple shared memory CPU s, and each CPU consists of multi core processors (core level parallelism) where each core supports multi-threading (thread level parallelism). The conventional programming models [1] available to cater the hierarchy of parallel levels can be categorized as Distributed memory, Shared memory and Virtual Shared memory programming models. The main objective of the paper is to propose a hybrid progrmming model that exploits the memory hierarchies available in a cluster with respect to functional programming, based on an identified set of metrics. The paper analyzes following scenarios which represents the most common and emerging programming languages under each programming model. As Distributed Memory programming models Pure MPI and Pure Erlang has been evaluated. Pure OpenMP is the selected model under Shared Memory Programming models. As a Virtual Shared Memory Programming model, OpenMP on Clusters has been evaluated. MPI with Pthreads and Erlang with OpenMP are the Hybrid Programming models being evaluated. The standard matrix multiplication problem using recursive Strassens algorithm has been choosen as the reference problem for evaluation. The rest of the paper is arranged as follows. Sect. II provides background knowledge for the evaluation, describing recursive Strassens algorithm for matrix multiplication Sect. III introduces the design of the model, on which the implementations are based. The experience gained on implementing above scenarios, their individual evaluations and the programming environment used for the evaluations are presented under Sect. IV. Based on the benefits and constraints identified from the evaluations the conclusions are provided in Sect. V. A. Matrix multiplication II. BACKGROUND The multiplication of two matrices is one of the most basic operations of linear algebra and scientific computing and has provided an important focus in the search for methods to speed up scientific computation. Therefore it is a primitive operation in most real world computer graphics and digital signal processing applications. B. Strassens Algorithm :An order (x lg7 ) matrix multiplication algorithm A more enhanced matrix multiplication algorithm, Strassens algorithm [13], was introduced in 1969 with a complexity of (x lg7 ), where lg(7) = The algorithm is more efficient than naive algorithm, by the execution time as well as memory usage. The algorithm is as follows. 1) If the matrices to be multiplied are A, B and the result is C of dimension n, the A, B matrices are divided in to 4 creating 8 sub-matrices of dimension n/2.

2 2) The 7 Strassens equations from equation 1 to equation 7 are applied on the above sub-matrices creating 7 temporary sub matrices of dimension n/2. P 1 = (A 11 + A 22 ) (B 11 + B 22 ) (1) P 2 = (A 21 + A 22 ) B 11 (2) P 3 = A 11 (B 12 B 22 ) (3) P 4 = A 22 (B 21 B 11 ) (4) P 5 = (A 11 + A 12 ) B 22 (5) P 6 = (A 21 A 11 ) (B 11 + B 12 ) (6) P 7 = (A 12 A 22 ) (B 21 + B 22 ) (7) 3) The temporary sub-matrices are used to calculate 4 submatrices of result C using equation 8 to equation 11. C 11 = P 1 + P 4 P 5 + P 7 (8) C 12 = P 3 + P 5 (9) C 21 = P 2 + P 4 (10) C 22 = P 1 + P 3 P 2 + P 6 (11) When compared to other approaches, Strassens algorithm efficiently uses cache as the calculations are performed on submatrices. Also the algorithm contains only 7 sub-matrix multiplications. But the algorithm does not show much performance for matrices with smaller dimension due to the overhead of pre and post processing of the matrices. C. Recursive Strassens Algorithm Not only the sequential Strassens algorithm has provided improved performance over the conventional multiplication, but the ability to recursively apply the algorithm on sub-matrix multiplications has made the algorithm ideally suited to exploit the multiple levels of parallelism available on computer clusters. This heterogeneity is one of the metrics, which we use in our evaluations. When we carefully analyze the algorithm, the 7 equations from equation 1 to 7, there are 7 multiplications upon the added and subtracted input sub-matrices. By applying the Strassens algorithm again and again on each of the embedded multiplication, the algorithm becomes recursive. Therefore the recursive Strassens algorithm can be scaled into any number of recursive levels, which supports scalability metric of our evaluations. The algorithm can be implemented with multiple recursive levels to exploit parallelism available on multiple memory levels. Fig. 1. Implementation model for Recursive Strassens Algorithm using 7 nodes in the cluster. The model is scalable to any number of recursive levels. III. DESIGN OF THE IMPLEMENTATION MODEL All six implementations of recursive Strassens algorithm based on the MPI, OpenMP and Erlang languages follow a similar model. The objectives of designing the model were, providing maximum scalability, extraction of multiple memory levels and flexibility on language based implementations. The model is based on the Divide and Conquer topology [14]. As shown in the figure 1 the matrices to be multiplied of dimension N are given as the input to master processor P0. In recursive level 1, the P0 scatters the matrices in to submatrices and performs only the additions and subtractions of the 7 Strassens equations (equation 1 to 7), creating 7 x 2 sub-matrices of dimension N/2. The P0 sends 6 chunks to other processors, while keeping a chunk to be computed locally to add more performance by overlapping communication with computation. After completing the multiplications of the sub-matrices in P0 and all the slaves, the results are gathered to P0. The P0 then apply the last 4 equations (equation 8 to 11) and merges the results to form output matrix C. In recursive level 2, when multiplying the sub-matrices of dimension N/2, all the nodes recursively apply Strassens algorithm. Then each node will interact with another 6 nodes, which require altogether 49 nodes. Likewise, each node with its slaves applies the equations and sends back the results to P0. Therefore the model can be scaled to n number of recursive levels which uses

3 7 n number of nodes. The performance should be increased when the number of nodes used increases as it reduces the dimension of the matrices to be multiplied. After conquering all the nodes in the environment, to get the advantage of core level parallelism the algorithm can be recursively spanned into processor cores in each node. Finally, to get the advantage of thread level parallelism, in each core at each node, it is possible create threads or processes to recursively apply the algorithm. Therefore the model is powerful enough to exploit the parallelism available in all the levels of the processing hierarchy. Due to the resource limitations of our testing environment the evaluations are restricted to one node level. The implementations accepts only 7 nodes from the cluster. Due to the extra overhead of allowing the program to be capable of managing an arbitrary number of nodes this limitation has been imposed. As the cluster consists of single-core processors the core level parallelism can not be directly evaluated. But by creating multiple threads in each node, thread level parallelism can be exploited. The evaluations have been done only for matrix dimensions in the order of exponents of 2, due to the nature of Strassen s Algorithm. IV. EVALUATION OF THE IMPLEMENTATIONS A. Programming environment All the executions have been done on Upplanka cluster at University of Colombo School of Computing with 14 has 64 bit single core nodes each having 2 Mb memories. B. Distributed Memory Programming Models 1) Pure MPI: MPI language characteristics The message passing programming model is a distributed memory MIMD model with explicit control parallelism. In designing the MPI, the MPI Forum has identified some critical shortcomings of existing message-passing systems, in areas such as complex data layouts or support for modularity and safe communication. This led to the introduction of new features in MPI -1 [3]. A major advantage in MPI is the capability to embed in sequential base languages like Fortran, C and C++. The disadvantage with the MPI model is the dynamic data balancing is often difficult and the granularity of the code often has to be large to ensure that the communication to computation ratio remains low. Therefore based on language characteristics MPI is more applicable for parallel problems which have coarse grain parallelism. MPI based implementation We have used C as the sequential language to embedding MPI. Initially, we have implemented a recursive version of Strassens algorithm in pure MPI, which was scaled up to only one recursive level. In message passing, asynchronous message has been used with loop fusion applied in scattering and merging the matrices into sub-matrices. As shown in Figure 2 our MPI based implementation shows considerable performance gain over the other high performance implementation. Fig. 2. Variation of execution time along with matrix dimension between Strassens implementation in MPI and another enhanced approach in MPI. The programs have been evaluated with 7 MPI nodes on the cluster. Evaluation of MPI based implementation Scalability The implementation was capable of executing with a maximum matrix dimension of Message passing in MPI involves considerable overhead in total execution time. Support for multiple process levels As MPI is a distributed memory model, we have exploited only the node level parallelism. It is also possible to exploit core level parallelism by enhancing the implementation to create MPI nodes on cores in a node. To exploit thread level parallelism a shared memory approach should be incorporated with MPI. Expressiveness of the language As MPI is not an independent language, but a set of libraries, it inherits the features like fault tolerance, garbage collection from the base language. Also any parallelism should be explicitly handled by the programmer. Due to the divide and conquer nature of the problem the implementation should be capable of efficiently transporting matrix blocks among the nodes. But due to limitations in message size of MPI, the scalability was also limited. 2) Pure Erlang: Functional languages are based on the theoretical foundation of Lambda Calculus [8]. In functional languages, a program is a collection of functions. The high level abstraction of parallel constructs encourages experimentation with

4 alternative parallelization, which often leads to improved solutions for novel parallel problems. Erlang language characteristics [9] Erlang has a declarative syntax with dynamic typing. It was developed having concurrency in mind with lightweight processes and only asynchronous message passing. Since each process has a separate heap that is garbage collected individually, there s a little program execution interruption on memory management. Also Erlang is ideal for soft real time application development with its robust programming techniques [11]. Erlang provides a rich set of external interfaces for foreign codes [20]. Erlang implementation The programming approach in functional programming is much different with respect to the other models [10]. The dynamic data binding characteristic provided easy and quick implementation of the algorithm while giving more consideration on enhancing the programming logic, rather than worrying about language constructs. In our implementation of the algorithm, a matrix was represented as a collection of lists which corresponds to the rows of the matrix. Therefore when creating the matrices, scattering and merging the matrices and in the case of matrix operations (addition, subtraction and multiplication), the operations have been recursively performed on each list to build the final result. As Erlang doesn t allow for globally shared data, on performing recursive operations we had to keep on passing matrices on recursive calls. The implementation is scalable to an arbitrary number of recursive levels. After the first recursive level which uses 7 nodes in the cluster, each node spawns another 6 processes on each recursive iteration. After a process is spawned, it communicates with the master process via asynchronous messages. If one process fails in the middle of its execution, the master process traps the termination and spawns a new process for the task. Evaluation of Erlang based implementation Scalability As shown in the figure 3 the implementation was highly scalable (the evaluations have been done only up to 3 levels). The reason is the light-weightiness of processes and fast context switching among the processes in Erlang. But the implementation was unable to execute matrix dimensions of more than 2048 on the cluster. Based the evaluations done with eprof (an Erlang profiling tool [18] ),it was found that most of the execution time was devoted to matrix scattering, multiplication and merging. This reveals that the recursive ness which comes with functional programming reduces the performance when applied to these types of coarse-grain parallel problems. Fig. 3. Variation of execution time along with matrix dimension in Erlang based implementation for 3 recursive levels. Support for multiple process levels The node level parallelism was there in the implementation to exploit parallelism on the 7 nodes of the cluster. Still Erlang doesn t allow scheduling of programs on different cores as it is up to the Erlang emulator to manage the processes. There isn t much support for core level parallelism in pure Erlang. In the evaluation, after the 1st recursive level (node level) another 2 recursive levels have been tested creating 49 processes in each node. Therefore Erlang processes could be considered as ideal for exploiting thread level parallelism. Expressiveness of the language Because of the abstract nature of functional languages, the program consisted of lesser number of lines of codes. The robustness provided by Erlang, maintains the safety of the program on distributed environments. As the matrix multiplication problem is memory intensive, efficient use of memory leads to efficient programs. In the implementation the maximum matrix dimension was limited due to the recursive behavior of the code which consumed much memory. Therefore functional languages have a negative impact on the performance of highly memory consuming problems. In Erlang, static load balancing among the nodes can be explicitly handled by the master nodes. When a master node is trapping a process in another node, after its successful termination, the master can assign another task to the node.

5 C. Shared Memory Programming Models 1) OpenMP: The key strengths of shared memory programming are simplicity and scalability. The most popular shared memory programming library, OpenMP [6] is based on the use of multiple threads. OpenMP characteristics [4] OpenMP has a minimalist list of control structures for adding parallel behaviors to the source codes. The simplicity in OpenMP comes with the use of directives. In the execution of OpenMP programs, a shared memory process consists of multiple threads. OpenMP provides both implicit and explicit synchronization. Implicit synchronization is provided by fork-join model of parallel execution. The user specifies explicit synchronization to the manage order or data dependencies. OpenMP implementation Because of the simplicity provided by OpenMP, porting a serial code to a parallel code is a relatively easy task and no significant code changes are required. Data parallel directives are used to parallelize matrix scattering, merging and performing multiplication. As the Strassen s equations are independent of each other, functional parallel directives were used in solving the equations. To add the recursive behavior to the OpenMP implementation, Intel s work queuing model has been used [19]. As a result the implementation was capable of extending to any number of recursive levels. Evaluation of OpenMP based implementation Scalability According to the evaluation results shown in Figure 4 the OpenMP based implementation shows a significant performance improvement when the number of recursive levels, is increased. Therefore the implementation is very scalable. As the evaluations are obtained only with one node, a comparison with MPI is not relevant in this scenario. Support for multiple process levels OpenMP provides only the explicit thread level parallelization for the programmer. Therefore core level parallelism is hidden at the operating system level. To exploit node level parallelism, distributed shared memory approach like OpenMP on Clusters is required. Expressiveness of the language The directive based programming model allows easily migrating sequential codes to parallel codes. With the added features like work queuing model from OpenMP version 2.0, the language provides more support on implementing recursive problems. Fig. 4. Variation of execution time along with matrix dimension in Sequential implementation and OpenMP based implementation. The OpenMP based implementation has evaluated to 3 recursive levels.executions were done with only one node on the cluster. D. Virtual Shared Memory Programming Models 1) OpenMP on Clusters: OpenMP on Clusters characteristics [4] The Cluster OpenMP [17] provides a shared memory layer over a distributed memory system. Cluster OpenMP extends OpenMP with one additional directive - the sharable directive. The sharable directive identifies variables that are referenced by more than one thread. The task of keeping shared variables consistent across distributed nodes is handled by the Cluster OpenMP run-time library. Sharable variables are grouped together on certain pages in memory. OpenMP on Clusters implementation Any pure OpenMP source implementation can be extended to be executed on cluster OpenMP, by analyzing and sharing the variables and data structures, which will be accessed by multiple threads. Therefore in our pure OpenMP implementation, the memory allocation and de-allocation functions(on creating and freeing matrices) have been replaced with kmp sharable malloc and kmp sharable free. Evaluation of OpenMP on Clusters implementation Scalability According to Figure 5 performance increase can be seen at the matrix dimension of 1024 with 1 recursive level.

6 1) MPI with Pthreads: Hybrid implementation As concluded on Pure MPI scenario, to cater thread level parallelism and limitation of the scalability of MPI, a shared memory model should be incorporated with MPI. Therefore a hybrid implementation which scales beyond 1st recursive level using Pthreads with MPI, was carried out. We then improved the implementation to be capable of easily spanning to any number of thread levels. But when executing with 2 thread levels, the program crashed due to memory limitations on creating 49 threads in each node. Evaluation As shown in Figure 6 we achieved a significant performance improvement with the threaded version compared to the Pure MPI version. But due to the overhead involved in creating and switching between threads the executions were limited to only one thread level. Fig. 5. Variation of execution time along with matrix dimension in OpenMP based implementation on Intel Cluster OpenMP to one recursive level.executions were done with 7 nodes on the upplanka cluster allowing maximum of 7 threads on each. But with a matrix dimension of 2048, the compiler complained of insufficient RAM space on cluster on enlarging the shared space. The main reason was that the nature of the problem, i.e., when multiple threads are accessing a matrix, the whole data structure should be consistent in each node s shared space. Therefore, keeping the memories consistent is much more expensive in Cluster OpenMP. Having high data locality, calculations would reduce the overhead to some extent. In the MPI based implementation, maximum buffer size of a message acted as a limitation.in a Cluster OpenMP implementation this converts to limited shared space. Support for multiple process levels The implementation adds the capability of exploiting node level parallelism to the pure OpenMP approach, which had only thread level parallelism. Expressiveness of the language On porting OpenMP programs to Cluster OpenMP, all which is needed to be done is to consider on sharable variables. Therefore the flexibilities provided by OpenMP have been extended by Cluster OpenMP. E. Hybrid Programming Models As discussed above when comparing the language features, each language has limitations due to the programming paradigms they are based on. We now discover several ways of overcoming such limitations by mixed models, to exploit parallelism at various processing levels. Fig. 6. Variation of execution time along with matrix dimension between Pure MPI and MPI + Pthreads implementations. The programs have evaluated with 7 MPI nodes on the cluster. The MPI + Pthreads implementation adds another recursive level using threads. 2) Erlang with OpenMP: Hybrid implementation As a result of the light-weightiness and the fast context switching among the Erlang processes, Erlang implementation was capable of scaling up to more than 3 recursive levels providing a significant speedup. But unlike the other approaches the high memory usage due to the recursive nature of functional programming have a negative impact,

7 and therefore most of the execution time of the program was consumed in pre and post processing. Therefore we conjecture that a hybrid implementation of Erlang and OpenMP would be able to extract the advantages of both approaches. In this hybrid implementation Erlang nodes have been used in the roots of the divide and conquer based design model to create and mange slaves. And the most costly calculations (pre processing, post processing and multiplications) were done with OpenMP embedded in C programs, which have been used as leaf nodes of the design model. The interoperability mechanism which was used in interfacing Erlang with OpenMP is known as distributed Erlang. Evaluation As shown in figure 7 the implementation is scalable than Fig. 7. Variation of execution time along with matrix dimension in Erlang with OpenMP hybrid implementation on the cluster to two recursive levels.executions were done with 7 nodes on the upplanka cluster. pure Erlang or Cluster OpenMP. However when transferring data between Erlang nodes and C nodes, the data marshaling and un-marshaling mechanisms used, have acted as a overhead for scaling with the matrix dimension. As this is a hybrid of Erlang and OpenMP, the flexibilities of both approaches are present. Therefore the Erlang with OpenMP scenario is more flexible than the MPI with Pthreads scenario discussed earlier. V. CONCLUSION When comparing most conventional shared and distributed memory parallel programming paradigms, each paradigm has its own specific capabilities and constraints in implementing the recursive Strassen s matrix multiplication problem. These are summarized in table I. TABLE I OVERALL EVALUATION Language Scalability Support for Expresiveness Multiple Parallel Levels of Language OpenMP on Limited Node and Flexible in Clusters by memory Thread level usage, but depends constraints on base language MPI with Limited, but Node, core and Depends on PThreads scalable than Thread level base language pure MPI Erlang with Scalable than Node, core and Highly OpenMP Pure Erlang, Thread level flexible Cluster OpenMP Pure MPI was scalable in matrix dimension than the other two approaches, but the restrictions placed on message buffers,would limit its scalability. The hybrid of MPI with Pthreads, makes the pure MPI approach recursively scalable with thread level parallelism. Pure OpenMP was the most flexible with compared to MPI and Erlang. But it only focuses on thread level parallelism. The distributed memory extension of Cluster OpenMP hasn t shown much performance when used with memory intensive parallel problem of matrix multiplication. Pure Erlang was the most recursively scalable approach. However the recursive ness reduces the performance when used with memory intensive problems. To overcome individual paradigm limitations, an Erlang and OpenMP hybrid approach has been proposed,implemented and evaluated. The implementation was more scalable than either pure Erlang or Cluster OpenMP. However it wasn t as scalable as MPI with Pthreads due to the inefficiency of data marshaling and unmarshaling mechanisms. On most scientific researches executing sequential codes on high performance clusters is a major concern. An Erlang based framework which utilizes concurrency, robustness and interoperability of Erlang and expressiveness of OpenMP would address the problem. VI. ACKNOWLEDGMENT I gratefully acknowledge all the support given by Mr. Malik Silva and Mr. K. Ganeshamoorthi on setting up evaluation environments. REFERENCES [1] L. M. E. Silva and R. Buyya. High Performance Cluster Computing: Programming and Applications, 2nd ed. vol. 2, Ed. New Jersey: Prentice Hall, 2000, ch 1. [2] David Geer. Chip Makers Turn to Multi-core Processors. IEEE Transactions on Computers, vol. 38, pp May [3] Message Passing Interface (MPI), [Online].13 Sept 2007[Accessed 05 May 2008]; Available from: URL: computing.llnl.gov/tutorials/mpi/. [4] D. E. Lenoski, W. D. Weber. Scalable Shared Memory Processing. IEEE Transactions on. Computers, vol. 48, pp Nov 1999.

8 [5] OpenMP, [Online].13 Sept 2007[Accessed 05 May 2008]; Available from: URL:computing.llnl.gov/tutorials/openMP. October [6] Leonardo Dagum, Ramesh Menon, OpenMP: An Industry-Standard API for Shared-Memory Programming, IEEE Transactions on. Computational Science & Engineering, vol. 05, pp , Jan-Mar, [7] G. Jost and H. Jin. Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster, NASA Advanced Supercomputing Division, California, Tech. Rep. NAS November [8] Raul Rojas, A Tutorial Introduction to the Lambda Calculus, Berlin, May [9] Pekka Hedqvist, A Parallel and Multithreaded ERLANG Implementation, M. S. Thesis, Uppsala University, Sweden, June [10] Jonas Barkland, Rober Virding, Erlang Reference Manual, [Online].Feb 1999[ Accessed 05 Jan 2008]; Available from: URL: manual/part frame.html [11] Joe Armstrong, Making reliable distributed systems in the presence of software errors, Ph.D. dissertation, The Royal Institute of Technology, Sweden. December [12] Y. P. Mong, Justin, Faster Matrix Multiplication Algorithms, University of Manchester, May [13] V. Strassen. Gaussian elimination is not optimal, Numer, 1969, pp [14] Barry Wilkinson and Michel Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, 2nd ed., Ed. Prentice Hall, 2003, ch 4. [15] Rolf Rabenseifner. Hybrid Parallel Programming: Performance Problems and Chances, In Proceedings of the 45th Cray Users Group (CUG) Conference, Columbus, Ohio, May 2003, pp [16] Rolf Ranenseifner and Gerhard Wellein, Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architecture, International Journal of High Performance Computing Applications, Sage Science Press, Vol.17, No.1, pp 49-62, [17] Jay P. Hoeflinger, Extending OpenMP to Clusters, White Paper, Intel Corporation, [18] Claes Wikstrm, Gunilla Hugosson. Eprof [online] [Accessed April 2008]; Available from: URL: 5.2/lib/tools2.1/doc/html/eprof.html. [19] S. Shah, G. Haab, P. Petersen, and J. Throop, Flexible control structures for parallelism in OpenMP, In Proceedings of the Fourth European Workshop on OpenMP - EWOMP02,Lund, Sweden, September 2002, pp [20] Erlang documentation, [online] [Accessed on April 2008]; Available from: URL: [21] Sample Matrix Multiplication code, [online] [Accessed on August 2008]; Available from: URL: tyang/class/240b99/homework/sam mm.c

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Parallel Computers. c R. Leduc

Parallel Computers. c R. Leduc Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

Parallel Programming with OpenMP

Parallel Programming with OpenMP Parallel Programming with OpenMP Parallel programming for the shared memory model Christopher Schollar Andrew Potgieter 3 July 2013 DEPARTMENT OF COMPUTER SCIENCE Roadmap for this course Introduction OpenMP

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Message Passing. Advanced Operating Systems Tutorial 7

Message Passing. Advanced Operating Systems Tutorial 7 Message Passing Advanced Operating Systems Tutorial 7 Tutorial Outline Review of Lectured Material Discussion: Erlang and message passing 2 Review of Lectured Material Message passing systems Limitations

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola What is MPI MPI: Message Passing Interface a standard defining a communication library that allows

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

Nanos Mercurium: a Research Compiler for OpenMP

Nanos Mercurium: a Research Compiler for OpenMP Nanos Mercurium: a Research Compiler for OpenMP J. Balart, A. Duran, M. Gonzàlez, X. Martorell, E. Ayguadé and J. Labarta Computer Architecture Department, Technical University of Catalonia, cr. Jordi

More information

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture SOTIRIOS G. ZIAVRAS and CONSTANTINE N. MANIKOPOULOS Department of Electrical and Computer Engineering New Jersey Institute

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

I/O in the Gardens Non-Dedicated Cluster Computing Environment

I/O in the Gardens Non-Dedicated Cluster Computing Environment I/O in the Gardens Non-Dedicated Cluster Computing Environment Paul Roe and Siu Yuen Chan School of Computing Science Queensland University of Technology Australia fp.roe, s.chang@qut.edu.au Abstract Gardens

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Moore s Law. Computer architect goal Software developer assumption

Moore s Law. Computer architect goal Software developer assumption Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Threads. CS3026 Operating Systems Lecture 06

Threads. CS3026 Operating Systems Lecture 06 Threads CS3026 Operating Systems Lecture 06 Multithreading Multithreading is the ability of an operating system to support multiple threads of execution within a single process Processes have at least

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Hybrid Model Parallel Programs

Hybrid Model Parallel Programs Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner rabenseifner@hlrs.de Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS)

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Message-Passing Shared Address Space

Message-Passing Shared Address Space Message-Passing Shared Address Space 2 Message-Passing Most widely used for programming parallel computers (clusters of workstations) Key attributes: Partitioned address space Explicit parallelization

More information

White Paper Jay P. Hoeflinger Senior Staff Software Engineer Intel Corporation. Extending OpenMP* to Clusters

White Paper Jay P. Hoeflinger Senior Staff Software Engineer Intel Corporation. Extending OpenMP* to Clusters White Paper Jay P. Hoeflinger Senior Staff Software Engineer Intel Corporation Extending OpenMP* to Clusters White Paper Extending OpenMP* to Clusters Table of Contents Executive Summary...3 Introduction...3

More information

L20: Putting it together: Tree Search (Ch. 6)!

L20: Putting it together: Tree Search (Ch. 6)! Administrative L20: Putting it together: Tree Search (Ch. 6)! November 29, 2011! Next homework, CUDA, MPI (Ch. 3) and Apps (Ch. 6) - Goal is to prepare you for final - We ll discuss it in class on Thursday

More information

1. INTRODUCTION. Keywords: multicore, openmp, parallel programming, speedup.

1. INTRODUCTION. Keywords: multicore, openmp, parallel programming, speedup. Parallel Algorithm Performance Analysis using OpenMP for Multicore Machines Mustafa B 1, Waseem Ahmed. 2 1 Department of CSE,BIT,Mangalore, mbasthik@gmail.com 2 Department of CSE, HKBKCE,Bangalore, waseem.pace@gmail.com

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

Molecular Dynamics. Dim=3, parts=8192, steps=10. crayc (Cray T3E) Processors

Molecular Dynamics. Dim=3, parts=8192, steps=10. crayc (Cray T3E) Processors The llc language and its implementation Antonio J. Dorta, Jose Rodr guez, Casiano Rodr guez and Francisco de Sande Dpto. Estad stica, I.O. y Computación Universidad de La Laguna La Laguna, 38271, Spain

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Extended Dataflow Model For Automated Parallel Execution Of Algorithms

Extended Dataflow Model For Automated Parallel Execution Of Algorithms Extended Dataflow Model For Automated Parallel Execution Of Algorithms Maik Schumann, Jörg Bargenda, Edgar Reetz and Gerhard Linß Department of Quality Assurance and Industrial Image Processing Ilmenau

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

Concurrency for data-intensive applications

Concurrency for data-intensive applications Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

The Public Shared Objects Run-Time System

The Public Shared Objects Run-Time System The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese E-mail: wiese@tu-harburg.d400.de Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg

More information

Parallel programming models. Main weapons

Parallel programming models. Main weapons Parallel programming models Von Neumann machine model: A processor and it s memory program = list of stored instructions Processor loads program (reads from memory), decodes, executes instructions (basic

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs

A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs Prakash Raghavendra, Akshay Kumar Behki, K Hariprasad, Madhav Mohan, Praveen Jain, Srivatsa S Bhat, VM Thejus, Vishnumurthy

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface IZZATDIN A. AZIZ, NAZLEENI HARON, MAZLINA MEHAT, LOW TAN JUNG, AISYAH NABILAH Computer and Information Sciences

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Parallelization of Graph Isomorphism using OpenMP

Parallelization of Graph Isomorphism using OpenMP Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to

More information

Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms

Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms Subhi A. Bahudaila and Adel Sallam M. Haider Information Technology Department, Faculty of Engineering, Aden University.

More information

An Introduction to OpenMP

An Introduction to OpenMP An Introduction to OpenMP U N C L A S S I F I E D Slide 1 What Is OpenMP? OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Chapter 4 Threads, SMP, and

Chapter 4 Threads, SMP, and Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Threads: Resource ownership

More information

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is

More information

L21: Putting it together: Tree Search (Ch. 6)!

L21: Putting it together: Tree Search (Ch. 6)! Administrative CUDA project due Wednesday, Nov. 28 L21: Putting it together: Tree Search (Ch. 6)! Poster dry run on Dec. 4, final presentations on Dec. 6 Optional final report (4-6 pages) due on Dec. 14

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information