A Comparison of Five Parallel Programming Models for C++

Size: px

Start display at page:

Download "A Comparison of Five Parallel Programming Models for C++"

Ethelbert Casey
6 years ago
Views:

1 A Comparison of Five Parallel Programming Models for C++ Ensar Ajkunic, Hana Fatkic, Emina Omerovic, Kristina Talic and Novica Nosovic Faculty of Electrical Engineering, University of Sarajevo, Sarajevo, Bosnia and Herzegovina Abstract Multi-core processors offer a growing potential of parallelism but pose a challenge of program development for achieving high performance in applications. This paper presents a comparison of the five parallel programming models for implementing parallel programs in C++ on multi-core computer systems. The models under consideration are Intel s Thread Building Blocks (TBB), OpenMPI, Intel s Cilk TM Plus, OpenMP and Pthreads. For demonstration purposes multiple parallel implementations of an algorithm for matrix multiplication suitable for parallelization were created. The main goal of this paper is a comprehensive comparison of chosen models with respect to the following criteria: performance and coding effort required. Index Terms Parallel programming, Pthreads, OpenMP, TBB, Cilk++, OpenMPI I. INTRODUCTION Multiprocessor computer systems are now widely available, with all of the processors shipped by manufacturers being multi-core based designed[1]. The current trend towards more processor cores instead of higher clock speeds can be attributed to the physical limitations of modern processor designs [2], [3]. This has serious implications for the design and development of applications as writing and debugging parallel software is difficult and requires more expertise than sequential programming [3]. However, it is vital that programmers become proficient in writing parallel code so that we can harness the parallel power available in multi-core computer systems. Various tools and libraries which help programmers make the transition from sequential to parallel programming are available. These include the compiler directive based OpenMP [9] and Thread Building Blocks [12], which is a high-level library providing parallel constructions and optimized parallel algorithms [4], [1], [2], [5]. Additionally, lower-level libraries such as Pthreads provide more control, but require the programmer to manage threads explicitly [6], [7]. Other parallel and performance tools include Intel Integrated Performance Primitives (IPP), Intel Math Kernel Library (MKL), Microsoft Parallel Patterns Library (PPL). This paper aims to evaluate the performance and code characteristics of the selected five parallel libraries for the parallelization of an algorithm for matrix multiplication. The various parallel implementations are compared on program elapsed time and speedup. Code characteristics such as effort required (measured in lines of code added or changed) and total number of lines will be compared. II. RELATED WORK To support parallelization available from multiple CPU cores software has to be able to spread its workload across multiple processors. On shared memory multiprocessors system, such as those based on multi-core CPU-s, this is typically achieved using multi-threading, although other techniques such as message passing can be employed. It might seem that if a little threading is good, then a lot must be better. In fact, having too many threads can bog down a program. The impact of having too many threads comes in two ways. First, partitioning a fixed amount of work among too many threads gives each thread too little work that the overhead of starting and terminating threads swamps the useful work. Second, having too many threads running incurs overhead from the way they share finite hardware resources. A good solution is to limit the number of runnable threads to the number of hardware threads. Algorithms that were developed used two threads. A. Shared memory based parallel programming models A diverse range of shared memory based parallel programming models are developed up to now. They can be classified into mainly three types as described below. [8] 1) Threading models: These models are based on the thread library that provides low-level library routines for parallelizing the application. These models use mutual exclusion locks and conditional variables for establishing communications and synchronization between threads. Threading models are suitable for applications based on the multiplicity of data and they provide a very high flexibility to programmer. 2) Directive based models: These models use the highlevel compiler directives to parallelize the applications. These models are an extension to the thread based models. The directive based models takes care of the low-level featureslike partitioning, worker management, synchronization and communication among the threads. The main advantages with directive models are that it is easy to write parallel application and programmer doesn t need to consider issues like data races, false sharing, deadlocks. 3) Tasking models: models are based on the concept of specifying tasks instead of threads as done by other models. MIPRO 2012/SP 2203

2 This is because tasks are of short span and more lightweight than threads. One difference between tasks and threads is that tasks are always implemented at user mode. B. Programming Models Evaluated This section describes the parallel programming models that are evaluated in this paper. The models evaluated are: Pthread as threading model OpenMP as directive based model TBB and Cilk++ as task based model MPI as both distributed and shared model 1) Pthreads: Portable Operating System Interface (POSIX) threads are an interface with a set of C language procedures and extensions used for creating and managing threads. [6] They can be easily extended to multiprocessor platforms and are capable for realizing potential gain in performance of parallel programs. It is raw threading model that resides on a shared memory platform leaving most of the implementation details of the parallel programs and more flexible to devel oper. Pthreads has very low-level of abstraction and hence developing the application in this model is hard from the developer perspective. With Pthreads the parallel application developer has more responsibilities like work load artitioning, worker management, communication and synchronization and task mapping. It defines a very wide variety of library routines categorized according to the responsibilities presented above. 2) OpenMP: Open Message passing or Open specification for multiprocessing is an application program interface, which defines a set of program directives, Run time library routines and environment variables that are used to explicitly express direct multi-threaded, shared memory parallelism. [9] It can be specified in C/C++/FORTRAN. OpenMP stand at high-level of abstraction which eases the development of parallel appli cations from the perspective of the developer. OpenMP hides and implements by itself the details like work load partitioning, worker management, communication and synchronization. The developers only need to specify the directives in order to parallelize the application. OpenMP is not widely used as Pthreads and is not emerged as a standard. The flexibility with this model is less compared to Pthreads. 3) TBB: Threading building blocks is a parallel programming library developed by the Intel Corporation. [12] It offers a very highly sophisticated set of parallel primitives to parallelize the application and to enhance the performance of the application on many cores. Intel s threading building blocks is a high-level and supports the task based parallelism to parallelize the applications; it not only replaces the thread ing libraries, but also hides the details about the threading mechanis ms for performance and scalability. Intel s TBB relies on offering the scalable data parallel programming which is much harder to achieve by making use of the performance as the number of processor cores increases. Threading building blocks is a library that supports the scalable parallel program ming by using the C++ code and does not require any special languages or compilers. The underlying library is responsible for mapping the tasks on to the threads in an efficient manner. As a result TBB enables you to specify parallelis m more efficiently than using other models for scalable data parallel programming. In general as one goes deep into the TBB parallel programming model it is hard to understand the model and in some cases increases the development time this is because TBB stands at a high level of abstraction. 4) Cilk++: This is a task based parallel library. [11] It facilitates the fastest development of the parallel applications by just using the three Cilk++ components and a runtime system that are capable of extending to the realms of the parallel programming. It is based on the C++. It is well suited for problems based on the divide and conquer strategy. Recursive functions are often used that are well suitable for the Cilk++ language. The Cilk++ keywords identify function calls and loops that can run in parallel. The Intel's Cilk++ runtime schedules these tasks to run efficiently on the available processors. Eases the development of the applications by imposing the developer to create tasks and also enables to check for the races in the program. 5) MPI: The Message Passing Interface Standard (MPI) is a message passing library standard designed to function on a wide variety of parallel computers.[10] Interface specifications have been defined for C/C++ and Fortran programs. While all other evaluated models work only on symmetric multiprocessing (SMP) systems, MPI works on both SMP and distributed memory systems. It is portable, there is no need to modify source code when porting application to a different platform that supports (and is compliant with) the MPI standard. MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require specifying a communicator as an argument. Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes, called rank. Ranks are used by the programmer to specify the source and destination of messages. Table 1 shows comparison of five tested parallel programming approaches based on several criteria such as type of parallelis m they are support, complexity, compiler and environment support they require MIPRO 2012/SP

3 TABLE I. COMPARISON OF FIVE PARALLEL PROGRAMMING LANGUAGE A. Pthreads Pthread implementation is somewhat harder then OpenMP, TBB and Cilk implementations as it is programmers job to manage threads creation and scheduling. When implementing parallelism using Pthread four threads were used. The iteration space is then distributed evenly between threads. III. PROBLEM DEFINITION The approach takes focus on experimentation and benchmarking. The algorithm for matrix multiplication was chosen for this simple demonstration of the tools as it provides a best-case scenario for parallelization. As the workload is easily distributed, any overheads introduced by the parallel tools should be made apparent when comparing the performance of the tools. IV. MATRIX MULTIPLICATION PARALLELIZATION Matrix multiplication is binary operation that takes a pair of matrices and produces another matrix. If A is an m-by-n matrix and B is an n-by-p matrix, then their matrix product AB is given by: n i, j i,1 1, j i,2 2, j i, n n, j r 1 A, B A B A B... A B A B, where 1 i m and 1 j p. After an examination of sequential algorithm for matrix multiplication, it is clear that the main part of the work is carried out within three for loops. First two loops are for initialization of matrices, and third one, nested loop, does the multiplication. As such, the primary focus of parallel optimization effort is on making the loop iterations run in parallel. However, a precaution has to be taken to ensure that it is safe and that no race condition are introduced in parallel implementation of the algorithm. It is evident from the structure of initialization loop and outer multiplication loop that there are no data dependencies between loop iterations, unlike the inner loops which have several dependencies. This allows an implementation without use of locks, there is no chance of deadlock. The first parallel implementation was carried out using Pthreads. OpenMP, TBB, Cilk++ and MPI implementations were devised and tested afterwards. i, r r, j B. OpenMP Listing 1. Matrix multiplication source code using Pthread OpenMP implementation uses simple pragma directives for compiler. Listing bellow shows how omp parallel for directive has been used to provide parallelization of loop execution. C. TBB Listing 2. Matrix multiplication source code using OpenMP Thread Building Blocks implementations requires more extensive modification to the original code to implement parallelism. Due the fact that TBB follows object oriented model, multiplication work was implemented in its own class Listing 3: Matrix multiplication source code using TBB MIPRO 2012/SP 2205

4 Blocked range represents range over which to iterate. The call of multiply function is replaced with a call to the TBB parallel for template. D. Cilk++ Similar to TBB implementation, in Cilk++ implementation outer loop was parallelized using cilk for keyword and no other modifications are needed. Cilk for loop is replacement for the normal C/C++ for loop that permits loop iterations to run in parallel. The compiler then converts a loop body to a function that is called recursively using a divide and conquer strategy. Listing 5: Matrix multiplication source code using MPI E. MPI Listing 4: Matrix multiplication source code using Cilk++ MPI implementation of matrix multiplication algorithm requires significant code modification, while data transfer to workers and work partitioning have to be done manually. Also, many index calculation and handling the remaining work when the amount of work is not divisible by the number of workers bring more difficulties for programmers. V. RESULTS To compare Pthreads, OpenMP, TBB, Cilk++ and MPI, multiple implementation of matrix multiplication algorithm were developed. These implementations were then tested and information about execution time and speedup were collected. These results are elaborated in Section V-A. Section V-B elaborates code characteristics (measured in lines of code added or changed) for each model. Testing was performed on a dual-core (Intel CoreTM2 Duo Processor T6500 (2M L2 Cache, 2.10 GHz, 800 MHz FSB)) system with 4GB DDR2 800MHz RAM, running Ubuntu Software versions are as follows: GCC (Cilk Arts build 8503), TBB 4.0, OpenMPI 1.4.4, Boost 1.42 and OpenMP 3.0. A. Matrix multiplication Performance Analysis For benchmarking purposes 11 data sets were created, in creasing matrix size from 1000 to 2000 using step 100. Then these data sets were applied to each algorithm. To ensure that benchmarking process is consistent each developed algorithm was rerun 30 times and calculate average time of execution. The first performance comparison of interest is time consumption. Fig. 1. demonstrates that all tested models show significant improvements in decreasing execution time compared to sequential execution. While it may appear that OpenMP per forms poorly when compared to Cilk++ and TBB, it must be noted that matrix multiplication algorithm is embarrassingly task-parallel problem that is more suited to Cilk++ and TBB, which are task-parallel models, whereas OpenMP is better suited to data-parallel problems and may prove easier to implement and perform better when faced with data-parallel problems. Fig. 2. shows speedup of the various implementations and confirm previous observation that task based models utperform directive and thread based ones MIPRO 2012/SP

5 Fig. 1: Performance of Sequential vs. Parallel Matrix Multiplication using OpenMP, TBB, Pthread, Cilk++ and MPI VI. CONCLUSION Choosing the best model when developing parallel software application depends on multiple factors. The key factors are development environment and complexity of the problem. Pthreads programming model introduces much more complex ity within the code than OpenMP, TBB, Cilk++ and even MPI, making it more challenging to develop. One of the benefits of using TBB, OpenMP or Cilk++ when appropriate is that creating and managing the threads are handled automatically. Even though OpenMP s trivially simple implementation make it reasonable choice for implementing parallelism, as this paper showed, it is better to consider other models when problem is of task-parallel nature. TBB requires significantly more effort to implement, but it provides more control and is better equipped to handle other problems such as task-parallel processes, while still maintaining respectable performance imp rovement when using parallel fo r on dataparallel loops. MPI implementation requires most effort and expertise to implement effectively. One of the advantages of MPI model over other four is that MPI can be used both on shared memory systems and distributed memory systems. REFERENCES [1] R. Merritt. (2008, Apr. 3,) Chip industry confronts software gap between multicore, programming. EETimes. [Online]. Available: [2] K. Carlson. (2008, Mar. 7,) SD West: Parallel or Bust. Dr. Dobbs Journal.[Online].Available: Fig. 2: Speedup trends of Parallel Matrix Multiplication using OpenMP, TBB, Pthread, Cilk++ and MPI over Sequential Implementation B. Matrix Multiplication Code Analysis For code characteristic comparisons, the number of lines of code (LOC) for each implementation as well as the number of modified lines (LOC) were compared over the original sequential version and shown in Table II. OpenMP and Cilk++ require only a single additional line. TBB requires significantly more code modification, but not nearly as much as Pthread, since TBB library handles thread management. TABLE II. CODE MODIFICATION DATA [3] B. Hayes, Computing in a Parallel Universe, American Scientist, vol. 95, pp , [4] M. Sato, OpenMP: parallel programming api for shared memory multiprocessors and on-chip multiprocessors, in ISSS 02: Proceedings of the 15th International Symposium on System Synthesis. New York, NY, USA: ACM, 2002, pp [5] J. Reinders, Intel Threading Building Blocks: Outfitting C++ for MulticoreProcessor Parallelism. OReilly, [6] B. Nichols, D. Buttlar, and J. P. Farrell, Pthreads programming. Sebastopol, CA, USA: OReilly & Associates, Inc., [7] B. Kempf. (2002, May 1,) The Boost.Threads Library. Dr. Dobbs Journal. [Online]. Available: [8] P. Pacheco, An Introduction to Parallel Programming, Morgan Kaufmann, 2011 [9] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J. McDonald, Parallel Programming in OpenMP, Morgan Kaufmann, 2000 [10] M. S. M uller, M. M. Resch, A. Schulz, W. E. NagelTools, Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing, September 2009, ZIH, Dresden As expected Pthread implementation required most code modification since it is programmers responsibility to create, assign and synchronize threads. Significant code modification is presented in MPI because data transfer to workers and work partitioning have to be done explicitly. [11] F. Gebali, Algorithms and Parallel Computing, John Wiley & Sons, 2011 [12] J. Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi- Core Processor Parallelism, O Reilly Media, 2007 MIPRO 2012/SP 2207

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline