Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI), Facultad de Informática, Universidad Nacional de La Plata, 50 y 120 2do piso, La Plata, Argentina. {fleibovich, ldgiusti, mnaiouf}@lidi.info.unlp.edu.ar Abstract - Given the technological progress of current processors and the appearance of the multi-core cluster architecture, the assessment of different parallel programming techniques that allow exploiting the new memory hierarchy provided by the architecture becomes important. The purpose of this paper is to carry out a comparative analysis of two parallel programming paradigms- message passing and hybrid programming (where message passing and shared memory are combined). The testing architecture used for the experimental analysis is a multi-core cluster formed by 16 nodes (blades), each blade with 2 quad core processors (128 cores total). The study case chosen was square matrix multiplication, analyzing scalability by increasing the size of the problem and the number of processing cores used. node has a processor with several cores instead of a monoprocessor. When it comes to implementing a parallel algorithm, it is very important to consider the memory hierarchy available, since this will directly affect algorithm performance. Memory hierarchy performance is determined by two hardware parameters: memory latency (time elapsed from the moment a piece of data is required and the moment it becomes available) and memory bandwidth (the speed with which data are sent from the memory to the processor). Figure 1 shows a representation of the memory hierarchy in the different architectures. Keywords: hybrid programming, cluster, multi-core, message passing, shared memory, parallel architectures. 1 Introduction Parallel architectures have evolved to offer better response times for applications. As part of this evolution, clusters, then multi-cores, and currently multi-core cluster architectures, can be mentioned. The latter are basically a collection of multi-core processors interconnected through a network. Multicore clusters allow combining the most distinctive features of clusters (use of message passing in distributed memory) and multicores (use of shared memory). Also, they introduce modifications in memory hierarchy and further increase computer system capacity and power. Taking into account the popularity of this architecture, it is important to study new parallel algorithms programming techniques that efficiently exploit its power, considering the hybrid systems in which shared memory and distributed memory are combined [1]. As previously mentioned, a multi-core cluster is a set of multicore processors that are interconnected through a network, where they work cooperatively as an only computational resource. That is, it is similar to a traditional cluster but each Figure (1). Memory hierarchy In the case of traditional clusters (both homogeneous and heterogeneous), there are memory levels in each processor (processor register and cache levels L1 and L2), but a new level is also included: network-distributed memory.
When considering a multi-core architecture, there are, in addition to register and L1 levels corresponding to each core, two memory levels: cache memory shared by pairs of cores (L2) and memory shared among the cores of the multi-core processor [2]. In particular, multi-core clusters introduce one additional level to the traditional memory hierarchy. In addition to the cache memory shared between pairs of cores and the memory shared among all cores within the same physical processor, there is the distributed memory that is accessed through the network. There is a large number of parallel applications in different areas. One of the most traditional and widely studied of these areas in parallel computing, and used in this paper, is matrix multiplication. The reason for using this application (widely tested and assessed) is that it allows using and exploiting data parallelism, as well as analyzing algorithm scalability by increasing matrix size [3]. Thus, the solutions using message passing can be compared with those using shared memory and message passing (hybrid). This paper is organized as follows: In Section 2, the contribution of this paper is detailed, whereas in Section 3, the features of hybrid programming are described. In Section 4, the study case is detailed, and in Section 5, the solutions implemented and the architecture used to generate the results shown in Section 6 are analyzed. Finally, in Section 7, the conclusions and future lines of work are presented. 2 Contribution The main contribution of this paper is carrying out a comparative analysis of the performance that can be achieved with hybrid programming in a multi-core cluster architecture versus a traditional parallel programming model (distributed memory). The analysis is carried out based on running time and efficiency of the hybrid solution as the size of the problem and the number of cores used increase, and the results are compared with solutions that use only message passing. 3 Hybrid Programming Traditionally, parallel processing has been divided in two large models - shared memory and message passing [1][4]. Message passing: data are seen as being associated to a specific processor. Thus, message communication among processors is required to access remote data. In this model, sending and receiving primitives are responsible for handling synchronization. With the appearance of multi-core cluster architectures, a new hybrid programming model comes to existence, which combines both strategies. Communication among processes belonging to the same physical processor can be done by using shared memory (micro level), whereas communication among physical processors (macro level) can be done by message passing. The purpose of using the hybrid model is exploiting and applying all of the advantages of each strategy, based on the needs of the application. This is a current interest research area; among the libraries used for hybrid programming are Pthreads for shared memory and I for message passing. Pthreads is a library that implements the POSIX (Portable Operating System Interface) standard defined by IEEE, and is composed by a set of types and calls to procedures in programming language C that includes a header file and a thread library that is part, for example, of the libc library, among others. It is used for programming parallel applications that use shared memory [5]. On the other hand, I is a message passing interface created to provide portability. It is a library that can be used to develop programs that use message passing (distributed memory) and uses the programming languages C or Fortran. The I standard defines both the syntax and the semantics of the set of routines that can be used in the implementation of programs that use message passing [6]. 4 Study Case Given two square matrixes A and B, matrix multiplication consists in obtaining matrix C, as indicated in equation 1. C (1) A B If matrix A has m * p elements, and matrix B has p * n elements, matrix C will have m * n elements. Each position of matrix C is calculated by applying equation 2. Shared memory: the data accessed by the application are in a global memory that is accessible to parallel processors. This means that processors can look for and store data from any memory position independently from each other. It is characterized by the need of synchronization in order to preserve the integrity of shared data structures. p C i, j Ai, k Bk, j k 1 (2)
5 Implemented Solutions and Architecture Used Experimental tests were carried out based on the implementation of the classical matrix multiplication algorithm, both sequentially and using different parallel programming models: message passing and hybrid (combination of message passing and shared memory). All three solutions, sequential and parallel, were developed in language C. The parallel solution that uses message passing as process communication mechanism uses the OpenI library [6]. The hybrid solution uses the Pthreads library [5] for shared memory and OpenI for message passing. This initial phase of the investigation consists in carrying out an experimental analysis of the behavior of a hybrid application in a multi-core cluster architecture from the point of view of programming models [7][8][9]. The results shown are focused in analyzing the hybrid solution in two aspects: 1. Analyzing behavior when the size of the problem and the number of cores increase (scalability) [7][8]. In this case, square matrixes of 1024, 2048, 4096 and 8192 rows and columns were processed. 2. Comparing running times and efficiency with those obtained with the message passing solution. The hardware used to carry out the tests was a Blade with 16 servers (blades). Each blade has 2 quad core Intel Xeon e5405 2.0 GHz processors; 2 Gb of RAM memory (shared between both processors); 2 X 6Mb L2 cache shared between each pair of cores by processor. The operating system used is Fedora 12, 64 bits [10][11]. In the following paragraphs, the solutions implemented are described. In all cases, matrix multiplication is carried out by storing matrix A by rows and matrix B by columns in order to use local cache memory for data access and take advantage of the architecture on which algorithms were run. 5.1 Sequential Solution Each position of C is calculated as established in equation 1. 5.2 Message Passing Solution In this case, processing is divided in blocks of rows, which are assigned equally to each process. If p is the number of processes and n * n is the dimension of matrixes A and B, the number of rows of matrix C calculated by each process is n/p. The algorithm uses a hierarchical master/worker structure. There is a general master that divides all rows that will be processed in each blade, and sends the corresponding rows to the master in each blade. It then behaves as the second level of workers described below. Finally, it receives the results obtained by all application workers. On the other hand, there is one master in each blade (secondlevel masters), responsible for receiving the rows that will be solved by the processes in its blade and distributing them among its workers to then process its own share, also acting as a worker. It should be noted that each process must store the rows from matrix A to be processed, all of matrix B and the rows from matrix C that it generates as a result. 5.3 Hybrid Solution In this solution, there is one process per blade that internally uses 7 threads to carry out processing activities, and the processing activities from the process itself that acts as a worker (one thread per core). A master/worker structure is used, with one of the processes acting as master, dividing the rows equally among all processes. Once this is done, it generates the corresponding processing threads (acting as worker). The other worker processes act in a similar way and send their results to the process master. The algorithm can be summarized as follows: Master process: 1. It divides the matrix into blocks of n rows/number of blades used for processing 2. It communicates the corresponding rows from matrix A and all of matrix B to worker processes. 3. It generates the threads and processes its own block 4. It receives results from worker processes. Worker processes 1. They receive the corresponding rows from matrix A and all of matrix B. 2. They generate the threads to process the data. 3. They communicate the results to the master process. 6 Results obtained In the following paragraphs, the results obtained in the experimental tests carried out are presented. Table 1 shows running times for the sequential solution (Seq.), the message passing solution using 16, 32, and 64 cores ( 16, 32 and 64) and the hybrid solution with 16, 32, and 64 cores (H16, H32 and H64). For the tests, both the dimension of the matrix and the number of cores are escalated. Figure 2 shows the speedups obtained for these tests. It can be seen that the running times obtained by the hybrid solution are always lower than those obtained by the message passing solution. Also, as problem size increases, the time difference between both solutions also increases in favor of the hybrid solution.
Size Seq. 16 H 16 32 H 32 64 H 64 1024 12.47 0.88 0.86 0.52 0.50 0.73 0.39 2048 101.21 6.72 6.63 3.66 3.58 2.51 2.36 4096 808.57 52.54 51.89 27.42 26.91 17.51 16.06 8192 6479.43 1059.52 410.36 638.87 209.36 752.07 124.89 Table (1) Figure (2). Speedup 6.1 Comparison of Results Table 2 shows the efficiency achieved by the different testing alternatives; whereas in Figure 3, a comparative chart showing that information is presented. Based on the results obtained, two observations can be made - the efficiency achieved by the hybrid solution is in all cases higher than the one achieved by the message passing solution, and, as problem size increases (for the same number of processing units), efficiency also increases. However, as it is to be expected, when the number of processing units increases, efficiency decreases due to the increased volume of communications and synchronization among processes. It should also be mentioned that the efficiency achieved by the message passing solution for 8192 * 8192 elements is significantly degraded in comparison with the other sizes. This is due to limitations in the main memory that is available in each blade, which, for large sizes, generates a swapping of the necessary data structures. Size 16 H 16 32 H 32 64 H 64 1024 0.87 0.90 0.73 0.77 0.26 0.49 2048 0.94 0.95 0.86 0.88 0.62 0.66 4096 0.96 0.97 0.92 0.93 0.72 0.78 8192 0.38 0.98 0.31 0.96 0.13 0.81 Table (2) Figure (3). Efficiency 7 Conclusions and future work As regards scalability, the results obtained show that the hybrid solution is scalable and that an increase in problem size also increases the efficiency achieved by the algorithm. On the other hand, when comparing the message passing solution versus the hybrid solution, it can be seen that the latter offers better running times. In this regard, there is improvement introduced by the hybrid solution, which takes advantage of the characteristics of the problem and the architecture used. The possibility of using shared memory makes it unnecessary to replicate data in each blade. In the case of the problem that was chosen as study case, matrix B does not have to be replicated in each of the workers. This does not happen with the message passing solution, since each worker handles its own memory space and therefore requires a copy of matrix B. This is shown in the running times obtained in the tests using matrixes of 8192 * 8192 elements. In the message passing solution, running time and efficiency are significantly degraded, since, due to the replication mentioned above, the memory that is available in the testing architecture becomes insufficient, swapping the required structures to disc and thus significantly degrading algorithm performance. In the future, the behavior with even larger matrix sizes will be studied, together with other parallelization strategies that mainly avoid data replication. 8 References [1] Dongarra J., Foster I., Fox G., Gropp W., Kennedy K., Torzcon L., White A. Sourcebook of Parallel computing. Morgan Kaufmann Publishers 2002. ISBN 1558608710 (Chapter 3). [2] Burger T. Intel Multi-Core Processors: Quick Reference Guide http://cachewww.intel.com/cd/00/00/23/19/231912_231912. pdf. (2010).
[3] Andrews G. Foundations of Multithreaded, Parallel and Distributed Programming. Addison Wesley Higher Education 2000. ISBN-13: 9780201357523. [4] Grama A., Gupta A., Karpyis G., Kumar V. Introduction to Parallel Computing. Pearson Addison Wesley 2003. ISBN: 0201648652. Second Edition (Chapter 3). [5] https://computing.llnl.gov/tutorials/pthreads (2010) [6] http://www.open-mpi.org (2010) [7] Kumar V., Gupta A., Analyzing Scalability of Parallel Algorithms and Architectures. Journal of Parallel and Distributed Computing. Vol 22, No.1.pp 60-79. 1994. [8] Leopold C., Parallel and Distributed Computing. A Survey of Models, Paradigms and Approaches. Wiley, 2001. ISBN: 0471358312 (Chapters 1, 2 and 3). [9] Chapman B., The Multicore Programming Challenge, Advanced Parallel Processing Technologies ; 7th International Symposium, (7th APPT'07), Lecture Notes in Computer Science (LNCS), Vol. 4847, p. 3, Springer-Verlag (New York), November 2007. [10] HP, "HP BladeSystem". http://h18004.www1.hp.com/products/blades/components/cclass.html. (2011). [11] HP, "HP BladeSystem c-class architecture". http://h20000.www2.hp.com/bc/docs/support/supportmanual/ c00810839/c00810839.pdf. (2011).