Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Similar documents
Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture

Parallelization of the N-queens problem. Load unbalance analysis.

Parallel Linear Algebra on Clusters

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

Multi MicroBlaze System for Parallel Computing

Introduction to parallel computing

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences

GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture

Performance of Multicore LUP Decomposition

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Limitations of Memory System Performance

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Solving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster

CS Understanding Parallel Computing

Scalable Hybrid Search on Distributed Databases

Scalability of Heterogeneous Computing

Distributed systems: paradigms and models Motivations

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Research on the Implementation of MPI on Multicore Architectures

Analysis of Matrix Multiplication Computational Methods

A Comprehensive Study on the Performance of Implicit LS-DYNA

A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Shared-memory Parallel Programming with Cilk Plus

Guiding the optimization of parallel codes on multicores using an analytical cache model

Organizational issues (I)

Introduction to Multicore Programming

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Evaluation of Parallel Programs by Measurement of Its Granularity

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Marco Danelutto. May 2011, Pisa

Scalability and Classifications

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

CUDA GPGPU Workshop 2012

GEN_OMEGA2: The HPSUMMARY Procedure: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with

Transactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN

Chapter 18 - Multicore Computers

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

COSC 6374 Parallel Computation. Organizational issues (I)

high performance medical reconstruction using stream programming paradigms

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

WHY PARALLEL PROCESSING? (CE-401)

IN this article we discuss several methods for parallelizing

Introduction to Multicore Programming

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

6.1 Multiprocessor Computing Environment

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Advances of parallel computing. Kirill Bogachev May 2016

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs

Parallel Exact Inference on the Cell Broadband Engine Processor

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Parallelizing NEC s Equation Solver Algorithm with OpenMP

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Performance Analysis and Optimal Utilization of Inter-Process Communications on Commodity Clusters

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

Parallel Architecture & Programing Models for Face Recognition

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

A Test Suite for High-Performance Parallel Java

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Computer Architecture. R. Poss

Shared-memory Parallel Programming with Cilk Plus

ANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS

Chap. 4 Multiprocessors and Thread-Level Parallelism

A Chromium Based Viewer for CUMULVS

Designing Evolvable Location Models for Ubiquitous Applications

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

HPC with Multicore and GPUs

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

A dedicated kernel named TORO. Matias Vara Larsen

Overview of research activities Toward portability of performance

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Analysis of Parallelization Techniques and Tools

Compilation for Heterogeneous Platforms

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Concurrent Programming Introduction

Sparse Matrix Operations on Multi-core Architectures

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Parallel Implementaton of the Weibull

Transcription:

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI), Facultad de Informática, Universidad Nacional de La Plata, 50 y 120 2do piso, La Plata, Argentina. {fleibovich, ldgiusti, mnaiouf}@lidi.info.unlp.edu.ar Abstract - Given the technological progress of current processors and the appearance of the multi-core cluster architecture, the assessment of different parallel programming techniques that allow exploiting the new memory hierarchy provided by the architecture becomes important. The purpose of this paper is to carry out a comparative analysis of two parallel programming paradigms- message passing and hybrid programming (where message passing and shared memory are combined). The testing architecture used for the experimental analysis is a multi-core cluster formed by 16 nodes (blades), each blade with 2 quad core processors (128 cores total). The study case chosen was square matrix multiplication, analyzing scalability by increasing the size of the problem and the number of processing cores used. node has a processor with several cores instead of a monoprocessor. When it comes to implementing a parallel algorithm, it is very important to consider the memory hierarchy available, since this will directly affect algorithm performance. Memory hierarchy performance is determined by two hardware parameters: memory latency (time elapsed from the moment a piece of data is required and the moment it becomes available) and memory bandwidth (the speed with which data are sent from the memory to the processor). Figure 1 shows a representation of the memory hierarchy in the different architectures. Keywords: hybrid programming, cluster, multi-core, message passing, shared memory, parallel architectures. 1 Introduction Parallel architectures have evolved to offer better response times for applications. As part of this evolution, clusters, then multi-cores, and currently multi-core cluster architectures, can be mentioned. The latter are basically a collection of multi-core processors interconnected through a network. Multicore clusters allow combining the most distinctive features of clusters (use of message passing in distributed memory) and multicores (use of shared memory). Also, they introduce modifications in memory hierarchy and further increase computer system capacity and power. Taking into account the popularity of this architecture, it is important to study new parallel algorithms programming techniques that efficiently exploit its power, considering the hybrid systems in which shared memory and distributed memory are combined [1]. As previously mentioned, a multi-core cluster is a set of multicore processors that are interconnected through a network, where they work cooperatively as an only computational resource. That is, it is similar to a traditional cluster but each Figure (1). Memory hierarchy In the case of traditional clusters (both homogeneous and heterogeneous), there are memory levels in each processor (processor register and cache levels L1 and L2), but a new level is also included: network-distributed memory.

When considering a multi-core architecture, there are, in addition to register and L1 levels corresponding to each core, two memory levels: cache memory shared by pairs of cores (L2) and memory shared among the cores of the multi-core processor [2]. In particular, multi-core clusters introduce one additional level to the traditional memory hierarchy. In addition to the cache memory shared between pairs of cores and the memory shared among all cores within the same physical processor, there is the distributed memory that is accessed through the network. There is a large number of parallel applications in different areas. One of the most traditional and widely studied of these areas in parallel computing, and used in this paper, is matrix multiplication. The reason for using this application (widely tested and assessed) is that it allows using and exploiting data parallelism, as well as analyzing algorithm scalability by increasing matrix size [3]. Thus, the solutions using message passing can be compared with those using shared memory and message passing (hybrid). This paper is organized as follows: In Section 2, the contribution of this paper is detailed, whereas in Section 3, the features of hybrid programming are described. In Section 4, the study case is detailed, and in Section 5, the solutions implemented and the architecture used to generate the results shown in Section 6 are analyzed. Finally, in Section 7, the conclusions and future lines of work are presented. 2 Contribution The main contribution of this paper is carrying out a comparative analysis of the performance that can be achieved with hybrid programming in a multi-core cluster architecture versus a traditional parallel programming model (distributed memory). The analysis is carried out based on running time and efficiency of the hybrid solution as the size of the problem and the number of cores used increase, and the results are compared with solutions that use only message passing. 3 Hybrid Programming Traditionally, parallel processing has been divided in two large models - shared memory and message passing [1][4]. Message passing: data are seen as being associated to a specific processor. Thus, message communication among processors is required to access remote data. In this model, sending and receiving primitives are responsible for handling synchronization. With the appearance of multi-core cluster architectures, a new hybrid programming model comes to existence, which combines both strategies. Communication among processes belonging to the same physical processor can be done by using shared memory (micro level), whereas communication among physical processors (macro level) can be done by message passing. The purpose of using the hybrid model is exploiting and applying all of the advantages of each strategy, based on the needs of the application. This is a current interest research area; among the libraries used for hybrid programming are Pthreads for shared memory and I for message passing. Pthreads is a library that implements the POSIX (Portable Operating System Interface) standard defined by IEEE, and is composed by a set of types and calls to procedures in programming language C that includes a header file and a thread library that is part, for example, of the libc library, among others. It is used for programming parallel applications that use shared memory [5]. On the other hand, I is a message passing interface created to provide portability. It is a library that can be used to develop programs that use message passing (distributed memory) and uses the programming languages C or Fortran. The I standard defines both the syntax and the semantics of the set of routines that can be used in the implementation of programs that use message passing [6]. 4 Study Case Given two square matrixes A and B, matrix multiplication consists in obtaining matrix C, as indicated in equation 1. C (1) A B If matrix A has m * p elements, and matrix B has p * n elements, matrix C will have m * n elements. Each position of matrix C is calculated by applying equation 2. Shared memory: the data accessed by the application are in a global memory that is accessible to parallel processors. This means that processors can look for and store data from any memory position independently from each other. It is characterized by the need of synchronization in order to preserve the integrity of shared data structures. p C i, j Ai, k Bk, j k 1 (2)

5 Implemented Solutions and Architecture Used Experimental tests were carried out based on the implementation of the classical matrix multiplication algorithm, both sequentially and using different parallel programming models: message passing and hybrid (combination of message passing and shared memory). All three solutions, sequential and parallel, were developed in language C. The parallel solution that uses message passing as process communication mechanism uses the OpenI library [6]. The hybrid solution uses the Pthreads library [5] for shared memory and OpenI for message passing. This initial phase of the investigation consists in carrying out an experimental analysis of the behavior of a hybrid application in a multi-core cluster architecture from the point of view of programming models [7][8][9]. The results shown are focused in analyzing the hybrid solution in two aspects: 1. Analyzing behavior when the size of the problem and the number of cores increase (scalability) [7][8]. In this case, square matrixes of 1024, 2048, 4096 and 8192 rows and columns were processed. 2. Comparing running times and efficiency with those obtained with the message passing solution. The hardware used to carry out the tests was a Blade with 16 servers (blades). Each blade has 2 quad core Intel Xeon e5405 2.0 GHz processors; 2 Gb of RAM memory (shared between both processors); 2 X 6Mb L2 cache shared between each pair of cores by processor. The operating system used is Fedora 12, 64 bits [10][11]. In the following paragraphs, the solutions implemented are described. In all cases, matrix multiplication is carried out by storing matrix A by rows and matrix B by columns in order to use local cache memory for data access and take advantage of the architecture on which algorithms were run. 5.1 Sequential Solution Each position of C is calculated as established in equation 1. 5.2 Message Passing Solution In this case, processing is divided in blocks of rows, which are assigned equally to each process. If p is the number of processes and n * n is the dimension of matrixes A and B, the number of rows of matrix C calculated by each process is n/p. The algorithm uses a hierarchical master/worker structure. There is a general master that divides all rows that will be processed in each blade, and sends the corresponding rows to the master in each blade. It then behaves as the second level of workers described below. Finally, it receives the results obtained by all application workers. On the other hand, there is one master in each blade (secondlevel masters), responsible for receiving the rows that will be solved by the processes in its blade and distributing them among its workers to then process its own share, also acting as a worker. It should be noted that each process must store the rows from matrix A to be processed, all of matrix B and the rows from matrix C that it generates as a result. 5.3 Hybrid Solution In this solution, there is one process per blade that internally uses 7 threads to carry out processing activities, and the processing activities from the process itself that acts as a worker (one thread per core). A master/worker structure is used, with one of the processes acting as master, dividing the rows equally among all processes. Once this is done, it generates the corresponding processing threads (acting as worker). The other worker processes act in a similar way and send their results to the process master. The algorithm can be summarized as follows: Master process: 1. It divides the matrix into blocks of n rows/number of blades used for processing 2. It communicates the corresponding rows from matrix A and all of matrix B to worker processes. 3. It generates the threads and processes its own block 4. It receives results from worker processes. Worker processes 1. They receive the corresponding rows from matrix A and all of matrix B. 2. They generate the threads to process the data. 3. They communicate the results to the master process. 6 Results obtained In the following paragraphs, the results obtained in the experimental tests carried out are presented. Table 1 shows running times for the sequential solution (Seq.), the message passing solution using 16, 32, and 64 cores ( 16, 32 and 64) and the hybrid solution with 16, 32, and 64 cores (H16, H32 and H64). For the tests, both the dimension of the matrix and the number of cores are escalated. Figure 2 shows the speedups obtained for these tests. It can be seen that the running times obtained by the hybrid solution are always lower than those obtained by the message passing solution. Also, as problem size increases, the time difference between both solutions also increases in favor of the hybrid solution.

Size Seq. 16 H 16 32 H 32 64 H 64 1024 12.47 0.88 0.86 0.52 0.50 0.73 0.39 2048 101.21 6.72 6.63 3.66 3.58 2.51 2.36 4096 808.57 52.54 51.89 27.42 26.91 17.51 16.06 8192 6479.43 1059.52 410.36 638.87 209.36 752.07 124.89 Table (1) Figure (2). Speedup 6.1 Comparison of Results Table 2 shows the efficiency achieved by the different testing alternatives; whereas in Figure 3, a comparative chart showing that information is presented. Based on the results obtained, two observations can be made - the efficiency achieved by the hybrid solution is in all cases higher than the one achieved by the message passing solution, and, as problem size increases (for the same number of processing units), efficiency also increases. However, as it is to be expected, when the number of processing units increases, efficiency decreases due to the increased volume of communications and synchronization among processes. It should also be mentioned that the efficiency achieved by the message passing solution for 8192 * 8192 elements is significantly degraded in comparison with the other sizes. This is due to limitations in the main memory that is available in each blade, which, for large sizes, generates a swapping of the necessary data structures. Size 16 H 16 32 H 32 64 H 64 1024 0.87 0.90 0.73 0.77 0.26 0.49 2048 0.94 0.95 0.86 0.88 0.62 0.66 4096 0.96 0.97 0.92 0.93 0.72 0.78 8192 0.38 0.98 0.31 0.96 0.13 0.81 Table (2) Figure (3). Efficiency 7 Conclusions and future work As regards scalability, the results obtained show that the hybrid solution is scalable and that an increase in problem size also increases the efficiency achieved by the algorithm. On the other hand, when comparing the message passing solution versus the hybrid solution, it can be seen that the latter offers better running times. In this regard, there is improvement introduced by the hybrid solution, which takes advantage of the characteristics of the problem and the architecture used. The possibility of using shared memory makes it unnecessary to replicate data in each blade. In the case of the problem that was chosen as study case, matrix B does not have to be replicated in each of the workers. This does not happen with the message passing solution, since each worker handles its own memory space and therefore requires a copy of matrix B. This is shown in the running times obtained in the tests using matrixes of 8192 * 8192 elements. In the message passing solution, running time and efficiency are significantly degraded, since, due to the replication mentioned above, the memory that is available in the testing architecture becomes insufficient, swapping the required structures to disc and thus significantly degrading algorithm performance. In the future, the behavior with even larger matrix sizes will be studied, together with other parallelization strategies that mainly avoid data replication. 8 References [1] Dongarra J., Foster I., Fox G., Gropp W., Kennedy K., Torzcon L., White A. Sourcebook of Parallel computing. Morgan Kaufmann Publishers 2002. ISBN 1558608710 (Chapter 3). [2] Burger T. Intel Multi-Core Processors: Quick Reference Guide http://cachewww.intel.com/cd/00/00/23/19/231912_231912. pdf. (2010).

[3] Andrews G. Foundations of Multithreaded, Parallel and Distributed Programming. Addison Wesley Higher Education 2000. ISBN-13: 9780201357523. [4] Grama A., Gupta A., Karpyis G., Kumar V. Introduction to Parallel Computing. Pearson Addison Wesley 2003. ISBN: 0201648652. Second Edition (Chapter 3). [5] https://computing.llnl.gov/tutorials/pthreads (2010) [6] http://www.open-mpi.org (2010) [7] Kumar V., Gupta A., Analyzing Scalability of Parallel Algorithms and Architectures. Journal of Parallel and Distributed Computing. Vol 22, No.1.pp 60-79. 1994. [8] Leopold C., Parallel and Distributed Computing. A Survey of Models, Paradigms and Approaches. Wiley, 2001. ISBN: 0471358312 (Chapters 1, 2 and 3). [9] Chapman B., The Multicore Programming Challenge, Advanced Parallel Processing Technologies ; 7th International Symposium, (7th APPT'07), Lecture Notes in Computer Science (LNCS), Vol. 4847, p. 3, Springer-Verlag (New York), November 2007. [10] HP, "HP BladeSystem". http://h18004.www1.hp.com/products/blades/components/cclass.html. (2011). [11] HP, "HP BladeSystem c-class architecture". http://h20000.www2.hp.com/bc/docs/support/supportmanual/ c00810839/c00810839.pdf. (2011).