Analysis of Matrix Multiplication Computational Methods

Similar documents
Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

6.1 Multiprocessor Computing Environment

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Scientific Programming in C XIV. Parallel programming

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Parallelization of Graph Isomorphism using OpenMP

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

High Performance Computing Software Development Kit For Mac OS X In Depth Product Information

Parallel Linear Algebra on Clusters

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

OpenMP * Past, Present and Future

Matrix Multiplication Specialization in STAPL

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Lecture 3: Intro to parallel machines and models

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve

Multi MicroBlaze System for Parallel Computing

Introduction to Parallel Programming. Tuesday, April 17, 12

A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer

High-Performance Implementation of the Level-3 BLAS

Analysis of Parallelization Techniques and Tools

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Strategies for Parallelizing the Solution of Rational Matrix Equations

Supercomputing and Science An Introduction to High Performance Computing

MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition

Parallel Computers. c R. Leduc

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Toward Scalable Matrix Multiply on Multithreaded Architectures

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Trends in HPC (hardware complexity and software challenges)

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Abstract. 1 Introduction

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

Multilevel Hierarchical Matrix Multiplication on Clusters

The Art of Parallel Processing

On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters.

Intel Math Kernel Library

High Performance Computing

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Applications of Berkeley s Dwarfs on Nvidia GPUs

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

EE/CSCI 451: Parallel and Distributed Computation

Automatic Parallelization of Sequential Codes Using S2P Tool and Benchmarking of the Generated Parallel Codes

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Abstract HPF was originally created to simplify high-level programming of parallel computers. The inventors of HPF strove for an easy-to-use language

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

CS691/SC791: Parallel & Distributed Computing

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Lecture 2. Memory locality optimizations Address space organization

A Fast and High Throughput SQL Query System for Big Data

Shared Memory programming paradigm: openmp

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

A Test Suite for High-Performance Parallel Java

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Serial and Parallel Sobel Filtering for multimedia applications

Portable, usable, and efficient sparse matrix vector multiplication

Introduction to OpenMP

Scalability of Heterogeneous Computing

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

High Performance Computing. Introduction to Parallel Computing

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Performance analysis of LDPC Decoder using OpenMP

Parallel Architecture & Programing Models for Face Recognition

Parallel Implementation of the NIST Statistical Test Suite

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

High Performance Computing on Windows. Debugging with VS2005 Debugging parallel programs. Christian Terboven

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Evaluation of Parallel Application s Performance Dependency on RAM using Parallel Virtual Machine

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Compilation for Heterogeneous Platforms

GOING ARM A CODE PERSPECTIVE

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication

Computer Architecture

A Comparison of Five Parallel Programming Models for C++

Transcription:

European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods Khaled Matrouk Corrospondent Author, Department of Computer Engineering, Faculty of Engineering E-mail: khaled.matrouk@ahu.edu.jo Tel: +962-3-2179000 (ext. 8503), Fax: +962-3-2179050 Abdullah Al- Hasanat Department of Computer Engineering, Faculty of Engineering Haitham Alasha'ary Department of Computer Engineering, Faculty of Engineering Ziad Al-Qadi Prof, Department of Computer Engineering, Faculty of Engineering Hasan Al-Shalabi Prof, Department of Computer Engineering, Faculty of Engineering Abstract Matrix multiplication is a basic concept that is used in engineering applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computation time as its complexity is O(n 3 ). Because most engineering applications require higher computational throughputs with minimum time, many sequential and parallel algorithms are developed. In this paper, methods of matrix multiplication are chosen, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using openmp and MPI methods of parallel computing. Keywords: OpenMP, MPI, Processing Time, Speedup, Efficiency 1. Introduction With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture (Alqadi and Abu-Jazzar, 2005a; Alqadi and Abu-Jazzar, 2005b; Alqadi et al, 2008). With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP) parallel programming techniques have evolved to support parallelism beyond a single level (Choi et al, 1994).

Analysis of Matrix Multiplication Computational Methods 259 Parallel programming within one SMP node can take advantage of the globally shared address space. Compilers for shared memory architectures usually support multi-threaded execution of a program. Loop level parallelism can be exploited by using compiler directives such as those defined in the OpenMP standard (Dongarra et al, 1994; Alpatov et al, 1997). OpenMP provides a fork-and-join execution model in which a program begins execution as a single thread. This thread executes sequentially until a parallelization directive for a parallel region is found (Alpatov et al, 1997; Anderson et al, 1987). At this time, the thread creates a team of threads and becomes the master thread of the new team (Chtchelkanova et al, 1995; Barnett et al, 1994; Choi et al, 1992). All threads execute the statements until the end of the parallel region. Work-sharing directives are provided to divide the execution of the enclosed code region among the threads. All threads need to synchronize at the end of parallel constructs. The advantage of OpenMP (web ref.) is that an existing code can be easily parallelized by placing OpenMP directives around time consuming loops which do not contain data dependences, leaving the source code unchanged. The disadvantage is that it is not easy for the user to optimize workflow and memory access. On an SMP cluster the message passing programming paradigm can be employed within and across several nodes. MPI (web ref.) is a widely accepted standard for writing message passing programs (web ref.; Rabenseifner, 2003). MPI provides the user with a programming model where processes communicate with other processes by calling library routines to send and receive messages. The advantage of the MPI programming model is that the user has complete control over data distribution and process synchronization, permitting the optimization data locality and workflow distribution. The disadvantage is that existing sequential applications require a fair amount of restructuring for a parallelization based on MPI. 1.1. Serial Matrix Multiplication Matrix multiplication involves two matrices A and B such that the number of columns of A and the number of rows of B are equal. When carried in sequential, it takes a time O(n 3 ). The algorithm for ordinary matrix multiplication is: for i=1 to n for j=1 to n c(i,j)=0 for k=1 to n c(i,j)=c(i,j)+a(i,k)*b(k,j) end end end 1.2. Parallel Matrix Multiplication Using OpenMp Master thread forks the outer loop between the slave threads, thus each of these threads implements matrix multiplication using a part of rows from the first matrix, when the threads multiplication are done the master thread joins the total result of matrix multiplication. 1.3. Parallel Matrix Multiplication Using MPI The procedures of implementing the sequential algorithm in parallel using MPI can be divided into the following steps: Split the first matrix row wise to split to the different processors, this is performed by the master processor. Broadcast the second matrix to all processors. Each processor performs multiplication of the partial of the first matrix and the second matrix. Each processor sends back the partial product to the master processor.

260 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi Implementation Master (processor 0) reads data Master sends size of data to slaves Slaves allocate memory Master broadcasts second matrix to all other processors Master sends respective parts of first matrix to all other processors Every processor performs its local multiplication o All slave processors send back their result. 2. Methods and Tools One station with Pentium i5 processor with 2.5 GHz and 4 GB memory is used to implement serial matrix multiplication. Visual Studio 2008 with openmp library is used as an environment for building, executing and testing matrix multiplication program. The program is tested using Pentium i5 processer with 2.5 GHz and 4 GB memory. A distributed processing system with different number of processors is used, each processor is a 4 core processor with 2.5 MHz and 4 GB memory, the processors are connected though Visual Studio 2008 with MPI environment. 3. Experimental Part Different sets of 2 matrices are chosen (different in sizes and data types ) and each pair of matrices is multiplied serially and in parallel using both openmp and MPI environments, and the average multiplication time is taken. 3.1. Experiment 1 Sequential matrix multiplication program is tested using different size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 100 types of matrices with different data types and different sizes are multiplied. Table 1 shows the average results obtained in this experiment. Table 1: Experiment 1 Results 3.2. Experiment 2 Matrices size Multiplication time in seconds 10*10 0.00641199 20*20 0.00735038 40*40 0.0063971 100*100 0.0142716 200*200 0.0386879 1000*1000 6.75335 1200*1200 11.889 5000*5000 2007 10000*10000 13000 Matrix multiplication program is tested using small size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data types and different sizes are multiplied using openmp environment. Table 2 shows the average results obtained in this experiment.

Analysis of Matrix Multiplication Computational Methods 261 Table 2: Experiment 2 Results # of threads 10,10 20,20 40,40 100,100 200,200 (time in seconds) (time in seconds) (time in seconds) (time in seconds) (time in seconds) 1 0.00641199 0.00735038 0.0063971 0.0142716 0.0386879 2 0.03675360 0.06866570 0.0370589 0.0373609 0.0615986 3 0.06271470 0.06311360 0.0701701 0.0978940 0.0787245 4 0.07273020 0.06979990 0.0710032 0.0706766 0.079643 5 0.06772930 0.07232620 0.0673493 0.0699920 0.051531 6 0.06918620 0.07037430 0.0707350 0.0724863 0.0837632 7 0.07124480 0.07204210 0.0727263 0.0727355 0.0820219 8 0.74631600 0.07348000 0.0677064 0.0762404 0.0820226 3.3. Experiment 3 Matrix multiplication program is tested using big size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data types and different sizes are multiplied using openmp environment with 8 threads. Table 3 shows the average results obtained in this experiment. Table 3: Experiment 3 Results Matrices size Multiplication time (in seconds) 1000,1000 1.8377 1200,1200 3.19091 2000,2000 18.0225 5000,5000 508.1 10000,10000 333.3 3.4. Experiment 4 Matrix multiplication program is tested using big size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data types and different sizes are multiplied using MPI environment with different number of processors. Table 4 shows the average results obtained in this experiment. Table 4: Experiment 4 Results Number of processors Multiplication time in second 1000x1000 matrices Multiplication time in second 5000x5000 matrices Multiplication time in second 10000x10000 matrices 1 6.96 2007 13000 2 5.9 1055 7090 4 3.3 525 3290 5 2.8 431 2965 8 2.1 260 1920 10 1.5 235 1600 20 0.8 119 900 25 0.6 91 830 50 0.55 52 292 4. Results Discussion From the results obtained in the previous section we can categorize the matrices into three groups: Small size matrices with size less than 1000*1000 Mid size matrices with 1000*1000 size 5000*5000

262 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi Huge size matrices with size 5000*5000 The following recommendation can be declared: For small size matrices, it is preferable to use sequential matrix multiplication. For mid size matrices, it is preferable to use parallel matrix multiplication using openmp. For huge size matrices, it is preferable to use parallel matrix multiplication using MPI. Also it is recommended to use hybrid parallel systems (MPI with openmp) to multiply matrices with huge sizes. From the results obtained in Table 2 we can see that the speedup of using openmp is limited to the number of actual available physical cores in the computer system as it is shown in Table 5 and Fig. 1. Speedup (times) = Time of execution with 1 thread/parallel time Table 5: Speedup results of Using OpenMP Matrix size #of threads = 1 #of threads = 8 Speedup 300,300 0.110188 0.097704 1.1278 400,400 0.314468 0.170006 1.8497 500,500 0.601031 0.277821 2.1634 600,600 1.14773 0.64882 1.7689 700,700 2.17295 0.704228 3.0856 800,800 3.16512 0.963983 3.2834 900,900 4.93736 1.37456 3.5920 1000,1000 6.69186 1.8377 3.6414 1024,1024 7.18151 1.97027 3.6449 1200,1200 12.0819 3.19091 3.7863 2000,2000 72.8571 18.0996 4.0253 2048,2048 74.7383 18.8406 3.9669 Figure 1: Maximum performance of using openmp From the results obtained in Tables 1 and 2 we can see that the matrix multiplication time increases rapidly when the matrix size increases as shown in Figs 2 and 3.

Analysis of Matrix Multiplication Computational Methods 263 Figure 2: Comparing between 8 and 1 threads results 80 70 time in seconds 60 50 40 30 One Thread 8 threads 20 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 n matrix(nxn) Figure 3: Relationship between the speedup and the matrix size 4.5 4 max # of cores 3.5 speedup :times 3 2.5 2 1.5 1 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 matrix size(nxn) From the results obtained in Table 4 we can calculate the speedup of using MPI and the system efficiency: Efficiency = speedup/number of processors The calculation results are shown in Table 6:

264 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi Table 6: Speedup and efficiency of using MPI Number of processors Speedup of Multiplication 1000x100 matrices Efficiency Speedup of Multiplication 5000x5000 matrices Efficiency Speedup of Multiplication 10000x10000 matrices Efficiency 1 1 1 1 1 1 1 2 1.17 0.585 1.9 0.38 1.83 0.92 4 2.9 0.725 3.8 0.95 3.9 0.98 5 2.46 0.492 4.66 0.93 4.4 0.88 8 3.29 0.411 7.72 0.96 6.77 0.85 10 4.6 0.46 8.54 0.85 8.13 0.81 20 8.63 0.43 16.87 0,84 14.44 0.72 25 11.5 0.45 22.05 0.88 15.66 0.63 50 12.55 0.251 38.6 0.77 44.52 0.89 From Table 6 we can see that increasing the number of processors in an MPI environments leads to enhancing the speedup of matrix multiplication but it also leads to poor system efficiency as shown in Figs 4, 5 and 6. Figure 4: Multiplication time for 10000x10000 matrices Running times for parallel matrix multiplication of two 10000x10000 matrices 14000 12000 10000 Time in seconds 8000 6000 4000 2000 0 0 5 10 15 20 25 30 35 40 45 50 Number of processors

Analysis of Matrix Multiplication Computational Methods 265 Figure 5: Speedup of multiplication for 10000x10000 matrices 45 Speedup of matrix multiplication 10000*10000 40 35 30 Speedup 25 20 15 10 5 0 12 45 8 10 20 25 50 Number of processors Figure 6: System efficiency of matrices multiplication 1.2 1.1 1 0.9 System efficiency 1000*1000 5000*5000 10000*10000 Efficiency 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 5 10 15 20 25 30 35 40 45 50 Number of processors 5. Conclusions Based on the results obtained and shown above, the following conclusions can be drawn: Sequential matrix multiplication is preferable for matrices with small sizes. OpenMP is a good method to use as an environment for parallel matrix multiplication with mid sizes, and here the speedup is limited to the number of available physical cores.

266 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi MPI is a good method to use as an environment for parallel matrix multiplication with huge sizes, here we can increase the speedup but negatively affects the system efficiency. To avoid the problems in the two previous conclusions we can recommend a hybrid parallel system References [1] Alqadi Z., and Abu-Jazzar A., 2005. CNRS-NSF Workshop on Environments and Program Methods Used for Optimizing Matrix Tools for Parallel Scientific Computing, Saint Hilaire Multiplication, Journal of Engineering 15(1), pp. 73-78, du Touvet, France, Sept. 7-8, Elsevier Sci. Publishers. [2] Alqadi Z., and Abu-Jazzar A., 2005. "Analysis of Program Methods Used for Optimizing Matrix Multiplication", Journal of Engineering 15(1), pp. 73-78. [3] Alqadi Z., Aqel M., and El Emary I. M. M., 2008. "Performance Analysis and Evaluation of Parallel Matrix Multiplication Algorithms",World Applied Sciences Journal 5(2). [4] Dongarra, J. J., R.A. Van de Geijn, and D.W. Walker, 1994. "Scalability Issues Affecting the Design of a Linear Algebra Library, Parallel Linear Algebra Package Design", Distributed Computing 22( 3), Proceedings of SC 97, pp. 523-537. [5] Alpatov, P., G. Baker, C. Edwards, J. Gunnels, and P. Geng, 1997. "Parallel Matrix Distributions: Parallel Linear Algebra Package", Tech. Report TR-95-40, Proceedings of the SIAM Parallel Processing Computer Sciences Conference, The University of Texas, Austin. [6] Choi, J., J. J. Dongarra and D.W. Walker, 1994. "A High-Performance Matrix Multiplication Algorithm Pumma: Parallel Universal Matrix Multiplication on a Distributed Memory Parallel Computer Using Algorithms on Distributed Memory Concurrent Overlapped Communication", IBM J. Res. Develop., Computers, Concurrency: Practice and Experience 6(7), pp. 543-570. [7] Chtchelkanova, A., J. Gunnels, and G. Morrow, 1986. "IEEE Implementation of BLAS: General Techniques for Level 3 BLAS", Proceedings of the 1986 International Conference on Parallel Processing, pp. 640-648, TR-95-40, Department of Computer Sciences, University of Texas. [8] Barnett, M., S. Gupta, D. Payne, and L. Shuler, 1994. "Using MPI: Communication Library (InterCom), Scalable High Portable Programming with the Message-Passing Performance, Computing Conference, pp. 17-31. [9] Anderson E., Z. Bai, C. Bischof, and J. Demmel, 1987. "Solving Problems on Concurrent Processors", Proceedings of Matrix Algorithms Supercomputing '90, IEEE 1, pp. 1-10. [10] Choi J., J. J. Dongarra, R. Pozo and D.W. Walker, 1992. "Scalapack: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers", Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, pp. 120-127. [11] MPI 1.1 Standard, http://www-unix.mcs.anl.gov/mpi/mpich. [12] OpenMP Fortran Application Program Interface, http://www.openmp.org/. [13] Rabenseifner, R., 2003. Hybrid Parallel Programming: Performance Problems and Chances, Proceedings of the 45th Cray User Group Conference, Ohio, May 12-16, 2003.