Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Similar documents
Fundamentals of Computers Design

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Fundamentals of Computer Design

CS 426 Parallel Computing. Parallel Computing Platforms

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Trends in HPC (hardware complexity and software challenges)

Analysis of Matrix Multiplication Computational Methods

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Parallel Architecture & Programing Models for Face Recognition

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000

Master Informatics Eng.

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

QR Decomposition on GPUs

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

BLAS: Basic Linear Algebra Subroutines I

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

BLAS: Basic Linear Algebra Subroutines I

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Architecture, Programming and Performance of MIC Phi Coprocessor

Scientific Computing. Some slides from James Lambers, Stanford

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Intel s MMX. Why MMX?

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Optimizations of BLIS Library for AMD ZEN Core

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

High performance matrix inversion of SPD matrices on graphics processors

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

*Yuta SAWA and Reiji SUDA The University of Tokyo

Performance of Multicore LUP Decomposition

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Lecture 3: Intro to parallel machines and models

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

BLAS. Basic Linear Algebra Subprograms

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

NUMA-aware Multicore Matrix Multiplication

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Characterization of Native Signal Processing Extensions

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

F02WUF NAG Fortran Library Routine Document

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Sparse Direct Solvers for Extreme-Scale Computing

Parallel Computing Ideas

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

BLAS. Christoph Ortner Stef Salvini

COMPUTATIONAL LINEAR ALGEBRA

Linear Algebra for Modern Computers. Jack Dongarra

Mapping Vector Codes to a Stream Processor (Imagine)

ADVANCED TECHNIQUES FOR IMPROVING PROCESSOR PERFORMANCE IN A SUPERSCALAR ARCHITECTURE

Lecture 1: Introduction

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Dense matrix algebra and libraries (and dealing with Fortran)

Review of previous examinations TMA4280 Introduction to Supercomputing

A Few Numerical Libraries for HPC

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Towards Realistic Performance Bounds for Implicit CFD Codes

Multicore Hardware and Parallelism

AMS 148: Chapter 1: What is Parallelization

PIPELINE AND VECTOR PROCESSING

Parallelized Progressive Network Coding with Hardware Acceleration

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Online Course Evaluation. What we will do in the last week?

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Fundamentals of Quantitative Design and Analysis

Level-3 BLAS on the TI C6678 multi-core DSP

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

The LINPACK Benchmark on the Fujitsu AP 1000

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Parallel Numerics, WT 2013/ Introduction

Transcription:

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering, Aswan University Aswan 81542, Egypt mossol@yahoo.com/mossol@ieee.org and fatma.sayed@aswu.edu.eg Abstract This paper exploits the capabilities of distributed multi-core processors (cluster of quad-core Intel Xeon E5410 processors running at 2.33 GHz) for accelerating basic linear algebra subprograms (BLAS). On one node, the use of SIMD, multi-threading, and multi-threading SIMD techniques improve the performance from 0.8-2.1 to 2.93-8.08, 1.43-3.0, and 1.83-3.65 GFLOPS. However, the use of MPI on multiple nodes degrades the performance because the network overhead dominates the overall execution time. For the same reason, the performance of Level-2 BLAS due to using multi-threading, SIMD, and blocking techniques degrades from 2.33-3.32 to 4.37E-02-4.88E-02 GFLOPS on eight nodes. However, the speedups of the traditional/strassen matrix-matrix multiplication on a single node over the sequential execution when applying SIMD, multithreading, multi-threading SIMD, and multi-threading SIMD blocking techniques are 3.76/3.50, 3.91/3.73, 7.37/7.25, and 12.65/11.86. On seven nodes, the performances of traditional/strassen are 68.67/50.73 GFLOPS. Moreover, on ten nodes, the performance of traditional matrix-matrix multiplication reaches 99.73 GFLOPS. Keywords BLAS; SIMD; multi-threading; matrix blocking MPI; performance evaluation. I. INTRODUCTION Scientific and engineering research is becoming increasingly dependent upon the development and implementation of efficient parallel algorithms on modern high-performance computers, see [1] for more details. Linear algebra (in particular, the solution of linear systems of equations) lies at the heart of most calculations in scientific computing [2]. The three common levels of linear algebra operations are Level-1 (vector-vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations). For vectors of length n, Level-1 BLAS (basic linear algebra subprograms) involve O(n) data and O(n) operations [3]. They perform important operations as they are used in many applications. For example, apply Givens rotation is used in QR decomposition, solving linear system equations, and singular value decomposition (SVD) [4]. While dotproduct and SAXPY (scalar a times vector x plus vector y) subroutines are the base of the matrix-matrix multiplication as will be discussed in Section 4. Level-2 BLAS involve O(n 2 ) operations on O(n 2 ) data, which include matrix-vector multiplication, rank-one update, etc. [5]. Note that Level-1 and Level-2 subroutines cannot gain good improvement in the performance on multi-core processors using multi-threading technique or on cluster of multi-core processors because they have low ratio of arithmetic operations to load/store operations, O(1). In addition, with the low level of reusing the loaded data, they cannot exploit the memory hierarchy to speedup the performance, see [6] for more detail. Level-3 BLAS involve O(n 3 ) operations on O(n 2 ) data, where matrixmatrix multiplication is an example [7]. In this paper, some kernels of BLAS are implemented and evaluated on a cluster of Fujitsu Siemens CELSIUS R550 [8]. In such cluster, each computer node has quad-core Intel Xeon E5410 processor running at 2.33 GHz, L1 data cache of 32 KB and L1 instruction cache of 32 KB for each core, shared L2 cache of 12 MB, and 4 GB of main memory [9]. Moreover, the target processor includes data prefetch logic and thirty functional execution units in eleven groups: six generalpurpose ALUs, two integer units, one shift unit, four data cache units, six multimedia units, two parallel shift units, one parallel multiply, two 82-bit floating-point multiplyaccumulate units, two SIMD floating-point multiplyaccumulate units, and three branch units. In addition, it includes sixteen 128-bit wide registers (XMM0 XMM15) for SIMD instructions. Therefore, Intel Xeon E5410 can exploit instruction-level parallelism (ILP), data-level parallelism (DLP), and threadlevel parallelism (TLP) in applications to accelerate their executions. ILP is exploited using pipelining, superscalar execution, speculation, and dynamic scheduling [10]. Since it is multi-core processor, multi-threading is the technique for supporting the execution of multiple threads to exploit TLP. To exploit DLP, streaming SIMD extensions (SSE, SSE2, SSE3, SSE3S, and SSE4.1) can be used for parallel processing multiple data using a single instruction. To gain further performance improvements, cluster of computers can be used for distributing the execution of applications among computers connected together through a network. The communication between the computers can be done via message passing. Message passing interface (MPI) is a standard library that used to establish an efficient and portable communication between the computers of a cluster [11]. From a software point of view, each computer in the cluster is running Microsoft Window 7 operating system, where Microsoft visual studio 2012 and MPICH2-1.2.1p1 library are used for MPI programming. The organization of this paper is as follows. Section 2 describes and evaluates Level-1 BLAS on the target system.

Level-2 BLAS are evaluated in Section 3. The traditional and Strassen algorithms for matrix-matrix multiplication are evaluated in Section 4 as examples of Level-3 BLAS. Finally, Section 5 concludes this paper. II. LEVEL-1 BLAS A. Description Level-1 BLAS include subroutines, which perform vectorscalar and vector-vector operations [3]. They involve O(n) floating-point operations (FLOPs) and create O(n) data movement (load/store operations), where n is the vector length. In this paper, five subroutines from Level-1 BLAS are selected to apply parallel processing techniques on them: apply Givens rotation (Givens), dot-product (Dot), SAXPY, Norm-2, and scalar-vector multiplication (SVmul). These subroutines have different number of FLOPs, 6n, 2n, 2n, 2n, and n, respectively. Moreover, they perform 2n/2n, 2n/1, 2n/n, n/1, and n/n load/store operations, respectively, see [12] for more details. B. Superscalar Performance of Level-1 BLAS Figure 1 shows the superscalar (sequential) performance of Level-1 BLAS on the target processor (out-of-order, superscalar, Intel Xeon E5410) when the vector length fitted in the L2 cache (see Figure 1a) and when the vector length is larger to fit in L2 cache (see Figure 1b). On large vector length fitted in L2 cache, performances of 2.10, 1.83, 1.65, 1.20, and 0.80 GFLOPS are achieved on Givens, Dot, SAXPY, Norm-2, and SVmul, respectively. However, increasing the vector length furthermore results in decreasing the performances to 1.01, 0.73, 0.63, 0.60, and 0.27 GFLOPS, respectively, due to increasing the rate of cache miss. As applying Givens rotation is more computationally intensive than other Level-1 subroutines, its performance is the best. The Norm-2 and dotproduct subroutines have the same FLOPs (2n) and the same store operations (one), but they differ in the load operations (n/2n for Norm-2/dot-product). Thus, the performance of the Norm-2 is better than the dot-product. Although the SAXPY subroutine has 2n FLOPs as the dot-product, it has less performance than the dot-product because SAXPY has n store operations higher than dot-product. Finally the scalar-vector multiplication has the smallest ratio of FLOPs to memory references (load/store operations), thus it has the smallest performance. On the other hand, the use of MPI on a cluster of computers speeds down the performance of Level-1 BLAS (see Figure 2a) because the network overhead for sending/receiving data/results dominates the overall execution time. The performances of Givens, Dot, SAXPY, Norm-2, and SVmul are 1.19E-02, 4.41E-02, 6.53E-03, 5.43E-03, and 3.82E-03 on two nodes, 2.13E-02, 5.55E-02, 1.40E-02, 8.95E- 03, and 7.05E-03 on four nodes, and 3.77E-02, 7.08E-02, 2.38E-02, 1.67E-02, and 1.17E-02 GFLOPS on eight nodes, respectively. C. DLP Exploitation to Improve Level-1 BLAS Performance The performance of the Level-1 BLAS can be improved using SIMD technique, as shown in Figure 1. SIMD instructions can process four single-precision elements (4 32- bit) by a single instruction in parallel. The use of SIMD technique improves the performance of Level-1 BLAS from 2.10, 1.83, 1.65, 1.20, and 0.80 to 8.08, 6.76, 5.70, 3.94, and 2.93 GFLOPS, respectively, on a single node. Therefore, speedups of 3.84, 3.71, 3.46, 3.29 and 3.66, respectively, are achieved due to applying SIMD on vectors fitted in L2 cache. Further increasing the length of the vectors reduces the performances to 3.63, 2.59, 2.10, 1.85, and 0.99 and reduces the speedups to 3.58, 3.54, 3.31, 3.08, and 3.60, respectively. The MPI SIMD performance of Level-1 BLAS on cluster of computers decreases. The performances of apply Givens rotation, Norm-2, dot-product, SAXPY, and scalar-vector multiplication are reduced to 4.27E-02, 1.66E-01, 2.30E-02, 1.87E-02, and 1.41E-02 on two nodes, 6.43E-02, 2.09E-01, 4.92E-02, 3.08E-02, and 2.61E-02 on four nodes, and 1.43E- 01, 2.67E-01, 8.36E-02, 5.74E-02, and 4.33E-02 GFLOPS on eight nodes, respectively. D. Multi-threaded Performance of Level-1 BLAS To employ TLP, the given vectors are divided across parallel threads. On target processor (Intel Xeon E5410), the given vectors are divided into four sub-vectors, where each subvector is assigned to a thread for processing. On small vector lengths, multi-threading technique utilization degrades the performance because overhead of the thread creation is larger than processing of the floating-point operations. However, as increasing the length of the vectors, the performance improves due to multi-threading technique application. Performances of 3.0, 3.83, 2.35, 1.98, and 1.43 GFLOPS, respectively, are achieved when the given vectors can be fitted in L2 cache. However, when the given vectors cannot be fitted in L2 cache, performances are reduced to 1.34, 1.49, 0.84, 0.91 and 0.48 (a) Vectors fitted in L2 cache (b) Vectors not fitted in L2 cache Figure 1: Performance of the parallel implementations of Level-1 BLAS on a single computer.

GFLOPS, respectively. Therefore, speedups of 1.43/1.32, 2.10/2.04, 1.42/1.32, 1.65/1.52, and 1.78/1.76, respectively, can be achieved on vector fitted/not-fitted in cache. Like SIMD performance on multiple nodes, MPI multithreaded performance of Level-1 BLAS on a cluster of computers decreases to 2.38E-02, 9.45E-02, 9.68E-03, 9.55E- 03, and 6.71E-03 on two nodes, 4.26E-02, 1.19E-01, 2.07E- 02, 1.57E-02, and 1.24E-02 on four nodes, and 7.53E-02, 1.52E-01, 3.52E-02, 2.93E-02, and 2.05E-02 on eight nodes. E. Multi-threaded SIMD Performance of Level-1 BLAS When each thread uses the SIMD technique, the performance of Level-1 BLAS increases furthermore, as shown in Figure 1. Therefore, performances of 3.65, 4.60, 3.18, 2.48, and 1.83 are achieved on single node when exploiting both of SIMD and multi-threading techniques on vectors fitted in cache. Thus, the speedups are increased from 1.43, 2.10, 1.42, 1.65, and 1.78 using multi-threading technique to 1.73, 2.52, 1.93, 2.07, and 2.28 using multithreading and SIMD techniques. Note that the speedup is better when the ratio of FLOPs to memory references is higher. The use of MPI beside multi-threading and SIMD techniques degrades the performance to 2.57E-02, 1.15E-01, 1.27E-02, 1.15E-02, and 8.72E-03 on two nodes, 4.61E-02, 1.45E-01, 2.72E-02, 1.90E-02, and 1.61E-02 on four nodes, and 8.16E-02, 1.85E-01, 4.63E-02, 3.54E-02, and 2.67E-02 on eight nodes. III. LEVEL-2 BLAS A. Description Level-2 BLAS include routines that operate on higherlevel data structure matrix-vector instead of vector-vector or vector-scalar as in Level-1 BLAS [5]. These routines take O(n 2 ) FLOPs and create O(n 2 ) memory references, where the dimension of the matrix involved is n n. Matrix-vector multiplication, rank-1 and rank-2 updates, and solution of a linear system equations are the main operations covered in Level-2 BLAS. A lot of the computation in numerical linear algebra algorithms is done by using the Level-2 BLAS. In this paper, matrix-vector multiplication (MVmul: y = y + Α (, ) x, where 0 i, j < n) and rank-1 update (Rank-1: Α (, ) = Α (, ) + x y, where 0 i, j < n ) subroutines are selected to explore the effect of using the different forms of the parallel processing on Level-2 BLAS. B. Superscalar Performance of Level-2 BLAS The matrix-vector multiplication and rank-1 update subroutines are implemented and evaluated on small (from 100 100 to 1000 1000 fitted in the L2 cache) and large (1000 1000 to 10,000 10,000 not fitted in the L2 cache) matrix sizes, see Figure 3. Note that L2 cache can fit until three million single-precision elements. The content of the matrix and vectors are generated randomly in the interval [1-10]. Figure 3 shows that the performances of the MVmul/Rank-1 improve at small matrix sizes fitted in L2 cache, reaching 1.63/1.11 GFLOPS on matrix size 1000 1000. When the matrix sizes are too large to fit in the cache starting from matrix size of 2000 2000, the performance begins to degrade (see Figure 3b) because of increasing the rate of cache miss. Therefore, the performance of the Level-2 BLAS depends on the size of the cache to reuse the loaded vector data. The superscalar performance of the matrix-vector multiplication is better than rank-1 update because the ratio of FLOPs to memory references is higher in the former than the later. (a) Sequential performance (b) SIMD performance (c) Multi-threaded performance (d) Multi-threaded SIMD performance Figure 2: Performance of the parallel implementations of Level-1 BLAS on a cluster of computers.

C. SIMD Performance of Level-2 BLAS Due to using SIMD technique, the performances of the matrix-vector multiplication and rank-1 update are improved with increasing matrix sizes fitted in the L2 cache, see Figure 3. On large matrices (1000 1000), the performances of MVmul/Rank-1 increase reaching 6.0/3.93 GFLOPS, as shown in Figure 3a. They result in speedups of 3.67/3.54. However, on larger matrix sizes (larger than 2000 2000), the performances of MVmul/Rank-1 degrade because of the escalating cache miss rate, as shown in Figure 3b. The performance can be improved, furthermore, using the matrix blocking technique, see [13] for details. Therefore, the performances of MVmul/Rank-1 are enhanced to 6.31/4.19 GFLOPS, as shown in Figure 3a ("SIMDB" bars). Moreover, the speedups improve to 3.86/3.78, respectively. As the performance depends on the size of the cache memory, the performance decreases at large matrix sizes, see Figure 3b because of growing penalty of cache miss since the loaded matrix is used only once. In conclusion, the performance of the Level-2 BLAS depends on the size of the cache memory and the number of the temporary SIMD registers, using the higher ones gives better performance. D. Multi-threaded SIMD Performance of Level-2 BLAS Since Level-2 BLAS operate on higher-level data structure (matrix-vector), multi-threading technique can improve their performances. For large matrix, it is partitioned between multiple threads, where each thread operates on a sub-matrix, promising with performance improvements. However, using multi-threading technique with a small matrix size may cause performance degradation, because each thread operates on a smaller sub-matrix, which cannot cover the overhead of the thread creation. Figure 3 shows the performances in GFLOPS of the matrix-vector multiplication and rank-1 update when using multi-threading (Thrd), multi-threading SIMD (ThrdSIMD), and multi-threading SIMD blocking (ThrdSIMDB) techniques. As expected, the performance speeds down for small matrix sizes. With increasing the matrix size, the performance improves, where the effect of the thread creation overhead is decreased. The performance keeps the growing trend until matrix size reaches 1000 1000 elements, which is the maximum size that can fit in the L2 cache. The performances of the multi-threading, multi-threading SIMD, and multi-threading SIMD blocking on MVmul/Rank-1 reach 2.03/1.41, 2.52/1.73, and 3.32/2.33 GFLOPS, respectively, as shown in Figure 3a. The corresponding speedups are of 1.25/1.28, 1.54/1.56, and 2.03/2.10, respectively. E. MPI Exploitation for Improving Level-2 BLAS MPI can be used to compute Level-2 BLAS with matrixvector data structure employing more than one computer. The MPI implementation of Level-2 BLAS depends on partitioning the data of the given matrix and the vector data between the computers. The amount of transmitted/received data of the rank-1 update is higher compared to matrix-vector multiplication. There are two sub-matrices one of them is sent and the other is received to/from each node for the former. However, for the latter a sub-matrix is sent and a sub-vector is received to/from each node. On two nodes, the sending/receiving time is 33.80 seconds for the former, while for the latter, it is 16.91 seconds on matrix size of 10,000 10,000. Therefore, the sending/receiving time dominates the overall time. The processing times of sequential, SIMD, multi-threaded, and multi-threaded SIMD implementations of MVmul/Rank-1on two nodes are 8.71E- 02/1.01E-01, 2.03E-02/2.71E-02, 4.84E-02/5.78E-02, and 3.74E-02/4.36E-02 seconds, respectively. As in Level-1, Level-2 BLAS don t gain a speedup in performance on multiple computers because of the high overhead of sending/receiving time between them. On matrixvector multiplication, the performances of the sequential, SIMD, multi-threaded, and multi-threaded SIMD implementations on 2/4/8 nodes drop from 1.63, 6.00, 2.03, and 2.52 GFLOPS on matrix size of 1000 1000 on one computer to 1.18E-02/1.95E-02/2.88E-02, 4.36E-02/7.20E- 02/1.07E-01, 1.55E-02/2.56E-02/3.79E-02, and 1.99E- 02/3.30E-02/4.88E-02 GFLOPS, respectively, see Figure 4. Moreover, for the rank-1 update, the performance speeds down. Its performance on one node for the sequential, SIMD, multi-threaded, and multi-threaded SIMD (on matrix size of 1000 1000) was 1.11, 3.93, 1.41, and 1.73 GFLOPS, respectively. When executing these implementations on 2/4/8 computers, the performance decreases to 5.91E-03/1.16E- 02/2.26E-02 for the sequential implementation, as shown in Figure 4a. For the SIMD implementation, the performance drops to 2.14E-02/4.20E-02/8.18E-02 GFLOPS, respectively, as shown in Figure 4b. Moreover, the performance of the multi-threaded implementation degrades to 8.62E-03/1.96E- 02/3.30E-02 GFLOPS, respectively, as shown in Figure 4c. Finally, the multi-threaded SIMD implementation (a) Matrices/vectors fitted in L2 cache (b) Matrices/vectors not fitted in L2 cache Figure 3: Performance of the parallel implementations of Level-2 BLAS on a single computer.

performance drops to 1.14E-02/2.24E-02/4.37E-02 GFLOPS, as shown in Figure 4d. Although, the performance decreases with applying MPI technique on Level-2 BLAS, the performance of the SIMD implementation on multiple nodes is higher than the multithreaded and multi-threaded SIMD implementations. This is because of the threads creation overhead and the fighting for the cache between threads. For example the performances of the SIMD, multi-threaded, and multi-threaded SIMD for MVmul/Rank-1 on 8 nodes on large matrix size are 1.07E- 01/8.18E-02, 3.79E-02/3.30E-02, and 4.88E-02/4.37E-02 GFLOPS, respectively. IV. LEVEL-3 BLAS A. Description Level-3 BLAS include subroutines that operate on a higher-level data structure (matrix-matrix) [7]. They are considered a computational intensive kernels as they involve O(n 3 ) FLOPs, while only creating O(n 2 ) memory references, where n n is the size of the matrices involved. Thus, they permit efficient reuse of data that resides in cache memory, and demand exploiting all available forms of parallelism to improve their performances. Level-3 BLAS create what is often called the surface-to-volume effect of the ratio of memory references (load/store operations) to computations (arithmetic operations) [14]. In contrast to Level-1 and Level- 2 BLAS, the total time of broadcasting, sending, and receiving data between multiple nodes is negligible compared to the processing time of Level-3 BLAS on large matrix sizes. Level-3 BLAS cover little operations including matrix-matrix multiplication, multiplication of a matrix by a triangular matrix, solving triangular systems of equations, and rank-k and rank-2k updates [7]. In this paper, matrix-matrix multiplication (MMmul) is selected as an example of Level-3 BLAS to apply parallel processing techniques, since it is one of the most fundamental problems in algorithmic linear algebra. The arithmetic complexity of matrix multiplication can be reduced to O(n 2.807 ) instead of O(n 3 ) by applying the Strassen s algorithm [13, 15]. B. Traditional Algorithm of Matrix-Matrix Multiplication As shown in Figure 5a, the superscalar performance of MMmul improves with increasing the size of matrices. On 1000 1000 matrix, it reaches to 1.91 GFLOPS. However, for larger matrices, the performance degrades, because larger sizes cannot fit in the L2 cache. On the other hand, the performance of MMmul improves on a cluster of computers, see Figure 5b. On 10,000 10,000, the performance increases from 0.96 GFLOPS on one node to 1.82, 3.51, 6.65, and 7.89 GFLOPS on 2, 4, 8, and 10 nodes, respectively, as shown in Figure 5b. These improvements provide speedups of 1.89, 3.66, 6.94, and 8.24, respectively over the sequential implementation on one node. Note that as increasing the number of nodes, the speedup is far away from the ideal value because more network overhead is added since the processing data per node becomes smaller. On small matrix size (100 100), the SIMD performance is 5.38 GFLOPS, while it reaches to 1.84 GFLOPS for the sequential superscalar implementation. As the matrix size increases the performance improves to 7.18 GFLOPS on 1000 1000, see Figure 5a (SIMD). It shows that applying the SIMD technique achieves a maximum speedup of 3.76 at the maximum size that can fit in the L2 cache. This speedup represents 94% of the ideal performance. Because of the growing cache miss rate for large matrix sizes, which cannot fit in the L2 cache, the SIMD performance of matrix-matrix (a) Sequential performance (b) SIMD performance (c) Multi-threaded performance (d) Multi-threaded SIMD performance Figure 4: Performance of the parallel implementations of Level-2 BLAS on a cluster of computers.

multiplication begins to decrease, as shown in Figure 5a. Moreover, Figure 5c shows how the performance improves furthermore on multiple nodes when applying the SIMD technique in addition to the MPI. On matrix size of 10,000 10,000, the performance reaches to 2.60, 4.99, 9.75, and 12.79 GFLOPS on 2, 4, 8, and 10 nodes, respectively. Speedups of 1.86, 3.58, 6.99, and 9.17 are achieved over the SIMD implementation on one node. Moreover, the speedups over the sequential implementation on one node are improved to 2.71, 5.21, 10.18, and 13.34, respectively. On small matrix sizes, the multi-threaded performance improves with increasing the matrix size until it reaches 7.48 GFLOPS on matrix size of 1000 1000 elements, achieving a speedup of 3.91, which represents 97.75% of the ideal value, as shown in Figure 5a (Thrd). Moreover, each thread can compute its operations in parallel using SIMD instructions. Figure 5a (Thrd_SIMD) shows how the combination of the two techniques improves the performance to 14.07 GFLOPS on matrix size 1000 1000. It results in a speedup of 7.37. Figure 5d shows the performances in GFLOPS of the multithreaded implementation on 2/4/8/10 nodes, improve to 4.13/7.92/15.70/19.17 GFLOPS on 10,000 10,000. They result in speedups over multi-threading technique on a single node of 1.96/3.76/7.45/9.09. In addition, speedups of 4.31/8.26/16.38/20.0 are achieved over the sequential implementation on one node. On multiple nodes, applying both of multi-threading and SIMD techniques lead to performances of 7.15 (2 nodes), 14.25 (4 nodes), 28.74 (8 nodes), and 36.16 (10 nodes) GFLOPS, see Figure 5e. Based on this, speedups of 1.75, 3.48, 7.02, and 8.84, respectively, are achieved over the multi-threaded SIMD implementation on a single node. Moreover, speedups over the sequential implementation on the single node are 7.46, 14.86, 29.99, and 37.73, respectively. Since the cache memory is shared between the parallel threads, the performance of the multi-threaded and multithreaded-simd implementations degrades on large matrix sizes because of increasing the rate of cache miss, see Figure 5a (Thrd and Thrd_SIMD curves). To reduce the rate of cache miss, each thread exploits the memory hierarchy using the matrix blocking technique, see Figure 5a (Thrd_SIMD_Blocking curve). Therefore, on matrix size of 10,000 10,000, the combination between the multi-threading, SIMD, and matrix blocking techniques leads to performances of 11.51/17.49/37.58/78.99/99.73 GFLOPS on 1/2/4/8/10 nodes, respectively as shown in Figure 5f. They achieve speedups of 12.0/18.25/39.20/82.41/104.04 over the single threaded sequential implementation, respectively. C. Strassen s Algorithm for Matrix-Matrix Multiplication The Strassen s algorithm is a divide-and-conquer algorithm that is asymptotically faster as it reduces the arithmetic operations to O(n log 2 7 ). The traditional method of multiplying two n n matrices takes eight multiplications and four additions on sub-matrices. Strassen showed how two n n matrices could be multiplied using only seven multiplications and eighteen addition/subtraction on sub-matrices, see [13, 15] for more details. Reducing the number of floating-point operations to O(n 2.807 ) instead of O(n 3 ) in the Strassen s algorithm make the matrix-matrix multiplication runs faster than the traditional algorithm. On matrix sizes of 1000 1000 and 10,000 10,000, the execution times of the Strassen/traditional algorithms are 0.49/1.05 and 1060/2090 seconds, respectively. A performance of 2.53 GFLOPS is achieved for the sequential implementation of the Strassen s algorithm at matrix size 1000 1000 as shown in Figure 6a, where it is the maximum size for fitting A, B, and C matrices in L2 cache. Increasing the size of the matrices over 1000 1000 results in increasing rate (a) Traditional matrix multiplication on 1 node (b) Sequential on multiple nodes (c) SIMD on multiple nodes (d) Multi-threaded on multiple nodes (e) Multi-threaded SIMD on multiple nodes (f) Multi-threaded SIMD blocking on multiple nodes Figure 5: Performance of the MPI implementations of the traditional matrix multiplication algorithm on 2 to 10 nodes.

of cache misses and degrading the performance as shown in Figure 6a. Figure 6b shows that the performance improves to 1.34, 2.05, 2.66, 3.25, 3.79, and 4.39 GFLOPS on 2, 3, 4, 5, 6, and 7 nodes, respectively on large matrix size of 10,000 10,000, compared to 0.75 GFLOPS on one node. Thus, speedups of 1.79, 2.74, 3.55, 4.34, 5.06, and 5.86 are achieved over the sequential implementation on one node, which are 89.5%, 91.3%, 88.75%, 86.8%, 84.3%, and 83.7% of the ideal values. Applying the SIMD and blocking techniques improve the performance of the sequential superscalar implementation from 2.53 to 8.83 and to 13.14 GFLOPS, respectively on matrix size of 1000 1000 as shown in Figure 6a. Speedups of 3.49 and 5.19 are achieved, respectively. For large matrix sizes, an average performances of 2.65 and 3.88 GFLOPS are achieved for applying SIMD and SIMD blocking techniques, respectively, as shown in Figure 6a. Note that at large matrix sizes, a slight degradation in the speedup is occurred, where the average speedups decrease to 3.32 and 4.87 for the SIMD and SIMD blocking techniques, respectively. In addition to MPI technique, the SIMD technique is used on each computer to process multiple data using single instruction to obtain more performance. As a result of using the SIMD technique, the performance of the Strassen s algorithm, on matrix size of 10,000 10,000, improves from 2.37 GFLOPS on a single node to 4.12, 6.33, 8.19, 10.52, 11.68, and 13.90 GFLOPS on 2, 3, 4, 5, 6, and 7 nodes, respectively, see Figure 6c. This improvement in the performance provides speedups (over the performance of the sequential implementation on one node) of 5.49, 8.44, 10.94, 14.04, 15.58, and 18.55, respectively. However, the speedups over the performance of the SIMD implementation on one node are 1.74, 2.67, 3.46, 4.44, 4.93, and 5.87, respectively, on matrix size of 10,000 10,000. The employment of the multi-threading technique by running four threads, concurrently on four cores, enhances the performance of the sequential implementation of the Strassen s algorithm at the small matrix sizes. The performance improves with increasing matrix size. It improves from 1.81 to 4.42 GFLOPS and from 2.53 to 9.43 GFLOPS for matrix sizes 100 100 and 1000 1000, respectively, see Figure 6a (Thrd curve). A maximum speedup of 3.73 is achieved on the maximum size that can fit in the L2 cache (1000 1000). An enhancement of 93.25% of the ideal speedup is obtained. Beside the multi-threading technique, the performance improves furthermore by applying SIMD and matrix blocking techniques. The performance increases to 18.33 and 30.00 GFLOPS, respectively, on matrix size of 1000 1000 (see Figure 6a, Thrd_SIMD and Thrd- SIMD_Blocking curves). They achieve speedups of 7.25 (45.31% of the ideal) and 11.86 (74.13% of the ideal), respectively. On large matrix sizes, the performances based on the multi-threading and SIMD techniques decrease because of cache effect with these large matrices. However, by exploiting better memory hierarchy and reusing the loaded data (in cache) using matrix blocking technique, the performance can be improved. Thus, the speedup due to using the blocking technique in addition to multi-threading and SIMD techniques increases on large matrix sizes, as illustrated in Figure 6a (Thrd_SIMD_Blocking curve). Not only the SIMD technique can be used with MPI, but also multi-threading technique can be applied. Each node can create four threads to compute its intermediate matrix in parallel. Therefore, the tasks are partitioned between the threads based on the Strassen s algorithm. A higher performance is achieved due to employing multi-threading and (a) Strassen matrix multiplication on 1 node (b) Sequential on multiple nodes (c) SIMD on multiple nodes (d) Multi-threaded on multiple nodes (e) Multi-threaded SIMD on multiple nodes (f) Multi-threaded SIMD blocking on multiple nodes Figure 6: Performance of the MPI implementations of the Strassen s algorithm on 2 to 10 nodes.

MPI techniques together. The performance for a large matrix size of 10,000 10,000, increases to 4.51, 6.74, 9.00, 11.17, 13.20, and 15.95 GFLOPS on 2, 3, 4, 5, 6, and 7 nodes, respectively, see Figure 6d. Therefore, the speedups over the sequential implementation on one node are enhanced to 6.02, 8.99, 12.00, 14.89, 17.61, and 21.29, respectively. However, speedups over the multi-threaded implementation on one node are increased to 1.76, 2.62, 3.50, 4.34, 5.14, and 6.21, respectively. A better performance can be achieved by combining SIMD, multi-threading, as well as MPI techniques. The performance improves to 7.35, 10.88, 14.76, 18.35, 22.49, and 25.39 GFLOPS, respectively, see Figure 6e. An improvement in the speedups over the performance of the sequential implementation on one node is occurred and the speedups increase to 9.81, 14.52, 19.69, 24.48, 30.01, and 33.89, respectively. However, the speedups over the multi-threaded SIMD implementation on one node are 1.67, 2.47, 3.35, 4.16, 5.10, and 5.76, respectively. Finally, a much better performance is achieved by exploiting all forms of parallelism on multiple nodes in addition to exploiting memory hierarchy. Thus, the performance enhances to 12.97, 18.99, 26.84, 34.57, 41.85, and 50.73 GFLOPS respectively, see Figure 6f. As a result of this, speedups of 17.30, 25.34, 35.82, 46.13, 55.84, and 67.69, respectively are achieved over the sequential implementation on one node. Moreover, the speedups over the performance of the multi-threaded SIMD blocking implementation on one node are 1.47, 2.16, 3.05, 3.93, 4.76, and 5.77, respectively on large matrix size of 10,000 10,000. V. CONCLUSION On Intel multi-core processors, three common forms of parallelism (ILP, TLP, and DLP) can be exploited to provide increases in the performance. Besides, the use of MPI improves the performance by running parallel tasks on computers connected through a network. In this paper, all these forms of parallelism have been exploited to improve the performance of the BLAS. Level-1 and Level-2 BLAS didn t gain a performance due to using the multi-threading and MPI techniques because of the overhead of the threads creation and send/receive overheads, respectively. However, they achieved a good performance when using the SIMD technique at large vector lengths and matrix sizes that can fit in the cache. A speedup of 3.84, 3.71, 3.46, 3.29, and 3.66 over the sequential implementation could be achieved for the apply Givens rotation, Norm-2, dot-product, SAXPY, and scalar-vector multiplication subroutines, respectively. They represented 96%, 92.75%, 86.5%, 82.25%, and 91.5% respectively, from the ideal value, which equals four. On the other hand, using the SIMD technique improved the performance of Level-2 BLAS (matrix-vector multiplication and rank-1 update). Their performances were enhanced to 6.0 and 3.93 GFLOPS, respectively, which achieved speedups of 3.67 and 3.54, respectively, over the sequential implementation of each one. Furthermore, the performance improved by exploiting the memory hierarchy by reusing the loaded data in the cache memory, which increased to 6.31 and 4.19 GFLOPS. On a single computer the performances of the traditional/strassen algorithms of the matrix-matrix multiplication (example of Level-3 BLAS) achieved speedups of 3.76/3.49, 3.91/3.73, 5.40/5.19, 7.37/7.25, and 12.65/11.86 due to applying SIMD, multi-threading, SIMD-blocking, multi-threading SIMD, multi-threading SIMD-blocking techniques, respectively, over the sequential implementation on large matrix size (1000 1000) that can fit in the L2 cache. Furthermore, the performances improved when more than one computer are used for distributed processing. On ten computers at large matrix size of 10,000 10,000, the speedup of the traditional algorithm over the sequential implementation increased to 8.24, 13.34, 20.00, 37.73, and 104.04 when applying the MPI, SIMD MPI, multi-threading MPI, multithreading SIMD MPI, and multi-threading SIMD-blocking MPI, respectively. On seven computers at large matrix size of 10,000 10,000, the speedup of the Strassen s algorithm improved to 5.86, 18.55, 21.29, 33.89, and 67.69, respectively. VI. REFERENCES [1] K. Gallivan, R. Plemmons, and A. Sameh, Parallel Algorithms for Dense Linear Algebra Computations, SIAM Review, Society for industrial and applied mathematics, Vol. 32, No. 1, pp. 54-135, March 1990. [2] J. Dongarra and V. Eijkhout, Numerical Linear Algebra Algorithms and Software, Journal of Computational and Applied Mathematics, Vol. 123, No. 1-2, pp. 489-514, November 2000. [3] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, Vol. 5, No. 3, pp. 308-323, September 1979. [4] F. Van Zee, R. van de Geijn, and G. Quintana, Restructuring the QR Algorithm for High-Performance Application of Givens Rotations, FLAME Working Note #60, October 2011. http://www.cs.utexas.edu/users/flame/pubs/flawn60.pdf. [5] J. Dongarra, J. Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol. 14, No. 1, pp. 1-17, March 1988. [6] M. Soliman, Exploiting the Capabilities of Intel Multi-Core Processors for Block QR Decomposition Algorithm, Neural, Parallel and Scientific Computations, Dynamic Publishers, Atlanta, GA, USA, ISSN 1061-5369, Vol. 21, No. 1, pp. 111-128, March 2013. [7] J. Dongarra, J. Croz, S. Hammarling, and I. Duff, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol. 16, No. 1, pp. 1-17, March 1990. [8] Fujitsu, data sheet of CELSIUS R550, November 2009. http://sp.ts.fujitsu.com/dmsp/publications/public/ds-celsius-r550.pdf. [9] Quad-Core Intel Xeon Processor 5400 Series, August 2008. http://www.intel.com/content/dam/www/public/us/en/documents/datash eets/quad-core-xeon-5400-datasheet.pdf. [10] J. Hennessay and D. Patterson. Computer Architecture A Quantitative Approach, Morgan-Kaufmann, 5 th Edition, September 2011. [11] B. Barney, Message Passing Interface (MPI), Lawrence Livermore National Laboratory, UCRL-MI-133316, 2014, https://computing.llnl.gov/tutorials/mpi/. [12] M. Soliman, Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms, Parallel Processing Letter (PPL), World Scientific Publishing Company, ISSN: 0129-6264, Vol. 19, No. 1, pp. 159-174, March 2009. [13] G. Golub and C. Van Loan, Matrix Computations, Johns Hopkins series in the mathematical sciences, Johns Hopkins University Press, Baltimore, MD, 1989. [14] W. Gropp, K. Kennedy, L. Torczon, A. White, J. Dongarra, I. FIoster, and G. C. Fox, Sourcebook of Parallel Computing, 2009. [15] M. Thottethodi, S. Chatterjee, and A. Lebeck, Tuning Strassen's Matrix Multiplication for memory efficiency, Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, pp. 1-14, November 1998.