Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Size: px
Start display at page:

Download "Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors"

Transcription

1 Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering, Aswan University Aswan 81542, Egypt mossol@yahoo.com/mossol@ieee.org and fatma.sayed@aswu.edu.eg Abstract This paper exploits the capabilities of distributed multi-core processors (cluster of quad-core Intel Xeon E5410 processors running at 2.33 GHz) for accelerating basic linear algebra subprograms (BLAS). On one node, the use of SIMD, multi-threading, and multi-threading SIMD techniques improve the performance from to , , and GFLOPS. However, the use of MPI on multiple nodes degrades the performance because the network overhead dominates the overall execution time. For the same reason, the performance of Level-2 BLAS due to using multi-threading, SIMD, and blocking techniques degrades from to 4.37E E-02 GFLOPS on eight nodes. However, the speedups of the traditional/strassen matrix-matrix multiplication on a single node over the sequential execution when applying SIMD, multithreading, multi-threading SIMD, and multi-threading SIMD blocking techniques are 3.76/3.50, 3.91/3.73, 7.37/7.25, and 12.65/ On seven nodes, the performances of traditional/strassen are 68.67/50.73 GFLOPS. Moreover, on ten nodes, the performance of traditional matrix-matrix multiplication reaches GFLOPS. Keywords BLAS; SIMD; multi-threading; matrix blocking MPI; performance evaluation. I. INTRODUCTION Scientific and engineering research is becoming increasingly dependent upon the development and implementation of efficient parallel algorithms on modern high-performance computers, see [1] for more details. Linear algebra (in particular, the solution of linear systems of equations) lies at the heart of most calculations in scientific computing [2]. The three common levels of linear algebra operations are Level-1 (vector-vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations). For vectors of length n, Level-1 BLAS (basic linear algebra subprograms) involve O(n) data and O(n) operations [3]. They perform important operations as they are used in many applications. For example, apply Givens rotation is used in QR decomposition, solving linear system equations, and singular value decomposition (SVD) [4]. While dotproduct and SAXPY (scalar a times vector x plus vector y) subroutines are the base of the matrix-matrix multiplication as will be discussed in Section 4. Level-2 BLAS involve O(n 2 ) operations on O(n 2 ) data, which include matrix-vector multiplication, rank-one update, etc. [5]. Note that Level-1 and Level-2 subroutines cannot gain good improvement in the performance on multi-core processors using multi-threading technique or on cluster of multi-core processors because they have low ratio of arithmetic operations to load/store operations, O(1). In addition, with the low level of reusing the loaded data, they cannot exploit the memory hierarchy to speedup the performance, see [6] for more detail. Level-3 BLAS involve O(n 3 ) operations on O(n 2 ) data, where matrixmatrix multiplication is an example [7]. In this paper, some kernels of BLAS are implemented and evaluated on a cluster of Fujitsu Siemens CELSIUS R550 [8]. In such cluster, each computer node has quad-core Intel Xeon E5410 processor running at 2.33 GHz, L1 data cache of 32 KB and L1 instruction cache of 32 KB for each core, shared L2 cache of 12 MB, and 4 GB of main memory [9]. Moreover, the target processor includes data prefetch logic and thirty functional execution units in eleven groups: six generalpurpose ALUs, two integer units, one shift unit, four data cache units, six multimedia units, two parallel shift units, one parallel multiply, two 82-bit floating-point multiplyaccumulate units, two SIMD floating-point multiplyaccumulate units, and three branch units. In addition, it includes sixteen 128-bit wide registers (XMM0 XMM15) for SIMD instructions. Therefore, Intel Xeon E5410 can exploit instruction-level parallelism (ILP), data-level parallelism (DLP), and threadlevel parallelism (TLP) in applications to accelerate their executions. ILP is exploited using pipelining, superscalar execution, speculation, and dynamic scheduling [10]. Since it is multi-core processor, multi-threading is the technique for supporting the execution of multiple threads to exploit TLP. To exploit DLP, streaming SIMD extensions (SSE, SSE2, SSE3, SSE3S, and SSE4.1) can be used for parallel processing multiple data using a single instruction. To gain further performance improvements, cluster of computers can be used for distributing the execution of applications among computers connected together through a network. The communication between the computers can be done via message passing. Message passing interface (MPI) is a standard library that used to establish an efficient and portable communication between the computers of a cluster [11]. From a software point of view, each computer in the cluster is running Microsoft Window 7 operating system, where Microsoft visual studio 2012 and MPICH p1 library are used for MPI programming. The organization of this paper is as follows. Section 2 describes and evaluates Level-1 BLAS on the target system.

2 Level-2 BLAS are evaluated in Section 3. The traditional and Strassen algorithms for matrix-matrix multiplication are evaluated in Section 4 as examples of Level-3 BLAS. Finally, Section 5 concludes this paper. II. LEVEL-1 BLAS A. Description Level-1 BLAS include subroutines, which perform vectorscalar and vector-vector operations [3]. They involve O(n) floating-point operations (FLOPs) and create O(n) data movement (load/store operations), where n is the vector length. In this paper, five subroutines from Level-1 BLAS are selected to apply parallel processing techniques on them: apply Givens rotation (Givens), dot-product (Dot), SAXPY, Norm-2, and scalar-vector multiplication (SVmul). These subroutines have different number of FLOPs, 6n, 2n, 2n, 2n, and n, respectively. Moreover, they perform 2n/2n, 2n/1, 2n/n, n/1, and n/n load/store operations, respectively, see [12] for more details. B. Superscalar Performance of Level-1 BLAS Figure 1 shows the superscalar (sequential) performance of Level-1 BLAS on the target processor (out-of-order, superscalar, Intel Xeon E5410) when the vector length fitted in the L2 cache (see Figure 1a) and when the vector length is larger to fit in L2 cache (see Figure 1b). On large vector length fitted in L2 cache, performances of 2.10, 1.83, 1.65, 1.20, and 0.80 GFLOPS are achieved on Givens, Dot, SAXPY, Norm-2, and SVmul, respectively. However, increasing the vector length furthermore results in decreasing the performances to 1.01, 0.73, 0.63, 0.60, and 0.27 GFLOPS, respectively, due to increasing the rate of cache miss. As applying Givens rotation is more computationally intensive than other Level-1 subroutines, its performance is the best. The Norm-2 and dotproduct subroutines have the same FLOPs (2n) and the same store operations (one), but they differ in the load operations (n/2n for Norm-2/dot-product). Thus, the performance of the Norm-2 is better than the dot-product. Although the SAXPY subroutine has 2n FLOPs as the dot-product, it has less performance than the dot-product because SAXPY has n store operations higher than dot-product. Finally the scalar-vector multiplication has the smallest ratio of FLOPs to memory references (load/store operations), thus it has the smallest performance. On the other hand, the use of MPI on a cluster of computers speeds down the performance of Level-1 BLAS (see Figure 2a) because the network overhead for sending/receiving data/results dominates the overall execution time. The performances of Givens, Dot, SAXPY, Norm-2, and SVmul are 1.19E-02, 4.41E-02, 6.53E-03, 5.43E-03, and 3.82E-03 on two nodes, 2.13E-02, 5.55E-02, 1.40E-02, 8.95E- 03, and 7.05E-03 on four nodes, and 3.77E-02, 7.08E-02, 2.38E-02, 1.67E-02, and 1.17E-02 GFLOPS on eight nodes, respectively. C. DLP Exploitation to Improve Level-1 BLAS Performance The performance of the Level-1 BLAS can be improved using SIMD technique, as shown in Figure 1. SIMD instructions can process four single-precision elements (4 32- bit) by a single instruction in parallel. The use of SIMD technique improves the performance of Level-1 BLAS from 2.10, 1.83, 1.65, 1.20, and 0.80 to 8.08, 6.76, 5.70, 3.94, and 2.93 GFLOPS, respectively, on a single node. Therefore, speedups of 3.84, 3.71, 3.46, 3.29 and 3.66, respectively, are achieved due to applying SIMD on vectors fitted in L2 cache. Further increasing the length of the vectors reduces the performances to 3.63, 2.59, 2.10, 1.85, and 0.99 and reduces the speedups to 3.58, 3.54, 3.31, 3.08, and 3.60, respectively. The MPI SIMD performance of Level-1 BLAS on cluster of computers decreases. The performances of apply Givens rotation, Norm-2, dot-product, SAXPY, and scalar-vector multiplication are reduced to 4.27E-02, 1.66E-01, 2.30E-02, 1.87E-02, and 1.41E-02 on two nodes, 6.43E-02, 2.09E-01, 4.92E-02, 3.08E-02, and 2.61E-02 on four nodes, and 1.43E- 01, 2.67E-01, 8.36E-02, 5.74E-02, and 4.33E-02 GFLOPS on eight nodes, respectively. D. Multi-threaded Performance of Level-1 BLAS To employ TLP, the given vectors are divided across parallel threads. On target processor (Intel Xeon E5410), the given vectors are divided into four sub-vectors, where each subvector is assigned to a thread for processing. On small vector lengths, multi-threading technique utilization degrades the performance because overhead of the thread creation is larger than processing of the floating-point operations. However, as increasing the length of the vectors, the performance improves due to multi-threading technique application. Performances of 3.0, 3.83, 2.35, 1.98, and 1.43 GFLOPS, respectively, are achieved when the given vectors can be fitted in L2 cache. However, when the given vectors cannot be fitted in L2 cache, performances are reduced to 1.34, 1.49, 0.84, 0.91 and 0.48 (a) Vectors fitted in L2 cache (b) Vectors not fitted in L2 cache Figure 1: Performance of the parallel implementations of Level-1 BLAS on a single computer.

3 GFLOPS, respectively. Therefore, speedups of 1.43/1.32, 2.10/2.04, 1.42/1.32, 1.65/1.52, and 1.78/1.76, respectively, can be achieved on vector fitted/not-fitted in cache. Like SIMD performance on multiple nodes, MPI multithreaded performance of Level-1 BLAS on a cluster of computers decreases to 2.38E-02, 9.45E-02, 9.68E-03, 9.55E- 03, and 6.71E-03 on two nodes, 4.26E-02, 1.19E-01, 2.07E- 02, 1.57E-02, and 1.24E-02 on four nodes, and 7.53E-02, 1.52E-01, 3.52E-02, 2.93E-02, and 2.05E-02 on eight nodes. E. Multi-threaded SIMD Performance of Level-1 BLAS When each thread uses the SIMD technique, the performance of Level-1 BLAS increases furthermore, as shown in Figure 1. Therefore, performances of 3.65, 4.60, 3.18, 2.48, and 1.83 are achieved on single node when exploiting both of SIMD and multi-threading techniques on vectors fitted in cache. Thus, the speedups are increased from 1.43, 2.10, 1.42, 1.65, and 1.78 using multi-threading technique to 1.73, 2.52, 1.93, 2.07, and 2.28 using multithreading and SIMD techniques. Note that the speedup is better when the ratio of FLOPs to memory references is higher. The use of MPI beside multi-threading and SIMD techniques degrades the performance to 2.57E-02, 1.15E-01, 1.27E-02, 1.15E-02, and 8.72E-03 on two nodes, 4.61E-02, 1.45E-01, 2.72E-02, 1.90E-02, and 1.61E-02 on four nodes, and 8.16E-02, 1.85E-01, 4.63E-02, 3.54E-02, and 2.67E-02 on eight nodes. III. LEVEL-2 BLAS A. Description Level-2 BLAS include routines that operate on higherlevel data structure matrix-vector instead of vector-vector or vector-scalar as in Level-1 BLAS [5]. These routines take O(n 2 ) FLOPs and create O(n 2 ) memory references, where the dimension of the matrix involved is n n. Matrix-vector multiplication, rank-1 and rank-2 updates, and solution of a linear system equations are the main operations covered in Level-2 BLAS. A lot of the computation in numerical linear algebra algorithms is done by using the Level-2 BLAS. In this paper, matrix-vector multiplication (MVmul: y = y + Α (, ) x, where 0 i, j < n) and rank-1 update (Rank-1: Α (, ) = Α (, ) + x y, where 0 i, j < n ) subroutines are selected to explore the effect of using the different forms of the parallel processing on Level-2 BLAS. B. Superscalar Performance of Level-2 BLAS The matrix-vector multiplication and rank-1 update subroutines are implemented and evaluated on small (from to fitted in the L2 cache) and large ( to 10,000 10,000 not fitted in the L2 cache) matrix sizes, see Figure 3. Note that L2 cache can fit until three million single-precision elements. The content of the matrix and vectors are generated randomly in the interval [1-10]. Figure 3 shows that the performances of the MVmul/Rank-1 improve at small matrix sizes fitted in L2 cache, reaching 1.63/1.11 GFLOPS on matrix size When the matrix sizes are too large to fit in the cache starting from matrix size of , the performance begins to degrade (see Figure 3b) because of increasing the rate of cache miss. Therefore, the performance of the Level-2 BLAS depends on the size of the cache to reuse the loaded vector data. The superscalar performance of the matrix-vector multiplication is better than rank-1 update because the ratio of FLOPs to memory references is higher in the former than the later. (a) Sequential performance (b) SIMD performance (c) Multi-threaded performance (d) Multi-threaded SIMD performance Figure 2: Performance of the parallel implementations of Level-1 BLAS on a cluster of computers.

4 C. SIMD Performance of Level-2 BLAS Due to using SIMD technique, the performances of the matrix-vector multiplication and rank-1 update are improved with increasing matrix sizes fitted in the L2 cache, see Figure 3. On large matrices ( ), the performances of MVmul/Rank-1 increase reaching 6.0/3.93 GFLOPS, as shown in Figure 3a. They result in speedups of 3.67/3.54. However, on larger matrix sizes (larger than ), the performances of MVmul/Rank-1 degrade because of the escalating cache miss rate, as shown in Figure 3b. The performance can be improved, furthermore, using the matrix blocking technique, see [13] for details. Therefore, the performances of MVmul/Rank-1 are enhanced to 6.31/4.19 GFLOPS, as shown in Figure 3a ("SIMDB" bars). Moreover, the speedups improve to 3.86/3.78, respectively. As the performance depends on the size of the cache memory, the performance decreases at large matrix sizes, see Figure 3b because of growing penalty of cache miss since the loaded matrix is used only once. In conclusion, the performance of the Level-2 BLAS depends on the size of the cache memory and the number of the temporary SIMD registers, using the higher ones gives better performance. D. Multi-threaded SIMD Performance of Level-2 BLAS Since Level-2 BLAS operate on higher-level data structure (matrix-vector), multi-threading technique can improve their performances. For large matrix, it is partitioned between multiple threads, where each thread operates on a sub-matrix, promising with performance improvements. However, using multi-threading technique with a small matrix size may cause performance degradation, because each thread operates on a smaller sub-matrix, which cannot cover the overhead of the thread creation. Figure 3 shows the performances in GFLOPS of the matrix-vector multiplication and rank-1 update when using multi-threading (Thrd), multi-threading SIMD (ThrdSIMD), and multi-threading SIMD blocking (ThrdSIMDB) techniques. As expected, the performance speeds down for small matrix sizes. With increasing the matrix size, the performance improves, where the effect of the thread creation overhead is decreased. The performance keeps the growing trend until matrix size reaches elements, which is the maximum size that can fit in the L2 cache. The performances of the multi-threading, multi-threading SIMD, and multi-threading SIMD blocking on MVmul/Rank-1 reach 2.03/1.41, 2.52/1.73, and 3.32/2.33 GFLOPS, respectively, as shown in Figure 3a. The corresponding speedups are of 1.25/1.28, 1.54/1.56, and 2.03/2.10, respectively. E. MPI Exploitation for Improving Level-2 BLAS MPI can be used to compute Level-2 BLAS with matrixvector data structure employing more than one computer. The MPI implementation of Level-2 BLAS depends on partitioning the data of the given matrix and the vector data between the computers. The amount of transmitted/received data of the rank-1 update is higher compared to matrix-vector multiplication. There are two sub-matrices one of them is sent and the other is received to/from each node for the former. However, for the latter a sub-matrix is sent and a sub-vector is received to/from each node. On two nodes, the sending/receiving time is seconds for the former, while for the latter, it is seconds on matrix size of 10,000 10,000. Therefore, the sending/receiving time dominates the overall time. The processing times of sequential, SIMD, multi-threaded, and multi-threaded SIMD implementations of MVmul/Rank-1on two nodes are 8.71E- 02/1.01E-01, 2.03E-02/2.71E-02, 4.84E-02/5.78E-02, and 3.74E-02/4.36E-02 seconds, respectively. As in Level-1, Level-2 BLAS don t gain a speedup in performance on multiple computers because of the high overhead of sending/receiving time between them. On matrixvector multiplication, the performances of the sequential, SIMD, multi-threaded, and multi-threaded SIMD implementations on 2/4/8 nodes drop from 1.63, 6.00, 2.03, and 2.52 GFLOPS on matrix size of on one computer to 1.18E-02/1.95E-02/2.88E-02, 4.36E-02/7.20E- 02/1.07E-01, 1.55E-02/2.56E-02/3.79E-02, and 1.99E- 02/3.30E-02/4.88E-02 GFLOPS, respectively, see Figure 4. Moreover, for the rank-1 update, the performance speeds down. Its performance on one node for the sequential, SIMD, multi-threaded, and multi-threaded SIMD (on matrix size of ) was 1.11, 3.93, 1.41, and 1.73 GFLOPS, respectively. When executing these implementations on 2/4/8 computers, the performance decreases to 5.91E-03/1.16E- 02/2.26E-02 for the sequential implementation, as shown in Figure 4a. For the SIMD implementation, the performance drops to 2.14E-02/4.20E-02/8.18E-02 GFLOPS, respectively, as shown in Figure 4b. Moreover, the performance of the multi-threaded implementation degrades to 8.62E-03/1.96E- 02/3.30E-02 GFLOPS, respectively, as shown in Figure 4c. Finally, the multi-threaded SIMD implementation (a) Matrices/vectors fitted in L2 cache (b) Matrices/vectors not fitted in L2 cache Figure 3: Performance of the parallel implementations of Level-2 BLAS on a single computer.

5 performance drops to 1.14E-02/2.24E-02/4.37E-02 GFLOPS, as shown in Figure 4d. Although, the performance decreases with applying MPI technique on Level-2 BLAS, the performance of the SIMD implementation on multiple nodes is higher than the multithreaded and multi-threaded SIMD implementations. This is because of the threads creation overhead and the fighting for the cache between threads. For example the performances of the SIMD, multi-threaded, and multi-threaded SIMD for MVmul/Rank-1 on 8 nodes on large matrix size are 1.07E- 01/8.18E-02, 3.79E-02/3.30E-02, and 4.88E-02/4.37E-02 GFLOPS, respectively. IV. LEVEL-3 BLAS A. Description Level-3 BLAS include subroutines that operate on a higher-level data structure (matrix-matrix) [7]. They are considered a computational intensive kernels as they involve O(n 3 ) FLOPs, while only creating O(n 2 ) memory references, where n n is the size of the matrices involved. Thus, they permit efficient reuse of data that resides in cache memory, and demand exploiting all available forms of parallelism to improve their performances. Level-3 BLAS create what is often called the surface-to-volume effect of the ratio of memory references (load/store operations) to computations (arithmetic operations) [14]. In contrast to Level-1 and Level- 2 BLAS, the total time of broadcasting, sending, and receiving data between multiple nodes is negligible compared to the processing time of Level-3 BLAS on large matrix sizes. Level-3 BLAS cover little operations including matrix-matrix multiplication, multiplication of a matrix by a triangular matrix, solving triangular systems of equations, and rank-k and rank-2k updates [7]. In this paper, matrix-matrix multiplication (MMmul) is selected as an example of Level-3 BLAS to apply parallel processing techniques, since it is one of the most fundamental problems in algorithmic linear algebra. The arithmetic complexity of matrix multiplication can be reduced to O(n ) instead of O(n 3 ) by applying the Strassen s algorithm [13, 15]. B. Traditional Algorithm of Matrix-Matrix Multiplication As shown in Figure 5a, the superscalar performance of MMmul improves with increasing the size of matrices. On matrix, it reaches to 1.91 GFLOPS. However, for larger matrices, the performance degrades, because larger sizes cannot fit in the L2 cache. On the other hand, the performance of MMmul improves on a cluster of computers, see Figure 5b. On 10,000 10,000, the performance increases from 0.96 GFLOPS on one node to 1.82, 3.51, 6.65, and 7.89 GFLOPS on 2, 4, 8, and 10 nodes, respectively, as shown in Figure 5b. These improvements provide speedups of 1.89, 3.66, 6.94, and 8.24, respectively over the sequential implementation on one node. Note that as increasing the number of nodes, the speedup is far away from the ideal value because more network overhead is added since the processing data per node becomes smaller. On small matrix size ( ), the SIMD performance is 5.38 GFLOPS, while it reaches to 1.84 GFLOPS for the sequential superscalar implementation. As the matrix size increases the performance improves to 7.18 GFLOPS on , see Figure 5a (SIMD). It shows that applying the SIMD technique achieves a maximum speedup of 3.76 at the maximum size that can fit in the L2 cache. This speedup represents 94% of the ideal performance. Because of the growing cache miss rate for large matrix sizes, which cannot fit in the L2 cache, the SIMD performance of matrix-matrix (a) Sequential performance (b) SIMD performance (c) Multi-threaded performance (d) Multi-threaded SIMD performance Figure 4: Performance of the parallel implementations of Level-2 BLAS on a cluster of computers.

6 multiplication begins to decrease, as shown in Figure 5a. Moreover, Figure 5c shows how the performance improves furthermore on multiple nodes when applying the SIMD technique in addition to the MPI. On matrix size of 10,000 10,000, the performance reaches to 2.60, 4.99, 9.75, and GFLOPS on 2, 4, 8, and 10 nodes, respectively. Speedups of 1.86, 3.58, 6.99, and 9.17 are achieved over the SIMD implementation on one node. Moreover, the speedups over the sequential implementation on one node are improved to 2.71, 5.21, 10.18, and 13.34, respectively. On small matrix sizes, the multi-threaded performance improves with increasing the matrix size until it reaches 7.48 GFLOPS on matrix size of elements, achieving a speedup of 3.91, which represents 97.75% of the ideal value, as shown in Figure 5a (Thrd). Moreover, each thread can compute its operations in parallel using SIMD instructions. Figure 5a (Thrd_SIMD) shows how the combination of the two techniques improves the performance to GFLOPS on matrix size It results in a speedup of Figure 5d shows the performances in GFLOPS of the multithreaded implementation on 2/4/8/10 nodes, improve to 4.13/7.92/15.70/19.17 GFLOPS on 10,000 10,000. They result in speedups over multi-threading technique on a single node of 1.96/3.76/7.45/9.09. In addition, speedups of 4.31/8.26/16.38/20.0 are achieved over the sequential implementation on one node. On multiple nodes, applying both of multi-threading and SIMD techniques lead to performances of 7.15 (2 nodes), (4 nodes), (8 nodes), and (10 nodes) GFLOPS, see Figure 5e. Based on this, speedups of 1.75, 3.48, 7.02, and 8.84, respectively, are achieved over the multi-threaded SIMD implementation on a single node. Moreover, speedups over the sequential implementation on the single node are 7.46, 14.86, 29.99, and 37.73, respectively. Since the cache memory is shared between the parallel threads, the performance of the multi-threaded and multithreaded-simd implementations degrades on large matrix sizes because of increasing the rate of cache miss, see Figure 5a (Thrd and Thrd_SIMD curves). To reduce the rate of cache miss, each thread exploits the memory hierarchy using the matrix blocking technique, see Figure 5a (Thrd_SIMD_Blocking curve). Therefore, on matrix size of 10,000 10,000, the combination between the multi-threading, SIMD, and matrix blocking techniques leads to performances of 11.51/17.49/37.58/78.99/99.73 GFLOPS on 1/2/4/8/10 nodes, respectively as shown in Figure 5f. They achieve speedups of 12.0/18.25/39.20/82.41/ over the single threaded sequential implementation, respectively. C. Strassen s Algorithm for Matrix-Matrix Multiplication The Strassen s algorithm is a divide-and-conquer algorithm that is asymptotically faster as it reduces the arithmetic operations to O(n log 2 7 ). The traditional method of multiplying two n n matrices takes eight multiplications and four additions on sub-matrices. Strassen showed how two n n matrices could be multiplied using only seven multiplications and eighteen addition/subtraction on sub-matrices, see [13, 15] for more details. Reducing the number of floating-point operations to O(n ) instead of O(n 3 ) in the Strassen s algorithm make the matrix-matrix multiplication runs faster than the traditional algorithm. On matrix sizes of and 10,000 10,000, the execution times of the Strassen/traditional algorithms are 0.49/1.05 and 1060/2090 seconds, respectively. A performance of 2.53 GFLOPS is achieved for the sequential implementation of the Strassen s algorithm at matrix size as shown in Figure 6a, where it is the maximum size for fitting A, B, and C matrices in L2 cache. Increasing the size of the matrices over results in increasing rate (a) Traditional matrix multiplication on 1 node (b) Sequential on multiple nodes (c) SIMD on multiple nodes (d) Multi-threaded on multiple nodes (e) Multi-threaded SIMD on multiple nodes (f) Multi-threaded SIMD blocking on multiple nodes Figure 5: Performance of the MPI implementations of the traditional matrix multiplication algorithm on 2 to 10 nodes.

7 of cache misses and degrading the performance as shown in Figure 6a. Figure 6b shows that the performance improves to 1.34, 2.05, 2.66, 3.25, 3.79, and 4.39 GFLOPS on 2, 3, 4, 5, 6, and 7 nodes, respectively on large matrix size of 10,000 10,000, compared to 0.75 GFLOPS on one node. Thus, speedups of 1.79, 2.74, 3.55, 4.34, 5.06, and 5.86 are achieved over the sequential implementation on one node, which are 89.5%, 91.3%, 88.75%, 86.8%, 84.3%, and 83.7% of the ideal values. Applying the SIMD and blocking techniques improve the performance of the sequential superscalar implementation from 2.53 to 8.83 and to GFLOPS, respectively on matrix size of as shown in Figure 6a. Speedups of 3.49 and 5.19 are achieved, respectively. For large matrix sizes, an average performances of 2.65 and 3.88 GFLOPS are achieved for applying SIMD and SIMD blocking techniques, respectively, as shown in Figure 6a. Note that at large matrix sizes, a slight degradation in the speedup is occurred, where the average speedups decrease to 3.32 and 4.87 for the SIMD and SIMD blocking techniques, respectively. In addition to MPI technique, the SIMD technique is used on each computer to process multiple data using single instruction to obtain more performance. As a result of using the SIMD technique, the performance of the Strassen s algorithm, on matrix size of 10,000 10,000, improves from 2.37 GFLOPS on a single node to 4.12, 6.33, 8.19, 10.52, 11.68, and GFLOPS on 2, 3, 4, 5, 6, and 7 nodes, respectively, see Figure 6c. This improvement in the performance provides speedups (over the performance of the sequential implementation on one node) of 5.49, 8.44, 10.94, 14.04, 15.58, and 18.55, respectively. However, the speedups over the performance of the SIMD implementation on one node are 1.74, 2.67, 3.46, 4.44, 4.93, and 5.87, respectively, on matrix size of 10,000 10,000. The employment of the multi-threading technique by running four threads, concurrently on four cores, enhances the performance of the sequential implementation of the Strassen s algorithm at the small matrix sizes. The performance improves with increasing matrix size. It improves from 1.81 to 4.42 GFLOPS and from 2.53 to 9.43 GFLOPS for matrix sizes and , respectively, see Figure 6a (Thrd curve). A maximum speedup of 3.73 is achieved on the maximum size that can fit in the L2 cache ( ). An enhancement of 93.25% of the ideal speedup is obtained. Beside the multi-threading technique, the performance improves furthermore by applying SIMD and matrix blocking techniques. The performance increases to and GFLOPS, respectively, on matrix size of (see Figure 6a, Thrd_SIMD and Thrd- SIMD_Blocking curves). They achieve speedups of 7.25 (45.31% of the ideal) and (74.13% of the ideal), respectively. On large matrix sizes, the performances based on the multi-threading and SIMD techniques decrease because of cache effect with these large matrices. However, by exploiting better memory hierarchy and reusing the loaded data (in cache) using matrix blocking technique, the performance can be improved. Thus, the speedup due to using the blocking technique in addition to multi-threading and SIMD techniques increases on large matrix sizes, as illustrated in Figure 6a (Thrd_SIMD_Blocking curve). Not only the SIMD technique can be used with MPI, but also multi-threading technique can be applied. Each node can create four threads to compute its intermediate matrix in parallel. Therefore, the tasks are partitioned between the threads based on the Strassen s algorithm. A higher performance is achieved due to employing multi-threading and (a) Strassen matrix multiplication on 1 node (b) Sequential on multiple nodes (c) SIMD on multiple nodes (d) Multi-threaded on multiple nodes (e) Multi-threaded SIMD on multiple nodes (f) Multi-threaded SIMD blocking on multiple nodes Figure 6: Performance of the MPI implementations of the Strassen s algorithm on 2 to 10 nodes.

8 MPI techniques together. The performance for a large matrix size of 10,000 10,000, increases to 4.51, 6.74, 9.00, 11.17, 13.20, and GFLOPS on 2, 3, 4, 5, 6, and 7 nodes, respectively, see Figure 6d. Therefore, the speedups over the sequential implementation on one node are enhanced to 6.02, 8.99, 12.00, 14.89, 17.61, and 21.29, respectively. However, speedups over the multi-threaded implementation on one node are increased to 1.76, 2.62, 3.50, 4.34, 5.14, and 6.21, respectively. A better performance can be achieved by combining SIMD, multi-threading, as well as MPI techniques. The performance improves to 7.35, 10.88, 14.76, 18.35, 22.49, and GFLOPS, respectively, see Figure 6e. An improvement in the speedups over the performance of the sequential implementation on one node is occurred and the speedups increase to 9.81, 14.52, 19.69, 24.48, 30.01, and 33.89, respectively. However, the speedups over the multi-threaded SIMD implementation on one node are 1.67, 2.47, 3.35, 4.16, 5.10, and 5.76, respectively. Finally, a much better performance is achieved by exploiting all forms of parallelism on multiple nodes in addition to exploiting memory hierarchy. Thus, the performance enhances to 12.97, 18.99, 26.84, 34.57, 41.85, and GFLOPS respectively, see Figure 6f. As a result of this, speedups of 17.30, 25.34, 35.82, 46.13, 55.84, and 67.69, respectively are achieved over the sequential implementation on one node. Moreover, the speedups over the performance of the multi-threaded SIMD blocking implementation on one node are 1.47, 2.16, 3.05, 3.93, 4.76, and 5.77, respectively on large matrix size of 10,000 10,000. V. CONCLUSION On Intel multi-core processors, three common forms of parallelism (ILP, TLP, and DLP) can be exploited to provide increases in the performance. Besides, the use of MPI improves the performance by running parallel tasks on computers connected through a network. In this paper, all these forms of parallelism have been exploited to improve the performance of the BLAS. Level-1 and Level-2 BLAS didn t gain a performance due to using the multi-threading and MPI techniques because of the overhead of the threads creation and send/receive overheads, respectively. However, they achieved a good performance when using the SIMD technique at large vector lengths and matrix sizes that can fit in the cache. A speedup of 3.84, 3.71, 3.46, 3.29, and 3.66 over the sequential implementation could be achieved for the apply Givens rotation, Norm-2, dot-product, SAXPY, and scalar-vector multiplication subroutines, respectively. They represented 96%, 92.75%, 86.5%, 82.25%, and 91.5% respectively, from the ideal value, which equals four. On the other hand, using the SIMD technique improved the performance of Level-2 BLAS (matrix-vector multiplication and rank-1 update). Their performances were enhanced to 6.0 and 3.93 GFLOPS, respectively, which achieved speedups of 3.67 and 3.54, respectively, over the sequential implementation of each one. Furthermore, the performance improved by exploiting the memory hierarchy by reusing the loaded data in the cache memory, which increased to 6.31 and 4.19 GFLOPS. On a single computer the performances of the traditional/strassen algorithms of the matrix-matrix multiplication (example of Level-3 BLAS) achieved speedups of 3.76/3.49, 3.91/3.73, 5.40/5.19, 7.37/7.25, and 12.65/11.86 due to applying SIMD, multi-threading, SIMD-blocking, multi-threading SIMD, multi-threading SIMD-blocking techniques, respectively, over the sequential implementation on large matrix size ( ) that can fit in the L2 cache. Furthermore, the performances improved when more than one computer are used for distributed processing. On ten computers at large matrix size of 10,000 10,000, the speedup of the traditional algorithm over the sequential implementation increased to 8.24, 13.34, 20.00, 37.73, and when applying the MPI, SIMD MPI, multi-threading MPI, multithreading SIMD MPI, and multi-threading SIMD-blocking MPI, respectively. On seven computers at large matrix size of 10,000 10,000, the speedup of the Strassen s algorithm improved to 5.86, 18.55, 21.29, 33.89, and 67.69, respectively. VI. REFERENCES [1] K. Gallivan, R. Plemmons, and A. Sameh, Parallel Algorithms for Dense Linear Algebra Computations, SIAM Review, Society for industrial and applied mathematics, Vol. 32, No. 1, pp , March [2] J. Dongarra and V. Eijkhout, Numerical Linear Algebra Algorithms and Software, Journal of Computational and Applied Mathematics, Vol. 123, No. 1-2, pp , November [3] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, Vol. 5, No. 3, pp , September [4] F. Van Zee, R. van de Geijn, and G. Quintana, Restructuring the QR Algorithm for High-Performance Application of Givens Rotations, FLAME Working Note #60, October [5] J. Dongarra, J. Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol. 14, No. 1, pp. 1-17, March [6] M. Soliman, Exploiting the Capabilities of Intel Multi-Core Processors for Block QR Decomposition Algorithm, Neural, Parallel and Scientific Computations, Dynamic Publishers, Atlanta, GA, USA, ISSN , Vol. 21, No. 1, pp , March [7] J. Dongarra, J. Croz, S. Hammarling, and I. Duff, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol. 16, No. 1, pp. 1-17, March [8] Fujitsu, data sheet of CELSIUS R550, November [9] Quad-Core Intel Xeon Processor 5400 Series, August eets/quad-core-xeon-5400-datasheet.pdf. [10] J. Hennessay and D. Patterson. Computer Architecture A Quantitative Approach, Morgan-Kaufmann, 5 th Edition, September [11] B. Barney, Message Passing Interface (MPI), Lawrence Livermore National Laboratory, UCRL-MI , 2014, [12] M. Soliman, Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms, Parallel Processing Letter (PPL), World Scientific Publishing Company, ISSN: , Vol. 19, No. 1, pp , March [13] G. Golub and C. Van Loan, Matrix Computations, Johns Hopkins series in the mathematical sciences, Johns Hopkins University Press, Baltimore, MD, [14] W. Gropp, K. Kennedy, L. Torczon, A. White, J. Dongarra, I. FIoster, and G. C. Fox, Sourcebook of Parallel Computing, [15] M. Thottethodi, S. Chatterjee, and A. Lebeck, Tuning Strassen's Matrix Multiplication for memory efficiency, Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, pp. 1-14, November 1998.

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL Fengguang Song, Jack Dongarra, and Shirley Moore Computer Science Department University of Tennessee Knoxville, Tennessee 37996, USA email:

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Parallel Architecture & Programing Models for Face Recognition

Parallel Architecture & Programing Models for Face Recognition Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature

More information

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Richard P. Brent and Peter E. Strazdins Computer Sciences Laboratory and Department of Computer Science Australian National University

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Case Studies on Cache Performance and Optimization of Programs with Unit Strides SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines NAG Library Chapter Introduction Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 3 Recommendations on Choice and Use of Available Routines... 2 3.1 Naming Scheme... 2 3.1.1 NAGnames...

More information

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Zhou, B. B. and Brent, R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 0200 Abstract This paper addresses

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations Developing a High Performance Software Library with MPI and CUDA for Matrix Computations Bogdan Oancea 1, Tudorel Andrei 2 1 Nicolae Titulescu University of Bucharest, e-mail: bogdanoancea@univnt.ro, Calea

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Intel s MMX. Why MMX?

Intel s MMX. Why MMX? Intel s MMX Dr. Richard Enbody CSE 820 Why MMX? Make the Common Case Fast Multimedia and Communication consume significant computing resources. Providing specific hardware support makes sense. 1 Goals

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) Michael Chuvelev, Intel Corporation, michael.chuvelev@intel.com Sergey Kazakov, Intel Corporation sergey.kazakov@intel.com

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available

More information

NUMA-aware Multicore Matrix Multiplication

NUMA-aware Multicore Matrix Multiplication Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study J Supercomput DOI 10.1007/s11227-014-1102-4 Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study S. Tabik G. Ortega E. M. Garzón Springer Science+Business Media

More information

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

F02WUF NAG Fortran Library Routine Document

F02WUF NAG Fortran Library Routine Document F02 Eigenvalues and Eigenvectors F02WUF NAG Fortran Library Routine Document Note. Before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Parallel Computing Ideas

Parallel Computing Ideas Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

ADVANCED TECHNIQUES FOR IMPROVING PROCESSOR PERFORMANCE IN A SUPERSCALAR ARCHITECTURE

ADVANCED TECHNIQUES FOR IMPROVING PROCESSOR PERFORMANCE IN A SUPERSCALAR ARCHITECTURE ADVANCED TECHNIQUES FOR IMPROVING PROCESSOR PERFORMANCE IN A SUPERSCALAR ARCHITECTURE Gordon B. STEVEN, University of Hertfordshire, Department of Computer Science, UK, E-mail: comqgbs@herts.ac.uk Lucian

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL Journal of Theoretical and Applied Mechanics, Sofia, 2013, vol. 43, No. 2, pp. 77 82 OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL P. Dobreva Institute of Mechanics, Bulgarian

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Towards Realistic Performance Bounds for Implicit CFD Codes

Towards Realistic Performance Bounds for Implicit CFD Codes Towards Realistic Performance Bounds for Implicit CFD Codes W. D. Gropp, a D. K. Kaushik, b D. E. Keyes, c and B. F. Smith d a Mathematics and Computer Science Division, Argonne National Laboratory, Argonne,

More information

Multicore Hardware and Parallelism

Multicore Hardware and Parallelism Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3

More information

AMS 148: Chapter 1: What is Parallelization

AMS 148: Chapter 1: What is Parallelization AMS 148: Chapter 1: What is Parallelization 1 Introduction of Parallelism Traditional codes and algorithms, much of what you, the student, have experienced previously has been computation in a sequential

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Parallelized Progressive Network Coding with Hardware Acceleration

Parallelized Progressive Network Coding with Hardware Acceleration Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded

More information

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Programming Dense Linear Algebra Kernels on Vectorized Architectures University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin. A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj

More information

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class

More information

The LINPACK Benchmark on the Fujitsu AP 1000

The LINPACK Benchmark on the Fujitsu AP 1000 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory Australian National University Canberra, Australia Abstract We describe an implementation of the LINPACK Benchmark

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information