Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture

Size: px

Start display at page:

Download "Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture"

Melvin Tyler
5 years ago
Views:

1 Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture Victoria Sanz 12, Armando De Giusti 134, Marcelo Naiouf 14 1 III-LIDI, Facultad de Informática, Universidad Nacional de La Plata, La Plata, Buenos Aires, Argentina. 2 CONICET PhD Scholar 3 CONICET Main Researcher. 4 Full-time Chair Professor. Abstract - In this paper, a performance analysis of a matrix diagonalization algorithm with Jacobi method on a multicore architecture is presented. For this, a block-based sequential algorithm was implemented using the CBLAS library, which improves classic sequential algorithms, and then an algorithm for the parallel resolution of the problem with the shared memory programming tool OpenMP is studied. Next, the experimental work is shown and an improvement in response time is observed as more threads/cores are used. Finally, a performance analysis (Speedup and efficiency) as the dimensions of the input matrix and the number of threads/ cores used increase is presented. Keywords: multicore architecture, parallel programming, matrix diagonalization, Jacobi, OpenMP. 1 Introduction The demand for computational power by a large number of scientific applications has increased so much that the use of parallel platforms for their resolution has become essential. Even though single-core cluster architectures have become usual due to their cost/performance ratio versus large sharedmemory multiprocessor systems, nowadays machines with more than one processor are quite common. Multicore architectures appear as a response to the limitations for increasing speed in single-core processors due to thermal and energy issues. A multicore processor integrates 2 or more cores in a single chip, which means that applications have to be adapted to be able to exploit the parallelism at a thread level provided by this architecture [1]. Similarly, several multicore machines can be connected through a network, which opens the possibility of having clusters with a large number of processors. The Jacobi method to diagonalize symmetric matrixes has applications in fields such as biometrics, artificial vision, digital signal processing, and so on [2][3][4]. As the volume of input data increases, the computation time required increases significantly. The combination of linear algebra libraries optimized for the underlying architecture, together with the power provided by a multicore, and a suitable parallel programming tool for such architecture, will allow reducing execution time. Two of the main performance analysis aspects in a parallel system are the Speedup factor (Sp) [5][6] and the Efficiency (E) that relates the Speedup with the number of processors (P) used [7]. Scalability is a third, very significant factor in parallel applications: problems usually scale, i.e., the volume of work to be done increases, and the architectures used can also scale by increasing the number of processors used. The effect of scaling workload and/or processors on the performance of parallel algorithms, considering Sp and E, is of interest. A system is said to be scalable if it can maintain a constant efficiency with increasing work and processors [5][8]. The purpose of this work is to implement a parallel algorithm to diagonalize matrixes with Jacobi method, using the shared-memory programming tool OpenMP [9] and the Linear Algebra CBLAS library [10] in order to exploit the computation power of a multicore machine. Lastly, experimental tests are presented and an analysis of the performance obtained as the size of the matrix and the number of cores scale, is carried out. This paper is organized as follows: Section 2 includes a description of Jacobi method for real symmetric matrixes; Section 3 presents various implementations of the classic sequential algorithm, two implementations of the block-based algorithm (one using the CBLAS library), and proposes the parallelization of the block-based algorithm using OpenMP; Section 4 includes a comparative analysis of the execution times for the various sequential implementations with matrixes of different sizes, concluding that the block-based algorithm that uses CBLAS performs better, and it presents experimental evidence and an analysis of the performance (speedup, efficiency) obtained with the parallel algorithm as the input matrix size and the number of threads/cores used are increased. 2 Description of Jacobi Diagonalization Method The problem of diagonalizing a symmetric square matrix S consists in finding an orthogonal matrix X that causes S to be

2 reduced to the diagonal form. Jacobi method allows finding a matrix X such that: X T *S*X = D. The elements in the diagonal of matrix D are called eigenvalues of S, and the columns of matrix X are the eigenvectors of S Specific Case: Diagonalization of a 2x2 Symmetric Matrix If S is a 2x2 symmetric matrix (Fig. 1.a), it can be diagonalized by means of the orthogonal matrix X that is shown in Fig. 1.b. This matrix is known as rotation matrix. S q,q = S p,p * sin 2 α + 2 *S p,q *sinα*cosα + S q,q *cos 2 α Notice that S p,p, S p,q and S q,q are the values in S before this step. This notation will be used from here on. 2. Cancel out S p,q and S q,p. 3. Replace the value in matrix S with the result obtained with the operation J T *S*J (omitting the positions updated in previous steps). 4. Replace the value in matrix X with the result obtained with the operation X*J. ( ) ( ) a Fig. 1. a) 2x2 symmetric matrix S. b) Orthogonal matrix X Since the purpose of the method is obtaining D, the values of cosine α and sine α must be such that the elements outside the diagonal of D are cancelled out. These values are calculated as shown in Fig. 2. b ( ) Fig. 3. Rotation matrix J for the general diagonalization case. Thus, matrixes J T i and J i will be multiplied by S until reaching the diagonal form. Matrix X is the result of the product of the J i matrixes, and X T is the result of multiplying the J T i matrixes (Fig. 4). [ [ ] ] Fig. 2. Calculation of the values sinα and cosα 2.2 General Case: Diagonalization of an nxn Symmetric Matrix The general method [11] keeps a matrix X initialized as the identity matrix, where the eigenvectors of S will remain, and in each stage performs a series of rotations that update S and X until the maximum element in the upper triangle of S is lower than a selected threshold. In each stage, for each element S p,q in the upper triangle, a rotation that cancels out such element is performed. The rotation matrix J (Fig. 3) will have the values cosine α, sine α, -sine α and cosine α at positions J p,p, J p,q, J q,p, and J q,q, respectively; it will also have a value of 1 on its diagonal, and 0 on all remaining elements. The steps to follow for each rotation are: 1. Calculate the elements S p,p and S q,q as follows S p,p = S p,p * cos 2 α 2*S p,q *sinα*cosα + S q,q * sin 2 α Fig. 4. Multiplication of S and the rotation matrixes to obtain the diagonal matrix. 3 Implementations of the Algorithm for Jacobi Diagonalization Method 3.1. Classic Sequential Algorithm: The algorithm to solve the diagonalization problem does not need to have matrix J stored. The updates carried out in S and X during steps 3 and 4 of the method described in Section 2.2 are replaced by the following operations: S i,p = S p,i = S i,p* cosα S i,q* sinα for i=1..n, i p, i q S i,q = S q,i = S i,q* cosα + S i,p* sinα for i=1..n, i p, i q X i,p = X i,p *cosα X i,q *sin α for i=1..n X i,q = X i,q *cosα + X i,p *sin α for i=1..n That is, during step 3, rows p and q and columns p and q in matrix S are updated; while during step 4, columns p and q in matrix X are updated. Two optimizations can be carried out in relation to the storage of matrixes S and X in memory:

3 - Since S is a symmetric matrix, it is possible to store only n*(n+1)/2 elements, n being the row and columns dimension of S, which allows avoiding unnecessary operations due to symmetry. - Since access to X is always through its columns, if the elements are stored in memory in column-major order, a better performance will be obtained. This is because every time the memory is accessed to look for an element in X, a block of elements that will be used in the short term is moved to cache memory. This is known as optimization by spatial locality. 3.2 Block-Based Sequential Algorithm: The development of new architectures requires the implementation of new techniques so that algorithms adapt to the underlying platform and thus obtain a better performance. The classic sequential algorithm described in Section 3.1 can be adapted to use linear algebra optimized libraries. In order to use the matrix operations available in the Level 3 CBLAS library, a sequential algorithm was implemented for the diagonalization of matrixes using the block-based Jacobi method, which will be the basis for the parallel algorithm; for this reason, the matrix S is stored in full. This algorithm considers that the dimension of matrix S is n = N*r (Fig. 5), where N is the number of blocks in each dimension, and r is the dimension of each block. This algorithm is similar to the classic one. Each element S P,Q, with P,Q=0..N-1, will denote a block. Within the block, the elements will be referenced with indexes (p,q), where p,q = 0..r-1. [12] ( ) Fig. 5. Matrix S, whose dimension is n=n*r. The method keeps a matrix X initialized as the identity matrix, organized by blocks the same as S, where the eigenvectors of S will remain. Each stage of the algorithm performs a series of rotations that update S and X, until the maximum element in the upper triangle of S is lower than a threshold previously selected. The operations in any given stage consist in cancelling out each block S P,Q in the upper block triangle in S. The steps to follow for each rotation are: 1. Calculate Jacobi with the classic method over the 2x2- block matrix formed by S P,P, S P,Q, S Q,P and S Q,Q. The eigenvectors will be in a 2x2-block matrix called auxx. 2. Calculate the transposition of auxx, transpx. 3. Apply rotations to matrix S a. S I,P = S I,P *auxx 0,0 + S I,Q * auxx 1,0 for I=0..N, I P Q b. S I,Q = S I,P *auxx 0,1 +S I,Q * auxx 1,1 for I=0..N, I P Q c. S P,I = transp 0,0 *S P,I + transp 0,1 *S Q,I for I=0..N, I P Q d. S Q,I = transp 1,0 *S P,I + transp 1,1 *S Q,I for I=0..N, I P Q 4. Apply rotations to matrix X a. X I,P = X I,P *auxx 0,0 + X I,Q *auxx 1,0 for I=0..N b. X I,Q = X I,P *auxx 0,1 + X I,Q *auxx 1,1 for I=0..N Implementation with CBLAS. In the block-based algorithm, the updates to matrixes S and X (steps 3.a, 3.b, 3.c, 3.d, 4.a, and 4.b) involve a sequence of algebraic operations over rxr-sized blocks or matrixes, and copies of blocks to keep auxiliary data that are not carried out in the classic algorithm. The use of the existing linear algebra libraries allows performing matrix operations (Level 3 CBLAS), achieving the best performance for a given architecture. [10] The function cblas_dgemm was used to reduce block multiplication and addition processing time, and cblas_dcopy was used to optimize block copying Block-Based Parallel Algorithm with CBLAS: From Section 3.2, it can be observed that a rotation carried out to cancel out an element S P,Q will update rows and columns with index P and Q in S, and columns P and Q in X. Also, if we have two coordinate pairs (P,Q) and (P,Q ), with P Q P Q, J P,Q * J P,Q = J P,Q * J P,Q. Based on these two premises, it is concluded that two or more positions can be canceled out in parallel, provided that their indexes are disjoint [13][14]. The parallelization strategy consists in carrying out the rotations simultaneously in each algorithm stage. The number of parallel tasks to be run in each stage is N/2, since their coordinates must be different, and their indexes will be calculated following the Chess Tournament strategy [15]. If there are more tasks than threads, they will be equally distributed. A task whose coordinate is (P,Q) consists in calculating the eigenvectors and eigenvalues of the 2x2-block submatrix formed by S P,P, S P,Q, S Q,P, S Q,Q with the classic sequential algorithm, independently. Then, matrixes S and X are updated as follows: - Rotations in S: since there may be several threads in this update stage, they must wait before starting (barrier synchronization); then, each of their columns will be updated (since each thread has coordinates whose indexes are disjoint). Then, after a second barrier synchronization stage, the rows in S are updated. The first synchronization prevents a clash between delayed threads currently updating rows in S and other threads that are about to

4 update the columns corresponding to the next tasks. The second synchronization prevents conflicts in case one thread is modifying its columns and another thread is updating its rows. - Rotations in X: these rotations modify only the columns and, since all tasks have different coordinates, they can be done in parallel. After carrying out these rotations, the threads can move on to complete their following task, and so forth until the current stage is finished. Next, the maximum of the upper triangle of matrix S is calculated in parallel, and then one of the threads calculates the set of indexes for the parallel tasks in the next stage (following the Chess Tournament strategy). This last step must be done sequentially because of data dependencies. For the following step, each thread independently verifies the termination condition, from the maximum value of S obtained before, and, based on the result obtained, the algorithm is ended or a new stage is started Implementation with OpenMP. OpenMP [9] is an API for the C, C++ and Fortran languages that allows writing and running parallel programs using shared memory and offers the possibility of creating a set of threads that work concurrently to exploit the advantages of multicore architectures. The implementation of the parallel algorithm to solve Jacobi method with OpenMP loads the input matrix S and sequentially initializes matrix X. Next, a set of threads is created, as many as cores are available in the multicore architecture, and they initialize in parallel the structure with the indexes of the tasks to be run in the first stage. In each stage of the algorithm, each thread will take a consecutive set of parallel task indexes and will follow the steps described in Section 3.3. When all the tasks in a stage have been executed, the threads calculate in parallel the maximum of the upper triangle in S by applying a reduction operation. Then, one of them updates the set of parallel task indexes for the next stage, following the Chess Tournament strategy. 4 Results Tests were carried out in a machine with two Intel Quad Core Xeon E5405, 2.0Ghz processors. Each core has a 32-Kb L1 cache for data and a 32-Kb cache for instructions. Each pair of cores shares a 6-Mb L2 cache. Available memory is 10Gb RAM. All times were measured with the function omp_get_wtime from the omp.h library. The tests that are compared in Sections 4.1, 4.2 and 4.3 were carried out over matrixes with identical data and identical precision (0.0001), since convergence times depend on the input data and the precision selected. The same is valid for Section 4.4, which assesses the scalability of the parallel algorithm Classic Sequential Algorithm Four versions with different methods for storing matrix S and matrix X were implemented: - Version 1: it stores the entire S by rows and X by rows - Version 2: it stores the entire S by rows and X by columns. Optimization by spatial locality in X. - Version 3: it stores only the upper triangle for S, and it stores X by rows. Storage optimization for S and lower number of operations by symmetry. - Version 4: it stores only the upper triangle for S, and it stores X by columns. Table 1 shows the execution times for each of the versions of the classic algorithm, and matrixes with n=100,200,300,400,500,600,700,800,900,1000. Versions 2 and 3 improve times compared to version 1, the optimization implemented for version 3 being the one with the greatest impact. For this reason, when both optimizations are incorporated in version 4, execution times are reduced even more. Table 1. Sequential algorithm times with classic Jacobi for various optimizations. Dimension / Version Version 1 Version 2 Version 3 Version Block-Based Sequential Algorithm For the tests with block-based sequential algorithms, the same matrixes from Section 4.1 were used, but their data were organized in consecutive blocks whose possible size will depend on the size of the matrix Block-Based Sequential Algorithm (without CBLAS) The block-based algorithm that does not use CBLAS did not improve the times obtained by the classic algorithm version 4. As the tests included in Table 2 show, the best

5 response time is obtained with the maximum possible block size. In those cases, the block-based algorithm behaves very similar to the classic algorithm, since it carries out only one stage to obtain the eigenvalues and eigenvectors of the 2x2- block matrix formed by S 0,0, S 0,1, S 1,0 and S 1,1 applying the classic algorithm, obtaining in this way the eigenvalues and eigenvectors of the entire matrix. The reason for which the block-based algorithm without CBLAS did not improve response times is due to the extra time used to multiply and copy blocks, and as block size decreases, a larger number of these operations is required Block-Based Sequential Algorithm with CBlas The block-based sequential algorithm that uses the CBLAS library improved the response times of the sequential algorithms assessed in the previous sections. In most of the cases, the best time was obtained with a block size of 10, as shown in Table 3. To analyze the cause of this improvement, the number of stages carried out by the block-based algorithm in its two versions and the average time needed to perform a rotation were calculated. These tests were done for the matrix of size 1000 with block size 10 and 500. Both algorithms performed an only stage for block size 500, and 10 stages when using the block size 10. This proves that, when block size is reduced, the algorithm will need more iterations to converge. However, the average time needed for each rotation is different. Taking into account that in each stage, N*(N+1)/2 rotations are done, it is observed that: - In the case of the block-based algorithm without CBLAS, the average rotation time for the matrix with block-size 10 is The number of rotations in each stage is Therefore, each stage will take approximately 56.4 seconds, which explains the time measured during the test. On the other hand, the average rotation time for the matrix with block-size 500 is This time is similar to the final time obtained, and also to the time obtained with the classic algorithm (version 1). This is because there is only one stage and a single rotation. - In the case of the block-based algorithm with CBLAS, when using block-size 10, the average rotation time is ; since there are 5050 rotations in each stage, each will take approximately seconds, which explains the time measured during the test. When using block- size 500, the iteration time was The improvement achieved with the block-based algorithm that uses CBlas is due to the use of the functions cblas_dgemm and cblas_dcopy when calculating rotations. The size of the block that optimizes times must be such that calculating the classic Jacobi for a 2x2-block matrix is not so expensive, since times that are similar to those of version 1 of the classic algorithm will be obtained. The exact ratio must be found, since very small blocks will reduce the average rotation time but will increase the number of rotations. For example: for the block-based algorithm with CBLAS, the average rotation time for the test with the matrix of size 1000 and block-size 5 is The algorithm carried out 10 stages and rotations in each stage. Each stage in average lasted 24.3 seconds, which ended up in an increase in time. Table 2. Times obtained with the block-based sequential algorithm without CBLAS. Dimension / Block Size

6 Table 3. Times obtained with the block-based sequential algorithm with CBLAS. Dimension / Block Size Table 4. Times obtained with the block-based sequential algorithm that uses CBLAS for matrixes whose n is power of 2. Matrix Dimension Block Size Block-Based Parallel Algorithm with CBLAS Since the scalability analysis was carried out with 2, 4, and 8 threads/cores, matrixes whose size is power of 2 were used, so that the tasks were proportionally distributed among the threads. Table 4 shows the execution times of the block-based sequential algorithm that uses CBLAS for those matrixes, varying the size of the blocks within the possible values. The test that optimized the execution time for each matrix size is highlighted. Table 5 shows the speedup (Sp= Sequential Time / Parallel Time) and the efficiency obtained (E= Sp / Number of processors) in the parallel tests. These results show that, if matrix size is kept the same and the number of processors is increased, the speedup obtained is better, that is, the problem is solved in less time. This improvement, however, does not keep a constant efficiency. The decrease in efficiency is due to the overhead of thread creation, synchronization (barriers), and the sequential portion of the problem (the structure with indexes of parallel tasks must be updated in each stage following a Chess Tournament strategy in a sequential manner). When scaling the problem and keeping the same number of processors, as shown in Table 5 and Fig. 6, both speedup and efficiency improve in general, since the overhead mentioned above is less significant in the total processing time. Table 5. Speedup and Efficiency values as matrix size and threads/cores are increased. Speedup Efficiency Dimension (r) / Parallel Tasks (8) (8) (16) (16) (16) (64)

7 Speedup Efficiency 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 Speedup Matrix Dimension ,00 0,80 0,60 0,40 0,20 0,00 Efficiency Matrix Dimension Fig. 6. a) Speedup as matrix size and the number of threads/cores are increased b) Efficiency as matrix size and the number of threads/cores are increased. The behavior described is typical of a scalable parallel system, where efficiency can be maintained at a constant value by simultaneously increasing the number of cores/processors and the size of the problem. 5 Conclusions and Future Work Various sequential algorithms for the resolution of the matrix diagonalization problem were presented and one parallel implementation that exploits the power offered by multicore architectures was introduced. The resulting execution times were analyzed for each implementation, and it was observed that the best performance corresponds to the algorithms that use libraries optimized for linear algebra (CBLAS Level 3). It was also observed that performance improves when the algorithm is parallelized on a multicore architecture. In recent years, GPUs (Graphic Processing Unit) [16] have gained significance due to the high performance achieved in general-purpose applications. One of the future lines of work is based on the migration of Jacobi diagonalization algorithm to be run on GPU, and then systematically study the performance achieved as the size of the problem and the number of threads are increased. Also, an energy consumption analysis for the execution of this parallel algorithm on various multicore architectures is proposed [17]. 6 References [1] Olukotun K, Hammond L, Laudon J. Chip Multiprocesor Architecture. Synthesis Lecture on Computer Science, Morgan&Claypool; [2] Turk M., Pentland A. Eigenfaces for recognition. Journal of Cognitive Neuroscience Vol 3, No. 1, pp ; [3] Bravo Muñoz I. Arquitectura basada en FPGAs para la detección de objetos en movimiento, utilizando visión computacional y técnicas PCA. Doctoral Dissertation, Universidad de Alcalá; [4] Brent R., Parallel Algorithms for Digital Signal Processing. Numerical Linear Algebra. Digital Signal Processing and Parallel Algorithms, pp Springer, Heidelberg; [5] Grama A., Gupta A., Karypis G., Kumar V. Introduction to Parallel Computing. Second Edition. Addison Wesley; [6] Leopold C. Parallel and distributed computing. A survey of models, paradigms, and approaches. Wiley; [7] Quiin M. J. Parallel Computing: Theory and Practice. McGraw-Hill Companies; [8] Hwang K. Advanced Computer Architecture. Parallelism, Scalability, Programmability. McGraw Hill; [9] The OpenMP API specification. [10] BLAS Basic Linear Algebra Subprograms. [11] Rutishauser H. The Jacobi Method for Real Symmetric Matrices. Numer. Math. 9, 1-10; [12] Hansen, E. On Jacobi methods and block-jacobi methods for computing matrix eigenvalues. Doctoral Dissertation, Stanford; 1960 [13] Sameh A. On Jacobi and Jacobi like algorithms for a parallel computer. Mathematics of computation, Vol 25, No. 115, [14] Shroff G. A parallel algorithm for the eingenvalues and eigenvectors of a general complex matrix. Research Institute for Advanced Computer Science. NASA Ames Research Center. Technical Report.; 1989 [15] Bischof C., Van Loan C. Computing the singular value decomposition on a ring of array processors. Proceedings of the IBM Europe Institute Workshop on Large Scale Eigenvalue Problems. Vol. 127, pp 51 66; [16] General-Purpose Computation on Graphics Hardware [17] Wu-chun Feng, Xizhou Feng & Rong Ge. Green Supercomputing Comes of Age Journal IT Professional. Vol. 10, No. 1, pp 17-23; 2008.

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),