Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Size: px
Start display at page:

Download "Parallel Reduction from Block Hessenberg to Hessenberg using MPI"

Transcription

1 Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson Umeå University Department of Computing Science SE UMEÅ SWEDEN

2

3 Abstract In many scientific applications, eigenvalues of a matrix have to be computed. By first reducing a matrix from fully dense to Hessenberg form, eigenvalue computations with the QR algorithm become more efficient. Previous work on shared memory architectures has shown that the Hessenberg reduction is in some cases most efficient when performed in two stages: First reduce the matrix to block Hessenberg form, then reduce the block Hessenberg matrix to true Hessenberg form. This Thesis concerns the adaptation of an existing parallel reduction algorithm implementing the second stage to a distributed memory architecture. Two algorithms were designed using the Message Passing Interface (MPI) for communication between processes. The algorithms have been evaluated by an analyze of trace and run-times for different problem sizes and processes. Results show that the two adaptations are not efficient compared to a shared memory algorithm, but possibilities for further improvement have been identified. We found that an uneven distribution of work, a large sequential part, and significant communication overhead are the main bottlenecks in the distributed algorithms. Suggested further improvements are dynamic load balancing, sequential computation overlap, and hidden communication.

4 ii

5 Contents 1 Introduction Memory Hierarchies Parallel Systems Linear Algebra Hessenberg Reduction Algorithms Problem statement Methods Question Questions 2 and Question Literature Review Householder reflections Givens Rotations Hessenberg reduction algorithm WY Representation Blocking and Data Distribution Block Reduction Parallel blocked algorithm QR Phase Givens Phase Successive Band Reduction Two-staged algorithm Stage Stage MPI Functions Storage Results Root-Scatter algorithm iii

6 iv CONTENTS 4.2 DistBlocks algorithm Performance Conclusions Future work Acknowledgements 37 References 39

7 Chapter 1 Introduction This chapter is an introduction to topics in matrix computations and parallel systems that are related to this Thesis. 1.1 Memory Hierarchies The hierarchy of memory inside a computer has changed the way efficient algorithms are designed. Central processing unit (CPU) caches exploits temporal and spatial locality. Data that have been used recently (temporal locality) and data that are close, in memory, to recently used data (spatial locality) have shorter access time. Figure 1.1 illustrates a common memory architecture for modern computers. Fast and small memory is located close to the CPU. Memory reference m is accessed from the slow and large RAM and loaded into fast cache memory, along with data located in the same cache line (typically B) as m. If a subsequent memory access is to m or data nearby m, it will be satisfied from the fast cache memory unless the corresponding cache line has been evicted. To avoid costly memory communication, many efficient algorithms are designed for data reuse. By often reusing data that has been brought into the cache, an algorithm will be able to amortize the high cost of the initial main memory communication. Figure 1.1: Memory hierarchy of many uni-core modern computers. The CPU works on small and fast registers. These registers are loaded with data from RAM. Data is cached in the fast L1, L2, and L3 caches. If the data stays in cache, access time will be shorter the next time data is accessed. Basic Linear Algebra Subprograms (BLAS), is a standard interface for linear algebra operations. BLAS Level 1 and 2 operations (see Figure 1.2(a) and 1.2(b) for examples) feature little data reuse and are therefore bounded by the memory bandwidth. Data reuse and 1

8 2 Chapter 1. Introduction locality are very important in order to achieve efficiency on modern computer architectures. BLAS Level 3 operations involve many more arithmetic operations than data accesses. For example, a matrix-matrix multiplication (see Figure 1.2(c)) involves O(n 3 ) arithmetic operations and O(n 2 ) data, for matrices of order n. The amount of data reuse is high for Level 3 operations. Many modern linear algebra algorithms try to minimize the amount of Level 1 (see Figure 1.2(a)) and 2 operations and maximize the amount of Level 3 operations [3]. (a) BLAS level 1, vector-vector. (b) BLAS level 2, vectormatrix/matrixvector. (c) BLAS level 3, matrixmatrix. Figure 1.2: Types of BLAS operations, exemplified with matrix/vector multiplication. 1.2 Parallel Systems There are two main types of system architectures used for parallel computations. The first one is the shared memory architecture (SM). SM is a computer architecture, where all processing nodes share the same memory space. Figure 1.3 illustrates a shared memory architecture with four processing nodes. All processing nodes can make computations on the same memory and this requires each node to have access to the memory. Figure 1.3: Shared memory architecture. Processing nodes P0 to P3 share the same memory space. The second type of architecture is the distributed memory architecture (DM). DM is an architecture where the processing nodes do not share the same memory space. In a DM, the nodes work on local memory and communicate with other nodes through an interconnection network. Figure 1.4 illustrates a distributed memory architecture with four processing nodes, each with its own local memory. Designing an algorithm for distributed memory has the advantage that the algorithm could scale to larger problems than an analogous SM algorithm. The reason for this is that SM is bound to its shared memory performance and capacity. A disadvantage with designing an algorithm for a DM is communication, which needs to be done explicitly through message passing. Communication in a DM is often costly and algorithms designed for this type of architecture have to avoid excessive communication. DM also has the property of explicit

9 1.3. Linear Algebra 3 Figure 1.4: Distributed memory architecture. Processing nodes P0 to P3 have separate memory space and they interact through the interconnection network. data distribution. Explicit data distribution can be difficult to utilize in some problems, but when the data is distributed, the programmer does not have to deal with race conditions. Memory is scalable in a DM system. When more processes are added, the total memory size is increasing. Local memory can be accessed efficient in the processes of the DM system, without interference from other processes. Another advantage with systems based on DM is economy: They can be built using low-price computers, connected through a cheap network. MPI is a large set of operations that are based on message passing. Message passing is the de facto standard way of working with a DM. Memory is not shared between processes and interaction is done by explicitly passing messages. 1.3 Linear Algebra The subjects linear algebra and matrix computations contain many terms. The ones related to this Thesis are described in this section. A square matrix is a matrix A R m n with the same number of rows as columns n = m. The main diagonal of A is all elements a ij, where i = j ([a 11, a 22,..., a nn ]) and a symmetric matrix is a matrix A where a ij = a ji for any element a ij. One type of a square matrix is the triangular matrix, which has all elements below (upper triangular) or above (lower triangular) the main diagonal equal to zero. An (upper/lower) Hessenberg matrix is a (upper/lower) triangular matrix with an extra subdiagonal (upper Hessenberg) or superdiagonal (lower Hessenberg). Eigenvalue λ is defined as Ax = λx, where x is a non-zero eigenvector. The two variants of a Hessenberg matrix are illustrated in Figure 1.5. By reducing full matrices to Hessenberg matrices, some matrix computations require less computational effort. The most important scientific application of Hessenberg reductions is the QR algorithm for finding eigenvalues of a non-symmetric matrix (a) Upper Hessenberg (b) Lower Hessenberg. Figure 1.5: Examples of Hessenberg Matrices.

10 4 Chapter 1. Introduction Hessenberg reduction is the process of transforming a full matrix to Hessenberg form by means of an (orthogonal) similarity transformation H = Q T AQ, where Q is an orthogonal matrix. Orthogonal matrices are invertible, with their inverse being the transpose (Q 1 = Q T, for an orthogonal matrix Q). A similarity transformation A P 1 AP is a transformation where a square matrix A is multiplied from left with P 1 and right with P, for an invertible matrix P. A similarity transformation preserves the eigenvalues of A [8, (p.312)]. Let B = P 1 AP be a similarity transformation of A, then B = P 1 AP P B = AP P BP 1 = A, which can be substituted in the definition of eigenvalues: Ax = λx P BP 1 x = λx (substitute A) BP 1 x = P 1 λx (multiply with P 1 ) B(P 1 x) = λ(p 1 x) (factor out vector P 1 x). This shows that if λ is an eigenvalue of A corresponding to an eigenvector x, then λ is also an eigenvalue of B corresponding to an eigenvector P 1 x. 1.4 Hessenberg Reduction Algorithms There are several known algorithms for reducing a full matrix to Hessenberg form. The following section is a brief introduction required to state the questions in Section 1.5. The details of the following algorithm will be presented in Chapter 3. The unblocked Hessenberg reduction algorithm for a matrix A is described in Algorithm 1. The underlying idea is to reduce the matrix one column at a time from left to right using Householder reflections. For each reduced column, a trailing submatrix is fully updated before the algorithm proceeds with the next column. Figure 1.6 illustrates which elements of an 8 8 matrix are modified by the second iteration of Algorithm 1. The figure shows how the algorithm reduces column c = 2 (Figure 1.6(a)) and applies reflection P c from left (Figure 1.6(b)) and right (Figure 1.6(c)). The unblocked Hessenberg reduction algorithm requires 10n 3 /3 floating point operations (flops) [8, p. 345] for a matrix of size n n. Flops are used as a measure of work and flops/s is a measure of performance for computers and programs, where Gflops/s is most common today (10 9 flops/s). The problem with the basic Algorithm 1 is that for each iteration, it

11 1.4. Hessenberg Reduction Algorithms 5 Algorithm 1 Unblocked Hessenberg reduction for input matrix A of order n. When terminated, the algorithm has overwritten A with the Hessenberg form of A. Transformation matrix Q = P 1 P 2... P n 2 is updated in each iteration by a multiplication of P c from the right. Q = R n n for c = 1, 2,... n 2 do Generate Householder reflection P c that reduces column c of A Q(:, c : n 1) = Q(:, c : n)p c A(c + 1 : n, c : n) = P T c A(c + 1 : n, c : n) // Left update A(1 : n, c + 1 : n) = A(1 : n, c + 1 : n)p c // Right update end for (a) Step 1: Generate Householder reflection. (b) Step 2: Apply reflection from the left. (c) Step 3: Apply reflection from the right. Figure 1.6: Unblocked Hessenberg reduction. Gray squares are the elements used for the reduction. This example shows the second iteration of an 8 8 matrix reduction. executes only a few operations on a large amount of data, in the form of BLAS Level 2 operations. Low amount of operations on a large set of data leads to low data reuse which is not suitable for modern computer s memory hierarchy. Reducing a full matrix to block Hessenberg form has proven to be more efficient than the direct reduction of Algorithm 1. Block (upper/lower) Hessenberg form is a matrix with more than one sub/superdiagonal. A block is a contiguous submatrix and in the block (upper/lower) Hessenberg form, each sub/superdiagonal block is on upper/lower triangular form. The block Hessenberg form is illustrated in the middle of Figure 1.8. An improved adaptation of a known two-staged Hessenberg reduction algorithm has been proposed by researchers at Umeå University [11]. The outline of the two-staged algorithm is: 1. Reduce a full matrix A to block Hessenberg form. 2. Reduce the block Hessenberg matrix to true Hessenberg form. The first step of the two-staged algorithm works on blocks, where each block is reduced to upper trapezoidal form. A a trapezoidal matrix (see Figure 1.7 for an illustration) is a rectangular matrix with zeros below or above a diagonal (there can be several diagonals in a rectangular matrix). The second stage uses band reduction methods for reducing the matrix to true Hessenberg form. Band reduction is the process of reducing the number of sub or superdiagonals. In Figure 1.8 the full 6 6 matrix is first reduced to block upper Hessenberg form. The block upper Hessenberg matrix is further reduced to true Hessenberg form, by reducing the number of subdiagonals.

12 6 Chapter 1. Introduction Figure 1.7: Example of a trapezoidal matrix. The gray squares represent zero elements. Zero elements do not have to be explicitly stored in memory. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 x x x x x 0 0 x x x x x x x x x x x x x x x x x x x 0 x x x x x 0 0 x x x x x x x x x Figure 1.8: Outline of the two-staged Hessenberg reduction algorithm. The two-staged method requires 10n 3 /3 flops for the first stage and 2n 3 flops for the second stage[11], assuming the number of subdiagonals in the block Hessenberg form is small relative to n and constant. This is 2n 3 more flops when compared with the unblocked algorithm, which is significantly more work (+60%). In return, the two-staged algorithm has better data reuse than the unblocked algorithm and as a consequence the two-staged algorithm has potential of being faster on a modern computer. In some environments, the two-staged algorithm is proven to be faster. This has been shown in [11] where the two algorithms where compared on a shared memory architecture. 1.5 Problem statement The main goal of this project is to design and implement an algorithm with better scalability than previous designs. Whether or not the main goal is reached, the project could generate knowledge about limitations and difficulties with this type of problem. There are several questions that should be answered in this project. The questions are: 1. Is it possible to adapt or redesign the second stage of the two-stage Hessenberg reduction algorithm for a DM while preserving efficiency? 2. How does the DM implementation compare with the existing SM implementation? 3. How scalable is the DM implementation? 4. What are the main factors that ultimately limit the performance and scalability of the DM implementation? The work presented in this report has been done in collaboration with Umeå University, research group Parallel and High Performance Computing. The rest of the report is structured as follows. Chapter 2 describes the planning of the work and how it was accomplished. Chapter 3 contains a literature review that covers previous research related to this subject. The result of the work is then presented in Chapter 4 followed by conclusions in Chapter 5.

13 Chapter 2 Methods In order to answer the questions raised in Section 1.5, one or several algorithms have to be designed and evaluated. Before design and implementation, previous designs have been studied in a literature review. The literature review spans several weeks, so that enough knowledge can be collected in order to design an algorithm. Notes from the literature review are used for the literature review chapter. When sufficient knowledge is collected, algorithms can be designed and implemented. The implementations are systematically done by careful design decisions and tests, in order to see which solution candidates that are useful or not. By making intermediate performance tests and comparing the numerical errors with a stable algorithm, bad solutions can be abolished. Through series of tests on a large parallel system, the resulting solution candidates are evaluated. Tests generate test data that is visualized and analyzed. The results from all these steps are documented in this report. 2.1 Question 1 Algorithms has to be designed and implemented for a DM in order to evaluate a solution. Question 1 will be measured by evaluating the performance results. Performance is measured by running tests on a system and comparing Gflop/s with the theoretical peak performance for the system. A successful implementation can reach a large fraction of the theoretical peak performance. The results from the SM adaptation[11] will be used as reference. 2.2 Questions 2 and 3 Question 2 and 3 will be answered by a series of carefully designed experiments. The experiments will contain tests that are similar to those made for the SM adaptation [12] in order to compare the implementations. Scalability will be evaluated with the speedup measure for parallel algorithms. 2.3 Question 4 Question 4 will be answered by an introspective analysis of the implementation in order to identify bottlenecks. Time for the different operations in the iterations will be analyzed in order to explain the behavior of the implementations. 7

14 8 Chapter 2. Methods

15 Chapter 3 Literature Review In order to solve the problems stated in Chapter 1, several subjects need to be studied. The chosen subjects for the literature review are listed in Table 3.1, with corresponding motivation on why the subject is relevant. Subject Householder reflections Hessenberg reduction algorithm WY representation Block reduction Parallel blocked algorithm Successive band reduction Two-staged algorithm Motivation It is the cornerstone of many reduction algorithms. It is required to know the Householder reflection s properties in order to use them in an algorithm. A basic reduction to Hessenberg form. Many Hessenberg reduction algorithms are based on this one. Efficient applications of aggregated Householder reflections. Efficient way of reducing to block Hessenberg form. DM algorithm for reduction to block Hessenberg form. Important because it deals with distributed memory and the problems that includes. Is similar to the second stage of the two-staged algorithm. The main subject of this Thesis. Table 3.1: Subjects for literature review. 3.1 Householder reflections In matrix computations, factorizations of a matrix A as in Table 3.2 all include orthogonal transformations applied to a non-symmetric (QR, Hessenberg, bidiagonal) or symmetric (tridiagonal) matrix. The transformations zeros elements after position i in vector x. QR, Hessenberg, tridiagonal, and bidiagonal are important factorizations that are used in eigenvalue and singular value-problems. A method [10] of reducing the columns of a full matrix one by one to Hessenberg form, is by applying Householder reflections P = I 2uu T, 9

16 10 Chapter 3. Literature Review QR Hessenberg Tridiagonal Bidiagonal Table 3.2: Different types of factorizations, with example illustrations on a 6 6 matrix A. Squares represent non-zero values. where u 2 = 1. Given a vector x, a Householder reflection P, defined by a vector u, can be generated such that all but one of the elements in the vector P T x are zero. Figure 3.1 illustrates the procedure of applying a Householder reflection. The reflection transforms a vector x R m by reflecting it to a subspace R i R m. R i is spanned by the base vector e i = [0, 0,..., 0, 1, 0,..., 0, 0] T, where i is the position of the value one (1). R i is chosen to be the subspace that only spans the first dimension of vector x. When x is reflected to R i, component x i is the only element of x that survives the reflection, every other element in x will be annihilated. Unit vector u is chosen to be orthogonal to reflection plane M. Example If the last two elements of x = [1, 4, 3] T (e 1 = [1, 0, 0] Y ) are to be annihilated, then u is chosen as: u = y y 2, where y = x αe 1, α = ± x. u constructs reflection matrix P such that: P T x = (I 2uu T )x = T = Two reasons why Householder reflections are used for the factorizations in Table 3.2 are: 1. Householder reflections are numerically stable, because the reflections are orthogonal transformations. The transformations do not change the norm of the vector x. A numerical error in x does not grow when the reflections are applied. 2. A Householder reflection can introduce multiple zeros at once. This is opposed to Givens rotations where each element of x below i gets zeroed one at a time.

17 3.2. Givens Rotations 11 Figure 3.1: By applying a reflection constructed of vector u that reflects vector x to the subspace R i (e is a basis vector in R i ), all components under x will be set to zero. 3.2 Givens Rotations Givens rotations is another type of orthogonal transformations that are used in reductions. A 2 2 rotation has the form [ ] cos(θ) sin(θ) G =. sin(θ) cos(θ) In Givens rotations, the rotation angle θ is chosen such that an element in a vector can be annihilated. The rotations can only annihilate one element at a time and require more flops than Householder reflections, but Givens rotations are better for selectively annihilating elements. The updates required from a rotation are applied only on few rows or columns. They are therefore preferable for selective annihilation in some applications that require low data dependency [8, (p.216)]. Example If x = [1, 3] T then angle θ is chosen such that [ ] [ cos(30 ) sin(30 ) 3/2 1/2 sin(30 ) cos(30 = ) 1/2 3/2 which transforms vector x into P T x = [2, 0] T. 3.3 Hessenberg reduction algorithm Table 3.2 shows the Hessenberg reduction, which is a series of similarity transformations applied to a matrix in order to create a Hessenberg matrix. Since the reduction is a similarity transformation, the eigenvalues can not change during the process. Algorithm 1 does this transformation by applying one Householder reflection at a time. The transformations reduce one column of the initial full matrix at a time, from left to right. Figure 3.2 shows an example on a full 4 4 matrix A. After two similarity transformations (four matrix multiplications) the matrix is reduced to upper Hessenberg form. Similarity transformations (described in Section 1.3) do not preserve the eigenvectors. While in many applications only the eigenvalues are required, eigenvectors are necessary in some. To recover the eigenvectors, the transformations Q = P 1 P 2... P i... P n 2 made on A have to be stored. The transformations can be stored in the lower triangle of A, as described in [13]. After the transformation, most of the lower triangular part of A is zero ]

18 12 Chapter 3. Literature Review (a) Generate reflection vector u 1 that reduces A(1 : n, 1). Apply reflection from the left. (b) Apply reflection from the right. (c) Generate reflection vector u 2 that reduces A(2 : n, 2). Apply reflection from the left. (d) Apply reflection from the right. Figure 3.2: Example of the basic Hessenberg reduction algorithm. Gray squares represents the elements in A that are not used in the multiplication. (the subdiagonal is not zero). The zero values can be considered implicit after the reduction and instead of the implicit zero values, a normalized version v i of u i is stored under the subdiagonal (see Figure 3.3). Vector v i is a normalization of u i such that v i (1) = 1. Because v i (1) = 1 for all i, the first element of v can implicitly be considered as 1 and does not have to be stored. The normalization changes the norm of u i and therefore requires extra storage for a scaling factor τ i (one real value per i). v i (2 : n) and τ i can construct P i as P i = I τ i [ 1 v i (2 : n) ] [1, v T i (2 : n)]. By storing the reflection vectors in-place, no extra memory except for τ 1, τ 2... τ n 2 is required to preserve eigenvectors. 3.4 WY Representation The WY representation of a product of Householder reflections enables the efficient application of several reflections at once. Let Q k = P 1 P 2 P k, where Q k R m m, be a product of k reflections. Then Q k can be rewritten as Q k = I + W k Y T k, W k R m k, Y k R m k.

19 3.4. WY Representation 13 Figure 3.3: Storage of transformations in a 6 6 Hessenberg matrix. The matrices W k and Y k are defined by the following recurrence: For k = 1, the transformation matrix is Q 1 = P 1 = I + w 1 u T 1 = I + W Y T, where w 1 = 2u 1. Here, W is chosen as w 1 and Y is u 1. For k > 1, the W k and Y k are calculated by multiplying Q k 1 with the k-th Householder reflection P k. The multiplication is carried out as Q k = Q k 1 P k = (I + W k 1 Yk 1 T )(I + w ku T k ), where P k = I + w k u T k, ut k is the k:th Householder reflection vector, and w k = 2u k. Q k 1 P k can be rewritten as: Q k 1 P k = (I + W k 1 Y T k 1)(I + u k v T k ) = I + [W k 1, Q k 1 w k ][Y k 1, u k ] T where the updated matrices W k = [W k 1, Q k 1 w k ] and Y k = [Y k 1, u k ] is W k 1 and Y k 1 appended with the vectors Q k 1 w k and u k respectively. Figure 3.4 shows the form of the W and Y matrices. Figure 3.4: Shape of the W and Y matrices when k = 2 and m = 6. Aggregated reflections in the form of a WY representation have a larger amount of BLAS Level 3 operations than reflections applied one by one. However, the WY representation requires extra flops compared to non-aggregated transformations, both in the formation of the WY representation and its application. In comparison with the LINPACK QR-reduction (at the time [6] was written, 1985), the WY representation has a higher amount of work by a factor of (1 + 2/N)[6], when applied in a blocked manner with N blocks. By choosing an appropriate value of N, the WY representation can perform better than reflections that are applied one by one. If N is too large, the representation loses its advantage of applying several reflections at once. If N is too small, the increased amount of flops could have significance. Y can be stored in the lower triangular part of A as in the previous section. Y is the only matrix needed in order to yield the eigenvectors because it contains all reflection vectors u. The compact WY representation is a more storage efficient variant of the WY representation. Aggregating the compact WY representation Q = I +Y T Y T requires less flops than the WY representation and the two require almost the same amount for applying Q [15].

20 14 Chapter 3. Literature Review The compact WY representation Q = I + Y T Y T is illustrated in Figure 3.5. Y is a trapezoidal matrix of size m k and T is a upper triangular matrix of size k k for k aggregated transformations. The sparse and small structure of Y and T enables the compact storage, where Y T can be derived from Y. Figure 3.5: Compact WY representation. Black areas represents non-zero elements. 3.5 Blocking and Data Distribution Blocking is a proven way of increasing the performance of a dense matrix algorithm [7]. The concept of blocking is that a matrix is divided into smaller blocks. If a block is small, it can fit in the fast cache memory, as opposed to the whole matrix. If the blocks are often reused, blocking gives better performance. Blocking can also be used to exploit parallelism in two ways: 1. Operations on distinct blocks may be done in parallel and 2. operations within a block may be parallelized. One difficulty with blocking is to find a suitable block layout (see Figure 3.6 for examples) and block size. Blocking can also be difficult because of dependencies in the computations. Another disadvantage with blocking is that it introduces a concept that is not related to the original problem, it is not always intuitive and therefore reduces code readability. (a) Row blocking. (b) Column blocking. (c) 2D blocking. Figure 3.6: Examples of blocking of an 8 8 matrix into blocks B. Blocking is also used for partitioning data in a DM system. Data can be distributed as in Figure 3.6 where each process store one block which is either blocked along rows, columns, or is square shaped (2D). Another type of distribution is the block cyclic partitioning of data, where the matrix is partitioned into 2D blocks and arranged into a logical two-dimensional mesh in both directions. Figure 3.7 describes the layout on a matrix, divided into 4

21 3.6. Block Reduction 15 pieces of 4 4 blocks. The block cyclic distribution is preferable for an algorithm that will work on both columns and rows. An advantage with the block cyclic distribution is that if some parts of the matrix requires more work than other, a block row, block column, or basic 2D blocks could give uneven load. Even load is important in parallel computations. If one process has a lot more work to do than other processes of the same speed, the program will be close to sequential and nothing is gained by parallelization. Another advantage with the block cyclic distribution is that data can be divided into very small blocks. This is not possible with row cyclic or column cyclic distribution (matrix is divided into block row or block column and distributed to each process in a round-robin fashion). In row or column cyclic distribution, the minimum size of each block is bounded by the number of rows or columns. The block cyclic distribution, on the other hand, is bounded by the size of an element. Figure 3.7: A block cyclic distribution. Every block B ij consists of 4 4 elements distributed to a process p ij. 3.6 Block Reduction The block reduction algorithms presented in [9] reduces a matrix to any of the forms listed in Table 3.2 by working on blocks. This is done in a similar way as the WY representation. For the Hessenberg reduction, reflections can be aggregated and applied in a blocked manner, as described in [9]. The aggregation of reflections is outlined as follows: The Hessenberg form of matrix A R n n is defined as H = P T n 2 P T 2 P T 1 AP 1 P 2 P n 2 Every update AP i = A(I 2uu T ) = A A(2uu T ) = A 2Au(u T A) is a rank-1 update (outer product update) of A. When the update is made from left and right, the update has rank-2. The rank-2 update of step i+1 can be rewritten as A i+1 = A i 2u i v T i 2w iu T i where: y i = A T i u i, z i = 2A i u i, v i = y i (z T i u i )u i, and w i = z i (y T i u i )u i.

22 16 Chapter 3. Literature Review The transformations at column k + 1 are aggregated as A k+1 = A 1 2UV T 2W U T The important part with this rewriting scheme is that the the update A k+1 = A 2UV T 2W U T is a rank-2k update that is rich in BLAS Level 3 operations and can be performed in blocks. U, V, and Y are trapezoidal and due to their sparse structure, they can be stored efficiently. Algorithm 2 shows an outline of the blocked reduction for number of blocks N 1. For every block k, b aggregated transformations are generated. The transformations are applied as a rank-2b update on the trailing submatrix after column (k 1) b in A. Algorithm 2 Blocked Hessenberg reduction of matrix A. A is overwritten, one column block of size b at a time, with the Hessenberg form of A. N = (n 2)/b for k = 1, 2,..., N 1 do s = (k 1)b + 1 for j = s, s s + b 1 do Generate U j, V j, and W j Aggregate U j, V j, and W j in U, V, and W respectively end for // Perform rank-2b update on the trailing submatrix A(1 : n, s + b : n) = (A 2UV T 2W U T )(1 : n, s + b : n) end for The first stage in the two-staged algorithm (see Algorithm 3) uses a similar approach for reducing a matrix to block upper Hessenberg form [11]. The algorithm applies aggregated reflections from left to right. The reflections are generated by a recursive QR factorization and aggregated as compact WY. When the algorithm is finished, matrix A with has been overwritten by a block upper Hessenberg matrix with r diagonals below the main diagonal. 3.7 Parallel blocked algorithm As described in Chapter 1, the two-staged reduction requires two stages for reducing a matrix to Hessenberg form. Two stages for reduction is preferred in some cases because the blocked updates in the first stage are more memory efficient than for the direct reduction. The first stage reduces the full matrix to block (upper) Hessenberg form. An efficient way of doing this is by dividing a matrix A into four blocks: ( ) A11 A A = 12, A 21 A 22 where A R n n and A 11 R b b. Reduction of block A 21 to upper triangular form R 1 = T Q 1 A21, computed with the QR factorization, can be done without using any of the elements in the other blocks. When A 21 has been reduced, all blocks are updated as ( Q T A11 A 1 AQ 1 = 12 Q1 QT 1 A 22 Q1 R 1 ), Q 1 = ( Ib 0 0 Q1 where I b is the identity matrix of size b. This procedure is repeated from left to right for N 1 iterations k = 1,..., N 1, by reducing submatrix A(n kb : n, b : n) in every ),

23 3.7. Parallel blocked algorithm 17 Algorithm 3 First stage of the two-staged algorithm. Outer block size is given by b and the resulting block upper Hessenberg matrix is given a lower bandwidth of r. for j 1 = 1 : b : n r 1 do ˆb = min(b, n r j1 ) Y R n ˆb, V R n ˆb, T Rˆb ˆb for j 2 = j 1 : r : j 1 + ˆb 1 do ˆr = min(r, j 1 + ˆb 1) i 4 = j 2 + r : n i 5 = 1 : j 2 j 1 i 6 = j 2 j : j 2 + ˆr j 1 i 7 = j 2 : j 2 + ˆr 1 A(i 2, i 7 ) = A(i 2, i 7 ) Y (i 2, i 5 )V (i 7, i 5 ) T A(i 2, i 7 ) = (I V (i 2, i 5 )T (i 5, i 5 )V (i 2, i 5 ) T ) T A(i 2, i 7 ) QR-factorize block as A(i 4, i 7 ) = (I ˆV ˆT ˆV T )R Aggregate reflections V (i 4, i 6 ) = ˆV T (i 6, i 6 ) = ˆT T (i 5, i 6 ) = V (i 4, i 5 ) T V (i 4, i 6 )T (i 6, i 6 ) Y (i 2, i 6 ) = A(i 2, i 4 )V (i 4, i 6 )T (i 6, i 6 ) Y (i 2, i 5 )T (i 5, i 6 ) T (i 5, i 6 ) = T (i 5, i 5 )T (i 5, i 6 ) end for Y (i 1, :) = A(i 1, i 2 )V (i 2, :)T Apply compact WY transformations A(i 1, i 2 ) = A(i 1, i 2 ) Y (i 1, :)V (i 2, :) T A(i 2, i 3 ) = A(i 2, i 3 ) Y (i 2, :)V (i 3, :) T A(i 2, i 3 ) = (I V (i 2, :)T V (i 2, :) T ) T A(i 2, i 3 ) end for iteration. When N 1 subdiagonal blocks have been reduced, a block upper Hessenberg matrix develops: H 11 H 12 H 1N H 21 H 22 H 2N H = Q T AQ =. 0 H.. 32 H3N H N,N 1 H NN Each transformation Q i is represented as WY for efficient application. The algorithm described above for block upper Hessenberg form can be parallelized and this has been done for a SM architecture in [11]. Algorithm 3 is parallelized such that the operations are divided into coarse grained tasks. The tasks are scheduled for a parallel SM system using threads, where each thread calls a sequential implementation of BLAS operations. In [4], another parallel algorithm for reducing a full matrix to block upper Hessenberg is described. It is an adaptation of the block Hessenberg reduction for a DM architecture. The algorithm performs operations on both columns and rows and therefore uses a block cyclic partitioning of data. Each block column k = 1, 2... N 1 of size b for N 1 block columns is reduced to block upper Hessenberg form and the trailing submatrix beginning at column kb is updated. After N 1 iterations, the full matrix is reduced to block upper Hessenberg form. In each iteration k, two phases are executed: A QR phase and a Givens phase. In the QR phase, all blocks below the diagonal on block column k is reduced to

24 18 Chapter 3. Literature Review upper triangular form. In the Givens phase, all blocks below the diagonal on block column k, except the block closest to the diagonal, are annihilated with Givens rotations QR Phase In the QR phase of the algorithm, every block B ik (belonging to process P ij ) below the subdiagonal on the column that should be reduced (iteration k), is reduced to upper triangular form. This is carried out individually (without communication) using a QR reduction. The QR reduction generates transformation matrices Q ik. Figure 3.8(a) shows the form of the process blocks after this reduction. When the processes below the subdiagonal on column k have finished their reduction, the rest of the matrix has to be updated, as seen in the previous section. The update is carried out by broadcasting each Q ik to all other blocks, which they can apply individually. Process P ij broadcasts Q ij in an efficient way by sending to all processes on the same block row and column. Figure 3.8(b) shows the path of the messages sent in this phase. The matrix is in block upper Hessenberg form, except for the upper triangular blocks below the subdiagonal on block column k (and all block columns > k). The upper triangular blocks below the subdiagonal have to be eliminated in order to complete the iteration. (a) Individual reduction to upper triangular form in block column k. (b) Message paths when broadcasting Q ik. Figure 3.8: The QR-phase of the parallel block Hessenberg reduction algorithm Givens Phase As seen in Section 3.2, Givens rotations are suited for applications where elements should be selectively annihilated with low data dependency and this is why Givens rotations are used in the second phase of the algorithm. The Givens phase consists of two steps, local annihilation and global annihilation. In the local annihilation step, each processor block below the subdiagonal on block column k reduces its local blocks without communication. The outline of this step is: 1. Each process P ij picks its upper triangular block B d, that is closest to the diagonal. 2. For every other block owned by P ij, reduce that block with Givens rotations, using the B d block as a pivot block.

25 3.8. Successive Band Reduction 19 The pivot block is used as a reference for eliminating the other local blocks. Figure 3.9(a) shows an outline of the local annihilation step. In Figure 3.9(a), the example is too small to show the general case, but for a larger matrix the concept is the same. When a process has computed a transformation (Givens rotation), it broadcasts the transformation matrices to all block columns and rows. The broadcast and application of each transformation is executed in the same way as in the QR phase. The result from the local annihilation step is illustrated in Figure 3.9(b). (a) Local annihilation using the blocks closest the diagonal as pivot blocks. This example is extended with gray blocks in order to illustrate the general idea. (b) Result after local annihilation. Figure 3.9: Local annihilation step of the Givens phase. The global annihilation step is acquired by reducing the remaining upper triangular blocks by using the subdiagonal block as a pivot block. This is executed in log 2 (n p ) steps, where n p is the number of processes involved in the reduction. Figure 3.10(a) and 3.10(b) illustrates the procedure of reducing the blocks below the subdiagonal in block column k. The result of the second step is illustrated in Figure 3.10(c). 3.8 Successive Band Reduction In the previous section, a full matrix was reduced to block upper Hessenberg form. The analog of block Hessenberg form for symmetric matrices is banded form (or block tridiagonal form). The resulting symmetric banded matrix has to be further reduced to true tridiagonal form in order to be used in eigenvalue computations. If the bandwidth (the number of nonzero diagonals) is low compared to the full matrix size, there are more efficient approaches than the previously described for full reduction to Hessenberg form. The Successive Band Reduction Toolbox (SBR) [5] is an implementation of a framework for algorithms that peel off diagonals from a symmetric banded matrix. These algorithms have the same structure:

26 20 Chapter 3. Literature Review (a) Global annihilation step, iteration one. (b) Global annihilation step, iteration two. (c) Result after global annihilation. Figure 3.10: Global annihilation step of the Givens phase. Annihilate one or several elements close to the diagonal. Bulge chasing in order to restore the banded form. When one or several elements are annihilated, the rest of the matrix has to be updated with the transformation. This introduces a bulge, a sort of a lump on the banded matrix. If the introduced bulge is not taken into account and the algorithm continues to annihilate elements, the bandwidth will rapidly increase. If the bandwidth is increased too much, the algorithm will be as inefficient as a full reduction algorithm (see in Algorithm 1). Between each annihilation in the reduction, a bulge chasing step is executed. Bulge chasing has the purpose of chasing off bulges, down the diagonal, in order to prepare for the next iteration. When a bulge has been chased off, the next iteration can begin without further increase of bandwidth. Figure 3.11 explains the concept of bulge chasing. If all lower diagonals except the subdiagonal are eliminated, tridiagonal form is achieved. This direct reduction only requires 6bn 2 flops [5] for a symmetric matrix A R n n with the half bandwidth b. If b n, this is a huge improvement from Algorithm 1, which also can be used for tridiagonal reduction. If matrix A is a full symmetric matrix, the reduction to banded form as in previous algorithms contributes with additional work. This additional work reduces the performance gap between Algorithm 1 and the reduction from full symmetric matrix to banded form followed by a band reduction. An iteration, with annihilation of elements and (a) Banded matrix with semi bandwidth 3 (non-zeros below the main diagonal). (b) First column is reduced and the rest of the matrix is updated. (c) This introduces a bulge that has to be chased off the diagonal before the second column can be reduced. (d) If the first column of the bulge is reduced, the next iteration can begin. (e) The bulge moves down the diagonal. Figure 3.11: Bulge chasing. the bulge chasing that follows, is called a sweep. What can be varied in the process of the

27 3.9. Two-staged algorithm 21 SBR is the number of elements d that is going to be eliminated per iteration. By changing d, the SBR can have three optimal implementations with 1. minimum algorithmic complexity (fewest flops), 2. minimum algorithmic complexity subject to limited storage or 3. better support for using Level 3 BLAS kernels (sub-programs). The last implementation type requires an idea used in previous sections. By aggregating several transformations in the annihilation step, the updates can be represented as WY and involve more Level 3 BLAS operations. Figure 3.12 shows how the bulge chasing works with aggregated updates. Aggregated updates can be executed in parallel and give high performance. In the band reduction, the reflection vectors can not be stored in the lower part of the matrix, as in previous methods. The large amount of reflections required in the bulge chasing does not fit in the zeroed part of the matrix. (a) The lower diagonals. (b) The first columns q = 3 are reduced with a QR reduction step. A bulge is introduced. (c) The first q columns of the bulge is reduced with a QR reduction. (d) Result after one reduced bulge. Figure 3.12: Bulge chasing with aggregated transformations. This shows the lower diagonals because they are the only diagonals that needs to be updated for a symmetric matrix. Updates are identical on the upper diagonals and do not have to be shown. 3.9 Two-staged algorithm The two-staged algorithm, as proposed in [11], is a parallel algorithm for a SM architecture that reduces a full matrix A to Hessenberg form in two stages. The first stage is to reduce A to block upper Hessenberg form with r subdiagonals. The second stage is to reduce the block upper Hessenberg matrix to true upper Hessenberg form by trimming the lower bandwidth down to r = 1 (see Figure 1.8) Stage 1 The first stage reduces a full matrix to block upper Hessenberg form, as described in Algorithm 3 and parallelized for a SM architecture in [11] and Section 3.7. The implementation of the blocked algorithm in [11] requires 10 3 n3 flops. 80% of the flops are BLAS level 3 operations (matrix-matrix multiplications) and the remaining 20% are BLAS level 2 operations (matrix-vector multiplications).

28 22 Chapter 3. Literature Review Stage 2 The second stage is highly related to the Successive Band Reduction as described in Section 3.8. By adapting SBR to the unsymmetric case (Hessenberg reduction), the lower bandwidth r can be reduced. The basic reduction from blocked Hessenberg to true Hessenberg form is shown in Algorithm 4 (described and improved in [11] and [12]). In this algorithm, one column at a time is reduced. When one column is reduced, the reflections are applied from left and right. This introduces a bulge that is chased off the diagonal in the same way as the column reduction was made. Algorithm 4 Unblocked reduction from blocked Hessenberg matrix A with lower bandwidth r. Notice that this algorithm uses zero-based indexation. for j 1 = 1 : n 2 do k 1 = 1 + n j 2 r for k = 0 : k 1 1 do l = j + max(0, (k 1)r + 1) i 1 = j + kr + 1 : min(j + (k + 1)r, n) i 2 = l : n i 3 = 1 : min(j + (k + 2)r, n) Reduce column A(i 1, l) with householder reflection Q j k Apply reflection from the left A(i 1, i 2 ) = (Q j k )T A(i 1, i 2 ) Apply reflection from the right A(i 3, i 1 ) = A(i 3, i 1 )Q j k end for end for (a) A block upper Hessenberg matrix. (b) The first columns is reduced with a Householder reflection. (c) Reflection is applied from the left. (d) Reflection is applied from the right. (e) A bulge is introduced. Figure 3.13: One iteration of the inner loop of Algorithm 4. As seen in the last figure, a bulge is introduced. The inner for-loop of Algorithm 4 both reduces a column and chases the introduced bulge down the diagonal. An example execution of Algorithm 4 is illustrated in Figure This algorithm has the same disadvantages as Algorithm 1, it has a low amount of operations compared to data. In order to improve this, the bulge-sweep process is reordered and the reduction is divided into two steps for each iteration as described in [12]. Step 1, Generate The first step is to generate transformations from several sweeps. This procedure is similar to Algorithm 4 except that it only reads and updates values close to the diagonal. Transformations for q consecutive sweeps are generated along the diagonal. Before a column is

29 3.9. Two-staged algorithm 23 reduced, it must be brought up-to-date by applying reflections from previous sweeps. Algorithm 5 gives a detailed description of the generate step. Because of data dependency and fine grained tasks, the generate step is not suitable for parallelization. Algorithm 5 Generates reflections Q j for q columns and updates matrix A close to the diagonal. Notice that this algorithm uses zero-based indexation. for j = j 1 : j 1 + q 1 do k 1 = 2 + n j 1 r for k = 0 : k 1 1 do α 1 = j + kr + 1 α 2 = min(α 1 + r 1, n) β = j + max(0, (k 1)r + 1) for ĵ = j 1 : j 1 do ˆα 1 = ĵ + kr + 1 ˆα 2 = min( ˆα 1 + r 1, n) if ˆα 2 ˆα then Bring A( ˆα 1 : ˆα 2, b) up-to-date by A( ˆα 1 : ˆα 2, b) = (Qĵ k )T A( ˆα 1 : ˆα 2, b) end if end for if α 2 α then Reduce A(α 1 : alpha 2, β) with reflection Q j k Update column A(α 1 : alpha 2, β) = (Q j k )T A(α 1 : alpha 2, β) γ 1 = j max(0, (k + j j 1 q + 2)r) γ 2 = min(j + (j + (k + 2)r, n) Introduce bulge A(γ 1 : γ 2, α 1 : α 2 ) = A(γ 1 : γ 2, α 1 : α 2 )Q j k end if end for end for Step 2, Update When Algorithm 5 has completed, the rest of the matrix blocks can be updated with the reflections Q j. Using threads, the updates are made in parallel on a shared memory architecture. Algorithm 6 describes how the updates are applied. Each process has its own range of rows (r 1 : r 2 ) and columns (c 1 : c 2 ) and can execute their updates on these ranges individually. Between variant R and L, the processes have to synchronize in order to avoid conflicts. In [12], the two steps are further optimized by dividing variant R into P R, U R and L into P L, U L (P and U stands for Prepare and Update). By dividing the updates in this manner, the P R and P L will update values close to the diagonal, which are required for the next generate-step. This allows the sequential computations involved in the generate step to be performed on one process at the same time as the other processes apply the U R and U L variants. Another improvement made in [12] is that the processes are given the row and column range that matches their performance. This range is calculated between each iteration in order to minimize idle time at the processes. Range r 1 : r 2 and c 1 : c 2 depends on how fast the processes executed their updates in the previous iterations. Using all these optimization techniques, the Generate and Update-algorithm runs 13 times faster than Algorithm 4 in the test environment used in [12](8 cores, r = 12, q = 16 and A R n n, n 2200).

30 24 Chapter 3. Literature Review Algorithm 6 Updates the remaining part of matrix A with reflections generated in Algorithm 5. The updates come in different variants (R and L). The variant specifies if the reflection should be applied from the right or left to the block. k 1 = 1 + n j 1 2 r for k = k 1 1 : 1 : 0 do for j = j 1 : j 1 + q 1 do α 1 = j + kr + 1 α 2 = min(α 1 + r 1, n) if α 2 α 1 2 then if variant is R then γ 1 = max(r 1, 1) γ 2 = max(r 2, j 1 + max(0, (k + j j 1 q + 2)r)) A(γ 1 : γ 2, α 1 : α 2 ) = A(γ 1 : γ 2, α 1 : α 2 )Q j k end if if variant is L then β 1 = max(c 1, j 1 + q, j 1 + q + q(k 1)r + 1) β 2 = min(c 2, n) A(α 1 : α 2, β 1 : β 2 ) = (Q j k )T A(α 1 : α 2, β 1 : β 2 ) end if end if end for end for Figure 3.14: Broadcast operation MPI Functions For communication in the distributed algorithms, three operations are used: Broadcast, scatter, gather and all-to-all. Broadcast is an operation where one process sends a message to all other processes (also called one-to-all broadcast)[1, (p )]. Figure 3.14 illustrates the broadcast operation. Scatter, also called one-to-all personalized communication [1](p ), is an operation where one process sends message M i to process p i (see Figure 3.15). A customization of the MPI scatterv function is used in the distributed algorithms. The MPI scatterv function is used for sending messages from one process to all other processes, with different size and displacement for each message (see Figure 3.16 for comparison). This allows the root process to send a variation of messages 1. In the implemented scatter, root process sends one message M i to process p i. Process p i receives M i and root process can continue to send message M i+1 to process p i+1. When all processes p = 1,, n p 1 has received a message, the scatter of data is finished. Gather is the dual to the scatter operation. In this operation, each process p i sends a 1 A limitation of the MPI scatterv function is that it can only scatter data using one data type. Because of this, a customized MPI scatterv function has been implemented, which can send different sizes of blocks to each process.

31 3.11. Storage 25 Figure 3.15: Scatter and gather operations. personalized message M i to the root process. When the operation is finished, root holds all messages M 0, M 1... M np 1. (a) Scatter of memory from root process. Every process gets an equal amount of data. (b) scatterv. Blocks of different sizes are scattered, starting at different positions. Gray memory is not sent. Figure 3.16: Comparison between scatter and MPI scatterv. In the all-to-all operation (also called all-to-all personalized communication) all processes sends a distinct message to every other process (see Figure 3.17). This operation is implemented as MPI Alltoall in MPI. MPI Alltoall has a related MPI Alltoallv-function that can work with different sized messages. MPI Alltoallv can not send messages with different data types. A custom all-to-all function has been implemented that can send messages with different data types 2. Many functions in MPI have a non-blocking counterpart, which does not wait for communication to finish. Non-blocking functions can be used to overlap communication with computation and hide communication overhead. In the custom implementation of all-to-all, each process p i sends its messages M i,k to the corresponding receivers k = 0, 1,..., n p 1. The sending is done by using a non-blocking send, so that the next send can begin. This is also the case for the receiving side: Receiving is done non-blocking. When all non-blocking sends and receives have been initiated, all processes waits for their receives and then their sends Storage For storage in the distributed algorithms, four data types are used: Full matrix, block row, block column, and banded matrix. The block row and block column data types are generated at all processes using MPI create subarray. MPI create subarray is a function for creating sub-blocks in a larger matrix. The blocking is done in rows and columns, similarly to the 2 The custom all-to-all function behaves as the MPI Alltoallw (which was unknown for the implementor at the time). No performance comparison has been done between the custom implementation and MPI Alltoallw. This is a topic for further studies.

32 26 Chapter 3. Literature Review Figure 3.17: The all-to-all operation. In total n p n p messages are exchanged between the n p processes. row and column blocking described in Section 3.5. When used for Hessenberg reduction, j 1 reduced columns can be excluded in the distribution (they are already reduced). Figure 3.18 shows an example on how the rows and columns could be blocked. (a) Row blocking with j 1 = 5. (b) Column blocking with j 1 = 5. Figure 3.18: Row and column blocking with j 1 reduced columns and 4 processors. The generate step (see Algorithm 5) only operates around the diagonal of A and the root process only needs some diagonals of A. When only some diagonals are used, banded matrix[2] is a storage efficient data type. When stored as banded matrix, only kl subdiagonals, ku superdiagonals and the main diagonal have to be stored. Figure 3.19 illustrates this storage. In order to access an element in the banded matrix with a position based on the full matrix, the following conversion is made: An element on position (i, j) in the original matrix A, is located at position (ku i j, j) in the banded matrix. Figure 3.19: Banded matrix storage. ku = 3 superdiagonals and kl = 3 subdiagonals from the left full matrix is stored in the banded matrix to the right.

33 Chapter 4 Results The second stage of the two-staged algorithm (see Section 3.9) has been adapted for a distributed memory architecture. The result is two implementations that uses MPI for communication. 4.1 Root-Scatter algorithm Our Root-Scatter algorithm is a basic adaptation of the algorithm described in Section 3.9. This algorithm uses the same algorithms as in the SM implementation[12], combined with message passing communication. The algorithm works similarly to the SM implementation, described in Section One difference is that before each update, the matrix A(:, j 1 : n) is scattered in blocks from the root process, which always holds the whole matrix A. For the R update (see Algorithm 6), A(:, j 1 : n) is row blocked and for the L update, A(:, j 1 : n) is column blocked. Each process p i updates its block A i and the root process gathers the results. Algorithm 7 describes the procedure where q columns are reduced in each iteration, from left to right. A problem with this approach is that the root process is storing the whole matrix consisting of n 2 elements. For a large matrix, the root process could run out of memory. Another problem is that scatter update gather scatter update gather makes the root process a bottleneck (see Figure 4.1). All communication has to go through the root process. For many processes, this solution should not scale well, but the algorithm is implemented for comparison with other solutions. All blocks have to be gathered to the root process before the next update can occur. One row block is of size (n j 1 + 1) n n p and one column block is of size (n j1+1) n p n (they have the same size). In each update, root process sends n p 1 blocks and gathers the same amount of blocks. In one iteration, sending blocks of matrix A has a total communication volume of 4(n p 1)(n j 1 + 1) n n p with the additional latency involved with 4(n p 1) message transfers. The complexity is of O(n p ) which can be improved to logarithmic complexity by using existing collective communication algorithms. This has not been done for the Root-Scatter algorithm, because of a limited amount of time. 27

34 28 Chapter 4. Results Figure 4.1: Communication of A in the Root-Scatter algorithm. Algorithm 7 Root-Scatter. Matrix A R n n is held at root and is updated in n q iterations, where q is the number of consequtive sweeps per iteration. for j 1 = 1 : q : n do if root process then Run Algorithm 5 with A and j 1 in order to generate transformations Q end if Broadcast Q Scatter A(:, j 1 n) as row blocks A i Update block A i with Algorithm 6 (variant=r) Gather blocks to A(:, j 1 : n) Scatter A(:, j 1 : n) as column blocks A i Update block A i with Algorithm 6 (variant=l) Gather blocks to A(:, j 1 : n) end for

35 4.2. DistBlocks algorithm DistBlocks algorithm Our DistBlocks algorithm is an improvement of the Root-Scatter algorithm. By using a more distributed communication model and partitioning, scalability issues can be avoided. As the Root-Scatter algorithm, the DistBlocks algorithm uses algorithms from the SM implementation. The key differences here is the communication between the updates, and the storage of A at the root process. First off, (q r 1) superdiagonals, (2 r 1) subdiagonals, and the main diagonal of matrix A is copied to a banded matrix banded as seen in Figure The root process will need (q r 1) superdiagonals and (2 r 1) subdiagonals for executing the generate step. Root process scatters matrix A in equal sized row blocks. When the generate step is finished, root process broadcasts transformations Q and scatters the banded matrix to the processes that have diagonal elements. All processes runs the R variant of Algorithm 6 on their row range r 1 : r 2. The transition from row blocks to column blocks is done through the all-to-all-operation. Each process can now update its columns by running the L variant of Algorithm 6 on column range c 1 : c 2. A transition back to row blocks is made. Before the next iteration can begin, the root process gathers the updated entries near the diagonal, to the banded matrix. Root process holds the full matrix A in the beginning, but uses it to store the result for q columns per iterations. A description can be found in Algorithm 8, where q columns is reduced for each iteration. The transition from row blocks to column blocks is illustrated in Figure 4.2. The scatter and gather of the banded matrix is implemented using the MPI Type indexed. MPI Type indexed can create complex data types, as the banded form. Scatter of the banded matrix is illustrated in Figure 4.3. The transition from row block to column block requires every process to send blocks with a total of (n j 1 + 1) n n p elements. Per iteration, the transition executes twice, which leads to a total communication cost of 2(n j 1 + 1) n n p. This is an improvement from the previous algorithm. The communication volume for the all-to-all operation will be reduced for a higher amount of processes, which is not the case for communication in the Root-Scatter algorithm. Figure 4.2: Transition from block rows to block columns, after j 1 = 5 columns have been reduced. The reverse procedure can be used to go from block columns to block rows. 4.3 Performance In order to test the performance of the two distributed algorithms, tests are made on the Abisko high performance computing (HPC) cluster at High Performance Computing Center North (HPC2N). Abisko is a cluster of 322 nodes, each comprising 4 AMD Opteron 12 core

36 30 Chapter 4. Results Algorithm 8 DistBlocks. Root process (rank = 0) uses a banded matrix banded in order to do the generate step. When the L-updates have been made, the root process holds at least q columns. This allows the root process to store the result from each iteration in A, without communication. if root process then Copy bands from A to banded matrix banded end if Scatter A as row blocks A i for j 1 = 1 : q : n do if root process then Run Algorithm 5 with banded and j 1 in order to generate transformations Q end if Broadcast Q Scatter banded(:, j 1 : n) as to row blocks A i Update block A i with Algorithm 6 (variant=r) Transit to column blocks with all-to-all Update block A i with Algorithm 6 (variant=l) if root process then Copy the first q columns of A 0 to matrix A(:, 1 : j1 + q) end if Transit to row blocks with all-to-all Gather diagonals from blocks to banded end for Figure 4.3: Scatter of updated diagonals, from the root process to all processes. The reverse can be used as gather. 2.6 GHz processors. Each core has a theoretical peak performance (TPP) of 10.4 Gflop/s. Two ranges of tests are designed in order to answer the questions stated in Section 1.5: 1. Test the performance of the Root-Scatter algorithm and the DistBlocks algorithm compared to the unblocked band reduction algorithm. Tests are made on sizes n = {800 : 200 : 6000} with q = 16, r = 12 and on 8 processes. The tests are similar to the one done in [12] for testing the SM implementation. They are used to compare the SM implementation with the DM implementation. The result from these tests will be used to answer question The DM implementation can run on a large amount of cores. In order to test its efficiency, a test of the scaled speedup is executed. Speedup is a measure for parallel scalability. It is defined as S = T1 T p, where T 1 is the execution time for the sequential algorithm (in this case, Algorithm 4) and T p is the time for the parallel algorithm, executed on p processes [1, (p.130)]. Scaled speedup (S ) is a measurement where the problem size increases with the number of processes in the test. In this case,

37 4.3. Performance 31 the problem size W is defined as Cn 3 and as the number of processes are increasing with a factor of 1, 2, , the order n of matrix A is increasing with a factor of 3 1, 3 2,..., 3 8. Scaled speedup is used in order to analyze the efficiency of a parallel program. Ideally, the scaled speedup should be linear in relation to the problem size W. Table 4.1 present the parameters used in the scaled speedup test. The scaled matrix order is reduced to the nearest multiple of the number of processes, n p, due to a limitation in the implementation. All tests will run with q = 16, r = 12. The result from these tests will be used to answer question 1 and 3. Problem size W Matrix order n Processors n p = = = = = = = 64 Table 4.1: Tests created for scaled speedup. Besides this, tests will be made on 4 processors and n = 4000 (q = 16, r = 12) in order to make an introspective analysis of some iterations. The introspective analysis is made by time stamping certain key passages in the code. This will produce trace data that will be analyzed in order to answer question 4. Implementations have been made in C++ using the OpenMPI implementation of MPI [14]. Code is compiled with GCC version Each test is being executed three times. Because of startup time of the program (setting up MPI environment), the first execution is disregarded and the result is an average of the two last runs. Figure 4.4 shows the performance of the two distributed implementations of the algorithm described in Section The algorithms have approximately a flop count of 2n 3, which is the factor used to calculate the performance Tp W n, where T p is the parallel execution time and W n is the amount of flops required for a problem of size n. Ideally, this test would reach the TPP of 83.8 Gflop/s (8 cores). In Figure 4.4, the highest performance 2.5Gflop/s is reached at n = 6000, which is 3% of the TPP. For similar tests of the SM implementation in [12], another cluster (2.5 GHz processors, TPP=80 Gflop/s) reaches approximately 12.5% of its theoretical peak for sizes n > The performance declines for the unblocked algorithm for n > 1000 and the reason for this is that for large problem, larger parts has to be used in the computations. If the parts do not fit in fast cache memory, it is mostly stored in slower caches. The unblocked algorithm scales worse on computers with caches, than cache efficient algorithms. Figure 4.5 shows the result from the scaled speedup test. In Figure 4.5, the distributed algorithms outperform the sequential unblocked algorithm at small problem sizes and with few processes. The reason for this is that the distributed algorithms apply their updates in a more cache efficient way than the unblocked one. For larger problem sizes, the distributed algorithms performs poorly, even though more processes are doing the computation. For large problem sizes, the DistBlocks algorithm performs better than the Root-Scatter algorithm. The DistBlocks algorithm has a more efficient communication model than the Root-Scatter algorithm, this could be the reason for the performance difference.

38 32 Chapter 4. Results Figure 4.6(a) and 4.6(b) shows trace for the two distributed algorithms for n = What they demonstrate is that a substantial amount of time is spent on waiting for other processes. The three main factors of the poor performance are: 1. Bad load balancing. The processes have a very different load in each iteration. The equally partitioned columns do not give an equal amount of work. 2. Sequential generate step. The sequential generate step requires much time. During this time, other processes are waiting for process 0 to finish. 3. Communication overhead. Figure 4.6(b) illustrates that much more (about six times) time is spent on communication than on the generate step. Three theoretical scenarios will be made in order to see how much the three factors above affects the performance. The three scenarios are: 1. Perfect load balancing (time t load ). This will be be simulated by summing the time spent in updates and evenly divide the time with the number of processes. 2. Hidden generate step (time t gen ). This will be simulated by taking the maximum of t gentot and t uptot /(n p 1), where t gentot is the total time spent in the generate step and t uptot is the total time for the updates. This simulates that the root process can do the generate step at the same time as other processes do the updates. 3. No communication overhead (time t comm ). This will be simulated by subtracting all time spent on communication. Table 4.2 shows the result of combinations of these simulated implementations, when running on a problem of order n = Results show that no communication does the best improvement, but this is a very unrealistic case. The results also show how badly the load is distributed on the processes. To see how much the sequential generate step is slowing down the algorithm, a calculation has been made for a theoretically perfect load balanced case. The total time spent in the update steps is summarized into t R, t L, for all processes. t R and t L are then divided with the number of processes, which will simulate perfect load balancing, as in the previous test. t R, t L values are compared with t G, the total amount of time spent in the generate step. Table 4.3 shows calculations for both implementations. The last column is the fraction of time that is spent in the generate step (when conditions are ideal). The two implementations spends 15% respectively 26% of the time in the generate step, even though the executions are theoretically perfectly load balanced and have no communication overhead. The DistBlocks algorithm performs worse than the Root-Scatter algorithm. The difference is not large for the updates, but in the generation of transformations, the gap is larger. This could be an effect of the banded form used in the DistBlocks algorithm. In order to use a banded matrix, every indexation is converted with (ku i j, j). This calculation could slow down the DistBlocks algorithm. Kernels used in the algorithms are not optimized for banded matrices.

39 4.3. Performance 33 Figure 4.4: Performance of the two distributed memory algorithms compared with the unblocked sequential. Figure 4.5: Scaled speedup S for normalized problem sizes W = 1 : 8.

40 34 Chapter 4. Results (a) Trace of the Root-Scatter implementation. (b) Trace of the DistBlocks implementation. Figure 4.6: Trace of the two distributed algorithms, running on 4 processes with n = 4000, q = 16, r = 12. Iteration 1 to 4 of a total of 250 iterations is illustrated. The time for broadcasting transformations Q is too short to be visible in the figures. Root-Scatter DistBlocks Actual runtime [s] t load [s] t load and t gen [s] t load and t comm [s] t gen [s] t gen and t comm [s] t comm [s] t load, t gen and t comm [s] Table 4.2: Combinations of different types of ideal scenarios, running on 4 processes with n = 4000, q = 16, r = 12. Implementation t G [s] [s] t G + t R p + t L p Root-Scatter DistBlocks Table 4.3: Sum of time spent in each computation step (ideally), running on 4 processes with n = 4000, q = 16, r = 12. t Rp t Lp [s] t G

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

CS 770G - Parallel Algorithms in Scientific Computing

CS 770G - Parallel Algorithms in Scientific Computing CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization

CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8. CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview

More information

Chapter 4. Matrix and Vector Operations

Chapter 4. Matrix and Vector Operations 1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and

More information

Parallelizing LU Factorization

Parallelizing LU Factorization Parallelizing LU Factorization Scott Ricketts December 3, 2006 Abstract Systems of linear equations can be represented by matrix equations of the form A x = b LU Factorization is a method for solving systems

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Sparse matrices, graphs, and tree elimination

Sparse matrices, graphs, and tree elimination Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

COMP 558 lecture 19 Nov. 17, 2010

COMP 558 lecture 19 Nov. 17, 2010 COMP 558 lecture 9 Nov. 7, 2 Camera calibration To estimate the geometry of 3D scenes, it helps to know the camera parameters, both external and internal. The problem of finding all these parameters is

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Lecture 7X: Prac-ces with Principles of Parallel Algorithm Design Concurrent and Mul-core Programming CSE 436/536

Lecture 7X: Prac-ces with Principles of Parallel Algorithm Design Concurrent and Mul-core Programming CSE 436/536 Lecture 7X: Prac-ces with Principles of Parallel Algorithm Design Concurrent and Mul-core Programming CSE 436/536 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Dense LU Factorization

Dense LU Factorization Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...

More information

2.7 Numerical Linear Algebra Software

2.7 Numerical Linear Algebra Software 2.7 Numerical Linear Algebra Software In this section we will discuss three software packages for linear algebra operations: (i) (ii) (iii) Matlab, Basic Linear Algebra Subroutines (BLAS) and LAPACK. There

More information

CSCE 689 : Special Topics in Sparse Matrix Algorithms Department of Computer Science and Engineering Spring 2015 syllabus

CSCE 689 : Special Topics in Sparse Matrix Algorithms Department of Computer Science and Engineering Spring 2015 syllabus CSCE 689 : Special Topics in Sparse Matrix Algorithms Department of Computer Science and Engineering Spring 2015 syllabus Tim Davis last modified September 23, 2014 1 Catalog Description CSCE 689. Special

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization Edgar Solomonik University of Illinois at Urbana-Champaign August 31, 2016 Review of

More information

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels Azzam Haidar University of Tennessee 1122 Volunteer Blvd Knoxville, TN haidar@eecs.utk.edu

More information

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure

More information

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Zhou, B. B. and Brent, R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 0200 Abstract This paper addresses

More information

Linear Arrays. Chapter 7

Linear Arrays. Chapter 7 Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between

More information

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels Azzam Haidar University of Tennessee 1122 Volunteer Blvd Knoxville, TN haidar@eecs.utk.edu

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

Matrix algorithms: fast, stable, communication-optimizing random?!

Matrix algorithms: fast, stable, communication-optimizing random?! Matrix algorithms: fast, stable, communication-optimizing random?! Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, James Demmel, Olga Holtz, Robert

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Computing Explicit Matrix Inverses by Recursion

Computing Explicit Matrix Inverses by Recursion Computing Explicit Matrix Inverses by Recursion Lars Karlsson February 15, 2006 Master s Thesis in Computing Science, 20 credits Supervisor at CS-UmU: Robert Granat Examiner: Per Lindström Umeå University

More information

Chapter 8 Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

Recognition, SVD, and PCA

Recognition, SVD, and PCA Recognition, SVD, and PCA Recognition Suppose you want to find a face in an image One possibility: look for something that looks sort of like a face (oval, dark band near top, dark band near bottom) Another

More information

A Geometric Analysis of Subspace Clustering with Outliers

A Geometric Analysis of Subspace Clustering with Outliers A Geometric Analysis of Subspace Clustering with Outliers Mahdi Soltanolkotabi and Emmanuel Candés Stanford University Fundamental Tool in Data Mining : PCA Fundamental Tool in Data Mining : PCA Subspace

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Blocked Schur Algorithms for Computing the Matrix Square Root

Blocked Schur Algorithms for Computing the Matrix Square Root Blocked Schur Algorithms for Computing the Matrix Square Root Edvin Deadman 1, Nicholas J. Higham 2,andRuiRalha 3 1 Numerical Algorithms Group edvin.deadman@nag.co.uk 2 University of Manchester higham@maths.manchester.ac.uk

More information

5. GENERALIZED INVERSE SOLUTIONS

5. GENERALIZED INVERSE SOLUTIONS 5. GENERALIZED INVERSE SOLUTIONS The Geometry of Generalized Inverse Solutions The generalized inverse solution to the control allocation problem involves constructing a matrix which satisfies the equation

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix Acceleration of Hessenberg Reduction for Nonsymmetric Matrix by Hesamaldin Nekouei Bachelor of Science Degree in Electrical Engineering Iran University of Science and Technology, Iran, 2009 A thesis presented

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Principles of Parallel Algorithm Design: Concurrency and Decomposition Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model. Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP

More information

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing Bo Kågström Department of Computing Science and High Performance Computing Center North (HPC2N) Umeå University,

More information

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003 CS 664 Slides #11 Image Segmentation Prof. Dan Huttenlocher Fall 2003 Image Segmentation Find regions of image that are coherent Dual of edge detection Regions vs. boundaries Related to clustering problems

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

Cache Performance II 1

Cache Performance II 1 Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

An Auto-Tuning Framework for a NUMA-Aware Hessenberg Reduction Algorithm*

An Auto-Tuning Framework for a NUMA-Aware Hessenberg Reduction Algorithm* An Auto-Tuning Framework for a NUMA-Aware Hessenberg Reduction Algorithm* by Mahmoud Eljammaly, Lars Karlsson, and Bo Kågström *This version revised January 2018. UMINF 17.19 UMEÅ UNIVERSITY DEPARTMENT

More information

Lesson 2 7 Graph Partitioning

Lesson 2 7 Graph Partitioning Lesson 2 7 Graph Partitioning The Graph Partitioning Problem Look at the problem from a different angle: Let s multiply a sparse matrix A by a vector X. Recall the duality between matrices and graphs:

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Parallel and perspective projections such as used in representing 3d images.

Parallel and perspective projections such as used in representing 3d images. Chapter 5 Rotations and projections In this chapter we discuss Rotations Parallel and perspective projections such as used in representing 3d images. Using coordinates and matrices, parallel projections

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Social-Network Graphs

Social-Network Graphs Social-Network Graphs Mining Social Networks Facebook, Google+, Twitter Email Networks, Collaboration Networks Identify communities Similar to clustering Communities usually overlap Identify similarities

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

CS6015 / LARP ACK : Linear Algebra and Its Applications - Gilbert Strang

CS6015 / LARP ACK : Linear Algebra and Its Applications - Gilbert Strang Solving and CS6015 / LARP 2018 ACK : Linear Algebra and Its Applications - Gilbert Strang Introduction Chapter 1 concentrated on square invertible matrices. There was one solution to Ax = b and it was

More information