Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Size: px

Start display at page:

Download "Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators"

MargaretMargaret Small
5 years ago
Views:

1 Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Departamento de Ingeniería y Ciencia de Computadores Universidad Jaume I 2.7 Castellón, Spain {gquintan,figual,quintana@icc.uji.es Robert van de Geijn Department of Computer Sciences The University of Texas at Austin Austin, Texas 7872 rvdg@cs.utexas.edu Abstract In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S87 platform with four GPUs delivers peak performances around 55 and 45 (single-precision) for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations. Categories and Subject Descriptors D.m [Software]: Miscellaneous General Terms Algorithms, Performance Keywords GPUs, algorithms-by-blocks, dependency analysis, dynamic scheduling, out-of-order execution. Introduction The limitations of current VLSI technology and the desire to transform the ever-increasing number of transistors on a chip dictated by Moore s Law into faster computers has led most hardware manufacturers to design multicore processors and/or specialized hardware accelerators [9]. In response, the computer science community is beginning to embrace (explicit) parallel programming as the means to exploit the potential of the new architectures []. How Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PPoPP 9, February 4 8, 29, Raleigh, North Carolina, USA. Copyright c 29 ACM /9/2... $5. to program these new architectures easily and efficiently is the key that will determine their success or failure. Given that architectures have recently bifurcated and no vendor can even predict what design will dominate five to ten years from now, the design of flexible programming solutions is as important as it ever has been. Dense linear algebra has been traditionally used as a pioneering area to conduct research on the performance of new architectures and this continues with the recent advent of multicore processors and hardware accelerators like GPUs and Cell B.E. The traditional approach in this problem domain, inherited from the solutions adopted for shared-memory multiprocessors years ago, is based on the use of multithreaded implementations of the Basic Linear Algebra Subprograms (BLAS) [26, 6, 5]. Code for operations constructed in terms of the BLAS (e.g, for solving a linear system or a linear least-squares problem) extract all the parallelism at the BLAS level. Thus, the intricacies of efficiently utilizing the target architecture are hidden inside the BLAS, and the burden of its parallelization lies in the hands of a few experts with a deep knowledge of the architecture. More recently, the FLAME, PLASMA, and SMPSs projects [, 2, 3, 3, 3, 32, 33, 9, 8, 4] have advocated for a different approach, extracting the parallelism at a higher level, so that only a sequential tuned implementation of the BLAS is necessary and more parallelism is detected and exploited. Cilk [27] is a precursor of these projects that suffered from not being able to deal with dependencies well. FLAME and PLASMA both focus on dense linear algebra, with the former working at a higher level of abstraction (much in the spirit of object-oriented programming), while the target domain for SMPSs is more general. In the last years, specialized hardware accelerators such as graphics processors (GPUs), Field Programmable Gate Arrays (FPGAs), and the Cell B.E. have also attracted the interest of the developers of dense linear algebra libraries [25, 24, 8, 3, 2,, 37, 38]. Squeezing these architectures for performance is revealing itself as a task of complexity similar to that of developing a highly tuned implementation of the BLAS for a sequential processor, which typically requires very low-level coding. The next evolutionary step has been the construction and use of systems with multiple accelerators: NVIDIA offers nodes with multiple Tesla processors (GPUs) in the Tesla series which can be connected via PCI-Express to a workstation and AMD/ATI has recently built a similar system, IBM Cell B.E. processors are currently available in the form of blades or PCI-Express accelerator boards, and ClearSpeed PCI-Express boards are furnished with 2 CSX6 processors. The natural question that arises at this point is how to program these multi-accelerator platforms. For systems with multiple GPUs, a possibility explored in [37, 38] is to distribute the data among the video memory of the GPUs 2

2 and code in a message-passing style similar to that of the libraries ScaLAPACK and PLAPACK [4, 36]. We identify two hurdles for this approach: While the state-of-the-art numerical methods have not changed, following this approach will require a complete rewrite of dense linear algebra libraries (alike the redesign of LAPACK for parallel distributed-memory architectures that was done in the ScaLAPACK and PLAPACK projects). Therefore, a large programming effort and a considerable amount of funding will be necessary to cover a functionality like that of LAPACK. Note that coding at such low level can be quite complex and experts are in short supply. The product that is obtained as a result of this style of programming will likely exhibit a parallelism similar to that of libraries based on multithreaded implementations of the BLAS and far from that demonstrated by the dynamic scheduling techniques in the FLAME, PLASMA, and SMPSs projects. While lookahead techniques [34] can increase the scalability of this solution to a certain extent, they do so at the cost of a much added complexity. Our approach in this context is fundamentally different. In previous papers [, 2, 3, 3, 3, 32, 33], we gave an overview of software tools and methods developed as part of the FLAME project, and we show how, when applied to a platform with multiple cores/processors, they provide an out-of-the-box solution that attains high performance relatively effortlessly. The key lies in maintaining a separation of concern between the code and target architecture by leaving the parallel execution of the operation in the hands of a runtime system. The advantages of this approach are twofold: When a new platform appears, it is only the runtime system that needs to be adapted. The routines in the library, which reflect the numerical algorithms, do not need to be modified. The parallelism is orchestrated by a runtime system which can be adapted to exploit different architectures. While still focused on the programmability of the solution, this paper makes the following new contributions: We target a fundamentally different architecture, consisting of a multicore processor connected to multiple hardware accelerators, which features properties of an heterogenous distributedmemory multiprocessor. This architecture model is representative of a platform consisting of workstation connected to multiple NVIDIA or AMD/ATI GPUs, IBM Cell B.E. blades, Clear- Speed boards, etc. We give a practical demonstration that the FLAME programming model easily accommodates for this generic multi-accelerator architecture, while not requiring a significant modification of the contents of the library. We describe how we tailor the runtime system for this generic multi-accelerator architecture by incorporating software implementations of cache/memory coherence techniques from SMP platforms. Altogether, these techniques provide a software distributed-shared memory (DSM) layer, which allows to view the multi-accelerator architecture as a shared-memory multiprocessor. Our experimental results show that the presence of this layer does not decrease performance for large problems. We report high performance on an NVIDIA Tesla multi-gpu platform with four G8 processors: A single-precision peak performance of 55 (= 55 9 floating-point arithmetic operations, or flops, per second) is attained for the matrix-matrix product using four G8 processors. Compared with the best implementation of the matrix-matrix product on a single G8 GPU (that of CUBLAS 2., based on the implementation in [37, 38]), and measuring the time to transfer the data in both cases, a super-linear speed-up of 5.5 is obtained for the largest problem size. Overall, a (single-precision) peak performance of 46 is attained on the Tesla platform for a more elaborate matrix operation, the Cholesky factorization, with complex data dependencies. The rest of the paper is structured as follows. Section 2 employs the Cholesky factorization of a dense matrix to offer a brief overview of FLAME, the key to easy development of highperformance dense linear algebra libraries that underlies our approach for multi-accelerator platforms. Section 3 describes how the tools in FLAME accommodate for the parallel execution of dense linear algebra codes on these platforms almost effortlessly. More elaborate techniques are presented in Section 4 together with their corresponding performance results. Finally, a few concluding remarks summarize the results in Section The FLAME Approach to Developing Dense Linear Algebra Libraries Following [], in this paper we will consider the Cholesky factorization of an n n symmetric positive definite matrix A to illustrate our approach. In this operation, the matrix is decomposed into the product A = LL T,whereLis the n n lower triangular Cholesky factor. (Alternatively, A can be decomposed as A = U T U, with U being upper triangular.) In traditional algorithms for this factorization, L overwrites the lower triangular part of A while the strictly upper triangular part remains unmodified. Here, we denote this as A := {L\A = CHOL(A). Key elements of FLAME are the high-level notation for expressing algorithms for dense and banded linear algebra operations, the formal derivation methodology to obtain provably correct algorithms, and the high-level application programming interfaces (APIs) to transform the algorithms into codes. FLASH and SuperMatrix are also important components of FLAME that address storage of matrices by blocks and automatic decomposition of linear algebra codes into tasks and dynamic scheduling of these tasks to multithreaded architectures (basically, SMP and multicore processors). In this section, We briefly review these elements. 2. FLAME: Formal Linear Algebra Methods Environment The fundamental innovation that enabled FLAME is the notation for expressing dense and banded linear algebra algorithms. Figure (left) shows a blocked algorithm for computing the Cholesky factorization using the FLAME notation. The formal derivation methodology consists of a series of steps which, when systematically applied, yield families of algorithms (multiple algorithmic variants) for computing an operation [2, 2, 6]. The significance of this for scientific computing is that often different algorithmic variants deliver higher performance on different platforms and/or problem sizes [7, 29]. This derivation of algorithms has also been made mechanical [5]. The FLAME/C API for the C programming language captures the notation in which we express our algorithms. Using this API, the blocked algorithm on the left of Figure can be transformed into the C code on the right of that figure. Note the close resemblance between algorithm and code. As indentation plays an importing role in making the FLAME/C code look like the algorithm, we recommend the use of a high-level mechanical tool like the SPARK webpage ( Spark/) which automatically yields a code skeleton. 22

3 Algorithm: A := CHOL BLK VAR(A) «ATL A TR Partition A A BL A BR where A TL is while m(a TL ) <m(a) do Determine block size b Repartition «A ATL A A A 2 A A BL A A A 2 A BR A 2 A 2 A 22 where A is b b A := {L\A = CHOL UNB(A ) A 2 := L 2 = A 2 L T A 22 := A 22 L 2 L T 2 = A 22 A 2 A T 2 Continue with ATL A TR A BL A BR endwhile «A A A A A A 2 A A 2 A 2 A 22 FLA_Error FLASH_Chol_by_blocks_var( FLA_Obj A ) { FLA_Obj ATL, ATR, A, A, A2, ABL, ABR, A, A, A2, A2, A2, A22; FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR,,, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A, /**/ &A, &A2, /* ************* */ /* ******************** */ &A, /**/ &A, &A2, ABL, /**/ ABR, &A2, /**/ &A2, &A22,,, FLA_BR ); FLA_Chol_unb_var( FLASH_MATRIX_AT( A ) ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, A2 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A2, FLA_ONE, A22 ); FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A, A, /**/ A2, A, A, /**/ A2, /* *************** */ /* ****************** */ &ABL, /**/ &ABR, A2, A2, /**/ A22, FLA_TL ); return FLA_SUCCESS; Figure. Blocked algorithm for computing the Cholesky factorization (left) and the corresponding FLAME/C implementation (right). 2.2 Storage-by-blocks using FLASH Algorithms-by-blocks [7] view matrices as collections of submatrices and express their computation in terms of these submatrix blocks. Algorithms are then written as before, except with scalar operations replaced by operations on the blocks. Although a number of solutions have been proposed to solve this problem [22, 35, 39], none of these have yielded a consistent methodology that allows the development of high-performance libraries with functionality that rivals those of LAPACK or FLAME. The problem is primarily one of programmability. Our approach to the problem views the matrix as a matrix of smaller matrices using the FLASH API. This view thus yields a matrix hierarchy, potentially with multiple levels. Code for an algorithm-by-blocks for the Cholesky factorization using the FLASH API is given in Figure 2 (left). It may seem that the complexity of the algorithm is merely hidden in the routines FLASH Trsm and FLASH Syrk. The abbreviated implementation of an algorithm-by-blocks for the former is given in Figure 2 (right) while the latter routine has a similar implementation. The reader can see here that many of the details of the FLASH implementation have been buried within the FLASH-aware FLAME object definition. 2.3 SuperMatrix runtime system SuperMatrix extracts the parallelism at a high level of abstraction, decomposing the operation into tasks, identifying the dependencies among these, scheduling them for execution when ready (all operands available/dependencies fulfilled), and mapping tasks to execution units (cores/accelerators) taking into account the target platform. All of this is done without exposing any of the details of the parallelization to the application programmer. The success of this approach has been previously reported in a number of papers [, 2, 3, 3, 3, 32, 33]. Further details on the operation of SuperMatrix will be illustrated in the next two sections as the strategy to adapt it to a multiaccelerator platform is exposed. 3. Adapting FLAME to Platforms with Multiple Accelerators Much work on NVIDIA G8 graphics processors and the IBM Cell B.E. view these accelerators as multicore architectures [37, 38, 25] and exploit the parallelism at this level. Our approach is different in that we view one of these accelerators as the equivalent of a single core, for which a tuned serial implementation of (specific kernels of the level 3) BLAS is available; our analog of a multicore processor is then a system with multiple accelerators. We therefore exploit parallelism at two levels: at a high level, the presence of multiple accelerators (G8 processors or Cell B.E.) is addressed by SuperMatrix. At the low level, parallelism within the 28 microcores of a G8 or the 8 SPUs of a single Cell B.E. is extracted by the BLAS. We hereafter do not pursue further this second level of parallelism and assume the existence of a tuned implementation of the BLAS. Our generic multi-accelerator platform consists of a workstation, possibly (but not necessarily) with a multicore CPU, connected to multiple hardware accelerators through a fast interconnect. Processors in the accelerator boards are passive elements that simply wait to be ordered what to do. The workstation RAM (simply RAM from now on) and the memory in the accelerator boards are independent and no hardware memory coherence mechanism is in place (though having one would certainly benefit our approach, as will be reported in the experiments). Communication between the CPU and the accelerators is done via explicit data copies be- 23

4 FLA_Error FLASH_Chol_by_blocks_var( FLA_Obj A ) { FLA_Obj ATL, ATR, A, A, A2, ABL, ABR, A, A, A2, A2, A2, A22; FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR,,, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A, /**/ &A, &A2, /* ************* */ /* ******************** */ &A, /**/ &A, &A2, ABL, /**/ ABR, &A2, /**/ &A2, &A22,,, FLA_BR ); FLA_Chol_unb_var( FLASH_MATRIX_AT( A ) ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, A2 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A2, FLA_ONE, A22 ); FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A, A, /**/ A2, A, A, /**/ A2, /* *************** */ /* ****************** */ &ABL, /**/ &ABR, A2, A2, /**/ A22, FLA_TL ); return FLA_SUCCESS; void FLASH_Trsm_rltn( FLA_Obj alpha, FLA_Obj L, FLA_Obj B ) /* Special case with mode parameters FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG,... ) Assumption: L consists of one block and B consists of a column of blocks */ { FLA_Obj BT, B, BB, B, B2; FLA_Part_2x( B, &BT, &BB,, FLA_TOP ); while ( FLA_Obj_length( BT ) < FLA_Obj_length( B ) ) { FLA_Repart_2x_to_3x( BT, &B, /* ** */ /* ** */ &B, BB, &B2,, FLA_BOTTOM ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, alpha, FLASH_MATRIX_AT( L ), FLASH_MATRIX_AT( B ) ); FLA_Cont_with_3x_to_2x( &BT, B, B, /* ** */ /* ** */ &BB, B2, FLA_TOP ); Figure 2. FLASH implementation of the Cholesky factorization and the corresponding triangular system solve. tween memories. Communication between two accelerators is only possible through the RAM and is handled by the CPU. This abstract model is general enough to accommodate a workstation connected to a multi-gpu platform or containing multiple boards with Cell B.E. or ClearSpeed processors. The SuperMatrix runtime computes the Cholesky factorization by executing the algorithm-by-blocks in Figure 2 (left) in two stages, both executed at run time. During the analysis stage, asin- gle thread symbolically executes the algorithm code, but instead of computing operations immediately as they are encountered, it simply annotates these in a queue of pending tasks. This happens inside the calls to FLA Chol unb var, FLA Trsm, FLA Syrk, and FLA Gemm encountered in routines FLASH Chol by blocks var, FLASH Trsm, andflash Syrk. As operations are encountered in the code, tasks are enqueued, dependencies are identified, and a DAG (directed acyclic graph) that contains all the dependencies among operations of the overall problem is constructed. To illustrate the outcome of this first stage, the execution of the analysis when the code in Figure 2 is used to factorize the 3 3 blocked matrix A Ā Ā Ā 2 Ā Ā Ā 2 Ā 2 Ā 2 Ā 22 C A, () results in the DAG implicitly contained in Figure 3. Once the DAG is constructed, the dispatch stage commences. In the SuperMatrix runtime for multithreaded architectures, idle threads monitor the queue of pending tasks until they find a task ready for execution (that is, an operation with all operands available), compute it, and upon completion, update the dependency information in the queue. It is the part of the runtime system responsible for the execution of this second stage that we tailor for multiaccelerator platforms as described next, while the part in charge of the analysis remains unmodified. Operation/Result In In/out. Ā := CHOL(Ā) Ā 2. Ā := Ā TRIL(Ā) T Ā Ā 3. Ā2 := Ā2 TRIL(Ā) T Ā Ā 2 4. Ā := Ā ĀĀT Ā Ā 5. Ā2 := Ā2 Ā2ĀT Ā 2 Ā Ā 2 6. Ā22 := Ā22 Ā2ĀT 2 Ā 2 Ā Ā := CHOL(Ā ) Ā 8. Ā 2 := Ā 2 TRIL(Ā ) T Ā Ā 2 9. Ā22 := Ā22 Ā2ĀT 2 Ā 2 Ā 22. Ā22 := CHOL(Ā22) Ā 22 Figure 3. An illustration of the DAG resulting from the execution of the SuperMatrix analysis stage for the Cholesky factorization of a 3 3 matrix of blocks in () using the algorithmby-blocks FLASH Chol by blocks var. The -marks denote those operands that are initially available (i.e., those operands that are not dependent upon other operations). Specifically, in our basic implementation we run as many threads in the CPU as accelerators are present in the system. When a thread encounters a ready task, it copies the data associated with the operation to the memory of the accelerator, orders it to compute the operation using the appropriate BLAS kernel, and transfers the results back to RAM. We are exposing here a hybrid model of execution where the CPU is responsible for scheduling tasks to the accelerators while tracking dependencies, and the accelerators per- 24

5 form the actual computations. In this hybrid model, tasks that are considered not suitable for execution in the accelerator (due, e.g., to their low complexity or the lack of the appropriate BLAS kernel) can be executed in the CPU. (Hybrid CPU/GPU computation has been previously explored in [3, 2,, 37, 38].) Given that the major computational cost is performed by the accelerators in this scheme, the existence of multiple cores in the CPU, though advisable, is not necessary. Obviously, this basic implementation incurs an undesirable high amount of data transfers between RAM and the memories of the accelerators so that, unless the cost of communication is negligible, it will surely attain a low practical performance (at this point, we encourage the reader to have a quick glimpse at the line labeled as Basic implementation in Figure 4). In the following section we improve the mechanism by including software cache and memory coherence techniques to reduce the number of transfers. 4. Improving the Performance 4. Cache and memory coherence Standard policies in computer architecture to maintain the coherence between data in the cache of a processor and the main memory are write-through (writes to data are immediately propagated to main memory) and write-back (data in the main memory is updated only when the cache line where the modified data lie is replaced) [23]. On shared-memory multiprocessors, policies to maintain coherence among the caches of the processors are write-update (writes to data by one of the processors are immediately propagated to the copies in the caches of the remaining processors) and writeinvalidate (writes to data by one of the processors invalidate copies of that cache line in the remaining processors) [23]. These policies all aim at reducing the number of data transfers between the cache of the processors and the main memory. Now, at a high level of abstraction, a shared-memory multiprocessor is similar to a workstation connected to multiple accelerators. Each one of the accelerators is the equivalent of one processor with the memory of the accelerator playing the role of the processor cache. The workstation RAM is then the analog of the sharedmemory in the multiprocessor. It is not surprising then that we can employ software implementations of standard coherence policies to reduce the number of data transfers between the memory of the accelerators and the RAM of the workstation. 4.2 Application to the multi-accelerator platform The target platform used in the experiments was an NVIDIA Tesla S87 computing system with four NVIDIA G8 GPUs and 6 GBytes of DDR3 memory (.5 GBytes per GPU), which exhibits a theoretical peak performance close to 4 in singleprecision. The Tesla system is connected to a workstation with two Intel Xeon QuadCore E545 processors executing at 2. GHz with 9 GBytes of DDR2 RAM. The Intel 54 chipset provides two PCI- Express Gen2 interfaces, for a peak bandwidth of 48 Gbits/second on each interface, to connect with the Tesla. NVIDIA CUBLAS (version 2.) built on top of the CUDA API (version 2.) together with NVIDIA driver (7.5) were used in our tests. MKL.. was employed for all computations performed in the Intel Xeon cores. Single precision was employed in all experiments. When reporting the rate of computation, we consider the cost of the matrix-matrix product and the Cholesky factorization to be the standard 2n 3 and n 3 /3 flops, respectively, for square matrices of order n. The rate is computed as the number of flops divided by t 9,wheret equals the elapsed time in seconds. The cost of all data transfers between RAM and GPU memories is included in the timings Performance of runtime systems for the matrix-matrix product D. Write-back C. Cache + write-invalidate B. 2D + write-through A. Basic implementation Performance of runtime systems for the Cholesky factorization D. Write-back C. Cache + write-invalidate B. 2D + write-through A. Basic implementation Figure 4. Performance of blocked algorithms for the matrixmatrix product (top) and the Cholesky factorization (bottom) using variants A, B, C, and D of the runtime system and the four G8 processors of the Tesla S87. The top and bottom plots in Figure 4 respectively report the performance of a blocked implementation of the matrix-matrix product and the blocked algorithm for the Cholesky factorization in Figure 2 (right), using several variants of the SuperMatrix runtime system and all four G8 processors of the Tesla. Unless otherwise stated, the enhancements described next are incremental so that a variant includes a new strategy plus those of all previous ones. Although we describe the differences between variants using mostly examples from the Cholesky factorization, the same holds for the matrix-matrix product. Four variants are evaluated in the figure: A. Basic implementation: This variant corresponds to the implementation of the runtime system described in Section 3. In the matrix-matrix product all operations are performed in the G8 processors. For the Cholesky factorization, the diagonal blocks are factorized by the Xeon cores of the CPU while all remaining 25

6 Ā Ā Ā Ā 2 Ā 2 Ā 22 Ā 3 Ā 3 Ā 32 Ā 33 C A G G G G G G G G G G Figure 5. Cyclic 2-D mapping of the blocks in the lower triangular part of a 4 4 blocked mapping to the four G8 processors: G, G, G,andG. computations (matrix-matrix products, symmetric rank-k updates, and triangular system solves) are performed in the G8 processors. FLASH provides transparent storage-by-blocks for the data matrix with one level of hierarchy. The block size is adjusted experimentally. B. 2-D + write-through: In order to improve data locality (and therefore reduce the costly data transfers between the memory of the GPUs), workload is distributed following a cyclic 2- D mapping of the data matrix to a 2 2 logical grid of the G8s; see Figure 5. (bidimensional workload distribution in the context of shared-memory multiprocessors has been previously investigated in [28].) In this scheme all operations that compute results which overwrite a given block are mapped to the same G8 processor. Thus, e.g., in the Cholesky factorization the updates Ā2 := Ā2 Ā2ĀT and Ā2 := Ā2TRIL(Ā) T are both performed in G. Blocks are thus classified from the viewpoint of a G8 processor into proprietary (owned = written by it; owner-computes rule) and non-proprietary. Initially all data blocks reside in the RAM and the memory of the GPUs is empty. When a task is to be computed in a G8 processor, blocks which are not already there are copied to the GPU memory. Proprietary blocks remain in that memory for the rest of the execution of the algorithm while non-proprietary blocks are discarded as soon as the operation is completed. A write-through policy is implemented in software to maintain the coherence between the proprietary blocks in the memory of the GPU and the RAM so that any update of a proprietary block is immediately propagated to the RAM. There is no need to maintain the coherence between the memory of the GPUs and the RAM for non-proprietary blocks as these are readonly blocks. Following the previous example for the Cholesky factorization, when the task which performs the update Ā2 := Ā 2 Ā 2 Ā T is to be computed at G, blocksā 2, Ā 2, and Ā are copied to the memory of this GPU; the update is computed and the new contents of Ā 2 are propagated to RAM. Block Ā 2 then remains in the GPU memory while the contents of Ā 2 and Ā are discarded. Latter, when Ā 2 := Ā 2TRIL(Ā) T is to be computed, only Ā is copied to the GPU memory as Ā2 is already there. Once this second update is computed, following the write-through policy the updated contents of Ā2 are sent back to RAM and Ā is discarded. Other workload distributions (block row-wise, block columnwise and cyclic variants) are easily supported by the runtime system and, more important, are transparent to the developer of the algorithms. In our experiments, no major differences were found for the matrix-matrix product and the Cholesky factorization between the performance of the (cyclic) 2-D workload distribution reported in the figure and those of cyclic block rowwise/column-wise layouts. C. Cache + write-invalidate: The previous strategy reduces the number of transfers from RAM to GPU memory of blocks that are modified, but still produces a large amount of transfers of read-only blocks. In this variant we implement a software cache of read-only blocks in each GPU memory to maintain recently used blocks. With this mechanism in place for the Cholesky factorization, e.g., when G solves the linear systems Ā := Ā TRIL(Ā) T and Ā3 := Ā3TRIL(Ā) T,acopyof Ā is transferred from RAM to the cache in the GPU memory before the first linear system is solved and remains there for the solution of the second linear system, saving a second transfer. To complement the cache system, when a task which updates a given block is completed, the thread in the CPU in charge of its execution invalidates all read-only copies of that block in the memory of the remaining GPUs (write-invalidate policy). The replacement policy, currently LRU (least recently used first), and the number of blocks per cache can be easily modified in the runtime system. D. Write-back: The purpose now is to reduce the number of transfers from the memory of the GPUs to RAM that occur when (proprietary) blocks are updated by the G8 processors. For this, write-through is abandoned in favor of a write-back policy which allows inconsistencies between proprietary blocks in the memory of the GPUs and the RAM. Thus, blocks written by a G8 processor are updated in the RAM only when a different G8 (or the GPU) is to compute with them. (Software cache for read-only blocks and the write-invalidate policy are still in place.) When the execution of the complete algorithm is terminated, the data matrix in the RAM must be updated with the contents of the blocks that have been updated in the memory of the GPU. In summary, the coherence policies and use of cache implemented in variants B, C, and D all aim at reducing the number of data transfers among the different memories present in the system (minimize communication time) while the 2-D distribution pursues a balanced distribution of the work load. A trace of the execution reveals that, from variant A to D, (the number of) block transfers from RAM to the memory of GPUs for the largest problem sizes is reduced from 496/3276 to 256/243 for the matrix-matrix product/cholesky factorization, while the block transfers in the opposite direction are reduced from 2288/44 to 28/89. Hereafter all results will refer to variant D of the runtime system. In Figure 6 we compare the performances of the algorithms-byblocks for the matrix-matrix product and Cholesky factorization with those of optimized implementations of these operations on current high-performance platforms: Algorithms-by-blocks on four G8 processors: Our algorithmsby-blocks for the two operations, combined with variant D of the runtime system, and executed on the Tesla platform using the four G8 processors. Algorithms-by-blocks on a single G8 processor: Our algorithms-by-blocks for the matrix-matrix product and Cholesky factorization executed on a single G8 processor of the Tesla platform. CUBLAS sgemm on a single G8 processor: Implementation of this routine in CUBLAS 2. and executed on a single G8 processor. To be consistent with the previous two algorithms, the time to transfer the data from RAM to the memory of the GPUs and retrieve the results is included. MKL sgemm/spotrf on two Intel Xeon QuadCore: Multithreaded MKL.. implementation of the corresponding BLAS/LAPACK routines executed on all eight cores of a workstation with two Xeon Quad-Core processors (details given at the beginning of the section). 26

7 Performance of the matrix-matrix Product on GPU/CPU Algorithm-by-blocks on four G8 processors Algorithm-by-blocks on a single G8 processor CUBLAS sgemm on a single G8 processor MKL sgemm on two Intel Xeon QuadCore (8 cores) 6 5 n=892 n=644 n=496 n=248 Scalability of the matrix-matrix product Number of G8 processors Performance of the Cholesky factorization on GPU/CPU Algorithm-by-blocks on four G8 processors Algorithm-by-blocks on a single G8 processor MKL spotrf on two Intel Xeon QuadCore (8 cores) Figure 6. Performance of the algorithms-by-blocks for the matrixmatrix product (top) and the Cholesky factorization (bottom) using four G8 processors, the implementation of the matrix-matrix product in CUBLAS 2. on one G8 processor, and the implementation of both operations in MKL.. using all eight Xeon cores. The results show that the Tesla S87 combined with the algorithmby-blocks offers a notable rate when compared with the multicore architecture. Figures 7 and 8 evaluate the scalability and report the speedup of the algorithm-by-blocks. No bottlenecks are revealed in the scalability experiment: the performance of the system steadily improves as the number of G8 processors is increased and larger problem sizes report higher execution rates. The speed-ups are calculated comparing the performance attained by the algorithms-byblocks using 2 4 G8 processors with that of executing same algorithm on a single G8 processor. For the largest problem size of the matrix-matrix product, speed-ups of.83, 2.5, and 3.2 are attained using 2, 3, and 4 G8 processors. Compared with the implementation of the matrix-matrix product in CUBLAS 2., and n=248 n=6384 n=2288 n=892 n=496 Scalability of the Cholesky factorization Number of G8 processors Figure 7. Scalability of the algorithms-by-blocks for the matrixmatrix product (top) and the Cholesky factorization (bottom) using, 2, 3, and 4 G8 processors. including the time of data transfer between RAM and GPU, the corresponding super-linear speed-ups are 3.4, 4.3, and 5.5. For the largest Cholesky factorization, the results show speed-ups of.83, 2.55, and 3.25 using respectively 2, 3, and 4 G8 processors. 5. Conclusions In this paper we have shown how separation of concerns leads to great flexibility while reducing complexity when porting representative dense linear algebra algorithms to novel architectures. By separating the API for coding algorithms-by-blocks, the part of the runtime system that builds a DAG of operations and tracks the dependencies, and the architecture-aware part of the runtime system that executes operations with blocks, different scheduling heuristics were shown to be easy to implement, allowing customization to what otherwise would have been a hostile environment: a workstation connected to a multi-gpu accelerator. The particular difficulty 27

8 Speed-up G8 processors 3 G8 processors 2 G8 processors Speed-up of the matrix-matrix product Additional information For additional information on FLAME visit utexas.edu/users/flame/. Acknowledgments The researchers at the Universidad Jaime I were supported by projects CICYT TIN C2-2 and FEDER, and PB and PB of the Fundación Caixa-Castellón/Bancaixa and UJI. We thank NVIDIA for the generous donation of equipment that was used in the experiments. This research was partially sponsored by NSF grants CCF and CCF Additional support came from the J. Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin. Speed-up G8 processors 3 G8 processors 2 G8 processors Speed-up of the Cholesky factorization Figure 8. Speed-up of the algorithms-by-blocks for the matrixmatrix product (top) and the Cholesky factorization (bottom) using 2, 3, and 4 G8 processors. of the setting is the fact that the local memory of the GPU is not shared with the host making it necessary to carefully amortize the cost of data transfers. While the experiments on the paper discuss specifically the multi-gpu NVIDIA Tesla system, the techniques clearly are also applicable to a similar setting where a standard workstation is connected via a fast network to multiple ClearSpeed boards, IBM Cell B.E. accelerators, AMD/ATI GPUs, etc. Remarkable rates of execution are demonstrated for the matrixmatrix product and the Cholesky factorization operation. Similar results have been obtained for other important BLAS operations as the solution of triangular linear systems and the symmetric rank-k update. References [] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-26-83, EECS Department, University of California, Berkeley, Dec 26. [2] S. Barrachina, M. Castillo, F. Igual, and R. Mayo adn E. S. Quintana- Ortí. Solving dense linear systems on graphics processors. In Eds. E. Luque et al., editor, Proceedings of Euro-Par 28, number 568 in LNCS, pages Springer-Verlag Berlin Heidelberg, 28. [3] S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana- Ortí. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing PDSEC 8, 28. (CD-ROM). [4] Pieter Bellens, Josep M. Pérez, Rosa M. Badía, and Jesús Labarta. CellSs: a programming model for the Cell BE architecture. In Proceedings of the 26 ACM/IEEE conference on Supercomputing SC26, page 86, New York, NY, USA, 26. ACM Press. [5] Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, 26. [6] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 3(): 26, March 25. [7] Paolo Bientinesi, Brian Gunter, and Robert A. van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software, 35():3, July 28. Article 3, 22 pages. [8] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 9 UT-CS-7-6, University of Tennessee, September 27. [9] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. LAPACK Working Note 9 UT-CS-7-598, University of Tennessee, July 27. [] Maribel Castillo, Ernie Chan, Francisco D. Igual, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, Robert van de Geijn, and Field G. Van Zee. Making parallel programming synonymous with programming for linear algebra libraries. FLAME Working Note #3 TR-8-2, The University of Texas at Austin, Department of Computer Sciences, April 29. [] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA 7: Proceedings of the Nineteenth ACM Symposium on Parallelism in 28

9 Algorithms and Architectures, pages 6 25, San Diego, CA, USA, June ACM. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana- Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In ACM SIGPLAN 28 symposium on Principles and Practices of Parallel Programming PPoPP 28, pages 23 32, 28. [3] Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of IEEE Cluster Computing 27, pages 9 99, September 27. [4] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages IEEE Comput. Soc. Press, 992. [5] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 6(): 7, March 99. [6] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 4(): 7, March 988. [7] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46():3 45, 24. [8] Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 25 ACM/IEEE conference on Supercomputing SC25, page 3, Washington, DC, USA, 25. IEEE Computer Society. [9] David Geer. Chip makers turn to multicore processors. Computer, 38(5): 3, 25. [2] John A. Gunnels. A Systematic Approach to the Design and Analysis of Parallel Dense Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, The University of Texas, December 2. [2] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Soft., 27(4): , December 2. [22] Jia Guo, Ganesh Bikshandi, Basilio Fraguela, Maria Garzaran, and David Padua. Programming with tiles. In PPoPP 8: The 3th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, 28. [23] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, 3rd edition, 23. [24] Jin Hyuk Junk and Dianne P. O Leary. Cholesky decomposition and linear programming on a GPU. Master s thesis, University of Maryland, College Park. [25] Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the CELL processor using Cholesky factorization. LAPACK Working Note 84 UT-CS-7-596, University of Tennessee, May 27. [26] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):38 323, Sept [27] C. Leiserson and A. Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 998. [28] Bryan A. Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Ortí, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel and Distributed Computing Euro-Par 27, pages , February 27. [29] Enrique S. Quintana-Ortí and Robert A. van de Geijn. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Soft., 29(2):28 243, June 23. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design and scheduling of an algorithm-by-blocks for LU factorization on multithreaded architectures. FLAME Working Note #26 TR-7-5, The University of Texas at Austin, Department of Computer Sciences, September 27. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. In Workshop on Multithreaded Architectures and Applications MTAAP 28, 28. CD-ROM. [32] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Field G. Van Zee, and Robert A. van de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures. In F. Spies D. El Baz, J. Bourgeois, editor, 6th Euromicro International Conference on Parallel, Distributed and Network-based Processing PDP 28, pages 3 3, 28. [33] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Alfredo Remón, and Robert van de Geijn. Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR-7-5, The University of Texas at Austin, Department of Computer Sciences, September 27. [34] Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR- CS-98-7, Department of Computer Science, The Australian National University, Canberra 2 ACT, Australia, 998. [35] Vinod Valsalam and Anthony Skjellum. A framework for highperformance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 4():85 84, 22. [36] Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 997. [37] Vasily Volkov and James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-28-49, EECS Department, University of California, Berkeley, May 28. [38] Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC 8: Proceedings of the 28 ACM/IEEE conference on Supercomputing, pages, Piscataway, NJ, USA, 28. IEEE Press. [39] David S. Wise, Jeremy D. Frens, Yuhong Gu, and Gregory A. Alexander. Language support for Morton-order matrices. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming PPoPP 2, pages 24 33, New York, NY, USA, 2. ACM Press. 29

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators FLAME Working Note #32 Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Robert van de Geijn Abstract In a