Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Size: px
Start display at page:

Download "Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators"

Transcription

1 Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Departamento de Ingeniería y Ciencia de Computadores Universidad Jaume I 2.7 Castellón, Spain {gquintan,figual,quintana@icc.uji.es Robert van de Geijn Department of Computer Sciences The University of Texas at Austin Austin, Texas 7872 rvdg@cs.utexas.edu Abstract In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S87 platform with four GPUs delivers peak performances around 55 and 45 (single-precision) for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations. Categories and Subject Descriptors D.m [Software]: Miscellaneous General Terms Algorithms, Performance Keywords GPUs, algorithms-by-blocks, dependency analysis, dynamic scheduling, out-of-order execution. Introduction The limitations of current VLSI technology and the desire to transform the ever-increasing number of transistors on a chip dictated by Moore s Law into faster computers has led most hardware manufacturers to design multicore processors and/or specialized hardware accelerators [9]. In response, the computer science community is beginning to embrace (explicit) parallel programming as the means to exploit the potential of the new architectures []. How Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PPoPP 9, February 4 8, 29, Raleigh, North Carolina, USA. Copyright c 29 ACM /9/2... $5. to program these new architectures easily and efficiently is the key that will determine their success or failure. Given that architectures have recently bifurcated and no vendor can even predict what design will dominate five to ten years from now, the design of flexible programming solutions is as important as it ever has been. Dense linear algebra has been traditionally used as a pioneering area to conduct research on the performance of new architectures and this continues with the recent advent of multicore processors and hardware accelerators like GPUs and Cell B.E. The traditional approach in this problem domain, inherited from the solutions adopted for shared-memory multiprocessors years ago, is based on the use of multithreaded implementations of the Basic Linear Algebra Subprograms (BLAS) [26, 6, 5]. Code for operations constructed in terms of the BLAS (e.g, for solving a linear system or a linear least-squares problem) extract all the parallelism at the BLAS level. Thus, the intricacies of efficiently utilizing the target architecture are hidden inside the BLAS, and the burden of its parallelization lies in the hands of a few experts with a deep knowledge of the architecture. More recently, the FLAME, PLASMA, and SMPSs projects [, 2, 3, 3, 3, 32, 33, 9, 8, 4] have advocated for a different approach, extracting the parallelism at a higher level, so that only a sequential tuned implementation of the BLAS is necessary and more parallelism is detected and exploited. Cilk [27] is a precursor of these projects that suffered from not being able to deal with dependencies well. FLAME and PLASMA both focus on dense linear algebra, with the former working at a higher level of abstraction (much in the spirit of object-oriented programming), while the target domain for SMPSs is more general. In the last years, specialized hardware accelerators such as graphics processors (GPUs), Field Programmable Gate Arrays (FPGAs), and the Cell B.E. have also attracted the interest of the developers of dense linear algebra libraries [25, 24, 8, 3, 2,, 37, 38]. Squeezing these architectures for performance is revealing itself as a task of complexity similar to that of developing a highly tuned implementation of the BLAS for a sequential processor, which typically requires very low-level coding. The next evolutionary step has been the construction and use of systems with multiple accelerators: NVIDIA offers nodes with multiple Tesla processors (GPUs) in the Tesla series which can be connected via PCI-Express to a workstation and AMD/ATI has recently built a similar system, IBM Cell B.E. processors are currently available in the form of blades or PCI-Express accelerator boards, and ClearSpeed PCI-Express boards are furnished with 2 CSX6 processors. The natural question that arises at this point is how to program these multi-accelerator platforms. For systems with multiple GPUs, a possibility explored in [37, 38] is to distribute the data among the video memory of the GPUs 2

2 and code in a message-passing style similar to that of the libraries ScaLAPACK and PLAPACK [4, 36]. We identify two hurdles for this approach: While the state-of-the-art numerical methods have not changed, following this approach will require a complete rewrite of dense linear algebra libraries (alike the redesign of LAPACK for parallel distributed-memory architectures that was done in the ScaLAPACK and PLAPACK projects). Therefore, a large programming effort and a considerable amount of funding will be necessary to cover a functionality like that of LAPACK. Note that coding at such low level can be quite complex and experts are in short supply. The product that is obtained as a result of this style of programming will likely exhibit a parallelism similar to that of libraries based on multithreaded implementations of the BLAS and far from that demonstrated by the dynamic scheduling techniques in the FLAME, PLASMA, and SMPSs projects. While lookahead techniques [34] can increase the scalability of this solution to a certain extent, they do so at the cost of a much added complexity. Our approach in this context is fundamentally different. In previous papers [, 2, 3, 3, 3, 32, 33], we gave an overview of software tools and methods developed as part of the FLAME project, and we show how, when applied to a platform with multiple cores/processors, they provide an out-of-the-box solution that attains high performance relatively effortlessly. The key lies in maintaining a separation of concern between the code and target architecture by leaving the parallel execution of the operation in the hands of a runtime system. The advantages of this approach are twofold: When a new platform appears, it is only the runtime system that needs to be adapted. The routines in the library, which reflect the numerical algorithms, do not need to be modified. The parallelism is orchestrated by a runtime system which can be adapted to exploit different architectures. While still focused on the programmability of the solution, this paper makes the following new contributions: We target a fundamentally different architecture, consisting of a multicore processor connected to multiple hardware accelerators, which features properties of an heterogenous distributedmemory multiprocessor. This architecture model is representative of a platform consisting of workstation connected to multiple NVIDIA or AMD/ATI GPUs, IBM Cell B.E. blades, Clear- Speed boards, etc. We give a practical demonstration that the FLAME programming model easily accommodates for this generic multi-accelerator architecture, while not requiring a significant modification of the contents of the library. We describe how we tailor the runtime system for this generic multi-accelerator architecture by incorporating software implementations of cache/memory coherence techniques from SMP platforms. Altogether, these techniques provide a software distributed-shared memory (DSM) layer, which allows to view the multi-accelerator architecture as a shared-memory multiprocessor. Our experimental results show that the presence of this layer does not decrease performance for large problems. We report high performance on an NVIDIA Tesla multi-gpu platform with four G8 processors: A single-precision peak performance of 55 (= 55 9 floating-point arithmetic operations, or flops, per second) is attained for the matrix-matrix product using four G8 processors. Compared with the best implementation of the matrix-matrix product on a single G8 GPU (that of CUBLAS 2., based on the implementation in [37, 38]), and measuring the time to transfer the data in both cases, a super-linear speed-up of 5.5 is obtained for the largest problem size. Overall, a (single-precision) peak performance of 46 is attained on the Tesla platform for a more elaborate matrix operation, the Cholesky factorization, with complex data dependencies. The rest of the paper is structured as follows. Section 2 employs the Cholesky factorization of a dense matrix to offer a brief overview of FLAME, the key to easy development of highperformance dense linear algebra libraries that underlies our approach for multi-accelerator platforms. Section 3 describes how the tools in FLAME accommodate for the parallel execution of dense linear algebra codes on these platforms almost effortlessly. More elaborate techniques are presented in Section 4 together with their corresponding performance results. Finally, a few concluding remarks summarize the results in Section The FLAME Approach to Developing Dense Linear Algebra Libraries Following [], in this paper we will consider the Cholesky factorization of an n n symmetric positive definite matrix A to illustrate our approach. In this operation, the matrix is decomposed into the product A = LL T,whereLis the n n lower triangular Cholesky factor. (Alternatively, A can be decomposed as A = U T U, with U being upper triangular.) In traditional algorithms for this factorization, L overwrites the lower triangular part of A while the strictly upper triangular part remains unmodified. Here, we denote this as A := {L\A = CHOL(A). Key elements of FLAME are the high-level notation for expressing algorithms for dense and banded linear algebra operations, the formal derivation methodology to obtain provably correct algorithms, and the high-level application programming interfaces (APIs) to transform the algorithms into codes. FLASH and SuperMatrix are also important components of FLAME that address storage of matrices by blocks and automatic decomposition of linear algebra codes into tasks and dynamic scheduling of these tasks to multithreaded architectures (basically, SMP and multicore processors). In this section, We briefly review these elements. 2. FLAME: Formal Linear Algebra Methods Environment The fundamental innovation that enabled FLAME is the notation for expressing dense and banded linear algebra algorithms. Figure (left) shows a blocked algorithm for computing the Cholesky factorization using the FLAME notation. The formal derivation methodology consists of a series of steps which, when systematically applied, yield families of algorithms (multiple algorithmic variants) for computing an operation [2, 2, 6]. The significance of this for scientific computing is that often different algorithmic variants deliver higher performance on different platforms and/or problem sizes [7, 29]. This derivation of algorithms has also been made mechanical [5]. The FLAME/C API for the C programming language captures the notation in which we express our algorithms. Using this API, the blocked algorithm on the left of Figure can be transformed into the C code on the right of that figure. Note the close resemblance between algorithm and code. As indentation plays an importing role in making the FLAME/C code look like the algorithm, we recommend the use of a high-level mechanical tool like the SPARK webpage ( Spark/) which automatically yields a code skeleton. 22

3 Algorithm: A := CHOL BLK VAR(A) «ATL A TR Partition A A BL A BR where A TL is while m(a TL ) <m(a) do Determine block size b Repartition «A ATL A A A 2 A A BL A A A 2 A BR A 2 A 2 A 22 where A is b b A := {L\A = CHOL UNB(A ) A 2 := L 2 = A 2 L T A 22 := A 22 L 2 L T 2 = A 22 A 2 A T 2 Continue with ATL A TR A BL A BR endwhile «A A A A A A 2 A A 2 A 2 A 22 FLA_Error FLASH_Chol_by_blocks_var( FLA_Obj A ) { FLA_Obj ATL, ATR, A, A, A2, ABL, ABR, A, A, A2, A2, A2, A22; FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR,,, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A, /**/ &A, &A2, /* ************* */ /* ******************** */ &A, /**/ &A, &A2, ABL, /**/ ABR, &A2, /**/ &A2, &A22,,, FLA_BR ); FLA_Chol_unb_var( FLASH_MATRIX_AT( A ) ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, A2 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A2, FLA_ONE, A22 ); FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A, A, /**/ A2, A, A, /**/ A2, /* *************** */ /* ****************** */ &ABL, /**/ &ABR, A2, A2, /**/ A22, FLA_TL ); return FLA_SUCCESS; Figure. Blocked algorithm for computing the Cholesky factorization (left) and the corresponding FLAME/C implementation (right). 2.2 Storage-by-blocks using FLASH Algorithms-by-blocks [7] view matrices as collections of submatrices and express their computation in terms of these submatrix blocks. Algorithms are then written as before, except with scalar operations replaced by operations on the blocks. Although a number of solutions have been proposed to solve this problem [22, 35, 39], none of these have yielded a consistent methodology that allows the development of high-performance libraries with functionality that rivals those of LAPACK or FLAME. The problem is primarily one of programmability. Our approach to the problem views the matrix as a matrix of smaller matrices using the FLASH API. This view thus yields a matrix hierarchy, potentially with multiple levels. Code for an algorithm-by-blocks for the Cholesky factorization using the FLASH API is given in Figure 2 (left). It may seem that the complexity of the algorithm is merely hidden in the routines FLASH Trsm and FLASH Syrk. The abbreviated implementation of an algorithm-by-blocks for the former is given in Figure 2 (right) while the latter routine has a similar implementation. The reader can see here that many of the details of the FLASH implementation have been buried within the FLASH-aware FLAME object definition. 2.3 SuperMatrix runtime system SuperMatrix extracts the parallelism at a high level of abstraction, decomposing the operation into tasks, identifying the dependencies among these, scheduling them for execution when ready (all operands available/dependencies fulfilled), and mapping tasks to execution units (cores/accelerators) taking into account the target platform. All of this is done without exposing any of the details of the parallelization to the application programmer. The success of this approach has been previously reported in a number of papers [, 2, 3, 3, 3, 32, 33]. Further details on the operation of SuperMatrix will be illustrated in the next two sections as the strategy to adapt it to a multiaccelerator platform is exposed. 3. Adapting FLAME to Platforms with Multiple Accelerators Much work on NVIDIA G8 graphics processors and the IBM Cell B.E. view these accelerators as multicore architectures [37, 38, 25] and exploit the parallelism at this level. Our approach is different in that we view one of these accelerators as the equivalent of a single core, for which a tuned serial implementation of (specific kernels of the level 3) BLAS is available; our analog of a multicore processor is then a system with multiple accelerators. We therefore exploit parallelism at two levels: at a high level, the presence of multiple accelerators (G8 processors or Cell B.E.) is addressed by SuperMatrix. At the low level, parallelism within the 28 microcores of a G8 or the 8 SPUs of a single Cell B.E. is extracted by the BLAS. We hereafter do not pursue further this second level of parallelism and assume the existence of a tuned implementation of the BLAS. Our generic multi-accelerator platform consists of a workstation, possibly (but not necessarily) with a multicore CPU, connected to multiple hardware accelerators through a fast interconnect. Processors in the accelerator boards are passive elements that simply wait to be ordered what to do. The workstation RAM (simply RAM from now on) and the memory in the accelerator boards are independent and no hardware memory coherence mechanism is in place (though having one would certainly benefit our approach, as will be reported in the experiments). Communication between the CPU and the accelerators is done via explicit data copies be- 23

4 FLA_Error FLASH_Chol_by_blocks_var( FLA_Obj A ) { FLA_Obj ATL, ATR, A, A, A2, ABL, ABR, A, A, A2, A2, A2, A22; FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR,,, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A, /**/ &A, &A2, /* ************* */ /* ******************** */ &A, /**/ &A, &A2, ABL, /**/ ABR, &A2, /**/ &A2, &A22,,, FLA_BR ); FLA_Chol_unb_var( FLASH_MATRIX_AT( A ) ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, A2 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A2, FLA_ONE, A22 ); FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A, A, /**/ A2, A, A, /**/ A2, /* *************** */ /* ****************** */ &ABL, /**/ &ABR, A2, A2, /**/ A22, FLA_TL ); return FLA_SUCCESS; void FLASH_Trsm_rltn( FLA_Obj alpha, FLA_Obj L, FLA_Obj B ) /* Special case with mode parameters FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG,... ) Assumption: L consists of one block and B consists of a column of blocks */ { FLA_Obj BT, B, BB, B, B2; FLA_Part_2x( B, &BT, &BB,, FLA_TOP ); while ( FLA_Obj_length( BT ) < FLA_Obj_length( B ) ) { FLA_Repart_2x_to_3x( BT, &B, /* ** */ /* ** */ &B, BB, &B2,, FLA_BOTTOM ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, alpha, FLASH_MATRIX_AT( L ), FLASH_MATRIX_AT( B ) ); FLA_Cont_with_3x_to_2x( &BT, B, B, /* ** */ /* ** */ &BB, B2, FLA_TOP ); Figure 2. FLASH implementation of the Cholesky factorization and the corresponding triangular system solve. tween memories. Communication between two accelerators is only possible through the RAM and is handled by the CPU. This abstract model is general enough to accommodate a workstation connected to a multi-gpu platform or containing multiple boards with Cell B.E. or ClearSpeed processors. The SuperMatrix runtime computes the Cholesky factorization by executing the algorithm-by-blocks in Figure 2 (left) in two stages, both executed at run time. During the analysis stage, asin- gle thread symbolically executes the algorithm code, but instead of computing operations immediately as they are encountered, it simply annotates these in a queue of pending tasks. This happens inside the calls to FLA Chol unb var, FLA Trsm, FLA Syrk, and FLA Gemm encountered in routines FLASH Chol by blocks var, FLASH Trsm, andflash Syrk. As operations are encountered in the code, tasks are enqueued, dependencies are identified, and a DAG (directed acyclic graph) that contains all the dependencies among operations of the overall problem is constructed. To illustrate the outcome of this first stage, the execution of the analysis when the code in Figure 2 is used to factorize the 3 3 blocked matrix A Ā Ā Ā 2 Ā Ā Ā 2 Ā 2 Ā 2 Ā 22 C A, () results in the DAG implicitly contained in Figure 3. Once the DAG is constructed, the dispatch stage commences. In the SuperMatrix runtime for multithreaded architectures, idle threads monitor the queue of pending tasks until they find a task ready for execution (that is, an operation with all operands available), compute it, and upon completion, update the dependency information in the queue. It is the part of the runtime system responsible for the execution of this second stage that we tailor for multiaccelerator platforms as described next, while the part in charge of the analysis remains unmodified. Operation/Result In In/out. Ā := CHOL(Ā) Ā 2. Ā := Ā TRIL(Ā) T Ā Ā 3. Ā2 := Ā2 TRIL(Ā) T Ā Ā 2 4. Ā := Ā ĀĀT Ā Ā 5. Ā2 := Ā2 Ā2ĀT Ā 2 Ā Ā 2 6. Ā22 := Ā22 Ā2ĀT 2 Ā 2 Ā Ā := CHOL(Ā ) Ā 8. Ā 2 := Ā 2 TRIL(Ā ) T Ā Ā 2 9. Ā22 := Ā22 Ā2ĀT 2 Ā 2 Ā 22. Ā22 := CHOL(Ā22) Ā 22 Figure 3. An illustration of the DAG resulting from the execution of the SuperMatrix analysis stage for the Cholesky factorization of a 3 3 matrix of blocks in () using the algorithmby-blocks FLASH Chol by blocks var. The -marks denote those operands that are initially available (i.e., those operands that are not dependent upon other operations). Specifically, in our basic implementation we run as many threads in the CPU as accelerators are present in the system. When a thread encounters a ready task, it copies the data associated with the operation to the memory of the accelerator, orders it to compute the operation using the appropriate BLAS kernel, and transfers the results back to RAM. We are exposing here a hybrid model of execution where the CPU is responsible for scheduling tasks to the accelerators while tracking dependencies, and the accelerators per- 24

5 form the actual computations. In this hybrid model, tasks that are considered not suitable for execution in the accelerator (due, e.g., to their low complexity or the lack of the appropriate BLAS kernel) can be executed in the CPU. (Hybrid CPU/GPU computation has been previously explored in [3, 2,, 37, 38].) Given that the major computational cost is performed by the accelerators in this scheme, the existence of multiple cores in the CPU, though advisable, is not necessary. Obviously, this basic implementation incurs an undesirable high amount of data transfers between RAM and the memories of the accelerators so that, unless the cost of communication is negligible, it will surely attain a low practical performance (at this point, we encourage the reader to have a quick glimpse at the line labeled as Basic implementation in Figure 4). In the following section we improve the mechanism by including software cache and memory coherence techniques to reduce the number of transfers. 4. Improving the Performance 4. Cache and memory coherence Standard policies in computer architecture to maintain the coherence between data in the cache of a processor and the main memory are write-through (writes to data are immediately propagated to main memory) and write-back (data in the main memory is updated only when the cache line where the modified data lie is replaced) [23]. On shared-memory multiprocessors, policies to maintain coherence among the caches of the processors are write-update (writes to data by one of the processors are immediately propagated to the copies in the caches of the remaining processors) and writeinvalidate (writes to data by one of the processors invalidate copies of that cache line in the remaining processors) [23]. These policies all aim at reducing the number of data transfers between the cache of the processors and the main memory. Now, at a high level of abstraction, a shared-memory multiprocessor is similar to a workstation connected to multiple accelerators. Each one of the accelerators is the equivalent of one processor with the memory of the accelerator playing the role of the processor cache. The workstation RAM is then the analog of the sharedmemory in the multiprocessor. It is not surprising then that we can employ software implementations of standard coherence policies to reduce the number of data transfers between the memory of the accelerators and the RAM of the workstation. 4.2 Application to the multi-accelerator platform The target platform used in the experiments was an NVIDIA Tesla S87 computing system with four NVIDIA G8 GPUs and 6 GBytes of DDR3 memory (.5 GBytes per GPU), which exhibits a theoretical peak performance close to 4 in singleprecision. The Tesla system is connected to a workstation with two Intel Xeon QuadCore E545 processors executing at 2. GHz with 9 GBytes of DDR2 RAM. The Intel 54 chipset provides two PCI- Express Gen2 interfaces, for a peak bandwidth of 48 Gbits/second on each interface, to connect with the Tesla. NVIDIA CUBLAS (version 2.) built on top of the CUDA API (version 2.) together with NVIDIA driver (7.5) were used in our tests. MKL.. was employed for all computations performed in the Intel Xeon cores. Single precision was employed in all experiments. When reporting the rate of computation, we consider the cost of the matrix-matrix product and the Cholesky factorization to be the standard 2n 3 and n 3 /3 flops, respectively, for square matrices of order n. The rate is computed as the number of flops divided by t 9,wheret equals the elapsed time in seconds. The cost of all data transfers between RAM and GPU memories is included in the timings Performance of runtime systems for the matrix-matrix product D. Write-back C. Cache + write-invalidate B. 2D + write-through A. Basic implementation Performance of runtime systems for the Cholesky factorization D. Write-back C. Cache + write-invalidate B. 2D + write-through A. Basic implementation Figure 4. Performance of blocked algorithms for the matrixmatrix product (top) and the Cholesky factorization (bottom) using variants A, B, C, and D of the runtime system and the four G8 processors of the Tesla S87. The top and bottom plots in Figure 4 respectively report the performance of a blocked implementation of the matrix-matrix product and the blocked algorithm for the Cholesky factorization in Figure 2 (right), using several variants of the SuperMatrix runtime system and all four G8 processors of the Tesla. Unless otherwise stated, the enhancements described next are incremental so that a variant includes a new strategy plus those of all previous ones. Although we describe the differences between variants using mostly examples from the Cholesky factorization, the same holds for the matrix-matrix product. Four variants are evaluated in the figure: A. Basic implementation: This variant corresponds to the implementation of the runtime system described in Section 3. In the matrix-matrix product all operations are performed in the G8 processors. For the Cholesky factorization, the diagonal blocks are factorized by the Xeon cores of the CPU while all remaining 25

6 Ā Ā Ā Ā 2 Ā 2 Ā 22 Ā 3 Ā 3 Ā 32 Ā 33 C A G G G G G G G G G G Figure 5. Cyclic 2-D mapping of the blocks in the lower triangular part of a 4 4 blocked mapping to the four G8 processors: G, G, G,andG. computations (matrix-matrix products, symmetric rank-k updates, and triangular system solves) are performed in the G8 processors. FLASH provides transparent storage-by-blocks for the data matrix with one level of hierarchy. The block size is adjusted experimentally. B. 2-D + write-through: In order to improve data locality (and therefore reduce the costly data transfers between the memory of the GPUs), workload is distributed following a cyclic 2- D mapping of the data matrix to a 2 2 logical grid of the G8s; see Figure 5. (bidimensional workload distribution in the context of shared-memory multiprocessors has been previously investigated in [28].) In this scheme all operations that compute results which overwrite a given block are mapped to the same G8 processor. Thus, e.g., in the Cholesky factorization the updates Ā2 := Ā2 Ā2ĀT and Ā2 := Ā2TRIL(Ā) T are both performed in G. Blocks are thus classified from the viewpoint of a G8 processor into proprietary (owned = written by it; owner-computes rule) and non-proprietary. Initially all data blocks reside in the RAM and the memory of the GPUs is empty. When a task is to be computed in a G8 processor, blocks which are not already there are copied to the GPU memory. Proprietary blocks remain in that memory for the rest of the execution of the algorithm while non-proprietary blocks are discarded as soon as the operation is completed. A write-through policy is implemented in software to maintain the coherence between the proprietary blocks in the memory of the GPU and the RAM so that any update of a proprietary block is immediately propagated to the RAM. There is no need to maintain the coherence between the memory of the GPUs and the RAM for non-proprietary blocks as these are readonly blocks. Following the previous example for the Cholesky factorization, when the task which performs the update Ā2 := Ā 2 Ā 2 Ā T is to be computed at G, blocksā 2, Ā 2, and Ā are copied to the memory of this GPU; the update is computed and the new contents of Ā 2 are propagated to RAM. Block Ā 2 then remains in the GPU memory while the contents of Ā 2 and Ā are discarded. Latter, when Ā 2 := Ā 2TRIL(Ā) T is to be computed, only Ā is copied to the GPU memory as Ā2 is already there. Once this second update is computed, following the write-through policy the updated contents of Ā2 are sent back to RAM and Ā is discarded. Other workload distributions (block row-wise, block columnwise and cyclic variants) are easily supported by the runtime system and, more important, are transparent to the developer of the algorithms. In our experiments, no major differences were found for the matrix-matrix product and the Cholesky factorization between the performance of the (cyclic) 2-D workload distribution reported in the figure and those of cyclic block rowwise/column-wise layouts. C. Cache + write-invalidate: The previous strategy reduces the number of transfers from RAM to GPU memory of blocks that are modified, but still produces a large amount of transfers of read-only blocks. In this variant we implement a software cache of read-only blocks in each GPU memory to maintain recently used blocks. With this mechanism in place for the Cholesky factorization, e.g., when G solves the linear systems Ā := Ā TRIL(Ā) T and Ā3 := Ā3TRIL(Ā) T,acopyof Ā is transferred from RAM to the cache in the GPU memory before the first linear system is solved and remains there for the solution of the second linear system, saving a second transfer. To complement the cache system, when a task which updates a given block is completed, the thread in the CPU in charge of its execution invalidates all read-only copies of that block in the memory of the remaining GPUs (write-invalidate policy). The replacement policy, currently LRU (least recently used first), and the number of blocks per cache can be easily modified in the runtime system. D. Write-back: The purpose now is to reduce the number of transfers from the memory of the GPUs to RAM that occur when (proprietary) blocks are updated by the G8 processors. For this, write-through is abandoned in favor of a write-back policy which allows inconsistencies between proprietary blocks in the memory of the GPUs and the RAM. Thus, blocks written by a G8 processor are updated in the RAM only when a different G8 (or the GPU) is to compute with them. (Software cache for read-only blocks and the write-invalidate policy are still in place.) When the execution of the complete algorithm is terminated, the data matrix in the RAM must be updated with the contents of the blocks that have been updated in the memory of the GPU. In summary, the coherence policies and use of cache implemented in variants B, C, and D all aim at reducing the number of data transfers among the different memories present in the system (minimize communication time) while the 2-D distribution pursues a balanced distribution of the work load. A trace of the execution reveals that, from variant A to D, (the number of) block transfers from RAM to the memory of GPUs for the largest problem sizes is reduced from 496/3276 to 256/243 for the matrix-matrix product/cholesky factorization, while the block transfers in the opposite direction are reduced from 2288/44 to 28/89. Hereafter all results will refer to variant D of the runtime system. In Figure 6 we compare the performances of the algorithms-byblocks for the matrix-matrix product and Cholesky factorization with those of optimized implementations of these operations on current high-performance platforms: Algorithms-by-blocks on four G8 processors: Our algorithmsby-blocks for the two operations, combined with variant D of the runtime system, and executed on the Tesla platform using the four G8 processors. Algorithms-by-blocks on a single G8 processor: Our algorithms-by-blocks for the matrix-matrix product and Cholesky factorization executed on a single G8 processor of the Tesla platform. CUBLAS sgemm on a single G8 processor: Implementation of this routine in CUBLAS 2. and executed on a single G8 processor. To be consistent with the previous two algorithms, the time to transfer the data from RAM to the memory of the GPUs and retrieve the results is included. MKL sgemm/spotrf on two Intel Xeon QuadCore: Multithreaded MKL.. implementation of the corresponding BLAS/LAPACK routines executed on all eight cores of a workstation with two Xeon Quad-Core processors (details given at the beginning of the section). 26

7 Performance of the matrix-matrix Product on GPU/CPU Algorithm-by-blocks on four G8 processors Algorithm-by-blocks on a single G8 processor CUBLAS sgemm on a single G8 processor MKL sgemm on two Intel Xeon QuadCore (8 cores) 6 5 n=892 n=644 n=496 n=248 Scalability of the matrix-matrix product Number of G8 processors Performance of the Cholesky factorization on GPU/CPU Algorithm-by-blocks on four G8 processors Algorithm-by-blocks on a single G8 processor MKL spotrf on two Intel Xeon QuadCore (8 cores) Figure 6. Performance of the algorithms-by-blocks for the matrixmatrix product (top) and the Cholesky factorization (bottom) using four G8 processors, the implementation of the matrix-matrix product in CUBLAS 2. on one G8 processor, and the implementation of both operations in MKL.. using all eight Xeon cores. The results show that the Tesla S87 combined with the algorithmby-blocks offers a notable rate when compared with the multicore architecture. Figures 7 and 8 evaluate the scalability and report the speedup of the algorithm-by-blocks. No bottlenecks are revealed in the scalability experiment: the performance of the system steadily improves as the number of G8 processors is increased and larger problem sizes report higher execution rates. The speed-ups are calculated comparing the performance attained by the algorithms-byblocks using 2 4 G8 processors with that of executing same algorithm on a single G8 processor. For the largest problem size of the matrix-matrix product, speed-ups of.83, 2.5, and 3.2 are attained using 2, 3, and 4 G8 processors. Compared with the implementation of the matrix-matrix product in CUBLAS 2., and n=248 n=6384 n=2288 n=892 n=496 Scalability of the Cholesky factorization Number of G8 processors Figure 7. Scalability of the algorithms-by-blocks for the matrixmatrix product (top) and the Cholesky factorization (bottom) using, 2, 3, and 4 G8 processors. including the time of data transfer between RAM and GPU, the corresponding super-linear speed-ups are 3.4, 4.3, and 5.5. For the largest Cholesky factorization, the results show speed-ups of.83, 2.55, and 3.25 using respectively 2, 3, and 4 G8 processors. 5. Conclusions In this paper we have shown how separation of concerns leads to great flexibility while reducing complexity when porting representative dense linear algebra algorithms to novel architectures. By separating the API for coding algorithms-by-blocks, the part of the runtime system that builds a DAG of operations and tracks the dependencies, and the architecture-aware part of the runtime system that executes operations with blocks, different scheduling heuristics were shown to be easy to implement, allowing customization to what otherwise would have been a hostile environment: a workstation connected to a multi-gpu accelerator. The particular difficulty 27

8 Speed-up G8 processors 3 G8 processors 2 G8 processors Speed-up of the matrix-matrix product Additional information For additional information on FLAME visit utexas.edu/users/flame/. Acknowledgments The researchers at the Universidad Jaime I were supported by projects CICYT TIN C2-2 and FEDER, and PB and PB of the Fundación Caixa-Castellón/Bancaixa and UJI. We thank NVIDIA for the generous donation of equipment that was used in the experiments. This research was partially sponsored by NSF grants CCF and CCF Additional support came from the J. Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin. Speed-up G8 processors 3 G8 processors 2 G8 processors Speed-up of the Cholesky factorization Figure 8. Speed-up of the algorithms-by-blocks for the matrixmatrix product (top) and the Cholesky factorization (bottom) using 2, 3, and 4 G8 processors. of the setting is the fact that the local memory of the GPU is not shared with the host making it necessary to carefully amortize the cost of data transfers. While the experiments on the paper discuss specifically the multi-gpu NVIDIA Tesla system, the techniques clearly are also applicable to a similar setting where a standard workstation is connected via a fast network to multiple ClearSpeed boards, IBM Cell B.E. accelerators, AMD/ATI GPUs, etc. Remarkable rates of execution are demonstrated for the matrixmatrix product and the Cholesky factorization operation. Similar results have been obtained for other important BLAS operations as the solution of triangular linear systems and the symmetric rank-k update. References [] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-26-83, EECS Department, University of California, Berkeley, Dec 26. [2] S. Barrachina, M. Castillo, F. Igual, and R. Mayo adn E. S. Quintana- Ortí. Solving dense linear systems on graphics processors. In Eds. E. Luque et al., editor, Proceedings of Euro-Par 28, number 568 in LNCS, pages Springer-Verlag Berlin Heidelberg, 28. [3] S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana- Ortí. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing PDSEC 8, 28. (CD-ROM). [4] Pieter Bellens, Josep M. Pérez, Rosa M. Badía, and Jesús Labarta. CellSs: a programming model for the Cell BE architecture. In Proceedings of the 26 ACM/IEEE conference on Supercomputing SC26, page 86, New York, NY, USA, 26. ACM Press. [5] Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, 26. [6] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 3(): 26, March 25. [7] Paolo Bientinesi, Brian Gunter, and Robert A. van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software, 35():3, July 28. Article 3, 22 pages. [8] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 9 UT-CS-7-6, University of Tennessee, September 27. [9] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. LAPACK Working Note 9 UT-CS-7-598, University of Tennessee, July 27. [] Maribel Castillo, Ernie Chan, Francisco D. Igual, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, Robert van de Geijn, and Field G. Van Zee. Making parallel programming synonymous with programming for linear algebra libraries. FLAME Working Note #3 TR-8-2, The University of Texas at Austin, Department of Computer Sciences, April 29. [] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA 7: Proceedings of the Nineteenth ACM Symposium on Parallelism in 28

9 Algorithms and Architectures, pages 6 25, San Diego, CA, USA, June ACM. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana- Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In ACM SIGPLAN 28 symposium on Principles and Practices of Parallel Programming PPoPP 28, pages 23 32, 28. [3] Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of IEEE Cluster Computing 27, pages 9 99, September 27. [4] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages IEEE Comput. Soc. Press, 992. [5] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 6(): 7, March 99. [6] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 4(): 7, March 988. [7] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46():3 45, 24. [8] Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 25 ACM/IEEE conference on Supercomputing SC25, page 3, Washington, DC, USA, 25. IEEE Computer Society. [9] David Geer. Chip makers turn to multicore processors. Computer, 38(5): 3, 25. [2] John A. Gunnels. A Systematic Approach to the Design and Analysis of Parallel Dense Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, The University of Texas, December 2. [2] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Soft., 27(4): , December 2. [22] Jia Guo, Ganesh Bikshandi, Basilio Fraguela, Maria Garzaran, and David Padua. Programming with tiles. In PPoPP 8: The 3th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, 28. [23] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, 3rd edition, 23. [24] Jin Hyuk Junk and Dianne P. O Leary. Cholesky decomposition and linear programming on a GPU. Master s thesis, University of Maryland, College Park. [25] Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the CELL processor using Cholesky factorization. LAPACK Working Note 84 UT-CS-7-596, University of Tennessee, May 27. [26] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):38 323, Sept [27] C. Leiserson and A. Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 998. [28] Bryan A. Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Ortí, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel and Distributed Computing Euro-Par 27, pages , February 27. [29] Enrique S. Quintana-Ortí and Robert A. van de Geijn. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Soft., 29(2):28 243, June 23. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design and scheduling of an algorithm-by-blocks for LU factorization on multithreaded architectures. FLAME Working Note #26 TR-7-5, The University of Texas at Austin, Department of Computer Sciences, September 27. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. In Workshop on Multithreaded Architectures and Applications MTAAP 28, 28. CD-ROM. [32] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Field G. Van Zee, and Robert A. van de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures. In F. Spies D. El Baz, J. Bourgeois, editor, 6th Euromicro International Conference on Parallel, Distributed and Network-based Processing PDP 28, pages 3 3, 28. [33] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Alfredo Remón, and Robert van de Geijn. Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR-7-5, The University of Texas at Austin, Department of Computer Sciences, September 27. [34] Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR- CS-98-7, Department of Computer Science, The Australian National University, Canberra 2 ACT, Australia, 998. [35] Vinod Valsalam and Anthony Skjellum. A framework for highperformance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 4():85 84, 22. [36] Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 997. [37] Vasily Volkov and James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-28-49, EECS Department, University of California, Berkeley, May 28. [38] Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC 8: Proceedings of the 28 ACM/IEEE conference on Supercomputing, pages, Piscataway, NJ, USA, 28. IEEE Press. [39] David S. Wise, Jeremy D. Frens, Yuhong Gu, and Gregory A. Alexander. Language support for Morton-order matrices. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming PPoPP 2, pages 24 33, New York, NY, USA, 2. ACM Press. 29

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32 Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators FLAME Working Note #32 Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Robert van de Geijn Abstract In a

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Making Programming Synonymous with Programming for. Linear Algebra Libraries

Making Programming Synonymous with Programming for. Linear Algebra Libraries Making Programming Synonymous with Programming for Linear Algebra Libraries Maribel Castillo Ernie Chan Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Solving Large Dense Matrix Problems on Multi-Core Processors

Solving Large Dense Matrix Problems on Multi-Core Processors Solving Large Dense Matrix Problems on Multi-Core Processors Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I, 12.071

More information

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad

More information

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Francisco D. Igual Gregorio Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad Jaume I 12.071 Castellón Spain {figualgquintan}@icc.uji.es

More information

FLAME Working Note #30

FLAME Working Note #30 FLAG@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia

More information

Beautiful Parallel Code: Evolution vs. Intelligent Design

Beautiful Parallel Code: Evolution vs. Intelligent Design Beautiful Parallel Code: Evolution vs. Intelligent Design FLAME Working Note #34 Robert A. van de Geijn Department of Computer Sciences The University of Texas at Austin Austin, Texas 7872 rvdg@cs.utexas.edu

More information

FLAME Working Note #23

FLAME Working Note #23 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures FLAME Working Note #23 Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn Abstract

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Informe Técnico ICC 2-2-28 Solving Dense Linear Systems on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí Febrero de 28 Departamento

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores

More information

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Transforming Linear Algebra Libraries: From Abstraction to Parallelism Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas at Austin Austin, TX

More information

Unleashing the Power of Multi-GPU Accelerators with FLAME

Unleashing the Power of Multi-GPU Accelerators with FLAME Unleashing the Power of Multi-GPU Accelerators with FLAME Enrique S. Quintana Ortí Universidad Jaume I quintana@icc.uji.es SIAM Conference on Parallel Processing for Scientific Computing Seattle, 2010

More information

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29 Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29 Gregorio Quintana-Ortí Enrique S Quintana-Ortí Robert van de Geijn Field G Van Zee Ernie Chan

More information

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Transforming Linear Algebra Libraries: From Abstraction to Parallelism Transforming Linear Algebra Libraries: From Abstraction to Parallelism FLAME Working Note #38 Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas

More information

Transforming Linear Algebra Libraries: From Abstraction to High Performance

Transforming Linear Algebra Libraries: From Abstraction to High Performance Transforming Linear Algebra Libraries: From Abstraction to High Performance RICHARD M. VERAS and JONATHAN S. MONETTE and ROBERT A. VAN DE GEIJN The University of Texas at Austin and ENRIQUE S. QUINTANA-ORTí

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1 How Heterogeneous? 2 How Many Languages? 3 How Many Languages? 3 Question! 4 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL

More information

Toward Scalable Matrix Multiply on Multithreaded Architectures

Toward Scalable Matrix Multiply on Multithreaded Architectures Toward Scalable Matrix Multiply on Multithreaded Architectures Bryan Marker 1, Field G Van Zee 1, Kazushige Goto 1, Gregorio Quintana Ortí 2, and Robert A van de Geijn 1 1 The University of Texas at Austin

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures Gregorio Quintana-Ortí Enrique S Quintana-Ortí Ernie Chan Robert A van de Geijn Field G Van Zee Abstract This paper examines

More information

Exploiting the capabilities of modern GPUs for dense matrix computations

Exploiting the capabilities of modern GPUs for dense matrix computations Informe Técnico ICC 1-11-28 Exploiting the capabilities of modern GPUs for dense matrix computations Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí, Gregorio

More information

Representing Linear Algebra Algorithms in Code: The FLAME Application Program Interfaces

Representing Linear Algebra Algorithms in Code: The FLAME Application Program Interfaces Representing Linear Algebra Algorithms in Code: The FLAME Application Program Interfaces Paolo Bientinesi The University of Texas at Austin and Enrique S. Quintana-Ortí Universidad Jaume I and Robert A.

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory

Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory Paolo Bientinesi Sergey Kolos 2 and Robert van de Geijn Department of Computer Sciences The University of Texas at Austin

More information

Out-of-Core Solution of Linear Systems on Graphics Processors

Out-of-Core Solution of Linear Systems on Graphics Processors 1 Informe Técnico ICC 01-05-2008 Out-of-Core Solution of Linear Systems on Graphics Processors Maribel Castillo, Francisco D. Igual, Rafael Mayo, Rafael Rubio, Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí,

More information

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Emmanuel Agullo 1, Henricus Bouwmeester 2, Jack Dongarra 1, Jakub Kurzak 1, Julien Langou 2,andLeeRosenberg

More information

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on

More information

FLAMES2S: From Abstraction to High Performance

FLAMES2S: From Abstraction to High Performance FLAMESS: From Abstraction to High Performance RICHARD VERAS The University of Texas at Austin and JONATHAN MONETTE The University of Texas at Austin and FIELD G. VAN ZEE The University of Texas at Austin

More information

A Square Block Format for Symmetric Band Matrices

A Square Block Format for Symmetric Band Matrices A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg

More information

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators FLAME Working Note #5 Ernie Chan Department of Computer Science The University of Texas at Austin Austin, Texas

More information

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) Michael Chuvelev, Intel Corporation, michael.chuvelev@intel.com Sergey Kazakov, Intel Corporation sergey.kazakov@intel.com

More information

Strategies for Parallelizing the Solution of Rational Matrix Equations

Strategies for Parallelizing the Solution of Rational Matrix Equations Strategies for Parallelizing the Solution of Rational Matrix Equations José M. Badía 1, Peter Benner, Maribel Castillo 1, Heike Faßbender 3, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, and Gregorio Quintana-Ortí

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

An M-script API for Linear Algebra Operations on Graphics Processors

An M-script API for Linear Algebra Operations on Graphics Processors Informe Técnico ICC 1-2-28 GLAME@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí

More information

An M-script API for Linear Algebra Operations on Graphics Processors

An M-script API for Linear Algebra Operations on Graphics Processors Informe Técnico ICC 1-2-28 GLAME@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

A Note on Auto-tuning GEMM for GPUs

A Note on Auto-tuning GEMM for GPUs A Note on Auto-tuning GEMM for GPUs Yinan Li 1, Jack Dongarra 1,2,3, and Stanimire Tomov 1 1 University of Tennessee, USA 2 Oak Ridge National Laboratory, USA 3 University of Manchester, UK Abstract. The

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Parallel Tiled QR Factorization for Multicore Architectures

Parallel Tiled QR Factorization for Multicore Architectures Parallel Tiled QR Factorization for Multicore Architectures Alfredo Buttari 1, Julien Langou 2, Jakub Kurzak 1, and Jack Dongarra 1,3,4 1 Computer Science Dept. University of Tennessee Knoxville, USA 2

More information

Max Planck Institute Magdeburg Preprints

Max Planck Institute Magdeburg Preprints Peter Benner Pablo Ezzatti Enrique S. Quintana-Ortí Alfredo Remón Matrix Inversion on CPU-GPU Platforms with Applications in Control Theory MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER SYSTEME

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Azzam Haidar, Hatem Ltaief, Asim YarKhan and Jack Dongarra Department of Electrical Engineering and

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

An API for Manipulating Matrices Stored by Blocks

An API for Manipulating Matrices Stored by Blocks An API for Manipulating Matrices Stored by Blocks Tze Meng Low Robert A van de Geijn Department of Computer Sciences The University of Texas at Austin 1 University Station, C0500 Austin, TX 78712 {ltm,rvdg}@csutexasedu

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Representing Linear Algebra Algorithms in Code: The FLAME API

Representing Linear Algebra Algorithms in Code: The FLAME API Representing Linear Algebra Algorithms in Code: The FLAME API Robert A. van de Geijn The University of Texas at Austin The Formal Linear Algebra Methods Environment (FLAME) encompasses a methodology for

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Technology on Dense Linear Algebra

Technology on Dense Linear Algebra Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,

More information

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin libflame The Complete Reference ( version 5.1.0-46 ) Field G. Van Zee The University of Texas at Austin Copyright c 2011 by Field G. Van Zee. 10 9 8 7 6 5 4 3 2 1 All rights reserved. No part of this book

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Informe Técnico ICC 1-1-8 Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí Enero de 8 Departamento

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

arxiv: v2 [cs.ms] 11 Sep 2007

arxiv: v2 [cs.ms] 11 Sep 2007 A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures LAPACK Working Note # 191 Alfredo Buttari 1, Julien Langou 4, Jakub Kurzak 1, Jack Dongarra 123 arxiv:0709.1272v2 [cs.ms]

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190 Parallel Tiled QR Factorization for Multicore Architectures Working Note # 190 Alfredo Buttari 1, Julien Langou 3, Jakub Kurzak 1, Jack Dongarra 12 1 Department of Electrical Engineering and Computer Science,

More information

Matrix Multiplication Specialization in STAPL

Matrix Multiplication Specialization in STAPL Matrix Multiplication Specialization in STAPL Adam Fidel, Lena Olson, Antal Buss, Timmie Smith, Gabriel Tanase, Nathan Thomas, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger Parasol Lab, Dept. of Computer

More information

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures Buttari, Alfredo and Langou, Julien and Kurzak, Jakub and Dongarra, Jack 2007 MIMS EPrint: 2007.122 Manchester Institute

More information

arxiv: v1 [math.na] 24 Jul 2007

arxiv: v1 [math.na] 24 Jul 2007 Parallel Tiled QR Factorization for Multicore Architectures Alfredo Buttari 1, Julien Langou 3, Jakub Kurzak 1, Jack Dongarra 12 arxiv:0707.3548v1 [math.na] 24 Jul 2007 1 Department of Electrical Engineering

More information

Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures LAPACK Working Note # 209

Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures LAPACK Working Note # 209 Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures LAPACK Working Note # 29 Hatem Ltaief 1, Jakub Kurzak 1, and Jack Dongarra 1,2,3 1 Department of Electrical Engineering and

More information

Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing

Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing Stanimire Tomov 1 and Jack Dongarra 1,2,3 1 University of Tennessee (USA) 2 Oak Ridge National Laboratory (USA) 3

More information

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL Fengguang Song, Jack Dongarra, and Shirley Moore Computer Science Department University of Tennessee Knoxville, Tennessee 37996, USA email:

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators Hatem Ltaief 1, Stanimire Tomov 1, Rajib Nath 1, and Jack Dongarra 1,2,3 1 Department of Electrical Engineering and Computer Science,

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix

Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix PAOLO BIENTINESI Duke University and BRIAN GUNTER Delft University of Technology and ROBERT A. VAN DE GEIJN The University

More information

NVBLAS LIBRARY. DU _v6.0 February User Guide

NVBLAS LIBRARY. DU _v6.0 February User Guide NVBLAS LIBRARY DU-06702-001_v6.0 February 2014 User Guide DU-06702-001_v6.0 2 Chapter 1. INTRODUCTION The is a GPU-accelerated Libary that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate

More information

algorithms Solving Matrix Equations on Multi-Core and Many-Core Architectures Algorithms 2013, 6, ; doi: /a

algorithms Solving Matrix Equations on Multi-Core and Many-Core Architectures Algorithms 2013, 6, ; doi: /a Algorithms 2013, 6, 857-870; doi:10.3390/a6040857 OPEN ACCESS algorithms ISSN 1999-4893 www.mdpi.com/journal/algorithms Article Solving Matrix Equations on Multi-Core and Many-Core Architectures Peter

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION

IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION Swetha Kukkala and Andrew A. Anda Computer Science Department St Cloud State University St Cloud, MN 56301 kusw0601@stcloudstate.edu

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

Matrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany

Matrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany Doing Nothing to Save Energy in Matrix Computations Enrique S. Quintana-Ortí quintana@icc.uji.esuji eeclust Workshop, Energy efficiency Motivation Doing nothing to save energy? Why at Ena-HPC then? Energy

More information

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1 Procedia Computer Science Procedia Computer Science 00 1 10 International Conference on Computational Science, ICCS One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in

More information

Tree-based methods on GPUs

Tree-based methods on GPUs Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology

More information