FLAMES2S: From Abstraction to High Performance

Size: px
Start display at page:

Download "FLAMES2S: From Abstraction to High Performance"

Transcription

1 FLAMESS: From Abstraction to High Performance RICHARD VERAS The University of Texas at Austin and JONATHAN MONETTE The University of Texas at Austin and FIELD G. VAN ZEE The University of Texas at Austin and ROBERT A. VAN DE GEIJN The University of Texas at Austin and ENRIQUE S. QUINTANA-ORTí Universidad Jaume I This paper discusses how to achieve high-performing portability for dense matrix libraries by expressing them at a high level of abstraction and mechanizing the transformation to more traditional low-level implementations. In the process of pursuing this goal we encountered a number of deficiencies of existing software layers in this domain, namely the Basic Linear Algebra Subprograms (BLAS) interface. As a result, the paper is as much about the mechanical transformation itself as it is about the modifications that had to be made to these lower level interfaces in order to facilitate simplicity without sacrificing performance. Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms; Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance. INTRODUCTION The present paper is best considered in context of the overall goals and accomplishments of the FLAME project. That project advocates raising the level of abstraction at which one expresses algorithms and codes for the domain of matrix Authors addresses: Richard Veras, Jonathan Monette, Robert A. van de Geijn, Field Van Zee, Department of Computer Science, The University of Texas at Austin, Austin, TX 787, rvdg@cs.utexas.edu. Enrique S. Quintana-Ortí, Departamento de Ingeniería y Ciencia de Computadores, Universidad Jaume I,.7 Castellón, Spain, quintana@icc.uji.es. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c YY ACM 98-35/YY/- $5. ACM Transactions on Mathematical Software, Vol. V, No. N, Month YY, Pages.

2 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí computations [Van Zee et al. 9;?]. Many benefits have been documented: A new notation allows algorithms to be presented in a way that captures the pictures that often accompany an explanation [Gunnels et al. ; Quintana et al. ]. This notation allows systematic derivation of correct families of loop-based algorithms [Gunnels et al. ; Bientinesi et al. 5; van de Geijn and Quintana- Ortí 8; Gunnels ]. This derivation process can be made mechanical [Bientinesi 6]. Application Programming Interfaces (APIs) can be defined for various languages so that representations in code mirror the algorithms [Bientinesi et al. 5]. Numerical error analyses can be made similarly systematic [Bientinesi 6; Bientinesi and van de Geijn ]. The effort for development and maintenance of libraries is greatly reduced [Van Zee ]. New architectures can be easily accommodated [Quintana-Ortí et al. 9; Quintana- Ortí et al. 9]. A production grade library, libflame, has been developed [Van Zee et al. 9; Van Zee ]. Many of these results were published in previous ACM TOMS issues. The present paper describes how a source-to-source translator, FLAMESS, makes the contributions of those papers, culminating in the libflame library, a viable option for users of dense linear algebra libraries even for those who insist on near-optimal performance. The translated implementations are reminiscent of, and attain performance similar to, corresponding LAPACK implementations. It is in trying to make the implementation mechanical that unintended consequences of interfaces often come to light. A secondary contribution of the present paper lies with insights into deficiencies that have existed since the Basic Linear Algebra Subprograms (BLAS) interface was standardized. These design flaws came to our attention by virtue of the fact that we were recreating a very large part of the functionality present in LAPACK. In some cases they are merely annoying while in other cases they are serious, leading to routines in LAPACK not being thread-safe. To overcome this problem, we provide a new interface that is used internally within libflame, the BLAS-Like Interface Subprograms (BLIS). This interface also features an orthogonal but manifestly useful improvement in that they support both column-major and row-major storage as well as general row and column stride, by which we mean the case where elements in both the row and column are of a constant, non-unit stride. The paper is organized as follows. We begin in Section with a motivating example and discuss some of the problems with both FLAME code and LAPACK. In Section 3 we introduce a source-to-source translator that translates high-level FLAME code to low-level C with direct calls to the BLAS. Section 4 illustrates some of the limitations we encountered in the original BLAS interface and proposes a set of BLAS-Like Interface Subprograms (BLIS) to address these issues. Performance results are discussed in Section 5. Section 6 summarizes related research, and closing remarks are given in the conclusion.

3 FLAMESS: From Abstraction to High Performance 3 Algorithm: A := Chol unb(a) Partition A A! T L A T R A BR where A T L is while m(a T L ) < m(a) do Repartition! A a A A T L A T α a T A A BR A where α is Variant : a := A T a α := α a T a α := α Variant : α := α a T a α := α a T := at at A a T := at /α Variant 3: α := α a T := at /α A := A a a T Continue with! A a A A T L A T α a T A A BR A endwhile Algorithm: A := Chol blk(a) Partition A A! T L A T R A BR where A T L is while m(a T L ) < m(a) do Determine block size b Repartition! A A A A T L A T A A A A BR A where A is b b Variant : A := A T A A := A A T A A := Chol(A ) Variant : A := A A T A A := Chol(A ) A := A A T A A := A T A Variant 3: A := Chol(A ) A := A T A A := A A T A Continue with! A A A A T L A T A A A A BR A endwhile Fig.. Unblocked (left) and blocked (right) algorithms for computing the Cholesky factorization.. THE PROBLEM In this section we describe the problem that our approach solves by discussing a representative example, the Cholesky factorization: Theorem. Let A be an n n symmetric positive definite (SPD) matrix. Then there exists an n n lower triangular matrix L such that A = LL T. If the diagonal elements of L are taken to be positive, L is unique. The matrix L is called the Cholesky factor of A.. A family of algorithms In Figure we present three unblocked algorithms and three blocked algorithms for overwriting the lower triangular part of SPD matrix A with its Cholesky factor, in FLAME notation. How to systematically derive these algorithms is discussed, for example, in [van de Geijn and Quintana-Ortí 8].

4 4 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí #include "FLAME.h" FLA_Error FLA_Chol_l_unb_var( FLA_Obj A ) { FLA_Obj ATL, ATR, A, a, A, ABL, ABR, at, alpha, at, A, a, A; int value = ; FLA_Part_x( A, &ATL, &ATR, &ABL, &ABR,,, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ FLA_Repart_x_to_3x3( ATL, // ATR, &A, // &a, &A, / / / / &at, // &alpha, &at, ABL, // ABR, &A, // &a, &A,,, FLA_BR ); / / // alpha = alpha - at at FLA_Dotcs( FLA_CONJUGATE, FLA_MINUS_ONE, at, at, FLA_ONE, alpha ); // a = a - A at FLA_Gemvc( FLA_NO_TRANSPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, FLA_ONE, a ); // alpha = sqrt( alpha ) value = FLA_Sqrt( alpha ); if ( value!= FLA_SUCCESS ) return ( FLA_Obj_length( A ) + ); // a = a / alpha FLA_Inv_scal( alpha, a ); / / } FLA_Cont_with_3x3_to_x( &ATL, // &ATR, A, a, // A, at, alpha, // at, / / / / &ABL, // &ABR, A, a, // A, FLA_TL ); } return value; Fig.. The libflame unblocked variant for Cholesky factorization routine. This implementation supports all four standard floating-point datatypes: single, double, complex, and double complex.. Representing the algorithms in code As part of the FLAME project, we have defined APIs for representing algorithms in code that hide details of indexing so that the code closely resembles the al-

5 FLAMESS: From Abstraction to High Performance 5 SUBROUTINE ZPOTF( UPLO, N, A, LDA, INFO ) -- LAPACK routine (version 3.) -- < deleted code > Compute the Cholesky factorization A = LL. DO J =, N Compute L(J,J) and test for non-positive-definiteness. AJJ = DBLE( A( J, J ) ) - ZDOTC( J-, A( J, ), LDA, $ A( J, ), LDA ) IF( AJJ.LE.ZERO ) THEN A( J, J ) = AJJ GO TO 3 END IF AJJ = SQRT( AJJ ) A( J, J ) = AJJ Compute elements J+:N of column J. IF( J.LT.N ) THEN CALL ZLACGV( J-, A( J, ), LDA ) <<<<<<< a := - a CALL ZGEMV( No transpose, N-J, J-, -CONE, A( J+, ), $ LDA, A( J, ), LDA, CONE, A( J+, J ), ) CALL ZLACGV( J-, A( J, ), LDA ) <<<<<<< a := - a CALL ZDSCAL( N-J, ONE / AJJ, A( J+, J ), ) END IF CONTINUE END IF GO TO 4 3 CONTINUE INFO = J 4 CONTINUE RETURN End of ZPOTF END Fig. 3. LAPACK unblocked left-looking complex valued Cholesky factorization. gorithms [Bientinesi et al. 5]. Further, our techniques for systematically and mechanically deriving correct algorithms can be leveraged to provide a high level of confidence in the code, since there are few opportunities for the introduction of programming bugs. A FLAME/C code for unblocked Variant that closely resembles the code in the libflame library is given in Figure. For comparison, in Figure 3 we present the same code in an LAPACK-style of coding. We note that coding the blocked algorithms in the LAPACK-style of coding is considerably more challenging due to indexing details.

6 6 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS FLAME blocked variant FLAME blocked variant FLAME blocked variant 3 FLAME unblocked variant FLAME unblocked variant FLAME unblocked variant 3 netlib blocked + netlib unblocked Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS 6 5 FLAME blocked variant 3 + FLAME unblocked variant FLAME blocked variant 3 + netlib unblocked netlib blocked + netlib unblocked Fig. 4. Performance of various implementations of Cholesky factorization. We claim that if one wishes to include several algorithms for all operations supported by LAPACK, then developing and maintaining a library for dense matrix computations coded with an API like that used by libflame is clearly preferred over the style of coding employed by LAPACK. Those who are not at least somewhat open to that view can probably stop reading at this point..3 The problem with (our) abstractions In Figure 4 (top), we show the performance of all six algorithms coded with the FLAME/C API as well as block size-tuned implementations that are part of the netlib release of LAPACK (version 3..). In Figure 4 (bottom) we also show, for a select implementation, the performance of blocked algorithms when the unblocked

7 FLAMESS: From Abstraction to High Performance 7 algorithms, which are used to compute the subproblem A := Chol(A ), are coded with the FLAME/C API or with an LAPACK style. The conclusion is that while the FLAME/C API incurs negligible overhead for the blocked algorithms, there is a hefty performance penalty for the unblocked algorithms that also affects the performance of the blocked algorithm. This should come as no surprise: the abstractions hide indices. In lieu of explicit indexing, various data structures are updated when matrix objects are partitioned and repartitioned, which incurs some additional cost that would not be present if the algorithms were coded at a low level. One solution would be to code the unblocked algorithms in the style of LAPACK, except in C. But that would defeat the benefits of the FLAME/C API for a large number of routines that are part of the library..4 The problem with LAPACK The style of coding used by LAPACK dates back to its predecessor, LINPACK, which emerged in the mid-97 s [Dongarra et al. 979; Stewart 977]. As an illustration, a so-called unblocked algorithm for Cholesky factorization is given in Figure 3. At the time of its release, LAPACK represented a significant leap forward in a number of ways. For example, the package merged and extended the functionality of EISPACK and LINPACK, portable performance was achieved through the use of blocked algorithms, and algorithms were coded in terms of standardized level-, -, and -3 BLAS subroutines. Still there are a number of characteristics that by now make the package seem dated. To a large degree, LAPACK adheres to the Fortran-77 standard. This means that names of routines and variables cannot use more than six letters. Clearly, this can be easily overcome, but there does not seem to be any inclination on the part of the LAPACK developers to do so. A user of LAPACK must have access to a Fortran compiler. While one can call Fortran routines from C and C++, even if one uses a precompiled version of LAPACK, or CLAPACK which was created by the fc tool, one must have access to certain Fortran runtime libraries. It is assumed that matrices are stored in column-major order. Some libraries that current build upon LAPACK choose to store matrices by rows [LabView ; ]. This mismatch can be overcome by (implicitly or explicitly) transposing operands when calling LAPACK routines. However, this workaround adds unnecessary complexity. Experience has taught us that on different architectures, different algorithmic variants are often superior. For the most part, LAPACK includes one unblocked and one blocked algorithm for each supported operation. Since LAPACK consists of more than a million lines of code already, maintaining a multitude of algorithmic variants would greatly increase the size of the code base, not to mention require an enormous effort in programming. The adherence to Fortran-77 prevents the use of recursion to achieve better performance by introducing multiple levels of blocking. A very large number of tunable parameters make optimization cumbersome.

8 8 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí #include "FLAME.h" #define A( i, j ) buff_a[ (j)ldim_a + (i) ] FLA_Error FLA_Chol_l_opt_var( FLA_Obj A ) { FLA_Datatype datatype; int ldim_a, m_a, n_a; datatype = FLA_Obj_datatype( A ); ldim_a = FLA_Obj_ldim( A ); m_a = FLA_Obj_length( A ); n_a = FLA_Obj_width( A ); switch(datatype) { case FLA_FLOAT: < deleted lines > case FLA_DOUBLE_COMPLEX: { dcomplex buff_a = ( dcomplex ) FLA_Obj_buffer( A ); } FLA_Chol_l_opz_var( m_a, n_a, buff_a, ldim_a ); break; } } return FLA_SUCCESS; Fig. 5. Wrapper routine that separates the call to FLA Chol l unb var into routines for the different datatypes. Workspace to be used inside routines must be explicitly passed in as a parameter. While this guarantees that the user knows how much memory will be used, it also creates a considerable burden on the user. These problems can be overcome by the LAPACK library developer and/or the LAPACK user, but not without a considerable investment in time and effort. 3. FLAMESS: A SOURCE-TO-SOURCE TRANSFORMER FOR FLAME/C CODE Conventional wisdom has been that, in scientific computing, we must sacrifice programmability for the sake of performance and therefore abstractions like the FLAME/C API are not allowed. However, we believe that one does not have to choose between low-performing, highly-readable code like that given in Figure and higher-performing code that explicitly exposes indices as in Figure 3. In this section, we discuss a source-to-source transformer that translates code implemented using the FLAME/C API to more traditional (LAPACK-like) code, yielding the best of both worlds. FLAMESS is remarkably elegant yet powerful: essentially, it consists of a set of rewrite rules that transforms a high-level description of the algorithm to code.

9 FLAMESS: From Abstraction to High Performance 9 FLA_Error FLA_Chol_l_opz_var( int m_a, dcomplex buff_a, int rs_a, int cs_a ) { int j; for( j = ; j < m_a; j++ ) { dcomplex at = &A(j, ); dcomplex alpha = &A(j, j); dcomplex A = &A(j+, ); dcomplex a = &A(j+, j); int m_a_min_j_min_one = m_a - j - ; // FLA_Dotcs( FLA_CONJUGATE, FLA_MINUS_ONE, at, at, FLA_ONE, alpha ); bli_zdots( BLIS_CONJUGATE, j, FLA_DOUBLE_COMPLEX_PTR( FLA_MINUS_ONE ), at, rs_a, at, rs_a, FLA_DOUBLE_COMPLEX_PTR( FLA_ONE ), alpha ); // FLA_Gemvc( FLA_NO_TRANPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, // FLA_ONE, a ); bli_zgemv( BLIS_NO_TRANSPOSE, BLIS_CONJUGATE, m_a_min_j_min_one, j, FLA_DOUBLE_COMPLEX_PTR( FLA_MINUS_ONE ), A, rs_a, cs_a, at, rs_a, FLA_DOUBLE_COMPLEX_PTR( FLA_ONE ), a, cs_a ); // FLA_Sqrt( alpha ); { FLA_Error error; bli_zsqrte( alpha, &error ); } if( error!= FLA_SUCCESS ) return j; } // FLA_Inv_scal( alpha, a ); bli_zinvscalv( BLIS_NO_CONJUGATE, m_a_min_j_min_one, alpha, a, cs_a ); } return FLA_SUCCESS; Fig. 6. Output of the FLAMESS translator when the BLIS interface is used. (Very minor editing was used to slightly shorten the code.)

10 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Fig. 7. The Spark tool for generating code skeletons. How we write FLAME/C code.. We start by very briefly reviewing the process by which we produce a routine like the one in Figure. This will help the reader understand why FLAME/C code is easy to parse. Once the algorithm has been derived, a code skeleton is generated using a webpage-based tool, Spark [Spark ], depicted in Figure 7. The idea is that by filling out a simple form, most of the code can be generated automatically, leaving only the updates between the /------/ lines to be filled in manually with subroutine calls that perform the necessary computations. We note that one of the output language options of the tool yields a representation that typesets a skeleton for algorithms with L A TEX as depicted in Figure. Translating FLAME/C code to LAPACK-like code.. The FLAME/C code in libflame is highly structured, and since much of it is automatically generated by tools like Spark, it is easy to parse. As a result, a translator, FLAMESS, was written that can take code written at a high level of abstraction (e.g., the code in Figure ) and translate it to an indexed loop and calls to a BLAS-like interface (e.g., the code in Figures 5 6). 4. BLAS AND BLAS-LIKE INTERFACES The reader will have noticed that the code in Figure 6 calls routines that are reminiscent of the BLAS. In this section we briefly discuss why we created a new interface. We stress that at this moment we merely consider this layer to be for our own use within the libflame library and that it is currently implemented in terms of the conventional BLAS interface. The problem with traditional (Fortran) BLAS. An initial implementation of the FLAMESS translator produced output that called traditional (Fortran) BLAS. We encourage the reader to visit this website and to try it before continuing.

11 FLAMESS: From Abstraction to High Performance Let us focus on how the matrix-vector multiplication in Figure was translated into a call to the traditional BLAS when a double precision complex routine is requested: FLA_Gemvc( FLA_NO_TRANPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, FLA_ONE, a ); is translated to // FLA_Gemvc( FLA_NO_TRANPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, // FLA_ONE, a ); { // Note: vector at needs to be conjugated. dcomplex tmp_vec; tmp_vec = ( dcomplex ) malloc( j sizeof( dcomplex ) ); zcopy_( &j, &A(j, ), &ldim_a, &tmp_vec[], &I_ONE ); conj_vector( j, tmp_vec ); zgemv_( "No Transpose", &m_a_min_j_min_one, &j, FLA_DOUBLE_COMPLEX_PTR( FLA_MINUS_ONE ), &A(j+, ), &ldim_a, &tmp_vec[], &I_ONE, FLA_DOUBLE_COMPLEX_PTR( FLA_ONE ), &A(j+, j), &I_ONE ); free( tmp_vec ); } This simple example illustrates some of the problems that are immediately encountered: The need for explicit conjugation. Notice that for the Cholesky factorization example it would have been possible to conjugate vector a T in-place and then unconjugate the vector after the matrix-vector operation was complete. However, in general this would not be thread-safe if the vector being conjugated were part of an input operand rather than an input/output operand. The tool would either have to know more about the context or it would need to produce the slower, more complicated (but thread-safe) code as it does here. But regardless of whether the application code is thread-safe or not, the inability to express implicit conjugation via the gemv interface means that the application code must explicitly conjugate the contents of a T at least once. The need for a temporary vector and additional copy operation. If the code generated is to be thread-safe, the contents of the vector a T must be copied to a temporary buffer where it may be safely conjugated. The dynamic allocation and eventual freeing of the temporary vector undoubtedly incurs a small cost. Further, the copy operation incurs an additional cost proportional to the length of a T. It may be argued, or even shown, that these costs are relatively small. However, it cannot be shown that these costs are necessary; the BLAS could have easily incorporated an option to operate with the input vector s conjugate onthe-fly, at virtually no cost. We also point out that the above code is significantly less readable than a hypothetical zgemv interface that allowed conjugation of the vector.

12 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí The requirement that matrices be stored in column-major order. Columnmajor storage, while commonplace in numerical applications, is not used universally. Applications which store data in row-major order would be forced to perform intricate transformations on the BLAS parameters in order to induce the correct computation. Such hackery, for most users, would be error-prone and result in code that is even more difficult to read, if not entirely obfuscated. The code translation shown in Figure 6 is much simpler thanks to the BLIS, which addresses each of the above concerns by providing a more powerful interface. Why not the CBLAS?. The CBLAS define C bindings to the traditional BLAS interface. For a number of reasons they do not serve our purposes: Firstly, we find the parameter passing conventions to be unnecessarily inconsistent and confusing. For instance, real scalars are passed by value, but complex scalars (along with all vectors and matrices) are passed by address. Also, the CBLAS do not define any complex types, and so the API uses the much vaguer void in place of the actual type for all complex parameters, making the interface more weakly typed and more difficult to read. Finally, while the CBLAS supports row-major storage, it requires that all matrix operands be stored in row-major (or column-major) order; mixing storage formats is not allowed. Why not the BLAST Forum extensions?. Some of the problems with complex datatypes (e.g., the need to conjugate vectors) were fixed in the proposed new BLAS interface that resulted from the BLAS Technical Forum [BLA ; ]. Unfortunately, that new interface never caught on. Not even LAPACK was changed to use the new standard. Moreover, the C interface was somewhat clunky, not unlike the CBLAS interface to the traditional BLAS. What is different in the BLIS?. The BLAS-Like Interface Subprograms present in libflame provide a set of APIs similar to that of the BLAS while supporting features that were missing from the original BLAS specification. The following is a list of the features, including improvements made over the BLAS: Low-level interfaces. The BLIS, like the BLAS, is a set of low-level interfaces. No objects or descriptors are used and thus all information is passed into the API explicitly. Therefore we retain the potential for maximum performance. No return values. The original BLAS supported a handful of operations, such as the dot product, that were implemented as functions. In practice, correctly linking to the routines for complex datatypes proved to be a nightmare due to the competing fc and GNU return value standards [GNU Fortran ]. The BLIS avoids this altogether by always using an extra parameter (passed by address) in lieu of returning numerical values. Row and column strides. By including row and column strides in the interface, the BLIS interfaces support both row-major and column-major storage. Furthermore, general row and column storage, whereby both strides are non-unit, In Fortran, subroutines and functions are distinguished in that only the latter return values to the caller.

13 FLAMESS: From Abstraction to High Performance 3 is made possible. Further still, the API allows the user to perform computation on operands of different storage layouts. Simple, consistent interfaces. All parameters that convey information about the operation, such as whether a transposition is performed, are typed as char. All stride parameters are int. All scalars, vectors, and matrices are passed by address. We feel these conventions provide a more natural interface for users of C. Consistent conjugation across complex and real typed routines. If a complex-typed routine allows the user to conjugate an operand, then the corresponding real-typed routines contain the same option. For example, the BLIS routine bli zgemv provides a conjx parameter, but so does bli dgemv. Thus, the BLIS looks at conjugation in the API type-agnostically, as an argument to the mathematical operation itself. Some would argue that such parameters are redundant; why have a conjugation parameter if it has no effect on the computation? We believe that such parameters have a purpose because they allow the users, and tools such as FLAMESS, to encode critical features of the operation the conjugation of a vector, for example into the routine invocation. By looking at a real-typed invocation, one can correctly project the operation out to the complex domain. If we were to leave out such redundant parameters from the interface, this kind of information would be lost when the application or library is coded. Operations not present in the BLAS. Perhaps most importantly, the BLIS include interfaces to operations not defined in the BLAS. For example, as alluded to previously, the BLIS implements matrix-vector multiplication (gemv) where the input vector may be optionally conjugated in a thread-safe manner. Another useful operation is gemv where the matrix operand is conjugated. Other extended operations include axpy, copy, dot, scal, ger, hemv, her, her, trmv, trsv, gemm, trmm, and trsm. In general, the BLIS attempts to expand upon the original BLAS to provide the most flexible, easy-to-use API possible while still providing low-overhead interfaces to essential linear algebra kernels. This revised interface layer streamlines both the implementation and output of tools such as FLAMESS while also lifting much of the programming burden from casual users. 5. IMPACT ON PERFORMANCE The scientific computing community is typically willing to give up programmability if it means attaining better performance. In this section we show that the FLAMESS translator produces code that matches that of the netlib LAPACK implementation. 5. Platform details All experiments were performed on a single core of a Dell PowerEdge R9 server consisting of four Intel Dunnington six-core processors. Each core provides a peak performance of.64 GFLOPS (55 9 floating-point operations per second) and has access to 96 GBytes of shared main memory. Performance experiments were

14 4 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí gathered under the GNU/Linux.6.8 operating system. Source code was compiled by the Intel C/C++ Compiler, version.. 5. Implementations We demonstrate the impact of translating high-level code using the FLAME/C API to low-level code using indexed loops and calls to the BLIS interface. Performance for three important operations supported by MKL.., LAPACK 3.., and libflame r394 is reported: Cholesky factorization, LU factorization with partial pivoting, and QR factorization via Householder transformations. The results are representative of what is observed for other operations as well. The following implementations were compared: Unblocked FLAME unblocked. The best unblocked algorithmic variant currently supported by libflame, coded using the FLAME/C API. optimized unblocked. The unblocked algorithmic variant used for FLAME unblocked translated to a plain loop with calls to the BLIS interface, as discussed in this paper. netlib unblocked. The unblocked implementation in LAPACK distributed on netlib. MKL unblocked. The unblocked implementation in MKL. Blocked FLAME blocked + FLAME unblocked. The best blocked algorithmic variant currently supported by libflame coded using the FLAME/C API. The subproblem is computed via a call to the best unblocked algorithmic variant coded using the FLAME/C API. FLAME blocked + optimized unblocked. The best blocked algorithmic variant currently supported by libflame coded using the FLAME/C API. The subproblem is computed via a call to the best unblocked algorithmic variant coded using the FLAME/C API and translated to a plain loop with calls to the BLIS interface. FLAME blocked + netlib unblocked. The best blocked algorithmic variant currently supported by libflame coded using the FLAME/C API. The subproblem is computed via a call to the LAPACK unblocked algorithmic variant distributed from netlib. netlib blocked + netlib unblocked. The LAPACK blocked routine linked to the LAPACK unblocked routine, available from netlib. MKL blocked. The blocked implementation in MKL. For all experiments that involve FLAME or netlib code, sequential level-, -, and -3 BLAS were provided by MKL. Also, we use double precision floating-point arithmetic for all experiments. 5.3 Results We report both the rate of computation that is attained (in GFLOPS) and the speedup relative to the netlib implementation. For the absolute performance graphs

15 FLAMESS: From Abstraction to High Performance 5 Sequential implementations of Cholesky factorization (lower triangular), unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of Cholesky factorization (lower triangular), unblocked FLAME unblocked optimized unblocked netlib unblocked MKL unblocked Speedup relative to netlib unblocked Fig. 8. Performance of various unblocked implementations of Cholesky factorization (double precision real). (on the left) the top of the graph represents the theoretical peak of the architecture. To facilitate easy comparison, the algorithmic block size was set to equal 8 for all experiments. This block size is not necessarily optimal for all s. But optimality is not the point of this paper: the point is that abstraction does not adversely affect performance. Tuning the implementations is an orthogonal issue. Cholesky factorization.. In Figures 8 and 9 we report the performance attained by various implementations of Cholesky factorization, using double precision real data and updating only the lower triangular part of the matrix. These graphs tell the story as we would expect: Using the FLAME/C API incurs a considerable overhead for the unblocked algorithm which is overcome when the code is translated

16 6 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Sequential implementations of Cholesky factorization (lower triangular), b_alg = 8.8 Speedup relative to netlib blocked + netlib unblocked FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked. FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Fig. 9. Performance of various blocked implementations of Cholesky factorization when using different unblocked implementations (double precision real). to a simple loop with calls to the BLIS interface. Even for the blocked algorithm, the low performance of the suboperation affects how fast the performance ramps up. We provide results for the Cholesky factorization of double precision complex matrices in Figure since we used the complex instantiation of the operation as a motivating example for the BLIS in Section 4. For nearly all s, complex FLAME codes that call BLIS-based unblocked routines perform as well or better than those found in LAPACK, despite the additional copies performed by the BLIS implementation in libflame. For the remaining performance experiments, we will only show performance results for double precision real data.

17 FLAMESS: From Abstraction to High Performance 7 Sequential implementations of Cholesky factorization (lower triangular), unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Fig.. Performance of various unblocked and blocked implementations of Cholesky factorization (double precision complex). LU with partial pivoting.. In Figure we report the performance attained by various implementations of LU factorization with partial pivoting. Here the comparison is complicated by the fact that not all implementations use the same algorithmic variant.. Unblocked. FLAME unblocked, optimized unblocked, MKL unblocked. The FLAME unblocked implementations use Variant 4 which is also known as the Crout variant [Crout 94]. This is an implementation that casts most computation in terms of matrixvector multiplication. From the performance signature we deduce that MKL uses the same variant.

18 8 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Sequential implementations of LU factorization with partial pivoting, unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of LU factorization with partial pivoting, b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Fig.. Performance of various implementations of LU factorization with partial pivoting (double precision real). netlib unblocked. This implementation uses Variant 5 which is also known as the right-looking variant. It casts most computation in terms of a rank- update. The observation is that a rank- update generates twice the memory traffic as does a matrix-vector multiplication, which is the reason why the netlib unblocked implementation attains worse performance. The fact that the optimized unblocked implementation attains performance that is close to the commercial MKL implementation supports the claim that the translated code implemented with the BLIS interface preserves performance.. Blocked. Again it is observed that the FLAME blocked + FLAME unblocked implementation ramps up slowly. The FLAME blocked + optimized unblocked

19 FLAMESS: From Abstraction to High Performance 9 Sequential implementations of QR factorization, unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of QR factorization, b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked netlib blocked + netlib unblocked MKL blocked Fig.. Performance of various implementations of Householder QR factorization (double precision real). outperforms both the FLAME blocked + netlib unblocked and the pure netlib unblocked algorithm because of the better algorithmic variant used for the subproblem. The MKL implementation is more highly tuned. We believe to a large degree this is due to a better implementation of the pivoting routine that swaps blocks of rows. Householder QR factorization.. In Figure we report the performance attained by various implementations of QR factorization based on Householder transformations. The LAPACK and MKL implementations use the compact WY transform to accumulate Householder transformations [Schreiber and Van Loan 989] while the libflame implementation uses a slight variant of that, which we call the UT-

20 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí transform [Joffrain et al. 6]. Other than the FLAME unblocked version, which suffers from the usual overhead, the unblocked implementions achieve very similar performance. The slightly lower performance attained by optimized unblocked can be attributed to the fact that that implementation always accumulates an additional triangular matrix needed for the blocked algorithm, while its counterparts do not. The most plausible reason why the MKL QR factorization attains better performance is that it uses a smaller block size when the is relatively small. This is beneficial because accumulating the Householder transformations adds additional computation and therefore there is a trade-off between the benefits of using a blocked algorithm and the extra computation that is required to support it. Tuning would similarly improve the FLAME implementation. 5.4 Summary It is important not to get distracted by the performance graphs. The point of the paper is that one can code at a high level of abstraction and, with the help of a simple code-to-code translator, attain performance that rivals that of a more traditional library. However, the performance graphs for LU factorization do show that there is a performance benefit that results from coding at a higher level of abstraction: without requiring unreasonable extra effort, additional algorithmic variants can be included in the library so that the best variant for the situation can be employed. 6. RELATED WORK It can be argued that there have been many efforts to express linear algebra libraries at a high level of abstraction, starting with the original matlab environment developed in the 97s [Moler 98]. Mapping from matlab-like descriptions to high-performance implementations was already pursued by the FALCON project [DeRose 996; Marsolf 997; DeRose and Padua 996] and matlab itself has a compiler that does so. More recently, a project to produce Build to Order Linear Algebra Kernels [Siek et al. 8] takes an approach similar in philosophy to what we are proposing: input is a matlab-like description of linear algebra algorithms and output is optimized code. What sets our approach apart from these efforts is that the FLAME project has managed to greatly simplify how algorithms are expressed at a high level of abstraction. A solution close to ours was proposed in John Gunnels dissertation [Gunnels ], a participant in what became the FLAME project. He proposed a domainspecific language for expressing PLAPACK code for distributed-memory architectures, PLAWright, and a translator, implemented with Mathematica, was developed that yielded PLAPACK code and a cost analysis of its execution. In many ways, PLAWright code resembles what later became the FLAME/C and FLAME@lab APIs for the C and M-script (matlab) languages, and thus, FLAMESS merely targets a different architecture, namely a single processor. There are, however, differences: The translation of PLAWright to PLAPACK code was simpler by virtue of the fact that PLAPACK code itself still exhibits a level of abstraction very similar to that of PLAWright. In other words: it is a simpler translation from PLAWright to PLAPACK code (like translating the inputs to the Spark tool to FLAME/C

21 FLAMESS: From Abstraction to High Performance code) and there was no need to produce lower-level code because the overhead of the PLAPACK API was minute compared with the cost of starting a communication on a distributed-memory architecture. FLAMESS instead takes high-level abstraction to low-level code. 7. CONCLUSION On the surface, the central message of this paper seems to be simple and the results straightforward: The inherent structure in FLAME/C code allows a direct sourceto-source transformer to be employed to produce code that attains the same high level of performance as traditional code. This task of translating high-level code is made even simpler when deficiencies in the original BLAS interface are remedied. In addition, we believe that the transformer completes a metamorphosis of dense and banded linear algebra libraries that started a decade ago [Gunnels et al. ] because it demonstrates that one can embrace programmability without sacrificing performance. It is worth noting that prior to this work, a legitimate objection to abandoning what we often call the LAPACK style of coding has been that coding at a lower level yields high performance, especially for small s, operations that perform O(n ) computation with O(n ) data, and/or computations on banded matrices. The presented prototype source-to-source transformer, FLAMESS, and the preliminary results neutralize this objection. Acknowledgements This research was partially sponsored by NSF grants CCF-5496, OCI-8575, and CCF-9767, including a Research Experience for Undergraduates (REU) supplement that supported Richard Veras and Jonathan Monette, and a grant from Microsoft. Visits by Enrique Quintana-Ortí to UT-Austin were supported by the J. Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin. Enrique S. Quintana- Ortí is also partially supported by the CICYT project TIN8-937-C- and FEDER. We thank the other members of the FLAME team for their support. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). REFERENCES. Basic Linear Algebra Subprograms Technical Forum Standard. International Journal of High Performance Applications and Supercomputing 6, (Spring). Bientinesi, P. 6. Mechanical derivation and systematic analysis of correct linear algebra algorithms. Ph.D. thesis, Department of Computer Sciences, The University of Texas. Technical Report TR September 6. Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Ortí, E. S., and van de Geijn, R. A. 5. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software 3, (March), 6. Bientinesi, P., Quintana-Ortí, E. S., and van de Geijn, R. A. 5. Representing linear algebra algorithms in code: The FLAME application programming interfaces. ACM Trans. Math. Soft. 3, (March), Bientinesi, P. and van de Geijn, R. A. A goal-oriented and modular approach to stability analysis. SIAM J. Matrix Anal. Appl.. In review.

22 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí BLAS Technical Forum. BLAS Technical Forum. Crout, P. D. 94. A short method for evaluating determinants and solving systmes of linear equations with real or complex coefficients. Trans AIEE 6, DeRose, L. and Padua, D A MATLAB to FORTRAN9 translator and its effectiveness. In Proceedings of the th ACM International Conference on Supercomputing. DeRose, L. A Compiler techniques for matlab programs. Ph.D. thesis, Computer Sciences Department, The University of Illinois at Urbana-Champaign. Dongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W LINPACK Users Guide. SIAM, Philadelphia. GNU Fortran. The GNU Fortran Compiler - Code Gen Options. gcc.gnu.org/onlinedocs/gfortran/code-gen-options.html. GNU Scientific Library. GNU Scientific Library. Gunnels, J. A.. A systematic approach to the design and analysis of parallel dense linear algebra algorithms. Ph.D. thesis, Department of Computer Sciences, The University of Texas. Gunnels, J. A., Gustavson, F. G., Henry, G. M., and van de Geijn, R. A.. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft. 7, 4 (December), Joffrain, T., Low, T. M., Quintana-Ortí, E. S., van de Geijn, R., and Van Zee, F. G. 6. Accumulating householder transformations, revisited. ACM Trans. Math. Softw. 3,, LabView. LabView. Marsolf, B. A Techniques for the interactive development of numerical linear algebra libraries for scientific computation. Ph.D. thesis, Computer Sciences Department, The University of Illinois at Urbana-Champaign. Moler, C. B. 98. MATLAB an interactive matrix laboratory. Tech. Rep. Technical Report 369, Department of Mathematics and Statistics, University of New Mexico. Quintana, E. S., Quintana, G., Sun, X., and van de Geijn, R.. A note on parallel matrix inversion. SIAM J. Sci. Comput., 5, Quintana-Ortí, G., Igual, F. D., Quintana-Ortí, E. S., and van de Geijn, R. 9. Solving dense linear systems on platforms with multiple hardware accelerators. In ACM SIGPLAN 9 symposium on Principles and practices of parallel programming (PPoPP 9). Quintana-Ortí, G., Quintana-Ortí, E. S., van de Geijn, R. A., Zee, F. G. V., and Chan, E. 9. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software 36, 3 (June), 4: 4:6. Schreiber, R. and Van Loan, C A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput., (Jan.), Siek, J. G., Karlin, I., and Jessup, E. R. 8. Build to order linear algebra kernels. In International Symposium on Parallel and Distributed Processing 8 (IPDPS 8). 8. Spark. Spark code skeleton generator. Stewart, G. W Research, development, and LINPACK. In Mathematical Software III, J. R. Rice, Ed. Academic Press, New York, 4. van de Geijn, R. A. and Quintana-Ortí, E. S. 8. The Science of Programming Matrix Computations. Van Zee, F. G.. libflame: The Complete Reference. In preparation. Van Zee, F. G., Chan, E., van de Geijn, R. A., Quintana-Ortí, E. S., and Quintana-Ortí, G. 9. The libflame library for dense matrix computation. IEEE Computing in Science & Engineering, 6 (November/December), Received Month Year; revised Month Year; accepted Month Year

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Transforming Linear Algebra Libraries: From Abstraction to High Performance

Transforming Linear Algebra Libraries: From Abstraction to High Performance Transforming Linear Algebra Libraries: From Abstraction to High Performance RICHARD M. VERAS and JONATHAN S. MONETTE and ROBERT A. VAN DE GEIJN The University of Texas at Austin and ENRIQUE S. QUINTANA-ORTí

More information

Representing Linear Algebra Algorithms in Code: The FLAME Application Program Interfaces

Representing Linear Algebra Algorithms in Code: The FLAME Application Program Interfaces Representing Linear Algebra Algorithms in Code: The FLAME Application Program Interfaces Paolo Bientinesi The University of Texas at Austin and Enrique S. Quintana-Ortí Universidad Jaume I and Robert A.

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Francisco D. Igual Gregorio Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad Jaume I 12.071 Castellón Spain {figualgquintan}@icc.uji.es

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory

Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory Paolo Bientinesi Sergey Kolos 2 and Robert van de Geijn Department of Computer Sciences The University of Texas at Austin

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Representing Linear Algebra Algorithms in Code: The FLAME API

Representing Linear Algebra Algorithms in Code: The FLAME API Representing Linear Algebra Algorithms in Code: The FLAME API Robert A. van de Geijn The University of Texas at Austin The Formal Linear Algebra Methods Environment (FLAME) encompasses a methodology for

More information

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad

More information

Solving Large Dense Matrix Problems on Multi-Core Processors

Solving Large Dense Matrix Problems on Multi-Core Processors Solving Large Dense Matrix Problems on Multi-Core Processors Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I, 12.071

More information

Beautiful Parallel Code: Evolution vs. Intelligent Design

Beautiful Parallel Code: Evolution vs. Intelligent Design Beautiful Parallel Code: Evolution vs. Intelligent Design FLAME Working Note #34 Robert A. van de Geijn Department of Computer Sciences The University of Texas at Austin Austin, Texas 7872 rvdg@cs.utexas.edu

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

FLAME Working Note #30

FLAME Working Note #30 FLAG@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia

More information

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29 Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29 Gregorio Quintana-Ortí Enrique S Quintana-Ortí Robert van de Geijn Field G Van Zee Ernie Chan

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Informe Técnico ICC 2-2-28 Solving Dense Linear Systems on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí Febrero de 28 Departamento

More information

Unleashing the Power of Multi-GPU Accelerators with FLAME

Unleashing the Power of Multi-GPU Accelerators with FLAME Unleashing the Power of Multi-GPU Accelerators with FLAME Enrique S. Quintana Ortí Universidad Jaume I quintana@icc.uji.es SIAM Conference on Parallel Processing for Scientific Computing Seattle, 2010

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

Making Programming Synonymous with Programming for. Linear Algebra Libraries

Making Programming Synonymous with Programming for. Linear Algebra Libraries Making Programming Synonymous with Programming for Linear Algebra Libraries Maribel Castillo Ernie Chan Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin libflame The Complete Reference ( version 5.1.0-46 ) Field G. Van Zee The University of Texas at Austin Copyright c 2011 by Field G. Van Zee. 10 9 8 7 6 5 4 3 2 1 All rights reserved. No part of this book

More information

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Transforming Linear Algebra Libraries: From Abstraction to Parallelism Transforming Linear Algebra Libraries: From Abstraction to Parallelism FLAME Working Note #38 Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas

More information

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Transforming Linear Algebra Libraries: From Abstraction to Parallelism Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas at Austin Austin, TX

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32 Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators FLAME Working Note #32 Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Robert van de Geijn Abstract In a

More information

Out-of-Core Solution of Linear Systems on Graphics Processors

Out-of-Core Solution of Linear Systems on Graphics Processors 1 Informe Técnico ICC 01-05-2008 Out-of-Core Solution of Linear Systems on Graphics Processors Maribel Castillo, Francisco D. Igual, Rafael Mayo, Rafael Rubio, Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí,

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

BLIS: A Modern Alternative to the BLAS

BLIS: A Modern Alternative to the BLAS 0 BLIS: A Modern Alternative to the BLAS FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin We propose the portable BLAS-like Interface Software (BLIS) framework which addresses

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

BLIS: A Framework for Generating BLAS-like Libraries. FLAME Working Note #66

BLIS: A Framework for Generating BLAS-like Libraries. FLAME Working Note #66 BLIS: A Framework for Generating BLAS-like Libraries FLAME Working Note #66 (We recommend reading the updated version of this paper titled BLIS: A Framework for Rapidly Instantiating BLAS Functionality,

More information

A Square Block Format for Symmetric Band Matrices

A Square Block Format for Symmetric Band Matrices A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture

More information

FLAME Working Note #23

FLAME Working Note #23 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures FLAME Working Note #23 Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn Abstract

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1 How Heterogeneous? 2 How Many Languages? 3 How Many Languages? 3 Question! 4 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL

More information

An M-script API for Linear Algebra Operations on Graphics Processors

An M-script API for Linear Algebra Operations on Graphics Processors Informe Técnico ICC 1-2-28 GLAME@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí

More information

An M-script API for Linear Algebra Operations on Graphics Processors

An M-script API for Linear Algebra Operations on Graphics Processors Informe Técnico ICC 1-2-28 GLAME@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí

More information

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines NAG Library Chapter Introduction Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 3 Recommendations on Choice and Use of Available Routines... 2 3.1 Naming Scheme... 2 3.1.1 NAGnames...

More information

BLIS: A Framework for Rapid Instantiation of BLAS Functionality

BLIS: A Framework for Rapid Instantiation of BLAS Functionality 0 BLIS: A Framework for Rapid Instantiation of BLAS Functionality FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin The BLAS Libray Instantiation Software (BLIS) is a new framework

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information

Exploiting the capabilities of modern GPUs for dense matrix computations

Exploiting the capabilities of modern GPUs for dense matrix computations Informe Técnico ICC 1-11-28 Exploiting the capabilities of modern GPUs for dense matrix computations Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí, Gregorio

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Departamento de Ingeniería y Ciencia de Computadores Universidad

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix

Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix PAOLO BIENTINESI Duke University and BRIAN GUNTER Delft University of Technology and ROBERT A. VAN DE GEIJN The University

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures Gregorio Quintana-Ortí Enrique S Quintana-Ortí Ernie Chan Robert A van de Geijn Field G Van Zee Abstract This paper examines

More information

Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package

Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package TIMOTHY A. DAVIS University of Florida SuiteSparseQR is an implementation of the multifrontal sparse QR factorization

More information

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices Guohong Liu and Sanzheng Qiao Department of Computing and Software McMaster University Hamilton, Ontario L8S 4L7, Canada

More information

Strategies for Parallelizing the Solution of Rational Matrix Equations

Strategies for Parallelizing the Solution of Rational Matrix Equations Strategies for Parallelizing the Solution of Rational Matrix Equations José M. Badía 1, Peter Benner, Maribel Castillo 1, Heike Faßbender 3, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, and Gregorio Quintana-Ortí

More information

Toward Scalable Matrix Multiply on Multithreaded Architectures

Toward Scalable Matrix Multiply on Multithreaded Architectures Toward Scalable Matrix Multiply on Multithreaded Architectures Bryan Marker 1, Field G Van Zee 1, Kazushige Goto 1, Gregorio Quintana Ortí 2, and Robert A van de Geijn 1 1 The University of Texas at Austin

More information

Max Planck Institute Magdeburg Preprints

Max Planck Institute Magdeburg Preprints Peter Benner Pablo Ezzatti Enrique S. Quintana-Ortí Alfredo Remón Matrix Inversion on CPU-GPU Platforms with Applications in Control Theory MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER SYSTEME

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Resources for parallel computing

Resources for parallel computing Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

An API for Manipulating Matrices Stored by Blocks

An API for Manipulating Matrices Stored by Blocks An API for Manipulating Matrices Stored by Blocks Tze Meng Low Robert A van de Geijn Department of Computer Sciences The University of Texas at Austin 1 University Station, C0500 Austin, TX 78712 {ltm,rvdg}@csutexasedu

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee. Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works

More information

Matrix Multiplication Specialization in STAPL

Matrix Multiplication Specialization in STAPL Matrix Multiplication Specialization in STAPL Adam Fidel, Lena Olson, Antal Buss, Timmie Smith, Gabriel Tanase, Nathan Thomas, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger Parasol Lab, Dept. of Computer

More information

Copyright by Tze Meng Low 2013

Copyright by Tze Meng Low 2013 Copyright by Tze Meng Low 203 The Dissertation Committee for Tze Meng Low certifies that this is the approved version of the following dissertation: A Calculus of Loop Invariants for Dense Linear Algebra

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Chapter 24a More Numerics and Parallelism

Chapter 24a More Numerics and Parallelism Chapter 24a More Numerics and Parallelism Nick Maclaren http://www.ucs.cam.ac.uk/docs/course-notes/un ix-courses/cplusplus This was written by me, not Bjarne Stroustrup Numeric Algorithms These are only

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström

More information

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators FLAME Working Note #5 Ernie Chan Department of Computer Science The University of Texas at Austin Austin, Texas

More information

Parallel Linear Algebra in Julia

Parallel Linear Algebra in Julia Parallel Linear Algebra in Julia Britni Crocker and Donglai Wei 18.337 Parallel Computing 12.17.2012 1 Table of Contents 1. Abstract... 2 2. Introduction... 3 3. Julia Implementation...7 4. Performance...

More information

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION

IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION Swetha Kukkala and Andrew A. Anda Computer Science Department St Cloud State University St Cloud, MN 56301 kusw0601@stcloudstate.edu

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Introducing coop: Fast Covariance, Correlation, and Cosine Operations

Introducing coop: Fast Covariance, Correlation, and Cosine Operations Introducing coop: Fast Covariance, Correlation, and Cosine Operations November 14, 2017 Drew Schmidt wrathematics@gmail.com Version 0.6-1 Disclaimer Any opinions, findings, and conclusions or recommendations

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Emmanuel Agullo 1, Henricus Bouwmeester 2, Jack Dongarra 1, Jakub Kurzak 1, Julien Langou 2,andLeeRosenberg

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

On the Efficacy of Haskell for High Performance Computational Biology

On the Efficacy of Haskell for High Performance Computational Biology On the Efficacy of Haskell for High Performance Computational Biology Jacqueline Addesa Academic Advisors: Jeremy Archuleta, Wu chun Feng 1. Problem and Motivation Biologists can leverage the power of

More information

Making Dataflow Programming Ubiquitous for Scientific Computing

Making Dataflow Programming Ubiquitous for Scientific Computing Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale

More information

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Azzam Haidar, Hatem Ltaief, Asim YarKhan and Jack Dongarra Department of Electrical Engineering and

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

Mixed Data Layout Kernels for Vectorized Complex Arithmetic Mixed Data Layout Kernels for Vectorized Complex Arithmetic Doru T. Popovici, Franz Franchetti, Tze Meng Low Department of Electrical and Computer Engineering Carnegie Mellon University Email: {dpopovic,

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Zhou, B. B. and Brent, R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 0200 Abstract This paper addresses

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

Heterogeneity in Computational Environments

Heterogeneity in Computational Environments Heterogeneity in Computational Environments Seongjai Kim Abstract In teaching, learning, or research activities in computational mathematics, one often has to borrow parts of computational codes composed

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

On The Computational Cost of FFT-Based Linear Convolutions. David H. Bailey 07 June 1996 Ref: Not published

On The Computational Cost of FFT-Based Linear Convolutions. David H. Bailey 07 June 1996 Ref: Not published On The Computational Cost of FFT-Based Linear Convolutions David H. Bailey 07 June 1996 Ref: Not published Abstract The linear convolution of two n-long sequences x and y is commonly performed by extending

More information