FLAMES2S: From Abstraction to High Performance

Size: px

Start display at page:

Download "FLAMES2S: From Abstraction to High Performance"

Shonda Beasley
5 years ago
Views:

1 FLAMESS: From Abstraction to High Performance RICHARD VERAS The University of Texas at Austin and JONATHAN MONETTE The University of Texas at Austin and FIELD G. VAN ZEE The University of Texas at Austin and ROBERT A. VAN DE GEIJN The University of Texas at Austin and ENRIQUE S. QUINTANA-ORTí Universidad Jaume I This paper discusses how to achieve high-performing portability for dense matrix libraries by expressing them at a high level of abstraction and mechanizing the transformation to more traditional low-level implementations. In the process of pursuing this goal we encountered a number of deficiencies of existing software layers in this domain, namely the Basic Linear Algebra Subprograms (BLAS) interface. As a result, the paper is as much about the mechanical transformation itself as it is about the modifications that had to be made to these lower level interfaces in order to facilitate simplicity without sacrificing performance. Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms; Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance. INTRODUCTION The present paper is best considered in context of the overall goals and accomplishments of the FLAME project. That project advocates raising the level of abstraction at which one expresses algorithms and codes for the domain of matrix Authors addresses: Richard Veras, Jonathan Monette, Robert A. van de Geijn, Field Van Zee, Department of Computer Science, The University of Texas at Austin, Austin, TX 787, rvdg@cs.utexas.edu. Enrique S. Quintana-Ortí, Departamento de Ingeniería y Ciencia de Computadores, Universidad Jaume I,.7 Castellón, Spain, quintana@icc.uji.es. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c YY ACM 98-35/YY/- $5. ACM Transactions on Mathematical Software, Vol. V, No. N, Month YY, Pages.

2 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí computations [Van Zee et al. 9;?]. Many benefits have been documented: A new notation allows algorithms to be presented in a way that captures the pictures that often accompany an explanation [Gunnels et al. ; Quintana et al. ]. This notation allows systematic derivation of correct families of loop-based algorithms [Gunnels et al. ; Bientinesi et al. 5; van de Geijn and Quintana- Ortí 8; Gunnels ]. This derivation process can be made mechanical [Bientinesi 6]. Application Programming Interfaces (APIs) can be defined for various languages so that representations in code mirror the algorithms [Bientinesi et al. 5]. Numerical error analyses can be made similarly systematic [Bientinesi 6; Bientinesi and van de Geijn ]. The effort for development and maintenance of libraries is greatly reduced [Van Zee ]. New architectures can be easily accommodated [Quintana-Ortí et al. 9; Quintana- Ortí et al. 9]. A production grade library, libflame, has been developed [Van Zee et al. 9; Van Zee ]. Many of these results were published in previous ACM TOMS issues. The present paper describes how a source-to-source translator, FLAMESS, makes the contributions of those papers, culminating in the libflame library, a viable option for users of dense linear algebra libraries even for those who insist on near-optimal performance. The translated implementations are reminiscent of, and attain performance similar to, corresponding LAPACK implementations. It is in trying to make the implementation mechanical that unintended consequences of interfaces often come to light. A secondary contribution of the present paper lies with insights into deficiencies that have existed since the Basic Linear Algebra Subprograms (BLAS) interface was standardized. These design flaws came to our attention by virtue of the fact that we were recreating a very large part of the functionality present in LAPACK. In some cases they are merely annoying while in other cases they are serious, leading to routines in LAPACK not being thread-safe. To overcome this problem, we provide a new interface that is used internally within libflame, the BLAS-Like Interface Subprograms (BLIS). This interface also features an orthogonal but manifestly useful improvement in that they support both column-major and row-major storage as well as general row and column stride, by which we mean the case where elements in both the row and column are of a constant, non-unit stride. The paper is organized as follows. We begin in Section with a motivating example and discuss some of the problems with both FLAME code and LAPACK. In Section 3 we introduce a source-to-source translator that translates high-level FLAME code to low-level C with direct calls to the BLAS. Section 4 illustrates some of the limitations we encountered in the original BLAS interface and proposes a set of BLAS-Like Interface Subprograms (BLIS) to address these issues. Performance results are discussed in Section 5. Section 6 summarizes related research, and closing remarks are given in the conclusion.

3 FLAMESS: From Abstraction to High Performance 3 Algorithm: A := Chol unb(a) Partition A A! T L A T R A BR where A T L is while m(a T L ) < m(a) do Repartition! A a A A T L A T α a T A A BR A where α is Variant : a := A T a α := α a T a α := α Variant : α := α a T a α := α a T := at at A a T := at /α Variant 3: α := α a T := at /α A := A a a T Continue with! A a A A T L A T α a T A A BR A endwhile Algorithm: A := Chol blk(a) Partition A A! T L A T R A BR where A T L is while m(a T L ) < m(a) do Determine block size b Repartition! A A A A T L A T A A A A BR A where A is b b Variant : A := A T A A := A A T A A := Chol(A ) Variant : A := A A T A A := Chol(A ) A := A A T A A := A T A Variant 3: A := Chol(A ) A := A T A A := A A T A Continue with! A A A A T L A T A A A A BR A endwhile Fig.. Unblocked (left) and blocked (right) algorithms for computing the Cholesky factorization.. THE PROBLEM In this section we describe the problem that our approach solves by discussing a representative example, the Cholesky factorization: Theorem. Let A be an n n symmetric positive definite (SPD) matrix. Then there exists an n n lower triangular matrix L such that A = LL T. If the diagonal elements of L are taken to be positive, L is unique. The matrix L is called the Cholesky factor of A.. A family of algorithms In Figure we present three unblocked algorithms and three blocked algorithms for overwriting the lower triangular part of SPD matrix A with its Cholesky factor, in FLAME notation. How to systematically derive these algorithms is discussed, for example, in [van de Geijn and Quintana-Ortí 8].

4 4 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí #include "FLAME.h" FLA_Error FLA_Chol_l_unb_var( FLA_Obj A ) { FLA_Obj ATL, ATR, A, a, A, ABL, ABR, at, alpha, at, A, a, A; int value = ; FLA_Part_x( A, &ATL, &ATR, &ABL, &ABR,,, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ FLA_Repart_x_to_3x3( ATL, // ATR, &A, // &a, &A, / / / / &at, // &alpha, &at, ABL, // ABR, &A, // &a, &A,,, FLA_BR ); / / // alpha = alpha - at at FLA_Dotcs( FLA_CONJUGATE, FLA_MINUS_ONE, at, at, FLA_ONE, alpha ); // a = a - A at FLA_Gemvc( FLA_NO_TRANSPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, FLA_ONE, a ); // alpha = sqrt( alpha ) value = FLA_Sqrt( alpha ); if ( value!= FLA_SUCCESS ) return ( FLA_Obj_length( A ) + ); // a = a / alpha FLA_Inv_scal( alpha, a ); / / } FLA_Cont_with_3x3_to_x( &ATL, // &ATR, A, a, // A, at, alpha, // at, / / / / &ABL, // &ABR, A, a, // A, FLA_TL ); } return value; Fig.. The libflame unblocked variant for Cholesky factorization routine. This implementation supports all four standard floating-point datatypes: single, double, complex, and double complex.. Representing the algorithms in code As part of the FLAME project, we have defined APIs for representing algorithms in code that hide details of indexing so that the code closely resembles the al-

5 FLAMESS: From Abstraction to High Performance 5 SUBROUTINE ZPOTF( UPLO, N, A, LDA, INFO ) -- LAPACK routine (version 3.) -- < deleted code > Compute the Cholesky factorization A = LL. DO J =, N Compute L(J,J) and test for non-positive-definiteness. AJJ = DBLE( A( J, J ) ) - ZDOTC( J-, A( J, ), LDA, $ A( J, ), LDA ) IF( AJJ.LE.ZERO ) THEN A( J, J ) = AJJ GO TO 3 END IF AJJ = SQRT( AJJ ) A( J, J ) = AJJ Compute elements J+:N of column J. IF( J.LT.N ) THEN CALL ZLACGV( J-, A( J, ), LDA ) <<<<<<< a := - a CALL ZGEMV( No transpose, N-J, J-, -CONE, A( J+, ), $ LDA, A( J, ), LDA, CONE, A( J+, J ), ) CALL ZLACGV( J-, A( J, ), LDA ) <<<<<<< a := - a CALL ZDSCAL( N-J, ONE / AJJ, A( J+, J ), ) END IF CONTINUE END IF GO TO 4 3 CONTINUE INFO = J 4 CONTINUE RETURN End of ZPOTF END Fig. 3. LAPACK unblocked left-looking complex valued Cholesky factorization. gorithms [Bientinesi et al. 5]. Further, our techniques for systematically and mechanically deriving correct algorithms can be leveraged to provide a high level of confidence in the code, since there are few opportunities for the introduction of programming bugs. A FLAME/C code for unblocked Variant that closely resembles the code in the libflame library is given in Figure. For comparison, in Figure 3 we present the same code in an LAPACK-style of coding. We note that coding the blocked algorithms in the LAPACK-style of coding is considerably more challenging due to indexing details.

6 6 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS FLAME blocked variant FLAME blocked variant FLAME blocked variant 3 FLAME unblocked variant FLAME unblocked variant FLAME unblocked variant 3 netlib blocked + netlib unblocked Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS 6 5 FLAME blocked variant 3 + FLAME unblocked variant FLAME blocked variant 3 + netlib unblocked netlib blocked + netlib unblocked Fig. 4. Performance of various implementations of Cholesky factorization. We claim that if one wishes to include several algorithms for all operations supported by LAPACK, then developing and maintaining a library for dense matrix computations coded with an API like that used by libflame is clearly preferred over the style of coding employed by LAPACK. Those who are not at least somewhat open to that view can probably stop reading at this point..3 The problem with (our) abstractions In Figure 4 (top), we show the performance of all six algorithms coded with the FLAME/C API as well as block size-tuned implementations that are part of the netlib release of LAPACK (version 3..). In Figure 4 (bottom) we also show, for a select implementation, the performance of blocked algorithms when the unblocked

7 FLAMESS: From Abstraction to High Performance 7 algorithms, which are used to compute the subproblem A := Chol(A ), are coded with the FLAME/C API or with an LAPACK style. The conclusion is that while the FLAME/C API incurs negligible overhead for the blocked algorithms, there is a hefty performance penalty for the unblocked algorithms that also affects the performance of the blocked algorithm. This should come as no surprise: the abstractions hide indices. In lieu of explicit indexing, various data structures are updated when matrix objects are partitioned and repartitioned, which incurs some additional cost that would not be present if the algorithms were coded at a low level. One solution would be to code the unblocked algorithms in the style of LAPACK, except in C. But that would defeat the benefits of the FLAME/C API for a large number of routines that are part of the library..4 The problem with LAPACK The style of coding used by LAPACK dates back to its predecessor, LINPACK, which emerged in the mid-97 s [Dongarra et al. 979; Stewart 977]. As an illustration, a so-called unblocked algorithm for Cholesky factorization is given in Figure 3. At the time of its release, LAPACK represented a significant leap forward in a number of ways. For example, the package merged and extended the functionality of EISPACK and LINPACK, portable performance was achieved through the use of blocked algorithms, and algorithms were coded in terms of standardized level-, -, and -3 BLAS subroutines. Still there are a number of characteristics that by now make the package seem dated. To a large degree, LAPACK adheres to the Fortran-77 standard. This means that names of routines and variables cannot use more than six letters. Clearly, this can be easily overcome, but there does not seem to be any inclination on the part of the LAPACK developers to do so. A user of LAPACK must have access to a Fortran compiler. While one can call Fortran routines from C and C++, even if one uses a precompiled version of LAPACK, or CLAPACK which was created by the fc tool, one must have access to certain Fortran runtime libraries. It is assumed that matrices are stored in column-major order. Some libraries that current build upon LAPACK choose to store matrices by rows [LabView ; ]. This mismatch can be overcome by (implicitly or explicitly) transposing operands when calling LAPACK routines. However, this workaround adds unnecessary complexity. Experience has taught us that on different architectures, different algorithmic variants are often superior. For the most part, LAPACK includes one unblocked and one blocked algorithm for each supported operation. Since LAPACK consists of more than a million lines of code already, maintaining a multitude of algorithmic variants would greatly increase the size of the code base, not to mention require an enormous effort in programming. The adherence to Fortran-77 prevents the use of recursion to achieve better performance by introducing multiple levels of blocking. A very large number of tunable parameters make optimization cumbersome.

8 8 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí #include "FLAME.h" #define A( i, j ) buff_a[ (j)ldim_a + (i) ] FLA_Error FLA_Chol_l_opt_var( FLA_Obj A ) { FLA_Datatype datatype; int ldim_a, m_a, n_a; datatype = FLA_Obj_datatype( A ); ldim_a = FLA_Obj_ldim( A ); m_a = FLA_Obj_length( A ); n_a = FLA_Obj_width( A ); switch(datatype) { case FLA_FLOAT: < deleted lines > case FLA_DOUBLE_COMPLEX: { dcomplex buff_a = ( dcomplex ) FLA_Obj_buffer( A ); } FLA_Chol_l_opz_var( m_a, n_a, buff_a, ldim_a ); break; } } return FLA_SUCCESS; Fig. 5. Wrapper routine that separates the call to FLA Chol l unb var into routines for the different datatypes. Workspace to be used inside routines must be explicitly passed in as a parameter. While this guarantees that the user knows how much memory will be used, it also creates a considerable burden on the user. These problems can be overcome by the LAPACK library developer and/or the LAPACK user, but not without a considerable investment in time and effort. 3. FLAMESS: A SOURCE-TO-SOURCE TRANSFORMER FOR FLAME/C CODE Conventional wisdom has been that, in scientific computing, we must sacrifice programmability for the sake of performance and therefore abstractions like the FLAME/C API are not allowed. However, we believe that one does not have to choose between low-performing, highly-readable code like that given in Figure and higher-performing code that explicitly exposes indices as in Figure 3. In this section, we discuss a source-to-source transformer that translates code implemented using the FLAME/C API to more traditional (LAPACK-like) code, yielding the best of both worlds. FLAMESS is remarkably elegant yet powerful: essentially, it consists of a set of rewrite rules that transforms a high-level description of the algorithm to code.

9 FLAMESS: From Abstraction to High Performance 9 FLA_Error FLA_Chol_l_opz_var( int m_a, dcomplex buff_a, int rs_a, int cs_a ) { int j; for( j = ; j < m_a; j++ ) { dcomplex at = &A(j, ); dcomplex alpha = &A(j, j); dcomplex A = &A(j+, ); dcomplex a = &A(j+, j); int m_a_min_j_min_one = m_a - j - ; // FLA_Dotcs( FLA_CONJUGATE, FLA_MINUS_ONE, at, at, FLA_ONE, alpha ); bli_zdots( BLIS_CONJUGATE, j, FLA_DOUBLE_COMPLEX_PTR( FLA_MINUS_ONE ), at, rs_a, at, rs_a, FLA_DOUBLE_COMPLEX_PTR( FLA_ONE ), alpha ); // FLA_Gemvc( FLA_NO_TRANPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, // FLA_ONE, a ); bli_zgemv( BLIS_NO_TRANSPOSE, BLIS_CONJUGATE, m_a_min_j_min_one, j, FLA_DOUBLE_COMPLEX_PTR( FLA_MINUS_ONE ), A, rs_a, cs_a, at, rs_a, FLA_DOUBLE_COMPLEX_PTR( FLA_ONE ), a, cs_a ); // FLA_Sqrt( alpha ); { FLA_Error error; bli_zsqrte( alpha, &error ); } if( error!= FLA_SUCCESS ) return j; } // FLA_Inv_scal( alpha, a ); bli_zinvscalv( BLIS_NO_CONJUGATE, m_a_min_j_min_one, alpha, a, cs_a ); } return FLA_SUCCESS; Fig. 6. Output of the FLAMESS translator when the BLIS interface is used. (Very minor editing was used to slightly shorten the code.)

Once the algorithm has been derived, a code skeleton is generated using a webpage-based tool, Spark [Spark ], depicted in Figure 7.

10 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Fig. 7. The Spark tool for generating code skeletons. How we write FLAME/C code.. We start by very briefly reviewing the process by which we produce a routine like the one in Figure. This will help the reader understand why FLAME/C code is easy to parse. Once the algorithm has been derived, a code skeleton is generated using a webpage-based tool, Spark [Spark ], depicted in Figure 7. The idea is that by filling out a simple form, most of the code can be generated automatically, leaving only the updates between the /------/ lines to be filled in manually with subroutine calls that perform the necessary computations. We note that one of the output language options of the tool yields a representation that typesets a skeleton for algorithms with L A TEX as depicted in Figure. Translating FLAME/C code to LAPACK-like code.. The FLAME/C code in libflame is highly structured, and since much of it is automatically generated by tools like Spark, it is easy to parse. As a result, a translator, FLAMESS, was written that can take code written at a high level of abstraction (e.g., the code in Figure ) and translate it to an indexed loop and calls to a BLAS-like interface (e.g., the code in Figures 5 6). 4. BLAS AND BLAS-LIKE INTERFACES The reader will have noticed that the code in Figure 6 calls routines that are reminiscent of the BLAS. In this section we briefly discuss why we created a new interface. We stress that at this moment we merely consider this layer to be for our own use within the libflame library and that it is currently implemented in terms of the conventional BLAS interface. The problem with traditional (Fortran) BLAS. An initial implementation of the FLAMESS translator produced output that called traditional (Fortran) BLAS. We encourage the reader to visit this website and to try it before continuing.

11 FLAMESS: From Abstraction to High Performance Let us focus on how the matrix-vector multiplication in Figure was translated into a call to the traditional BLAS when a double precision complex routine is requested: FLA_Gemvc( FLA_NO_TRANPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, FLA_ONE, a ); is translated to // FLA_Gemvc( FLA_NO_TRANPOSE, FLA_CONJUGATE, FLA_MINUS_ONE, A, at, // FLA_ONE, a ); { // Note: vector at needs to be conjugated. dcomplex tmp_vec; tmp_vec = ( dcomplex ) malloc( j sizeof( dcomplex ) ); zcopy_( &j, &A(j, ), &ldim_a, &tmp_vec[], &I_ONE ); conj_vector( j, tmp_vec ); zgemv_( "No Transpose", &m_a_min_j_min_one, &j, FLA_DOUBLE_COMPLEX_PTR( FLA_MINUS_ONE ), &A(j+, ), &ldim_a, &tmp_vec[], &I_ONE, FLA_DOUBLE_COMPLEX_PTR( FLA_ONE ), &A(j+, j), &I_ONE ); free( tmp_vec ); } This simple example illustrates some of the problems that are immediately encountered: The need for explicit conjugation. Notice that for the Cholesky factorization example it would have been possible to conjugate vector a T in-place and then unconjugate the vector after the matrix-vector operation was complete. However, in general this would not be thread-safe if the vector being conjugated were part of an input operand rather than an input/output operand. The tool would either have to know more about the context or it would need to produce the slower, more complicated (but thread-safe) code as it does here. But regardless of whether the application code is thread-safe or not, the inability to express implicit conjugation via the gemv interface means that the application code must explicitly conjugate the contents of a T at least once. The need for a temporary vector and additional copy operation. If the code generated is to be thread-safe, the contents of the vector a T must be copied to a temporary buffer where it may be safely conjugated. The dynamic allocation and eventual freeing of the temporary vector undoubtedly incurs a small cost. Further, the copy operation incurs an additional cost proportional to the length of a T. It may be argued, or even shown, that these costs are relatively small. However, it cannot be shown that these costs are necessary; the BLAS could have easily incorporated an option to operate with the input vector s conjugate onthe-fly, at virtually no cost. We also point out that the above code is significantly less readable than a hypothetical zgemv interface that allowed conjugation of the vector.

12 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí The requirement that matrices be stored in column-major order. Columnmajor storage, while commonplace in numerical applications, is not used universally. Applications which store data in row-major order would be forced to perform intricate transformations on the BLAS parameters in order to induce the correct computation. Such hackery, for most users, would be error-prone and result in code that is even more difficult to read, if not entirely obfuscated. The code translation shown in Figure 6 is much simpler thanks to the BLIS, which addresses each of the above concerns by providing a more powerful interface. Why not the CBLAS?. The CBLAS define C bindings to the traditional BLAS interface. For a number of reasons they do not serve our purposes: Firstly, we find the parameter passing conventions to be unnecessarily inconsistent and confusing. For instance, real scalars are passed by value, but complex scalars (along with all vectors and matrices) are passed by address. Also, the CBLAS do not define any complex types, and so the API uses the much vaguer void in place of the actual type for all complex parameters, making the interface more weakly typed and more difficult to read. Finally, while the CBLAS supports row-major storage, it requires that all matrix operands be stored in row-major (or column-major) order; mixing storage formats is not allowed. Why not the BLAST Forum extensions?. Some of the problems with complex datatypes (e.g., the need to conjugate vectors) were fixed in the proposed new BLAS interface that resulted from the BLAS Technical Forum [BLA ; ]. Unfortunately, that new interface never caught on. Not even LAPACK was changed to use the new standard. Moreover, the C interface was somewhat clunky, not unlike the CBLAS interface to the traditional BLAS. What is different in the BLIS?. The BLAS-Like Interface Subprograms present in libflame provide a set of APIs similar to that of the BLAS while supporting features that were missing from the original BLAS specification. The following is a list of the features, including improvements made over the BLAS: Low-level interfaces. The BLIS, like the BLAS, is a set of low-level interfaces. No objects or descriptors are used and thus all information is passed into the API explicitly. Therefore we retain the potential for maximum performance. No return values. The original BLAS supported a handful of operations, such as the dot product, that were implemented as functions. In practice, correctly linking to the routines for complex datatypes proved to be a nightmare due to the competing fc and GNU return value standards [GNU Fortran ]. The BLIS avoids this altogether by always using an extra parameter (passed by address) in lieu of returning numerical values. Row and column strides. By including row and column strides in the interface, the BLIS interfaces support both row-major and column-major storage. Furthermore, general row and column storage, whereby both strides are non-unit, In Fortran, subroutines and functions are distinguished in that only the latter return values to the caller.

13 FLAMESS: From Abstraction to High Performance 3 is made possible. Further still, the API allows the user to perform computation on operands of different storage layouts. Simple, consistent interfaces. All parameters that convey information about the operation, such as whether a transposition is performed, are typed as char. All stride parameters are int. All scalars, vectors, and matrices are passed by address. We feel these conventions provide a more natural interface for users of C. Consistent conjugation across complex and real typed routines. If a complex-typed routine allows the user to conjugate an operand, then the corresponding real-typed routines contain the same option. For example, the BLIS routine bli zgemv provides a conjx parameter, but so does bli dgemv. Thus, the BLIS looks at conjugation in the API type-agnostically, as an argument to the mathematical operation itself. Some would argue that such parameters are redundant; why have a conjugation parameter if it has no effect on the computation? We believe that such parameters have a purpose because they allow the users, and tools such as FLAMESS, to encode critical features of the operation the conjugation of a vector, for example into the routine invocation. By looking at a real-typed invocation, one can correctly project the operation out to the complex domain. If we were to leave out such redundant parameters from the interface, this kind of information would be lost when the application or library is coded. Operations not present in the BLAS. Perhaps most importantly, the BLIS include interfaces to operations not defined in the BLAS. For example, as alluded to previously, the BLIS implements matrix-vector multiplication (gemv) where the input vector may be optionally conjugated in a thread-safe manner. Another useful operation is gemv where the matrix operand is conjugated. Other extended operations include axpy, copy, dot, scal, ger, hemv, her, her, trmv, trsv, gemm, trmm, and trsm. In general, the BLIS attempts to expand upon the original BLAS to provide the most flexible, easy-to-use API possible while still providing low-overhead interfaces to essential linear algebra kernels. This revised interface layer streamlines both the implementation and output of tools such as FLAMESS while also lifting much of the programming burden from casual users. 5. IMPACT ON PERFORMANCE The scientific computing community is typically willing to give up programmability if it means attaining better performance. In this section we show that the FLAMESS translator produces code that matches that of the netlib LAPACK implementation. 5. Platform details All experiments were performed on a single core of a Dell PowerEdge R9 server consisting of four Intel Dunnington six-core processors. Each core provides a peak performance of.64 GFLOPS (55 9 floating-point operations per second) and has access to 96 GBytes of shared main memory. Performance experiments were

14 4 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí gathered under the GNU/Linux.6.8 operating system. Source code was compiled by the Intel C/C++ Compiler, version.. 5. Implementations We demonstrate the impact of translating high-level code using the FLAME/C API to low-level code using indexed loops and calls to the BLIS interface. Performance for three important operations supported by MKL.., LAPACK 3.., and libflame r394 is reported: Cholesky factorization, LU factorization with partial pivoting, and QR factorization via Householder transformations. The results are representative of what is observed for other operations as well. The following implementations were compared: Unblocked FLAME unblocked. The best unblocked algorithmic variant currently supported by libflame, coded using the FLAME/C API. optimized unblocked. The unblocked algorithmic variant used for FLAME unblocked translated to a plain loop with calls to the BLIS interface, as discussed in this paper. netlib unblocked. The unblocked implementation in LAPACK distributed on netlib. MKL unblocked. The unblocked implementation in MKL. Blocked FLAME blocked + FLAME unblocked. The best blocked algorithmic variant currently supported by libflame coded using the FLAME/C API. The subproblem is computed via a call to the best unblocked algorithmic variant coded using the FLAME/C API. FLAME blocked + optimized unblocked. The best blocked algorithmic variant currently supported by libflame coded using the FLAME/C API. The subproblem is computed via a call to the best unblocked algorithmic variant coded using the FLAME/C API and translated to a plain loop with calls to the BLIS interface. FLAME blocked + netlib unblocked. The best blocked algorithmic variant currently supported by libflame coded using the FLAME/C API. The subproblem is computed via a call to the LAPACK unblocked algorithmic variant distributed from netlib. netlib blocked + netlib unblocked. The LAPACK blocked routine linked to the LAPACK unblocked routine, available from netlib. MKL blocked. The blocked implementation in MKL. For all experiments that involve FLAME or netlib code, sequential level-, -, and -3 BLAS were provided by MKL. Also, we use double precision floating-point arithmetic for all experiments. 5.3 Results We report both the rate of computation that is attained (in GFLOPS) and the speedup relative to the netlib implementation. For the absolute performance graphs

15 FLAMESS: From Abstraction to High Performance 5 Sequential implementations of Cholesky factorization (lower triangular), unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of Cholesky factorization (lower triangular), unblocked FLAME unblocked optimized unblocked netlib unblocked MKL unblocked Speedup relative to netlib unblocked Fig. 8. Performance of various unblocked implementations of Cholesky factorization (double precision real). (on the left) the top of the graph represents the theoretical peak of the architecture. To facilitate easy comparison, the algorithmic block size was set to equal 8 for all experiments. This block size is not necessarily optimal for all s. But optimality is not the point of this paper: the point is that abstraction does not adversely affect performance. Tuning the implementations is an orthogonal issue. Cholesky factorization.. In Figures 8 and 9 we report the performance attained by various implementations of Cholesky factorization, using double precision real data and updating only the lower triangular part of the matrix. These graphs tell the story as we would expect: Using the FLAME/C API incurs a considerable overhead for the unblocked algorithm which is overcome when the code is translated

16 6 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Sequential implementations of Cholesky factorization (lower triangular), b_alg = 8.8 Speedup relative to netlib blocked + netlib unblocked FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked. FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Fig. 9. Performance of various blocked implementations of Cholesky factorization when using different unblocked implementations (double precision real). to a simple loop with calls to the BLIS interface. Even for the blocked algorithm, the low performance of the suboperation affects how fast the performance ramps up. We provide results for the Cholesky factorization of double precision complex matrices in Figure since we used the complex instantiation of the operation as a motivating example for the BLIS in Section 4. For nearly all s, complex FLAME codes that call BLIS-based unblocked routines perform as well or better than those found in LAPACK, despite the additional copies performed by the BLIS implementation in libflame. For the remaining performance experiments, we will only show performance results for double precision real data.

17 FLAMESS: From Abstraction to High Performance 7 Sequential implementations of Cholesky factorization (lower triangular), unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of Cholesky factorization (lower triangular), b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Fig.. Performance of various unblocked and blocked implementations of Cholesky factorization (double precision complex). LU with partial pivoting.. In Figure we report the performance attained by various implementations of LU factorization with partial pivoting. Here the comparison is complicated by the fact that not all implementations use the same algorithmic variant.. Unblocked. FLAME unblocked, optimized unblocked, MKL unblocked. The FLAME unblocked implementations use Variant 4 which is also known as the Crout variant [Crout 94]. This is an implementation that casts most computation in terms of matrixvector multiplication. From the performance signature we deduce that MKL uses the same variant.

18 8 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí Sequential implementations of LU factorization with partial pivoting, unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of LU factorization with partial pivoting, b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked FLAME blocked + netlib unblocked netlib blocked + netlib unblocked MKL blocked Fig.. Performance of various implementations of LU factorization with partial pivoting (double precision real). netlib unblocked. This implementation uses Variant 5 which is also known as the right-looking variant. It casts most computation in terms of a rank- update. The observation is that a rank- update generates twice the memory traffic as does a matrix-vector multiplication, which is the reason why the netlib unblocked implementation attains worse performance. The fact that the optimized unblocked implementation attains performance that is close to the commercial MKL implementation supports the claim that the translated code implemented with the BLIS interface preserves performance.. Blocked. Again it is observed that the FLAME blocked + FLAME unblocked implementation ramps up slowly. The FLAME blocked + optimized unblocked

19 FLAMESS: From Abstraction to High Performance 9 Sequential implementations of QR factorization, unblocked 9 FLAME unblocked optimized unblocked netlib unblocked MKL unblocked 8 7 GFLOPS Sequential implementations of QR factorization, b_alg = GFLOPS FLAME blocked + FLAME unblocked FLAME blocked + optimized unblocked netlib blocked + netlib unblocked MKL blocked Fig.. Performance of various implementations of Householder QR factorization (double precision real). outperforms both the FLAME blocked + netlib unblocked and the pure netlib unblocked algorithm because of the better algorithmic variant used for the subproblem. The MKL implementation is more highly tuned. We believe to a large degree this is due to a better implementation of the pivoting routine that swaps blocks of rows. Householder QR factorization.. In Figure we report the performance attained by various implementations of QR factorization based on Householder transformations. The LAPACK and MKL implementations use the compact WY transform to accumulate Householder transformations [Schreiber and Van Loan 989] while the libflame implementation uses a slight variant of that, which we call the UT-

20 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí transform [Joffrain et al. 6]. Other than the FLAME unblocked version, which suffers from the usual overhead, the unblocked implementions achieve very similar performance. The slightly lower performance attained by optimized unblocked can be attributed to the fact that that implementation always accumulates an additional triangular matrix needed for the blocked algorithm, while its counterparts do not. The most plausible reason why the MKL QR factorization attains better performance is that it uses a smaller block size when the is relatively small. This is beneficial because accumulating the Householder transformations adds additional computation and therefore there is a trade-off between the benefits of using a blocked algorithm and the extra computation that is required to support it. Tuning would similarly improve the FLAME implementation. 5.4 Summary It is important not to get distracted by the performance graphs. The point of the paper is that one can code at a high level of abstraction and, with the help of a simple code-to-code translator, attain performance that rivals that of a more traditional library. However, the performance graphs for LU factorization do show that there is a performance benefit that results from coding at a higher level of abstraction: without requiring unreasonable extra effort, additional algorithmic variants can be included in the library so that the best variant for the situation can be employed. 6. RELATED WORK It can be argued that there have been many efforts to express linear algebra libraries at a high level of abstraction, starting with the original matlab environment developed in the 97s [Moler 98]. Mapping from matlab-like descriptions to high-performance implementations was already pursued by the FALCON project [DeRose 996; Marsolf 997; DeRose and Padua 996] and matlab itself has a compiler that does so. More recently, a project to produce Build to Order Linear Algebra Kernels [Siek et al. 8] takes an approach similar in philosophy to what we are proposing: input is a matlab-like description of linear algebra algorithms and output is optimized code. What sets our approach apart from these efforts is that the FLAME project has managed to greatly simplify how algorithms are expressed at a high level of abstraction. A solution close to ours was proposed in John Gunnels dissertation [Gunnels ], a participant in what became the FLAME project. He proposed a domainspecific language for expressing PLAPACK code for distributed-memory architectures, PLAWright, and a translator, implemented with Mathematica, was developed that yielded PLAPACK code and a cost analysis of its execution. In many ways, PLAWright code resembles what later became the FLAME/C and FLAME@lab APIs for the C and M-script (matlab) languages, and thus, FLAMESS merely targets a different architecture, namely a single processor. There are, however, differences: The translation of PLAWright to PLAPACK code was simpler by virtue of the fact that PLAPACK code itself still exhibits a level of abstraction very similar to that of PLAWright. In other words: it is a simpler translation from PLAWright to PLAPACK code (like translating the inputs to the Spark tool to FLAME/C

21 FLAMESS: From Abstraction to High Performance code) and there was no need to produce lower-level code because the overhead of the PLAPACK API was minute compared with the cost of starting a communication on a distributed-memory architecture. FLAMESS instead takes high-level abstraction to low-level code. 7. CONCLUSION On the surface, the central message of this paper seems to be simple and the results straightforward: The inherent structure in FLAME/C code allows a direct sourceto-source transformer to be employed to produce code that attains the same high level of performance as traditional code. This task of translating high-level code is made even simpler when deficiencies in the original BLAS interface are remedied. In addition, we believe that the transformer completes a metamorphosis of dense and banded linear algebra libraries that started a decade ago [Gunnels et al. ] because it demonstrates that one can embrace programmability without sacrificing performance. It is worth noting that prior to this work, a legitimate objection to abandoning what we often call the LAPACK style of coding has been that coding at a lower level yields high performance, especially for small s, operations that perform O(n ) computation with O(n ) data, and/or computations on banded matrices. The presented prototype source-to-source transformer, FLAMESS, and the preliminary results neutralize this objection. Acknowledgements This research was partially sponsored by NSF grants CCF-5496, OCI-8575, and CCF-9767, including a Research Experience for Undergraduates (REU) supplement that supported Richard Veras and Jonathan Monette, and a grant from Microsoft. Visits by Enrique Quintana-Ortí to UT-Austin were supported by the J. Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational Engineering and Sciences (ICES) at UT-Austin. Enrique S. Quintana- Ortí is also partially supported by the CICYT project TIN8-937-C- and FEDER. We thank the other members of the FLAME team for their support. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). REFERENCES. Basic Linear Algebra Subprograms Technical Forum Standard. International Journal of High Performance Applications and Supercomputing 6, (Spring). Bientinesi, P. 6. Mechanical derivation and systematic analysis of correct linear algebra algorithms. Ph.D. thesis, Department of Computer Sciences, The University of Texas. Technical Report TR September 6. Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Ortí, E. S., and van de Geijn, R. A. 5. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software 3, (March), 6. Bientinesi, P., Quintana-Ortí, E. S., and van de Geijn, R. A. 5. Representing linear algebra algorithms in code: The FLAME application programming interfaces. ACM Trans. Math. Soft. 3, (March), Bientinesi, P. and van de Geijn, R. A. A goal-oriented and modular approach to stability analysis. SIAM J. Matrix Anal. Appl.. In review.

22 R. Veras, J. Monette, F. Van Zee, R. van de Geijn, E. Quintana-Ortí BLAS Technical Forum. BLAS Technical Forum. Crout, P. D. 94. A short method for evaluating determinants and solving systmes of linear equations with real or complex coefficients. Trans AIEE 6, DeRose, L. and Padua, D A MATLAB to FORTRAN9 translator and its effectiveness. In Proceedings of the th ACM International Conference on Supercomputing. DeRose, L. A Compiler techniques for matlab programs. Ph.D. thesis, Computer Sciences Department, The University of Illinois at Urbana-Champaign. Dongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W LINPACK Users Guide. SIAM, Philadelphia. GNU Fortran. The GNU Fortran Compiler - Code Gen Options. gcc.gnu.org/onlinedocs/gfortran/code-gen-options.html. GNU Scientific Library. GNU Scientific Library. Gunnels, J. A.. A systematic approach to the design and analysis of parallel dense linear algebra algorithms. Ph.D. thesis, Department of Computer Sciences, The University of Texas. Gunnels, J. A., Gustavson, F. G., Henry, G. M., and van de Geijn, R. A.. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft. 7, 4 (December), Joffrain, T., Low, T. M., Quintana-Ortí, E. S., van de Geijn, R., and Van Zee, F. G. 6. Accumulating householder transformations, revisited. ACM Trans. Math. Softw. 3,, LabView. LabView. Marsolf, B. A Techniques for the interactive development of numerical linear algebra libraries for scientific computation. Ph.D. thesis, Computer Sciences Department, The University of Illinois at Urbana-Champaign. Moler, C. B. 98. MATLAB an interactive matrix laboratory. Tech. Rep. Technical Report 369, Department of Mathematics and Statistics, University of New Mexico. Quintana, E. S., Quintana, G., Sun, X., and van de Geijn, R.. A note on parallel matrix inversion. SIAM J. Sci. Comput., 5, Quintana-Ortí, G., Igual, F. D., Quintana-Ortí, E. S., and van de Geijn, R. 9. Solving dense linear systems on platforms with multiple hardware accelerators. In ACM SIGPLAN 9 symposium on Principles and practices of parallel programming (PPoPP 9). Quintana-Ortí, G., Quintana-Ortí, E. S., van de Geijn, R. A., Zee, F. G. V., and Chan, E. 9. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software 36, 3 (June), 4: 4:6. Schreiber, R. and Van Loan, C A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput., (Jan.), Siek, J. G., Karlin, I., and Jessup, E. R. 8. Build to order linear algebra kernels. In International Symposium on Parallel and Distributed Processing 8 (IPDPS 8). 8. Spark. Spark code skeleton generator. Stewart, G. W Research, development, and LINPACK. In Mathematical Software III, J. R. Rice, Ed. Academic Press, New York, 4. van de Geijn, R. A. and Quintana-Ortí, E. S. 8. The Science of Programming Matrix Computations. Van Zee, F. G.. libflame: The Complete Reference. In preparation. Van Zee, F. G., Chan, E., van de Geijn, R. A., Quintana-Ortí, E. S., and Quintana-Ortí, G. 9. The libflame library for dense matrix computation. IEEE Computing in Science & Engineering, 6 (November/December), Received Month Year; revised Month Year; accepted Month Year

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de