Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's

Size: px

Start display at page:

Download "Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's"

Dinah Morton
6 years ago
Views:

1 A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm y B. Kumar, C.-H. Huang, and P. Sadayappan Department of Computer and Information Science The Ohio State University R.W. Johnson Department of Computer Science St. Cloud State University Abstract In this paper, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this paper, we present a non-recursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O( n ) for multiplying n n matrices. We present a modied formulation in which the working storage requirement is reduced to O( n ). The modied formulation exhibits sucient parallelism for ecient implementation on a shared-memory multiprocessor. Performance results on a Cray Y-MP8/ are presented. Keywords: Strassen's matrix multiplication algorithm, Tensor product, Block recursive algorithm, Parallel architecture, Vector processors. This work was supported in part by ARPA and monitered by NIST.

2 Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's Algorithm: Breadth-First Evaluation : : : : : : : : : : : : : : : :. Block Recursive Strassen's Algorithm: Depth-First Evaluation : : : : : : : : : : : : : : : : :. Combining Breadth-First and Depth-First Evaluations : : : : : : : : : : : : : : : : : : : : : : 8. Matrix Storage in Main Memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 Code Generation for Vector Processors 10.1 Block Strassen's Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10. Memory Management for Depth-First Evaluation : : : : : : : : : : : : : : : : : : : : : : : : : 1. Implementation of Winograd's Variation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Performance Results on the Cray Y-MP 1 Conclusions 1

3 List of Figures 1 -level block recursive storage : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 Recursion tree of depth 1 for Strassen's algorithm : : : : : : : : : : : : : : : : : : : : : : : : 19 l-level block recursive Strassen's algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 l-level Block Recursive Strassen's Algorithm with Memory Reduction : : : : : : : : : : : : : 0 Memory requirements for working arrays : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 0

4 List of Tables 1 Comparison of operation counts for Strassen's algorithm and conventional matrix multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 0 Execution times for Block Strassen's algorithm with memory reduction : : : : : : : : : : : : 1 Execution times for k = on processors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Execution times for k = on 8 processors. Percentages of 8-cpu obtained are given in parenthesis. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1

5 1 Introduction Tensor products (Kronecker products) have been used to model algorithms with a recursive computational structure which occur in application areas such as digital signal processing [, 1], image processing [1], linear system design [], and statistics []. In recent years, a programming methodology based on tensor products has been successfully used to design and implement high-performance algorithms to compute fast Fourier Transforms (FFT) [1, 1] and matrix multiplication [10, 1] for shared-memory vector multiprocessors. A set of multilinear algebra operations such as tensor product and matrix multiplication are used to express block recursive algorithms. These algebraic operations can be systematically translated into high-level programming language constructs such as sequential composition, iteration, and parallel / vector operations. Tensor product formulas representing an algorithm can be algebraically manipulated to restructure the computation to achieve dierent performance characteristics. In this way, the algorithm can be tuned to match the underlying architecture. Matrix multiplication is an important core computation in may scientic applications. Conventional matrix multiplication of n n matrices requires O(8 n ) operations. In 199, V. Strassen proposed an algorithm for matrix multiplication [1] that employs a computationally ecient method to compute the product of matrices using only seven multiplications. A recursive application of this algorithm for multiplying n n matrices requires only O( n ) operations, compared to O(8 n ) for conventional matrix multiplication. Ecient parallel implementations of this algorithm have been described in [1, 10]. This algorithm has been used for fast matrix multiplication in implementing level BLAS [9] and linear algebra routines []. In this paper, we describe the tensor product formulation of Strassen's matrix multiplication algorithm, and discuss program generation for shared memory vector processors such as the Cray Y-MP. Achieving high performance on these architectures requires operating on large vectors and reducing memory bank conicts, while at the same time exploiting coarse grained parallelism. We show how the tensor product formula of Strassen's algorithm can be manipulated to operate on full vectors with unit stride. An important feature of the generated code is that it employs no recursion. The initial formulation presented in [10] required a working array of size O( n ) for the multiplication of n n matrices. We present a modied formulation that signicantly reduces the size of working array to O( n ). This reduction is made possible through the reuse of working storage. We describe how this memory reuse can be captured in tensor product formulas with the use of a selection operator. We present a strategy for automatic code synthesis from tensor product formulas containing a selection operator. The modied formulation exhibits sucient parallelism for ecient implementation on a vector-parallel machine such as the Cray Y-MP. In addition, we express Winograd's variation [] using our notation and describe its translation to a programming code. Winograd's variation uses the same number of multiplications, but a smaller number of additions than the original Strassen's algorithm. This paper is organized as follows. Section contains an overview of the tensor product notation. A 1

6 formulation of Strassen's algorithm using this notation is presented in Section, along with a discussion on how the formulation can be modied to achieve reduction in working storage. Section presents a strategy for automatic code generation from a tensor product formula. Winograd's variation of the Strassen's algorithm is also presented. Section presents performance results on the Cray Y-MP. Conclusions are presented in Section. An Overview of the Tensor Product Notation In this section, we give a brief overview of the tensor product notation and the properties that are used in this paper. Let A < mn and B < pq. The tensor product A B is the block matrix obtained by replacing each element a i;j by the matrix a i;j B, i.e., A B = a 0;0 B a 0;n?1 B..... a m?1;0 B a m?1;n?1 B Whenever all the involved matrix products are valid, the following properties hold: Property.1 (Tensor Product) 1. A B C = A (B C) = (A B) C. (A B) (C D) = AC BD. A B = (A I n ) (I m B) = (I m B) (A I n ). (A B) T = A T B T. (A B)?1 = A?1 B?1. N n?1 i=0 A ib i = N n?1 i=0 A i N n?1 i=0 B i. Q n?1 i=0 (A i B i ) = Q n?1 i=0 A i Q n?1 i=0 B i 8. I mn = I m I n where I n represents the n n identity matrix, Q n?1 i=0 A i = A n?1 A n? : : : A 0, and N n?1 i=0 A i = A n?1 A n? : : : A 0. A matrix basis E m;n i;j is an m n matrix with a one in the i-th row and the j-th column and zeros elsewhere. A vector basis e m i is a column vector of length m with a one in the i-th position and zeros elsewhere. If the basis Ei;j m;n two vector bases e m i e n j. The tensor product of two vector bases em i e n j of an m n matrix is stored by rows, it is isomorphic to the tensor product of is equal to the vector basis emn in+j,

7 i.e., e m i e n j = emn in+j. The tensor product of two vector bases em i e n j is called a tensor basis. If the basis elements are ordered lexicographically then e n1 i 1 e nt i t = e n1nt i 1n nt++i t?1n t+i t Expressing a vector basis e M i as the tensor Q product of vector bases e m1 i 1 e mt i t, where M = m 1 m t t and i k = (i div M k ) mod m k ; M k = i=k+1 m i; M t = 1 is called the factorization of the vector basis, e.g., the vector basis e 1 8 can be factorized into the tensor bases e 1 e 1 e 0 or e e. Expressing a tensor basis e n1 i 1 e nt i t as a vector basis ei n1nt 1n nt++i t?1n t+i t is called linearization of the tensor basis. For example, the tensor basis e e can be linearized to give the vector basis e 1 8. One of the permutations used frequently in the representation of algorithms in tensor product formulas is the stride permutation. Stride permutation L mn n is dened as? L mn n e m i e n j = e n j e m i L mn n permutes the elements of a vector of size mn with stride distance n. This permutation can be represented as an mn mn transformation. For example, L can be represented by the matrix L x = The stride permutation has the following properties: x 0 x 1 x x x x = x 0 x x x 1 x x Property. (Stride Permutation) 1. (L mn n )?1 = L mn m. L rst st. L rst t = L rst s L rst t = (L rt t I s )(I r L st t ) A permutation of the form I m L pq q I n is called a tensor permutation. The following theorem illustrates how a tensor product of two matrices can be commuted by applying a stride permutation. Theorem.1 (Commutation Theorem) If A is an m m matrix and B is an n n matrix, then L mn n (A B) = (B A)L mn n

8 Pairwise multiplication between two vectors implies the product between the corresponding elements of those vectors, e.g., x 0 x 1 x n?1 y 0 y 1 y n?1 = x 0 y 0 x 1 y 1 x n?1 y n?1 If the elements x i and y i are themselves submatrices, then x i y i corresponds to matrix multiplication between them. A Tensor Product Formulation of Strassen's Algorithm Strassen's matrix multiplication algorithm is based on a computationally ecient way of multiplying matrices using only multiplications [1]. Consider the matrix multiplication C = AB, where A = a 00 a 01 B = a 10 a 11 b 00 b 01 C = b 10 b 11 c 00 c 01 c 10 c 11 Strassen's algorithm can then be written as follows. First, the following intermediate values are calculated. Then the individual elements of C are given by t 0 = (a 00 + a 11 )(b 00 + b 11 ) t 1 = (a 10 + a 11 )b 00 t = a 00 (b 01? b 11 ) t = a 11 (?b 00 + b 10 ) t = (a 00 + a 01 )b 11 t = (?a 00 + a 10 )(b 00 + b 01 ) t = (a 01? a 11 )(b 10 + b 11 ) c 00 = t 0 + t? t + t c 10 = t 1 + t c 01 = t + t c 11 = t 0? t 1 + t + t In matrix notation, this can be represented as: C = S c (S a A S b B)

9 where S a = ? ?1 ; S b = ?1? ; and S c = ? ? and A, B and C are vectors of length, and represent the storage of matrices A, B, and C in column major form. The notation A corresponds exactly to the vec(a) notation [8], however, we shall use the former for readability purposes. The matrices S a, S b and S c are termed as basic operators, and do not have to be explicitly generated, but specify which operations have to be performed on specic components of the input vectors. The above formulation can be easily extended to matrices of size n n by considering a ij, b ij, and c ij to be blocks of size n?1 n?1. First, we describe the block recursive storage of matrices in memory. Let X be any n n matrix. At the top level, X can be viewed as: X = X 00 X 01 X 10 X 11 A vector X representing an r-level block recursive representation of X is recursively dened as: X 00 : X = X 10 X 01 X 11 with the boundary condition that Y is the column-major representation of any n?r n?r block Y. An example of block recursive storage is given in Figure 1. Let A, B, and C be the 1-level block recursive representation of n n matrices A, B, and C. Strassen's algorithm for computing C = AB can be written as: C = (S c I n?1)[(s a I n?1)a n?1 (S b I n?1)b] where n?1 denotes pairwise matrix multiplication between matrices of size n?1 n?1. We refer to the above as 1-level block recursive Strassen's algorithm. In this case, the intermediate values t i are n?1 n?1 block matrices, and block matrix multiplications are performed using conventional matrix multiplication. This algorithm can be conveniently viewed in terms of a recursion tree (Figure ), where the root node corresponds to the update of C, and the leaf nodes correspond to the evaluation of the intermediate values. The steps marked by refer to computations that require working memory. Note that all the intermediate

10 values can be computed in parallel, since there are no data dependences between them. Each intermediate value requires a working memory of O( n?1 ). Hence, a 1-level block recursive Strassen's algorithm requires a total working storage of size O( n?1 ). Even though the above formulation has been given for matrix sizes of the form n n, it is straightforward to generalize the implementation to handle arbitrary dimensions of matrices A and B. A common technique used is to pad the matrices with rows and columns of zeros to increase the matrix sizes to the next higher powers of two, compute the extended matrix product, and then extract the desired result [1]. Another approach [] is drop the last rows and columns from the computation to achieve even dimensions, and compute the partial matrix product. The complete matrix product is then obtained with a rank-k update (k = 1; ; )..1 Block Recursive Strassen's Algorithm: Breadth-First Evaluation In 1-level application of Strassen's algorithm, n?1 n?1 block multiplications were computed using conventional matrix multiplication. To get an additional savings in the total number of arithmetic operations required to compute the matrix product, Strassen's algorithm can be recursively applied to compute the block multiplications also. Let A, B, and C be the l-level block recursive representations of n n matrices A, B, and C. The computation of C is described by the following formulation [10]: where C = S n;l c [Sa n;l (A) n?l S n;l (B)] S n;l a = N l i=0 S a I n?l = Q l?1 i=0 (I i S a I n?(i+1)) S n;l b = N l i=0 S b I n?l = Q n?1 i=0 (I i S b I n?(i+1)) S n;l c = N l i=0 S c I n?l = Q 0 i=n?1 (I i S c I n?(i+1)) and n?l denotes pairwise multiplication between blocks of size n?l n?l. This computation can be interpreted as a breadth rst evaluation of the recursion tree shown in Figure. Each intermediate block matrix t i is itself computed using Strassen's algorithm yielding intermediate sub-blocks t i0 ; : : : t i. process is recursively applied until blocks of size n?l n?l, which are then computed using conventional matrix multiplication. Following our convention, denotes computation that requires working storage. The working array requirement in this case is O( l n?l ). In the extreme case, Strassen's algorithm can be applied recursively down to blocks of size, and such an (n? 1)-level (or n-level) Strassen's algorithm would require working storage of size O( n ). Table 1 presents the total number of operations required for multiplying two matrices of size n n. MM denotes conventional matrix multiplication, STR refers to an n-level block recursive Strassen's algorithm, and BLOCK STR denotes a (n? k)-level block recursive Strassen's algorithm. STR has a lower operation count than MM only for n 10. The expression for the operation count for BLOCK STR has a minima b This

11 at k = for all integer values of n and k. For k =, BLOCK STR has a lower operation count than MM for n. Hence, block Strassen's algorithm is better than conventional matrix multiplication in terms of total operation count even for small values of n. However, for implementation on a shared memory vector machine such as the Cray Y-MP, a lower operation count does not imply smaller execution time, since the eect of vector length and stride also comes into play.. Block Recursive Strassen's Algorithm: Depth-First Evaluation An l-level Strassen's algorithm requires fewer operations than conventional matrix multiplication when the number of levels l is increased. An optimal value is attained at n? l =. However, the working storage requirement for an l-level algorithm are O( l n?l ), and hence increases exponentially with an increase in l. This high storage requirement comes due to the breadth rst expansion of the recursion tree in which all the intermediate values have to be stored. To achieve reduction in working storage, we can perform the computation of Strassen's algorithm using a depth-rst expansion of the recursion tree. Instead of expanding all the leaves in the recursion tree, we only compute a subtree, and use the results obtained from that subtree to update C. This process is repeatedly applied until all the subtrees are evaluated. It is necessary to ensure that no redundant computation is performed. The memory requirement for the algorithm in this case will be the memory requirement for a single subtree, since the same space can be reused for the evaluation of dierent subtrees. For the case, the algorithm is modied as follows. t is a temporary variable that is used to store intermediate values. Step 1: t = (a 00 + a 11 )(b 00 + b 11 ); c 00 = t; c 11 = t; Step : t = (a 10 + a 11 )b 00 ; c 10 = t; c 11 = c 11? t; Step : t = a 00 (b 01? b 11 ); c 01 = t; c 11 = c 11 + t; Step : t = a 11 (?b 00 + b 10 ); c 00 = c 00 + t; c 10 = c 10 + t; Step : t = (a 00 + a 01 )b 11 ; c 00 = c 00? t; c 01 = c 01 + t; Step : t = (?a 00 + a 10 )(b 00 + b 01 ); c 11 = c 11 + t; Step : t = (a 01? a 11 )(b 10 + b 11 ); c 00 = c 00 + t; Now the extra memory requirement is of only one element, since the same memory location can be reused to evaluate dierent t i 's. In the original formulation, seven memory locations are required since all the intermediate values are calculated before the update of C is performed. The total number of arithmetic operations are unchanged. We now formulate the concept of memory reduction using the tensor product framework. Dene D j to be a matrix with d jj = 1 and zeros elsewhere. Note that P j=0 D j = I. Memory reduction for

12 case can be formulated in matrix notation as: D j C = X j=0 (S c D j )[(D j S a)a (D j S b)b] is termed as a selection operator and selects subsets of the input vector on which the computation is to be performed. This framework can be extended to multiplying matrices of size n n. We begin with the 1-level Strassen's algorithm, and assume that the data matrices are stored in a 1-level block recursive form. The tensor product formula to compute C = AB can then be written as: C = X j=0 (S c D j I n?1)[(d j S a I n?1)a n?1 (D j S b I n?1)b] We can apply memory reduction at multiple levels by performing the same operation recursively on the smaller blocks. Assuming that the matrices are stored in an l-level block recursive form, an l-level Strassen's algorithm with memory reduction can be formulated as: C = X j 0;j 1;;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( r=0 O l?1 Dj r S a ) I n?l)a n?l (( r=0 D j r S b ) I n?l)b)] We refer to the above formulation as the partial evaluation form of Strassen's algorithm. The computation specied in the above formulation can be described using the recursion tree shown in Figure. The current intermediate blocks being computed are represented by. Working storage is required for the intermediate blocks from the leaf node being computed, to the root of the recursion tree. Hence, the working storage required is O( P l i=1 n?i ) = O( n ).. Combining Breadth-First and Depth-First Evaluations The operator in the tensor product formula for partial computation refers to pairwise matrix multiplication. Each block matrix multiplication in the pairwise matrix multiplication can itself be performed using complete evaluation. Hence we have a -level hierarchy. At the highest level, partial evaluation is performed till blocks of size n?l n?l. Then complete evaluation is performed till blocks of size k k are reached, after which conventional matrix multiplication is applied. This can be expressed in the tensor product notation as: C P S = X j 0;:::;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( C 0 CS = S ~ n?l;k c ( S ~ n?l;k a A 0 MM k S ~ n?l;k b B 0 ) where CS n?l r=0 O l?1 Dj r S a ) I n?l)a CS n?l (( r=0 D j r S b ) I n?l)b)] denotes pairwise matrix multiplication between blocks of size n?l n?l using complete evaluation, C 0 CS corresponds to each block pairwise multiplication during the partial evaluation, which itself is evaluated using an (l? k)-level Strassen's algorithm, and MM k between blocks of size k k using conventional matrix multiplication. denotes pairwise matrix multiplication 8

13 The root of the recursion tree is dened to be at level 0. At level i, a working array of size O( n?i ) is required to store the intermediate results of partial evaluation. The breadth-rst expansion of the last (l? k) levels requires a working array of O( n?l?k k ). Hence, the total memory requirement is O( P l?1 i=1 n?i + n?l?k k ) = o( n + n?l?k k ). Even for moderate values of n and small values of l, this represents a signicant savings compared to O( n?k k ) for complete evaluation. If the matrices are of size N N where N is not a power of two, the technique of padding can be used, and the memory requirement with reduction will be o( dlg Ne + dlg Ne?l?k k ) compared to O( dlg Ne?k k ) for complete evaluation.. Matrix Storage in Main Memory The formulation presented in the previous sections assumes for simplicity of presentation that the data matrices are stored in a block recursive form. However, when implementing a block recursive algorithm on a shared memory machine, matrices are usually stored in a row major or column major form. We have implemented Strassen's algorithm using Fortran on the Cray Y-MP, hence the data matrices are stored in memory in column major form. The tensor product formula to convert a n n matrix from a column major form to a k-level block recursive form is given by [11]: R n;k = " n?k?1 Y i=0 (I i+1 L n?k?i I n+k?i?1) # (I n?k L n n?k I k) R n;k is termed as a conversion operator. There are two ways in which storage conversion can be implemented. One way is to perform explicit conversion from row/column major form to a block recursive form through data movement. However, a more ecient way is to merge the conversion operator into the computation in Strassen's algorithm, which results in a modication of the data array indexing functions. The modied tensor formulation for Block Strassen's algorithm is: where ~S n;k a = ~S n;k b = Y n?k?1 i=0 Y n?k?1 i=0 C = (R n;k )?1 S n;k c = ~ S n;k c [ S ~ n;k a A k S ~ n;k b B] [R n;k Sa n;k A k R n;k S n;k b B] [(I i S a I n?i?1)(i i I L n?k?i I n+k?i?1)](i n?k L n n?k I k) [(I i S b I n?i?1)(i i I L n?k?i I n+k?i?1)](i n?k L n n?k I k) ~S n;k c = (I n?k L n k I k) 0Y i=n?k?1 [(I i With l-level memory reduction, the above formulation is modied into: C = X j 0;j 1;;j l?1=0 " l?1 c ( ( ~ S n;k O r=0 I L n?k?i n?k?n?1 I n+k?i?1)(i i S c I n?i?1)] O l?1 Dj r ))(( r=0 Dj r S ~ n;k O l?1 a A) k ( r=0 D j r # S ~ n;k b B)) 9

14 Code Generation for Vector Processors.1 Block Strassen's Algorithm Matrix factorizations form the basis of translating tensor product formulas by mapping the operations implied by the formula to program constructs in a high-level programming language. The translation process starts with the top level abstraction and generates more rened code as it proceeds to lower level abstractions. At each level, semantically equivalent program constructs are chosen to replace mathematical operations. Ecient programs can be synthesized from tensor product formulas by exploiting the regular computational structure expressed by such formulas. The tensor product formulation of block recursive algorithms usually involves certain basic computations, such as S a, S b, and S c in the case of Strassen's algorithm. It is sometimes necessary to use manually optimized codes for these basic computations to achieve high-performance. We now illustrate the code generation strategy with an example. Let B be an m n matrix, and X be a vector of size np, Consider the application of (I p B) to X, i.e., B B X[0 : n? 1] X[n : n? 1]. X[(p? 1)n : pn? 1] = BX[0 : n? 1] BX[n : n? 1]. BX[(p? 1)n : pn? 1] This can be interpreted as p copies of B acting in parallel on p disjoint segments of X, resulting in a vector of size mp. Hence, Y = (I p B)X can be implemented as: Code[Y = (I p B)X] doall i = 0; p? 1 Code[Y [in : (i + 1)n? 1] = BX[in : (i + 1)n? 1]] enddoall Once an algorithm is expressed using the tensor product framework, ecient implementation can be obtained by algebraically manipulating the tensor product formula. For example, consider the implementation of Y = (B I p )X where Y, B, and X are vectors as described before. Using the commutation rule, it can be determined that (B I p ) = L mp m (I p B)L np p Hence, one implementation to compute Y might be to permute X according to L np p, perform (I p B), and permute the result according to L mp m. A more ecient implementation would be to incorporate the stride permutations into the indexing of the input and output data arrays. The above can be written as: (L mp p Y ) = (I p B)(L np p X) 10

15 i.e., Y [0 : mp? 1 : p] Y [1 : mp? 1 : p]. Y [p? 1 : mp? 1 : p] Hence, the code can be written as = B B X[0 : np? 1 : p] Y [1 : np? 1 : p]. Y [p? 1 : np? 1 : p] Code[Y = (B I p)x] doall i = 0; p? 1 Code[Y [in : (i + 1)n? 1 : p] = BX[in : (i + 1)n? 1 : p]] enddoall Let us consider the code generation for (n? k)-level block Strassen's algorithm for multiplying n n matrices. Assume that the matrices are stored in a (n? k)-level block recursive format, and that at the lowest level, pairwise multiplication between blocks of size k k is performed. For simplication, we shall assume that no memory reduction is performed. The tensor product formulation of this algorithm is given by (Section.): C = S ~ n;k c [ S ~ n;k a A k S ~ n;k b B] The formula for block Strassen's algorithm contains the operations S ~ n;k a, S ~ n;k b, k, and S ~ n;k c. All the operations except k are linear operations and hence require an array operand. Operation k is a bilinear operation and requires two array operands. Each operation corresponds to an assignment statement that stores its result in an array that may be used as input data for the subsequent assignment or represents the nal output. Temporary arrays representing working arrays are denoted by T i. The above formula then translates to the following high-level code: T 0 = S ~ n;k a T 1 = S ~ n;k b A B T 0 = T 0 k T 1 C = ~ S n;k c T 0 The assignment statements are composed sequentially to preserve the semantics of computation. However, the above sequential composition is not unique. For example, the assignment statements for S ~ n;k a B can be in any order since there is no data dependency between them. ~S n;k b ~S n;k a, S ~ n;k b, and S ~ n;k c have the form " ny i=1 Fi # where F i = 8 < : (I r i OP I si ) (I ri L mini n i I si ) and OP is a basic operator A and The generic tensor product formula Y = (I ri OP I si )X can be implemented as a fully parallel doubly nested loop: 11

16 doall i 1 = 0; r i? 1 doall i = 0; s i? 1 Code[OP; Y; X; i 1; i ] enddoall enddoall Any tensor permutation that may be present results in a modication of the array indexing functions. Dierent implementations of the above formula are possible by changing the order and / or blocking the inner loops, as they are fully permutable. However, dierent orderings of the inner loops results in dierent data access patterns. These in turn will have dierent performance characteristics on systems with hierarchical / interleaved memories. Consider the application of the tensor product formula Q n i=1 F i to a vector X. The product term corresponds to a sequential outer loop in which the output of the i th stage is fed as input to the (i + 1) th stage, i = 1; n? 1. Only two arrays are required for this operation. The input array for the i th step can be reused as the output array for the (i + 1) st step. At the end of each iteration, the arrays are swapped (which can be implemented trivially simply by swapping the pointers to the two arrays), and the resulting pseudocode is: T 0 X do i = 1; n Code[T 1 = F it 0] Swap(T 1; T 0) enddo Q n At the end of the last iteration, T 1 contains the result of [ i=1 F i]x. S ~ n;k a, S ~ n;k b, and S ~ n;k c above form, and code can be easily generate for them. have the The pairwise multiplication k performs a sequence of n?k matrix multiplications of k k blocks. Let the input vectors be T 0 and T 1, corresponding to the evaluation of ~ S n;k a A and ~ S n;k a B respectively. All elements of a given block are stored consecutively in the input arrays. Pseudocode for the operation T = T 0 k T 1 is presented below: doall i = 0; n?k? 1 T [i k : (i + 1) k? 1] = MatrixMultiply(T 0[i k : (i + 1) k? 1]; T 1[i k : (i + 1) k? 1]) enddoall where MatrixMultiply refers to conventional matrix multiplication between blocks of size k k stored in column major form. 1

17 . Memory Management for Depth-First Evaluation Consider the tensor product formula: C = X j 0;j 1;;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( r=0 O l?1 Dj r S a ) I n?l) A n?l (( r=0 D j r S b ) I n?l) B)] The summation operator in the formulation of partial evaluation corresponds to a sequential loop nest, with the i th loop performing a depth rst evaluation of the i th level in the recursion tree. At each level, there are seven subtrees that need to be evaluated. Evaluation of each subtree is followed by an update of its parent. After the update, working storage used by that subtree can be reused for the computation of the next subtree at that level in the recursion tree. The loop structure hence loops like the following: do j 0 = 0; Code[T 0 a = Dj0 SaA] /* partial evaluation */ Code[T 0 b = Dj0 S bb] /* partial evaluation */ enddo do j 1 = 0; Code[T 1 a = Dj1 SaA] /* partial evaluation */ Code[T 1 b = Dj1 S bb] /* partial evaluation */ do j l?1 = 0; enddo enddo Code[T l?1 a = D j l?1 SaA] /* partial evaluation */ Code[T l?1 b = D j S l?1 bb] /* partial evaluation */ Code[T l?1 c = T l?1 a CSn?l T l?1 b ] /* complete evaluation */ Code[T l? c = D l?1 j l?1 ScTc ] /* update of parent */ Code[T 0 c = D j1 ScT 1 c ] /* update of parent */ Code[C = D j0 ScT 0 c ] /* update of parent */. Implementation of Winograd's Variation Strassen's algorithm uses 18 scalar additions and scalar multiplications to multiply matrices. Winograd presented a more ecient algorithm which uses 1 scalar additions and scalar multiplications []. The 1

18 Winograd's variation is based on the following three matrix operations: W a =? ? ?1 1? ; W b = 1 0? ?1 1? ?1?1 1 ; and W c = The Winograd's variation for multiplying can be written as the matrix formula c 0;0 0 a 0;0 b 0; ? : c 1;0 c 0;1 B = W W a a 1;0 a 0;1 W b b 1;0 b 0;1 C A : c 1;1 a 1;1 b 1;1 The generated code of operations W a, W b, and W c contains some common terms. For example, a 0;0?a 1;0 is evaluated twice in a direct implementation of W a. The key to reducing the number of additions in Winograd's variation is to evaluate a common term only once. We factorize W a, W b, and W c to eliminate the common terms: W a = W b = W c = ? ? ? ? ? ? ?1 1? ; ; and : 1

19 There are 1 rows containing two non-zero elements in the matrix factorizations of W a, W b, and W c, which correspond to the 1 additions in Winograd's variation of Strassen's algorithm. The rows containing a single one are implemented as data movement, and those containing all zeros are equivalent to null operations. The indices of input and output array elements of W a are specied by the permutation operations in a tensor product formula and are computed similar to those for S a. Let p i, 0 i <, be the indices of the input array elements, and q i, 0 i <, be the indices of the output array elements. The computation T = W a A on a vector A of length is translated to the following sequence of assignments: Code[T = W aa] T [q 1] = A[p 0] T [q ] = A[p ] T [q ] = A[p 0]? A[p 1] T [q ] = A[p 1] + A[p ] T [q ] = A[p ] T [q 0] =?T [q 1] + T [q ] T [q ] =?T [q 0] + T [q ] The implementation of Winograd's variation is simply a replacement of the translated code of W a, W b, and W c for the code of S a, S b, and S c in the corresponding implementation of Strassen's algorithm. Performance Results on the Cray Y-MP Performance statistics were gathered for dierent matrix sizes, dierent block sizes, and dierent levels of partial evaluation. Table shows performance for execution on a single processor. All execution times are in seconds. The numbers in the parenthesis display performance in megaops. Empty elds indicate that the program could not be run due to lack of sucient memory. The matrix size is n n, the level of partial evaluation is l, and the block size at which conventional matrix multiplication is applied is k k. The execution times for the Block Strassen's algorithm is compared with the Cray Scientic Library routines SGEMM which implements conventional matrix multiplication, and SGEMMS which implements Strassen's matrix multiplication. Since SGEMM and SGEMMS are independent of l and k, the times for those are given only once for each value of n. SGEMM is used for block matrix multiplication in the Block Strassen's algorithm. For any value of l, the lowest execution time occurs for k = because the vector length on the Cray Y-MP is. The megaops for k = are higher than that of k = for the same value of n and l. But, the execution time for k = is longer since a larger number of arithmetic operations are performed. The execution times and megaops for k =, l = 0 are comparable (slightly better) to that of SGEMMS. There is a performance degradation due to a slight increase in the number of memory operations as l increases for any xed n and k =. However, the dierence is quite small that is evident from the execution times. 1

20 Table gives the performance when the program was run on two processors. A xed value of k = was chosen since this resulted in the best performance in the single processor case. Again, the performance when l = 0 is slightly better than that of SGEMMS. For larger values of l, the performance degrades by about 1%. Table shows the performance results for 8 processors. Since the programs were run in a non-dedicated mode on the Cray Y-MP, we were unable to get all the 8 processors for the entire execution of the program. The numbers in parenthesis give the percentage of 8-cpus available for execution. The amount of extra memory required has been given in Figure for dierent values of n and l. It can be easily seen that there is an order of magnitude improvement even for small values of l. A value of k = was chosen since it is for this block size that the execution times are minimum. Conclusions We have shown how tensor product formulas expressing Strassen's matrix multiplication algorithm can be translated to ecient parallel programs for shared memory multiprocessors. This translation process is part of a more general programming methodology for designing high-performance block recursive algorithms for shared and distributed memory machines. The methodology uses a mathematical notation based on tensor products for expressing block recursive algorithms. Algebraic manipulation of these formulas yields mathematically equivalent formulas which result in implementations with dierent performance characteristics. A large number of programs can be generated in order to search for ecient implementations. Tensor product give a powerful method to generate these equivalent implementations automatically. As was illustrated in this paper, programs generated from tensor product formulas compare favorably with the best hand-coded ones. This paper presents an implementation of the Strassen's algorithm on a shared memory multiprocessor such as the Cray Y-MP. In the Y-MP, memory is organized into banks, and in the absence of bank conicts, all memory accesses take the same amount of time. However, in distributed memory multiprocessors such as the Cray TD, where each processor has its own local memory, a local memory access can be signicantly faster than a remote access. Hence, an ecient implementation on a distributed memory machine requires partitioning the algorithm in such a manner that remote accesses are minimized. Tensor product formulas can also be used to specify regular data distributions for arrays. Given a tensor product formula with a specied distribution of its input and output arrays, the inter-processor communication cost incurred by the implementation can be determined. If the cost of communication is high, it might be more ecient to perform a data redistribution before the computation, to bring the arrays into a form where the computation is local to the processors, if the overhead of data distribution is lower than the benet gained due the communication cost reducing to zero. We are currently examining these issues and are working on an implementation on the Cray TD. Both formula modication and program generation are capable of being automated. We are currently im- 1

21 plementing this methodology in an expert system EXTENT (Expert System for Tensor Formula Translation) that assists in the development of parallel programs for numerical algorithms on various computer architectures. Currently, the system generates Fortran programs for the Cray Y-MP. The expert system employs various heuristics to automatically generate alternative tensor product formulas, translate tensor product formulas to programs for various parallel architectures, test the produced programs, and analyze the test results. References [1] D.H. Bailey. Extra High Speed Matrix Multiplication on the Cray-. SIAM Journal of Scientic and Statistical Computing, 9():0{0, [] D.H. Bailey, K. Lee, and H.D. Simon. Using Strassen's Algorithm to Accelerate the Solution of Linear Systems. Journal of Supercomputing, ():{1, Jan [] A. Borodin and I. Munro. The Computational Complexity of Algebraic and Numeric Problems. American Elsevier Publishing Co., 19. [] R.P. Brent. Algorithms for matrix multiplication. Technical Report CS 1, Computer Science Dept., Stanford University, Palo Alto, Calif., 190. [] J.W. Brewer. Kronecker Products and Matrix Calculus in System Theory. IEEE Transactions on Circuits and Systems, :{81, 198. [] J. Granta, M. Conner, and R. Tolimieri. Recursive Fast Algorithms and the Role of Tensor Products. IEEE Transactions on Signal Processing, 0(1):91{90, Dec [] F.A. Graybill. Matrices, with Applications in Statistics. Wadsworth International Group, Belmont, CA, 198. [8] H.V. Henderson and S.R. Searle. The Vec-Permutation Matrix, The Vec Operator and Kronecker Products: A Review. Linear and Multilinear Algebra, 9:1{88, [9] N.J. Higham. Exploiting fast matrix multiplication within the level BLAS. In ACM Transactions on Mathematical Software, volume 1, pages {8, Dec [10] C.-H. Huang, J.R. Johnson, and R.W. Johnson. A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm. In App. Math. Lett., volume, pages {1, [11] C.-H. Huang, J.R. Johnson, and R.W. Johnson. Generating Parallel Programs from Tensor Product Formulas: A Case Study of Strassen's Matrix Multiplication Algorithm. In International Conference on Parallel Processing, volume, pages 10{108,

22 [1] J.R. Johnson, R.W. Johnson, D. Rodriguez, and R. Tolimieri. A Methodology for Designing, Modifying and Implementing Fourier Transform Algorithms on Various Architectures. Circuits Systems Signal Process, 9():0{00, [1] B. Kumar, C. H. Huang, J. Johnson, R. W. Johnson, and P. Sadayappan. A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction. In Seventh International Parallel Processing Symposium, pages 8{88, Apr [1] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM Publications, 199. [1] P.A. Regalia and S.K. Mitra. Kronecker Products, Unitary Matrices and Signal Processing Applications. SIAM Review, 1():8{1, Dec [1] G.X. Ritter and P.D. Gader. Image Algebra Techniques and Parallel Image Processing. Journal of Parallel and Distributed Computing, :{, 198. [1] V. Strassen. Gaussian Elimination is not Optimal. Numer. Math., 1:{,

23 blocks of size n-1 each l blocks of size n-l each Figure 1: -level block recursive storage Figure : Recursion tree of depth 1 for Strassen's algorithm A 00 C A t 0 t 1 t t t t t A 10 A 10 t 00 t 0 t 0 t t 0 t C A 11 Figure : l-level block recursive Strassen's algorithm 0 1 t t 0 t t 1 t t t t t 0... t t 0... t19 t 0... t... t... t... A 11

24 Figure : l-level Block Recursive Strassen's Algorithm with Memory Reduction Memory Words x t 0 t 1 t t t t t t 00 t 0 n- t 0 t t 0 t t t t t 0... t 0... t 0... t... t... t... C n = 8 n-1 n = 9 n = 10 Figure : Memory requirements for working arrays Table 1: Comparison of operation counts for Strassen's algorithm and conventional matrix multiplication Algorithm n-l Operation Count Additions Multiplications Total MM 8 n 8 n 8 n STR ( n? n ) n n+1? n BLOCK STR k ( n?k? n?k ) n?k ( 8 k? k ) n?k ( 8 k + k )? n 0 l

25 Table : Execution times for Block Strassen's algorithm with memory reduction n = 8 n = 9 n = 10 l SGEMM:.109 sec. (10 MFlops) SGEMM:.88 sec. (10 MFlops) SGEMM:.9 sec. (09 MFlops) SGEMMS:.09 sec. (91 MFlops) SGEMMS:. sec. (8 MFlops) SGEMMS:.9 sec. (8 MFlops) k = k = k = k = k = k = k = k = k = k = k = k = k = k = k = () (1) () (91) (0) (8) (0) () (1) (9) (8) (9) () (1) (1) (8) (9) () (10) () (8) () (1) (8) (8) (9) (1) (1) (8) (98) () (11) (1) () (10) () (8) () (1) () (81) (9) () (10) () (11) (18) () (18) (1) () (8) () (10) () (10) (1) (9) () (10).8 (9) Table : Execution times for k = on processors n SGEMMS Block Strassen l = 0 l = 1 l = l = l = (9) () (1) (9) (9) () (1) (0) (90) (89) (10) (0) (8) Table : Execution times for k = on 8 processors. Percentages of 8-cpu obtained are given in parenthesis. n SGEMMS Block Strassen l = 0 l = 1 l = l = l = 8.01 (8.%).0 (.%).018 (81.8%).0 (.%) 9.10 (91.%).11 (.%).1 (.%).1 (80.9%).1 (.0%) 10.8 (8.%) 1.0 (0.9%) 1.0 (.0%) 1.1 (0.%) 1

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is