Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's
|
|
- Dinah Morton
- 6 years ago
- Views:
Transcription
1 A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm y B. Kumar, C.-H. Huang, and P. Sadayappan Department of Computer and Information Science The Ohio State University R.W. Johnson Department of Computer Science St. Cloud State University Abstract In this paper, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this paper, we present a non-recursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O( n ) for multiplying n n matrices. We present a modied formulation in which the working storage requirement is reduced to O( n ). The modied formulation exhibits sucient parallelism for ecient implementation on a shared-memory multiprocessor. Performance results on a Cray Y-MP8/ are presented. Keywords: Strassen's matrix multiplication algorithm, Tensor product, Block recursive algorithm, Parallel architecture, Vector processors. This work was supported in part by ARPA and monitered by NIST.
2 Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's Algorithm: Breadth-First Evaluation : : : : : : : : : : : : : : : :. Block Recursive Strassen's Algorithm: Depth-First Evaluation : : : : : : : : : : : : : : : : :. Combining Breadth-First and Depth-First Evaluations : : : : : : : : : : : : : : : : : : : : : : 8. Matrix Storage in Main Memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 Code Generation for Vector Processors 10.1 Block Strassen's Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10. Memory Management for Depth-First Evaluation : : : : : : : : : : : : : : : : : : : : : : : : : 1. Implementation of Winograd's Variation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Performance Results on the Cray Y-MP 1 Conclusions 1
3 List of Figures 1 -level block recursive storage : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 Recursion tree of depth 1 for Strassen's algorithm : : : : : : : : : : : : : : : : : : : : : : : : 19 l-level block recursive Strassen's algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 l-level Block Recursive Strassen's Algorithm with Memory Reduction : : : : : : : : : : : : : 0 Memory requirements for working arrays : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 0
4 List of Tables 1 Comparison of operation counts for Strassen's algorithm and conventional matrix multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 0 Execution times for Block Strassen's algorithm with memory reduction : : : : : : : : : : : : 1 Execution times for k = on processors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Execution times for k = on 8 processors. Percentages of 8-cpu obtained are given in parenthesis. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
5 1 Introduction Tensor products (Kronecker products) have been used to model algorithms with a recursive computational structure which occur in application areas such as digital signal processing [, 1], image processing [1], linear system design [], and statistics []. In recent years, a programming methodology based on tensor products has been successfully used to design and implement high-performance algorithms to compute fast Fourier Transforms (FFT) [1, 1] and matrix multiplication [10, 1] for shared-memory vector multiprocessors. A set of multilinear algebra operations such as tensor product and matrix multiplication are used to express block recursive algorithms. These algebraic operations can be systematically translated into high-level programming language constructs such as sequential composition, iteration, and parallel / vector operations. Tensor product formulas representing an algorithm can be algebraically manipulated to restructure the computation to achieve dierent performance characteristics. In this way, the algorithm can be tuned to match the underlying architecture. Matrix multiplication is an important core computation in may scientic applications. Conventional matrix multiplication of n n matrices requires O(8 n ) operations. In 199, V. Strassen proposed an algorithm for matrix multiplication [1] that employs a computationally ecient method to compute the product of matrices using only seven multiplications. A recursive application of this algorithm for multiplying n n matrices requires only O( n ) operations, compared to O(8 n ) for conventional matrix multiplication. Ecient parallel implementations of this algorithm have been described in [1, 10]. This algorithm has been used for fast matrix multiplication in implementing level BLAS [9] and linear algebra routines []. In this paper, we describe the tensor product formulation of Strassen's matrix multiplication algorithm, and discuss program generation for shared memory vector processors such as the Cray Y-MP. Achieving high performance on these architectures requires operating on large vectors and reducing memory bank conicts, while at the same time exploiting coarse grained parallelism. We show how the tensor product formula of Strassen's algorithm can be manipulated to operate on full vectors with unit stride. An important feature of the generated code is that it employs no recursion. The initial formulation presented in [10] required a working array of size O( n ) for the multiplication of n n matrices. We present a modied formulation that signicantly reduces the size of working array to O( n ). This reduction is made possible through the reuse of working storage. We describe how this memory reuse can be captured in tensor product formulas with the use of a selection operator. We present a strategy for automatic code synthesis from tensor product formulas containing a selection operator. The modied formulation exhibits sucient parallelism for ecient implementation on a vector-parallel machine such as the Cray Y-MP. In addition, we express Winograd's variation [] using our notation and describe its translation to a programming code. Winograd's variation uses the same number of multiplications, but a smaller number of additions than the original Strassen's algorithm. This paper is organized as follows. Section contains an overview of the tensor product notation. A 1
6 formulation of Strassen's algorithm using this notation is presented in Section, along with a discussion on how the formulation can be modied to achieve reduction in working storage. Section presents a strategy for automatic code generation from a tensor product formula. Winograd's variation of the Strassen's algorithm is also presented. Section presents performance results on the Cray Y-MP. Conclusions are presented in Section. An Overview of the Tensor Product Notation In this section, we give a brief overview of the tensor product notation and the properties that are used in this paper. Let A < mn and B < pq. The tensor product A B is the block matrix obtained by replacing each element a i;j by the matrix a i;j B, i.e., A B = a 0;0 B a 0;n?1 B..... a m?1;0 B a m?1;n?1 B Whenever all the involved matrix products are valid, the following properties hold: Property.1 (Tensor Product) 1. A B C = A (B C) = (A B) C. (A B) (C D) = AC BD. A B = (A I n ) (I m B) = (I m B) (A I n ). (A B) T = A T B T. (A B)?1 = A?1 B?1. N n?1 i=0 A ib i = N n?1 i=0 A i N n?1 i=0 B i. Q n?1 i=0 (A i B i ) = Q n?1 i=0 A i Q n?1 i=0 B i 8. I mn = I m I n where I n represents the n n identity matrix, Q n?1 i=0 A i = A n?1 A n? : : : A 0, and N n?1 i=0 A i = A n?1 A n? : : : A 0. A matrix basis E m;n i;j is an m n matrix with a one in the i-th row and the j-th column and zeros elsewhere. A vector basis e m i is a column vector of length m with a one in the i-th position and zeros elsewhere. If the basis Ei;j m;n two vector bases e m i e n j. The tensor product of two vector bases em i e n j of an m n matrix is stored by rows, it is isomorphic to the tensor product of is equal to the vector basis emn in+j,
7 i.e., e m i e n j = emn in+j. The tensor product of two vector bases em i e n j is called a tensor basis. If the basis elements are ordered lexicographically then e n1 i 1 e nt i t = e n1nt i 1n nt++i t?1n t+i t Expressing a vector basis e M i as the tensor Q product of vector bases e m1 i 1 e mt i t, where M = m 1 m t t and i k = (i div M k ) mod m k ; M k = i=k+1 m i; M t = 1 is called the factorization of the vector basis, e.g., the vector basis e 1 8 can be factorized into the tensor bases e 1 e 1 e 0 or e e. Expressing a tensor basis e n1 i 1 e nt i t as a vector basis ei n1nt 1n nt++i t?1n t+i t is called linearization of the tensor basis. For example, the tensor basis e e can be linearized to give the vector basis e 1 8. One of the permutations used frequently in the representation of algorithms in tensor product formulas is the stride permutation. Stride permutation L mn n is dened as? L mn n e m i e n j = e n j e m i L mn n permutes the elements of a vector of size mn with stride distance n. This permutation can be represented as an mn mn transformation. For example, L can be represented by the matrix L x = The stride permutation has the following properties: x 0 x 1 x x x x = x 0 x x x 1 x x Property. (Stride Permutation) 1. (L mn n )?1 = L mn m. L rst st. L rst t = L rst s L rst t = (L rt t I s )(I r L st t ) A permutation of the form I m L pq q I n is called a tensor permutation. The following theorem illustrates how a tensor product of two matrices can be commuted by applying a stride permutation. Theorem.1 (Commutation Theorem) If A is an m m matrix and B is an n n matrix, then L mn n (A B) = (B A)L mn n
8 Pairwise multiplication between two vectors implies the product between the corresponding elements of those vectors, e.g., x 0 x 1 x n?1 y 0 y 1 y n?1 = x 0 y 0 x 1 y 1 x n?1 y n?1 If the elements x i and y i are themselves submatrices, then x i y i corresponds to matrix multiplication between them. A Tensor Product Formulation of Strassen's Algorithm Strassen's matrix multiplication algorithm is based on a computationally ecient way of multiplying matrices using only multiplications [1]. Consider the matrix multiplication C = AB, where A = a 00 a 01 B = a 10 a 11 b 00 b 01 C = b 10 b 11 c 00 c 01 c 10 c 11 Strassen's algorithm can then be written as follows. First, the following intermediate values are calculated. Then the individual elements of C are given by t 0 = (a 00 + a 11 )(b 00 + b 11 ) t 1 = (a 10 + a 11 )b 00 t = a 00 (b 01? b 11 ) t = a 11 (?b 00 + b 10 ) t = (a 00 + a 01 )b 11 t = (?a 00 + a 10 )(b 00 + b 01 ) t = (a 01? a 11 )(b 10 + b 11 ) c 00 = t 0 + t? t + t c 10 = t 1 + t c 01 = t + t c 11 = t 0? t 1 + t + t In matrix notation, this can be represented as: C = S c (S a A S b B)
9 where S a = ? ?1 ; S b = ?1? ; and S c = ? ? and A, B and C are vectors of length, and represent the storage of matrices A, B, and C in column major form. The notation A corresponds exactly to the vec(a) notation [8], however, we shall use the former for readability purposes. The matrices S a, S b and S c are termed as basic operators, and do not have to be explicitly generated, but specify which operations have to be performed on specic components of the input vectors. The above formulation can be easily extended to matrices of size n n by considering a ij, b ij, and c ij to be blocks of size n?1 n?1. First, we describe the block recursive storage of matrices in memory. Let X be any n n matrix. At the top level, X can be viewed as: X = X 00 X 01 X 10 X 11 A vector X representing an r-level block recursive representation of X is recursively dened as: X 00 : X = X 10 X 01 X 11 with the boundary condition that Y is the column-major representation of any n?r n?r block Y. An example of block recursive storage is given in Figure 1. Let A, B, and C be the 1-level block recursive representation of n n matrices A, B, and C. Strassen's algorithm for computing C = AB can be written as: C = (S c I n?1)[(s a I n?1)a n?1 (S b I n?1)b] where n?1 denotes pairwise matrix multiplication between matrices of size n?1 n?1. We refer to the above as 1-level block recursive Strassen's algorithm. In this case, the intermediate values t i are n?1 n?1 block matrices, and block matrix multiplications are performed using conventional matrix multiplication. This algorithm can be conveniently viewed in terms of a recursion tree (Figure ), where the root node corresponds to the update of C, and the leaf nodes correspond to the evaluation of the intermediate values. The steps marked by refer to computations that require working memory. Note that all the intermediate
10 values can be computed in parallel, since there are no data dependences between them. Each intermediate value requires a working memory of O( n?1 ). Hence, a 1-level block recursive Strassen's algorithm requires a total working storage of size O( n?1 ). Even though the above formulation has been given for matrix sizes of the form n n, it is straightforward to generalize the implementation to handle arbitrary dimensions of matrices A and B. A common technique used is to pad the matrices with rows and columns of zeros to increase the matrix sizes to the next higher powers of two, compute the extended matrix product, and then extract the desired result [1]. Another approach [] is drop the last rows and columns from the computation to achieve even dimensions, and compute the partial matrix product. The complete matrix product is then obtained with a rank-k update (k = 1; ; )..1 Block Recursive Strassen's Algorithm: Breadth-First Evaluation In 1-level application of Strassen's algorithm, n?1 n?1 block multiplications were computed using conventional matrix multiplication. To get an additional savings in the total number of arithmetic operations required to compute the matrix product, Strassen's algorithm can be recursively applied to compute the block multiplications also. Let A, B, and C be the l-level block recursive representations of n n matrices A, B, and C. The computation of C is described by the following formulation [10]: where C = S n;l c [Sa n;l (A) n?l S n;l (B)] S n;l a = N l i=0 S a I n?l = Q l?1 i=0 (I i S a I n?(i+1)) S n;l b = N l i=0 S b I n?l = Q n?1 i=0 (I i S b I n?(i+1)) S n;l c = N l i=0 S c I n?l = Q 0 i=n?1 (I i S c I n?(i+1)) and n?l denotes pairwise multiplication between blocks of size n?l n?l. This computation can be interpreted as a breadth rst evaluation of the recursion tree shown in Figure. Each intermediate block matrix t i is itself computed using Strassen's algorithm yielding intermediate sub-blocks t i0 ; : : : t i. process is recursively applied until blocks of size n?l n?l, which are then computed using conventional matrix multiplication. Following our convention, denotes computation that requires working storage. The working array requirement in this case is O( l n?l ). In the extreme case, Strassen's algorithm can be applied recursively down to blocks of size, and such an (n? 1)-level (or n-level) Strassen's algorithm would require working storage of size O( n ). Table 1 presents the total number of operations required for multiplying two matrices of size n n. MM denotes conventional matrix multiplication, STR refers to an n-level block recursive Strassen's algorithm, and BLOCK STR denotes a (n? k)-level block recursive Strassen's algorithm. STR has a lower operation count than MM only for n 10. The expression for the operation count for BLOCK STR has a minima b This
11 at k = for all integer values of n and k. For k =, BLOCK STR has a lower operation count than MM for n. Hence, block Strassen's algorithm is better than conventional matrix multiplication in terms of total operation count even for small values of n. However, for implementation on a shared memory vector machine such as the Cray Y-MP, a lower operation count does not imply smaller execution time, since the eect of vector length and stride also comes into play.. Block Recursive Strassen's Algorithm: Depth-First Evaluation An l-level Strassen's algorithm requires fewer operations than conventional matrix multiplication when the number of levels l is increased. An optimal value is attained at n? l =. However, the working storage requirement for an l-level algorithm are O( l n?l ), and hence increases exponentially with an increase in l. This high storage requirement comes due to the breadth rst expansion of the recursion tree in which all the intermediate values have to be stored. To achieve reduction in working storage, we can perform the computation of Strassen's algorithm using a depth-rst expansion of the recursion tree. Instead of expanding all the leaves in the recursion tree, we only compute a subtree, and use the results obtained from that subtree to update C. This process is repeatedly applied until all the subtrees are evaluated. It is necessary to ensure that no redundant computation is performed. The memory requirement for the algorithm in this case will be the memory requirement for a single subtree, since the same space can be reused for the evaluation of dierent subtrees. For the case, the algorithm is modied as follows. t is a temporary variable that is used to store intermediate values. Step 1: t = (a 00 + a 11 )(b 00 + b 11 ); c 00 = t; c 11 = t; Step : t = (a 10 + a 11 )b 00 ; c 10 = t; c 11 = c 11? t; Step : t = a 00 (b 01? b 11 ); c 01 = t; c 11 = c 11 + t; Step : t = a 11 (?b 00 + b 10 ); c 00 = c 00 + t; c 10 = c 10 + t; Step : t = (a 00 + a 01 )b 11 ; c 00 = c 00? t; c 01 = c 01 + t; Step : t = (?a 00 + a 10 )(b 00 + b 01 ); c 11 = c 11 + t; Step : t = (a 01? a 11 )(b 10 + b 11 ); c 00 = c 00 + t; Now the extra memory requirement is of only one element, since the same memory location can be reused to evaluate dierent t i 's. In the original formulation, seven memory locations are required since all the intermediate values are calculated before the update of C is performed. The total number of arithmetic operations are unchanged. We now formulate the concept of memory reduction using the tensor product framework. Dene D j to be a matrix with d jj = 1 and zeros elsewhere. Note that P j=0 D j = I. Memory reduction for
12 case can be formulated in matrix notation as: D j C = X j=0 (S c D j )[(D j S a)a (D j S b)b] is termed as a selection operator and selects subsets of the input vector on which the computation is to be performed. This framework can be extended to multiplying matrices of size n n. We begin with the 1-level Strassen's algorithm, and assume that the data matrices are stored in a 1-level block recursive form. The tensor product formula to compute C = AB can then be written as: C = X j=0 (S c D j I n?1)[(d j S a I n?1)a n?1 (D j S b I n?1)b] We can apply memory reduction at multiple levels by performing the same operation recursively on the smaller blocks. Assuming that the matrices are stored in an l-level block recursive form, an l-level Strassen's algorithm with memory reduction can be formulated as: C = X j 0;j 1;;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( r=0 O l?1 Dj r S a ) I n?l)a n?l (( r=0 D j r S b ) I n?l)b)] We refer to the above formulation as the partial evaluation form of Strassen's algorithm. The computation specied in the above formulation can be described using the recursion tree shown in Figure. The current intermediate blocks being computed are represented by. Working storage is required for the intermediate blocks from the leaf node being computed, to the root of the recursion tree. Hence, the working storage required is O( P l i=1 n?i ) = O( n ).. Combining Breadth-First and Depth-First Evaluations The operator in the tensor product formula for partial computation refers to pairwise matrix multiplication. Each block matrix multiplication in the pairwise matrix multiplication can itself be performed using complete evaluation. Hence we have a -level hierarchy. At the highest level, partial evaluation is performed till blocks of size n?l n?l. Then complete evaluation is performed till blocks of size k k are reached, after which conventional matrix multiplication is applied. This can be expressed in the tensor product notation as: C P S = X j 0;:::;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( C 0 CS = S ~ n?l;k c ( S ~ n?l;k a A 0 MM k S ~ n?l;k b B 0 ) where CS n?l r=0 O l?1 Dj r S a ) I n?l)a CS n?l (( r=0 D j r S b ) I n?l)b)] denotes pairwise matrix multiplication between blocks of size n?l n?l using complete evaluation, C 0 CS corresponds to each block pairwise multiplication during the partial evaluation, which itself is evaluated using an (l? k)-level Strassen's algorithm, and MM k between blocks of size k k using conventional matrix multiplication. denotes pairwise matrix multiplication 8
13 The root of the recursion tree is dened to be at level 0. At level i, a working array of size O( n?i ) is required to store the intermediate results of partial evaluation. The breadth-rst expansion of the last (l? k) levels requires a working array of O( n?l?k k ). Hence, the total memory requirement is O( P l?1 i=1 n?i + n?l?k k ) = o( n + n?l?k k ). Even for moderate values of n and small values of l, this represents a signicant savings compared to O( n?k k ) for complete evaluation. If the matrices are of size N N where N is not a power of two, the technique of padding can be used, and the memory requirement with reduction will be o( dlg Ne + dlg Ne?l?k k ) compared to O( dlg Ne?k k ) for complete evaluation.. Matrix Storage in Main Memory The formulation presented in the previous sections assumes for simplicity of presentation that the data matrices are stored in a block recursive form. However, when implementing a block recursive algorithm on a shared memory machine, matrices are usually stored in a row major or column major form. We have implemented Strassen's algorithm using Fortran on the Cray Y-MP, hence the data matrices are stored in memory in column major form. The tensor product formula to convert a n n matrix from a column major form to a k-level block recursive form is given by [11]: R n;k = " n?k?1 Y i=0 (I i+1 L n?k?i I n+k?i?1) # (I n?k L n n?k I k) R n;k is termed as a conversion operator. There are two ways in which storage conversion can be implemented. One way is to perform explicit conversion from row/column major form to a block recursive form through data movement. However, a more ecient way is to merge the conversion operator into the computation in Strassen's algorithm, which results in a modication of the data array indexing functions. The modied tensor formulation for Block Strassen's algorithm is: where ~S n;k a = ~S n;k b = Y n?k?1 i=0 Y n?k?1 i=0 C = (R n;k )?1 S n;k c = ~ S n;k c [ S ~ n;k a A k S ~ n;k b B] [R n;k Sa n;k A k R n;k S n;k b B] [(I i S a I n?i?1)(i i I L n?k?i I n+k?i?1)](i n?k L n n?k I k) [(I i S b I n?i?1)(i i I L n?k?i I n+k?i?1)](i n?k L n n?k I k) ~S n;k c = (I n?k L n k I k) 0Y i=n?k?1 [(I i With l-level memory reduction, the above formulation is modied into: C = X j 0;j 1;;j l?1=0 " l?1 c ( ( ~ S n;k O r=0 I L n?k?i n?k?n?1 I n+k?i?1)(i i S c I n?i?1)] O l?1 Dj r ))(( r=0 Dj r S ~ n;k O l?1 a A) k ( r=0 D j r # S ~ n;k b B)) 9
14 Code Generation for Vector Processors.1 Block Strassen's Algorithm Matrix factorizations form the basis of translating tensor product formulas by mapping the operations implied by the formula to program constructs in a high-level programming language. The translation process starts with the top level abstraction and generates more rened code as it proceeds to lower level abstractions. At each level, semantically equivalent program constructs are chosen to replace mathematical operations. Ecient programs can be synthesized from tensor product formulas by exploiting the regular computational structure expressed by such formulas. The tensor product formulation of block recursive algorithms usually involves certain basic computations, such as S a, S b, and S c in the case of Strassen's algorithm. It is sometimes necessary to use manually optimized codes for these basic computations to achieve high-performance. We now illustrate the code generation strategy with an example. Let B be an m n matrix, and X be a vector of size np, Consider the application of (I p B) to X, i.e., B B X[0 : n? 1] X[n : n? 1]. X[(p? 1)n : pn? 1] = BX[0 : n? 1] BX[n : n? 1]. BX[(p? 1)n : pn? 1] This can be interpreted as p copies of B acting in parallel on p disjoint segments of X, resulting in a vector of size mp. Hence, Y = (I p B)X can be implemented as: Code[Y = (I p B)X] doall i = 0; p? 1 Code[Y [in : (i + 1)n? 1] = BX[in : (i + 1)n? 1]] enddoall Once an algorithm is expressed using the tensor product framework, ecient implementation can be obtained by algebraically manipulating the tensor product formula. For example, consider the implementation of Y = (B I p )X where Y, B, and X are vectors as described before. Using the commutation rule, it can be determined that (B I p ) = L mp m (I p B)L np p Hence, one implementation to compute Y might be to permute X according to L np p, perform (I p B), and permute the result according to L mp m. A more ecient implementation would be to incorporate the stride permutations into the indexing of the input and output data arrays. The above can be written as: (L mp p Y ) = (I p B)(L np p X) 10
15 i.e., Y [0 : mp? 1 : p] Y [1 : mp? 1 : p]. Y [p? 1 : mp? 1 : p] Hence, the code can be written as = B B X[0 : np? 1 : p] Y [1 : np? 1 : p]. Y [p? 1 : np? 1 : p] Code[Y = (B I p)x] doall i = 0; p? 1 Code[Y [in : (i + 1)n? 1 : p] = BX[in : (i + 1)n? 1 : p]] enddoall Let us consider the code generation for (n? k)-level block Strassen's algorithm for multiplying n n matrices. Assume that the matrices are stored in a (n? k)-level block recursive format, and that at the lowest level, pairwise multiplication between blocks of size k k is performed. For simplication, we shall assume that no memory reduction is performed. The tensor product formulation of this algorithm is given by (Section.): C = S ~ n;k c [ S ~ n;k a A k S ~ n;k b B] The formula for block Strassen's algorithm contains the operations S ~ n;k a, S ~ n;k b, k, and S ~ n;k c. All the operations except k are linear operations and hence require an array operand. Operation k is a bilinear operation and requires two array operands. Each operation corresponds to an assignment statement that stores its result in an array that may be used as input data for the subsequent assignment or represents the nal output. Temporary arrays representing working arrays are denoted by T i. The above formula then translates to the following high-level code: T 0 = S ~ n;k a T 1 = S ~ n;k b A B T 0 = T 0 k T 1 C = ~ S n;k c T 0 The assignment statements are composed sequentially to preserve the semantics of computation. However, the above sequential composition is not unique. For example, the assignment statements for S ~ n;k a B can be in any order since there is no data dependency between them. ~S n;k b ~S n;k a, S ~ n;k b, and S ~ n;k c have the form " ny i=1 Fi # where F i = 8 < : (I r i OP I si ) (I ri L mini n i I si ) and OP is a basic operator A and The generic tensor product formula Y = (I ri OP I si )X can be implemented as a fully parallel doubly nested loop: 11
16 doall i 1 = 0; r i? 1 doall i = 0; s i? 1 Code[OP; Y; X; i 1; i ] enddoall enddoall Any tensor permutation that may be present results in a modication of the array indexing functions. Dierent implementations of the above formula are possible by changing the order and / or blocking the inner loops, as they are fully permutable. However, dierent orderings of the inner loops results in dierent data access patterns. These in turn will have dierent performance characteristics on systems with hierarchical / interleaved memories. Consider the application of the tensor product formula Q n i=1 F i to a vector X. The product term corresponds to a sequential outer loop in which the output of the i th stage is fed as input to the (i + 1) th stage, i = 1; n? 1. Only two arrays are required for this operation. The input array for the i th step can be reused as the output array for the (i + 1) st step. At the end of each iteration, the arrays are swapped (which can be implemented trivially simply by swapping the pointers to the two arrays), and the resulting pseudocode is: T 0 X do i = 1; n Code[T 1 = F it 0] Swap(T 1; T 0) enddo Q n At the end of the last iteration, T 1 contains the result of [ i=1 F i]x. S ~ n;k a, S ~ n;k b, and S ~ n;k c above form, and code can be easily generate for them. have the The pairwise multiplication k performs a sequence of n?k matrix multiplications of k k blocks. Let the input vectors be T 0 and T 1, corresponding to the evaluation of ~ S n;k a A and ~ S n;k a B respectively. All elements of a given block are stored consecutively in the input arrays. Pseudocode for the operation T = T 0 k T 1 is presented below: doall i = 0; n?k? 1 T [i k : (i + 1) k? 1] = MatrixMultiply(T 0[i k : (i + 1) k? 1]; T 1[i k : (i + 1) k? 1]) enddoall where MatrixMultiply refers to conventional matrix multiplication between blocks of size k k stored in column major form. 1
17 . Memory Management for Depth-First Evaluation Consider the tensor product formula: C = X j 0;j 1;;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( r=0 O l?1 Dj r S a ) I n?l) A n?l (( r=0 D j r S b ) I n?l) B)] The summation operator in the formulation of partial evaluation corresponds to a sequential loop nest, with the i th loop performing a depth rst evaluation of the i th level in the recursion tree. At each level, there are seven subtrees that need to be evaluated. Evaluation of each subtree is followed by an update of its parent. After the update, working storage used by that subtree can be reused for the computation of the next subtree at that level in the recursion tree. The loop structure hence loops like the following: do j 0 = 0; Code[T 0 a = Dj0 SaA] /* partial evaluation */ Code[T 0 b = Dj0 S bb] /* partial evaluation */ enddo do j 1 = 0; Code[T 1 a = Dj1 SaA] /* partial evaluation */ Code[T 1 b = Dj1 S bb] /* partial evaluation */ do j l?1 = 0; enddo enddo Code[T l?1 a = D j l?1 SaA] /* partial evaluation */ Code[T l?1 b = D j S l?1 bb] /* partial evaluation */ Code[T l?1 c = T l?1 a CSn?l T l?1 b ] /* complete evaluation */ Code[T l? c = D l?1 j l?1 ScTc ] /* update of parent */ Code[T 0 c = D j1 ScT 1 c ] /* update of parent */ Code[C = D j0 ScT 0 c ] /* update of parent */. Implementation of Winograd's Variation Strassen's algorithm uses 18 scalar additions and scalar multiplications to multiply matrices. Winograd presented a more ecient algorithm which uses 1 scalar additions and scalar multiplications []. The 1
18 Winograd's variation is based on the following three matrix operations: W a =? ? ?1 1? ; W b = 1 0? ?1 1? ?1?1 1 ; and W c = The Winograd's variation for multiplying can be written as the matrix formula c 0;0 0 a 0;0 b 0; ? : c 1;0 c 0;1 B = W W a a 1;0 a 0;1 W b b 1;0 b 0;1 C A : c 1;1 a 1;1 b 1;1 The generated code of operations W a, W b, and W c contains some common terms. For example, a 0;0?a 1;0 is evaluated twice in a direct implementation of W a. The key to reducing the number of additions in Winograd's variation is to evaluate a common term only once. We factorize W a, W b, and W c to eliminate the common terms: W a = W b = W c = ? ? ? ? ? ? ?1 1? ; ; and : 1
19 There are 1 rows containing two non-zero elements in the matrix factorizations of W a, W b, and W c, which correspond to the 1 additions in Winograd's variation of Strassen's algorithm. The rows containing a single one are implemented as data movement, and those containing all zeros are equivalent to null operations. The indices of input and output array elements of W a are specied by the permutation operations in a tensor product formula and are computed similar to those for S a. Let p i, 0 i <, be the indices of the input array elements, and q i, 0 i <, be the indices of the output array elements. The computation T = W a A on a vector A of length is translated to the following sequence of assignments: Code[T = W aa] T [q 1] = A[p 0] T [q ] = A[p ] T [q ] = A[p 0]? A[p 1] T [q ] = A[p 1] + A[p ] T [q ] = A[p ] T [q 0] =?T [q 1] + T [q ] T [q ] =?T [q 0] + T [q ] The implementation of Winograd's variation is simply a replacement of the translated code of W a, W b, and W c for the code of S a, S b, and S c in the corresponding implementation of Strassen's algorithm. Performance Results on the Cray Y-MP Performance statistics were gathered for dierent matrix sizes, dierent block sizes, and dierent levels of partial evaluation. Table shows performance for execution on a single processor. All execution times are in seconds. The numbers in the parenthesis display performance in megaops. Empty elds indicate that the program could not be run due to lack of sucient memory. The matrix size is n n, the level of partial evaluation is l, and the block size at which conventional matrix multiplication is applied is k k. The execution times for the Block Strassen's algorithm is compared with the Cray Scientic Library routines SGEMM which implements conventional matrix multiplication, and SGEMMS which implements Strassen's matrix multiplication. Since SGEMM and SGEMMS are independent of l and k, the times for those are given only once for each value of n. SGEMM is used for block matrix multiplication in the Block Strassen's algorithm. For any value of l, the lowest execution time occurs for k = because the vector length on the Cray Y-MP is. The megaops for k = are higher than that of k = for the same value of n and l. But, the execution time for k = is longer since a larger number of arithmetic operations are performed. The execution times and megaops for k =, l = 0 are comparable (slightly better) to that of SGEMMS. There is a performance degradation due to a slight increase in the number of memory operations as l increases for any xed n and k =. However, the dierence is quite small that is evident from the execution times. 1
20 Table gives the performance when the program was run on two processors. A xed value of k = was chosen since this resulted in the best performance in the single processor case. Again, the performance when l = 0 is slightly better than that of SGEMMS. For larger values of l, the performance degrades by about 1%. Table shows the performance results for 8 processors. Since the programs were run in a non-dedicated mode on the Cray Y-MP, we were unable to get all the 8 processors for the entire execution of the program. The numbers in parenthesis give the percentage of 8-cpus available for execution. The amount of extra memory required has been given in Figure for dierent values of n and l. It can be easily seen that there is an order of magnitude improvement even for small values of l. A value of k = was chosen since it is for this block size that the execution times are minimum. Conclusions We have shown how tensor product formulas expressing Strassen's matrix multiplication algorithm can be translated to ecient parallel programs for shared memory multiprocessors. This translation process is part of a more general programming methodology for designing high-performance block recursive algorithms for shared and distributed memory machines. The methodology uses a mathematical notation based on tensor products for expressing block recursive algorithms. Algebraic manipulation of these formulas yields mathematically equivalent formulas which result in implementations with dierent performance characteristics. A large number of programs can be generated in order to search for ecient implementations. Tensor product give a powerful method to generate these equivalent implementations automatically. As was illustrated in this paper, programs generated from tensor product formulas compare favorably with the best hand-coded ones. This paper presents an implementation of the Strassen's algorithm on a shared memory multiprocessor such as the Cray Y-MP. In the Y-MP, memory is organized into banks, and in the absence of bank conicts, all memory accesses take the same amount of time. However, in distributed memory multiprocessors such as the Cray TD, where each processor has its own local memory, a local memory access can be signicantly faster than a remote access. Hence, an ecient implementation on a distributed memory machine requires partitioning the algorithm in such a manner that remote accesses are minimized. Tensor product formulas can also be used to specify regular data distributions for arrays. Given a tensor product formula with a specied distribution of its input and output arrays, the inter-processor communication cost incurred by the implementation can be determined. If the cost of communication is high, it might be more ecient to perform a data redistribution before the computation, to bring the arrays into a form where the computation is local to the processors, if the overhead of data distribution is lower than the benet gained due the communication cost reducing to zero. We are currently examining these issues and are working on an implementation on the Cray TD. Both formula modication and program generation are capable of being automated. We are currently im- 1
21 plementing this methodology in an expert system EXTENT (Expert System for Tensor Formula Translation) that assists in the development of parallel programs for numerical algorithms on various computer architectures. Currently, the system generates Fortran programs for the Cray Y-MP. The expert system employs various heuristics to automatically generate alternative tensor product formulas, translate tensor product formulas to programs for various parallel architectures, test the produced programs, and analyze the test results. References [1] D.H. Bailey. Extra High Speed Matrix Multiplication on the Cray-. SIAM Journal of Scientic and Statistical Computing, 9():0{0, [] D.H. Bailey, K. Lee, and H.D. Simon. Using Strassen's Algorithm to Accelerate the Solution of Linear Systems. Journal of Supercomputing, ():{1, Jan [] A. Borodin and I. Munro. The Computational Complexity of Algebraic and Numeric Problems. American Elsevier Publishing Co., 19. [] R.P. Brent. Algorithms for matrix multiplication. Technical Report CS 1, Computer Science Dept., Stanford University, Palo Alto, Calif., 190. [] J.W. Brewer. Kronecker Products and Matrix Calculus in System Theory. IEEE Transactions on Circuits and Systems, :{81, 198. [] J. Granta, M. Conner, and R. Tolimieri. Recursive Fast Algorithms and the Role of Tensor Products. IEEE Transactions on Signal Processing, 0(1):91{90, Dec [] F.A. Graybill. Matrices, with Applications in Statistics. Wadsworth International Group, Belmont, CA, 198. [8] H.V. Henderson and S.R. Searle. The Vec-Permutation Matrix, The Vec Operator and Kronecker Products: A Review. Linear and Multilinear Algebra, 9:1{88, [9] N.J. Higham. Exploiting fast matrix multiplication within the level BLAS. In ACM Transactions on Mathematical Software, volume 1, pages {8, Dec [10] C.-H. Huang, J.R. Johnson, and R.W. Johnson. A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm. In App. Math. Lett., volume, pages {1, [11] C.-H. Huang, J.R. Johnson, and R.W. Johnson. Generating Parallel Programs from Tensor Product Formulas: A Case Study of Strassen's Matrix Multiplication Algorithm. In International Conference on Parallel Processing, volume, pages 10{108,
22 [1] J.R. Johnson, R.W. Johnson, D. Rodriguez, and R. Tolimieri. A Methodology for Designing, Modifying and Implementing Fourier Transform Algorithms on Various Architectures. Circuits Systems Signal Process, 9():0{00, [1] B. Kumar, C. H. Huang, J. Johnson, R. W. Johnson, and P. Sadayappan. A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction. In Seventh International Parallel Processing Symposium, pages 8{88, Apr [1] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM Publications, 199. [1] P.A. Regalia and S.K. Mitra. Kronecker Products, Unitary Matrices and Signal Processing Applications. SIAM Review, 1():8{1, Dec [1] G.X. Ritter and P.D. Gader. Image Algebra Techniques and Parallel Image Processing. Journal of Parallel and Distributed Computing, :{, 198. [1] V. Strassen. Gaussian Elimination is not Optimal. Numer. Math., 1:{,
23 blocks of size n-1 each l blocks of size n-l each Figure 1: -level block recursive storage Figure : Recursion tree of depth 1 for Strassen's algorithm A 00 C A t 0 t 1 t t t t t A 10 A 10 t 00 t 0 t 0 t t 0 t C A 11 Figure : l-level block recursive Strassen's algorithm 0 1 t t 0 t t 1 t t t t t 0... t t 0... t19 t 0... t... t... t... A 11
24 Figure : l-level Block Recursive Strassen's Algorithm with Memory Reduction Memory Words x t 0 t 1 t t t t t t 00 t 0 n- t 0 t t 0 t t t t t 0... t 0... t 0... t... t... t... C n = 8 n-1 n = 9 n = 10 Figure : Memory requirements for working arrays Table 1: Comparison of operation counts for Strassen's algorithm and conventional matrix multiplication Algorithm n-l Operation Count Additions Multiplications Total MM 8 n 8 n 8 n STR ( n? n ) n n+1? n BLOCK STR k ( n?k? n?k ) n?k ( 8 k? k ) n?k ( 8 k + k )? n 0 l
25 Table : Execution times for Block Strassen's algorithm with memory reduction n = 8 n = 9 n = 10 l SGEMM:.109 sec. (10 MFlops) SGEMM:.88 sec. (10 MFlops) SGEMM:.9 sec. (09 MFlops) SGEMMS:.09 sec. (91 MFlops) SGEMMS:. sec. (8 MFlops) SGEMMS:.9 sec. (8 MFlops) k = k = k = k = k = k = k = k = k = k = k = k = k = k = k = () (1) () (91) (0) (8) (0) () (1) (9) (8) (9) () (1) (1) (8) (9) () (10) () (8) () (1) (8) (8) (9) (1) (1) (8) (98) () (11) (1) () (10) () (8) () (1) () (81) (9) () (10) () (11) (18) () (18) (1) () (8) () (10) () (10) (1) (9) () (10).8 (9) Table : Execution times for k = on processors n SGEMMS Block Strassen l = 0 l = 1 l = l = l = (9) () (1) (9) (9) () (1) (0) (90) (89) (10) (0) (8) Table : Execution times for k = on 8 processors. Percentages of 8-cpu obtained are given in parenthesis. n SGEMMS Block Strassen l = 0 l = 1 l = l = l = 8.01 (8.%).0 (.%).018 (81.8%).0 (.%) 9.10 (91.%).11 (.%).1 (.%).1 (80.9%).1 (.0%) 10.8 (8.%) 1.0 (0.9%) 1.0 (.0%) 1.1 (0.%) 1
Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987
Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is
More informationInternational Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA
International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationSmall Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5
On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH 43221 Columbus, OH 4321 Abstract We analyze
More informationAbstract In this paper we report on the development of an ecient and portable implementation of Strassen's matrix multiplication algorithm. Our implem
Implementation of Strassen's Algorithm for Matrix Multiplication 1 Steven Huss-Lederman 2 Elaine M. Jacobson 3 Jeremy R. Johnson 4 Anna Tsao 5 Thomas Turnbull 6 August 1, 1996 1 This work was partially
More informationSubmitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational
Experiments in the Iterative Application of Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA fsoha,ebelingg@cs.washington.edu
More informationThe problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can
A Simple Cubic Algorithm for Computing Minimum Height Elimination Trees for Interval Graphs Bengt Aspvall, Pinar Heggernes, Jan Arne Telle Department of Informatics, University of Bergen N{5020 Bergen,
More informationRectangular Matrix Multiplication Revisited
JOURNAL OF COMPLEXITY, 13, 42 49 (1997) ARTICLE NO. CM970438 Rectangular Matrix Multiplication Revisited Don Coppersmith IBM Research, T. J. Watson Research Center, Yorktown Heights, New York 10598 Received
More informationLegal and impossible dependences
Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral
More informationImplementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract
Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer
More informationPARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES
PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationThe p-sized partitioning algorithm for fast computation of factorials of numbers
J Supercomput (2006) 38:73 82 DOI 10.1007/s11227-006-7285-5 The p-sized partitioning algorithm for fast computation of factorials of numbers Ahmet Ugur Henry Thompson C Science + Business Media, LLC 2006
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationAn Ecient Approximation Algorithm for the. File Redistribution Scheduling Problem in. Fully Connected Networks. Abstract
An Ecient Approximation Algorithm for the File Redistribution Scheduling Problem in Fully Connected Networks Ravi Varadarajan Pedro I. Rivera-Vega y Abstract We consider the problem of transferring a set
More information1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica
A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University
More informationreasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap
Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com
More informationTable-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o
Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationChapter 12: Query Processing
Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation
More informationWorst-case running time for RANDOMIZED-SELECT
Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256
More informationto automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu
Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu
More informationr (1,1) r (2,4) r (2,5) r(2,6) r (1,3) r (1,2)
Routing on Trees via Matchings? Alan Roberts 1, Antonis Symvonis 1 and Louxin Zhang 2 1 Department of Computer Science, University of Sydney, N.S.W. 2006, Australia 2 Department of Computer Science, University
More informationSystem Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer
More informationA Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction B. KUMAR!, C.-H. HUANG!, P. SADAYAPPAN 1, AND R.W. JOHNSON 2 1 Department of Computer and Information Science,
More informationWe augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real
14.3 Interval trees We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real numbers ], with Interval ]represents the set Open and half-open intervals
More informationwould be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp
1 Introduction 1.1 Parallel Randomized Algorihtms Using Sampling A fundamental strategy used in designing ecient algorithms is divide-and-conquer, where that input data is partitioned into several subproblems
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationGraphBLAS Mathematics - Provisional Release 1.0 -
GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,
More informationGeneral context-free recognition in less than cubic time
Carnegie Mellon University Research Showcase @ CMU Computer Science Department School of Computer Science 1974 General context-free recognition in less than cubic time Leslie Valiant Carnegie Mellon University
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationOrthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet. Y. C. Pati R. Rezaiifar and P. S.
/ To appear in Proc. of the 27 th Annual Asilomar Conference on Signals Systems and Computers, Nov. {3, 993 / Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet
More informationAn On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland
An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,
More informationand therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa
Vertical Fragmentation and Allocation in Distributed Deductive Database Systems Seung-Jin Lim Yiu-Kai Ng Department of Computer Science Brigham Young University Provo, Utah 80, U.S.A. Email: fsjlim,ngg@cs.byu.edu
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationAlgorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)
Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two
More informationA Method for Construction of Orthogonal Arrays 1
Eighth International Workshop on Optimal Codes and Related Topics July 10-14, 2017, Sofia, Bulgaria pp. 49-54 A Method for Construction of Orthogonal Arrays 1 Iliya Bouyukliev iliyab@math.bas.bg Institute
More informationChapter 2A Instructions: Language of the Computer
Chapter 2A Instructions: Language of the Computer Copyright 2009 Elsevier, Inc. All rights reserved. Instruction Set The repertoire of instructions of a computer Different computers have different instruction
More informationLet v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have
Let v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have been red in the sequence up to and including v i (s) is deg(v)? s(v), and by the induction hypothesis this sequence
More informationDieter Gollmann, Yongfei Han, and Chris J. Mitchell. August 25, Abstract
Redundant integer representations and fast exponentiation Dieter Gollmann, Yongfei Han, and Chris J. Mitchell August 25, 1995 Abstract In this paper two modications to the standard square and multiply
More informationAdaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh
Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented which, for a large-dimensional exponential family G,
More information10/24/ Rotations. 2. // s left subtree s right subtree 3. if // link s parent to elseif == else 11. // put x on s left
13.2 Rotations MAT-72006 AA+DS, Fall 2013 24-Oct-13 368 LEFT-ROTATE(, ) 1. // set 2. // s left subtree s right subtree 3. if 4. 5. // link s parent to 6. if == 7. 8. elseif == 9. 10. else 11. // put x
More informationSCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics. Numbers & Number Systems
SCHOOL OF ENGINEERING & BUILT ENVIRONMENT Mathematics Numbers & Number Systems Introduction Numbers and Their Properties Multiples and Factors The Division Algorithm Prime and Composite Numbers Prime Factors
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationA High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.
A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationHowever, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t
FAST CALCULATION OF GEOMETRIC MOMENTS OF BINARY IMAGES Jan Flusser Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vodarenskou vez 4, 82 08 Prague 8, Czech
More informationTwiddle Factor Transformation for Pipelined FFT Processing
Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,
More informationIS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP?
Theoretical Computer Science 19 (1982) 337-341 North-Holland Publishing Company NOTE IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP? Nimrod MEGIDDO Statistics Department, Tel Aviv
More information17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer
Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are
More informationMulticasting in the Hypercube, Chord and Binomial Graphs
Multicasting in the Hypercube, Chord and Binomial Graphs Christopher C. Cipriano and Teofilo F. Gonzalez Department of Computer Science University of California, Santa Barbara, CA, 93106 E-mail: {ccc,teo}@cs.ucsb.edu
More informationUnit-5 Dynamic Programming 2016
5 Dynamic programming Overview, Applications - shortest path in graph, matrix multiplication, travelling salesman problem, Fibonacci Series. 20% 12 Origin: Richard Bellman, 1957 Programming referred to
More information6. Algorithm Design Techniques
6. Algorithm Design Techniques 6. Algorithm Design Techniques 6.1 Greedy algorithms 6.2 Divide and conquer 6.3 Dynamic Programming 6.4 Randomized Algorithms 6.5 Backtracking Algorithms Malek Mouhoub, CS340
More information2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo
Two-Stage Service Provision by Branch and Bound Shane Dye Department ofmanagement University of Canterbury Christchurch, New Zealand s.dye@mang.canterbury.ac.nz Asgeir Tomasgard SINTEF, Trondheim, Norway
More informationunder Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli
Interface Optimization for Concurrent Systems under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli Abstract The scope of most high-level synthesis eorts to date has
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationNull space basis: mxz. zxz I
Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence
More informationA Brief Description of the NMP ISA and Benchmarks
Report No. UIUCDCS-R-2005-2633 UILU-ENG-2005-1823 A Brief Description of the NMP ISA and Benchmarks by Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine February 2005 A Brief Description
More informationAnnouncement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17
Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationHEAPS ON HEAPS* Downloaded 02/04/13 to Redistribution subject to SIAM license or copyright; see
SIAM J. COMPUT. Vol. 15, No. 4, November 1986 (C) 1986 Society for Industrial and Applied Mathematics OO6 HEAPS ON HEAPS* GASTON H. GONNET" AND J. IAN MUNRO," Abstract. As part of a study of the general
More informationIntroduction to Algorithms
Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that
More informationTransistors NODES. Time (sec) No. of Nodes/Transistors
Binary Tree Structure for Formal Verication of Combinational ICs FATMA A. EL-LICY, and HODA S. ABDEL-ATY-ZOHDY Microelectronics System Design Laboratory Department of Electrical and Systems Engineering
More informationHeap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the
Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence
More informationHypercubes. (Chapter Nine)
Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is
More informationtime using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o
Reconstructing a Binary Tree from its Traversals in Doubly-Logarithmic CREW Time Stephan Olariu Michael Overstreet Department of Computer Science, Old Dominion University, Norfolk, VA 23529 Zhaofang Wen
More informationNull. Example A. Level 0
A Tree Projection Algorithm For Generation of Frequent Itemsets Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 E-mail: f agarwal, charu,
More information1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma
MULTIPLE SUBSPACE ULV ALGORITHM AND LMS TRACKING S. HOSUR, A. H. TEWFIK, D. BOLEY University of Minnesota 200 Union St. S.E. Minneapolis, MN 55455 U.S.A fhosur@ee,tewk@ee,boley@csg.umn.edu ABSTRACT. The
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationOn the construction of nested orthogonal arrays
isid/ms/2010/06 September 10, 2010 http://wwwisidacin/ statmath/eprints On the construction of nested orthogonal arrays Aloke Dey Indian Statistical Institute, Delhi Centre 7, SJSS Marg, New Delhi 110
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationAn Efficient Ranking Algorithm of t-ary Trees in Gray-code Order
The 9th Workshop on Combinatorial Mathematics and Computation Theory An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order Ro Yu Wu Jou Ming Chang, An Hang Chen Chun Liang Liu Department of
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationTreewidth and graph minors
Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under
More informationFast algorithm for generating ascending compositions
manuscript No. (will be inserted by the editor) Fast algorithm for generating ascending compositions Mircea Merca Received: date / Accepted: date Abstract In this paper we give a fast algorithm to generate
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationCS473-Algorithms I. Lecture 10. Dynamic Programming. Cevdet Aykanat - Bilkent University Computer Engineering Department
CS473-Algorithms I Lecture 1 Dynamic Programming 1 Introduction An algorithm design paradigm like divide-and-conquer Programming : A tabular method (not writing computer code) Divide-and-Conquer (DAC):
More information(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX
Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu
More informationOn Covering a Graph Optimally with Induced Subgraphs
On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number
More informationA Simplied NP-complete MAXSAT Problem. Abstract. It is shown that the MAX2SAT problem is NP-complete even if every variable
A Simplied NP-complete MAXSAT Problem Venkatesh Raman 1, B. Ravikumar 2 and S. Srinivasa Rao 1 1 The Institute of Mathematical Sciences, C. I. T. Campus, Chennai 600 113. India 2 Department of Computer
More informationAutomatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology
Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291
More informationLecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1
CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the
More informationFFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES
FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and
More informationContents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...
Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing
More informationMapping Vector Codes to a Stream Processor (Imagine)
Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream
More informationDepartment of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley
Department of Computer Science Remapping Subpartitions of Hyperspace Using Iterative Genetic Search Keith Mathias and Darrell Whitley Technical Report CS-4-11 January 7, 14 Colorado State University Remapping
More informationChapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition
Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition
More informationModel Checking I Binary Decision Diagrams
/42 Model Checking I Binary Decision Diagrams Edmund M. Clarke, Jr. School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 2/42 Binary Decision Diagrams Ordered binary decision diagrams
More informationLocalization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD
CAR-TR-728 CS-TR-3326 UMIACS-TR-94-92 Samir Khuller Department of Computer Science Institute for Advanced Computer Studies University of Maryland College Park, MD 20742-3255 Localization in Graphs Azriel
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel
More informationDocument Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.
Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips
More informationParallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication
CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned
More information