Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's

Size: px
Start display at page:

Download "Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's"

Transcription

1 A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm y B. Kumar, C.-H. Huang, and P. Sadayappan Department of Computer and Information Science The Ohio State University R.W. Johnson Department of Computer Science St. Cloud State University Abstract In this paper, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this paper, we present a non-recursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O( n ) for multiplying n n matrices. We present a modied formulation in which the working storage requirement is reduced to O( n ). The modied formulation exhibits sucient parallelism for ecient implementation on a shared-memory multiprocessor. Performance results on a Cray Y-MP8/ are presented. Keywords: Strassen's matrix multiplication algorithm, Tensor product, Block recursive algorithm, Parallel architecture, Vector processors. This work was supported in part by ARPA and monitered by NIST.

2 Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's Algorithm: Breadth-First Evaluation : : : : : : : : : : : : : : : :. Block Recursive Strassen's Algorithm: Depth-First Evaluation : : : : : : : : : : : : : : : : :. Combining Breadth-First and Depth-First Evaluations : : : : : : : : : : : : : : : : : : : : : : 8. Matrix Storage in Main Memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 Code Generation for Vector Processors 10.1 Block Strassen's Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10. Memory Management for Depth-First Evaluation : : : : : : : : : : : : : : : : : : : : : : : : : 1. Implementation of Winograd's Variation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Performance Results on the Cray Y-MP 1 Conclusions 1

3 List of Figures 1 -level block recursive storage : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 Recursion tree of depth 1 for Strassen's algorithm : : : : : : : : : : : : : : : : : : : : : : : : 19 l-level block recursive Strassen's algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 l-level Block Recursive Strassen's Algorithm with Memory Reduction : : : : : : : : : : : : : 0 Memory requirements for working arrays : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 0

4 List of Tables 1 Comparison of operation counts for Strassen's algorithm and conventional matrix multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 0 Execution times for Block Strassen's algorithm with memory reduction : : : : : : : : : : : : 1 Execution times for k = on processors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Execution times for k = on 8 processors. Percentages of 8-cpu obtained are given in parenthesis. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1

5 1 Introduction Tensor products (Kronecker products) have been used to model algorithms with a recursive computational structure which occur in application areas such as digital signal processing [, 1], image processing [1], linear system design [], and statistics []. In recent years, a programming methodology based on tensor products has been successfully used to design and implement high-performance algorithms to compute fast Fourier Transforms (FFT) [1, 1] and matrix multiplication [10, 1] for shared-memory vector multiprocessors. A set of multilinear algebra operations such as tensor product and matrix multiplication are used to express block recursive algorithms. These algebraic operations can be systematically translated into high-level programming language constructs such as sequential composition, iteration, and parallel / vector operations. Tensor product formulas representing an algorithm can be algebraically manipulated to restructure the computation to achieve dierent performance characteristics. In this way, the algorithm can be tuned to match the underlying architecture. Matrix multiplication is an important core computation in may scientic applications. Conventional matrix multiplication of n n matrices requires O(8 n ) operations. In 199, V. Strassen proposed an algorithm for matrix multiplication [1] that employs a computationally ecient method to compute the product of matrices using only seven multiplications. A recursive application of this algorithm for multiplying n n matrices requires only O( n ) operations, compared to O(8 n ) for conventional matrix multiplication. Ecient parallel implementations of this algorithm have been described in [1, 10]. This algorithm has been used for fast matrix multiplication in implementing level BLAS [9] and linear algebra routines []. In this paper, we describe the tensor product formulation of Strassen's matrix multiplication algorithm, and discuss program generation for shared memory vector processors such as the Cray Y-MP. Achieving high performance on these architectures requires operating on large vectors and reducing memory bank conicts, while at the same time exploiting coarse grained parallelism. We show how the tensor product formula of Strassen's algorithm can be manipulated to operate on full vectors with unit stride. An important feature of the generated code is that it employs no recursion. The initial formulation presented in [10] required a working array of size O( n ) for the multiplication of n n matrices. We present a modied formulation that signicantly reduces the size of working array to O( n ). This reduction is made possible through the reuse of working storage. We describe how this memory reuse can be captured in tensor product formulas with the use of a selection operator. We present a strategy for automatic code synthesis from tensor product formulas containing a selection operator. The modied formulation exhibits sucient parallelism for ecient implementation on a vector-parallel machine such as the Cray Y-MP. In addition, we express Winograd's variation [] using our notation and describe its translation to a programming code. Winograd's variation uses the same number of multiplications, but a smaller number of additions than the original Strassen's algorithm. This paper is organized as follows. Section contains an overview of the tensor product notation. A 1

6 formulation of Strassen's algorithm using this notation is presented in Section, along with a discussion on how the formulation can be modied to achieve reduction in working storage. Section presents a strategy for automatic code generation from a tensor product formula. Winograd's variation of the Strassen's algorithm is also presented. Section presents performance results on the Cray Y-MP. Conclusions are presented in Section. An Overview of the Tensor Product Notation In this section, we give a brief overview of the tensor product notation and the properties that are used in this paper. Let A < mn and B < pq. The tensor product A B is the block matrix obtained by replacing each element a i;j by the matrix a i;j B, i.e., A B = a 0;0 B a 0;n?1 B..... a m?1;0 B a m?1;n?1 B Whenever all the involved matrix products are valid, the following properties hold: Property.1 (Tensor Product) 1. A B C = A (B C) = (A B) C. (A B) (C D) = AC BD. A B = (A I n ) (I m B) = (I m B) (A I n ). (A B) T = A T B T. (A B)?1 = A?1 B?1. N n?1 i=0 A ib i = N n?1 i=0 A i N n?1 i=0 B i. Q n?1 i=0 (A i B i ) = Q n?1 i=0 A i Q n?1 i=0 B i 8. I mn = I m I n where I n represents the n n identity matrix, Q n?1 i=0 A i = A n?1 A n? : : : A 0, and N n?1 i=0 A i = A n?1 A n? : : : A 0. A matrix basis E m;n i;j is an m n matrix with a one in the i-th row and the j-th column and zeros elsewhere. A vector basis e m i is a column vector of length m with a one in the i-th position and zeros elsewhere. If the basis Ei;j m;n two vector bases e m i e n j. The tensor product of two vector bases em i e n j of an m n matrix is stored by rows, it is isomorphic to the tensor product of is equal to the vector basis emn in+j,

7 i.e., e m i e n j = emn in+j. The tensor product of two vector bases em i e n j is called a tensor basis. If the basis elements are ordered lexicographically then e n1 i 1 e nt i t = e n1nt i 1n nt++i t?1n t+i t Expressing a vector basis e M i as the tensor Q product of vector bases e m1 i 1 e mt i t, where M = m 1 m t t and i k = (i div M k ) mod m k ; M k = i=k+1 m i; M t = 1 is called the factorization of the vector basis, e.g., the vector basis e 1 8 can be factorized into the tensor bases e 1 e 1 e 0 or e e. Expressing a tensor basis e n1 i 1 e nt i t as a vector basis ei n1nt 1n nt++i t?1n t+i t is called linearization of the tensor basis. For example, the tensor basis e e can be linearized to give the vector basis e 1 8. One of the permutations used frequently in the representation of algorithms in tensor product formulas is the stride permutation. Stride permutation L mn n is dened as? L mn n e m i e n j = e n j e m i L mn n permutes the elements of a vector of size mn with stride distance n. This permutation can be represented as an mn mn transformation. For example, L can be represented by the matrix L x = The stride permutation has the following properties: x 0 x 1 x x x x = x 0 x x x 1 x x Property. (Stride Permutation) 1. (L mn n )?1 = L mn m. L rst st. L rst t = L rst s L rst t = (L rt t I s )(I r L st t ) A permutation of the form I m L pq q I n is called a tensor permutation. The following theorem illustrates how a tensor product of two matrices can be commuted by applying a stride permutation. Theorem.1 (Commutation Theorem) If A is an m m matrix and B is an n n matrix, then L mn n (A B) = (B A)L mn n

8 Pairwise multiplication between two vectors implies the product between the corresponding elements of those vectors, e.g., x 0 x 1 x n?1 y 0 y 1 y n?1 = x 0 y 0 x 1 y 1 x n?1 y n?1 If the elements x i and y i are themselves submatrices, then x i y i corresponds to matrix multiplication between them. A Tensor Product Formulation of Strassen's Algorithm Strassen's matrix multiplication algorithm is based on a computationally ecient way of multiplying matrices using only multiplications [1]. Consider the matrix multiplication C = AB, where A = a 00 a 01 B = a 10 a 11 b 00 b 01 C = b 10 b 11 c 00 c 01 c 10 c 11 Strassen's algorithm can then be written as follows. First, the following intermediate values are calculated. Then the individual elements of C are given by t 0 = (a 00 + a 11 )(b 00 + b 11 ) t 1 = (a 10 + a 11 )b 00 t = a 00 (b 01? b 11 ) t = a 11 (?b 00 + b 10 ) t = (a 00 + a 01 )b 11 t = (?a 00 + a 10 )(b 00 + b 01 ) t = (a 01? a 11 )(b 10 + b 11 ) c 00 = t 0 + t? t + t c 10 = t 1 + t c 01 = t + t c 11 = t 0? t 1 + t + t In matrix notation, this can be represented as: C = S c (S a A S b B)

9 where S a = ? ?1 ; S b = ?1? ; and S c = ? ? and A, B and C are vectors of length, and represent the storage of matrices A, B, and C in column major form. The notation A corresponds exactly to the vec(a) notation [8], however, we shall use the former for readability purposes. The matrices S a, S b and S c are termed as basic operators, and do not have to be explicitly generated, but specify which operations have to be performed on specic components of the input vectors. The above formulation can be easily extended to matrices of size n n by considering a ij, b ij, and c ij to be blocks of size n?1 n?1. First, we describe the block recursive storage of matrices in memory. Let X be any n n matrix. At the top level, X can be viewed as: X = X 00 X 01 X 10 X 11 A vector X representing an r-level block recursive representation of X is recursively dened as: X 00 : X = X 10 X 01 X 11 with the boundary condition that Y is the column-major representation of any n?r n?r block Y. An example of block recursive storage is given in Figure 1. Let A, B, and C be the 1-level block recursive representation of n n matrices A, B, and C. Strassen's algorithm for computing C = AB can be written as: C = (S c I n?1)[(s a I n?1)a n?1 (S b I n?1)b] where n?1 denotes pairwise matrix multiplication between matrices of size n?1 n?1. We refer to the above as 1-level block recursive Strassen's algorithm. In this case, the intermediate values t i are n?1 n?1 block matrices, and block matrix multiplications are performed using conventional matrix multiplication. This algorithm can be conveniently viewed in terms of a recursion tree (Figure ), where the root node corresponds to the update of C, and the leaf nodes correspond to the evaluation of the intermediate values. The steps marked by refer to computations that require working memory. Note that all the intermediate

10 values can be computed in parallel, since there are no data dependences between them. Each intermediate value requires a working memory of O( n?1 ). Hence, a 1-level block recursive Strassen's algorithm requires a total working storage of size O( n?1 ). Even though the above formulation has been given for matrix sizes of the form n n, it is straightforward to generalize the implementation to handle arbitrary dimensions of matrices A and B. A common technique used is to pad the matrices with rows and columns of zeros to increase the matrix sizes to the next higher powers of two, compute the extended matrix product, and then extract the desired result [1]. Another approach [] is drop the last rows and columns from the computation to achieve even dimensions, and compute the partial matrix product. The complete matrix product is then obtained with a rank-k update (k = 1; ; )..1 Block Recursive Strassen's Algorithm: Breadth-First Evaluation In 1-level application of Strassen's algorithm, n?1 n?1 block multiplications were computed using conventional matrix multiplication. To get an additional savings in the total number of arithmetic operations required to compute the matrix product, Strassen's algorithm can be recursively applied to compute the block multiplications also. Let A, B, and C be the l-level block recursive representations of n n matrices A, B, and C. The computation of C is described by the following formulation [10]: where C = S n;l c [Sa n;l (A) n?l S n;l (B)] S n;l a = N l i=0 S a I n?l = Q l?1 i=0 (I i S a I n?(i+1)) S n;l b = N l i=0 S b I n?l = Q n?1 i=0 (I i S b I n?(i+1)) S n;l c = N l i=0 S c I n?l = Q 0 i=n?1 (I i S c I n?(i+1)) and n?l denotes pairwise multiplication between blocks of size n?l n?l. This computation can be interpreted as a breadth rst evaluation of the recursion tree shown in Figure. Each intermediate block matrix t i is itself computed using Strassen's algorithm yielding intermediate sub-blocks t i0 ; : : : t i. process is recursively applied until blocks of size n?l n?l, which are then computed using conventional matrix multiplication. Following our convention, denotes computation that requires working storage. The working array requirement in this case is O( l n?l ). In the extreme case, Strassen's algorithm can be applied recursively down to blocks of size, and such an (n? 1)-level (or n-level) Strassen's algorithm would require working storage of size O( n ). Table 1 presents the total number of operations required for multiplying two matrices of size n n. MM denotes conventional matrix multiplication, STR refers to an n-level block recursive Strassen's algorithm, and BLOCK STR denotes a (n? k)-level block recursive Strassen's algorithm. STR has a lower operation count than MM only for n 10. The expression for the operation count for BLOCK STR has a minima b This

11 at k = for all integer values of n and k. For k =, BLOCK STR has a lower operation count than MM for n. Hence, block Strassen's algorithm is better than conventional matrix multiplication in terms of total operation count even for small values of n. However, for implementation on a shared memory vector machine such as the Cray Y-MP, a lower operation count does not imply smaller execution time, since the eect of vector length and stride also comes into play.. Block Recursive Strassen's Algorithm: Depth-First Evaluation An l-level Strassen's algorithm requires fewer operations than conventional matrix multiplication when the number of levels l is increased. An optimal value is attained at n? l =. However, the working storage requirement for an l-level algorithm are O( l n?l ), and hence increases exponentially with an increase in l. This high storage requirement comes due to the breadth rst expansion of the recursion tree in which all the intermediate values have to be stored. To achieve reduction in working storage, we can perform the computation of Strassen's algorithm using a depth-rst expansion of the recursion tree. Instead of expanding all the leaves in the recursion tree, we only compute a subtree, and use the results obtained from that subtree to update C. This process is repeatedly applied until all the subtrees are evaluated. It is necessary to ensure that no redundant computation is performed. The memory requirement for the algorithm in this case will be the memory requirement for a single subtree, since the same space can be reused for the evaluation of dierent subtrees. For the case, the algorithm is modied as follows. t is a temporary variable that is used to store intermediate values. Step 1: t = (a 00 + a 11 )(b 00 + b 11 ); c 00 = t; c 11 = t; Step : t = (a 10 + a 11 )b 00 ; c 10 = t; c 11 = c 11? t; Step : t = a 00 (b 01? b 11 ); c 01 = t; c 11 = c 11 + t; Step : t = a 11 (?b 00 + b 10 ); c 00 = c 00 + t; c 10 = c 10 + t; Step : t = (a 00 + a 01 )b 11 ; c 00 = c 00? t; c 01 = c 01 + t; Step : t = (?a 00 + a 10 )(b 00 + b 01 ); c 11 = c 11 + t; Step : t = (a 01? a 11 )(b 10 + b 11 ); c 00 = c 00 + t; Now the extra memory requirement is of only one element, since the same memory location can be reused to evaluate dierent t i 's. In the original formulation, seven memory locations are required since all the intermediate values are calculated before the update of C is performed. The total number of arithmetic operations are unchanged. We now formulate the concept of memory reduction using the tensor product framework. Dene D j to be a matrix with d jj = 1 and zeros elsewhere. Note that P j=0 D j = I. Memory reduction for

12 case can be formulated in matrix notation as: D j C = X j=0 (S c D j )[(D j S a)a (D j S b)b] is termed as a selection operator and selects subsets of the input vector on which the computation is to be performed. This framework can be extended to multiplying matrices of size n n. We begin with the 1-level Strassen's algorithm, and assume that the data matrices are stored in a 1-level block recursive form. The tensor product formula to compute C = AB can then be written as: C = X j=0 (S c D j I n?1)[(d j S a I n?1)a n?1 (D j S b I n?1)b] We can apply memory reduction at multiple levels by performing the same operation recursively on the smaller blocks. Assuming that the matrices are stored in an l-level block recursive form, an l-level Strassen's algorithm with memory reduction can be formulated as: C = X j 0;j 1;;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( r=0 O l?1 Dj r S a ) I n?l)a n?l (( r=0 D j r S b ) I n?l)b)] We refer to the above formulation as the partial evaluation form of Strassen's algorithm. The computation specied in the above formulation can be described using the recursion tree shown in Figure. The current intermediate blocks being computed are represented by. Working storage is required for the intermediate blocks from the leaf node being computed, to the root of the recursion tree. Hence, the working storage required is O( P l i=1 n?i ) = O( n ).. Combining Breadth-First and Depth-First Evaluations The operator in the tensor product formula for partial computation refers to pairwise matrix multiplication. Each block matrix multiplication in the pairwise matrix multiplication can itself be performed using complete evaluation. Hence we have a -level hierarchy. At the highest level, partial evaluation is performed till blocks of size n?l n?l. Then complete evaluation is performed till blocks of size k k are reached, after which conventional matrix multiplication is applied. This can be expressed in the tensor product notation as: C P S = X j 0;:::;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( C 0 CS = S ~ n?l;k c ( S ~ n?l;k a A 0 MM k S ~ n?l;k b B 0 ) where CS n?l r=0 O l?1 Dj r S a ) I n?l)a CS n?l (( r=0 D j r S b ) I n?l)b)] denotes pairwise matrix multiplication between blocks of size n?l n?l using complete evaluation, C 0 CS corresponds to each block pairwise multiplication during the partial evaluation, which itself is evaluated using an (l? k)-level Strassen's algorithm, and MM k between blocks of size k k using conventional matrix multiplication. denotes pairwise matrix multiplication 8

13 The root of the recursion tree is dened to be at level 0. At level i, a working array of size O( n?i ) is required to store the intermediate results of partial evaluation. The breadth-rst expansion of the last (l? k) levels requires a working array of O( n?l?k k ). Hence, the total memory requirement is O( P l?1 i=1 n?i + n?l?k k ) = o( n + n?l?k k ). Even for moderate values of n and small values of l, this represents a signicant savings compared to O( n?k k ) for complete evaluation. If the matrices are of size N N where N is not a power of two, the technique of padding can be used, and the memory requirement with reduction will be o( dlg Ne + dlg Ne?l?k k ) compared to O( dlg Ne?k k ) for complete evaluation.. Matrix Storage in Main Memory The formulation presented in the previous sections assumes for simplicity of presentation that the data matrices are stored in a block recursive form. However, when implementing a block recursive algorithm on a shared memory machine, matrices are usually stored in a row major or column major form. We have implemented Strassen's algorithm using Fortran on the Cray Y-MP, hence the data matrices are stored in memory in column major form. The tensor product formula to convert a n n matrix from a column major form to a k-level block recursive form is given by [11]: R n;k = " n?k?1 Y i=0 (I i+1 L n?k?i I n+k?i?1) # (I n?k L n n?k I k) R n;k is termed as a conversion operator. There are two ways in which storage conversion can be implemented. One way is to perform explicit conversion from row/column major form to a block recursive form through data movement. However, a more ecient way is to merge the conversion operator into the computation in Strassen's algorithm, which results in a modication of the data array indexing functions. The modied tensor formulation for Block Strassen's algorithm is: where ~S n;k a = ~S n;k b = Y n?k?1 i=0 Y n?k?1 i=0 C = (R n;k )?1 S n;k c = ~ S n;k c [ S ~ n;k a A k S ~ n;k b B] [R n;k Sa n;k A k R n;k S n;k b B] [(I i S a I n?i?1)(i i I L n?k?i I n+k?i?1)](i n?k L n n?k I k) [(I i S b I n?i?1)(i i I L n?k?i I n+k?i?1)](i n?k L n n?k I k) ~S n;k c = (I n?k L n k I k) 0Y i=n?k?1 [(I i With l-level memory reduction, the above formulation is modied into: C = X j 0;j 1;;j l?1=0 " l?1 c ( ( ~ S n;k O r=0 I L n?k?i n?k?n?1 I n+k?i?1)(i i S c I n?i?1)] O l?1 Dj r ))(( r=0 Dj r S ~ n;k O l?1 a A) k ( r=0 D j r # S ~ n;k b B)) 9

14 Code Generation for Vector Processors.1 Block Strassen's Algorithm Matrix factorizations form the basis of translating tensor product formulas by mapping the operations implied by the formula to program constructs in a high-level programming language. The translation process starts with the top level abstraction and generates more rened code as it proceeds to lower level abstractions. At each level, semantically equivalent program constructs are chosen to replace mathematical operations. Ecient programs can be synthesized from tensor product formulas by exploiting the regular computational structure expressed by such formulas. The tensor product formulation of block recursive algorithms usually involves certain basic computations, such as S a, S b, and S c in the case of Strassen's algorithm. It is sometimes necessary to use manually optimized codes for these basic computations to achieve high-performance. We now illustrate the code generation strategy with an example. Let B be an m n matrix, and X be a vector of size np, Consider the application of (I p B) to X, i.e., B B X[0 : n? 1] X[n : n? 1]. X[(p? 1)n : pn? 1] = BX[0 : n? 1] BX[n : n? 1]. BX[(p? 1)n : pn? 1] This can be interpreted as p copies of B acting in parallel on p disjoint segments of X, resulting in a vector of size mp. Hence, Y = (I p B)X can be implemented as: Code[Y = (I p B)X] doall i = 0; p? 1 Code[Y [in : (i + 1)n? 1] = BX[in : (i + 1)n? 1]] enddoall Once an algorithm is expressed using the tensor product framework, ecient implementation can be obtained by algebraically manipulating the tensor product formula. For example, consider the implementation of Y = (B I p )X where Y, B, and X are vectors as described before. Using the commutation rule, it can be determined that (B I p ) = L mp m (I p B)L np p Hence, one implementation to compute Y might be to permute X according to L np p, perform (I p B), and permute the result according to L mp m. A more ecient implementation would be to incorporate the stride permutations into the indexing of the input and output data arrays. The above can be written as: (L mp p Y ) = (I p B)(L np p X) 10

15 i.e., Y [0 : mp? 1 : p] Y [1 : mp? 1 : p]. Y [p? 1 : mp? 1 : p] Hence, the code can be written as = B B X[0 : np? 1 : p] Y [1 : np? 1 : p]. Y [p? 1 : np? 1 : p] Code[Y = (B I p)x] doall i = 0; p? 1 Code[Y [in : (i + 1)n? 1 : p] = BX[in : (i + 1)n? 1 : p]] enddoall Let us consider the code generation for (n? k)-level block Strassen's algorithm for multiplying n n matrices. Assume that the matrices are stored in a (n? k)-level block recursive format, and that at the lowest level, pairwise multiplication between blocks of size k k is performed. For simplication, we shall assume that no memory reduction is performed. The tensor product formulation of this algorithm is given by (Section.): C = S ~ n;k c [ S ~ n;k a A k S ~ n;k b B] The formula for block Strassen's algorithm contains the operations S ~ n;k a, S ~ n;k b, k, and S ~ n;k c. All the operations except k are linear operations and hence require an array operand. Operation k is a bilinear operation and requires two array operands. Each operation corresponds to an assignment statement that stores its result in an array that may be used as input data for the subsequent assignment or represents the nal output. Temporary arrays representing working arrays are denoted by T i. The above formula then translates to the following high-level code: T 0 = S ~ n;k a T 1 = S ~ n;k b A B T 0 = T 0 k T 1 C = ~ S n;k c T 0 The assignment statements are composed sequentially to preserve the semantics of computation. However, the above sequential composition is not unique. For example, the assignment statements for S ~ n;k a B can be in any order since there is no data dependency between them. ~S n;k b ~S n;k a, S ~ n;k b, and S ~ n;k c have the form " ny i=1 Fi # where F i = 8 < : (I r i OP I si ) (I ri L mini n i I si ) and OP is a basic operator A and The generic tensor product formula Y = (I ri OP I si )X can be implemented as a fully parallel doubly nested loop: 11

16 doall i 1 = 0; r i? 1 doall i = 0; s i? 1 Code[OP; Y; X; i 1; i ] enddoall enddoall Any tensor permutation that may be present results in a modication of the array indexing functions. Dierent implementations of the above formula are possible by changing the order and / or blocking the inner loops, as they are fully permutable. However, dierent orderings of the inner loops results in dierent data access patterns. These in turn will have dierent performance characteristics on systems with hierarchical / interleaved memories. Consider the application of the tensor product formula Q n i=1 F i to a vector X. The product term corresponds to a sequential outer loop in which the output of the i th stage is fed as input to the (i + 1) th stage, i = 1; n? 1. Only two arrays are required for this operation. The input array for the i th step can be reused as the output array for the (i + 1) st step. At the end of each iteration, the arrays are swapped (which can be implemented trivially simply by swapping the pointers to the two arrays), and the resulting pseudocode is: T 0 X do i = 1; n Code[T 1 = F it 0] Swap(T 1; T 0) enddo Q n At the end of the last iteration, T 1 contains the result of [ i=1 F i]x. S ~ n;k a, S ~ n;k b, and S ~ n;k c above form, and code can be easily generate for them. have the The pairwise multiplication k performs a sequence of n?k matrix multiplications of k k blocks. Let the input vectors be T 0 and T 1, corresponding to the evaluation of ~ S n;k a A and ~ S n;k a B respectively. All elements of a given block are stored consecutively in the input arrays. Pseudocode for the operation T = T 0 k T 1 is presented below: doall i = 0; n?k? 1 T [i k : (i + 1) k? 1] = MatrixMultiply(T 0[i k : (i + 1) k? 1]; T 1[i k : (i + 1) k? 1]) enddoall where MatrixMultiply refers to conventional matrix multiplication between blocks of size k k stored in column major form. 1

17 . Memory Management for Depth-First Evaluation Consider the tensor product formula: C = X j 0;j 1;;j l?1=0 O l?1 [(( r=0 O l?1 S c Dj r ) I n?l)((( r=0 O l?1 Dj r S a ) I n?l) A n?l (( r=0 D j r S b ) I n?l) B)] The summation operator in the formulation of partial evaluation corresponds to a sequential loop nest, with the i th loop performing a depth rst evaluation of the i th level in the recursion tree. At each level, there are seven subtrees that need to be evaluated. Evaluation of each subtree is followed by an update of its parent. After the update, working storage used by that subtree can be reused for the computation of the next subtree at that level in the recursion tree. The loop structure hence loops like the following: do j 0 = 0; Code[T 0 a = Dj0 SaA] /* partial evaluation */ Code[T 0 b = Dj0 S bb] /* partial evaluation */ enddo do j 1 = 0; Code[T 1 a = Dj1 SaA] /* partial evaluation */ Code[T 1 b = Dj1 S bb] /* partial evaluation */ do j l?1 = 0; enddo enddo Code[T l?1 a = D j l?1 SaA] /* partial evaluation */ Code[T l?1 b = D j S l?1 bb] /* partial evaluation */ Code[T l?1 c = T l?1 a CSn?l T l?1 b ] /* complete evaluation */ Code[T l? c = D l?1 j l?1 ScTc ] /* update of parent */ Code[T 0 c = D j1 ScT 1 c ] /* update of parent */ Code[C = D j0 ScT 0 c ] /* update of parent */. Implementation of Winograd's Variation Strassen's algorithm uses 18 scalar additions and scalar multiplications to multiply matrices. Winograd presented a more ecient algorithm which uses 1 scalar additions and scalar multiplications []. The 1

18 Winograd's variation is based on the following three matrix operations: W a =? ? ?1 1? ; W b = 1 0? ?1 1? ?1?1 1 ; and W c = The Winograd's variation for multiplying can be written as the matrix formula c 0;0 0 a 0;0 b 0; ? : c 1;0 c 0;1 B = W W a a 1;0 a 0;1 W b b 1;0 b 0;1 C A : c 1;1 a 1;1 b 1;1 The generated code of operations W a, W b, and W c contains some common terms. For example, a 0;0?a 1;0 is evaluated twice in a direct implementation of W a. The key to reducing the number of additions in Winograd's variation is to evaluate a common term only once. We factorize W a, W b, and W c to eliminate the common terms: W a = W b = W c = ? ? ? ? ? ? ?1 1? ; ; and : 1

19 There are 1 rows containing two non-zero elements in the matrix factorizations of W a, W b, and W c, which correspond to the 1 additions in Winograd's variation of Strassen's algorithm. The rows containing a single one are implemented as data movement, and those containing all zeros are equivalent to null operations. The indices of input and output array elements of W a are specied by the permutation operations in a tensor product formula and are computed similar to those for S a. Let p i, 0 i <, be the indices of the input array elements, and q i, 0 i <, be the indices of the output array elements. The computation T = W a A on a vector A of length is translated to the following sequence of assignments: Code[T = W aa] T [q 1] = A[p 0] T [q ] = A[p ] T [q ] = A[p 0]? A[p 1] T [q ] = A[p 1] + A[p ] T [q ] = A[p ] T [q 0] =?T [q 1] + T [q ] T [q ] =?T [q 0] + T [q ] The implementation of Winograd's variation is simply a replacement of the translated code of W a, W b, and W c for the code of S a, S b, and S c in the corresponding implementation of Strassen's algorithm. Performance Results on the Cray Y-MP Performance statistics were gathered for dierent matrix sizes, dierent block sizes, and dierent levels of partial evaluation. Table shows performance for execution on a single processor. All execution times are in seconds. The numbers in the parenthesis display performance in megaops. Empty elds indicate that the program could not be run due to lack of sucient memory. The matrix size is n n, the level of partial evaluation is l, and the block size at which conventional matrix multiplication is applied is k k. The execution times for the Block Strassen's algorithm is compared with the Cray Scientic Library routines SGEMM which implements conventional matrix multiplication, and SGEMMS which implements Strassen's matrix multiplication. Since SGEMM and SGEMMS are independent of l and k, the times for those are given only once for each value of n. SGEMM is used for block matrix multiplication in the Block Strassen's algorithm. For any value of l, the lowest execution time occurs for k = because the vector length on the Cray Y-MP is. The megaops for k = are higher than that of k = for the same value of n and l. But, the execution time for k = is longer since a larger number of arithmetic operations are performed. The execution times and megaops for k =, l = 0 are comparable (slightly better) to that of SGEMMS. There is a performance degradation due to a slight increase in the number of memory operations as l increases for any xed n and k =. However, the dierence is quite small that is evident from the execution times. 1

20 Table gives the performance when the program was run on two processors. A xed value of k = was chosen since this resulted in the best performance in the single processor case. Again, the performance when l = 0 is slightly better than that of SGEMMS. For larger values of l, the performance degrades by about 1%. Table shows the performance results for 8 processors. Since the programs were run in a non-dedicated mode on the Cray Y-MP, we were unable to get all the 8 processors for the entire execution of the program. The numbers in parenthesis give the percentage of 8-cpus available for execution. The amount of extra memory required has been given in Figure for dierent values of n and l. It can be easily seen that there is an order of magnitude improvement even for small values of l. A value of k = was chosen since it is for this block size that the execution times are minimum. Conclusions We have shown how tensor product formulas expressing Strassen's matrix multiplication algorithm can be translated to ecient parallel programs for shared memory multiprocessors. This translation process is part of a more general programming methodology for designing high-performance block recursive algorithms for shared and distributed memory machines. The methodology uses a mathematical notation based on tensor products for expressing block recursive algorithms. Algebraic manipulation of these formulas yields mathematically equivalent formulas which result in implementations with dierent performance characteristics. A large number of programs can be generated in order to search for ecient implementations. Tensor product give a powerful method to generate these equivalent implementations automatically. As was illustrated in this paper, programs generated from tensor product formulas compare favorably with the best hand-coded ones. This paper presents an implementation of the Strassen's algorithm on a shared memory multiprocessor such as the Cray Y-MP. In the Y-MP, memory is organized into banks, and in the absence of bank conicts, all memory accesses take the same amount of time. However, in distributed memory multiprocessors such as the Cray TD, where each processor has its own local memory, a local memory access can be signicantly faster than a remote access. Hence, an ecient implementation on a distributed memory machine requires partitioning the algorithm in such a manner that remote accesses are minimized. Tensor product formulas can also be used to specify regular data distributions for arrays. Given a tensor product formula with a specied distribution of its input and output arrays, the inter-processor communication cost incurred by the implementation can be determined. If the cost of communication is high, it might be more ecient to perform a data redistribution before the computation, to bring the arrays into a form where the computation is local to the processors, if the overhead of data distribution is lower than the benet gained due the communication cost reducing to zero. We are currently examining these issues and are working on an implementation on the Cray TD. Both formula modication and program generation are capable of being automated. We are currently im- 1

21 plementing this methodology in an expert system EXTENT (Expert System for Tensor Formula Translation) that assists in the development of parallel programs for numerical algorithms on various computer architectures. Currently, the system generates Fortran programs for the Cray Y-MP. The expert system employs various heuristics to automatically generate alternative tensor product formulas, translate tensor product formulas to programs for various parallel architectures, test the produced programs, and analyze the test results. References [1] D.H. Bailey. Extra High Speed Matrix Multiplication on the Cray-. SIAM Journal of Scientic and Statistical Computing, 9():0{0, [] D.H. Bailey, K. Lee, and H.D. Simon. Using Strassen's Algorithm to Accelerate the Solution of Linear Systems. Journal of Supercomputing, ():{1, Jan [] A. Borodin and I. Munro. The Computational Complexity of Algebraic and Numeric Problems. American Elsevier Publishing Co., 19. [] R.P. Brent. Algorithms for matrix multiplication. Technical Report CS 1, Computer Science Dept., Stanford University, Palo Alto, Calif., 190. [] J.W. Brewer. Kronecker Products and Matrix Calculus in System Theory. IEEE Transactions on Circuits and Systems, :{81, 198. [] J. Granta, M. Conner, and R. Tolimieri. Recursive Fast Algorithms and the Role of Tensor Products. IEEE Transactions on Signal Processing, 0(1):91{90, Dec [] F.A. Graybill. Matrices, with Applications in Statistics. Wadsworth International Group, Belmont, CA, 198. [8] H.V. Henderson and S.R. Searle. The Vec-Permutation Matrix, The Vec Operator and Kronecker Products: A Review. Linear and Multilinear Algebra, 9:1{88, [9] N.J. Higham. Exploiting fast matrix multiplication within the level BLAS. In ACM Transactions on Mathematical Software, volume 1, pages {8, Dec [10] C.-H. Huang, J.R. Johnson, and R.W. Johnson. A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm. In App. Math. Lett., volume, pages {1, [11] C.-H. Huang, J.R. Johnson, and R.W. Johnson. Generating Parallel Programs from Tensor Product Formulas: A Case Study of Strassen's Matrix Multiplication Algorithm. In International Conference on Parallel Processing, volume, pages 10{108,

22 [1] J.R. Johnson, R.W. Johnson, D. Rodriguez, and R. Tolimieri. A Methodology for Designing, Modifying and Implementing Fourier Transform Algorithms on Various Architectures. Circuits Systems Signal Process, 9():0{00, [1] B. Kumar, C. H. Huang, J. Johnson, R. W. Johnson, and P. Sadayappan. A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction. In Seventh International Parallel Processing Symposium, pages 8{88, Apr [1] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM Publications, 199. [1] P.A. Regalia and S.K. Mitra. Kronecker Products, Unitary Matrices and Signal Processing Applications. SIAM Review, 1():8{1, Dec [1] G.X. Ritter and P.D. Gader. Image Algebra Techniques and Parallel Image Processing. Journal of Parallel and Distributed Computing, :{, 198. [1] V. Strassen. Gaussian Elimination is not Optimal. Numer. Math., 1:{,

23 blocks of size n-1 each l blocks of size n-l each Figure 1: -level block recursive storage Figure : Recursion tree of depth 1 for Strassen's algorithm A 00 C A t 0 t 1 t t t t t A 10 A 10 t 00 t 0 t 0 t t 0 t C A 11 Figure : l-level block recursive Strassen's algorithm 0 1 t t 0 t t 1 t t t t t 0... t t 0... t19 t 0... t... t... t... A 11

24 Figure : l-level Block Recursive Strassen's Algorithm with Memory Reduction Memory Words x t 0 t 1 t t t t t t 00 t 0 n- t 0 t t 0 t t t t t 0... t 0... t 0... t... t... t... C n = 8 n-1 n = 9 n = 10 Figure : Memory requirements for working arrays Table 1: Comparison of operation counts for Strassen's algorithm and conventional matrix multiplication Algorithm n-l Operation Count Additions Multiplications Total MM 8 n 8 n 8 n STR ( n? n ) n n+1? n BLOCK STR k ( n?k? n?k ) n?k ( 8 k? k ) n?k ( 8 k + k )? n 0 l

25 Table : Execution times for Block Strassen's algorithm with memory reduction n = 8 n = 9 n = 10 l SGEMM:.109 sec. (10 MFlops) SGEMM:.88 sec. (10 MFlops) SGEMM:.9 sec. (09 MFlops) SGEMMS:.09 sec. (91 MFlops) SGEMMS:. sec. (8 MFlops) SGEMMS:.9 sec. (8 MFlops) k = k = k = k = k = k = k = k = k = k = k = k = k = k = k = () (1) () (91) (0) (8) (0) () (1) (9) (8) (9) () (1) (1) (8) (9) () (10) () (8) () (1) (8) (8) (9) (1) (1) (8) (98) () (11) (1) () (10) () (8) () (1) () (81) (9) () (10) () (11) (18) () (18) (1) () (8) () (10) () (10) (1) (9) () (10).8 (9) Table : Execution times for k = on processors n SGEMMS Block Strassen l = 0 l = 1 l = l = l = (9) () (1) (9) (9) () (1) (0) (90) (89) (10) (0) (8) Table : Execution times for k = on 8 processors. Percentages of 8-cpu obtained are given in parenthesis. n SGEMMS Block Strassen l = 0 l = 1 l = l = l = 8.01 (8.%).0 (.%).018 (81.8%).0 (.%) 9.10 (91.%).11 (.%).1 (.%).1 (80.9%).1 (.0%) 10.8 (8.%) 1.0 (0.9%) 1.0 (.0%) 1.1 (0.%) 1

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH 43221 Columbus, OH 4321 Abstract We analyze

More information

Abstract In this paper we report on the development of an ecient and portable implementation of Strassen's matrix multiplication algorithm. Our implem

Abstract In this paper we report on the development of an ecient and portable implementation of Strassen's matrix multiplication algorithm. Our implem Implementation of Strassen's Algorithm for Matrix Multiplication 1 Steven Huss-Lederman 2 Elaine M. Jacobson 3 Jeremy R. Johnson 4 Anna Tsao 5 Thomas Turnbull 6 August 1, 1996 1 This work was partially

More information

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational Experiments in the Iterative Application of Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA fsoha,ebelingg@cs.washington.edu

More information

The problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can

The problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can A Simple Cubic Algorithm for Computing Minimum Height Elimination Trees for Interval Graphs Bengt Aspvall, Pinar Heggernes, Jan Arne Telle Department of Informatics, University of Bergen N{5020 Bergen,

More information

Rectangular Matrix Multiplication Revisited

Rectangular Matrix Multiplication Revisited JOURNAL OF COMPLEXITY, 13, 42 49 (1997) ARTICLE NO. CM970438 Rectangular Matrix Multiplication Revisited Don Coppersmith IBM Research, T. J. Watson Research Center, Yorktown Heights, New York 10598 Received

More information

Legal and impossible dependences

Legal and impossible dependences Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

The p-sized partitioning algorithm for fast computation of factorials of numbers

The p-sized partitioning algorithm for fast computation of factorials of numbers J Supercomput (2006) 38:73 82 DOI 10.1007/s11227-006-7285-5 The p-sized partitioning algorithm for fast computation of factorials of numbers Ahmet Ugur Henry Thompson C Science + Business Media, LLC 2006

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

An Ecient Approximation Algorithm for the. File Redistribution Scheduling Problem in. Fully Connected Networks. Abstract

An Ecient Approximation Algorithm for the. File Redistribution Scheduling Problem in. Fully Connected Networks. Abstract An Ecient Approximation Algorithm for the File Redistribution Scheduling Problem in Fully Connected Networks Ravi Varadarajan Pedro I. Rivera-Vega y Abstract We consider the problem of transferring a set

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu

More information

r (1,1) r (2,4) r (2,5) r(2,6) r (1,3) r (1,2)

r (1,1) r (2,4) r (2,5) r(2,6) r (1,3) r (1,2) Routing on Trees via Matchings? Alan Roberts 1, Antonis Symvonis 1 and Louxin Zhang 2 1 Department of Computer Science, University of Sydney, N.S.W. 2006, Australia 2 Department of Computer Science, University

More information

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction B. KUMAR!, C.-H. HUANG!, P. SADAYAPPAN 1, AND R.W. JOHNSON 2 1 Department of Computer and Information Science,

More information

We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real

We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real 14.3 Interval trees We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real numbers ], with Interval ]represents the set Open and half-open intervals

More information

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp 1 Introduction 1.1 Parallel Randomized Algorihtms Using Sampling A fundamental strategy used in designing ecient algorithms is divide-and-conquer, where that input data is partitioned into several subproblems

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

GraphBLAS Mathematics - Provisional Release 1.0 -

GraphBLAS Mathematics - Provisional Release 1.0 - GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,

More information

General context-free recognition in less than cubic time

General context-free recognition in less than cubic time Carnegie Mellon University Research Showcase @ CMU Computer Science Department School of Computer Science 1974 General context-free recognition in less than cubic time Leslie Valiant Carnegie Mellon University

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet. Y. C. Pati R. Rezaiifar and P. S.

Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet. Y. C. Pati R. Rezaiifar and P. S. / To appear in Proc. of the 27 th Annual Asilomar Conference on Signals Systems and Computers, Nov. {3, 993 / Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet

More information

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,

More information

and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa

and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa Vertical Fragmentation and Allocation in Distributed Deductive Database Systems Seung-Jin Lim Yiu-Kai Ng Department of Computer Science Brigham Young University Provo, Utah 80, U.S.A. Email: fsjlim,ngg@cs.byu.edu

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

A Method for Construction of Orthogonal Arrays 1

A Method for Construction of Orthogonal Arrays 1 Eighth International Workshop on Optimal Codes and Related Topics July 10-14, 2017, Sofia, Bulgaria pp. 49-54 A Method for Construction of Orthogonal Arrays 1 Iliya Bouyukliev iliyab@math.bas.bg Institute

More information

Chapter 2A Instructions: Language of the Computer

Chapter 2A Instructions: Language of the Computer Chapter 2A Instructions: Language of the Computer Copyright 2009 Elsevier, Inc. All rights reserved. Instruction Set The repertoire of instructions of a computer Different computers have different instruction

More information

Let v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have

Let v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have Let v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have been red in the sequence up to and including v i (s) is deg(v)? s(v), and by the induction hypothesis this sequence

More information

Dieter Gollmann, Yongfei Han, and Chris J. Mitchell. August 25, Abstract

Dieter Gollmann, Yongfei Han, and Chris J. Mitchell. August 25, Abstract Redundant integer representations and fast exponentiation Dieter Gollmann, Yongfei Han, and Chris J. Mitchell August 25, 1995 Abstract In this paper two modications to the standard square and multiply

More information

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented which, for a large-dimensional exponential family G,

More information

10/24/ Rotations. 2. // s left subtree s right subtree 3. if // link s parent to elseif == else 11. // put x on s left

10/24/ Rotations. 2. // s left subtree s right subtree 3. if // link s parent to elseif == else 11. // put x on s left 13.2 Rotations MAT-72006 AA+DS, Fall 2013 24-Oct-13 368 LEFT-ROTATE(, ) 1. // set 2. // s left subtree s right subtree 3. if 4. 5. // link s parent to 6. if == 7. 8. elseif == 9. 10. else 11. // put x

More information

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics. Numbers & Number Systems

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics. Numbers & Number Systems SCHOOL OF ENGINEERING & BUILT ENVIRONMENT Mathematics Numbers & Number Systems Introduction Numbers and Their Properties Multiples and Factors The Division Algorithm Prime and Composite Numbers Prime Factors

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin. A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t FAST CALCULATION OF GEOMETRIC MOMENTS OF BINARY IMAGES Jan Flusser Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vodarenskou vez 4, 82 08 Prague 8, Czech

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP?

IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP? Theoretical Computer Science 19 (1982) 337-341 North-Holland Publishing Company NOTE IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP? Nimrod MEGIDDO Statistics Department, Tel Aviv

More information

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are

More information

Multicasting in the Hypercube, Chord and Binomial Graphs

Multicasting in the Hypercube, Chord and Binomial Graphs Multicasting in the Hypercube, Chord and Binomial Graphs Christopher C. Cipriano and Teofilo F. Gonzalez Department of Computer Science University of California, Santa Barbara, CA, 93106 E-mail: {ccc,teo}@cs.ucsb.edu

More information

Unit-5 Dynamic Programming 2016

Unit-5 Dynamic Programming 2016 5 Dynamic programming Overview, Applications - shortest path in graph, matrix multiplication, travelling salesman problem, Fibonacci Series. 20% 12 Origin: Richard Bellman, 1957 Programming referred to

More information

6. Algorithm Design Techniques

6. Algorithm Design Techniques 6. Algorithm Design Techniques 6. Algorithm Design Techniques 6.1 Greedy algorithms 6.2 Divide and conquer 6.3 Dynamic Programming 6.4 Randomized Algorithms 6.5 Backtracking Algorithms Malek Mouhoub, CS340

More information

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo Two-Stage Service Provision by Branch and Bound Shane Dye Department ofmanagement University of Canterbury Christchurch, New Zealand s.dye@mang.canterbury.ac.nz Asgeir Tomasgard SINTEF, Trondheim, Norway

More information

under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli

under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli Interface Optimization for Concurrent Systems under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli Abstract The scope of most high-level synthesis eorts to date has

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

A Brief Description of the NMP ISA and Benchmarks

A Brief Description of the NMP ISA and Benchmarks Report No. UIUCDCS-R-2005-2633 UILU-ENG-2005-1823 A Brief Description of the NMP ISA and Benchmarks by Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine February 2005 A Brief Description

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

HEAPS ON HEAPS* Downloaded 02/04/13 to Redistribution subject to SIAM license or copyright; see

HEAPS ON HEAPS* Downloaded 02/04/13 to Redistribution subject to SIAM license or copyright; see SIAM J. COMPUT. Vol. 15, No. 4, November 1986 (C) 1986 Society for Industrial and Applied Mathematics OO6 HEAPS ON HEAPS* GASTON H. GONNET" AND J. IAN MUNRO," Abstract. As part of a study of the general

More information

Introduction to Algorithms

Introduction to Algorithms Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that

More information

Transistors NODES. Time (sec) No. of Nodes/Transistors

Transistors NODES. Time (sec) No. of Nodes/Transistors Binary Tree Structure for Formal Verication of Combinational ICs FATMA A. EL-LICY, and HODA S. ABDEL-ATY-ZOHDY Microelectronics System Design Laboratory Department of Electrical and Systems Engineering

More information

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence

More information

Hypercubes. (Chapter Nine)

Hypercubes. (Chapter Nine) Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is

More information

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o Reconstructing a Binary Tree from its Traversals in Doubly-Logarithmic CREW Time Stephan Olariu Michael Overstreet Department of Computer Science, Old Dominion University, Norfolk, VA 23529 Zhaofang Wen

More information

Null. Example A. Level 0

Null. Example A. Level 0 A Tree Projection Algorithm For Generation of Frequent Itemsets Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 E-mail: f agarwal, charu,

More information

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma MULTIPLE SUBSPACE ULV ALGORITHM AND LMS TRACKING S. HOSUR, A. H. TEWFIK, D. BOLEY University of Minnesota 200 Union St. S.E. Minneapolis, MN 55455 U.S.A fhosur@ee,tewk@ee,boley@csg.umn.edu ABSTRACT. The

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

On the construction of nested orthogonal arrays

On the construction of nested orthogonal arrays isid/ms/2010/06 September 10, 2010 http://wwwisidacin/ statmath/eprints On the construction of nested orthogonal arrays Aloke Dey Indian Statistical Institute, Delhi Centre 7, SJSS Marg, New Delhi 110

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order

An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order The 9th Workshop on Combinatorial Mathematics and Computation Theory An Efficient Ranking Algorithm of t-ary Trees in Gray-code Order Ro Yu Wu Jou Ming Chang, An Hang Chen Chun Liang Liu Department of

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

Fast algorithm for generating ascending compositions

Fast algorithm for generating ascending compositions manuscript No. (will be inserted by the editor) Fast algorithm for generating ascending compositions Mircea Merca Received: date / Accepted: date Abstract In this paper we give a fast algorithm to generate

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

CS473-Algorithms I. Lecture 10. Dynamic Programming. Cevdet Aykanat - Bilkent University Computer Engineering Department

CS473-Algorithms I. Lecture 10. Dynamic Programming. Cevdet Aykanat - Bilkent University Computer Engineering Department CS473-Algorithms I Lecture 1 Dynamic Programming 1 Introduction An algorithm design paradigm like divide-and-conquer Programming : A tabular method (not writing computer code) Divide-and-Conquer (DAC):

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

A Simplied NP-complete MAXSAT Problem. Abstract. It is shown that the MAX2SAT problem is NP-complete even if every variable

A Simplied NP-complete MAXSAT Problem. Abstract. It is shown that the MAX2SAT problem is NP-complete even if every variable A Simplied NP-complete MAXSAT Problem Venkatesh Raman 1, B. Ravikumar 2 and S. Srinivasa Rao 1 1 The Institute of Mathematical Sciences, C. I. T. Campus, Chennai 600 113. India 2 Department of Computer

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the

More information

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley Department of Computer Science Remapping Subpartitions of Hyperspace Using Iterative Genetic Search Keith Mathias and Darrell Whitley Technical Report CS-4-11 January 7, 14 Colorado State University Remapping

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Model Checking I Binary Decision Diagrams

Model Checking I Binary Decision Diagrams /42 Model Checking I Binary Decision Diagrams Edmund M. Clarke, Jr. School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 2/42 Binary Decision Diagrams Ordered binary decision diagrams

More information

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD CAR-TR-728 CS-TR-3326 UMIACS-TR-94-92 Samir Khuller Department of Computer Science Institute for Advanced Computer Studies University of Maryland College Park, MD 20742-3255 Localization in Graphs Azriel

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T. Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips

More information

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned

More information