IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION

Similar documents
Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I

Dynamic Programming. Outline and Reading. Computing Fibonacci

Using Templates to Introduce Time Efficiency Analysis in an Algorithms Course

Dynamic Programming. An Enumeration Approach. Matrix Chain-Products. Matrix Chain-Products (not in book)

Computer Sciences Department 1

We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real

VISUALIZING NP-COMPLETENESS THROUGH CIRCUIT-BASED WIDGETS

Dynamic Programming. Design and Analysis of Algorithms. Entwurf und Analyse von Algorithmen. Irene Parada. Design and Analysis of Algorithms

x = 12 x = 12 1x = 16

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

AMS526: Numerical Analysis I (Numerical Linear Algebra)

High-Performance Implementation of the Level-3 BLAS

Vertex Cover Approximations on Random Graphs.

Matrix Multiplication

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

Lecture 8. Dynamic Programming

Matrix Multiplication Specialization in STAPL

Solving Systems Using Row Operations 1 Name

Algorithm Design Techniques part I

Matrix Multiplication

Framework for Design of Dynamic Programming Algorithms

12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Control Flow Analysis with SAT Solvers

Chapter 1: Number and Operations

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Sparse Matrices in Matlab*P. Final Report

Chapter 4. Matrix and Vector Operations

The implementation of the Sparse BLAS in Fortran 95 1

CLASS-ROOM NOTES: OPTIMIZATION PROBLEM SOLVING - I

Mapping Vector Codes to a Stream Processor (Imagine)

The p-sized partitioning algorithm for fast computation of factorials of numbers

Matrix Inverse 2 ( 2) 1 = 2 1 2

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best

10/24/ Rotations. 2. // s left subtree s right subtree 3. if // link s parent to elseif == else 11. // put x on s left

How to Write Fast Numerical Code

Dynamic Programming Matrix-chain Multiplication

An Object Oriented Finite Element Library

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

A Deterministic Polynomial-time Algorithm for the Clique Problem and the Equality of P and NP Complexity Classes

Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package

F04EBFP.1. NAG Parallel Library Routine Document

0_PreCNotes17 18.notebook May 16, Chapter 12

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Dynamic Programming

GraphBLAS Mathematics - Provisional Release 1.0 -

CS 157: Assignment 5

Reducing Directed Max Flow to Undirected Max Flow and Bipartite Matching

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

DSCPACK: Domain-Separator Codes For The. Padma Raghavan y. on multiprocessors and networks-of-workstations. This package is suitable for

SRI VENKATESWARA COLLEGE OF ENGINEERING. COURSE DELIVERY PLAN - THEORY Page 1 of 6

A Square Block Format for Symmetric Band Matrices

Scientific Computing. Some slides from James Lambers, Stanford

Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

CS 179: Lecture 10. Introduction to cublas

Chapter 5: Algorithms

Epetra Performance Optimization Guide

3. Replace any row by the sum of that row and a constant multiple of any other row.

' $ Sparse Compilation & % 1

2. COURSE DESIGNATION: 3. COURSE DESCRIPTIONS:

printf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);

On the maximum rank of completions of entry pattern matrices

9. Linear Algebra Computation

Lecture 4: Dynamic programming I

Design and Analysis of Algorithms 演算法設計與分析. Lecture 7 April 15, 2015 洪國寶

Dynamic Programming. Nothing to do with dynamic and nothing to do with programming.

Systems of Inequalities and Linear Programming 5.7 Properties of Matrices 5.8 Matrix Inverses

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

O. Brewer, J. Dongarra, and D. Sorensen. Tools to Aid in the Analysis of Memory Access Patterns. for Fortran Programs

Algorithms and Data Structures

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

A Scalable Parallel HITS Algorithm for Page Ranking

Storage Formats for Sparse Matrices in Java

Mathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2

A Standard for Batching BLAS Operations

Fast Expression Templates

The LinBox Project for Linear Algebra Computation

Chapter 3 Dynamic programming

CS473-Algorithms I. Lecture 10. Dynamic Programming. Cevdet Aykanat - Bilkent University Computer Engineering Department

Vector: A series of scalars contained in a column or row. Dimensions: How many rows and columns a vector or matrix has.

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Supercomputing and Science An Introduction to High Performance Computing

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

LECTURE 17. Expressions and Assignment

STEPHEN WOLFRAM MATHEMATICADO. Fourth Edition WOLFRAM MEDIA CAMBRIDGE UNIVERSITY PRESS

Introduction to Algorithms Third Edition

14 Dynamic. Matrix-chain multiplication. P.D. Dr. Alexander Souza. Winter term 11/12

A MATLAB Interface to the GPU

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

NAG Fortran Library Routine Document F07AAF (DGESV).1

Resources for parallel computing

Algorithms and Data Structures, or

A DESIGN PROPOSAL FOR AN OBJECT ORIENTED ALGEBRAIC LIBRARY

SMURF Language Reference Manual Serial MUsic Represented as Functions

5.1 The String reconstruction problem

Transcription:

IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION Swetha Kukkala and Andrew A. Anda Computer Science Department St Cloud State University St Cloud, MN 56301 kusw0601@stcloudstate.edu aanda@stcloudstate.edu Abstract We discuss our implementation and development of a matrix-chain product evaluation facility in C++. The matrices in the product chain can be any of a set of predefined types. Our implementation consists of several basic components. The first component generates the essential matrix-chain operand characteristics. The second receives those characteristics then computes an optimal associative ordering using a memoized dynamic programming algorithm. The product can then be performed in that optimal associative ordering. In our implementation, we are extending the algorithm to be able to incorporate the scalar multiplication counts for various combinations of different matrix classes.

1 Introduction A matrix product exists only if the pair of matrix operands are conformal, i.e. the number of columns of the left operand matrix must equal the number of rows of the right operand matrix, e.g. for the matrix product,a i j B k l, j k. The product of a sequence of conformal matrices is a matrix-chain, represented as: n i 1 A k i k i 1 The set of matrices is associative under multiplication. For example, the product of the conformal matrices A, B, & C yields the same value regardless of ordering, i.e. (A B) C A (B C). However, when the product of a sequence of conformal matrices, of differing sizes, is performed, the number of scalar multiplication operations required can vary depending on the associative ordering. The number of scalar multiplication operations is a direct function of the associative ordering of that product. For any matrix-chain, there will be one or more orderings that require the fewest scalar multiplication operations. We term those associative orderings requiring the fewest scalar multiplication operations optimal orderings. The size of the search space as a function of the number of matrices for the optimal ordering exhibits (exponential) combinatorial growth proportional to the growth of the sequence of Catalan numbers Ω(4 n /n 3/2 )[1]. A greedy approach to finding an optimal ordering may fail with this type of problem, but an optimal solution will have the property of optimal substructure wherein the optimal solution is composed of optimal subproblems [2, p.327]. So we employ dynamic programming via memoization (maintaining a table of subproblem solutions) [2, p. 347] to compute an optimal ordering in polynomial time Ω(n 3 ) and spaceω(n 2 ). The algorithm we used is not novel. In fact, the matrix-chain optimality problem often motivates the pedagogical development and justification of the dynamic programming technique in standard algorithm analysis textbooks [2, p. 331; 9, p. 105; 8, p. 320; 7, p. 457]. Current examples of the matrix-chain product algorithm consider only operands which are only general matrices of the same precision [2]. But in practice, we can also consider specialized matrix types because the matrices can be sub-typed into a myriad of conventional categories, many of which have either differing operation counts for matrix products having the same complexity, or differing complexities for the product of the respective matrix operands [1]. Therefore, we provide a facility by which a class returns an optimal ordering of the matrix-chain product optionally with the number of operation counts. Our objective is to provide an application programmer with classes which perform an optimal matrix-chain product using a minimum number of scalar multiplication operations. At this early stage, we are providing a proof of concept of the utility of 1

exploiting matrix specialization. The matrix classes that we implemented augment those in the Template Numerical Toolkit (TNT) library. The TNT is an open source collection of interfaces and reference implementations of vector and matrix based numerical classes useful for scientific computing in C++ [4]. Our facility evaluates and extracts the matrixchain operands coded by the programmer efficiently and returns an associative ordering of the matrices along with its scalar multiplication operation count (the measure we are minimizing). 1.1 Computation of scalar multiplication operation count i j The product C = Ai k k j B, where A and B are general matrices, results in a general matrix, C, and requires the following number of scalar multiplication operations in equation i j k (1) The general matrix-matrix product is an example of a level 3 BLAS (Basic Linear Algebra Subroutines) [10] operation. Changing the matrix type can change the scalar operation count. For example, we can pre- or post-multiply the dense matrix by a diagonal matrix. A diagonal matrix has non-zero entries only on the main diagonal. Since the elements in the matrix off diagonal are all zero, there is no need to perform the scalar multiplication and additions involving those zero entries. Therefore, those operations will not be counted. Consider the following dense-diagonal matrix product, 1 2 4 7 8 9 1 0 0 0 1 0 0 0 1 = 1 1 2 1 4 1 4 1 5 1 6 1 7 1 8 1 9 1 = 1 2 4 7 8 9 (2) We see in Equation (2) that the additions and multiplications with zero elements can be ignored. So the total number of scalar operations are 9 which is half less than the total number of operations when we apply (1). Therefore, multiplying an i j dense matrix type with a j j diagonal matrix type results in an i j dense matrix. And the number of scalar multiplication operations is calculated using the equation i j (3) A triangular matrix is a square matrix where the entries either below or above the diagonal are zero. Consider the following lower triangular-dense matrix product, 1 0 0 2 3 0 1 2 3 7 8 9 = 1 1 1 2 1 3 2 1 + 3 4 2 2 + 3 5 2 3 + 3 6 4 1 + 5 4 + 6 7 4 2 + 5 5 + 6 8 4 3 + 5 6 + 6 9 (4) 2

We see in Equation (4) that the total number of scalar operations is 18 which is less than the total number of operations when we apply Equation (1), 27. Therefore, multiplying an i j dense matrix with a j j lower triangular matrix. The number of scalar multiplication operations is calculated using the equation j a=1 a j (5) Our intention is to support the set of matrix types supported by the dense BLAS. Dense BLAS has become a defacto computation kernel of many libraries that perform matrix computations. We have not resolved how to design non squared patterned matrices. 1.2 Object Oriented Design C++ supports object oriented programming which focuses on encapsulation, polymorphism, modularity and inheritance. Our matrix classes are based on those that are in the TNT following the principles of object oriented design. The specialized matrix classes that we developed are inherited from a basic matrix class in the TNT. We reused and extended the existing base class matrix via virtualization. Thus we achieved inheritance by obtaining the hierarchical relationships between the classes as shown in figure 1. Matrix general patterned Banded triangular diagonal lower upper Figure 1: Specialization of matrix classes inherited from Matrix [3] 3

1.3 Modified dynamic programming algorithm for matrix-chain product The following dynamic programming algorithm demonstrates how we handle different matrix types for the optimal associative ordering. A vector of pointers of matrix type is passed to this routine. The matrices could be any of a set of predefined types. Matrix-chain-order(vp) n length[vp] 1 p getdimensions(vp) op 0 tempop 0 for i 1 to n do m(f1(i,i)) 0 temps(f1(i,i)) 0 for l 2 to n do for i 1 to n l + 1 do m(f1(i, j)) INT_MAX for k i to j 1 do q f2(m(f1(i,k)))+f2(m(f1(k+1,j)))+ f3(p[i-1],[j],[k]) if(q < m(f1(i, j))) then m(f1(i, j)) q s(f1(i, j)) k op q end if temps(f1(i, j)) k tempop q return m, s,temps,op,tempop The complete internal details are not provided here in the psuedocode. The values of m and s are stored in a packed storage format. Packed storage is a space-efficient way to store elements in a single dimension array by eliminating zero elements [5]. Function f1 is used to calculate the index to store the non-zero elements into a packed array. Function f2 is used to identify the type of matrix. Function f3 calculates the number of scalar operations using any of the Equations (1), (3) or (5). 1.4 Results Testing the algorithm with different combinations of general, dense-diagonal and densetriangular matrices, we produced the following results which demonstrate that the optimal associative matrix-chain product ordering can change for fixed matrix dimensions when matrix specialization is exploited. The result using specialization is compared to the result using no specialization. Consider the sequence of four matrices A G 1x3 B D 3x4 C G 4x3 D D 3x3. 4

The subscripts G and D denote general dense and diagonal matrix types. The optimal ordering for this sequence is ( A ( B( CD))) with 10 scalar operations. Using the same sequence with the original version of algorithm resulted in the same optimal sequence ( A ( B( CD))) but with 27 scalar operations. Matrix Type Sequence Optimal Ordering Operation Count General A G 2x4 B G 4x4 C G 4x4 D G 4x2 A B CD 80 Specialized A G 2x4 B D 4x4 C D 4x4 D G 4x2 (A BC D ) 28 Table 1: General vs. Specialized matrix types Table 1 shows different optimal orderings for a fixed set of matrix dimensions of general and specialized version. We observe that the matrix-chain product can be performed more efficiently when we can specialize the matrix types. 2 Conclusions We provided a class method in C++ which implements the standard algorithm for optimizing the matrix chain product for general matrices. We then augmented that facility to exploit different fundamental dense matrix shapes. We provided examples where the optimal ordering changes when considering specific shaped rather than general matrices. The implementation is carried in the following steps. The first step provides the essential matrix-chain operand characteristics to a dynamic programming based solver which accepts a variable number of arguments and computes an optimal associative ordering. Then the product can be performed in the order as specified by the algorithm. 2.1 Further Developments At our present stage of development, there are unresolved issues relating to whether some types of shaped matrices (e.g. diagonal and triangular) can be non-square. Our implementation can be extended to the remaining matrix types in the BLAS then to the additional matrix types as listed in [1]. Additionally, matrix-chain multiplication can be performed at compile time by the application of the template facility in C++. A programmer could then code the matrix-chain as an expression. The matrix-chain expression could then be preprocessed at compile time to generate the optimal matrixchain ordering for the run-time computation of the product thereby increasing efficiency [3]. This is termed as metaprogramming. The matrix-chain expression can be evaluated by a technique called expression templates [11] which is implemented by overloading binary operators [6]. 5

References [1] Anda, A.A, Generative Programming Considerations for the Matrix-Chain Product Problem, Proceedings of the Midwest Instruction and Computing Symposium (MICS- 08), March 2008, pp. 272-279. [2] CORMEN, T.H., LEISERSON, C.E., R.L., AND STEIN, C. Introduction to Algorithms, Second edition, The MIT Press, Cambridge, MA, USA, 2001. [3] CZARNECKI, KRZYSTOF, Generative Programming: Methods, tools and Applications, Addison Wesley, Boston, MA, USA 2000. [4] Template Numerical Tool Kit, http://math.nist.gov/tnt/ [5] Packed Storage Format, http://www.netlib.org/lapack/lug/node123.html [6] Bassetti, Federico and Davis, Kei and Quinlan, Dan, C++ Expression Templates Performance Issues in Scientific Computing, Rice University, Texas, October 1997. [7] Baase, Sara and Van Gelder, Allen, Computer Algorithms: Introduction to Design and Analysis, Third edition, Addison-Wesley, Reading, MA, 2000. [8] Purdom, Jr., Paul Walton and Brown, Cynthia A., The Analysis of Algorithms, Holt, Rinehart and Winston, New York, NY, 1985. [9] Neapolitan, Richard and Naimipour, Kumarss, Foundations of Algorithms, D. C. Heath and Company, Lexington, MA, 1996. [10] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., 16,1990. [11] Veldhuizen, Todd L., C++ templates as partial evaluation, ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation, 1999. 6