Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Similar documents
A Strassen-Newton Algorithm for High-Speed Parallelizable Matrix Inversion

Rectangular Matrix Multiplication Revisited

1 Motivation for Improving Matrix Multiplication

On The Computational Cost of FFT-Based Linear Convolutions. David H. Bailey 07 June 1996 Ref: Not published

FFTs in External or Hierarchical Memory David H. Bailey September 13, 1989

Algorithm Engineering

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

Floating Point Considerations

2 Computation with Floating-Point Numbers

Optimizing INTBIS on the CRAY Y-MP

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

perform. If more storage is required, more can be added without having to modify the processor (provided that the extra memory is still addressable).

Dense Matrix Algorithms

Abstract In this paper we report on the development of an ecient and portable implementation of Strassen's matrix multiplication algorithm. Our implem

Errors in Computation

A Multiple-Precision Division Algorithm

Matrix Multiplication and All Pairs Shortest Paths (2002; Zwick)

Floating-Point Numbers in Digital Computers

Floating-Point Numbers in Digital Computers

How to Do Word Problems. Building the Foundation

Mathematical preliminaries and error analysis

NAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract

The Perils of Floating Point

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Floating-point representation

A Multiple -Precision Division Algorithm. By David M. Smith

Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's

2 Computation with Floating-Point Numbers

GAP CLOSING. Integers. Intermediate / Senior Facilitator s Guide

Chapter 1. Numeric Artifacts. 1.1 Introduction

A MATLAB Interface to the GPU

Introduction to Algorithms

A parallel frontal solver for nite element applications

CHAPTER 2 SENSITIVITY OF LINEAR SYSTEMS; EFFECTS OF ROUNDOFF ERRORS

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

Dijkstra s Algorithm Last time we saw two methods to solve the all-pairs shortest path problem: Min-plus matrix powering in O(n 3 log n) time and the

Parallelisation of Surface-Related Multiple Elimination

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

Curriculum Map: Mathematics

Hypercubes. (Chapter Nine)

Multi-Way Number Partitioning

EC121 Mathematical Techniques A Revision Notes

For the hardest CMO tranche, generalized Faure achieves accuracy 10 ;2 with 170 points, while modied Sobol uses 600 points. On the other hand, the Mon

Optimal Sequential Multi-Way Number Partitioning

PIPELINE AND VECTOR PROCESSING

n = 1 What problems are interesting when n is just 1?

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Unit 9 : Fundamentals of Parallel Processing

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

What Is An Algorithm? Algorithms are the ideas behind computer programs. An algorithm is the thing which stays the same whether

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Scheduling Periodic and Aperiodic. John P. Lehoczky and Sandra R. Thuel. and both hard and soft deadline aperiodic tasks using xed-priority methods.

Module 12 Floating-Point Considerations

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Floating Point Arithmetic

hp calculators HP 50g Algebraic and RPN Operating Modes Calculation Modes A simple example - the area of a piece of carpet Setting the mode

Advanced Topics in Computer Architecture

By, Ajinkya Karande Adarsh Yoga

CSE100 Principles of Programming with C++

Scientific Computing: An Introductory Survey

on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

Processor Lang Keying Code Clocks to Key Clocks to Encrypt Option Size PPro/II ASM Comp PPr

A Hybrid Recursive Multi-Way Number Partitioning Algorithm

Computer Organisation CS303

A. Incorrect! To simplify this expression you need to find the product of 7 and 4, not the sum.

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

Computing Basics. 1 Sources of Error LECTURE NOTES ECO 613/614 FALL 2007 KAREN A. KOPECKY

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial

Exponential Notation


FPGA Matrix Multiplier

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Part 1. The Review of Linear Programming The Revised Simplex Method

CS311 Lecture: Pipelining and Superscalar Architectures

AMS 148: Chapter 1: What is Parallelization

2.1.1 Fixed-Point (or Integer) Arithmetic

6. Relational Algebra (Part II)

Chapter 13 Strong Scaling

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Parallel Poisson Solver in Fortran

Parallel Auction Algorithm for Linear Assignment Problem

Kriging in a Parallel Environment

MARIE: An Introduction to a Simple Computer

Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet. Y. C. Pati R. Rezaiifar and P. S.

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

7.3 Divide-and-Conquer Algorithm and Recurrence Relations

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

What Every Programmer Should Know About Floating-Point Arithmetic

GAP CLOSING. Grade 9. Facilitator s Guide

The Role of Performance

UNSTRUCTURED GRIDS ON NURBS SURFACES. The surface grid can be generated either in a parameter. surfaces. Generating grids in a parameter space is

Chapter 4. MARIE: An Introduction to a Simple Computer

Transcription:

Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is capable of performing matrix multiplication at very high rates. Using library routines provided bycray Research, Inc., performance rates of 300 to 425 MFLOPS can be obtained on a single processor, depending on system load. Considerably higher rates can be achieved with all four processors running simultaneously. This article describes how matrix multiplication can be performed even faster, up to twice the above rates. This can be achieved by (1) employing Strassen's matrix multiplication algorithm to reduce the number of oating-point operations performed and (2) utilizing local memory on the Cray-2 to avoid performance losses due to memory bank contention. The numerical stability and potential for parallel application of this procedure are also discussed. The author is with the Numerical Aerodynamic Simulation Systems Division at NASA Ames Research Center, Moett Field, CA 94035. 1

Introduction Recently a number of high-performance multi-processor vector computers have become available for scientic computation. One of these is the Cray-2, manufactured by Cray Research, Inc. It features over 268 million 64-bit words of main memory, which is between one and two orders of magnitude more than that provided in previous generations of supercomputers. In addition to this large main memory, the Cray-2 features four central processing units (CPUs), each with a hardware clock period of only 4.1 nanoseconds, as compared to 9.5 nanoseconds on the Cray X-MP line. Thus eachcray-2 CPU is potentially about twice as fast as a Cray X-MP CPU. However, performance rates on typical vectorized FORTRAN programs are only about on a par with the Cray X-MP, largely due to the fact that chips commensurate in speed with the fast Cray-2 CPUs were not available when the rst few Cray-2 systems were manufactured. For applications that can eectively utilize certain assembly-coded library routines, however, the higher power of the Cray-2 can be harnessed. In particular, Cray's library routine for matrix multiplication (MXM) is capable of speeds that are close to the peak hardware speeds. For one processor running without memory interference from other processors, a performance rate of 425 million oating-point operations per second (MFLOPS) has been achieved on a matrix multiplication in a stand-alone environment. Even with the other processors busy as in a normal operating environment, rates near 300 MFLOPS can easily be achieved. Using a four processor, multi-tasked version of this routine, over 1700 MFLOPS has been achieved. Can matrices be multiplied on the Cray-2 signicantly faster than this? Yes. In fact, large matrices can be multiplied with more than twice the speed of the MXM routine. Clearly such speedups cannot be obtained solely by more ecient implementation of the usual matrix multiplication scheme, although some improvement can be made in this area. The key to such speedups is to employ an advanced algorithm that produces the matrix product using fewer oating-point operations, while still maintaining a high level of vectorization and functional unit concurrency. Strassen's Matrix Multiplication Algorithm The fact that matrix multiplication can be performed with fewer than 2n 3 arithmetic operations has been known since 1969, when V. Strassen published an algorithm that asymtotically requires only about 4:7n 2:807 operations [1]. Since then other such algorithms have been discovered [2], and currently the best known result is due to Coppersmith and Winograd [3], which reduces the exponent of n to only 2.496. These more recently discovered algorithms are considerably more complicated than Strassen's algorithm and do not signicantly improve upon Strassen's algorithm unless the matrices are quite large (i.e., 1 000 1 000 or so). Thus this article will focus on the implementation of Strassen's algorithm, which is as follows: 2

" Let the matrices A, B, and C be divided into half-sized blocks: " A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 = " C 11 C 12 C 21 C 22 Then the result may be calculated as follows: P 1 = (A 11 + A 22 )(B 11 + B 22 ) P 2 = (A 21 + A 22 )B 11 P 3 = A 11 (B 12 ; B 22 ) P 4 = A 22 (B 21 ; B 11 ) P 5 = (A 11 + A 12 )B 22 P 6 = (A 21 ; A 11 )(B 11 + B 12 ) P 7 = (A 12 ; A 22 )(B 21 + B 22 ) C 11 = P 1 + P 4 ; P 5 + P 7 C 12 = P 3 + P 5 C 21 = P 2 + P 4 C 22 = P 1 + P 3 ; P 2 + P 6 It should be pointed out that the intermediate matrices P 1 P 2 P 7 may all be computed concurrently. The last four lines may also be computed concurrently, but their cost is generally insignicant compared to the previous seven lines. In any event Strassen's algorithm appears to be fairly well suited for multi-processor computation. The computational savings of employing Strassen's algorithm derives from the fact that only seven half-sized matrix multiplications need to be performed, whereas eight are required with the standard algorithm. Thus for fairly large matrices, where the cost of performing this algorithm is dominated by the cost of multiplying the half-sized blocks, a savings of approximately 14% can be realized (in theory) over the traditional method. Strassen's method can of course be recursively employed to multiply the half-sized blocks, and so on down to individual matrix elements if desired. For every level of recursion, an additional 14% savings can be realized. However, in practice this recursion is only performed down to the level at which the losses due to bookkeeping costs and short vector lengths overwhelm any savings due to fewer oating-point operations being performed. On the Cray-2 this crossover point was found to be for matrices of size 128 128. Reducing Memory Bank Contention The Cray-2 has 128 independent, interleaved main memory banks. Thus a vector fetch of 64 contiguous data words obtains each word from a separate bank, provided each of the requested banks is not busy. However, with four CPUs actively accessing main memory, the probability that a requested bank will be busy is fairly high. As a result, the overall performance of the Cray-2 on memory intensive programs such as matrix multiplication is 3

reduced as much as 35% below what it would be in the absence of contention. This loss will be reduced in the future as faster memory chips become available. One way to avoid this performance loss is to employ the 16,384 word local memory in each CPU to perform block matrix multiplications. In this way main memory trac is sharply reduced, and the CPU computational units can perform nearly at peak speed, even when the system is busy with jobs on the other CPUs. Don Calahan of the University of Michigan has prepared an assembly-coded matrix multiplication routine based on this principle, and this routine was employed in these calculations. Performance Results For testing purposes, Strassen's algorithm was coded in FORTRAN for square matrices, although it should not be dicult to generalize the implementation for non-square matrices since the Strassen formulas still hold for this case. Calahan's local memory routine was referenced for matrices smaller than 128 128. Recursion was obtained merely by replicating the FORTRAN subroutine and changing the name at each level of recursion. Although little real work was performed in the FORTRAN subroutines, these operations were vectorized by thefortran compiler. No attempt was made to utilize more than one of the four Cray-2 CPUs. However, if this were done, speedups of nearly four times should be possible because inter-processor memory bank contention is minimized by the use of Calahan's routine. The usual implementation of Strassen's routine requires additional memory. In this program, a scratch array of size 3n 2 was required in addition to the input and result arrays. On a large memory computer such as the Cray-2, memory space is not a serious issue, and in fact trading memory space for increased performance is an eective use of the Cray-2. On systems where memory is more dear, a version proposed by Kreczmar [4] may be preferable. That version reduces the extra storage requirements to 2n 2 =3. Table 1 compares the performance of this implementation with that of the Cray library matrix multiplication routine. Columns headed \Strassen Routine" contain data for the technique described above, while \Cray Library MXM" contain data for the Cray library matrix multiplication routine. These results are based on ten trials with pseudorandom normally distributed data. The error statistic reported is the relative root-mean-square error. To be precise, the error statistic E is given by X E = n ;3=2 (A ij ; S ij ) 2 i j where A is the computed matrix product and S is exact matrix product. The exact matrix S was computed using double-precision calculations, which are feasible only up to 1 024 1 024. Cray computers do not have hardware for performing arithmetic on double precision (128 bit) data, and as a result double precision operations are two orders of magnitude slower than single precision arithmetic. It can be seen from the results in the table that the new routine is approximately 35% faster than MXM for small matrices. This speedup is entirely due to Calahan's local 4

Matrix Strassen Routine Cray Library MXM Time Size CPU Time Error CPU Time Error Ratio 64 0.0014 9.686 10 ;15 0.0019 9.686 10 ;15 1.35 100 0.0057 9.187 10 ;15 0.0078 1.214 10 ;14 1.35 128 0.0112 1.003 10 ;14 0.0162 1.355 10 ;14 1.45 200 0.0474 1.891 10 ;14 0.0636 1.704 10 ;14 1.34 256 0.0881 1.999 10 ;14 0.1401 1.915 10 ;14 1.59 400 0.3548 3.745 10 ;14 0.4844 2.420 10 ;14 1.37 512 0.6452 3.997 10 ;14 1.0473 2.730 10 ;14 1.62 800 2.5878 7.484 10 ;14 3.7262 3.424 10 ;14 1.44 1024 4.7107 7.993 10 ;14 8.7800 3.882 10 ;14 1.86 1600 18.3251 28.3005 1.54 2048 33.1119 66.6682 2.01 Table 1: Comparative Matrix Multiplication Performance memory matrix multiplication routine. Beginning at size 128 128, a larger speedup is obtained, indicating the positive eect of the Strassen algorithm. Some irregularity can be seen in the speedup gures from entry to entry, but these gures are monotonic if restricted to powers of two or to non-powers of two. For a 2 048 2 048 matrix multiplication, which was the largest case studied, a speedup factor of 2.01 was obtained. Again, all but 35% of this speedup is due to usage of the Strassen algorithm. Numerical Stability of Strassen's Algorithm It can be seen from the gures in the table that the numerical errors in Strassen's algorithm, although slightly larger than the ordinary inner product method, appear to be well under control in the cases studied. In general Strassen's algorithm is known to be not as numerically stable as the inner product calculation, although it still satises an acceptable stability condition. In 1975 Webb Miller [5] showed that any scheme satisfying a suciently strong notion of stability would have to perform at least n 3 multiplications to evaluate a n n matrix product. Thus the ordinary inner product method is optimal for this stability condition. However, Strassen's algorithm is known [5] to satisfy a condition known as simultaneous Brent stability. For our purposes these two stability conditions can be dened as follows. Let c ik = Pj a ijb jk denote the usual inner product of two matrices A and B. Let c ik denote the numerical error in calculating c ik by some particular procedure (this term is more precisely dened in Miller's paper). Then simultaneous Brent stability and simultaneous strong stability are dened as, respectively, c ik C(max j ja ij j)(max jb jk j) j for every i k 5

c ik C X j ja ij b jk j for every i k An example where Strassen's method is not strongly stable is as follows. Let denote avalue on the order of the machine \epsilon" (i.e., 2 ;b, where b is the number of mantissa bits in the representation of oating-point numbers). Consider the 2 2 matrix product " 2 1 1 " 1 2 1 = " 2 2 1+ 2 1+ 3 2 Note that c 11 is of order 2. In performing Strassen's algorithm to evaluate this product, c 11 is computed as c 11 = 2(1 + ) ; (1 ; 2 ) ; (1 + 2 )+(1; )(1 + 2 ) Because this calculation adds and subtracts numbers of order unity, the numerical error in calculating c 11 is potentially of order. However, both terms of the sum in the strong stability condition for c 11 above are of order 2. Thus the strong stability condition can fail by an unbounded factor the greater the level of machine precision, the greater the potential error factor. In other words, these results imply that matrix products computed using the Strassen algorithm can only be relied on to a level of accuracy that is on the order of the machine epsilon times the largest value of the matrix. For most applications, this degree of accuracy is completely acceptable nothing more is required of a linear equation solution, for example. Certainly away from a set of matrices of small measure there is no problem whatsoever, other than a somewhat faster accumulation of error due to the increased number of additions and subtractions that are part of Strassen's algorithm. Conclusion Strassen's algorithm appears to be a practical means of accelerating matrix multiplication. It produces a signicant speedup of this operation whenever the matrices are suciently large to overcome the eects of bookkeeping costs and the shorter vector length of block matrix operations at the base level. These requirements should be met for matrices of reasonable size on a variety of currently available scientic computer systems. Thus this algorithm should be considered by anyone wishing to implement a high-performance matrix multiply routine for scientic computation. 6

REFERENCES 1. Strassen, V., \Gaussian Elimination Is Not Optimal", Numerical Mathematics, Vol. 13 (1969), p. 354-356. 2. Pan, V., \New Fast Algorithms for Matrix Operations", SIAM Journal on Computing, Vol. 9 (1980), p. 321-342. 3. Coppersmith, D., and Winograd, S., \On the Asymptotic Complexity of Matrix Multiplication", SIAM Journal on Computing, Vol. 11 (1982), p. 472-492. 4. Kreczmar, A., \On Memory Requirements of Strassen Algorithms", in Algorithms and Complexity: New Directions and Recent Results, J. F. Traub, ed., Academic Press, New York, 1976. 5. Miller, Webb, \Computational Complexity and Numerical Stability", SIAM Journal on Computing, Vol. 4 (1975), p. 97-107. 6. Kronsjo, Lydia, Computational Complexity of Sequential and Parallel Algorithms, John Wiley, 1985. 7