Matrix Multiplication

Similar documents
Matrix Multiplication

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Dense matrix algebra and libraries (and dealing with Fortran)

A Few Numerical Libraries for HPC

Data Structures and Algorithms(12)

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur

COMPUTATIONAL LINEAR ALGEBRA

Matrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Numerical Algorithms

An array is a collection of data that holds fixed number of values of same type. It is also known as a set. An array is a data type.

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

1 Motivation for Improving Matrix Multiplication

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Arrays, Matrices and Determinants

Effect of memory latency

CSE 230 Intermediate Programming in C and C++ Arrays and Pointers

Today Cache memory organization and operation Performance impact of caches

How to declare an array in C?

Matrix Multiplication

High-Performance Parallel Computing

Lecture 2: Single processor architecture and memory

Performance Optimization Tutorial. Labs

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

Vector: A series of scalars contained in a column or row. Dimensions: How many rows and columns a vector or matrix has.

AMath 483/583 Lecture 11

A Preliminary Assessment of the ACRI 1 Fortran Compiler

Therefore, after becoming familiar with the Matrix Method, you will be able to solve a system of two linear equations in four different ways.

For example, the system. 22 may be represented by the augmented matrix

2.7 Numerical Linear Algebra Software

Arrays. Arrays are of 3 types One dimensional array Two dimensional array Multidimensional array

Lecture 2. Memory locality optimizations Address space organization

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Physics 326G Winter Class 2. In this class you will learn how to define and work with arrays or vectors.

Structure of Computer Systems

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

Chapter 4. Matrix and Vector Operations

Optimizing MMM & ATLAS Library Generator

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

ECEN 615 Methods of Electric Power Systems Analysis Lecture 11: Sparse Systems

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

Introduction to Matrix Operations in Matlab

Numerical Linear Algebra

Rapid growth of massive datasets

COSC2430: Programming and Data Structures Sparse Matrix Multiplication: Search and Sorting Algorithms

Lecture 16 Optimizing for the memory hierarchy

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Lecture 57 Dynamic Programming. (Refer Slide Time: 00:31)

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Derek Bridge School of Computer Science and Information Technology University College Cork

211: Computer Architecture Summer 2016

Scientific Programming in C X. More features & Fortran interface

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Lecture 2. Two-Dimensional Arrays

Algorithm. Lecture3: Algorithm Analysis. Empirical Analysis. Algorithm Performance

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

CSCI-580 Advanced High Performance Computing

Matrices. A Matrix (This one has 2 Rows and 3 Columns) To add two matrices: add the numbers in the matching positions:

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Ch 09 Multidimensional arrays & Linear Systems. Andrea Mignone Physics Department, University of Torino AA

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Scientific Computing. Some slides from James Lambers, Stanford

LAB 2 VECTORS AND MATRICES

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Lecture 18. Optimizing for the memory hierarchy

Matrices. Jordi Cortadella Department of Computer Science

Uniprocessor Optimizations. Idealized Uniprocessor Model

MATH (CRN 13695) Lab 1: Basics for Linear Algebra and Matlab

Matrix Operations on the GPU

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

CSCE 110 PROGRAMMING FUNDAMENTALS. Prof. Amr Goneid AUC Part 7. 1-D & 2-D Arrays

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

MATLAB COURSE FALL 2004 SESSION 1 GETTING STARTED. Christian Daude 1

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016

UNIT 2 ARRAYS 2.0 INTRODUCTION. Structure. Page Nos.

Introduction to MatLab. Introduction to MatLab K. Craig 1

EE/CSCI 451 Midterm 1

CS 2461: Computer Architecture 1

APPM 2360 Project 2 Due Nov. 3 at 5:00 PM in D2L

Linear Loop Transformations for Locality Enhancement

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

CISC 360. Cache Memories Nov 25, 2008

Cache Memories October 8, 2007

Program Transformations for the Memory Hierarchy

Matrix algebra. Basics

Chapter 8 :: Composite Types

Leftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Lecture 2 Arrays, Searching and Sorting (Arrays, multi-dimensional Arrays)

Sparse matrices, graphs, and tree elimination

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Transcription:

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 2 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 3 / 32

Definition of a matrix A matrix is a rectangular two-dimensional array of numbers. We say a matrix is m n if it has m rows and n columns. These values are sometimes called the dimensions of the matrix. Note that, in contrast to Cartesian coordinates, we specify the number of rows (the vertical dimension) and then the number of columns (the horizontal dimension). In most contexts, the rows and columns are numbered starting with 1. We use a ij to refer to the entry in i th row and j th column of the matrix A. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 4 / 32

Matrices are extremely important in HPC While it s certainly not the case that high performance computing is all about computing with matrices, matrix operations are key to many important HPC applications. Many important applications can be reduced to operations on matrices, including (but not limited to) 1 searching and sorting 2 numerical simulation of physical processes 3 optimization The list of the top 500 supercomputers in the world (found at http://www.top500.org) is determined by a benchmark program that performs matrix operations. (Like most benchmark programs, this is just one measure, however, and does not predict the relative performance of a supercomputer on non-matrix problems, or even different matrix-based problems.) CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 5 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 6 / 32

Dense matrices The m n matrix A is dense if all or most of its entries are nonzero. Storing a dense matrix (sometimes called a full matrix) requires storing all mn elements of the matrix. Usually an array data structure is used to store a dense matrix. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 7 / 32

Dense matrix example F 15 E 5 11 13 4 8 14 A 3 D 12 7 1 10 9 2 B 6 C Find a matrix to represent this complete graph if the ij entry contains the weight of the edge connecting node corresponding to row i with the node corresponding to column j. Use the value 0 if a connection is missing. 0 1 2 3 4 5 1 0 6 7 8 9 2 6 0 10 11 12 3 7 10 0 13 14 4 8 11 13 0 15 5 9 12 14 15 0 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 8 / 32

Dense matrix example 0 1 2 3 4 5 1 0 6 7 8 9 2 6 0 10 11 12 3 7 10 0 13 14 4 8 11 13 0 15 5 9 12 14 15 0 This is considered a dense matrix even though it contains zeros. This matrix is symmetric, meaning that a ij = a ji. What would be a good way to store this matrix? CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 9 / 32

Sparse matrices A matrix is sparse if most of its entries are zero. Here most is not usually just a simple majority, rather we expect the number of zeros to far exceed the number of nonzeros. It is often most efficient to store only the nonzero entries of a sparse matrix, but this requires that location information also be stored. Arrays and lists are most commonly used to store sparse matrices. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 10 / 32

Sparse matrix example F 15 E 5 14 A 3 D 9 B 6 C Find a matrix to represent this graph if the ij entry contains the weight of the edge connecting node corresponding to row i with the node corresponding to column j. As before, use the value 0 if a connection is missing. 0 0 0 3 0 5 0 0 6 0 0 9 0 6 0 0 0 0 3 0 0 0 0 14 0 0 0 0 0 15 5 9 0 14 15 0 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 11 / 32

Sparse matrix example Sometimes its helpful to leave out the zeros to better see the structure of the matrix 0 0 0 3 0 5 0 0 6 0 0 9 0 6 0 0 0 0 3 0 0 0 0 14 0 0 0 0 0 15 5 9 0 14 15 0 = 3 5 6 9 6 3 14 15 5 9 14 15 This matrix is also symmetric. How could it be stored efficiently? CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 12 / 32

Banded matrices An important type of sparse matrices are banded matrices. Nonzeros are along diagonals close to main diagonal. Example: 3 1 6 0 0 0 0 4 8 5 0 0 0 0 1 2 1 1 3 0 0 0 1 0 4 2 6 0 0 0 6 9 5 2 5 0 0 0 7 1 8 7 0 0 0 0 4 4 9 The bandwidth of this matrix is 5. = 3 1 6 4 8 5 0 1 2 1 1 3 1 0 4 2 6 6 9 5 2 5 7 1 8 7 4 4 9 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 13 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 14 / 32

Using a two-dimensional arrays It is natural to use a 2D array to store a dense or banded matrix. Unfortunately, there are a couple of significant issues that complicate this seemingly simple approach. 1 Row-major vs. column-major storage pattern is language dependent. 2 It is not possible to dynamically allocate two-dimensional arrays in C and C++; at least not without pointer storage and manipulation overhead. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 15 / 32

Row-major storage Languages like C and C++ use what is often called a row-major storage pattern for 2D matrices. In C and C++, the last index in a multidimensional array indexes contiguous memory locations. Thus a[i][j] and a[i][j+1] are adjacent in memory. Example: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 The stride between adjacent elements in the same row is 1. The stride between adjacent elements in the same column is the row length (5 in this example). CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 16 / 32

Column-major storage In Fortran 2D matrices are stored in column-major form. The first index in a multidimensional array indexes contiguous memory locations. Thus a(i,j) and a(i+1,j) are adjacent in memory. Example: 0 1 2 3 4 5 6 7 8 9 0 5 1 6 2 7 3 8 4 9 The stride between adjacent elements in the same row is the column length (2 in this example) while the stride between adjacent elements in the same column is 1. Notice that if C is used to read a matrix stored in Fortran (or vice-versa), the transpose matrix will be read. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 17 / 32

Significance of array ordering There are two main reasons why HPC programmers need to be aware of this issue: CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 18 / 32

Significance of array ordering There are two main reasons why HPC programmers need to be aware of this issue: 1 Memory access patterns can have a dramatic impact on performance, especially on modern systems with a complicated memory hierarchy. These code segments access the same elements of an array, but the order of accesses is different. Access by rows Access by columns for ( i = 0; i < 2; i ++) for ( j = 0; j < 5; j ++) a[i][j] =... for ( j = 0; j < 5; j ++) for ( i = 0; i < 2; i ++) a[i][j] =... CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 18 / 32

Significance of array ordering There are two main reasons why HPC programmers need to be aware of this issue: 1 Memory access patterns can have a dramatic impact on performance, especially on modern systems with a complicated memory hierarchy. These code segments access the same elements of an array, but the order of accesses is different. Access by rows Access by columns for ( i = 0; i < 2; i ++) for ( j = 0; j < 5; j ++) a[i][j] =... for ( j = 0; j < 5; j ++) for ( i = 0; i < 2; i ++) a[i][j] =... 2 Many important numerical libraries (e.g. LAPACK) are written in Fortran. To use them with C or C++ the programmer must often work with a transposed matrix. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 18 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 19 / 32

Row-sweep matrix-vector multiplication Row-major matrix-vector product y = A x, A is M N: for ( i = 0; i < M; i ++) { y[ i] = 0.0; for ( j = 0; j < N; j ++) { y[i] += a[i][j] * x[j]; } } matrix elements accessed in row-major order repeated consecutive updates to y[i]...... we can usually assume the compiler will optimize this also called inner product form since the i th entry of y is the result of an inner product between the i th column of A and the vector x. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 20 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 21 / 32

Column-sweep matrix-vector multiplication Column-major matrix-vector product y = A x, A is M N: for ( i = 0; i < M; i ++) { y[ i] = 0.0; } for ( j = 0; j < N; j ++) { for ( i = 0; i < M; i ++) { y[i] += a[i][j] * x[j]; } } matrix elements accessed in column-major order repeated but non-consecutive updates to y[i] also called outer product form. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 22 / 32

Matrix-vector algorithm comparison Which of these two algorithms will run faster? Why? Row-Sweep Form for ( i = 0; i < M; i ++) { y[ i] = 0.0; for ( j = 0; j < N; j ++) { y[i] += a[i][j] * x[j]; } } Column-Sweep Form for ( i = 0; i < M; i ++) { y[ i] = 0.0; } for ( j = 0; j < N; j ++) { for ( i = 0; i < M; i ++) { y[i] += a[i][j] * x[j]; } } CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 23 / 32

Matrix-vector algorithm comparison Both algorithms carry out the same operations but do so in a different order In particular, their memory access patterns are quite different. The first form will typically work better using a language like C or C++ which access 2D arrays in row-major form. Since Fortran accesses 2D arrays column-by-column, it is usually best to use the second form. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 24 / 32

Operation counts In order to compute the computation rate in MFLOPS or GFLOPS we need to know the number of floating point operations carried out and the elapsed time. Divisions are usually the most expensive of the four basic operations, followed by multiplication. Addition and subtraction are equivalent in terms of time. We typically count all four operations on floating point numbers when calculating FLOPS but ignore integer operations like array subscript calculations. In the case of a matrix-vector, the innermost loop body contains a multiplication and an addition: y i = y i + a ij x j The inner and outer loop are done M and N times respectively, so there are a total of 2MN FLOPs. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 25 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 26 / 32

The textbook algorithm Consider the problem of multiplying two matrices: 3 8 2 5 2 3 3 0 5 4 0 C = AB = 1 8 4 2 6 1 3 6 2 3 7 9 2 2 7 5 4 0 2 The standard textbook algorithm to form the product C of the M P matrix A and the P N matrix B is based on the inner product: The c ij entry in the product is the inner product (or dot product) of the i th row of A and the j th column of B. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 27 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) end end end known as the ijk-form of the product due to the loop ordering What is the operation count...? Number of FLOPs is 2MNP. For square matrices n = M = N = P so number of FLOPs could be written 2n 3. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 28 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) end end end Notice that A is accessed row-by-row but B is accessed column-by-column. The column index for C varies faster than the row index, but these are constant with respect to the inner loop so is much less significant. We have an efficient access pattern for A but not for B Could we improve things by rearranging the order of operations? CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 29 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 30 / 32

The ijk forms Recall from linear algebra that column i of the matrix-matrix product AB is defined as a linear combination of the columns of A using the values in the i th column of B as weights. The pseudocode for this is: for j = 1 to N for i = 1 to M c(i,j) = 0.0 end for k = 1 to P for i = 1 to M c(i,j) = c(i,j) + a(i,k) * b(k,j) end end end the loop ordering changes but the innermost statement is unchanged the initialization of values in C is done one column at a time the operation count is still 2MNP CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 31 / 32

The ijk forms Other loop orderings are possible... How many? How many ways can i, j, and k be arranged? Recall from discrete math that this is a permutation problem. three ways to choose the first letter, two ways to choose the second, and one way to choose the third : 3 2 1 = 6 There are six possible loop orderings. We ll work with all six on Friday. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 32 / 32