LINPACK Benchmark Optimizations on a Virtual Processor Grid

Size: px

Start display at page:

Download "LINPACK Benchmark Optimizations on a Virtual Processor Grid"

Tyrone Gibson
6 years ago
Views:

1 Ed Anderson Maynard Brandt (retired) Chao Yang

2 Outline Organization of the Cray LINPACK Benchmark code Kernel optimizations on the CRAY X The virtual processor grid

3 Solve a linear system Ax = b The LINPACK benchmark using Gaussian elimination with partial pivoting. The problem size and implementation are not specified. The algorithm factors A = L U and solves y = L - b; x = U - y Performance results are specified in Gflop/s (billions of floating-point operations per second) or Tflop/s (trillions of floating-point operations per second)

4 Right-looking algorithm: Block LU factorization I II III I. Factor block column into A = LU II. III. Exchange rows and update block row Update the rest of the matrix using matrix multiplication

5 Optimizing the panel factorization Sub- blocking or recursion is used within the block column so that more work is done in the optimized MM kernel.

6 Software choices HPL (High Performance LINPACK) Block-cyclic distribution Column-wise storage of blocks MPI communication Cray s LINPACK Benchmark code Block-cyclic distribution Row-wise storage of blocks MPI or SHMEM communications 6

7 7 -D block cyclic data distribution Example on a x processor grid:

8 Local view: ScaLAPACK/HPL Blocks stored by columns, on processor : A, A, A,7 A, A, A,7 A(mb*nrblks, *ncblks) A, A, A,7 A 7, A 7, A 7,7 Advantage: With abstraction of BLAS and LAPACK routines, we can maintain much of the LAPACK design. Disadvantage: As matrix size gets large, distribution blocks get spread out through memory. 8

9 Local view: Cray LBM code Blocks stored by rows, processor : A, A, A,7 A(mb,,ncblks,) A, A, A,7 A(mb,,ncblks,) A, A, A,7 A(mb,,ncblks,) A 7, A 7, A 7,7 A(mb,,ncblks,) Advantage: Distribution blocks and block rows are contiguous. Disadvantage: More indexing with the -D array. 9

10 Main computation kernel After communication of the column and row blocks, each processor performs a matrix-matrix multiplication of the form: = - x In ScaLAPACK/HPL, one call to SGEMM is required for this operation.

11 Optimizing data layout for SGEMM With row-wise storage of blocks, one call to SGEMM is needed to update each local block row. = - = - = - x x x (all the same block)

12 Internals of SGEMM on TE Basic operation: -by- matrix times -by-n matrix n n x = A B C Innermost kernel: matrix-vector multiply, -element result The -by- block of A is used repeatedly and will reside in cache. The columns of B are streams (if LDB=, B is one long stream). The result vector is held in registers until combined into C.

13 Transition of SGEMM: TE X n n x = A B C TE: Result vector was elements long X: Result vector can be VL elements long (to 6)

14 Transition of SGEMM: TE X n n x = A B C TE: x block constrained by size of set of Scache (96W = 6x6) X: x block needs to fit in MB cache (6KW = x)

15 Transition of SGEMM: TE X n n x = A B C TE: Compute one column of C at a time

16 Transition of SGEMM: TE X n n x = A B C TE: Compute one column of C at a time X: Compute columns of C at a time 6

17 Transition of SGEMM: TE X n n x = A B C TE: Compute x matrix-vector multiply in innermost kernel 7

18 Transition of SGEMM: TE X n n x = A B C TE: Compute x matrix-vector multiply in innermost kernel X: Each SSP computes a 6 x matrix-vector multiply concurrently 8

19 Internals of SGEMM on X Basic operation: -by- matrix times -by-n matrix SSP SSP SSP SSP n n x = A B C The -by- block of A is used repeatedly and will reside in cache. B is read columns at a time and is shared by the SSPs. The result vector is held in registers until combined into C. 9

20 Prototype matrix multiply code subroutine sgemmnn( m, n, k, alpha, a, inra, b, inrb, & beta, c, inrc ) integer m, n, k, inra, inrb, inrc real alpha, beta, a(inra,*), b(inrb,*), c(inrc,*) integer i, js, m cdir$ SSP_PRIVATE dmmnn m = (m+)/ do i =, js = min( m, max( m-i*m, ) ) call dmmnn( js, n, k, alpha, a(+i*m,), inra, & b, inrb, beta, c(+i*m,), inrc ) end do return end Compile with: ftn -Oaggress -O -s default6 c sgemmnn.f

21 Optimizing the row exchanges A row exchange is performed at each step of the block column factorization to put the largest element (in absolute value) on the diagonal. The vector IPIV records the exchanges. Example: IPIV() = IPIV() = 6 IPIV() = 9 IPIV() = IPIV() = IPIV(6) = IPIV(7) = IPIV(8) =

22 Gather/scatter indices We can avoid synchronizing after every exchange by translating IPIV into gather and scatter permutation vectors. scatter gather

23 Optimizing the communication To optimize the communication, rows to be exchanged are first copied locally into a contiguous buffer.

24 When a p q grid isn t enough Shortcomings of existing algorithm: Many processors are idle during column factorization. Not every processor count factors neatly into N = p q. Example: = x = x = x6 = x On some systems it is better to leave one or two processors idle ( N p may not factor neatly). p

25 The Virtual Processor Grid Generalize the -D grid factorization to p q = k N p, where k, p N, and lcm(p,n ) = k N. p Example: 6 processors in a virtual grid p p This becomes the tiling pattern for the distributed -D matrix

26 Data structure for VPG Recall the -D array used for row-wise storage of blocks: A( mb,, ncblks, nrblks ) Now add another dimension for the virtual processor index: A( mb,, ncblks, nrblks, nvpi ) Maintains contiguousness of distribution blocks and row blocks. Need to add a loop over the virtual processor indices. No extra storage except for some extra buffers for each virtual processor. 6

27 Coding issues for VPG Loop doesn t always go from to nvpi. Example: Send second column of following matrix across rows. Send is initiated with virtual processor index on {,}, with v.p. index on {,}. Recv is initiated with virtual processor index on {,,,} and with v.p. index on {,} do i =, nvpi- {work on block numbered + mod(start-+i, nvpi)} end do 7

28 LINPACK Benchmark Performance CRAY X, MSPs Gflop/s N x grid x grid x grid 8

29 Summary Storage order of distribution blocks was optimized for the cache. Leading dimension was padded from 6 to 6 to optimize the matrix-multiply kernel. Main computational kernel uses SSP parallelism. Communication was optimized using SHMEM. No barriers! Communication was extensively overlapped with computation. Virtual processor grid improves parallelism of column and row operations. 9

Benchmark Results. 2006/10/03

Benchmark Results. 2006/10/03 Benchmark Results cychou@nchc.org.tw 2006/10/03 Outline Motivation HPC Challenge Benchmark Suite Software Installation guide Fine Tune Results Analysis Summary 2 Motivation Evaluate, Compare, Characterize