LINPACK Benchmark Optimizations on a Virtual Processor Grid

Size: px

Start display at page:

Download "LINPACK Benchmark Optimizations on a Virtual Processor Grid"

Blaze Harrison
6 years ago
Views:

1 LINPACK Bechmark Optimizatios o a Virtual Processor Grid Edward Aderso, Mayard Bradt ad Chao Yag, Cray Ic. ABSTRACT: The "massively parallel" LINPACK bechmark is a familiar measure of highperformace computig performace ad a useful check of the scalability of multi-processor computig systems. The CRAY X1 performs well o this bechmark, achievig i excess of 90% of peak performace per processor ad ear liear scalability. This paper will outlie Cray's implemetatio of the LINPACK bechmark code ad show how it leads to computatioal kerels that are easier to optimize tha those of ScaLAPACK or the High Performace LINPACK code, HPL. Specific areas i which the algorithm has bee tued for the CRAY X1 are i commuicatig across rows or dow colums, i iterchagig rows to implemet partial pivotig, ad i multiplyig dese matrices. The Cray LINPACK bechmark code has also bee adapted to ru o a virtual processor grid, that is, a p-by-q grid where p q is a multiple of the umber of processors. This feature is especially pertiet whe the umber of processors does ot factorize eatly, but it ca be used i ay situatio i which it is desired to assig more processors to row or colum operatios. The virtual processor grid cocept is ot specific to the LINPACK bechmark ad could be applied to ay applicatio that uses a 2-D grid decompositio. 1. Itroductio The LINPACK bechmark attracts a lot of attetio from performace aalysts because it is the metric used to rak systems for the biaual Top 500 list ( It is also a good predictor of performace for certai applicatios ivolvig the solutio of large dese liear systems or eigevalue problems. The rules for the highly parallel LINPACK bechmark (Table 3 i [5]) allow ay problem size ad ay algorithm to be used as log as it solves a dese 64-bit real liear system, Ax = b. I practice, the algorithm is always some form of Gaussia elimiatio with partial pivotig, ad for maximum efficiecy, the problem size is usually the largest problem that will fit i memory. A excellet implemetatio of the LINPACK bechmark for distributed memory systems is High Performace LINPACK (HPL) [6], a C code loosely based o ScaLAPACK [3]. I this paper we describe a alterative implemetatio of the LINPACK bechmark developed at Cray Ic. that has bee used o CRAY T3E ad CRAY X1 systems. We will refer to this implemetatio as the Cray LINPACK Bechmark code. The Cray LINPACK Bechmark code is writte i a mix of C ad Fortra ad uses Cray s oe-sided message-passig library SHMEM [4] for commuicatio. Like HPL, it assumes a 2-D block cyclic distributio of the matrix A ad solves a augmeted system usig a block right-lookig algorithm. Ulike HPL, it stores the distributio blocks by rows, istead of by colums. This storage order optimizes the memory access patter of the domiat matrix multiply kerel for the highest levels of the memory hierarchy ad preserves the logical structure of the distributio blocks. It is also a atural fit for a geeralizatio of the usual 2-D processor grid ito a virtual processor grid, that is, a p-by-q grid where p q is a multiple of the umber of processors. 2. Solvig liear systems o distributed memory computers The computatio measured i the LINPACK bechmark is the solutio of a dese liear system Ax = b, where A is a -by- 64-bit real matrix of full rak ad b is a vector of legth. The stadard techique for this problem uses Gaussia elimiatio with partial pivotig to compute a factorizatio A = LU, where L is a product of permutatio ad uit lower triagular matrices ad U is upper triagular, from which the solutio ca be obtaied by solvig Ly = b for y ad Ux = y for x. Whe there is oly oe right-had side vector b, it is advatageous to solve a augmeted system, i which b is stored as the (+1) st colum of the array A. The the factorizatio applies L -1 to b to form y, ad the solutio ca be determied i oe additioal step by solvig the triagular system Ux = y by back substitutio. To solve Ax = b o a distributed memory computer, the matrix A is partitioed amog the processors usig a blockcyclic distributio o a 2-D processor grid. Oe possible assigmet of blocks to processors o a 2-by-3 processor CUG 2004 Proceedigs 1

2 grid is show i Figure 1. The tilig patter repeats as may times as is ecessary, ad the last row or colum of blocks may be empty o some processors. The distributio block size ad the aspect ratio of the processor grid are tuable parameters, although choices for the p-by-q grid are limited by the umber of ways to factor the umber of processors, N p Figure 1: Block-cyclic distributio o a 2 3 processor grid The right-lookig algorithm to compute a LU factorizatio of a matrix A stored as i Figure 1 is described as follows. First, a pael of colum blocks is factored usig Gaussia elimiatio with partial pivotig. Next, iformatio about the sequece of row iterchages that was used is broadcast to the other processors, ad the row iterchages are applied to the rest of the matrix. The curret block row is the updated by applyig the trasformatios cotaied i the diagoal block; this operatio cosists of a triagular solve with multiple right had sides ad is performed locally by the Level 3 BLAS routie xtrsm. Fially the block colum below the diagoal block is commuicated across rows ad the block row to the right of the diagoal block is commuicated dow colums, ad the rest of the matrix is updated by a rak-b update, performed locally by the Level 3 BLAS routie xgemm. Block algorithms were first developed for shared memory computers i order to icrease the re-use of data at the highest levels of the memory hierarchy ad thereby improve performace. I the umerical liear algebra package LAPACK [1], the block size is a iteral parameter, used oly for tuig; it does ot affect how the 2-D matrix is stored. I ScaLAPACK [3], a extesio of LAPACK for distributed-memory computers, the distributio block size is aalogous to the algorithm block size i LAPACK, ad distributed BLAS were developed aalogous to the stadard BLAS so that much of the algorithmic structure of LAPACK could be preserved. Sice Fortra stores arrays i colum-major order, ScaLAPACK stored the distributio blocks colum-wise o each processor, makig the local portio of the distributed matrix a 2-D matrix as i the shared-memory case. The High Performace LINPACK code (HPL) [6] is a specialized versio of the ScaLAPACK routie to solve a dese real liear system, put ito a LINPACK Bechmark cotext with optios for tuig. Like ScaLAPACK, HPL stores the distributio blocks colum-wise o each processor, but it provides a optio for storig the traspose of the matrix for better cache use o some architectures. 3. Row-wise storage of distributio blocks While the colum-wise storage of local distributio blocks is cetral to the ScaLAPACK desig, there are both otatioal ad performace disadvatages to storig these blocks i a 2-D array. First, the blockig factors used i the distributio are lost i this data structure ad must be carried alog i a array descriptor, separate from the array A. Although a distributio block is a atural piece for the algorithm desiger to work with, it exists oly as a subblock of the 2-D array. Elemets i the same colum of a block of A are accessed with uit stride, but elemets i the same row are accessed with a stride that grows with the matrix size. For very large problems, such as those solved i the LINPACK bechmark, eve a small sub-block may spa several memory pages, icreasig the latecy to load the sub-block ito cache. Also, successively larger row strides will each have a differet footprit i the cache, makig it difficult to predict the performace of the algorithm for larger sizes based o the performace for smaller problems. I the Cray LINPACK Bechmark code, the distributio blocks are stored row-wise o the local processors. This orgaizatio of blocks is most coveietly described by a 4-D array A( b, b, cblks, rblks ) where the first two idices describe a b-by-b distributio block ad the last two idices describe the local colum idex ad row idex of each block, respectively. For example, the first elemet of the last block stored o processor 5 i Figure 1 would be A( 1, 1, 2, 3 ), sice it is the (1,1) elemet i the secod colum of blocks ad the third row of blocks o processor 5. 1 Assumig the blocks are stored oe after the other, this data structure ca be regarded as rblks block rows of size b-by-b cblks where it is coveiet to do so. 1 I Co-array Fortra otatio, we could be eve more complete ad call the elemet A( 1, 1, 2, 3; 5 ). CUG 2004 Proceedigs 2

3 Row-wise storage of blocks maitais the atural cotiguity of the distributio blocks ad facilitates their movemet to the highest levels of the memory hierarchy. Because a block is cotiguous i memory, its mappig to a associative cache is kow a priori ad its size ca be chose to fit i a give segmet of cache. Also, a block row of local distributio blocks is cotiguous, so a rak-b update as i the LINPACK bechmark will result i a uit stride access patter for the etire block row, as show i Figure 2. This access patter ca take advatage of stream buffers or hardware prefetch istructios if they are available. A disadvatage of storig the distributio blocks row-wise is that it requires additioal idexig of the local blocks. It also may ot be ay better tha the stadard techique uless the library routies have bee optimized for it. But it has the potetial to be a higher-performig alterative. 4. Optimizatio of the matrix multiply kerel I the Cray LINPACK Bechmark code, the mai matrix multiply kerel is a b-by-b matrix A times a b-by- matrix B added to a b-by- matrix C to produce a b-by result. The memory access patter for this operatio is show i Figure 2. O the CRAY T3E, the iermost kerel was a matrix-vector multiply. The first time the block of A was used, it would be loaded from memory, but subsequetly it would likely be foud i the o-chip secodary cache (Scache) of the CRAY T3E processor. A would multiply a colum of B loaded from memory, ad the result vector would be accumulated i registers before beig writte to C. If the leadig dimesio of the B array matched the block size b, the the colums of B would be accessed i oe cotiuous stream, activatig the stream buffers of the CRAY T3E processor. The matrix C was also accessed as oe cotiuous stream. Thus the optimizatio of the matrix-vector multiply kerel for the CRAY T3E cosisted of loadig elemets of A early eough to hide the Scache latecy, spacig the operatios far eough apart to hide the floatig-poit uit latecy, ad prefetchig elemets of the B vector to hide its latecy from the stream buffers. Costraits i the CRAY T3E processor desig limited the matrix multiply performace to about 75% of peak. First, there were oly 31 scalar floatig-poit registers, which were eeded for accumulatig the result vector as well as for temporarily holdig elemets of A ad B. The latecy of approximately 10 clock periods (CP) to Scache ad 4 CP for the floatig-poit uit made istructio schedulig for cached data a challege. The compiler ad most of the scietific library urolled loops by 4 or 8; for LINPACK, a special-purpose kerel was created that was urolled by 12. This desig meat that the optimal distributio block size b was a multiple of 12. The b-by-b block had to fit i the Scache of the T3E, which was 96 KB or 12 KW, but because of the radom replacemet policy of the 3-way set associative Scache, it was better to size the block for oly oe 4096 word set of the Scache. Thus the biggest possible block of A was 64-by-64, ad coupled with the multiple of 12 requiremet ad the 8-word Scache lie size, the optimal block size for the CRAY T3E was 48. Sice the iermost kerel was a 12-by-48 matrix-vector multiply, each vector of B was used four times as it was multiplied by each of the four 12-by-48 sub-blocks of A. The first time, B would be loaded from memory via the stream buffers, but the subsequet uses would hope to fid B i the Scache. However, if the vector of B had the same relative cache offset as part of A, the radom replacemet policy meat that oe of them might eed to be reloaded from memory. The CRAY X1 processor desig directly addresses may of the limitatios i the CRAY T3E memory hierarchy. The CRAY X1 multi-streamig processor (MSP) desig is show i Figure 3. At the highest level of the hierarchy, there are 32 vector registers, each of legth 64, o each of the four sigle-streamig processors (SSPs) of a MSP. Vector registers are used i the CRAY X1 matrix multiply kerel to hold elemets of A ad to accumulate results before writig back to C. I fact, there are so may registers available that we ca improve the re-use of the A ad B elemets ad keep all the vector fuctioal uits busy by computig four colums at a time, istead of just oe. Also, the o-chip cache is 2 MB or 256 KW, allowig the block of A to be up to 512-by-512. Fially, the four SSPs o a MSP ca perform the computatios with the four subblocks of A cocurretly, accessig elemets of B through their shared cache. The matrix multiply kerel o a CRAY X1 processor is show i Figure 4. Sustaied performace of the matrix multiply kerel i the cotext of the LINPACK bechmark is approximately 95% of peak. The matrix-multiply routie xgemm ad the other Level 3 BLAS kerel, xtrsm, are the oly parts of the Cray LINPACK bechmark code that are optimized at the SSP level. A prototype for the matrix multiply kerel showig the parallelizatio across SSPs is show i Figure 5. This subroutie is compiled with ft -Oaggress -O3 -s default64 c sgemm.f The SSP code, amed DMMNN i this prototype, is implemeted i assembly laguage to achieve the best possible istructio schedulig for elemets of A that are expected to be foud i the cache. subroutie sgemm( m,, k, alpha, a, ira, & b, irb, beta, c, irc ) iteger m,, k, ira, irb, irc real alpha, beta real a(ira,*), b(irb,*), c(irc,*) iteger i, js, m4 cdir$ SSP_PRIVATE dmm CUG 2004 Proceedigs 3

4 m4 = (m+3)/4 do i = 0, 3 js = mi( m4, max( m-i*m4, 0 ) ) call dmm( js,, k, alpha, a(1+i*m4,1), & ira, b, irb, beta, & c(1+i*m4,1), irc ) ed do retur ed Figure 5: MSP code for matrix multiply, callig SSP subroutie dmm 5. Optimizatio of the row iterchages The row exchage is a target for optimizatio because it ivolves movig data with o-uit stride, all processors participate, ad it is all overhead o floatig poit computatios are performed. The sequece of row iterchages from the pael factorizatio is described by a idex vector IPIV, where IPIV(i) = j idicates that row i was exchaged with row j. The idices are ot uique, because a row that is exchaged out of the pivot row block early i the pael factorizatio may be copied back ito the pivot row block i a exchage with a later row. 2 The challege is to tur the descriptio of a sequetial process the row exchages of the pael factorizatio ito a block update that ca be applied to the distributed matrix as a whole. We do this by traslatig the IPIV vector ito idex vectors ISCAT ad IGATH, where ISCAT cotais the destiatio row idices for the rows of the pivot block row ad IGATH cotais the idices of rows of the global array that are gathered ito the rows of the pivot block row. Row idices i each of these vectors are uique, so the gather or scatter operatio ca be doe as a block. Sice rows of the local array structure are o-uit stride, it is more efficiet to copy them to a cotiguous buffer before sedig them. O processors i the pivot block row, the ISCAT vector is used to copy ay rows to be trasferred to a remote processor to a sed buffer. O the remote processors, the IGATH vector is used to idetify local rows ivolved i the row exchage ad copy them to a buffer as well. Each processor participatig i the row exchage the sychroizes with the pivot block row, after which each ca get their data from the other s buffer. A fial 2 The most extreme example of row exchages usig partial pivotig is the matrix for which IPIV(1) = = IPIV(N) = N, i.e., every row is exchaged with row N. sychroizatio whe the trasfer is complete is required to idicate that the buffers ca be re-used. I the CRAY LINPACK bechmark code, the local data copy of selected rows to a cotiguous buffer is called trasposeaitob, ad the correspodig routie to copy from colums of the cotiguous buffer back to rows of the local array structure is called trasposebtoai. O the CRAY T3E, these operatios were optimized usig E-registers, like other traspose operatios [2]. O the CRAY X1, the vectorizig compiler hadles the idexed traspose without ay special effort. 6. The virtual processor grid Oe shortcomig of mappig a distributed matrix to a 2-D processor grid is the 2-D grid itself. If the umber of processors N p is a perfect square, the ay operatios o a block colum or block row (such as the pael factorizatio i the LINPACK bechmark) will ivolve oly sqrt(n p ) processors. Moreover, ot every processor cout factors eatly ito p q. For example, early CRAY X1 systems were shipped with 32 four-processor odes, with 31 odes cofigured for applicatio use. But the oly factorizatio of 124 is 4 31 (or 31 4), ad 123 = 3 41 ad 122 = 2 61 are o better. Rather tha leave 3 processors idle ad solve the system o 121 processors, we coceived of usig a virtual processor grid, i which each physical processor would hadle more tha 1 block i the virtual processor grid. More geerally, oe could solve ay 2-D problem usig N p processors o a p q grid such that p q = k N p, where k = 1, p = N p, ad lcm(p, N p ) = k N p. The restrictio p = N p is ecessary to prevet assigig more tha oe block i a colum of the virtual processor grid to the same processor, ad the costrait lcm(p, N p ) = k N p simply guaratees that the virtual processor grid is ot a multiple of aother smaller virtual processor grid. Usig a virtual processor grid is ot the same as multithreadig, because we create oly oe process per processor, ad the blocks of the virtual processor grid must sometimes be dealt with i a prescribed order, ot always from 1 to k. A example usig 6 processors o a 4 3 virtual processor grid is show i Figure 6. This tilig patter has two times the parallelism of a 2 3 grid for colum operatios. The oly drawback to the virtual processor grid, other tha algorithmic complexity, is a potetially loger sychroizatio time whe oe processor must deal with more tha oe of its blocks at a time. For istace, if colum 1 ad colum 2 of the virtual procesor grid i Figure 6 exchaged data, the processor 0 would have to exchage with 4 usig its first block before exchagig with 2 usig its secod block. However, this effect is lesseed o larger virtual grids, such as a grid o 124 CUG 2004 Proceedigs 4

5 processors, where the blocks o the same processor are farther apart. Figure 6: Six processors i a 4 3 virtual processor grid I the Cray LINPACK bechmark code, the virtual processor idexig is hadled by addig a 5 th dimesio to the local array structure A, which becomes A( b, b, cblks, rblks, vpi ) For some operatios, such as the rak-b update i the LINPACK bechmark code, the order of processig the distributio blocks is ot importat, so we ca update all the blocks with the same virtual processor idex together. I other cases, particularly commuicatio operatios, each processor must process all its blocks i the virtual processor grid tile before goig o to the ext tile. Furthermore, the order of processig the blocks i a tile may chage as the algorithm progresses, so it is ecessary to maitai the curret startig poit istart, ad process the block umbered 1 + mod( istart + i 1, vpi ) i a loop from 1 to vpi. Figure 7 shows the effect of the virtual processor grid algorithm o 124 processors of a CRAY X1. Both the 4 31 ad 31 4 grids show similar performace, while the virtual processor grid is up to 10% faster. I the LINPACK bechmark, which is so domiated by matrix multiply that improvemets to other parts of the code are usually i the 1% rage, this degree of improvemet is sigificat. 7. Summary I optimizig the LINPACK bechmark for the CRAY X1, we have foud ew aveues for performace improvemets i the code origially developed for a CRAY T3E. The storage order of the distributio blocks is a ofte overlooked parameter that affects how the data is mapped to the memory hierarchy i the mai computatioal kerels. By parallelizig these kerels across the SSPs of a MSP, we achieved performace of 95% of peak for matrix multiplicatio ad over 90% for the LINPACK bechmark overall. To further reduce iefficiecies i the row or colum operatios, which ivolve oly a subset of the processors, we itroduce the cocept of a virtual processor grid. A side beefit of this techique is that oe ca model the solutio of a problem o k N p processors usig oly N p processors if it will fit i memory. Ackowledgemets The authors wish to ackowledge Jeff Brooks for his isights i the developmet of the origial Cray LINPACK Bechmark code, ad Mike Aamodt of the Bechmarkig group for his support of the CRAY X1 bechmarkig efforts. Refereces [1] E. Aderso, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dogarra, J. Du Croz, A. Greebaum, S. Hammarlig, A. McKeey, ad D. Sorese, LAPACK Users Guide, Third Editio, SIAM, Philadelphia, ( [2] E. Aderso, J. Brooks, ad T. Hewitt, The Bechmarker's Guide to Sigle-processor Optimizatio for CRAY T3E Systems, Cray Research techical report, (available at ftp://ftp.arsc.edu/pub/mpp/docs/bmguide.ps.z) [3] L. S. Blackford et al., ScaLAPACK Users Guide, SIAM, Philadelphia, ( [4] Cray Ic., Ma Page Collectio: Shared Memory Access (SHMEM), S , (available at [5] J. J. Dogarra, Performace of Various Computers Usig Stadard Liear Equatios Software, Techical Report CS-89-85, Uiversity of Teessee. (updated versio at [6] A. Petitet, R. C. Whaley, J. Dogarra, A. Cleary, HPL - A Portable Implemetatio of the High-Performace Lipack Bechmark for Distributed-Memory Computers, Versio 1.0a, Ja (available at About the Authors Ed Aderso is a cosultat to Cray Ic. workig with the Bechmarkig group. He worked for Cray Research i the Scietific Libraries ad Bechmarkig groups from Ed ca be reached at eaderso@a-et.orl.gov. Mayard Bradt retired from Cray Ic. i 2004 after may years i the Bechmarkig group. He still lives i the Mieapolis/St. Paul area ad ca be foud at Gopher basketball games. Chao Yag is a Seior Applicatios Aalyst with Cray Ic., specializig i mathematical software developmet. Chao ca be reached at Cray Ic., 1340 Medota Heights Road, Medota Heights, MN, 55120, cwy@cray.com. CUG 2004 Proceedigs 5

6 b b = = Figure 2: Matrix-multiply C = A*B o the CRAY T3E, highlightig the matrix-vector kerel Figure 3: CRAY X1 multi-streamig processor desig b SSP0 = b SSP1 SSP2 SSP3 Figure 4: Matrix-multiply C = A*B o the CRAY X1, highlightig the SSP parallelism ad computatio of four vectors cocurretly. CUG 2004 Proceedigs 6

7 LINPACK Bechmark Performace CRAY X1, 124 MSPs Gflop/s N 4x31 grid 31x4 grid 32x31 grid Figure 7: LINPACK Bechmark performace o a virtual processor grid, compared to a covetioal processor grid CUG 2004 Proceedigs 7

A Recursive Blocked Schur Algorithm for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

A Recursive Blocked Schur Algorithm for Computig the Matrix Square Root Deadma, Edvi ad Higham, Nicholas J. ad Ralha, Rui 212 MIMS EPrit: 212.26 Machester Istitute for Mathematical Scieces School of Mathematics