Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

Communcaton-Mnmal Parttonng and Data Algnment for Af"ne Nested Loops HYUK-JAE LEE 1 AND JOSÉ A. B. FORTES 2 1 Department of Computer Scence, Lousana Tech Unversty, Ruston, LA 71272, USA 2 School of Electrcal and Computer Engneerng, Purdue Unversty, W. Lafayette, IN 47907, USA Emal: hlee@engr.latech.edu Data algnment and computaton doman parttonng technques have been wdely nvestgated to reduce communcaton overheads n dstrbuted memory systems. Ths paper consders how these two technques can be combned and appled to af"ne nested loops. Current data algnment technques focus on ndvdual entres of data arrays and, n general, cannot be used drectly for cases when blocs of entres should be algned collectvely. Ths paper shows that exstng data algnment technques can be appled to parttoned algorthms f the null space of the data array ndexng matrx s a boundary of computaton blocs or the ntersecton of some computaton bloc boundares. These condtons can be used to generate several dfferent parttonngs and tme space transformatons for a gven target archtecture. An example llustrates how t s possble to tradeoff the number of communcatons and the memory space. Another example shows parttons of matrx matrx multplcaton that have smaller communcaton computaton ratos than Cannon's algorthm. Receved November 10, 1996; revsed July 25, 1997 1. INTRODUCTION Ths paper nvestgates the problem of data algnment when computaton domans are parttoned. Condtons of parttons, data algnments and tme space mappngs that mnmze communcaton are provded. These condtons can be used to derve data algnments and tme space mappngs for parttoned (.e. bloced algorthms. Interprocessor communcaton s expensve for dstrbuted memory parallel computers. Hence, extensve research has focused on mnmzng communcaton overheads [1 15]. An mportant technque s to allocate data to memory modules so that ndvdual processor computatons "nd ther operands n local memory. To reach ths goal t s necessary to algn operands of any gven operaton. Ths problem turns out to be NP-complete and many good heurstcs have been proposed [7 11]. Another technque for communcaton mnmzaton s to partton a computaton doman nto blocs that do not need data from other blocs. Parttonng technques can also be effectvely used to map a large problem nto a small number of processors. Although a parttonng technque applcable to arbtrary algorthms does not exst, systematc technques have been developed for optmal parttonng of unform dependence algorthms [12, 14]. To mnmze communcaton overheads of a dstrbuted memory program, one must consder data algnment and parttonng technques together. Current data algnment technques focus on ndvdual data entres, but t s not clear whether these technques can be used for bloced data,.e. when many data entres n a bloc need to be algned collectvely. In order to be able to use exstng algnment technques, a bloc of data needs to be treated as a sngle data entry (ust as a computaton bloc can be treated as a sngle computaton. However, ths may create addtonal sgn"cant communcaton overheads (see Example 3.1. Prevous parttonng technques (e.g. [14] and references theren focused on mnmzng the communcaton computaton rato for a sngle bloc, but dd not consder data algnment ssues. These technques focused on unform dependence algorthms wth data array domans whch are the same (wthn a constant offset as the computaton domans. When a computaton doman s the same as a data doman, parttonng of a data doman can also be the same as that of a computaton doman. Thus, a data bloc can be treated as a sngle data entry wthout addtonal communcaton overheads. Therefore, prevous technques desgned for unform dependence algorthms do not nvolve addtonal overheads for data algnment. However, for algorthms wth non-unform dependences, such as af"ne nested loops or BLAS-le algorthms [10], t s not trval to partton data arrays so that each bloc can be treated as a sngle data entry. Ths paper consders the combnaton of parttonng and data algnment for algorthms descrbed by an af"ne nested loop, that s, a nested loop wth data access patterns

COMMUNICATION-MINIMAL PARTITIONING AND DATA ALIGNMENT 303 whch can be descrbed by an af"ne functon (a formal de"nton s gven n Secton 2. For these algorthms, data algnment may cause sgn"cant communcaton overhead when computaton and data domans are parttoned mproperly. To leverage exstng data algnment technques for parttoned af"ne nested loops, ths paper nvestgates the condtons of algorthm parttons that allow data array blocs to be treated as a sngle data entry wthout addtonal overheads and therefore exstng data algnment technques for ndvdual data entres can be used for algnment of parttoned algorthms and data. Based on these condtons, many parttonngs and tme space mappngs can be derved accordng to the number of processors, memory per processor and communcaton costs. The rest of ths paper s organzed as follows. Useful notaton, termnology and de"ntons are provded n Secton 2. Secton 3 nvestgates the relaton between a data partton and a computaton partton. Secton 4 examnes condtons of parttonngs, data algnments and tme space mappngs that mnmze communcaton. Secton 5 dscusses the ef"cency of varous parttonngs and tme space mappngs for gven target archtectures. Secton 6 concludes the paper. 2. NOTATION, TERMINOLOGY AND DEFINITIONS DEFINITION 2.1. (nested loop. A nested loop s a 4-tuple (J, S, V, F where 1. J s the loop ndex set (.e. the set of ndces of all teratons. Each computaton s ndexed by a vector J (for a gven teraton of a nested loop program, the th entry of corresponds to the value of the th loop counter. 2. S s the set of statements {S 1, S 2,...,S ξ }n the loop body. 3. V s the set of data arrays {V 1, V 2,...,V η }. 4. F s the set of ndexng functons {f 1 η} whch de"ne how data array V s accessed n the loop. The mage of f (J s called the ndex set of data array V. In ths paper, Y denotes the ndex set of a data array V,.e. Y = f (J. An af"ne nested loop s a nested loop, the ndexng functons of whch are af"ne,.e. every f F s of the form f ( = F + f, for J, (1 where F s a dm(y dm(j matrx and f s a dm(y - dmensonal vector. F and f are called the ndexng matrx and f the ndexng offset respectvely. All algorthms consdered n ths paper are assumed to be af"ne nested loops. Throughout ths paper, n denotes the dmenson of J. A tme space transformaton s a mappng from an ndex set nto the doman of tme and space. DEFINITION 2.2. (tme space transformaton, vrtual processor array. A tme space transformaton of a nested loop s a mappng T : J Z n such that for any J, T( = ( t x where t s the tme of executon of the computaton ndexed by and x corresponds to the coordnates of the processor where executon taes place. The proecton of T(J nto space (.e. the set of all possble values of x s called the vrtual processor array. DEFINITION 2.3. (data algnment. Let Y be the ndex set of an array V and X be a vrtual processor array. A data algnment of array V s a mappng, p,fromyto X. A partton of an ndex set J s a set of non-empty dsont subsets of J, the unon of whch equals J. The subsets mang up a partton are called blocs of the partton. The parttons of nterest n ths paper can be descrbed by an equvalence relaton such that 1 and 2 are n the same bloc f and only f q( 1 = q( 2 for any 1, 2 J where q s of the form q( = D / β,d Z n n, β Z n. The mage of J under q s called the ndex set of a partton and s denoted by J ˆ. A bloc wth elements whch are mapped to ĵ Jˆ s denoted by Q 1 ( ĵ. Note that Q 1 does not represent the nverse functon of q (snce q s not a one-toone mappng, t does not have an nverse. Treatng each bloc as a sngle computatonal unt, a tme space transformaton of a parttoned ndex set can be de"ned as a mappng from the ndex set of a partton to the tme space doman,.e. ˆT : Jˆ (ˆt, X ˆ. The transformaton matrx and the modulus vector assocated wth ˆT are denoted by ˆT and ˆm respectvely. Note that `ˆ' appears n the notaton used for parttoned ndex sets and functons de"ned on them. Ths s also true for Xˆ whch denotes the ndex set of the physcal machne whereas X denotes the ndex set of the vrtual processor array. EXAMPLE 2.1. Consder the matrx matrx multplcaton algorthm whch computes C = A B where A, B and C are (8 12, (12 16 and (8 16 matrces respectvely. DO = 0, 7 DO = 0, 15 DO = 0, 11 c(, = c(, + a(, b(, CONTINUE Suppose that the loop ndex set s parttoned by an equvalence relaton such that 1 and 2 are n the same bloc f and only f q( 1 = q( 2, where q( = 1 0 0 0 0 (2,4,3 T. Then, the sze of a bloc s 2 4 3 whle the sze of the partton s 4 4 4. Thus, the ndex set of the partton, J ˆ, also conssts of 4 4 4 elements. Q 1 ( ĵ denotes the bloc ndexed by ĵ. For example, Q 1 ((0, 0, 0 T denotes the

304 H.-J. LEE AND J. A. B. FORTES FIGURE 1. (a Indvdual data algnment and (b bloced data algnment. bloc { (0, 0, 0 T <(2,4,3 T }whle Q 1 ((1, 0, 0 T denotes the bloc { (2, 0, 0 T <(4,4,3 T }. A partton of a data array ndex set Y can also be de"ned by an equvalence relaton such that y 1 and y 2 are n the same bloc f and only f r( y 1 = r( y 2 for any y 1, y 2 Y where r s of the form r( y = R / β, R Z dm(y dm(y, β Z dm(y. The mage of Y under r s called the ndex set of the data partton and s denoted by Ŷ. A bloc, the elements of whch are mapped to ŷ Ŷ s denoted by R 1 (ˆr. Data algnments of a parttoned data array are de"ned on the ndex set of the data partton,.e. ˆp : Ŷ X ˆ. EXAMPLE 2.1. (contnued. Suppose that matrx A s parttoned by an equvalence relaton such that y 1 and y 2 are n the same bloc f and only f r a ( y 1 = r a ( y 2, where ( 1 0 y r a ( y = (2,3 T. Then, the sze of a bloc s 2 3 whle the sze of the partton s 4 4. The ndex set of the partton, Ŷ a, also conssts of 4 4 elements. R 1 ( ĵ denotes the bloc ndexed by ŷ. For example, R 1 ((0, 0 T denotes the bloc {(0, 0 T < (2,3 T } whle R 1 ((1, 0 T denotes the bloc {(2, 0 T <(4,3 T }. 3. RELATION BETWEEN A COMPUTATION PARTITION AND A DATA PARTITION In order to elmnate unnecessary communcaton, t s desrable that each computaton and ts operand data be algned,.e. allocated to the same processor. For an af"ne nested loop, the condton of algnment can be descrbed by the followng equaton: T( = (t, p(f( T. (2 Ths equaton mples that the computaton ndexed by s mapped at tme t nto processor x, the same processor contanng the data entry accessed by the computaton. Fgure 1a llustrates the condton. There are two paths from to (t, x. One path drectly connects to (t, x, whle the other goes through y to connect to (t, x. There are two functons from to (t, x along these two paths respectvely. In ths "gure, Equaton (2 means that the mages of two functons along these two paths must be the same. Fgure 1b llustrates the condton for the algnment of bloced data and the computaton. Smlarly to Fgure 1a, there are two paths from to (ˆt, ˆx. Hence, the condton for the algnment s that two functons along these two paths must be the same: ˆT(q( = (ˆt, ˆp(r(f( T. (3 Extensve research has focused on the dervaton of a smple form of condton (2 for the algnment of ndvdual data and computaton. Wth the smple form, many formulatons have been developed to derve T and p [3 10]. However, t s not easy to derve ˆT and ˆp that satsfy Equaton (3. Ths s because r and q are both nonlnear functons, and therefore t s necessary to solve a system of nonlnear equatons for condtons on ˆT and ˆp. Ths paper proposes to use a new functon ˆf whch maps a computaton bloc to a data bloc as shown n Fgure 2. Wth the new functon ˆf, the condton of bloced data algnment becomes ˆT( ĵ = (ˆt, ˆp(ˆf( ĵ T. (4 The form of Equaton (4 s exactly the same as that of Equaton (2. Hence, prevous technques developed for the condton of Equaton (2 can also be used for the condton of Equaton (3. However, t s not clear whether the new functon ˆf always exsts for arbtrary computaton and data parttonng. Therefore, ths secton nvestgates the condton that guarantees the exstence of ˆf. The functon ˆf s formally de"ned as follows: DEFINITION 3.1. Let Jˆ be the ndex set of a computaton partton and Ŷ be the ndex set of a data partton. A

COMMUNICATION-MINIMAL PARTITIONING AND DATA ALIGNMENT 305 (a (b (-3,2 (-2,2 (-1,2 (0,2 FIGURE 2. Indexng functon for a parttoned algorthm. J Y Q -1 R -1 J f f FIGURE 3. De"nton of the ndexng functon. functon ˆf : ˆ J Ŷ s an ndexng functon of a parttoned data array accessed n a parttoned loop nest f f(q 1 ( ĵ = R 1 (ˆf( ĵ, (5 for any ĵ J ˆ. Condton (5 can be stated as follows: the data entres accessed by a computaton bloc, Q 1 ( ĵ, should be the same as the bloc of the data array ndexed by ŷ whch s the mage of ĵ under the ndexng functon ˆf. Ths condton can be better llustrated wth Fgure 3. There exst two chans of relatons on J ˆ Y: one s the relaton Q 1 on f J ˆ J followed by the mappng J Y and the other s the mappng Jˆ ˆf Ŷ followed by the relaton R 1 on Ŷ Y. For the functon ˆf to be smlar to the ndexng functon, f, the two chans of relatons need to yeld relatons wth an dentcal proecton on Y. Ths condton can be formulated mathematcally as (5. Here an ndexng functon of a parttoned data array accessed n a parttoned loop nest s smply called an ndexng functon, f ths smpl"caton does not lead to confuson. EXAMPLE 2.1. (contnued. Let the ndexng functon of a parttoned matrx A be ˆf a ( ĵ = ( 1 0 0 0. It s not df"cult to chec whether ths functon sats"es De"nton 3.1. For example, consder bloc Q 1 (0, 0, 0 = { (0, 0, 0 < (2,4,3 T }. Snce the orgnal ndexng functon s f a ( = ( 1 0 0 0, t follows that f a (Q 1 ((0, 0, 0 T ={ y (0,0 T y <(2,3 T }.On the Y (-2,1 (-1,1 (0,1 (1,1 (-1,0 (0,0 (1,0 (2,0 FIGURE 4. Parttonng of a loop ndex set whch does not allow the vald parttonng of a data array ndex set and an ndexng functon. (a Loop ndex set; (b parttoned loop ndex set. other hand, ˆf a ((0, 0, 0 T = (0, 0 T and Therefore, R 1 (ˆf a ((0, 0, 0 T ={ y (0,0 T y<(2,3 T }. f a (Q 1 ((0, 0, 0 T = R 1 (ˆf a ((0, 0, 0 T. Hence, De"nton 3.1 s sats"ed. In the above example, the ndexng functons of the parttoned data arrays accessed n the parttoned loop nest are the same as the orgnal ndexng functons,.e. ˆf a = f a ; however, ths s not true n general cases. Dependng on how loop ndex sets are parttoned, t s often very complcated to "nd the correct ndexng functon. Moreover, n many cases, there do not exst ndexng functons that satsfy De"nton 3.1. The next example llustrates such a case. EXAMPLE 3.1. Consder the followng computaton. DO = 0, 9 DO = 0, 9 c(, = a( b( CONTINUE Suppose that the computaton partton s gven by the mappng q ( 1 1 q( = (3,3 T. Fgure 4a shows the computaton doman. The horzontal and vertcal lnes ntersect at computaton ndex ponts. Fgure 4b shows the parttoned loop ndex set. Indces

306 H.-J. LEE AND J. A. B. FORTES of blocs are also shown n ths "gure. Dashed lnes represent the boundares of blocs that are not ncluded n the loop ndex set. Consder the access pattern of the entres of vector a. For the bloc of computatons ndexed by (0, 0, entres a(0, a(1, a(2, a(3 and a(4 are accessed. For the bloc of computatons ndexed by (1, 0, entres a(3, a(4, a(5, a(6 and a(7 are accessed. Note that entres a(3, a(4 are shared by both blocs and should be allocated to the processors that execute these blocs. Hence, ths data `partton' s not vald n the sense that t s not a partton (the blocs are not dsont. Consder the entres of vector b whch are accessed by blocs. For the bloc ndexed by (0, 0, entres b(0, b(1 and b(2 are accessed, whle b(3, b(4 and b(5 are accessed by the bloc ndexed by (0, 1. In ths case, mappng r b ( y = y/3 s a possble vald parttonng of vector b. 4. CONDITIONS OF PARTITIONING FOR DATA ALIGNMENT = { Z n = ĵ (D 1 β n } + a D 1 β,0 a < 1, = 1,2,...,n. Therefore, t follows that { n f(q 1 ( ĵ = y y = f(ĵ (D 1 β + a f(d 1 β, 0 a < 1, = 1, 2,...,n, y Z n }. (6 On the other hand, the rght-hand sde of (5 must be of the form { } R y R 1 (ˆf( ĵ = y = ˆf( ĵ β = { y ŷ β R <(ŷ+ 1 β }, { = y y =ŷ (R 1 β Exstng data algnment technques are based on relatons among ndexng functons, tme space mappngs and data algnments. To reuse the exstng data algnment technques n parttoned algorthms, one must "nd the correct ndexng functons of bloced data arrays. Example 3.1 shows that some algorthm parttons do not allow the exstence of a vald data partton. Therefore, t s necessary to "nd condtons of algorthm parttons that guarantee the exstence of vald data parttons and to derve technques to dentfy them. Here the partcularly desrable case when the ndexng functons n the parttoned algorthms are the same as n the orgnal algorthms (.e. non-parttoned s consdered. For ths case, Proposton 4.1 provdes the condton of computaton parttons whch guarantees that the ndexng functons of parttoned nested loops are dentcal to the ndexng functons used n the orgnal (.e. non-parttoned nested loops. Let an ndexng functon be called Eucldean f any row of the ndexng matrx s dstnct wth one entry valued unty and all others valued zero. Let u and v be n-dmensonal vectors. Let the operator denote elementwse multplcaton,.e. u v= (u 1 v 1,u 2 v 2,...,u n v n T. Let T be an n n matrx and I be a subset of {1, 2,,...,n}. T I denotes a submatrx of I whose columns consst of the th column of T for all I. PROPOSITION 4.1. Let a gven algorthm have a Eucldean ndexng functon f for a data array. Gven an algorthm partton de"ned by q( = (D / β, there exsts an ndexng functon that sats"es De"nton 3.1 f there exsts a set I {1,2,...,n} such that the columns of D 1 I form a bass of null(f. Proof. Consder the left-hand sde of Equaton (5: Q 1 ( ĵ = { D Z n β } = ĵ = { Z n ĵ β D <(ĵ+ 1 β}, dm(y + =1 b R 1 β, 0 b < 1, = 1, 2,...,dm(Y, y Z n }. (7 Equaton (6 can be n the form of Equaton (7 only f n dm(y number of vectors among {f(d 1, = 1, 2, n}are zero. Then, the set {D 1 f(d 1 = 0} forms a bass of null(f. Snce f s Eucldean, f( ĵ (D 1 β = f(ĵ (f(d 1 C f( β where C = {1,2,...,n} I. Thus, Equatons (6 and (7 are dentcal f R 1 s de"ned to be f(d 1 C = FD 1 C,and β s de"ned to be f( β = F β. The proof of the above proposton says that the data partton s de"ned by r( y = ((FDc 1 1 y/ β c where β C H β. If an algorthm has a data array wth an ndexng functon whch s not Eucldean, t s necessary to change the bass to mae the ndexng functon Eucldean. Ths paper consders only the case when such a change s possble. A boundary s generated by n 1 columns of D 1. Hence, there exst ( n dm(null(f n dm(null(f 1 = n dm(null(f boundares contanng null(f. Therefore, the ntersecton of these boundares forms null(f. In other words, null(f should be parallel to the ntersecton of these boundares. EXAMPLE 4.1. In Example 2.1, q for the computaton partton and r a for data array A partton are 1 0 0 0 D 0 q( = = β (2,4,3 T,

COMMUNICATION-MINIMAL PARTITIONING AND DATA ALIGNMENT 307 and ( 1 0 R y r a ( y = = β (2,3 T respectvely. Snce D 1 = ( 1 0 0 0 0 y,, null(f a = (0, 1, 0 T s generated by the second column of D 1, D 1 2. It then follows that ( 1 1 0 [F( D 1 ] 1 = 1 0 0 0 0 0 ( 1 0 =. In addton, β = (2, 3 T. Hence, the condton n Proposton 4.1 s sats"ed. Thus, ( ˆf a ( ĵ = f a 1 0 0 ( ĵ = ĵ. 0 EXAMPLE 4.2. In Example 3.1, the mappng q has D =. Thus, D 1 s. Snce the null space of f a s ( 1 1 ( 1 1 (0, 1 T, t s generated by nether D 1 1 nor D 1 2. The followng proposton proves that condton (2 of data algnment of ndvdual data entres can be used for the condton of data algnment for bloced data (Equaton (3 f there exsts an ndexng functon that sats"es De"nton 3.1. PROPOSITION 4.2. Let J be the ndex set of a nested loop. Let V be a data array accessed n the nested loop and Y be the ndex set of array V. Let f be an ndexng functon of array V. Let Jˆ be the ndex set of a partton of J and Ŷ be the ndex set of a partton of Y. Let ˆf : Jˆ Ŷ be the af"ne ndexng functon of parttoned array V n the parttoned nested loop. Let Xˆ be the ndex set of a physcal processor array. Let ˆT : Jˆ Xˆ be a tme space transformaton of the parttoned nested loop and ˆp be data algnment of the parttoned array. For any J ˆT(q( = (ˆt, ˆp(r(f( (8 f and there exsts ˆf such that ˆT( ĵ = (ˆt, ˆp(ˆf( ĵ f(q 1 ( ĵ = R 1 (ˆf( ĵ. Proof. Gven J, there exsts ĵ such that Q 1 ( ĵ. Then, f( f(q 1 ( = R 1 (ˆf( ĵ. Therefore, r(f( ĵ r(r 1 (ˆf( = ˆf( ĵ. TABLE 1. Intal data algnment of the parttoned verson of Cannon's algorthm: â,, ˆb, and ĉ, denote (, th blocs of data array ndex sets of matrces A, B and C respectvely. Thus, Hence, x 2 x 1 2 3 0 â 0,0 â 0,1 â 0,2 â 0,3 1 â 1,1 â 1,2 â 1,3 â 1,0 2 â 2,2 â 2,3 â 2,0 â 2,1 3 â 3,3 â 3,0 â 3,1 â 3,2 x 2 x 1 2 3 0 ˆb 0,0 ˆb 1,1 ˆb 2,2 ˆb 3,3 1 ˆb 1,0 ˆb 2,1 ˆb 3,2 ˆb 0,3 2 ˆb 2,0 ˆb 3,1 ˆb 0,2 ˆb 1,3 3 ˆb 3,0 ˆb 0,1 ˆb 1,2 ˆb 2,3 x 2 x 1 2 3 0 ĉ 0,0 ĉ 0,1 ĉ 0,2 ĉ 0,3 1 ĉ 1,0 ĉ 1,1 ĉ 1,2 ĉ 1,3 2 ĉ 2,0 ĉ 2,1 ĉ 2,2 ĉ 2,3 3 ĉ 3,0 ĉ 3,1 ĉ 3,2 ĉ 3,3 ˆp(r(f( ˆp(ˆf( ĵ. ˆT(q( = ˆT( ĵ = (ˆt, ˆp(r(f(. EXAMPLE 4.3. Consder Example 2.1 agan. After algorthm and data array ndex sets are parttoned, the ndex set of a computaton partton Jˆ s (4 4 4, whle the ndex sets of data parttons Ŷ a, Ŷ b and Ŷ c are all (4 4. Suppose that 1 1 1 ˆT (4,4,4 ( ĵ = 1 0 0 (ĵ 0 (mod (4,4,4 s employed for a tme space transformaton of ths algorthm. Data algnments of matrces A, B and C that satsfy Equaton (4 are and ˆp aˆt (ŷ = (( 1 0 1 1 (( ˆp bˆt 1 1 (ŷ = ˆp cˆt (ŷ = (( 1 0 ŷ +(0, 1 Tˆt, (mod (4,4 ŷ +( 1,0 Tˆt, (mod (4,4 ŷ +(0,0 Tˆt, (mod (4,4 respectvely, where the subscrpt ˆt denotes that data algnments depend on the executon tme [10]. Intal

308 H.-J. LEE AND J. A. B. FORTES (a (b (c FIGURE 5. Examples of computaton parttons. (a β = (2, 3, 4;(b β =(2, 4, 6;(c β =(2, 4, 12. dstrbutons of these matrces are shown n Table 1. All the entres stored n processors are computed and then matrx A shfts left and matrx B shfts up for the next computaton. The computaton step and the communcaton step are alternately repeated untl all columns of matrx A have been multpled by all rows of matrx B. Ths algorthm for matrx multplcaton corresponds to the parttoned (bloced verson of Cannon's algorthm. 5. COMMUNICATION MINIMAL PARTITION FOR A FIXED-SIZE PROCESSOR ARRAY For most practcal problems, the szes of computaton domans are large and must be parttoned to "t a `small' "xed-sze processor array. To reduce executon tme, t s desrable to select the one that requres less communcaton. Gven that the parttons of nterest are determned by q( = (D / β, dfferent parttons are possble by choosng D and β. Matrx D s often determned by other constrants such as those n Proposton 4.1 or n [12, 14]. Therefore, ths secton assumes that D s gven, and nvestgates β that mnmzes communcaton overheads. Some components of β have ther values chosen so that a partton matches an avalable processor array n sze. The values of other components of β are chosen to satsfy other constrants such as avalable memory space. EXAMPLE 5.1. Consder the matrx matrx multplcaton algorthm n Example 4.3 and a processor array of sze (4 4. Suppose that the algorthm blocs are mapped to processors along the drecton (0, 0, 1 T. Then, β 1 = 2 and β 2 = 4 (n order to accommodate an (8 16 proecton n (4 4 processors. Consder how β 3 affects communcaton overheads. If β 3 s chosen to be 6, the sze of blocs becomes twce as large as that n Example 4.3. Thus, the sze of the partton s also changed to (4 4 2. The shapes of blocs are shown n Fgure 5b. Snce the number of blocs along the last coordnate s two, ths algorthm requres two tme unts 1 to complete computatons. Recall that communcatons are necessary between blocs, but not nsde a sngle bloc. Hence, only one communcaton (.e. one message wth several data entres 2 s necessary. Note that the parttoned verson of Cannon's algorthm n Example 4.3 needs four tme unts and therefore three communcatons are requred. The trade-off of ths example s the sze of the bloc, whch s twce as large as that n Example 4.3. Hence, approxmately twce as much memory as that of Example 4.3 s requred n ths example. Suppose that β 3 s chosen as 12 (Fgure 5c. Then, the sze of a sngle bloc s (2 4 12 and the sze of the data partton s (4 4 1. In ths case, the algorthm needs ust one tme unt and therefore no communcaton s necessary. However, t requres approxmately four tmes as much storage as that of Example 4.3. Example 5.1 explans the trade-off relaton between memory space and the number of communcatons. In general, the larger the blocs stored n a sngle processor, the smaller the number of communcatons s requred. However, t s not always possble to store as many data tems as desred n a sngle processor due to lmtaton of avalable memory space. Moreover, there exst specal cases n whch the necessary memory space ncreases even though the sze of blocs decreases. Ths s because algnment of a parttoned data array must satsfy condtons n Proposton 4.2 and therefore, there exst some cases when more than one bloc must be stored n a sngle processor [10]. Consder the problem n whch components of β are chosen to mae the algorthm partton "t an avalable processor array. Prevous research for communcaton mnmzaton showed that a data array needs to move f T 1 1 s not an element of null(f [11]. Therefore, t s desrable to 1 The tme unt s the tme requred to "nsh all computatons n a sngle bloc. The tme unt depends on the sze of a sngle bloc. The computaton tme to execute a sngle bloc ncreases accordng to bloc sze. 2 Each message contans all data generated and/or used by a bloc that s needed by another bloc n the destnaton processor. Thus, the sze of the message ncreases wth the sze of the blocs.

COMMUNICATION-MINIMAL PARTITIONING AND DATA ALIGNMENT 309 (a (b (c FIGURE 6. Examples of computaton parttons. (a β = (2, 4, 6;(b β =(4, 4, 3;(c β =(2, 8, 3. Number of communcatons 20 5 10 Number of data tems Number of data tems 2x10 4 1x10 4 0 64x64x64 64x64x1024 0 N= 0.25 0.5 1 2 4 Sze of a bloc Problem sze ( (1024/Nx1024x(Nx1024 (a (b FIGURE 7. Parttons and communcaton overheads. (a ----, number of communcatons;, number of data tems n a sngle communcaton;, number of data tems n a sngle bloc. Sze of a computaton doman, 1024 1024 1024; sze of a processor array, 16 16. (b----, A,C: move, B: stay;, A, B: move, C: stay;, B,C: move, A: stay. Sze of a processor array, 16 16. choose T 1 1 (the proecton vector [13], that s the drecton along whch computatons are mapped nto processors as the bass of null(f. Therefore, the choce of the proecton vector s lmted by the data arrays used n an algorthm. In general, the number of optmal proecton vectors s small enough to examne all possbltes n a reasonable amount of tme. For example, n matrx multplcaton, there are sx possble choces of the proecton vector (Proposton 3 n [11]. Dependng on the proecton vector, the number of data entres n a sngle communcaton can vary. Example 5.2 llustrates how ths problem can be solved and how much the communcaton overheads can be reduced by searchng all possbltes. EXAMPLE 5.2. Consder the matrx matrx multplcaton algorthm n Example 5.1 agan. Suppose that β s (2, 4, 6 T (Fgure 6a. The number of data entres to be stored n a sngle processor s 44. Snce matrces A and B should be moved whle matrx C stays at the same processor, the number of data entres to be moved s 36. If β = (4, 4, 3 T (Fgure 6b, matrces A and C should be moved whle matrx B stays at the same processor. Hence, the number of data entres to be stored s 40 whle the number of data entres to be communcated s 28. If β = (2, 8, 3 T (Fgure 6c, matrces B and C should be moved whle matrx A stays at the same processor. Hence, the number of data entres to be stored s 46, whle the number of data

310 H.-J. LEE AND J. A. B. FORTES entres to be communcated s 40. Hence, the amount of communcaton requred for the case when β = (2, 8, 3 T s 1.43 tmes more than that when β = (4, 4, 3 T and 1.11 tmes more than that when β = (2, 4, 6 T. The dfference of communcaton overheads depends on the shape of the orgnal loop ndex set. Therefore, t s necessary to nvestgate all three cases and choose the partton that mnmzes communcaton. Fgure 7 shows the relatonshp between communcaton overheads and parttons for varous szes of problems. Fgure 7a shows the trade-off relaton between the number of communcatons and the sze of necessary memory modules dependng on the szes of blocs when (1024 1024 1024 matrx multplcaton s computed on a (16 16 processor array. The szes of blocs consdered n ths "gure are (64 64 (64n, n = 1, 2, 4, 8, 16. As bloc sze ncreases, the number of data entres n the bloc also ncreases proportonally. However, the number of communcaton steps decreases wth bloc sze. Ths "gure also shows that the number of data entres moved n a sngle bloc ncreases proportonally to the bloc sze. Fgure 7b shows how communcaton overheads depend on the choce of coordnates to be mapped nto a space doman. The matrx multplcaton algorthms of szes ((1024/n 1024 (1024n, n = 0.25, 0.5, 1, 2, 4 are consdered. Snce the computaton doman s three dmensonal, there are three possbltes n choosng the coordnates mapped nto a space doman. Ths choce determnes whch matrces need to move and whch matrx can stay at the same processor. Ths "gure shows that the communcaton overheads depend on the choce of coordnates when the computaton doman s not cubc. As the dfference n sze along three coordnates ncreases, the dfference of communcaton overheads also ncreases. 6. CONCLUSIONS Ths paper consders parttonng and data algnment of af"ne nested loops. To leverage exstng data algnment technques for parttoned algorthms, computaton parttons should satsfy the condton that there exst boundares of blocs wth ntersectons whch are the null spaces of the ndexng functons of data arrays. By usng exstng systematc data algnment technques, t s possble to generate several parttons and tme space mappngs (wth perfectly algned data out of whch the optmal one could be chosen. Addtonal research s needed on technques to drectly dentfy such optmal solutons. ACKNOWLEDGEMENTS Ths research was partally funded by the Natonal Scence Foundaton under grants MIP-9500673 and CDA-9015696. REFERENCES [1] Hranandan, S. et al. (1992 Compler optmzatons for Fortran D for MIMD dstrbuted-memory machnes. Commun. ACM, 35, 66 80. [2] Gupta, M. and Baneree, P. (1992 Demonstraton of automatc data parttonng technques for parallelzng complers on multcomputers. IEEE Trans. Parallel Dstrb. Syst., PDS-3, 179 193. [3] Bau, D. et al. (1995 Solvng algnment usng elementary lnear algebra. Lecture Notes n Computer Scence, 892, 46 60. Sprnger-Verlag, Berln. [4] Shang, W. and Shu, Z. (1994 Data algnment of loop nests wthout nonlocal communcaton. In Proc. Int. Conf. Applcaton Spec"c Array Processors, August, pp. 439 450. IEEE Computer Socety Press, Los Alamtos, CA. [5] Ramanuam, J. and Sadayappan, P. (1991 Comple-tme technques for data dstrbuton n dstrbuted memory machnes. IEEE Trans. Parallel Dstrb. Syst., PDS-2, 472 482. [6] Knobe, K., Luas, J. D. and Steele, G. L. Jr. (1990 Data optmzaton: allocaton of arrays to reduce communcaton on SIMD machnes. J. Parallel Dstrb. Comput., 8, 102 118. [7] L, J. and Chen, M. (1991 The data algnment phase n complng programs for dstrbuted-memory machnes. J. Parallel Dstrb. Comput., 13, 213 221. [8] Feautrer, P. (1992 Toward Automatc Dstrbuton. Laboratore MASI, Insttut Blase Pascal, Techncal Report 92-95. [9] Darte, A. and Robert, Y. (1993 A Graph-Theoretc Approach to the Algnment Problem. LIP, Ecole Normale Supereure de Lyon, France, Techncal Report 93-20. [10] Lee, H.-J. and Fortes, J. A. B. (1995 Data algnments for modular mappngs of BLAS-le algorthms. In Proc. Int. Conf. Applcaton-Spec"c Array Processors, July, pp. 34 41. IEEE Computer Socety Press, Los Alamtos, CA. [11] Lee, H.-J. and Fortes, J. A. B. (1997 Modular mappngs and data dstrbuton ndependent computatons. Parallel Processng Lett., 7, 169 180. [12] Schreber, R. and Dongarra, J. J. (1990 Automatc Blocng of Nested Loops. Techncal Report 90.38, RIACS. [13] Darte, A. (1991 Regular parttonng for syntheszng "xedsze systolc arrays. Integraton VLSI J., 12, 293 304. [14] Boulet, P., Darte, A., Rsset, T. and Robert, Y. (1993 (Pen- Ultmate Tlng? LIP, Ecole Normale Supereure de Lyon, France, Techncal Report 93-36. [15] Wolf, M. E. and Lam, M. S. (1991 A loop transformaton theory and an algorthm to maxmze parallelsm. IEEE Trans. Parallel Dstrb. Syst., PDS-2, 452 471.