GLORE: Generalized Loop Redundancy Elimination upon LER-Notation

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton YUFEI DING, XIPENG SHEN, North Carolna State Unversty, Unted States 74 Ths paper presents GLORE, a novel approach to enablng the detecton and removal of large-scope redundant computatons n nested loops. GLORE works on LER-notaton, a new representaton of computatons n both regular and rregular loops. Together wth a set of novel algorthms, t makes GLORE able to systematcally consder computaton reorderng at both the expresson level and the loop level n a unfed manner. GLORE shows an applcablty much broader than pror methods have, and frequently lowers the computatonal complextes of some nested loops that are elusve to pror optmzaton technques, producng sgnfcantly larger speedups. CCS Concepts: Software and ts engneerng General programmng languages; Addtonal Key Words and Phrases: program optmzaton, loop redundancy elmnaton, operaton mnmzaton ACM Reference Format: Yufe Dng, Xpeng Shen. 2017. GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton. Proc. ACM Program. Lang. 1, OOPSLA, Artcle 74 (October 2017), 28 pages. https://do.org/10.1145/3133898 1 INTRODUCTION Removng redundant computatons s an effectve way to speed up applcatons. The tradtonal approach, loop redundancy elmnaton, detects and removes computatons that are nvarant across the nnermost loop. Many redundances, however, span a much larger scope and often reman hdden to pror methods. Detectng them would requre some careful large-scope computaton reorderng and reassocaton at both the expresson level and the loop level. They are elusve to tradtonal methods for ther small analyss scope and weaknesses n handlng rregular loops and complex control flows and dependences. For nstance, Example 1 n Fgure 1 (a) shows a code contanng a whle loop and a nested for loop. If we only focus on the expresson n the nnermost loop body, we could not fnd any redundant computaton: varable d gets updated n every teraton of the for loop, and w gets updated n every whle loop teraton. However, takng a broader vew, we can see that wth some large-scope reorderng and reassocaton of the computatons, the entre code s equvalent to the form n Fgure 1 (b), n whch, reducton loops a[] and b[] are both redundantly recomputed across the whle loop. If we take the redundant computatons out of the outer-level loop approprately, we can save many computatons and speed up the executon by orders of magntude. Both the Authors Emals: Yufe Dng, ydng8@ncsu.edu; Xpeng Shen, xshen5@ncsu.edu. Authors address: Department of Computer Scence, North Carolna State Unversty, Ralegh, North Carolna, 27606, Unted States, US.. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for components of ths work owned by others than ACM must be honored. Abstractng wth credt s permtted. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Request permssons from permssons@acm.org. 2017 Assocaton for Computng Machnery. 2475-1421/2017/10-ART74 https://do.org/10.1145/3133898

74:2 Yufe Dng, Xpeng Shen w = w0; whle (d > 0.01){ d = 0; for( = 0; <= M; ++){ d += a[] + b[] * w; w = w - 0.001 * d; w = w0; whle (d > 0.01){ A = a[]; B = b[]; d = A + B * w; w = w - 0.001 * d for ( = 0; <= M; ++){ r[] = 0; for (k = 0; k <= ; k++){ for ( = 0; <= ; ++){ r[] += x[,] * y[,k]; for ( = 0; <= M; ++){ temp[,0] = y[,0]; for ( = 1; <= M; ++){ for ( = 0; <= ; ++){ temp[,] = temp[,-1]+y[,]; for ( = 0; <= M; ++){ r[] = 0; for ( = 0; <= ; ++){ r[] += x[,] * temp[,]; (a) Example 1 (b) A form equvalent to Example 1 (c) Example 2 (d) A form equvalent to Example 2 Fg. 1. Illustraton of large-scope loop redundances. needed large-scope analyss and the presence of whle loop prevent the tradtonal methods to fnd and remove such redundances. Example 2 n Fgure 1 (c) shows large-scope redundances n affne for loops. The code computes the products of elements n two arrays, reduces them along two axes (k and ), and then stores them n a new array r. Agan, f we only focus on the expressons n the nnermost loop, we could not fnd any redundant computatons, as the expresson x[, ]*y[, k] computes dfferent values across dfferent loop teratons. But redundances expose n a larger scope: swtch the order of loop k and ; the nner loop would be computng the product of x[, ] and the prefx-sum of y[, k] along k dmenson (up to ). We can have a separate loop to compute the prefx-sum. If we further notce that temp[, ] equals temp[, 1] + y[, ], the prefx-sum can be even smplfed nto the frst two loops n Fgure 1 (d). By reusng the prefx sums, the computaton of r s smplfed nto the bottom loop n Fgure 1 (d). The computaton complexty reduces to O(M 2 ) from the orgnal complexty O(M 3 ). In both examples, the redundances requre large-scope (across multple levels of loops) computaton reorderng to detect and remove. As notced n numerous studes [Cooper et al. 2008; Detz et al. 2001; Dng et al. 2017, 2015; Drake and Hamerly 2012; Elkan 2003; Fahm et al. 2006; Goldberg and Harrelson 2005; Greenspan et al. 2000; Gupta and Raopadhye 2006; Gutman 2004; Hamerly 2010; Nga et al. 2006; Wang et al. 2012; Wang 2011], such large-scope loop redundances are common, especally n applcatons n computatonal physcs, chemstry, data analytcs, and other domans that nvolve relatvely complex formulae or algorthms. When translatng those complex formulae or algorthms nto computer programs, the developers often ntutvely follow the formulae or algorthms step by step, producng logcally easy-to-understand and practcally easy-to-mantan code rather than tryng to mnmze redundant computatons. There have been some efforts n extendng the scope of tradtonal loop redundancy elmnaton [Cooper et al. 2008; Detz et al. 2001; Gupta and Raopadhye 2006]. They have made some sgnfcant contrbutons towards large-scope redundancy elmnaton, but they are subect to two maor lmtatons. Frst, they all have some strct requrements on the forms of the loops. None of them can handle mathematcal operatons (e.g., sn, mod) or rregular loops wth complex control flows (e.g., whle loops wth breaks) and complcated dependences. Second, almost none of them can systemcally consder the combnaton of loop-level computaton reorderng and expressonlevel algebrac reorderng, and deal wth ther nterplays. Tensor contracton optmzatons [Hartono et al. 2006, 2005] have consdered both levels of reorderngs, but they are desgned specfcally for tensor contracton n regular for loops wth constant loop bounds, napplcable to common loops. As a result, both examples n Fgure 1 are elusve to all pror technques.

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:3 for( = 0; <= M; ++){ result = a * b; x[] = result + y[]; for( = 2; <= M; ++){ x[] = y[-2]+y[-1]+y[+1]+y[+2]; for ( = 0; <= M; ++){ for ( = 0; <= M; ++){ for (k = 0; k <= N; k++){ for(l = 0; l <= N; l++){ r[,k] += x[,l] * y[l,] * s[,k]; for( = 1; <= M; ++){ for( = 1; <= ; ++){ y[] +=x[]; (a) Category 1: loop-nvarant expresson (b) Category 2: Partally loop-nvarant expresson (c) Category 3: Loop-nvarant loop (d) Category 4: Partally loop-nvarant loop Fg. 2. Examples of the four man categores of loop redundances. A key observaton made n our work s that the lmtatons of the pror technques fundamentally stem from the lack of a proper representaton of loops of varous forms. For nstance, a pror work [Gupta and Raopadhye 2006] uses a hgh-level equatonal form to represent loops. As a result, t cannot accommodate whle loops or data dependences. Moreover, t focuses on loop reorderng but gnores ts nterplay wth expresson-level reorderng, causng t to even mss many redundances hdden n regular loops that t can represent. Ths paper presents a new soluton that addresses those challenges through the development of GLORE, whch stands for generalzed loop redundancy elmnaton. Ths new method ntroduces a notaton scheme named loop-reducton notaton (LER-notaton), whch provdes the frst unfed symbolc abstracton for systematcally conductng computaton reorderng across both loops and expressons upon the laws of assocatvty, commutatvty, and dstrbutvty. LER-notaton equps GLORE wth an applcablty much broader than pror methods have, coverng both regular and rregular loops and applcable to code wth complex control flows, complcated dependences, and math operatons. At the same tme, LER-notaton offers a form more frendly for the exploratons of both loop and expresson reorderng. To translate the ncreased flexblty to actual redundancy removal, we further propose a set of novel transformatons and algorthms, ncludng operand foldng, alternatng form generaton, mnmum-unon algorthm, a lnear-tme closure-based algorthm, and so on. These technques allow GLORE to treat the varous loop complextes wth ease, and to effectvely detect loop redundances by explorng loop and expresson reorderng and reassocatons n a unfed, comprehensve manner. Experments on 21 benchmarks from four sources show that GLORE excels n both generalty and effectveness. Workng as an end-to-end framework, GLORE s able to detect and remove the most common cases n four maor categores of loop redundant computatons. Those cases nclude some large-scope redundances that have been elusve to pror technques, on whch, GLORE gves orders of magntude speedups by lowerng ther computatonal complextes. On loops that pror methods can handle, GLORE produces smlar or sgnfcantly hgher speedups. 2 FOUR CATEGORIES OF LOOP REDUNDANCY In general, any computatons that occur n multple teratons of a loop s a loop redundancy. We classfy loop redundances nto four maor categores accordng to ther granulartes and repetton patterns, and GLORE s desgned to tackle programs wth these four maor loop redundances. Category 1: Loop-nvarant expressons. If an expresson s operands and operatons are nvarant across all teratons of a loop, the expresson s a loop-nvarant expresson, llustrated by a b n Fgure 2 (a). Category 2: Partally loop-nvarant expressons. If an expresson s recomputed across some but not all teratons of a loop, that expresson s a partally loop-nvarant expresson of that loop. Such redundancy could be the outcome of array expressons that appears n some algned formats. Fgure 2 (b) offers such an example, n whch, y[ + 1] + y[ + 2] n the th

74:4 Yufe Dng, Xpeng Shen Loop nvarant removal [Allen and Kennedy 2001]: category 1 redundances only. ASE [Detz et al. 2001]: for sum-of-products n stencls only (part of category 2), requrng sngle-rank array references as operands and ndex expressons n a strngent form. ESR [Cooper et al. 2008]: for common array subexpressons n categores 1 and 2. REDUCTION [Gupta and Raopadhye 2006]: for reductons only. No quanttatve results reported. No support of whle loops, mperfectly nested loops, and complex operatons (e.g., sn, cos, etc.) n an expresson. Page [Page and Koeng 1982]: ncremental computaton across functon calls. Fg. 3. Summary of pror methods. teraton of the loop executes the same computaton as y[ 2] + y[ 1] does n the ( + 3) th teraton. Category 3: Loop-nvarant loops. Let loop L 2 be a loop nested n loop L 1. If every nvocaton of L 2 contans exactly the same set of computatons, we say that L 2 s a loop-nvarant loop of L 1. Fgure 2 (c) gves such an example: The reducton over products of x[,l] and y[l, ] along axs l s repeatedly computed across loop k, and the reducton over products of y[l, ] and s[, k] s repeatedly computed across loop. Category 4: Partally loop-nvarant loops. Let L 2 be a loop nested n loop L 1. If the computatons by some nvocatons of L 2 form a subset by some other nvocatons of L 2, we say that L 2 s a partally loop-nvarant loop of L 1. Fgure 2 (d) shows an example. The th teraton of the outer loop computes =1 x[]. That computaton s repeated n the frst part of every later teraton of the outer loop. It dffers from Category 3 n that later teratons have some extra computatons. The examples n Fgure 2 are ntentonally made smple for understandng. The actual code could nvolve data dependences, rregular loops, and other complextes as the motvatng examples n Fgure 1 show. One mportant ablty of GLORE s to systematcally conduct computaton reorderng and reassocaton across both loops and expressons, so that these four categores of redundances, f hdden n the orgnal program, could stll be exposed and removed. 3 RELATED WORK Fgure 3 summarzes the man methods developed n pror studes on removng loop redundances. Tradtonal loop nvarant removal [Allen and Kennedy 2001] s desgned for only category-1 redundancy. Efforts to expandng the scope have each been desgned for a specal type of redundancy. Wthout establshng a general flexble way to analyze and reorder computatons n a large scope, these efforts show lmted applcablty and effectveness. Array Subexpresson Elmnaton (ASE) [Detz et al. 2001] s desgned only for sum-of-products computaton stencls, where the operands must be sngle-rank array references and the ndex expressons must be of a strngent form. ESR [Cooper et al. 2008] combnes value numberng and scalar replacement to explore common subexpressons made up of array references. It s desgned for some loop redundances n only categores 1 and 2, and msses redundances that requre sophstcated computaton reorderng. REDUCTION [Gupta and Raopadhye 2006] s specally desgned for smplfyng reductons. It utlzes polyhedral models to explore computatons shared among multple reductons. It mght be able to fnd some redundances n categores 3 and 4, but cannot handle whle loops, mperfectly

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:5 nested loops, and complex operatons (e.g., sn, cos, etc.). Moreover, ts descrpton stops at a theoretcal level, gvng nether mplementaton nor gudelnes for mplementaton; no quanttatve results have been reported on that technque. Hartono et al. [Hartono et al. 2006, 2005] tres to dentfy the most cost-effectve common subexpressons for tensor contracton n electronc structure calculatons so that the total number of operatons could be mnmzed. They have strct requrements on the loop type and expresson format: only for loop wth constant loop bound, and the expressons must be products of array references and ther ndex expressons must be of a partcular form. Page [Page and Koeng 1982] studes how fnte dfferencng can be used to optmze ncremental computatons. The optmzaton s at functon level based on a predefned transformaton lbrary, wthout consderng sophstcated computaton reorderng. CLARITY [Olvo et al. 2015] s a recent work that shows promsng results n detectng repeated traversals of arrays. Even though repeated traversals could hnt on possble (not necessarly true) redundant computatons, they are nsuffcent for precsely detectng or removng redundant computatons. Some tlng technques for mperfect loops [Song and L 1999] may expose some possble redundant computatons as a result of the loop tlng transformatons. But ther man purpose and effects are on mprovng cache performance by restrctng data footprnt sze, rather than detectng loop redundances. A recent work [Luporn et al. 2017] manages to reduce the number of operaton counts for a class of fnte element ntegraton loop nests by explotng fundamental mathematcal propertes of fnte element operators. It can fnd some redundances n categores 3, but does not handle mperfectly nested loops or complex operatons (e.g., sn, cos, etc.). It nether gves any systemc consderatons of the combnaton of loop-level computaton reorderng and expresson-level algebrac reorderng, and thus may mss some large-scope optmzaton opportuntes. Overall, for lack of a general approach to analyzng large-scope redundancy and comprehensve reorderng, these methods are each lmted to a specal type of redundant computatons wth relatvely narrow applcabltes. Even mergng them together stll leave lots of cases uncovered and opportuntes untapped as Secton 9 wll show. 4 GLORE OVERVIEW GLORE overcomes the lmtatons of the prevous studes through two features: an LER-notaton to enable flexble analyss and reorderng of large-scope computatons on both regular and rregular loops, and a seres of novel algorthms to effectvely determne the approprate orders that mnmze the amount of redundant computatons. As a result, GLORE treats the most common cases n all the four categores of redundancy a much broader range of loop redundances than any pror method does, fnds better computaton orders, and s amenable to the aforementoned varous code complextes: reducton loops, regular loops, and rregular loops that are perfectly or mperfectly nested, carryng dependences or not, nvolvng smple or complex operatons (e.g., sn, cos, log). We next present LER-notaton and then explan how the algorthms n GLORE leverage the flexblty by the notaton to analyze and reorder computatons to remove each of the four categores of redundances. We descrbe the converson between code and LER-notaton at the end 1. 1 Ths paper uses C language termnology as GLORE s currently mplemented for C programs, but the prncpled technque should be extensble to code n some other languages.

74:6 Yufe Dng, Xpeng Shen 5 LER-NOTATION Wth LER-notaton, a nested loop can be represented concsely captured n some formulae, makng symbolc analyss easer to apply. Our survey fnds that no exstng notatons of loops can handle all the complextes mentoned n the prevous secton, whereas LER-notaton solves the problem. In LER-notaton, a nested loop s represented n one or more formulae (called LER-formulae). Although LER-formulae can represent calculatons n an arbtrary level of loops, n ths work, we use them to represent the computatons n the nnermost loop along wth all the levels of loops enclosng the computatons, because of the nnermost computatons typcally beng the most costly part of a nested loop. There could be data dependences flowng nto the nnermost loop from other levels of loops, whch would be captured by some subscrpts of operands n LER-notaton as descrbed at the end of ths secton. The general format of a formula n LER-notaton s as follows 2 : L E R, where, L represents a sequence of loop notatons, E represents an expresson nsde those loops, R represents that the computaton results are stored nto varable R. The LER-representaton of a nested loop s a collecton of such formulae. We next explan E and L n more detal. E and Operands Foldng. The expresson E may contan arbtrary mathematcal computatons (e.g., sn 2 (x[])) as long as the computatons do not alter the value of the operands or other varables (.e., free of sde effects). For computatons usng operators beyond the common basc mathematcal operators (+,-,*,/), the computatons are folded nto a sngle synthetc operand wth a unque ID and wth all the loop ndces used n the orgnal computatons ncluded n the operand s ndexng subscrpt. For nstance, sn(a[] + b[]) s represented as synthetc_ab1[, ], where synthetc_ab1 s the unque ID of the created synthetc operand and [, ] s ts ndexng subscrpt. We call ths transformaton operands foldng. By hdng the detaled complextes n expressons but explctly exposng the connectons wth the enclosng loops, operands foldng makes t possble for GLORE to handle loops wth complex expressons. L and Dependence Subscrpts. The loop sequence L s a combnaton of L,,, and W, whch each represents one knd of loops: (a) Regular for loops (L): L l,u represents a regular f or loop wth as the loop ndex varable. It s assumed that the loops have already gone through normalzaton such that the ndex goes from a lower bound (l) to an upper bound (u) (whch are affne expressons of loop ndex varables) wth 1 as the step sze. The followng code, for nstance, s represented as L 1,N L 1,M (a[] b[] c[]) x[, ] n LER-Notaton: for( = 1; N; ++){ for( = 1; M; ++){ x[,] = a[] b[] c[]; (b) Reducton loops (, ): If the loop conducts a reducton operaton (e.g., summaton or product) across teratons, the loop s represented as a reducton loop. In the current mplementaton, GLORE consders ust summaton ( ) and product ( ), whch are the most commonly seen Semrngs. Other Semrngs are possble to be handled wth mnor extensons. The notaton of a reducton loop s the same as a regular for loop except that L s replaced wth ether or. 2 The name LER comes from ths general form.

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:7 Loop normalzaton and affne loop bounds are also assumed. So, l,u represents a loop n whch a summaton s done across ts teratons wth as the loop ndex and l and u as the loop bounds. The followng code, for example, can be represented as b + 1,N 1,M a[] x: x = b; for( = 1; N; ++){ for( = 1; M; ++){ x = x + a[]; Note, f the nnermost loop contans multple statements, multple formulae could be created wth each correspondng to one of the statements. When there are values flowng between those statements, some vectorzatons of scalar varables may be necessary. For nstance, f there s a statement c[,] = x n the prevous example loop rght after the x=x+a[] statement, the LERnotaton would be as follows: b + 1,N L 1, N L 1,M 1,M a[] x[, ] x[, ] c[, ]. (c) Whle loops and other rregular loops (W): Unlke the prevous two knds of loops, a whle loop has no loop ndex varable or lower or upper bound. To help dentfy a partcular whle loop n the notaton, LER-notaton gves a unque dentty to each whle loop, represented as a subscrpt of W. For nstance, W t represents a whle loop whose dentty s set as t. Irregular for loops and reducton loops (e.g., wth non-affne loop bounds or control flow statements such as break that may cause early termnaton of the loops) are represented and treated n the same way as the whle loops, except that ther loop ndces are used as ther denttes. When the lower bound of a for loop or reducton loop s 1, the lower bound can be omtted. In LER-notaton, the subscrpt of L,,, or W s called the ID of that loop. If a varable (say x) gets assgned and a loop (ether f or or whle loop) wth ID s the nnermost loop that contans that assgnment, the varable carres a subscrpt (e.g., x ) n the LER-representaton to explctly ndcate possble data dependences caused by the update to that operand. For nstance, the w n Fgure 1 (a) gets updated n the whle loop; so ts subscrpt shall carry the ID of the whle loop to ndcate the possble cross-loop dependences. (In ths paper, to smplfy the representaton, we explctly wrte out such subscrpts only when t s necessary.) Such dependences propagate: Varables whose value comes from calculatons nvolvng a varable wth such a subscrpt would carry that subscrpt themselves. To obtan these subscrpts, the examnaton starts from the operands n the outermost loop, and gradually moves to the nner loops and propagates the subscrpts throughout the process. A synthetc operand carres subscrpt f any of ts orgnal operands would carry subscrpt. Examples. In LER-notaton, the mperfectly nested loop n Fgure 1(a) s expressed n the followng formula: 1, N W t (a[] + b[] w t ) d t. (1) It explctly represents the statement n the nnermost loop as that s the focus of optmzaton. It uses the subscrpt t to capture the dependences of w and d over the teratons of the whle loop due to the statements n the outer loop.

74:8 Yufe Dng, Xpeng Shen preprocess Forg (operand foldng; alt form gen) Falt f operand abstracton f* f loop encapsulaton g cat-3 opt (mn unon alg. & closure-based alg.) loop decapsulaton g upper case: a set of formulae; lower case: a formula; : transformaton; : for each member. G cat-4 opt (ncremental repres.) h H operand concretzaton h cat-1 & cat-2 opt. (reuse lst & groups) I Fg. 4. GLORE transforms formulae through a seres of steps to remove ts loop redundances. A fnal cross-formula optmzaton step s omtted. The example n Fgure 1(c) s expressed as L1, M 0, X 0, X k x[, ] y[, k] r [], (2) LER-notaton offers a way to concsely represent both regular and rregular loops and explctly encode possble data dependences nto the representaton. These propertes prove essental for GLORE to acheve a much broader range of applcablty and to more effectvely explore computaton reorderng of all scopes to detect and remove large-scope redundances than pror methods do. 6 GLORE ANALYSIS AND OPTIMIZATIONS Ths secton descrbes GLORE and explans how t works on the LER representatons of loops to fnd and remove the four categores of loop redundances. 6.1 Overvew Fgure 4 outlnes the man steps of the GLORE algorthm. The nput to GLORE s a set of LERformulae correspondng to a nested loop. Its output s a new set of LER-formulae wth all redundances GLORE fnds removed. The algorthm frst preprocesses the nput formulae. Ths step ncludes two man operatons. The frst s operand foldng (descrbed n Secton 5), after whch, operatons beyond the basc algorthmc operatons are replaced wth synthetc operands. The second s alternatng form generaton, n whch, mnus s turned nto negatve sgns assocated wth each of the relevant operands, and dvson s folded nto operands as nverse. After that, the formula contans only plus or tmes, a form we call alternatng form. In such a form, the expresson can be regarded as a herarchy wth the levels alternatng between PLUS and TIMES, as llustrated n Fgure 5. The herarchcal vew allows a dvde-and-conquer strategy to be used, wth redundancy detected and removed at each level of the herarchy. Because the composton at each level nvolves ether only plus or only tmes, t allows free reassocaton and communcaton of the operatons among the chldren of an arbtrary node n the tree. GLORE then optmzes each of the orgnal formulae ndvdually to remove loop redundances contaned n each. After that, t examnes the new set of formulae and removes redundances that exst across the formulae.

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:9 * * + * * + * PLUS TIMES Leaf Node 5 x -3 d[] -1 a[] d[] + b[] c[k] Fg. 5. The alternatng form of formula 5x (3d[] + a[] d[] (b[] + c[k]) s 5x + ( 3d[]) + ( a[]) + d[] (b[] + c[k]), regarded as a herarchy wth the levels alternatng between PLUS and TIMES. When optmzng an ndvdual formula, GLORE takes a seres of steps, and through the process, a formula s transformed nto a seres of forms. As Fgure 4 shows, for a gven formula f, GLORE frst conducts operand abstracton, whch replaces the ndex of each operand wth a set of the IDs of ts relevant loops. A loop s relevant to an operand f ts ID appears n the ndex of the operand. We use relloops(x) to denote the set of loops relevant to an operand x. For nstance, for the followng formula: L 1, N 1, 1, N k x[ + ] y[, + k] z[k] w[], (3) the relevant loop set s {, for x, {,,k for y, and {k for z. After operand abstracton, the formula becomes L 1, N 1, 1, N k x{, y{,, k z{k w[]. (4) Ths step smplfes the removal of loop redundances of categores 3 and 4, n whch, what s relevant s the set of loop ndces n the ndexng expressons of each operand rather than the ndexng expressons themselves. We denote the resultng formula wth f. (The concrete ndexng expressons of each operand s restored n later steps.) GLORE then apples loop encapsulaton on f to convert t nto a new form f, whch, through the use of pseudo-bounds, hdes the complextes n loop bounds such that every loop n f, other than whle loops, has only constant bounds. GLORE then uses mnmum unon algorthm to detect and remove category-3 redundances (loop-nvarant loops) from f, yeldng a new set of formulae G (Secton 6.2). For each formula n G, say д, GLORE decapsulates t to get a form д wth the complextes of the loop bounds restored. GLORE then fnds and removes category-4 redundances (partally loop-nvarant loops) by convertng д nto an ncremental representaton, resultng n a new set of formulae H. GLORE then restores the concrete ndex expressons of operands (operand concretzaton), and removes the other two categores of redundances by buldng up reuse lsts and reuse groups of the expressons n the formulas. We next gve a detaled explanaton of the algorthms for the removal of loop-nvarant loops (category 3). As the most complex category to handle, t demonstrates how LER-notaton facltates the large-scope analyss and computaton reorderng for redundancy removal. The treatments to other categores follow a smlar approach; we descrbe them brefly at the end.

74:10 Yufe Dng, Xpeng Shen 6.2 Removal of Loop-Invarant Loops (Category 3) To help understandng, we start wth the case where every loop bound s constant across the teratons of the nested loop of nterest, there are no loop-carred data dependences (expect regular reductons), and all loops are nterchangeable n order. We dscuss the other complextes later n Secton 6.2.3. A loop-nvarant loop can be ether a reducton loop or a for loop. GLORE treats redundant reducton loops frst and then treats redundant regular loops. 6.2.1 Loop-Invarant Reducton Loops. We wll draw on the example n Fgure 2 (c) n ths secton. In LER notaton, t can be expressed n the followng formula: L 1,M 1,M L 1,N k 1,N l x[,l] y[l, ] s[, k] r[, k]. (5) RelLoops. The removal of redundant reductons s based on relloops of operands and relloops of reducton loops. Recall that the relloops of an operand s the set of loop IDs that appear n the ndexng expressons of that operand (as defned n secton 6.1). The relloops of a reducton loop R s the unon of the relloops of all ts operands whose relloops contans R. Formally, t s defned as follows: relloops(r) = RelLoops(o). (6) o:o operands (R) R relloops (o) For nstance, the relloops( ) n Formula 5 s {,l,k because appears n the ndexng expressons of operands y[l, ] and s[, k] and the unon of ther loop ndex sets s {,l,k. Relevant loops of a reducton tells what loops must be nvolved when dong the correspondng reducton. Our followng dscusson assumes that the reducton s a summaton. The prncple desgn s the same for multplcaton-based reducton. Formula Smplfcaton. The nput to the step for removng cat-3 redundances s the form after the preprocessng steps and the operand abstracton. Array ndexng expressons are already replaced wth the relloops of the array access. For nstance, for formula N N N k d[2 k] + 2 a[] sn(d[]) (a[3 + 3]b[] + c[k]) r, ts nput form to ths step s N N N k d{k + 2 a{ d_{ (a{b{ + c{k) r, where the ndexng of each array only ndcates the relevance of the loops rather than the exact locaton of the element to access, and d_ s a synthetc operand. Before detectng redundant reductons, we further smplfy the formula. If the top level of the expresson n the formula s a plus node (lke Fgure 5 shows), the formula s broken nto several formulae wth each correspondng to a chld of the root node. Moreover, for each of the new formula (a tmes node), ts operands are grouped to further smplfy the representaton: The operands represented by ts mmedate chld nodes are grouped nto a sngle synthetc operand f they have the same relloops, and each non-leaf chld node turns nto a sngle synthetc operand. For nstance, the smplfcaton result of formula N N N k d{k + 2 a{ d_{ (a{b{ +c{k) r becomes k d{k tmp1, k 2 дroup_a_d{ дroup_a_b_c{,, k tmp2, tmp1 + tmp2 r, where, the frst two formulae correspond to each of the two terms of the orgnal plus expresson. The operands n the second formula are further grouped: дroup_a_d{ s derved from a{ d_{

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:11 for ther dentcal relloops, and дroup_a_b_c{,, k s derved from the sngle term (a{b{ +c{k). The fnal formula adds the results of the prevous formulae to get the fnal result. Ths smplfcaton puts each formula nto a product form, offerng convenences for removal of redundant reductons as shown next. Detectng Redundancy. Based on the smplfed product form and the concept of relloops, the detecton of loop-nvarant loops becomes easy: Under the assumpton that all loops n a formula are nterchangeable, f loop relloops(r), where R s then the reducton loop R s nvarant to loop, whch means that we may move out of loop the reducton R of the subexpresson consstng of operands that nclude n ther relevant loop sets. An Example. In Formula 5, for nstance, loop s not n relloops( ), whch equals {,l,k. The formula s hence equvalent to the followng: 1,M L 1,N L 1, N y[l, ] s[, k] temp[l, k] l k L 1,M L 1, N k 1, N l x[,l] temp[l, k] r[, k]. (7) The equvalence s ntutve: Because the calculaton of the sum of the product of y and s has nothng to do wth loop, t does not need to be repeatedly computed nsde loop. Puttng t out removes the redundant computatons. The cost of the reducton s reduced from O (N 2 M 2 ) to O(N 2 M). Another way to understand the benefts s that the transformaton essentally changes the order of computaton nvolved n the two reductons by leveragng the dstrbutve property of multplcaton. Gven and k, the orgnal formula computes r[, k] as x[,l] y[l, ] s[, k], whle the new formula computes t as x[,l] y[l, ] s[, k]. l l Wth the nner summaton beng moved nto a separate formula, the computatonal complexty decreases. For that example, an alternatve form of the resultng formulae s as follows: L 1,M L 1,M L 1,M L 1, N k 1, N l 1,M x[,l] y[l, ] temp[, ] temp[, ] s[, k] r[, k]. (8) Ths form dffers from the prevous form n the orders of computng the reducton loops, and hence the computatonal complexty (O(N M 2 ) v.s. O(N 2 M)). Ths example demonstrates a crtcal aspect n removng redundant loops: fndng the best order of the computatons. We solve the problem through an algorthm named mnmum unon algorthm. Mnmum Unon Algorthm.

74:12 Yufe Dng, Xpeng Shen When there are many nested loops and operands whose ndexng expressons each cover some subsets of the loop ndces, fndng the best computaton order can be a dffcult problem. In fact, a prevous paper [Ch-Chung et al. 1997] shows that even a much smplfed verson of the problem 3 s already NP-complete. We desgn a heurstc algorthm, called mnmum unon algorthm, to help quckly determne a good order of reducton loops (regular loops dscussed later). It produces a forest, whch encodes the desred order of the reducton loops. Each node n the forest corresponds to a reducton loop. Loops on separate trees can have an arbtrary order n the produced formulae, whle the loops n one tree wll follow a post order (chldren before parent) n the produced formulae. We call the forest an orderng forest. Consder the followng example: N N N N a[][] b[][k] result (9) k t Fgure 6 shows the produced orderng forest. The t loop can be ether before or after the other three loops; reducton loop and loop k should be computed before reducton loop. 1 2 t: { : {,,k : {, k: {,k Fg. 6. The orderng forest produced by the mnmum unon algorthm for Formula 9 (the brackets show the relloops of each reducton). Mnmum unon algorthm leverages from nsghts. Frst, f the relloops of reducton loop s a subset of that of reducton loop and does not nclude, computng loop frst wll allow loop to use ts results (rather than recompute ts results n every teraton of loop ), lowerng the computatonal complexty. Second, f the relloops of two reducton loops and have no overlap, then the two reducton loops do not need to use the results from each other, and hence ther order does not matter to the cost. Fgure 7 outlnes our algorthm. Its nput s the set of reducton loops n a gven LER-formula. GLORE stores wth each reducton loop ts relloops, and estmates ts cost as the product of the ranges of the ndex values of all loops n ts relloops (symbolcally represented). It produces separate trees n the result forest based on the second nsght gven n the prevous paragraph. It takes a greedy strategy, attemptng to maxmze the amount of reuse. In each teraton of the whle loop n Fgure 7, the loop wth the mnmum cost s selected and added nto the forest. It s temporarly put as the chldren of all the yet-to-process reducton loops that use ts results (they are ts temporary parents ), and the algorthm (lnes 10-21 n Fgure 7) updates those loops costs by consderng the replacement of the (re)computaton of that loop wth the reuse of ts result. The parent of a node s later clarfed: The frst of ts temporary parents that s put nto the forest s set as ts actual parent (lnes 22-31 n Fgure 7). Note we do not need to undo the cost reducton to other 3 In the smplfed problem, only regular for and summaton loops wth constant bounds are allowed, and the operands must be the product of arrays whose ndces must follow some strct form for nstance, a[,] s allowed whle nether a[,] nor a[2,] s.

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:13 temporary parents because they wll happen after ths loop and can stll use ts results thanks to the post order n the generaton of new formulae. When several orders could be the best dependng on the actual loop bounds values, the analyss records all of them and ther respectve favorable condtons. Applcaton of the algorthm to Formula 9 gves the forest as shown n Fgure 6. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 /* nputs: a set of reducton loops S; each element has an estmated computaton cost (cost) and relevant loops (relloops) recorded. outputs: a forest F that records an optmzed order to compute the reductons. */ worklst = createintalnodes (S); // a tree node s created for each loop wth chldren and // parent settng to NULL; whle(worklst!= ){ thsloop = the loop wth the mnmum cost n worklst; remove thsloop from worklst and add t nto F; foreach l n worklst{ // add thsloop nto the chldren lst of loops //that rely on ts results f ( ( l.id thsloop.relloops)!= ) { add thsloop nto the chldren lst of l; // update the cost of l l.cost /= thsloop.ndexrange; // confrm the parent relaton wth ts chldren foreach c n thsloop.chldren { f (c.parent!= NULL){ // c has a parent already remove c from the chldren lst of thsloop else{ c.parent = thsloop; Fg. 7. Mnmum Unon Algorthm for selectng an optmzed order for reducton loops. Removng Redundancy through Formula Generaton. After gettng the orderng forest, GLORE generates new formulae wth the redundancy removed. Ths step nvolves not ust the reducton loops, but also all other loops and all operands n the orgnal formula. The generaton works on the trees n the forest one after another; the order makes no dfference. We explan the algorthm frst and then provde an example. When startng workng on a tree T, the algorthm flls a lst A wth all the operands that appear n the orgnal formula. It traverses T n a post order (chldren before parent). Consder a node correspondng to reducton loop R n an orderng forest. The algorthm creates a formula L E r. L s a sequence of loop representatons correspondng to the loops n relloop(r ). E represents the product α β, where α s the product of the results produced by the chldren of ths node n the orderng forest, and β s the product of all the operands n A that have n ther relloops. Those operands are then removed from A. For a sngle-node tree wth no relevant operands (e.g., node 1 n Fgure 6), E s ust 1; the formula s replaced wth the computaton of the range (ub lb) of the loop ndex. The rght-hand-sde notaton r represents a new name created by the algorthm to record the result; f L contans some regular loops, then r has ndex as [d 1,d 2,,d k ], where d s the ID of the th regular loop n L. After all trees n the forest have been processed, a fnal formula s generated to get the product of all those results.

74:14 Yufe Dng, Xpeng Shen Appled to Formula 9, the algorthm generates the followng formulae based on the orderng forest shown n Fgure 6. N tmp0; L N N a[][] tmp1[]; L N N k b[][k] tmp2[]; N (tmp1[] tmp2[]) tmp3; tmp0 tmp3 result. The frst formula corresponds to the reducton loop t, whch contans no relevant operand or chld. The second formula comes from the left chld of the second tree whch corresponds to the reducton loop. As ts relloops contans only and, the formula contans loop as a regular loop and the reducton loop. That node has no chldren and hence the expresson n the formula contans only the product of all the operands that have n ther relloops, whch are ust a[][]. The result s stored nto a new name tmp1, whose ndex contans only. The thrd formula comes from the rght chld of the second tree n a smlar manner. The fourth formula comes from the root of the second tree. Because after the generaton of the second and thrd formulae, both a[][] and b[][k] have been removed from the operand lst A, there s no operand n A that has n ts relloops. Therefore, the expresson of ths thrd formula contans only the results from ts two chldren nodes, tmp1[] and tmp2[]. The fnal formula gets the fnal result by multplyng the results from dfferent trees. The overall computatonal complexty reduces from O (n 4 ) to O(n 2 ). 6.2.2 Loop-Invarant Regular Loops. Redundant regular loops can be detected and removed upon LER-notaton n a smlar manner, through a dfferent algorthm named closure-based algorthm. It works on each of the formulae produced n the prevous step. We use the followng example for our dscusson. L N L N L N k x[] d[, k] c[, ] w[,, k]. (10) Such computatons are cross products that are commonly seen n computatonal physcs, where each operand represents some values n a lower dmensonal space, and the result gves the values n a hgher dmensonal space. There are two knds of optmzaton opportuntes for such regular loops. The frst s about synthetc operands. If x[] n Formula 10, for nstance, s a synthetc operand (defned n Secton 5) that nvolves non-trval computatons (e.g., sn 2 (t[])), then computng x[] cross loops and k would be redundant. It could be avoded f we put the computatons of all x[] (1 N ) nto a separate formula. Transformatons to explot ths knd of opportuntes s easy to do: Just put the operand and ts relevant loops nto a separate formula. The second s about reuses across subexpressons. When there are multple operands, ther computatons could be splt nto multple steps (each as a separate formula), such that later steps can reuse, rather than repeatedly compute, the results of earler steps. For example, the followng formulae compute the same results as Formula 10 does, but requres only N 2 + N 3 multplcatons, rather than the 2N 3 multplcatons needed by the orgnal formula. L N L N x[] c[, ] temp1[, ] L N L N L N k temp1[, ] d[, k] w[,, k]. (11) The complexty n explotng ths knd of opportuntes s agan on orderng: There may be many possble orders n whch the expressons could get computed. For Formula 10, a form alternatve to

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:15 (a) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 // nodeset: nodes that share a parent n the closure tree; foreach nd n nodeset{ nd.remanngindexspace = nd.ndexspace; seenindexset = ; whle (nodeset s not empty){ thsnode = the node havng the smallest remanngindexspace; generateformula(thsnode); remove thsnode from nodeset; extraindexset = thsnode.ndexset - seenindexset; extraindexspace = computespace (extraindexset); f (extraindexspace > 0) { // update the remanngndexspace of the other nodes foreach nd n nodeset { nd.extraindexspace /= extraindexspace; seenindexset = seenindexset (b) extraindexset; Fg. 8. (a) The operand closure tree for Formula 10. (b) Closure-based algorthm for fndng a good order for the operands that share a parent n an operand closure tree. Formula 11 s as follows: L N L N L N k x[] d[, k] temp1[,, k] L N L N L N k temp1[,, k] c[, ] w[,, k], (12) whch s more costly than Formulae 11 due to the order n whch t nvolves the operands n the computaton. For an arbtrary expresson, fndng the optmal order s NP-complete n general [Ch-Chung et al. 1997]. We desgn a lnear-tme closure-based heurstc algorthm to solve the problem. It s based on a concept we ntroduce, operand closure tree. Each node n the tree, except the root, corresponds to an operand n the expresson of the LER-formula to optmze, and carres the relloops of that operand n t. The root s an artfcally added node to put all nodes nto one tree structure; ts relloops conssts of all the loop IDs n the formula to optmze. An mportant property of the tree s that a chld s relloops must be a subset of ts parent s hence the name operand closure tree. Ths property helps GLORE fnd good orders. Fgure 8 (a) shows the operand closure tree of Formula 10. Fgure 8 (b) outlnes the closure-based algorthm, whch fnds a good order through a post-order walk over the closure tree. Before the walk, each non-root node has an ndexspace computed, whch equals the product of the ranges of all the loops n ts relloops. Through the post-order walk, the algorthm uses a greedy strategy to teratvely decde the order of the chldren of each node. Its desgn tres to make the unon of the ndex sets of the selected operands enlarge slowly, whch helps maxmze the amount of result reuse and hence effectvely avod unnecessary computatons. Through the orderng process, new formulae are generated to ncrementally compute the product of the chldren of a node (puttng one more chldren nto each new formula), and then creates a formula to compute the multplcatons between that product and the parent node. The search algorthm s for general cases. For loops wth only a small number of operands and loops, exhaustve search could be used to fnd the best. 6.2.3 Extra Complextes. Ths subsecton descrbes how GLORE handles non-constant loop bounds and data dependences when t removes category-3 redundances.

74:16 Yufe Dng, Xpeng Shen Non-Constant Bounds. It uses loop encapsulaton to handle loops wth non-constant bounds. Consder the followng example. L 1, N 1, 1,N k x[] y[] z[k] w[] (13) where, the upper bound of loop s. The basc dea of loop encapsulaton s to use a pseudo-loop wth constant bounds to replace a group of loops that may contan non-constant bounds and have dependences among ther ndces. Its applcaton to Formula 13 gves 1,N 2 t, { 1,N k x{t y{t z{k w[], where, loop t s a pseudo-loop for the group of loops {,, and the subscrpt { records the regular loop (loop ) n that group; the upper bound of loop t (N 2 ) s a smple rough estmaton of the sze of the combned teraton space; the operand x and y both have t n ther relloops because they are relevant to some (or all) loops n that loop group. After encapsulaton, the formulae turn nto a form amenable for the prevously descrbed optmzaton algorthms to apply. Techncal report( [TR 2017] contans the full algorthm of loop encapsulaton.) Data Dependences. When there are loop-carred data dependences, there may be certan restrctons on loop reorderng n the optmzatons. Many classc technques have been developed before for detectng loop-carred data dependences, and to recognze the legalty of a new order of loops accordng to the data dependences [Allen and Kennedy 2001]. These technques can be used to reveal the data dependences n a nested loop. Based on these analyss results, GLORE can ensure that ts transformatons produce legal formulae. Specfcally, GLORE avods dependence volatons through an annotaton scheme and two prncpled rules. The annotaton scheme s the subscrpts of operands for specfyng that the operands are subect to some data dependence across certan loops. An example s the subscrpt t n w t n Formula 1, whch ndcates the data dependence of w across the whle loop n Fgure 1(a). Such annotatons apply to for and reducton loops as well. The annotatons allow GLORE to follow two conservatve rules to prevent any dependences volatons: (1) If a new formula s expresson contans no operands that have dependence subscrpts, the formula s safe to create. The correctness comes from the classc loop transformaton theorem [Allen and Kennedy 2001]: Loop reorderng s safe f there are no loop-carred data dependences. (2) Whenever GLORE tres to create a new formula contanng operands wth dependence subscrpts, t must nclude nto the formula all the loops that any of the operands n the new formula depends on, and at the same tme, loop reorderng s allowed only among the loops nsde the nnermost dependence-carryng loop to avod dependence volatons. Control Flow Statements. Our technque apples regardless of whether the loops contan f, contnue, break, or other control flow statements. These statements are not explctly expressed n our LER notatons. But the dependences they nduce are kept n the optmzatons through dependence subscrpts of varables n the notatons and some constrants on loop reorderng. If the computatons n the expresson ncluded n an LER-notaton has control or data dependences on one of such control flow statements, the varables n those expressons are marked wth dependence subscrpts of all the loops enclosng that control flow statements, whch prevents the reorderng nvolvng those loops to observe the dependences.

GLORE: Generalzed Loop Redundancy Elmnaton upon LER-Notaton 74:17 L N X x[] y[] ) tmp1[] L N X y[] ) tmp2[] L N x[] tmp2[] ) tmp1[] L N x[](ncl[] :ncl[ 1] + y[]! ncl[1] = y[1]) ) tmp1[] relaton between teratons (ncl[] s a new name) ntal condton Fg. 9. Example for removng redundances of categores 4. 6.3 Other Categores Ths secton descrbes the algorthms for other categores brefly. Readers may refer to our techncal report [TR 2017] for detals. Category 4: GLORE frst decapsulates the encapsulated loops. It then detects possble category 4 redundances (partally loop-nvarant loops as Fgure 2 (d)) by matchng some common patterns (e.g., affne loop bounds) wth the formulae. From the nnermost to the outermost loops, t moves an operand out of a loop f ts relloops do not contan the loop s ndex, as the top step n Fgure 9 llustrates. It then reformulates the partally loop-nvarant loops wth an ncremental form as the bottom step n Fgure 9 (a) shows. In the code generaton step, t removes the orgnal redundant computatons by replacng them wth ncremental computatons. When there are multple reducton loops, GLORE apples the mnmum unon algorthms dscussed n Secton 6.2 to decde the reducton order. After that, for each reducton formula t generates, GLORE checks, n the LER-notaton, whether the unon of the relloops of all the operands n R s a proper subset of the unon of the loop IDs n L. If so, t tres to reformulate the partally loop-nvarant loops nto an ncremental form. Ths category of redundancy s often seen n cases where so-called ncremental computng [Hammer et al. 2015, 2014] has been appled. What GLORE contrbutes n ths part s a mechansm for ncorporatng the detected ncremental computatons nto the optmzaton of LER-formulae for automatcally removng the redundances. Category 2: In all the cases we have dscussed, some operands use only a subset of the loop ndces, creatng the redundances. When all operands cover all loop ndces, there can stll be redundant computatons. One typcal example s stencl-lke computatons, such as L L (A[][ 1] + A[ 1][] + A[ + 1][] + A[][ + 1]) C[, ], where, the frst half of the expresson (A[][-1] + A[-1][]) n teraton (, ) conducts the same computatons as the second half of the expresson (A[+1][] + A[][+1]) does n teraton ( 1, 1). GLORE deals wth these redundances after removng the redundances of categores 3 and 4. GLORE frst takes an operand concretzaton step to reverse the effects of operand abstracton such that the operands n the formulae now have ther concrete ndexng expressons. We draw on the followng example to explan the algorthm for removng category-2 redundances: