Run Time Methods for Parallelizing Partially Parallel Loops x

Size: px

Start display at page:

Download "Run Time Methods for Parallelizing Partially Parallel Loops x"

Thomas Atkinson
6 years ago
Views:

1 Run Time Methods fo Paallelizing Patially Paallel Loops x Laence Rauchege y Nancy M. Amato z David A. Padua y Univesity of Illinois Texas A&M Univesity Univesity of Illinois Abstact In this pape e give a ne un time technique fo finding an optimal paallel execution schedule fo a patially paallel loop, i.e., a loop hose paallelization equies synchonization to ensue that the iteations ae executed in the coect ode. Given the oiginal loop, the compile geneates inspecto code that pefoms un time pepocessing of the loop s access patten, and schedule code that schedules (and executes) the loop iteations. The inspecto is fully paallel, uses no synchonization, and can be applied to any loop. In addition, it can implement at un time the to most effective tansfomations fo inceasing the amount of paallelism in a loop: aay pivatization and eduction paallelization (element ise). We also descibe a ne scheme fo constucting an optimal paallel execution schedule fo the iteations of the loop. Intoduction To achieve a high level of pefomance fo a paticula pogam on today s supecomputes, softae developes ae often foced to tediously hand-code optimizations tailoed to a specific machine. Such hand-coding is difficult, eo-pone, and often not potable to diffeent machines. Restuctuing, o paallelizing, compiles addess these poblems by detecting and exploiting paallelism in sequential pogams itten in conventional languages. Although compile techniques fo the automatic detection of paallelism have been studied extensively ove the last to decades [, ], cuent paallelizing compiles cannot extact a significant faction of the available paallelism in a loop if it has a complex and/o statically insufficiently defined access patten. This is an extemely impotant issue because a lage class of complex simulations used in industy today have iegula domains and/o dynamically changing inteactions. Examples include SPICE fo cicuit simulation, DYNA D and PRONTO D fo stuctual mechanics modeling, GAUS- SIAN and DMOL fo quantum mechanicalsimulation of molecules, CHARMM and DISCOVER fo molecula dynamics simulation of oganic systems, and FIDAP fo modeling complex fluid flos [8]. x Due to space limitations, this pape is an extended abstact of []. y Cente fo Supecomputing Reseach & Development, 08 W. Main St., Ubana, IL 680, ege,padua@csd.uiuc.edu. Reseach suppotedin pat by Intel and NASA GaduateFelloships, and Amy contact #DABT6-9-C-00. This ok is not necessaily epesentative of the positions o policies of the Amy o the Govenment. z Depatment of Compute Science, Texas A&M Univesity, College Station, TX 778-, amato@cs.tamu.edu. Reseach suppoted in pat by an AT&T Bell LaboatoiesGaduate Felloship, NSF Gant CCR , and the Intenational Compute Science Institute, Bekeley, CA. Thus, since the available paallelism in theses types of applications cannot be detemined statically by pesent paallelizing compiles [6, 8], compile-time analysis must be complemented by ne methods capableof automatically extacting paallelism at un time. Run time techniques ae needed because the access patten of some pogams cannot be statically detemined, eithe because of limitations of cuent analysis algoithms o because the access patten is input data dependent. Fo example, most dependence analysis algoithms consevatively assume dependences hen pesented ith non linea o subscipted subscipt expessions. Duing the past fe yeas,techniques have been developed fo the un time analysis and scheduling of loops [5, 9,, 7, 0,, 5, 6, 7, 8, 9, 0,, ]. The majoity of this ok has concentated on developing un time methods fo constucting execution schedules fo patially paallel loops, i.e., loops hose paallelization equies synchonization to ensue that the iteations ae executed in the coect ode. Given the oiginal, o souce loop, most of these techniques geneate inspecto code that analyzes, at un time, the coss-iteation dependences in the loop,and schedule/executo code that schedulesand executes the loop iteations using the dependence infomation extacted by the inspecto [0]. Ou Results. We give a ne inspecto/schedule/executo method fo finding an optimal paallel execution schedule fo a patially paallel loop. Ou inspecto is fully paallel, uses no synchonization, and can be applied to any loop (fom hich an inspecto can be extacted). In addition, ou inspecto can implement at un time the to most effective tansfomations fo inceasing the amount of paallelism in a loop: aay pivatization and eduction paallelization (element ise). The ability to identify pivatizable and eduction vaiables is vey poeful since it eliminates the data dependences involving these vaiables and inceases the available paallelism in the loop. The schedule patitions the set of iteations into subsets called avefonts. Iteations in each avefont can be executed in paallel, i.e., thee ae no data dependences beteen iteations in a avefont. Although the avefonts themselves ae constucted one afte anothe, the computation of each avefont is fully paallel and equies no synchonization. The scheduling can be dynamically ovelapped ith the paallel execution of the loop iteations to utilize the machine moe unifomly. Ou ne method impoves on the pevious techniquessince none of them has all of these popeties (a compaison to pevious ok is contained in Section ). Peliminaies In ode to guaantee the semantics of a loop, the paallel execution schedule fo its iteations must espect the data dependence elations beteen the statements in the loop body [, 5,,, 5]. Thee ae thee possible types of dependences beteen to statements that access the same memoy location: flo (ead afte ite), anti (ite afte ead), and output (ite afte ite). Flo dependences expess a fundamental elationship about the data flo in the pogam. Anti and output dependences, also knon as memoyelated dependences, ae caused by the euse of memoy, e.g., pogam vaiables. If thee ae flo dependences beteen accesses in

2 do i =, n/ S: tmp = A(*i) A(*i) = A(*i-) S : A(*i-) = tmp (a) Figue : do i=, n do j =, m S: A(j) = A(j) + exp() diffeent iteations of a loop, then the semantics of the loop cannotbe guaanteed unless those iteations ae executed in ode of iteation numbe because values that ae computed (poduced) in an iteation of the loop ae used (consumed) duing some late iteation. If thee ae no flo dependences, but thee ae anti o output dependences beteen iteations of a loop, then the loop must be modified to emove all such dependences befoe these iteations can be executed in paallel. In some cases, even flo dependences can be emoved by simple algoithm substitution, e.g., eductions. Unfotunately, not all such situations can be handled efficiently. In ode to emove cetain types of dependences to tansfomations can be applied to the loop: pivatization and eduction paallelization. Pivatization ceates, fo each pocesso coopeating on the execution of the loop, pivate copies of the pogam vaiables that give ise to anti o output dependences (see, e.g., [7, 8, 9, ]). The loop shon in Figue (a), is an example of a loop that can be executed in paallel by using pivatization; the anti dependences beteen statement S of iteation i and statement S of iteation i +, fo i < n=, can be emoved by pivatizing the tempoay vaiable tmp. In this pape, the folloing citeion is used to detemine hethe a vaiable may be pivatized. Pivatization Citeion: Let A be a shaed aay (o aay section) that is efeenced in a loop L. A can be pivatized if and only if evey ead access to an element of A is peceded by a ite access to that same element of A ithin the same iteation of L. In geneal, dependences that ae geneated by accesses to vaiables that ae only used as okspace (e.g., tempoay vaiables) ithin an iteation can be eliminated by pivatizing the okspace. Reduction paallelization is anothe impotant technique fo tansfoming cetain types of data dependent loops fo concuent execution. Definition: A eduction vaiable is a vaiable hose value is used inoneassociativeopeationofthefomx = xexp,heeis the associative opeato and x does not occu in exp o anyhee else in the loop. If the opeato is not commutative then the implementation of the paallel equivalent eduction opeation is moe constained. Reduction vaiables ae theefoe accessed in a cetain specific patten (hich leads to a chaacteistic data dependence gaph). A simple but typical example of a eduction is statement S in Figue (b). The opeato is exemplified by the + opeato, the access patten of aay A(:) is ead, modify, ite, and the function pefomed by the loop is to add a value computed in each iteation to the value stoed in A(:). Once eduction vaiables ae identified, methods ae knon fo pefoming the eduction opeation in paallel (see, e.g., [,, 6, 5]). Run Time Analysis of Loops Given a do loop hose access patten cannot be statically analyzed, compiles have taditionally geneated sequential code. Since compile time data dependence analysis techniques cannot be used on such pogams, methods of pefoming the analysis at un time (b) ae equied. Seveal techniques have been developed fo the un time analysis and scheduling of loops ith coss-iteation dependences[5,9,,7,0,,8,9,0,,]. Hoeve,fo vaious easons, such techniques have not achieved ide spead use in cuent paallelizing compiles. In the folloing e descibe a ne un time scheme fo constucting a paallel execution schedule fo the iteations of a loop. The geneal stuctue of ou method is simila to the above cited un time techniques: given the oiginal, o souce loop, the compile geneates inspecto code that analyzes, at un time, the cossiteation dependences in the loop, schedule code that schedules the loop iteations using the dependence infomation extacted by the inspecto, and executo code that executes the loop iteations. In the pevious techniques, the schedule and the executo ae tightly coupled codes hich ae collectively efeed to as the executo, and the inspecto and the schedule/executo codes ae usually decoupled [0]. Although ou methods can also inteleave the schedule and the executo, e teat them sepaately since they do tackle distinct tasks.. The Inspecto In this section e descibe a ne inspecto scheme that pocesses the memoy efeences in a loop and constucts a data stuctue hich the schedule can use to efficiently assign iteations to avefonts. In addition, ou inspecto can implement at un time to impotant tansfomations: (element ise) aay pivatization and eduction paallelization (see Section ). The ability to identify pivatizable and eduction vaiables is vey poeful since it eliminates the data dependences involving these vaiables. In paticula, these tansfomations incease the available paallelism in the loop and also educe the ok equied of the schedule since it need not conside dependencesinvolving such vaiables hen it constucts the paallel execution schedule fo the loop iteations. The basic stategy of ou method is fo the inspectoto pepocess the memoy efeences and detemine the data dependencesfo each memoy location accessed. Late, the schedule uses this memoylocation dependenceinfomation to detemine the data dependences beteen the iteations. We descibe the method as applied to a shaed aay A that is accessed though subscipt aays (see Figue (a)). Fo simplicity, e fist conside only the poblem of identifying the coss iteation dependences fo each aay element (memoy location). Afte descibing the inspecto, e discuss ho the dependence infomation it discoves can be used to identify the aay elements that ae ead only, pivatizable, o eduction vaiables. The inspecto has to main tasks.. Fo each aay element A[x], the inspecto collects all the efeences to it into an aay (o list) R x and stoes them in iteation ode. Fo each efeence it stoes the iteation numbe and access type (i.e., ead o ite) (see Figue (b)).. Fo each aay element A[x], the inspecto detemines the data dependences beteen all its efeences and stoes them in a data stuctue H x fo late use by the schedule. Belo e discuss ho the efeences to each aay element can be collected and stoed in the aay (o list) R x. Assuming R x is available, e fist descibe ho the inspecto detemines the dependencesamong the efeences to A[x] and computes the data stuctue H x. The elations beteen the efeences to A[x] can be oganized (conceptually) into an aay element dependence gaph D x.ifadjacent efeences in R x have diffeent access types, then a flo o

3 do i =,8 A(W(i)) = = A(R(i)) ok(i) (a) W(:8) = [ 5 6 ] R(:8) = [ 7 8 ] D iteation access type R ite type level level H index in R Figue : A (a) souceloop, (b) the aay R fo A[], (c) its dependence gaph D, and (d) its hieachy vecto H. anti dependence exists, and if they ae both ites, then an output dependence is signaled. These dependences ae eflected by paentchild elationships in D x. If adjacent efeences ae both eads, then thee is no dependence beteen the elements, but they may have a common paent (child) in D x: the last ite peceding (fist ite folloing) them in R x. Fo example, the dependence gaph D fo A[] is shon in Figue (c). Ou goal is to encode the pedecesso/successo infomation of the (conceptual) dependence gaph D x in a hieachy vecto H x so that the schedule can easily look-up the dependence infomation fo the efeences to A[x]. Fist,eaddalevel field to the ecods in R x, and stoe in it the efeence s level in the dependence gaph D x (see Figue (b)). Then, fo each level, e stoe in H x the index (pointe to location) in R x of the fist efeence at that level. Specifically, H x is an aay and H x[i] contains the index in R x of the fist efeence at level i, i.e., H x ill seve as a look up table fo the fist efeence in R x at any level (see Figue (d)). Note that this implies that H x ecods the position in R x of evey ite access and of the fist ead access in any un of eads. We no give an example of ho the hieachy vecto seves as a look-up table fo the pedecessosand successos of all the accesses. Conside the ead access to A[] in the 6th iteation, hich appeas as the 6th enty in R. Its level is 5, and thus it finds its successo by looking at the 5+ =6th element of the hieachy vecto H, hich contains the value 8 indicating that its successo is the 8th element in R. Similaly, its pedecesso is found by looking in the 5, =th element of H, hich indicates that its pedecesso is the 5th element of R. Implementing the Inspecto. We no conside ho to collect the accesses to each aay element A[x] into the aays R x. Regadless of the technique used to constuct these aays, to ensue the scalability of ou methods e must pocess (mak) the efeences to the shaed aay A in a doall (see Figue (a) and (b)). The computation pefomed in the making opeations ill depend upon the technique used to constuct the aays R x. In any case, note that since e ae inteested in coss iteation data dependences e need only ecod at most one ead and one ite access in R x fo any paticula iteation, i.e., subsequent eads o ites to A[x] in the same iteation can be ignoed. Pehaps the simplest method of constucting the element aays (b) (d) (c) do i =,8 A(W(i)) = = A(R(i)) ok(i) W(:8) = [ 5 6 ] R(:8) = [ 7 8 ] Poc pr ph (a) index.. index.. ite type level? index in PR?? doall p =,npoc pivate intege j do j=stat(p,nite),end(p,nite) makite(w(j)) makead(r(j)) all Poc pr ph index. index.. (b) (c) ? 5 Figue : An example of the pivate element aays pr and hieachy vectos ph (c) hen to pocessos ae used in the inspecto doall loop (b) fo the souce do loop (a). R x is to fist place a ecod fo each memoy efeence into an aay R A, and then sot the ecods lexicogaphically by aay element numbe (fist key) and iteation numbe (second key). Afte soting, each aay R x ill occupy a contiguous potion (a subaay) in the aay R A. In this case the making opeations simply ecod the infomation about the access into R A. Afte the lexicogaphic sot, the level of each efeence in D x can be computed by a pefix sum computation. Hoeve, since the ange of the values to be soted is knon in advance (it is given by the dimension of the shaed aay A), a linea time bucket o bin sot can be used in place of the moe geneal O(n log n) lexicogaphic sot. Moeove, if the inspecto s making phase is chunked (i.e., statically scheduled), then futhe optimization is possible. In this case, pocesso i ill be assigned iteations idn=pe though (i +)dn=pe,,heepis the total numbe of pocessos, n is the numbe of iteations in the loop, and 0 i<p. The basic idea is as follos. Fist, in a pivate making phase, each pocesso maks the efeences in its assigned iteations, and constucts element aays R x and hieachy vectos H x as descibed above, but only fo the efeences in its assigned iteations. Then, in a coss pocesso analysis phase, the hieachy vectos fo the hole iteation space of the loop ae fomed using the pocessos hieachy (sub)vectos. The pivate making phase poceeds as follos. Let A[:s] be the shaed aay unde scutiny, and suppose each pocesso has a sepaate aay pr[:s; :n=p] in hich to stoe the ecods of the efeences in its set of iteations. Each ecod contains the iteation, type of efeence, and level as descibed above. (The second dimension of :n=p follos since at most one ead and one ite to any element need to be maked in each iteation, and each pocesso has n=p iteations.) Assuming a pocesso maks its iteations in ode of inceasing iteation numbe, it can immediately place the ecods fo the efeences into its aay pr in soted ode of iteation numbe. In addition to the aay pr, each pocesso has a sepaate aay ph[:s; :n=p] used to stoe the hieachy vectos fo the efeences in its assigned set of iteations. Again,

4 assuming that iteations ae pocessed in inceasing ode of iteation numbe, the hieachy vectos can be filled in at the same time that the efeences ae ecoded in pr (see Figue (c)). In the coss-pocesso analysis phase e need to find fo each aay element A[x] the pedecesso, if any, of the fist efeence ecoded by each pocesso, i.e., e need to fill in the value in pocesso i s hieachy vecto fo the efeence that immediately pecedes (in the dependence gaph D x) the fist efeence to A[x] that as assigned to pocesso i. Similaly, e must find the immediate successoof the last efeence to A[x] that as assigned to pocesso i. Pocesso i can find the pedecessos (successos) needed fo its hieachy vectos by scanning the aays of the pocessos less than (lage than) i. Fo example, the? at the end of ph[] fo pocesso in Figue ould be filled in ith a pointe to the fist element in the aay pr[] of pocesso. Hence, the initial and final enties in the hieachy vectos also need to stoe the pocesso numbe that contains the pedecesso and successo. These scans can be made moe efficient by maintaining some auxiliay infomation, e.g., fo each aay element, each pocesso computes the total numbe of accessesit ecoded, and the indices in pr of the fist and last ite to that element. In any case, e note that filling in the pocessos hieachy vectos equies a minimal amount of intepocesso communication, i.e., it equies only a connecting and not a full meging of the diffeent hieachy vectos. Thee ae seveal ays in hich the above sketched analysis phase can be optimized. Fo example, in ode to detemine hich aay elements need pedecessos and successos (i.e., the elements ith non empty aays R x), the pocesso needs to check each o of its aay pr (o i of pr coesponds to the aay R i). This could be a costly opeation if the dimension of the oiginal aay is lage and the pocesso s assigned iteations have a spase access patten. Hoeve, the need to check each o in pr can be avoided by maintaining a list of the non empty os. This list can be constucted duing the making phase, and then tavesed in the analysis phase. Anothe souce of inefficiency fo machines ith many pocessos is the seach fo a paticula pedecesso (o successo)since each pocesso might need to look fo a pedecesso in all the peceding (succeeding) pocessos iteations. The cost of these seaches can be educed fom p to O(log p) using a standad paallel divide and conque pai ise meging appoach [6], hee p is the total numbe of pocessos. Pivatization and Reduction Recognition. The basic inspecto descibed above can easily be augmented to find the aay elements that ae independent (i.e., accessed in only one iteation), ead only, pivatizable, o eduction vaiables. We fist considethe poblem of identifying independent, ead only, and pivatizable aay elements. Duing the making phase, a pocesso maintains the status of each element efeenced in its assigned iteations ith espect to only these iteations. In paticula, if it finds than an element is itten in any of its assigned iteations, then it is not ead only. If an element is accessed in moe than one of its assigned iteations, then it is not independent. If an element as ead befoe it as itten in any of its assigned iteations, then it is not pivatizable. Next, the final status of each element is detemined in the coss pocesso analysis phase as follos. An element is independent if and only if it as classified as independent by exactly one pocesso, and as not efeenced on any othe pocesso. An element is ead only if and only if it as detemined to be ead only by evey pocesso that efeenced it. Similaly, an element is pivatizable if and only if it as pivatizable on evey pocesso that accessed do i =, n S: A(K(i)) =... S:... = A(L(i)) S: A(R(i)) = A(R(i)) + exp() doall i =, n makite(k(i)) makedux(k(i)) S: A(K(i)) =... makead(l(i)) makedux(l(i)) S:... = A(L(i)) makite(r(i)) S: A(R(i)) = A(R(i)) + exp() all Figue : The tansfomation of the do loop in (a) is shon in (b). The makite (makead) opeation adds a ecod to the pocesso s aay pr (if its not a duplicate), and updatesthe hieachyvecto ph appopiately. The makedux opeation invalidates the indicated aay element as a eduction vaiable since it is accessed outside the eduction statement S. it. Thus, the elements can be categoized by a simila pocess to the one used to find the pedecessos and successos hen filling in the pocessos hieachy vectos. Finally, if e maintain a linked list of the non empty os of pr as mentioned above, then the os coesponding to elements that ee found to be independent, ead only, o pivatizable ae emoved fom the list, i.e., accesses to these elements need not be consideedhen constucting the paallel execution schedule fo the loop iteations. We no conside the poblem of veifying that a statement is a eduction using un time data dependence analysis. Recall that potential eduction statements ae geneally identified by syntactically matching the statement ith the geneic eduction template x = x exp,heex is the eduction vaiable, and is an associative opeato. The statement is validated as a eduction if it can be shon that x is neithe efeenced in exp no anyhee in the loop body outside the eduction statement. Fo example, although statement S in the loop in Figue (a) matches a eduction statement, it is still necessay to pove that the elements of aay A efeenced in S and S do not ovelap ith those accessed in statement S, i.e., that: K(i) 6= R(j) and L(i) 6= R(j), foall i; j n. It tuns out that this condition can be tested in the same ay that ead only and pivatizable aay elements ae identified. In paticula, duing the making phase, heneve an element is accessed outside the eduction statement the pocesso invalidates that element as a eduction vaiable. Again, the final status of each element is detemined in the coss pocesso analysis phase, i.e., an element is a eduction vaiable if and only if it as not invalidated as such by any pocesso. This basic stategy can be extended to handle moe complex eduction opeations (efe to [] fo details). Complexity of the Inspecto. The ost case complexity of the inspecto is O(a log p),heea is the maximum numbe of efeences assigned to each pocesso and p is the total numbe of pocessos. In paticula, using the bucket sot implementation, each pocesso spends constant time on each of its O(a) accesses in the making phase, andthe analysis phasetakestime O(a log p) using a paallel divide and conque pai ise meging stategy [6]. We emak that since the cost of the analysis phase is popotional to the numbe of distinct elements accessed (i.e., the numbe of non empty os in the pr aay) the complexity of this phase could be significantly less than O(a log p) if thee ae many epeated efeences in the loop. Also, if a log p>s, then the mege among the pocesses can be impoved to O(s + log p) time by chunking the pr aays. (a) (b)

5 . The Schedule The schedule deives the moe estictive iteation-ise dependence elations fom the memoy location dependence infomation found by the inspecto. A valid paallel execution schedule fo a loop is a patition of the set of iteations into odeed subsets called avefonts, so that all coss-iteation dependences go fom an iteation in a loe numbeed avefont to an iteation in a highe numbeed avefont. We say that a valid paallel execution schedule is optimal if it has a minimum numbe of avefonts, i.e., is has as many avefonts as the longestpath (the citical path) in the diected acyclic gaph (dag) descibing the coss-iteation dependencesin the loop. We emak that the schedules descibed belo can be used to constuct the full iteation schedule in advance (as descibed) o they can be inteleaved ith the executo, i.e., the iteations could be executed as they ae found to be eady. A simple schedule. A simple schedulethat finds an optimal schedule is sketched in Figue 5(a). In the figue, an aay f(i) stoes the avefont found fo iteation i, the global vaiable done flags if all iteations have been scheduled, dy(i) signals if iteation i is eady to be executed, loe case lettes (a,b) ae used fo efeences to aay elements, a.ite is the iteation hich contains efeence a, andped(a) is the set of immediate pedecessos of a in the aay element dependence gaphs. The scheduling is pefomed in phases (line ) so that in phase i the iteations belonging to ith avefont ae identified. In each phase, all the efeences ecoded in the pr aays ae pocessed (lines 7 6), and the pedecessos of all efeences hose iteations have not been scheduled (line 0) ae examined. An iteation is not eady if the iteations of any of its efeence s pedecessos ee not assigned to pevious avefonts (line ). Afte all the efeences ae pocessed, all the iteations ae examined (lines 7 9) to see hich can be added to the cuent avefont: an iteation i is eady (line 8) if none of its efeences set dy(i) to false. Advantages of this schedule ae that it is conceptually vey simple and quite easy to implement. Optimizing the simple schedule. Thee ae some souces of inefficiency in this schedule. Fist, since a ite access could potentially have many paent ead accesses it could pove expensive to equie each ite to check all its paents (line 0). Fotunately, this poblem is easily cicumvented by equiing an unscheduled ead access to infom its successo s iteation that it is not eady. Then, a ite access only needs to check its pedecesso if the (single) pedecesso is also a ite. Anothe souce of inefficiency aises fom the fact that each inne doall (lines 7 6) equies time O(n a=p) to identify unscheduled iteations (line 9), hee n a is the total numbe of accesses to the shaed aay and p is the numbe of pocessos. Thus, the schedule takes time O((n a=p)cpl), hee cpl is the length of the citical path. If cpl p, then it cannot be expected to offe any speedup ove sequential execution, and even ose, it could yield slodons fo longe citical paths. Hoeve, note that in any single iteation of the schedule, the only iteations that could potentially be added to the next avefont must have all thei accesses at the loest unscheduledlevel in thei espective element ise dependencegaphs. Fo example, conside the dependence gaph shon in Figue 5(b). If iteation (level ) has not been scheduled yet, then none of the iteations ith accesses in highe levels could be added to the cuent avefont. Thus, in each of the cpl iteations of the do hile loop, e ould like to examine only those efeences that ae in the topmost unscheduled level of thei espective dependence f(:numite) = 0 done =.false. cpl = do hile (done.eq..false.) dy(:numite) =.tue. level done =.tue. doall i =, numaccess a = access(i) if (f(a.ite).eq. 0) then fo each (b in Ped(a)) if (f(b.ite).eq.0) then done =.false. dy(a.ite) =.false. endfo endif all doall i =,numite if (f(i).eq. 0.and. dy(i).eq..tue.) all cpl = cpl + hile D x fo A[x] iteation 8 9 (a) f(i) = cpl Figue 5: A simple schedule (a), and the dependencegaph fo one of the memoy locations accessed in the loop (b). gaph. Fist note that e can easily identify the accesses on each level of the aay element dependence gaphs since efeences ae stoed in inceasing level ode in the pr aays and the ph aays contain pointes the fist access at each level. To pocess only the accesses on the loest unscheduled level it is useful to have a count of the total numbe of (ecoded) accesses in each iteation hich can easily be extacted in the making phase. Then, in the schedule, a count of the numbe of eady accesses fo each iteation is computed on a pe pocesso basis in the fist doall (lines 7 6). In the second doall (lines 7 9), the coss-pocesso sum of the eady access counts fo each unscheduled iteation is compaed to its total access count, and if they ae equal the iteation is added to the cuent avefont. In summay, e ould expect the optimized vesion to outpefom the oiginal schedule if thee ae multiple levels in the aay element dependence gaphs. Hence, the detemination of hich vesion to use should be made using knoledge gained about the access patten by the inspecto. In [], e discuss ays to educe scheduling ovehead such as ovelapping avefont computation ith actual loop execution and using dynamic eady queues []. A Compaison ith Pevious Methods We no compae the methods descibed in this pape to seveal othe techniques that have been poposed fo analyzing and scheduling do loops at un time. Most of this ok has concentated on developing inspectos. A high level compaison of the vaious methods is given in Table. Methods utilizing citical sections. The method of Zhu and Ye [] computes the avefonts one afte anothe using a method simi- (b)

6 obtains contains equies esticts pivat optimal seial global type of o Method sched potions synch loop educt Ne Yes No No No P,R ZY [] No No Yes No No MP [0] Yes No Yes No No KS [] No No Yes No P CYT [9] No ; No Yes No No SM [8] No No Yes Yes 5 No SMC [0] Yes Yes Yes Yes 5 No LZ [7] Yes No Yes Yes 5 No P[] No No No No No RP [5, 6] No 6 No No No P,R Table : A compaison of un time paallelization techniques fo do loops. In the table enties, P and R sho that the method identities pivatizable and eduction vaiables, espectively. The supescipts have the folloing meanings:, the method seializes all ead accesses;, pefomance can degade significantly in the pesence of hotspots;, the schedule/executo is a doacoss loop (iteations ae stated in a apped manne) and busy aits ae used to enfoce cetain data dependences;, the inspecto loop sequentially taveses the access patten; 5, the method is applicable only to loops ithout output dependences (i.e., each memoy location is itten at most once); 6, the method identifies only fully paallel loops. la to the simple schedule descibed in Section.. Duing a phase, an iteation is added to the cuent avefont if none of the data accessed in that iteation is accessed by any loe unassigned iteation; the loest unassigned iteation to access any aay element is found using atomic compae-and-sap synchonization pimitives and a shado vesion of the aay. Midkiff and Padua [0] extended this method to allo concuent eads fom a memoy location in multiple iteations. These methods un the isk of a sevee degadation in pefomance fo access pattens containing hot spots (i.e., many accesses to the same memoy location). A featue of them is that they use only a shado vesion of the shaed aay heeas all othe methods (except [, 5, 6]) unoll the loop and stoe all accesses to the shaed aay. Kothapalli and Sadayappan [] poposed a un time scheme fo emoving anti and output dependences fom loops. Fo each memoy location, thei inspecto counts the numbe efeences to it (using citical sections as in []), places them in a dynamically allocated aay, and then sots them by iteation numbe. Afte building a dependence gaph fo each memoy location (simila to ou aays R x), the inspecto emoves all anti and output dependences by ediecting the accesses to dynamically allocated stoage (using an additional level of indiection). Flo dependences ae enfoced using full/empty bits. To ou knoledge, this is the only othe un time pivatization technique except fo the one descibed in [5, 6]. Recently, Chen, Ye, and Toellas [9] poposed an inspecto that fist builds (in pivate stoage) access lists fo each memoy location efeenced in a pocesso s assigned iteations (simila to [] and ou inspecto s making phase, except they seialize ead accesses), and then links them acoss pocessos using a global Zhu/Ye algoithm []. Thei schedule/executouses doacoss paallelization [8] (see belo). Although this scheme potentially has less communication ovehead than [], it is still sensitive to hot spots and thee ae cases (e.g., doalls) in hich it poves infeio to []. Methods fo loops ithout output dependences. This poblem has also been studied extensively by Saltz et al. [5, 8, 9, 0, ]. Most of thei ok assumes that thee ae no output dependences in the souce loop. In doacoss paallelization [8], an inspecto finds the (at most one) iteation in hich each vaiable is itten. The schedule/executo stats iteations in a apped manne and pocessos busy ait until thei opeands ae available. In [0], the inspectoconstucts avefonts that espectthe flodependencesby pefoming a sequential topological sot of the accesses in the loop, and the schedule/executo enfoces any anti dependences using old and ne vesions of each vaiable (possible since each vaiable in the souce loop is itten at most once). The topological sot can be paallelized somehat using doacoss paallelization. Leung and Zahojan [7] poposed methods of paallelizing the sequential inspecto of [0]. In theit sectioning method, the loop is chunked and each pocesso computes an optimal schedule fo its chunk, and then these schedules ae concatenated togethe sepaated by synchonization baies. In bootstapping technique, the inspecto is paallelized (not optimally) using sectioning, but an optimal schedule is poduced. Othe methods. In contast to the above methods hich place iteations in the loest possible avefont, Polychonopolous [] gives a method hee avefonts ae maximal sets of contiguous iteations ith no coss-iteation dependences. Dependences ae detected using shado vesions of the vaiables, eithe sequentially, o in paallel ith the aid of citical sections as in []. All of the above mentioned methods attempt to find a valid paallel execution schedule fo the souce do loop. Recently, e consideed a elated poblem [5, 6]: testing at un time hethe the loop is fully paallel, i.e., hethe thee ae any coss-iteation dependences in the loop. Ou inteest in fully paallel loops is motivated by the obsevation that they aise fequently in eal pogams. 5 Implementation and Expeimental Results We pesent expeimental esults obtained on to modestly paallel machines ith 8 (Alliant FX/80 []) and pocessos (Alliant FX/800 []). Hoeve, e emak that the esults scale ith the numbe of pocessos and the data size and thus they may be extapolated fo massively paallel pocessos (MPPs), the actual taget of ou un time methods. To demonstate that the ne methods can achieve speedups, e applied them to thee loops contained in the PERFECT Benchmaks []. To analyze the ovehead incued by the methods e applied them to access pattens taken fom actual pogams and to synthetic access pattens. The methods ee implemented in Ceda Fotan []. The inspecto as essentially as descibed in Section.. In paticula, e implemented the bucket sot vesion using sepaate pr and ph data stuctues fo each pocesso. Each pocesso constucted a linked list of the non-empty os in its pr aay duing the making phase. Checks fo independent, ead only, and pivatizable elements ee implemented in the inspecto (e have not yet included the test fo eduction vaiables). In the analysis phase, these elements ae classified at the same time that the pedecessos and successos ae found fo each o. An optimization that e did not yet implement as the pai ise mege acoss pocessos hen seaching fo pedecessos o successos in the analysis phase (o hen classifying elements as independent, ead only, o pivatizable). Hoeve, this is an impotant optimization since, as peviously noted, ithout it the analysis phase of the inspecto may fail to scale ith the numbe of pocessos. Since e implemented the optimized vesion of the sim-

7 ple schedule descibed in Section., a count of the total numbe of accessesin each iteation as computed in the making phase (no inte-pocesso communication is needed to detemine these counts since each iteation is assigned to a single pocesso). Fo simplicity, the schedule and the executo ee completely decoupled in the implementation, but bette speedups should be obtainable by inteleaving these to tasks (see Section.). We emak that thee ae othe issues to be consideed hen applying these methods in a eal application envionment such as memoy equiements and knon bounds on the souce loop s available paallelism (efe to [] fo moe details). Synthetic Loops Using synthetic loops, e studied the sensitivity of the ovehead of the methods to to chaacteistics of the souce do loop: its aveage paallelism (#iteations/cpl) and its hotspot degee (the maximum numbe of epeated accesses to any aay element). To simplify the geneation of the synthetic okloads, e did not identify independent, ead only, o pivatizable elements in the analysis phase. Aveage paallelism. To isolate the effect of the aveage paallelism in the souce loop on the ovehead of the methods, e geneated access pattens that ee as simila as possible in all aspects except fo the aveage paallelism: each iteation had to accesses (a ead folloed by a ite), and evey aay element as accessed appoximately tice. We ould not expect the inspecto s execution time to be dependent on the aveage paallelism in the souce loop since it is fully paallel. Hoeve, as the schedule uns in cpl steps, its execution time should be invesely coelated ith the aveage paallelism. In Figues 6 and 7 e display esults fom a loop ith 08 iteations un on 0 pocessos. The plot shos the ovehead incued fo a loop ith a citical path length of Step. As expected, the ovehead of the inspecto is invaiant ith the length of the citical path, and that of the schedule gos linealy ith this length. We also studied ho ovehead speedup elates to aveage paallelism. The inspecto s ovehead is independent of the aveage paallelism since it is fully paallel. Although, the schedule consists of cpl steps, it may still exhibit substantial speedupssince each step is fully paallel. In fact, in Figues 8 and 9 e sho that almost identical speedups ae obtained fo sequential, patially paallel, and fully paallel loops fo both the inspecto and schedule. The slightly diminished slope of the inspecto s speedup cuve afte about 0 pocessos is because ou implementation did not use a pai ise mege among the pocessos (Section.). Hotspots. To isolate the effect of the hotspot degee in the souce loop on the ovehead of the methods, e geneated simila access pattens diffeing only in hotspot degee: all loops had 08 iteations (each ith to accesses), a citical path length of 0, and a loop ith hotspot value h contained h efeences to each of 08=h aay elements. We ould not expect the methods to be negatively affected by the hot spot degee. In fact, a lage hotspot degee implies fee non-empty os in the pr aay, and thus e might see impoved esults in the analysis and scheduling phases. The esults in Figue 0 sho that in fact the total ovehead (inspecto + schedule) is nealy the same fo all hotspot degees. Loops fom the MA8 Solve We applied the ne methods to loops fom eal applications, both to demonstate the divesity of patially paallel access pattens and also to econfim the conclusions eached above using synthetic loops. Fo this pupose e chose Loop MA0cd/DO 0 fom MA8 (a blocked spase non-symmetic linea solve [0]). We selected this loop, hichpefoms the foad backadsubstitution in the final phase of the blocked spase linea system solve, because it can geneate many divese access pattens hen using the Haell-Boeing matices as input. Unfotunately, the loop itself is not a good candidate fo paallelization since it pefoms vey little ok and is highly imbalanced. We discuss to input sets: gemat, hich geneates 99 iteations, and bp 600, hich geneates 8 iteations. Afte extacting and pecomputing the linea ecuences fom the souce loop (based on the methods in [7]), e geneated a paallel inspecto and computed an optimal paallel execution schedule fo the loop. The paallelism pofiles obtained (Figues and ) sho the avefont sizes of the optimal paallel execution schedule and illustate ho the same loop can geneate vastly diffeent dependence gaphs given diffeent input. Figue shos that most of the iteations of the loop can be executed in the initial avefonts (cpl = ), hich suggests that inteleaving the avefont computation and execution ould be moe beneficial than ovelapping them, so that paallelization can be abandoned hen the sequential tail of the pofile is eached. Although in Figue most of the iteations ae also executed in the initial avefonts, in this case it appeas that some benefit could be gained by ovelapping, i.e., e can take advantage of the pauses in paallelism to compute futue (hopefully lage) avefonts. The histogams in Figues and undescoe the need fo scheduling and execution stategies that can adapt dynamically depending upon the type of paallelism encounteed. Figues 5 and 6 sho that ovehead speedup is invaiant ith the paallelism pofile. Lage speedups ee not obtained since the loop is heavily imbalanced due to the blocked natue of the algoithm used in MA8. Pefect Benchmak Loops We applied the methods to thee loops contained in the PERFECT Benchmaks []. In the analysis phase it as found that one of the loops as fully paallel, and that the othe to could be tansfomed into doalls by pivatizing the shaed aay unde test. Figues 7 though 9 sho the speedup measued fo each loop as a function of the numbe of pocessos used. As a efeence, e give the ideal speedup, hich as measued using an optimally paallelized (by hand) vesion of the loop. Thesegaphs shothat the speedupscales ith the numbe of pocessos and is a significant pecentage of the ideal speedup. We note that these loops could also be identified by the LRPD test [5, 6], a un time test fo identifying fully paallel loops, i.e., loops that can be tansfomed into doalls using pivatization and eduction paallelization. Although the LRPD test has a smalle ovehead than the methods pesented hee, it cannot extact patial paallelism. In BDNA ACTFOR Loop 0, the shaed aay unde test is accessedthough a subsciptaay computedinside the loop hich is found to be pivatizable in the analysis phase (Figue 7). In MDG INTERF Loop 000, it is also found that the shaed aay unde test is pivatizable in the analysis phase (Figue 8). In OCEAN FTRVMT Loop 09, all accesses to the shaed aay ae found to

8 be unique in the analysis phase. Since this loop is invoked 6,000 times, and accounts fo 0% of the sequential execution time of the pogam, it is an excellent candidate fo schedule euse [0]. The access patten fo each instantiation of the loop is detemined by a set of five scalas. In ode to apply schedule euse, e checked hethe the cuent set of scalas matched a peviously analyzed set. If not, then e applied the paallelization techniques, and if they did match then e simply executed the loop as a doall. As can be seen in Figue 9, ith schedule euse e obtain scalable speedups that ae compaable to the ideal speedup. 6 Conclusion Paallelizing statically intactable loops at un time is an impotant task since automatic, compile time paallelization had stopped ith egula, ell behaved,statically defined pogams hich epesent only a faction of all applications. We believe that aggessive, dynamic techniques such as those descibed hee can beak this baie and extact much of the available paallelism fom even the most complex pogams. The scalability of ou methodsensuesthat thei un time ovehead can be educed to an insignificant faction of the pogam s sequential execution time, hich implies that thei significance ill only incease ith the advent of massively paallel pocessos (MPPs). Although these ne methods illustate the potential benefits of un time paallelization, thee is still much ok left to be done. Fo example, thee ae many potential scheduling stategies that need to be studied. Anothe impotant task is to devise effective, automatable stategies fo detemining hen and ho to use un time paallelization. Since speedups obtainable fom un time paallelization ae uppe bounded by the inheent paallelism of the loop, the compile needs to estimate obtainable paallelism. Such estimates can be poduced only though collection and intepetation of valid statistics fom pogams in diffeent application domains. The ne methods povide a useful tool fo such studies since they detemine the dependence gaph and paallelism pofile of the loop. It should be noted that un time ovehead could be significantly educed though achitectual suppot. We vie the methods descibed in this pape as a building block in an evolving fameok of un time paallelization as a complement to the existing techniques [5, 6, 7]. Acknoledgment. We ould like to thank Paul Petesen fo his useful advice, and William Blume and Bett Masolf fo identifying and claifying applications fo ou expeiments. We ae also gateful to Richad Cole fo suggestions egading soting algoithms. Refeences [] Alliant Compute Systems Copoation. FX/Seies Achitectue Manual, 986. [] Alliant Computes Systems Copoation. Alliant FX/800 Seies System Desciption, 99. [] U. Banejee. Dependence Analysis fo Supecomputing. Klue. Boston, MA., 988. [] M. Bey and othes. The PERFECT club benchmaks: Effective pefomance evaluation of supecomputes. TR. 87, Ct. fo Supecomputing R.&D., Univ. of Illinois, Ubana, IL, May 989. [5] H. Beyman and J. Saltz. A manual fo PARTI untime pimitives. Inteim Repot 90-, ICASE, 990. [6] W. Blume and R. Eigenmann. Pefomance analysis of paallelizing compiles on the Pefect Benchmaks TM Pogams. IEEE Tans. on Paallel and Distibuted Systems, (6):6 656, Nov. 99. [7] M. Buke, R. Cyton, J. Feante, and W. Hsieh. Automatic geneation of nested, fok-joinpaallelism. J. of Supecomputing,pp. 7 88,989. [8] W. J. Camp, S. J. Plimpton, B. A. Hendickson,and R. W. Leland. Massively paallel methods fo engineeing and science poblems. Comm. ACM, 7():, Apil 99. [9] D. K. Chen, P. C. Ye, and J. Toellas. An efficient algoithm fo the un-time paallelization of doacoss loops. In Poc. of Supecomputing 99, pp , Nov. 99. [0] I. S. Duff. Ma8 a set of Fotan suboutines fo spase unsymmetic linea equations. Tech. Rept. AERE R870, HMSO, London, 977. [] R. Eigenmann, J. Hoeflinge, Z. Li, and D. Padua. Expeience in the automatic paallelization of fou Pefect-Benchmak pogams. In Lectue Notes in Comp. Science 589. Poc. of the th Wokshop on Languages and Compiles fo Paallel Computing, Santa Claa, CA, pp. 65 8, Aug. 99. [] M. Guzzi, D. Padua, J. Hoeflinge, and D. Laie. Ceda Fotan and othe vecto and paallel Fotan dialects. J. Supecomput., ():7 6, Mach 990. [] V. Kothapalli and P. Sadayappan. An appoach to synchonization of paallel computing. In Poc. of the 988 Int. Conf. on Supecomputing, pp , June 988. [] C. Kuskal. Efficient paallel algoithms fo gaph poblems. In Poc. of the 986 Int. Conf. on Paallel Pocessing, pp , Aug [5] D. J. Kuck, R. H. Kuhn, B. Leasue, D. A. Padua, and M. Wolfe. Dependence gaphs and compile optimizations. In Poc. of 8th ACM Symp. Pincip. Pog. Lang., pp. 07 8, Jan. 98. [6] F. Thomson Leighton. Intoduction to Paallel Algoithms and Achitectues: Aays, Tees, Hypecubes. Mogan Kaufmann, 99. [7] S. Leung and J. Zahojan. Impoving the pefomance of untime paallelization. In th PPOPP, pp. 8 9, May 99. [8] Zhiyuan Li. Aay pivatization fo paallel execution of loops. In Poc. of the 9th Int. Symp. on Comput. Ach., pp., 99. [9] D. E. Maydan, S. P. Amaasinghe, and M. S. Lam. Data dependence and data-flo analysis of aays. In Poc. 5th Wokshop on Pogamming Languages and Compiles fo Paallel Computing, Aug. 99. [0] S. Midkiff and D. Padua. Compile algoithms fo synchonization. IEEE Tans. Comput., C-6():85 95, 987. [] J. Moeia and C. Polychonopoulos. Autoscheduling in a distibuted shaed-memoy envionment. TR. 7, Ct. fo Supecomputing R.&D., Univ. of Illinois, Ubana, June 99. [] D. Padua and M. Wolfe. Advanced compile optimizations fo supecomputes. Communications of the ACM, 9:8 0, Dec [] C. Polychonopoulos. Compile optimizations fo enhancing paallelism and thei Impact on achitectue design. IEEE Tans. Comput., C-7(8):99 00, Aug [] L. Rauchege, N. Amato and D.Padua. Run-time methods fo paallelizing patially paallel loops. TR. 00, Ct. fo Supecomputing R.&D., Univ. of Illinois, Ubana, IL, May 989. [5] L. Rauchege and D. Padua. The pivatizing doall test: A un-time technique fo doall loop identification and aay pivatization. In Poc. of the 99 Int. Conf. on Supecomputing, pp., July 99. [6] L. Rauchege and D. Padua. The LRPD test: Speculative un-time paallelization of loops ith pivatization and eduction paallelization. In ACM SIGPLAN Conf. on Pogamming Language Design and Implementation, June, 995. [7] L. Rauchege and D. Padua. Paallelizing hile loops fo multipocesso systems. In 9th Int. Paallel Pocess. Symp., Apil, 995.

9 [8] J. Saltz and R. Michandaney. The pepocessed doacoss loop. In D. H.D. Schetman, edito, Poc. of the 99 Int. Conf. on Paallel Pocessing, pp CRC Pess, Inc., 99. Vol. II - Softae. [9] J. Saltz, R. Michandaney, and K. Coley. The doconside loop. In Poc. of the 989 Int. Conf. on Supecomputing,pp. 9 0, June 989. [0] J. Saltz, R. Michandaney, and K. Coley. Run-time paallelization and scheduling of loops. IEEE Tans. Comput., 0(5), May 99. [] P. Tu and D. Padua. Automatic aay pivatization. In Poc. 6th Annual Wokshop on Languages and Compiles fo Paallel Computing, Potland, OR, Aug. 99. [] M. Wolfe. Optimizing Compiles fo Supecomputes. The MIT Pess, Boston, MA, 989. [] J. Wu, J. Saltz, S. Hianandani,and H. Beyman. Runtime compilation methods fo multicomputes. In D. H.D. Schetman, edito, Poc. of the 99 Int. Conf. on Paallel Pocessing, pp CRC Pess, Inc., 99. Vol. II - Softae. [] C. Zhu and P. C. Ye. A scheme to enfoce data dependence on lage multipocesso systems. IEEE Tans. Soft. Eng., (6):76 79, 987. [5] H. Zima. Supecompiles fo Paallel and Vecto Computes. ACM Pess, Ne Yok, NY, 99. Figue 8: Figue 9: Figue 6: Figue 0: Figue 7: Figue :

Efficient Execution Path Exploration for Detecting Races in Concurrent Programs

IAENG Intenational Jounal of Compute Science, 403, IJCS_40_3_02 Efficient Execution Path Exploation fo Detecting Races in Concuent Pogams Theodous E. Setiadi, Akihiko Ohsuga, and Mamou Maekaa Abstact Concuent