The Data Locality of Work Stealing

Size: px

Start display at page:

Download "The Data Locality of Work Stealing"

Stewart Wright
6 years ago
Views:

1 The Daa Localiy of Work Sealing Umu A. Acar School of Compuer Science Carnegie Mellon Universiy Guy E. Blelloch School of Compuer Science Carnegie Mellon Universiy Rober D. Blumofe Deparmen of Compuer Sciences Universiy of Texas a Ausin rdb@cs.uexas.edu Absrac This paper sudies he daa localiy of he worksealing scheduling algorihm on hardwareconrolled sharedmemory machines. We presen lower and upper bounds on he number of cache misses using work sealing and inroduce a localiyguided worksealing algorihm along wih experimenal validaion. As a lower bound we show ha here is a family of mulihreaded compuaions each member of which requires oal operaions (work) for which when using worksealing he oal number of cache misses on one processor is consan while even on wo processors he oal number of cache misses is. For nesedparallel compuaions however we show ha on processors he expeced addiional number of cache misses beyond hose on a single processor is bounded by where is he execuion ime of an insrucion incurring a cache miss is he seal ime is he size of cache and is he number of nodes on he longes chain of dependences. Based on his we give srong bounds on he oal running ime of nesedparallel compuaions using work sealing. For he second par of our resuls we presen a localiyguided work sealing algorihm ha improves he daa localiy of mulihreaded compuaions by allowing a hread o have an affiniy for a processor. Our iniial experimens on ieraive daaparallel applicaions show ha he algorihm maches he performance of saicpariioning under radiional work loads bu improves he performance up o "! over saic pariioning under muliprogrammed work loads. Furhermore he localiyguided work sealing improves he performance of worksealing up o#$"!. 1 Inroducion Many of oday s parallel applicaions use sophisicaed adapive algorihms which are bes realized wih parallel programming sysems ha suppor dynamic lighweigh hreads such as Cilk [8] Nesl [5] Hood [10] and many ohers [ ]. The core of hese sysems is a hread scheduler ha balances load among he processes. In addiion o a good load balance however good daa localiy is essenial in obaining high performance from modern parallel sysems. Several researches have sudied echniques o improve he daa localiy of mulihreaded programs. One class of such echniques is based on sofwareconrolled disribuion of daa among he local memories of a disribued shared memory sysem [ ]. Anoher class of echniques is based on hins supplied by he programmer so ha similar asks migh be execued on he same processor [ ]. Boh hese classes of echniques rely on he programmer or compiler o deermine he daa access paerns in he program which may be very difficul when he program has complicaed daa access paerns. Perhaps he earlies class of echniques was o aemp o execue hreads ha are close in he compuaion graph on he same processor [ ]. The worksealing algorihm is he mos sudied of hese echniques [ ]. Blumofe e al showed ha fullysric compuaions achieve a provably good daa localiy [7] when execued wih he worksealing algorihm on a dagconsisen disribued shared memory sysems. In recen work Narlikar showed ha work sealing improves he performance of spaceefficien mulihreaded applicaions by increasing he daa localiy [29]. None of his previous work however has sudied upper or lower bounds on he daa localiy of mulihreaded compuaions execued on exising hardwareconrolled shared memory sysems. In his paper we presen heoreical and experimenal resuls on he daa localiy of work sealing on hardwareconrolled shared memory sysems (HSMSs). Our firs se of resuls are upper and lower bounds on he number of cache misses in mulihreaded compuaions execued by he worksealing algorihm. Le%'&( denoe he number of cache misses in he uniprocessor execuion and %')* denoe he number of cache misses in a processor execuion of a mulihreaded compuaion by he work sealing algorihm on an HSMS wih cache size. Then for a mulihreaded compuaion wih & work (oal number of insrucions) criical pah (longes sequence of dependences) we show he following resuls for he worksealing algorihm running on a HSMS. + Lower bounds on he number of cache misses for general compuaions: We show ha here is a family of compuaions wih &. such ha %'&( /102 while even on wo processors he number of misses % Upper bounds on he number of cache misses for nesedparallel compuaions: For a nesedparallel compuaion we show ha% )87 %'&9 :<;= > where> is he number of seals in he processor execuion. We hen show ha he

2 ? Speedup linear worksealing localiyguided worksealing saic parioning Number of Processes Figure 1: The speedup obained by hree differen overrelaxaion algorihms. expeced number of seals is ( ( where is he ime for a cache miss and is he ime for a seal. + Upper bound on he execuion ime of nesedparallel compuaions: We show ha he expeced execuion ime of a nesedparallel compuaion on processors is ' $ :/ G:H Ï =@$A9BDCFE ) : where &9 is he uniprocessor execuion ime of he compuaion including cache misses. As in previous work [6 9] we represen a mulihreaded compuaion as a direced acyclic graph (dag) of insrucions. Each node in he dag represens a single insrucion and he edges represen ordering consrains. A nesedparallel compuaion [5 6] is a racefree compuaion ha can be represened wih a seriesparallel dag [33]. Nesedparallel compuaions include compuaions consising of parallel loops and fork an joins and any nesing of hem. This class includes mos compuaions ha can be expressed in Cilk [8] and all compuaions ha can be expressed in Nesl [5]. Our resuls show ha nesedparallel compuaions have much beer localiy characerisics under work sealing han do general compuaions. We also briefly consider anoher class of compuaions compuaions wih fuures [ ] and show ha hey can be as bad as general compuaions. The second par of our resuls are on furher improving he daa localiy of mulihreaded compuaions wih work sealing. In work sealing a processor seals a hread from a randomly (wih uniform disribuion) chosen processor when i runs ou of work. In cerain applicaions such as ieraive daaparallel applicaions random seals may cause poor daa localiy. The localiyguided work sealing is a heurisic modificaion o work sealing ha allows a hread o have an affiniy for a process. In localiyguided work sealing when a process obains work i gives prioriy o a hread ha has affiniy for he process. Localiyguided work sealing can be used o implemen a number of echniques ha researchers sugges o improve daa localiy. For example he programmer can achieve an iniial disribuion of work among he processes or schedule hreads based on hins by appropriaely assigning affiniies o hreads in he compuaion. Our preliminary experimens wih localiyguided work sealing give encouraging resuls showing ha for cerain applicaions he performance is very close o ha of saic pariioning in dedicaed mode (i.e. when he user can lock down a fixed number of processors) bu does no suffer a performance cliff problem [10] in muliprogrammed mode (i.e. when processors migh be aken by oher users or he OS). Figure 1 shows a graph comparing work sealing localiyguided work sealing and saic pariioning for a simple overrelaxaion algorihm on a J9K processor Sun Ulra Enerprise. The overrelaxaion algorihm ieraes over a J dimensional array performing a 0 poin sencil compuaion on each sep. The superlinear speedup for saic pariioning and localiyguided work sealing is due o he fac ha he daa for each run does no fi ino he L; cache of one processor bu fis ino he collecive L; cache of L or more processors. For his benchmark he following can be seen from he graph. 1. Localiyguided work sealing does significanly beer han sandard work sealing since on each sep he cache is prewarmed wih he daa i needs. 2. Localiyguided work sealing does approximaely as well as saic pariioning for up o 14 processes. 3. When rying o schedule more han 14 processes on 14 processors saic pariioning has a serious performance drop. The iniial drop is due o load imbalance caused by he coarsegrained pariioning. The performance hen approaches ha of work sealing as he pariioning ges more finegrained. We are ineresed in he performance of worksealing compuaions on hardwareconrolled shared memory (HSMSs). We model an HSMS as a group of idenical processors each of which has is own cache and has a single shared memory. Each cache conains blocks and is managed by he memory subsysem auomaically. We allow for a variey of cache organizaions and replacemen policies including boh direcmapped and associaive caches. We assign a server process wih each processor and associae he cache of a processor wih process ha he processor is assigned. One limiaion of our work is ha we assume ha here is no false sharing. 2 Relaed Work As menioned in Secion 1 here are hree main classes of echniques ha researchers have suggesed o improve he daa localiy of mulihreaded programs. In he firs class he program daa is disribued among he nodes of a disribued sharedmemory sysem by he programmer and a hread in he compuaion is scheduled on he node ha holds he daa ha he hread accesses [ ]. In he second class daalocaliy hins supplied by he programmer are used in hread scheduling [ ]. Techniques from boh classes are employed in disribued shared memory sysems such as COOL and Illinois Concer [15 22] and also used o improve he daa localiy of sequenial programs [31]. However he firs class of echniques do no apply direcly o HSMSs because HSMSs do no allow sofware conrolled disribuion of daa among he caches. Furhermore boh classes of echniques rely on he programmer o deermine he daa access paerns in he applicaion and hus may no be appropriae for applicaions wih complex daaaccess paerns. The hird class of echniques which is based on execuion of hreads ha are close in he compuaion graph on he same process is applied in many scheduling algorihms including work sealing [ ]. Blumofe e al showed bounds on he number of cache misses in a fullysric compuaion execued by he worksealing algorihm under he dagconsisen disribued sharedmemory of Cilk [7]. Dag consisency is a relaxed memoryconsisency model ha is employed in he disribued sharedmemory implemenaion of he Cilk language. In a disribued Cilk applicaion processes mainain he dag consisency by means of he BACKER algorihm. In [7] Blumofe e al bound he number of sharedmemory cache misses in a disribued Cilk

3 RQ QVQR POM NQSQ Q P TU Figure 2: A dag (direced acyclic graph) for a mulihreaded compuaion. Threads are shown as gray recangles. applicaion for caches ha are mainained wih he LRU replacemen policy. They assumed ha accesses o he shared memory are disribued uniformly and independenly which is no generally rue because hreads may concurrenly access he same pages by algorihm design. Furhermore hey assumed ha processes do no generae seal aemps frequenly by making processes do addiional page ransfers before hey aemp o seal from anoher process. 3 The Model In his secion we presen a graphheoreic model for mulihreaded compuaions describe he worksealing algorihm define seriesparallel and nesedparallel compuaions and inroduce our model of an HSMS (Hardwareconrolled SharedMemory Sysem). As wih previous work [6 9] we represen a mulihreaded compuaion as a direced acyclic graph a dag of insrucions (see Figure 2). Each node in he dag represens an insrucion and he edges represen ordering consrains. There are hree ypes of edges coninuaion spawn and dependency edges. A hread is a sequenial ordering of insrucions and he nodes ha corresponds o he insrucions are linked in a chain by coninuaion edges. A spawn edge represens he creaion of a new hread and goes from he node represening he insrucion ha spawns he new hread o he node represening he firs insrucion of he new hread. A dependency edge from insrucion W of a hread o insrucionx of some oher hread represens a synchronizaion beween wo insrucions such ha insrucionx mus be execued aferw. We draw spawn edges wih hick sraigh arrows dependency edges wih curly arrows and coninuaion edges wih hick sraigh arrows hroughou his paper. Also we show pahs wih wavy lines. For a compuaion wih an associaed dag we define he compuaional work & as he number of nodes in and he criical pah as he number of nodes on he longes pah of. LeY andz be any wo nodes in a dag. Then we cally an ancesor ofz andz a descendan ofy if here is a pah fromy oz. Any node is is descendan and ancesor. We say ha wo nodes are relaives if here is a pah from one o he oher oherwise we say ha he nodes are independen. The children of a node are independen because oherwise he edge from he node o one child is redundan. We call a common descendan[ ofy andz a merger of Y andz if he pahs fromy o[ andz o[ have only[ in common. We define he deph of a node Y as he number of edges on he shores pah from he roo node oy. We define he leas common ancesor ofy andz as he ancesor of bohy andz wih maximum deph. Similarly we define he greaes common descendan of Y and Z as he descendan of boh Y and Z wih minimum deph. An edge Y]\^Z2 is redundan if here is a pah beween Y and Z ha does no conain he edge Y \^Z2. The ransiive reducion of a dag is he dag wih all he redundan edges removed. In his paper we are only concerned wih he ransiive reducion of he compuaional dags. We also require ha he dags have a single node wih indegree he roo and a single node wih oudegree he final node. In a muliprocess execuion of a mulihreaded compuaion independen nodes can execue a he same ime. If wo independen nodes read or modify he same daa we say ha hey are RR or WW sharing respecively. If one node is reading and he oher is modifying he daa we say hey are RW sharing. RW or WW sharing can cause daa races and he oupu of a compuaion wih such races usually depends on he scheduling of nodes. Such races are ypically indicaive of a bug [18]. We refer o compuaions ha do no have any RW or WW sharing as racefree compuaions. In his paper we consider only racefree compuaions. The worksealing algorihm is a hread scheduling algorihm for mulihreaded compuaions. The idea of worksealing daes back o he research of Buron and Sleep [11] and has been sudied exensively since hen [ ]. In he worksealing algorihm each process mainains a pool of ready hreads and obains work from is pool. When a process spawns a new hread he process adds he hread ino is pool. When a process runs ou of work and finds is pool empy i chooses a random process as is vicim and ries o seal work from he vicim s pool. In our analysis we imagine he worksealing algorihm operaing on individual nodes in he compuaion dag raher han on he hreads. Consider a mulihreaded compuaion and is execuion by he worksealing algorihm. We divide he execuion ino discree ime seps such ha a each sep each process is eiher working on a node which we call he assigned node or is rying o seal work. The execuion of a node akesj ime sep if he node does no incur a cache miss and seps oherwise. We say ha a node is execued a he ime sep ha a process complees execuing he node. The execuion ime of a compuaion is he number of ime seps ha elapse beween he ime sep ha a process sars execuing he roo node o he ime sep ha he final node is execued. The execuion schedule specifies he aciviy of each process a each ime sep. During he execuion each process mainains a deque (doubly ended queue) of ready nodes; we call he ends of a deque he op and he boom. When a nodey is execued i enables some oher node Z if Y is he las paren of Z ha is execued. We call he edge Y]\^Z2 an enabling edge and Y he designaed paren of Z. When a process execues a node ha enables oher nodes one of he enabled nodes become he assigned node and he process pushes he res ono he boom of is deque. If no node is enabled hen he process obains work from is deque by removing a node from he boom of he deque. If a process finds is deque empy i becomes a hief and seals from a randomly chosen process he vicim. This is a seal aemp and akes a leas and a mos _F ime seps for some consan_a`bj o complee. A hief process migh make muliple seal aemps before succeeding or migh never succeed. When a seal succeeds he hief process sars working on he solen node a he sep following he compleion of he seal. We say ha a seal aemp occurs a he sep i complees. The worksealing algorihm can be implemened in various ways. We say ha an implemenaion of work sealing is deerminisic if whenever a process enables oher nodes he implemenaion always chooses he same node as he assigned node for hen nex sep on ha process and he remaining nodes are always placed in he deque in he same order. This mus be rue for boh muliprocess and uniprocess execuions. We refer o a deerminisic implemenaion of he worksealing algorihm ogeher wih he HSMS ha runs he implemenaion as a work sealer. For breviy we refer o an execuion of a mulihreaded compuaion wih a work sealer as an execuion. We define he oal work as he number of seps aken by a uniprocess execuion including he cache misses and denoe i by& where is he cache size. We denoe he number of cache misses in a process execuion wih block caches as %')c. We define he cache overhead

4 2f 1f f 1 h f g 2 ~ ~ ~~ ~ ~ ƒ ~ ~ ~ ~ ed i h g (a) (b) (c) Figure 3: Illusraes he recursive definiion for seriesparallel dags. Figure (a) is he base case figure (b) depics he serial and figure (c) depics he parallel composiion. of a process execuion as % ) kjl%'&( where %'&9 is he number of misses in he uniprocess execuion on he same work sealer. We refer o a mulihreaded compuaion for which he ransiive reducion of he corresponding dag is seriesparallel [33] as a seriesparallel compuaion. A seriesparallel dag mn\po is a dag wih wo disinguished verices a source rqsm and a sink qrm and can be defined recursively as follows (see Figure 3). + Base: consiss of a single edge connecing o. + Series Composiion: consiss of wo seriesparallel dags &9 m]&(\^o&^ and 3$ mu32\^o3= wih disjoin edge ses such ha is he source of & Y is he sink of & and he source of 3 and is he sink of 3. Moreoverm &wv m 3 yx$yfz. + Parallel Composiion: The graph consiss of wo seriesparallel dags &9 m]&(\^o&^ and 3$ mu32\^o3= wih disjoin edges ses such ha and are he source and he sink of boh & and 3. Moreoverm &*v m 3 {x$$\ z. A nesedparallel compuaion is a racefree seriesparallel compuaion [6]. We also consider mulihreaded compuaions ha use fuures [ ]. The dag srucures of compuaions wih fuures are defined elsewhere [4]. This is a superclass of nesedparallel compuaions bu sill much more resricive han general compuaions. The worksealing algorihm for fuures is a resriced form of worksealing algorihm where a process sars execuing a newly creaed hread immediaely puing is assigned hread ono is deque. In our analysis we consider several cache organizaion and replacemen policies for an HSMS. We model a cache as a se of (cache) lines each of which can hold he daa belonging o a memory block (a consecuive ypically small region of memory). One insrucion can operae on a mos one memory block. We say ha an insrucion accesses a block or he line ha conains he block when he insrucion reads or modifies he block. We say ha an insrucion overwries a line ha conains he block when he insrucion accesses some oher block ha replaces in he cache. We say ha a cache replacemen policy is simple if i saisfies wo condiions. Firs he policy is deerminisic. Second whenever he policy decides o overwrie a cache line } i makes he decision o overwrie} by only using informaion peraining o he accesses ha are made afer he las access o }. We refer o a cache managed wih a simple cachereplacemen policy as a simple cache. Simple caches and replacemen policies are common in pracice. For example leasrecenly used (LRU) replacemen policy direc Š ˆ ˆ ~ƒ~ Figure 4: The srucure for dag of a compuaion wih a large cache overhead. mapped caches and se associaive caches where each se is mainained by a simple cache replacemen policy are simple. In regards o he definiion of RW or WW sharing we assume ha reads and wries perain o he whole block. This means we do no allow for false sharing when wo processes accessing differen porions of a block invalidae he block in each oher s caches. In pracice false sharing is an issue bu can ofen be avoided by a knowledge of underlying memory sysem and appropriaely padding he shared daa o preven wo processes from accessing differen porions of he same block. 4 General Compuaions In his secion we show ha he cache overhead of a muliprocess execuion of a general compuaion and a compuaion wih fuures can be large even hough he uniprocess execuion incurs a small number of misses. Theorem 1 There is a family of compuaions x Œ y_f\^žf " n_qr H cz wih compuaional work whose uniprocess execuion incurs 02 misses while any; process execuion of he compuaion incurs misses on a work sealer wih a cache size of assuming ha ay where is he maximum seal ime. Proof: Figure 4 shows he srucure of a dag C for 4 K2. Each node excep he roo node represens a sequence of insrucions accessing a se of disinc memory blocks. The roo node represens :/ insrucions ha accesses disinc memory blocks. The graph has wo symmeric componens and C C which corresponds o he lef and he righ subree of he roo excluding he leaves. We pariion he nodes in ino hree classes C such ha all nodes in a class access he same memory blocks while nodes from differen classes access muually disjoin se of memory blocks. The firs class conains he roo node only he second class conains all he nodes in and he hird class conains he res C of he nodes which are he nodes in and he leaves of C. For general y_ can be pariioned ino and he_ C leaves of and he roo similarly. Each of and conains ;] 3 ]jj nodes and has he srucure of a complee binary ree wih addiional_ leaves a he lowes level. There is a dependency edge from he leaves of boh and o he leaves of. Consider a work sealer ha execues he nodes of in he order ha hey are numbered in a uniprocess execuion. In he uniprocess execuion no node in incurs a cache miss excep he roo node since all nodes in access he same memory blocks as he roo of. The same argumen holds for and he_ leaves of. Hence he execuion of he nodes in and he leaves causes ;= misses. Since he roo node causes misses he oal

5 Ÿ ž Ÿ š š œ Figure 5: The srucure for dag of a compuaion wih fuures ha can incur a large cache overhead. number of misses in he uniprocess execuion is 0". Now consider a; process execuion wih he same work sealer and call he processes process and J. A ime sep J process sars execuing he roo node which enables he roo of no laer han ime sep. Since process sars sealing immediaely and here are no oher processes o seal from process J seals and sars working on he roo of no laer han ime sepb:<. Hence he roo of execues before he roo of and hus all he nodes in execue before he corresponding symmeric node in. Therefore for any leaf of he paren ha is in execues before he paren in. Therefore a leaf node of is execued immediaely afer is paren in and hus causes cache misses. Thus he oal number of cache misses is _ *y. There exiss compuaions similar o he compuaion in Figure 4 ha generalizes Theorem 1 for arbirary number of processes by making sure ha all he processes bu ; seal hroughou any muliprocess execuion. Even in he general case however where he average parallelism is higher han he number of processes Theorem 1 can be generalized wih he same bound on expeced number of cache misses by exploiing he symmery in and by assuming a symmerically disribued sealime. Wih a symmerically disribued sealime for any a seal ha akes seps more han mean sealime is equally likely o happen as a seal ha akes less seps han he mean. Theorem 1 holds for compuaions wih fuures as well. Mulihreaded compuing wih fuures is a fairly resriced form of mulihreaded compuing compared o compuing wih evens such as synchronizaion variables. The graph in Figure 5 shows he srucure of a dag whose ; process execuion causes large number of cache misses. In a ; process execuion of he enabling paren of he leaf nodes in he righ subree of he roo are in he lef subree and herefore he execuion of each such leaf node causes misses. 5 NesedParallel Compuaions In his secion we show ha he cache overhead of an execuion of a nesedparallel compuaion wih a work sealer is a mos wice he produc of he number of seals and he cache size. Our proof has wo seps. Firs we show ha he cache overhead is bounded by he produc of he cache size and he number of nodes ha are execued ou of order wih respec o he uniprocess execuion order. Second we prove ha he number of such ouoforder execuions is a mos wice he number of seals. Consider a compuaion and is process execuion ) wih a work sealer and he uniprocess execuion & wih he same work sealer. LeZ be a node in and nodey be he node ha execues immediaely beforez in &. Then we say haz is drifed in ) if node Y is no execued immediaely before Z by he process ha execuesz in ). Lemma 2 esablishes a key propery of an execuion wih simple caches. Lemma 2 Consider a process wih a simple cache of blocks. Le & denoe he execuion of a sequence of insrucions on he process saring wih cache sae & and le /3 denoe he execuion of he same sequence of insrucions saring wih cache sae 3. Then & incurs a mos more misses han 3. Proof: We consruc a oneoone mapping beween he cache lines in & and 3 such ha an insrucion ha accesses a line } & in & accesses he enry} 3 in /3 if and only if}ª& is mapped o }3. Consider & and le} & be a cache line. LeW be he firs insrucion ha accesses or overwries }ª&. Le } 3 be he cache line ha he same insrucion accesses or overwries in /3 and map } & o }3. Since he caches are simple an insrucion ha overwries} & in & overwries}3 in 3. Therefore he number of misses ha overwries} & in & is equal o he number of misses ha overwries} 3 in /3 afer insrucion W. Since W iself can cause J miss he number of misses ha overwries} & in & is a mosj more han he number of misses ha overwries}3 in 3. We consruc he mapping for each cache line in & in he same way. Now le us show ha he mapping is oneoone. For he sake of conradicion assume ha wo cache lines} & and}3 in & map o he same line in 3. Le W & and W3 be he firs insrucions accessing he cache lines in & such ha WI& is execued before WD3. SinceW«& and WD3 map o he same line in 3 and caches are simplew 3 accesses he line ha W & accesses in & bu hen} & y}3 a conradicion. Hence he oal number of cache misses in & is a mos more han he misses in 3. Theorem 3 Le denoe he oal number of drifed nodes in an execuion of a nesedparallel compuaion wih a work sealer on processes each of which has a simple cache wih words. Then he cache overhead of he execuion is a mos. Proof: Le ) denoe he process execuion and le & be he uniprocess execuion of he same compuaion wih he same work sealer. We divide he muliprocess compuaion ino pieces each of which can incur a mos more misses han in he uniprocess execuion. Le Y be a drifed node le be he process ha execues Y. Le Z be he nex drifed node execued on (or he final node of he compuaion). Le he ordered se represen he execuion order of all he nodes ha are execued afery (Y is included) and beforez (Z is excluded if i is drifed included oherwise) on in ). Then nodes in are execued on he same process and in he same order in boh & and ). Now consider he number of cache misses during he execuion of he nodes in in & and ). Since he compuaion is nesed parallel and herefore race free a process ha execues in parallel wih does no cause o incur cache misses due o sharing. Therefore by Lemma 2 during he execuion of he nodes in he number of cache misses in ) is a mos more han he number of misses in &. This bound holds for each of he sequence of such insrucions corresponding o drifed nodes. Since he sequence saring a he roo node and ending a he firs drifed node incurs he same number of misses in & and ) ) akes a mos more misses han & and he cache overhead is a mos. Lemma 2 (and hus Theorem 3) does no hold for caches ha are no simple. For example consider he execuion of a sequence of insrucions on a cache wih leasfrequenlyused replacemen policy saring a wo cache saes. In he firs cache sae he blocks ha are frequenly accessed by he insrucions are in he cache wih high frequencies whereas in he second cache sae he blocks ha

6 G 1 ± ¹ ¾ º Á À ½ ² ³ Figure 6: Children of and heir merger. µ Figure 7: The join embedding of Y and Z. are in he cache are no accessed by he insrucion and have low frequencies. The execuion wih he second cache sae herefore incurs many more misses han he size of he cache compared o he execuion wih he second cache sae. Now we show ha he number of drifed nodes in an execuion of a seriesparallel compuaion wih a work sealer is a mos wice he number of seals. The proof is based on he represenaion of seriesparallel compuaions as spdags. We call a node wih oudegree of a leas; a fork node and pariion he nodes of an spdag excep he roo ino hree caegories: join nodes sable nodes and nomadic nodes. We call a node ha has an indegree of a leas ; a join node and pariion all he nodes ha have indegree J ino wo classes: a nomadic node has a paren ha is a fork node and a sable node has a paren ha has oudegreej. The roo node has indegree and i does no belong o any of hese caegories. Lemma 4 liss wo fundamenal properies of spdags; one can prove boh properies by inducion on he number of edges in an spdag. Lemma 4 Le be an spdag. Then has he following properies. 1. The leas common ancesor of any wo nodes in is unique. 2. The greaes common descendan of any wo nodes in is unique and is equal o heir unique merger. Lemma 5 Le be a fork node. Then no child of is a join node. Proof: Le Y and Z denoe wo children of and suppose Y is a join node as in Figure 6. Le denoe some oher paren ofy and denoe he unique merger of Y and Z. Then boh and Y are mergers for and which is a conradicion of Lemma 5. Hence Y is no a join node. Corollary 6 Only nomadic nodes can be solen in an execuion of a seriesparallel compuaion by he worksealing algorihm. Proof: Le Y be a solen node in an execuion. Then Y is pushed on a deque and hus he enabling paren of Y is a fork node. By Lemma 5Y is no a join node and has an incoming degreej. Therefore Y is nomadic. Consider a seriesparallel compuaion and le be is spdag. LeY andz be wo independen nodes in and le and denoe heir leas common ancesor and greaes common descendan respecively as shown in Figure 7. Le & denoe he graph ha is G 2» ¼ G1 Figure 8: The join node is he leas common ancesor of[ and. NodeY andz are he children of. induced by he relaives ofy ha are descendans of and also ancesors of. Similarly le 3 denoe he graph ha is induced by he relaives of Z ha are descendans of and ancesors of. Then we call & he embedding of Y wih respec oz and 3 he embedding ofz wih respec oy. We call he graph ha is he union of & and 3 he join embedding of Y and Z wih source and sink. Now consider an execuion of and[ and be he children of such ha[ is execued before. Then we call[ he leader and he guard of he join embedding. Lemma 7 Le mn\^o be an spdag and le[ and be wo parens of a join node in. Le & denoe he embedding of[ wih respec o and 3 denoe he embedding of wih respec o [. Le denoe he source and denoe he sink of he join embedding. Then he parens of any node in & excep for and is in & and he parens of any node in 3 excep for and is in 3. Proof: Since[ and are independen boh of and are differen from [ and (see Figure 8). Firs we show ha here is no an edge ha sars a a node in & excep a and ends a a node in 3 excep a and vice versa. For he sake of conradicion assume here is an edge G\Â such ha ÄÃ b is in & and yã is in 3. Then is he leas common ancesor of [ and ; hence no such G\^ exiss. A similar argumen holds when is in 3 and is in &. Second we show ha here does no exiss an edge ha originaes from a node ouside of & or 3 and ends a a node a & or 3. For he sake of conradicion le Å \ÂÆ be an edge such ha Æ is in & andå is no in & or 3. ThenÆ is he unique merger for he wo children of he leas common ancesor of Å and which we denoe wih. Bu hen is also a merger for he children of. The children of are independen and have a unique merger hence here is no such edge ÅÇ\ Æ. A similar argumen holds whenæ is in 3. Therefore we conclude ha he parens of any node in & excep and is in & and he parens of any node in 3 excep and is in 3. Lemma 8 Le be an spdag and le[ and be wo parens of a join node in. Consider he join embedding of[ and and le Y be he guard node of he embedding. Then[ and are execued in he same respecive order in a muliprocess execuion as hey are execued in he uniprocess execuion if he guard node Y is no solen. Proof: Le be he source he sink and Z he leader of he join embedding. SinceY is no solen Z is no solen. Hence by Lemma 7 before i sars working on Y he process ha execues execued Z and all is descendans in he embedding excep for Hence is execued before Y and [ is execued afer Y as in he uniprocess execuion. Therefore[ and are execued in he same respecive order as hey execue in he uniprocess execuion. G2

7 Ê ÎÊ ÌÊ Ë ÏÉ É Í ÍÉ Ì Ê È Figure 9: Nodes & and 3 are wo join nodes wih he common guardy. Lemma 9 A nomadic node is drifed in an execuion only if i is solen. Proof: Le Y be a nomadic and drifed node. Then by Lemma 5 Y has a single paren ha enablesy. IfY is he firs child of o execue in he uniprocess execuion hen Y is no drifed in he muliprocess execuion. Hence Y is no he firs child o execue. Le Z be he las child of ha is execued beforey in he uniprocess execuion. Now consider he muliprocess execuion and le be he process ha execues Z. For he sake of conradicion assume ha Y is no solen. Consider he join embedding of Y and Z as shown in Figure 8. Since all parens of he nodes in 3 excep for and are in 3 by Lemma 7 execues all he nodes in 3 before i execues Y and hus precedes Y on. Bu heny is no drifed because is he node ha is execued immediaely before Y in he uniprocess compuaion. Hence Y is solen. Le us define he cover of a join node in an execuion as he se of all he guard nodes of he join embedding of all possible pairs of parens of in he execuion. The following lemma shows ha a join node is drifed only if a node in is cover is solen. Lemma 10 A join node is drifed in an execuion only if a node in is cover is solen in he execuion. Proof: ÈÉ Consider he execuion and le be a join node ha is drifed. Assume for he sake of conradicion ha no node in he cover of is solen. Le [ and be any wo parens of as in Figure 8. Then[ and are execued in he same order as in he uniprocess execuion by Lemma 8. Bu hen all parens of execue in he same order as in he uniprocess execuion. Hence he enabling paren of in he execuion is he same as in he uniprocess execuion. Furhermore he enabling paren of has oudegree J because oherwise is no a join node by Lemma 5 and hus he process ha enables execues. Therefore is no drifed. A conradicion hence a node in he cover of is solen. Lemma 11 The number of drifed nodes in an execuion of a seriesparallel compuaion is a mos wice he number of seals in he execuion. Proof: We associae each drifed node in he execuion wih a seal such ha no seal has more han ; drifed nodes associaed wih i. Consider a drifed node Y. Then Y is no he roo node of he compuaion and i is no sable eiher. Hence Y is eiher a nomadic or join node. IfY is nomadic heny is solen by Lemma 9 and we associae Y wih he seal ha seals Y. Oherwise Y is a join node and here is a node in is cover YF ha is solen by Lemma 10. We associaey wih he seal ha seals a node in is cover. Now assume here are more han; nodes associaed wih a seal ha seals node Y. Then here are a leas wo join nodes & and 3 ha are associaed wihy. Therefore nodey is in he join embedding of wo parens of & and also 3. Le Æ & [ & be hese parens of & andæ 3 [ 3 be he parens of 3 as shown in Figure 9. Bu heny has paren ha is a fork node and is a join node which conradics Lemma 5. Hence no such Y exiss. Theorem 12 The cache overhead of an execuion of a nesedparallel compuaion wih simple caches is a mos wice he produc of he number of misses in he execuion and he cache size. Proof: Follows from Theorem 3 and Lemma An Analysis of Nonblocking Work Sealing The nonblocking implemenaion of he worksealing algorihm delivers provably good performance under radiional and muliprogrammed workloads. A descripion of he implemenaion and is analysis is presened in [2]; an experimenal evaluaion is given in [10]. In his secion we exend he analysis of he nonblocking worksealing algorihm for classical workloads and bound he execuion ime of a nesedparallel compuaion wih a work sealer o include he number of cache misses he cachemiss penaly and he seal ime. Firs we bound he number of seal aemps in an execuion of a general compuaion by he worksealing algorihm. Then we bound he execuion ime of a nesedparallel compuaion wih a work sealer using resuls from Secion 5. The analysis ha we presen here is similar o he analysis given in [2] and uses he same poenial funcion echnique. We associae a nonnegaive poenial wih nodes in a compuaion s dag and show ha he poenial decreases as he execuion proceeds. We assume ha a node in a compuaion dag has oudegree a mos ;. This is consisen wih he assumpion ha each node represens on insrucion. Consider an execuion of a compuaion wih is dag mc\^o wih he worksealing algorihm. The execuion grows a ree he enabling ree ha conains each node in he compuaion and is enabling edge. We define he disance of a node YqGm ÐF YF as j4ð2ñâò ªÓ YF where Ð2ÑÂÒ Ó YF is he deph of Y in he enabling ree of he compuaion. Inuiively he disance of a node indicaes how far he node is away from end of he compuaion. We define he poenial funcion in erms of disances. A any given sep W we assign a posiive poenial o each ready node all oher nodes have poenial. A node is ready if i is enabled and no ye execued o compleion. Le Y denoe a ready node a ime sep W. Then we define ÔÕI YF he poenial of Y a ime sep W as Ö Ô Õ Ỹ c 0 3^ B Ø9EIÙ & if Y is assigned; 0 3^ B Ø9E oherwise. The poenial a sepwú Õ is he sum of he poenial of each ready node a sep W. When an execuion begins he only ready node is he roo node which has disance and is assigned o some process so we sar wih ÚÛÜ0 Ý/Ù &. As he execuion proceeds nodes ha are deeper in he dag become ready and he poenial decreases. There are no ready nodes a he end of an execuion and he poenial is. Le us give a few more definiions ha enable us o associae a poenial wih each process. Le Õ = denoe he se of ready nodes ha are in he deque of process along wih s assigned node if any a he beginning of sep W. We say ha each node in ÞÕp = Y belongs o process. Then we define he poenial of s deque as Ú Õ = ß Ô Õ YF nå Ø$à"áâªBäãIE

8 ç beginning of sep W and le Õ denoe he se of all oher Ú Õ {Ú Õ æ Õ :8Ú Õ Õ n\ In addiion leæ Õ denoe he se of processes whose deque is empy a he processes. We pariion he poenial ÚèÕ ino wo pars Ú Õ Ú Õ Ú Õ Ú Õ where æ Õ *éß = ãêà2ëâ and Õ *éß = n\ ãêàuìcâ and we analyze he wo pars separaely. Lemma 13 liss four basic properies of he poenial ha we use frequenly. The proofs for hese properies are given in [2] and he lised properies are correc independen of he ime ha execuion of a node or a seal akes. Therefore we give a shor proof skech. Lemma 13 The poenial funcion saisfies he following properies. 1. Suppose nodey is assigned o a process a sepw. Then he poenial decreases by a leas ;$í(02ïô Õ YF. 2. Suppose a node Y is execued a sep W. Then he poenial decreases by a leas $í î"ïôõ«ỹ a sepw. 3. Consider any sep W and any process in Õ. The opmos node Y in s deque conribues a leas 0"í K of he poenial associaed wih. Tha is we have ÔÕI YF ïỳ 0"í K" ÂÚðÕI =. 4. Suppose a process Ò chooses process in Õ as is vicim a ime sepw (a seal aemp ofò argeing occurs a sepw). Then he poenial decreases by a leas ÂJ(í$;= ÂÚ Õ = due o he assignmen or execuion of a node belonging o a he end of sepw. Propery J follows direcly from he definiion of he poenial funcion. Propery ; holds because a node enables a mos wo children wih smaller poenial one of which becomes assigned. Specifically he poenial afer he execuion of node Y decreases by a leas Ôc YF ê ÂJñj ò& j ó& *õôó Ôc YF. Propery0 follows from a srucural propery of he nodes in a deque. The disance of he nodes in a process deque decrease monoonically from he op of he deque o boom. Therefore he poenial in he deque is he sum of geomerically decreasing erms and dominaed by he poenial of he op node. The las propery holds because when a process chooses process in Õ as is vicim he node a he op of s deque is assigned a he nex sep. Therefore he poenial decreases by ;=í 0=Ô Õ YF by propery J. Moreover Ô Õ YF k`{ 02í K" ÂÚ Õ = by propery 0 and he resul follows. Lemma 16 shows ha he poenial decreases as a compuaion proceeds. The proof for Lemma 16 uilizes balls and bins game bound from Lemma 14. Lemma 14 (Balls and Weighed Bins) Suppose ha a leas balls are hrown independenly and uniformly a random ino bins where binw has a weighö Õ forw J$\9å å9å9\^. The oal weigh is öøúù Õüû& ) ö Õ. For each binw define he random variable Õ as ýõ þ öaõ if some ball lands in bin W ; oherwise. If 1úù Õüû& ) Õ hen for anyÿ in he range 'ÿ új we have x( é`'ÿnöúz yjèj8j(í2 Â ÂJðj4ÿ*ÏÑ=. This lemma can be proven wih an applicaion of Markov s inequaliy. The proof of a weaker version of his lemma for he case of exacly hrows is similar and given in [2]. Lemma 14 also follows from he weaker lemma because does no decrease wih more hrows. We now show ha whenever or more seal aemps occur he poenial decreases by a consan fracion of Ú Õ Õ wih consan probabiliy. Lemma 15 Consider any sep W and any laer sepx such ha a leas seal aemps occur a seps from W (inclusive) o X (exclusive). Then we have þ*ú Õ j Ú Þ` K J Ú Õ Õ J K å Moreover he poenial decrease is because of he execuion or assignmen of nodes belonging o a process in /Õ. Proof: Consider all processes and seal aemps ha occur a or afer sep W. For each process in Õ if one or more of he aemps arge as he vicim hen he poenial decreases by ÂJ(í=;$ ÂÚèÕI = due o he execuion or assignmen of nodes ha belong o by properyk in Lemma 13. If we hink of each aemp as a ball oss hen we have an insance of he Balls and Weighed Bins Lemma (Lemma 14). For each process in /Õ we assign a weigh ö ã õ ÂJ(í=;$ ÂÚèÕp = and for each oher process inæ Õ we assign a weighö ã y. The weighs sum oö ÂJí$;$ ÂÚ Õ Õ. Using ÿ J(í=; in Lemma 14 we conclude ha he poenial decreases by a leasÿnö ÂJí K" ÂÚ Õ Õ wih probabiliy greaer han Jèj J(í2 Â ÂJj4ÿ*ÏÑ= Jí K due o he execuion or assignmen of nodes ha belong o a process in Õ. We now bound he number of seal aemps in a worksealing compuaion. Lemma 16 Consider a process execuion of a mulihreaded compuaion wih he worksealing algorihm. Le & and denoe he compuaional work and he criical pah of he compuaion. Then he expeced number of seal aemps in he execuion is (. Moreover for any he number of seal aemps is ( :F ÂJ(í= Â wih probabiliy a leas Jèj. Proof: We analyze he number of seal aemps by breaking he execuion ino phases of seal aemps. We show ha wih consan probabiliy a phase causes he poenial o drop by a consan facor. The firs phase begins a sep & J and ends a he firs sep & such ha a leas seal aemps occur during he inerval of seps & \ &. The second phase begins a sep 3 & : J and so on. Le us firs show ha here are a leas seps in a phase. A process has a mos J ousanding seal aemp a any ime and a seal aemp akes a leas seps o complee. Therefore a mos seal aemps occur in a period of ime seps. Hence a phase of seal aemps akes a leas I ( ( Â Âí$(`s ime unis. Consider a phase beginning a sep W and le X be he sep a which he nex phase begins. Then W :< 7 X. We will show ha we have x2ú 7 02í(K2 ÂÚ Õ z J(í K. Recall ha he poenial can be pariioned as ÚèÕyÚèÕI æ Õ 2:GÚèÕª /Õ. Since he phase conains (w seal aemps x$ú Õ j8ú `{ ÂJ(í(K2 ÂÚ Õ Õ Âz Jí K due o execuion or assignmen of nodes ha belong o a process in Õ by Lemma 15. Now we show ha he poenial also drops by a consan fracion ofú Õ æ Õ due o he execuion of assigned nodes ha are assigned o he processes in æ Õ. Consider a process say in æ Õ. If does no have an assigned node hen Ú Õ =.. If has an assigned node Y hen ÚèÕI = ý ÔÕI YF. In his case process complees execuing node Y a sep W=:aúj4JX a he

9 laes and he poenial drops by a leas $í(î2ïô Õ YF by propery ; of Lemma 13. Summing over each process in æ Õ we have ÚèÕ]j Ú ` $í î" ÂÚèÕ«æ Õ. Thus we have shown ha he poenial decreases a leas by a quarer of Ú Õ æ Õ and Ú Õ Õ. Therefore no maer how he oal poenial is disribued over æ Õ and Õ he oal poenial decreases by a quarer wih probabiliy more han Jí K ha is x=ú Õ jú ỳ ÂJí K" ÂÚ Õ z {J(í(K. We say ha a phase is successful if i causes he poenial o drop by a leas a Jí K fracion. A phase is successful wih probabiliy a leas Jí K. Since he poenial sars a ÚÛ.0 Ý Ù & and ends a (and is always an ineger) he number of successful phases is a mos ; "! j{j( ò 0b#$. The expeced number of phases needed o obain #= successful phases is a mos0";. Thus he expeced number of phases is and because each phase conains w seal aemps he expeced number of seal aemps is ( (4. The high probabiliy bound follows by an applicaion of he Chernoff bound. Theorem 17 Le %8)c be he number of cache misses in a process execuion of a nesedparallel compuaion wih a worksealer ha has simple caches of blocks each. Le% & be he number of cache misses in he uniprocess execuion Then %')* c{% & : ( {: #%$c ÂJí= Â wih probabiliy a leas J2j&. The expeced number of cache misses is %'& :8 ( Proof: Theorem 12 shows ha he cache overhead of a nesedparallel compuaion is a mos wice he produc of he number of seals and he cache size. Lemma 16 shows ha he number of seal aemps is (((/ ý:%$c ÂJí= Â Â wih probabiliy a leas J2j and he expeced number of seals is (((<. The number of seals is no greaer han he number of seal aemps. Therefore he bounds follow. Theorem 18 Consider a process nesedparallel worksealing compuaion wih simple caches of blocks. Then for any y he execuion ime is & :8 :%$ ÂJ(í= Â =: 8:( ê :'%$ ÂJí= Â Â wih probabiliy a leas ÂJj(=. Moreover he expeced running ime is Proof: &( :<8 :s b:8(ï nå We use an accouning argumen o bound he running ime. A each sep in he compuaion each process pus a dollar ino one of wo buckes ha maches is aciviy a ha sep. We name he wo buckes as he work and he seal bucke. A process pus a dollar ino he work bucke a a sep if i is working on a node in he sep. The execuion of a node in he dag adds eiher J or dollars o he work bucke. Similarly a process pus a dollar ino he seal bucke for each sep ha i spends sealing. Each seal aemp akes ( seps. Therefore each seal adds ( dollars o he seal bucke. The number of dollars in he work bucke a he end of execuion is a mos & :y j8j( %8)k Â which is & : j8j ) y:%$c ÂJ(í+ Â Â *./01 Figure 10: The ree of hreads creaed in a daaparallel worksealing applicaion.. wih probabiliy a leasjèj3 The oal number of dollars in seal bucke is he oal number of seal aemps muliplied by he number of dollars added o he seal bucke for each seal aemp which is (. Therefore oal number of dollars in he seal bucke is 4) * < {:3%$c ÂJí wih probabiliy a leas J/j4. Each process adds exacly one dollar o a bucke a each sep so we divide he oal number of dollars by o ge he high probabiliy bound in he heorem. A similar argumen holds for he expeced ime bound. 7 LocaliyGuided Work Sealing The worksealing algorihm achieves good daa localiy by execuing nodes ha are close in he compuaion graph on he same process. For cerain applicaions however regions of he program ha access he same daa are no close in he compuaional graph. As an example consider an applicaion ha akes a sequence of seps each of which operaes in parallel over a se or array of values. We will call such an applicaion an ieraive daaparallel applicaion. Such an applicaion can be implemened using worksealing by forking a ree of hreads on each sep in which each leaf of he ree updaes a region of he daa (ypically disjoin). Figure 10 shows an example of he rees of hreads creaed in wo seps. Each node represens a hread and is labeled wih he process ha execues i. The gray nodes are he leaves. The hreads synchronize in he same order as hey fork. The firs and second seps are srucurally idenical and each pair of corresponding gray nodes updae he same region ofen using much of he same inpu daa. The dashed recangle in Figure 10 for example shows a pair of such gray nodes. To ge good localiy for his applicaion hreads ha updae he same daa on differen seps ideally should run on he same processor even hough hey are no close in he dag. In work sealing however his is highly unlikely o happen due o he random seals. Figure 10 for example shows an execuion where all pairs of corresponding gray nodes run on differen processes. In his secion we describe and evaluae localiyguided work sealing a heurisic modificaion o work sealing which is designed o allow localiy beween nodes ha are disan in he compuaional graph. In localiyguided work sealing each hread can be given an affiniy for a process and when a process obains work i gives prioriy o hreads wih affiniy for i. To enable his in addiion o a deque each process mainains a mailbox: a firsinfirsou 2./01 Â Â

10 (FIFO) queue of poiners o hreads ha have affiniy for he process. There are hen wo differences beween he localiyguided worksealing and worksealing algorihms. Firs when creaing a hread a process will push he hread ono boh he deque as in normal work sealing and also ono he ail of he mailbox of he process ha he hread has affiniy for. Second a process will firs ry o obain work from is mailbox before aemping a seal. Because hreads can appear wice once in a mailbox and once on a deque here needs o be some form of synchronizaion beween he wo copies o make sure he hread is no execued wice. A number of echniques ha have been suggesed o improve he daa localiy of mulihreaded programs can be realized by he localiyguided worksealing algorihm ogeher wih an appropriae policy o deermine he affiniies of hreads. For example an iniial disribuion of work among processes can be enforced by seing he affiniies of a hread o he process ha i will be assigned a he beginning of he compuaion. We call his localiyguided worksealing wih iniial placemens. Likewise echniques ha rely on hins from he programmer can be realized by seing he affiniy of hreads based on he hins. In he nex secion we describe an implemenaion of localiyguided work sealing for ieraive daaparallel applicaions. The implemenaion described can be modified easily o implemen oher echniques menioned. 7.1 Implemenaion We buil localiyguided work sealing ino Hood. Hood is a mulihreaded programming library wih a nonblocking implemenaion of work sealing ha delivers provably good performance under boh radiional and muliprogrammed workloads [ ]. In Hood he programmer defines a hread as a C++ class which we refer o as he hread definiion. A hread definiion has a mehod named run ha defines he code ha he hread execues. The run mehod is a C++ funcion which can call Hood library funcions o creae and synchronize wih oher hreads. A rope is an objec ha is an insance of a hread definiion class. Each ime he run mehod of a rope is execued i creaes a new hread. A rope can have an affiniy for a process and when he Hood runime sysem execues such a rope he sysem passes his affiniy o he hread. If he hread does no run on he process for which i has affiniy he affiniy of he rope is updaed o he new process. Ieraive daaparallel applicaions can effecively use ropes by making sure all corresponding hreads (hreads ha updae he same region across differen seps) are generaed from he same rope. A hread will herefore always have an affiniy for he process on which i s corresponding hread ran on he previous sep. The dashed recangle in Figure 10 for example represens wo hreads ha are generaed in wo execuions of one rope. To iniialize he ropes he programmer needs o creae a ree of ropes before he firs sep. This ree is hen used on each sep when forking he hreads. To implemen localiyguided work sealing in Hood we use a nonblocking queue for each mailbox. Since a hread is pu o a mailbox and o a deque one issue is making sure ha he hread is no execued wice once from he mailbox and once from he deque. One soluion is o remove he oher copy of a hread when a process sars execuing i. In pracice his is no efficien because i has a large synchronizaion overhead. In our implemenaion we do his lazily: when a process sars execuing a hread i ses a flag using an aomic updae operaion such as esandse or compareandswap o mark he hread. When execuing a hread a process idenifies a marked hread wih he aomic updae and discards he hread. The second issue comes up when one wans o reuse he hread daa srucures ypically hose from he previous sep. When a hread s srucure is reused in a sep he copies from he previous sep which can be in a mailbox or a deque needs o be marked invalid. One can implemen his by invalidaing all he Benchmark Work Overhead Criical Pah Average ( & ) (@$A ) Lengh ( ) Par. ) saichea J(2åî" hea J9Lñå;= J$å J(; uå$k2 0=LuJ=åüJ=J lghea J9Lñå076 J$å J(; uå$k$k 076$;2å" iphea J9Lñå076 J$å J(; uå$k$k 076$;2å" saicrelax K$KñåüJ J$å$# relax K$0ñåî=0 J$å$# uå$0$î J$J; LñåKñJ lgrelax K$Kñå;=; J$å$# uå$0$î J$J 0$0ñå#=K iprelax K$Kñå;=; J$å$# uå$0$î J$J 0$0ñå#=K Table 1: Measured benchmark characerisics. We compiled all applicaions wih Sun CC compiler using xarch=v8plus O5 dalign flags. All imes are given in seconds. denoes he execuion ime of he sequenial algorihm for he applicaion and is J9Kñå (K for Hea and for Relax. muliple copies of hreads a he end of a sep and synchronizing all processes before he nex sep sar. In muliprogrammed workloads however he kernel can swap a process ou prevening i from paricipaing o he curren sep. Such a swapped ou process prevens all he oher processes from proceeding o he nex sep. In our implemenaion o avoid he synchronizaion a he end of each sep we imesamp hread daa srucures such ha each process closely follows he ime of he compuaion and ignores a hread ha is ouofdae. 7.2 Experimenal Resuls In his secion we presen he resuls of our preliminary experimens wih localiyguided work sealing on wo small applicaions. The experimens were run on a J K processor Sun Ulra Enerprise wih K$= MHz processors and K M bye L2 cache each and running Solaris 2.7. We used he processor bind sysem call of Solaris 2.7 o bind processes o processors o preven Solaris kernel from migraing a process among processors causing he process o loose is cache sae. When he number of processes is less han number of processors we bind one process o each processor oherwise we bind processes o processors such ha processes are disribued among processors as evenly as possible. We use he applicaions Hea and Relax in our evaluaion. Hea is a Jacobi overrelaxaion ha simulaes hea propagaion on a; dimensional grid for a number of seps. This benchmark was derived from similar Cilk [27] and SPLASH [35] benchmarks. The main daa srucures are wo equalsized arrays. The algorihm runs in seps each of which updaes he enries in one array using he daa in he oher array which was updaed in he previous sep. Relax is a GaussSeidel overrelaxaion algorihm ha ieraes over one a J dimensional array updaing each elemen by a weighed average of is value and ha of is wo neighbors. We implemened each applicaion wih four sraegies saic pariioning work sealing localiyguided work sealing and localiy guided work sealing wih iniial placemens. The saic pariioning benchmarks divide he oal work equally among he number of processes and makes sure ha each process accesses he same daa elemens in all he seps. I is implemened direcly wih Solaris hreads. The hree worksealing sraegies are all implemened in Hood. The plain worksealing version uses hreads direcly and he wo localiyguided versions use ropes by building a ree of ropes a he beginning of he compuaion. The iniial placemen sraegy assigns iniial affiniies o he ropes near he op of he ree o achieve a good iniial load balance. We use he following prefixes in he names of he benchmarks: saic (saic pariioning) none (work seal

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report) Implemening Ray Casing in Terahedral Meshes wih Programmable Graphics Hardware (Technical Repor) Marin Kraus, Thomas Erl March 28, 2002 1 Inroducion Alhough cell-projecion, e.g., [3, 2], and resampling,