The Data Locality of Work Stealing

Size: px

Start display at page:

Download "The Data Locality of Work Stealing"

Jean Campbell
5 years ago
Views:

1 The Data Locality of Work Stealing Umut A. Acar School of Computer Science Carnegie Mellon Univerity Guy E. Blelloch School of Computer Science Carnegie Mellon Univerity Robert D. Blumofe Department of Computer Science Univerity of Texa at Autin Abtract Thi paper tudie the data locality of the work-tealing cheduling algorithm on hardware-controlled hared-memory machine. We preent lower and upper bound on the number of cache mie uing work tealing, and introduce a locality-guided work-tealing algorithm along with experimental validation. A a lower bound, we how that there i a family of multithreaded computation G n each member of which require (n) total intruction (work), for which when uing work-tealing the number of cache mie on one proceor i contant, while even on two proceor the total number of cache mie i (n). Thi implie that for general computation there i no ueful bound relating multiproceor to uninproceor cache mie. For neted-parallel computation, however, we how that on P proceor the expected additional number of cache mie beyond thoe on a ingle proceor i bounded by O(Cd m e PT), where m i the execution time of an intruction incurring a cache mi, i the teal time, C i the ize of cache, and T i the number of node on the longet chain of dependence. Baed on thi we give trong bound on the total running time of neted-parallel computation uing work tealing. For the econd part of our reult, we preent a locality-guided work tealing algorithm that improve the data locality of multithreaded computation by allowing a thread to have an affinity for a proceor. Our initial experiment on iterative data-parallel application how that the algorithm matche the performance of taticpartitioning under traditional work load but improve the performance up to 5% over tatic partitioning under multiprogrammed work load. Furthermore, the locality-guided work tealing improve the performance of work-tealing up to 8%. Introduction Many of today parallel application ue ophiticated, adaptive algorithm which are bet realized with parallel programming ytem that upport dynamic, lightweight thread uch a Cilk [8], Nel [5], Hood [], and many other [3, 6, 7,, 3]. The core of thee ytem i a thread cheduler that balance load among the procee. In addition to a good load balance, however, good data locality i eential in obtaining high performance from modern parallel ytem. Permiion to make digital or hard copie of part or all of thi work or peronal or claroom ue i granted without fee provided that copie are not made or ditributed for profit or commercial advantage and that copie bear thi notice and the full citation on the firt page. To copy otherwie, to republih, to pot on erver, or to reditribute to lit, require prior pecific permiion and/or a fee. SPAA, Bar Harbor, Maine USA ACM //7...$5. Several reearche have tudied technique to improve the data locality of multithreaded program. One cla of uch technique i baed on oftware-controlled ditribution of data among the local memorie of a ditributed hared memory ytem [5,, 6]. Another cla of technique i baed on hint upplied by the programmer o that imilar tak might be executed on the ame proceor [5, 3, 34]. Both thee clae of technique rely on the programmer or compiler to determine the data acce pattern in the program, which may be very difficult when the program ha complicated data acce pattern. Perhap the earliet cla of technique wa to attempt to execute thread that are cloe in the computation graph on the ame proceor [, 9,, 3, 6, 8]. The work-tealing algorithm i the mot tudied of thee technique [9,, 9,, 4, 36, 37]. Blumofe et al howed that fully-trict computation achieve a provably good data locality [7] when executed with the work-tealing algorithm on a dag-conitent ditributed hared memory ytem. In recent work, Narlikar howed that work tealing improve the performance of pace-efficient multithreaded application by increaing the data locality [9]. None of thi previou work, however, ha tudied upper or lower bound on the data locality of multithreaded computation executed on exiting hardware-controlled hared memory ytem. In thi paper, we preent theoretical and experimental reult on the data locality of work tealing on hardware-controlled hared memory ytem (HSMS). Our firt et of reult are upper and lower bound on the number of cache mie in multithreaded computation executed by the work-tealing algorithm. Let M (C) denote the number of cache mie in the uniproceor execution and M P (C) denote the number of cache mie in a P -proceor execution of a multithreaded computation by the work tealing algorithm on an HSMS with cache ize C. Then, for a multithreaded computation with T work (total number of intruction), T critical path (longet equence of dependence), we how the following reult for the work-tealing algorithm running on a HSMS. Lower bound on the number of cache mie for general computation: We how that there i a family of computation G n with T = (n) uch that M (C) = 3C while even on two proceor the number of mie M (C) = (n). Upper bound on the number of cache mie for netedparallel computation: For a neted-parallel computation, we how that M P M(C) +C, where i the number of teal in the P -proceor execution. We then how that the expected number of teal i O(d m ept), where m i the time for a cache mi and i the time for a teal. Upper bound on the execution time of neted-parallel computation: We how that the expected execution time of a

2 Speedup linear work-tealing locality-guided worktealing tatic partioning Number of Procee Figure : The peedup obtained by three different over-relaxation algorithm. neted-parallel computation on P proceor i O( T (C) P + md m ect+(m+)t), where T(C) i the uniproceor execution time of the computation including cache mie. A in previou work [6, 9], we repreent a multithreaded computation a a directed, acyclic graph (dag) of intruction. Each node in the dag repreent a ingle intruction and the edge repreent ordering contraint. A neted-parallel computation [5, 6] i a race-free computation that can be repreented with a erie-parallel dag [33]. Neted-parallel computation include computation coniting of parallel loop and fork an join and any neting of them. Thi cla include mot computation that can be expreed in Cilk [8], and all computation that can be expreed in Nel [5]. Our reult how that neted-parallel computation have much better locality characteritic under work tealing than do general computation. We alo briefly conider another cla of computation, computation with future [, 3, 4,, 5], and how that they can be a bad a general computation. The econd part of our reult are on further improving the data locality of multithreaded computation with work tealing. In work tealing, a proceor teal a thread from a randomly (with uniform ditribution) choen proceor when it run out of work. In certain application, uch a iterative data-parallel application, random teal may caue poor data locality. The locality-guided work tealing i a heuritic modification to work tealing that allow a thread to have an affinity for a proce. In locality-guided work tealing, when a proce obtain work it give priority to a thread that ha affinity for the proce. Locality-guided work tealing can be ued to implement a number of technique that reearcher ugget to improve data locality. For example, the programmer can achieve an initial ditribution of work among the procee or chedule thread baed on hint by appropriately aigning affinitie to thread in the computation. Our preliminary experiment with locality-guided work tealing give encouraging reult, howing that for certain application the performance i very cloe to that of tatic partitioning in dedicated mode (i.e. when the uer can lock down a fixed number of proceor), but doe not uffer a performance cliff problem [] in multiprogrammed mode (i.e. when proceor might be taken by other uer or the OS). Figure how a graph comparing work tealing, locality-guided work tealing, and tatic partitioning for a imple over-relaxation algorithm on a 4 proceor Sun Ultra Enterprie. The over-relaxation algorithm iterate over a dimenional array performing a 3-point tencil computation on each tep. The uperlinear peedup for tatic partitioning and locality-guided work tealing i due to the fact that the data for each run doe not fit into the L cache of one proceor but fit into the collective L cache of 6 or more proceor. For thi benchmark the following can be een from the graph.. Locality-guided work tealing doe ignificantly better than tandard work tealing ince on each tep the cache i prewarmed with the data it need.. Locality-guided work tealing doe approximately a well a tatic partitioning for up to 4 procee. 3. When trying to chedule more than 4 procee on 4 proceor tatic partitioning ha a eriou performance drop. The initial drop i due to load imbalance caued by the coare-grained partitioning. The performance then approache that of work tealing a the partitioning get more fine-grained. We are intereted in the performance of work-tealing computation on hardware-controlled hared memory (HSMS). We model an HSMS a a group of identical proceor each of which ha it own cache and ha a ingle hared memory. Each cache contain C block and i managed by the memory ubytem automatically. We allow for a variety of cache organization and replacement policie, including both direct-mapped and aociative cache. We aign a erver proce with each proceor and aociate the cache of a proceor with proce that the proceor i aigned. One limitation of our work i that we aume that there i no fale haring. Related Work A mentioned in Section, there are three main clae of technique that reearcher have uggeted to improve the data locality of multithreaded program. In the firt cla, the program data i ditributed among the node of a ditributed hared-memory ytem by the programmer and a thread in the computation i cheduled on the node that hold the data that the thread accee [5,, 6]. In the econd cla, data-locality hint upplied by the programmer are ued in thread cheduling [5, 3, 34]. Technique from both clae are employed in ditributed hared memory ytem uch a COOL and Illinoi Concert [5, ] and alo ued to improve the data locality of equential program [3]. However, the firt cla of technique do not apply directly to HSMS, becaue HSMS do not allow oftware controlled ditribution of data among the cache. Furthermore, both clae of technique rely on the programmer to determine the data acce pattern in the application and thu, may not be appropriate for application with complex data-acce pattern. The third cla of technique, which i baed on execution of thread that are cloe in the computation graph on the ame proce, i applied in many cheduling algorithm including work tealing [, 9, 3, 6, 8, 9]. Blumofe et al howed bound on the number of cache mie in a fully-trict computation executed by the work-tealing algorithm under the dag-conitent ditributed hared-memory of Cilk [7]. Dag conitency i a relaxed memory-conitency model that i employed in the ditributed hared-memory implementation of the Cilk language. In a ditributed Cilk application, procee maintain the dag conitency by mean of the BACKER algorithm. In [7], Blumofe et al bound the number of hared-memory cache mie in a ditributed Cilk application for cache that are maintained with the LRU replacement policy. They aumed that accee to the hared memory are ditributed uniformly and independently, which i not generally true becaue thread may concurrently acce the ame page by algorithm deign. Furthermore, they aumed that procee do

3 Figure : A dag (directed acyclic graph) for a multithreaded computation. Thread are hown a gray rectangle. not generate teal attempt frequently by making procee do additional page tranfer before they attempt to teal from another proce. 3 The Model In thi ection, we preent a graph-theoretic model for multithreaded computation, decribe the work-tealing algorithm, define erieparallel and neted-parallel computation and introduce our model of an HSMS (Hardware-controlled Shared-Memory Sytem). A with previou work [6, 9] we repreent a multithreaded computation a a directed acyclic graph, a dag, of intruction (ee Figure ). Each node in the dag repreent an intruction and the edge repreent ordering contraint. There are three type of edge, continuation, pawn, and dependency edge. A thread i a equential ordering of intruction and the node that correpond to the intruction are linked in a chain by continuation edge. A pawn edge repreent the creation of a new thread and goe from the node repreenting the intruction that pawn the new thread to the node repreenting the firt intruction of the new thread. A dependency edge from intruction i of a thread to intruction j of ome other thread repreent a ynchronization between two intruction uch that intruction j mut be executed after i. We draw pawn edge with thick traight arrow, dependency edge with curly arrow and continuation edge with thick traight arrow throughout thi paper. Alo we how path with wavy line. For a computation with an aociateddag G, we define the computational work, T, a the number of node in G and the critical path, T, a the number of node on the longet path of G. Let u and v be any two node in a dag. Then we call u an ancetor of v, and v a decendant of u if there i a path from u to v. Any node i it decendant and ancetor. We ay that two node are relative if there i a path from one to the other, otherwie we ay that the node are independent. The children of a node are independent becaue otherwie the edge from the node to one child i redundant. We call a common decendant y of u and v a merger of u and v if the path from u to y and v to y have only y in common. We define the depth of a node u a the number of edge on the hortet path from the root node to u. We define the leat common ancetor of u and v a the ancetor of both u and v with maximum depth. Similarly, we define the greatet common decendant of u and v, a the decendant of both u and v with minimum depth. An edge (u v) i redundant if there i a path between u and v that doe not contain the edge (u v). The tranitive reduction of a dag i the dag with all the redundant edge removed. In thi paper we are only concerned with the tranitive reduction of the computational dag. We alo require that the dag have a ingle node with in-degree, the root, and a ingle node with outdegree, the final node. In a multiproce execution of a multithreaded computation, independent node can execute at the ame time. If two independent node read or modify the ame data, we ay that they are RR or WW haring repectively. If one node i reading and the other i modifying the data we ay they are RW haring. RW or WW haring can caue data race, and the output of a computation with uch race uually depend on the cheduling of node. Such race are typically indicative of a bug [8]. We refer to computation that do not have any RW or WW haring a race-free computation. In thi paper we conider only race-free computation. The work-tealing algorithm i a thread cheduling algorithm for multithreaded computation. The idea of work-tealing date back to the reearch of Burton and Sleep [] and ha been tudied extenively ince then [, 9, 9,, 4, 36, 37]. In the work-tealing algorithm, each proce maintain a pool of ready thread and obtain work from it pool. When a proce pawn a new thread the proce add the thread into it pool. When a proce run out of work and find it pool empty, it chooe a random proce a it victim and trie to teal work from the victim pool. In our analyi, we imagine the work-tealing algorithm operating on individual node in the computation dag rather than on the thread. Conider a multithreaded computation and it execution by the work-tealing algorithm. We divide the execution into dicrete time tep uch that at each tep, each proce i either working on a node, which we call the aigned node, or i trying to teal work. The execution of a node take time tep if the node doe not incur a cache mi and m tep otherwie. We ay that a node i executed at the time tep that a proce complete executing the node. The execution time of a computation i the number of time tep that elape between the time tep that a proce tart executing the root node to the time tep that the final node i executed. The execution chedule pecifie the activity of each proce at each time tep. During the execution, each proce maintain a deque (doubly ended queue) of ready node; we call the end of a deque the top and the bottom. When a node, u, i executed, it enable ome other node v if u i the lat parent of v that i executed. We call the edge (u v) an enabling edge and u the deignated parent of v. When a proce execute a node that enable other node, one of the enabled node become the aigned node and the proce puhe the ret onto the bottom of it deque. If no node i enabled, then the proce obtain work from it deque by removing a node from the bottom of the deque. If a proce find it deque empty, it become a thief and teal from a randomly choen proce, the victim. Thi i a teal attempt and take at leat and at mot k time tep for ome contant k to complete. A thief proce might make multiple teal attempt before ucceeding, or might never ucceed. When a teal ucceed, the thief proce tart working on the tolen node at the tep following the completion of the teal. We ay that a teal attempt occur at the tep it complete. The work-tealing algorithm can be implemented in variou way. We ay that an implementation of work tealing i determinitic if, whenever a proce enable other node, the implementation alway chooe the ame node a the aigned node for then next tep on that proce, and the remaining node are alway placed in the deque in the ame order. Thi mut be true for both multiproce and uniproce execution. We refer to a determinitic implementation of the work-tealing algorithm together with the HSMS that run the implementation a a work tealer. For brevity, we refer to an execution of a multithreaded computation with a work tealer a an execution. We define the total work a the number of tep taken by a uniproce execution, including the cache mie, and denote it by T (C), where C i the cache ize. We denote the number of cache mie in a P -proce execution with C-block cache a M P (C). We define the cache overhead of a P -proce execution a M P (C) ; M (C), where M (C) i the number of mie in the uniproce execution on the ame work tealer. We refer to a multithreaded computation for which the trani- 3

4 t G G t u (a) (b) (c) Figure 3: Illutrate the recurive definition for erie-parallel dag. Figure (a) i the bae cae, figure (b) depict the erial, and figure (c) depict the parallel compoition. tive reduction of the correponding dag i erie-parallel [33] a a erie-parallel computation. A erie-parallel dag G(V E) i a dag with two ditinguihed vertice, a ource, V and a ink, t V and can be defined recurively a follow (ee Figure 3). Bae: G conit of a ingle edge connecting to t. Serie Compoition: G conit of two erie-parallel dag G (V E ) and G (V E ) with dijoint edge et uch that i the ource of G, u i the ink of G and the ource of G, and t i the ink of G. Moreover V \ V = fug. Parallel Compoition: The graph conit of two erie-parallel dag G (V E ) and G (V E ) with dijoint edge et uch that and t are the ource and the ink of both G and G. Moreover V \ V = f tg. A neted-parallel computation i a race-free erie-parallel computation [6]. We alo conider multithreaded computation that ue future [, 3, 4,, 5]. The dag tructure of computation with future are defined elewhere [4]. Thi i a upercla of neted-parallel computation, but till much more retrictive than general computation. The work-tealing algorithm for future i a retricted form of work-tealing algorithm, where a proce tart executing a newly created thread immediately, putting it aigned thread onto it deque. In our analyi, we conider everal cache organization and replacement policie for an HSMS. We model a cache a a et of (cache) line, each of which can hold the data belonging to a memory block (a conecutive, typically mall, region of memory). One intruction can operate on at mot one memory block. We ay that an intruction accee a block or the line that contain the block when the intruction read or modifie the block. We ay that an intruction overwrite a line that contain the block b when the intruction accee ome other block that replace b in the cache. We ay that a cache replacement policy i imple if it atifie two condition. Firt the policy i determinitic. Second whenever the policy decide to overwrite a cache line, l, it make the deciion to overwrite l by only uing information pertaining to the accee that are made after the lat acce to l. We refer to a cache managed with a imple cache-replacement policy a a imple cache. Simple cache and replacement policie are common in practice. For example, leat-recently ued (LRU) replacement policy, direct mapped cache and et aociative cache where each et i maintained by a imple cache replacement policy are imple. In regard to the definition of RW or WW haring, we aume that read and write pertain to the whole block. Thi mean we do not allow for fale haring when two procee acceing different portion of a block invalidate the block in each other cache. In practice, fale haring i an iue, but can often be avoided by a knowledge of underlying memory ytem and appropriately padding the hared data to prevent two procee from acceing different portion of the ame block. G G t L 4C Figure 4: The tructure for dag of a computation with a large cache overhead. 4 General Computation In thi ection, we how that the cache overhead of a multiproce execution of a general computation and a computation with future can be large even though the uniproce execution incur a mall number of mie. Theorem There i a family of computation fg n : n = kc for k Z + g with O(n) computational work, whoe uniproce execution incur 3C mie while any -proce execution of the computation incur (n) mie on a work tealer with a cache ize of C, auming that S = O(C), where S i the maximum teal time. Proof: Figure 4 how the tructure of a dag, G 4C for n =4C. Each node except the root node repreent a equence of C intruction acceing a et of C ditinct memory block. The root node repreent C +S intruction that acceec ditinct memory block. The graph ha two ymmetric component L 4C and R 4C, which correpond to the left and the right ubtree of the root excluding the leave. We partition the node in G 4C into three clae, uch that all node in a cla acce the ame memory block while node from different clae acce mutually dijoint et of memory block. The firt cla contain the root node only, the econd cla contain all the node in L 4C, and the third cla contain the ret of the node, which are the node in R 4C and the leave of G 4C. For general n = kc, G n can be partitioned into L n, R n and the k leave of G n and the root imilarly. Each of L n and R n contain d k e; node and ha the tructure of a complete binary tree with additional k leave at the lowet level. There i a dependency edge from the leave of both L n and R n to the leave of G n. Conider a work tealer that execute the node of G n in the order that they are numbered in a uniproce execution. In the uniproce execution, no node in L n incur a cache mi except the root node, ince all node in L n acce the ame memory block a the root of L n. The ame argument hold for R n and the k leave of G n. Hence the execution of the node in L n, R n, and the leave caue C mie. Since the root node caue C mie, the total number of mie in the uniproce execution i 3C. Now, conider a -proce execution with the ame work tealer and call the procee, proce and. At time tep, proce tart executing the root node, which enable the root of R n no later than time tep m. Since proce tart tealing immediately and there are no other procee to teal from, proce teal and tart working on the root of R n, no later than time tep m + S. Hence, the root of R n execute before the root of L n and thu, all the node in L n execute before the correponding ymmetric node in R n. Therefore, for any leaf of G n, the parent that i in R n execute before the parent in L n. Therefore a leaf node of G n i executed immediately after it parent in L n and thu, caue C cache mie. Thu, the total number of cache mie i (kc) =(n). R 4C 5 8 4

5 Figure 5: The tructure for dag of a computation with future that can incur a large cache overhead. There exit computation imilar to the computation in Figure 4 that generalize Theorem for arbitrary number of procee by making ure that all the procee but teal throughout any multiproce execution. Even in the general cae, however, where the average parallelim i higher than the number of procee, Theorem can be generalized with the ame bound on expected number of cache mie by exploiting the ymmetry in G n and by auming a ymmetrically ditributed teal-time. With a ymmetrically ditributed teal-time, for any, a teal that take tep more than mean teal-time i equally likely to happen a a teal that take le tep than the mean. Theorem hold for computation with future a well. Multithreaded computing with future i a fairly retricted form of multithreaded computing compared to computing with event uch a ynchronization variable. The graph F in Figure 5 how the tructure of a dag, whoe -proce execution caue large number of cache mie. In a -proce execution of F, the enabling parent of the leaf node in the right ubtree of the root are in the left ubtree and therefore the execution of each uch leaf node caue C mie. 5 Neted-Parallel Computation In thi ection, we how that the cache overhead of an execution of a neted-parallel computation with a work tealer i at mot twice the product of the number of teal and the cache ize. Our proof ha two tep. Firt, we how that the cache overhead i bounded by the product of the cache ize and the number of node that are executed out of order with repect to the uniproce execution order. Second, we prove that the number of uch out-of-order execution i at mot twice the number of teal. Conider a computation G and it P -proce execution, X P, with a work tealer and the uniproce execution, X with the ame work tealer. Let v be a node in G and node u be the node that execute immediately before v in X. Then we ay that v i drifted in X P if node u i not executed immediately before v by the proce that execute v in X P. Lemma etablihe a key property of an execution with imple cache. Lemma Conider a proce with a imple cache of C block. Let X denote the execution of a equence of intruction on the proce tarting with cache tate S and let X denote the execution of the ame equence of intruction tarting with cache tate S. Then X incur at mot C more mie than X. Proof: We contruct a one-to-one mapping between the cache line in X and X uch that an intruction that accee a line l in X accee the entry l in X, if and only if l i mapped to 4 l. Conider X and let l be a cache line. Let i be the firt intruction that accee or overwrite l. Let l be the cache line that the ame intruction accee or overwrite in X and map l to l. Since the cache are imple, an intruction that overwrite l in X overwrite l in X. Therefore the number of mie that overwrite l in X i equal to the number of mie that overwrite l in X after intruction i. Since i itelf can caue mi, the number of mie that overwrite l in X i at mot more than the number of mie that overwrite l in X. We contruct the mapping for each cache line in X in the ame way. Now, let u how that the mapping i one-to-one. For the ake of contradiction, aume that two cache line, l and l,inx map to the ame line in X. Let i and i be the firt intruction acceing the cache line in X uch that i i executed before i. Since i and i map to the ame line in X and the cache are imple, i accee the line that i acceein X but then l = l, a contradiction. Hence, the total number of cache mie in X i at mot C more than the mie in X. Theorem 3 Let D denote the total number of drifted node in an execution of a neted-parallel computation with a work tealer on P procee, each of which ha a imple cache with C word. Then the cache overhead of the execution i at mot CD. Proof: Let X P denote the P -proce execution and let X be the uniproce execution of the ame computation with the ame work tealer. We divide the multiproce computation into D piece each of which can incur at mot C more mie than in the uniproce execution. Let u be a drifted node let q be the proce that execute u. Let v be the next drifted node executed on q (or the final node of the computation). Let the ordered et O repreent the execution order of all the node that are executed after u (u i included) and before v (v i excluded if it i drifted, included otherwie) on q in X P. Then node in O are executed on the ame proce and in the ame order in both X and X P. Now conider the number of cache mie during the execution of the node in O in X and X P. Since the computation i neted parallel and therefore race free, a proce that execute in parallel with q doe not caue q to incur cache mie due to haring. Therefore by Lemma during the execution of the node in O the number of cache mie in X P i at mot C more than the number of mie in X. Thi bound hold for each of the D equence of uch intruction O correponding to D drifted node. Since the equence tarting at the root node and ending at the firt drifted node incur the ame number of mie in X and X P X P take at mot CD more mie than X and the cache overhead i at mot CD. Lemma (and thu Theorem 3) doe not hold for cache that are not imple. For example, conider the execution of a equence of intruction on a cache with leat-frequently-ued replacement policy tarting at two cache tate. In the firt cache tate, the block that are frequently acceedby the intruction are in the cache with high frequencie, wherea in the econd cache tate, the block that are in the cache are not acceed by the intruction and have low frequencie. The execution with the econd cache tate, therefore, incur many more mie than the ize of the cache compared to the execution with the econd cache tate. Now we how that the number of drifted node in an execution of a erie-parallel computation with a work tealer i at mot twice the number of teal. The proof i baed on the repreentation of erie-parallel computation a p-dag. We call a node with outdegree of at leat a fork node and partition the node of an p-dag except the root into three categorie: join node, table node and nomadic node. We call a node that ha an in-degree of at leat a join node and partition all the node that have in-degree into 5

6 t w u z r u x y v G t Figure 6: Children of and their merger. v G z G u Figure 8: The join node i the leat common ancetor of y and z. Node u and v are the children of. t G Figure 7: The joint embedding of u and v. two clae: a nomadic node ha a parent that i a fork node, and a table node ha a parent that ha out-degree. The root node ha indegree and it doe not belong to any of thee categorie. Lemma 4 lit two fundamental propertie of p-dag; one can prove both propertie by induction on the number of edge in an p-dag. Lemma 4 Let G be an p-dag. Then G ha the following propertie. v. The leat common ancetor of any two node in G i unique.. The greatet common decendant of any two node in G i unique and i equal to their unique merger. Lemma 5 Let be a fork node. Then no child of i a join node. Proof: Let u and v denote two children of and uppoe u i a join node a in Figure 6. Let t denote ome other parent of u and z denote the unique merger of u and v. Then both z and u are merger for and t, which i a contradiction of Lemma 5. Hence u i not a join node. Corollary 6 Only nomadic node can be tolen in an execution of a erie-parallel computation by the work-tealing algorithm. Proof: Let u be a tolen node in an execution. Then u i puhed on a deque and thu the enabling parent of u i a fork node. By Lemma 5, u i not a join node and ha an incoming degree. Therefore u i nomadic. Conider a erie-parallel computation and let G be it p-dag. Let u and v be two independent node in G and let and t denote their leat common ancetor and greatet common decendant repectively a hown in Figure 7. Let G denote the graph that i induced by the relative of u that are decendant of and alo ancetor of t. Similarly, let G denote the graph that i induced by the relative of v that are decendant of and ancetor of t. Then we call G the embedding of u with repect to v and G the embedding of v with repect to u. We call the graph that i the union of G and G the joint embedding of u and v with ource and ink t. Now conider an execution of G and y and z be the children of uch that y i executed before z. Then we call y the leader and z the guard of the joint embedding. Lemma 7 Let G(V E) be an p-dag and let y and z be two parent of a join node t in G. Let G denote the embedding of y with repect to z and G denote the embedding of z with repect to y. Let denote the ource and t denote the ink of the joint embedding. Then the parent of any node in G except for and t i in G and the parent of any node in G except for and t i in G. Proof: Since y and z are independent, both of and t are different from y and z (ee Figure 8). Firt, we how that there i not an edge that tart at a node in G except at and end at a node in G except at t and vice vera. For the ake of contradiction, aume there i an edge (m n) uch that m 6= i in G and n 6= t i in G. Then m i the leat common ancetor of y and z; hence no uch (m n) exit. A imilar argument hold when m i in G and n i in G. Second, we how that there doe not exit an edge that originate from a node outide of G or G and end at a node at G or G. For the ake of contradiction, let (w x) be an edge uch that x i in G and w i not in G or G. Then x i the unique merger for the two children of the leat common ancetor of w and, which we denote with r. But then t i alo a merger for the children of r. The children of r are independent and have a unique merger, hence there i no uch edge (w x). A imilar argument hold when x i in G. Therefore we conclude that the parent of any node in G except and t i in G and the parent of any node in G except and t i in G. Lemma 8 Let G be an p-dag and let y and z be two parent of a join node t in G. Conider the joint embedding of y and z and let u be the guard node of the embedding. Then y and z are executed in the ame repective order in a multiproce execution a they are executed in the uniproce execution if the guard node u i not tolen. Proof: Let be the ource, t the ink, and v the leader of the joint embedding. Since u i not tolen, v i not tolen. Hence, by Lemma 7, before it tart working on u, the proce that execute executed v and all it decendant in the embedding except for t Hence, z i executed before u and y i executed after u a in the uniproce execution. Therefore, y and z are executed in the ame repective order a they execute in the uniproce execution. Lemma 9 A nomadic node i drifted in an execution only if it i tolen. Proof: Let u be a nomadic and drifted node. Then, by Lemma 5, u ha a ingle parent that enable u. Ifu i the firt child of to execute in the uniproce execution then u i not drifted in the multiproce execution. Hence, u i not the firt child to execute. Let v be the lat child of that i executed before u in the uniproce execution. Now, conider the multiproce execution and let q be the 6

7 u Figure 9: Node t and t are two join node with the common guard u. proce that execute v. For the ake of contradiction, aume that u i not tolen. Conider the joint embedding of u and v a hown in Figure 8. Since all parent of the node in G except for and t are in G by Lemma 7, q execute all the node in G before it execute u and thu, z precede u on q. But then u i not drifted, becaue z i the node that i executed immediately before u in the uniproce computation. Hence u i tolen. Let u define the cover of a join node t in an execution a the et of all the guard node of the joint embedding of all poible pair of parent of t in the execution. The following lemma how that a join node i drifted only if a node in it cover i tolen. Lemma A join node i drifted in an execution only if a node in it cover i tolen in the execution. Proof: Conider the execution and let t be a join node that i drifted. Aume, for the ake of contradiction, that no node in the cover of t, C(t), i tolen. Let y and z be any two parent of t a in Figure 8. Then y and z are executed in the ame order a in the uniproce execution by Lemma 8. But then all parent of t execute in the ame order a in the uniproce execution. Hence, the enabling parent of t in the execution i the ame a in the uniproce execution. Furthermore, the enabling parent of t ha out-degree, becaue otherwie t i not a join node by Lemma 5 and thu, the proce that enable t execute t. Therefore, t i not drifted. A contradiction, hence a node in the cover of t i tolen. Lemma The number of drifted node in an execution of a erieparallel computation i at mot twice the number of teal in the execution. Proof: We aociate each drifted node in the execution with a teal uch that no teal ha more than drifted node aociated with it. Conider a drifted node, u. Then u i not the root node of the computation and it i not table either. Hence, u i either a nomadic or join node. If u i nomadic, then u i tolen by Lemma 9 and we aociate u with the teal that teal u. Otherwie, u i a join node and there i a node in it cover C(u) that i tolen by Lemma. We aociate u with the teal that teal a node in it cover. Now, aume there are more than node aociated with a teal that teal node u. Then there are at leat two join node t and t that are aociated with u. Therefore, node u i in the joint embedding of two parent of t and alo t. Let x, y be thee parent of t and x, y be the parent of t, a hown in Figure 9. But then u ha parent that i a fork node and i a joint node, which contradict Lemma 5. Hence no uch u exit. Theorem The cache overheadof an execution of a neted-parallel computation with imple cache i at mot twice the product of the number of mie in the execution and the cache ize. Proof: Follow from Theorem 3 and Lemma. x y x y t t 6 An Analyi of Nonblocking Work Stealing The non-blocking implementation of the work-tealing algorithm deliver provably good performance under traditional and multiprogrammed workload. A decription of the implementation and it analyi i preented in []; an experimental evaluation i given in []. In thi ection, we extend the analyi of the non-blocking work-tealing algorithm for claical workload and bound the execution time of a neted-parallel, computation with a work tealer to include the number of cache mie, the cache-mi penalty and the teal time. Firt, we bound the number of teal attempt in an execution of a general computation by the work-tealing algorithm. Then we bound the execution time of a neted-parallel computation with a work tealer uing reult from Section 5. The analyi that we preent here i imilar to the analyi given in [] and ue the ame potential function technique. We aociate a nonnegative potential with node in a computation dag and how that the potential decreae a the execution proceed. We aume that a node in a computation dag ha outdegree at mot. Thi i conitent with the aumption that each node repreent on intruction. Conider an execution of a computation with it dag, G(V E) with the work-tealing algorithm. The execution grow a tree, the enabling tree, that contain each node in the computation and it enabling edge. We define the ditance of a node u V, d(u),at ; depth(u), where depth(u) i the depth of u in the enabling tree of the computation. Intuitively, the ditance of a node indicate how far the node i away from end of the computation. We define the potential function in term of ditance. At any given tep i, we aign a poitive potential to each ready node, all other node have potential. A node i ready if it i enabled and not yet executed to completion. Let u denote a ready node at time tep i. Then we define, i(u), the potential of u at time tep i a i(u) = 3 d(u); if u i aigned; 3 d(u) otherwie. The potential at tep i, i, i the um of the potential of each ready node at tep i. When an execution begin, the only ready node i the root node which ha ditance T and i aigned to ome proce, o we tart with =3 T;. A the execution proceed, node that are deeper in the dag become ready and the potential decreae. There are no ready node at the end of an execution and the potential i. Let u give a few more definition that enable u to aociate a potential with each proce. Let R i(q) denote the et of ready node that are in the deque of proce q along with q aigned node, if any, at the beginning of tep i. We ay that each node u in R i(q) belong to proce q. Then we define the potential of q deque a X i(q) = i(u) : ur i (q) In addition, let A i denote the et of procee whoe deque i empty at the beginning of tep i, and let D i denote the et of all other procee. We partition the potential i into two part where i(a i)= X i = i(a i)+ i(d i) qa i i(q) and i(d i)= and we analyze the two part eparately. X qd i i(q) 7

8 Lemma 3 lit four baic propertie of the potential that we ue frequently. The proof for thee propertie are given in [] and the lited propertie are correct independent of the time that execution of a node or a teal take. Therefore, we give a hort proof ketch. Lemma 3 The potential function atifie the following propertie.. Suppoe node u i aigned to a proce at tep i. Then the potential decreae by at leat (=3) i(u).. Suppoe a node u i executed at tep i. Then the potential decreae by at leat (5=9) i(u) at tep i. 3. Conider any tep i and any proce q in D i. The topmot node u in q deque contribute at leat 3=4 of the potential aociated with q. That i, we have i(u) (3=4) i(q). 4. Suppoe a proce p chooe proce q in D i a it victim at time tep i (a teal attempt of p targeting q occur at tep i). Then the potential decreae by at leat (=) i(q) due to the aignment or execution of a node belonging to q at the end of tep i. Property follow directly from the definition of the potential function. Property hold becaue a node enable at mot two children with maller potential, one of which become aigned. Specifically, the potential after the execution of node u decreaeby at leat (u)(; 3 ; 9 )= 5 (u). Property 3 follow from a tructural property of the node in a deque. The ditance of the node in 9 a proce deque decreae monotonically from the top of the deque to bottom. Therefore, the potential in the deque i the um of geometrically decreaing term and dominated by the potential of the top node. The lat property hold becaue when a proce chooe proce q in D i a it victim, the node at the top of q deque i aigned at the next tep. Therefore, the potential decreae by =3 i(u) by property. Moreover, i(u) (3=4) i(q) by property 3 and the reult follow. Lemma 6 how that the potential decreae a a computation proceed. The proof for Lemma 6 utilize ball and bin game bound from Lemma 4. Lemma 4 (Ball and Weighted Bin) Suppoe that at leat P ball are thrown independently and uniformly at random into P bin, where P bin i ha a weight W i, for i = ::: P. The total weight i P W = Wi. For each bin i, define the random variable X i= i a n Wi if ome ball land in bin i; X i = P otherwie. P If X = Xi, then for any in the range <<, we have i= Pr fx Wg > ; =(( ; )e). Thi lemma can be proven with an application of Markov inequality. The proof of a weaker verion of thi lemma for the cae of exactly P throw i imilar and given in []. Lemma 4 alo follow from the weaker lemma becaue X doe not decreae with more throw. We now how that whenever P or more teal attempt occur, the potential decreae by a contant fraction of i(d i) with contant probability. Lemma 5 Conider any tep i and any later tep j uch that at leat P teal attempt occur at tep from i (incluive) to j (excluive). Then we have o Pr n i ; j 4 i(di) > 4 : Moreover the potential decreae i becaue of the execution or aignment of node belonging to a proce in D i. Proof: Conider all P procee and P teal attempt that occur at or after tep i. For each proce q in D i, if one or more of the P attempt target q a the victim, then the potential decreae by (=) i(q) due to the execution or aignment of node that belong to q by property 4 in Lemma 3. If we think of each attempt a a ball to, then we have an intance of the Ball and Weighted Bin Lemma (Lemma 4). For each proce q in D i, we aign a weight W q =(=) i(q), and for each other proce q in A i, we aign a weight W q =. The weight um to W =(=) i(d i). Uing = = in Lemma 4, we conclude that the potential decreae by at leat W =(=4) i(d i) with probability greater than ; =(( ; )e) > =4 due to the execution or aignment of node that belong to a proce in D i. We now bound the number of teal attempt in a work-tealing computation. Lemma 6 Conider a P -proce execution of a multithreaded computation with the work-tealing algorithm. Let T and T denote the computational work and the critical path of the computation. Then the expected number of teal attempt in the execution i O(d m e PT). Moreover, for any ">, the number of teal attempt i O(d m e PT +lg(=")) with probability at leat ; ". Proof: We analyze the number of teal attempt by breaking the execution into phae of d m e P teal attempt. We how that with contant probability, a phae caue the potential to drop by a contant factor. The firt phae begin at tep t = and end at the firt tep t uch that at leat d m e P teal attempt occur during the interval of tep [t t ]. The econd phae begin at tep t = t +, and o on. Let u firt how that there are at leat m tep in a phae. A proce ha at mot outtanding teal attempt at any time and a teal attempt take at leat tep to complete. Therefore, at mot P teal attempt occur in a period of time tep. Hence a phae of teal attempt take at leat d(d m e)p )=P e m time unit. Conider a phae beginning at tep i, and let j be the tep at which the next phae begin. Then i + m j. We will how that we have Pr f j (3=4)ig > =4. Recall that the potential can be partitioned a i = i(a i)+ i(d i). Since the phae contain d m e P teal attempt, Pr fi ; j (=4)i(Di)g > =4 due to execution or aignment of node that belong to a proce in D i, by Lemma 5. Now we how that the potential alo drop by a contant fraction of i(a i) due to the execution of aigned node that are aigned to the procee in A i. Conider a proce, ay q in A i. If q doe not have an aigned node, then i(q) =. If q ha an aigned node u, then i(q) = i(u). In thi cae, proce q complete executing node u at tep i + m ; <jat the latet and the potential drop by at leat (5=9) i(u) by property of Lemma 3. Summing over each proce q in A i, we have i ; j (5=9)i(Ai). Thu, we have hown that the potential decreae at leat by a quarter of i(a i) and i(d i). Therefore no matter how the total potential i ditributed over A i and D i, the total potential decreae by a quarter with probability more than =4, that i, Pr f i ; j (=4)ig > =4. We ay that a phae i ucceful if it caue the potential to drop by at leat a =4 fraction. A phae i ucceful with probability at leat =4. Since the potential tart at = 3 T; and end at (and i alway an integer), the number of ucceful phae i at mot (T ; ) log4=3 3 < 8T. The expected number of phae needed to obtain 8T ucceful phae i at mot 3T. Thu, the expected number of phae i O(T ), and becaue each phae contain d m e P teal attempt, the expected number of teal attempt i O(d m ept). The high probability bound follow by an application of the Chernoff bound. 8

9 Theorem 7 Let M P (C) be the number of cache mie in a P - proce execution of a neted-parallel computation with a worktealer that ha imple cache of C block each. Let M (C) be the number of cache mie in the uniproce execution Then Step M P (C) =M (C) + m O(d e CP T + m d e CP ln(=")) with probability at leat ;". The expected number of cache mie i M m (C) +O(d e CP T) Step Proof: Theorem how that the cache overhead of a netedparallel computation i at mot twice the product of the number of teal and the cache ize. Lemma 6 how that the number of teal attempt i O(d m ep (T+ln(="))) with probability at leat ;" and the expected number of teal i O(d m ept). The number of teal i not greater than the number of teal attempt. Therefore the bound follow. Theorem 8 Conider a P -proce, neted-parallel, work-tealing computation with imple cache of C block. Then, for any ">, the execution time i O( T(C) m P +md ec (T + ln(="))+(m +)(T +ln(="))) with probability at leat ( ; "). Moreover, the expected running time i O( T(C) P + m md e CT +(m + )T) : Proof: We ue an accounting argument to bound the running time. At each tep in the computation, each proce put a dollar into one of two bucket that matche it activity at that tep. We name the two bucket a the work and the teal bucket. A proce put a dollar into the work bucket at a tep if it i working on a node in the tep. The execution of a node in the dag add either or m dollar to the work bucket. Similarly, a proce put a dollar into the teal bucket for each tep that it pend tealing. Each teal attempt take O() tep. Therefore, each teal add O() dollar to the teal bucket. The number of dollar in the work bucket at the end of execution i at mot O(T +(m ; ) M P (C)), which i l m m O(T (C) +(m ; ) CP (T +ln(=" ))) with probability at leat ; ". The total number of dollar in teal bucket i the total number of teal attempt multiplied by the number of dollar added to the teal bucket for each teal attempt, which i O(). Therefore total number of dollar in the teal bucket i l m m O( P (T +ln(=" ))) with probability at leat ; ". Each proce add exactly one dollar to a bucket at each tep o we divide the total number of dollar by P to get the high probability bound in the theorem. A imilar argument hold for the expected time bound. Figure : The tree of thread created in a data-parallel worktealing application. 7 Locality-Guided Work Stealing The work-tealing algorithm achieve good data locality by executing node that are cloe in the computation graph on the ame proce. For certain application, however, region of the program that acce the ame data are not cloe in the computational graph. A an example, conider an application that take a equence of tep each of which operate in parallel over a et or array of value. We will call uch an application an iterative data-parallel application. Such an application can be implemented uing work-tealing by forking a tree of thread on each tep, in which each leaf of the tree update a region of the data (typically dijoint). Figure how an example of the tree of thread created in two tep. Each node repreent a thread and i labeled with the proce that execute it. The gray node are the leave. The thread ynchronize in the ame order a they fork. The firt and econd tep are tructurally identical, and each pair of correponding gray node update the ame region, often uing much of the ame input data. The dahed rectangle in Figure, for example, how a pair of uch gray node. To get good locality for thi application, thread that update the ame data on different tep ideally hould run on the ame proceor, even though they are not cloe in the dag. In work tealing, however, thi i highly unlikely to happen due to the random teal. Figure, for example, how an execution where all pair of correponding gray node run on different procee. In thi ection, we decribe and evaluate locality-guided work tealing, a heuritic modification to work tealing which i deigned to allow locality between node that are ditant in the computational graph. In locality-guided work tealing, each thread can be given an affinity for a proce, and when a proce obtain work it give priority to thread with affinity for it. To enable thi, in addition to a deque each proce maintain a mailbox: a firt-in-firt-out (FIFO) queue of pointer to thread that have affinity for the proce. There are then two difference between the locality-guided work-tealing and work-tealing algorithm. Firt, when creating a thread, a proce will puh the thread onto both the deque, a in normal work tealing, and alo onto the tail of the mailbox of the proce that the thread ha affinity for. Second, a proce will firt try to obtain work from it mailbox before attempting a teal. Becaue thread can appear twice, once in a mailbox and once on a deque, there need to be ome form of ynchronization between the two copie to make ure the thread i not executed twice. A number of technique that have been uggeted to improve the data locality of multithreaded program can be realized by the locality-guided work-tealing algorithm together with an appropriate policy to determine the affinitie of thread. For example, an 9

The Data Locality of Work Stealing

The Data Locality of Work Stealing Umut A. Acar School of Computer Science Carnegie Mellon University umut@cs.cmu.edu Guy E. Blelloch School of Computer Science Carnegie Mellon University guyb@cs.cmu.edu