Fast Distributed Process Creation with the XMOS XS1 Architecture

Size: px

Start display at page:

Download "Fast Distributed Process Creation with the XMOS XS1 Architecture"

Aubrey Powell
5 years ago
Views:

1 Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James HANLON and Simon J. HOLLIS Deartment of Comuter Science, University of Bristol, UK. {hanlon, Abstract. The rovision of mechanisms for rocessor allocation in current distributed arallel rogramming models is very limited. This makes difficult, or even rohibits, the exression of a large class of rograms which require a run-time assessment of their required resources. This includes rograms whose structure is irregular, comosite or unbounded. Efficient allocation of rocessors requires a rocess creation mechanism able to initiate and terminate remote comutations quickly. This aer resents the design, demonstration and analysis of an exlicit mechanism to do this, imlemented on the XMOS XS architecture, as a foundation for a more dynamic scheme. It shows that rocess creation can be made efficient so that it incurs only a fractional overhead of the total runtime and that it can be combined naturally with recursion to enable raid distribution of comutations over a system. Keywords. distributed rocess creation, distributed runtime, dynamic task lacement, arallel recursion, Introduction An essential issue in the design of scalable, distributed arallel comuters is the rate at which comutations can be initiated, and results collected as they terminate []. This requires an efficient method of rocess creation caable of disatching a rogram and data on which to oerate to a remote rocessor. This aer resents the design, imlementation, demonstration and evaluation of a rocess creation mechanism for the XMOS XS architecture [2]. Parallelism is being emloyed on an increasingly large scale to imrove erformance of comuter systems, articularly in high erformance systems, but increasingly in other areas such as embedded comuting [3]. As current rogramming models such as MPI (Message Passing Interface) rovide limited suort for automated management of rocessing resources, the burden of doing this mainly falls on the rogrammer. These issues are not relevant to the exression of a rogram as, in general, a rogrammer is concerned only with introducing arallelism (execution on multile rocessors) to imrove erformance, and not how the comutation is scheduled on the underlying system. When we consider that future high erformance systems will run on the order of 0 9 threads [4], it is clear that the rogramming model must rovide some means of dynamic rocessor allocation to remove this burden. This is the situation we have with memory in sequential systems, where allocation and deallocation is erformed with varying degrees of automaticy. This observation is not new [5,6], but it is only as existing rogramming models and software struggle to meet the increasing scale of arallelism that the roblem is again coming to light. For instance, caabilities for rocess creation and management were introduced in the MPI-2.0 secification, stating that: Reasons for including rocess management in MPI are both technical and ractical. Imortant classes of message-assing alications require

2 2 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture this control. These include task farms, serial alications with arallel modules and roblems that require a run-time assessment of the number and tye of rocesses that should be started [7]. Several MPI imlementations suort rocess creation and management functionality, but it is itched as an advanced feature that is difficult to use and roblematic with many current job-scheduling systems. More encouragingly, language-level abstractions for dynamic rocess creation and lacement have aeared recently in the Chael [8] and X0 [9], which are being develoed by Cray and IBM resectively as art of DARPA s High Productivity Comuting Systems rogram. Both suort these concets as key ingredients in the design of arallel rograms, but they are built on software communication libraries and statically-maed rogram binaries. Consequently, they are subject to the same communication inefficiencies and inflexibility of single-rogram aroaches. A run-time assessment of required rocessing resources concerns large class of rograms whose structure is irregular, such as unstructured-grid algorithms like the Sectral Element Method [0], unbounded such as recursively-structured algorithms like Branch-and-Bound search [] and Adative Mesh Refinement [2], or comosite, where a rogram may be comosed of different arallel subroutines that are themselves executed in arallel, ossibly each with its own structure. These all require a means of dynamic rocessor allocation that is able to distribute comutations over a set of rocessors, deending on requirements determined at runtime. The combination of arallelism and recursion is a owerful mechanism for growth which can be used to imlement distribution efficiently. This must be suorted with a mechanism for rocess creation with the ability to disatch, initiate and terminate comutations efficiently on remote rocessors. This aer resents the design and imlementation of an exlicit scheme for dynamic rocess creation in a distributed memory arallel comuter. This work is intended to be a key building block for a more automatic scheme. The imlementation is on the the XMOS XS architecture, which has low-level rovisions for concurrency, allowing a convincing roofof-concet imlementation. Based on this, the rocess creation mechanism is evaluated by combining it with controlled recursion in two simle algorithms to demonstrate the rate and granularity at which it is ossible to create remote comutations. Performance models are develoed in each case to interret the measured results and to make redictions for larger systems and workloads. This analysis highlights the efficiency, scalability and effectiveness of the concet and aroach taken. The rest of this aer is structured as follows. Section describes the XS architecture, the exerimental latform and the notations and conventions used. Section 2 gives a brief overview of the design and imlementation details. Section 3 resents the erformance models and exerimental and redicted results. Finally, Section 4 concludes and Section 5 discusses ossible future extensions to the work.. Background.. Platform The XMOS XS rocessor architecture [2] is general-urose, multi-threaded, scalable and has been designed from the ground u to suort concurrency. It allows systems to be constructed from multile XCore rocessors which communicate with each other through fast communication links. The key novel asect of this architecture with resect to the work in this aer is the instruction set suort for rocesses and communication. Low-level threading and communication are key features, exosed with oerations, for examle, to rovide synchronous and asynchronous fork-join thread-level arallelism and channel-based message assing communication. Provision of these features in hardware allows them to be erformed

3 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 3 in the same order of magnitude of time as memory references, branches and arithmetic. This allows efficient high-level notations for concurrency to be effectively built. The system used to demonstrate and evaluate the roosed rocess creation mechanism is an exerimental board called the XK-XMP-64 [3]. It connects together 64 XCore rocessors in 6 XS-G4 devices which run at 400MHz. The G4 devices are interconnected in a 4-dimensional hyercube which equivalently can be viewed as a 2-dimensional torus. Mathematically, this is defined in the following way [4]: Definition. A d-dimensional hyercube is a grah G = (N,E) where N is the set of 2 d nodes and E is the set of edges. Each node is labeled with a d-bit identifier. For any m,n N, an edge exists between m and n if and only if m n = 2 k for 0 k d where is the bitwise exclusive-or oerator. Hence, each node has d = logn edges and E = d2 d. Each core in the G4 ackage has a rivate 64kB memory and is interconnected via internal links to an integrated switch. It is convenient to view the whole system as a 6-dimensional hyercube. As each core can run 8 hardware threads, the system is caable of 52-way concurrency with an aggregate 25.6 GIPS erformance..2. Notation For resentation of the algorithms in this aer, a simle imerative, block-structured notation is used. The following oints describe the non-standard elements that aear in the examles..2.. Sequential and Parallel Comosition A set of instructions that are to be executed in sequence are comosed with the ; searator. A sequence of instructions comrises a rocess. For examle, the block { I ; I 2 ; I 3 } defines a simle rocess to erform three instructions, I, I 2 and I 3 in sequence. Processes may be executed in arallel by comosition within a block with the searator. Execution of a arallel block initiates the execution of the constituent rocesses simultaneously. The arallel block successfully terminates only when all rocesses have successfully terminated. This is referred to as synchronous fork-join arallelism. For examle, the block declaration { P P 2 P 3 } denotes the arallel execution of three rocesses P, P 2 and P Aliasing The aliases statement is used to create new references to sub-sections of an array. For examle, the statement A aliases B[i... j] sets A to refer to the sub-section of B in the index range i to j.

4 4 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture.2.3. Process Creation The on statement reveals exlicitly to the rogrammer the rocess creation mechanism. The statement on do P is semantically equivalent to executing a call to P, excet that rocess P is transmitted to rocessor, which then executes P and communicates back any results using channels, leaving the original rocessor free to erform other tasks. By comosing on in arallel, we can exloit multi-threaded arallelism to offload work while executing another rocess. For examle, the statement { P on do P 2 } causes P to be executed while P 2 is offloaded and executed on rocessor..3. Measurements All timing measurements resented were made with hardware timers, which are accessible through the ISA and have 0ns resolution. Constant values were extraolated through the measurements taken by fitting erformance models to the data..4. Conventions All logarithms are to the base 2. is defined as the number of rocessors and is taken to be a ositive ower of two. A word is taken to be 4 bytes and is a unit of inut in the erformance models. 2. Imlementation The on statement causes the closure of a rocess P located at a guest rocessor to be sent to a remote host rocessor, the host to execute P and to send back any udated free variables of P stored at the guest. The execution of on is synchronous in this resect. The closure of a rocess P is a comlete descrition of P allowing it to be executed indeendently and is defined in the following way: Definition 2. The closure C of a rocess P consists of three elements: a set of arguments A, which reresents the comlete variable context of P as we don t consider global variables, a set of rocedure indicies I and a set of rocedures Q: C(P) = (A,I,Q) where A 0 and I = Q. Each argument a A is a ordered sequence of one or more integer values. Each rocess P Q is an ordered sequence of one or more instructions. I P is an integer value denoting the index of rocedure P. Each core maintains a fixed-size jum table denoted jum, which records the location of each rocedure in memory. As the rocedure address may not be consistent between cores the indicies are guaranteed to be. This allows relative branches to be exressed in terms of an index which is locally referenced at execution. Each node in the system is initialised with a minimal binary containing the rocess creation kernel. The comlete rogram is loaded on node 0, from where arts of it can be coied onto other nodes to be executed. 2.. Protocol The rocess creation mechanism is imlemented as a oint-to-oint rotocol between a guest core and a host core. Any running thread is able to sawn the execution of a rocess on any other core. It consists of the following four hases.

5 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Connection Initialisation A guest initiates a connection by sending a single byte control token and a word identifying itself. It waits for an acknowledgment from the host indicating a host thread has been allocated and the connection is roerly established. A core may host multile guest comutations, each on a different thread Transmission of Closure C(P) is transmitted in three arts. Firstly, a header is sent containing A and Q. Secondly, each a A is sent with a single word header denoting the tye of the argument. For referenced arrays, this is followed by length(a) and the values contained. The host writes these directly into hea-allocated sace and the argument value is set to this address. Single-value variables are treated similarly and constant values can be coied directly into the argument value. Lastly, each P Q is sent with a two word header denoting I P and length(p) in bytes. The host allocates sace on the hea and receives the instructions of P from the guest, read from memory in word-chunks from jum[i P ] to jum[i P ]+length(p). On comletion, the host sets jum[i P ] to the address of P on the hea Execution/Wait for Comletion Once C has been successfully transmitted, the host initialises the thread s registers and stack with the arguments of P and initiates execution. The connection is left oen and the guest thread waits for the host to indicate P has halted Transmission of Results and Teardown Once P has halted, all referenced array and variable arguments contained in C (now the results) are transmitted back to the guest. The guest writes them back directly to their original locations. Once this has been comleted, the connection is terminated. The guest continues execution and the host thread frees the memory allocated to the closure and yields Performance Model The runtime cost of this mechanism is catured in the following way: Definition 3. The runtime of rocess creation T c is a function of the total size of the argument values n, rocedure descritions m and the results o and is given by T c (n,m,o) = (C i +C w n +C w m +C w o) C l where C i and C w are constants relating to initialisation and termination, and overhead er (word) value transmitted resectively. The value n is inclusive of the size of referenced arrays and hence o n. As all communication is synchronised, C l is a constant factor overhead relating to the latency of the ath between the guest and host rocessors. Normalising C l = to a single ho off-chi, the er-word overhead C w was measured as 50ns. The initialisation overhead C i is deendent on the size of the closure. 3. Demonstration and Evaluation The aim of this section is to demonstrate the use of rocess creation combined with arallel recursion to evaluate the erformance of the design and its imlementation in realising efficient growth. To do this, we develo erformance models to combine with exerimental results, allowing us to extraolate to larger systems and inuts. We start with a simle algorithm to demonstrate the fast distribution of arallel comutations and then show how this can be alied to a ractical roblem.

6 6 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture roc distribute (t, n) is if n = then node (t) else { distribute (t, n/2) on t + n/2 do distribute (t + n/2, n/2) } Figure. A recursive rocess distribute to raidly distribute another rocess node over a set of rocessors. 3.. Raid Process Distribution The algorithm distribute given in Figure is insired by [] and works by sawning a new coy of itself on a remote rocessor each time it recurses. Each rocess then itself recurses, continuing this behaviour and hence, each level of the recursion subdivides the set of rocessors in half, resulting in a doubling of the caacity to initiate comutations. This growth follows the structure of a binary tree. When each instance of distribute executes with n =, the node rocess is executed and the recursion halted. The arameter t indicates the node identifier and the algorithm is executed from node 0 with t = 0 and n = Runtime The hyercube interconnection toology of the XK-XMP-64 rovides an otimal transort in terms of ho distance between remote creations; this is established by the following theorem. Theorem. Every coy of distribute is always created on a neighbouring node when executed on a hyercube. Proof. Let H = (N,E) be a d-dimensional hyercube. When distribute is executed with t = 0 and n = N, starting at node 0 on H, the recursion follows the structure of a binary tree of deth d = log N, where identifiers at level i are multiles of N /2 i. A node at deth i with identifier k N /2 i creates a new remote child node c with identifier k N /2 i + N /2 i+. As N = 2 d, c = k2 d i + 2 d i and hence, c = 2 d i. Given that m and n are fixed, that o = 0 (there are no results) and from Theorem we can normalise C l to, the runtime T c (m,n,o) of the on statement in distribute is Θ(), which we define as the initialisation overhead C j. Using this, we can exress the arallel runtime of distribute T d on rocessors. In each ste, the number of active rocesses double, but we count the runtime at each level of recursion, which terminates when n/2 i = or i = logn. Hence, log T d () = i= (T c +C o ) =(C j +C o )log () where C o is the the sequential overhead at each level. C j was measured as 8.4µs and C o was measured as 60ns Results Figure 2a gives the redicted and measured execution time of distribute as a function of the number of rocessors. The rediction almost exactly matches the runtime given by Equation. Figure 2b shows the inaccuracy between the measured and redicted results more clearly, by giving the measured execution time for each level in the recursion, that is, the difference between consecutive oints in Figure 2a. It shows that the assumtion made based on Theorem does not hold and that the first two levels take fractionally less time than the

7 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 7 Time (µs) T d T d (a) Measured vs. redicted ( ) execution time. Time (µs) Level (b) Execution times for each level of recursion of distribute. Figure 2. Measured execution time of distribute over varying numbers of rocessors. (b) clearly shows the inter- vs. intra-chi latencies. last four levels (3.85µs). This is due to the reduced on-chi communication costs. Overall though, each level of recursion comletes on average in 8.9µs and it takes only 4.60µs to oulate all 64 rocessors. Moreover, using the erformance model given by T d, we can extraolate to larger than is ossible to measure with the current latform. For examle, when = 024, T d (024) = 90µs Remarks By using the erformance model to make redictions, we have assumed a hyercube toology and efficient suort for concurrency. Although other architectures and larger systems cannot make such rovisions, the model and results rovide a reasonable lower bound on execution time with resect to the aroach described. The hyercube has rich communication roerties and suorts exonential growth, but it does not scale well due to the number of connections at each node and length of wires in realistic ackagings. Although distribute has otimal single-ho behaviour and we obtain eak erformance, it is well known that efficient embeddings of binary trees into lower-degree networks such as meshes and tori exist [4], allowing reasonable disersion. In this case, the granularity of rocess creation would have to be chosen to match the caabilities of the architecture. Provision of efficient ISA-level oerations for rocesses and communications allows fine-grained erformance, articularly in terms of short messages. Many current architectures do not suort these oerations at a such a low-level and cannot exloit the full otential of this aroach, although again it generalises at a coarser granularity of message size to match the relative erformance of these oerations Mergesort Mergesort is a well known sorting algorithm [5] that works by recursively halving a list of unsorted numbers until unit sub-lists are obtained. These are then successively merged together such that each merging ste roduces a sorted sub-list, which can be erformed in time Θ(n) for sub-lists of size n/2. Figure 3a gives the sequential mergesort algorithm seq-msort. Mergesort s branching recursive structure matches that of distribute, allowing us to combine them to obtain a arallel version. Instead of sequentially evaluating the recursive calls, conditional on some threshold value C th, a local recursive call is made in arallel with the

8 8 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture roc seq-msort (A) is if A > then { a aliases A[0.. A /2 ] ; b aliases A[i.. A ] ; seq-msort (a) ; seq-msort (b) ; merge(a,a,b) } (a) roc ar-msort (t, n, A) is if A > then { a aliases A[0... A /2 ] ; b aliases A[i... A ] ; if A > C th then { ar-msort (t, n/2, a) on t + n/2 do ar-msort (t + n/2, n/2, b) } else { ar-msort (t, n/2,a) ; ar-msort (t + n/2, n/2,b) } ; merge(a,a,b) } Figure 3. Sequential and arallel mergesort rocesses. second call which is migrated to a remote core. This threshold is used to control the extent to which the comutation is distributed. In each of the exeriments for an inut of size 2 k and available rocessors = 2 d, the threshold is set as 2 k /. The aroach taken in distribute is used to control the lacements of each of the sub-comutations. Initially, the roblem is slit in half; this will have the greatest benefit to the execution time. Deending on the roblem size, further remote branchings of the roblem may not be economical, and the remaining stes should be evaluated locally, in sequence. In this case, the algorithm simly reduces to seq-msort. This arallel formulation of mergesort is essentially just distribute with additional work and communication overhead, but it will allow us to more concretely quantify the relative costs of rocess creation. The arallel imlementation of mergesort ar-msort is given in Figure 3b. It uses the same sequential merge rocedure and the arameters t and n control the lacement of rocesses in the same way as they were used with distribute. We can now analyse the erformance and behaviour of ar-msort and the rocess creation mechanism by looking at the arallel runtime Runtime We first define the runtime of the sequential comonents of ar-msort. This includes the sequential merging and sorting rocedures. The runtime T m of merge is linear and is defined as T m (n) = C a n +C b for constants C a,c b > 0, relating to the er-word and er-merge overheads resectively. These were measured as C a = 90ns and C b = 830ns. The runtime T s (n,) of seq-msort, is exressed as a recurrence: T s (n,) = 2T s ( n 2, ) + T m (n) (2) which has the solution T s (n,) = n(c c logn +C d ) (3) for constants C c,c d > 0. These were measured as C c = 200ns and C d = 200ns. Based on this we can exress the runtime of ar-msort as the combination of the costs of creating new rocesses, moving data, merging and sorting sequentially. The key comonent of this is the cost T c, relating to the on statement in the arallel formulation, which is defined as (b)

9 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 9 T c (n) = C i + 2C w n. This is because we can normalise C l to (due to Theorem ), the size of the rocedures sent is constant and the number of arguments and results are both n. The initialisation overhead C i was measured as 28µs, larger than that for distribute as the closure contains the descritions of merge and ar-msort. For the arallel runtime, the base sequential case is given by Equation 2. With two rocessors, the work and execution time can be slit in half at the cost of migrating the rocedures and data: ( n ) ( n ) T s (n,2) = T c + T s 2 2, + T m (n). With four rocessors, the work is slit in half at a cost of T c (n/2) and then in quarters at a cost of T c (n/4). After the data has been sequentially sorted in time T s (n/4,) it must be merged at the two children of the master node in time T m (n/2), and then again at the master in time T m (n): T s (n,4) =T c ( n 2 Hence in general, we have: T s (n, ) = ) ( n ) ( n ) ( n ) + T c + T m + T m (n) + T s 4 2 4, log ( n ) ( n )) ( ) n (T c i= 2 i + T m 2 i + T s, for n as each leaf sub-rocess of the sorting comutation must oerate on at least one data item. We can then exress this recisely by substituting our definitions for T s, T c and T m and simlifying: T s (n, ) =C w 2n ( ) +C i log +C a 2n ( ) +C b log + n = 2n ( )(C w +C a ) + (C i +C b )log + n ( C c log n ) +C d ( C c log n ) +C d For =, this reduces to Equation 3. This definition allows us to exress the a lower bound and minimum for the runtime Lower Bound We can give a lower bound T m s on the arallel runtime T s (n, ) such that n, T s (n, ) T m s. This is obtained by considering the arallel overhead, that is the cost of distributing the roblem over the system. In this case it relates to the cost of rocess creation, including moving rocesses and their data, the T c comonent of T s : T m s (n, ) = = log ( n ) T c k= 2 k log k= ( C i + 2C w n 2 k ) (4) 2n = C i log +C w ( ). (5) Equation 5 is then the sum of the costs of rocess creation and movement of inut data. When n = 0, T m s relates to Equation ; this is the cost of transmitting and initiating just the comutations over the system. For n 0, this includes the cost of moving the data.

10 0 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Minimum Given an inut of length m n for some sub-comutation of ar-msort, creation of a remote branch is beneficial only when the cost of this is less than the local sequential case: ( m ) ( m ) T c + T s 2 2, + T m (n) < T s (m,) ( m ) ( m ) ( m ) T c + T s 2 2, + T m (n) < 2T s 2, + T m (m) ( m ) ( m ) T c < T s 2 2, Hence, initiation of a remote sorting rocess for an array of length n is beneficial only when T c (n) < T s (n,). That is, the cost of remotely initiating a rocess to erform half the work and receiving the results is less than the cost of sequentially sorting m/2 elements. Therefore at the inflection oint we have T c (n) = T s (n,). (6) Results Figure 4 shows the measured execution time of ar-msort as a function of the number of rocessors used for varying inut sizes. Figure 4a shows just three small inuts. The smallest ossible inut is 256 bytes as the minimum size for any sub-comutation is word. The minimum execution time for this size is at = 4 rocessors, when the array is subdivided twice into 64 byte sections. This is the oint given by Equation 6 and indicates directly the total cost incurred in offloading a comutation. For < 4, the cost of sorting sequentially dominates the runtime, and for > 4, the cost of creating a new rocesses and transferring the array sections dominates the runtime. With the next inut of size 52 bytes, the minimum moves to = 8, where the array is again divided into 64 byte sections. This holds for each inut size and in general gives us the minimum size for which creating a new rocess will further reduce the runtime. The runtime lower bound T m s (0, ) given by Equation 5 is also lotted on Figure 4a. This shows the small and sub-linear cost with resect to of the overheads incurred with the distribution and management of rocesses around the system. Relative to T s (64, ) this constitutes most of the overall work erformed, which is exected as the array is fully decomosed into unit sections. For larger sized inuts, as resented in Figure 4b, this cost becomes just a fraction of the total work erformed. Figure 5 shows redicted execution times for ar-msort for larger and n. Each lot contains the execution time T s as defined by Equation 4, and T m s with and without the transfer of data. Figure 5a gives results for the smallest inut size ossible to sort on 024 cores (4kB) and includes the measurements for T m s (0, ) and T s. It reiterates what was shown in Figure 4a and shows that beyond 64 cores, very little enalty is incurred to create u to 024 sorting instances, with T m s accounting for around 23% of the total runtime for larger systems. This is due to the exonential growth of the distribution mechanism. Figure 5b gives results for the largest measured inut of 32kB, showing the same trends, where T m s this time is around just 3% of the runtime between 64 and 024 cores. Figure 5c and Figure 5d resent redictions made by the erformance model for more realistic workloads of 0MB and GB resectively. Figure 5c shows that 0MB could be sorted sequentially in around 7s and in arallel in at least 0.6s. Figure 5d shows that GB could be sorted in just under 5m sequentially or at least m in arallel. What these results

11 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Time (ms) T m s (0, ) T s (256B, ) T s (52B, ) T s (kb, ) (a) Log-linear lot for varying small inuts. Time (ms) T s (256B, ) T s (52B, ) T s (kb, ) T s (2kB, ) T s (4kB, ) T s (8kB, ) T s (6kB, ) T s (32kB, ) (b) Log-log lot for larger inuts. Figure 4. Measured execution time of ar-msort as a function of the number of rocessors. (a) highlights the minimum execution time and the T m s lower bound. make clear is that the distribution of the inut data dominates and bounds the runtime and that the distribution of data constituting the rocess descritions is a negligible roortion of the overall runtime for reasonable workloads. The relatively small sequential workload O(n/ log(n/)) of mergesort, which decays quickly as increases, emhasises the cost of data distribution. For heavier workloads, such as O((n/) 2 ), we would exect to see a much more dramatic reduction in execution time and the cost of data distribution still eventually to bound runtime, but then by a relatively fractional amount. 4. Conclusions This aer resents the design, imlementation, demonstration and evaluation of an efficient mechanism for dynamically creating comutations in a distributed memory arallel comuter. It has shown that a comutation can be disatched to a remote rocessor in just tens of microseconds, and when this mechanism is combined with recursion, it can be used to efficiently imlement arallel growth. The distribute algorithm demonstrates how an emty array of rocessors can be oulated with a comutation exonentially quickly. For 64 cores, it takes just 4.60µs and for 024 cores this will be of the order of 90µs. The ar-msort algorithm extends this by erforming additional comutational work and communication of data which allowed us to obtain a clearer icture of the cost of rocess creation with resect to varying roblem sizes. As the cost of transferring and invoking remote comutations is related rimarily to the size of the closure, this cost grows slowly with system size and is indeendent of data. With a 0MB inut, it reresents around just 0.00% of the runtime. The sorting results also highlight two imortant issues: the granularity at which it is ossible to create new rocesses and costs of data movement. They show that the comutation can be subdivided to oerate on just 64 byte chunks and for erformance to still be imroved. The cost of data movement is significant, relative to the small amount of work erformed at each node; for more intensive tasks, these costs would diminish. However, these results assume a worst case, where all data originates from a single core. In other systems, this cost may be reduced by concurrent access through a arallel file system or from rior data distribution. The XS architecture rovides efficient suort for concurrency and communications and the XK-XMP-64 rovides an otimal transort for the described algorithms, so we exect our lightweight scheme to be fast, relative to the erformance of other distributed systems.

12 2 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Time (ms) T m s (0, ) T m s (0, ) T m s (n, ) T s (n, ) T s (n, ) Time (ms) T m s (0, ) T m s (0, ) T m s (n, ) T s (n, ) T s (n, ) (a) n = 64 (256B) with measured results u to 64 cores. (b) n = 892 (32kB) with measured results u to 64 cores. Time (ms) T m s (0, ) T m s (n, ) T s (n, ) Time (ms) e T m s (0, ) T m s (n, ) T s (n, ) (c) n = (0MB). (d) n = (GB). Figure 5. Predicted ( ) erformance of ar-msort for larger n and 024. All lots are log-log. Hence, the results rovide a convincing roof-of-concet imlementation, demonstrating the kind of erformance that is ossible and, with resect to the toology, establish a reasonable lower bound on the erformance of the aroach resented. The results generalise to more dynamic schemes where lacements are not erfect and other larger architectures such as suercomuters, where interconnection toologies are less well connected and communication is less efficient. In these cases, the aroach alies at a coarser granularity with larger roblem sizes to match the relative erformance. 5. Future Work Having successfully designed and imlemented a language and runtime allowing exlicit rocess creation with the on statement, we will continue with our focus on the concet of growth in arallel rograms and lan to extend the work in the following ways. Firstly, by looking at how lacement of rocess closures can be determined automatically by the runtime, relieving the rogrammer of having to secify this. Secondly, by imlementing the language and runtime with C and MPI to target a larger latform, which will rovide a more scalable demonstration of the concets and their generality. And lastly, by looking at generic otimisations that can be made to the rocess creation mechanism to imrove overall erformance and scalability. More details about the current imlementation are available online, htt://

13 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 3 where news of future develoments will also be ublished. Acknowledgments The authors would like to thank XMOS for their suort, in articular from David May, Henk Muller and Richard Osborne. References [] David May. The Transuter revisited. In Millennial Persectives in Comuter Science: Proceedings of the 999 Oxford-Microsoft Symosium in Honour of Sir Tony Hoare, ages Palgrave Macmillan, 999. [2] David May. The XMOS XS Architecture. XMOS Ltd., October htt:// suort/documentation. [3] Asanovic, Bodik et al. The Landscae of Parallel Comuting Research: A View from Berkeley. Technical Reort UCB/EECS , EECS Deartment, University of California, Berkeley, Dec htt: // [4] Dongarra, J., Beckman, P. et al. International Exascale Software Project Roadma. Technical Reort UT- CS-0-654, University of Tennessee EECS Technical Reort, May 200. htt:// [5] D. May. The Influence of VLSI Technology on Comuter Architecture [and Discussion]. Philosohical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 326(59): , 988. [6] Per Brinch Hansen. The nature of arallel rogramming. Natural and Artifical Parallel Comutation, ages 3 46, 990. [7] MPI 2.0. Technical reort, Message Passing Interface Forum, November htt://www. mi-forum.org/docs/. [8] B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel rogrammability and the Chael language. International Journal of High Performance Comuting Alications, 2(3):29 32, [9] Philie Charles, Christian Grothoff, Vijay Saraswat, Christoher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoh von Praun, and Vivek Sarkar. X0: an object-oriented aroach to non-uniform cluster comuting. In OOPSLA 05: Proceedings of the 20th annual ACM SIGPLAN conference on Objectoriented rogramming, systems, languages, and alications, ages , New York, NY, USA, ACM. [0] A. Patera. A sectral element method for fluid dynamics: Laminar flow in a channel exansion. Journal of Comutational Physics, 54(3): , June 984. [] Bernard Gendron and Teodor Gabriel Crainic. Parallel branch-and-bound algorithms: Survey and synthesis. Oerations Research, 42(6): , 994. [2] Marsha J Berger and Joseh Oliger. Adative mesh refinement for hyerbolic artial differential equations. Journal of Comutational Physics, 53(3):484 52, 984. [3] XMOS. XK-XMP-64 Hardware Manual. XMOS Ltd., Feburary 200. htt:// suort/documentation. [4] F. Thomson Leighton. Introduction to arallel algorithms and architectures: array, trees, hyercubes. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 992. [5] D. E. Knuth. The Art of Comuter Programming, volume 3, Sorting and Searching, chater 5.2.4, Sorting by Merging, ages Reading, MA: Addison-Wesley, 2nd ed. edition, 998.

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann