Fast Distributed Process Creation with the XMOS XS1 Architecture

Size: px
Start display at page:

Download "Fast Distributed Process Creation with the XMOS XS1 Architecture"

Transcription

1 Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James HANLON and Simon J. HOLLIS Deartment of Comuter Science, University of Bristol, UK. {hanlon, Abstract. The rovision of mechanisms for rocessor allocation in current distributed arallel rogramming models is very limited. This makes difficult, or even rohibits, the exression of a large class of rograms which require a run-time assessment of their required resources. This includes rograms whose structure is irregular, comosite or unbounded. Efficient allocation of rocessors requires a rocess creation mechanism able to initiate and terminate remote comutations quickly. This aer resents the design, demonstration and analysis of an exlicit mechanism to do this, imlemented on the XMOS XS architecture, as a foundation for a more dynamic scheme. It shows that rocess creation can be made efficient so that it incurs only a fractional overhead of the total runtime and that it can be combined naturally with recursion to enable raid distribution of comutations over a system. Keywords. distributed rocess creation, distributed runtime, dynamic task lacement, arallel recursion, Introduction An essential issue in the design of scalable, distributed arallel comuters is the rate at which comutations can be initiated, and results collected as they terminate []. This requires an efficient method of rocess creation caable of disatching a rogram and data on which to oerate to a remote rocessor. This aer resents the design, imlementation, demonstration and evaluation of a rocess creation mechanism for the XMOS XS architecture [2]. Parallelism is being emloyed on an increasingly large scale to imrove erformance of comuter systems, articularly in high erformance systems, but increasingly in other areas such as embedded comuting [3]. As current rogramming models such as MPI (Message Passing Interface) rovide limited suort for automated management of rocessing resources, the burden of doing this mainly falls on the rogrammer. These issues are not relevant to the exression of a rogram as, in general, a rogrammer is concerned only with introducing arallelism (execution on multile rocessors) to imrove erformance, and not how the comutation is scheduled on the underlying system. When we consider that future high erformance systems will run on the order of 0 9 threads [4], it is clear that the rogramming model must rovide some means of dynamic rocessor allocation to remove this burden. This is the situation we have with memory in sequential systems, where allocation and deallocation is erformed with varying degrees of automaticy. This observation is not new [5,6], but it is only as existing rogramming models and software struggle to meet the increasing scale of arallelism that the roblem is again coming to light. For instance, caabilities for rocess creation and management were introduced in the MPI-2.0 secification, stating that: Reasons for including rocess management in MPI are both technical and ractical. Imortant classes of message-assing alications require

2 2 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture this control. These include task farms, serial alications with arallel modules and roblems that require a run-time assessment of the number and tye of rocesses that should be started [7]. Several MPI imlementations suort rocess creation and management functionality, but it is itched as an advanced feature that is difficult to use and roblematic with many current job-scheduling systems. More encouragingly, language-level abstractions for dynamic rocess creation and lacement have aeared recently in the Chael [8] and X0 [9], which are being develoed by Cray and IBM resectively as art of DARPA s High Productivity Comuting Systems rogram. Both suort these concets as key ingredients in the design of arallel rograms, but they are built on software communication libraries and statically-maed rogram binaries. Consequently, they are subject to the same communication inefficiencies and inflexibility of single-rogram aroaches. A run-time assessment of required rocessing resources concerns large class of rograms whose structure is irregular, such as unstructured-grid algorithms like the Sectral Element Method [0], unbounded such as recursively-structured algorithms like Branch-and-Bound search [] and Adative Mesh Refinement [2], or comosite, where a rogram may be comosed of different arallel subroutines that are themselves executed in arallel, ossibly each with its own structure. These all require a means of dynamic rocessor allocation that is able to distribute comutations over a set of rocessors, deending on requirements determined at runtime. The combination of arallelism and recursion is a owerful mechanism for growth which can be used to imlement distribution efficiently. This must be suorted with a mechanism for rocess creation with the ability to disatch, initiate and terminate comutations efficiently on remote rocessors. This aer resents the design and imlementation of an exlicit scheme for dynamic rocess creation in a distributed memory arallel comuter. This work is intended to be a key building block for a more automatic scheme. The imlementation is on the the XMOS XS architecture, which has low-level rovisions for concurrency, allowing a convincing roofof-concet imlementation. Based on this, the rocess creation mechanism is evaluated by combining it with controlled recursion in two simle algorithms to demonstrate the rate and granularity at which it is ossible to create remote comutations. Performance models are develoed in each case to interret the measured results and to make redictions for larger systems and workloads. This analysis highlights the efficiency, scalability and effectiveness of the concet and aroach taken. The rest of this aer is structured as follows. Section describes the XS architecture, the exerimental latform and the notations and conventions used. Section 2 gives a brief overview of the design and imlementation details. Section 3 resents the erformance models and exerimental and redicted results. Finally, Section 4 concludes and Section 5 discusses ossible future extensions to the work.. Background.. Platform The XMOS XS rocessor architecture [2] is general-urose, multi-threaded, scalable and has been designed from the ground u to suort concurrency. It allows systems to be constructed from multile XCore rocessors which communicate with each other through fast communication links. The key novel asect of this architecture with resect to the work in this aer is the instruction set suort for rocesses and communication. Low-level threading and communication are key features, exosed with oerations, for examle, to rovide synchronous and asynchronous fork-join thread-level arallelism and channel-based message assing communication. Provision of these features in hardware allows them to be erformed

3 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 3 in the same order of magnitude of time as memory references, branches and arithmetic. This allows efficient high-level notations for concurrency to be effectively built. The system used to demonstrate and evaluate the roosed rocess creation mechanism is an exerimental board called the XK-XMP-64 [3]. It connects together 64 XCore rocessors in 6 XS-G4 devices which run at 400MHz. The G4 devices are interconnected in a 4-dimensional hyercube which equivalently can be viewed as a 2-dimensional torus. Mathematically, this is defined in the following way [4]: Definition. A d-dimensional hyercube is a grah G = (N,E) where N is the set of 2 d nodes and E is the set of edges. Each node is labeled with a d-bit identifier. For any m,n N, an edge exists between m and n if and only if m n = 2 k for 0 k d where is the bitwise exclusive-or oerator. Hence, each node has d = logn edges and E = d2 d. Each core in the G4 ackage has a rivate 64kB memory and is interconnected via internal links to an integrated switch. It is convenient to view the whole system as a 6-dimensional hyercube. As each core can run 8 hardware threads, the system is caable of 52-way concurrency with an aggregate 25.6 GIPS erformance..2. Notation For resentation of the algorithms in this aer, a simle imerative, block-structured notation is used. The following oints describe the non-standard elements that aear in the examles..2.. Sequential and Parallel Comosition A set of instructions that are to be executed in sequence are comosed with the ; searator. A sequence of instructions comrises a rocess. For examle, the block { I ; I 2 ; I 3 } defines a simle rocess to erform three instructions, I, I 2 and I 3 in sequence. Processes may be executed in arallel by comosition within a block with the searator. Execution of a arallel block initiates the execution of the constituent rocesses simultaneously. The arallel block successfully terminates only when all rocesses have successfully terminated. This is referred to as synchronous fork-join arallelism. For examle, the block declaration { P P 2 P 3 } denotes the arallel execution of three rocesses P, P 2 and P Aliasing The aliases statement is used to create new references to sub-sections of an array. For examle, the statement A aliases B[i... j] sets A to refer to the sub-section of B in the index range i to j.

4 4 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture.2.3. Process Creation The on statement reveals exlicitly to the rogrammer the rocess creation mechanism. The statement on do P is semantically equivalent to executing a call to P, excet that rocess P is transmitted to rocessor, which then executes P and communicates back any results using channels, leaving the original rocessor free to erform other tasks. By comosing on in arallel, we can exloit multi-threaded arallelism to offload work while executing another rocess. For examle, the statement { P on do P 2 } causes P to be executed while P 2 is offloaded and executed on rocessor..3. Measurements All timing measurements resented were made with hardware timers, which are accessible through the ISA and have 0ns resolution. Constant values were extraolated through the measurements taken by fitting erformance models to the data..4. Conventions All logarithms are to the base 2. is defined as the number of rocessors and is taken to be a ositive ower of two. A word is taken to be 4 bytes and is a unit of inut in the erformance models. 2. Imlementation The on statement causes the closure of a rocess P located at a guest rocessor to be sent to a remote host rocessor, the host to execute P and to send back any udated free variables of P stored at the guest. The execution of on is synchronous in this resect. The closure of a rocess P is a comlete descrition of P allowing it to be executed indeendently and is defined in the following way: Definition 2. The closure C of a rocess P consists of three elements: a set of arguments A, which reresents the comlete variable context of P as we don t consider global variables, a set of rocedure indicies I and a set of rocedures Q: C(P) = (A,I,Q) where A 0 and I = Q. Each argument a A is a ordered sequence of one or more integer values. Each rocess P Q is an ordered sequence of one or more instructions. I P is an integer value denoting the index of rocedure P. Each core maintains a fixed-size jum table denoted jum, which records the location of each rocedure in memory. As the rocedure address may not be consistent between cores the indicies are guaranteed to be. This allows relative branches to be exressed in terms of an index which is locally referenced at execution. Each node in the system is initialised with a minimal binary containing the rocess creation kernel. The comlete rogram is loaded on node 0, from where arts of it can be coied onto other nodes to be executed. 2.. Protocol The rocess creation mechanism is imlemented as a oint-to-oint rotocol between a guest core and a host core. Any running thread is able to sawn the execution of a rocess on any other core. It consists of the following four hases.

5 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Connection Initialisation A guest initiates a connection by sending a single byte control token and a word identifying itself. It waits for an acknowledgment from the host indicating a host thread has been allocated and the connection is roerly established. A core may host multile guest comutations, each on a different thread Transmission of Closure C(P) is transmitted in three arts. Firstly, a header is sent containing A and Q. Secondly, each a A is sent with a single word header denoting the tye of the argument. For referenced arrays, this is followed by length(a) and the values contained. The host writes these directly into hea-allocated sace and the argument value is set to this address. Single-value variables are treated similarly and constant values can be coied directly into the argument value. Lastly, each P Q is sent with a two word header denoting I P and length(p) in bytes. The host allocates sace on the hea and receives the instructions of P from the guest, read from memory in word-chunks from jum[i P ] to jum[i P ]+length(p). On comletion, the host sets jum[i P ] to the address of P on the hea Execution/Wait for Comletion Once C has been successfully transmitted, the host initialises the thread s registers and stack with the arguments of P and initiates execution. The connection is left oen and the guest thread waits for the host to indicate P has halted Transmission of Results and Teardown Once P has halted, all referenced array and variable arguments contained in C (now the results) are transmitted back to the guest. The guest writes them back directly to their original locations. Once this has been comleted, the connection is terminated. The guest continues execution and the host thread frees the memory allocated to the closure and yields Performance Model The runtime cost of this mechanism is catured in the following way: Definition 3. The runtime of rocess creation T c is a function of the total size of the argument values n, rocedure descritions m and the results o and is given by T c (n,m,o) = (C i +C w n +C w m +C w o) C l where C i and C w are constants relating to initialisation and termination, and overhead er (word) value transmitted resectively. The value n is inclusive of the size of referenced arrays and hence o n. As all communication is synchronised, C l is a constant factor overhead relating to the latency of the ath between the guest and host rocessors. Normalising C l = to a single ho off-chi, the er-word overhead C w was measured as 50ns. The initialisation overhead C i is deendent on the size of the closure. 3. Demonstration and Evaluation The aim of this section is to demonstrate the use of rocess creation combined with arallel recursion to evaluate the erformance of the design and its imlementation in realising efficient growth. To do this, we develo erformance models to combine with exerimental results, allowing us to extraolate to larger systems and inuts. We start with a simle algorithm to demonstrate the fast distribution of arallel comutations and then show how this can be alied to a ractical roblem.

6 6 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture roc distribute (t, n) is if n = then node (t) else { distribute (t, n/2) on t + n/2 do distribute (t + n/2, n/2) } Figure. A recursive rocess distribute to raidly distribute another rocess node over a set of rocessors. 3.. Raid Process Distribution The algorithm distribute given in Figure is insired by [] and works by sawning a new coy of itself on a remote rocessor each time it recurses. Each rocess then itself recurses, continuing this behaviour and hence, each level of the recursion subdivides the set of rocessors in half, resulting in a doubling of the caacity to initiate comutations. This growth follows the structure of a binary tree. When each instance of distribute executes with n =, the node rocess is executed and the recursion halted. The arameter t indicates the node identifier and the algorithm is executed from node 0 with t = 0 and n = Runtime The hyercube interconnection toology of the XK-XMP-64 rovides an otimal transort in terms of ho distance between remote creations; this is established by the following theorem. Theorem. Every coy of distribute is always created on a neighbouring node when executed on a hyercube. Proof. Let H = (N,E) be a d-dimensional hyercube. When distribute is executed with t = 0 and n = N, starting at node 0 on H, the recursion follows the structure of a binary tree of deth d = log N, where identifiers at level i are multiles of N /2 i. A node at deth i with identifier k N /2 i creates a new remote child node c with identifier k N /2 i + N /2 i+. As N = 2 d, c = k2 d i + 2 d i and hence, c = 2 d i. Given that m and n are fixed, that o = 0 (there are no results) and from Theorem we can normalise C l to, the runtime T c (m,n,o) of the on statement in distribute is Θ(), which we define as the initialisation overhead C j. Using this, we can exress the arallel runtime of distribute T d on rocessors. In each ste, the number of active rocesses double, but we count the runtime at each level of recursion, which terminates when n/2 i = or i = logn. Hence, log T d () = i= (T c +C o ) =(C j +C o )log () where C o is the the sequential overhead at each level. C j was measured as 8.4µs and C o was measured as 60ns Results Figure 2a gives the redicted and measured execution time of distribute as a function of the number of rocessors. The rediction almost exactly matches the runtime given by Equation. Figure 2b shows the inaccuracy between the measured and redicted results more clearly, by giving the measured execution time for each level in the recursion, that is, the difference between consecutive oints in Figure 2a. It shows that the assumtion made based on Theorem does not hold and that the first two levels take fractionally less time than the

7 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 7 Time (µs) T d T d (a) Measured vs. redicted ( ) execution time. Time (µs) Level (b) Execution times for each level of recursion of distribute. Figure 2. Measured execution time of distribute over varying numbers of rocessors. (b) clearly shows the inter- vs. intra-chi latencies. last four levels (3.85µs). This is due to the reduced on-chi communication costs. Overall though, each level of recursion comletes on average in 8.9µs and it takes only 4.60µs to oulate all 64 rocessors. Moreover, using the erformance model given by T d, we can extraolate to larger than is ossible to measure with the current latform. For examle, when = 024, T d (024) = 90µs Remarks By using the erformance model to make redictions, we have assumed a hyercube toology and efficient suort for concurrency. Although other architectures and larger systems cannot make such rovisions, the model and results rovide a reasonable lower bound on execution time with resect to the aroach described. The hyercube has rich communication roerties and suorts exonential growth, but it does not scale well due to the number of connections at each node and length of wires in realistic ackagings. Although distribute has otimal single-ho behaviour and we obtain eak erformance, it is well known that efficient embeddings of binary trees into lower-degree networks such as meshes and tori exist [4], allowing reasonable disersion. In this case, the granularity of rocess creation would have to be chosen to match the caabilities of the architecture. Provision of efficient ISA-level oerations for rocesses and communications allows fine-grained erformance, articularly in terms of short messages. Many current architectures do not suort these oerations at a such a low-level and cannot exloit the full otential of this aroach, although again it generalises at a coarser granularity of message size to match the relative erformance of these oerations Mergesort Mergesort is a well known sorting algorithm [5] that works by recursively halving a list of unsorted numbers until unit sub-lists are obtained. These are then successively merged together such that each merging ste roduces a sorted sub-list, which can be erformed in time Θ(n) for sub-lists of size n/2. Figure 3a gives the sequential mergesort algorithm seq-msort. Mergesort s branching recursive structure matches that of distribute, allowing us to combine them to obtain a arallel version. Instead of sequentially evaluating the recursive calls, conditional on some threshold value C th, a local recursive call is made in arallel with the

8 8 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture roc seq-msort (A) is if A > then { a aliases A[0.. A /2 ] ; b aliases A[i.. A ] ; seq-msort (a) ; seq-msort (b) ; merge(a,a,b) } (a) roc ar-msort (t, n, A) is if A > then { a aliases A[0... A /2 ] ; b aliases A[i... A ] ; if A > C th then { ar-msort (t, n/2, a) on t + n/2 do ar-msort (t + n/2, n/2, b) } else { ar-msort (t, n/2,a) ; ar-msort (t + n/2, n/2,b) } ; merge(a,a,b) } Figure 3. Sequential and arallel mergesort rocesses. second call which is migrated to a remote core. This threshold is used to control the extent to which the comutation is distributed. In each of the exeriments for an inut of size 2 k and available rocessors = 2 d, the threshold is set as 2 k /. The aroach taken in distribute is used to control the lacements of each of the sub-comutations. Initially, the roblem is slit in half; this will have the greatest benefit to the execution time. Deending on the roblem size, further remote branchings of the roblem may not be economical, and the remaining stes should be evaluated locally, in sequence. In this case, the algorithm simly reduces to seq-msort. This arallel formulation of mergesort is essentially just distribute with additional work and communication overhead, but it will allow us to more concretely quantify the relative costs of rocess creation. The arallel imlementation of mergesort ar-msort is given in Figure 3b. It uses the same sequential merge rocedure and the arameters t and n control the lacement of rocesses in the same way as they were used with distribute. We can now analyse the erformance and behaviour of ar-msort and the rocess creation mechanism by looking at the arallel runtime Runtime We first define the runtime of the sequential comonents of ar-msort. This includes the sequential merging and sorting rocedures. The runtime T m of merge is linear and is defined as T m (n) = C a n +C b for constants C a,c b > 0, relating to the er-word and er-merge overheads resectively. These were measured as C a = 90ns and C b = 830ns. The runtime T s (n,) of seq-msort, is exressed as a recurrence: T s (n,) = 2T s ( n 2, ) + T m (n) (2) which has the solution T s (n,) = n(c c logn +C d ) (3) for constants C c,c d > 0. These were measured as C c = 200ns and C d = 200ns. Based on this we can exress the runtime of ar-msort as the combination of the costs of creating new rocesses, moving data, merging and sorting sequentially. The key comonent of this is the cost T c, relating to the on statement in the arallel formulation, which is defined as (b)

9 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 9 T c (n) = C i + 2C w n. This is because we can normalise C l to (due to Theorem ), the size of the rocedures sent is constant and the number of arguments and results are both n. The initialisation overhead C i was measured as 28µs, larger than that for distribute as the closure contains the descritions of merge and ar-msort. For the arallel runtime, the base sequential case is given by Equation 2. With two rocessors, the work and execution time can be slit in half at the cost of migrating the rocedures and data: ( n ) ( n ) T s (n,2) = T c + T s 2 2, + T m (n). With four rocessors, the work is slit in half at a cost of T c (n/2) and then in quarters at a cost of T c (n/4). After the data has been sequentially sorted in time T s (n/4,) it must be merged at the two children of the master node in time T m (n/2), and then again at the master in time T m (n): T s (n,4) =T c ( n 2 Hence in general, we have: T s (n, ) = ) ( n ) ( n ) ( n ) + T c + T m + T m (n) + T s 4 2 4, log ( n ) ( n )) ( ) n (T c i= 2 i + T m 2 i + T s, for n as each leaf sub-rocess of the sorting comutation must oerate on at least one data item. We can then exress this recisely by substituting our definitions for T s, T c and T m and simlifying: T s (n, ) =C w 2n ( ) +C i log +C a 2n ( ) +C b log + n = 2n ( )(C w +C a ) + (C i +C b )log + n ( C c log n ) +C d ( C c log n ) +C d For =, this reduces to Equation 3. This definition allows us to exress the a lower bound and minimum for the runtime Lower Bound We can give a lower bound T m s on the arallel runtime T s (n, ) such that n, T s (n, ) T m s. This is obtained by considering the arallel overhead, that is the cost of distributing the roblem over the system. In this case it relates to the cost of rocess creation, including moving rocesses and their data, the T c comonent of T s : T m s (n, ) = = log ( n ) T c k= 2 k log k= ( C i + 2C w n 2 k ) (4) 2n = C i log +C w ( ). (5) Equation 5 is then the sum of the costs of rocess creation and movement of inut data. When n = 0, T m s relates to Equation ; this is the cost of transmitting and initiating just the comutations over the system. For n 0, this includes the cost of moving the data.

10 0 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Minimum Given an inut of length m n for some sub-comutation of ar-msort, creation of a remote branch is beneficial only when the cost of this is less than the local sequential case: ( m ) ( m ) T c + T s 2 2, + T m (n) < T s (m,) ( m ) ( m ) ( m ) T c + T s 2 2, + T m (n) < 2T s 2, + T m (m) ( m ) ( m ) T c < T s 2 2, Hence, initiation of a remote sorting rocess for an array of length n is beneficial only when T c (n) < T s (n,). That is, the cost of remotely initiating a rocess to erform half the work and receiving the results is less than the cost of sequentially sorting m/2 elements. Therefore at the inflection oint we have T c (n) = T s (n,). (6) Results Figure 4 shows the measured execution time of ar-msort as a function of the number of rocessors used for varying inut sizes. Figure 4a shows just three small inuts. The smallest ossible inut is 256 bytes as the minimum size for any sub-comutation is word. The minimum execution time for this size is at = 4 rocessors, when the array is subdivided twice into 64 byte sections. This is the oint given by Equation 6 and indicates directly the total cost incurred in offloading a comutation. For < 4, the cost of sorting sequentially dominates the runtime, and for > 4, the cost of creating a new rocesses and transferring the array sections dominates the runtime. With the next inut of size 52 bytes, the minimum moves to = 8, where the array is again divided into 64 byte sections. This holds for each inut size and in general gives us the minimum size for which creating a new rocess will further reduce the runtime. The runtime lower bound T m s (0, ) given by Equation 5 is also lotted on Figure 4a. This shows the small and sub-linear cost with resect to of the overheads incurred with the distribution and management of rocesses around the system. Relative to T s (64, ) this constitutes most of the overall work erformed, which is exected as the array is fully decomosed into unit sections. For larger sized inuts, as resented in Figure 4b, this cost becomes just a fraction of the total work erformed. Figure 5 shows redicted execution times for ar-msort for larger and n. Each lot contains the execution time T s as defined by Equation 4, and T m s with and without the transfer of data. Figure 5a gives results for the smallest inut size ossible to sort on 024 cores (4kB) and includes the measurements for T m s (0, ) and T s. It reiterates what was shown in Figure 4a and shows that beyond 64 cores, very little enalty is incurred to create u to 024 sorting instances, with T m s accounting for around 23% of the total runtime for larger systems. This is due to the exonential growth of the distribution mechanism. Figure 5b gives results for the largest measured inut of 32kB, showing the same trends, where T m s this time is around just 3% of the runtime between 64 and 024 cores. Figure 5c and Figure 5d resent redictions made by the erformance model for more realistic workloads of 0MB and GB resectively. Figure 5c shows that 0MB could be sorted sequentially in around 7s and in arallel in at least 0.6s. Figure 5d shows that GB could be sorted in just under 5m sequentially or at least m in arallel. What these results

11 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Time (ms) T m s (0, ) T s (256B, ) T s (52B, ) T s (kb, ) (a) Log-linear lot for varying small inuts. Time (ms) T s (256B, ) T s (52B, ) T s (kb, ) T s (2kB, ) T s (4kB, ) T s (8kB, ) T s (6kB, ) T s (32kB, ) (b) Log-log lot for larger inuts. Figure 4. Measured execution time of ar-msort as a function of the number of rocessors. (a) highlights the minimum execution time and the T m s lower bound. make clear is that the distribution of the inut data dominates and bounds the runtime and that the distribution of data constituting the rocess descritions is a negligible roortion of the overall runtime for reasonable workloads. The relatively small sequential workload O(n/ log(n/)) of mergesort, which decays quickly as increases, emhasises the cost of data distribution. For heavier workloads, such as O((n/) 2 ), we would exect to see a much more dramatic reduction in execution time and the cost of data distribution still eventually to bound runtime, but then by a relatively fractional amount. 4. Conclusions This aer resents the design, imlementation, demonstration and evaluation of an efficient mechanism for dynamically creating comutations in a distributed memory arallel comuter. It has shown that a comutation can be disatched to a remote rocessor in just tens of microseconds, and when this mechanism is combined with recursion, it can be used to efficiently imlement arallel growth. The distribute algorithm demonstrates how an emty array of rocessors can be oulated with a comutation exonentially quickly. For 64 cores, it takes just 4.60µs and for 024 cores this will be of the order of 90µs. The ar-msort algorithm extends this by erforming additional comutational work and communication of data which allowed us to obtain a clearer icture of the cost of rocess creation with resect to varying roblem sizes. As the cost of transferring and invoking remote comutations is related rimarily to the size of the closure, this cost grows slowly with system size and is indeendent of data. With a 0MB inut, it reresents around just 0.00% of the runtime. The sorting results also highlight two imortant issues: the granularity at which it is ossible to create new rocesses and costs of data movement. They show that the comutation can be subdivided to oerate on just 64 byte chunks and for erformance to still be imroved. The cost of data movement is significant, relative to the small amount of work erformed at each node; for more intensive tasks, these costs would diminish. However, these results assume a worst case, where all data originates from a single core. In other systems, this cost may be reduced by concurrent access through a arallel file system or from rior data distribution. The XS architecture rovides efficient suort for concurrency and communications and the XK-XMP-64 rovides an otimal transort for the described algorithms, so we exect our lightweight scheme to be fast, relative to the erformance of other distributed systems.

12 2 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture Time (ms) T m s (0, ) T m s (0, ) T m s (n, ) T s (n, ) T s (n, ) Time (ms) T m s (0, ) T m s (0, ) T m s (n, ) T s (n, ) T s (n, ) (a) n = 64 (256B) with measured results u to 64 cores. (b) n = 892 (32kB) with measured results u to 64 cores. Time (ms) T m s (0, ) T m s (n, ) T s (n, ) Time (ms) e T m s (0, ) T m s (n, ) T s (n, ) (c) n = (0MB). (d) n = (GB). Figure 5. Predicted ( ) erformance of ar-msort for larger n and 024. All lots are log-log. Hence, the results rovide a convincing roof-of-concet imlementation, demonstrating the kind of erformance that is ossible and, with resect to the toology, establish a reasonable lower bound on the erformance of the aroach resented. The results generalise to more dynamic schemes where lacements are not erfect and other larger architectures such as suercomuters, where interconnection toologies are less well connected and communication is less efficient. In these cases, the aroach alies at a coarser granularity with larger roblem sizes to match the relative erformance. 5. Future Work Having successfully designed and imlemented a language and runtime allowing exlicit rocess creation with the on statement, we will continue with our focus on the concet of growth in arallel rograms and lan to extend the work in the following ways. Firstly, by looking at how lacement of rocess closures can be determined automatically by the runtime, relieving the rogrammer of having to secify this. Secondly, by imlementing the language and runtime with C and MPI to target a larger latform, which will rovide a more scalable demonstration of the concets and their generality. And lastly, by looking at generic otimisations that can be made to the rocess creation mechanism to imrove overall erformance and scalability. More details about the current imlementation are available online, htt://

13 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS Architecture 3 where news of future develoments will also be ublished. Acknowledgments The authors would like to thank XMOS for their suort, in articular from David May, Henk Muller and Richard Osborne. References [] David May. The Transuter revisited. In Millennial Persectives in Comuter Science: Proceedings of the 999 Oxford-Microsoft Symosium in Honour of Sir Tony Hoare, ages Palgrave Macmillan, 999. [2] David May. The XMOS XS Architecture. XMOS Ltd., October htt:// suort/documentation. [3] Asanovic, Bodik et al. The Landscae of Parallel Comuting Research: A View from Berkeley. Technical Reort UCB/EECS , EECS Deartment, University of California, Berkeley, Dec htt: // [4] Dongarra, J., Beckman, P. et al. International Exascale Software Project Roadma. Technical Reort UT- CS-0-654, University of Tennessee EECS Technical Reort, May 200. htt:// [5] D. May. The Influence of VLSI Technology on Comuter Architecture [and Discussion]. Philosohical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 326(59): , 988. [6] Per Brinch Hansen. The nature of arallel rogramming. Natural and Artifical Parallel Comutation, ages 3 46, 990. [7] MPI 2.0. Technical reort, Message Passing Interface Forum, November htt://www. mi-forum.org/docs/. [8] B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel rogrammability and the Chael language. International Journal of High Performance Comuting Alications, 2(3):29 32, [9] Philie Charles, Christian Grothoff, Vijay Saraswat, Christoher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoh von Praun, and Vivek Sarkar. X0: an object-oriented aroach to non-uniform cluster comuting. In OOPSLA 05: Proceedings of the 20th annual ACM SIGPLAN conference on Objectoriented rogramming, systems, languages, and alications, ages , New York, NY, USA, ACM. [0] A. Patera. A sectral element method for fluid dynamics: Laminar flow in a channel exansion. Journal of Comutational Physics, 54(3): , June 984. [] Bernard Gendron and Teodor Gabriel Crainic. Parallel branch-and-bound algorithms: Survey and synthesis. Oerations Research, 42(6): , 994. [2] Marsha J Berger and Joseh Oliger. Adative mesh refinement for hyerbolic artial differential equations. Journal of Comutational Physics, 53(3):484 52, 984. [3] XMOS. XK-XMP-64 Hardware Manual. XMOS Ltd., Feburary 200. htt:// suort/documentation. [4] F. Thomson Leighton. Introduction to arallel algorithms and architectures: array, trees, hyercubes. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 992. [5] D. E. Knuth. The Art of Comuter Programming, volume 3, Sorting and Searching, chater 5.2.4, Sorting by Merging, ages Reading, MA: Addison-Wesley, 2nd ed. edition, 998.

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

S16-02, URL:

S16-02, URL: Self Introduction A/Prof ay Seng Chuan el: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: htt://www.hysics.nus.edu.sg/~hytaysc I was a rogrammer from to. I have been working in NUS

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

TOPP Probing of Network Links with Large Independent Latencies

TOPP Probing of Network Links with Large Independent Latencies TOPP Probing of Network Links with Large Indeendent Latencies M. Hosseinour, M. J. Tunnicliffe Faculty of Comuting, Information ystems and Mathematics, Kingston University, Kingston-on-Thames, urrey, KT1

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

Randomized algorithms: Two examples and Yao s Minimax Principle

Randomized algorithms: Two examples and Yao s Minimax Principle Randomized algorithms: Two examles and Yao s Minimax Princile Maximum Satisfiability Consider the roblem Maximum Satisfiability (MAX-SAT). Bring your knowledge u-to-date on the Satisfiability roblem. Maximum

More information

Lecture 8: Orthogonal Range Searching

Lecture 8: Orthogonal Range Searching CPS234 Comutational Geometry Setember 22nd, 2005 Lecture 8: Orthogonal Range Searching Lecturer: Pankaj K. Agarwal Scribe: Mason F. Matthews 8.1 Range Searching The general roblem of range searching is

More information

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS Kevin Miller, Vivian Lin, and Rui Zhang Grou ID: 5 1. INTRODUCTION The roblem we are trying to solve is redicting future links or recovering missing links

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Weimin Chen, Volker Turau TR-93-047 August, 1993 Abstract This aer introduces a new technique for source-to-source code

More information

split split (a) (b) split split (c) (d)

split split (a) (b) split split (c) (d) International Journal of Foundations of Comuter Science c World Scientic Publishing Comany ON COST-OPTIMAL MERGE OF TWO INTRANSITIVE SORTED SEQUENCES JIE WU Deartment of Comuter Science and Engineering

More information

Submission. Verifying Properties Using Sequential ATPG

Submission. Verifying Properties Using Sequential ATPG Verifying Proerties Using Sequential ATPG Jacob A. Abraham and Vivekananda M. Vedula Comuter Engineering Research Center The University of Texas at Austin Austin, TX 78712 jaa, vivek @cerc.utexas.edu Daniel

More information

Space-efficient Region Filling in Raster Graphics

Space-efficient Region Filling in Raster Graphics "The Visual Comuter: An International Journal of Comuter Grahics" (submitted July 13, 1992; revised December 7, 1992; acceted in Aril 16, 1993) Sace-efficient Region Filling in Raster Grahics Dominik Henrich

More information

Object and Native Code Thread Mobility Among Heterogeneous Computers

Object and Native Code Thread Mobility Among Heterogeneous Computers Object and Native Code Thread Mobility Among Heterogeneous Comuters Bjarne Steensgaard Eric Jul Microsoft Research DIKU (Det. of Comuter Science) One Microsoft Way University of Coenhagen Redmond, WA 98052

More information

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing Mikael Taveniku 2,3, Anders Åhlander 1,3, Magnus Jonsson 1 and Bertil Svensson 1,2

More information

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions

More information

CASCH - a Scheduling Algorithm for "High Level"-Synthesis

CASCH - a Scheduling Algorithm for High Level-Synthesis CASCH a Scheduling Algorithm for "High Level"Synthesis P. Gutberlet H. Krämer W. Rosenstiel Comuter Science Research Center at the University of Karlsruhe (FZI) HaidundNeuStr. 1014, 7500 Karlsruhe, F.R.G.

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

Privacy Preserving Moving KNN Queries

Privacy Preserving Moving KNN Queries Privacy Preserving Moving KNN Queries arxiv:4.76v [cs.db] 4 Ar Tanzima Hashem Lars Kulik Rui Zhang National ICT Australia, Deartment of Comuter Science and Software Engineering University of Melbourne,

More information

Argo Programming Guide

Argo Programming Guide Argo Programming Guide Evangelia Kasaaki, asmus Bo Sørensen February 9, 2015 Coyright 2014 Technical University of Denmark This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International

More information

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION Yugoslav Journal of Oerations Research (00), umber, 5- A BICRITERIO STEIER TREE PROBLEM O GRAPH Mirko VUJO[EVI], Milan STAOJEVI] Laboratory for Oerational Research, Faculty of Organizational Sciences University

More information

arxiv: v1 [cs.dc] 13 Nov 2018

arxiv: v1 [cs.dc] 13 Nov 2018 Task Grah Transformations for Latency Tolerance arxiv:1811.05077v1 [cs.dc] 13 Nov 2018 Victor Eijkhout November 14, 2018 Abstract The Integrative Model for Parallelism (IMP) derives a task grah from a

More information

1.5 Case Study. dynamic connectivity quick find quick union improvements applications

1.5 Case Study. dynamic connectivity quick find quick union improvements applications . Case Study dynamic connectivity quick find quick union imrovements alications Subtext of today s lecture (and this course) Stes to develoing a usable algorithm. Model the roblem. Find an algorithm to

More information

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s Sensitivity two stage EOQ model 1 Sensitivity of multi-roduct two-stage economic lotsizing models and their deendency on change-over and roduct cost ratio s Frank Van den broecke, El-Houssaine Aghezzaf,

More information

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University

More information

A Symmetric FHE Scheme Based on Linear Algebra

A Symmetric FHE Scheme Based on Linear Algebra A Symmetric FHE Scheme Based on Linear Algebra Iti Sharma University College of Engineering, Comuter Science Deartment. itisharma.uce@gmail.com Abstract FHE is considered to be Holy Grail of cloud comuting.

More information

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 M. Gajbe a A. Canning, b L-W. Wang, b J. Shalf, b H. Wasserman, b and R. Vuduc, a a Georgia Institute of Technology,

More information

Simulating Ocean Currents. Simulating Galaxy Evolution

Simulating Ocean Currents. Simulating Galaxy Evolution Simulating Ocean Currents (a) Cross sections (b) Satial discretization of a cross section Model as two-dimensional grids Discretize in sace and time finer satial and temoral resolution => greater accuracy

More information

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems Matlab Virtual Reality Simulations for otimizations and raid rototying of flexible lines systems VAMVU PETRE, BARBU CAMELIA, POP MARIA Deartment of Automation, Comuters, Electrical Engineering and Energetics

More information

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

EE678 Application Presentation Content Based Image Retrieval Using Wavelets EE678 Alication Presentation Content Based Image Retrieval Using Wavelets Grou Members: Megha Pandey megha@ee. iitb.ac.in 02d07006 Gaurav Boob gb@ee.iitb.ac.in 02d07008 Abstract: We focus here on an effective

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY Andrew Lam 1, Steven J.E. Wilton 1, Phili Leong 2, Wayne Luk 3 1 Elec. and Com. Engineering 2 Comuter Science

More information

Optimization of Collective Communication Operations in MPICH

Optimization of Collective Communication Operations in MPICH To be ublished in the International Journal of High Performance Comuting Alications, 5. c Sage Publications. Otimization of Collective Communication Oerations in MPICH Rajeev Thakur Rolf Rabenseifner William

More information

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005*

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005* The Anubis Service Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL-2005-72 June 8, 2005* timed model, state monitoring, failure detection, network artition Anubis is a fully

More information

Energy consumption model over parallel programs implemented on multicore architectures

Energy consumption model over parallel programs implemented on multicore architectures Energy consumtion model over arallel rograms imlemented on multicore architectures Ricardo Isidro-Ramírez Instituto Politécnico Nacional SEPI-ESCOM M exico, D.F. Amilcar Meneses Viveros Deartamento de

More information

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans Available online at htt://ijdea.srbiau.ac.ir Int. J. Data Enveloment Analysis (ISSN 2345-458X) Vol.5, No.2, Year 2017 Article ID IJDEA-00422, 12 ages Research Article International Journal of Data Enveloment

More information

Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures

Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures Multigrain Parallel Delaunay Mesh Generation: Challenges and Oortunities for Multithreaded Architectures Christos D. Antonooulos, Xiaoning Ding, Andrey Chernikov, Fili Blagojevic, Dimitrios S. Nikolooulos,

More information

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets An imroved algorithm for Hausdorff Voronoi diagram for non-crossing sets Frank Dehne, Anil Maheshwari and Ryan Taylor May 26, 2006 Abstract We resent an imroved algorithm for building a Hausdorff Voronoi

More information

Extracting Optimal Paths from Roadmaps for Motion Planning

Extracting Optimal Paths from Roadmaps for Motion Planning Extracting Otimal Paths from Roadmas for Motion Planning Jinsuck Kim Roger A. Pearce Nancy M. Amato Deartment of Comuter Science Texas A&M University College Station, TX 843 jinsuckk,ra231,amato @cs.tamu.edu

More information

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method ITB J. Eng. Sci. Vol. 39 B, No. 1, 007, 1-19 1 Leak Detection Modeling and Simulation for Oil Pieline with Artificial Intelligence Method Pudjo Sukarno 1, Kuntjoro Adji Sidarto, Amoranto Trisnobudi 3,

More information

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching Mitigating the Imact of Decomression Latency in L1 Comressed Data Caches via Prefetching by Sean Rea A thesis resented to Lakehead University in artial fulfillment of the requirement for the degree of

More information

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS Philie LACOMME, Christian PRINS, Wahiba RAMDANE-CHERIF Université de Technologie de Troyes, Laboratoire d Otimisation des Systèmes Industriels (LOSI)

More information

Distributed Estimation from Relative Measurements in Sensor Networks

Distributed Estimation from Relative Measurements in Sensor Networks Distributed Estimation from Relative Measurements in Sensor Networks #Prabir Barooah and João P. Hesanha Abstract We consider the roblem of estimating vectorvalued variables from noisy relative measurements.

More information

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems Cheng-Yuan Chang and I-Lun Tseng Deartment of Comuter Science and Engineering Yuan Ze University,

More information

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima

More information

A Model-Adaptable MOSFET Parameter Extraction System

A Model-Adaptable MOSFET Parameter Extraction System A Model-Adatable MOSFET Parameter Extraction System Masaki Kondo Hidetoshi Onodera Keikichi Tamaru Deartment of Electronics Faculty of Engineering, Kyoto University Kyoto 66-1, JAPAN Tel: +81-7-73-313

More information

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics structure arises in many alications of geometry. The dual structure, called a Delaunay triangulation also has many interesting roerties. Figure 3: Voronoi diagram and Delaunay triangulation. Search: Geometric

More information

Chapter 8: Adaptive Networks

Chapter 8: Adaptive Networks Chater : Adative Networks Introduction (.1) Architecture (.2) Backroagation for Feedforward Networks (.3) Jyh-Shing Roger Jang et al., Neuro-Fuzzy and Soft Comuting: A Comutational Aroach to Learning and

More information

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE Petra Surynková Charles University in Prague, Faculty of Mathematics and Physics, Sokolovská 83,

More information

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER GEOMETRIC CONSTRAINT SOLVING IN < AND < 3 CHRISTOPH M. HOFFMANN Deartment of Comuter Sciences, Purdue University West Lafayette, Indiana 47907-1398, USA and PAMELA J. VERMEER Deartment of Comuter Sciences,

More information

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution Texture Maing with Vector Grahics: A Nested Mimaing Solution Wei Zhang Yonggao Yang Song Xing Det. of Comuter Science Det. of Comuter Science Det. of Information Systems Prairie View A&M University Prairie

More information

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1 10 File System 1 We will examine this chater in three subtitles: Mass Storage Systems OERATING SYSTEMS FILE SYSTEM 1 File System Interface File System Imlementation 10.1.1 Mass Storage Structure 3 2 10.1

More information

Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information

Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information Non-Strict Indeendence-Based Program Parallelization Using Sharing and Freeness Information Daniel Cabeza Gras 1 and Manuel V. Hermenegildo 1,2 Abstract The current ubiuity of multi-core rocessors has

More information

A Meta-graph Approach to Analyze Subgraph-centric Distributed Programming Models

A Meta-graph Approach to Analyze Subgraph-centric Distributed Programming Models A Meta-grah Aroach to Analyze Subgrah-centric Distributed Programming Models Ravikant Dindokar, Neel Choudhury, Yogesh Simmhan Deartment of Comutational and Data Sciences, Indian Institute of Science,

More information

Contents 1 Introduction 2 2 Outline of the SAT Aroach Performance View Abstraction View

Contents 1 Introduction 2 2 Outline of the SAT Aroach Performance View Abstraction View Abstraction and Performance in the Design of Parallel Programs Der Fakultat fur Mathematik und Informatik der Universitat Passau vorgelegte Zusammenfassung der Veroentlichungen zur Erlangung der venia

More information

Support Vector Machines for Face Authentication

Support Vector Machines for Face Authentication Suort Vector Machines for Face Authentication K Jonsson 1 2, J Kittler 1,YPLi 1 and J Matas 1 2 1 CVSSP, University of Surrey Guildford, Surrey GU2 5XH, United Kingdom 2 CMP, Czech Technical University

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comut. 71 (2011) 288 301 Contents lists available at ScienceDirect J. Parallel Distrib. Comut. journal homeage: www.elsevier.com/locate/jdc Quality of security adatation in arallel

More information

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University.

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University. Relations with Relation Names as Arguments: Algebra and Calculus Kenneth A. Ross Columbia University kar@cs.columbia.edu Abstract We consider a version of the relational model in which relation names may

More information

High Quality Offset Printing An Evolutionary Approach

High Quality Offset Printing An Evolutionary Approach High Quality Offset Printing An Evolutionary Aroach Ralf Joost Institute of Alied Microelectronics and omuter Engineering University of Rostock Rostock, 18051, Germany +49 381 498 7272 ralf.joost@uni-rostock.de

More information

Interactive Image Segmentation

Interactive Image Segmentation Interactive Image Segmentation Fahim Mannan (260 266 294) Abstract This reort resents the roject work done based on Boykov and Jolly s interactive grah cuts based N-D image segmentation algorithm([1]).

More information

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao (taoyf@cse.cuhk.edu.hk) Deartment of Comuter Science and

More information

Directed File Transfer Scheduling

Directed File Transfer Scheduling Directed File Transfer Scheduling Weizhen Mao Deartment of Comuter Science The College of William and Mary Williamsburg, Virginia 387-8795 wm@cs.wm.edu Abstract The file transfer scheduling roblem was

More information

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information

More information

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. This document is downloaded from DR-NTU, Nanyang Technological University Library, Singaore. Title Automatic Robot Taing: Auto-Path Planning and Maniulation Author(s) Citation Yuan, Qilong; Lembono, Teguh

More information

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Grah Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, Matei Rieanu Deartment of Electrical and Comuter Engineering, The University

More information

Modified Bloom filter for high performance hybrid NoSQL systems

Modified Bloom filter for high performance hybrid NoSQL systems odified Bloom filter for high erformance hybrid NoSQL systems A.B.Vavrenyuk, N.P.Vasilyev, V.V.akarov, K.A.atyukhin,..Rovnyagin, A.A.Skitev National Research Nuclear University EPhI (oscow Engineering

More information

Improving Trust Estimates in Planning Domains with Rare Failure Events

Improving Trust Estimates in Planning Domains with Rare Failure Events Imroving Trust Estimates in Planning Domains with Rare Failure Events Colin M. Potts and Kurt D. Krebsbach Det. of Mathematics and Comuter Science Lawrence University Aleton, Wisconsin 54911 USA {colin.m.otts,

More information

Resource Allocation for QoS Provisioning in Wireless Ad Hoc Networks

Resource Allocation for QoS Provisioning in Wireless Ad Hoc Networks Resource Allocation for QoS Provisioning in Wireless Ad Hoc Networks Mung Chiang, Daniel ONeill, David Julian andstehenboyd Electrical Engineering Deartment Stanford University, CA 94305, USA Abstract-

More information

Skip List Based Authenticated Data Structure in DAS Paradigm

Skip List Based Authenticated Data Structure in DAS Paradigm 009 Eighth International Conference on Grid and Cooerative Comuting Ski List Based Authenticated Data Structure in DAS Paradigm Jieing Wang,, Xiaoyong Du,. Key Laboratory of Data Engineering and Knowledge

More information

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs An emirical analysis of looy belief roagation in three toologies: grids, small-world networks and random grahs R. Santana, A. Mendiburu and J. A. Lozano Intelligent Systems Grou Deartment of Comuter Science

More information

I ACCEPT NO RESPONSIBILITY FOR ERRORS ON THIS SHEET. I assume that E = (V ).

I ACCEPT NO RESPONSIBILITY FOR ERRORS ON THIS SHEET. I assume that E = (V ). 1 I ACCEPT NO RESPONSIBILITY FOR ERRORS ON THIS SHEET. I assume that E = (V ). Data structures Sorting Binary heas are imlemented using a hea-ordered balanced binary tree. Binomial heas use a collection

More information

Architecture description languages for programmable embedded systems

Architecture description languages for programmable embedded systems Architecture descrition languages for rogrammable embedded systems P. Mishra and N. Dutt Abstract: Embedded systems resent a tremendous oortunity to customise designs by exloiting the alication behaviour.

More information

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScrit Objects Shiyi Wei and Barbara G. Ryder Deartment of Comuter Science, Virginia Tech, Blacksburg, VA, USA. {wei,ryder}@cs.vt.edu

More information

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal International Journal of Information and Electronics Engineering, Vol. 1, No. 1, July 011 An Efficient VLSI Architecture for Adative Rank Order Filter for Image Noise Removal M. C Hanumantharaju, M. Ravishankar,

More information

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime GDP: Using Dataflow Proerties to Accurately Estimate Interference-Free Performance at Runtime Magnus Jahre Deartment of Comuter Science Norwegian University of Science and Technology (NTNU) Email: magnus.jahre@ntnu.no

More information

PRO: a Model for Parallel Resource-Optimal Computation

PRO: a Model for Parallel Resource-Optimal Computation PRO: a Model for Parallel Resource-Otimal Comutation Assefaw Hadish Gebremedhin Isabelle Guérin Lassous Jens Gustedt Jan Arne Telle Abstract We resent a new arallel comutation model that enables the design

More information

Information Flow Based Event Distribution Middleware

Information Flow Based Event Distribution Middleware Information Flow Based Event Distribution Middleware Guruduth Banavar 1, Marc Kalan 1, Kelly Shaw 2, Robert E. Strom 1, Daniel C. Sturman 1, and Wei Tao 3 1 IBM T. J. Watson Research Center Hawthorne,

More information

Design Trade-offs in Customized On-chip Crossbar Schedulers

Design Trade-offs in Customized On-chip Crossbar Schedulers J Sign Process Syst () 8:9 8 DOI.7/s-8--x Design Trade-offs in Customized On-chi Crossbar Schedulers Jae Young Hur Stehan Wong Todor Stefanov Received: October 7 / Revised: June 8 / cceted: ugust 8 / Published

More information

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip Downloaded from orbit.dtu.dk on: Jan 25, 2019 A Metaheuristic Scheduler for Time Division Multilexed Network-on-Chi Sørensen, Rasmus Bo; Sarsø, Jens; Pedersen, Mark Ruvald; Højgaard, Jasur Publication

More information

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH Jin Lu, José M. F. Moura, and Urs Niesen Deartment of Electrical and Comuter Engineering Carnegie Mellon University, Pittsburgh, PA 15213 jinlu, moura@ece.cmu.edu

More information

level 0 level 1 level 2 level 3

level 0 level 1 level 2 level 3 Communication-Ecient Deterministic Parallel Algorithms for Planar Point Location and 2d Voronoi Diagram? Mohamadou Diallo 1, Afonso Ferreira 2 and Andrew Rau-Chalin 3 1 LIMOS, IFMA, Camus des C zeaux,

More information

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level,

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level, [9] J. J. Dongarra, R. Hemel, A. J. G. Hey, and D. W. Walker, \A Proosal for a User-Level, Message Passing Interface in a Distributed-Memory Environment," Tech. Re. TM-3, Oak Ridge National Laboratory,

More information

Distributed Algorithms

Distributed Algorithms Course Outline With grateful acknowledgement to Christos Karamanolis for much of the material Jeff Magee & Jeff Kramer Models of distributed comuting Synchronous message-assing distributed systems Algorithms

More information

Building Better Nurse Scheduling Algorithms

Building Better Nurse Scheduling Algorithms Building Better Nurse Scheduling Algorithms Annals of Oerations Research, 128, 159-177, 2004. Dr Uwe Aickelin Dr Paul White School of Comuter Science University of the West of England University of Nottingham

More information