ON IMPLEMENTING THE FARM SKELETON

Size: px
Start display at page:

Download "ON IMPLEMENTING THE FARM SKELETON"

Transcription

1 ON IMPLEMENTING THE FARM SKELETON MICHAEL POLDNER and HERBERT KUCHEN Department of Information Systems, University of Münster, D Münster, Germany ABSTRACT Algorithmic skeletons intend to simplify parallel programming by providing a higher level of abstraction compared to the usual message passing. Task and data parallel skeletons can be distinguished. In the present paper, we will consider several approaches to implement one of the most classical task parallel skeletons, namely the farm, and compare them w.r.t. scalability, overhead, potential bottlenecks, and load balancing. We will also investigate several communication modes for the implementation of skeletons. Based on experimental results, the advantages and disadvantages of the different approaches are shown. Moreover, we will show how to terminate the system of processes properly. Keywords: Parallel programming, algorithmic skeletons, farm, MPI. 1. Introduction Today, parallel programming of MIMD machines with distributed memory is typically based on message passing. Owing to the availability of standard message passing libraries such as MPI a [1], the resulting software is platform independent and efficient. However, the programming level is still rather low and programmers have to fight against low-level communication problems such as deadlocks. Moreover, the program is split into a set of processes which are assigned to the different processors. Like an ant, each process only has a local view of the overall activity. A global view of the overall computation only exists in the programmer s mind, and there is no way to express it more directly on this level. Many approaches try to increase the level of parallel programming and to overcome the mentioned disadvantages. Here, we will focus on algorithmic skeletons, i.e. typical parallel-programming patterns which are efficiently implemented on the available parallel machine and usually offered to the user as higher-order functions, which get the details of the specific application problem as argument functions (see e.g. [3,4,19]). The skeletal parallelism homepage [6] contains links to virtually all groups and projects working on skeletons. In our framework, a parallel computation consists of a sequence of calls to skelea We assume some familiarity with MPI and C++.

2 Third Worskhop on High-Level Parallel Programming and Applications tons. Several implementations of algorithmic skeletons are available. They differ in the kind of host language used and in the particular set of skeletons offered. Since higher-order functions are taken from functional languages, many approaches use such a language as host language [7,12,2]. In order to increase the efficiency, imperative languages such as C and C++ have been extended by skeletons, too [2,3,9,19]. Depending on the kind of parallelism used, skeletons can be classified into task parallel and data parallel ones. In the first case, a skeleton (dynamically) creates a system of communicating processes by nesting predefined process topologies such as pipeline, farm, parallel composition, divide&conquer, and branch&bound [1,4,5,7,11,19]. In the second case, a skeleton works on a distributed data structure, performing the same operations on some or all elements of this data structure. Data-parallel skeletons, such as map, fold or rotate are used in [2,3,7,8,12,19]. Moreover, there are implementations offering skeletons as a library rather than as part of a new programming language [5,14] The approach described in the sequel is based on the skeleton library introduced in [11,13,14] and on the corresponding C++ language binding. Skeletons can be understood as domain-specific languages for parallel programming. Our library provides task as well as data parallel skeletons, which can be combined based on the two-tier model taken from P 3 L [19]. In general, a computation consists of nested task parallel constructs where an atomic task parallel computation can be sequential or data parallel. Purely data parallel and purely task parallel computations are special cases of this model. An advantage of the C++ binding is that the three important features needed for skeletons, namely higher-order functions (i.e. functions having functions as arguments), partial applications (i.e. the possibility to apply a function to less arguments than it needs and to supply the missing arguments later), and parametric polymorphism, can be implemented elegantly and efficiently in C++ using operator overloading and templates, respectively [13,21]. In the present paper, we will focus on task-parallel skeletons in general and on the well-known farm skeleton in particular. Conceptually, a farm consists of a farmer and several workers. The farmer accepts a sequence of tasks from some predecessor process and propagates each task to a worker. The worker executes the task and delivers the result back to the farmer who propagates it to some successor process (which may be the same as the predecessor). This specification suggests a straightforward implementation leading to the process topology depicted in Fig. 1. The problem with this simple approach is that the farmer may become a bottleneck, if the number of workers is large. Another disadvantage is the overhead caused by the propagation of messages. Consequently, it is worth considering different implementation schemes avoiding these disadvantages. In the present paper, we will consider a variant of the classical farm where the farmer is divided into a dispatcher and a collector as well as variants where these building blocks have (partly) been omitted. Moreover, we investigate which MPI communication modes are most appropriate for implementing skeletons. Finally, we will present a scheme to shutdown the system of processes cleanly, even for nested and possibly cyclic process topologies. The rest of the paper is organized as follows. In Section 2, we show how task-

3 On Implementing the Farm Skeleton Predecessor Farm tasks results Farmer tasks... results Successor Figure 1: Farm. parallel applications can be implemented using the task-parallel skeletons provided by our skeleton library. The considered variants of the farm skeleton are presented in Section 3. Section 4 contains experimental results for the different approaches. In Section 5, we describe, how to terminate the system of processes cleanly. Finally, in Section 6 we conclude and discuss related work. Moreover, we point out future extensions. 2. Task-parallel Skeletons Our skeleton library offers data parallel and task parallel skeletons. Task parallelism is established by setting up a system of processes which communicate via streams of data. Such a system is not arbitrarily structured but constructed by nesting predefined process topologies such as farms and pipelines. Moreover, there are skeletons for parallel composition, branch & bound, and divide & conquer. Finally, it is possible to nest task and data parallelism according to the mentioned two-tier model of P 3 L, which allows atomic task parallel processes to use data parallelism inside [19]. Here, we will focus on task-parallel skeletons in general and on the farm skeleton in particular. In a farm, a farmer process accepts a sequence of inputs and assigns each of them to one of several workers. The parallel composition works similar to the farm. However, each input is forwarded to every worker. A pipeline allows tasks to be processed by a sequence of stages. Each stage handles one task at a time, and it does this in parallel to all the other stages. Each task parallel skeleton has the same property as an atomic process, namely it accepts a sequence of inputs and produces a sequence of outputs. This allows the task parallel skeletons to be arbitrarily nested. Task parallel skeletons like pipeline and farm are provided by many skeleton systems, see e.g. [5,19]. In the example in Fig. 2, a pipeline of an initial atomic process, a farm of two atomic workers, and a final atomic process is constructed. In the C++ binding, there is a class for every task parallel skeleton. All these classes are subclasses of the abstract class Process. A task parallel application proceeds in two steps. First, a process topology is created by using the constructors of the mentioned class. This process topology reflects the actual nesting of skeletons. Then, this system of processes is started by applying method start() to the outermost skeleton. Internally, every atomic process will be assigned to a processor. For an implementation

4 Third Worskhop on High-Level Parallel Programming and Applications #include "Skeleton.h" Atomic Atomic static int current = ; static const int numworkers = 2; Initial Farmer Final int* init(){if (current++ < 1) return &current; else return NULL;} int foo(int n){ int result = 1; for (int i=1; i<=n%1+1; i++) result *= i; for (int i=; i<result; i++) n++; return n;} void fin(int n){cout << "result: " << n << endl;} int main(int argc, char **argv){ InitSkeletons(argc,argv); // step 1: create a process topology (using C++ constructors) Initial<int> p1(init); Atomic<int,int> p2(foo,1); Farm<int,int> p3(p2,numworkers); Final<int> p4(fin); Pipe p5(p1,p3,p4); // step 2: start the system of processes p5.start(); TerminateSkeletons();} Figure 2: Task parallel example application. on top of SPMD, this means that every processor will dispatch depending on its rank to the code of its assigned process. When constructing an atomic process, the argument function of the constructor tells how each input is transformed into an output value. Again, such a function can be either a C++ function or a partial application. In Fig. 2, each worker applies function foo to each input and returns the result. The initial and final atomic processes are special, since they do not consume inputs and produce outputs, respectively. 3. The Farm Skeleton As pointed out in the introduction, a straightforward implementation of the farm skeleton could be based on the process topology depicted in Fig. 1. It has the advantage that the farmer knows which workers have returned the results of their tasks and are hence idle. Thus, the farmer can forward incoming tasks to the idle workers. However, this approach has the disadvantage that it causes substantial overhead due to the messages which have to be exchanged between farmer and workers. Moreover, the farmer might become a bottleneck, if the number of workers

5 On Implementing the Farm Skeleton FarmWDWC Predecessor Dispatcher Collector tasks results Successor tasks... results Figure 3: Farm with separate dispatcher and collector. is large. In this case, the farmer will not be able to keep all workers busy, leading to wasted workers. The amount of workers which the farmer can keep busy depends on the sizes of the tasks and the sizes of the messages the farmer has to propagate. For small tasks (in terms of the amount of required computations) and large messages, only very few workers can be kept busy. If on the other hand there are few workers (with large tasks), the farmer will partly be idle. This problem can be solved by mapping a worker to the same processor as the farmer. Thus, we will not consider this problem further. In general, the aim is to keep all processors as busy as possible and to avoid a waste of resources. Let us point out that implementing this apparently simple farm skeleton is not as easy as it might seem. The interested reader may have a look at our implementation [16]. Firstly, the process topology is obviously cyclic (see Fig. 1). Thus, one has to be very careful to avoid deadlocks. On the other hand, one has to make sure that the farmer reacts as quickly as possible on newly arriving tasks and on workers delivering their results. For an implementation based on MPI, this means that the simpler synchronous communication has to be replaced by the significantly more complex non-blocking asynchronous communication using MPI Wait. Moreover, special control messages are needed besides data messages in order to stop the computation cleanly at the end. This leads to a quite complicated protocol, which has to be supported by every skeleton. Thus, an obvious advantage of using skeletons is that the user does not have to invent the wheel again and again, but can readily use the typical patterns for parallel programming without worrying about deadlocks, stopping the computation cleanly, or overlapping computation and communication properly Farm with Dispatcher and Collector A first approach to reduce the load of an overloaded farmer is to split it into a dispatcher of work and an independent collector of results as depicted in Fig. 3. This variant is used in P 3 L [19]. Ideally, distributing tasks needs as much time as collecting results. In this case, the farmer can serve twice as many workers as with the previous approach. If, however, distributing tasks is much more work than collecting results or vice versa,

6 Third Worskhop on High-Level Parallel Programming and Applications FarmWDNC Predecessor tasks Dispatcher tasks results Successor... Figure 4: Farm with dispatcher; no collector is used. little has been gained, since now the dispatcher or the collector will quickly become the bottleneck, while the other will partly be idle. The amount of required messages is unchanged and the corresponding overhead is hence preserved. A disadvantage of this farm variant is that the dispatcher now has no knowledge about the actual load of each worker. Thus, he has to divide work based on some load independent scheme, e.g. cyclically or by random selection of a worker. Both may lead to an unbalanced distribution of work. However for a large number of tasks, one can expect that the load is roughly balanced. The blind distribution of work could be avoided by sending work requests from idle workers to the dispatcher. But then the dispatcher would have to process as many messages as the farmer in the original approach, and the introduction of a separate collector would be rather useless. Alternatively, a load balancing scheme among workers could be applied like in P 3 L. In the present paper, we have not considered this alternative Farm with Dispatcher The previous approach can be improved by omitting the collector and by sending results directly from the workers to the successor of the farm in the overall process topology (see Fig. 4). This eliminates the overhead of propagating results. Moreover, the omitted collector can no longer be a potential bottleneck. b 3.3. Farm without Dispatcher and Collector After omitting the collector, one can consider omitting the dispatcher as well (see Fig. 5). In fact, this is possible, provided that the predecessors of the farm in the overall process topology assume the responsibility to distribute their tasks directly to the workers. As in the previous subsections, this distribution has to be performed based on a load independent scheme, e.g. cyclically or by random selection. This approach reduces the overhead for the propagation of messages completely. Moreover, it omits another potential source of a bottleneck, namely the dispatcher. Of course, this new variant of the farm skeleton can be arbitrarily nested with other b It may however happen that the successor now becomes a bottleneck. A corresponding global optimization of the overall process topology remains as future work.

7 On Implementing the Farm Skeleton FarmNDNC 2 FarmNDNC 1 Predecessor Successor Figure 5: Pipeline of two farms without dispatcher and collector. skeletons. For instance, it is possible to construct a pipeline of the such farms, as depicted in Fig. 5. c In such a situation where n workers of the first farm communicate with m workers of the second, it is important to ensure that not all of them start with the same destination. If using a cyclic distribution scheme, worker i could e.g. assign its first task to worker i/m of the second farm. If the destination is randomly chosen, it has to be ensured that all the random number generators start with a different initial value. A farm without dispatcher but with collector does not seem to make sense, and it is not considered here. 4. Experimental Results In our experimental results, we have tested the different process topologies for implementing the farm as well as the different MPI communication modes Influence of the Process Topology We have tested the different variants of the farm skeleton for small and big tasks as well as for tasks of varying sizes. A small task simply consist in computing the square of the input, a big tasks performs two million additions, and the tasks of varying sizes execute n! iterations for some 1 n 1 (as in Fig. 2). The experiments have been carried out on the IBM cluster at the University of Münster. This machine [22] has 94 compute nodes, each with a 2.4 GHz Intel Xeon processor, 512 KB L2 Cache, and 1 GB memory. The nodes are connected by a Myrinet[18] and running RedHat Linux 7.3 and the MPICH-GM implementation of MPI. All farm variants have been implemented and compared based on the nonblocking standard communication mode using the eager protocol. In the sequel, we will assume some familiarity with the different MPI communication modes, namely standard mode, synchronous mode, buffered mode, and ready c Note that such a sequence of farms may, for instance, make sense, if each worker of the first farm uses one processor, while each worker of the second farm performs a data parallel computation on several processors.

8 Third Worskhop on High-Level Parallel Programming and Applications mode as well as with the corresponding concepts of blocking and a non-blocking send operations [17]. Moreover, we assume some knowledge of the eager protocol and the rendezvous protocol, which MPICH-GM employs for implementing the different modes. For large tasks (of the same size), all variants are able to keep the workers busy, as one would expect. Thus, all variants need roughly the same amount of time to execute the tasks, as can be seen in Fig. 6 a). However, just comparing the runtimes is not fair, since the different farms need different numbers of processors (unless farmer, dispatcher, and collector share the same processor with one worker as mentioned above). Taking this into account, we see that the farmndnc is the best, followed by the farmwdnc and the farm with farmer (see Fig. 6 b). When considering tasks with (strongly) varying sizes (Fig. 7), we note that all approaches are able to keep a small number of workers (here up to 4) busy. If we add more workers, all approaches reach a point where they are no longer able to employ the additional workers. Beyond this point the runtime remains constant, independently of the number of additional workers. As expected, the original farm reaches this situation more quickly than the farmwdwc, the farmwdnc, and the farmndnc, in this order. Moreover, farms which produce less overhead need less runtime when reaching the limit. When taking into account the number of processors used (rather than the number of workers), the advantages of the farmwdnc and, in particular, the farmndnc become even more apparent (see Fig. 7 b). Interestingly, the behavior of the different farm variants does not depend significantly on the distribution scheme. Cyclic distribution and random distribution lead to almost identical runtimes. This is due to the fact that the number of tasks was large and the load of the workers has hence been balanced over time (Fig. 7 b). For small numbers of tasks the behavior may depend heavily on the actual mapping of tasks to workers. It may be surprising that even the farmndnc is not able to keep an arbitrary farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl farm with farmer farmwdwc rand farmwdwc cycl farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # workers a) runtime depending on # workers # processors b) runtime depending on # processors Figure 6: Farm variants with big tasks.

9 On Implementing the Farm Skeleton farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl farm with farmer farmwdwc rand farmwdwc cycl farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # workers a) runtime depending on # workers # processors b) runtime depending on # processors Figure 7: Farm variants with tasks of varying sizes. amount of workers busy, although it has no farmer or dispatcher which might get a bottleneck. The reason is that the predecessor(s) of the farm are now responsible for the distribution of tasks to the workers. Unless the predecessor is itself a large farm, it will eventually become a bottleneck as observed in Fig. 7. Similarly, the successor(s) of the farm may now get a bottleneck. For small tasks, it is not worthwhile to employ any worker. The process delivering the small tasks should better execute them itself. If we nevertheless use a farm, it is clear that it is not possible to keep even a single worker busy. All farm variants require hence a more or less constant runtime independently of the number of workers (see Fig. 8). This situation corresponds to the one in the previous experiment when reaching the limit of useful workers. The roughly constant runtimes differ between skeletons depending on the overhead caused by the considered variant MPI Communication Modes and Distribution Schemes In order to answer the question whether and how much the eager protocol per farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl # workers a) runtime depending on # workers farm with farmer farmwdwc rand farmwdwc cycl.5 farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # processors b) runtime depending on # processors Figure 8: Farm variants with small tasks.

10 Third Worskhop on High-Level Parallel Programming and Applications # workers a) small tasks Rendezvous rand Rendezvous cycl Eager rand/cycl # workers b) big tasks Rendezvous rand Rendezvous cycl Eager rand/cycl Rendezvous rand Rendezvous cycl Eager rand/cycl # workers c) tasks of varying sizes Figure 9: Comparison of the eager and the rendezvous protocol. forms better than the rendezvous protocol, we have implemented our skeleton library using the standard send mode as well as the synchronous send mode. The standard send mode is based on the eager protocol for messages less than 32 KB (as in our experiments) and on the rendezvous protocol for larger messages. The synchronous send mode always uses the rendezvous protocol. As above, we have tested the farmndnc for small and big tasks as well as for tasks of varying sizes, each of them using a cyclic as well as a random distribution scheme. Just like the previous tests these experiments have been carried out on our IBM cluster. For small tasks (see Fig. 9 a) the eager protocol performs much better than the rendezvous protocol due to its lower latency. Here, the distribution scheme has no significant effect, since the tasks are almost immediately finished and there is hence no delay caused by busy workers. When considering large tasks in connection with the rendezvous protocol, a significant difference between the cyclic and the random distribution scheme is determined (Fig. 9 b). Whenever the sender selects the same receiver as shortly before, the sender is blocked until the receiver has finished the computation and

11 On Implementing the Farm Skeleton percentage deviation # tasks max random avg cyclic min random Figure 1: Behavior depending on the number of tasks. accepts the message. This causes significant idle times of the sender and the other workers. With a cyclic distribution, such delays are minimized. The eager protocol performs even slightly better. This is not so much caused by the lower latency but by the fact that each worker may buffer several tasks, such that it can directly start processing when finishing its previous task. It does not have to wait for the actual communication to take place. For the same reasons, the eager protocol performs best for tasks of (strongly) varying sizes (Fig. 9 c) and the cyclic distribution is better than the random distribution when using the rendezvous protocol. In order to compare the different distribution schemes further, we have performed this test with different numbers number of tasks, namely 1, 1, 1, 1 and 1, and farm sizes of up to four several times, all using the eager protocol. While the cyclic distribution scheme provides nearly constant runtimes for each number of tasks, the random distribution scheme causes high variances of the runtimes for smaller numbers of tasks. Fig. 1 shows the runtimes depending on the number of tasks for cyclic distribution (normalized to 1) as well as the minimal and maximal runtimes for random distribution. As one might expect, an unlucky random distribution of tasks is more serious for small numbers of tasks than for large numbers, where this effect is averaged out. The same is true for lucky random distributions. We can conclude that the cyclic distribution of tasks is better than the random distribution for implementing the farm skeleton. 5. Terminating the Process System To perform a task parallel computation, a process topology is created and started. After the computation is completed, the process topology must be terminated properly in order to ensure that no user data are lost and a possible subsequent skeleton computation with a potentially different topology can be started without problems. In the following, we present a conceptually simple approach for

12 Third Worskhop on High-Level Parallel Programming and Applications the generation and termination of process topologies. d In our skeleton library, task parallel skeletons are implemented as C++ classes. A process topology is created by using their constructors, which can be arbitrarily nested. Each of the skeletons provides a method start() to actually start the computation of the skeleton and all nested skeletons (see Fig. 2). While task parallel skeletons are active, they consume a stream of input values and produce a stream of output values. If a skeleton notices the end of the input stream, it informs its successors that its output stream ends by sending a special message and terminates afterwards. Thus, the process system terminates automatically without delegating the responsibility to the programmer. This ensures that no user data are getting lost. Messages for controlling the process system must, of course, differ from user data. Hence the semantics of a message is described by a tag which is sent within the message envelope. Depending on this tag, the receiver decides whether the message is a control message or not. A message with a STOP tag informs the receiver of the end of the data stream and initiates the termination of the subsequent processes. The first STOP is sent by the initial process (Fig. 11). MPI guarantees that messages between the same sender and receiver cannot pass each other. This leads to a simple algorithm for terminating the process topology. There is no problem, if the number of entrances and exits of a skeleton is limited to one. In this case, there is a well-defined successor to which the STOP is sent. When receiving it, the corresponding process forwards it to its successor and stops. For skeletons with an arbitrary number of entrances and exits such as the farmndnc, messages can be received from and sent to several processes, and data may arrive after a STOP message. Suppose, for example, there are two farmndncs (as shown in Fig. 11) and there is a message containing user data from a worker A of the first farm to a worker C of the second farm. Furthermore there is a STOP from the initial process via worker B to C. If C receives the STOP first, C must not terminate until the user data have been received and processed. Thus, it has to wait for STOP messages from all predecessors before terminating. Moreover, each process must propagate a STOP message to all its successors. This algorithm is easy to implement. On the other hand, potentially many stop messages are sent. However, the corresponding costs can be neglected compared to the previous parallel computation. 6. Conclusions, Related Work, and Future Extensions We have considered alternative implementation schemes for the well-known farm skeleton. More precisely, we have investigated several process topologies as well as several communication modes. Concerning the process topologies we have compared the classical approach, where a farmer distributes work and collects results, with variants, where the farmer has been divided into a dispatcher and collector. Moreover, we have investigated variants where the collector and dispatcher have d The actual implementation is not quite as simple [16], in particular in the presence of cyclic process topologies.

13 On Implementing the Farm Skeleton Initial 2. STOP 1. data 2. STOP A 5. STOP 4. data 3. STOP 5. STOP C B D 3. STOP FarmNDNC 1 FarmNDNC 2 8. STOP 6. data 7. STOP Final Figure 11: Process termination with asynchronous communication. (partly) been omitted. In case of the variant without dispatcher, the predecessor of the farm in the overall process topology is responsible for the distribution of tasks to the workers. As our experimental results and our analysis show, the farm only consisting of workers is the best in terms of scalability and low overhead. It is clearly superior to the farms with dispatcher (and possibly collector). For a large number of tasks, it is also better than the classical farm, where a farmer distributes work. For a very small number of tasks, the classical farm might have advantages in some cases, since it is the only one which takes the actual load of workers into account when distributing work. We are not aware of any previous systematic analysis of different implementation schemes for farms in the literature. In the different skeleton projects, typically one technique has been chosen. In the first version of our skeleton library, there was just the classical farm. P 3 L [19] uses the farmwdwc while eskel [5] uses the classical farm with farmer. Concerning the choice of the communication mode for implementing skeletons, we have compared the standard mode (here) using the eager protocol with the synchronized mode based on the rendezvous protocol. The eager protocol turned out to be better, not only due its lower latency but also due to the fact that the workers can buffer several tasks and can process the next one after finishing the previous one without delay. The cyclic distribution of tasks to the workers turned out to be superior to a random distribution, since it leads to more predictable and in the average slightly smaller runtimes. As future work, we intend to investigate alternative implementation schemes for other skeletons, e.g. divide & conquer and branch & bound. References [1] E. Alba, F. Almeida et al., MALLBA: A Library of Skeletons for Combinatorial Search, Proc. Euro-Par 22, LNCS 24 (Springer, 22) [2] G.H. Botorog and H. Kuchen, Efficient Parallel Programming with Algorithmic Skeletons, Proc. Euro-Par 96, LNCS 1123 (Springer, 1996)

14 Third Worskhop on High-Level Parallel Programming and Applications [3] G.H. Botorog and H. Kuchen, Efficient High-Level Parallel Programming, Theoretical Computer Science 196 (1998) [4] M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computation (MIT Press, 1989). [5] M. Cole, Bringing Skeletons out of the Closet: A Pragmatic Manifesto for Skeletal Parallel Programming, Parallel Computing 3(3) (24) [6] M. Cole, The Skeletal Parallelism Web Page, [7] J. Darlington, A.J. Field, T.G. Harrison et al., Parallel Programming Using Skeleton Functions, Proc. PARLE 93, LNCS 694 (Springer,1993) [8] J. Darlington, Y. Guo, H.W. To, J. Yang, Functional Skeletons for Parallel Coordination, Proc. Euro-Par 95, LNCS 966 (Springer,1995) [9] I. Foster, R. Olson, S. Tuecke, Productive Parallel Programming: The PCN Approach, Scientific Programming 1(1) (1992) [1] W. Gropp, E. Lusk, A. Skjellum, Using MPI (MIT Press, 1999). [11] H. Kuchen, M. Cole, The Integration of Task and Data Parallel Skeletons, Parallel Processing Letters 12(2) (22) [12] H. Kuchen, R. Plasmeijer, H. Stoltze, Efficient Distributed Memory Implementation of a Data Parallel Functional Language, Proc. PARLE 94, LNCS 817 (Springer, 1994) [13] H. Kuchen, J. Striegnitz, Higher-Order Functions and Partial Applications for a C++ Skeleton Library, Proc. Joint ACM Java Grande & ISCOPE Conference (ACM, 22) [14] H. Kuchen, A Skeleton Library, Euro-Par 2, LNCS 24 (Springer, 22) [15] H. Kuchen, Optimizing Sequences of Skeleton Calls, in Domain-Specific Program Generation, LNCS 316 (Springer, 24) [16] H. Kuchen, The Skeleton Library Web Pages. [17] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard [18] Myricom, The Myricom Homepage, [19] S. Pelagatti, Task and Data Parallelism in P3L, in Patterns and Skeletons for Parallel and Distributed Computing, eds. F.A. Rabhi and S. Gorlatch (Springer, 23) [2] D. Skillicorn, Foundations of Parallel Programming (Cambridge U. Press, 1994). [21] J. Striegnitz, Making C++ Ready for Algorithmic Skeletons, Tech. Report IB-2-8, [22] ZIV-Cluster,

Scalable Farms. Michael Poldner a, Herbert Kuchen a. D Münster, Germany

Scalable Farms. Michael Poldner a, Herbert Kuchen a. D Münster, Germany 1 Scalable Farms Michael Poldner a, Herbert Kuchen a a University of Münster, Department of Information Systems, Leonardo Campus 3, D-4819 Münster, Germany Algorithmic skeletons intend to simplify parallel

More information

Skeleton programming environments Muesli (1)

Skeleton programming environments Muesli (1) Skeleton programming environments Muesli (1) Patrizio Dazzi ISTI - CNR Pisa Research Campus mail: patrizio.dazzi@isti.cnr.it Master Degree (Laurea Magistrale) in Computer Science and Networking Academic

More information

Marco Danelutto. May 2011, Pisa

Marco Danelutto. May 2011, Pisa Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close

More information

Evaluating the Performance of Skeleton-Based High Level Parallel Programs

Evaluating the Performance of Skeleton-Based High Level Parallel Programs Evaluating the Performance of Skeleton-Based High Level Parallel Programs Anne Benoit, Murray Cole, Stephen Gilmore, and Jane Hillston School of Informatics, The University of Edinburgh, James Clerk Maxwell

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

Parallel Programming

Parallel Programming Parallel Programming for Multicore and Cluster Systems von Thomas Rauber, Gudula Rünger 1. Auflage Parallel Programming Rauber / Rünger schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG

More information

Point-to-Point Communication. Reference:

Point-to-Point Communication. Reference: Point-to-Point Communication Reference: http://foxtrot.ncsa.uiuc.edu:8900/public/mpi/ Introduction Point-to-point communication is the fundamental communication facility provided by the MPI library. Point-to-point

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Message-Passing Programming with MPI

Message-Passing Programming with MPI Message-Passing Programming with MPI Message-Passing Concepts David Henty d.henty@epcc.ed.ac.uk EPCC, University of Edinburgh Overview This lecture will cover message passing model SPMD communication modes

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

The Actor Role Coordinator Implementation Developer s Manual

The Actor Role Coordinator Implementation Developer s Manual The Actor Role Coordinator Implementation Developer s Manual Nianen Chen, Li Wang Computer Science Department Illinois Institute of Technology, Chicago, IL 60616, USA {nchen3, li}@iit.edu 1. Introduction

More information

PARA++ : C++ Bindings for Message Passing Libraries

PARA++ : C++ Bindings for Message Passing Libraries PARA++ : C++ Bindings for Message Passing Libraries O. Coulaud, E. Dillon {Olivier.Coulaud, Eric.Dillon}@loria.fr INRIA-lorraine BP101, 54602 VILLERS-les-NANCY, FRANCE Abstract The aim of Para++ is to

More information

Message-Passing Programming with MPI. Message-Passing Concepts

Message-Passing Programming with MPI. Message-Passing Concepts Message-Passing Programming with MPI Message-Passing Concepts Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center 1 of 14 11/1/2006 3:58 PM Cornell Theory Center Discussion: MPI Point to Point Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate

More information

Line Segment Intersection Dmitriy V'jukov

Line Segment Intersection Dmitriy V'jukov Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers

More information

Joint Structured/Unstructured Parallelism Exploitation in muskel

Joint Structured/Unstructured Parallelism Exploitation in muskel Joint Structured/Unstructured Parallelism Exploitation in muskel M. Danelutto 1,4 and P. Dazzi 2,3,4 1 Dept. Computer Science, University of Pisa, Italy 2 ISTI/CNR, Pisa, Italy 3 IMT Institute for Advanced

More information

PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller

PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller PARALLEL ID3 Jeremy Dominijanni CSE633, Dr. Russ Miller 1 ID3 and the Sequential Case 2 ID3 Decision tree classifier Works on k-ary categorical data Goal of ID3 is to maximize information gain at each

More information

Optimization Techniques for Implementing Parallel Skeletons in Grid Environments

Optimization Techniques for Implementing Parallel Skeletons in Grid Environments Optimization Techniques for Implementing Parallel Skeletons in Grid Environments M. Aldinucci 1,M.Danelutto 2,andJ.Dünnweber 3 1 Inst. of Information Science and Technologies CNR, Via Moruzzi 1, Pisa,

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Distributed Systems Theory 4. Remote Procedure Call. October 17, 2008

Distributed Systems Theory 4. Remote Procedure Call. October 17, 2008 Distributed Systems Theory 4. Remote Procedure Call October 17, 2008 Client-server model vs. RPC Client-server: building everything around I/O all communication built in send/receive distributed computing

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

Enhancing the performance of Grid Applications with Skeletons and Process Algebras

Enhancing the performance of Grid Applications with Skeletons and Process Algebras Enhancing the performance of Grid Applications with Skeletons and Process Algebras (funded by the EPSRC, grant number GR/S21717/01) A. Benoit, M. Cole, S. Gilmore, J. Hillston http://groups.inf.ed.ac.uk/enhance/

More information

Using eskel to implement the multiple baseline stereo application

Using eskel to implement the multiple baseline stereo application 1 Using eskel to implement the multiple baseline stereo application Anne Benoit a, Murray Cole a, Stephen Gilmore a, Jane Hillston a a School of Informatics, The University of Edinburgh, James Clerk Maxwell

More information

Lecture Notes on Liveness Analysis

Lecture Notes on Liveness Analysis Lecture Notes on Liveness Analysis 15-411: Compiler Design Frank Pfenning André Platzer Lecture 4 1 Introduction We will see different kinds of program analyses in the course, most of them for the purpose

More information

Two Fundamental Concepts in Skeletal Parallel Programming

Two Fundamental Concepts in Skeletal Parallel Programming Two Fundamental Concepts in Skeletal Parallel Programming Anne Benoit and Murray Cole School of Informatics, The University of Edinburgh, James Clerk Maxwell Building, The King s Buildings, Mayfield Road,

More information

Job-shop scheduling with limited capacity buffers

Job-shop scheduling with limited capacity buffers Job-shop scheduling with limited capacity buffers Peter Brucker, Silvia Heitmann University of Osnabrück, Department of Mathematics/Informatics Albrechtstr. 28, D-49069 Osnabrück, Germany {peter,sheitman}@mathematik.uni-osnabrueck.de

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

MICE: A Prototype MPI Implementation in Converse Environment

MICE: A Prototype MPI Implementation in Converse Environment : A Prototype MPI Implementation in Converse Environment Milind A. Bhandarkar and Laxmikant V. Kalé Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

More information

Analyzing the Receiver Window Modification Scheme of TCP Queues

Analyzing the Receiver Window Modification Scheme of TCP Queues Analyzing the Receiver Window Modification Scheme of TCP Queues Visvasuresh Victor Govindaswamy University of Texas at Arlington Texas, USA victor@uta.edu Gergely Záruba University of Texas at Arlington

More information

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design A Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pijush Chakraborty (153050015)

More information

Concurrent Programming

Concurrent Programming Concurrency Concurrent Programming A sequential program has a single thread of control. Its execution is called a process. A concurrent program has multiple threads of control. They may be executed as

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

The MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola

The MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola The MPI Message-passing Standard Lab Time Hands-on SPD Course 2016-2017 Massimo Coppola Remember! Simplest programs do not need much beyond Send and Recv, still... Each process lives in a separate memory

More information

Dynamic Load balancing for I/O- and Memory- Intensive workload in Clusters using a Feedback Control Mechanism

Dynamic Load balancing for I/O- and Memory- Intensive workload in Clusters using a Feedback Control Mechanism Dynamic Load balancing for I/O- and Memory- Intensive workload in Clusters using a Feedback Control Mechanism Xiao Qin, Hong Jiang, Yifeng Zhu, David R. Swanson Department of Computer Science and Engineering

More information

Extended Dataflow Model For Automated Parallel Execution Of Algorithms

Extended Dataflow Model For Automated Parallel Execution Of Algorithms Extended Dataflow Model For Automated Parallel Execution Of Algorithms Maik Schumann, Jörg Bargenda, Edgar Reetz and Gerhard Linß Department of Quality Assurance and Industrial Image Processing Ilmenau

More information

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences 1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

RiMOM Results for OAEI 2009

RiMOM Results for OAEI 2009 RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered Lecture 7: Programming Using the Message-Passing Paradigm 1 Message Passing Programming Model Blocking communication Non-Buffered Buffered Non-Blocking communication Introducing MPI 2 1 Programming models

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

CST-Trees: Cache Sensitive T-Trees

CST-Trees: Cache Sensitive T-Trees CST-Trees: Cache Sensitive T-Trees Ig-hoon Lee 1, Junho Shim 2, Sang-goo Lee 3, and Jonghoon Chun 4 1 Prompt Corp., Seoul, Korea ihlee@prompt.co.kr 2 Department of Computer Science, Sookmyung Women s University,

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within

More information

RECHOKe: A Scheme for Detection, Control and Punishment of Malicious Flows in IP Networks

RECHOKe: A Scheme for Detection, Control and Punishment of Malicious Flows in IP Networks > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < : A Scheme for Detection, Control and Punishment of Malicious Flows in IP Networks Visvasuresh Victor Govindaswamy,

More information

Compositional C++ Page 1 of 17

Compositional C++ Page 1 of 17 Compositional C++ Page 1 of 17 Compositional C++ is a small set of extensions to C++ for parallel programming. OVERVIEW OF C++ With a few exceptions, C++ is a pure extension of ANSI C. Its features: Strong

More information

Mark Sandstrom ThroughPuter, Inc.

Mark Sandstrom ThroughPuter, Inc. Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom

More information

Process Management And Synchronization

Process Management And Synchronization Process Management And Synchronization In a single processor multiprogramming system the processor switches between the various jobs until to finish the execution of all jobs. These jobs will share the

More information

Grappa: A latency tolerant runtime for large-scale irregular applications

Grappa: A latency tolerant runtime for large-scale irregular applications Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University

More information

Fast and Effective Interpolation Using Median Filter

Fast and Effective Interpolation Using Median Filter Fast and Effective Interpolation Using Median Filter Jian Zhang 1, *, Siwei Ma 2, Yongbing Zhang 1, and Debin Zhao 1 1 Department of Computer Science, Harbin Institute of Technology, Harbin 150001, P.R.

More information

Holland Computing Center Kickstart MPI Intro

Holland Computing Center Kickstart MPI Intro Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:

More information

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Skel: A Streaming Process-based Skeleton Library for Erlang (Early Draft!)

Skel: A Streaming Process-based Skeleton Library for Erlang (Early Draft!) Skel: A Streaming Process-based Skeleton Library for Erlang (Early Draft!) Archibald Elliott 1, Christopher Brown 1, Marco Danelutto 2, and Kevin Hammond 1 1 School of Computer Science, University of St

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/32963 holds various files of this Leiden University dissertation Author: Zhai, Jiali Teddy Title: Adaptive streaming applications : analysis and implementation

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

Evolutionary Algorithms

Evolutionary Algorithms Evolutionary Algorithms Proposal for a programming project for INF431, Spring 2014 version 14-02-19+23:09 Benjamin Doerr, LIX, Ecole Polytechnique Difficulty * *** 1 Synopsis This project deals with the

More information

File Size Distribution on UNIX Systems Then and Now

File Size Distribution on UNIX Systems Then and Now File Size Distribution on UNIX Systems Then and Now Andrew S. Tanenbaum, Jorrit N. Herder*, Herbert Bos Dept. of Computer Science Vrije Universiteit Amsterdam, The Netherlands {ast@cs.vu.nl, jnherder@cs.vu.nl,

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

MPI Programming Techniques

MPI Programming Techniques MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any

More information

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Poor co-ordination that exists in threads on JVM is bottleneck

More information

An introduction to C++ template programming

An introduction to C++ template programming An introduction to C++ template programming Hayo Thielecke University of Birmingham http://www.cs.bham.ac.uk/~hxt March 2015 Templates and parametric polymorphism Template parameters Member functions of

More information

Parallel System Architectures 2016 Lab Assignment 1: Cache Coherency

Parallel System Architectures 2016 Lab Assignment 1: Cache Coherency Institute of Informatics Computer Systems Architecture Jun Xiao Simon Polstra Dr. Andy Pimentel September 1, 2016 Parallel System Architectures 2016 Lab Assignment 1: Cache Coherency Introduction In this

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Modeling and Simulating Discrete Event Systems in Metropolis

Modeling and Simulating Discrete Event Systems in Metropolis Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract

More information

Consistent Rollback Protocols for Autonomic ASSISTANT Applications

Consistent Rollback Protocols for Autonomic ASSISTANT Applications Consistent Rollback Protocols for Autonomic ASSISTANT Applications Carlo Bertolli 1, Gabriele Mencagli 2, and Marco Vanneschi 2 1 Department of Computing, Imperial College London 180 Queens Gate, London,

More information

G Programming Languages - Fall 2012

G Programming Languages - Fall 2012 G22.2110-003 Programming Languages - Fall 2012 Lecture 2 Thomas Wies New York University Review Last week Programming Languages Overview Syntax and Semantics Grammars and Regular Expressions High-level

More information

Scheduling with Bus Access Optimization for Distributed Embedded Systems

Scheduling with Bus Access Optimization for Distributed Embedded Systems 472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 5, OCTOBER 2000 Scheduling with Bus Access Optimization for Distributed Embedded Systems Petru Eles, Member, IEEE, Alex

More information

Gossamer Simulator Design Document

Gossamer Simulator Design Document Gossamer Simulator Design Document Notre Dame PIM Group July 9, Document maintained by Arun Rodrigues (arodrig@cse.nd.edu) Abstract The SST simulator was designed to provide scalable exploration of large

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,

More information

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. Romain Dolbeau March 24, 2014 1 Introduction To quote John Walker, the first person to brute-force the problem [1]:

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Jose Flich 1,PedroLópez 1, Manuel. P. Malumbres 1, José Duato 1,andTomRokicki 2 1 Dpto.

More information

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE* SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL

More information

UMCS. Annales UMCS Informatica AI 6 (2007) Fault tolerant control for RP* architecture of Scalable Distributed Data Structures

UMCS. Annales UMCS Informatica AI 6 (2007) Fault tolerant control for RP* architecture of Scalable Distributed Data Structures Annales Informatica AI 6 (2007) 5-13 Annales Informatica Lublin-Polonia Sectio AI http://www.annales.umcs.lublin.pl/ Fault tolerant control for RP* architecture of Scalable Distributed Data Structures

More information

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář Department ofradim Computer Bača Science, and Technical David Bednář University of Ostrava Czech

More information

3.3 Hardware Parallel processing

3.3 Hardware Parallel processing Parallel processing is the simultaneous use of more than one CPU to execute a program. Ideally, parallel processing makes a program run faster because there are more CPUs running it. In practice, it is

More information

M301: Software Systems & their Development. Unit 4: Inheritance, Composition and Polymorphism

M301: Software Systems & their Development. Unit 4: Inheritance, Composition and Polymorphism Block 1: Introduction to Java Unit 4: Inheritance, Composition and Polymorphism Aims of the unit: Study and use the Java mechanisms that support reuse, in particular, inheritance and composition; Analyze

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell Table of Contents Distributed and Parallel Technology Revision Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 1 Overview 2 A Classification of Parallel

More information

Link Service Routines

Link Service Routines Link Service Routines for better interrupt handling Introduction by Ralph Moore smx Architect When I set out to design smx, I wanted to have very low interrupt latency. Therefore, I created a construct

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information