ON IMPLEMENTING THE FARM SKELETON
|
|
- Chad Cummings
- 5 years ago
- Views:
Transcription
1 ON IMPLEMENTING THE FARM SKELETON MICHAEL POLDNER and HERBERT KUCHEN Department of Information Systems, University of Münster, D Münster, Germany ABSTRACT Algorithmic skeletons intend to simplify parallel programming by providing a higher level of abstraction compared to the usual message passing. Task and data parallel skeletons can be distinguished. In the present paper, we will consider several approaches to implement one of the most classical task parallel skeletons, namely the farm, and compare them w.r.t. scalability, overhead, potential bottlenecks, and load balancing. We will also investigate several communication modes for the implementation of skeletons. Based on experimental results, the advantages and disadvantages of the different approaches are shown. Moreover, we will show how to terminate the system of processes properly. Keywords: Parallel programming, algorithmic skeletons, farm, MPI. 1. Introduction Today, parallel programming of MIMD machines with distributed memory is typically based on message passing. Owing to the availability of standard message passing libraries such as MPI a [1], the resulting software is platform independent and efficient. However, the programming level is still rather low and programmers have to fight against low-level communication problems such as deadlocks. Moreover, the program is split into a set of processes which are assigned to the different processors. Like an ant, each process only has a local view of the overall activity. A global view of the overall computation only exists in the programmer s mind, and there is no way to express it more directly on this level. Many approaches try to increase the level of parallel programming and to overcome the mentioned disadvantages. Here, we will focus on algorithmic skeletons, i.e. typical parallel-programming patterns which are efficiently implemented on the available parallel machine and usually offered to the user as higher-order functions, which get the details of the specific application problem as argument functions (see e.g. [3,4,19]). The skeletal parallelism homepage [6] contains links to virtually all groups and projects working on skeletons. In our framework, a parallel computation consists of a sequence of calls to skelea We assume some familiarity with MPI and C++.
2 Third Worskhop on High-Level Parallel Programming and Applications tons. Several implementations of algorithmic skeletons are available. They differ in the kind of host language used and in the particular set of skeletons offered. Since higher-order functions are taken from functional languages, many approaches use such a language as host language [7,12,2]. In order to increase the efficiency, imperative languages such as C and C++ have been extended by skeletons, too [2,3,9,19]. Depending on the kind of parallelism used, skeletons can be classified into task parallel and data parallel ones. In the first case, a skeleton (dynamically) creates a system of communicating processes by nesting predefined process topologies such as pipeline, farm, parallel composition, divide&conquer, and branch&bound [1,4,5,7,11,19]. In the second case, a skeleton works on a distributed data structure, performing the same operations on some or all elements of this data structure. Data-parallel skeletons, such as map, fold or rotate are used in [2,3,7,8,12,19]. Moreover, there are implementations offering skeletons as a library rather than as part of a new programming language [5,14] The approach described in the sequel is based on the skeleton library introduced in [11,13,14] and on the corresponding C++ language binding. Skeletons can be understood as domain-specific languages for parallel programming. Our library provides task as well as data parallel skeletons, which can be combined based on the two-tier model taken from P 3 L [19]. In general, a computation consists of nested task parallel constructs where an atomic task parallel computation can be sequential or data parallel. Purely data parallel and purely task parallel computations are special cases of this model. An advantage of the C++ binding is that the three important features needed for skeletons, namely higher-order functions (i.e. functions having functions as arguments), partial applications (i.e. the possibility to apply a function to less arguments than it needs and to supply the missing arguments later), and parametric polymorphism, can be implemented elegantly and efficiently in C++ using operator overloading and templates, respectively [13,21]. In the present paper, we will focus on task-parallel skeletons in general and on the well-known farm skeleton in particular. Conceptually, a farm consists of a farmer and several workers. The farmer accepts a sequence of tasks from some predecessor process and propagates each task to a worker. The worker executes the task and delivers the result back to the farmer who propagates it to some successor process (which may be the same as the predecessor). This specification suggests a straightforward implementation leading to the process topology depicted in Fig. 1. The problem with this simple approach is that the farmer may become a bottleneck, if the number of workers is large. Another disadvantage is the overhead caused by the propagation of messages. Consequently, it is worth considering different implementation schemes avoiding these disadvantages. In the present paper, we will consider a variant of the classical farm where the farmer is divided into a dispatcher and a collector as well as variants where these building blocks have (partly) been omitted. Moreover, we investigate which MPI communication modes are most appropriate for implementing skeletons. Finally, we will present a scheme to shutdown the system of processes cleanly, even for nested and possibly cyclic process topologies. The rest of the paper is organized as follows. In Section 2, we show how task-
3 On Implementing the Farm Skeleton Predecessor Farm tasks results Farmer tasks... results Successor Figure 1: Farm. parallel applications can be implemented using the task-parallel skeletons provided by our skeleton library. The considered variants of the farm skeleton are presented in Section 3. Section 4 contains experimental results for the different approaches. In Section 5, we describe, how to terminate the system of processes cleanly. Finally, in Section 6 we conclude and discuss related work. Moreover, we point out future extensions. 2. Task-parallel Skeletons Our skeleton library offers data parallel and task parallel skeletons. Task parallelism is established by setting up a system of processes which communicate via streams of data. Such a system is not arbitrarily structured but constructed by nesting predefined process topologies such as farms and pipelines. Moreover, there are skeletons for parallel composition, branch & bound, and divide & conquer. Finally, it is possible to nest task and data parallelism according to the mentioned two-tier model of P 3 L, which allows atomic task parallel processes to use data parallelism inside [19]. Here, we will focus on task-parallel skeletons in general and on the farm skeleton in particular. In a farm, a farmer process accepts a sequence of inputs and assigns each of them to one of several workers. The parallel composition works similar to the farm. However, each input is forwarded to every worker. A pipeline allows tasks to be processed by a sequence of stages. Each stage handles one task at a time, and it does this in parallel to all the other stages. Each task parallel skeleton has the same property as an atomic process, namely it accepts a sequence of inputs and produces a sequence of outputs. This allows the task parallel skeletons to be arbitrarily nested. Task parallel skeletons like pipeline and farm are provided by many skeleton systems, see e.g. [5,19]. In the example in Fig. 2, a pipeline of an initial atomic process, a farm of two atomic workers, and a final atomic process is constructed. In the C++ binding, there is a class for every task parallel skeleton. All these classes are subclasses of the abstract class Process. A task parallel application proceeds in two steps. First, a process topology is created by using the constructors of the mentioned class. This process topology reflects the actual nesting of skeletons. Then, this system of processes is started by applying method start() to the outermost skeleton. Internally, every atomic process will be assigned to a processor. For an implementation
4 Third Worskhop on High-Level Parallel Programming and Applications #include "Skeleton.h" Atomic Atomic static int current = ; static const int numworkers = 2; Initial Farmer Final int* init(){if (current++ < 1) return ¤t; else return NULL;} int foo(int n){ int result = 1; for (int i=1; i<=n%1+1; i++) result *= i; for (int i=; i<result; i++) n++; return n;} void fin(int n){cout << "result: " << n << endl;} int main(int argc, char **argv){ InitSkeletons(argc,argv); // step 1: create a process topology (using C++ constructors) Initial<int> p1(init); Atomic<int,int> p2(foo,1); Farm<int,int> p3(p2,numworkers); Final<int> p4(fin); Pipe p5(p1,p3,p4); // step 2: start the system of processes p5.start(); TerminateSkeletons();} Figure 2: Task parallel example application. on top of SPMD, this means that every processor will dispatch depending on its rank to the code of its assigned process. When constructing an atomic process, the argument function of the constructor tells how each input is transformed into an output value. Again, such a function can be either a C++ function or a partial application. In Fig. 2, each worker applies function foo to each input and returns the result. The initial and final atomic processes are special, since they do not consume inputs and produce outputs, respectively. 3. The Farm Skeleton As pointed out in the introduction, a straightforward implementation of the farm skeleton could be based on the process topology depicted in Fig. 1. It has the advantage that the farmer knows which workers have returned the results of their tasks and are hence idle. Thus, the farmer can forward incoming tasks to the idle workers. However, this approach has the disadvantage that it causes substantial overhead due to the messages which have to be exchanged between farmer and workers. Moreover, the farmer might become a bottleneck, if the number of workers
5 On Implementing the Farm Skeleton FarmWDWC Predecessor Dispatcher Collector tasks results Successor tasks... results Figure 3: Farm with separate dispatcher and collector. is large. In this case, the farmer will not be able to keep all workers busy, leading to wasted workers. The amount of workers which the farmer can keep busy depends on the sizes of the tasks and the sizes of the messages the farmer has to propagate. For small tasks (in terms of the amount of required computations) and large messages, only very few workers can be kept busy. If on the other hand there are few workers (with large tasks), the farmer will partly be idle. This problem can be solved by mapping a worker to the same processor as the farmer. Thus, we will not consider this problem further. In general, the aim is to keep all processors as busy as possible and to avoid a waste of resources. Let us point out that implementing this apparently simple farm skeleton is not as easy as it might seem. The interested reader may have a look at our implementation [16]. Firstly, the process topology is obviously cyclic (see Fig. 1). Thus, one has to be very careful to avoid deadlocks. On the other hand, one has to make sure that the farmer reacts as quickly as possible on newly arriving tasks and on workers delivering their results. For an implementation based on MPI, this means that the simpler synchronous communication has to be replaced by the significantly more complex non-blocking asynchronous communication using MPI Wait. Moreover, special control messages are needed besides data messages in order to stop the computation cleanly at the end. This leads to a quite complicated protocol, which has to be supported by every skeleton. Thus, an obvious advantage of using skeletons is that the user does not have to invent the wheel again and again, but can readily use the typical patterns for parallel programming without worrying about deadlocks, stopping the computation cleanly, or overlapping computation and communication properly Farm with Dispatcher and Collector A first approach to reduce the load of an overloaded farmer is to split it into a dispatcher of work and an independent collector of results as depicted in Fig. 3. This variant is used in P 3 L [19]. Ideally, distributing tasks needs as much time as collecting results. In this case, the farmer can serve twice as many workers as with the previous approach. If, however, distributing tasks is much more work than collecting results or vice versa,
6 Third Worskhop on High-Level Parallel Programming and Applications FarmWDNC Predecessor tasks Dispatcher tasks results Successor... Figure 4: Farm with dispatcher; no collector is used. little has been gained, since now the dispatcher or the collector will quickly become the bottleneck, while the other will partly be idle. The amount of required messages is unchanged and the corresponding overhead is hence preserved. A disadvantage of this farm variant is that the dispatcher now has no knowledge about the actual load of each worker. Thus, he has to divide work based on some load independent scheme, e.g. cyclically or by random selection of a worker. Both may lead to an unbalanced distribution of work. However for a large number of tasks, one can expect that the load is roughly balanced. The blind distribution of work could be avoided by sending work requests from idle workers to the dispatcher. But then the dispatcher would have to process as many messages as the farmer in the original approach, and the introduction of a separate collector would be rather useless. Alternatively, a load balancing scheme among workers could be applied like in P 3 L. In the present paper, we have not considered this alternative Farm with Dispatcher The previous approach can be improved by omitting the collector and by sending results directly from the workers to the successor of the farm in the overall process topology (see Fig. 4). This eliminates the overhead of propagating results. Moreover, the omitted collector can no longer be a potential bottleneck. b 3.3. Farm without Dispatcher and Collector After omitting the collector, one can consider omitting the dispatcher as well (see Fig. 5). In fact, this is possible, provided that the predecessors of the farm in the overall process topology assume the responsibility to distribute their tasks directly to the workers. As in the previous subsections, this distribution has to be performed based on a load independent scheme, e.g. cyclically or by random selection. This approach reduces the overhead for the propagation of messages completely. Moreover, it omits another potential source of a bottleneck, namely the dispatcher. Of course, this new variant of the farm skeleton can be arbitrarily nested with other b It may however happen that the successor now becomes a bottleneck. A corresponding global optimization of the overall process topology remains as future work.
7 On Implementing the Farm Skeleton FarmNDNC 2 FarmNDNC 1 Predecessor Successor Figure 5: Pipeline of two farms without dispatcher and collector. skeletons. For instance, it is possible to construct a pipeline of the such farms, as depicted in Fig. 5. c In such a situation where n workers of the first farm communicate with m workers of the second, it is important to ensure that not all of them start with the same destination. If using a cyclic distribution scheme, worker i could e.g. assign its first task to worker i/m of the second farm. If the destination is randomly chosen, it has to be ensured that all the random number generators start with a different initial value. A farm without dispatcher but with collector does not seem to make sense, and it is not considered here. 4. Experimental Results In our experimental results, we have tested the different process topologies for implementing the farm as well as the different MPI communication modes Influence of the Process Topology We have tested the different variants of the farm skeleton for small and big tasks as well as for tasks of varying sizes. A small task simply consist in computing the square of the input, a big tasks performs two million additions, and the tasks of varying sizes execute n! iterations for some 1 n 1 (as in Fig. 2). The experiments have been carried out on the IBM cluster at the University of Münster. This machine [22] has 94 compute nodes, each with a 2.4 GHz Intel Xeon processor, 512 KB L2 Cache, and 1 GB memory. The nodes are connected by a Myrinet[18] and running RedHat Linux 7.3 and the MPICH-GM implementation of MPI. All farm variants have been implemented and compared based on the nonblocking standard communication mode using the eager protocol. In the sequel, we will assume some familiarity with the different MPI communication modes, namely standard mode, synchronous mode, buffered mode, and ready c Note that such a sequence of farms may, for instance, make sense, if each worker of the first farm uses one processor, while each worker of the second farm performs a data parallel computation on several processors.
8 Third Worskhop on High-Level Parallel Programming and Applications mode as well as with the corresponding concepts of blocking and a non-blocking send operations [17]. Moreover, we assume some knowledge of the eager protocol and the rendezvous protocol, which MPICH-GM employs for implementing the different modes. For large tasks (of the same size), all variants are able to keep the workers busy, as one would expect. Thus, all variants need roughly the same amount of time to execute the tasks, as can be seen in Fig. 6 a). However, just comparing the runtimes is not fair, since the different farms need different numbers of processors (unless farmer, dispatcher, and collector share the same processor with one worker as mentioned above). Taking this into account, we see that the farmndnc is the best, followed by the farmwdnc and the farm with farmer (see Fig. 6 b). When considering tasks with (strongly) varying sizes (Fig. 7), we note that all approaches are able to keep a small number of workers (here up to 4) busy. If we add more workers, all approaches reach a point where they are no longer able to employ the additional workers. Beyond this point the runtime remains constant, independently of the number of additional workers. As expected, the original farm reaches this situation more quickly than the farmwdwc, the farmwdnc, and the farmndnc, in this order. Moreover, farms which produce less overhead need less runtime when reaching the limit. When taking into account the number of processors used (rather than the number of workers), the advantages of the farmwdnc and, in particular, the farmndnc become even more apparent (see Fig. 7 b). Interestingly, the behavior of the different farm variants does not depend significantly on the distribution scheme. Cyclic distribution and random distribution lead to almost identical runtimes. This is due to the fact that the number of tasks was large and the load of the workers has hence been balanced over time (Fig. 7 b). For small numbers of tasks the behavior may depend heavily on the actual mapping of tasks to workers. It may be surprising that even the farmndnc is not able to keep an arbitrary farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl farm with farmer farmwdwc rand farmwdwc cycl farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # workers a) runtime depending on # workers # processors b) runtime depending on # processors Figure 6: Farm variants with big tasks.
9 On Implementing the Farm Skeleton farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl farm with farmer farmwdwc rand farmwdwc cycl farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # workers a) runtime depending on # workers # processors b) runtime depending on # processors Figure 7: Farm variants with tasks of varying sizes. amount of workers busy, although it has no farmer or dispatcher which might get a bottleneck. The reason is that the predecessor(s) of the farm are now responsible for the distribution of tasks to the workers. Unless the predecessor is itself a large farm, it will eventually become a bottleneck as observed in Fig. 7. Similarly, the successor(s) of the farm may now get a bottleneck. For small tasks, it is not worthwhile to employ any worker. The process delivering the small tasks should better execute them itself. If we nevertheless use a farm, it is clear that it is not possible to keep even a single worker busy. All farm variants require hence a more or less constant runtime independently of the number of workers (see Fig. 8). This situation corresponds to the one in the previous experiment when reaching the limit of useful workers. The roughly constant runtimes differ between skeletons depending on the overhead caused by the considered variant MPI Communication Modes and Distribution Schemes In order to answer the question whether and how much the eager protocol per farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl # workers a) runtime depending on # workers farm with farmer farmwdwc rand farmwdwc cycl.5 farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # processors b) runtime depending on # processors Figure 8: Farm variants with small tasks.
10 Third Worskhop on High-Level Parallel Programming and Applications # workers a) small tasks Rendezvous rand Rendezvous cycl Eager rand/cycl # workers b) big tasks Rendezvous rand Rendezvous cycl Eager rand/cycl Rendezvous rand Rendezvous cycl Eager rand/cycl # workers c) tasks of varying sizes Figure 9: Comparison of the eager and the rendezvous protocol. forms better than the rendezvous protocol, we have implemented our skeleton library using the standard send mode as well as the synchronous send mode. The standard send mode is based on the eager protocol for messages less than 32 KB (as in our experiments) and on the rendezvous protocol for larger messages. The synchronous send mode always uses the rendezvous protocol. As above, we have tested the farmndnc for small and big tasks as well as for tasks of varying sizes, each of them using a cyclic as well as a random distribution scheme. Just like the previous tests these experiments have been carried out on our IBM cluster. For small tasks (see Fig. 9 a) the eager protocol performs much better than the rendezvous protocol due to its lower latency. Here, the distribution scheme has no significant effect, since the tasks are almost immediately finished and there is hence no delay caused by busy workers. When considering large tasks in connection with the rendezvous protocol, a significant difference between the cyclic and the random distribution scheme is determined (Fig. 9 b). Whenever the sender selects the same receiver as shortly before, the sender is blocked until the receiver has finished the computation and
11 On Implementing the Farm Skeleton percentage deviation # tasks max random avg cyclic min random Figure 1: Behavior depending on the number of tasks. accepts the message. This causes significant idle times of the sender and the other workers. With a cyclic distribution, such delays are minimized. The eager protocol performs even slightly better. This is not so much caused by the lower latency but by the fact that each worker may buffer several tasks, such that it can directly start processing when finishing its previous task. It does not have to wait for the actual communication to take place. For the same reasons, the eager protocol performs best for tasks of (strongly) varying sizes (Fig. 9 c) and the cyclic distribution is better than the random distribution when using the rendezvous protocol. In order to compare the different distribution schemes further, we have performed this test with different numbers number of tasks, namely 1, 1, 1, 1 and 1, and farm sizes of up to four several times, all using the eager protocol. While the cyclic distribution scheme provides nearly constant runtimes for each number of tasks, the random distribution scheme causes high variances of the runtimes for smaller numbers of tasks. Fig. 1 shows the runtimes depending on the number of tasks for cyclic distribution (normalized to 1) as well as the minimal and maximal runtimes for random distribution. As one might expect, an unlucky random distribution of tasks is more serious for small numbers of tasks than for large numbers, where this effect is averaged out. The same is true for lucky random distributions. We can conclude that the cyclic distribution of tasks is better than the random distribution for implementing the farm skeleton. 5. Terminating the Process System To perform a task parallel computation, a process topology is created and started. After the computation is completed, the process topology must be terminated properly in order to ensure that no user data are lost and a possible subsequent skeleton computation with a potentially different topology can be started without problems. In the following, we present a conceptually simple approach for
12 Third Worskhop on High-Level Parallel Programming and Applications the generation and termination of process topologies. d In our skeleton library, task parallel skeletons are implemented as C++ classes. A process topology is created by using their constructors, which can be arbitrarily nested. Each of the skeletons provides a method start() to actually start the computation of the skeleton and all nested skeletons (see Fig. 2). While task parallel skeletons are active, they consume a stream of input values and produce a stream of output values. If a skeleton notices the end of the input stream, it informs its successors that its output stream ends by sending a special message and terminates afterwards. Thus, the process system terminates automatically without delegating the responsibility to the programmer. This ensures that no user data are getting lost. Messages for controlling the process system must, of course, differ from user data. Hence the semantics of a message is described by a tag which is sent within the message envelope. Depending on this tag, the receiver decides whether the message is a control message or not. A message with a STOP tag informs the receiver of the end of the data stream and initiates the termination of the subsequent processes. The first STOP is sent by the initial process (Fig. 11). MPI guarantees that messages between the same sender and receiver cannot pass each other. This leads to a simple algorithm for terminating the process topology. There is no problem, if the number of entrances and exits of a skeleton is limited to one. In this case, there is a well-defined successor to which the STOP is sent. When receiving it, the corresponding process forwards it to its successor and stops. For skeletons with an arbitrary number of entrances and exits such as the farmndnc, messages can be received from and sent to several processes, and data may arrive after a STOP message. Suppose, for example, there are two farmndncs (as shown in Fig. 11) and there is a message containing user data from a worker A of the first farm to a worker C of the second farm. Furthermore there is a STOP from the initial process via worker B to C. If C receives the STOP first, C must not terminate until the user data have been received and processed. Thus, it has to wait for STOP messages from all predecessors before terminating. Moreover, each process must propagate a STOP message to all its successors. This algorithm is easy to implement. On the other hand, potentially many stop messages are sent. However, the corresponding costs can be neglected compared to the previous parallel computation. 6. Conclusions, Related Work, and Future Extensions We have considered alternative implementation schemes for the well-known farm skeleton. More precisely, we have investigated several process topologies as well as several communication modes. Concerning the process topologies we have compared the classical approach, where a farmer distributes work and collects results, with variants, where the farmer has been divided into a dispatcher and collector. Moreover, we have investigated variants where the collector and dispatcher have d The actual implementation is not quite as simple [16], in particular in the presence of cyclic process topologies.
13 On Implementing the Farm Skeleton Initial 2. STOP 1. data 2. STOP A 5. STOP 4. data 3. STOP 5. STOP C B D 3. STOP FarmNDNC 1 FarmNDNC 2 8. STOP 6. data 7. STOP Final Figure 11: Process termination with asynchronous communication. (partly) been omitted. In case of the variant without dispatcher, the predecessor of the farm in the overall process topology is responsible for the distribution of tasks to the workers. As our experimental results and our analysis show, the farm only consisting of workers is the best in terms of scalability and low overhead. It is clearly superior to the farms with dispatcher (and possibly collector). For a large number of tasks, it is also better than the classical farm, where a farmer distributes work. For a very small number of tasks, the classical farm might have advantages in some cases, since it is the only one which takes the actual load of workers into account when distributing work. We are not aware of any previous systematic analysis of different implementation schemes for farms in the literature. In the different skeleton projects, typically one technique has been chosen. In the first version of our skeleton library, there was just the classical farm. P 3 L [19] uses the farmwdwc while eskel [5] uses the classical farm with farmer. Concerning the choice of the communication mode for implementing skeletons, we have compared the standard mode (here) using the eager protocol with the synchronized mode based on the rendezvous protocol. The eager protocol turned out to be better, not only due its lower latency but also due to the fact that the workers can buffer several tasks and can process the next one after finishing the previous one without delay. The cyclic distribution of tasks to the workers turned out to be superior to a random distribution, since it leads to more predictable and in the average slightly smaller runtimes. As future work, we intend to investigate alternative implementation schemes for other skeletons, e.g. divide & conquer and branch & bound. References [1] E. Alba, F. Almeida et al., MALLBA: A Library of Skeletons for Combinatorial Search, Proc. Euro-Par 22, LNCS 24 (Springer, 22) [2] G.H. Botorog and H. Kuchen, Efficient Parallel Programming with Algorithmic Skeletons, Proc. Euro-Par 96, LNCS 1123 (Springer, 1996)
14 Third Worskhop on High-Level Parallel Programming and Applications [3] G.H. Botorog and H. Kuchen, Efficient High-Level Parallel Programming, Theoretical Computer Science 196 (1998) [4] M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computation (MIT Press, 1989). [5] M. Cole, Bringing Skeletons out of the Closet: A Pragmatic Manifesto for Skeletal Parallel Programming, Parallel Computing 3(3) (24) [6] M. Cole, The Skeletal Parallelism Web Page, [7] J. Darlington, A.J. Field, T.G. Harrison et al., Parallel Programming Using Skeleton Functions, Proc. PARLE 93, LNCS 694 (Springer,1993) [8] J. Darlington, Y. Guo, H.W. To, J. Yang, Functional Skeletons for Parallel Coordination, Proc. Euro-Par 95, LNCS 966 (Springer,1995) [9] I. Foster, R. Olson, S. Tuecke, Productive Parallel Programming: The PCN Approach, Scientific Programming 1(1) (1992) [1] W. Gropp, E. Lusk, A. Skjellum, Using MPI (MIT Press, 1999). [11] H. Kuchen, M. Cole, The Integration of Task and Data Parallel Skeletons, Parallel Processing Letters 12(2) (22) [12] H. Kuchen, R. Plasmeijer, H. Stoltze, Efficient Distributed Memory Implementation of a Data Parallel Functional Language, Proc. PARLE 94, LNCS 817 (Springer, 1994) [13] H. Kuchen, J. Striegnitz, Higher-Order Functions and Partial Applications for a C++ Skeleton Library, Proc. Joint ACM Java Grande & ISCOPE Conference (ACM, 22) [14] H. Kuchen, A Skeleton Library, Euro-Par 2, LNCS 24 (Springer, 22) [15] H. Kuchen, Optimizing Sequences of Skeleton Calls, in Domain-Specific Program Generation, LNCS 316 (Springer, 24) [16] H. Kuchen, The Skeleton Library Web Pages. [17] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard [18] Myricom, The Myricom Homepage, [19] S. Pelagatti, Task and Data Parallelism in P3L, in Patterns and Skeletons for Parallel and Distributed Computing, eds. F.A. Rabhi and S. Gorlatch (Springer, 23) [2] D. Skillicorn, Foundations of Parallel Programming (Cambridge U. Press, 1994). [21] J. Striegnitz, Making C++ Ready for Algorithmic Skeletons, Tech. Report IB-2-8, [22] ZIV-Cluster,
Scalable Farms. Michael Poldner a, Herbert Kuchen a. D Münster, Germany
1 Scalable Farms Michael Poldner a, Herbert Kuchen a a University of Münster, Department of Information Systems, Leonardo Campus 3, D-4819 Münster, Germany Algorithmic skeletons intend to simplify parallel
More informationSkeleton programming environments Muesli (1)
Skeleton programming environments Muesli (1) Patrizio Dazzi ISTI - CNR Pisa Research Campus mail: patrizio.dazzi@isti.cnr.it Master Degree (Laurea Magistrale) in Computer Science and Networking Academic
More informationMarco Danelutto. May 2011, Pisa
Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close
More informationEvaluating the Performance of Skeleton-Based High Level Parallel Programs
Evaluating the Performance of Skeleton-Based High Level Parallel Programs Anne Benoit, Murray Cole, Stephen Gilmore, and Jane Hillston School of Informatics, The University of Edinburgh, James Clerk Maxwell
More informationAn In-place Algorithm for Irregular All-to-All Communication with Limited Memory
An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de
More informationParallel Programming
Parallel Programming for Multicore and Cluster Systems von Thomas Rauber, Gudula Rünger 1. Auflage Parallel Programming Rauber / Rünger schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG
More informationPoint-to-Point Communication. Reference:
Point-to-Point Communication Reference: http://foxtrot.ncsa.uiuc.edu:8900/public/mpi/ Introduction Point-to-point communication is the fundamental communication facility provided by the MPI library. Point-to-point
More informationHigh Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore
High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on
More informationCombining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?
Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationMessage-Passing Programming with MPI
Message-Passing Programming with MPI Message-Passing Concepts David Henty d.henty@epcc.ed.ac.uk EPCC, University of Edinburgh Overview This lecture will cover message passing model SPMD communication modes
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationThe Actor Role Coordinator Implementation Developer s Manual
The Actor Role Coordinator Implementation Developer s Manual Nianen Chen, Li Wang Computer Science Department Illinois Institute of Technology, Chicago, IL 60616, USA {nchen3, li}@iit.edu 1. Introduction
More informationPARA++ : C++ Bindings for Message Passing Libraries
PARA++ : C++ Bindings for Message Passing Libraries O. Coulaud, E. Dillon {Olivier.Coulaud, Eric.Dillon}@loria.fr INRIA-lorraine BP101, 54602 VILLERS-les-NANCY, FRANCE Abstract The aim of Para++ is to
More informationMessage-Passing Programming with MPI. Message-Passing Concepts
Message-Passing Programming with MPI Message-Passing Concepts Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationDESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID
I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort
More informationDiscussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center
1 of 14 11/1/2006 3:58 PM Cornell Theory Center Discussion: MPI Point to Point Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate
More informationLine Segment Intersection Dmitriy V'jukov
Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers
More informationJoint Structured/Unstructured Parallelism Exploitation in muskel
Joint Structured/Unstructured Parallelism Exploitation in muskel M. Danelutto 1,4 and P. Dazzi 2,3,4 1 Dept. Computer Science, University of Pisa, Italy 2 ISTI/CNR, Pisa, Italy 3 IMT Institute for Advanced
More informationPARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller
PARALLEL ID3 Jeremy Dominijanni CSE633, Dr. Russ Miller 1 ID3 and the Sequential Case 2 ID3 Decision tree classifier Works on k-ary categorical data Goal of ID3 is to maximize information gain at each
More informationOptimization Techniques for Implementing Parallel Skeletons in Grid Environments
Optimization Techniques for Implementing Parallel Skeletons in Grid Environments M. Aldinucci 1,M.Danelutto 2,andJ.Dünnweber 3 1 Inst. of Information Science and Technologies CNR, Via Moruzzi 1, Pisa,
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationDistributed Systems Theory 4. Remote Procedure Call. October 17, 2008
Distributed Systems Theory 4. Remote Procedure Call October 17, 2008 Client-server model vs. RPC Client-server: building everything around I/O all communication built in send/receive distributed computing
More informationBoosting the Performance of Myrinet Networks
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations
More informationEnhancing the performance of Grid Applications with Skeletons and Process Algebras
Enhancing the performance of Grid Applications with Skeletons and Process Algebras (funded by the EPSRC, grant number GR/S21717/01) A. Benoit, M. Cole, S. Gilmore, J. Hillston http://groups.inf.ed.ac.uk/enhance/
More informationUsing eskel to implement the multiple baseline stereo application
1 Using eskel to implement the multiple baseline stereo application Anne Benoit a, Murray Cole a, Stephen Gilmore a, Jane Hillston a a School of Informatics, The University of Edinburgh, James Clerk Maxwell
More informationLecture Notes on Liveness Analysis
Lecture Notes on Liveness Analysis 15-411: Compiler Design Frank Pfenning André Platzer Lecture 4 1 Introduction We will see different kinds of program analyses in the course, most of them for the purpose
More informationTwo Fundamental Concepts in Skeletal Parallel Programming
Two Fundamental Concepts in Skeletal Parallel Programming Anne Benoit and Murray Cole School of Informatics, The University of Edinburgh, James Clerk Maxwell Building, The King s Buildings, Mayfield Road,
More informationJob-shop scheduling with limited capacity buffers
Job-shop scheduling with limited capacity buffers Peter Brucker, Silvia Heitmann University of Osnabrück, Department of Mathematics/Informatics Albrechtstr. 28, D-49069 Osnabrück, Germany {peter,sheitman}@mathematik.uni-osnabrueck.de
More informationAn algorithm for Performance Analysis of Single-Source Acyclic graphs
An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs
More informationMICE: A Prototype MPI Implementation in Converse Environment
: A Prototype MPI Implementation in Converse Environment Milind A. Bhandarkar and Laxmikant V. Kalé Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign
More informationAnalyzing the Receiver Window Modification Scheme of TCP Queues
Analyzing the Receiver Window Modification Scheme of TCP Queues Visvasuresh Victor Govindaswamy University of Texas at Arlington Texas, USA victor@uta.edu Gergely Záruba University of Texas at Arlington
More informationExploring mtcp based Single-Threaded and Multi-Threaded Web Server Design
Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design A Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pijush Chakraborty (153050015)
More informationConcurrent Programming
Concurrency Concurrent Programming A sequential program has a single thread of control. Its execution is called a process. A concurrent program has multiple threads of control. They may be executed as
More informationPart IV. Chapter 15 - Introduction to MIMD Architectures
D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple
More informationThe MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola
The MPI Message-passing Standard Lab Time Hands-on SPD Course 2016-2017 Massimo Coppola Remember! Simplest programs do not need much beyond Send and Recv, still... Each process lives in a separate memory
More informationDynamic Load balancing for I/O- and Memory- Intensive workload in Clusters using a Feedback Control Mechanism
Dynamic Load balancing for I/O- and Memory- Intensive workload in Clusters using a Feedback Control Mechanism Xiao Qin, Hong Jiang, Yifeng Zhu, David R. Swanson Department of Computer Science and Engineering
More informationExtended Dataflow Model For Automated Parallel Execution Of Algorithms
Extended Dataflow Model For Automated Parallel Execution Of Algorithms Maik Schumann, Jörg Bargenda, Edgar Reetz and Gerhard Linß Department of Quality Assurance and Industrial Image Processing Ilmenau
More informationScalable Performance Analysis of Parallel Systems: Concepts and Experiences
1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,
More informationParallel-computing approach for FFT implementation on digital signal processor (DSP)
Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm
More informationResearch on the Implementation of MPI on Multicore Architectures
Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer
More informationRiMOM Results for OAEI 2009
RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn
More informationIntroduction to parallel computing concepts and technics
Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing
More informationMODELS OF DISTRIBUTED SYSTEMS
Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between
More informationAdaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >
Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationDISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA
DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678
More information5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered
Lecture 7: Programming Using the Message-Passing Paradigm 1 Message Passing Programming Model Blocking communication Non-Buffered Buffered Non-Blocking communication Introducing MPI 2 1 Programming models
More informationPARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION
PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In
More informationMODELS OF DISTRIBUTED SYSTEMS
Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between
More informationCST-Trees: Cache Sensitive T-Trees
CST-Trees: Cache Sensitive T-Trees Ig-hoon Lee 1, Junho Shim 2, Sang-goo Lee 3, and Jonghoon Chun 4 1 Prompt Corp., Seoul, Korea ihlee@prompt.co.kr 2 Department of Computer Science, Sookmyung Women s University,
More informationUnder the Hood, Part 1: Implementing Message Passing
Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within
More informationRECHOKe: A Scheme for Detection, Control and Punishment of Malicious Flows in IP Networks
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < : A Scheme for Detection, Control and Punishment of Malicious Flows in IP Networks Visvasuresh Victor Govindaswamy,
More informationCompositional C++ Page 1 of 17
Compositional C++ Page 1 of 17 Compositional C++ is a small set of extensions to C++ for parallel programming. OVERVIEW OF C++ With a few exceptions, C++ is a pure extension of ANSI C. Its features: Strong
More informationMark Sandstrom ThroughPuter, Inc.
Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom
More informationProcess Management And Synchronization
Process Management And Synchronization In a single processor multiprogramming system the processor switches between the various jobs until to finish the execution of all jobs. These jobs will share the
More informationGrappa: A latency tolerant runtime for large-scale irregular applications
Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University
More informationFast and Effective Interpolation Using Median Filter
Fast and Effective Interpolation Using Median Filter Jian Zhang 1, *, Siwei Ma 2, Yongbing Zhang 1, and Debin Zhao 1 1 Department of Computer Science, Harbin Institute of Technology, Harbin 150001, P.R.
More informationHolland Computing Center Kickstart MPI Intro
Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:
More informationMPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018
MPI 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationSkel: A Streaming Process-based Skeleton Library for Erlang (Early Draft!)
Skel: A Streaming Process-based Skeleton Library for Erlang (Early Draft!) Archibald Elliott 1, Christopher Brown 1, Marco Danelutto 2, and Kevin Hammond 1 1 School of Computer Science, University of St
More informationCover Page. The handle holds various files of this Leiden University dissertation
Cover Page The handle http://hdl.handle.net/1887/32963 holds various files of this Leiden University dissertation Author: Zhai, Jiali Teddy Title: Adaptive streaming applications : analysis and implementation
More informationMPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018
MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,
More informationEvolutionary Algorithms
Evolutionary Algorithms Proposal for a programming project for INF431, Spring 2014 version 14-02-19+23:09 Benjamin Doerr, LIX, Ecole Polytechnique Difficulty * *** 1 Synopsis This project deals with the
More informationFile Size Distribution on UNIX Systems Then and Now
File Size Distribution on UNIX Systems Then and Now Andrew S. Tanenbaum, Jorrit N. Herder*, Herbert Bos Dept. of Computer Science Vrije Universiteit Amsterdam, The Netherlands {ast@cs.vu.nl, jnherder@cs.vu.nl,
More informationBest Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.
IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development
More informationMPI Programming Techniques
MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any
More informationExecutive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.
Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Poor co-ordination that exists in threads on JVM is bottleneck
More informationAn introduction to C++ template programming
An introduction to C++ template programming Hayo Thielecke University of Birmingham http://www.cs.bham.ac.uk/~hxt March 2015 Templates and parametric polymorphism Template parameters Member functions of
More informationParallel System Architectures 2016 Lab Assignment 1: Cache Coherency
Institute of Informatics Computer Systems Architecture Jun Xiao Simon Polstra Dr. Andy Pimentel September 1, 2016 Parallel System Architectures 2016 Lab Assignment 1: Cache Coherency Introduction In this
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationModeling and Simulating Discrete Event Systems in Metropolis
Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract
More informationConsistent Rollback Protocols for Autonomic ASSISTANT Applications
Consistent Rollback Protocols for Autonomic ASSISTANT Applications Carlo Bertolli 1, Gabriele Mencagli 2, and Marco Vanneschi 2 1 Department of Computing, Imperial College London 180 Queens Gate, London,
More informationG Programming Languages - Fall 2012
G22.2110-003 Programming Languages - Fall 2012 Lecture 2 Thomas Wies New York University Review Last week Programming Languages Overview Syntax and Semantics Grammars and Regular Expressions High-level
More informationScheduling with Bus Access Optimization for Distributed Embedded Systems
472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 5, OCTOBER 2000 Scheduling with Bus Access Optimization for Distributed Embedded Systems Petru Eles, Member, IEEE, Alex
More informationGossamer Simulator Design Document
Gossamer Simulator Design Document Notre Dame PIM Group July 9, Document maintained by Arun Rodrigues (arodrig@cse.nd.edu) Abstract The SST simulator was designed to provide scalable exploration of large
More informationPattern Mining in Frequent Dynamic Subgraphs
Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More information2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006
2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,
More informationMULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE
MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,
More informationThe p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.
The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. Romain Dolbeau March 24, 2014 1 Introduction To quote John Walker, the first person to brute-force the problem [1]:
More informationCombining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing
Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Jose Flich 1,PedroLópez 1, Manuel. P. Malumbres 1, José Duato 1,andTomRokicki 2 1 Dpto.
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationSPECULATIVE MULTITHREADED ARCHITECTURES
2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions
More informationImplementation and Evaluation of Prefetching in the Intel Paragon Parallel File System
Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationAn Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks
An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu
More informationy(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*
SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL
More informationUMCS. Annales UMCS Informatica AI 6 (2007) Fault tolerant control for RP* architecture of Scalable Distributed Data Structures
Annales Informatica AI 6 (2007) 5-13 Annales Informatica Lublin-Polonia Sectio AI http://www.annales.umcs.lublin.pl/ Fault tolerant control for RP* architecture of Scalable Distributed Data Structures
More informationFlexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář
Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář Department ofradim Computer Bača Science, and Technical David Bednář University of Ostrava Czech
More information3.3 Hardware Parallel processing
Parallel processing is the simultaneous use of more than one CPU to execute a program. Ideally, parallel processing makes a program run faster because there are more CPUs running it. In practice, it is
More informationM301: Software Systems & their Development. Unit 4: Inheritance, Composition and Polymorphism
Block 1: Introduction to Java Unit 4: Inheritance, Composition and Polymorphism Aims of the unit: Study and use the Java mechanisms that support reuse, in particular, inheritance and composition; Analyze
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More information1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell
Table of Contents Distributed and Parallel Technology Revision Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 1 Overview 2 A Classification of Parallel
More informationLink Service Routines
Link Service Routines for better interrupt handling Introduction by Ralph Moore smx Architect When I set out to design smx, I wanted to have very low interrupt latency. Therefore, I created a construct
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More information