ON IMPLEMENTING THE FARM SKELETON

Size: px

Start display at page:

Download "ON IMPLEMENTING THE FARM SKELETON"

Chad Cummings
5 years ago
Views:

1 ON IMPLEMENTING THE FARM SKELETON MICHAEL POLDNER and HERBERT KUCHEN Department of Information Systems, University of Münster, D Münster, Germany ABSTRACT Algorithmic skeletons intend to simplify parallel programming by providing a higher level of abstraction compared to the usual message passing. Task and data parallel skeletons can be distinguished. In the present paper, we will consider several approaches to implement one of the most classical task parallel skeletons, namely the farm, and compare them w.r.t. scalability, overhead, potential bottlenecks, and load balancing. We will also investigate several communication modes for the implementation of skeletons. Based on experimental results, the advantages and disadvantages of the different approaches are shown. Moreover, we will show how to terminate the system of processes properly. Keywords: Parallel programming, algorithmic skeletons, farm, MPI. 1. Introduction Today, parallel programming of MIMD machines with distributed memory is typically based on message passing. Owing to the availability of standard message passing libraries such as MPI a [1], the resulting software is platform independent and efficient. However, the programming level is still rather low and programmers have to fight against low-level communication problems such as deadlocks. Moreover, the program is split into a set of processes which are assigned to the different processors. Like an ant, each process only has a local view of the overall activity. A global view of the overall computation only exists in the programmer s mind, and there is no way to express it more directly on this level. Many approaches try to increase the level of parallel programming and to overcome the mentioned disadvantages. Here, we will focus on algorithmic skeletons, i.e. typical parallel-programming patterns which are efficiently implemented on the available parallel machine and usually offered to the user as higher-order functions, which get the details of the specific application problem as argument functions (see e.g. [3,4,19]). The skeletal parallelism homepage [6] contains links to virtually all groups and projects working on skeletons. In our framework, a parallel computation consists of a sequence of calls to skelea We assume some familiarity with MPI and C++.

2 Third Worskhop on High-Level Parallel Programming and Applications tons. Several implementations of algorithmic skeletons are available. They differ in the kind of host language used and in the particular set of skeletons offered. Since higher-order functions are taken from functional languages, many approaches use such a language as host language [7,12,2]. In order to increase the efficiency, imperative languages such as C and C++ have been extended by skeletons, too [2,3,9,19]. Depending on the kind of parallelism used, skeletons can be classified into task parallel and data parallel ones. In the first case, a skeleton (dynamically) creates a system of communicating processes by nesting predefined process topologies such as pipeline, farm, parallel composition, divide&conquer, and branch&bound [1,4,5,7,11,19]. In the second case, a skeleton works on a distributed data structure, performing the same operations on some or all elements of this data structure. Data-parallel skeletons, such as map, fold or rotate are used in [2,3,7,8,12,19]. Moreover, there are implementations offering skeletons as a library rather than as part of a new programming language [5,14] The approach described in the sequel is based on the skeleton library introduced in [11,13,14] and on the corresponding C++ language binding. Skeletons can be understood as domain-specific languages for parallel programming. Our library provides task as well as data parallel skeletons, which can be combined based on the two-tier model taken from P 3 L [19]. In general, a computation consists of nested task parallel constructs where an atomic task parallel computation can be sequential or data parallel. Purely data parallel and purely task parallel computations are special cases of this model. An advantage of the C++ binding is that the three important features needed for skeletons, namely higher-order functions (i.e. functions having functions as arguments), partial applications (i.e. the possibility to apply a function to less arguments than it needs and to supply the missing arguments later), and parametric polymorphism, can be implemented elegantly and efficiently in C++ using operator overloading and templates, respectively [13,21]. In the present paper, we will focus on task-parallel skeletons in general and on the well-known farm skeleton in particular. Conceptually, a farm consists of a farmer and several workers. The farmer accepts a sequence of tasks from some predecessor process and propagates each task to a worker. The worker executes the task and delivers the result back to the farmer who propagates it to some successor process (which may be the same as the predecessor). This specification suggests a straightforward implementation leading to the process topology depicted in Fig. 1. The problem with this simple approach is that the farmer may become a bottleneck, if the number of workers is large. Another disadvantage is the overhead caused by the propagation of messages. Consequently, it is worth considering different implementation schemes avoiding these disadvantages. In the present paper, we will consider a variant of the classical farm where the farmer is divided into a dispatcher and a collector as well as variants where these building blocks have (partly) been omitted. Moreover, we investigate which MPI communication modes are most appropriate for implementing skeletons. Finally, we will present a scheme to shutdown the system of processes cleanly, even for nested and possibly cyclic process topologies. The rest of the paper is organized as follows. In Section 2, we show how task-

3 On Implementing the Farm Skeleton Predecessor Farm tasks results Farmer tasks... results Successor Figure 1: Farm. parallel applications can be implemented using the task-parallel skeletons provided by our skeleton library. The considered variants of the farm skeleton are presented in Section 3. Section 4 contains experimental results for the different approaches. In Section 5, we describe, how to terminate the system of processes cleanly. Finally, in Section 6 we conclude and discuss related work. Moreover, we point out future extensions. 2. Task-parallel Skeletons Our skeleton library offers data parallel and task parallel skeletons. Task parallelism is established by setting up a system of processes which communicate via streams of data. Such a system is not arbitrarily structured but constructed by nesting predefined process topologies such as farms and pipelines. Moreover, there are skeletons for parallel composition, branch & bound, and divide & conquer. Finally, it is possible to nest task and data parallelism according to the mentioned two-tier model of P 3 L, which allows atomic task parallel processes to use data parallelism inside [19]. Here, we will focus on task-parallel skeletons in general and on the farm skeleton in particular. In a farm, a farmer process accepts a sequence of inputs and assigns each of them to one of several workers. The parallel composition works similar to the farm. However, each input is forwarded to every worker. A pipeline allows tasks to be processed by a sequence of stages. Each stage handles one task at a time, and it does this in parallel to all the other stages. Each task parallel skeleton has the same property as an atomic process, namely it accepts a sequence of inputs and produces a sequence of outputs. This allows the task parallel skeletons to be arbitrarily nested. Task parallel skeletons like pipeline and farm are provided by many skeleton systems, see e.g. [5,19]. In the example in Fig. 2, a pipeline of an initial atomic process, a farm of two atomic workers, and a final atomic process is constructed. In the C++ binding, there is a class for every task parallel skeleton. All these classes are subclasses of the abstract class Process. A task parallel application proceeds in two steps. First, a process topology is created by using the constructors of the mentioned class. This process topology reflects the actual nesting of skeletons. Then, this system of processes is started by applying method start() to the outermost skeleton. Internally, every atomic process will be assigned to a processor. For an implementation

4 Third Worskhop on High-Level Parallel Programming and Applications #include "Skeleton.h" Atomic Atomic static int current = ; static const int numworkers = 2; Initial Farmer Final int* init(){if (current++ < 1) return &current; else return NULL;} int foo(int n){ int result = 1; for (int i=1; i<=n%1+1; i++) result *= i; for (int i=; i<result; i++) n++; return n;} void fin(int n){cout << "result: " << n << endl;} int main(int argc, char **argv){ InitSkeletons(argc,argv); // step 1: create a process topology (using C++ constructors) Initial<int> p1(init); Atomic<int,int> p2(foo,1); Farm<int,int> p3(p2,numworkers); Final<int> p4(fin); Pipe p5(p1,p3,p4); // step 2: start the system of processes p5.start(); TerminateSkeletons();} Figure 2: Task parallel example application. on top of SPMD, this means that every processor will dispatch depending on its rank to the code of its assigned process. When constructing an atomic process, the argument function of the constructor tells how each input is transformed into an output value. Again, such a function can be either a C++ function or a partial application. In Fig. 2, each worker applies function foo to each input and returns the result. The initial and final atomic processes are special, since they do not consume inputs and produce outputs, respectively. 3. The Farm Skeleton As pointed out in the introduction, a straightforward implementation of the farm skeleton could be based on the process topology depicted in Fig. 1. It has the advantage that the farmer knows which workers have returned the results of their tasks and are hence idle. Thus, the farmer can forward incoming tasks to the idle workers. However, this approach has the disadvantage that it causes substantial overhead due to the messages which have to be exchanged between farmer and workers. Moreover, the farmer might become a bottleneck, if the number of workers

5 On Implementing the Farm Skeleton FarmWDWC Predecessor Dispatcher Collector tasks results Successor tasks... results Figure 3: Farm with separate dispatcher and collector. is large. In this case, the farmer will not be able to keep all workers busy, leading to wasted workers. The amount of workers which the farmer can keep busy depends on the sizes of the tasks and the sizes of the messages the farmer has to propagate. For small tasks (in terms of the amount of required computations) and large messages, only very few workers can be kept busy. If on the other hand there are few workers (with large tasks), the farmer will partly be idle. This problem can be solved by mapping a worker to the same processor as the farmer. Thus, we will not consider this problem further. In general, the aim is to keep all processors as busy as possible and to avoid a waste of resources. Let us point out that implementing this apparently simple farm skeleton is not as easy as it might seem. The interested reader may have a look at our implementation [16]. Firstly, the process topology is obviously cyclic (see Fig. 1). Thus, one has to be very careful to avoid deadlocks. On the other hand, one has to make sure that the farmer reacts as quickly as possible on newly arriving tasks and on workers delivering their results. For an implementation based on MPI, this means that the simpler synchronous communication has to be replaced by the significantly more complex non-blocking asynchronous communication using MPI Wait. Moreover, special control messages are needed besides data messages in order to stop the computation cleanly at the end. This leads to a quite complicated protocol, which has to be supported by every skeleton. Thus, an obvious advantage of using skeletons is that the user does not have to invent the wheel again and again, but can readily use the typical patterns for parallel programming without worrying about deadlocks, stopping the computation cleanly, or overlapping computation and communication properly Farm with Dispatcher and Collector A first approach to reduce the load of an overloaded farmer is to split it into a dispatcher of work and an independent collector of results as depicted in Fig. 3. This variant is used in P 3 L [19]. Ideally, distributing tasks needs as much time as collecting results. In this case, the farmer can serve twice as many workers as with the previous approach. If, however, distributing tasks is much more work than collecting results or vice versa,

6 Third Worskhop on High-Level Parallel Programming and Applications FarmWDNC Predecessor tasks Dispatcher tasks results Successor... Figure 4: Farm with dispatcher; no collector is used. little has been gained, since now the dispatcher or the collector will quickly become the bottleneck, while the other will partly be idle. The amount of required messages is unchanged and the corresponding overhead is hence preserved. A disadvantage of this farm variant is that the dispatcher now has no knowledge about the actual load of each worker. Thus, he has to divide work based on some load independent scheme, e.g. cyclically or by random selection of a worker. Both may lead to an unbalanced distribution of work. However for a large number of tasks, one can expect that the load is roughly balanced. The blind distribution of work could be avoided by sending work requests from idle workers to the dispatcher. But then the dispatcher would have to process as many messages as the farmer in the original approach, and the introduction of a separate collector would be rather useless. Alternatively, a load balancing scheme among workers could be applied like in P 3 L. In the present paper, we have not considered this alternative Farm with Dispatcher The previous approach can be improved by omitting the collector and by sending results directly from the workers to the successor of the farm in the overall process topology (see Fig. 4). This eliminates the overhead of propagating results. Moreover, the omitted collector can no longer be a potential bottleneck. b 3.3. Farm without Dispatcher and Collector After omitting the collector, one can consider omitting the dispatcher as well (see Fig. 5). In fact, this is possible, provided that the predecessors of the farm in the overall process topology assume the responsibility to distribute their tasks directly to the workers. As in the previous subsections, this distribution has to be performed based on a load independent scheme, e.g. cyclically or by random selection. This approach reduces the overhead for the propagation of messages completely. Moreover, it omits another potential source of a bottleneck, namely the dispatcher. Of course, this new variant of the farm skeleton can be arbitrarily nested with other b It may however happen that the successor now becomes a bottleneck. A corresponding global optimization of the overall process topology remains as future work.

7 On Implementing the Farm Skeleton FarmNDNC 2 FarmNDNC 1 Predecessor Successor Figure 5: Pipeline of two farms without dispatcher and collector. skeletons. For instance, it is possible to construct a pipeline of the such farms, as depicted in Fig. 5. c In such a situation where n workers of the first farm communicate with m workers of the second, it is important to ensure that not all of them start with the same destination. If using a cyclic distribution scheme, worker i could e.g. assign its first task to worker i/m of the second farm. If the destination is randomly chosen, it has to be ensured that all the random number generators start with a different initial value. A farm without dispatcher but with collector does not seem to make sense, and it is not considered here. 4. Experimental Results In our experimental results, we have tested the different process topologies for implementing the farm as well as the different MPI communication modes Influence of the Process Topology We have tested the different variants of the farm skeleton for small and big tasks as well as for tasks of varying sizes. A small task simply consist in computing the square of the input, a big tasks performs two million additions, and the tasks of varying sizes execute n! iterations for some 1 n 1 (as in Fig. 2). The experiments have been carried out on the IBM cluster at the University of Münster. This machine [22] has 94 compute nodes, each with a 2.4 GHz Intel Xeon processor, 512 KB L2 Cache, and 1 GB memory. The nodes are connected by a Myrinet[18] and running RedHat Linux 7.3 and the MPICH-GM implementation of MPI. All farm variants have been implemented and compared based on the nonblocking standard communication mode using the eager protocol. In the sequel, we will assume some familiarity with the different MPI communication modes, namely standard mode, synchronous mode, buffered mode, and ready c Note that such a sequence of farms may, for instance, make sense, if each worker of the first farm uses one processor, while each worker of the second farm performs a data parallel computation on several processors.

8 Third Worskhop on High-Level Parallel Programming and Applications mode as well as with the corresponding concepts of blocking and a non-blocking send operations [17]. Moreover, we assume some knowledge of the eager protocol and the rendezvous protocol, which MPICH-GM employs for implementing the different modes. For large tasks (of the same size), all variants are able to keep the workers busy, as one would expect. Thus, all variants need roughly the same amount of time to execute the tasks, as can be seen in Fig. 6 a). However, just comparing the runtimes is not fair, since the different farms need different numbers of processors (unless farmer, dispatcher, and collector share the same processor with one worker as mentioned above). Taking this into account, we see that the farmndnc is the best, followed by the farmwdnc and the farm with farmer (see Fig. 6 b). When considering tasks with (strongly) varying sizes (Fig. 7), we note that all approaches are able to keep a small number of workers (here up to 4) busy. If we add more workers, all approaches reach a point where they are no longer able to employ the additional workers. Beyond this point the runtime remains constant, independently of the number of additional workers. As expected, the original farm reaches this situation more quickly than the farmwdwc, the farmwdnc, and the farmndnc, in this order. Moreover, farms which produce less overhead need less runtime when reaching the limit. When taking into account the number of processors used (rather than the number of workers), the advantages of the farmwdnc and, in particular, the farmndnc become even more apparent (see Fig. 7 b). Interestingly, the behavior of the different farm variants does not depend significantly on the distribution scheme. Cyclic distribution and random distribution lead to almost identical runtimes. This is due to the fact that the number of tasks was large and the load of the workers has hence been balanced over time (Fig. 7 b). For small numbers of tasks the behavior may depend heavily on the actual mapping of tasks to workers. It may be surprising that even the farmndnc is not able to keep an arbitrary farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl farm with farmer farmwdwc rand farmwdwc cycl farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # workers a) runtime depending on # workers # processors b) runtime depending on # processors Figure 6: Farm variants with big tasks.

9 On Implementing the Farm Skeleton farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl farm with farmer farmwdwc rand farmwdwc cycl farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # workers a) runtime depending on # workers # processors b) runtime depending on # processors Figure 7: Farm variants with tasks of varying sizes. amount of workers busy, although it has no farmer or dispatcher which might get a bottleneck. The reason is that the predecessor(s) of the farm are now responsible for the distribution of tasks to the workers. Unless the predecessor is itself a large farm, it will eventually become a bottleneck as observed in Fig. 7. Similarly, the successor(s) of the farm may now get a bottleneck. For small tasks, it is not worthwhile to employ any worker. The process delivering the small tasks should better execute them itself. If we nevertheless use a farm, it is clear that it is not possible to keep even a single worker busy. All farm variants require hence a more or less constant runtime independently of the number of workers (see Fig. 8). This situation corresponds to the one in the previous experiment when reaching the limit of useful workers. The roughly constant runtimes differ between skeletons depending on the overhead caused by the considered variant MPI Communication Modes and Distribution Schemes In order to answer the question whether and how much the eager protocol per farm with farmer farmwdwc cycl farmwdnc cycl farmndnc cycl # workers a) runtime depending on # workers farm with farmer farmwdwc rand farmwdwc cycl.5 farmwdnc rand farmwdnc cycl farmndnc rand farmndnc cycl # processors b) runtime depending on # processors Figure 8: Farm variants with small tasks.

10 Third Worskhop on High-Level Parallel Programming and Applications # workers a) small tasks Rendezvous rand Rendezvous cycl Eager rand/cycl # workers b) big tasks Rendezvous rand Rendezvous cycl Eager rand/cycl Rendezvous rand Rendezvous cycl Eager rand/cycl # workers c) tasks of varying sizes Figure 9: Comparison of the eager and the rendezvous protocol. forms better than the rendezvous protocol, we have implemented our skeleton library using the standard send mode as well as the synchronous send mode. The standard send mode is based on the eager protocol for messages less than 32 KB (as in our experiments) and on the rendezvous protocol for larger messages. The synchronous send mode always uses the rendezvous protocol. As above, we have tested the farmndnc for small and big tasks as well as for tasks of varying sizes, each of them using a cyclic as well as a random distribution scheme. Just like the previous tests these experiments have been carried out on our IBM cluster. For small tasks (see Fig. 9 a) the eager protocol performs much better than the rendezvous protocol due to its lower latency. Here, the distribution scheme has no significant effect, since the tasks are almost immediately finished and there is hence no delay caused by busy workers. When considering large tasks in connection with the rendezvous protocol, a significant difference between the cyclic and the random distribution scheme is determined (Fig. 9 b). Whenever the sender selects the same receiver as shortly before, the sender is blocked until the receiver has finished the computation and

11 On Implementing the Farm Skeleton percentage deviation # tasks max random avg cyclic min random Figure 1: Behavior depending on the number of tasks. accepts the message. This causes significant idle times of the sender and the other workers. With a cyclic distribution, such delays are minimized. The eager protocol performs even slightly better. This is not so much caused by the lower latency but by the fact that each worker may buffer several tasks, such that it can directly start processing when finishing its previous task. It does not have to wait for the actual communication to take place. For the same reasons, the eager protocol performs best for tasks of (strongly) varying sizes (Fig. 9 c) and the cyclic distribution is better than the random distribution when using the rendezvous protocol. In order to compare the different distribution schemes further, we have performed this test with different numbers number of tasks, namely 1, 1, 1, 1 and 1, and farm sizes of up to four several times, all using the eager protocol. While the cyclic distribution scheme provides nearly constant runtimes for each number of tasks, the random distribution scheme causes high variances of the runtimes for smaller numbers of tasks. Fig. 1 shows the runtimes depending on the number of tasks for cyclic distribution (normalized to 1) as well as the minimal and maximal runtimes for random distribution. As one might expect, an unlucky random distribution of tasks is more serious for small numbers of tasks than for large numbers, where this effect is averaged out. The same is true for lucky random distributions. We can conclude that the cyclic distribution of tasks is better than the random distribution for implementing the farm skeleton. 5. Terminating the Process System To perform a task parallel computation, a process topology is created and started. After the computation is completed, the process topology must be terminated properly in order to ensure that no user data are lost and a possible subsequent skeleton computation with a potentially different topology can be started without problems. In the following, we present a conceptually simple approach for

12 Third Worskhop on High-Level Parallel Programming and Applications the generation and termination of process topologies. d In our skeleton library, task parallel skeletons are implemented as C++ classes. A process topology is created by using their constructors, which can be arbitrarily nested. Each of the skeletons provides a method start() to actually start the computation of the skeleton and all nested skeletons (see Fig. 2). While task parallel skeletons are active, they consume a stream of input values and produce a stream of output values. If a skeleton notices the end of the input stream, it informs its successors that its output stream ends by sending a special message and terminates afterwards. Thus, the process system terminates automatically without delegating the responsibility to the programmer. This ensures that no user data are getting lost. Messages for controlling the process system must, of course, differ from user data. Hence the semantics of a message is described by a tag which is sent within the message envelope. Depending on this tag, the receiver decides whether the message is a control message or not. A message with a STOP tag informs the receiver of the end of the data stream and initiates the termination of the subsequent processes. The first STOP is sent by the initial process (Fig. 11). MPI guarantees that messages between the same sender and receiver cannot pass each other. This leads to a simple algorithm for terminating the process topology. There is no problem, if the number of entrances and exits of a skeleton is limited to one. In this case, there is a well-defined successor to which the STOP is sent. When receiving it, the corresponding process forwards it to its successor and stops. For skeletons with an arbitrary number of entrances and exits such as the farmndnc, messages can be received from and sent to several processes, and data may arrive after a STOP message. Suppose, for example, there are two farmndncs (as shown in Fig. 11) and there is a message containing user data from a worker A of the first farm to a worker C of the second farm. Furthermore there is a STOP from the initial process via worker B to C. If C receives the STOP first, C must not terminate until the user data have been received and processed. Thus, it has to wait for STOP messages from all predecessors before terminating. Moreover, each process must propagate a STOP message to all its successors. This algorithm is easy to implement. On the other hand, potentially many stop messages are sent. However, the corresponding costs can be neglected compared to the previous parallel computation. 6. Conclusions, Related Work, and Future Extensions We have considered alternative implementation schemes for the well-known farm skeleton. More precisely, we have investigated several process topologies as well as several communication modes. Concerning the process topologies we have compared the classical approach, where a farmer distributes work and collects results, with variants, where the farmer has been divided into a dispatcher and collector. Moreover, we have investigated variants where the collector and dispatcher have d The actual implementation is not quite as simple [16], in particular in the presence of cyclic process topologies.

13 On Implementing the Farm Skeleton Initial 2. STOP 1. data 2. STOP A 5. STOP 4. data 3. STOP 5. STOP C B D 3. STOP FarmNDNC 1 FarmNDNC 2 8. STOP 6. data 7. STOP Final Figure 11: Process termination with asynchronous communication. (partly) been omitted. In case of the variant without dispatcher, the predecessor of the farm in the overall process topology is responsible for the distribution of tasks to the workers. As our experimental results and our analysis show, the farm only consisting of workers is the best in terms of scalability and low overhead. It is clearly superior to the farms with dispatcher (and possibly collector). For a large number of tasks, it is also better than the classical farm, where a farmer distributes work. For a very small number of tasks, the classical farm might have advantages in some cases, since it is the only one which takes the actual load of workers into account when distributing work. We are not aware of any previous systematic analysis of different implementation schemes for farms in the literature. In the different skeleton projects, typically one technique has been chosen. In the first version of our skeleton library, there was just the classical farm. P 3 L [19] uses the farmwdwc while eskel [5] uses the classical farm with farmer. Concerning the choice of the communication mode for implementing skeletons, we have compared the standard mode (here) using the eager protocol with the synchronized mode based on the rendezvous protocol. The eager protocol turned out to be better, not only due its lower latency but also due to the fact that the workers can buffer several tasks and can process the next one after finishing the previous one without delay. The cyclic distribution of tasks to the workers turned out to be superior to a random distribution, since it leads to more predictable and in the average slightly smaller runtimes. As future work, we intend to investigate alternative implementation schemes for other skeletons, e.g. divide & conquer and branch & bound. References [1] E. Alba, F. Almeida et al., MALLBA: A Library of Skeletons for Combinatorial Search, Proc. Euro-Par 22, LNCS 24 (Springer, 22) [2] G.H. Botorog and H. Kuchen, Efficient Parallel Programming with Algorithmic Skeletons, Proc. Euro-Par 96, LNCS 1123 (Springer, 1996)

14 Third Worskhop on High-Level Parallel Programming and Applications [3] G.H. Botorog and H. Kuchen, Efficient High-Level Parallel Programming, Theoretical Computer Science 196 (1998) [4] M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computation (MIT Press, 1989). [5] M. Cole, Bringing Skeletons out of the Closet: A Pragmatic Manifesto for Skeletal Parallel Programming, Parallel Computing 3(3) (24) [6] M. Cole, The Skeletal Parallelism Web Page, [7] J. Darlington, A.J. Field, T.G. Harrison et al., Parallel Programming Using Skeleton Functions, Proc. PARLE 93, LNCS 694 (Springer,1993) [8] J. Darlington, Y. Guo, H.W. To, J. Yang, Functional Skeletons for Parallel Coordination, Proc. Euro-Par 95, LNCS 966 (Springer,1995) [9] I. Foster, R. Olson, S. Tuecke, Productive Parallel Programming: The PCN Approach, Scientific Programming 1(1) (1992) [1] W. Gropp, E. Lusk, A. Skjellum, Using MPI (MIT Press, 1999). [11] H. Kuchen, M. Cole, The Integration of Task and Data Parallel Skeletons, Parallel Processing Letters 12(2) (22) [12] H. Kuchen, R. Plasmeijer, H. Stoltze, Efficient Distributed Memory Implementation of a Data Parallel Functional Language, Proc. PARLE 94, LNCS 817 (Springer, 1994) [13] H. Kuchen, J. Striegnitz, Higher-Order Functions and Partial Applications for a C++ Skeleton Library, Proc. Joint ACM Java Grande & ISCOPE Conference (ACM, 22) [14] H. Kuchen, A Skeleton Library, Euro-Par 2, LNCS 24 (Springer, 22) [15] H. Kuchen, Optimizing Sequences of Skeleton Calls, in Domain-Specific Program Generation, LNCS 316 (Springer, 24) [16] H. Kuchen, The Skeleton Library Web Pages. [17] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard [18] Myricom, The Myricom Homepage, [19] S. Pelagatti, Task and Data Parallelism in P3L, in Patterns and Skeletons for Parallel and Distributed Computing, eds. F.A. Rabhi and S. Gorlatch (Springer, 23) [2] D. Skillicorn, Foundations of Parallel Programming (Cambridge U. Press, 1994). [21] J. Striegnitz, Making C++ Ready for Algorithmic Skeletons, Tech. Report IB-2-8, [22] ZIV-Cluster,

Scalable Farms. Michael Poldner a, Herbert Kuchen a. D Münster, Germany

Scalable Farms. Michael Poldner a, Herbert Kuchen a. D Münster, Germany 1 Scalable Farms Michael Poldner a, Herbert Kuchen a a University of Münster, Department of Information Systems, Leonardo Campus 3, D-4819 Münster, Germany Algorithmic skeletons intend to simplify parallel