I/O in the Gardens Non-Dedicated Cluster Computing Environment

I/O in the Gardens Non-Dedicated Cluster Computing Environment Paul Roe and Siu Yuen Chan School of Computing Science Queensland University of Technology Australia fp.roe, s.chang@qut.edu.au Abstract Gardens is an integrated programming language and system designed to support parallel computing across nondedicated cluster computers, in particular networks of PCs. To utilise non-dedicated machines a program must adapt to those currently available. In Gardens this is realised by over decomposing a program into more tasks than processors, and migrating tasks to implement adaptation. Communication in Gardens is achieved via a lightweight form of remote method invocation. Furthermore I/O may be efficiently achieved by the same mechanism. All that is required is to support stable tasks which are not migrated - these are effectively bound to resources such as systems. The main contribution of this paper is to show how I/O may be achieved in an adaptive system utilising task migration to harness the power of non-dedicated cluster computers. 1 Introduction In the aggregate networks of workstations represent a huge and cheap unused computing resource. By their very nature such non-dedicated cluster computers are dynamic. The workstations available to a computation will typically change during the run of a program as workstation users come and go. Thus programs must adapt to the changing availability of workstations. The Gardens system [6] is an integrated programming language and system targeted at non-dedicated cluster computers. The goals of Gardens are: adaptation, safety, abstraction and performance (ASAP!). These are realised in part by a modern object oriented programming language, Mianjin [5], a derivative of Pascal. Gardens utilises task migration to realise adaptation. A program is over decomposed into more tasks than processors and tasks are migrated in response to changing workstation loads. This adaptation is transparent to the programmer. Transparent task migration entails location transparent communication between tasks. Communication between Garden tasks is achieved through a virtual shared object space. Tasks may perform remote method calls on objects belonging to other tasks. Such method calls are lightweight and asynchronous. Until recently I/O in Gardens has been handled in an ad hoc manner. The difficulty with I/O is that tasks are mobile and hence it must be performed in a location transparent manner, in a similar way to communication. However, standard handles etc. are static cannot be migrated with tasks. Furthermore it may be desirable to perform strictly local I/O for example to open a temporary or to utilise a which for performance reasons has been replicated across all machines. The main contribution of this paper is to show how efficient I/O may be achieved in a system utilising task migration to harness the power of non-dedicated cluster computers. In keeping with the Gardens philosophy the mechanisms are simple and enable efficient I/O abstractions to be created. The remainder of this paper is organised as follows, the next section describes the Gardens programming language Mianjin and in particular its support for a virtual shared object space. Section 3 describes the basic mechanisms and techniques for supporting I/O in an adaptive setting. Section 4 describes how data can be locally cached and how strictly local I/O resources can be utilised. Some preliminary performance figures are reported in Section 5. Section 6 presents related work, and the final section discusses the work and future directions. 2 Overview of Mianjin Gardens is an integrated programming language (Mianjin) and system to support parallel computation across networks of workstations. Programs are over-decomposed into more tasks than processors and task migration is used to implement load balancing. Multiple tasks are supported within a single operating system process.

Gardens supports a virtual shared object space. Each objects belongs to exactly one task which manages it; however, tasks may reference remote objects, belonging to other tasks. The Mianjin language distinguishes strictly local object references from potentially global ones, by labelling the latter GLOBAL. A simple example demonstrates the key idea of global objects: TYPE Acc = POINTER TO RECORD count, sum: INTEGER END; (* a remote method *) GLOBAL PROCEDURE (self:acc) Add (s:integer); self.sum := self.sum + s; self.count := self.count - 1; (* if last result unblock master task *) IF self.count=0 THEN Unblock END END Add; POLL PROCEDURE Worker (gsum: GLOBAL Acc); VAR localval: INTEGER; (* expensive calculation of localval *)... (* global method invocation *) gsum.add(localval) END Worker; POLL PROCEDURE Master; VAR acc: Acc; NEW(acc); acc.sum := 0; acc.count := NTasks; (* create worker tasks: Worker(acc) *) FOR i:= 1 TO NTasks DO Fork (Worker,acc) END; (* wait for all results *) WHILE acc.count#0 DO Block END;... END Master; The example implements a form of summation in which a master task accumulates the sum of values contributed by a set of worker tasks. A global object (acc), managed by a master task accumulates the sum. The master task has full access to the object. Worker tasks can only access acc via its global methods, in this case Add. When a worker task invokes the global method Add, actual parameters and the method index are communicated to the processor holding the object; there the method is invoked locally on the object. The master task owning and managing the global object (acc) blocks waiting for all local values to be contributed to the object. Local objects, the default, are always located in their referring tasks heap. Global objects may be either located in a different task or in the referring tasks heap. Furthermore a global object may be located in task on the same processor as the referring task or on a different processor to the referring task. Thus Mianjin supports location transparent communication via global objects and their associated global methods. Global object references are valid across all machines and hence location independent. This is necessary since tasks may be migrated between processors at run time, hence global object references must remain valid. 3 I/O and Processor Bound Tasks Since I/O is similar to intertask communication it is natural to try using intertask communication mechanisms i.e. global objects to support I/O. However unlike tasks I/O is usually bound to some specific resource such as a server or particular machine. Such resources are not mobile and should not be load balanced through migration. (Note if it makes sense to migrate a with a task then this can be easily achieved.) What are required are a special kind of object which are bound to resources and which are static - do not get migrated. To achieve this we extend Gardens with processor bound tasks which are never migrated from their host machine. Only one such task is required per machine. Objects which are allocated in processor bound tasks are standard global objects except that they are not migrated. We term these processor bound objects. Initially each processor bound task is seeded with a root I/O object, a global object, which is used for initiating all I/O. All tasks are given access to these root I/O objects. Root I/O objects are used for creating other processor bound objects corresponding to resource handles for performing I/O. For example typical root I/O object operations support opening and creating s. These operations result in objects (processor bound objects) which support read and write operations. More sophisticated root I/O object operations are used for creating custom resource bound i.e. processor bound objects. For example see Figure 1 for a description of simple opening and reading. Thus the only extension necessary to the Gardens system to support I/O in an adaptive setting are processor bound tasks. All objects such as root I/O objects and objects are standard global objects. They are just allocated in a processor bound task and hence are not migrated. If a task is migrated the existing support for global objects ensures that references to processor bound objects remain valid, see Figure 2. The mechanism allows different I/O abstractions to be coded. Typically the standard OS I/O calls are exposed in a special unsafe interface which I/O objects utilise for

Processor A proc bound task root I/O obj 3 create obj 2 OS open obj "foo" 6 OS read 1 open 4 ret obj ref 5 read 7 ret data Processor B mobile task root.open("foo") obj.read Figure 1. File Opening and Reading Processor A task before migration Processor B task after migration it may be desirable to cache some data on the local processor and share it between all tasks on the same processor. For example a may have been replicated across all processors for efficiency, or it may be desirable to create a temporary strictly locally. This goes against the idea of location transparency. However as described in [4] it is possible to safely make use of location information to optimise communications; the same can be done for I/O. For example, it is safe to test a global object reference to determine which processor hosts the object. Using such techniques a task may select from a number of I/O resources, represented by global objects, a local one to use. Since all the resources are represented by global objects such a test is safe and represents purely an optimisation. In particular there is no way for a task to access a local resource which becomes unavailable if the task is migrated - all that will happen is the task will access the resource remotely which will be inefficient but correct. Thus I/O requests can be sent to a local proxy object which will route requests to the local I/O resource. This is shown in Figure 3. processor bound object Processor 1 Processor i resource request Processor n resource proxy object resource map Figure 2. Task Migration actually performing I/O. In general we do not expect programmers to code I/O objects themselves; rather we want to provide a library of such routines for the programmer to use. Nevertheless an important part of our philosophy is that such abstractions should be programmable and not built-into the system in some special way. This is necessary in order for the system to be truly extensible. From the programmers perspective I/O is performed in the same way as intertask communication, this simplifies programming by economising on concepts. As described remote access to I/O resources is rather naive. For serious use, caching of data must be employed. It is possible to cache data on a task by task basis. However often caching on a per processor basis is best. The following section describes how this may be achieved. 4 Caching and Processor Bound Tasks Accessing remote s and other resources is useful, however sometimes the reverse is required. For efficiency local... resource 1 local resource i... Figure 3. Local Resource Map local resource n The two techniques may be combined so that remote data may be cached locally for access by all local tasks. Once again we are able to build sophisticated I/O abstractions using a few simple primitives. Furthermore we are able to utilise locality information to optimise I/O in a safe fashion. 5 Performance We have some preliminary performance figures based on simple comparisons of standard Unix I/O and I/O using our processor bound objects.

op. native Unix GO same proc GO diff proc create 161 219 326 open 99 158 264 close 69 110 192 read 268 309 540 write 163 199 419 All times are average times in microseconds, reading and writing was of a 1000 byte block and names were 16 bytes long. The experiments were performed on two Sun SparcStation-4s connected via a Myrinet, using a custom version of AM for GO communication [6]. The difference in times between the native Unix I/O and processor bound object version may be accounted for by task context switch overhead. Note, the processor bound object is managed by a separate task from the referring task. We anticipate improving the performance of this further. Remote access to processor bound objects requires both communication and context switching, leading to increased access times. For larger blocks the communication time dominates context switch times. The performance figures are reasonable; however for serious use a more sophisticated implementation using caching is required. 6 Related Work There are many approaches to distributed systems, and most are built using some kind of RPC e.g. NFS or distributed object system. Metacomputing environments such as Legion [2] and Globus [1] also support distributed systems. The work presented here is unique in dealing with I/O in an environment supporting task migration and by supporting locality optimisation. Dual problems exist in mobile computing where resources are mobile or dynamically configured e.g. mobile TCP/IP or Jini [8]. However these systems do not have the same efficiency constraints as a parallel system. There are also clustered web servers which support adaptive utilisation of resources through DNS based load balancing. It is possible to construct a similar system using Java RMI [7] however Java RMI is synchronous and does not support locality optimisations. Java RMI is also rather slow compared with our optimised global method invocation; it uses a costly serialisation protocol. Like RMI our system supports communication between heterogeneous platforms; we also support task migration between heterogeneous platforms. Other more remotely related systems include LDAP (Lightweight Directory Access Protocol) [9] which provides a mechanism for connecting to, searching and modifying internet directories, and remote database access APIs such as OLEDB over DCOM. Parallel I/O is about high performance I/O rather than adaptive I/O: the goal of our work. However we were influenced by MPI-IO, part of MPI-2.0 [3], in that we wanted to use a mechanism analogous to communication for I/O. 7 Discussion The basic ideas presented in the previous sections have been implemented. This has provided a simple and effective means to perform I/O in Gardens. So far we have not implemented sophisticated caching mechanisms. Generally unless special facilities are required I/O in a nondedicated cluster is usually best performed by the existing distributed system e.g. NFS if it exists, since such systems have been heavily optimised. Nevertheless interfacing with such systems still requires the use of processor bound objects since NFS handles are not directly migrable. We have not addressed issues of parallel I/O. This is a complex issue for non-dedicated clusters. Since the set of available workstations changes over time some form of redundancy is required. A simple method is to replicate all s on all processors. However this is only valid for small data sets. Another issue concerns interactive I/O. Tasking in Gardens is non-preemptive this is fine for non-interactive I/O. However for interactive I/O the scheduling of tasks becomes more important. Some preliminary investigation has been started in this area. Related to scheduling is the issue of blocking I/O. The OS typically blocks certain I/O requests until they have been completed. In Gardens rather than the OS blocking the current OS thread and running another we really want to perform a Gardens block and reschedule another Gardens task. Finally since our system is designed for non-dedicated clusters, if an interactive workstation hosts a then remote accesses will potentially disturb the interactive user. If I/O intensive computation is required then either resource replication is necessary or some kind of dedicated system e.g. server must be utilised. Providing the programmer with a uniform model for communications and I/O has been an important achievement of this work. In addition it came as a pleasant surprise to us that little additional infrastructure is necessary to support I/O in Gardens. Acknowledgements We would like to thank other Gardeners for their help and useful discussions concerning I/O in Gardens. This study has been supported by the Gardens research project of the Programming Languages and Systems Research Centre at QUT.

References [1] Globus. http://www.globus.org/. [2] Legion. http://legion.virginia.com/. [3] MPI Forum. MPI-2.0. http://www.mpi-forum.org/. [4] P. Roe. Adaptive synchronisation: Optimising the locality of collective communications in an adaptive setting. In to appear in: Sixth Australasian Conference on Parallel and Real- Time Systems (PART 99), Melbourne, Australia, Nov. 1999. Springer. [5] P. Roe and C. Szyperski. Mianjin is Gardens Point: A parallel language taming asynchronous communication. In Fourth Australasian Conference on Parallel and Real-Time Systems (PART 97), Newcastle, Australia, Sept. 1997. Springer. [6] P. Roe and C. Szyperski. The gardens approach to adaptive parallel computing. In R. Buyya, editor, Cluster Computing, volume 1, pages 740 753. Prentice Hall, 1999. [7] Sun Microsystems. Java RMI. http://java.sun.com/products/jdk/rmi/. [8] Sun Microsystems. Jini. http://www.sun.com/jini/. [9] P. Taylor. Introducing LDAP. Windows NT Systems, pages 47 51, Dec. 1998.