Unibus: A Contrarian Approach to Grid Computing

Size: px

Start display at page:

Download "Unibus: A Contrarian Approach to Grid Computing"

Marilyn Hood
6 years ago
Views:

1 Unibus: A Contrarian Approach to Grid Computing Dawid Kurzyniec, Magdalena Sławińska, Jarosław Sławiński, Vaidy Sunderam Dept. of Math and Computer Science, Emory University 400 Dowman Drive Atlanta, GA 30322, USA {dawidk,magg,jaross,vss}@mathcs.emory.edu Keywords: resource sharing, virtualization, aggregation, grids, MPI Abstract Despite maturing in many ways, heterogeneous distributed computing platforms continue to require substantial effort in terms of software installation and management for efficient use, often necessitating manual intervention by resource providers and end-users. In this paper we propose a novel model of resource sharing that is a viable alternative to that commonly adopted in the grid community. Our model, termed Unibus, shifts the resource virtualization and aggregation responsibilities to the software at the client side, taking these burdens away from resource providers. Drawing from parallels with operating systems, we argue that distributed resources may be unified and aggregated at the user s end, in a manner similar to ordinary peripheral devices. Running on the user s access device, the overlay system software can virtualize remote resources via dynamically deployed software mediators analogous to device drivers, reconfiguring the resources if necessary via firmware modules. To illustrate the feasibility of the Unibus model, we have prototyped a development toolkit automating the Research supported in part by U.S. DoE grant DE-FG02-02ER25537 and NSF grant ACI An earlier version of the material in Section 5 of this paper was submitted to PDP

2 installation, build, run, and post-processing stages of MPI applications. Through the provided console, this toolkit can deploy and configure an MPI execution environment across a set of heterogeneous, isolated distributed resources, turning them into a coherent virtual machine with a single interface point. We conducted a series of experiments with the NAS Parallel Benchmarks. Results indicate that the toolkit preserves the application performance of bare MPI, while substantially reducing maintenance and configuration efforts. Overall, the results suggest that the envisioned client side overlay model for resource sharing may potentially be able to address some of long-standing obstacles in building heterogeneous HPC systems. 1 Introduction Computational resource sharing continues to mature as a mainstream paradigm, and its promise of delivering unprecedented amounts of computing power to the most demanding scientific applications is rapidly becoming a reality [1, 2]. Large scale collaborations are being formed to exploit the potential of resources shared across administrative and geographical boundaries and new classes of application domains are being deployed [3]. However, the adoption of grid technologies defined as frameworks for resource sharing across administrative domains has been slower than expected. Complexities due to heterogeneity, policy issues, and software installation and management prerequisites are widely postulated causes. While grids have consistently attempted to emphasize ease of use for end-users, they have done so at substantial cost to resource providers and administrators, who are required to install and maintain complex software toolkits and coordinate with other providers to supply aggregated services such as single sign-on or centralized job scheduling. These logistical burdens introduce a high barrier-to-entry and elevated maintenance costs [4], detracting from the eventual goal of wall-socket computing from which the term grid is derived. As a result, cost-effectiveness of resource sharing is dependent upon, and constrained by, long-term agreements between large institutions. This requirement discourages spontaneous collaborations between smaller groups that frequently pool and share resources, or individual scientists who often have access to departmental or collaborator computers [5]. Administrators of department clusters, for example, may be willing to offer users a login account but may be much 2

3 more hesitant to install grid software, run unfamiliar daemons, ensure library version matching, and perform the myriad other tasks that are often needed. Moreover, the coordination overhead is detrimental to scalability, hampering the transition from domain-specific grids into the postulated, self-organizing Grid. Clients (...) (...) (...) Aggregation (...) Virtualization Resources Common protocols Network Access points Providers Grid Model H2O Model Unibus Model Figure 1: Comparison between founding models of grids (left), H2O (middle), and Unibus (right). In contrast to the grid model, Unibus delegates resource virtualization and aggregation to the client side To address the above issues, we propose a paradigm shift in the sharing model, as depicted in Figure 1. Traditional grids, shown at the left, force providers to collectively virtualize resources and to coordinate in order to provide aggregated capabilities. H2O, the authors earlier project, shown in the middle, pushes the aggregation and multi-site authentication concerns to the user side, thus promoting provider autonomy; however, providers are still partially responsible for resource virtualization. Finally, the proposed Unibus model, shown at the right, pushes the resource virtualization and aggregation responsibilities to the client side, granting providers complete sovereignty and allowing them to expose resources as they see fit. Virtualization is fully delegated to the client side. This transition cannot be done at the expense of end-users, however they should continue to be able to use the shared resources as a convenient virtual aggregate. Consequently, management burdens taken away from providers must be delegated to the client-side software. Indeed, we 3

4 postulate that from the user s perspective, a shared remote resource can and should be treated analogously to a peripheral device connected to the user s computer. Peripheral functionality is delivered by: (1) firmware that resides on the device (and can be potentially uploaded or upgraded by the OS); (2) a device driver that resides in the OS; and (3) OS abstractions in the form of system libraries that virtualize this and various other peripherals. Similarly, metasystem configuration software could be automatically uploaded from the client side on to multiple computing resources, with interfaces in the form of service drivers at the point of access, aggregated by client overlay software that virtualizes a collection of distributed resources. In this paper, we present the resulting Unibus model and architecture, followed by a prototypical application to MPI-based metacomputing. After discussing the related work, we move on to defining the model in detail. We analyze relationships between resource providers and clients, describing access interfaces as well as approaches to aggregation and virtualization. We introduce the concept of the client-side overlay, discussing its applicability to shared-resource metacomputing, and contrasting it with the traditional model of virtual organization that is commonly used in the grid community. Subsequently, we focus our attention specifically on the overlay, describing its architectural decomposition. We show how the layered, service-based composition approach leads to dynamic adaptation and scalability. We describe how components from different architectural layers cooperate to provide applications and tools with consistent services exposed through standard APIs and language bindings. Next, we discuss the Zero-Force MPI toolkit, which is a concrete application of the Unibus methodology aimed at transparent client-side aggregation of isolated shared machines into a parallel computational resource. We show how through its automated deployment and remote configuration capabilities, the toolkit creates a client-side metacomputer abstraction that shields end-users from complexities of physical resource distribution and heterogeneity during build, test, and run time phases in the MPI application lifecycle. We describe our experiments with the NAS Parallel Benchmarks, evaluating the viability of the proposed toolkit from the usability and performance point of view in the context of scientific computing. Finally, we conclude by discussing future research directions and opportunities for generalization of the current model and implementation e.g. through self-organizing scheduling or support for wider variety of resource 4

5 access methodologies and semantics. 2 Related Work The original motivation behind grid systems was to facilitate access to shared computational resources by application scientists. In several fields and disciplines, this vision has been successfully realized through Grid Portals [6, 7], giving field researches instant access to vast amount of data (e.g. BIRN [2], NIEHS Environmental Health Science Data Resource Portal [8], GEON [9]) and/or computational power (e.g. Cactus [10], P-GRADE NGS Portal [11], NGS [12]). In enabling this, however, new responsibilities related to setup and management of grid toolkits, establishing coordinated virtual organizations (VO), dealing with inter-vo security and access, and adapting to unpredictable resource variations, have been imposed on grid administrators and resource providers. Well-known grid systems [13, 14, 15], focusing on institutional end-users, large scale resources, and long-lasting sharing agreements, have taken the one size fits all approach, requiring substantial administrative and coordination efforts. However, many potential users of shared resources do not fit this mold for instance, they might want to collaborate on a more ad-hoc, peer-to-peer basis. Such uncoordinated sharing scenarios motivate more lightweight, dynamic, and adaptive middleware stacks. Our earlier project, H2O [16, 17], aims to address these requirements by decoupling resource providers from each other, from users, and from service deployers. H2O makes clients responsible for resource aggregation, and it integrates with P2P technologies [18] in order to reduce the need for inter-provider coordination. However, providers are still partially responsible for resource virtualization and are required to run a dedicated daemon (H2O kernel) that involves some initial setup and maintenance. Unibus builds upon H2O conceptual foundations, but it takes a more radical approach to resource sharing. It delegates the virtualization concerns to the client side as well, and it increases the role of software in dealing with resource heterogeneity by autonomously deploying and reconfiguring distributed software modules when necessary. From the autonomic computing perspective [19], projects such as Organic Grid [20] have sug- 5

6 gested novel ways of deploying self-scheduled computations, however they are restricted to fullyencapsulated, strongly-mobile agents. A similar approach has been adopted by Personal Grid [21] but again with limitations on the types of applications that may be deployed. C0PE [22] reduces costs of system administration of personal environments via the concept of self-configured devices but is not targeted at compute aggregation. A novel remote execution engine has been proposed by REX [23], however it is also not oriented towards providing aggregated views of the underlying resources. GridShell [24] extends commonly used login shells to hide the complexity of grids from end-users but requires a grid toolkit installation. Similarly, Nimrod/G [25] defines a specialized client-side API and allows to execute distributed shell commands but over a previously installed grid. Eclipse PTP [26] aims to provide an IDE for parallel applications emphasizing portability, support for integration of parallel tools, and simplification of the end-user interaction with parallel systems through integration of tools such as GUI-based parallel debuggers. We believe that our effort is complementary as it can ultimately provide Eclipse PTP with plugins for resource sharing. There are also initiatives such as SciDAC SSS [27] that focus on standardization of interfaces among software components used in HPC. While the standardization effort brings obvious advantages to the community, it takes a long time for standards to disseminate from the standardization body, through vendors, system administrators, and tools and libraries developers to end users. In contrast, the approach presented in this paper can be immediately applied to existing HPC systems. It is estimated that up to 30% of the overall scientific application development effort can be consumed by build, deployment, runtime, and post-processing phases [28, 29]. We believe that our approach to HPC resource sharing can potentially increase development productivity by delegating some of these efforts to the software and thus (partially) relieving resource providers and administrators of the associated burdens. 3 Unibus Model In our proposed sharing model (Figure 2), providers make resources available via (server side) network daemons, which we call access daemons. Well known, universally available examples are 6

7 sshd and ftpd that expose access to computational resources and storage, respectively. Other examples include Globus GRAM [30] and the H2O kernel. From the usage perspective, an access daemon is fully characterized by its network endpoint and access protocol stack. Resource capabilities are exposed to clients (software agents acting on behalf of users) solely via these endpoints. User User Policies Access device Unibus application Unibus library Unibus overlay Native application Mediator Local operating system OS Driver Network Endpoint Access Daemon Access Protocol Authentication Network Soft-Installed Firmware Module Resource Administrative Domain Access Policy Providers Figure 2: The Unibus Model Each resource belongs to an administrative domain that defines a consistent set of access policies. Each provider controls one or more administrative domains. Typically, prior to accessing a resource, a client must authenticate (on behalf of its user) and obtain appropriate authorization in accordance with the domain policy. Clients may also require daemons to authenticate (on behalf of providers); service authorization is then performed according to the user s policy. Such mutual authentication is especially relevant in peer-to-peer sharing scenarios, where users are simultaneously providers. Authentication is typically performed in one of the layers of the access protocol stack, such as SSL or IPSec, whereas authorization may be performed in the same layer or in some higher layer. For instance, a peer-to-peer system may create an authenticated overlay routing network while implementing services that are subject to authorization in higher communication layers. A user space is defined as a collection of resources that a given user is authorized to access. 7

8 User spaces usually span multiple administrative domains and may exhibit location and temporal dependencies. The user of a distributed system may access its services from a variety of access devices, ranging from workstations to PDAs to specialized embedded devices. We focus on general-purpose access devices, such as laptop or desktop computers, using off-the-shelf operating systems. The proposed model assumes an adaptive middleware layer that runs as an overlay operating system on a user s access device. The system is intended to provide APIs supporting seamless and dynamic aggregation and reconfiguration of both local and remote resources. We once again emphasize the difference between the proposed client-side overlay model and the grid virtual organization model, as illustrated in Figure 1. The VO model requires providers to adhere to common protocols and middleware, and to synchronize authorization policies. The client-side overlay model, driven by the belief that resource sharing for aggregated-use applications will proliferate if providers are relieved of such burdens, accords resource providers full freedom in the choice of sharing technologies and authorization policies. However, this absence of provider-side resource virtualization implies that the overlay system must handle their heterogeneity at the client side. Our approach to accomplishing this involves dynamically and automatically deployable, resource-specific plugins or service-drivers that we term mediators. Essentially, mediators play a role identical to device drivers in a traditional operating system: they are intended to provide virtualized resource views to the higher system layers, abstracting away access protocol details. Many access daemons, e.g. sshd, expose computational capabilities defined as those that enable launching user-specified programs on back-end machine(s). Another, potentially overlapping set of daemons, including e.g. GridFTP, or again sshd with its SFTP subsystem, yields access to storage. Using a combination of both, Unibus mediators may be able to deploy ( soft install") software services on remote resources, in a manner analogous to upgrading the device firmware". We note that those soft-installed services may themselves have communication and/or self-management capabilities; in the extreme, we envision that this scheme may allow the overlay system to self-deploy, as a whole or in part, across computational resources. Consequently, 8

9 arbitrarily complex distributed systems can be assembled using this scheme in its most general form. The described model delivers numerous benefits: relieving providers of the need to install and maintain metasystem software (through automatic installation and updates), eliminating the need for inter-provider coordination (by virtualizing aggregation only at the client interface), adapting to resource additions, failures, and disconnects (by isolating events within the drivers and overlay), and retaining client-centricity and transparency (via unifying resource abstractions built into the overlay software, and by integrating with the user OS interface). 4 Unibus Architecture As previously mentioned, the Unibus model is realized as a software overlay running on the client access device, optionally complemented by soft-installed remote components, with the goal of virtualizing and aggregating remote resources accessible to the user. The overlay provides operational services to applications via appropriate APIs and language bindings. This section discusses the Unibus architecture, illustrated in Figure 3, in more detail. Resource abstraction layer: The lowest level of the Unibus overlay consists of mediators, which virtualize resources by translating their access protocols into capabilities that are exposed to clients. For instance, the SSH mediator exposes computational and storage capabilities, implementing them as requests carried over the SSH protocol (with the storage capabilities exposed via the SFTP subsystem). By implementing standardized capabilities, mediators are able to abstract away the heterogeneity of resources and their access protocols. However, the model allows nongeneric resource functionality to be exposed as non-standard capabilities, as the Weather Forecast service example in Figure 3 suggests. Since mediators perceive resources through access protocols, they typically exhibit fewer dependencies on vendor- and instance-specific details than is the case with drivers in traditional operating systems. For instance, the NFS mediator may be able to handle any shared NFS filesystem, regardless of the vendor or platform. Furthermore, by relying 9

10 on standard client-side libraries (e.g. sshlib) that provide APIs on top of common protocols, mediators can become relatively thin, lightweight, and typically stateless adapters between these libraries and the Unibus service model. Core services layer: While mediators provide virtualized access to individual resources, core services focus on aggregation of resources of similar types, and project unified abstractions as higher level capabilities. For instance, the Process Control service is intended to group computational services and allow clients to stage and run multiple program copies on several machines, as well as to aid in resource allocation. For instance, the service may issue queries (based on service properties) in order to identify resources available to perform the task at hand (e.g. resources with platform type compatible with the application). The Unified File System (UFS) service is intended to combine storage resources and project them as a single file system abstraction, allowing applications to perform I/O operations across heterogeneous, distributed stores. Additionally, certain higher-level services such as transparent replication can be addressed at this level. Applications, tools MPI App console, GUI Unibus App Weather Forecast Application adaptation CCA MPI... UFS I/O ProActive Access libraries Service binding Core services Mediators C/C++ Fortran Python... Java Unibus microkernel Process Control Unified File System Discovery Status... GRAM SSH NFS SMB TWAIN... Weather Forecast Unibus library Unibus overlay Network Resources Web Service Figure 3: The Unibus architecture, illustrating potential services and usage scenarios Our envisioned approach to dynamic reconfiguration involves publish-subscribe, topic-based event notification mechanisms. For instance, when a discovery service detects a new file share available in the network, it deploys a mediator to handle it. This causes an event notification to 10

11 be delivered to the UFS service which responds by embracing the new share into the unified file system. Similarly, a network failure detected by the mediator may cause an event to be send to a status service, which can implement a recovery policy (e.g. attempting to reconnect, or redirecting traffic to another replica), or propagate the failure event and revoke access to the resource e.g. by removing the share from UFS. Service binding layer: The functionality of the Unibus overlay is provided by dynamically loadable services; responsibilities of the core API are confined to managing the hosting environment. Applications and higher level libraries communicate with hosted services through their specific interfaces, via inter-process calls. To provide applications and libraries with a coherent invocation methodology, we envision the binding layer, responsible for defining representations of the component model in different programming languages. Given that representation, generation of service stubs (either statically or at run time) can be automated by appropriate tools. Application adaptation layer: This layer includes system software and libraries that provide added value over services supplied by Unibus by implementing or emulating higher-level programming abstractions and environments. A simple example can include an I/O library, interfacing to the UFS service, and extending upon standard OS libraries by supporting advanced filesystem features such as scheduled direct resource-to-resource transfer or data query/lookup. As another example, we envision the possibility of building an MPI library that uses Unibus as a process control and communication substrate. Applications and tools layer: The top-most layer in the Unibus architecture consists of applications that use Unibus APIs either explicitly or via the adaptation layer software. Examples of the former may include the Unibus GUI and command-line tools; an example of the latter is an MPI program running on a remote computational resource, using an MPI library implemented on top of Unibus. 11

12 5 Example Manifestation: Zero-Force MPI Particularly interesting applications of the Unibus model are those related to resource-shared metacomputing. Traditionally, heterogeneous shared resources are virtualized in coordination by their providers, who install and configure the agreed-upon middleware stacks. The Unibus model shifts those responsibilities to the client-side overlay, making it possible to transform generic, uncoordinated, individually shared resources into a coherent metacomputing platform, accessed centrally from the user s access device. As a specific proof-of-concept exemplification of this methodology, we have developed the Zero-Force MPI (ZF-MPI) toolkit, aimed at such compute aggregation for MPI-based applications. The toolkit aids the user in the build, deployment, run-time and postprocessing phases through the automatic software deployment and data replication capabilities. ZF-MPI is based on FT-MPI [31], which we have chosen due to: (1) its small size (gzipped source files size is about 800 KB), (2) full MPI 1.2 compliance, (3) robust autoconf-based installer supporting wide range of target architectures, and (4) fault tolerance features (especially important in loosely-coupled environments). The following subsections describe implementation of system components in more detail. 5.1 ZF-MPI Architecture At the client side, ZF-MPI is manifested as an interactive console, presenting application scientists with a unified and coherent interface for accessing underlying computational resources in a location-transparent manner. Through the console, users can add remote resources to their distributed virtual machine (causing automatic remote deployment of the FT-MPI environment), replicate source files, makefiles, and data, perform (parallel) compilation, and launch the application. Resources are virtualized via Unibus mediators specific to their access protocol stacks. Albeit the model allows the use of diverse access protocols that interface e.g. with batch scheduling systems, the current ZF-MPI prototype supports only the SSH2 [32] mediator. SSH2 enables remote program execution through its exec channel. This feature can be exploited to (1) install (e.g. via configure/make) the user s software on a remote resource and (2) to execute code on that resource. 12

13 Moreover, via the the SFTP subsystem of SSH2, it is possible to transfer files in both directions, i.e., uploading necessary software (upgrading the device firmware ) and user s program and data files, and retrieving application output. The described architecture allows scientists to interact with virtualized aggregated heterogeneous resources through the ZF-MPI Console as if they worked with a local system. At the same time, resource provider responsibility is confined to granting access (currently via an SSH login account). The system automates cumbersome tasks such as MPI environment configuration, uploading and compiling computational applications, and input data staging / results collection. 5.2 ZF-MPI Console In its current version, the ZF-MPI console allows users to: (1) assemble the distributed virtual machine (DVM); (2) synchronize files, such as the application sources, makefiles and scripts, or the input data; (3) invoke shell commands on remote resources (in parallel) e.g. to compile user s applications; and (4) launch compiled applications. DVM Assembly: The add command can be used to add a computational resource to the DVM. The command requires the user to specify the target host name and supply the standard login credentials (e.g. ssh username/password). Upon successful login, the console automatically uploads (pushes) the FT-MPI distribution to the remote resource, decompresses it, installs it by setting the appropriate environment variables and then calling configure/make, and finally starts it by launching the appropriate daemons. We note that the upload phase may stress the potentially low-bandwidth network connection of the client machine. In the future, we will explore alternative schemes, e.g. instructing the remote host to download the software directly from a remote repository, if an appropriate enabling software (e.g. wget) is installed. Moreover, the add command must be scalable i.e. adding thousands of resources should be no more difficult than adding a few ones. This can be achieved by referencing external, separately generated resource lists, and by using patterns in resource address specifications. 13

14 Data Synchronization: The ZF-MPI Console allows users to synchronize source files across the DVM nodes with the sync command. For example, when the user locally modifies a makefile or application source files, sync can be used to propagate these changes across all hosts enrolled in the DVM. The current prototype implementation simply uses file modification time to resolve synchronization conflicts. In the future, we will investigate more sophisticated schemes one possibility is to build a complete distributed file system on top of SFTP shares, or to adapt existing solutions such as Coda [33], or Ivy [34]. Compile and Build: The console allows users to invoke standard shell commands and remote tools, such as configure or make, which are then executed in parallel on all DVM nodes. In the future we also consider to extend this mechanism by integrating with existing toolkits such as Parallel UNIX Commands [35]. Application Launch: The console supports the custom, client-side version of mpirun, that can be used to launch the MPI program on (a subset of) nodes in the DVM. 5.3 Example Session A sample ZF MPI Console session is presented in Figure 4. The scenario involves running one of the NAS Parallel Benchmarks [36] (BT) on a DVM consisting of a collection of Sun Solaris and Linux workstations. In lines 1-2, the heterogeneous DVM is defined. In line 3, the user launches the FT-MPI Name Service on the compute host. The next command (line 4) instantiates the FT-MPI DVM by launching the necessary daemons. Line 5 shows how the current DVM status can be obtained. The next command (line 6) de facto replicates the contents of the local directory across directories on remote hosts constituting the DVM. The next three shell commands cause the application to be compiled and deployed. Although they look as if they are invoked locally, in fact these shell commands are executed in parallel on all hosts registered in the ZF-MPI DVM. Finally, the compiled program is launched, the results are written to the output file and the FT-MPI console is halted. 14

15 1 zf-mpi> add ft_mpi 2 zf-mpi> add ft_mpi 3 zf-mpi> ft_mpi setns compute 4 zf-mpi> ft_mpi add ALL 5 zf-mpi> ft_mpi console conf 6 zf-mpi> sync ~/NPB3.2.1/NPB3.2-MPI ~/zf-mpi/ 7 zf-mpi> cd ~/zf-mpi/npb3.2-mpi 8 zf-mpi> make bt NPROCS=8 CLASS=B 9 zf-mpi> mv bin/bt.b.8 $HARNESS_BIN_DIR/$HARNESS_ARCH/ 10 zf-mpi> ft_mpi ftmpirun compute -np 8 -o bt.b.8 > log 11 zf-mpi> cat log grep "Time in seconds" 12 zf-mpi> ft_mpi console haltall Figure 4: The sample session with the ZF-MPI console Console commands can be divided in four categories: (1) pure ZF-MPI commands (add,sync), (2) commands related to FT-MPI (ft_mpi setns, ft_mpi add), (3) commands related to the FT-MPI console (ft_mpi console), and (4) shell commands (cd, mv, cat). We note that in the absence of Zero-Force MPI or a similar parallel command mechanism, the steps 1 9 above would have to be performed on each host separately and by hand. 5.4 Heterogeneity Issues The prototype ZF-MPI console has been implemented in Java, and thus it can run on any Javaenabled client platform. As the implementation of the SSH2 protocol, we have used the popular JSch [37] library. However, a potential portability issue stems from incompatibilities between file system attributes on different systems. For instance, when the ZF-MPI console is used to transfer files from Windows-based laptop to UNIX back-end machines, the UNIX shell scripts (e.g. configure) must be identified as such in order for their executable flag to be properly set. We have applied a simple heuristic to preserve the executable status across file systems; for instance, when copied from Windows to UNIX, a text file starting with!# is considered to be executable etc. Another interesting issue is related to software and compiler dependency resolution. The compiler suite used to build the application should be the same as the one used to build the MPI system; 15

16 otherwise linking problems are likely to occur. Therefore, the choice of the compiler suite must be made as early as at the time the FT-MPI environment is compiled. For example, on Sun workstations that we were using, GNU and native Solaris compiler suites were both available, but the GNU suite did not contain a Fortran compiler. To include Fortran support (needed e.g. to compile NPB tests), we have configured the console to prefer the native suite by default (e.g. via CC=cc and similar definitions) when interfacing to Solaris. In the future, we envision that the choice may be offered to the user when the console auto-detects multiple alternative compiler suites. 5.5 Experimental Evaluation Once Zero-Force MPI deploys the FT-MPI DVM, it gets out of the way and does not perturb application execution. Therefore, we expected no performance impact caused by the use of the toolkit vs. bare FT-MPI. To verify that this is indeed the case, and to evaluate feasibility of the proposed methodology when applied to real-life-like problems, we performed a series of experiments. For the tests, we have chosen the NAS Parallel Benchmarks 3.2 for MPI [36] suite since: (1) they cover a representative spectrum of classes of HPC applications, and (2) they are implemented in languages representative for HPC programming (Fortran and C). We have run experiments on the Sun Solaris platform, and in a heterogeneous environment consisting of Solaris and Linux/i86 machines. We have chosen Solaris as a challenging, non-gnu-based UNIX system with non-standard command syntax, significantly exercising heterogeneity support in the toolkit Performance Tests Initial series of NPB tests has been executed on a homogeneous cluster of Sun Blade 2500 workstations. Each workstation had two UltraSPARC-IIIi 1280MHz processors with 1 MB cache memory per processor, 2GB RAM memory, and 160 MHz system clock frequency. All machines were connected directly to 100 Mb/s HP network switches and operated under control of the SunOS 5.10 operating system. Benchmarks were compiled in the size class A. All tests were repeated 10 times to obtain reliable results. BT and SP tests were run on 1, 4, 9 and 16 processors, and the remainder of the tests on 1, 2, 4, 8 and 16 processors. The FT-MPI virtual machine consisted of 16

17 the same 19 nodes for FT-MPI and ZF-MPI, with up to 16 nodes used for the benchmarks. Comparison tests have been launched at the same sequence and on the same nodes. During tests, the FT-MPI Name Service has been running on an additional dedicated workstation. Figures 5-12 present resulting execution times in seconds. Relatively poor scalability (except for the embarrassingly parallel benchmark) can be attributed to small problem sizes and a slow network [38]; the worst scalability has been observed for throughput-bound (FT) or latency-bound (IS, CG) benchmarks. Since the network was non-dedicated, we have observed some variations in execution times from run to run. Nonetheless, the results confirm that ZF-MPI has no measurable influence on the application performance, with small fluctuations for communication-bound and short-running benchmarks (Figures 10,11,12) falling into the range of a statistical error Heterogeneity Tests In order to demonstrate that the ZF-MPI toolkit works as expected in heterogeneous environments we conducted another experiment. We selected 4 general-purpose PC computers (Pentium 4 with 2.4 GHz, 2.8 GHz, 2.6 GHz, 2.4 GHz, and 1 GB, 1 GB, 768 MB, and 1 GB RAM, respectively) with Linux Mandriva 2006 operating system (kernel ), and coupled them with 5 Sun Blade 2500 workstations from the previous experiment. The PCs had Gigabit Ethernet network interfaces and they were connected to Sun machines via 100 Mb/s Ethernet switches. None of the nodes had preinstalled MPI or benchmark codes. The FT-MPI VM was assembled in a consistent order (alphabetically by host names): 2 Linux machines followed by 5 Sun machines and then another 2 Linux machines. The tests were compiled in the size class B for 8 or 9 processes. Each test was executed 5 times, once through the ZF-MPI toolkit, and a second time using bare FT-MPI. The results of this experiment are presented in Figure 13. As we expected and similarly to the results of the previous experiment, it can be observed that ZF-MPI does not affect performance of MPI applications. However, we argue that from the end-user perspective, it significantly reduces efforts related to common tasks such as deployment and execution of computational applications. A potential source of startup overhead in ZF-MPI can be associated with the necessity to upload and install FT-MPI on the target machines. These phases depend heavily on the network / CPU 17

18 speeds, but typically require seconds for the upload (involving transfer of an 800 KB file), and minutes for the install (about 2 min. on our testbed Sun Blade workstations). However, we note that these steps are performed only once per node (and can run in parallel), thus amortizing across application executions. 6 Conclusions and Future Work Existing resource sharing platforms invariably make providers responsible for resource virtualization and aggregation. Providers are required to set up their resources for sharing using specific technologies, commonly via heavyweight software toolkits, and to coordinate among themselves in order to provide users with aggregated services. These logistical and coordination burdens discourage providers from sharing, and are cost effective only in situations based on long-term agreements governing large, static pools of resources. In this paper, we propose to address those issues through a novel model of resource sharing, based on the concept of a client-side overlay, as contrasted with the traditional grid model of virtual organizations. The proposed model, termed Unibus, aims to partially relieve providers of administration burdens by delegating them to the client-side software. That overlay software can autonomously adapt to resource heterogeneity and temporal variability by autonomously reconfiguring itself and the remote resources through dynamically deployable software modules. In the context of metacomputing, Unibus model permits clients to aggregate decoupled, individually shared resources into a coherent and conveniently accessed computational platform. The overlay software assumes the responsibility of deploying the necessary coordination and communication middleware onto the shared hosts. We have described the Zero-Force MPI toolkit, a prototype exemplification of the Unibus model targeted at such compute aggregation for MPI-based parallel computing. We have shown that with ZF-MPI, the user that has been granted access to remote resources can efficiently use them to execute MPI applications in a semi-automated setting, without extra support from resource providers. Our experiments with the NAS Parallel Benchmark suite on a heterogeneous collection of resources show promising results and demonstrate feasibility of the proposed approach. 18

19 Although our current software prototype can only interface to SSH2-accessible resources, the Unibus model is designed (through the mediator abstraction layer) to accommodate other access methods of various semantics. For instance, when a toolkit such as ZF-MPI interfaces to a batch scheduling system, the mediator can delegate interactive activities (such as program compilation) to service nodes, implement the synchronized filesystem abstraction via site-specific data synchronization commands, and translate the execution requests (made by the client-side mpirun in this case) into batch files scheduled on the queuing system at the server side. Ultimately aiming to make uncoordinated, global-scale resource sharing a reality, the Unibus model is only a first step along the way, as it spawns numerous new challenges and pitfalls. The centralized resource scheduling must be replaced by decentralized, self-organizing approaches, perhaps involving compensation [39], in which client agents negotiate access on behalf of the user with multiple uncoordinated provider agents. The traditional resource allocation schemes, based on a dedicated information and discovery services, must yield to decentralized, globally scalable mechanisms, perhaps based on peer-to-peer technologies. The single sign-on capabilities, provided via coordinated authentication servers, must be replaced by sophisticated, decentralized credential management mechanisms in order to maintain user-centricity without compromising provider autonomy. Finally, the soft-installation approaches must be generalized and further developed to support arbitrary codes and dynamic dependency resolution. Notwithstanding these difficulties, we believe that the Unibus client-side overlay model may, in the long term, be able to overcome some of the fundamental deficiencies of the virtual organization approach. References [1] Y. Guo, J. G. Liu, M. Ghanem, K. Mish, V. Curcin, C. Haselwimmer, D. Sotiriou, K. Muraleetharan, and L. Taylor, Bridging the Macro and Micro: A Computing Intensive Earthquake Study Using Discovery Net, ACM/IEEE SC 2005 Conference (SC 05), Nov Seattle, USA. [2] Biomedical Informatics Research Network

20 [3] K. Yurkiewicz, Sciences On The Grid, Symmetry, vol. 02, Nov symmetrymag.org/pdfs/200511/sciences_on_the_grid.pdf. [4] J. C. Werner, How to succeed using grid in High Energy Physics experiments, tech. rep., High Energy Physics, University of Manchester, Available at man.ac.uk/u/jamwer/esci2005.pdf. [5] J. Chin and P. V. Coveney, Towards Tractable Toolkits for the Grid: A Plea for Lightweight, Usable Middleware, Tech. Rep. UKeS , UK e-science, nesc.ac.uk/technical_papers/ukes pdf. [6] Open Grid Portals: Portals, Portlets and the Grid, opengridportals.org/. [7] G. Fox, D. Gannon, and M. Thomas, Grid Computing, ch. Overview of Grid Computing Environments, pp ch20. [8] National Institute of Environmental Health Science, U.S. National Institutes of Health [9] Enabling Scientific Discoveries and Improving Education in Geosciences through Information Technology Research [10] Center for Computation and Technology HPC Portal. edu, [11] NGS P-GRADE [12] National Grid Service [13] Globus Alliance [14] Legion

21 [15] A. S. Grimshaw, M. A. Humphrey, and A. Natrajan, A philosophical and technical comparison of Legion and Globus, IBM Journal of Research and Development, vol. 48, no. 2, [16] D. Kurzyniec, T. Wrzosek, D. Drzewiecki, and V. Sunderam, Towards self-organizing distributed computing frameworks: The H2O approach, Parallel Processing Letters, vol. 13, no. 2, pp , [17] The H2O Project, [18] P. Jurczyk, M. Golenia, M. Malawski, D. Kurzyniec, M. Bubak, and V. S. Sunderam, A System for Distributed Computing Based on H2O and JXTA, in Cracow Grid Workshop 2004, (Kraków, Poland), [19] IBM, Practical Autonomic Computing: Roadmap to Self Managing Technology. Whitepaper.pdf, Jan [20] G. B. Arjav J. Chakravarti and M. Lauria, The Organic Grid: Self-Organizing Computation on a Peer-to-Peer Network, IEEE Trans. on Systems, Man, and Cybernetics Part A: Systems and Humans, vol. 35 pp , May [21] J. Han and D. Park, A Lightweight Personal Grid Using a Supernode Network, in Proc. of the 3rd International Conference on Peer-to-Peer Computing, (Linköping, SWEDEN), pp , IEEE Computer Society, [22] P. Yalagandula, L. Alvisi, M. Dahlin, and H. Vin, C0PE: Consistent 0-Administration Personal Environment, in Proc. of the Sixth International Workshop on Object-Oriented Real- Time Dependable Systems (WORDS 01), pp , IEEE Computer Society, [23] M. Kaminsky et al., REX: Secure, Extensible Remote Execution, in Proc. of the 2004 USENIX Annual Technical Conference (USENIX 04), (Boston, Massachusetts, USA), pp , June

22 [24] E. Walker, T. Minyard, and J. Boisseau, GridShell: A Login Shell for Orchestrating and Coordinating Applications in a Grid Enabled Environment, in Proc. of The International Conference on Computing Communications and Control Technologies, (Austin, Texas, USA), pp , [25] D. Abramson, J. Giddy, and L. Kotler, High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid?, in Proc. of the 2000 International Parallel and Distributed Processing Symposium, (Cancun, Mexico), pp , IEEE Computer Society, [26] Eclipse.org, [27] Scalable Systems Software for Terascale Computer Centers, scidac.org/scalablesystems. [28] P. F. Dubois, G. K. Kumfert, and T. G. W. Epperly, Why Johnny can t build, Computing in Science and Engineering, vol. 5, pp , Sep/Oct [29] G. K. Kumfert and T. G. W. Epperly, Software in the DOE: The hidden overhead of the build, Tech. Rep. UCRL-ID , Lawrence Livermore National Laboratory, [30] K. Czajkowski et al., A Resource Management Architecture for Metacomputing Systems, in Proc. of the IPPS/SPDP 98 Workshop on Job Scheduling Strategies for Parallel Processing, (Orlando, FL, USA), pp , [31] G. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. Dongarra, Extending the MPI specification for process fault tolerance on high performance computing systems, in Proceedings of ISC2004, (Heidelberg, Germany), June Available at isc2004-ft-mpi.pdf. 22

23 [32] The Internet Engineering Task Force Network Working Group The Secure Shell (SSH) Connection Protocol RFC 4254, Jan rfc4254.txt. [33] M. Satyanarayanan, The Evolution of Coda, ACM Transactions on Computer Systems (TOCS), vol. 20, no. 2, pp , [34] A. Muthitacharoen, R. Morris, T. Gil, and B. Chen, Ivy: A Read/Write Peer-to-Peer File System, in Proc. of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI 02), (Boston, MA, USA), pp , [35] E. Ong, E. Lusk, and W. Gropp, Scalable Unix Commands for Parallel Processors: A High- Performance Implementation in Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th European PVM/MPI Users, p. Vol. 2131, LNCS, Springer Berlin / Heidelberg, Jan [36] NASA Advanced Supercomputing (NAS) Division: NAS Parallel Benchmarks, [37] JCraft, JSCH Java Secure Channel, [38] F. C. Wong, R. P. Martin, R. H., Arpaci-Dusseau, and D. E. Culler, Architectural requirements and scalability of the NAS Parallel Benchmarks, in Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), ACM Press New York, NY, USA, [39] G. Cheliotis, C. Kenyon, and R. Buyya, Grid Economics: 10 Lessons From Finance, Daily News and Information for the Global Grid Community, vol. 2, Jul gridbus.org/papers/grid_lessons.pdf. 23

24 70 60 ft mpi zf mpi ft mpi zf mpi time [s] 30 time [s] nodes nodes Figure 5: EP benchmark test Figure 6: LU benchmark test ft mpi zf mpi ft mpi zf mpi time [s] 400 time [s] nodes nodes Figure 7: BT benchmark test Figure 8: MG benchmark test 700 ft mpi zf mpi 12 ft mpi zf mpi time [s] time [s] nodes nodes Figure 9: SP benchmark test Figure 10: IS benchmark test 24

25 50 45 ft mpi zf mpi ft mpi zf mpi time [s] 25 time [s] nodes nodes Figure 11: FT benchmark test Figure 12: CG benchmark test ft mpi zf mpi time [s] mg.8 is.8 ep.8 cg.8 lu.8 ft.8 bt.9 sp.9 test Figure 13: NAS Parallel Benchmarks for 8 or 9 processes in the size class B 25

From Parallel Virtual Machine to Virtual Parallel Machine: The Unibus System

From Parallel Virtual Machine to Virtual Parallel Machine: The Unibus System Vaidy Sunderam Emory University, Atlanta, USA vss@emory.edu Credits and Acknowledgements Distributed Computing Laboratory, Emory