Heterogeneous parallel and distributed computing

Size: px

Start display at page:

Download "Heterogeneous parallel and distributed computing"

Betty Collins
5 years ago
Views:

1 Parallel Computing 25 (1999) 1699± Heterogeneous parallel and distributed computing V.S. Sunderam a, *, G.A. Geist b a Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA b Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA Abstract Heterogeneous network-based distributed and parallel computing is gaining increasing acceptance as an alternative or complementary paradigm to multiprocessor-based parallel processing as well as to conventional supercomputing. While algorithmic and programming aspects of heterogeneous concurrent computing are similar to their parallel processing counterparts, system issues, partitioning and scheduling, and performance aspects are signi cantly di erent. In this paper, we discuss the evolution of heterogeneous concurrent computing, in the context of the parallel virtual machine (PVM) system, a widely adopted software system for network computing. In particular, we highlight the system level infrastructures that are required, aspects of parallel algorithm development that most a ect performance, system capabilities and limitations, and tools and methodologies for e ective computing in heterogeneous networked environments. We also present recent developments and experiences in the PVM project, and comment on ongoing and future work. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Heterogeneous computing; Networked computing; Cluster computing; Message passing interface (MPI); Parallel virtual machine (PVM); NAS parallel benchmark; Parallel I/O; Meta computing 1. Introduction We discuss parallel and distributed computing on networked heterogeneous envrionments. As used in this paper, these terms, as well as ``concurrent'' computing, refer to the simultaneous execution of the components of a single application on multiple processing elements. While this de nition might also apply to most other * Corresponding author. address: vss@mathcs.emory.edu (V.S. Sunderam) /99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S ( 9 9 )

2 1700 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 notions of parallel processing, we make a deliberate distinction, to highlight certain attributes of the methodologies and systems discussed herein ± namely, loose coupling, physical and logical independence of the processing elements, and heterogeneity. These characteristics distinguish heterogeneous concurrent computing from traditional parallel processing, normally performed on homogeneous, tightly coupled platforms which possess some degree of physical independence but are logically coherent. Concurrent computing, in various forms, is becoming increasingly popular as a methodology for many classes of applications, particularly those in the high-performance and scienti c computing arenas. This is due to numerous bene ts that accrue, both from the applications as well as the systems perspectives. However, in order to fully exploit these advantages, a substantial framework is required ± in the form of novel programming paradigms and models, systems support, toolkits, and performance analysis and enhancement mechanisms. In this paper, we focus on the latter aspects, namely the systems infrastructures, functionality, and performance issues in concurrent computing Heterogeneous, networked, and cluster computing One of the major goals of concurrent computing systems is to support heterogeneity. Heterogeneous computing refers to architectures, models, systems, and applications that comprise substantively di erent components, as well as to techniques and methodologies that address issues that arise when computing in heterogeneous environments. While this de nition encompasses numerous systems, including recon gurable architectures, mixed-mode arithmetic, special purpose hardware, and even vector and input±output units, we restrict ourselves to systems that are comprised of networked, independent, general purpose computers that may be used in a coherent and uni ed manner. Thus, heterogeneous systems may consist of scalar, vector, parallel, and graphics machines that are interconnected by one or more (types of) networks, and support one or more programming environment/operating system. In such environments, heterogeneity occurs in several forms: System architecture ± heterogeneous systems may consist of SIMD, MIMD, scalar, and vector computers. Machine architecture ± individual processing elements may di er in their instruction sets and/or data representation. Machine con gurations ± even when processing elements are architecturally identical, di erences such as clock speeds and memory contribute to heterogeneity. External in uences ± as heterogeneous systems are normally built in general purpose environments, external resource demands can (and often do) induce heterogeneity into processing elements that are identical in architecture and con guration, and further, cause dynamic variations in interconnection network capacity. Interconnection networks ± may be optical or electrical, local or wide-area, high or low speed, and may employ several di erent protocols.

3 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± Software ± from the infrastructure point of view, the underlying operating systems are often di erent in heterogeneous systems; from the applications point of view, in addition to operating systems aspects, di erent programming models, languages, and support libraries are available in heterogeneous systems. Research in heterogeneous systems is in progress in several areas [1,2] including applications, paradigm development, mapping, scheduling, recon guration, etc., but the primary thrust has thus far been in systems, methodologies, and toolkits [22]. This latter thrust has been highly productive and successful, with several systems in production-level use at hundreds of installations worldwide. The body of this paper will discuss the parallel virtual machine (PVM) system that has evolved into a popular and e ective methodology for heterogeneous concurrent computing Applications perspective From the point of view of application development, heterogeneous computing is attractive, since it inherently supports function parallelism, with the added potential of executing subtasks on best-suited architectures. It is well known that di erent types of algorithms are well matched to di erent machine architectures and con- gurations, and at least in the abstract sense, heterogeneous computing permits this matching to be realized, resulting in optimality in application execution as well as in resource utilization. However, in practice, this scenario may be di cult to achieve for reasons of availability, applicability, and the existence of appropriate mapping and scheduling tools. Nevertheless, the concept is an attractive one and several research e orts are in progress in this area [3,4]. In this respect, many classes of applications that would bene t substantively from heterogeneous computing have been identi ed. For example, a critically important problem which is ideally suited to heterogeneous computing is is global climate modeling. Simulation of the global climate is a particularly di cult challenge because of the wide range of time and space scales governing the behavior of the atmosphere, the oceans, and the surface. Parallel GCM codes require distinct component modules representing the atmosphere, ocean and surface and process modules representing phenomena like radiation and convection. Sampling, updating and manipulating this data requires scalar, vector, MIMD and SIMD paradigms, many of which can be performed concurrently. Another application domain that could exploit heterogeneous computing is computer vision. Vision problems generally require processing at three levels: high, medium and low. Low-level and some medium-level vision tasks often involve regular data ow and iconic operations. This type of computation is well matched to mesh-connected SIMD machines. Medium-grained MIMD machines are more suitable for various high-level and some medium-level vision tasks which are communication-intensive and in which the ow of data is not regular. Coarse-grained MIMD machines are best matched for high-level vision tasks such as image understanding/recognition and symbolic processing. As previously mentioned however, the above aspect of heterogeneous concurrent computing is still in its infancy. Proof-of-concept research and experiments have

4 1702 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 demonstrated the viability of exploiting application heterogeneity, and many others are evolving. On the other hand, the systems aspect has matured signi cantly; to the extent that robust environments are now available for production execution of traditional parallel applications while providing stable testbeds for the evolving, truly heterogeneous, applications [12,13]. We discuss the systems facet of heterogeneous concurrent computing in the remainder of the paper. 2. The historical perspective 2.1. Heterogeneous concurrent computing systems Heterogeneous computing systems [5,6] evolved in the late 1980s and shared some common goals and requirements: To e ectively provide access to signi cant amounts of computing resources in a cost-e ective manner, usually by utilizing already available resources. To exploit the existing software infrastructure and facilities (e.g., editors, compilers, debuggers) that are available on individual computer systems in a cluster. To provide an e ective programming model and interface, generally based on explicit parallelism and the message passing paradigm. To support transparency in terms of architecture, processor type, task location, network communication, and resource allocation. To achieve the best possible performance, subject to the inherent limitations of the processors and networks involved; some systems also attempt to be non-intrusive by suspending execution in deference to higher priority activities. Several of the above goals were met, at least by the most popular networkcomputing/heterogeneous processing systems. Other goals, such as exploiting heterogeneity, sophisticated job and resource management, automatic parallelization, and graphical interfaces are still being pursued. Since then, PVM gained substantially in popularity, and the MPI standard evolved in the mid 1990s ± both are still in widespread use. We brie y outline some of the earlier systems and their salient features before discussing PVM in depth, and commenting on the latest trends in metacomputing The Linda model and system Linda [7] is a concurrent programming model that has evolved from a Yale University research project. The primary concept in Linda is that of a ``tuple space'', an abstraction via which cooperating processes communicate. This central theme of Linda has been proposed as an alternative paradigm to the two traditional methods of parallel processing, namely, that based on shared-memory, and on message passing. The tuple space concept is essentially an abstraction of distributed sharedmemory, with one important di erence (tuple spaces are associative), and several minor distinctions (destructive and non-destructive reads, and di erent coherency semantics are possible). Applications use the Linda model by embedding explicitly,

5 within cooperating sequential programs, constructs that manipulate (insert/retrieve tuples) the tuple space. From the application point of view Linda [8] is a set of programming language extensions for facilitating parallel programming. The Linda model is a scheme built upon an associative memory referred to as tuple space. It provides a shared-memory abstraction for process communication without requiring the underlying hardware to physically share-memory. Tuples are collections of elds logically ``welded'' to form persistent storage items. They are the basic tuple space storage units. Parallel processes exchange data by generating, reading, and consuming them. To update a tuple, the tuple is removed from tuple space, modi ed, and returned to tuple space. Restricting tuple space modi cation in this manner creates an implicit locking mechanism ensuring proper synchronization of multiple accesses. The ``Linda system'' usually refers to a speci c (sometimes portable) implementation of software that supports the Linda programming model. System software is provided that establishes and maintains tuple spaces, that is used in conjunction with libraries that appropriately interpret and execute Linda primitives. Depending on the environment (shared-memory multiprocessors, message passing parallel computers, networks of workstations etc.), the tuple space mechanism is implemented using di erent techniques, and with varying degrees of e ciency. Recently, a new system technique has been proposed, at least nominally related to the Linda project. This scheme, termed ``Pirhana'' proposes a proactive approach to concurrent computing ± the idea being that computational resources (viewed as active agents) seize computational tasks from a well-known location based on availability and suitability. Again, this scheme may be implemented on multiple platforms, and manifested as a ``Pirhana system'' or ``Linda±Pirhana system'' P4 and Parmacs V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± P4 is a library of macros and subroutines developed at Argonne National Laboratory for programming a variety of parallel machines. The P4 system [9] supports both the shared-memory model (based on monitors) and the distributed-memory model (using message-passing). For the shared-memory model of parallel computation, P4 provides a set of primitives from which monitors can be constructed, as well as a set of useful monitors. For the distributed-memory model, P4 provides typed send and receive operations, and creation of processes according to a text le describing group and process structure. Process management in the P4 system is based on a con guration le that speci es the host pool, the object le to be executed on each machine, the number of processes to be started on each host (intended primarily for multiprocessor systems) and other auxiliary information. Two issues are noteworthy in regard to the process management mechanism in P4. First, there is the notion a ``master'' process and ``slave'' processes, and multilevel hierarchies may be formed to implement what is termed a cluster model of computation. Second, the primary mode of process creation is static, via the con guration le; dynamic process creation is

6 1704 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 possible only by a statically created process that must invoke a special P4 function that spawns a new process on the local machine. However, despite these restrictions, a variety of application paradigms may be implemented in the P4 system in a fairly straightforward manner. Message Passing in P4 system is achieved through the use of traditional send and recv primitives, parameterized almost exactly as other message passing systems. Several variants are provided for semantics such as heterogeneous exchange, and blocking or non-blocking transfer. A signi cant proportion of the burden of bu er allocation and management, however, is left to the user. Apart from basic message passing, P4 also o ers a variety of global operations, including broadcast, global maxima and minima, and barrier synchronization. Shared-memory support via monitors is a facility that distinguishes P4 from other systems. However, this feature is not distributed shared-memory; but rather, a portable mechanism for shared address space programming in true shared-memory multiprocessors. Parmacs is a project that is closely related to the P4 e ort. Essentially, Parmacs is a set of macro extensions to the P4 system developed at GMD [10]. It originated in an e ort to provide FORTRAN interfaces to the P4 system, but is now a signi cantly enhanced package that provides a variety of high-level abstractions, mostly dealing with global operations. Parmacs provides macros for logically con guring a set of P4 processes; for example, the macro torus produces a suitable con guration le for use by P4 that results in a logical process con guration corresponding to a 3-d torus. Other logical topologies, including general graphs may also be implemented, and Parmacs provides macros used in conjunction with send and recv to achieve topology-speci c communications within executing programs Message passing interface (MPI) In 1992 a group of about 30 people from universities, government laboratories, and industry began meeting to specify a message passing interface. It was felt that the de nition of a message passing standard provides vendors with a clearly de ned base set of routines that they can implement e ciently, or in some cases provide hardware support for, thereby enhancing performance. In 1994 the MPI-1 speci cation was published that de ned 128 functions divided into ve categories: point-to-point communication, collective communication, groups and context, processor topologies, and pro ling interface. While MPI-1 de ned a message passing API, it was not portable across heterogeneous clusters of computers because MPI-1 de ned no standard way to start processes. Thus, in 1995 the MPI forum began to meet again to de ne MPI-2. MPI-2 speci ed an additional 200 functions in several new areas: I/O, one-sided communication, process spawning, and extended collective operations. The MPI-2 speci cation was published in The goal of the MPI speci cation is to develop a widely used standard for writing message passing programs. MPI-1 is widely supported by the parallel computer vendors and work has begun to implement MPI-2 functions.

7 3. The PVM system 3.1. PVM overview V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± PVM is a software system that permits the utilization of a heterogeneous network of parallel and serial computers as a uni ed general and exible concurrent computational resource. The PVM system [11] initially supported the message passing, shared-memory, and hybrid paradigms; thus, allowing applications to use the most appropriate computing model for the entire application or for individual subalgorithms. However, support for emulated shared-memory was omitted as the system evolved, since the message passing paradigm was the model of choice for most scienti c parallel processing applications. Processing elements in PVM may be scalar machines, distributed and shared-memory multiprocessors, vector supercomputers and special purpose graphics engines; thereby, permitting the use of the best-suited computing resource for each component of an application. The PVM system is composed of a suite of user interface primitives supporting software that together enable concurrent computing on loosely coupled networks of processing elements. PVM may be implemented on a hardware base consisting of di erent machine architectures, including single CPU systems, vector machines, and multiprocessors. These computing elements may be interconnected by one or more networks, which may themselves be di erent (e.g., one implementation of PVM operates on Ethernet, the Internet, and a ber optic network). These computing elements are accessed by applications via a standard interface that supports common concurrent processing paradigms in the form of well-de ned primitives that are embedded in procedural host languages. Application programs are composed of components that are subtasks at a moderately larger level of granularity. During execution, multiple instances of each component may be initiated. Fig. 1 depicts a simpli ed architectural overview of the PVM computing model as well as the system. Application programs view the PVM system as a general and exible parallel computing resource. A translucent layering permits exibility while retaining the ability to exploit particular strengths of individual machines on the network. The PVM user interface is strongly typed; support for operating in a heterogeneous environment is provided in the form of special constructs that selectively perform machine dependent data conversions where necessary. Inter-instance communication constructs include those for the exchange of data structures as well as high-level primitives such as broadcast, barrier synchronization, mutual exclusion, and rendezvous. Application programs under PVM may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, the processes in existence may have arbitrary relationships between each other and, further, any process may communicate and/or synchronize with any other. The PVM system is composed of two parts. The rst part is a daemon, called pvmd, that executes on all the computers comprising the virtual machine. PVM is designed so that any user having normal access rights to each host in the pool

8 1706 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 Fig. 1. PVM system overview: (a) PVM computing model; (b) PVM architectural overview. may install and operate the system. To run a PVM application, the user executes the daemons on a selected host pool, and the set of daemons cooperate via distributed algorithms to initialize the virtual machine. The PVM application can then be started by executing a program on any of these machines. The usual method is for this manually started program to spawn other application processes, using PVM facilities. Multiple users may con gure overlapping virtual machines, and each user can execute several PVM applications simultaneously. The second part of the system is a library of PVM interface routines (libpvm.a). This library contains user callable routines for message passing, spawning processes, coordinating tasks, and modifying the virtual machine. The installation process for PVM is straightforward. PVM does not require special privileges to be installed. Anyone with a valid login on the hosts can do so, by following a simple sequence of steps for obtaining the distribution via the Web or by Ftp, compiling, and installing PVM programming Developing applications for the PVM system follows, in a general sense at least, the traditional paradigm for programming distributed-memory multiprocessors such as the Intel family of hypercubes. This is true for both the logistical aspects of programming as well as for algorithm development. However, there are signi cant

9 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± di erences in terms of (a) task management, especially issues concerning dynamic process creation, naming and addressing; (b) initialization phases prior to actual computation; (c) granularity choices; and (d) heterogeneity. These issues must be kept in mind during the general programming process for PVM and attention paid to factors that impact functionality and performance. In PVM, the issue of workload allocation is of particular importance, subsequent to establishing process structure, because of the heterogeneous and multiprogrammed nature of the underlying hosts that inherently cause load imbalances. Therefore, data decomposition or partitioning should not assume that all processing elements are equally capable or equally available. Function decomposition is better suited, since it divides the work based on di erent operations or functions. In a sense, the PVM computing model supports function decomposition at the component level (components are fundamentally di erent programs that perform di erent operations) and data decomposition at the instance level i.e., within a component, the same program operates on di erent portions of the data. In order to utilize the PVM system, applications must evolve through two stages. The rst concerns development of the distributed-memory parallel version of the application algorithm(s); this phase is common to the PVM system as well as to other distributed-memory multiprocessors. The actual parallelization decisions fall into two major categories ± those related to structure, and those related to e ciency. For structural decisions in parallelizing applications, the major decisions to be made include the choice of model to be used; i.e., crowd computation (based on peer-topeer process structures) vs. tree computation (based on hierarchical process structures) and data decomposition vs. function decomposition. Decisions with respect to e ciency when parallelizing for distributed-memory environments are generally oriented towards minimizing the frequency and volume of communications. It is typically in this latter respect that the parallelization process di ers for PVM and hardware multiprocessors: for PVM environments based on networks, large granularity generally leads to better performance. With this quali cation, the parallelization process is very similar for PVM and for other distributed-memory environments, including hardware multiprocessors. The parallelization of applications may be done either ab initio or from existing sequential versions or from existing parallel versions. In the rst two cases, the stages involved are to select an appropriate algorithm for each of the subtasks in the application, usually from published descriptions ± or by inventing a parallel algorithm. These algorithms are then coded in the language of choice (C, C++, or FOR- TRAN77 for PVM) and interfaced with each other as well as with process management and other constructs. Parallelization from existing sequential programs also follows certain general guidelines, primary among which are to decompose loops, beginning with outermost loops and working inward. In this process, the main concern is to detect dependencies and partition loops such that dependencies are preserved while allowing for concurrency. This parallelization process is described in numerous textbooks and papers on parallel computing; though, few textbooks discuss the practical and speci c aspects of transforming a sequential program to a parallel one.

10 1708 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 Existing parallel programs may be based on either the shared-memory or distributed-memory paradigms. Converting existing shared-memory programs to PVM is similar to converting from sequential code, when the shared-memory versions are based on vector or loop level parallelism. In the case of explicit shared-memory programs, the primary task is to locate synchronization points and replace these with message passing. In order to convert existing distributed-memory parallel code to PVM, the main task is to convert from one set of concurrency constructs to another. Typically, existing distributed-memory parallel programs are written either for hardware multiprocessors or other networked environments such as P4 or Express. In both cases, the major changes required are with regard to process management. For example, in the Intel family of distributed-memory multiprocessors (DMMPs), it is common for processes to be started from an interactive shell command line. Such a paradigm should be replaced for PVM by either a master program or a node program that takes responsibility for process spawning. With regard to interaction, there is, fortunately, a great deal of commonality between the message passing calls in various programming environments. The major di erences between PVM and other systems in this context are with regard to (a) process management and process addressing schemes; (b) virtual machine con guration/recon guration and its impact on executing applications; (c) heterogeneity in messages as well as the aspect of heterogeneity that deals with di erent architectures and data representations; and (d) certain unique and specialized features such as signaling, task scheduling methods, etc Fault tolerance issues Fault tolerance is a critical issue for any large scale scienti c computer application. Long-running simulations, which can take days or even weeks to execute, must be given some means to gracefully handle faults in the system or the application tasks. Without fault detection and recovery it is unlikely that such applications will ever complete. For example, consider a large simulation running on dozens of workstations. If one of those many workstations should crash or be rebooted, then tasks critical to the application might disappear. Additionally, if the application hangs or fails, it may not be immediately obvious to the user. Many hours could be wasted before it is discovered that something has gone wrong. Further, there are several types of applications that explicitly require a fault tolerant execution environment, due to safety or level of service requirements. In any case, it is essential that there be some well-de ned scheme for identifying system and application faults and automatically responding to them, or at least providing timely noti cation to the user in the event of failure. PVM has supported a basic fault noti cation scheme for some time. Under the control of the user, tasks can register with PVM to be ``noti ed'' when the status of the virtual machine changes or when a task fails. This noti cation comes in the form of special event messages that contain information about the particular event. A task can ``post'' a notify for any of the tasks from which it expects to receive a message. In this scenario, if a task dies, the receiving task will get a notify message in place of any

11 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± expected message. The notify message allows the task an opportunity to respond to the fault without hanging or failing. Similarly, if a speci c host like an I/O server is critical to the application, then the application tasks can post noti es for that host. The tasks will then be informed if that server exits the virtual machine, and they can allocate a new I/O server. This type of virtual machine noti cation is also useful in controlling computing resources. When a host exits from the virtual machine, tasks can utilize the notify messages to recon gure themselves to the remaining resources. When a new host computer is added to the virtual machine, tasks can be noti ed of this as well. This information can be used to redistribute load or expand the computation to utilize the new resource. Several systems have been designed speci cally for this purpose, including the WoDi system [21] which uses Condor [20] on top of PVM. There are several important issues to consider when providing a fault noti cation scheme. For example, a task might request noti cation of an event after it has already occurred. PVM immediately generates a notify message in response to any such ``after the fact'' request. For example, if a ``task exit'' noti cation request is posted for a task that has already exited, a notify message is immediately returned. Similarly, if a ``host exit'' request is posted for a host that is no longer part of the virtual machine, a notify message is immediately returned. It is possible for a ``host add'' noti cation request to occur simultaneously with the addition of a new host into the virtual machine. To alleviate this race condition, the user must poll the virtual machine after the notify request to obtain the complete virtual machine con guration. Subsequently, PVM can then reliably deliver any new ``host add'' noti es Current status and outlook The latest version of PVM (PVM 3.4) works with both Windows NT as well as Unix hosts. The new features included in PVM 3.4 allows users to develop much more exible, dynamic, and fault tolerant applications. PVM 3.4 includes 12 new functions. These functions provide the biggest leap in PVM capabilities since PVM 3.0 came out in The functions provide communication context, message handlers, and a tuple space called message box. The ability to send messages in di erent communication contexts is a fundamental requirement for parallel tools and applications that must interact with each other. It is also a requirement for the development of safe parallel libraries. Context is a unique system created tag, which is sent with each message. A matching receive function must match the context, destination, and message tag elds for the message to be received (wild cards are allowed for destination and message tag but not for context). In the past, PVM applications had to divide up the message tag space to mimic context capabilities. With PVM 3.4 there are built-in functions to create, set, and free context values. By de ning the context to be system wide unique, PVM continues to allow the dynamic generation and destruction of tasks. And by de ning that all PVM tasks have a base context by default, all existing PVM applications continue to work

12 1710 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 unchanged. The combination of these features allows parallel tools developers to create visualization and monitoring packages that can attach to existing PVM applications, extract information, and detach without concern about interfering with the application. The ability in the future to dynamically plug-in middle layer tools and applications is predicated on the existence of a similar if not identical communication context paradigm to PVM 3.4. PVM has always had message handlers internally, which were used for controlling the virtual machine. In PVM 3.4 the ability to de ne and delete message handlers has been raised up to the user level. To add a message handler, an application task calls: handler id ˆ pvm addmhf src; tag; context; function ; Thereafter, whenever a message arrives at this task with the speci ed source, message tag, and communication context, the speci ed function is executed. The function is passed the pointer to the message so that the handler may unpack the message if required. PVM 3.4 places no restrictions on the complexity of the function. It is free to make system calls or other PVM calls. With the functionality provided by pvm_addmhf( ) it is possible to build onesided communication, active messages, applications that trigger other applications on certain events, fault recovery tools and schedulers, and so on. For example, instead of an error inside an application printing an error message, the event could be made to invoke a parallel debugger focused on the area of the problem. Another example would be a distributed data mining application that nds an interesting correlation and triggers a response in all the associated searching tasks. The existence of pvm_addmhf( ) allows tasks within an application to dynamically adapt and take on new functionality whenever a message handler is invoked. In future systems the ability to dynamically add new functionality will have to be extended to include the underlying system as well as the user tasks. One could envision a message handler de ned inside the virtual machine daemons that when triggered by the application would spawn-o intelligent agents to seek out the requested software module from Web repositories. These trusted ``children'' agents could retrieve the module and use another message handler to cause the daemon to load the module, incorporating its new features. In a typical message passing system, messages are transitive and the focus is often on making their existence as short as possible, i.e., decrease latency and increase bandwidth. There are many situations in distributed applications seen today in which programming would be much easier if there was a way to have persistent messages. This is the purpose of the new message box feature in PVM 3.4. The message box is an internal tuple space in the virtual machine. Tasks can use regular PVM pack routines to create an arbitrary message and then use pvm_putinfo( ) to place this message into the message box with an associated name. Copies of this message can be retrieved by any PVM task that knows the name. And if the name is unknown or changing dynamically, then pvm_getmboxinfo( ) can be used to nd the list of names active in the message box. The four functions that make up the message box in PVM 3.4 are:

13 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± index ˆ pvm putinfo name; msgbuf; flag pvm recvinfo name; index; flag pvm delinfo name; index; flag pvm getmboxinfo pattern; names Š; structinfo Š The flag de nes the properties of the stored message, such as, who is allowed to delete this message, does this name allow multiple instances of messages, does a put overwrite the message? The flag argument also allows extension of this interface as PVM 3.4 users give us feedback on how they use the features of message boxes. While the tuple space could be used as a distributed shared-memory, similar to the Linda system [8], the granularity of the PVM 3.4 implementation is better suited to large grained data storage. Here are just a few of the many potential message box uses. A visualization tool spontaneously comes to life and nds out where and how to connect to a large distributed simulation. A scheduling tool retrieves information left by a resource monitor. A new team member learns how to connect to an ongoing collaboration. A debugging tool retrieves a message left by a performance monitor that indicates which of the thousands of tasks is most likely a bottleneck. Many of these capabilities are directly applicable to adaptable environments, and some method to have persistent messages will be a part of future virtual machine design. The addition of communication contexts, message handlers, and message boxes to the PVM environment allows developers to take a big leap forward in the capabilities of their distributed applications. PVM 3.4 is a useful tool for the development of much more dynamic, fault tolerant distributed applications MPI and its relationship to PVM PVM is built around the concept of a virtual machine which is a dynamic collection of (potentially heterogeneous) computational resources managed as a single parallel computer. The virtual machine concept is fundamental to the PVM perspective and provides the basis for heterogeneity, portability, and encapsulation of functions that constitute PVM. PVM provides features like fault tolerance and interoperability which are not a part of MPI. In contrast, MPI has focused on message passing and explicitly states that resource management and the concept of a virtual machine are outside the scope of the MPI (1 and 2) standard. The PVM API has continuously evolved over the years to satisfy user requests for additional features and to keep up with the fast changing network and computing technology. In contrast to the PVM API, the MPI-1 API was speci ed by a committee and de ned as a xed unchanging standard. of about 40 high-performance computing experts from research and industry in a series of meetings in 1993±1994. The impetus for developing MPI was that each massively parallel processor (MPP) vendor was creating their own proprietary message passing API. In this scenario it was not possible to write a portable parallel application. MPI is intended to be a standard message passing speci cation that each MPP vendor would implement on their system. The MPP vendors need to be able to deliver high-performance and this

14 1712 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 became the focus of the MPI design. Given this design focus, MPI is expected to always be faster than PVM on MPP hosts. MPI-1 contains the following main features: A large set of point-to-point communication routines (by far the richest set of any library to date). A large set of collective communication routines for communication among groups of processes. A communication context that provides support for the design of safe parallel software libraries. The ability to specify communication topologies. The ability to create derived datatypes that describe messages of non-contiguous data. MPI-1 users soon discovered that their applications were not portable across a network of workstations because there was no standard method to start MPI tasks on separate hosts. Di erent MPI implementations used di erent methods. In 1995 the MPI committee began meeting to design the MPI-2 speci cation to correct this problem and to add additional communication functions to MPI including: MPI_SPAWN functions to start MPI processes. One-sided communication functions such as put and get. MPI_IO. Language bindings for C++. The MPI-2 speci cation was nished in June The MPI-2 document adds an additional 200 functions to the 128 functions speci ed in the MPI-1 API. This makes MPI a much richer source of communication methods than PVM. 4. Representative results in network computing 4.1. The NAS parallel benchmarks The Numerical Aerodynamic Simulation (NAS) program of the National Air and Space Administration (NASA) has devised and published a suite of benchmarks [14] for the performance analysis of highly parallel computers. These benchmarks are designed to substantially exercise the processor, memory, and communication systems of current generation parallel computers. They are speci ed only algorithmically; except for a few constraints, implementors are free to select optimal language constructs and implementation techniques. The complete benchmark suite consists of eight applications, ve of which are termed kernels because they form the core of many classes of aerodynamic applications, and the remaining three are simulated CFD applications. The ve kernels, and their vital characteristics are listed in Table 1. NASA periodically publishes performance results obtained either from internal experiments or those conducted by third party implementors on various supercomputers and parallel machines. The de facto yardstick used to compare these performance results is a single processor of the Cray Y-MP, executing a sequential version of the same application.

15 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± Table 1 NAS parallel benchmarks: kernel characteristics Benchmark code Problem size Memory (MB) Cray time (s) Operation count Embarrassingly parallel : V-cycle multigrid : Conjugate gradient : d FFT PDE : Integer sort : The ve NPB kernels, with the exception of the embarrassingly parallel application, are all highly communication intensive when parallelized for message passing systems. As such, they form a rigorous suite of quasi-real applications that heavily exercise system facilities and also provide insights into bottlenecks and hot-spots for speci c distributed-memory architectures. In order to investigate the viability of clusters and heterogeneous concurrent systems for such applications, the NPB kernels were ported to execute on the PVM system. Detailed discussions and analyses are presented in [15,16]; here we describe our experiences with two representative kernels, using Ethernet and FDDI-based clusters. The V-cycle multigrid kernel involves the solution of a discrete Poisson problem r 2 u ˆ v with periodic boundary conditions on a grid. v is 0 at all coordinates except for 10 speci c points which are +1.0, and 10 speci c points which are 1.0. The PVM version of this application was derived by substantially modifying an Intel hypercube version; data partitioning along 1-d of the grid, maintaining shadow boundaries, and performing nearest neighbor communications. Several optimizations were also incorporated, primarily to maximize utilization of network capacity, and to reduce some communication. Results for the multigrid kernel under PVM are shown in Table 2. From the table it can be seen that the PVM implementation performs at good to excellent levels, despite the large volume of communication which accounts for up to 35% of the overall execution time. It may also be observed that the communications Table 2 V-cycle multigrid: PVM timings Platform Time (s) Communication volume (MB) Communication time (s) 4IBM RS6000/ (Ethernet) 4IBM RS6000/ (FDDI) 8IBM RS6000/ (FDDI) Cray Y-MP/1 54 ± ± ± Bandwidth (KB/s)

16 1714 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 bandwidths obtained, at the application level, are a signi cant percentage of the theoretical limit for both types of networks. Finally, the eight processor cluster achieves one-half the speed of a single processor of the Cray Y-MP, at an estimated one-fourth of the cost. The conjugate gradient kernel is an application that approximates the smallest eigenvalue of a symmetric positive-de nite sparse matrix. The critical portion of the code is a matrix±vector multiplication, requiring the exchange of subvectors in the partitioning scheme used. In this exercise also, the PVM version was implemented for optimal performance, with modi cations once again focusing on reducing communication volume and interference. Results from our experiments are shown in Table 3. This table also exhibits certain interesting characteristics. Like the multigrid application, the conjugate gradient kernel is able to obtain near theoretical communications bandwidth, particularly on the Ethernet, and a four processor cluster of high-performance workstations performs at one-fourth the speed of a Cray Y-MP/1. Another notable observation is that with an increase in the number of processors, the communication volume increases; thereby, resulting in lowered speedups. Our results from these two NPB kernels indicate both the power and the limitations of concurrent network-based computing, i.e., that with high-speed, high-capacity networks, cluster performance is competitive with that of supercomputers: that it is possible to harness nearly the full potential and capacity of processing elements and networks; but that scaling, load imbalance, and latency limitations are inevitable with the use of general purpose processors and networks that most cluster environments are built from Polymer chains and scale-invariant phenomena The particular problem studied in this work is one in which some fundamental aspects of the statistical mechanics of polymer solutions [17] are investigated. In this experiment, we focus on a linear chain which has a restricted interaction with the medium; that is, there are forbidden regions (in nite energy barrier), and the Table 3 Conjugate gradient: PVM timings Platform Time (s) Communication volume Communication time (s) 4 IBM RS6000/ (Ethernet) 4 IBM RS6000/ MB (FDDI) 16 Sun Sparc SS MB (Ethernet) Cray Y-MP/1 22 ± ± ± Bandwidth (KB/s)

17 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± chain is con ned to other parts of the medium. If the forbidden regions occur randomly and the minimum length scale of these is much smaller than the size of the polymer in homogeneous media, then this problem can be modeled by a selfavoiding walk (SAW) on a randomly diluted lattice. Thus, we rst create a percolation cluster [18] on a nite grid (of size L 3 ) by randomly diluting the grid (i.e., removing sites as inaccessible to the chain) with probability 1 p. The remaining sites then form connected components called clusters. Above a certain threshold p c, there exists one cluster that spans the grid from end to end, and this is the cluster of interest. On this disordered cluster, a starting point of a SAW is chosen randomly, and then all SAWs of a predetermined number of steps N are generated by a depth- rst type search algorithm. At each of the N steps, various conformational properties of the chain are measured, such as the moments of the end-toend distance R N and radius of gyration S N and those of the number of chains C N. These are then averaged over the ensemble of SAWs on the particular disorder con guration. This is repeated for a large number of disorder con gurations, and nally both linear and logarithmic means are calculated over the disorder ensemble. The polymer simulation problem is typical of most Monte Carlo problems in that it possesses a simple repetitive structure. The main routine initializes various arrays for statistics and calls a slave routine which computes samples, and periodically communicates them to the monitor(s). A few hours of e ort were required in parallelizing the original code using PVM and a related tool called ECLIPSE [23]. In 11 di erent experiments, each corresponding to a particular network con guration of arbitrarily chosen machines, we used between 16 and 192 geographically dispersed processors; results from these experiments are reported in Table 4. The parallelization has made an otherwise impossible interactive experimentation process possible. Previous exercises conducted on the CRAY Y-MP were only permissible in batch mode, because of the computing time and cost involved, and allowed for little interactive experimentation. Further, limited CRAY access time made the entire experimentation process di cult; restrictions that many researchers encounter on supercomputers as well as on MPPs. In contrast, our experience is that a computing environment exibly allowing for computation nodes that are SUN/ IBM workstations, Intel i860 nodes, IPSC/2 nodes or sequent processors improves the entire interactive experimentation process by an order of magnitude. Moreover, Table 4 Polymer physics application: PVM/ECLIPSE timings (seconds) # of procs Sun SS IBM RS6000 Sun SS + RS600 Intel i860 SS + IBM + i860 Average Equiv Cray

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational