Heterogeneous parallel and distributed computing

Size: px
Start display at page:

Download "Heterogeneous parallel and distributed computing"

Transcription

1 Parallel Computing 25 (1999) 1699± Heterogeneous parallel and distributed computing V.S. Sunderam a, *, G.A. Geist b a Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA b Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA Abstract Heterogeneous network-based distributed and parallel computing is gaining increasing acceptance as an alternative or complementary paradigm to multiprocessor-based parallel processing as well as to conventional supercomputing. While algorithmic and programming aspects of heterogeneous concurrent computing are similar to their parallel processing counterparts, system issues, partitioning and scheduling, and performance aspects are signi cantly di erent. In this paper, we discuss the evolution of heterogeneous concurrent computing, in the context of the parallel virtual machine (PVM) system, a widely adopted software system for network computing. In particular, we highlight the system level infrastructures that are required, aspects of parallel algorithm development that most a ect performance, system capabilities and limitations, and tools and methodologies for e ective computing in heterogeneous networked environments. We also present recent developments and experiences in the PVM project, and comment on ongoing and future work. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Heterogeneous computing; Networked computing; Cluster computing; Message passing interface (MPI); Parallel virtual machine (PVM); NAS parallel benchmark; Parallel I/O; Meta computing 1. Introduction We discuss parallel and distributed computing on networked heterogeneous envrionments. As used in this paper, these terms, as well as ``concurrent'' computing, refer to the simultaneous execution of the components of a single application on multiple processing elements. While this de nition might also apply to most other * Corresponding author. address: vss@mathcs.emory.edu (V.S. Sunderam) /99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S ( 9 9 )

2 1700 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 notions of parallel processing, we make a deliberate distinction, to highlight certain attributes of the methodologies and systems discussed herein ± namely, loose coupling, physical and logical independence of the processing elements, and heterogeneity. These characteristics distinguish heterogeneous concurrent computing from traditional parallel processing, normally performed on homogeneous, tightly coupled platforms which possess some degree of physical independence but are logically coherent. Concurrent computing, in various forms, is becoming increasingly popular as a methodology for many classes of applications, particularly those in the high-performance and scienti c computing arenas. This is due to numerous bene ts that accrue, both from the applications as well as the systems perspectives. However, in order to fully exploit these advantages, a substantial framework is required ± in the form of novel programming paradigms and models, systems support, toolkits, and performance analysis and enhancement mechanisms. In this paper, we focus on the latter aspects, namely the systems infrastructures, functionality, and performance issues in concurrent computing Heterogeneous, networked, and cluster computing One of the major goals of concurrent computing systems is to support heterogeneity. Heterogeneous computing refers to architectures, models, systems, and applications that comprise substantively di erent components, as well as to techniques and methodologies that address issues that arise when computing in heterogeneous environments. While this de nition encompasses numerous systems, including recon gurable architectures, mixed-mode arithmetic, special purpose hardware, and even vector and input±output units, we restrict ourselves to systems that are comprised of networked, independent, general purpose computers that may be used in a coherent and uni ed manner. Thus, heterogeneous systems may consist of scalar, vector, parallel, and graphics machines that are interconnected by one or more (types of) networks, and support one or more programming environment/operating system. In such environments, heterogeneity occurs in several forms: System architecture ± heterogeneous systems may consist of SIMD, MIMD, scalar, and vector computers. Machine architecture ± individual processing elements may di er in their instruction sets and/or data representation. Machine con gurations ± even when processing elements are architecturally identical, di erences such as clock speeds and memory contribute to heterogeneity. External in uences ± as heterogeneous systems are normally built in general purpose environments, external resource demands can (and often do) induce heterogeneity into processing elements that are identical in architecture and con guration, and further, cause dynamic variations in interconnection network capacity. Interconnection networks ± may be optical or electrical, local or wide-area, high or low speed, and may employ several di erent protocols.

3 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± Software ± from the infrastructure point of view, the underlying operating systems are often di erent in heterogeneous systems; from the applications point of view, in addition to operating systems aspects, di erent programming models, languages, and support libraries are available in heterogeneous systems. Research in heterogeneous systems is in progress in several areas [1,2] including applications, paradigm development, mapping, scheduling, recon guration, etc., but the primary thrust has thus far been in systems, methodologies, and toolkits [22]. This latter thrust has been highly productive and successful, with several systems in production-level use at hundreds of installations worldwide. The body of this paper will discuss the parallel virtual machine (PVM) system that has evolved into a popular and e ective methodology for heterogeneous concurrent computing Applications perspective From the point of view of application development, heterogeneous computing is attractive, since it inherently supports function parallelism, with the added potential of executing subtasks on best-suited architectures. It is well known that di erent types of algorithms are well matched to di erent machine architectures and con- gurations, and at least in the abstract sense, heterogeneous computing permits this matching to be realized, resulting in optimality in application execution as well as in resource utilization. However, in practice, this scenario may be di cult to achieve for reasons of availability, applicability, and the existence of appropriate mapping and scheduling tools. Nevertheless, the concept is an attractive one and several research e orts are in progress in this area [3,4]. In this respect, many classes of applications that would bene t substantively from heterogeneous computing have been identi ed. For example, a critically important problem which is ideally suited to heterogeneous computing is is global climate modeling. Simulation of the global climate is a particularly di cult challenge because of the wide range of time and space scales governing the behavior of the atmosphere, the oceans, and the surface. Parallel GCM codes require distinct component modules representing the atmosphere, ocean and surface and process modules representing phenomena like radiation and convection. Sampling, updating and manipulating this data requires scalar, vector, MIMD and SIMD paradigms, many of which can be performed concurrently. Another application domain that could exploit heterogeneous computing is computer vision. Vision problems generally require processing at three levels: high, medium and low. Low-level and some medium-level vision tasks often involve regular data ow and iconic operations. This type of computation is well matched to mesh-connected SIMD machines. Medium-grained MIMD machines are more suitable for various high-level and some medium-level vision tasks which are communication-intensive and in which the ow of data is not regular. Coarse-grained MIMD machines are best matched for high-level vision tasks such as image understanding/recognition and symbolic processing. As previously mentioned however, the above aspect of heterogeneous concurrent computing is still in its infancy. Proof-of-concept research and experiments have

4 1702 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 demonstrated the viability of exploiting application heterogeneity, and many others are evolving. On the other hand, the systems aspect has matured signi cantly; to the extent that robust environments are now available for production execution of traditional parallel applications while providing stable testbeds for the evolving, truly heterogeneous, applications [12,13]. We discuss the systems facet of heterogeneous concurrent computing in the remainder of the paper. 2. The historical perspective 2.1. Heterogeneous concurrent computing systems Heterogeneous computing systems [5,6] evolved in the late 1980s and shared some common goals and requirements: To e ectively provide access to signi cant amounts of computing resources in a cost-e ective manner, usually by utilizing already available resources. To exploit the existing software infrastructure and facilities (e.g., editors, compilers, debuggers) that are available on individual computer systems in a cluster. To provide an e ective programming model and interface, generally based on explicit parallelism and the message passing paradigm. To support transparency in terms of architecture, processor type, task location, network communication, and resource allocation. To achieve the best possible performance, subject to the inherent limitations of the processors and networks involved; some systems also attempt to be non-intrusive by suspending execution in deference to higher priority activities. Several of the above goals were met, at least by the most popular networkcomputing/heterogeneous processing systems. Other goals, such as exploiting heterogeneity, sophisticated job and resource management, automatic parallelization, and graphical interfaces are still being pursued. Since then, PVM gained substantially in popularity, and the MPI standard evolved in the mid 1990s ± both are still in widespread use. We brie y outline some of the earlier systems and their salient features before discussing PVM in depth, and commenting on the latest trends in metacomputing The Linda model and system Linda [7] is a concurrent programming model that has evolved from a Yale University research project. The primary concept in Linda is that of a ``tuple space'', an abstraction via which cooperating processes communicate. This central theme of Linda has been proposed as an alternative paradigm to the two traditional methods of parallel processing, namely, that based on shared-memory, and on message passing. The tuple space concept is essentially an abstraction of distributed sharedmemory, with one important di erence (tuple spaces are associative), and several minor distinctions (destructive and non-destructive reads, and di erent coherency semantics are possible). Applications use the Linda model by embedding explicitly,

5 within cooperating sequential programs, constructs that manipulate (insert/retrieve tuples) the tuple space. From the application point of view Linda [8] is a set of programming language extensions for facilitating parallel programming. The Linda model is a scheme built upon an associative memory referred to as tuple space. It provides a shared-memory abstraction for process communication without requiring the underlying hardware to physically share-memory. Tuples are collections of elds logically ``welded'' to form persistent storage items. They are the basic tuple space storage units. Parallel processes exchange data by generating, reading, and consuming them. To update a tuple, the tuple is removed from tuple space, modi ed, and returned to tuple space. Restricting tuple space modi cation in this manner creates an implicit locking mechanism ensuring proper synchronization of multiple accesses. The ``Linda system'' usually refers to a speci c (sometimes portable) implementation of software that supports the Linda programming model. System software is provided that establishes and maintains tuple spaces, that is used in conjunction with libraries that appropriately interpret and execute Linda primitives. Depending on the environment (shared-memory multiprocessors, message passing parallel computers, networks of workstations etc.), the tuple space mechanism is implemented using di erent techniques, and with varying degrees of e ciency. Recently, a new system technique has been proposed, at least nominally related to the Linda project. This scheme, termed ``Pirhana'' proposes a proactive approach to concurrent computing ± the idea being that computational resources (viewed as active agents) seize computational tasks from a well-known location based on availability and suitability. Again, this scheme may be implemented on multiple platforms, and manifested as a ``Pirhana system'' or ``Linda±Pirhana system'' P4 and Parmacs V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± P4 is a library of macros and subroutines developed at Argonne National Laboratory for programming a variety of parallel machines. The P4 system [9] supports both the shared-memory model (based on monitors) and the distributed-memory model (using message-passing). For the shared-memory model of parallel computation, P4 provides a set of primitives from which monitors can be constructed, as well as a set of useful monitors. For the distributed-memory model, P4 provides typed send and receive operations, and creation of processes according to a text le describing group and process structure. Process management in the P4 system is based on a con guration le that speci es the host pool, the object le to be executed on each machine, the number of processes to be started on each host (intended primarily for multiprocessor systems) and other auxiliary information. Two issues are noteworthy in regard to the process management mechanism in P4. First, there is the notion a ``master'' process and ``slave'' processes, and multilevel hierarchies may be formed to implement what is termed a cluster model of computation. Second, the primary mode of process creation is static, via the con guration le; dynamic process creation is

6 1704 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 possible only by a statically created process that must invoke a special P4 function that spawns a new process on the local machine. However, despite these restrictions, a variety of application paradigms may be implemented in the P4 system in a fairly straightforward manner. Message Passing in P4 system is achieved through the use of traditional send and recv primitives, parameterized almost exactly as other message passing systems. Several variants are provided for semantics such as heterogeneous exchange, and blocking or non-blocking transfer. A signi cant proportion of the burden of bu er allocation and management, however, is left to the user. Apart from basic message passing, P4 also o ers a variety of global operations, including broadcast, global maxima and minima, and barrier synchronization. Shared-memory support via monitors is a facility that distinguishes P4 from other systems. However, this feature is not distributed shared-memory; but rather, a portable mechanism for shared address space programming in true shared-memory multiprocessors. Parmacs is a project that is closely related to the P4 e ort. Essentially, Parmacs is a set of macro extensions to the P4 system developed at GMD [10]. It originated in an e ort to provide FORTRAN interfaces to the P4 system, but is now a signi cantly enhanced package that provides a variety of high-level abstractions, mostly dealing with global operations. Parmacs provides macros for logically con guring a set of P4 processes; for example, the macro torus produces a suitable con guration le for use by P4 that results in a logical process con guration corresponding to a 3-d torus. Other logical topologies, including general graphs may also be implemented, and Parmacs provides macros used in conjunction with send and recv to achieve topology-speci c communications within executing programs Message passing interface (MPI) In 1992 a group of about 30 people from universities, government laboratories, and industry began meeting to specify a message passing interface. It was felt that the de nition of a message passing standard provides vendors with a clearly de ned base set of routines that they can implement e ciently, or in some cases provide hardware support for, thereby enhancing performance. In 1994 the MPI-1 speci cation was published that de ned 128 functions divided into ve categories: point-to-point communication, collective communication, groups and context, processor topologies, and pro ling interface. While MPI-1 de ned a message passing API, it was not portable across heterogeneous clusters of computers because MPI-1 de ned no standard way to start processes. Thus, in 1995 the MPI forum began to meet again to de ne MPI-2. MPI-2 speci ed an additional 200 functions in several new areas: I/O, one-sided communication, process spawning, and extended collective operations. The MPI-2 speci cation was published in The goal of the MPI speci cation is to develop a widely used standard for writing message passing programs. MPI-1 is widely supported by the parallel computer vendors and work has begun to implement MPI-2 functions.

7 3. The PVM system 3.1. PVM overview V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± PVM is a software system that permits the utilization of a heterogeneous network of parallel and serial computers as a uni ed general and exible concurrent computational resource. The PVM system [11] initially supported the message passing, shared-memory, and hybrid paradigms; thus, allowing applications to use the most appropriate computing model for the entire application or for individual subalgorithms. However, support for emulated shared-memory was omitted as the system evolved, since the message passing paradigm was the model of choice for most scienti c parallel processing applications. Processing elements in PVM may be scalar machines, distributed and shared-memory multiprocessors, vector supercomputers and special purpose graphics engines; thereby, permitting the use of the best-suited computing resource for each component of an application. The PVM system is composed of a suite of user interface primitives supporting software that together enable concurrent computing on loosely coupled networks of processing elements. PVM may be implemented on a hardware base consisting of di erent machine architectures, including single CPU systems, vector machines, and multiprocessors. These computing elements may be interconnected by one or more networks, which may themselves be di erent (e.g., one implementation of PVM operates on Ethernet, the Internet, and a ber optic network). These computing elements are accessed by applications via a standard interface that supports common concurrent processing paradigms in the form of well-de ned primitives that are embedded in procedural host languages. Application programs are composed of components that are subtasks at a moderately larger level of granularity. During execution, multiple instances of each component may be initiated. Fig. 1 depicts a simpli ed architectural overview of the PVM computing model as well as the system. Application programs view the PVM system as a general and exible parallel computing resource. A translucent layering permits exibility while retaining the ability to exploit particular strengths of individual machines on the network. The PVM user interface is strongly typed; support for operating in a heterogeneous environment is provided in the form of special constructs that selectively perform machine dependent data conversions where necessary. Inter-instance communication constructs include those for the exchange of data structures as well as high-level primitives such as broadcast, barrier synchronization, mutual exclusion, and rendezvous. Application programs under PVM may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, the processes in existence may have arbitrary relationships between each other and, further, any process may communicate and/or synchronize with any other. The PVM system is composed of two parts. The rst part is a daemon, called pvmd, that executes on all the computers comprising the virtual machine. PVM is designed so that any user having normal access rights to each host in the pool

8 1706 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 Fig. 1. PVM system overview: (a) PVM computing model; (b) PVM architectural overview. may install and operate the system. To run a PVM application, the user executes the daemons on a selected host pool, and the set of daemons cooperate via distributed algorithms to initialize the virtual machine. The PVM application can then be started by executing a program on any of these machines. The usual method is for this manually started program to spawn other application processes, using PVM facilities. Multiple users may con gure overlapping virtual machines, and each user can execute several PVM applications simultaneously. The second part of the system is a library of PVM interface routines (libpvm.a). This library contains user callable routines for message passing, spawning processes, coordinating tasks, and modifying the virtual machine. The installation process for PVM is straightforward. PVM does not require special privileges to be installed. Anyone with a valid login on the hosts can do so, by following a simple sequence of steps for obtaining the distribution via the Web or by Ftp, compiling, and installing PVM programming Developing applications for the PVM system follows, in a general sense at least, the traditional paradigm for programming distributed-memory multiprocessors such as the Intel family of hypercubes. This is true for both the logistical aspects of programming as well as for algorithm development. However, there are signi cant

9 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± di erences in terms of (a) task management, especially issues concerning dynamic process creation, naming and addressing; (b) initialization phases prior to actual computation; (c) granularity choices; and (d) heterogeneity. These issues must be kept in mind during the general programming process for PVM and attention paid to factors that impact functionality and performance. In PVM, the issue of workload allocation is of particular importance, subsequent to establishing process structure, because of the heterogeneous and multiprogrammed nature of the underlying hosts that inherently cause load imbalances. Therefore, data decomposition or partitioning should not assume that all processing elements are equally capable or equally available. Function decomposition is better suited, since it divides the work based on di erent operations or functions. In a sense, the PVM computing model supports function decomposition at the component level (components are fundamentally di erent programs that perform di erent operations) and data decomposition at the instance level i.e., within a component, the same program operates on di erent portions of the data. In order to utilize the PVM system, applications must evolve through two stages. The rst concerns development of the distributed-memory parallel version of the application algorithm(s); this phase is common to the PVM system as well as to other distributed-memory multiprocessors. The actual parallelization decisions fall into two major categories ± those related to structure, and those related to e ciency. For structural decisions in parallelizing applications, the major decisions to be made include the choice of model to be used; i.e., crowd computation (based on peer-topeer process structures) vs. tree computation (based on hierarchical process structures) and data decomposition vs. function decomposition. Decisions with respect to e ciency when parallelizing for distributed-memory environments are generally oriented towards minimizing the frequency and volume of communications. It is typically in this latter respect that the parallelization process di ers for PVM and hardware multiprocessors: for PVM environments based on networks, large granularity generally leads to better performance. With this quali cation, the parallelization process is very similar for PVM and for other distributed-memory environments, including hardware multiprocessors. The parallelization of applications may be done either ab initio or from existing sequential versions or from existing parallel versions. In the rst two cases, the stages involved are to select an appropriate algorithm for each of the subtasks in the application, usually from published descriptions ± or by inventing a parallel algorithm. These algorithms are then coded in the language of choice (C, C++, or FOR- TRAN77 for PVM) and interfaced with each other as well as with process management and other constructs. Parallelization from existing sequential programs also follows certain general guidelines, primary among which are to decompose loops, beginning with outermost loops and working inward. In this process, the main concern is to detect dependencies and partition loops such that dependencies are preserved while allowing for concurrency. This parallelization process is described in numerous textbooks and papers on parallel computing; though, few textbooks discuss the practical and speci c aspects of transforming a sequential program to a parallel one.

10 1708 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 Existing parallel programs may be based on either the shared-memory or distributed-memory paradigms. Converting existing shared-memory programs to PVM is similar to converting from sequential code, when the shared-memory versions are based on vector or loop level parallelism. In the case of explicit shared-memory programs, the primary task is to locate synchronization points and replace these with message passing. In order to convert existing distributed-memory parallel code to PVM, the main task is to convert from one set of concurrency constructs to another. Typically, existing distributed-memory parallel programs are written either for hardware multiprocessors or other networked environments such as P4 or Express. In both cases, the major changes required are with regard to process management. For example, in the Intel family of distributed-memory multiprocessors (DMMPs), it is common for processes to be started from an interactive shell command line. Such a paradigm should be replaced for PVM by either a master program or a node program that takes responsibility for process spawning. With regard to interaction, there is, fortunately, a great deal of commonality between the message passing calls in various programming environments. The major di erences between PVM and other systems in this context are with regard to (a) process management and process addressing schemes; (b) virtual machine con guration/recon guration and its impact on executing applications; (c) heterogeneity in messages as well as the aspect of heterogeneity that deals with di erent architectures and data representations; and (d) certain unique and specialized features such as signaling, task scheduling methods, etc Fault tolerance issues Fault tolerance is a critical issue for any large scale scienti c computer application. Long-running simulations, which can take days or even weeks to execute, must be given some means to gracefully handle faults in the system or the application tasks. Without fault detection and recovery it is unlikely that such applications will ever complete. For example, consider a large simulation running on dozens of workstations. If one of those many workstations should crash or be rebooted, then tasks critical to the application might disappear. Additionally, if the application hangs or fails, it may not be immediately obvious to the user. Many hours could be wasted before it is discovered that something has gone wrong. Further, there are several types of applications that explicitly require a fault tolerant execution environment, due to safety or level of service requirements. In any case, it is essential that there be some well-de ned scheme for identifying system and application faults and automatically responding to them, or at least providing timely noti cation to the user in the event of failure. PVM has supported a basic fault noti cation scheme for some time. Under the control of the user, tasks can register with PVM to be ``noti ed'' when the status of the virtual machine changes or when a task fails. This noti cation comes in the form of special event messages that contain information about the particular event. A task can ``post'' a notify for any of the tasks from which it expects to receive a message. In this scenario, if a task dies, the receiving task will get a notify message in place of any

11 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± expected message. The notify message allows the task an opportunity to respond to the fault without hanging or failing. Similarly, if a speci c host like an I/O server is critical to the application, then the application tasks can post noti es for that host. The tasks will then be informed if that server exits the virtual machine, and they can allocate a new I/O server. This type of virtual machine noti cation is also useful in controlling computing resources. When a host exits from the virtual machine, tasks can utilize the notify messages to recon gure themselves to the remaining resources. When a new host computer is added to the virtual machine, tasks can be noti ed of this as well. This information can be used to redistribute load or expand the computation to utilize the new resource. Several systems have been designed speci cally for this purpose, including the WoDi system [21] which uses Condor [20] on top of PVM. There are several important issues to consider when providing a fault noti cation scheme. For example, a task might request noti cation of an event after it has already occurred. PVM immediately generates a notify message in response to any such ``after the fact'' request. For example, if a ``task exit'' noti cation request is posted for a task that has already exited, a notify message is immediately returned. Similarly, if a ``host exit'' request is posted for a host that is no longer part of the virtual machine, a notify message is immediately returned. It is possible for a ``host add'' noti cation request to occur simultaneously with the addition of a new host into the virtual machine. To alleviate this race condition, the user must poll the virtual machine after the notify request to obtain the complete virtual machine con guration. Subsequently, PVM can then reliably deliver any new ``host add'' noti es Current status and outlook The latest version of PVM (PVM 3.4) works with both Windows NT as well as Unix hosts. The new features included in PVM 3.4 allows users to develop much more exible, dynamic, and fault tolerant applications. PVM 3.4 includes 12 new functions. These functions provide the biggest leap in PVM capabilities since PVM 3.0 came out in The functions provide communication context, message handlers, and a tuple space called message box. The ability to send messages in di erent communication contexts is a fundamental requirement for parallel tools and applications that must interact with each other. It is also a requirement for the development of safe parallel libraries. Context is a unique system created tag, which is sent with each message. A matching receive function must match the context, destination, and message tag elds for the message to be received (wild cards are allowed for destination and message tag but not for context). In the past, PVM applications had to divide up the message tag space to mimic context capabilities. With PVM 3.4 there are built-in functions to create, set, and free context values. By de ning the context to be system wide unique, PVM continues to allow the dynamic generation and destruction of tasks. And by de ning that all PVM tasks have a base context by default, all existing PVM applications continue to work

12 1710 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 unchanged. The combination of these features allows parallel tools developers to create visualization and monitoring packages that can attach to existing PVM applications, extract information, and detach without concern about interfering with the application. The ability in the future to dynamically plug-in middle layer tools and applications is predicated on the existence of a similar if not identical communication context paradigm to PVM 3.4. PVM has always had message handlers internally, which were used for controlling the virtual machine. In PVM 3.4 the ability to de ne and delete message handlers has been raised up to the user level. To add a message handler, an application task calls: handler id ˆ pvm addmhf src; tag; context; function ; Thereafter, whenever a message arrives at this task with the speci ed source, message tag, and communication context, the speci ed function is executed. The function is passed the pointer to the message so that the handler may unpack the message if required. PVM 3.4 places no restrictions on the complexity of the function. It is free to make system calls or other PVM calls. With the functionality provided by pvm_addmhf( ) it is possible to build onesided communication, active messages, applications that trigger other applications on certain events, fault recovery tools and schedulers, and so on. For example, instead of an error inside an application printing an error message, the event could be made to invoke a parallel debugger focused on the area of the problem. Another example would be a distributed data mining application that nds an interesting correlation and triggers a response in all the associated searching tasks. The existence of pvm_addmhf( ) allows tasks within an application to dynamically adapt and take on new functionality whenever a message handler is invoked. In future systems the ability to dynamically add new functionality will have to be extended to include the underlying system as well as the user tasks. One could envision a message handler de ned inside the virtual machine daemons that when triggered by the application would spawn-o intelligent agents to seek out the requested software module from Web repositories. These trusted ``children'' agents could retrieve the module and use another message handler to cause the daemon to load the module, incorporating its new features. In a typical message passing system, messages are transitive and the focus is often on making their existence as short as possible, i.e., decrease latency and increase bandwidth. There are many situations in distributed applications seen today in which programming would be much easier if there was a way to have persistent messages. This is the purpose of the new message box feature in PVM 3.4. The message box is an internal tuple space in the virtual machine. Tasks can use regular PVM pack routines to create an arbitrary message and then use pvm_putinfo( ) to place this message into the message box with an associated name. Copies of this message can be retrieved by any PVM task that knows the name. And if the name is unknown or changing dynamically, then pvm_getmboxinfo( ) can be used to nd the list of names active in the message box. The four functions that make up the message box in PVM 3.4 are:

13 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± index ˆ pvm putinfo name; msgbuf; flag pvm recvinfo name; index; flag pvm delinfo name; index; flag pvm getmboxinfo pattern; names Š; structinfo Š The flag de nes the properties of the stored message, such as, who is allowed to delete this message, does this name allow multiple instances of messages, does a put overwrite the message? The flag argument also allows extension of this interface as PVM 3.4 users give us feedback on how they use the features of message boxes. While the tuple space could be used as a distributed shared-memory, similar to the Linda system [8], the granularity of the PVM 3.4 implementation is better suited to large grained data storage. Here are just a few of the many potential message box uses. A visualization tool spontaneously comes to life and nds out where and how to connect to a large distributed simulation. A scheduling tool retrieves information left by a resource monitor. A new team member learns how to connect to an ongoing collaboration. A debugging tool retrieves a message left by a performance monitor that indicates which of the thousands of tasks is most likely a bottleneck. Many of these capabilities are directly applicable to adaptable environments, and some method to have persistent messages will be a part of future virtual machine design. The addition of communication contexts, message handlers, and message boxes to the PVM environment allows developers to take a big leap forward in the capabilities of their distributed applications. PVM 3.4 is a useful tool for the development of much more dynamic, fault tolerant distributed applications MPI and its relationship to PVM PVM is built around the concept of a virtual machine which is a dynamic collection of (potentially heterogeneous) computational resources managed as a single parallel computer. The virtual machine concept is fundamental to the PVM perspective and provides the basis for heterogeneity, portability, and encapsulation of functions that constitute PVM. PVM provides features like fault tolerance and interoperability which are not a part of MPI. In contrast, MPI has focused on message passing and explicitly states that resource management and the concept of a virtual machine are outside the scope of the MPI (1 and 2) standard. The PVM API has continuously evolved over the years to satisfy user requests for additional features and to keep up with the fast changing network and computing technology. In contrast to the PVM API, the MPI-1 API was speci ed by a committee and de ned as a xed unchanging standard. of about 40 high-performance computing experts from research and industry in a series of meetings in 1993±1994. The impetus for developing MPI was that each massively parallel processor (MPP) vendor was creating their own proprietary message passing API. In this scenario it was not possible to write a portable parallel application. MPI is intended to be a standard message passing speci cation that each MPP vendor would implement on their system. The MPP vendors need to be able to deliver high-performance and this

14 1712 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 became the focus of the MPI design. Given this design focus, MPI is expected to always be faster than PVM on MPP hosts. MPI-1 contains the following main features: A large set of point-to-point communication routines (by far the richest set of any library to date). A large set of collective communication routines for communication among groups of processes. A communication context that provides support for the design of safe parallel software libraries. The ability to specify communication topologies. The ability to create derived datatypes that describe messages of non-contiguous data. MPI-1 users soon discovered that their applications were not portable across a network of workstations because there was no standard method to start MPI tasks on separate hosts. Di erent MPI implementations used di erent methods. In 1995 the MPI committee began meeting to design the MPI-2 speci cation to correct this problem and to add additional communication functions to MPI including: MPI_SPAWN functions to start MPI processes. One-sided communication functions such as put and get. MPI_IO. Language bindings for C++. The MPI-2 speci cation was nished in June The MPI-2 document adds an additional 200 functions to the 128 functions speci ed in the MPI-1 API. This makes MPI a much richer source of communication methods than PVM. 4. Representative results in network computing 4.1. The NAS parallel benchmarks The Numerical Aerodynamic Simulation (NAS) program of the National Air and Space Administration (NASA) has devised and published a suite of benchmarks [14] for the performance analysis of highly parallel computers. These benchmarks are designed to substantially exercise the processor, memory, and communication systems of current generation parallel computers. They are speci ed only algorithmically; except for a few constraints, implementors are free to select optimal language constructs and implementation techniques. The complete benchmark suite consists of eight applications, ve of which are termed kernels because they form the core of many classes of aerodynamic applications, and the remaining three are simulated CFD applications. The ve kernels, and their vital characteristics are listed in Table 1. NASA periodically publishes performance results obtained either from internal experiments or those conducted by third party implementors on various supercomputers and parallel machines. The de facto yardstick used to compare these performance results is a single processor of the Cray Y-MP, executing a sequential version of the same application.

15 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± Table 1 NAS parallel benchmarks: kernel characteristics Benchmark code Problem size Memory (MB) Cray time (s) Operation count Embarrassingly parallel : V-cycle multigrid : Conjugate gradient : d FFT PDE : Integer sort : The ve NPB kernels, with the exception of the embarrassingly parallel application, are all highly communication intensive when parallelized for message passing systems. As such, they form a rigorous suite of quasi-real applications that heavily exercise system facilities and also provide insights into bottlenecks and hot-spots for speci c distributed-memory architectures. In order to investigate the viability of clusters and heterogeneous concurrent systems for such applications, the NPB kernels were ported to execute on the PVM system. Detailed discussions and analyses are presented in [15,16]; here we describe our experiences with two representative kernels, using Ethernet and FDDI-based clusters. The V-cycle multigrid kernel involves the solution of a discrete Poisson problem r 2 u ˆ v with periodic boundary conditions on a grid. v is 0 at all coordinates except for 10 speci c points which are +1.0, and 10 speci c points which are 1.0. The PVM version of this application was derived by substantially modifying an Intel hypercube version; data partitioning along 1-d of the grid, maintaining shadow boundaries, and performing nearest neighbor communications. Several optimizations were also incorporated, primarily to maximize utilization of network capacity, and to reduce some communication. Results for the multigrid kernel under PVM are shown in Table 2. From the table it can be seen that the PVM implementation performs at good to excellent levels, despite the large volume of communication which accounts for up to 35% of the overall execution time. It may also be observed that the communications Table 2 V-cycle multigrid: PVM timings Platform Time (s) Communication volume (MB) Communication time (s) 4IBM RS6000/ (Ethernet) 4IBM RS6000/ (FDDI) 8IBM RS6000/ (FDDI) Cray Y-MP/1 54 ± ± ± Bandwidth (KB/s)

16 1714 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 bandwidths obtained, at the application level, are a signi cant percentage of the theoretical limit for both types of networks. Finally, the eight processor cluster achieves one-half the speed of a single processor of the Cray Y-MP, at an estimated one-fourth of the cost. The conjugate gradient kernel is an application that approximates the smallest eigenvalue of a symmetric positive-de nite sparse matrix. The critical portion of the code is a matrix±vector multiplication, requiring the exchange of subvectors in the partitioning scheme used. In this exercise also, the PVM version was implemented for optimal performance, with modi cations once again focusing on reducing communication volume and interference. Results from our experiments are shown in Table 3. This table also exhibits certain interesting characteristics. Like the multigrid application, the conjugate gradient kernel is able to obtain near theoretical communications bandwidth, particularly on the Ethernet, and a four processor cluster of high-performance workstations performs at one-fourth the speed of a Cray Y-MP/1. Another notable observation is that with an increase in the number of processors, the communication volume increases; thereby, resulting in lowered speedups. Our results from these two NPB kernels indicate both the power and the limitations of concurrent network-based computing, i.e., that with high-speed, high-capacity networks, cluster performance is competitive with that of supercomputers: that it is possible to harness nearly the full potential and capacity of processing elements and networks; but that scaling, load imbalance, and latency limitations are inevitable with the use of general purpose processors and networks that most cluster environments are built from Polymer chains and scale-invariant phenomena The particular problem studied in this work is one in which some fundamental aspects of the statistical mechanics of polymer solutions [17] are investigated. In this experiment, we focus on a linear chain which has a restricted interaction with the medium; that is, there are forbidden regions (in nite energy barrier), and the Table 3 Conjugate gradient: PVM timings Platform Time (s) Communication volume Communication time (s) 4 IBM RS6000/ (Ethernet) 4 IBM RS6000/ MB (FDDI) 16 Sun Sparc SS MB (Ethernet) Cray Y-MP/1 22 ± ± ± Bandwidth (KB/s)

17 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699± chain is con ned to other parts of the medium. If the forbidden regions occur randomly and the minimum length scale of these is much smaller than the size of the polymer in homogeneous media, then this problem can be modeled by a selfavoiding walk (SAW) on a randomly diluted lattice. Thus, we rst create a percolation cluster [18] on a nite grid (of size L 3 ) by randomly diluting the grid (i.e., removing sites as inaccessible to the chain) with probability 1 p. The remaining sites then form connected components called clusters. Above a certain threshold p c, there exists one cluster that spans the grid from end to end, and this is the cluster of interest. On this disordered cluster, a starting point of a SAW is chosen randomly, and then all SAWs of a predetermined number of steps N are generated by a depth- rst type search algorithm. At each of the N steps, various conformational properties of the chain are measured, such as the moments of the end-toend distance R N and radius of gyration S N and those of the number of chains C N. These are then averaged over the ensemble of SAWs on the particular disorder con guration. This is repeated for a large number of disorder con gurations, and nally both linear and logarithmic means are calculated over the disorder ensemble. The polymer simulation problem is typical of most Monte Carlo problems in that it possesses a simple repetitive structure. The main routine initializes various arrays for statistics and calls a slave routine which computes samples, and periodically communicates them to the monitor(s). A few hours of e ort were required in parallelizing the original code using PVM and a related tool called ECLIPSE [23]. In 11 di erent experiments, each corresponding to a particular network con guration of arbitrarily chosen machines, we used between 16 and 192 geographically dispersed processors; results from these experiments are reported in Table 4. The parallelization has made an otherwise impossible interactive experimentation process possible. Previous exercises conducted on the CRAY Y-MP were only permissible in batch mode, because of the computing time and cost involved, and allowed for little interactive experimentation. Further, limited CRAY access time made the entire experimentation process di cult; restrictions that many researchers encounter on supercomputers as well as on MPPs. In contrast, our experience is that a computing environment exibly allowing for computation nodes that are SUN/ IBM workstations, Intel i860 nodes, IPSC/2 nodes or sequent processors improves the entire interactive experimentation process by an order of magnitude. Moreover, Table 4 Polymer physics application: PVM/ECLIPSE timings (seconds) # of procs Sun SS IBM RS6000 Sun SS + RS600 Intel i860 SS + IBM + i860 Average Equiv Cray

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Ehab AbdulRazak Al-Asadi College of Science Kerbala University, Iraq Abstract The study will focus for analysis the possibilities

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CHAPTER 7 CONCLUSION AND FUTURE SCOPE 121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution

More information

Introduction CHAPTER. Practice Exercises. 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are:

Introduction CHAPTER. Practice Exercises. 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are: 1 CHAPTER Introduction Practice Exercises 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are: To provide an environment for a computer user to execute programs

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Concepts of Distributed Systems 2006/2007

Concepts of Distributed Systems 2006/2007 Concepts of Distributed Systems 2006/2007 Introduction & overview Johan Lukkien 1 Introduction & overview Communication Distributed OS & Processes Synchronization Security Consistency & replication Programme

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Chapter 8 : Multiprocessors

Chapter 8 : Multiprocessors Chapter 8 Multiprocessors 8.1 Characteristics of multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input-output equipment. The term processor in multiprocessor

More information

Introduction. EE 4504 Computer Organization

Introduction. EE 4504 Computer Organization Introduction EE 4504 Computer Organization Section 11 Parallel Processing Overview EE 4504 Section 11 1 This course has concentrated on singleprocessor architectures and techniques to improve upon their

More information

Distributed Computing Environment (DCE)

Distributed Computing Environment (DCE) Distributed Computing Environment (DCE) Distributed Computing means computing that involves the cooperation of two or more machines communicating over a network as depicted in Fig-1. The machines participating

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat Grid Workflow Efficient Enactment for Data Intensive Applications L3.4 Data Management Techniques Authors : Eddy Caron Frederic Desprez Benjamin Isnard Johan Montagnat Summary : This document presents

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Partitioning Effects on MPI LS-DYNA Performance

Partitioning Effects on MPI LS-DYNA Performance Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI 5416-1225 zais@us.ibm.com Abbreviations: MPI message-passing interface RISC - reduced instruction set computing

More information

100 Mbps DEC FDDI Gigaswitch

100 Mbps DEC FDDI Gigaswitch PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Scaling Tuple-Space Communication in the Distributive Interoperable Executive Library. Jason Coan, Zaire Ali, David White and Kwai Wong

Scaling Tuple-Space Communication in the Distributive Interoperable Executive Library. Jason Coan, Zaire Ali, David White and Kwai Wong Scaling Tuple-Space Communication in the Distributive Interoperable Executive Library Jason Coan, Zaire Ali, David White and Kwai Wong August 18, 2014 Abstract The Distributive Interoperable Executive

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

CPS221 Lecture: Operating System Functions

CPS221 Lecture: Operating System Functions CPS221 Lecture: Operating System Functions Objectives last revised 6/23/10 1. To overview key hardware concepts 2. To iintroduce the process concept 3. To discuss the various kinds of functionality of

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

[8] J. J. Dongarra and D. C. Sorensen. SCHEDULE: Programs. In D. B. Gannon L. H. Jamieson {24, August 1988.

[8] J. J. Dongarra and D. C. Sorensen. SCHEDULE: Programs. In D. B. Gannon L. H. Jamieson {24, August 1988. editor, Proceedings of Fifth SIAM Conference on Parallel Processing, Philadelphia, 1991. SIAM. [3] A. Beguelin, J. J. Dongarra, G. A. Geist, R. Manchek, and V. S. Sunderam. A users' guide to PVM parallel

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Chapter 17: Distributed Systems (DS)

Chapter 17: Distributed Systems (DS) Chapter 17: Distributed Systems (DS) Silberschatz, Galvin and Gagne 2013 Chapter 17: Distributed Systems Advantages of Distributed Systems Types of Network-Based Operating Systems Network Structure Communication

More information

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square

More information

IOS: A Middleware for Decentralized Distributed Computing

IOS: A Middleware for Decentralized Distributed Computing IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Network protocols and. network systems INTRODUCTION CHAPTER

Network protocols and. network systems INTRODUCTION CHAPTER CHAPTER Network protocols and 2 network systems INTRODUCTION The technical area of telecommunications and networking is a mature area of engineering that has experienced significant contributions for more

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Client Server & Distributed System. A Basic Introduction

Client Server & Distributed System. A Basic Introduction Client Server & Distributed System A Basic Introduction 1 Client Server Architecture A network architecture in which each computer or process on the network is either a client or a server. Source: http://webopedia.lycos.com

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Distributed Systems. Overview. Distributed Systems September A distributed system is a piece of software that ensures that:

Distributed Systems. Overview. Distributed Systems September A distributed system is a piece of software that ensures that: Distributed Systems Overview Distributed Systems September 2002 1 Distributed System: Definition A distributed system is a piece of software that ensures that: A collection of independent computers that

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013 Chapter 1: Distributed Systems: What is a distributed system? Fall 2013 Course Goals and Content n Distributed systems and their: n Basic concepts n Main issues, problems, and solutions n Structured and

More information

1 Executive Overview The Benefits and Objectives of BPDM

1 Executive Overview The Benefits and Objectives of BPDM 1 Executive Overview The Benefits and Objectives of BPDM This is an excerpt from the Final Submission BPDM document posted to OMG members on November 13 th 2006. The full version of the specification will

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS

Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS COMPUTER AIDED MIGRATION OF APPLICATIONS SYSTEM **************** CAMAS-TR-2.3.4 Finalization Report

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

The modularity requirement

The modularity requirement The modularity requirement The obvious complexity of an OS and the inherent difficulty of its design lead to quite a few problems: an OS is often not completed on time; It often comes with quite a few

More information

A Comparison of Allocation Policies in Wavelength Routing Networks*

A Comparison of Allocation Policies in Wavelength Routing Networks* Photonic Network Communications, 2:3, 267±295, 2000 # 2000 Kluwer Academic Publishers. Manufactured in The Netherlands. A Comparison of Allocation Policies in Wavelength Routing Networks* Yuhong Zhu, George

More information

Multiprocessors 2007/2008

Multiprocessors 2007/2008 Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

An Introduction to Software Architecture. David Garlan & Mary Shaw 94

An Introduction to Software Architecture. David Garlan & Mary Shaw 94 An Introduction to Software Architecture David Garlan & Mary Shaw 94 Motivation Motivation An increase in (system) size and complexity structural issues communication (type, protocol) synchronization data

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Network Bandwidth & Minimum Efficient Problem Size

Network Bandwidth & Minimum Efficient Problem Size Network Bandwidth & Minimum Efficient Problem Size Paul R. Woodward Laboratory for Computational Science & Engineering (LCSE), University of Minnesota April 21, 2004 Build 3 virtual computers with Intel

More information

Operating Systems Overview. Chapter 2

Operating Systems Overview. Chapter 2 1 Operating Systems Overview 2 Chapter 2 3 An operating System: The interface between hardware and the user From the user s perspective: OS is a program that controls the execution of application programs

More information

Introduction to Cluster Computing

Introduction to Cluster Computing Introduction to Cluster Computing Prabhaker Mateti Wright State University Dayton, Ohio, USA Overview High performance computing High throughput computing NOW, HPC, and HTC Parallel algorithms Software

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Benefits of Programming Graphically in NI LabVIEW

Benefits of Programming Graphically in NI LabVIEW Benefits of Programming Graphically in NI LabVIEW Publish Date: Jun 14, 2013 0 Ratings 0.00 out of 5 Overview For more than 20 years, NI LabVIEW has been used by millions of engineers and scientists to

More information

Benefits of Programming Graphically in NI LabVIEW

Benefits of Programming Graphically in NI LabVIEW 1 of 8 12/24/2013 2:22 PM Benefits of Programming Graphically in NI LabVIEW Publish Date: Jun 14, 2013 0 Ratings 0.00 out of 5 Overview For more than 20 years, NI LabVIEW has been used by millions of engineers

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid THE GLOBUS PROJECT White Paper GridFTP Universal Data Transfer for the Grid WHITE PAPER GridFTP Universal Data Transfer for the Grid September 5, 2000 Copyright 2000, The University of Chicago and The

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Dr e v prasad Dt

Dr e v prasad Dt Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification:

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification: Application control : Boundary control : Access Controls: These controls restrict use of computer system resources to authorized users, limit the actions authorized users can taker with these resources,

More information

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Chapter 3. Design of Grid Scheduler. 3.1 Introduction Chapter 3 Design of Grid Scheduler The scheduler component of the grid is responsible to prepare the job ques for grid resources. The research in design of grid schedulers has given various topologies

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Super-Peer Architectures for Distributed Computing

Super-Peer Architectures for Distributed Computing Whitepaper Super-Peer Architectures for Distributed Computing Fiorano Software, Inc. 718 University Avenue, Suite 212 Los Gatos, CA 95032 U.S.A. +1.408.354.3210 email info@fiorano.com www.fiorano.com Entire

More information