UNIVERSITY of CALIFORNIA, SAN DIEGO

Size: px
Start display at page:

Download "UNIVERSITY of CALIFORNIA, SAN DIEGO"

Transcription

1 UNIVERSITY of CALIFORNIA, SAN DIEGO MPI Process Swapping: Performance Enhancement for Tightly-coupled Iterative Parallel Applications in Shared Computing Environments A thesis submitted in partial satisfaction of the requirements for the degree of Master of Science in Computer Science by Otto K. Sievert Committee in charge: Professor Henri Casanova, Chair Professor Francine Berman, Co-chair Professor Scott Baden 2003

2 Copyright 2003 by Otto K. Sievert All rights reserved.

3 The thesis of Otto K. Sievert is approved: Date Co-chair Date Chair Date University of California at San Diego 2003

4 For Aaren. iv

5 It is not best to swap horses crossing the river. Abraham Lincoln, in a reply to the National Union League, 1864 v

6 Contents Dedication Epigraph List of Tables List of Figures iv v viii ix Chapter 1 Introduction 1 Chapter 2 Related Work Work Replication Dynamic Load Balancing Checkpoint / Restart Process Migration Pre-execution Scheduling Chapter 3 MPI Process Swapping The Swap Library Run Time Services Advantages of the Design Limitations Chapter 4 Experimental Results Environment Applications Swap Policy Results Cost of over-allocation Gross swapping behavior Conclusion vi

7 Chapter 5 Swap Policies Payback Algebra Policy Parameters Three Swapping Policies Chapter 6 Simulation Environment Application Performance Enhancing Techniques Simulation Architecture Experimental Results Evaluation of swapping vs. competing approaches Evaluation of three swapping policies Effect of CPU load distribution Summary Chapter 7 Conclusion 55 Appendix A User s Guide 57 A.1 Getting Started A.1.1 Compiling a swap-enabled application A.1.2 Running a swap-enabled application A.2 Command Line Options Bibliography 62 vii

8 List of Tables A.1 Important swapping files A.2 Run time command line options for swap-enabled applications A.3 Swap Dispatcher command line options A.4 Swap Manager command line options A.5 Swap Handler command line options A.6 Swap Admin command line options viii

9 List of Figures 2.1 Swapping benefits Swapping overloads standard MPI communications The minimum source changes to swap-enable an application The build changes to swap-enable an application Swap run-time architecture Swap interaction diagram Load on experimental machines Cost of over-allocation: MPI Init() Cost of over-allocation: MPI Barrier() Cost of over-allocation: MPI Finalize() Four process swapping behavior Sixteen process swapping behavior Sixty-four process swapping behavior Payback distance On-Off CPU load model ON/OFF CPU load Hyperexponential CPU load ON/OFF CPU load run length distribution Hyperexponential CPU load run length distribution Simulation architecture SWAP vs. NOP,DLB,CR Swap efficacy vs. over-allocation Swap efficacy at 1MB and 60MB process size Swap efficacy at 1MB and 1GB process size Swap vs. environment variability Swapping with large process size Hyperexponential load ix

10 Acknowledgements I would like to acknowledge a number of people who have provided support, given assistance, and shown understanding throughout the development of this work. All the Grail members at UCSD, but especially Holly, Shava, and Graziano, provided the best environment anyone could ask for. This work grew out of a desire for a mid-execution rescheduler for the GrADS project; the members of GrADS, especially Ruth, Celso and Rich, engaged me in many interesting discussions that influenced the swapping design. The Hewlett-Packard Company supported me throughout my graduate experience, and folks there like Larry and Craig who have themselves traveled this road were always generous sounding boards. Tiffany and Thomas tolerated my mental and physical absence as I was engaged in thesis activities, often at odd hours of the day. Finally, Fran exposed me to the expanse of thesis possibilities and to the many wonders and personalities in the parallel and distributed computing community, and Henri gave me tremendous attention and (through the use of some techniques not seen since the Inquisition) encouraged me to focus on one small part of this universe. Thank you. Portions of this work were previously published in the Proceedings of the High Performance Distributed Computing Conference and the International Journal of High Performance Computing Applications. x

11 Abstract MPI Process Swapping: Performance Enhancement for Tightly-coupled Iterative Parallel Applications in Shared Computing Environments by Otto K. Sievert Master of Science in Computer Science University of California, San Diego, 2003 Professor Henri Casanova, Chair Professor Francine Berman, Co-chair Simultaneous performance and ease-of-use is difficult to obtain for many parallel applications. Despite the enormous amount of research and development work in the area of parallel computing, performance tuning remains a labor intensive and artful activity. A form of process migration called MPI Process Swapping is presented as one way to achieve higher application performance without sacrificing ease-of-use. Process swapping is a simple add-on to the MPI programming paradigm, and can improve performance in shared computing environments. MPI process swapping is easy to use, requiring as few as three lines of source code change to an existing iterative application, yet it provides performance benefits on par with techniques that require significant application modification. xi

12 Chapter 1 Introduction While parallel computing has been actively pursued for several decades, it remains a daunting proposition for many end users. A number of programming models have been proposed [9, 39, 34] by which users can write applications with well-defined Application Programming Interfaces and use various parallel platforms. This thesis focuses on message passing, and in particular on the Message Passing Interface (MPI) standard [24]. MPI provides the necessary abstractions for writing parallel applications and harnessing multiple processors. However, the primary parallel computing challenges of application scalability and performance remain, especially in a shared computing environment. While these parallel challenges can be addressed via intensive performance tuning and careful application scheduling, typical end users often lack the time and expertise required. As a result, many end users sacrifice some performance in exchange for ease-of-use (the user is relatively happy as long as parallel application performance is better than serial execution, even if the overall parallel performance is quite inefficient). This is a general trend as parallel computing often enjoys ease-of-use or high performance, but rarely both at the same time. In this situation, a simple technique that provides a sub-optimal (but still beneficial) performance improvement can be more appealing in practice than a near optimal solution that requires substantial effort to implement. 1

13 2 MPI Process Swapping is an easy-to-use application performance enhancement that provides benefit in dynamic environments. With careful attention to swapping policies, this thesis shows that process swapping can perform as well as other performance enhancing techniques such as Dynamic Load Balancing and Checkpoint/Restart. Process swapping is useful for applications that execute slowly because of resource contention. Using process swapping, application execution time can be reduced. This is done by continually moving the application computation from slow processors to fast processors. Process swapping improves performance by dynamically choosing the best available resources throughout the execution of an application, using MPI process overallocation and real-time performance measurement. The basic idea behind MPI process swapping is as follows. Say that a parallel iterative application needs N processes to run, due to memory and/or performance considerations. Process swapping overallocates N + M processes so that the application only runs on N processes, but has the opportunity to swap any of these processes with any of M spare processes. Process swapping imposes the restriction that data redistribution is not allowed: the application is stuck with the initial data distribution, which limits the ability to adapt to fluctuating resources. Although MPI process swapping will often be sub-optimal, it is a practical solution for practical situations and it can be integrated to existing applications easily. Process swapping can be thought of as a partial application checkpoint and migration. Of the N active MPI processes, only those that show poor performance are checkpointed and have their work responsibilities transferred to another MPI process. Because the checkpointing used by MPI process swapping is only semi-automated (the user must define quiescent, checkpoint-able states in the code that guarantee no outstanding communication messages; the user also must register process state information that is transferred during a migration), this technique is most applicable to iterative applications. Iterative applications lend themselves to MPI process swapping in two ways. First, the loop structure of an iterative code execution is a natural place for a command that checks for a process swap. Modifying only one place in the code gives many swap

14 3 opportunities during execution. Second, iterative applications typically have natural barrier points where synchronization occurs. These barrier points often (but not always) have the required communication quiescence required for MPI process swapping. For these reasons, iterative applications are the focus of this work. The target execution environment is heterogeneous time-shared platforms (e.g. networks of desktop workstations) in which the available computing power of each processor varies throughout time due to external load (e.g. CPU load generated by other users and applications). This type of platform has steadily gained in popularity in arenas such as enterprise computing. In the target environment, it is assumed there are only a few parallel applications running on the platform simultaneously amidst a field of interactive (serial) jobs, creating performance hot spots that should be avoided. If the workload instead consists primarily of parallel applications with no interactive usage then the platform of choice should be a batch-scheduled cluster. Although our target usage scenario may appear limiting, workloads consisting of few parallel applications with interactive desktop users are not uncommon in academic and commercial environments. Many research labs at the University of California, San Diego have this characteristic, as do many commercial development facilities such as those used by the Hewlett-Packard Company, for example. The remainder of this thesis is organized as follows: ffl Chapter 2 provides background information and describes related approaches to improve application performance. ffl Chapter 3 describes process swapping in detail. ffl Chapter 4 describes experiments run to validate the implementation. ffl Chapter 5 introduces three swapping policies. ffl Chapter 6 compares these swapping policies to other performance enhancement techniques using simulation. ffl Chapter 7 concludes the work.

15 Chapter 2 Related Work This chapter relates process swapping to other performance enhancing techniques: ffl work replication ffl Dynamic Load Balancing (DLB) ffl Checkpoint / Restart (CR) ffl process migration ffl pre-execution scheduling In typical shared environments, the computing power of each individual workstation can vary for many reasons. For example, the resources themselves are usually heterogenous due to varying requirements or because purchases are staggered in time. Also, the resources are not uniformly used between individual users and an occasional parallel job, each machine s load can vary. This variation can be dramatic and unpredictable, resulting in an unfriendly environment for parallel computing. End users tolerate the inconsistency of such a computing environment. One day their applications may run quickly, but another day may bring slow progress. As long as there is some minimal amount of performance benefit to parallelizing an application, typical users today will be most concerned with how easy it is to run their parallel application. As a general group, end users are not likely to invest significant time and effort tuning their application schedules for performance beyond some notion of good enough. Often they don t have the time, or the experience, or even the desire, required to perform detailed performance and scheduling analyses. 4

16 5 There are several well-known techniques to harness the power of a heterogeneous computing environment. Four of these techniques are described in this chapter: work replication, dynamic load balancing, checkpoint/restart, process migration, and preexecution scheduling. It is claimed that, when the product of implementation effort and performance gain are compared, that MPI Process Swapping compares favorably with these techniques. Figure 2.1 illustrates this abstract claim. The remainder of this thesis provides information supporting this hypothesis. Execution Performance MPI Swap dynamic load balancing pre-execution scheduling checkpoint/ restart default app work replication Implementation Difficulty Figure 2.1: Claim: swapping brings potential performance benefits with relatively low effort. 2.1 Work Replication Because process swapping intelligently decides which processors actively participate in program execution, the over-allocation technique used by process swapping is better than simply replicating work. The simplest work replication option is to execute the application twice. In a dynamic environment, however, it is likely that at least one processor used by each replicated run will have decreased performance, causing both

17 6 applications to execute slowly. In this case, performance will suffer even though twice as many resources are used. Doubling work units within the application, using the first available results, and abandoning the other results, can also in general be hindered by slow processors. This method also requires significant modification to the application itself. Both of these work replication methods wastefully use twice as much computational power as is required. In contrast, process swapping makes efficient use of the pool of available processors. 2.2 Dynamic Load Balancing Dynamic Load Balancing (DLB) is one of the best known methods for achieving good parallel performance in unstable conditions. Dynamic load balancing repartitions application work during execution to balance load and minimize execution time. DLB techniques have been developed and used for scenarios in which the application s computational requirements change over time [10, 14, 18, 26] and scenarios in which the platform changes over time [49, 30, 50]. This work targets the latter scenario and DLB is thus an attractive approach but it has limitations. DLB often requires substantial effort to implement. Support for uneven, dynamic data partitioning adds complexity to an application, and complexity takes time to develop and effort to maintain. An application that supports arbitrary data decomposition is more likely to break down, and is harder to debug when it does fail. It should be noted that this difficulty is recognized; there have been efforts to provide middleware services that hide some of this complexity from the user and the programmer [22, 29, 5]. DLB requires an application that is amenable, in the limit, to arbitrary data partitioning. Many parallel algorithms demand fundamentally rigid data partitioning. For example, ScaLAPACK requires fixed block sizes (although these blocks can be somewhat creatively assigned to processors for performance reasons, as in [7]). The performance of an application that supports dynamic load balancing is limited by the achievable performance on the processors that are used. A perfectly loadbalanced execution can still run slowly if all the processors used operate at a fraction

18 7 of their peak performance. It should be noted that a DLB implementation could further improve performance in this case through the use an over-allocation scheme such as the one used by process swapping. 2.3 Checkpoint / Restart Another way for an application to adapt to changing conditions is Checkpoint/Restart (CR). Checkpoint/restart halts an entire application during execution, saves important application state information, and restarts the application on possibly new hosts with a possibly new data partitioning. CR is traditionally used for fault-tolerance, but it can be used for performance considerations by choosing appropriate restart hosts [2, 45]. CR does not limit the application to the processors on which execution is started, so it does not have to remain running on a set of processors that have become loaded. It also does not require a sophisticated data partitioning algorithm, and can thus be used with a wider variety of applications/algorithms. Unfortunately, generalized heterogenous checkpoint/restart of parallel applications is a difficult task; it remains the subject of several active research projects [40, 3, 20, 6] and is not widely supported today. Checkpointing may incur significant overheads depending on the application and compute platform. For large applications using substantial amounts of memory, running on a set of machines where only one or two nodes are loaded and slow, the overhead required to checkpoint the entire application, possibly to a central remote checkpoint drydock, is substantial. After the application has been checkpointed, a new application schedule must be calculated. This can take some time, especially if the scheduler needs to cool down for a period of time before the resource performance predictions are accurate enough to compute the new schedule (because of hysteresis effects, cpuloadbased performance estimates often require several minutes to fully dissipate the effect of a running application). Once this new schedule is completed, the entire application must bootstrap itself and read the appropriate data and process state from the checkpoint file. Only then can computation proceed. Application-level checkpointing can be implemented with limited effort for iterative

19 8 applications. The Cactus worm [2] uses standard Cactus checkpointing facilities to checkpoint a Cactus application to a central remote location, allowing the application schedule to be recomputed. The CoCheck project provides a checkpointing tool for PVM and MPI applications, allowing checkpoint/restart [40]. CoCheck is implemented on top of MPI, and provides parallel checkpointing and message forwarding. The SRS library recently developed to add checkpoint capability to ScaLAPACK provides similar user-level checkpointing (without the outstanding message bookkeeping) [44]. 2.4 Process Migration Process swapping is also related to a number of efforts to add migration support to the MPI runtime system. Migration facilities such as those provided by fault-tolerance extensions to MPI provide better-integrated support and more generally improve the capabilities of the MPI system. For example, MPI/FT provides self-checking parallel threads, offering a true migration infrastructure [3]. Similarly, FT-MPI adds faulttolerance to MPI [20]. MPICH-V [6] is very similar to FT-MPI, rising out of the volatility exhibited by global computing platforms. MPICH-V provides uncoordinated checkpointing as part of the MPI core implementation, and tracks undelivered messages through a distributed message logging facility. Designed primarily for very large systems whose mean time between failure is low, these techniques could conceivably be used to migrate for performance reasons. These migration mechanisms could be combined with the process swapping services and policies developed in this thesis, improving the robustness and generality over the current process swapping solution. In particular, a checkpointing facility might allow a better process swapping implementation by (i) removing the restriction of working only with iterative applications; (ii) further reducing the already minimal source code invasiveness; and (iii) reducing or removing the need to over-allocate MPI processes at the beginning of execution.

20 9 Combining MPI process swapping policies with the cycle-stealing facilities of high throughput desktop computing systems like Condor [33], XtremWeb [21] or other commercial systems [19, 27] would yield a powerful system. These systems currently evict application processes when a resource is reclaimed by its owner. By combining swapping policies with this eviction mechanism, a process might also be evicted and migrated for application performance reasons. Such a combined system would not only provide high system throughput, but individual application performance as well. One difficulty would be to allow network connections to survive process migration. An approach like the one in MPICH-V [6] could be used. 2.5 Pre-execution Scheduling MPI process swapping can be categorized as a mid-execution scheduler, and can be compared to pre-execution application schedulers such as those found in the AppLeS [4] and GrADS [28] projects. These projects are also concerned with achieving high performance within dynamic parallel execution environments. Additionally, they strive for ease-of-use, an important attribute for disciplinary scientists. The performance measurement and prediction techniques used in process swapping have much in common with these projects; all use application and environmental measurements (e.g. via the NWS [46], Autopilot [35], or MDS [23]) to improve application performance. AppLeS schedulers exist to promote application performance (as opposed to system throughput or system performance). Many application-specific AppLeS schedulers have been built, for example [38, 41, 12]. By and large, these schedulers can be categorized as pre-execution schedulers, matching an application to appropriate computing resources before the application is run. Recently, application-class AppLeS schedulers have been pursued, most notably AMWAT and APST, which focus on master/worker and parameter sweep application classes respectively [36, 8]. In related work, the GrADS project seeks grid ease-of-use through performanceaware grid infrastructure from compilation through execution. This effort includes an application-generic pre-execution scheduler that promotes application performance, but

21 10 is modular [11]. Using a rich application requirements description and resource capabilities, this scheduler matches an application and execution resources for a variety of application classes. Also part of this effort is a mid-execution rescheduler, which evaluates the application schedule and alters it during execution to improve performance [13]. The work in this thesis has been integrated into the GrADS system as the first implementation of this rescheduler. Pre-execution scheduling is advantageous because it requires little or no application modifications. The pre-execution scheduler determines resources and application configuration information without access to application source code. While pre-execution scheduling is broadly similar to mid-execution scheduling (for example, when applied in the middle of an application run both schedulers will discard slow processors and acquire fast processors), a mid-execution scheduler has more constraints that make simple application of a pre-execution scheduler difficult. These constraints are categorized below: ffl affinity After executing for some time on a particular set of processors, an application develops an affinity for these resources. The use of different resources therefore has an additional cost that is not present before the start of execution. ffl the Heisenberg Principle Using traditional pre-execution performance estimation techniques, it is difficult to separate the performance effect of the application itself, as it is currently running, from other performance drains. Blindly using these traditional performance estimates will result in frivolous scheduling decisions, as the currently used resources will appear slow and overloaded, even when the load is caused by the application itself. ffl relative performance is not sufficient Before execution, the pre-execution scheduler chooses the best resources for the application, often relying on relative performance rankings. However, during execution there is a substantial cost for changes to the schedule. Any mid-execution change must be evaluated using absolute, not relative, performance measures or one cannot reason about the benefits of the new schedule relative to the cost of implementing the schedule.

22 Chapter 3 MPI Process Swapping This chapter presents MPI process swapping in detail, including: ffl the swap library, and its impact on application source code; ffl the swap run-time services, and how they interact with a swap-enabled application; ffl the advantages and limitations of the swapping design and implementation, including performance considerations. 3.1 The Swap Library In order to minimize the impact to user code, and yet still provide automated swapping functionality, MPI process swapping overloads many of the MPI function calls through a combination of #define macros and function calls. Over-allocation is implemented with two private MPI communicators. An active communicator contains all the MPI processes that are actively participating in the application, and an inactive communicator contains all the inactive processes. To hide this complexity from the user, the swapping library overloads MPI function calls, as shown in Figure 3.1. The application developer is still required to make a handful of changes to an existing application. This is best illustrated through an example. Figure 3.2(a) contains C-like pseudo code for a typical MPI application. This example shows the communication from an actual MPI application that computes Van der Waals forces between 11

23 12 user.c libswap.a... MPI_Send(); /* hijacked */... Swap_Send() {... MPI_Send(); /* real */... } libmpi.a MPI_Send() {... } Figure 3.1: Swapping overloads standard MPI communications. particles in a two-dimensional grid [48]. In this example, the original C source code includes the mpi.h header file, and makes several MPI function calls throughout the code. To build the application, the user compiles their source code and links to the MPI library, as shown in Figure 3.3(a). To swap-enable this application, the following three lines of code are changed (see Figure 3.2). 1. The user s code includes the header file mpi swap.h instead of mpi.h. 2. The user must register the iteration variable using the swap register() function call. This is necessary in order for the swap code to know which iteration a particular MPI process is executing at any given time. The swap register() function is used to register statically allocated memory that is important to be swapped. All statically allocated memory that must be transferred during a swap must be registered with the swap register() function call; for example, the iteration variable above is registered this way. The swap register() call is considered collective and must be issued across all

24 13 processes. 3. The user must insert a call to MPI Swap() inside the iteration loop to exercise the swapping test and actuation routines. The MPI Swap() function call acts like a barrier to active processes. It must be placed inside the application s iteration loop. The current implementation requires that no communication messages be outstanding when MPI Swap() is called. In theory, outstanding messages could be allowed by forwarding them to the new active process. This enhancement has been designed, but not implemented to date. Figure 3.3(b) illustrates how to compile a swap-enabled application. The user includes the mpi swap.h header file provided by the swap package, and links against both the standard MPI library, called libmpi.a here, and the swap library libswap.a that is provided by the swap package. #include "mpi.h" #include "mpi_swap.h" main() f MPI_Init(); main() f MPI_Init(); swap_register(iteration variable); for (a lot of loops) f (MPI_Send() MPI_Recv()); MPI_Bcast(); MPI_Allreduce(); g for (a lot of loops) f MPI_Swap(); (MPI_Send() MPI_Recv()); MPI_Bcast(); MPI_Allreduce(); g g MPI_Finalize(); (a) Standard MPI C source. g MPI_Finalize(); (b) Swap-enabled MPI C source. Figure 3.2: The minimum source changes to swap-enable an application. In a similar way to how the standard MPI calls are overloaded, the standard C library calls to dynamically allocate memory are overloaded. In this way, the swap library

25 14 user.c mpi.h user.c mpi_swap.h mpi.h libmpi.a libmpi.a libswap.a executable executable (a) Standard MPI. (b) Swap-enabled MPI. Figure 3.3: The build changes to swap-enable an application. tracks memory that needs to be communicated during a swap. In the rare instance that some local memory is dynamically allocated (and does not need to be communicated during a swap), passthrough functions are provided that allocate memory but do not register the memory for swap communication. This default overloading of dynamic memory allocation can cause lower application performance. However, swapping is much less likely to fail for novice users if all dynamic memory is trapped and registered. Furthermore, the advanced user who is concerned about every little bit of performance will take the time to sort out these local memory pools and prevent them from being registered. 3.2 Run Time Services The swap run time services, depicted in Figure 3.4, are independent processes that together manage the execution of a parallel application. The swap services monitor the application and the resources on which it runs, and are responsible for determining when and where to swap. Below, each of these services is described in more detail.

26 15 proc d swap dispatcher proc m swap manager proc x visualization, logging, external control, etc. proc 0 proc 1 proc N application MPI rank 0 application MPI rank 1 application MPI rank N swap handler swap handler... swap handler Figure 3.4: Swap run-time architecture. swap dispatcher The swap dispatcher service is an always-on service, listening at an advertised host/port. As shown in Figure 3.5, the swap dispatcher is contacted during application bootstrap, in the MPI Init() call. At this time the dispatcher launches a swap manager on the processor with MPI rank 0. Until the application terminates, and the swap manager unregisters itself with the swap dispatcher, the only additional action provided by the dispatcher is to forward communication inquiries to the appropriate swap manager. For example, the swap dispatcher can be contacted for information about a particular application, and this request will be forwarded on to the swap manager responsible for the application. swap manager When it starts, the swap manager reports to the swap dispatcher the host and port number on which it receives messages. This information is passed back to the application. The application communicates this information to all the MPI processes in the parallel app. Each of these processes contacts the swap manager and requests a swap handler. The swap manager remotely starts a swap handler for each

27 16 application process. During application execution, the swap manager plays an active role, tracking the performance of active and inactive processors, tracking application progress, evaluating this information, and making process swapping decisions. swap handler A swap handler is started for each MPI process, during each process call to MPI Init(). Each MPI process requests a swap handler from the swap manager. The swap manager remotely starts a swap handler on the same processor as the MPI process. This handler returns to the swap manager the host and port on which it communicates. The swap manager gives this information to the MPI process, which then only talks with the swap handler for the remainder of the application. Upon application termination, the swap handler unregisters itself with the swap manager, then exits. The swap handler exists as a separate process (or thread) for several reasons: ffl For efficiency all swapping communication between the application and the swap services are made on the local host. During the bulk of the application run, the overhead of swapping is minimized because all communication is local to the processor. ffl The swap manager needs to have asynchronous dialog with the swap handler, and this could not happen if the swap handler were written inline in the MPI process. ffl The swap handler performs some active resource measurements from time to time, and this is very difficult to do on inactive MPI processes where the handler code is inline with the application. Swap utilities The swap utilities are utilities designed to improve the usability of the swap environment. Facilities such as swap information logging and swap visualization connect to the swap manager (possibly through the swap dispatcher), and track an application s progress. The swap admin provides a simple interface to manually configure or control the swapping system. When a swap is determined by the swap manager, it communicates this to the swap handler associated with the MPI process with local rank zero (the local root ). At the

28 17 next application iteration, during the call to MPI Swap(), the swap handler indicates to the local root process that a swap is necessary. The local root communicates this information to all the MPI processes, and the appropriate processes initiate a swap. The other processes continue on with the application, doing as much work as possible in parallel with the swap. processor u processor d processor m processor 0 processor 1 processor N swap dispatch user 2 start 1 init MPI app rank 0 start MPI app rank 1 start MPI app rank N create vis info info info init create 3 swap manager 4 info info info create 5 swap handler info create swap handler info create swap handler info info info info info swap? no 6 perf info... info info swap 0&1 info swap 1&N info perf swap? yes 7 swap swap data swap swap? no swap? yes swap swap info quit 9 finalize finalize 8 finalize finalize finalize swap data finalize finalize finalize Figure 3.5: Interaction diagram of a swappable MPI application. The swap services interact with the MPI application and with each other in a straightforward asynchronous manner, as illustrated in Figure 3.5. Walking through an example application execution will further describe these interactions. For ease of reference, the numbers below are matched in the figure. 1. From machine u a user launches an MPI application that uses N total processes,

29 18 a subset of which will be active at any given time. 2. The root process (the process with MPI rank zero) on machine 0 contacts the always-on swap dispatcher (running on machine d) during initialization, and requests swap services. 3. The swap dispatcher launches a swap manager on machine m. The swap dispatcher waits for the swap manager to initialize, then tells the root process how to contact this personalized swap manager. The root process passes this information to all MPI processes in the application. From this point onward, the swap dispatcher plays a minimal role; the swap manager becomes the focal point. 4. For each MPI process, the swap manager starts a swap handler on the same machine. Once the swap handlers are initialized, the application begins execution. 5. While the application is executing, the swap handlers are gathering application and environment (machine) performance information and feeding it to the swap manager. Some of this information is passive, like the CPU load or the amount of computation, communication, and barrier wait time of the application. Other times the performance information is gathered via active probing, which uses significant computational resources for a short period of time but provides more accurate information. The swap manager analyzes all of this information and determines whether or not to initiate a process swap. 6. The active root process, the MPI process that is the root process in the group of active processes, contacts its swap handler periodically (at an interval of some number of iterations, during the call to MPI Swap()). In this case, the active root starts out as the process on machine 0. The first time this process asks if a swap is needed, the swap handler replies no. The application continues to execute, and information continues to be fed to the swap manager. 7. Eventually, the swap manager decides that process 0 and 1 should swap, so it sends a message to the swap handler that cohabitates with the active root process.

30 19 The next time the application asks if it should swap, the swap handler answers yes. Processes 2 through N continue to execute the application while processes 0 and 1 exchange information and data. The process on machine 0 will become inactive, while the process on machine 1 becomes active. When the swap is complete, process 1 is now the active root process, so the next swap message from the swap manager is sent to the process on machine 1. This time, process 1 and process N swap. The execution continues in this fashion until it completes. 8. As the MPI application shuts down, each MPI process sends finalization messages to its swap handler before quitting. The swap handler in turn registers a finalization message with the swap manager, then quits. Once all the swap handlers have unregistered with the swap manager, it sends a quit message to the swap dispatcher, and shuts down. 9. In this case, all during the application execution the user monitored the progress of the application. Shortly after the application began to execute, the user started the swap visualization tool.the visualization tool contacted the swap dispatcher, which told it where the swap manager lived. The visualization tool registered itself with the swap manager, and from that time forward the swap manager kept the visualization tool informed directly. After the application shut down, and the swap manager also shut down, the user closed the visualization tool. 3.3 Advantages of the Design The advantages of this design are briefly discussed below. Minimal impact on application source code MPI process swapping works by overloading many of the MPI function calls. The goal is to achieve as much transparency as possible for the application programmer and end-user. There are several modifications to the application source code that must be done by the user. Conceivably these steps could be automated with a compiler but they are straightforward for a

31 20 user to implement. As few as three lines of source code need to be changed in order to swap-enable an existing iterative MPI application. Support for MPI-1 applications Process swapping is tightly integrated with MPI, but it does not require advanced MPI-2 features. MPI-1 does not support adding processors to communicators, so process swapping relies on over-allocation of processes at the beginning of execution to get its pool of processors. Swapping chooses the best subset of available processors to actively participate in the application execution; the rest remain inactive until needed. These inactive processes utilize very little computational power; aside from periodic active performance measurement, they block on I/O calls and wait to become active. MPI-2 has support for adding and removing processors to an application during execution. However, MPI-2 is not as widely supported as MPI-1. Furthermore, the functionality to add/remove processes is not transparent. The programmer must manage new communicators, requiring significant source code modification for existing MPI- 1 applications. By contrast, MPI-1 with process swapping requires minimal source code changes. So while MPI-2 dynamic process management functionality such as that supported by the latest grid-enabled MPI implementation, MPICH-G2 [43], minimizes the need for over-allocation, it requires significant modification to existing applications. Profiling support By using a custom interface to overload MPI functions, the end user may still use any MPI profiling, logging, or debugging instrumentation they are familiar with. MPI does not have language support for multiple levels of interception, which means that (without special arrangement) only one profiling instrumentation can be done. Had the swap library used the standard MPI profiling interface for its own use, this would have precluded the use of another profiling or debugging tool. Scalable run-time architecture By organizing the run-time system as services, and minimizing the communication with a central service, the run-time system can scale to large numbers of simultaneous swap-enabled applications running at the same time.

32 Limitations There are a number of limitations to the current implementation. Some of these limitations are fundamental design limitations, but most are simply features that (for expediency) were not implemented. A laundry list of limitations is given below. Iterative applications applications. The swap system was designed to target the class of iterative Language The swap library only has C bindings. C++ and Fortran bindings have not been implemented. Message forwarding The swap library does not perform message forwarding; in particular, this means that the call to MPI Swap() must be at a natural application barrier; there can be no outstanding asynchronous communications when this function is called. User-defined communicators The swap library does not currently do bookkeeping necessary to properly handle user-defined communicators. Overhead of MPI overloading There is a slight performance overhead in each MPI communication call. The cost of an additional function call is incurred for each call. These overheads, while small, can accumulate in an application with a lot of MPI calls. The function call overloading could have been implemented with macros, but this limits the readability and maintainability of the swap library itself. Inactive process blocking Inactive MPI processes block on an MPI communication receive. On most MPI implementations, this means that these inactive processes consume virtually no system resources. This is what allows an application to dramatically over-allocate processors without incurring prohibitive penalties or causing problems for other users. However, it must be noted that some MPI implementations use spinning in their implementation of MPI Recv(), for example TMPI[42]. On these machines, with the current implementation, the inactive processes would needlessly consume a significant amount of resources.

33 22 It is straightforward to implement the inactive processes pause in a fashion that does not consume resources, for example through the use of the select() facility. This was not done in this implementation for expediency and because the systems on which swapping was evaluated did not spin wait, and so they did not consume significant resources while processes were inactive. Scalability The swap services, written in Python, do not scale well above roughly one hundred processes. Above this level, communication traffic causes dropped messages and reset socket connections. This could be addressed through a focused redesign. Similarly, the mpich system could not launch more than two hundred processes before running out of process space.

34 Chapter 4 Experimental Results Swapping has been implemented and verified in a commercial production environment. In this chapter, this validation is presented. The overheads of swapping are measured, and the effectiveness of the system is shown anecdotally (a more rigorous discussion of swap efficacy is presented in Chapter Environment The Hewlett-Packard Company maintains a large network of workstations in San Diego, California. These machines are a heterogenous collection of HP PA-RISC processors running HP-UX 11.11i, connected through 100baseT Ethernet using multiple sub-nets, routers, and switches. The majority of the HP machines are used as individual workstations for commercial research and development activities, such as Computer Aided Drafting, Very Large Scale Integrated digital circuit development, and software/firmware development. A handful of these machines are not used interactively; instead, these machines are used as a cluster. Three hundred of the HP machines were included in MPI process swapping machine pool, though due to scalability concerns fewer than 100 participated in experiments at any given time. Figure 4.1 shows an example of the CPU load on this system. The load in this figure is typical of most of the interactively-used processors in the system. The batch processors tend to have much fewer jobs that run much longer. 23

35 CPU load Time [seconds] Figure 4.1: Example of CPU load in the experimental environment. 4.2 Applications A number of applications have been swap-enabled. They are described here. fish Fish is a simple particle physics parallel application written by Fred Wong and Jim Demmel [48]. Typical of the codes generally found in the field of particle dynamics, this application computes Van der Waals forces between particles in a two-dimensional field. As the particles interact, they move about the field. Because the amount of computation depends on the location and proximity of particles to one another, this application exhibits a dynamic amount of work per processor even when the data partitioning is static and the processors are dedicated. This iterative code uses simple MPI calls using only the MPI COMM WORLD communicator. It has a per-iteration barrier point, which is a natural place to insert a call to MPI Swap(). From the original code, four source lines were added/changed in order to add process swapping capability to this application. iterative stencil applications This unified application, written by Holly Dail, performs the computation and communication for (a) a Jacobi relaxation solver for a system of equations, and (b) for the game of life algorithm. Both sub-applications are iterative and have natural barrier points suitable for MPI Swap() insertion. These applications use straightforward synchronous MPI communication calls, and only com-

36 25 municate via MPI COMM WORLD. From the original code, only three lines of code were changed to enable swapping. synthetic This application was designed specifically to emulate applications like fish, jacobi, and the game of life. It has the same basic application elements of configuration, iteration, finalization. Unlike these real applications, the synthetic application does not implement a real algorithm, or communicate real data. Instead, this code is able to be tuned to any computation/communication demands desired. The advantage is that the synthetic application allows quick and efficient exploration of entire classes of applications without actually writing them. The synthetic application does perform actual floating point calculations (additions and multiplies), and it does communicate buffers full of random data. Four lines of source code were altered to make this application swap-enabled. ring This well-known introductory MPI application sends a message from process to process in a ring communication. The message finally returns to the initial sending process. This simple application reports the total message travel time, and as such is useful for measuring various swap overheads. mpi sleep This utility application was designed to allow exploration of swapping capabilities without degrading the experience of other users. Instead of performing actual computation and communication, the payload of this iterative application is simply a sleep command. When configured with a particular sleep time and number of sleep iterations, this application swaps in between sleeping. This application was swap-enabled with modifications to three lines of source. 4.3 Swap Policy The policy used in the experiments took every opportunity to acquire and use faster processors, even when those processors were only marginally faster than the current processor set. This greedy policy periodically evaluated the performance of all the machines in the application s pool (a subset of the entire HP cluster), and swapped

GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications

GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications Authors Ken Kennedy, Mark Mazina, John Mellor-Crummey, Rice University Ruth Aydt, Celso Mendes, UIUC Holly Dail, Otto

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Adaptive Internet Data Centers

Adaptive Internet Data Centers Abstract Adaptive Internet Data Centers Jerome Rolia, Sharad Singhal, Richard Friedrich Hewlett Packard Labs, Palo Alto, CA, USA {jar sharad richf}@hpl.hp.com Trends in Internet infrastructure indicate

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

IOS: A Middleware for Decentralized Distributed Computing

IOS: A Middleware for Decentralized Distributed Computing IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin A Decoupled Scheduling Approach for the GrADS Program Development Environment DCSL Ahmed Amin Outline Introduction Related Work Scheduling Architecture Scheduling Algorithm Testbench Results Conclusions

More information

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk Expressing Fault Tolerant Algorithms with MPI-2 William D. Gropp Ewing Lusk www.mcs.anl.gov/~gropp Overview Myths about MPI and Fault Tolerance Error handling and reporting Goal of Fault Tolerance Run

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Chapter 3. Design of Grid Scheduler. 3.1 Introduction Chapter 3 Design of Grid Scheduler The scheduler component of the grid is responsible to prepare the job ques for grid resources. The research in design of grid schedulers has given various topologies

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC) Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

MANUFACTURING SIMULATION USING BSP TIME WARP WITH VARIABLE NUMBERS OF PROCESSORS

MANUFACTURING SIMULATION USING BSP TIME WARP WITH VARIABLE NUMBERS OF PROCESSORS MANUFACTURING SIMULATION USING BSP TIME WARP WITH VARIABLE NUMBERS OF PROCESSORS Malcolm Yoke Hean Low Programming Research Group, Computing Laboratory, University of Oxford Wolfson Building, Parks Road,

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

Chapter 1: Distributed Information Systems

Chapter 1: Distributed Information Systems Chapter 1: Distributed Information Systems Contents - Chapter 1 Design of an information system Layers and tiers Bottom up design Top down design Architecture of an information system One tier Two tier

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2 Chapter 1: Distributed Information Systems Gustavo Alonso Computer Science Department Swiss Federal Institute of Technology (ETHZ) alonso@inf.ethz.ch http://www.iks.inf.ethz.ch/ Contents - Chapter 1 Design

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on Chapter 1: Introduction Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Objectives To describe the basic organization of computer systems To provide a grand tour of the major

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Partitioning Strategies for Concurrent Programming

Partitioning Strategies for Concurrent Programming Partitioning Strategies for Concurrent Programming Henry Hoffmann, Anant Agarwal, and Srini Devadas Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory {hank,agarwal,devadas}@csail.mit.edu

More information

Virtualizing JBoss Enterprise Middleware with Azul

Virtualizing JBoss Enterprise Middleware with Azul Virtualizing JBoss Enterprise Middleware with Azul Shyam Pillalamarri VP Engineering, Azul Systems Stephen Hess Sr. Director, Product Management, Red Hat June 25, 2010 Agenda Java Virtualization Current

More information

2008 WebSphere System z Podcasts - Did you say Mainframe?

2008 WebSphere System z Podcasts - Did you say Mainframe? TITLE: WebSphere Extended Deployment for z/os HOST: Hi, and welcome to the Did you say mainframe? podcast series. This is where we regularly interview IBM technical experts who can help you to understand

More information

A trace-driven analysis of disk working set sizes

A trace-driven analysis of disk working set sizes A trace-driven analysis of disk working set sizes Chris Ruemmler and John Wilkes Operating Systems Research Department Hewlett-Packard Laboratories, Palo Alto, CA HPL OSR 93 23, 5 April 993 Keywords: UNIX,

More information

The HP 3PAR Get Virtual Guarantee Program

The HP 3PAR Get Virtual Guarantee Program Get Virtual Guarantee Internal White Paper The HP 3PAR Get Virtual Guarantee Program Help your customers increase server virtualization efficiency with HP 3PAR Storage HP Restricted. For HP and Channel

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

Mark Sandstrom ThroughPuter, Inc.

Mark Sandstrom ThroughPuter, Inc. Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom

More information

Hierarchical Chubby: A Scalable, Distributed Locking Service

Hierarchical Chubby: A Scalable, Distributed Locking Service Hierarchical Chubby: A Scalable, Distributed Locking Service Zoë Bohn and Emma Dauterman Abstract We describe a scalable, hierarchical version of Google s locking service, Chubby, designed for use by systems

More information

Making Workstations a Friendly Environment for Batch Jobs. Miron Livny Mike Litzkow

Making Workstations a Friendly Environment for Batch Jobs. Miron Livny Mike Litzkow Making Workstations a Friendly Environment for Batch Jobs Miron Livny Mike Litzkow Computer Sciences Department University of Wisconsin - Madison {miron,mike}@cs.wisc.edu 1. Introduction As time-sharing

More information

Real-time grid computing for financial applications

Real-time grid computing for financial applications CNR-INFM Democritos and EGRID project E-mail: cozzini@democritos.it Riccardo di Meo, Ezio Corso EGRID project ICTP E-mail: {dimeo,ecorso}@egrid.it We describe the porting of a test case financial application

More information

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Poor co-ordination that exists in threads on JVM is bottleneck

More information

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 16: Virtual Machine Monitors Geoffrey M. Voelker Virtual Machine Monitors 2 Virtual Machine Monitors Virtual Machine Monitors (VMMs) are a hot

More information

Why load test your Flex application?

Why load test your Flex application? Why load test your Flex application? Your Flex application is new and exciting, but how well does it perform under load? Abstract As the trend to implement Web 2.0 technologies continues to grow and spread

More information

z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2

z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2 z/os Heuristic Conversion of CF Operations from Synchronous to Asynchronous Execution (for z/os 1.2 and higher) V2 z/os 1.2 introduced a new heuristic for determining whether it is more efficient in terms

More information

Kernel Korner AEM: A Scalable and Native Event Mechanism for Linux

Kernel Korner AEM: A Scalable and Native Event Mechanism for Linux Kernel Korner AEM: A Scalable and Native Event Mechanism for Linux Give your application the ability to register callbacks with the kernel. by Frédéric Rossi In a previous article [ An Event Mechanism

More information

Cloud Computing For Researchers

Cloud Computing For Researchers Cloud Computing For Researchers August, 2016 Compute Canada is often asked about the potential of outsourcing to commercial clouds. This has been investigated as an alternative or supplement to purchasing

More information

A Comparison of Two Distributed Systems: Amoeba & Sprite. By: Fred Douglis, John K. Ousterhout, M. Frans Kaashock, Andrew Tanenbaum Dec.

A Comparison of Two Distributed Systems: Amoeba & Sprite. By: Fred Douglis, John K. Ousterhout, M. Frans Kaashock, Andrew Tanenbaum Dec. A Comparison of Two Distributed Systems: Amoeba & Sprite By: Fred Douglis, John K. Ousterhout, M. Frans Kaashock, Andrew Tanenbaum Dec. 1991 Introduction shift from time-sharing to multiple processors

More information

Computational Steering

Computational Steering Computational Steering Nate Woody 10/23/2008 www.cac.cornell.edu 1 What is computational steering? Generally, computational steering can be thought of as a method (or set of methods) for providing interactivity

More information

vsan 6.6 Performance Improvements First Published On: Last Updated On:

vsan 6.6 Performance Improvements First Published On: Last Updated On: vsan 6.6 Performance Improvements First Published On: 07-24-2017 Last Updated On: 07-28-2017 1 Table of Contents 1. Overview 1.1.Executive Summary 1.2.Introduction 2. vsan Testing Configuration and Conditions

More information

Web Serving Architectures

Web Serving Architectures Web Serving Architectures Paul Dantzig IBM Global Services 2000 without the express written consent of the IBM Corporation is prohibited Contents Defining the Problem e-business Solutions e-business Architectures

More information

Dell PowerVault MD3600f/MD3620f Remote Replication Functional Guide

Dell PowerVault MD3600f/MD3620f Remote Replication Functional Guide Dell PowerVault MD3600f/MD3620f Remote Replication Functional Guide Page i THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

Introduction to iscsi

Introduction to iscsi Introduction to iscsi As Ethernet begins to enter into the Storage world a new protocol has been getting a lot of attention. The Internet Small Computer Systems Interface or iscsi, is an end-to-end protocol

More information

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet Condor and BOINC Distributed and Volunteer Computing Presented by Adam Bazinet Condor Developed at the University of Wisconsin-Madison Condor is aimed at High Throughput Computing (HTC) on collections

More information

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary white paper Computer-Aided Engineering ANSYS Mechanical on Intel Xeon Processors Engineer Productivity Boosted by Higher-Core CPUs Engineers can be significantly more productive when ANSYS Mechanical runs

More information

Lotus Sametime 3.x for iseries. Performance and Scaling

Lotus Sametime 3.x for iseries. Performance and Scaling Lotus Sametime 3.x for iseries Performance and Scaling Contents Introduction... 1 Sametime Workloads... 2 Instant messaging and awareness.. 3 emeeting (Data only)... 4 emeeting (Data plus A/V)... 8 Sametime

More information

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud 571 Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud T.R.V. Anandharajan 1, Dr. M.A. Bhagyaveni 2 1 Research Scholar, Department of Electronics and Communication,

More information

Middleware. Adapted from Alonso, Casati, Kuno, Machiraju Web Services Springer 2004

Middleware. Adapted from Alonso, Casati, Kuno, Machiraju Web Services Springer 2004 Middleware Adapted from Alonso, Casati, Kuno, Machiraju Web Services Springer 2004 Outline Web Services Goals Where do they come from? Understanding middleware Middleware as infrastructure Communication

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

VIRTUALIZATION PERFORMANCE: VMWARE VSPHERE 5 VS. RED HAT ENTERPRISE VIRTUALIZATION 3

VIRTUALIZATION PERFORMANCE: VMWARE VSPHERE 5 VS. RED HAT ENTERPRISE VIRTUALIZATION 3 VIRTUALIZATION PERFORMANCE: VMWARE VSPHERE 5 VS. RED HAT ENTERPRISE VIRTUALIZATION 3 When you invest in a virtualization platform, you can maximize the performance of your applications and the overall

More information

Using Virtualization to Reduce Cost and Improve Manageability of J2EE Application Servers

Using Virtualization to Reduce Cost and Improve Manageability of J2EE Application Servers WHITEPAPER JANUARY 2006 Using Virtualization to Reduce Cost and Improve Manageability of J2EE Application Servers J2EE represents the state of the art for developing component-based multi-tier enterprise

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Java Message Service (JMS) is a standardized messaging interface that has become a pervasive part of the IT landscape

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

EMC GREENPLUM MANAGEMENT ENABLED BY AGINITY WORKBENCH

EMC GREENPLUM MANAGEMENT ENABLED BY AGINITY WORKBENCH White Paper EMC GREENPLUM MANAGEMENT ENABLED BY AGINITY WORKBENCH A Detailed Review EMC SOLUTIONS GROUP Abstract This white paper discusses the features, benefits, and use of Aginity Workbench for EMC

More information

SWsoft ADVANCED VIRTUALIZATION AND WORKLOAD MANAGEMENT ON ITANIUM 2-BASED SERVERS

SWsoft ADVANCED VIRTUALIZATION AND WORKLOAD MANAGEMENT ON ITANIUM 2-BASED SERVERS SWsoft ADVANCED VIRTUALIZATION AND WORKLOAD MANAGEMENT ON ITANIUM 2-BASED SERVERS Abstract Virtualization and workload management are essential technologies for maximizing scalability, availability and

More information

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002 Maximum Availability Architecture: Overview An Oracle White Paper July 2002 Maximum Availability Architecture: Overview Abstract...3 Introduction...3 Architecture Overview...4 Application Tier...5 Network

More information

AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING

AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING DR. ROGER EGGEN Department of Computer and Information Sciences University of North Florida Jacksonville, Florida 32224 USA ree@unf.edu

More information

Blue Waters I/O Performance

Blue Waters I/O Performance Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract

More information

Grid Application Development Software

Grid Application Development Software Grid Application Development Software Department of Computer Science University of Houston, Houston, Texas GrADS Vision Goals Approach Status http://www.hipersoft.cs.rice.edu/grads GrADS Team (PIs) Ken

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security Bringing OpenStack to the Enterprise An enterprise-class solution ensures you get the required performance, reliability, and security INTRODUCTION Organizations today frequently need to quickly get systems

More information

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM Szabolcs Pota 1, Gergely Sipos 2, Zoltan Juhasz 1,3 and Peter Kacsuk 2 1 Department of Information Systems, University of Veszprem, Hungary 2 Laboratory

More information

Distributed Operating Systems

Distributed Operating Systems 2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

MPI versions. MPI History

MPI versions. MPI History MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 2018 Lecture 22: File system optimizations and advanced topics There s more to filesystems J Standard Performance improvement techniques Alternative important

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

TECHNICAL PAPER March Datacenter Virtualization with Windows Server 2008 Hyper-V Technology and 3PAR Utility Storage

TECHNICAL PAPER March Datacenter Virtualization with Windows Server 2008 Hyper-V Technology and 3PAR Utility Storage TECHNICAL PAPER March 2009 center Virtualization with Windows Server 2008 Hyper-V Technology and 3PAR Utility Storage center Virtualization with 3PAR and Windows Server 2008 Hyper-V Technology Table of

More information

The Past, Present and Future of High Performance Computing

The Past, Present and Future of High Performance Computing The Past, Present and Future of High Performance Computing Ruud van der Pas 1 Sun Microsystems, Technical Developer Tools 16 Network Circle, Mailstop MK16-319, Menlo Park, CA 94025, USA ruud.vanderpas@sun.com

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

Unit 2 : Computer and Operating System Structure

Unit 2 : Computer and Operating System Structure Unit 2 : Computer and Operating System Structure Lesson 1 : Interrupts and I/O Structure 1.1. Learning Objectives On completion of this lesson you will know : what interrupt is the causes of occurring

More information

Advanced Systems Lab

Advanced Systems Lab Advanced Systems Lab Autumn Semester 2017 Gustavo Alonso 1 Course Overview 1.1 Prerequisites This is a project based course operating on a self-study principle. The course relies on the students own initiative,

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

BASIC OPERATIONS. Managing System Resources

BASIC OPERATIONS. Managing System Resources 48 PART 2 BASIC OPERATIONS C H A P T E R 5 Managing System Resources CHAPTER 5 MANAGING SYSTEM RESOURCES 49 THE part of Windows Vista that you see the Vista desktop is just part of the operating system.

More information

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification:

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification: Application control : Boundary control : Access Controls: These controls restrict use of computer system resources to authorized users, limit the actions authorized users can taker with these resources,

More information

packet-switched networks. For example, multimedia applications which process

packet-switched networks. For example, multimedia applications which process Chapter 1 Introduction There are applications which require distributed clock synchronization over packet-switched networks. For example, multimedia applications which process time-sensitive information

More information

Big data and data centers

Big data and data centers Big data and data centers Contents Page 1 Big data and data centers... 3 1.1 Big data, big IT... 3 1.2 The IT organization between day-to-day business and innovation... 4 2 Modern data centers... 5 2.1

More information

hot plug RAID memory technology for fault tolerance and scalability

hot plug RAID memory technology for fault tolerance and scalability hp industry standard servers april 2003 technology brief TC030412TB hot plug RAID memory technology for fault tolerance and scalability table of contents abstract... 2 introduction... 2 memory reliability...

More information

QuickSpecs. Compaq NonStop Transaction Server for Java Solution. Models. Introduction. Creating a state-of-the-art transactional Java environment

QuickSpecs. Compaq NonStop Transaction Server for Java Solution. Models. Introduction. Creating a state-of-the-art transactional Java environment Models Bringing Compaq NonStop Himalaya server reliability and transactional power to enterprise Java environments Compaq enables companies to combine the strengths of Java technology with the reliability

More information

EMC VPLEX Geo with Quantum StorNext

EMC VPLEX Geo with Quantum StorNext White Paper Application Enabled Collaboration Abstract The EMC VPLEX Geo storage federation solution, together with Quantum StorNext file system, enables a global clustered File System solution where remote

More information

Deploying VSaaS and Hosted Solutions using CompleteView

Deploying VSaaS and Hosted Solutions using CompleteView SALIENT SYSTEMS WHITE PAPER Deploying VSaaS and Hosted Solutions using CompleteView Understanding the benefits of CompleteView for hosted solutions and successful deployment architectures. Salient Systems

More information

Rapid Bottleneck Identification A Better Way to do Load Testing. An Oracle White Paper June 2008

Rapid Bottleneck Identification A Better Way to do Load Testing. An Oracle White Paper June 2008 Rapid Bottleneck Identification A Better Way to do Load Testing An Oracle White Paper June 2008 Rapid Bottleneck Identification A Better Way to do Load Testing. RBI combines a comprehensive understanding

More information

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information