UNIVERSITY of CALIFORNIA, SAN DIEGO

Size: px

Start display at page:

Download "UNIVERSITY of CALIFORNIA, SAN DIEGO"

Kimberly Dalton
5 years ago
Views:

1 UNIVERSITY of CALIFORNIA, SAN DIEGO MPI Process Swapping: Performance Enhancement for Tightly-coupled Iterative Parallel Applications in Shared Computing Environments A thesis submitted in partial satisfaction of the requirements for the degree of Master of Science in Computer Science by Otto K. Sievert Committee in charge: Professor Henri Casanova, Chair Professor Francine Berman, Co-chair Professor Scott Baden 2003

3 The thesis of Otto K. Sievert is approved: Date Co-chair Date Chair Date University of California at San Diego 2003

4 For Aaren. iv

5 It is not best to swap horses crossing the river. Abraham Lincoln, in a reply to the National Union League, 1864 v

6 Contents Dedication Epigraph List of Tables List of Figures iv v viii ix Chapter 1 Introduction 1 Chapter 2 Related Work Work Replication Dynamic Load Balancing Checkpoint / Restart Process Migration Pre-execution Scheduling Chapter 3 MPI Process Swapping The Swap Library Run Time Services Advantages of the Design Limitations Chapter 4 Experimental Results Environment Applications Swap Policy Results Cost of over-allocation Gross swapping behavior Conclusion vi

7 Chapter 5 Swap Policies Payback Algebra Policy Parameters Three Swapping Policies Chapter 6 Simulation Environment Application Performance Enhancing Techniques Simulation Architecture Experimental Results Evaluation of swapping vs. competing approaches Evaluation of three swapping policies Effect of CPU load distribution Summary Chapter 7 Conclusion 55 Appendix A User s Guide 57 A.1 Getting Started A.1.1 Compiling a swap-enabled application A.1.2 Running a swap-enabled application A.2 Command Line Options Bibliography 62 vii

8 List of Tables A.1 Important swapping files A.2 Run time command line options for swap-enabled applications A.3 Swap Dispatcher command line options A.4 Swap Manager command line options A.5 Swap Handler command line options A.6 Swap Admin command line options viii

9 List of Figures 2.1 Swapping benefits Swapping overloads standard MPI communications The minimum source changes to swap-enable an application The build changes to swap-enable an application Swap run-time architecture Swap interaction diagram Load on experimental machines Cost of over-allocation: MPI Init() Cost of over-allocation: MPI Barrier() Cost of over-allocation: MPI Finalize() Four process swapping behavior Sixteen process swapping behavior Sixty-four process swapping behavior Payback distance On-Off CPU load model ON/OFF CPU load Hyperexponential CPU load ON/OFF CPU load run length distribution Hyperexponential CPU load run length distribution Simulation architecture SWAP vs. NOP,DLB,CR Swap efficacy vs. over-allocation Swap efficacy at 1MB and 60MB process size Swap efficacy at 1MB and 1GB process size Swap vs. environment variability Swapping with large process size Hyperexponential load ix

10 Acknowledgements I would like to acknowledge a number of people who have provided support, given assistance, and shown understanding throughout the development of this work. All the Grail members at UCSD, but especially Holly, Shava, and Graziano, provided the best environment anyone could ask for. This work grew out of a desire for a mid-execution rescheduler for the GrADS project; the members of GrADS, especially Ruth, Celso and Rich, engaged me in many interesting discussions that influenced the swapping design. The Hewlett-Packard Company supported me throughout my graduate experience, and folks there like Larry and Craig who have themselves traveled this road were always generous sounding boards. Tiffany and Thomas tolerated my mental and physical absence as I was engaged in thesis activities, often at odd hours of the day. Finally, Fran exposed me to the expanse of thesis possibilities and to the many wonders and personalities in the parallel and distributed computing community, and Henri gave me tremendous attention and (through the use of some techniques not seen since the Inquisition) encouraged me to focus on one small part of this universe. Thank you. Portions of this work were previously published in the Proceedings of the High Performance Distributed Computing Conference and the International Journal of High Performance Computing Applications. x

11 Abstract MPI Process Swapping: Performance Enhancement for Tightly-coupled Iterative Parallel Applications in Shared Computing Environments by Otto K. Sievert Master of Science in Computer Science University of California, San Diego, 2003 Professor Henri Casanova, Chair Professor Francine Berman, Co-chair Simultaneous performance and ease-of-use is difficult to obtain for many parallel applications. Despite the enormous amount of research and development work in the area of parallel computing, performance tuning remains a labor intensive and artful activity. A form of process migration called MPI Process Swapping is presented as one way to achieve higher application performance without sacrificing ease-of-use. Process swapping is a simple add-on to the MPI programming paradigm, and can improve performance in shared computing environments. MPI process swapping is easy to use, requiring as few as three lines of source code change to an existing iterative application, yet it provides performance benefits on par with techniques that require significant application modification. xi

12 Chapter 1 Introduction While parallel computing has been actively pursued for several decades, it remains a daunting proposition for many end users. A number of programming models have been proposed [9, 39, 34] by which users can write applications with well-defined Application Programming Interfaces and use various parallel platforms. This thesis focuses on message passing, and in particular on the Message Passing Interface (MPI) standard [24]. MPI provides the necessary abstractions for writing parallel applications and harnessing multiple processors. However, the primary parallel computing challenges of application scalability and performance remain, especially in a shared computing environment. While these parallel challenges can be addressed via intensive performance tuning and careful application scheduling, typical end users often lack the time and expertise required. As a result, many end users sacrifice some performance in exchange for ease-of-use (the user is relatively happy as long as parallel application performance is better than serial execution, even if the overall parallel performance is quite inefficient). This is a general trend as parallel computing often enjoys ease-of-use or high performance, but rarely both at the same time. In this situation, a simple technique that provides a sub-optimal (but still beneficial) performance improvement can be more appealing in practice than a near optimal solution that requires substantial effort to implement. 1

13 2 MPI Process Swapping is an easy-to-use application performance enhancement that provides benefit in dynamic environments. With careful attention to swapping policies, this thesis shows that process swapping can perform as well as other performance enhancing techniques such as Dynamic Load Balancing and Checkpoint/Restart. Process swapping is useful for applications that execute slowly because of resource contention. Using process swapping, application execution time can be reduced. This is done by continually moving the application computation from slow processors to fast processors. Process swapping improves performance by dynamically choosing the best available resources throughout the execution of an application, using MPI process overallocation and real-time performance measurement. The basic idea behind MPI process swapping is as follows. Say that a parallel iterative application needs N processes to run, due to memory and/or performance considerations. Process swapping overallocates N + M processes so that the application only runs on N processes, but has the opportunity to swap any of these processes with any of M spare processes. Process swapping imposes the restriction that data redistribution is not allowed: the application is stuck with the initial data distribution, which limits the ability to adapt to fluctuating resources. Although MPI process swapping will often be sub-optimal, it is a practical solution for practical situations and it can be integrated to existing applications easily. Process swapping can be thought of as a partial application checkpoint and migration. Of the N active MPI processes, only those that show poor performance are checkpointed and have their work responsibilities transferred to another MPI process. Because the checkpointing used by MPI process swapping is only semi-automated (the user must define quiescent, checkpoint-able states in the code that guarantee no outstanding communication messages; the user also must register process state information that is transferred during a migration), this technique is most applicable to iterative applications. Iterative applications lend themselves to MPI process swapping in two ways. First, the loop structure of an iterative code execution is a natural place for a command that checks for a process swap. Modifying only one place in the code gives many swap

14 3 opportunities during execution. Second, iterative applications typically have natural barrier points where synchronization occurs. These barrier points often (but not always) have the required communication quiescence required for MPI process swapping. For these reasons, iterative applications are the focus of this work. The target execution environment is heterogeneous time-shared platforms (e.g. networks of desktop workstations) in which the available computing power of each processor varies throughout time due to external load (e.g. CPU load generated by other users and applications). This type of platform has steadily gained in popularity in arenas such as enterprise computing. In the target environment, it is assumed there are only a few parallel applications running on the platform simultaneously amidst a field of interactive (serial) jobs, creating performance hot spots that should be avoided. If the workload instead consists primarily of parallel applications with no interactive usage then the platform of choice should be a batch-scheduled cluster. Although our target usage scenario may appear limiting, workloads consisting of few parallel applications with interactive desktop users are not uncommon in academic and commercial environments. Many research labs at the University of California, San Diego have this characteristic, as do many commercial development facilities such as those used by the Hewlett-Packard Company, for example. The remainder of this thesis is organized as follows: ffl Chapter 2 provides background information and describes related approaches to improve application performance. ffl Chapter 3 describes process swapping in detail. ffl Chapter 4 describes experiments run to validate the implementation. ffl Chapter 5 introduces three swapping policies. ffl Chapter 6 compares these swapping policies to other performance enhancement techniques using simulation. ffl Chapter 7 concludes the work.

15 Chapter 2 Related Work This chapter relates process swapping to other performance enhancing techniques: ffl work replication ffl Dynamic Load Balancing (DLB) ffl Checkpoint / Restart (CR) ffl process migration ffl pre-execution scheduling In typical shared environments, the computing power of each individual workstation can vary for many reasons. For example, the resources themselves are usually heterogenous due to varying requirements or because purchases are staggered in time. Also, the resources are not uniformly used between individual users and an occasional parallel job, each machine s load can vary. This variation can be dramatic and unpredictable, resulting in an unfriendly environment for parallel computing. End users tolerate the inconsistency of such a computing environment. One day their applications may run quickly, but another day may bring slow progress. As long as there is some minimal amount of performance benefit to parallelizing an application, typical users today will be most concerned with how easy it is to run their parallel application. As a general group, end users are not likely to invest significant time and effort tuning their application schedules for performance beyond some notion of good enough. Often they don t have the time, or the experience, or even the desire, required to perform detailed performance and scheduling analyses. 4

16 5 There are several well-known techniques to harness the power of a heterogeneous computing environment. Four of these techniques are described in this chapter: work replication, dynamic load balancing, checkpoint/restart, process migration, and preexecution scheduling. It is claimed that, when the product of implementation effort and performance gain are compared, that MPI Process Swapping compares favorably with these techniques. Figure 2.1 illustrates this abstract claim. The remainder of this thesis provides information supporting this hypothesis. Execution Performance MPI Swap dynamic load balancing pre-execution scheduling checkpoint/ restart default app work replication Implementation Difficulty Figure 2.1: Claim: swapping brings potential performance benefits with relatively low effort. 2.1 Work Replication Because process swapping intelligently decides which processors actively participate in program execution, the over-allocation technique used by process swapping is better than simply replicating work. The simplest work replication option is to execute the application twice. In a dynamic environment, however, it is likely that at least one processor used by each replicated run will have decreased performance, causing both

17 6 applications to execute slowly. In this case, performance will suffer even though twice as many resources are used. Doubling work units within the application, using the first available results, and abandoning the other results, can also in general be hindered by slow processors. This method also requires significant modification to the application itself. Both of these work replication methods wastefully use twice as much computational power as is required. In contrast, process swapping makes efficient use of the pool of available processors. 2.2 Dynamic Load Balancing Dynamic Load Balancing (DLB) is one of the best known methods for achieving good parallel performance in unstable conditions. Dynamic load balancing repartitions application work during execution to balance load and minimize execution time. DLB techniques have been developed and used for scenarios in which the application s computational requirements change over time [10, 14, 18, 26] and scenarios in which the platform changes over time [49, 30, 50]. This work targets the latter scenario and DLB is thus an attractive approach but it has limitations. DLB often requires substantial effort to implement. Support for uneven, dynamic data partitioning adds complexity to an application, and complexity takes time to develop and effort to maintain. An application that supports arbitrary data decomposition is more likely to break down, and is harder to debug when it does fail. It should be noted that this difficulty is recognized; there have been efforts to provide middleware services that hide some of this complexity from the user and the programmer [22, 29, 5]. DLB requires an application that is amenable, in the limit, to arbitrary data partitioning. Many parallel algorithms demand fundamentally rigid data partitioning. For example, ScaLAPACK requires fixed block sizes (although these blocks can be somewhat creatively assigned to processors for performance reasons, as in [7]). The performance of an application that supports dynamic load balancing is limited by the achievable performance on the processors that are used. A perfectly loadbalanced execution can still run slowly if all the processors used operate at a fraction

18 7 of their peak performance. It should be noted that a DLB implementation could further improve performance in this case through the use an over-allocation scheme such as the one used by process swapping. 2.3 Checkpoint / Restart Another way for an application to adapt to changing conditions is Checkpoint/Restart (CR). Checkpoint/restart halts an entire application during execution, saves important application state information, and restarts the application on possibly new hosts with a possibly new data partitioning. CR is traditionally used for fault-tolerance, but it can be used for performance considerations by choosing appropriate restart hosts [2, 45]. CR does not limit the application to the processors on which execution is started, so it does not have to remain running on a set of processors that have become loaded. It also does not require a sophisticated data partitioning algorithm, and can thus be used with a wider variety of applications/algorithms. Unfortunately, generalized heterogenous checkpoint/restart of parallel applications is a difficult task; it remains the subject of several active research projects [40, 3, 20, 6] and is not widely supported today. Checkpointing may incur significant overheads depending on the application and compute platform. For large applications using substantial amounts of memory, running on a set of machines where only one or two nodes are loaded and slow, the overhead required to checkpoint the entire application, possibly to a central remote checkpoint drydock, is substantial. After the application has been checkpointed, a new application schedule must be calculated. This can take some time, especially if the scheduler needs to cool down for a period of time before the resource performance predictions are accurate enough to compute the new schedule (because of hysteresis effects, cpuloadbased performance estimates often require several minutes to fully dissipate the effect of a running application). Once this new schedule is completed, the entire application must bootstrap itself and read the appropriate data and process state from the checkpoint file. Only then can computation proceed. Application-level checkpointing can be implemented with limited effort for iterative

19 8 applications. The Cactus worm [2] uses standard Cactus checkpointing facilities to checkpoint a Cactus application to a central remote location, allowing the application schedule to be recomputed. The CoCheck project provides a checkpointing tool for PVM and MPI applications, allowing checkpoint/restart [40]. CoCheck is implemented on top of MPI, and provides parallel checkpointing and message forwarding. The SRS library recently developed to add checkpoint capability to ScaLAPACK provides similar user-level checkpointing (without the outstanding message bookkeeping) [44]. 2.4 Process Migration Process swapping is also related to a number of efforts to add migration support to the MPI runtime system. Migration facilities such as those provided by fault-tolerance extensions to MPI provide better-integrated support and more generally improve the capabilities of the MPI system. For example, MPI/FT provides self-checking parallel threads, offering a true migration infrastructure [3]. Similarly, FT-MPI adds faulttolerance to MPI [20]. MPICH-V [6] is very similar to FT-MPI, rising out of the volatility exhibited by global computing platforms. MPICH-V provides uncoordinated checkpointing as part of the MPI core implementation, and tracks undelivered messages through a distributed message logging facility. Designed primarily for very large systems whose mean time between failure is low, these techniques could conceivably be used to migrate for performance reasons. These migration mechanisms could be combined with the process swapping services and policies developed in this thesis, improving the robustness and generality over the current process swapping solution. In particular, a checkpointing facility might allow a better process swapping implementation by (i) removing the restriction of working only with iterative applications; (ii) further reducing the already minimal source code invasiveness; and (iii) reducing or removing the need to over-allocate MPI processes at the beginning of execution.

20 9 Combining MPI process swapping policies with the cycle-stealing facilities of high throughput desktop computing systems like Condor [33], XtremWeb [21] or other commercial systems [19, 27] would yield a powerful system. These systems currently evict application processes when a resource is reclaimed by its owner. By combining swapping policies with this eviction mechanism, a process might also be evicted and migrated for application performance reasons. Such a combined system would not only provide high system throughput, but individual application performance as well. One difficulty would be to allow network connections to survive process migration. An approach like the one in MPICH-V [6] could be used. 2.5 Pre-execution Scheduling MPI process swapping can be categorized as a mid-execution scheduler, and can be compared to pre-execution application schedulers such as those found in the AppLeS [4] and GrADS [28] projects. These projects are also concerned with achieving high performance within dynamic parallel execution environments. Additionally, they strive for ease-of-use, an important attribute for disciplinary scientists. The performance measurement and prediction techniques used in process swapping have much in common with these projects; all use application and environmental measurements (e.g. via the NWS [46], Autopilot [35], or MDS [23]) to improve application performance. AppLeS schedulers exist to promote application performance (as opposed to system throughput or system performance). Many application-specific AppLeS schedulers have been built, for example [38, 41, 12]. By and large, these schedulers can be categorized as pre-execution schedulers, matching an application to appropriate computing resources before the application is run. Recently, application-class AppLeS schedulers have been pursued, most notably AMWAT and APST, which focus on master/worker and parameter sweep application classes respectively [36, 8]. In related work, the GrADS project seeks grid ease-of-use through performanceaware grid infrastructure from compilation through execution. This effort includes an application-generic pre-execution scheduler that promotes application performance, but

21 10 is modular [11]. Using a rich application requirements description and resource capabilities, this scheduler matches an application and execution resources for a variety of application classes. Also part of this effort is a mid-execution rescheduler, which evaluates the application schedule and alters it during execution to improve performance [13]. The work in this thesis has been integrated into the GrADS system as the first implementation of this rescheduler. Pre-execution scheduling is advantageous because it requires little or no application modifications. The pre-execution scheduler determines resources and application configuration information without access to application source code. While pre-execution scheduling is broadly similar to mid-execution scheduling (for example, when applied in the middle of an application run both schedulers will discard slow processors and acquire fast processors), a mid-execution scheduler has more constraints that make simple application of a pre-execution scheduler difficult. These constraints are categorized below: ffl affinity After executing for some time on a particular set of processors, an application develops an affinity for these resources. The use of different resources therefore has an additional cost that is not present before the start of execution. ffl the Heisenberg Principle Using traditional pre-execution performance estimation techniques, it is difficult to separate the performance effect of the application itself, as it is currently running, from other performance drains. Blindly using these traditional performance estimates will result in frivolous scheduling decisions, as the currently used resources will appear slow and overloaded, even when the load is caused by the application itself. ffl relative performance is not sufficient Before execution, the pre-execution scheduler chooses the best resources for the application, often relying on relative performance rankings. However, during execution there is a substantial cost for changes to the schedule. Any mid-execution change must be evaluated using absolute, not relative, performance measures or one cannot reason about the benefits of the new schedule relative to the cost of implementing the schedule.

22 Chapter 3 MPI Process Swapping This chapter presents MPI process swapping in detail, including: ffl the swap library, and its impact on application source code; ffl the swap run-time services, and how they interact with a swap-enabled application; ffl the advantages and limitations of the swapping design and implementation, including performance considerations. 3.1 The Swap Library In order to minimize the impact to user code, and yet still provide automated swapping functionality, MPI process swapping overloads many of the MPI function calls through a combination of #define macros and function calls. Over-allocation is implemented with two private MPI communicators. An active communicator contains all the MPI processes that are actively participating in the application, and an inactive communicator contains all the inactive processes. To hide this complexity from the user, the swapping library overloads MPI function calls, as shown in Figure 3.1. The application developer is still required to make a handful of changes to an existing application. This is best illustrated through an example. Figure 3.2(a) contains C-like pseudo code for a typical MPI application. This example shows the communication from an actual MPI application that computes Van der Waals forces between 11

23 12 user.c libswap.a... MPI_Send(); /* hijacked */... Swap_Send() {... MPI_Send(); /* real */... } libmpi.a MPI_Send() {... } Figure 3.1: Swapping overloads standard MPI communications. particles in a two-dimensional grid [48]. In this example, the original C source code includes the mpi.h header file, and makes several MPI function calls throughout the code. To build the application, the user compiles their source code and links to the MPI library, as shown in Figure 3.3(a). To swap-enable this application, the following three lines of code are changed (see Figure 3.2). 1. The user s code includes the header file mpi swap.h instead of mpi.h. 2. The user must register the iteration variable using the swap register() function call. This is necessary in order for the swap code to know which iteration a particular MPI process is executing at any given time. The swap register() function is used to register statically allocated memory that is important to be swapped. All statically allocated memory that must be transferred during a swap must be registered with the swap register() function call; for example, the iteration variable above is registered this way. The swap register() call is considered collective and must be issued across all

24 13 processes. 3. The user must insert a call to MPI Swap() inside the iteration loop to exercise the swapping test and actuation routines. The MPI Swap() function call acts like a barrier to active processes. It must be placed inside the application s iteration loop. The current implementation requires that no communication messages be outstanding when MPI Swap() is called. In theory, outstanding messages could be allowed by forwarding them to the new active process. This enhancement has been designed, but not implemented to date. Figure 3.3(b) illustrates how to compile a swap-enabled application. The user includes the mpi swap.h header file provided by the swap package, and links against both the standard MPI library, called libmpi.a here, and the swap library libswap.a that is provided by the swap package. #include "mpi.h" #include "mpi_swap.h" main() f MPI_Init(); main() f MPI_Init(); swap_register(iteration variable); for (a lot of loops) f (MPI_Send() MPI_Recv()); MPI_Bcast(); MPI_Allreduce(); g for (a lot of loops) f MPI_Swap(); (MPI_Send() MPI_Recv()); MPI_Bcast(); MPI_Allreduce(); g g MPI_Finalize(); (a) Standard MPI C source. g MPI_Finalize(); (b) Swap-enabled MPI C source. Figure 3.2: The minimum source changes to swap-enable an application. In a similar way to how the standard MPI calls are overloaded, the standard C library calls to dynamically allocate memory are overloaded. In this way, the swap library

25 14 user.c mpi.h user.c mpi_swap.h mpi.h libmpi.a libmpi.a libswap.a executable executable (a) Standard MPI. (b) Swap-enabled MPI. Figure 3.3: The build changes to swap-enable an application. tracks memory that needs to be communicated during a swap. In the rare instance that some local memory is dynamically allocated (and does not need to be communicated during a swap), passthrough functions are provided that allocate memory but do not register the memory for swap communication. This default overloading of dynamic memory allocation can cause lower application performance. However, swapping is much less likely to fail for novice users if all dynamic memory is trapped and registered. Furthermore, the advanced user who is concerned about every little bit of performance will take the time to sort out these local memory pools and prevent them from being registered. 3.2 Run Time Services The swap run time services, depicted in Figure 3.4, are independent processes that together manage the execution of a parallel application. The swap services monitor the application and the resources on which it runs, and are responsible for determining when and where to swap. Below, each of these services is described in more detail.

26 15 proc d swap dispatcher proc m swap manager proc x visualization, logging, external control, etc. proc 0 proc 1 proc N application MPI rank 0 application MPI rank 1 application MPI rank N swap handler swap handler... swap handler Figure 3.4: Swap run-time architecture. swap dispatcher The swap dispatcher service is an always-on service, listening at an advertised host/port. As shown in Figure 3.5, the swap dispatcher is contacted during application bootstrap, in the MPI Init() call. At this time the dispatcher launches a swap manager on the processor with MPI rank 0. Until the application terminates, and the swap manager unregisters itself with the swap dispatcher, the only additional action provided by the dispatcher is to forward communication inquiries to the appropriate swap manager. For example, the swap dispatcher can be contacted for information about a particular application, and this request will be forwarded on to the swap manager responsible for the application. swap manager When it starts, the swap manager reports to the swap dispatcher the host and port number on which it receives messages. This information is passed back to the application. The application communicates this information to all the MPI processes in the parallel app. Each of these processes contacts the swap manager and requests a swap handler. The swap manager remotely starts a swap handler for each

27 16 application process. During application execution, the swap manager plays an active role, tracking the performance of active and inactive processors, tracking application progress, evaluating this information, and making process swapping decisions. swap handler A swap handler is started for each MPI process, during each process call to MPI Init(). Each MPI process requests a swap handler from the swap manager. The swap manager remotely starts a swap handler on the same processor as the MPI process. This handler returns to the swap manager the host and port on which it communicates. The swap manager gives this information to the MPI process, which then only talks with the swap handler for the remainder of the application. Upon application termination, the swap handler unregisters itself with the swap manager, then exits. The swap handler exists as a separate process (or thread) for several reasons: ffl For efficiency all swapping communication between the application and the swap services are made on the local host. During the bulk of the application run, the overhead of swapping is minimized because all communication is local to the processor. ffl The swap manager needs to have asynchronous dialog with the swap handler, and this could not happen if the swap handler were written inline in the MPI process. ffl The swap handler performs some active resource measurements from time to time, and this is very difficult to do on inactive MPI processes where the handler code is inline with the application. Swap utilities The swap utilities are utilities designed to improve the usability of the swap environment. Facilities such as swap information logging and swap visualization connect to the swap manager (possibly through the swap dispatcher), and track an application s progress. The swap admin provides a simple interface to manually configure or control the swapping system. When a swap is determined by the swap manager, it communicates this to the swap handler associated with the MPI process with local rank zero (the local root ). At the

28 17 next application iteration, during the call to MPI Swap(), the swap handler indicates to the local root process that a swap is necessary. The local root communicates this information to all the MPI processes, and the appropriate processes initiate a swap. The other processes continue on with the application, doing as much work as possible in parallel with the swap. processor u processor d processor m processor 0 processor 1 processor N swap dispatch user 2 start 1 init MPI app rank 0 start MPI app rank 1 start MPI app rank N create vis info info info init create 3 swap manager 4 info info info create 5 swap handler info create swap handler info create swap handler info info info info info swap? no 6 perf info... info info swap 0&1 info swap 1&N info perf swap? yes 7 swap swap data swap swap? no swap? yes swap swap info quit 9 finalize finalize 8 finalize finalize finalize swap data finalize finalize finalize Figure 3.5: Interaction diagram of a swappable MPI application. The swap services interact with the MPI application and with each other in a straightforward asynchronous manner, as illustrated in Figure 3.5. Walking through an example application execution will further describe these interactions. For ease of reference, the numbers below are matched in the figure. 1. From machine u a user launches an MPI application that uses N total processes,

29 18 a subset of which will be active at any given time. 2. The root process (the process with MPI rank zero) on machine 0 contacts the always-on swap dispatcher (running on machine d) during initialization, and requests swap services. 3. The swap dispatcher launches a swap manager on machine m. The swap dispatcher waits for the swap manager to initialize, then tells the root process how to contact this personalized swap manager. The root process passes this information to all MPI processes in the application. From this point onward, the swap dispatcher plays a minimal role; the swap manager becomes the focal point. 4. For each MPI process, the swap manager starts a swap handler on the same machine. Once the swap handlers are initialized, the application begins execution. 5. While the application is executing, the swap handlers are gathering application and environment (machine) performance information and feeding it to the swap manager. Some of this information is passive, like the CPU load or the amount of computation, communication, and barrier wait time of the application. Other times the performance information is gathered via active probing, which uses significant computational resources for a short period of time but provides more accurate information. The swap manager analyzes all of this information and determines whether or not to initiate a process swap. 6. The active root process, the MPI process that is the root process in the group of active processes, contacts its swap handler periodically (at an interval of some number of iterations, during the call to MPI Swap()). In this case, the active root starts out as the process on machine 0. The first time this process asks if a swap is needed, the swap handler replies no. The application continues to execute, and information continues to be fed to the swap manager. 7. Eventually, the swap manager decides that process 0 and 1 should swap, so it sends a message to the swap handler that cohabitates with the active root process.

30 19 The next time the application asks if it should swap, the swap handler answers yes. Processes 2 through N continue to execute the application while processes 0 and 1 exchange information and data. The process on machine 0 will become inactive, while the process on machine 1 becomes active. When the swap is complete, process 1 is now the active root process, so the next swap message from the swap manager is sent to the process on machine 1. This time, process 1 and process N swap. The execution continues in this fashion until it completes. 8. As the MPI application shuts down, each MPI process sends finalization messages to its swap handler before quitting. The swap handler in turn registers a finalization message with the swap manager, then quits. Once all the swap handlers have unregistered with the swap manager, it sends a quit message to the swap dispatcher, and shuts down. 9. In this case, all during the application execution the user monitored the progress of the application. Shortly after the application began to execute, the user started the swap visualization tool.the visualization tool contacted the swap dispatcher, which told it where the swap manager lived. The visualization tool registered itself with the swap manager, and from that time forward the swap manager kept the visualization tool informed directly. After the application shut down, and the swap manager also shut down, the user closed the visualization tool. 3.3 Advantages of the Design The advantages of this design are briefly discussed below. Minimal impact on application source code MPI process swapping works by overloading many of the MPI function calls. The goal is to achieve as much transparency as possible for the application programmer and end-user. There are several modifications to the application source code that must be done by the user. Conceivably these steps could be automated with a compiler but they are straightforward for a

31 20 user to implement. As few as three lines of source code need to be changed in order to swap-enable an existing iterative MPI application. Support for MPI-1 applications Process swapping is tightly integrated with MPI, but it does not require advanced MPI-2 features. MPI-1 does not support adding processors to communicators, so process swapping relies on over-allocation of processes at the beginning of execution to get its pool of processors. Swapping chooses the best subset of available processors to actively participate in the application execution; the rest remain inactive until needed. These inactive processes utilize very little computational power; aside from periodic active performance measurement, they block on I/O calls and wait to become active. MPI-2 has support for adding and removing processors to an application during execution. However, MPI-2 is not as widely supported as MPI-1. Furthermore, the functionality to add/remove processes is not transparent. The programmer must manage new communicators, requiring significant source code modification for existing MPI- 1 applications. By contrast, MPI-1 with process swapping requires minimal source code changes. So while MPI-2 dynamic process management functionality such as that supported by the latest grid-enabled MPI implementation, MPICH-G2 [43], minimizes the need for over-allocation, it requires significant modification to existing applications. Profiling support By using a custom interface to overload MPI functions, the end user may still use any MPI profiling, logging, or debugging instrumentation they are familiar with. MPI does not have language support for multiple levels of interception, which means that (without special arrangement) only one profiling instrumentation can be done. Had the swap library used the standard MPI profiling interface for its own use, this would have precluded the use of another profiling or debugging tool. Scalable run-time architecture By organizing the run-time system as services, and minimizing the communication with a central service, the run-time system can scale to large numbers of simultaneous swap-enabled applications running at the same time.

32 Limitations There are a number of limitations to the current implementation. Some of these limitations are fundamental design limitations, but most are simply features that (for expediency) were not implemented. A laundry list of limitations is given below. Iterative applications applications. The swap system was designed to target the class of iterative Language The swap library only has C bindings. C++ and Fortran bindings have not been implemented. Message forwarding The swap library does not perform message forwarding; in particular, this means that the call to MPI Swap() must be at a natural application barrier; there can be no outstanding asynchronous communications when this function is called. User-defined communicators The swap library does not currently do bookkeeping necessary to properly handle user-defined communicators. Overhead of MPI overloading There is a slight performance overhead in each MPI communication call. The cost of an additional function call is incurred for each call. These overheads, while small, can accumulate in an application with a lot of MPI calls. The function call overloading could have been implemented with macros, but this limits the readability and maintainability of the swap library itself. Inactive process blocking Inactive MPI processes block on an MPI communication receive. On most MPI implementations, this means that these inactive processes consume virtually no system resources. This is what allows an application to dramatically over-allocate processors without incurring prohibitive penalties or causing problems for other users. However, it must be noted that some MPI implementations use spinning in their implementation of MPI Recv(), for example TMPI[42]. On these machines, with the current implementation, the inactive processes would needlessly consume a significant amount of resources.

33 22 It is straightforward to implement the inactive processes pause in a fashion that does not consume resources, for example through the use of the select() facility. This was not done in this implementation for expediency and because the systems on which swapping was evaluated did not spin wait, and so they did not consume significant resources while processes were inactive. Scalability The swap services, written in Python, do not scale well above roughly one hundred processes. Above this level, communication traffic causes dropped messages and reset socket connections. This could be addressed through a focused redesign. Similarly, the mpich system could not launch more than two hundred processes before running out of process space.

34 Chapter 4 Experimental Results Swapping has been implemented and verified in a commercial production environment. In this chapter, this validation is presented. The overheads of swapping are measured, and the effectiveness of the system is shown anecdotally (a more rigorous discussion of swap efficacy is presented in Chapter Environment The Hewlett-Packard Company maintains a large network of workstations in San Diego, California. These machines are a heterogenous collection of HP PA-RISC processors running HP-UX 11.11i, connected through 100baseT Ethernet using multiple sub-nets, routers, and switches. The majority of the HP machines are used as individual workstations for commercial research and development activities, such as Computer Aided Drafting, Very Large Scale Integrated digital circuit development, and software/firmware development. A handful of these machines are not used interactively; instead, these machines are used as a cluster. Three hundred of the HP machines were included in MPI process swapping machine pool, though due to scalability concerns fewer than 100 participated in experiments at any given time. Figure 4.1 shows an example of the CPU load on this system. The load in this figure is typical of most of the interactively-used processors in the system. The batch processors tend to have much fewer jobs that run much longer. 23

35 CPU load Time [seconds] Figure 4.1: Example of CPU load in the experimental environment. 4.2 Applications A number of applications have been swap-enabled. They are described here. fish Fish is a simple particle physics parallel application written by Fred Wong and Jim Demmel [48]. Typical of the codes generally found in the field of particle dynamics, this application computes Van der Waals forces between particles in a two-dimensional field. As the particles interact, they move about the field. Because the amount of computation depends on the location and proximity of particles to one another, this application exhibits a dynamic amount of work per processor even when the data partitioning is static and the processors are dedicated. This iterative code uses simple MPI calls using only the MPI COMM WORLD communicator. It has a per-iteration barrier point, which is a natural place to insert a call to MPI Swap(). From the original code, four source lines were added/changed in order to add process swapping capability to this application. iterative stencil applications This unified application, written by Holly Dail, performs the computation and communication for (a) a Jacobi relaxation solver for a system of equations, and (b) for the game of life algorithm. Both sub-applications are iterative and have natural barrier points suitable for MPI Swap() insertion. These applications use straightforward synchronous MPI communication calls, and only com-

36 25 municate via MPI COMM WORLD. From the original code, only three lines of code were changed to enable swapping. synthetic This application was designed specifically to emulate applications like fish, jacobi, and the game of life. It has the same basic application elements of configuration, iteration, finalization. Unlike these real applications, the synthetic application does not implement a real algorithm, or communicate real data. Instead, this code is able to be tuned to any computation/communication demands desired. The advantage is that the synthetic application allows quick and efficient exploration of entire classes of applications without actually writing them. The synthetic application does perform actual floating point calculations (additions and multiplies), and it does communicate buffers full of random data. Four lines of source code were altered to make this application swap-enabled. ring This well-known introductory MPI application sends a message from process to process in a ring communication. The message finally returns to the initial sending process. This simple application reports the total message travel time, and as such is useful for measuring various swap overheads. mpi sleep This utility application was designed to allow exploration of swapping capabilities without degrading the experience of other users. Instead of performing actual computation and communication, the payload of this iterative application is simply a sleep command. When configured with a particular sleep time and number of sleep iterations, this application swaps in between sleeping. This application was swap-enabled with modifications to three lines of source. 4.3 Swap Policy The policy used in the experiments took every opportunity to acquire and use faster processors, even when those processors were only marginally faster than the current processor set. This greedy policy periodically evaluated the performance of all the machines in the application s pool (a subset of the entire HP cluster), and swapped

GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications

GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications Authors Ken Kennedy, Mark Mazina, John Mellor-Crummey, Rice University Ruth Aydt, Celso Mendes, UIUC Holly Dail, Otto