Overhead-Free Portable Thread-Stack Checkpoints

Size: px

Start display at page:

Download "Overhead-Free Portable Thread-Stack Checkpoints"

Polly Cook
5 years ago
Views:

1 Overhead-Free Portable Thread-Stack Checkpoints Ronald Veldema and Michael Philippsen University of Erlangen-Nuremberg, Computer Science Department 2, Martensstr Erlangen Germany {veldema, philippsen@cs.fau.de Abstract. Checkpointing is the process of taking a snapshot of a thread s stack and possibly the objects that it uses such that a thread can be either restarted (for error recovery) or moved to another machine (to improve load balancing). Current approaches to thread stack checkpointing are either not heterogeneous as they do not allow a call stack created using architecture X to be restored on a machine with architecture Y or they introduce large runtime overhead. In general, previous approaches add overhead by instrumenting each function in a program to constantly test if the current method invocation is for thread restoration purposes or whether it is a normal invocation. The instrumentation costs are even incurred when no checkpointing is performed. Our implementation introduces no runtime overhead during regular execution. Furthermore our approach supports heterogeneity. We implement this by letting our compiler create extra functions to portably save and rebuild activation records to and from a machine-independent format. Each variable of an activation record is described in terms of its usages in a variable usage descriptor string. As the computed variable usage descriptor strings for a given variable are the same on all architectures they are used to uniquely identify variables inside activation records across different architectures. 1 Introduction Checkpointing a thread is the process of taking a snapshot of all activation records that form the thread s call stack and the objects on the heap reachable from them for later restoration on another machine. There are many usages for checkpointing. For example: to migrate a thread from one machine to another machine, to directly utilize resources or special features that a specific machine has, or to provide rollback fault tolerance. Many checkpointing packages (including the system presented here) give a programmer access to a checkpoint this thread() function. Invoking this method causes all relevant information of the current thread such as the thread s stack, active registers, accessible heap to be saved to disk. Later, the program can be restarted using a special command line parameter that causes the thread to be completely restored. To the program, it then seems as if control just returned from the checkpoint this thread() function as if nothing has happened. Creating a fully heterogeneous checkpointing package is difficult because of the many architectural differences. First of all, local variables and parameters are stored in different physical locations and binary formats on different processors. Also, one machine can have more registers than another. If this occurs, less variables are allocated in

2 2 memory and more are allocated in registers (but the sum of the number of variables allocated in registers and memory remains the same). Variables that are stored in memory can also be stored in different locations inside the stackframes. We therefore cannot perform a simple bitwise copy of an entire activation record from one machine to another. Likewise, the layout in memory of an object can vary between architectures. We have implemented our checkpointing algorithm in Jackal [7]. Jackal includes an aggressively optimizing static compiler: the compiler accepts Java source code and generates an executable. Jackal, also supports heterogeneous clusters; objects and pointers can already be freely exchanged between different types of machines. Our new checkpointing algorithm solves all the problems mentioned above: a function s activation record can now be converted to a machine-independent format using a novel way to describe the individual variables. This allows a checkpoint to be restored on a different architecture than where the checkpoint was taken while imposing no overhead during normal execution. Although the techniques described here are implemented in a Java system, they are applicable to any type-safe programming language and has no function pointers. 2 Implementation The central problem in heterogeneous checkpointing is how to translate an activation record created on one architecture to the activation record on another architecture without changing the code on either architecture. I.e. how to associate variable X of function F allocated in register or stackframe position P with the X of F on a different architecture where X is allocated in stackframe position or register Q. For example, on an x86, a source code variable might be spilled to memory at -8(ebp) while on an IA64, the variable is allocated in register loc0. Unfortunately, checkpointing requires such low level details as it needs to machine-dependently save and restore a stackframe. Alas, source code variable names cannot be used to reassociate the variables with their physical locations as naming information is lost with increasing optimization levels. Compilers with simultaneous debugging and optimization in such cases only provide approximations to the debugger which won t suffice for our purposes. 2.1 Stack Checkpointing and Restoration A machine-independent description of the variables inside an activation record is needed. Our solution is to uniquely identify variables by characterizing how each register or memory location in a stackframe is used. This characterization is portable as it uniquely identifies variables without looking at how the variables are physically stored. A characterization of a variable (either in register or memory location) is initially in the form of a usage descriptor string. As checkpointing can only occur at a call instruction (due to calling checkpoint this thread directly or indirectly), we only need to create a usage descriptor string for each live variable at each callsite. The rule for usage descriptor strings is then: if on architectures X and Y the descriptor strings for variables v 1 and v 2 are the same, then the variables represent the

3 3 same variable. At runtime, the checkpoint file then consists of a series of tuples {usage descriptor string, value of variable in universal binary format. Our checkpointing algorithm operates in two passes at compile time. After all the machine-independent optimizations, we create lists of the above tuples for each live variable at each callsite. After code generation and machine-dependent optimizations we generate two helper functions: checkpoint(c in F) that checkpoints function F at callsite C and restore(c in F) that performs the reverse operation. The original callsites and functions are unaffected (pseudocode for checkpoint(c in F) and restore(c in F) is shown in 3). At runtime, checkpointing unwinds the stack from checkpoint this thread upward toward the thread s run method. For each activation record we locate checkpoint(c in F) for that callsite by hash lookup on the callsite s address. Checkpoint(C in F) then outputs for each live variable V a tuple {descriptor(v), value(v). As checkpoint(c in F) and restore(c in F) are machine dependently generated for that specific callsite, they know where each variable is physically located. Restoration recursively restores activation records until the whole call chain is restored. The restoration process for a single activation record starts by reading a single callsite descriptor string C in F. Next restore(c in F) is located and invoked. Restore(C in F) first reads the activation record s complete list of tuples {descriptor(l), value(l). For each live variable to restore, it searches for value(v) by searching for a matching descriptor(v). Restore(C in F) then converts and assigns value(v) to the right location in either memory or register as the code generated for F requires. After all formal parameters and live variables have been initialized, the activation record information for the next stack frame is read from the checkpoint (file). This continues until all activation records have been restored. After restoration, the call stack will look like a series of restore functions each calling the next. Transfer of control after a stackframe has been restored is implemented by performing a jump statement that jumps directly to the position in the function that was checkpointed. However, when that function returns, it will return to the generated restore function of the invoking activation record. That restore function will in turn immediately jump to the originally invoked function etc. Pseudo code for Restore(C in F) and checkpoint(c in F) is shown in Figure 3 for an IA64 restore and a matching x86 checkpointer. As can be seen in the example, the checkpointer creates tuples keyed on fixed destriptor strings that the restore function uses to locate the correct value to put in a given physical location. The next section explains how the arguments to read/write tuples are constructed. 2.2 Variable Descriptor Strings The key idea of our checkpointing algorithm is the concept of the usage descriptor string. A usage descriptor string describes machine independently, for each variable, how that variable is used inside a procedure. To ease the construction of these strings, they are created after machine-independent optimizations (see Figure 1). The descriptor string is created by traversing the Control Flow Graph (CFG) to find all usages of a variable. When building a descriptor for a variable A we first search for its definitions. Whenever a usage of A is found, one of the rules below is applied:

4 4 MACHINE INDEPENDENT OPTIMIZATIONS MACHINE SPECIFIC OPTIMIZATIONS (x86) ASSEMBLE x86 BINARY MACHINE SPECIFIC OPTIMIZATIONS (IA64) ASSEMBLE IA64 BINARY CREATE VARIABLE DESCRIPTORS CREATE METHODS CHECKPOINT(C in F) AND RESTORE(C in F) Fig. 1. The compiler s pipeline. Java code basic block void foo(int a, Object b, int c) { 0 a = a + 2; 0 int y=0; 0 do { int x=0; 1 do { zoo(a, b, c); 2 // live variables = {a,b,c,x,y 2 while(x++<10); 2 while(y++<10); 3 Fig. 2. Example: Live Variables and Usage Descriptors. // x86: a is in memory at -8(ebp) checkpoint zoo in foo(stackframe info f): // allow the correct restorer to be found for frame: write name string( restore zoo in foo ); write tuple( +P:1,C:2@B:0, f->mem( -8(ebp) )); // same for b, c, x, y, etc // IA64: a is in register loc0 restore zoo in foo: tuples[]t = read tuples(); loc0 = find tuple(t, +P:1,C:2@B:0 )->value; //... same as above for b,c,x,y etc call next restorer( read name string() ); jump to insn after call to zoo in foo; Fig. 3. Generated checkpoint and restore functions for Figure Upon encountering an assignment: A = constant, modify the descriptor string as follows: string(a) = string(a) + C:<constant> 2 Upon encountering: A = B, continue the search for usages of variable B. 3 When A is the return value of a call: A = call, then modify the string as follows: string(a) = string(a) + call:<index of call in all calls inside the containing function> 4 When A is assigned the value of a formal parameter: A = param(x), then append the parameter s index into the parameter list to the string. String(A) = string(a) + P:<index of param(x)> 5 When A is given the value of an object field: A = object access(expression, field), then change the descriptor as follows: string(a) = string(a) + access:field, and continue building the descriptor with the variables in expression. 6 When encountering a generic assignment such as: A = B op C, where op is one of the binary operands such as +,,, /, then append <op>, string(b) and string(c) to the string of A. 7 When making a modification to a usage descriptor string using one of the above rules, add a basic block identifier to the string: string(a) = string(a) As an example, let us construct the usage descriptor strings for the live variables in the code given in Figure 2.2. At zoo in foo, the compiler computes the set of live variables (a, b, c, x, y). The analysis pass takes this set and performs a traversal over the control flow graph of the function.

5 5 During the traversal, the initialization of x is encountered and the usage descriptor string is updated by appending the string C:0 (rule 1). That happens in basic block 1 thus the is appended to the descriptor string (rule 7). Likewise, initialization of y in basic block 0, results in the string C:0@B:0. The assignment to a is more complex. Evaluation of a + 2 delivers a string +P:1,C:2@B:0 (rules 6 and 7), the + from the addition operation, the C:2 from the constant and the P:1 because it is a formal parameter (rule 4). Variables b and c, are formal parameters 2 and 3, thus they receive the strings P:2@B:0 and P:3@B:0 respectively. During code generation and machine-specific optimization these variables are renamed along with the actual variables. During register allocation, for example, replacing register a with memory location b also replaces the a in the live variable set. 2.3 Compressing Variable Descriptor Strings One potential performance problem with the above descriptor strings is the length of the strings that need to be saved in the checkpoint (file). For a large procedure with many variables and where each of these variables is used many times, large strings may result. This can cause large checkpoint files or huge amounts of data to be sent over the network when migrating a thread. In high performance computing where optimizing thread migration latencies are an issue this needs to be avoided. To combat this effect, the compiler can sort the list of the created descriptor strings for a given callsite and assign each resulting string a 16 bit index in the sorted table. The descriptor strings are then replaced by 16 bit table indexes. Most importantly, the table indexes are emitted to the stream instead of the descriptor strings themselves. This transformation is correct as we are replacing unique strings with unique identifiers. Notice that we need the descriptor strings to generate the integers. We cannot directly construct the integers. 3 Performance Two aspects of performance are important for any checkpointing implementation: the size of the resulting checkpoint (be it a file or sent over a network for thread migration) and the time it takes to perform checkpointing and to restore from a checkpoint. For three small benchmark applications and two architectures we will examine the performance of our checkpointer. We did not modify the applications except of inserting a single call to start our checkpointer. Everything else is fully automated. All performance tests were run on both a 2.4 GHz Pentium IV (Linux ) and a 900 MHz Itanium II (Linux ). The x86 had a 160 GByte IDE disk with 2 MByte cache and 1 GByte of RAM. The IA64 had an SCSI RAID0 with 73 GByte per disk and 10 GByte of RAM. Table 1 displays the sizes of the generated checkpoint files and how long it took to write them. As the checkpoints are written to disk, the checkpoint includes a copy of the reachable heap from the thread that requested checkpointing. As the size of the checkpoint file is independent of the architecture one table suffices.

6 6 Table 1. Checkpoint file sizes and checkpointing wall time (seconds) A) Checkpoint sizes. #checkpoints total size (KBytes) total size compressed (KBytes) Fib(17) Matrix Water B) Checkpointing run times. x86 checkpoint x86 compressed x86 no checkpoint x86 restore Fib(17) Matrix Water IA64 checkpoint IA64 compressed IA64 no checkpoint IA64 restore Fib(17) Matrix Water To ensure that the checkpointing implementation works correctly, each application is given the checkpoint file generated on the alternate architecture when performing restoration. Again, note that the IA64 has many more registers than the x86 and has a different pointer size (64 bit vs 32 bit). class Fib { public long fib(long n) { if (n < 2) { checkpoint this thread(); return n; return fib(n-1) + fib(n-2); public static void main(string args[]) { System.out.println( fib = + new Fib().fib(17)); Fig. 4. Fibonacci example. Fibonacci. Fibonacci recursively computes Fibonacci numbers (see Figure 4). A checkpoint is made in each leaf of the recursion to create a large number of checkpoints. On the x86, a single checkpoint takes 0.18 milliseconds (( seconds)/2584 checkpoints) including file I/O. Restoring a checkpoint takes about 0.7 ms for a 5020 byte checkpoint file. Compression (Section 2.3) reduces the size of the checkpoint by about 15% on average. Checkpoint construction time compared to performing the file I/O times are minimal. As Fibonacci uses only one object on the heap, heap checkpointing times are zero. On the IA64 checkpointing is faster because of better disk performance. Overall the IA64 is slower because of its lower clock speed (900 MHz vs 2.4 GHz).

7 7 Matrix. The 2D array benchmark is designed to test how well the checkpointing routines perform when writing large volumes of data. In total, five arrays of arrays (1024x1024) are written to disk for a total of 40+ MBytes. Compression of the usage descriptor strings does not gain much as there are only 12 variables in total over all activation records. Virtually no computation is performed in this benchmark, all time is spent in the runtime system and kernel. Checkpoint and restore are much slower on the IA64 than on the x86 because of slower clock and memory speeds which impacts the speed in which objects can be allocated. Water. Water performs an N-body simulation of a number of water molecules. A water molecule is coded as a 4D array of doubles holding position, force, velocity and acceleration. Our input data set contains 1728 water molecules. A checkpoint is made after each time step and contains all molecules, a thread object, a force computation object, and some small arrays to maintain global state. There is very little stack to checkpoint but the effort to enable checkpointing was zero: no code needed to be written to explicitly write the molecules to the checkpoint file. Two checkpoint files are created of 4.2 MBytes each. Compression of the usage descriptor strings aids little in reducing checkpoint file sizes as the number of live variables when checkpointing is small. On both machines, checkpointing costs are less than 1% of the total runtime. 4 Related Work There are many packages and systems that offer checkpointing services to applications [1, 2, 5]. Most packages do not support heterogeneity at all. Heterogeneity in this context means that a checkpoint created on one architecture can be restored on another architecture. In general, packages that do support heterogeneity have high runtime overheads due to code instrumentation or employ very complex and error prone implementation techniques. Related work can be roughly divided into two classes: those that use a compiler and those that implement their checkpointing algorithm inside a library/os. Although library/os checkpointing packages are simple to implement, they do not support heterogeneity. Library approaches include, for example, Condor [3] and libchkpt [4]. Bouchenak et al. [2] created a system for JavaThread serialization based on decompilation for Java JITs. JIT-generated code is decompiled to the same format that the interpreter would use for its stackframes. This ensures that the checkpointer only has to deal with the Java operand stack in interpreted form. However, with increasing levels of optimization the process of decompilation will become increasingly difficult. Porch [1] is a project to create a small preprocessor for C programs to allow heterogeneous checkpointing. The preprocessor instruments each function of a program with some extra code to test if that function is to perform restoration, checkpointing, or to execute the actual code of that function as normal. However, the code introduced to perform the checkpointing/restoration causes substantial overhead during normal program execution. PREACHES [6] offers heterogeneous checkpointing. Instead of creating a single generic checkpoint suitable for all architectures, PREACHES creates checkpoints suitable for each architecture that the user might wish to restore the checkpoint on. The

8 8 downside is that at all times during a programs execution a machine of each different architecture needs to be available to perform the conversion of stackframes to each architecture. 5 Conclusions We have described an algorithm where a compiler creates two extra functions for each callsite to save and restore the state of the activation record at that point. When not performing checkpointing, our algorithm has no overhead. When checkpointing is enabled, our algorithm generates portable checkpoint files: a checkpoint that is created on one architecture can be restored on another by using activation record variable descriptors. The variable usage descriptors are portable entities as they describe how a variable is used rather than describing its location. The checkpoint file sizes are moderate and can be decreased further by our proposed simple compression scheme. References 1. B. Ramkumar and V. Strumpen. Portable checkpointing for heterogeneous architectures. In In 27th International Symposium on Fault-Tolerant Computing - Digest of Papers, pages 58 67, S. Bouchenak, D. Hagimont, and N. De Palma. Techniques for Implementing Efficient Java Thread Serialization. In ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 03), July M. Litzkow and M. Solomon. Supporting Checkpointing and Process Migration Outside the UNIX Kernel. In Usenix Conference Proceedings, pages , January J.S. Plank, M. Beck, G. Kingsley, and K. Li. Technical report, libckpt: Transparent checkpointing under unix. Technical Report UT-CS , P. Smith and N.C. Hutchinson. Heterogeneous process migration: The Tui system. Software Practice and Experience, 28(6): , Kuo-Feng Ssu and W. Kent Fuchs. PREACHES - portable recovery and checkpointing in heterogeneous systems. In Symposium on Fault-Tolerant Computing, pages 38 47, R. Veldema, R. F. H. Hofman, R. A. F. Bhoedjang, and H. E. Bal. Runtime optimizations for a Java DSM implementation. In 2001 joint ACM-ISCOPE Conference on Java Grande, pages , Palo Alto, CA., 2001.

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption