Dynamic Function Splicing in Jackal

Size: px

Start display at page:

Download "Dynamic Function Splicing in Jackal"

Derrick Williams
5 years ago
Views:

1 Dynamic Function Splicing in Jackal Diplomarbeit im Fach Informatik vorgelegt von Michael Klemm geb in Nürnberg angefertigt am Institut für Informatik Lehrstuhl für Informatik 2 Programmiersysteme Friedrich-Alexander-Universität Erlangen Nürnberg (Prof. Dr. M. Philippsen) Betreuer: Ronald Veldema Beginn der Arbeit: Abgabe der Arbeit:

3 Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung anderer als der angegebenen Quellen angefertigt habe und dass die Arbeit in gleicher oder ähnlicher Form noch keiner anderen Prüfungsbehörde vorgelegen hat und von dieser als Teil einer Prüfungsleistung angenommen wurde. Alle Ausführungen, die wörtlich oder sinngemäß übernommen wurden, sind als solche gekennzeichnet. Der Universität Erlangen-Nürnberg, vertreten durch die Informatik 2 (Programmiersysteme), wird für Zwecke der Forschung und Lehre ein einfaches, kostenloses, zeitlich und örtlich unbeschränktes Nutzungsrecht an den Arbeitsergebnissen der Diplomarbeit einschließlich etwaiger Schutzrechte und Urheberrechte eingeräumt. Erlangen, den Michael Klemm

5 Diplomarbeit Thema: Dynamic Function Splicing in Jackal. Hintergrund: In Jackal, whenever there is a sequence of N failing access checks where all the faulting objects are located on the same machine (different from the faulting machine), Jackal currently sends 2*N messages. One solution is to perform aggressive object migration (as Jackal already performs), another solution is to perform thread migration in the form of function splicing for the sequence of code containing the failing access checks. This thesis attempts to implement the latter solution by performing a combination of compile time and runtime analysis. Aufgabenstellung: At compile time opportunities are identified while at runtime the opportunities are exploited if runtime profiling information indicates its usefulness. More specifically, the compiler identifies sequences of access checks and identifies where the thread of control should be transfered to another machine. Example: void foo ( Data p, Data q ) { p r i n t p. value ; p r i n t q. value ; } should be changed to: void s p l i c e ( Data p, Data q ) { p r i n t p. value ; p r i n t q. value ; } void foo ( Data p, Data q ) { rpc ( s p l i c e, p, q ) ; } A problem here is that it may not be known at compile time where objects p and q are located. Thus there is a need for runtime information and possibly transformation. Furthermore a form of call graph analysis and control flow analysis is needed to determine where to insert the code to transfer control to another machine if access check X is problematic. An example of the problem where analysis is needed: i n t sum( LinkedList node ) { i f ( node == n u l l ) return 0 ; return node. value + sum( node. Next ) ; }

6 The analysis should not splice away the access to node.next but instead should identify where sum is called and rpc the call site to the home of the list. Possible enhancements: - piggybacking of data accessed in the spliced function, - reduction of flushes, - only perform the optimization if the amount of already cached data is small (the current thread needs to flush data to allow already modified data to be visible in the spliced function), - code rearangement to ensure that faults to the same machine are sequentially accessed: fault to machine 0, fault to machine 1, fault to machine 0, could be changed to: fault to machine 1, fault to machine 0, fault to machine 0, allowing the last two faults to be unambiguously handled. Meilensteine: getting to know Jackal s compiler: compiler structure, layout, etc function splicing without runtime support runtime added, no heuristics heuristics added performance evaluation latex Literatur: Java Language Specification Betreuung: Bearbeiter: Ronald Veldema, veldema@informatik.uni-erlangen.de Michael Klemm

7 Abstract In this thesis we present a new static compiler optimization technique called Dynamic Function Splicing (DFS). DFS extends the existing compiler and runtime framework of the Jackal project. The optimization goal is to minimize object request messages sent between the nodes of a cluster running a Jackal program. Dynamic function splicing first identifies arbitrary sequences of Java code of a Java method that will perform a number of object requests. This code sequence is then moved to a new function, a so-called function splice. A function splice can be called either locally or remotely through the automatic creation of specialized stubs and skeletons for the function splice at compile time. Based on runtime heuristics we replace a local call to a function splice by a remote call which executes the function splice on the home node of the objects accessed inside the function splice. This replacement is made if a remote call is advantageous over a local call. In turn, remote splice calls may be replaced by local splice calls if a local call is the better choice. Dynamically switching between local and remote invocations of a function splice implies rewriting the executable to get a better runtime behavior. We evaluated dynamic function splicing using three example programs: First, a linked list where DFS will increase performance in an optimal way. Running the optimized version of linked list, we save up to 99 % of the number of messages sent by the unoptimized version. A maximal gain of 99 % is achieved in execution time. Second, a gene splicing example, no overall performance gain is achieved due to its object access behavior. Turning off object replication for this example leads to a performance gain of 61 % up to 95 % in execution time. 48 % up to 87 % of messages are saved in this case. Our third example showes a visible gain in performance to due to its access behavior which is comparable to the linked list example. The gain achieved by applying DFS to Hamming is between 84 % and 98 % for execution time and saves between 50 % and 95 % of the message count. i

8 ii

9 Contents 1 Introduction Jackal: A Distributed Shared Memory system Dynamic Function Splicing Contents The Jackal Compiler and Runtime System Terminology Jackal s Compiler The Compiler Back-end Access Checks Extending the Compiler s Back-end Jackal s Runtime System Architecture Accessing Objects Optimizations at Runtime Summary Compiler Support for Dynamic Function Splicing Static Program Analysis Sets of Basic Blocks Finding Natural Loops Generating Function Splices out of Basic Blocks Saving Live Variables Saving Live Registers Generating Function Splices out of Loops and Methods Handling Break Statements Handling Return Statements Handling Throw Statements iii

10 Contents 3.4 Problems Handling of Generic Pointers Helper Variables Helper Functions Remote Splice Call Stub Remote Splice Call Skeleton Remote Splice Registrars Table of Function Splices Further Optimizations Summary Runtime Support for Dynamic Function Splicing Dynamic Function Splicing Replacement of Call Targets Finding Function Splices Using Failed Access Checks Program Startup Remote Splice Call Protocol Heuristics Always and Never Coin Tossing Adaptive Coin Tossing Feedback Analysis Summary Performance Evaluation Example Programs A Micro Benchmark: Linked List Genes Hamming Evaluation Framework Performance Evaluation Linked List Genes Hamming Summary iv

11 Contents 6 Related and Future Work Related Work Future Work Summary 61 v

12 Contents vi

13 List of Figures 2.1 A Java source code example: Translation of object access to access check Compilation pipeline of the Jackal compiler Low-level LASM code of the example in figure 2.1(b) High-level view of Jackal s runtime system The different layers of a Jackal program Transformation of function f into a new function calling s f Generation of a function splice for function f modifying variable x Example of a subgraph with two edges exiting the subgraph Termination of function splices by a return statement Sequence diagram of performing a remote splice call Graph of the functions in equation 4.1 and Compilation using feedback analysis vii

14 List of Figures viii

15 List of Tables 1.1 Distribution of different computer families in TOP500 list (June 2002) Distribution of different computer families in TOP500 list (June 2003) Format of a remote splice call protocol packet Remote splice call protocol message types Feedback information gathered for each LASM function Feedback information maintained for each access check Symbols used in equations 4.3 to Test setup data Number of messages sent and execution times of Linked List (version A) Number of messages sent and execution times of Linked List (version B) Number of messages sent and execution times of Genes Number of messages sent and execution times of Genes (object replication and migration turned off) Number of messages sent and execution times of Hamming ix

16 List of Tables x

17 List of Algorithms 3.1 The DFS compiler pass generating function splices Searching for basic blocks containing access checks Finding a natural loop using back edge information Rewriting jump instructions exiting a subgraph of a CFG Tasks of a remote splice call stub for a given function splice Algorithm to find the destination node to send a RSC to by counting Tasks of a remote splice call skeleton for a given function splice Rewriting call instructions on a IA32 machine Looking up entries in the function splice table using by program counter. 34 xi

18 List of Algorithms xii

19 1 Introduction In nova fert animus mutatas dicere formas corpa: da, coeptis (nam vos mutastis et illas) adspirate meis primaque ab origine mundi ad mea perpetuum deducite tempora carmen. (Ovid) Clusters of workstations are a emerging technology in High Performance Computing (HPC). Every six months a list of the top 500 of the fastest computer installations is published in the World Wide Web 1. Six of the ten fastest supercomputing sites in this list are built using clusters of workstations (see tables 1.1 and 1.2 for statics over the whole TOP500 list [15]). As the tables 1.1 and 1.2 show, the total number of clusters increased by a factor of nearly two over the last year. That is the reason why programming techniques for such computer architectures have become an important issue in computer science. New techniques have to be applied to make programming such machines easier and to make them more manageable. While programming single computers (such as uni-processors or Symmetric Multi- Processors, SMP) is well understood, programming large clusters is still a complex task. For example, a programmer has to cope with memory boundaries between different machines. The programmer is required to explicitly define a data distribution scheme to share data over all computing nodes. Distributing data also implies communication using message passing techniques. Libraries such as MPI, PVM ([14, 12]) Count Share Processors MPP % Cluster % Constellations % SMP % 6056 Total % Table 1.1: Distribution of different computer families in TOP500 list (June 2002, [15])

20 1 Introduction Count Share Processors MPP % Cluster % Constellations % Total % Table 1.2: Distribution of different computer families in TOP500 list (June 2003, [15]). et al. help to hide some communication details, but still require the programmer to explicitly invoke library functions to control sending and receiving messages [5]. A problem inherent to programming clusters is the difficulty in performing load balancing. Using bad data distribution schemes leads to load imbalances on clustered systems. Some nodes will idle while waiting for another node to finish with its computation. The programmer is responsible to find a trade-of between data locality and load balance, while in most cases data locality conflicts with load balancing. These two problems are just two of many. A general discussion of the programming techniques applicable to cluster computing can be found in [5, 6, 16]. 1.1 Jackal: A Distributed Shared Memory system One possible solution to the problems inherent in cluster computing is to create a shared memory machine. A Distributed Shared Memory (DSM) machine simulates a global address space mapped onto the distributed memory of cluster nodes. Using a DSM, a cluster can be regarded as one big computer with no memory boundaries [6]. DSM machines can be implemented in two ways: First, one type of DSMs uses special hardware to create the illusion of a global address space. This type of DSM is called Hardware-based DSM (H-DSM). To transfer data from one machine to another machine H-DSMs use cache lines or pages as basis for communication. H-DSMs need special logic implemented in hardware to function which makes H-DSM machines expensive. Second, DSMs may be implemented in software (a so-called Software-based DSM, S-DSM). S-DSMs use a software layer to hide memory boundaries from the running applications. Mostly, S-DSMs are either based on pages (using the memory management unit of a CPU) or employ a smaller granularity such as single variables or objects. Such as other S-DSM projects, e. g. TreadMarks ([3]) or CRL ([10]), the Jackal project ([24]) addresses most of these problems emerging in cluster computing. Jackal consists of a native compiler translating Java source code to assembly code for a set of architectures (e. g. Intel IA32 and IA64, SUN UltraSparc) and a supporting Runtime System (RTS), which together simulate the view of a SMP machine. 2

21 1.2 Dynamic Function Splicing When running a multi-threaded program compiled with the Jackal compiler the runtime system automatically distributes threads onto different nodes of the cluster. Java objects are allocated on the node that also executes the new expression used to create the object. Before accessing an object the generated code automatically tests whether the object is locally available (a so-called access check). If the object is not stored on the local node the remote node owning the data is contacted by the runtime system to send the object to the requesting node. The same holds for accesses to Java arrays. The only difference is that arrays are divided into equal-sized chunks. A chunk is fetched when accessing an element from that chunk. This avoids the transfer of the whole array. 1.2 Dynamic Function Splicing Instead of accessing a remote object by creating a local copy on the local node (data shipping) it is also possible to transfer the flow of control to the remote node (function shipping). Instead of transferring a large number of messages over (low-bandwidth and high-latency) network links we generate specialized Remote Procedure Calls (RPC). This avoids many objects and array chunks to be transferred to the local node. When traversing arrays or object graphs, the runtime sends two messages for each chunk of the array or object accessed. If n objects are referenced the runtime system would send 2n messages. Generating a special remote procedure call for this type of traversal will result in just two messages. One for sending the RPC request and one to return the reply. RPC support is divided into two parts: First, the runtime system decides on performing a remote call or not, because the object distribution in the cluster is known only at runtime. If a RPC is advantageous, the runtime system determines the correct node to send the RPC messages to. If a RPC is not advantageous the call is executed locally to preserve the correct semantics of the program. Second, the Jackal compiler is extended to generate all necessary functions to execute such a remote call. Given Java method f containing accesses to (potentially remotely allocated) objects, the compiler generates a set of function splices. To create a function splice the compiler copies a sequence of instructions located in method f to a new function s f and replaces this instruction sequence in f by a call to s f. Except of an additional call to the generated function splices the behavior and computational result of f is not changed by this transformation. On execution of s f the runtime system monitors all access checks performed by the function splice to determine either to proceed invoking Local Splice Calls (LSC) or to replace the LSC by a Remote Splice Call (RSC). The runtime system is also allowed to take back such changes by replacing a RSC by the corresponding LSC. We call this technique Dynamic Function Splicing (DFS), because the runtime system dynamically 3

22 1 Introduction modifies the executable of the running application to perform either local splice calls or remote splice calls. 1.3 Contents In chapter 2 we describe the architecture and features of Jackal s compiler and its runtime system in detail. We show how the existing framework operates when compiling and running an application. Chapter 3 shows how the existing Jackal compiler framework is extended to create function splices and how it creates the support functions needed to perform LSCs and RSCs. Chapter 4 describes which modifications are made to the Jackal runtime system to support dynamic function splicing at runtime. The section also discusses heuristics to decide whether a local splice call or a remote splice call should be executed. In chapter 5 we show the performance gain achieved by applying dynamic function splicing to a set of example programs. Chapter 6 gives a short overview of other projects addressing distributed computing and simulation of shared memory machines. Furthermore, this chapter describes some additional features which could extend dynamic function splicing in a future version. 4

23 2 The Jackal Compiler and Runtime System Every accomplishment starts with the decision to try. (Anonymous) The Jackal compiler and runtime system together simulate a shared memory for Jackal programs. With a Jackal program we denote a program that is compiled by the Jackal compiler and managed by the Jackal runtime system. In order to provide a global address space, the compiler inserts so-called access checks in front of each object access while translating a sequence of Java code to machine code. Figure 2.1(a) shows an example of an Java method containing an object access and figure 2.1(b) shows the corresponding intermediate code. For a introduction to this type of intermediate code called LASM see [23]. In figure 2.1(a) a Java method named func() accesses field to assign it a new value (line 8). Because the object referenced by mydata might be allocated on another node the compiler needs to insert an access check to ensure that the object is either fetched and cached locally by the runtime system or already allocated on the local node. Figure 2.1(b) shows the access check inserted by the compiler (line 7). In the next sections we describe the basic operation of the Jackal compiler and runtime system. Section 2.2 addresses Jackal s compiler. Jackal s runtime system is described in section 2.3. In both sections we give a short introduction to the architecture and operational features of the Jackal framework. In this description we will only refer to Java examples and Java language constructs. 2.1 Terminology Before we start to address Jackal s compiler and runtime system we define some terms needed to understand compiler technology. In this section we will give some informal definitions of basic terms used in subsequent chapters. For formalized definitions of these terms see [1, 2]. 5

24 2 The Jackal Compiler and Runtime System class Data { int f i e l d ; } c l a s s MyThread extends Thread { Data mydata = new Data ( ) ; void func ( ) { mydata. f i e l d = 42; } void run ( ) { func ( ) ; } } (a) Method f unc() accesses an object field. %g23065 = param( g, g, this +0+0, signed, no helper ) access check[% g23065, read, object, complete object, check id : 0 ] %g23066 = (% g23065, g ). F i e l d A c c e s s. mydata %g23067 = &(%g23066, i ). Data. f i e l d %i23069 = param( i, i, v+0+0, signed, no helper ) access check[% g23066, write, object, complete object, check id : 1 ] ( i, % g23067, type id : Data) = % i23069 (b) LASM representation of method f unc(). Figure 2.1: A Java source code example: Translation of object access to access check. 6

25 2.2 Jackal s Compiler Basic Block A sequence of (intermediate code) instructions is called basic block if the control flow of the sequence starts at its beginning and stops at its end. There are no other exists caused by jump instructions. Also no instruction inside the basic block is reachable by a label. Thus, if a basic block is entered it will be exited after executing the last instruction. Control Flow Graph A connected, directed graph of basic blocks is called a Control Flow Graph (CFG). The CFG shows all possible control flow paths through the basic blocks of a given (intermediate) program. Two basic blocks B 1 and B 2 in this graph are connected if and only if there is a jump instruction at the end of B 1 targeting the first instruction of basic block B 2, or if the control flow exiting B 1 directly enters B 2. Live Variable Analysis Using the CFG a compiler is able to determine the set of live variables for each instruction in respect to the control flow of the program. A variable v is called a live variable at instruction i if it is defined by an instruction before i and used at or after i. A variable is defined by assigning a new value. A variable is used if its value is used by an instruction. 2.2 Jackal s Compiler Jackal s compiler is a modularized, optimizing multi-pass compiler. The compiler supports compiling Java, C and Pascal source code to various architectures (e. g. IA32, IA64, SUN UltraSparc III). Figure 2.2 shows how a Java source file is passed through the compiler front-end and translated into LASM intermediate code which is then processed by the compiler s back-end. The intermediate code is then passed to different modules for optimization. After all passes modifying LASM code have finished, the intermediate code is further transformed to machine level instructions and passed to a external assembler and linker to create the final executable The Compiler Back-end As stated before the compiler back-end uses LASM as a representation of intermediate code. As explained in [23], LASM is a register-based intermediate language supporting both high-level and low-level instructions. The compiler front-end translates the input source to high-level LASM code where loops and if statements are explicitly visible. The high-level intermediate code is then converted to low-level LASM code. The existing compiler back-end supports a wide range of optimization techniques. Amongst them are language independent optimizations such as Loop Unrolling, In- 7

26 2 The Jackal Compiler and Runtime System Java source code Read input Check validity Front end create LASM javac Read input Check validity create LASM Java byte code Optimizer 1 Optimizer 2 Optimizer N LASM Back end intermediate code data base lower LASM Reg. Alloc. Schedule Assembler and linker executable Java byte code Figure 2.2: Compilation pipeline of the Jackal compiler. 8

27 2.2 Jackal s Compiler %g23065 = ( g, ( % ebp + 8)) %g22280 = ((% g ) > > 6) %g23603 = f s : : 4 jc ( ( ( g, ( % g ((% g22280 >> 5) <<< 2))) & (1 <<< (%g22280 & 3 1 ) ) )! = 0 ),. L10 access check c a l l ( shm start read object, % g23065 ). L10 : %g23066 = ( g, ( % g ), type id : c l a s s c l a s s F i e l d A c c e s s ) %g23067 = (% g ) %i23069 = ( i, ( % ebp + 12)) %g22317 = ((% g ) > > 6) %g23639 = f s : : 0 jc ( ( ( g, ( % g ((% g22317 >> 5) <<< 2))) & (1 <<< (%g22317 & 3 1 ) ) )! = 0 ),. L12 access check c a l l ( shm start write object, % g23066 ). L12 : ( i, % g23067, type id : Data ) = % i23069 Figure 2.3: Low-level LASM code of the example in figure 2.1(b). variant Code Motion and others. See [23] for a complete list of optimization techniques implemented in the Jackal compiler. Besides machine and language independent optimizations the back-end also supports optimization techniques special to Java. Using Escape Analysis the compiler is able to allocate some objects on the stack instead of allocating them on the heap. Garbage collection can then be avoided because such objects are destroyed automatically when the stack frame of the called function is destroyed. The compiler s optimizer is also able to remove boundary checks on Java arrays if it can prove that accessing using that array index will always be safe and will not reference elements outside the array. Furthermore, unnecessary access checks can be removed. This optimization is possible if objects to be tested are certain to be already cached on the local node Access Checks For every object access in a Java method an access check is inserted by the compiler s front-end (see figures 2.1(a) and 2.1(b)). After converting line 7 in figure 2.1(b) to low-level LASM code the access check consists of two parts. (1) A conditional jump instruction that determines whether the requested object is already cached locally. (2) A call to a support function of the runtime system that will contact the home node of the object to request that a copy of be transferred to the local node. 9

28 2 The Jackal Compiler and Runtime System Figure 2.3 shows the low-level LASM code of figure 2.1(b) for a x86 machine. Line 15 and 16 show how a access check is transformed to low-level LASM code. As noted, the conditional jump expression in line 15 checks whether the object is already cached on this node. If the check is successful it jumps to a label right after the call. Otherwise, a support function in the runtime system is called to request the object from its current home node (line 16). Section will show how such object request calls are processed by the runtime system Extending the Compiler s Back-end The compiler s back-end is easily extensible. Modularization was one of its main design goals [23]. Each of the back-end s optimizers or transformation passes is implemented as one module. The task of writing new optimizer passes is done by implementing a new module and announcing it to the back-end. All passes that are available to the back-end are listed in an array called passes containing instances of class LasmPass, each describing a single pass by maintaining its name and a global function to be called on invocation of the pass. Thus, a new pass is announced to the back-end by inserting an instance of class LasmPass describing the new pass in the passes array. The global function called on invocation of the new pass then executes the specific actions needed to complete the new pass. All internal data needed by the back-end is stored in a global instance of LasmState which can be queried and modified by each pass. For example, the list of all to LASM functions can be obtained by calling LasmState::functions(). A compiler pass can perform actions specific for its task on each of these functions. 2.3 Jackal s Runtime System In addition to the compiler, Jackal s runtime system plays an important role on simulating a S-DSM machine. While the compiler prepares the translated source code to be able to run on the simulated S-DSM machine the runtime system provides the basic functionality needed to simulate the S-DSM by supplying the support functions that are called by the compiled program Architecture When a Jackal program is run it is loaded on every node of the cluster. After loading the program the runtime system first initializes the memory and communication modules. After this stage, the program is started by calling the main method on one node in the cluster. This sequence of program startup equals the normal startup 10

29 2.3 Jackal s Runtime System Figure 2.4: High-level view of Jackal s runtime system. sequence of a single node application (such as C/C++ applications) except the special initialization of the communication module and the DSM protocol. Figure 2.4 shows the memory layout of some instances of a running Jackal program. The memory accessible by a node is divided into three parts [23]. The object heap is used to allocate objects that are constructed by new expressions inside the program on that node. If a node executes such a new expression the runtime system will allocate the new object on the local object heap. Objects that have failed an access check are requested by the local node to be fetched from their current home node. On requesting these objects, space is allocated inside the caching heap to store the local copy of the object. The administrative heap is used to store data needed by the runtime system to manage object accessibility. The RTS stores information about object accessibility in the form of so-called read-write bitmaps for each object. For each thread a bitmap is maintained for each object or array chunk. The read bit in the bitmap is set if the node holds a read-only copy of the object, while the write bit states whether the local node has a writable copy of the object. In section we will describe how an object is transferred from its home node to the local node to be cached there. Complementary to figure 2.4, figure 2.5 shows a compiled and linked Jackal program that utilizes the Jackal runtime system. The whole program is divided into three different layers. The lowest level of abstraction (concerning a S-DSM) is provided by the native Operating System of the nodes of the cluster running a Jackal program. Basic services (such as TCP/IP, I/O) are supplied by system calls to the kernel of the OS. 11

2 The Jackal Compiler and Runtime System Figure 2.5: The different layers of a Jackal program. On the next layer, networking and multi-threading is provided by internal modules of the runtime system.

30 2 The Jackal Compiler and Runtime System Figure 2.5: The different layers of a Jackal program. On the next layer, networking and multi-threading is provided by internal modules of the runtime system. Using this type of abstraction Jackal is able to provide DSM services on a wide range of networks, currently supporting Myrinet, TCP/IP and MPI. The networking layer provides a packet-oriented interface that is utilized by the RTS. The same holds for multi-threading support. An internal interface is used to hide the calls to a native thread library such as POSIX Threads. The Jackal RTS resides on top of the OS, networking and multi-threading modules. The task of the protocol management in the Jackal RTS is to supply all features needed to create the illusion of a global address space inside the cluster. In addition to requesting and caching of remotely allocated objects, Jackal s RTS manages thread creation and performs Garbage Collection (GC). On creation of a new thread the RTS chooses a node inside the cluster and creates the thread on that machine. Therefore, the runtime system needs to provide some strategies that determine the node that should execute the thread. Jackal s garbage collector operates on two levels. On the first level only the memory of the local node is checked to see if objects can be freed. Local garbage collection is done when the object heap or the caching heap run out free space. If the local garbage collector fails to free enough memory the runtime system initiates a global GC on all nodes of the DSM machine to free memory Accessing Objects As section has shown, Jackal s compiler inserts access checks before an object access takes place. In this section we further investigate the execution of a given access check. A detailed description of accessing both Java objects and arrays can be found in [23]. 12

31 2.4 Optimizations at Runtime Using the per object, and per thread, read-write bitmaps an access check tests if the object is cached on the local node. That is, for a read access, the access check extracts the read bit corresponding to the object from the thread s read bitmap. If the bit is not set the object is not cached on the local node in the right mode. Thus, the object needs to be requested from its home node and cached locally. Write accesses to objects are processed in the same manner, except that the write bitmap is queried for the status bit. If a tested object is to be requested from its home node, the access check invokes a special protocol function (shm read object for read accesses and shm write object for write accesses) provided by the Jackal S-DSM runtime. The support function then determines the current home node of the requested object and sends a request message to that node. The home node responds to the object request by sending a reply message containing the object requested. On receipt of the reply message, the requesting node maps the received object into its address space and updates the read-write bitmaps accordingly. Accesses to objects not locally cached are are therefore expensive operations: It involves sending two messages, one for the request and one for the reply. Also, mapping the object into the local address space of the requester is a potentially costly operation, since it is possible that a (global) garbage collection phase has to be initiated by the requesting node to free enough memory to allow Jackal s RTS to cache the transferred object. 2.4 Optimizations at Runtime In addition to the Jackal compiler, the runtime system is also able to perform optimizations when running a Jackal program [23]. If some of the running threads continuously read-only access a set of objects or array chunks the runtime system is able to perform object replication on these objects. Hence, each thread receives a read-only copy of the object to avoid further remote accesses to these objects. A thread holding a read-only copy is no longer forced to flush the object to its home node as long as the object maintains its read-only state. Another optimization performed by the Jackal RTS is home-node migration. The home node of some object maintains all global meta-information of that object (e. g. object locking information or readers and writers of the object). The home node of some object is migrated to another machine if that machine is the only node that has write access to the object. 13

32 2 The Jackal Compiler and Runtime System 2.5 Summary In this chapter we have described the basic features of the Jackal compiler and its runtime system. Together they simulate the view of a shared memory system by implementing a S-DSM machine. While Jackal s compiler translates a source program, it inserts access checks in front of an object access to test if the object is already locally cached. If the object is not cached on the local node, the Jackal RTS is invoked to request the object from its home node and to map it into the address space of the local node. This causes expensive calls to the communication module to send the request and the reply messages. 14

33 3 Compiler Support for Dynamic Function Splicing Everything should be made as simple as possible, but not simpler. (A. Einstein) To be able to transfer parts of a Java method to a remote node the program has to be transformed accordingly. As chapter 1 introduced, parts of the instruction sequence of a method are moved to newly generated functions, called function splices. Function splices can either be called locally (local splice call) or on some other node of the cluster (remote splice call). This chapter deals with the creation of function splices out of an existing sequence of LASM intermediate code. The transformation of a Java method to a method that calls several function splices is performed in two phases. In the first DFS phase (section 3.1), the whole instruction sequence of each function is analyzed. This pass is used to identify possible sequence candidates to be moved to a function splice. After applying the analysis phase the code is transformed by the second DFS pass (sections 3.2). The complete DFS pass is described by algorithm 3.1. Firstly, the compiler pass filters functions that cannot be processed. If a function is already a splice of another function it is not processed further. If this were allowed the DFS pass would split the function again, potentially generating too much call overhead. Secondly, methods containing try-catch blocks are also ignored. This is due to the complexity that is inherent to the translation of try-catch down to LASM intermediate code. Finally, so-called registrar functions are also not processed by the DFS pass. Registrars are special functions that are called at the start-up of a Jackal program to register tables and other data from the binary file into the runtime system. If a function is suitable for dynamic function splicing the compiler performs a static program analysis pass to find candidates for DFS (see lines 19 to 21 in algorithm 3.1). Afterwards the compiler generates a set of functions splices using the data gathered while running the analysis phase (line 22). 15

34 3 Compiler Support for Dynamic Function Splicing 1: Input. A set F of functions in LASM intermediate code. 2: Output. A set of rewritten functions plus their function splices. 3: 4: for all functions f in F do 5: S := {set of basic blocks containing access checks} 6: L := {set of loops containing access checks} 7: 8: {filter functions that should not be processed} 9: if f is a splice function then 10: continue with next function 11: end if 12: if f contains at least one try-catch block then 13: continue with next function 14: end if 15: if f is a registrar then 16: continue with next function 17: end if 18: 19: build CFG of f 20: S := basic blocks in f 21: L := natural loops in f 22: call performdf S(f, S, L) 23: end for Algorithm 3.1: The DFS compiler pass generating function splices. 3.1 Static Program Analysis Using static program analysis the DFS pass is able to determine possible candidate sequences inside a function to be separated from the original function. To search for candidates the compiler pass first constructs the CFG of the function. Creation of a CFG is already supported by the compiler back-end since it is used by other optimization passes in the compiler s back-end Sets of Basic Blocks After the CFG is created, the first DFS pass runs algorithm 3.2 on all basic blocks of the CFG. The algorithm is used to find all basic blocks that contain access checks. A basic block is therefore the smallest granularity the DFS pass is able to separate from the original function. Algorithm 3.2 starts with an empty set S of basic blocks. It then iterates over all basic blocks of a given function f searching for access check instructions inside the 16

35 3.1 Static Program Analysis 1: Input. A function f in LASM intermediate code. 2: Output. A set S of basic blocks containing access checks 3: 4: S = 5: for all basic blocks b in f do 6: for all instructions i in b do 7: if i is of type LasmAccessCheck then 8: S := S {b} 9: continue with next basic block 10: end if 11: end for 12: end for Algorithm 3.2: Searching for basic blocks containing access checks. basic blocks. If an instruction is an access check the basic block is inserted into S and processing is continued on the next basic block. After termination of algorithm 3.2 each basic block containing an access check is an element of set S Finding Natural Loops After searching for basic blocks containing access checks, the DFS pass identifies natural loops. Natural loops are the next higher granularity of instruction sequences for DFS. Again, the existing compiler back-end supplies all basic techniques to aid a loop finding algorithm (see algorithm 3.3). To be able to identify natural loops we need to compute all dominators for each node of the CFG. If all paths to a given node n of the CFG pass node d then d is called dominator of n (d dom n). A natural loop is found if there is a node t that is dominated by node h (h dom t) and there is a back edge t h from t back to h. Node h is then called the header of the natural loop. A natural loop is defined by all nodes that can reach t without going through h plus h [2]. Algorithm 3.3 (algorithm 10.1 in [2]) shows how to find the set of basic blocks associated with a natural loop identified by a given back edge. A detailed description of how to find natural loops can be found in [1, 2]. The algorithm in [2] is called for each back edge in a function that is currently processed to find all natural loops. As algorithm 3.1 shows the set of loops identified will be used later on as the second type of granularity for pulling out splices. After loop finding, the DFS pass now has information about all basic blocks and loops that contain access checks. This information is used to transform a function to call one or more function splices. The next section describes how a sequence of instructions is separated from their containing function and moved to a function splice. 17

36 3 Compiler Support for Dynamic Function Splicing 1: Input. A flow graph G and a back edge n d (n, d : basic block). 2: Output. The set loop consisting of all nodes in the natural loop of n d. 3: 4: procedure insert(m : basic block) 5: if m is not in loop then 6: loop := loop {m} 7: push m onto stack 8: end if 9: end procedure 10: 11: stack := empty 12: loop := {d} 13: insert(n) 14: while stack is not empty do 15: pop m, the first element of stack, off stack 16: for all predecessors of p of m do 17: insert(p) 18: end for 19: end while Algorithm 3.3: Finding a natural loop using back edge information. 3.2 Generating Function Splices out of Basic Blocks We start by describing how function splices are generated by examining how to separate single basic blocks from the original instruction sequence. Figure 3.1 shows a function f and one of its splices s f. The basic block that is moved to the function splice is marked dark gray. The instruction sequence I of function f is copied by the compiler to a function splice and replaced by a call to the function splice s f. This ensures that the results computed by f are not changed by the transformation applied if the exact state of the function is conserved while calling the function splice Saving Live Variables When generating a function splice of a function, we need to ensure that the state at the begin and at the end (of the instruction sequence to be separated) is the same as it was if no transformation would be applied. Figure 3.2(a) shows an example where variable x is assigned a value by the original function f. Later, this variable is changed and used at the end of f. It is obvious that the value of variable x has to be saved before entering the function splice s f. After entering s f, the value saved has to be restored into the representation 18

1: Transformation of function f into a new function calling s f.

37 3.2 Generating Function Splices out of Basic Blocks (a) Original function f. (b) Transformation of f. Figure 3.1: Transformation of function f into a new function calling s f. (a) Original function f modifying variable x. (b) Transformation of f calling function splice s f. Figure 3.2: Generation of a function splice for function f modifying variable x. 19

38 3 Compiler Support for Dynamic Function Splicing of variable x inside the function splice. Leaving s f, we need to record the value of x again. This time the value has to be restored after re-entering f. Thus, the changes made on x inside s f are visible in f and the results of f regarding x are the same as they were if DFS was not applied (see figure 3.2(b)). The same applies to all other variables that are alive at the begin or end of an instruction sequence that is a candidate to be separated. An variable is alive if a new value is assigned and the new value is used later on in the code [2]. Using the analysis framework it is possible to create a set of live variables for each instruction of a given function. These sets are then used to determine all variables whose values need to be saved and restored. Live variables at the beginning of the instruction sequence are passed by Call- By-Value (CBV) to the function splice. The values needed to restore the function state after the splice call returned are passed using Call-By-Reference (CBR). Before returning from the function splice back to the caller, code is executed that saves all live variables back to their corresponding CBR arguments. The CBR parameters are then copied back to the original variables to restore their state. How to pass a CBR argument when a remote splice call is performed will be described in section Saving Live Registers The exact state of a function at a certain point in the instruction sequence is not only defined by live local variables in memory. There are also scratch registers that hold intermediate results computed by earlier instructions or values that have been loaded into some register. Therefore, we need to save and restore not only variables but registers, too. Since a register cannot be passed directly to the callee, the code generator for DFS inserts temporary variables to carry the current value of that register. Before calling a function splice, the register is saved into a temporary local variable. The variable is then passed using CBV as an argument to the function splice. Obviously, to restore the exact state of a register after returning from the callee registers need to be restored as well. Again, it is not possible to return a set of values back to the caller by passing registers. Comparable to live variables, registers are first saved into a CBR parameter that is copied back to the register to be restored after returning from the splice call. 3.3 Generating Function Splices out of Loops and Methods Generating function splices of single basic blocks is the basis for moving instruction sequences of bigger granularity such as loops or even whole methods. In this section we will describe how to prepare dynamic function splicing on loops. Applying DFS to 20

39 3.3 Generating Function Splices out of Loops and Methods a loop or a whole method uses the same process as it does for basic blocks: The exact state of all variables and registers is saved and passed to the function splice called. The DFS pass without feedback analysis (see section 4.3.4) only generates function splices out of outer-most loops. This is due to the lack of knowledge about the actual distribution of objects on the nodes of the cluster at runtime. Thus, the DFS pass is not able to tell if an access check is likely to fail or to succeed. Since this knowledge is essential to decide on instruction sequence to move to a function splice the DFS pass is not able to select another granularity. A loop described by set of basic blocks forms a subgraph of the CFG of the whole method. When moving a part of the control flow graph to a function splice we encounter the problem that the subgraph and the surrounding CFG may have more than one connection between each other. That is, there may be more than one exit from a subgraph Handling Break Statements Figure 3.3(a) shows a function searching the first element of an array larger than some threshold. If the threshold is reached, the position of that element is saved and the loop is exited by a break statement. Figure 3.3(b) shows the control flow graph of that function. The for loop has to possible exits. Firstly, there s the exit by failing the loop condition and jumping to the first instruction after the for loop (edge a). Secondly, the loop can be terminated by the break statement (edge b) when reaching the threshold. Therefore the control flow exiting the loop can take one of two paths (edge a or edge b). If the whole loop is selected for splicing we will not able be to distinguish between the two paths of the exits of the subgraph. Thus, the DFS pass needs to carry that information back to the caller which must evaluate this information and continue execution at the right location. The exchange of this information is performed by adding additional CBR arguments to the function splice: (1) A state argument (flags) indicating how the function splice was exited, and (2) an integer argument (break no) to tell which path lead to the termination of the function splice. The existing jump instruction is replaced by a new instruction sequence. Firstly, an unique number is assigned to the break no argument. Secondly, a bit mask (SPLICE BREAK ) is set in the f lags argument. Finally, the function splice is terminated using a simple return instruction. At runtime we can then determine which path in the CFG caused the termination of the function splice. The caller uses a LASM multi-jump instruction comparable to a switch statement in Java to continue execution with the right jump target; a socalled break handler. Algorithm 3.4 summarizes the replacement of jump instruction jumping out of the splicing subgraph. 21

40 3 Compiler Support for Dynamic Function Splicing int findexeedingelement ( int [ ] array, int t h r e s h o l d ) { int p o s i t i o n = 1; for ( int i = 0 ; i < array. l ength ; i ++) { i f ( array [ i ] > t h r e s h o l d ) { p o s i t i o n = i ; break ; } } return p o s i t i o n ; } (a) Source of example method f indexceedingelement(). (b) CFG of example method f indexceedingelement(). Figure 3.3: Example of a subgraph with two edges exiting the subgraph. 22

41 3.3 Generating Function Splices out of Loops and Methods 1: Input. A instruction sequence I. 2: Output. Rewritten jump instructions, multi-jump instruction ml for calling function. 3: 4: id := 0 {unique identification of jump instruction} 5: for all instruction i in I do 6: if i is a jump instruction then 7: if i s jump target is not inside function splice then 8: target := jump target of i 9: id := id : replace instruction i by new instruction sequence: 11: break no := id 12: flags := flags OR SPLICE BREAK 13: return 14: add condition to ml: if break no equals current id then goto target 15: end if 16: end if 17: end for 18: add multi-jump ml after call instruction invoking function splice Algorithm 3.4: Rewriting jump instructions exiting a subgraph of a CFG Handling Return Statements Another type of statement is able to end the execution of a function before reaching the last instruction. Using a return statement it is possible to end the execution of a function and pass a value back to caller. If such a return statement is contained in a part of the function that is moved to a function splice, we need to handle this accordingly when control returns to the original function. When a return statement is executed inside a function splice, it has to terminate passing back the return value to the caller. Besides the callee, the caller must also terminate returning this value. In fact, the function and the called function splice must act as there was no call to the splice and the return was executed right in the original function. Figure 3.4(a) shows the stack frames of a function calling another function f(...). On execution of a return statement, f is terminated and control flow continues inside the caller. In figure 3.4(b) the return statement was moved to function splice, which needs to terminate itself and its calling function. To handle these types of statements the state variable introduced for break statements is re-used. Additionally, a CBR argument is added to the function splice that will carry the return value evaluated when processing the return statement. At the end of the function splice a return handler is added to correctly terminate the function 23

3 Compiler Support for Dynamic Function Splicing (a) A function terminated by a return statement. (b) A function splice terminating together with its callers. Figure 3.

The original return statement moved to the function splice is then replaced by two instructions: One to copy the value of the return expression to a temporary LASM register, and one to perform a

42 3 Compiler Support for Dynamic Function Splicing (a) A function terminated by a return statement. (b) A function splice terminating together with its callers. Figure 3.4: Termination of function splices by a return statement. splice passing back the value of the return expression. The original return statement moved to the function splice is then replaced by two instructions: One to copy the value of the return expression to a temporary LASM register, and one to perform a unconditional jump to the return handler at the end of the function splice. The return handler then copies the value of the temporary register into the CBR argument reserved for passing back the return value. In the state argument (f lags) a flag (SPLICE RETURN ) is additionally set to indicate that the function splice was terminated by execution of a return statement. This state flag is tested after the function splice returns from its call. If the flags argument passed back to the caller indicates that the function splice was terminated by handling a return statement, the caller executes a return statement which passes back the value stored in the argument containing the return value from the caller s function splice Handling Throw Statements The next type of statements able to change the flow of control in a Java method are throw statements. On execution of a throw statement the control flow of the function is interrupted. The call stack of functions is then reduced until a function is reached that contains a catch block conforming to the type of the object reference inside the throw expression causing the reduction of the call stack. While the call stack is reduced the object reference of the throw expression is passed along to be passed to the catch block. Execution of this type of statements will be handled at runtime. In case of performing a local splice call, the Jackal compiler already ensures that the call stack is reduced correctly, because the function splice will reside on the call stack. In case of a remote splice call, the information about an exception that is thrown remotely needs to be 24

43 3.4 Problems transferred to the calling node. This task will be accomplished by the DFS runtime (see section 4.2). 3.4 Problems While correct handling of language constructs such as break, return and throw statements enables the DFS pass to generate function splices, the compiler back-end cannot perform function splicing in all cases. This section will give a short description of two cases where DFS will not be possible Handling of Generic Pointers In a Jackal program there are two types of pointers. Firstly, pointers can reference Java objects. This type of pointer is valid on the local node, but can be transferred to another node by serialization and de-serialization. These pointers can be exchanged between different machines, because each instance of the runtime system on these machines maintains informations needed to map such pointers onto the local address space. Secondly, there are pointers referencing generic memory in the address space of the Jackal program. This type of pointer is used to reference data structures stored by the Jackal RTS, such as pointers to the vtable of an object. The second type of pointer disallows the DFS pass to generate function splices, if such pointers are live variables or live registers at the begin or at the end of the instruction sequence that is to be moved. Because we cannot assume the same memory layout on two different nodes, a generic pointer will not be valid on a machine other than the local machine. Thus, transferring such pointers between nodes will cause undefined behavior of the application Helper Variables On some architectures special helper variables might be inserted. One application for such variables is to access memory locations that cannot be accesses directly using normal instructions of the machine code. In that case the compiler is able to generate code that accesses such memory locations in the following way: The higher word of the address specified is loaded into the lower word of the variable. Then a bit shift is done to move the partial address to the higher word. The lower part of the address is loaded into the variable by an OR operation. The memory location specified can then be accessed by indirect memory access using the content of the helper variable. The problems caused by generic pointers are also inherent to this type of helper variables. A DFS pass generating code for an architecture which requires the compiler to insert helper variables needs to perform careful analysis to avoid undefined behavior of the program. 25

44 3 Compiler Support for Dynamic Function Splicing 1: Input. Arguments A to the corresponding function splice 2: Output. None. 3: 4: allocate transfer buffer call by val for CBV arguments 5: allocate transfer buffer call by ref for CBR arguments 6: find remote node n for requesting the remote splice call 7: for all CBV arguments v in A do 8: marshal v into transfer buffer call by val 9: end for 10: send request for remote splice call(call by val, call by ref) to node n and wait for reply 11: for all CBR arguments r in A do 12: unmarshal r using transfer buffer call by ref 13: end for Algorithm 3.5: Tasks of a remote splice call stub for a given function splice. 3.5 Helper Functions In addition to generating function splices, the DFS pass is also responsible for creating specialized support functions used by the DFS runtime system to call function splices remotely. For each function splice, three types of support functions are created. This section describes the generated functions in detail. Section 4.2 will show how these functions are called at runtime to perform a remote splice call Remote Splice Call Stub A so-called remote splice call stub is generated to initiate a remote splice call. This type of support function owns the same signature as the corresponding function splice. Hence, the call to the function splice and the call to the RSC stub are interchangeable with each other. The task of the RSC stub is to first marshal all CBV arguments passed to the function splice such that they can be sent over the network to a remote node. Then, a function of the DFS runtime system is called that sends the call request and the marshalled arguments to the remote node. After returning from the runtime system, CBR arguments are unmarshalled and returned to the calling function. Thus, RSC stubs equal stub procedures used in most implementations of remote procedure calls [4, 21]. Algorithm 3.5 shows the task of the RSC stub using pseudo-code. As shown in algorithm 3.5 one additional task of the remote splice call stub is to determine the remote node the RSC request is directed to. Algorithm 3.6 shows a simple O(n) algorithm used to determine the destination node for a RSC request. Because of the optimization target (reduction of messages sent in the cluster) the 26

45 3.5 Helper Functions 1: Input. A set of pointers to Java objects. 2: Output. Destination node to request a RSC. 3: 4: {initialize all helper variables} 5: for all nodes h in the Jackal cluster do 6: object count[h] = 0 7: end for 8: max := 0 9: destination :=local node 10: 11: {build frequency table} 12: for all pointers p in A do 13: h :=logical home node of p 14: increase object count[h] by 1 15: if max < object count[h] then 16: max := object count[h] 17: destination := h 18: end if 19: end for 20: 21: return destination Algorithm 3.6: Algorithm to find the destination node to send a RSC to by counting. algorithm tries to find the home node of a set of n object references given as input. It is obvious that the number of messages saved is optimal if the request is directed to a home node storing most of the objects needed to execute a function splice. Algorithm 3.6 first creates a table of integer numbers for each node in the cluster. For each object reference passed to the algorithm table element h is increased by one if node h is the home node of the object. After all object references have been examined a frequency table was generated indicating how many objects are stored on some node. To avoid searching the complete table for the maximum number objects on a certain node the current node holding most of the objects is recorded in destination while processing the set of object references. When a new node becomes the node with the maximal object count, destination is updated accordingly Remote Splice Call Skeleton To execute the function splice at the remote node a remote splice call skeleton is created by the DFS pass. The RSC skeleton will be called by the DFS runtime system when a remote splice call is request by some node. The received CBV arguments are unmarshalled by the RSC skeleton and the function splice is called. After the function 27

46 3 Compiler Support for Dynamic Function Splicing 1: Input. CBV argument buffer call by val and CBR argument buffer call by ref 2: Output. None. 3: 4: for all CBV arguments v that need to be passed to function splice do 5: unmarshal v using transfer buffer call by val 6: end for 7: call function splice 8: for all CBR arguments r that need to be passed to function splice do 9: marshal r into transfer buffer call by ref 10: end for Algorithm 3.7: Tasks of a remote splice call skeleton for a given function splice. splice has returned, the values passed back by CBR arguments are marshalled into a buffer provided by the DFS runtime system (see algorithm 3.7). The tasks performed by the remote splice call skeleton are comparable to the task of the skeleton procedures in most implementations of RPC [4, 21] Remote Splice Registrars When a remote splice call is requested by some node, the other node must be informed which RSC skeleton has to be invoked. To uniquely identify a remote splice call skeleton different approaches seem to be evident. One way to uniquely request remote execution of a function splice is to send a unique identification string to the remote node. As identification string the compilermangled name of a Java method together with an unique identification for function splices could be used. At runtime this string has to be compared with the ID of all other function splices. Potentially, this causes a high overhead at runtime, because of repeated string comparisons. Another possible way to identify remote function splices would be to use the address of the function splice in the address space of the local node. Passing this value between different nodes will lead to the same problems as discussed in section about generic pointers. Thus, we cannot be sure that a pointer to a function splice on one node is valid and references the same function splice on a remote node. The final solution implemented in the current system is to use a surrogate identification that is computed at runtime when the program is started. A remote splice registrar is used by the runtime system to generate the surrogate identification. At program startup the list (see section 3.6) of function splices is sorted lexicographically. The sorting of strings is a task that is done by all nodes in the Jackal cluster in the same way. Hence, the position of a string in the sequence will be the same for all nodes producing a unique surrogate identification number for each function splice. This in- 28

47 3.6 Table of Function Splices dex is then sent over the network to request the remote execution of some function splice. 3.6 Table of Function Splices Information about function splices is passed to the runtime system by the DFS pass in the form of a table. This table stores all data needed to perform dynamic function splicing at runtime. For each splice the following information is emitted: Address of the call instruction that calls the function splice locally. Start and end address of the instruction sequence moved to the function splice. The start address of the function splice, the RSC stub and skeleton. A string consisting of the mangled name of the Java method and the number of the function splice. Section 4.1 shows how the emitted splice table is used by the runtime system to perform dynamic function splicing. 3.7 Further Optimizations Within the current implementation of the DFS pass there are still some possibilities for optimizations. Mainly, passing arguments can be optimized further. Firstly, we can avoid saving and restoring live variables and registers if they are not used inside the function splice. The definition of the term live variable which states that the variable is assigned to a value and used later in the instruction sequence implies that a variable is live at the begin of the function splice whereas its usage is not part of the function splice. In that case passing the value as argument to the function splice is not necessary. The same applies to live variables at the end of the function splice. If the variables are never assigned a new value in the instruction sequence of the function splice the value it is possible to avoid passing back the values to the caller. These kinds of optimizations decrease the amount of copy operations needed to save and restore the correct state of all live variables and registers. Also, message sizes are decreased, because less data need to be transferred on the execution of a remote splice call. 29

48 3 Compiler Support for Dynamic Function Splicing 3.8 Summary We have described how the existing compiler back-end is extended to support dynamic function splicing. The DFS pass is responsible for identifying candidate instruction sequences of a Java method. Afterwards, the instruction sequence is copied to a new function and replaced by a call to this function. One of the main tasks of the DFS pass is ensure that two requirements are met. Firstly, the exact state of a function has to be saved and restored on entry and on exit of the function splice. The values of live variables and registers at both points have to be saved and restored accordingly. Secondly, the execution of Java language constructs needs to be emulated such that the results of the computation performed are not changed by dynamic function splicing. This chapter showed solutions for both of the two requirements. We were able to create function splices by moving and replacing instructions of a LASM function with static transformations of LASM code. After running the DFS pass the original function contains calls to its function splices. Those calls will always be performed locally. As yet, the foundations for performing remote splice calls are laid, but the compiler is not able to predict whether performing a local or a remote splice call is advantageous. Heuristics are needed to decide on this subject. Furthermore, the DFS pass is not able to determine the optimal node to send a remote call to. This is due to the fact that the distribution of objects over the cluster is not known at compile time. In chapter 4 we will discuss how the remaining tasks can be solved by adding support for DFS to the existing Jackal runtime system. 30

49 4 Runtime Support for Dynamic Function Splicing No rule so good as the rule of thumb, if it fit. (Scottish proverb) After the LASM representation of a Java program has been processed by the DFS compiler pass, the program is fully prepared for dynamic function splicing. Instruction sequences have been moved to function splices and replaced in the original function. For every function splice the DFS pass also created a set of support functions which are used at runtime to perform remote splice calls. Additionally, the compiler pass created a table which stores all information needed to perform DFS at runtime. The following chapter describes the runtime support for dynamic function splicing at runtime. First, we will show how local splice calls are replaced by remote splice calls when executing the program (section 4.1) and how a remote splice call is requested between two nodes (section 4.2). Second, in section 4.3 we describe the heuristics implemented to decide whether to perform a local splice call or a remote splice call. 4.1 Dynamic Function Splicing As introduced in chapter 1, dynamic function splicing is defined as the process to dynamically replace local splice calls by remote splice calls and vice versa. The task of replacing these call instructions in an executable is performed by a special module in the Jackal RTS. The RTS needs some basic informations about the executable to be able to re-vector calls at runtime. In section 3.6 a table of function splices that was emitted by the DFS compiler pass was introduced. This table is taken as a basis to provide all required data for DFS Replacement of Call Targets One basic problem in DFS is to find the machine instruction that calls a function splice that should be replaced by a RSC. To find the address of this instruction its address 31

50 4 Runtime Support for Dynamic Function Splicing 1: Input. Surrogate identification s of a function splice s f, table T of function splices. 2: Output. Rewritten call instruction calling s f either locally or remotely. 3: 4: procedure patch executable(pc : program counter, target : instruction address) 5: addr call := pc 5 6: offset := target addr call 5 7: set 4 bytes at location (addr call + 1) in the executable to offset 8: end procedure 9: 10: if heuristic opts for LSC on s using entry then 11: patch executable(t [s].addr of call, T [s].addr of RSC stub) 12: else 13: patch executable(t [s].addr of call, T [s].addr of f unction splice) 14: end if Algorithm 4.1: Rewriting call instructions on a IA32 machine. is recorded in the table of function splices. Thus, the address of the call instruction is easily found by a table look-up. After the address of the call instruction to be rewritten is determined, the call target for that call needs to be replaced by a new call target. The new call target is either the address of a function splice or the address of the remote splice call stub for that function splice. For both function splice and its RSC stub the addresses are stored inside the table of function splices. If a heuristic (see section 4.3) decides to replace a LSC by a RSC the DFS runtime looks up the address of the call instruction performing the local splice call. Then, the table is searched for the address of the corresponding RSC stub. The address found in the table replaces the address of the function splice in the call instructions. From there on the executable performs a remote splice call if the call instruction is executed. Algorithm 4.1 shows how the executable is rewritten using the address of the call instruction and the new call target. A remote splice call can be replaced by a local splice call in the same way by applying the reverse transformation. Again, the table of function splices is searched for the address of the call instruction (but now calling the RSC stub). The target of the call instruction located at the identified address is then replaced by the address of the function splice. This address is also stored in the table of function splices. The look-up operations described so far are cheap in terms of execution costs. After reading the table of function splices at program startup, each function splice owns a unique ID which is used as the look-up index for the table. Thus, the look-up operation in this case is performed by simple indirect memory accesses to some memory location when searching for the meta data of a function splice. Performing the look-up of the 32

51 4.1 Dynamic Function Splicing different types of addresses costs another indirect memory access 1. The complexity of this look-up type is O(1) because each indirect access can be regarded as one operation Finding Function Splices Using Failed Access Checks Besides the direct access to the function splice table, we need to implement a search for function splices in the table. This type of search is used when a heuristic is applied that monitors access checks and decides on the data gathered during the monitoring to change a local splice call to a remote splice call and back. If an access check fails on a local node because the requested object is not cached locally, the Jackal program calls special functions of the runtime system to request the object transfer and caching on the local node (see sections and 2.3.2). On execution of the RTS support functions, the address of the failing access check is passed using the current program counter of the running executable. Heuristics may place a hook in the Jackal RTS S-DSM protocol support functions processing failed access checks. Therefore, a look-up function needs to be implemented that searches the splice table and uses the program counter of the failed access check as the search condition. To meet this requirement the splice table also stores the address of the first and the last instruction of the function splice. If an access check fails the program counter is passed to the DFS runtime which tests if the program counter is located in the address range between the begin address and the end address of a function splice. All possible intervals are disjoint, because individual function splices do not share parts of their instruction sequences with other function splices. The look-up algorithm (see algorithm 4.2) uses a form of binary search ([20]) to look-up the start address of the interval the program counter is located in. The end interval is then automatically determined because of the disjointness of all possible address intervals. After the look-up was applied, a pointer to the entry in the table of function splices is returned. Thus, the complexity of finding a entry in the table of function splices is O(log n). The look-up using a binary search algorithm needs O(log n) operations to find the start address of the address interval identifying a function splice. Then a constant needs to be added to the complexity of O(log n) due to the indirect memory accesses to the entry found (see section 4.1.1). 1 The table of function splices is mapped to a array of pointers C structs holding the information. The look-up of data using the function splice ID i creates one indirect access to get the ith element of the array. The second and third indirect memory access is due to the translation of accesses to elements of a pointer to a C struct. 33

52 4 Runtime Support for Dynamic Function Splicing 1: Input. Table T of function splices, length l of T, program counter pc. 2: Output. Address of entry of function splice with address range corresponding to pc. 3: 4: start := 0 {begin of current search interval} 5: end := l 1 {end of current search interval} 6: pos := (end + start)/2 7: 8: while (end start) > 1 do 9: if pc <= T [pos].addr begin then 10: end = pos 11: else if T [pos].addr begin <= pc and pc <= T [pos].addr begin <= pc then 12: exit loop 13: else 14: start := pos : end if 16: end while 17: 18: return address of T [pos] Algorithm 4.2: counter. Looking up entries in the function splice table using by program Program Startup The existing startup process of a Jackal program needs to be extended to prepare DFS at runtime, because a program has to initialize all data structures needed by the DFS runtime support before the main method of the program is called. The first phase of the startup procedure loads the table of function splices into main memory from the binary file of the running program. The table is accessible using a pointer to the first entry of the table which is passed by the existing startup code as a parameter to the DFS startup function. This function allocates memory for the table and reads in all entries emitted to the table by the DFS compiler pass. After the table is read it is sorted using the begin address of the intervals of the individual function splices mentioned earlier. While reading the table of function splices an array is filled with the unique names of all function splices. This array is then sorted lexicographically to create the surrogate identification for each of the function splices of the program on phase 2 of the startup (see section 3.5, remote splice registrars). The second phase of the program startup finishes all preparations that are needed for DFS. The surrogate identification is created by calling the remote splice registrars for each function splice passing the surrogate ID. Each of the remote splice registrars 34

53 4.2 Remote Splice Call Protocol stores the ID passed by the startup code in a special global variable. Such variables are created for every function splice to store the identification of the corresponding RSC skeleton. Due to the lexicographic sorting of unique strings this identification is the same on all nodes for each of the function splices and their RSC skeletons. In the third phase actions specific to the different DFS heuristics take place which are described in section 4.3 about the different DFS heuristics. 4.2 Remote Splice Call Protocol The Remote Splice Call Protocol (RSCP) defines all message types that are used to remotely execute a function splice. The protocol is not forced to be fault tolerant as the underlying networking module of Jackal s RTS guarantees reliability. Therefore, the RSCP can be kept simple. The basic protocol supports three types of messages that can be sent between nodes of a cluster. Table 4.1 shows the format of the packets sent over the network, table 4.2 shows the supported packet types and their semantics. Execution of a Remote Splice Call Figure 4.1 shows a sequence diagram of a remote splice call between two nodes A and B using the request-reply remote splice call protocol. On local node A, f unc() calls the remote splice call stub for one of its function splices. The RSC stub then calls shm dfs send remote splice call of the DFS runtime system to send the request to the destination node. The support function then blocks until the return packet of the request is received at the calling node and the blocked thread is notified. On the destination node, the Jackal runtime system accepts the request packet and calls shm dfs handle remote splice call which is responsible to process the RSC request and calls function splice requested through its RSC skeleton. After returning from the function splice a return packet is sent to the calling node. On receipt of the return packet the runtime system notifies the waiting thread of the arrival of the return packet and the successful execution of the function splice on the remote node. Execution of an Remote Splice Call Throwing an Exception Potentially, all function splices can be terminated by throwing an exception while they are executed remotely. The DFS runtime system is responsible for handling the occurrence of such exceptions correctly. Therefore, the runtime system catches a thrown exception when invoking a function splice. If an exception was caught an exception packet is sent back to the calling node which then re-throws the exception locally. Thus, the sequence of performing a remote splice call equals figure 4.1 except the packet sent on the return from the call. 35

54 4 Runtime Support for Dynamic Function Splicing Field Data Type Size Semantics proto hdr - - This field is common to all protocols in the Jackal RTS. It is used to store informations needed to sent the packet to its destination. source CPU number 4 bytes Contains the number of the node that requested the RSC. The return packet will be directed to the node source. method no unsigned integer 4 bytes Surrogate identification of the function splice which is requested to be called remotely. request id pointer 8 bytes Request ID of the RSC requested. On receiving the return packet for a RSC the request id is used to wake up the waiting thread that requested the RSC. cbv size unsigned integer 4 bytes Size of the arguments that have to be passed using call-by-value. cbr size unsigned integer 4 bytes Size of the arguments that have to be passed using call-by-reference. copy back pointer 8 bytes Stores the address of the buffer that stores the CBR arguments of the function splice. flags pointer 8 bytes This is the address of a special flag variable of the waiting thread. It used to exchange status information with the waiting thread. - data cbv size bytes Arguments passed by Call-by- Value (only present in request packet). - data cbr size bytes Arguments passed by Call-by- Reference (only present in return packet). Table 4.1: Format of a remote splice call protocol packet. 36

55 4.2 Remote Splice Call Protocol Figure 4.1: Sequence diagram of performing a remote splice call. 37

56 4 Runtime Support for Dynamic Function Splicing Packet Type DSM SPLICE CALL DSM SPLICE RETURN DSM SPLICE EXCEPTION Semantics This type of message is sent when the local node requested a remote execution of a function splice on another node of the cluster. After processing a remote splice call request the remote node sends back a return packet of type DSM SPLICE RETURN If the function splice called remotely was terminated by a exception the remote node changes the type of the return packet to DSM SPLICE EXCEPTION to indicate the exceptional termination. 4.3 Heuristics Table 4.2: Remote splice call protocol message types. Heuristics are utilized by the DFS runtime system to decide on replacing a local splice call with a remote splice call and vice versa Always and Never The Never heuristic will perform no replacement of call targets from local splice calls to remote splice calls. Using Never it is also possible to test if the DFS compiler pass created correct function splices: The result of executing the unmodified version of the Jackal program has to equal the result produced running the modified version of the program. Also, heuristic could be used to determine which performance penalty is introduced by creating and calling function splices locally. Always is another heuristic used to test the correctness of the implementation of DFS in Jackal. Using Always it is possible to check if remote splice calls are processed correctly by the DFS runtime system. Always meets the definition of DFS because it needs to replace the call instruction calling function splices by instructions executing the corresponding RSC stubs. Because it is sure that each function splice of a program is executed remotely the correctness of each remote call can be easily checked Coin Tossing The Coin Tossing heuristic assumes that the DFS runtime has no knowledge about the past and future behavior of a program. Thus, the DFS runtime needs to guess whether to invoke a local splice call or a remote splice call. To determine the type of invocation LSC or RSC Coin Tossing simulates a ideal coin toss. A ideal coin toss shows both heads and tails with a probability of 50 % [18]. 38

57 4.3 Heuristics Thus, the probability for each individual splice to be called locally or remotely is at 50 %. The decision made by Coin Tossing is used at three locations in the RTS. Firstly, after a LSC returns, the Coin Tossing routines are called to decide on the next execution of the function splice. A hook after the local splice call instruction is used to intercept the execution after the return of the LSC and to call the decision routine of Coin Tossing. Secondly, when a RSC stub is about to return, Coin Tossing is used as well. Finally, at the program startup a initial coin toss is made for each function splice Adaptive Coin Tossing Extending Coin Tossing by employing runtime profiling is called Adaptive Coin Tossing. The basic idea behind this heuristic is the same as for Coin Tossing. After each execution of a function splice locally or remotely the call is assessed by a profiler to rate the execution. Using the profiling information it is possible to better estimate whether a local splice call or a remote splice call is favorable for the next execution. Based on this evaluation the probability of the two possible events is increased or decreased. For example, if the last RSC was more beneficial than the last LSC then the probability to perform a RSC is increased, decreasing the probability of a LSC. Because the optimization s goal is to decrease the number of messages sent between nodes of the cluster, it seems to be sufficient to count the number of access checks that failed on execution of a function splice to create a profile of a splice call. This information is sent back to the calling node when a remote splice call returns and stored in a structure stored per thread. On the next decision made by Adaptive Coin Tossing, this information is evaluated to compute the new probabilities. Figure 4.2 shows the plot of the functions used to increase or decrease the probability distribution of the two possible events. These functions are designed to reward beneficial executions by increasing the probability by a cubic increase. In turn, unfavorable executions are penalized by a cubic decrease. increases the probability of an event, while f inc (x) = 1 (1 x) 3, x [0..1] (4.1) f dec (x) = x 3, x [0..1] (4.2) decreases its probability (see figure 4.2). To avoid expensive computations at runtime, the results of functions 4.1 and 4.2 are precomputed in the interval [0..1]. To be able to also use inexpensive integer arithmetic instead of potentially costly double arithmetic, these intervals are normalized and discretized to intervals of integers in [0..255]. 39

4 Runtime Support for Dynamic Function Splicing 1 0.8 0.6 0.4 0.2 f inc (x) = 1 (1 x) 3 f dec (x) = x 3 0 0 0.2 0.4 0.6 0.8 1 Figure 4.2: Graph of the functions in equation 4.1 and 4.2. Figure 4.3: Compilation using feedback analysis.

58 4 Runtime Support for Dynamic Function Splicing f inc (x) = 1 (1 x) 3 f dec (x) = x Figure 4.2: Graph of the functions in equation 4.1 and 4.2. Figure 4.3: Compilation using feedback analysis Feedback Analysis Because all heuristics described so far do not know about the runtime behavior of a Jackal program at compile time, we use Feedback Analysis to provide this information to both the compiler and the DFS runtime system. Thus, both parts compiler and runtime system of the DFS support in Jackal need to be extended. In a first pass the Jackal program is augmented with calls to profiling functions provided by the DFS runtime system. While executing the prepared program the DFS runtime system records information about all access checks executed. This data is passed back to a second compiler pass that uses the feedback information to decide on the generation of function splices (see figure 4.3). 40

Implementation and Optimization of the Lattice Boltzmann Method for the Jackal DSM System

Implementation and Optimization of the Lattice Boltzmann Method for the Jackal DSM System Bachelor Thesis im Fach Mathematik vorgelegt von Alexander Dreweke angefertigt am Institut für Informatik Lehrstuhl