Distributed Object Sharing for Cluster-based Java Virtual Machine

Size: px

Start display at page:

Download "Distributed Object Sharing for Cluster-based Java Virtual Machine"

Ilene Griffith
5 years ago
Views:

1 Distributed Object Sharing for Cluster-based Java Virtual Machine Fang Weijian A thesis submitted in partial fulfillment of the requirement for the degree of Doctor of Philosophy at the University of Hong Kong 2004

2 Abstract of thesis entitled Distributed Object Sharing for Cluster-based Java Virtual Machine submitted by Fang Weijian for the degree of Doctor of Philosophy at the University of Hong Kong in 2004 Java has already become one of the most popular programming languages since its debut. Recent advances in Java compilation and execution technologies have further pushed Java into the arena of high performance parallel and distributed computing. On the other hand, the computer cluster has gradually been accepted as a scalable and affordable parallel computing platform by both academia and industry in recent years. We were therefore inspired to design a cluster-based Java Virtual Machine (JVM) that can run unmodified multi-threaded Java applications on a computer cluster, where Java threads can be automatically distributed to different computer nodes to achieve high parallelism and leverage cluster-wide resources such as memory and network bandwidth. In a cluster-based JVM, the shared memory nature of Java threads calls for a global object space (GOS) that virtualizes a single Java object heap spanning the cluster to facilitate transparent distributed object sharing. The performance of the cluster-based JVM hinges on the GOS s ability to minimize the communication and coordination overheads in maintaining the single object heap illusion. i

3 Different from the previous approaches to build a cluster-based JVM, we build the GOS as an object-based distributed shared memory (DSM) service embedded in the cluster-based JVM, which facilitates the exploitation of abundant runtime information for performance improvement. Distributedshared objects (DSOs) that are reachable from threads at different nodes are detected to facilitate efficient consistency maintenance and memory management in the cluster-based JVM. Furthermore, based on the concept of DSO, we propose a framework to characterize object access patterns, along three orthogonal dimensions. With this framework, we are able to effectively calibrate the runtime memory access patterns and dynamically apply an adaptive cache coherence protocol to minimize consistency maintenance overhead. The adaptation devices include an adaptive object home migration method that optimizes the single-writer access pattern, synchronized method migration that allows the execution of a synchronized method to take place remotely at the home node of its locked object, and connectivity-based object pushing that uses object connectivity information to optimize the producer-consumer access pattern. Extensive experiments have demonstrated the effectiveness of our design. ii

4 Declarations I hereby declare that the thesis entitled Distributed Object Sharing for Cluster-based Java Virtual Machine represents my own work and has not been previously submitted to this or any other institution for a degree, diploma and other qualifications. Fang Weijian 2004 i

5 Acknowledgements I would like to thank my supervisors, Dr. Cho-Li Wang and Dr. Francis C. M. Lau, for their advices and help on my research and daily life, which are endless, patient, and invaluable. It is their encouragement and support that have brought this research to completion. In particular, the experiences of intensively revising papers before deadlines with Dr. Wang were painful, but remarkably rewarding. From them, I not only learned how to write papers but also learned how to do research. Dr. Lau is inspiring and enlightening in directing my research. I also want to thank my internal and external examiners for their valuable comments on my thesis. It is my pleasure to work with Zhu Wenzhang in my PhD study. I am full of gratitude to his suggestions and cooperations. It is also my pleasure to hike with Zhu Wenzhang. He is energetic in both hiking and research. I would like to thank many colleagues in HKU. They are Wang Lian, Wang Tianqi, Chen Weisong, Chen Lin, Chen Ge, Zhu DongLai, Li Wei, Yin Kangkai, etc. I really enjoyed the time spent with them. I also want to thank Benny Cheung, Roy Ho, and Anthony Tam, for their help on my research and teaching work. Finally, I want to express my deepest gratitude to my wife and my parents. ii

6 Contents Declarations Acknowledgements i ii 1 Introduction Java and Java Virtual Machine Cluster Computing Cluster-based Java Virtual Machine Global Object Space Our Approach Contributions of the Thesis Thesis Organization Background Software Distributed Shared Memory Memory Consistency Model Classification Based on the Coherence Granularity Java Memory Model Memory Access Pattern Memory Access Pattern Optimization in DSM Programmer Annotation Compiler Analysis iii

7 3.1.3 Runtime Adaptation Access Pattern Space Distributed-Shared Object Definitions Benefits from DSO Detection Benefits on Memory Consistency Maintenance Benefits on Memory Management Lightweight DSO Detection and Reclamation Basic Cache Coherence Protocol Adaptive Cache Coherence Protocol Adaptive Object Home Migration Home Migration Concepts Home Migration with Adaptive Threshold Synchronized Method Migration Connectivity-based Object Pushing Object Access Pattern Visualization Object Access Trace Generator Pattern Analysis Engine Pattern Visualization Component Implementation JIT Compiler Enabled Native Instrumentation Distributed Threading and Synchronization Thread Distribution Thread Synchronization JVM Termination Non-Blocking I/O Support Distributed Class Loading Garbage Collection iv

8 7.5.1 Local Garbage Collection Distributed Garbage Collection Performance Evaluation Experiment Environment Application Suite CPI ASP SOR NBody NSquared TSP Application Performance Sequential Performance Parallel Performance Effects of Adaptations Adaptive Object Home Migration Synchronized Method Migration Connectivity-based Object Pushing Sensitivity and Robustness Analysis for HM Protocol More on Synchronized Method Migration Related Work Overview Augmenting Java for Parallel Computing Language Augmentation Class Augmentation Cluster-based JVM Jackal Hyperion JavaSplit v

9 9.3.4 cjvm JESSICA Java/DSM Conclusion Discussions Effectiveness of the Adaptations Which Existing JVM is Based on Thread Migration vs. Initial Placement Future Work Compiler Analysis to Reduce Software Checks Automatic Performance Bottleneck Detection High Performance Communication Substrate A Appendix 127 A.1 Overheads of GOS Primitive Operations A.2 ASP Code Segment A.3 The Method for Parallel Performance Breakdown A.4 JIT Compilation vs. Interpretation vi

10 List of Figures 3.1 The object access pattern space The detection of distributed-shared object The state transition graph depicting object lifecycle in the GOS Home-based Protocol for LRC with multiple-writer support Barrier class PAT Architecture Memory access operations in GOS Phase parallel paradigm The time lines window The window of object access pattern analysis result (the bigger one) and the window of the application s source code (the smaller one) Pseudo code for access check: using a function call Pseudo code for access check: by comparison Detailed pseudo code for a read check IA32 assembly code for a read check Remote unlock of a DSO JVM s dynamical loading, linking, and initialization of classes Tolerating inconsistency in DGC DSO reference diffusion tree vii

11 8.1 The typical operation in SOR Barnes-Hut tree for 2D space decomposition Single node performance Speedup Breakdown of normalized execution time against number of processors The adaptive protocol vs. the basic protocol Effects of adaptations w.r.t. execution time Effects of adaptations w.r.t. message number Effects of adaptations w.r.t. network traffic The effect of object home migration on SOR RCounter s Source code skeleton run by each thread Effects of home migration protocols against repetition of singlewriter pattern: normalized execution time (RCounter) Effects of home migration protocols against repetition of singlewriter pattern: normalized message number (RCounter) DSOR s Source code skeleton run by each thread Effects of home migration protocols against repetition of singlewriter pattern: normalized execution time (DSOR) Effects of home migration protocols against repetition of singlewriter pattern: normalized message number (DSOR) Effect of synchronized method migration on the barrier operation against the number of processors ASP s execution times on different problem sizes JavaSplit s code sample to send and receive objects A.1 The source code to measure GOS primitive operations A.2 JIT Compilation vs. Interpretation viii

12 List of Tables 4.1 Coherence protocols according to object type Communication effort on 16 processors A.1 Overheads (in microseconds) of primitive operations with respect to different number of threads ix

13 Chapter 1 Introduction 1.1 Java and Java Virtual Machine In less than ten years, Java [31] has become one of the most popular programming languages since its debut on May 23, 1995 at SunWorld 95. Java s following features contribute to its success. Java adopts a simplified C++ alike grammar, which make it a simple yet expressive object-oriented language. Java is also a concurrent programming language by supporting multithreading. Java is by design a platform-independent language, through the introduction of the bytecode. Java s source code is first compiled to the standard bytecode, which in turn can run on any platform where there is a Java Virtual Machine (JVM) [60]. JVM is the runtime system responsible for executing Java bytecode. JVM provides some very attractive runtime features, such as automatic memory management through garbage collection [79], multi-threading 1

14 support, and runtime safety checks that include array boundary checks as well as reference type checks. Java Development Kit (JDK) provides abundant libraries to support Collection, Socket, Remote Method Invocation (RMI) [75], and Object Serialization [77], etc. Although Java has been considered as a productive and universal language for a long time, its performance was unsatisfactory due to the poor performance of JVM. However, recent advances in Java compilation and execution technology, such as the just-in-time compiler [73], the hotspot technology [76], and the incremental garbage collection [52], add to the attractiveness of Java as a language for high performance scientific and engineering computing [6]. As a consequence, more and more researchers are adopting Java in high performance parallel and distributed computing [19][20]. 1.2 Cluster Computing A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers working together as a single, integrated computing resource [32]. In recent years, the computer cluster has been widely accepted as a scalable and affordable parallel computing platform by both academia and industry [30, 26, 41]. For example, in the TOP500 list [17] released on November 2003, 41.6% of the supercomputers, i.e., 208 systems, are of cluster, and account for 49.8% of aggregated performance. The prosperity of cluster computing is attributed to ever advancing commodity high performance microprocessors and high-speed networks, as well as open source cluster softwares, such as Rocks [12] for Linux cluster software installation, Torque [18] for resource management, Maui [10] for job 2

15 scheduling, MPICH [11] for message passing programming, and Ganglia [2] for cluster monitoring. Nevertheless, cluster programming is still a challenging task. One of the major programming paradigms on clusters is message passing, e.g., by following MPI standard [16]. Message passing paradigm requires the programmers to write explicit code to send and receive data in order to coordinate processes on different cluster nodes. In message passing paradigm, a superior performance is usually achievable by fine tuning the timing and content of each message, which however is widely believed to be a painful and error-prone process. Alternatively, software Distributed Shared Memory (DSM) [1] promises a better programmability compared with the message passing paradigm, by providing a globally shared memory abstraction across physically distributed memory machines. In software DSM, programmers access distributed data in the same way as local data. Special APIs are provided to synchronize parallel processes. To improve the performance, shared data unit can be replicated on multiple nodes. Inconsistency among multiple replications is solved according to memory consistency models [21]. Since the enforcement of data coherence is done automatically through the DSM infrastructure, it is probable that the communications happen more frequently and involve more data traffic than necessary. For example, the update or invalidation to a cached copy is unnecessary if the copy will not be used any more. 1.3 Cluster-based Java Virtual Machine Motivated by both the programmability of Java and the ample availability of clusters as a cost-effective parallel computing environment, the transparent and parallel execution of multi-threaded Java programs on clusters has become a research hotspot [62, 78, 82, 24, 61, 42, 44]. In this work, we build a cluster-based Java Virtual Machine to tackle this 3

16 problem. A cluster-based JVM conforms to the JVM specification [60], but runs on a cluster. With a cluster-based JVM, the Java threads created within one program can be transparently distributed onto different cluster nodes to achieve a higher degree of execution parallelism. In addition, cluster-wide resources such as memory, I/O, and network bandwidth can be unified and used as a whole to solve resource-demanding problems. A cluster-based JVM is also called a distributed JVM. A cluster-based JVM is composed of a group of collaborating daemons, one on each cluster node. A cluster-based JVM daemon is a standard JVM augmented with the cluster awareness and capabilities to cooperate with each other in order to present a single system image (SSI) [53] of cluster towards Java applications. The single system image is enabled through the global object space that will be discussed in the next section. The adoption of the cluster-based JVM for parallel Java computing can boost the cluster programming productivity. Given that the cluster-based JVM conforms to the JVM specification, any Java program can run on the cluster-based JVM without any modification. The steep learning curve can thus be avoided since the programmers do not need to learn a new parallel language, a new message passing library, or a new tool in order to develop parallel programs on clusters. It is also convenient for program development as the multi-threaded programs can be implemented and tested on a nonparallel computer before it is submitted to a cluster for execution. Finally, many existing multi-threaded Java applications, especially those server applications, can be ported to clusters when a cost-effective parallel platform is sought for. 1.4 Global Object Space In a cluster-based JVM, as Java threads are distributed around the cluster, the shared memory nature of Java threads calls for a global object space 4

17 (GOS) that virtualizes a single Java object heap spanning the cluster to facilitate transparent distributed object sharing. In GOS, object replication is encouraged to improve the data locality, which raises the consistency issue. The memory consistency issue is solved according to the Java memory model (Chapter 8 of the JVM specification [60]). Particularly, memory consistency operations are triggered by thread synchronization. GOS is responsible to enforce the Java memory model, as well as handle threads distribution and location-transparent synchronization. In addition, in order to completely comply with the JVM specification, GOS needs to perform distributed garbage collection for automatic memory management. GOS is indeed a DSM service with functionality extensions in an objectoriented Java system. The performance of the cluster-based JVM hinges on the GOS s ability to minimize the communication and coordination overheads in maintaining the single object heap illusion. It is challenging to design and implement a GOS that is both complete in terms of functionality and efficient in terms of performance. 1.5 Our Approach We design a cluster-based JVM. Different from previous approaches [82, 61] that leverage a page-based DSM as an underlying infrastructure to build the GOS, we build a GOS embedded in the cluster-based JVM [43]. In this architecture, GOS is able to exploit abundant runtime information in JVM, particularly the object type information, to improve the performance. We leverage the runtime object connectivity information to detect distributedshared objects (DSOs). DSOs are the objects that are reachable from at least two threads located at different cluster nodes in a cluster-based JVM. The identification of DSOs allows us to handle the memory consistency problem more precisely and efficiently. For example, in Java, synchronization primi- 5

18 tives are not only used to protect critical sections but also to maintain the memory consistency. Clearly, only synchronizations of DSOs may involve multiple threads on different nodes. Thus, the identification of DSOs can reduce the frequency of consistency-related memory operations. Moreover, since only DSOs that are replicated on multiple nodes would be involved in the consistency maintenance, the detection of DSOs therefore leads to a more efficient implementation of the cache coherence protocol. The identification of DSOs also facilitates distributed garbage collection. The choice of a good cache coherence protocol is often application-dependent. That is, the particular memory access patterns in an application speak for the more suitable protocol. That motivates us to go after an adaptive protocol. An adaptive cache coherence protocol is able to detect the current access pattern and adjusts itself accordingly. We believe that adaptive protocols are superior to non-adaptive ones due to their adaptability to object access patterns in applications. In our design, we use an object-based adaptive cache coherence protocol to implement the Java memory model. The challenges of designing an effective and efficient adaptive cache coherence protocol are: (1) whether we can determine those important access patterns that occur frequently or those that contribute a significant amount of overhead to the GOS, and (2) whether the runtime system can efficiently and correctly identify such target access patterns and apply the corresponding adaptations in a timely fashion. To further understand the first challenge and to overcome it, we propose the access pattern space [44] as a framework to characterize object access behavior. This space has three dimensions: number of writers, synchronization, and repetition. We identify some basic access patterns along each dimension: multiple-writer, single-writer, and read-only for the number-ofwriters dimension; mutual exclusion and condition for the synchronization dimension; and patterns with different numbers of consecutive repetitions for repetition dimension. Some combination of different basic patterns along the 6

19 three dimensions then portrays an actual runtime memory access pattern. This 3-D access pattern space serves as a foundation on which we can identify those important object access patterns in the distributed JVM. We can then choose the right adaptations to match with these access patterns and improve the overall performance of the GOS. To meet the second challenge, we take advantage of the fact that the GOS is implemented embedded in the cluster-based JVM. Our adaptive protocol can leverage all runtime object types and access information to efficiently and accurately identify the access patterns worthy of special focus. We apply three different protocol adaptations to the basic home-based multiple-writer cache coherence protocol in three respective situations in the access pattern space: (1) adaptive object home migration which optimizes the single-writer access pattern by moving the object s home to the writing node according to the access history; (2) synchronized method migration which chooses between default object (data) movement and optional method (control flow) movement in order to optimize the execution of critical section methods according to some prior knowledge; (3) connectivity-based object pushing which scales the transfer unit to optimize the producer-consumer access pattern according to the object connectivity information. 1.6 Contributions of the Thesis We summarize the contributions of this thesis as follows: 1. We design a global object space embedded in a cluster-based JVM that exploits Java s runtime information to improve the performance. In particular, distributed-shared objects are identified at run time to reduce the overhead of memory consistency maintenance and to facilitate the distributed garbage collection. 2. We propose an object access pattern space as a framework to charac- 7

20 terize the object access behavior. 3. We propose a novel object home migration protocol that optimizes the single-writer access pattern. The protocol demonstrates both the sensitivity to the lasting single-writer pattern and the robustness against the transient single-writer pattern. In the latter case, the protocol inhibits home migration in order to reduce the home notification overhead. 4. We propose other optimizations in our GOS, including synchronized method migration that allows the execution of a synchronized method to take place remotely at the home node of its locked object, and connectivity-based object pushing that uses object connectivity information to optimize the producer-consumer access pattern. 5. We design and implement a visualization tool called PAT (Pattern Analysis Tool) that can be used to visualize object access traces and analyze object access patterns in our GOS. 6. We have prototyped a cluster-based JVM with our GOS design and all optimizations incorporated. Extensive experiments demonstrate the performance of our GOS and the effectiveness of the optimizations. 1.7 Thesis Organization Chapter 2 introduces the background of this research. Chapter 3 elaborates the memory access patterns in DSM and GOS. Chapter 4 presents the concept of distributed-shared object and how we leverage it to improve GOS performance. Chapter 5 elaborates the adaptations we have adopted. Chapter 6 presents our pattern analysis tool used to visualize object access patterns. Chapter 7 discusses some implementation details in our cluster-based JVM. Chapter 8 reports the experiments we conduct to measure the performance of the prototype based on our design. Chapter 9 discusses the related 8

21 work and compares them with this work. Chapter 10 gives the conclusion and presents a possible agenda for future work. 9

22 Chapter 2 Background To support the truly parallel execution of Java threads on a cluster, we need a global object space for transparent distributed object accesses. The concept of global object space is rooted in software distributed shared memory, which is a well-established research area in cluster computing. In this chapter, the concepts of distributed shared memory will be introduced. We will also discuss Java s special constraints on global object space, i.e., Java Memory Model. 2.1 Software Distributed Shared Memory Software distributed shared memory (DSM 1 ) [1] promises a higher programmability compared with message passing paradigm, by providing a globally shared memory abstraction across physically distributed memory machines. To improve the performance, the replication of shared data is allowed. The data consistency issue is solved by well defined memory consistency models [21]. 1 In this thesis, DSM denotes software distributed shared memory. 10

23 2.1.1 Memory Consistency Model The memory consistency model of a DSM system provides a formal specification of how the memory system will appear to the programmer [21]. It defines the restrictions on the legal values that a read can return among the writes performed by other processors. From the viewpoint of programmers, sequential consistency [58] is the most intuitive model, which requires the memory accesses within each individual process follow program order and writes be made atomically visible to all the processes. Though intuitive, sequential consistency suffers from poor performance. Sequential consistency not only prohibits some common compiler optimizations, such as reordering memory accesses to different memory locations, but also results in excessive data communication on the distributed shared memory platform [59]. In order to improve the efficiency of DSM, it has been considered to relax the memory order constraints imposed by sequential consistency. Lazy release consistency (LRC) [56] is one of the state-of-the-art relaxed consistency models widely used in software DSM systems. LRC distinguishes synchronization variables from normal shared variables. LRC defines two operations on synchronization variables, namely acquire and release. Acquire operations are used to tell the memory system that a critical region is about to be entered. Release operations are used to tell that a critical region is about to be exited. In LRC, when a process P 1 acquires a synchronization variable that was most recently released by another process P 2, all the writes that are visible to P 2 at the time of releasing the synchronization variable become visible to P 1. LRC allows common compiler optimizations. LRC also allows the write propagations to be postponed and batched until the synchronization points. Moreover, correctly synchronized LRC programs that are data-race-free have sequential consistent behavior [22]. Thus, it is intuitive for programmers to reason the execution of a data-race-free LRC program. 11

24 2.1.2 Classification Based on the Coherence Granularity According to the coherence granularity in DSM, there are three kinds of DSM systems: page-based DSMs, whose granularity is a virtual memory page; object-based DSMs, whose granularity is a variable-sized structured data unit defined by the application; and fine-grain DSMs, whose granularity is a fixed-sized memory block that is much smaller than a virtual memory page. Page-based DSM Page-based DSMs coherence granularity is the virtual memory page. The page-based DSM leverages the memory management unit (MMU) to intercept the faulting access on a shared page that is not locally available, because it is either obsolete or not cached at all. Then the page-based DSM fetches the valid copy from the other nodes according to the memory consistency and resumes the faulting access. The advantage of page-based DSM is that by using the MMU only faulting accesses are trapped by the page-based DSM, all the non-faulting accesses go in full speed. However, the size of virtual memory page is as large as 4K bytes, and thus raises the false sharing problem. The false sharing problem happens when two processes independently access different parts of the same page. The page-based DSM s effort for the two processes to have the same view of the page is unnecessary for the correctness of the program. The false sharing problem could be a serious performance issue in page-based DSMs, particularly for those applications of fine-grain sharing characteristics. TreadMarks [57] is a page-based DSM which adopts a homeless cache coherence protocol to implement lazy release consistency [56]. TreadMarks uses twin and diff techniques to support multiple processes writing on the same shared virtual memory page simultaneously due to false sharing. On a 12

25 write fault to a local cached page, a copy of that page, called twin, is created. Later, the diff, which is the local update ever performed, can be figured out by comparing the current page with the previously saved twin. The protocol is considered to be homeless because the diffs are saved and managed at each process. Comparatively, HLRC [55] uses a home-based protocol to implement LRC. In home-based protocol, each shared coherence unit has a home to which all writes (diffs) are propagated and from which all copies are derived. It has shown that the home-based protocol is more scalable than the homeless protocol because the home-based protocol maintains a simpler state, sends fewer messages, has a lower diff overhead, and consumes a much smaller memory [55]. Object-based DSM Having observed that the false sharing problem is rooted in the sharing granularity mismatch between the page-based DSM systems and the applications, some researchers introduced the concept of object-based DSM. Object-based DSMs coherence granularity is an object, which is a structured data unit defined by the applications. Most existing object-based DSM systems are language-based. They are either new parallel programming languages (e.g., Orca [28] and Jade [70]), or modifications of programming languages such as C (e.g., Munin [33] and Midway [83]). In both cases, the compiler or the preprocessor is leveraged to hook the source code with the routines in the corresponding object-based DSM library. Object-based DSMs reduce the false sharing problem due to the relatively small probability that two processes independently access the different parts of a shared object. However, they raise another performance issue. Since the MMU can not be used to trap the faulting accesses on arbitrary-sized objects, software checks must be inserted before the memory accesses to guarantee the accessed objects are in the right access state. The software access checks 13

26 could introduce a large overhead in object-based DSMs. Fine-grain DSM The fine-grain DSM is a trade-off between the page-based DSM and the object-based DSM. The fine-grain DSM provides a shared memory address space as that the page-based DSM does. The copies of the same shared data reside at the same virtual memory address on all nodes. Thus it eases the memory management and the data transfer among nodes. To reduce the false sharing problem, the coherence granularity of fine-grain DSM is much smaller than that of the page-based DSM. For example, a fine-grain DSM, Shasta [71], has a variable-sized coherence granularity, called block, which is a multiple of line, occupying 64 bytes or 128 bytes s memory. Software checks are inserted before memory accesses to guarantee that the shared data are in the right state, as in object-based DSMs. Shasta has demonstrated a set of techniques to reduce the software checks. Jackal [78] also uses a fine-grain DSM to build the GOS for a cluster-based JVM. 2.2 Java Memory Model Java is a programming language incorporating multi-threading support. Java threads interact with each other through a shared memory, i.e., the object heap. It is necessary to define the rules describing which values may be seen by a read of shared memory that is updated by multiple threads. The Java memory model (JMM) (chapter 8 of JVM specification [60]) defines the memory consistency semantics of multi-threaded Java programs. There is a lock associated with each object in Java. The Java language provides the synchronized keyword, used in either a synchronized method or a synchronized statement, for synchronization among multiple threads. Entering or exiting a synchronized block corresponds to acquiring or releasing the lock of the specified object. A synchronized method or a synchronized 14

27 statement is used not only to guarantee exclusive accesses in the critical section, but also to maintain memory consistency of objects among all threads that have performed synchronization operations on the same lock. An abstract machine is defined in JMM to describe threads memory behavior. All threads share a main memory, which contains the master copies of all variables. A variable is an object field, an array element, or a static field. Each thread has its own working memory, which is its private cache for all the variables it uses. A use on a variable in the main memory will make it cached in the thread s working memory. JMM defines: before a thread releases a lock, it must copy all assigned values in its working memory back to the main memory; before a thread acquires a lock, it must flush (invalidate) all variables in its working memory. In this way, the later uses will load the up-to-date values from the main memory. In addition, with respect to a lock, the acquire and release operations performed by all threads are sequentially consistent. The acquire and release operations have their embodiments in the Java bytecode set, i.e., monitorenter and monitorexit. JMM resembles LRC in that acquire/release operations are used to establish a partial order between the memory actions performed by multiple threads. We follow the operations defined in the JVM specification to implement JMM. Revising JMM Some researchers argue that the current JMM is not well designed because it prohibits some common compiler optimizations, causes some counterintuitive behavior, and even makes some well known design patterns unsafe [68]. Currently, the JMM is under active revision through JCP s procedures [8]. JCP (Java Community Process) is the standard procedure to evolve Java technology through the community effort and under the supervision of Sun Microsystems at the same time. Hopefully, a new JMM will be introduced 15

28 in the Tiger (1.5) release of Java, to replace the original one. The latest information of the proposed JMM can be found at Pugh s website [15], which is still in constantly revising. The detailed comparison between the current JMM and the proposed one is beyond the scope of this thesis. Here we simply list some major changes made in the proposed JMM: The semantics of volatile variables have been strengthened to have acquire and release semantics. A read to a volatile field has the acquire semantics and a write to a volatile field has the release semantics. The semantics of final fields have been strengthened to allow for threadsafe immutability. A read on a final field will always return the correctly initialized value as long as the object reference is not exposed during the object construction. In addition, the proposed JMM states that the useless synchronization has no memory semantics. A synchronization action is useless in a number of situations, including acquiring/releasing a lock of thread-local objects, and re-acquiring an already acquired lock. This statement is very reasonable. Based on our understanding of the current and proposed JMM, we believe that although our cluster-based JVM mainly follows the current JMM, it can be quickly adapted to the proposed JMM if it was officially approved. 16

29 Chapter 3 Memory Access Pattern Our cluster-based JVM is highlighted for its adaptability to object access patterns. In this chapter, we firstly survey various memory access pattern optimizations in the area of DSM. Then we propose an access pattern space as a framework to characterize object access behavior, which is used as a foundation to design the effective adaptations. 3.1 Memory Access Pattern Optimization in DSM Although the DSM paradigm promises higher programmability than message passing paradigm, it may involve more communication than necessary. For example, the update or invalidation to a cached copy is unnecessary if the copy will not be used any more. To make DSM applications performance be comparable with their message passing counterparts, researchers are investigating various ways to reduce the communication in DSMs. In DSM systems, many cache coherence protocols have been proposed to implement various memory consistency models. The home-based protocol [55] assigns a home node to each shared data object from which all copies 17

30 are derived. It is widely believed that home-based protocol is more scalable than the homeless protocol [57], for the reason that the former has less memory consumption and can eliminate diff accumulation. The home in a home-based protocol can be either fixed [55] or mobile [35]. There are also variations for the coherence operations, such as a multiple-writer protocol, or a single-writer protocol. The single-writer protocol allows only one process to write on a shared data unit at the same time. In order to become a writer, a process needs to acquire a writing permission from the previous writer. The multiple-writer protocol introduced in Munin [33] supports concurrent writes on different copies of the same object by using the diff technique. It may however incur heavy diff overhead compared with conventional single-writer protocols. Another choice is between the update protocol (e.g., Orca [28]) and the invalidate protocol. The latter is used in many page-based DSM systems such as TreadMarks [57] and JUMP [35]. The update protocol can prefetch the data before the access, but it may send much more unneeded data when compared with the invalidate protocol. A promising approach to further improve the performance of DSM systems is to design adaptive cache coherence protocols that are able to detect and optimize memory access patterns. The rationale behind is that the particular memory access patterns in an application speak for the more suitable protocol. It means the choice of a good coherence protocol is often application-dependent. That motivates people to go after an adaptive protocol. In this section, we discuss three approaches used in the memory access pattern optimization, namely programmer annotation, compiler analysis, and runtime adaptation Programmer Annotation The programmer annotation approach requires the programmers explicitly annotate the shared data objects with pattern declarations. Strictly speak- 18

31 ing, this approach does not use an adaptive cache coherence protocol. Nevertheless, it manages to optimize some memory access patterns. Munin [33] follows the programmer annotation approach. Munin allows programmers to explicitly annotate the object with pattern declarations, which include conventional, read-only, migratory, and write-shared. Each pattern has its own protocol that will be used by Munin at runtime. Munin applies a multiple-writer protocol to the write-shared pattern, and a singlewriter protocol to the conventional pattern. For the migratory pattern, the objects are migrated from machine to machine as critical regions are entered and exited. The read-only data are replicated on demand without further consistency maintenance, but a runtime error will be generated if some process tries to write read-only data. SAM [72] is an object-based DSM runtime system that supports to optimize some object access patterns, such as the producer-consumer pattern and the accumulator pattern. An accumulator is used to represent a piece of data that must be updated in a critical section. SAM provides synchronization primitives to let user explicitly tie the patterns to the object accesses. SAM automatically migrates the accumulator data, and prefetches the producerconsumer data before they are consumed Compiler Analysis The programmer annotation approach allows programmers choose a most suitable cache coherence protocol among a set of candidates for an object presenting a particular access pattern. Although this approach helps to improve the performance, it is inconvenient for programmers and error-prone. The compiler analysis approach tries to overcome the shortcoming of the programmer annotation approach by leveraging compiler analysis techniques to automatically extract the access pattern information from the programs. Orca [28] is a language-based DSM system. At runtime, a shard object can be either replicated on all processors, or not replicated at all. For the 19

32 replicated objects, the broadcast is used to deliver the update to all replications. For the non-replicated objects, the remote procedure call is used to access the objects. The actual replication policy for each object is determined by both the compiler and the runtime system. Orca s compiler estimates the expected read to write ratio of each shared object in the program. For example, an object with a large read/write ratio on a cheap broadcast network will be replicated on all processors. Orca s runtime system can also collect the actual read/write information to amend the compiler derived decisions. Although the compiler analysis approach is able to automatically extract access pattern information from the programs, it has several shortcomings inherited from compiler analysis techniques. Firstly, since the input of the compiler is the source code of the program, the compiler analyzes the programs based on allocation sites. An allocation site is the location in the program source code where object instances are created at runtime. Though the compiler analysis works well for the situation where all object instances created from the same allocation site present the same access pattern, it may be difficult to distinguish among the object instances of different access patterns from the same allocation site. Secondly, the compiler analysis approach may be difficult to detect the access pattern changes. Even it is able to notice the possible changes, it may be difficult to predict the actual change time. Thirdly, the compiler analysis can not precisely predict the access patterns without the knowledge of the actual thread-to-node mapping in the situation of multi-threading. For example, assuming two threads concurrently write on a shared large object, which can be detected by the compiler, if they reside on different nodes at runtime, a multiple-writer protocol is suitable for the shared object. The twin and diff techniques are used to support concurrent multiple writers. However, if these two threads are on the same node at runtime, all the twin and diff overheads are simply wasted. 20

33 3.1.3 Runtime Adaptation To overcome the shortcomings of the compiler analysis approach, people are investigating the runtime approach to optimize the memory access patterns, called the runtime adaptation approach. It leverages the adaptive cache coherence protocol to detect and adapt to some particular access patterns. It is transparent to the programmers. Since all the runtime access information is accessible, the precise and prompt access pattern optimizations are possible. Usually the runtime adaptation approach speculatively detects access patterns based on some heuristics. The false speculation can be corrected at runtime. Currently, most works in the runtime adaptation approach are done on page-based DSMs. In the context of page-based DSMs, accesses to different objects residing at the same page are mingled at the page level. So it is difficult for them to detect access patterns in applications with fine-grain sharing. Some homeless page-based DSM systems use adaptive cache coherence protocols to optimize memory access patterns. Adaptive TreadMarks [23] can adapt between the single-write protocol and the multiple-writer protocol. The single-writer protocol does not use twin and diff technique. Instead, one process must get the ownership of a shared page before writing on it. Adaptive TreadMarks switches to the single-writer protocol when it observes that the overhead of requesting and applying diffs is larger than that of requesting the whole page. It can also perform dynamical page aggregation, which groups several pages together as a coherence unit. When a page of the group is faulted in, the whole group of pages will be faulted in, too. ADSM [64] can also adapt between the single-write protocol and the multiple-writer protocol based on the approximate association between locks and the data they protect. Initially, all pages are in the initial state, valid at and owned by process 0. Any access fault will place the page in migratory state, until a write fault by other process happens. Then the page is placed 21

34 in multiple-writer state. The single-writer protocol is used for pages in MIG state, and the multiple-writer protocol is used for pages in multiple-writer state. From time to time, pages in multiple-writer state can be reset to initial to allow continuous adaptation. The asymmetry between the home copy and non-home copies in homebased protocols raises the home assignment problem. In home-based protocols, the home copy is always valid. The accesses on home node never incur communication overhead, while the accesses on non-home nodes will trigger the communication with the home node. Therefore, which node acts as the home will change the coherence data communication pattern, and thus influence the application performance. In fact, the optimal home assignment is determined by the memory access pattern of the application. This inspires some dynamic home assignment protocols able to adapt to runtime memory access patterns. In JiaJia [51], which is a page-based DSM system, those pages that are written by only one process between two barriers are recognized by the barrier manager and their homes are migrated to the single writing process. New home notifications are piggybacked on barrier messages. JiaJia s home migration protocol only optimizes the single-writer pattern. Since JiaJia s approach relies on the barrier synchronization, it will not work if the application does not use barriers or the DSM infrastructure does not expose the barrier function. For example, in our case, the Java programmers have to implement the barrier by using more primitive synchronization operations such as lock/unlock/wait. Furthermore, since all the single-writer detection work is done centrally at the barrier manager, it may cause considerable overhead when there are a fair number of processes as well as shared pages. JUMP [35] adopts a migrating-home protocol in that the process requiring the page becomes the new home. The new home notification is broadcast to other processes at synchronization points. Although this approach results in less diffing operations because the writes probably happen at the home node, 22

35 Number of writers Multiple writers Single writer No synchronization Read only 1 Adaptation point Repetition Mutual exclusion (Accumulator) Condition (Assignment) Synchronization Figure 3.1: The object access pattern space the home migration decision ignores the inherent memory access patterns of the application. If the accesses by the process at the new home do not persist, home migration will not improve the performance; instead, it could suffer from heavy home notification overhead. The worst case happens when the shared page is written by processes sequentially, which produces numerous home notification messages. 23

36 3.2 Access Pattern Space According to JMM, an object s access behavior can be described as a set of reads and writes performed on the object, with interleaving synchronization actions such as locks and unlocks. Locks and unlocks on the same object are executed sequentially. Three orthogonal dimensions capturing the characteristics of object access behavior can be defined: number of writers, synchronization, and repetition. They form a 3-dimensional access pattern space, as shown in Fig Number of writers Among all the accesses from different threads, a happen-before-1 [22] partial ordering, denoted by, hb1 can be established: If a 1 and a 2 are two memory actions by the same thread, and a 1 occurs hb1 before a 2 in program order, then a 1 a 2. If a 1 is an unlock by thread t 1, and a 2 is the following lock on the same hb1 object by thread t 2, then a 1 a 2. If a 1 a 2 and a 2 a 3, then a 1 hb1 a 3. A write w 1 is a concurrent write if there exists another write w 2 so that w 1 and w 2 are issued by different threads; and w 1 and w 2 are on the same object; and both w 1 hb1 hb1 w 2 and w 2 w 1 do not hold. We can also say w 1 is concurrent with respect to w 2, denoted by w 1 w 2. A write w 1 is a sequential write if there does not exist another write w 2 so that w 1 w 2. 24

37 On the dimension of number of writers, we distinguish three cases: Multiple writers: the object is written by multiple threads. Specifically, if w is a concurrent write on this object, this object presents multiplewriter pattern when w happens. Multiple-writer pattern is not the data race situation. Accesses of data race happen on the same variable, while accesses of multiple-writer pattern happen on the same object. Multiple-writer pattern implies the false sharing situation. Single writer: the object is written by a single thread. Specifically, if w is a sequential write on this object, this object presents single-writer pattern when w happens. Exclusive access is a special case where the object is accessed (written and read) by only one thread. Read only: no thread writes to the object. Synchronization This characterizes the execution order of accesses by different threads. When the object is accessed by multiple threads and at least one thread is a writer, the threads should be well synchronized to avoid data race. There are three cases: Accumulator: the object accesses are mutually exclusive. The object is updated by multiple threads concurrently, and therefore all the updating should happen in a critical section. That is, the read/write should be preceded by a lock and followed by an unlock. Java provides synchronized block and synchronized method to implement accumulator pattern. Assignment: the object accesses obey the precedence constraint. The object is used to safely transfer a value from one thread to another thread. The source thread writes to the object first, followed by the destination thread reading it. Synchronization actions should be used 25

38 to enforce that the write happens before the read according to the memory model. Java provides the wait and notify methods in the Object class to help implement the assignment pattern. No synchronization: synchronization is unnecessary. Repetition This indicates the number of consecutive repetitions of an access pattern. It is desirable that an access pattern will repeat for a number of times so that the GOS will be able to detect the pattern based on the history information and then apply the optimization on the re-occurrence of the pattern. Such a pattern will appear on the right side of the adaptation point along the repetition axis. The adaptation point is an internal threshold parameter in the GOS. When the pattern repeats for more times than what the adaptation point indicates, the corresponding adaptation will be automatically performed. On the other hand, some important patterns appear on the left of the adaptation point, such as the producer-consumer pattern. Produce-consumer pattern is also called the single assignment. The write must happen before the read. However, in the producer-consumer pattern, after the object is created, it is written and read only once, and then turned into garbage. 26

39 Chapter 4 Distributed-Shared Object This chapter presents how the memory consistency issue is efficiently solved by leveraging the concept of distributed-shared objects. We define distributedshared object and discuss the benefits it brings to our GOS. We then present a lightweight mechanism for the detection of DSOs and the basic cache coherence protocol used in the GOS. 4.1 Definitions In the JVM, connectivity exists between two Java objects if one object contains a reference to another. Therefore, we can conceive the whole picture of an object heap to be a connectivity graph, where vertices represent objects and edges represent references. Reachability describes the transitive referential relationship between a Java thread and an object based on the connectivity graph. An object is reachable from a thread if its reference resides in the thread s stack, or if there is some path existing in the connectivity graph between this object and some known reachable objects. By the escape analysis technique [38], if an object is reachable from only one thread, it is called thread-local object. The opposite is a thread-escaping object, which is reachable from multiple threads. Thread-local objects can be 27

40 separated from thread-escaping objects at compile time using escape analysis. The benefits from the escape analysis are: the synchronization operations on thread-local objects can be safely removed, and the thread-local objects can be allocated on the threads stack instead of the heap to reduce the heap overhead. In a distributed JVM, Java threads are distributed to different nodes, so we need to extend the concepts of thread-local object and thread-escaping object. We define the following. A node-local object (NLO) is an object reachable from thread(s) in the same node. It is either a thread-local object or a thread-escaping object. A distributed-shared object (DSO) is an object reachable from at least two threads located at different nodes. 4.2 Benefits from DSO Detection We introduce the concept of DSO to address both the memory consistency issue and the memory management issue in GOS. We argue that the identification of DSOs benefits both the memory consistency maintenance and the memory management, i.e., distributed garbage collection Benefits on Memory Consistency Maintenance The detection of DSOs can help reduce the memory consistency maintenance overhead. According to the JVM specification, there are two memory consistency problems in a distributed JVM. The first one, local consistency, exists among working memories of threads and the main memory inside one node. The second one, distributed consistency, exists among multiple main memories of different nodes. The issue of local consistency should be addressed by any JVM implementation, whereas the issue of distributed consistency 28

41 is only present in the distributed JVM. The cost to maintain distributed consistency is much more than that of its local counterpart due to the communication incurred. As we have mentioned before, synchronization in Java is used not only to protect critical sections but also to enforce memory consistency. However, synchronization actions on NLOs do not need to trigger distributed consistency maintenance, because all threads that are able to acquire or release the lock of an NLO must reside in the same node, and therefore would not experience distributed inconsistency throughout. Only DSOs are involved in distributed consistency maintenance since they have multiple copies in different nodes. With the detection of DSOs, only DSOs need to be visited to make sure that they are in a consistent state during distributed consistency maintenance Benefits on Memory Management According to the JVM specification, one vital responsibility of the GOS is to perform automatic memory management in the distributed environment, i.e., distributed garbage collection (DGC) [67]. The detection of DSOs also helps improve the memory management in the GOS. Since we detect DSOs at runtime, we are able to do pointer translation across node boundaries, i.e., between local object addresses and objects global unique identifications (GUID), so as to relocate objects at different memory addresses on different nodes. In this way, the heap management of each node is totally decoupled. Each node performs independent memory management. The local garbage collectors on each node can perform asynchronous collection of garbage objects independently. The global garbage collections can thus be postponed or reduced. Moreover, all the nodes are coordinated to present a huge virtual heap. We can calculate the aggregated heap size of our distributed JVM with the 29

42 following formula: H = (1 d)hn + dh (4.1) H The aggregated heap size; h The heap size on each node; n The number of nodes; d The ratio of the local heap space that DSOs occupy to the total local heap size. We presume DSOs will be replicated on all nodes. Obviously, when the ratio of DSOs, i.e., d, is small, H hn. 4.3 Lightweight DSO Detection and Reclamation In the distributed JVM, whether an object is a DSO or an NLO is determined by the relative locations of the object and the threads reaching it. Compiletime solutions, such as the escape analysis, are not useful as the location of objects and threads can only be determined at runtime. We propose a runtime lightweight DSO detection scheme which leverages Java s runtime type information to unambiguously identify pointers, i.e. object references in Java context. Java is a strongly typed language. Each variable, either an object field that is in the heap, or a thread-local variable in some Java thread s stack, is associated with a type. The type is either a reference type or a primitive type such as integer, char, or float. The type information is known at compile time and written into class files generated by the compiler. At runtime, the class subsystem builds up the type information from the class files. By looking up the runtime type information, we can identify those variables that are of the reference type. Therefore, object connectivity can be determined at runtime. The object connectivity graph is dynamic since the connectivity between objects may change from time to time through the reassignment of 30

43 objects fields. DSO detection is performed when there are some JVM runtime data to be transmitted across node boundaries, which could be thread stack contexts for thread relocation, object contents for remote object access, or diff data for update propagation. On both the sending and the receiving side, these data are examined for identification of object references. A transmitted object reference indicates the object is a DSO since it is reachable from threads located at different nodes. On the sending side, if the corresponding object of an identified object reference has not been marked as a DSO, it is marked at this moment. In doing this, a global unique identification (GUID) will be assigned to it, which is its global name in the cluster-based JVM. Before sent, all the object references should be replaced by their GUIDs. Since the copies of DSOs reside at different memory addresses on different nodes, local object references, i.e., memory addresses, do not make sense on other nodes. In sending an object, usually the type information of all its fields can be determined according to its class data structure. However, some additional type information should be sent along in some situations in order to clearly describe the type information of a field: (a) If the field is of an array, the array s size should be sent. The array s size helps to shape an array. And it is a special field of the array object in Java. (b) If the field is of a subclass of the class type defined in the class, the subclass s type should be sent. Java allows a conversion from any class type S to any class type T, provided that S is a subclass of T. And the subclass type can t be determined from the type information loaded from class file. (c) If the field is of an implementation of the interface type defined in the class, the field s actual type should be sent. On the receiving side, all the GUIDs should be replaced by their corresponding local object references. The receiver knows where a GUID should appear according to the type information. When a GUID first emerges, an empty object of corresponding type will be created to be associated with 31

44 it, so that the reference will not become a dangling pointer. The object s access state will be set to be invalid. When it is accessed later, its up-to-date content will be faulted in. In this scheme, only those objects whose references appear in multiple nodes will be identified as DSOs. We detect DSO in a lazy fashion. Since at anytime it is unknown whether an object will be accessed by its reaching thread in the future or not, we choose to postpone the detection to as close to the actual access as possible, thus making the detection scheme lightweight. To correctly reflect the sharing status of objects in the GOS, we rely on distributed garbage collection to convert a DSO back to an NLO. If all the cached copies of a DSO have become garbage, the DSO can be converted back to an NLO. The distributed garbage collection will be discussed in section 7.5. An Example Examining the case in Figure 4.1, a thread T 1 prepares an object tree then passes the reference of object c to another thread T 2 as shown in the reachability graph (Figure 4.1.a). When T 2 is distributed to another cluster node, i.e. node 1, all the objects reachable from object c become DSOs. Object a, b, and d are not DSOs since they are thread-local to T 1. Instead of detecting all these objects as DSOs at one blow, we detect object c as a DSO and send object c to node 1. Because object e and f are directly connected with object a, we also detect object e and f as DSOs but do not send them to node 1 (Figure 4.1.b). On node 1, we create two objects whose types are exactly the same as the types of object e and f. Since the contents of object e and f are not available, we set their access states to invalid. Next time when object f is accessed by T 2 on node 1 (Figure 4.1.c), an object fault will occur. An object request message will be sent to node 0. 32

45 T 1 T 2 Java thread stack frame Java object a Detected DSO Invalid DSO b c Connectivity between objects d e f Object reference in thread stack g h i (a) Reachability graph T 1 T 2 a b c c d e f e f g h i Node 0 Node 1 (b) After thread T 2 is distributed to Node 1 T 1 T 2 a b c c d e f e f g h i i Node 0 Node 1 Cluster network (c) Access on f by T 2 triggers detection of i Figure 4.1: The detection of distributed-shared object 33

46 This event will trigger the detection of object i as a DSO. The up-to-date content of object f is copied from node 0 to node 1. The details of how to maintain the coherence of objects located among multiple nodes are discussed in next section. If object e is not accessed by T 2, object e is always invalid on Node 1 and object g and h will never be detected as DSOs. 4.4 Basic Cache Coherence Protocol Our basic cache coherence protocol is a home-based, multiple-writer cache coherence protocol. Figure 4.2 shows a state transition graph depicting the lifecycle of an object from its creation to possible collection based on the proposed DSO concept. At the right of the figure, the state transition graph of the cache coherence protocol for DSOs at non-home nodes is shown. The read/write arrows represent those happening on this object. The lock/unlock arrows represent those happening on any object because lock/unlock actions on other objects will also influence this object s state according to JMM. The lower part of the figure illustrates the interaction between the garbage collection and the object s states, which will be discussed in section 7.5. When a DSO is detected, the node where the object is first created is made its home node. The home copy of a DSO is always valid. A non-home copy of a DSO can be in one of three possible access states: invalid, read (read-only), or write (writable). Accesses to invalid copies of DSOs will fault in the contents from their home node. Upon releasing a lock of a DSO, all updated values to non-home copies of DSOs should be written to their corresponding home nodes. Upon acquiring a lock, a flush action is required to invalidate the non-home copies of DSOs, which guarantees that the most up-to-date contents will be faulted in from the home nodes when they are accessed later. Before the flush, all updated values to non-home copies of DSOs should be written to the corresponding home nodes. In this way, a thread is able to see the up-to-date contents of the DSOs after it acquires 34

47 %&' &%&() & # $! W ** +,-.,+/01234/ " # $5 $ 6 " * " * " *78 Figure 4.2: The state transition graph depicting object lifecycle in the GOS 35

48 the proper lock. A multiple-writer protocol permits concurrent writing to the copies of a DSO, which is implemented using the twin and diff techniques [57]. On the first write to a non-home copy of the DSO, a twin will be created, which is an exact copy of the object. On lock acquiring and releasing, the diff, i.e., the modified portion of the object, is created by comparing the twin with the current object content word by word, and sent to the home node. On releasing a lock, after the diffs are sent out, the access states of the updated objects should be changed from write to read to capture the future writes. Since a lock can be considered as a special field of an object, all the operations on a lock, including acquire, release, as well as wait and notify that are the methods of the Object class, are executed in the object s home node. Thus, the object s home node acts as the object s lock manager. The detailed design and implementation of distributed synchronization will be discussed in section 7.2. With the availability of object type information, it is possible to invoke different coherence protocols according to the type of the objects, as shown in table 4.1. For example, immutable objects, such as instances of class String, Integer, and Float, can be simply replicated and treated as an NLO. Some objects are considered node-dependent resources, such as instances of class File. When node-dependent objects are detected as DSOs, object replication should be prohibited. Instead, accesses to them should be transparently redirected to their home nodes. This is an important issue in the provision of a complete single system image to Java applications. 36

49 Type Characteristics Protocol java.lang.thread Represents Java thread. On creation, choose a running node for load balance. java.lang.string, Immutable object. Simply replicated and java.lang.integer, etc. treated as an NLO. java.io.file, etc. Represents node- No replication. Accesses will dependant resources. be transparently redirected to their home node. Primitive array, such contains no object references. DSO detection is disabled. as float[ ], int[ ], etc. Table 4.1: Coherence protocols according to object type 37

50 Chapter 5 Adaptive Cache Coherence Protocol Scientific applications usually exhibit diverse memory access patterns. The performance of various cache coherence protocols is application-dependent: the application s inherent memory access patterns speak for the most suitable protocol. This inspires us to go after an adaptive cache coherence protocol to further improve the performance of our GOS. An adaptive cache coherence protocol is able to detect the current access pattern and adjusts itself accordingly. Based on the access pattern space, we present several adaptations incorporated into our basic home-based multiple-writer cache coherence protocol in three respective situations in the access pattern space: (1) object home migration [45] which optimizes the single-writer access pattern by moving the object s home to the writing node according to the access history; (2) synchronized method migration which chooses between default object (data) movement and optional method (control flow) movement in order to optimize the execution of critical section methods according to some prior knowledge; (3) connectivity-based object pushing which scales the transfer unit to optimize the producer-consumer access pattern according to the object connec- 38

51 tivity information. 5.1 Adaptive Object Home Migration As a state-of-the-art DSM system, TreadMarks [57] adopts a multiple-writer cache coherence protocol to implement lazy release consistency. TreadMarks uses twin and diff techniques to support multiple processes writing on the same shared virtual memory page simultaneously due to false sharing. The protocol is considered to be homeless because the diffs are saved and managed at each process. Although TreadMarks homeless protocol can greatly alleviate the false sharing problem, it may still suffer from heavy communication and protocol overheads. In order to serve a page fault, the faulting process has to fetch the diffs from each process that has updated the page before the fault according to LRC, which causes multiple round-trip messages. Each diff needs to be applied once at each process that fetches that diff, which amounts to a large overhead. In addition, the diffs could consume a lot of memory, and cleaning the useless diffs may trigger a global garbage collection. In order to address the above problems, a home-based protocol to implement LRC, which is called HLRC, was proposed [55]. In the home-based protocol, each shared coherence unit has a home to which all writes (diffs) are propagated and from which all copies are derived. It has been shown that the home-based protocol is more scalable than the homeless protocol because the home-based protocol maintains a simpler state, sends fewer messages, has a lower diff overhead, and consumes much less memory. The asymmetry between the home copy and non-home copies in homebased protocols raises the home assignment problem. In home-based protocols, the home copy is always valid. Accesses at the home node never incur communication overhead, while accesses at non-home nodes will trigger communication with the home node. Therefore, which node to act as the home 39

52 will change the coherence data communication pattern, and influence the application performance. In fact, the optimal home assignment is determined by the memory access patterns of the application. This inspires some dynamic home assignment protocols that are able to adapt to runtime memory access patterns [51, 35, 78, 44]. In DSM applications, the single-writer access pattern happens if the shared coherence unit is only updated by one process for a certain period. It does not prohibit the shared coherence unit from being read by multiple processes at the same time. A few research projects [51, 23, 64] have demonstrated that the single-writer pattern is common in DSM applications. In our GOS, we propose a novel home migration protocol to optimize the single-writer pattern. We only target the single-writer pattern because home migration makes little difference in the multiple-writer situation so long as the home node is one of the writers. At runtime, an object can exhibit different access patterns during its lifetime. For example, an object can be updated by multiple writers concurrently, and then by a single writer exclusively; or an object can be updated by different writers sequentially, each persisting for sometime. Since home migration has to have the effect that the other processes would be informed of the new home, improper home migrations will degrade the performance by introducing a host of messages for new home notification. Therefore, it is a challenge to exploit the single-writer property as much as possible and at the same time maintain an acceptable level of home migration overhead Home Migration Concepts Figure 5.1 illustrates the home-based multiple-writer protocol that implements LRC. In the figure, X represents some shared coherence unit, which could be either an object or a virtual memory page. Its home is at the processor where process P2 resides. Assuming the write on X performed by process P1 causes a fault, because either the local cached copy is outdated according 40

53 P1 P2 (Home of X) Lock Write(X) Fault-in Make twin Create diff Diff propagation Unlock Apply diff Figure 5.1: Home-based Protocol for LRC with multiple-writer support to LRC or X is not cached at all, P1 will then fault-in the valid copy from X s home, P2. Before P1 could write on the newly fetched copy, it needs to create a twin, which is simply a copy of X. Later, when P1 releases the lock, it eagerly creates the diff, which is the difference between the current X and the previously saved twin, and sends the diff to the home. And the diff will be applied to the home copy of X at the home. If P1 is the only writer of X, we can migrate X s home from P2 to P1, to avoid the communication overhead including faulting in the shared data and the diff propagation, the diff overhead including creating and applying the diff, and the memory consumption caused by the twin and the diff. On the other hand, if both P1 and P2 write on X, it does not matter which node to become the home. Home Location Notification Mechanism We assume there is a way to determine the initial home for each unit. For example, all units are initially assigned a home node by a well known hash function. If the home of a shared coherence unit is subject to migration, 41

54 home miss could happen. Home miss is the situation that a process visits an obsolete home. Therefore, we need some mechanism to inform other nodes of the new home location. There are three mechanisms: broadcast, home manager, and forwarding pointer. Broadcast After home migration, the new home location is broadcast to all the nodes. Home Manager The most updated home location of a unit is always recorded in a designated manager node, which is known to all nodes. On home migration, the new home location is posted to the manager node. On home miss, a process can visit the manager node to find out where the current home is. Forwarding Pointer On home migration, a forwarding pointer is left in the former home to point to the new home. On home miss, a process can always be redirected to the current home via the given forwarding pointer. With the broadcast and home manager mechanisms, it is possible that the broadcast or update to the manager happens after some node tries to fault in a copy from the home node. Then the former home is already obsolete, but the new home is still not known. This situation needs to be handled carefully, for example, by waiting for sometime before repeating the fault-in again. Notice that this situation will not happen with the forwarding pointer mechanism. Of the three mechanisms, which is superior depends on the memory access patterns of the applications and how frequent the home migration is. If after a home migration, all the other nodes need to visit the new home, then the broadcast mechanism is superior to the others because a well implemented broadcast operation should be efficient for notifying all. Otherwise, the broadcast may cause a large overhead. The merit of the forwarding 42

55 pointer mechanism is that it does not need to broadcast the new home location on home migration. However, the redirection effect may cascade where multiple home migrations may form a distributed chain of home forwarding pointers. Therefore, a process may be redirected multiple times before coming upon the current home, which is called redirection accumulation. It could cause significant overhead when home migration happens frequently. The manager mechanism strikes a balance between the home notification cost and the home miss cost. However, on a home miss, the process needs to visit the old home, the manager, and the new home in sequence, which is heavyweight compared with the broadcast mechanism and the forwarding pointer mechanism in the absence of redirection accumulation Home Migration with Adaptive Threshold In order to detect the single-writer access pattern, the GOS monitors all home accesses as well as non-home accesses at the home node. With the cache coherence protocol, the object request can be considered a remote read and the diff received on synchronization points a remote write. To monitor the home accesses, the access state of the home copy will be set to invalid on acquiring a lock and to readonly on releasing a lock. Therefore, the home access faults can be trapped and returned after the access is recorded. We call the write fault at home node the home write, and the read fault at home node the home read, respectively. At the home node, we define an object s consecutive remote writes to be those issued from the same remote node and not interleaved with the writes from either the home node or other remote nodes. Note that under the Java memory model, the remote writes are only reflected to the home node on synchronization points. Therefore the number of consecutive writes is the number of synchronizations during which the object is only updated by that node. At runtime, the GOS continuously monitors consecutive remote writes for each object. We also introduce a predefined home migration threshold 43

56 that represents some prior knowledge on the single-writer pattern. We follow a heuristic that an object is in the single-writer pattern if the number of consecutive remote writes exceeds the home migration threshold. If the single-writer pattern is detected, when the object is requested again by the writing node, not only the object is replied, but also its home is migrated. We adopt the forwarding pointer mechanism to notify others of the new home location. When an obsolete home node is requested for an object, it simply replies with the valid home node location. However, this protocol is still not satisfactory. Above all, it is difficult to decide the fixed home migration threshold. If it is too large, which implies a lazy migration policy, the home migration will be less sensitive to the singlewriter pattern, thus causing unnecessary remote access overhead. If the home could be migrated earlier, more remote accesses could be transformed to local accesses. On the contrary, if the threshold is too small, it implies an eager migration policy. Although sensitive to the single-writer pattern, it will be less capable of avoiding unnecessary home migrations. If the singlewriter pattern is transient in that it repeats for a very limited times, then the threads on the new home node may not perform any more accesses after the home migration. Thus the home migration decision will not gain any performance improvement, but suffer from the home redirection overhead. We observe that the transient single-writer pattern is not worthy of home migration. The home migration protocol should capitalize on the lasting single-writer pattern. The challenge here is to choose a threshold that yields both sensitivity and robustness for the single-writer pattern. By robustness we mean taking no migration action for the transient single-writer pattern, and by sensitivity the approach responds actively to the lasting single-writer pattern. Furthermore, it is anticipated that different objects may have different access behaviors. It is more reasonable to use different thresholds on different objects. Based on the above discussion, we propose a home migration protocol 44

57 with an adaptive threshold. The adaptive threshold is monotonously decreasing with increased likelihood that an object presents the lasting singlewriter pattern. A lower threshold will allow home migration to happen more quickly. The adaptive threshold is continuously adjusted at runtime according to the feedback of previous home migration decisions for each object. Runtime Feedback In order to measure the feedback of previous home migration decisions, the GOS will observe exclusive home writes and redirected object requests at runtime. We define exclusive home write to be that there is no remote write between an exclusive home write and an earlier home write. Clearly, exclusive home writes reflect the single-write pattern happening at the home node. So it represents a positive feedback of previous home migration decisions. A redirected object request reflects the home redirection effect due to home migration. It represents a negative feedback of previous home migration decisions. Redirected object requests take the redirection accumulation into account. For example, if an object request is redirected for three times before reaching the current home node, the number of redirected object requests will be considered to be three instead of one. In addition, it is observed that exclusive home writes and redirected object requests are associated with different costs. The home redirection overhead, which is measured by redirected object requests, is equal to the round-trip time for a unit-sized message. The benefits due to home migration are from eliminated pairs of object fault-ins and diff propagations. They are measured by exclusive home writes, and are related to the object size. Therefore, we introduce the home access coefficient which is the overhead ratio of one eliminated pair of object fault-in and diff propagation to one home redirection. Here we mainly consider the communication overhead. 45

58 Formalization We formalize the idea of object home migration with adaptive threshold as follows. For each object, we have: C i : the number of consecutive remote writes since the (i 1)th home migration. T i : the value of the adaptive home migration threshold since the (i 1)th home migration. T init : the initial threshold, which is set to 1. R i : the number of redirected object requests since the (i 1)th home migration. E i : the number of exclusive home writes since the (i 1)th home migration. α : the home access coefficient. m 1 : the half-peak length in bytes, which is the message length required 2 to achieve half of the asymptotic bandwidth [50]. Home migration decision is taken when the following condition is met: C i = T i (5.1) Adaptive home migration threshold, T i, is calculated by T i = max{(t i 1 + R i αe i ), T init } (5.2) where T 0 = T init = 1 (5.3) 46

59 and α 2 + sizeof(object) (5.4) m 1 2 Equation (5.2) is the core of the above equations, which determines the adaptive home migration threshold. Both the positive feedback (exclusive home writes) and the negative feedback (redirected object requests) of previous home migrations will affect the current threshold. The positive feedback tends to indicate that the object presents a lasting single-writer pattern, thus decreases the threshold. Remember the threshold is monotonously decreasing with increased likelihood of the lasting single-writer pattern. While the negative feedback tends to indicate that the object presents the transient single-writer pattern, thus increases the threshold. We also take the home access coefficient into account. Whenever the home migration condition, i.e., Equation (5.1), is met, a home migration takes place. All these computations are done by the GOS at the home node of the object. The initial threshold is set to 1 in order to speed up the initial data relocation if possible. It is possible that the initial data layout is not optimal with respect to the data access behavior, particularly when the writing nodes of single-writer objects are not their home nodes. A small initial home migration threshold could alleviate this situation. We rely on the adaptive threshold mechanism to adjust the threshold automatically after the initial home migration. The Deduction of Home Access Coefficient Hockney [50] has proposed a model to characterize the communication time (in µs) for a point-to-point operation as follows, where the communication overhead, t(m), is a linear function of the message length m (in bytes): t(m) = t 0 + m r (5.5) 47

60 t 0 r is the start-up time in µs. is the asymptotic bandwidth in MB/s. Recall that home access coefficient is the overhead ratio of one eliminated pair of object fault-in and diff propagation to one home redirection. Here we mainly consider the communication overhead. We assume the object size is o, the diff size is d, and the home redirection is a unit-sized message. Then we have: α = (t 0 + o r ) + (t 0 + d r ) t r (5.6) = 2t 0r + (o + d) t 0 r + 1 (5.7) The half-peak length, denoted by m 1 bytes, is the message length required 2 to achieve half of the asymptotic bandwidth. It can be derived using the relationship: Based on m 1 2 here: m 1 2 = t 0 r (5.8) 1 and o > d, we derive equation (5.4). We restate it α 2 + o (5.9) m Synchronized Method Migration Synchronized method migration is not meant to directly optimize synchronization related access patterns such as assignment and accumulator. Instead, it optimizes the execution of the synchronized method itself, which is usually related to those access patterns. Java s synchronization primitives, including synchronized block, as well as the wait and notify methods of the Object class, are originally designed 48

61 for thread synchronization in a shared memory environment. The synchronization constructs built upon them are inefficient in a distributed JVM that is implemented in a distributed memory architecture like clusters. Fig. 5.2 shows the skeleton of a Java implementation of the barrier function. The execution cannot continue until all the threads have invoked the barrier method. We assume the instance object is a DSO and the node invoking barrier is not its home node. On entering and exiting the synchronized barrier method, the invoking node will acquire and then release the lock of the barrier object, while maintaining distributed consistency. In line 8, the barrier object will be faulted in. It is a common behavior that the locked object s fields will be accessed in a synchronized method. In line 9 and line 11, the synchronization requests wait and notifyall respectively, will be issued. The wait method will also trigger an operation to maintain distributed consistency according to the JMM. 1 Therefore, there are four synchronization or object requests sent to the home node and multiple distributed consistency maintaining operations involved. We propose synchronized method migration to reduce communication and consistency maintenance overhead in the execution of synchronized methods at non-home nodes. On synchronized method migration, instead of invoking the method, the synchronized object s GUID, the method s index in the dispatch table, and the arguments of the method, will be sent to the home node of the synchronized object. The method will be executed there. The method s return value if exists will be sent back so that the execution at the non-home node can continue. While object shipping is the default behavior in the GOS, we apply method shipping particularly to the execution of synchronized methods of DSOs. With the detection of DSOs, this adaptation is feasible in our GOS. The synchronized method migration code is generated at JIT compilation time. All the non-synchronized methods are untouched so that they can go 1 According to the JMM, wait behaves as if the lock is released first and acquired later. 49

62 class Barrier { private int count; private int arrived; // the number of threads to barrier // currently arrived threads public Barrier(int numofthreads) { count = numofthreads; arrived = 0; } } public synchronized void barrier() { try { if (++arrived < count) wait(); else { notifyall(); arrived = 0; } } catch (Exception e) { // handle the synchronization exception } } Figure 5.2: Barrier class in full speed. A code stub is inserted into the beginning of each synchronized method, which includes a condition check to see whether the current execution needs migration, and the actual code to perform synchronized method migration. The method shipping will cause the workload to be redistributed among the nodes. However, the synchronized methods are usually short in terms of the execution time; therefore, synchronized method migration will not significantly affect the load distribution in the distributed JVM. 50

63 5.3 Connectivity-based Object Pushing Some important patterns, such as the single-writer pattern, tend to repeat for a considerable number of times, therefore giving the GOS the opportunity to detect the pattern using history information. However, there are some significant access patterns that do not repeat, which cannot be detected by using access history information. Connectivity-based object pushing is applied in our GOS to the situations where no history information is available. Essentially, object pushing is a prefetching strategy which takes advantage of the object connectivity information to more accurately pre-store the objects to be accessed by a remote thread, therefore minimizing the network delay in subsequent remote object accesses. Connectivity-based object pushing actually improves the reference locality. It is useful in applications of fine-grained object sharing. The producer-consumer pattern is one of the patterns that can be optimized by connectivity-based object pushing. Similar to the assignment pattern, the producer-consumer pattern obeys the precedence constraint. The write must happen before the read. However, in the producer-consumer pattern, after the object is created, it is written and read only once, and then turned into garbage. Therefore, producer-consumer is single-assignment. The producer-consumer pattern is popular in Java programs. Usually, in a producer-consumer pattern, one thread produces an object tree, and prompts another consuming thread to access the tree. In the distributed JVM, the consuming thread suffers from network delay when requesting objects one by one from the node where the object tree resides. In order to apply connectivity-based object pushing, we follow the heuristic that after an object is accessed by a remote thread, all its reachable objects in the connectivity graph may be consumed by that thread afterwards. Therefore, upon request for a specific DSO in the object tree, the home node pushes all the objects that are reachable from it to the requesting 51

64 node. Object pushing is better than pull-based prefetching which relies on the requesting node to specify explicitly which objects to be pulled according to the object connectivity information. A fatal drawback of pull-based prefetching is that the connectivity information contained in an invalidated object may be obsolete. Therefore, the prefetching accuracy is not guaranteed. Some unneeded objects, even garbage objects, may be prefetched, which will end up wasting communication bandwidth. On the contrary, object pushing gives more accurate prefetching since the home node has the up-to-date copies of the objects and the connectivity information in the home node is always valid. In our implementation, we rely on an optimal message length, which is the preferred aggregate size of objects to be delivered to the requesting node. Reachable objects from the requested object will be copied to the message buffer until the current message length is larger than the optimal message length. We use a breadth-first search algorithm to select the objects to be pushed. If these pushed objects are not DSOs yet, they will be detected. This way, DSOs are eagerly detected in object pushing. Since object connectivity information does not guarantee that future accesses are bound to happen, object pushing also risks sending unneeded objects. We disable object pushing upon request of an array of reference type, e.g., a multi-dimension array, since such an array usually represents some workload shared among threads with each thread accessing only a part of it. 52

65 Chapter 6 Object Access Pattern Visualization We design and implement a visualization tool called PAT (Pattern Analysis Tool) that can be used to visualize object access traces and analyze object access patterns in our GOS. PAT is useful in two aspects. For the protocol designers, such a tool can expose the inherent memory access patterns inside a benchmark application, and thus enable the evaluation of the effectiveness of the adaptive protocol in reducing the number of network-related memory operations and the protocol s pattern detection mechanism. It can reveal how frequent a particular memory access pattern appears in an application, and how well a particular adaptation can optimize a target memory access pattern. On the other hand, it can help the application developer in planning the initial data layout and runtime data relocation. Since DSM systems tend to hide the communication details from application developers, performance tuning is rather difficult if not impossible. With PAT, the parallel application developer is able to discover the performance bottleneck in the application by observing the application s memory access behavior. He may then redesign the algorithm to avoid some heavyweight memory access patterns. In this 53

66 DJVM Node log DJVM Node log Merged Object Access Events Log DJVM Node log DJVM Node log Lifetime Pattern Analyzer Global Phase Pattern Analyzer Producerconsumer Analyzer Other Pattern Analyzer DJVM Node log DJVM Node log Pattern Analysis Engine DJVM Node log DJVM Node log Map pattern to access events Pattern Window Map pattern to allocation site Timeline Window Pattern Visualization Component Source Code Window Runtime Operations Postmortem Operations Figure 6.1: PAT Architecture aspect, PAT plays the role of a profiling tool. PAT comprises three components: the object access trace generator (OATG) that is plugged into the distributed JVM, the pattern analysis engine (PAE), and the pattern visualization component (PVC), as shown in figure 6.1. OATG gathers object access information at runtime. Improper runtime logging could introduce intolerable overheads and interruptions to the application being traced, which makes the logging unaccepted. For example, the recorded memory access behavior could be quite different from that without logging due to the interruption caused by heavyweight logging. To tackle this problem, OATG was designed to be lightweight. It activates the recording only on DSOs. Logs are stored in a memory structure and flushed to the local disk at synchronization points or when the buffer is full. The just-intime compiler is used to instrument only the user-interested methods; all the other methods execute at full speed. 54

67 PAE is used to discover knowledge concerning patterns from the raw access information collected by OATG. After an application s execution, the global (of all the processes) and complete (the entire lifetime of the application) access information can be compiled, based on which an analysis of the object access patterns is carried out precisely and thoroughly. PVC uses a pattern-centric representation to visualize object access patterns. It can display the global and complete access pattern information. In addition, for objects of interest to the user, it can associate access patterns with the source code lines that create the corresponding objects referred to as allocation sites. The object access patterns can be further mapped to low-level object access operations. StormWatch [37] is a profiling tool that visualizes the execution of DSM systems and links it to the program s source code. StormWatch provides three linked graphic views: trace, communication and source. The trace and the communication view combined together reflect the low level access operations in the execution. The major difference between our tool and StormWatch is that StormWatch only focuses on the low level access operations, which may not provide straightforward and intuitive information to the users. However, our pattern analysis and visualization system provides the access pattern knowledge that, as high level information, will definitely be more helpful to the users. Xu et al. described a profiling approach for DSM systems in [81]. It can detect and visualize some cache block level access patterns. However, as an online tool, it suffers from the memory and time constraints in a runtime analysis. For example, it can only show lifetime access pattern that a certain cache block presents in the whole execution time. The pattern change cannot be expressed because the memory consumption is expensive if each pattern change per cache block is recorded. This is neither flexible nor precise. On the contrary, our approach is postmortem so that we can invest as much effort as affordable to precisely and thoroughly analyze the access patterns 55

68 after the execution. 6.1 Object Access Trace Generator OTAG uses several techniques to achieve the lightweight runtime logging of memory access information. Firstly, it relies on the Java memory model to carefully choose the memory access operations to be logged. Figure 6.2 shows all the memory access operations in the GOS, with only those access types in bold font being logged. In the GOS, we focus on DSOs since only they will incur communication overheads. Consequently, we are only interested in the access patterns presented by DSOs. On non-home nodes, the object faulting-in and diff propagation can represent the reads and writes on the cached copy, respectively. Similarly, the home read fault and home write fault can represent all the reads and writes happening in the home node, respectively. All these remote and home reads/writes, together with synchronization operations on objects and synchronized methods, constitute the object s access behavior. Secondly, usually we are interested not only in those access operations themselves, but also the relationship between them and other program states. For example, we may want to know what the object access behavior is inside a Java method. Or we may want to log a particular method that implements barrier synchronization among all threads to observe the object access operations against the barrier synchronization. To address the above requirement, OATG leverages the just-in-time compiler in a distributed JVM to dynamically instrument translated Java method code to log interesting operations. PAT allows the user to provide a list of interested Java method signatures 1 to the distributed JVM. During the justin-time compilation, the signature of the to-be-translated method will be compared against the user provided list. If there is a match, the just-in-time 1 The format of Java method signature is defined in the JVM specification. 56

69 Memory Access Operations in distributed JVM On node-local objects On distributed-shared objects Synchronization: lock, unlock, wait, notify Read Write Read/write issued on non-home nodes Read/Write issued on the home node Synchronization: lock, unlock, wait, notify Synchronized method Remote read: object faultingin from the home node Read on cached copy Write on cached copy Remote write: diff propagation to the home node Home read: home read fault Other read on home copy Home write: home write fault Other write on home copy Figure 6.2: Memory access operations in GOS 57

70 compiler will insert the log code at both the start and the end of the method. In doing so, the user is able to choose his interested method operations to log. All the other methods are left untouched and operate at full speed. If the just-in-time compiler is not used, we have to do the instrumentation for each method in advance since each method could potentially be a user-interested operation. The overall slowdown could be significant. We make use of some source code of the logging facility in MPE (Multi- Processing Environment of MPICH) [34] for collecting the access logs. However, our logging facility does not require MPI support during logging. It is implemented as a library and linked against the distributed JVM. At runtime, each process of the distributed JVM independently generates its own log. The log records are firstly put into the local memory and then dumped to the local disk at synchronization points or when the memory buffer is full. After the multi-threaded Java program exits, an MPI program will merge all those local logs into one log file according to the time stamps. We rely on the Network Time Protocol (NTP) [63] to synchronize the computer times on different cluster nodes. The time offset between cluster nodes can be adjusted to less than one millisecond. On merging the node local logs, the time stamps will be further tuned by calculating the current time offset. 6.2 Pattern Analysis Engine There can be many independent modules sequentially reading in the same log in the analysis engine. Each module is responsible for detecting one or several related access patterns. The access pattern analysis results from all the modules are fed into the pattern visualization component, which will be discussed in the next section. It is extensible in the sense that we can plug in new modules to detect any precisely defined access patterns. Currently there are two analysis modules in place: the lifetime pattern analyzer and the global phase pattern analyzer. 58

71 The lifetime pattern analyzer detects object access pattern that is fixed in the whole lifetime for each DSO. It will check whether an object presents read-only, single-writer, or multiple-writers pattern in its whole lifetime. The global phase pattern analyzer works for those applications adopting a phase parallel paradigm (section of [54]), as shown in figure 6.3. In this paradigm, every thread does some computation before arriving at a barrier. After all the threads arrive at the barrier, they can continue to the next computation phase. Two consecutive barriers define a global synchronization phase agreed by all threads. This is a very common paradigm in parallel programming. The global phase pattern analyzer will check whether an object presents read-only, single-writer, or multiple-writers pattern in each global synchronization phase. The barrier, as a synchronized Java method, will be logged as a special operation at runtime. If the application does not present the phase parallel paradigm, i.e. no barrier operations are found in the log, the global phase pattern analyzer simply ignores the log. Detecting read-only, single-writer, and multiple-writers patterns in the log is straightforwardly done by counting the number of writers among all the accesses on the object during each phase. 6.3 Pattern Visualization Component There are three windows in the presentation, a time lines window displaying the low-level access operations, a pattern result window revealing the object access patterns, and a source code window displaying the application s source code. The time lines window also reflects the overall access operations incurred in the execution. The time lines window, as shown in figure 6.4, provides a complete execution picture on 8 cluster nodes for an application called SOR. The x-axis represents the time. In the y-axis direction, there are 8 time lines in the figure, representing 8 threads, one thread on each node in this experiment. The 59

72 Thread1 Thread2 Thread3 Comp. Comp. Comp. Barrier Synchronization Comp. Comp. Comp. Barrier Synchronization Figure 6.3: Phase parallel paradigm rectangles on the time lines show some states, e.g., barrier synchronization in this case. The arrows show the object access operations. Those in green are writes and those in white are reads. Furthermore, the arrows started in one thread s time line and ended in another thread s time line represent the remote reads (object faulting-ins) or the remote writes (diff propagations). They are issued by the threads represented by the arrows starting time lines. The corresponding home nodes are the nodes represented by the arrows ending time lines. The arrows overlapping with the time lines are the home reads or home writes. We can click any arrow to see the detail information about that object access, e.g. the class name, size, and ID of that object. The time lines can be zoomed out to get an overall picture of the accesses behavior, or zoomed in to examine some particular object accesses. We implement the time lines window by modifying Jumpshot in MPE [34]. Moreover, clicking the Pattern Analysis button in the time lines window will trigger the pop-up of the pattern result window, as shown in figure 6.5. As SOR is a barrier synchronized application, the global phase pattern analyzer can provide the pattern analysis result for each object. The objects are 60

73 Read Write Figure 6.4: The time lines window firstly sorted by their allocation sites where they are created in the source code. Each allocation site may create many objects at runtime. For each object, its access pattern at each phase is displayed. As observed from the analysis result, most objects in SOR present the single-writer access pattern. For example, in figure 6.5, the object being observed presents the read-only and the single-writer pattern in alternate phases. The pattern result window is in the center of the visualization. Inside this window, we can choose any object to highlight its accesses in the time lines window. Thus we provide a convenient association between the high level access pattern knowledge and the low-level access operation details. Since the objects are sorted by their allocation sites in the pattern analysis result window, we can map any object to its actual allocation site in the application s source code by clicking it, as shown in figure 6.5. Note that the highlighted line in the source code window is the actual position for the highlighted allocation site in the pattern analysis result window. Thus we 61

provide a convenient association between the object access pattern and the object s corresponding allocation site in the source code.

74 Figure 6.5: The window of object access pattern analysis result (the bigger one) and the window of the application s source code (the smaller one) provide a convenient association between the object access pattern and the object s corresponding allocation site in the source code. In such a design, our visualization tool not only helps us, the GOS designer, to visually evaluate the effectiveness of the adaptive protocol being applied, but also helps the multi-threaded Java application programmer to better understand the access behavior inherent in the program. 62

75 Chapter 7 Implementation In this chapter, we discuss several implementation details in the cluster-based JVM. 7.1 JIT Compiler Enabled Native Instrumentation In DSM, shared data units have different access states, such as invalid, read (readonly), and write (writable). A faulting access will trigger some operation according to the cache coherence protocol. For example, an access on an invalid data unit will cause the data to be faulted in, and a write on a readonly data unit will cause its twin to be created under the multiple-writer protocol. Therefore, it is the responsibility of the DSM system to trap all the faulting accesses. Unlike page-based DSMs which rely on the MMU hardware to trap the faulting accesses, object-based DSMs need to insert software checks before the memory accesses in order to trap the possible faulting ones. So does our GOS. The GOS provides transparent object accesses for Java threads distributed among different nodes in a cluster-based JVM. The GOS needs to insert soft- 63

76 ware checks before all the bytecodes accessing heap in the Java programs, which include: GETFIELD/PUTFIELD: load/store object fields. GETSTATIC/PUTSTATIC: load/store static fields. XALOAD/XASTORE 1 : load/store array elements. In a JVM, the bytecode execution engine is the processor for Java bytecode, which could be an interpreter or a just-in-time (JIT) compiler. An interpreter will emulate the behavior of the bytecode one by one. While a JIT compiler will translate a Java method in bytecode to the native code on the first time it is invoked. Usually, a JIT compiler improves the JVM performance by one order of magnitude compared with an interpreter. Since the cluster-based JVM is targeted at high performance scientific and engineering computing, the JIT compiler is our choice for the execution engine of the cluster-based JVM. Under the JIT compiler mode, a heap access operation takes only one native machine instruction. We should be very careful to design the check code to make it as lightweight as possible. A straightforward solution could be inserting a function call before each heap access, as shown in figure 7.1. The object state is checked and the necessary protocol operation is done in the function. Although simple, this approach is very heavyweight because a function call will cause a lot of overhead, such as saving registers before the call, preparing for a new stack frame, and restoring registers after the call. We had better avoid the function call as much as possible. A more efficient way is to make a comparison to check the access state, as illustrated in figure 7.2. If the object has the proper access state, the function call can be avoided. 1 X represents a type indicator, e.g., A (reference), B (byte), C (char), etc. 64

77 call gos_check(object1); access object1; Figure 7.1: Pseudo code for access check: using a function call if (object1 does not have the proper state) call gos_check(object1); access object1; Figure 7.2: Pseudo code for access check: by comparison Since we classify all the objects to either DSOs or NLOs, and only DSOs have access states, we can easily come up with a straightforward algorithm for a read operation, as shown in figure 7.3. In this way, two comparisons are required for each read operation. In order to reduce the comparisons, we let NLOs also have access state, i.e., write (writable). Thus only one comparison is necessary to check the access state of an object. We have patched the JIT compiler engine to perform native instrumentation to insert the access state check before each heap access. Figure 7.4 shows the Intel assembly code for a read access after the native instrumentation by the JIT compiler in our distributed JVM. Register esi is used for the object reference, register ecx for the object access state, and register eax for the object field to read. When the object is readable, only 3 machine instructions are taken to check the access state, which include one memory read, one comparison, and one jump instruction. if (object1 is DSO) if (object1 is invalid) call gos_check(object1); read object1; Figure 7.3: Detailed pseudo code for a read check 65

78 0x08eac045: mov 0xc(%esi),%ecx // load access state 0x08eac04b: cmp $0x ,%ecx // make a comparison 0x08eac051: jge 0x8eac076 // go to access 0x08eac057: mov %ecx,0xffffffac(%ebp) // save register 0x08eac05d: mov %esi,0xffffffb0(%ebp) // save register 0x08eac063: push %esi // push argument 0x08eac065: call 0x8a3da0 <checkread> // call gos_check 0x08eac06a: add $0x4,%esp // pop argument 0x08eac070: mov 0xffffffb0(%ebp),%esi // restore register 0x08eac076: mov 0x80(%esi),%eax // read object field Figure 7.4: IA32 assembly code for a read check 7.2 Distributed Threading and Synchronization In the cluster-based JVM, threads within one Java application are automatically distributed among cluster nodes to achieve parallel execution. Thus, we need to extend the threading subsystem inside the JVM to the cluster scope, and to visualize a single thread space across machine boundaries. Particularly, we need to solve the following technical issues: Thread distribution The threads need to be efficiently distributed among the nodes of the cluster-based JVM to achieve maximum parallelism. Thread synchronization Even running on different nodes, the threads can still interact and coordinate with each other through the methods provided in class java.lang.thread, and any Java objects by synchronization according to JMM. JVM termination As we mentioned in the introduction, from the perspective of system architecture, a cluster-based JVM is composed of a group of collaborating daemons, one on each cluster node. Each cluster-based 66

79 JVM daemon can exit if and only if the multi-threaded Java application terminates. In a standard JVM, all threads can be classified into either user threads created by the application, or daemon threads created by the JVM itself. Any Java application will create at least one user thread, i.e., main thread. Daemon threads include, e.g., gc thread for performing garbage collection, finalizer thread for performing the finalization work for the unreachable objects before the collection. So the JVM is a multi-threaded system even the running Java application is single-threaded. The whole JVM exits if all user threads exit. The thread subsystem performs the tasks of thread scheduling and thread synchronization. The thread subsystem also provides non-blocking I/O interfaces Thread Distribution In our cluster-based JVM, we classify all the nodes into two types, the master node and the slave node. The master node is where the Java application is started. The slave nodes accept threads distributed from the master node to share the workload of the application. A daemon thread, called gosd, is created on each node, which sits in a big loop to handle the cache coherence protocol requests such as object fault-ins, diff propagations, and synchronization operations. We follow an initial placement approach to distribute user threads to slave nodes. Upon the creation of a user thread on the master node, if there is an underloaded slave node, the information of the thread which includes the thread class name and the thread object will be sent to the slave node. The gosd thread on the slave node then creates a new user thread based on the thread information, and invokes the start() method of the thread object to run the thread. The slave node is made the home of the thread object to improve the access locality. 67

80 Our cluster-based JVM does not support dynamic thread distribution mechanisms such as thread migration, by which a thread can be migrated from one node to another during its execution Thread Synchronization After the threads are distributed among the cluster nodes in a cluster-based JVM, they should be able to coordinate with each other during the execution. This can be achieved through synchronization operations on any Java object, such as lock, unlock, wait, and notify. Since each object has a home, all the synchronization operations are executed in the object s home node. The object s home node also acts as the object s lock manager. If the object s home is not local, synchronization requests are sent to the home node, which are handled by the gosd thread there. Some synchronization requests are blocking, such as lock and wait. A lock request will suspend till the corresponding lock is acquired. Since the gosd thread can not be blocked anytime, upon receiving a synchronization request, it arranges another kind of daemon thread, the monitorproxy thread, to actually process and reply it. The monitorproxy thread performs the synchronization operation indicated in the request on behalf of the requesting remote thread. Since the synchronization is stateful, the gosd thread will always arrange the same monitorproxy thread for the requests from the same remote thread. For example, after a monitorproxy thread MP have acquired a lock on an object as requested by a remote thread T, it sends a notice to T. Then T continues its execution as if it has acquired the lock. When T requests to release this lock, it should be MP instead of any other monitorproxy threads to process this request. Only after MP does not hold any lock state any more, it can process the synchronization requests from other remote thread than T. In the startup phase of the cluster-based JVM, a number of monitorproxy threads are created. When new synchronization request comes, the gosd 68

81 thread tries to pick up a current available monitorproxy thread. If failed, a new monitorproxy thread will be created to process it. Java threads can also coordinate with each other through the methods of class java.lang.thread. For example, invocation of join will block till the callee thread finishes. Like other methods of java.lang.thread, join is built on the synchronization operations on the thread object. Thus it is also implemented through the mechanism discussed above JVM Termination A cluster-based JVM is composed of a group of collaborating JVM daemons, one on each cluster node. Each cluster-based JVM daemon can exit if and only if the multi-threaded Java application terminates. If a JVM daemon exits earlier, the home-based cache coherence protocol will be violated for those DSOs whose homes are there. If a JVM daemon exits later, it becomes unattended and wastes system resources. A termination protocol is designed to coordinate all cluster-based JVM daemons to exit when the Java application terminates. 1. When a slave node is started up, a main thread is also started there, which will wait on an internal lock, called slavemain. 2. On the master node, a counter is increased by one whenever a user thread is created in the Java application, and it is decreased by one whenever a user thread exits. 3. A user thread could be distributed to a slave node. When it exits there, a notice will be sent back to the master node. The master node decreases the user thread counter accordingly. Thus the counter reflects the number of currently live user threads. 4. When the counter reaches zero, which means all user threads have exited, the master node can safely exit. Before exiting, the master 69

82 node sends notices to all slave nodes, informing them to exit. 5. Receiving the notice to exit, the slave node wakes up the main thread waiting on the slavemain lock. The main thread then exits. Since all user threads exit now, the slave node will also exit. Until now, the cluster-based JVM terminates and all its JVM daemons exit. 7.3 Non-Blocking I/O Support There are multiple threads on each node of the distributed JVM. Nonblocking I/O support is a must in the distributed JVM so that a thread doing I/O will not block the whole node. We use the remote unlock operation on a DSO as an example to illustrate non-blocking I/O support in the GOS, as shown in figure 7.5. After the requesting thread sends out the unlock request, it will be switched off to give the CPU to other runnable threads. The multi-threading nature of the cluster-based JVM asks for a non-blocking I/O processing, otherwise a thread in I/O will block the whole JVM. Therefore, the receiving thread should never busy wait for an incoming message. Instead, it should give up the CPU. Later, the signal SIGIO will be catched to switch on the corresponding I/O-waiting thread. The significant signal processing overhead has to be introduced. On the requested node, the currently running thread may be some other thread than the GOS daemon thread that takes care of all the GOS request messages. Thus, a thread switch is necessary to switch on the GOS daemon thread. The GOS daemon thread will schedule a proper monitor proxy thread to process this unlock request. The proper monitor proxy thread is the one that currently holds the lock. Here another thread switch is incurred. A similar situation happens on the requesting node when it receives the unlock reply message. We can see along the critical path of a remote unlock, an unlock overhead, 2 signal processings and 3 thread switches are incurred. 70

83 Requested Node Requesting Node Some Thread Signal Processing Thread Switch GOS Daemon Thread Thread Switch Unlock Request Requesting Thread Thread Switch Some Thread Monitor Proxy Thread Unlock Reply Signal Processing Thread Switch Requesting Thread Figure 7.5: Remote unlock of a DSO 7.4 Distributed Class Loading A Java class file defines a single class static/instance fields, static/instance methods, and the constant pool that serves a function similar to that of a symbol table in conventional programming languages. At runtime, the JVM dynamically loads, links, and initializes classes when they appear in the application, as illustrated in figure 7.6. Loading is the process of finding the Java class file and reading it into the memory. Linking is the process of combining it into the runtime state of the Java virtual machine. During the linking phase, the bytecode will be verified, 71

84 Java Class File Loading Linking Initialization Class in Method Area Figure 7.6: JVM s dynamical loading, linking, and initialization of classes the static fields will be allocated and initialized to the default values, and all the symbolic references in the constant pool are resolved. Initialization is the process of executing the class initialization code. The Java class finally stays in the method area of JVM. Our cluster-based JVM provides the dynamical class loading capability defined in the JVM specification. Since Java classes contain readonly definitions of fields and methods, they are allowed to be loaded independently be each node. However, two particular issues needed to be addressed to maintain the single system image: Although each cluster-based JVM daemon can load Java classes independently, it must be guaranteed that they load Java classes from the same source. In other words, they must load the same Java classes. Although each cluster-based JVM daemon can load Java classes in- 72

85 dependently, a Java class can only be initialized for once. And the static variables should be made consistent according to JMM during the execution. To address the first issue, we have configured a Network File System (NFS) [74] for the cluster-based JVM so that each JVM daemon sees the same file system hierarchy where the Java class files are stored. Since the NFS is a very popular file system used in the cluster environment, such configuration does not impair the portability of our cluster-based JVM. To address the second issue, we let the master node maintain a centralized table recording all the cluster-wide initialized classes. For each initialized class, the table also records where it is initialized and a lock to prevent the race condition on the initialization. In the GOS, a class is also considered as an object, which contains static fields and has a home. The node initializing the class becomes its home node. Whenever a JVM daemon loads a class, it will check the table to see whether it has already been initialized or not. If yes, the JVM daemon will skip the initialization, and fetch the current content of the static fields from the home node of the class. If not, the class is initialized locally. Before the initialization, the corresponding lock in the table must be acquired. Having acquired the lock, the JVM daemon will double check whether the object have been initialized. The lock is released after the initialization is done. Since the static fields are allowed to be replicated on different nodes, they are also handled by the cache coherence protocol to maintain the consistency according to the JMM. 7.5 Garbage Collection In this section, we will discuss distributed garbage collection in GOS. There are two GC algorithms in place in GOS, one for local garbage collection that will be discussed in section 7.5.1, and the other for distributed garbage collection of DSOs that will be discussed in section

86 Root set Root set a a b Node 0 (Home) Node 1 Figure 7.7: Tolerating inconsistency in DGC Local Garbage Collection An adapted uniprocessor garbage collector, e.g., a mark-sweep collector [79] in our case, can function independently on each node in our cluster-based JVM. The challenge here is to put the right stuff into the root set to assure the correctness of GC. The home copy of DSO should always been put into the root set since the collector has no idea whether its non-home siblings are still alive or not. As long as there are some non-home siblings, the home copy should be kept due to its special role in the home-based cache coherence protocol. The inconsistency among the copies of a DSO introduces a new problem in DGC, which dose not exist when there is no consistency issue involved [46]. Fig. 7.7 gives an example. Node 0 is the home of DSO a. Node 1 cached a and modified it by installing a reference to object b in a. Now the copies of a are inconsistent. If a becomes unreachable on node 1, and node 1 performs a local GC, both a and b will be mistakenly collected. Therefore, when each node performs independent local GC, all the non-home copies of DSOs that are inconsistent with their home copies, i.e. in the write access state, should be put into the root set. 74

87 0 Export list = {1, 2} Import list = {null} Export list = {null} Import list = {0} 1 2 Export list = {3, 4} Import list = {0} Export list = {null} Import list = {2} 3 4 Export list = {null} Import list = {2} Figure 7.8: DSO reference diffusion tree Distributed Garbage Collection A DGC algorithm, Indirect Reference Listing [66], is adopted to collect garbage DSOs. Essentially, the indirect reference listing (IRL) algorithm maintains a distributed reference diffusion tree for each DSO. In GOS, a reference of DSO can be transmitted either from the home node to a non-home node or between two non-home nodes. The former is referred as reference creation and the latter as reference duplication. With IRL, both the home and non-home copy of a DSO will maintain two lists, an import list recording where its reference comes from, and an export list recording where its reference is sent to. In a DSO s reference diffusion tree, every vertex represents a node possessing one of its copies. The root of the tree is its home node. An edge in the tree represents that the reference is transmitted from one node to another node. The sending node adds the receiving node into its export list, while the receiving node adds the sending node into its import list. If the node to be added is already in the list, the addition has no effect. Fig. 7.8 gives an example. The figure in the circle is the node number. When a non-home copy DSO is figured out to meet the following two conditions, it can be reclaimed locally and a garbage notice will be sent to its 75

88 parent in the diffusion tree: (1) its export list is empty; (2) it is not reachable from the local root set, which can be determined by the local collector. If one node receives a garbage notice of a DSO, it will remove the sending node from the DSO s export list. When the export list of the home copy of a DSO is empty, the DSO will be reversed to an NLO. IRL requires that the local collector also put those non-home copies of DSOs with non-empty export lists into the root set. The transmission path of a DSO reference may form some cycle among the nodes. Then the export list on every node in the cycle is not empty and all the copies will be put into the local root sets. It makes this DSO never be reclaimed even it is not reachable from any local root set. In order to prevent such cycles polluting the structure of the diffusion tree, we assure each node can only have one valid parent in the tree. If a DSO reference arrives from a node different from the current parent, the sender will not be added into the import list. Instead, the receiver prepares a pseudo garbage notice for the sender, since the sender has already added the receiver into the export list. Having received the pseudo garbage notice, the sender can remove the receiver from its export list. IRL inherits the idempotency property from reference listing [67]. The effect of multiple transmissions of a DSO reference between two nodes is as same as that of once. This property is very helpful in GOS since DSOs will be transmitted many times due to the cache coherence protocol. The indirect nature of IRL avoids the race condition in reference listing when reference deletion and duplication happen at the same time [67]. IRL can not collect the cycle of garbage DSOs whose home nodes are different. However, this usually is not a serious problem. The major overheads of IRL are maintaining import and export lists for every DSO as well as sending garbage notices. The list maintenance coexists with the reference transmission. Compared with the transmission, the maintenance overhead, which is simply bitmap setting, is negligible. The 76

89 garbage notices can be batched and piggybacked with coherence messages. So IRL will not contribute a significant overhead to GOS. 77

90 Chapter 8 Performance Evaluation 8.1 Experiment Environment We conducted the performance evaluation on the HKU Gideon cluster [14]. Each node has an Intel 2GHz P4 CPU and 512M memory, running Linux kernel A Network File System (NFS) is set up and mounted on all the cluster nodes so that the user has a same view of his home directory on all nodes. All the cluster nodes are connected by two Fast Ethernet networks, one for NFS, the other for high performance communication such as MPI. Our cluster-based JVM is implemented based on the Kaffe JVM [9] which is an open-source JVM. A Java application is started on the master node. When a Java thread is created, it is automatically dispatched to a cluster node to achieve parallel execution. Unless specified otherwise, the number of computation threads created is the same as the number of cluster nodes in all the experiments. In our implementation, we leverage TCP/IP Socket interface for all the communications. We use Netperf [40] to evaluate the TCP/IP performance of Gideon cluster. It takes 114 microseconds to send a one-byte request message and get a one-byte response message. The network throughput is 94.05Mb/s when the message size is 4096 bytes. 78

91 8.2 Application Suite In this section, the application suite used to evaluate the performance of our cluster-based JVM will be presented. The application suite contains CPI, ASP, SOR, NBody, NSquared, and TSP CPI CPI is a multi-threaded Java program to calculate π. The π is computed by π = dx (8.1) 1 + x2 The program follows a fork-and-join parallelism style. The integral intervals are equally divided among threads ASP The All-pairs Shortest-Path (ASP) problem is to find the shortest path between all pairs of vertices in a graph. ASP is an important problem in graph theory and has applications in communications, transportation, and electronics problems [47]. A graph can be represented as a distance matrix D in which each element (i, j) represents the distance between vertex i and vertex j. We assume for any i and j, D ij exists so that 0 D ij <. Also, D ij = D ji and D ii = 0. Floyd gives a sequential algorithm for ASP. It solves a graph of N vertices in N steps, constructing an intermediate matrix I(k) containing the bestknown shortest distance between each pair of nodes at step k. Initially, I(0) is set to D. The kth step of the algorithm considers each I ij in turn and determines whether the best-known path from i to j is longer than the combined lengths of the best-known paths from i to k and from k to j. If so, the entry I ij is updated to reflect the shorter path. We design a parallel version of Floyd s algorithm by making a row-wise 79

92 black[j][k] = (red[j-1][k] + red[j+1][k] + red[j][k-1] + red[j][k]) / (float)4.0; Figure 8.1: The typical operation in SOR domain decomposition of the distance matrix D and the intermediate matrix I among threads. Appendix A.2 shows the run method of the Worker thread in our ASP. The instances of the Worker thread perform the actual computation. At step k, all threads need the value of the kth row of the distance matrix. There is a barrier at the end of each iteration. The workload is distributed equally among the Worker threads. The rows of D are allocated among cluster nodes in a round-robin manner initially SOR The red-black Successive Over Relaxation (SOR) is used to solve partial differential equations: 2 f x + 2 f 2 y = 0 (8.2) 2 A matrix is created with the perimeter elements being initialized to be the boundary conditions of a given mathematical problem. The interior elements are repeatedly computed as the average of its top, bottom, left, and right neighbors until the computed values are sufficiently close to the values computed in the last iteration. Two matrixes, a red one and a black one, are used in the SOR. At any iteration the elements are read from one matrix, and the computed values are written to the other. After finishing this iteration, the roles of two matrixes are swapped. Figure 8.1 shows the typical operation in SOR. We partition the red and black matrixes among threads in a row-wise way. Each thread computes the parts of matrixes it has been assigned. Thus, the 80

93 (a) Space decomposition (b) Barnes-Hut tree Figure 8.2: Barnes-Hut tree for 2D space decomposition workload is equally partitioned among the threads. Each thread needs to access not only its own sub-matrixes but also the neighboring rows in the matrixes which are computed by other threads. After each iteration, all threads are synchronized through a barrier operation. The rows of red and black matrices are allocated among cluster nodes in a round-robin manner initially NBody NBody is used to simulate the motion of particles due to gravitational forces between each other. The Barnes-Hut method [29] is a well-known hierarchical NBody algorithm. In Barnes-Hut method a physical space is recursively divided into sub-domains until each sub-domain contains at most one body. The space decomposition is based on the spatial distribution of the bodies. Figure 8.2 (a) gives an example of space decomposition in 2D space. Initially, the space is equally divided into four sub-domains. If there are more than one bodies in a sub-domain, the sub-domain will be further decomposed into four smaller sub-domains. A Barnes-Hut tree is built based on the space decomposition as figure 8.2 (b) shows. 81

94 In the Barnes-Hut tree, the bodies reside on the leaves. Inner cells in the tree correspond to the sub-domains, and represent the centers of mass for the bodies beneath it. The force computation is performed by traversing the tree. The Barnes-Hut tree is built at the beginning of each iteration. If the body is far enough from a cell, no further traversal will be made beneath the cell. The force influences from the bodies below the cell can be computed as the force influence from the cell, which is the center of mass. Otherwise, the body should proceed to traverse the children of the cell. After the force computation, each body updates its position in the space as the result of force influences. That ends one simulation loop. The tree should be rebuilt at the beginning of next iteration to reflect the new body distribution in the space. We parallelize the Barnes-Hut method by equally dividing the bodies among threads. The workload of threads is not balanced as the computation load associated with each body is different. The tree construction is not parallelized. In the NBody application, there is a main thread responsible for the tree construction, and a number of worker threads responsible for the computation of the force and the resulted body movement. During each iteration, after the main thread has built the tree, it wakens all the waiting worker threads. A barrier operation synchronizes the worker threads after they finish their computation. Then the main thread is notified to begin the tree construction of the next iteration. A lot of Java objects, which describe bodies position, velocity, and forces, have been created during the tree construction NSquared NSquared solves NBody problem with an O(n 2 ) complexity, just like Water- NSquared in Splash-2 benchmark suite [80]. All n bodies are stored in an array. The workload are evenly partitioned among threads by assigning an identical number of bodies to each thread. 82

95 A thread is responsible to calculate the force on each of its assigned bodies, and update the bodies positions accordingly. To calculate the force on one body, we need to combine the interactions between this body and the other n 1 bodies respectively TSP The Traveling Salesman Problem (TSP) is to find the cheapest way of visiting all the cities and returning to the starting point. Our TSP finds the optimal solution instead of an approximate one by searching the entire solution space. Our TSP follows a branch-and-bound approach. It prunes large parts of the solution space by ignoring partial routes that are already longer than the current best solution. The program divides the whole solution space into many small ones to build up a job queue in the beginning. A sub-space contains all the routes of a same prefix. A number of worker threads are created initially. Every thread repeatedly requests a sub-space from the job queue to search for the optimal solution until the queue is empty. The workload of threads is not balanced. A lot of objects have been created during the searching. 8.3 Application Performance In our experiments, unless stated otherwise, CPI makes the integration on 100,000,000 sub-intervals, ASP solves a graph of 1024 vertices, SOR performs the successive over-relaxation on a 2-D matrix of 2048 by 2048 for 30 iterations, NBody simulates the motion of 2048 particles over 30 steps, NBody simulates the motion of 2048 particles over 10 steps, and TSP solves a problem of 12 cities. 83

96 Normalized execution time 160% 140% 120% 100% 80% 60% 40% 20% Kaffe GOS/1 0% ASP SOR NSquared NBody TSP CPI Figure 8.3: Single node performance Sequential Performance Our cluster-based JVM is based on Kaffe JVM. When running our clusterbased JVM on only one processor, we can measure its sequential performance. The major GOS overhead incurred in the sequential performance is caused by software checks inserted before object accesses, which are used to ensure the corresponding objects are in the right access states, as discussed in section 7.1. By comparing the sequential performance of our cluster-based JVM with the performance of original Kaffe, we can measure the overhead of software checks. Figure 8.3 compares the performances of the cluster-based JVM and Kaffe using our application suite. In the figure, GOS/1 denotes the cluster-based JVM on one processor. Both the Kaffe JVM and our cluster-based JVM run in the just-in-time mode. Among all applications, ASP, SOR, and NSquared incur a heavy check overhead due to their intensive array object accesses. NBody and TSP s 84

97 Speedup CPI TSP NBody NSquared SOR ASP Number of processors Figure 8.4: Speedup check overheads are well contained, less than 10%. spent on calculation, and the object accesses are very few. In CPI, most time is Parallel Performance We measure the speedup for all applications on up to 16 processors as an overall performance evaluation for our cluster-based JVM. Figure 8.4 shows the speedup curves. In the experiments, n threads will be created when running on n processors. The sequential time on 1 processor is measured on the original Kaffe JVM where only one thread is created. All the cache coherence protocol optimizations are enabled. Both the Kaffe JVM and our cluster-based JVM run in the just-in-time mode. The applications parallel performances are determined by their computationto-communication ratios. Among all the applications, TSP and CPI are computationally intensive programs. Therefore, they are able to achieve speedups 85

98 of more than 13 on 16 processors. NBody and NSquared also achieves acceptable speedups on 16 processors. SOR and ASP s performances are embarrassing. They achieve speedups less than 3.5 on 8 processors. Their speedup curves drop on 16 processors. In order to further investigate factors contributing to the applications performance, we try to break down the execution time to various parts, including Comp that denotes the computation time; Obj, the object access time to fault in up-to-date copies of invalid objects; Syn, the time spent on synchronization operations, such as lock, unlock, wait, notify, and migrated synchronized method; and GC, the garbage collection overhead. We instrument internal functions of our cluster-based JVM to measure the accumulated overheads of Obj, Syn, and GC. The Comp time is computed by subtracting all the other parts from the total time. All the breakdown data are normalized to the total execution time, as displayed in figure 8.5. How we obtain the breakdown data is discussed in appendix A.3 in detail. In spite of a certain impreciseness, figure 8.5 helps us gain an insight into the executions. Notice that not every application requires GC. Obj and Syn portions are the GOS overhead to maintain a global view of a virtual object heap shared by physically distributed threads. Obj and Syn portions not only include the necessary local management cost and the time spent on the wire for moving the protocol-related data, but also the possible waiting time on the requested node. The percentage of Comp roughly reflects the efficiency of parallel executions. ASP requires n iterations to solve an n-node graph problem. There is a barrier at the end of each iteration, which requires participation of all threads. When ASP is running on more processors, the computation workload of each thread decreases. On the contrary, the Syn part increases when more processors join. The Obj part also increases with the number of processors. On the ith iteration, all threads need to access the ith row of the distance 86

99 Normalized execution time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% ASP SOR NBody Comp Syn Obj GC Number of processors Normalized execution time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% TSP NSquared CPI Comp Syn Obj GC Number of processors Figure 8.5: Breakdown of normalized execution time against number of processors 87

100 matrix. When the number of processor increases, the home node of the ith row needs to serve more requests. Thus the waiting time of each request increases correspondingly. When scaled up to a large number of processors, ASP s performance is hindered by the intensive data communication and synchronization overheads. The situation of SOR is similar to that of ASP. In SOR, there are two barriers in each iteration. The Syn part contributes a significant portion to the execution time when scaled to a large number of processors. The absolute time of Obj roughly stays constant because each thread only needs to access the neighbor rows with respect to the rows it manages in the matrixes. The data to be accessed do not increase with the number of processors. However, the percentage of Obj in the total time increases because each thread s computation load is reduced when SOR is running on more processors. Similar to ASP, SOR s performance is hindered by the intensive data communication and synchronization overheads when scaled up to a large number of processors. NBody also involves synchronization in each simulation step. The synchronization overhead becomes a significant part of the overall execution time when we increase the number of processors. The absolute time of Obj decreases when the number of processors increases, but slower than that the absolute time of Comp decreases. So we observe that the percentage of Obj increases with respect to the total time. NBody is a memory intensive application and therefore triggers garbage collection. With our distributed garbage collection mechanism in place, the GC overhead is highly parallelized. The absolute time of GC is inversely proportional to the number of processors. The breakdown of NSquared is similar to that of NBody. TSP is a computationally intensive application, and the GOS overhead accounts for less than 1% of the total execution time. TSP is also a memory intensive application. The absolute times of GC and Obj are inversely proportional to the number of processors. Nevertheless, their percentages 88

101 Parameters Messages Traffic (KB) CPI 100,000, sub-intervals ASP A graph of 169, , vertices SOR A 2048 by 2048 matrix 35,999 93,286 for 30 iterations NBody 2048 particles 752, ,505 over 30 steps NSquared 2048 particles 698,192 74,230 over 10 steps TSP 12 cities 4, Table 8.1: Communication effort on 16 processors in the total time stay constant on various numbers of processors. CPI is a computation-intensive application. Most of its time is Comp. 8.4 Effects of Adaptations In this section, we evaluate the effectiveness of the adaptations discussed in chapter 5. They are adaptive object home migration, synchronized method shipping, and connectivity-based object pushing. All applications except TSP and CPI incur a lot of communication during the parallel executions. Table 8.1 shows their communication effort when running on 16 processors. The measurements are made after all the cache coherence protocol optimizations are enabled. Figure 8.6 shows the overall performance improvement due to the adaptations for four benchmark applications respectively. We do not show the figures for CPI and TSP becuase they are computationally intensive applications and incur little communication. The adaptations do not have obvious effects on them. In the figures, Basic represents the basic cache coherence protocol with three adaptations disabled, Adaptive represents the adaptive 89

102 Execution time (seconds) Basic Adaptive Execution time (seconds) Basic Adaptive Number of processors Number of processors (a) ASP (b) SOR Execution time (seconds) Basic 80 Adaptive Number of processors Execution time (seconds) Basic 80 Adaptive Number of processors (c) NBody (d) NSquared Figure 8.6: The adaptive protocol vs. the basic protocol cache coherence protocol with all adaptations enabled. We display the applications execution times against the number of processors. The cluster-based JVM runs in the JIT compilation mode. We can observe from the figures that the adaptive cache coherence protocol greatly improves the performance of ASP and SOR. For example, 76% to 89.7% of ASP s execution time can be eliminated when the adaptive protocol is enabled. The adaptive protocol also improves the performance of NBody and NSquared considerably. For example, as seen from figure 8.6 (c), 23.8% of NBody s execution time can be eliminated on 16 nodes when the adaptive 90

103 protocol is enabled. In order to further investigate the effectiveness of various adaptations, we try to breakdown the effects of adaptations. In the experiments, all adaptations are disabled initially; and then we would enable the planned adaptations incrementally. Figure 8.7 shows the effects of adaptations on the execution time. Figure 8.8 shows the effects of adaptations on the message number generated during the execution. Figure 8.9 shows the effects of adaptations on the network traffic generated during the execution. All data are normalized to those when none of the adaptations are enabled. We present the normalized data against different numbers of processors. In the legend, No denotes no adaptive protocol enabled, HM denotes adaptive object home migration, SMM denotes synchronized method migration, and Push denotes connectivity-based object pushing. We will elaborate the effectiveness of each adaptation respectively in the following sub-sections Adaptive Object Home Migration Among four applications, adaptive object home migration improves the performance of ASP and SOR a lot, as seen in figure 8.7 (a) and (b). In ASP and SOR, the data are in the 2-D matrices that are shared by all threads. In Java, a 2-D matrix is implemented as an array object whose elements are also array objects. Many of these array objects exhibit the single-writer access pattern after they are initialized. The shared data are allocated to different cluster nodes in a round robin manner initially. However, their original homes are not the writing nodes. The home migration protocol automatically makes the writing node the home node to eliminate remote accesses. As seen in figure 8.8 (a) and figure 8.9 (a) for ASP, and figure 8.8 (b) and figure 8.9 (b) for SOR, home migration greatly reduces the messages and network traffic generated during the executions for ASP and SOR, which explains the reason of performance improvement. 91

104 100% 100% Normalize execution time 80% 60% 40% 20% Normalize execution time 80% 60% 40% 20% 0% Number of processors 0% Number of processors No HM HM+SMM HM+SMM+Push No HM HM+SMM HM+SMM+Push (a) ASP (b) SOR 100% 100% Normalize execution time 80% 60% 40% 20% Normalize execution time 80% 60% 40% 20% 0% Number of processors 0% Number of processors No HM HM+SMM HM+SMM+Push No HM HM+SMM HM+SMM+Push (c) NBody (d) NSquared Figure 8.7: Effects of adaptations w.r.t. execution time As a further demonstration, figure 8.10 visualizes the effect of object home migration on SOR by using PAT that is discussed in chapter 6. Figure 8.10 (a) is the time line window without home migration. There are four global phases, each taking approximately the same amount of time. Figure 8.10 (b) is the time line window with home migration enabled. Three global phases are marked in the figure: Before Home Migration, Home Migrating, and After Home Migration. Before home migration takes effect, we observe that a lot of remote reads and writes are sent to their home node, node The shared objects are intentionally allocated on node 0 to simplify the visualization 92

105 Normalize message number 100% 80% 60% 40% 20% Normalize message number 100% 80% 60% 40% 20% 0% Number of processors 0% Number of processors No HM HM+SMM HM+SMM+Push No HM HM+SMM HM+SMM+Push (a) ASP (b) SOR Normalize message number 100% 80% 60% 40% 20% Normalize message number 100% 80% 60% 40% 20% 0% Number of processors 0% Number of processors No HM HM+SMM HM+SMM+Push No HM HM+SMM HM+SMM+Push (c) NBody (d) NSquared Figure 8.8: Effects of adaptations w.r.t. message number During the home migrating phase, we observe that although the reads (white arrows) are still sent to the original home node, the writes (gray arrows) are performed locally. It means the home has already been migrated to the local node at that moment. We can also observe that the phase after home migration takes much less time than the phase before home migration since most remote reads and writes are eliminated by object home migration. As can be observed, the effect of home migration is to change remote read/write to home read/write. view. 93

106 100% 100% Normalize network traffic 80% 60% 40% 20% Normalize network traffic 80% 60% 40% 20% 0% Number of processors No HM HM+SMM HM+SMM+Push 0% Number of processors No HM HM+SMM HM+SMM+Push (a) ASP (b) SOR 100% 100% Normalize network traffic 80% 60% 40% 20% Normalize network traffic 80% 60% 40% 20% 0% Number of processors 0% Number of processors No HM HM+SMM HM+SMM+Push No HM HM+SMM HM+SMM+Push (c) NBody (d) NSquared Figure 8.9: Effects of adaptations w.r.t. network traffic Home migration also improves the performance of NSquared. In NSquard, the data of particles are stored in an array. The particles are evenly distributed among threads. Each thread will only update its assigned particles. Thus the particle objects present single-writer pattern, and the communication is reduced by migrating the homes of the particle objects to their updating threads respectively. Home migration has little impact on the performance of Nbody because NBody lacks the single-writer pattern, as seen in figure 8.7 (c). This also indicates that our home migration protocol has little negative side effect 94

2 Synchronized Method Migration Synchronized method migration optimizes the execution of a synchronized method of a non-home DSO.

107 (a) Without home migration Before Home Migration Home migrating After Home migration (b) With home migration Figure 8.10: The effect of object home migration on SOR because of its lightweight design Synchronized Method Migration Synchronized method migration optimizes the execution of a synchronized method of a non-home DSO. Although it does not reduce the network traffic, it reduces the number of messages and the protocol overheads, as we discussed in section 5.2. ASP requires n barriers for all the threads in order to solve an n-node graph. SOR requires two barriers in each iteration. NSquared requires one barrier in each simulation step. The barrier operation is implemented as a 95

On the Design of Global Object Space for Efficient Multi-threading Java Computing on Clusters 1

On the Design of Global Object Space for Efficient Multi-threading Java Computing on Clusters 1 Weijian Fang, Cho-Li Wang, Francis C.M. Lau System Research Group Department of Computer Science and Information