A Study of Java Virtual Machine Scalability Issues on SMP Systems

Size: px

Start display at page:

Download "A Study of Java Virtual Machine Scalability Issues on SMP Systems"

Wilfrid Caldwell
5 years ago
Views:

1 A Study of Java Virtual Machine Scalability Issues on SMP Systems Zhongbo Cao, Wei Huang, and J. Morris Chang Department of Electrical and Computer Engineering Iowa State University Ames, Iowa {jzbcao, huangwei, Abstract This paper studies the scalability issues of Java Virtual Machine (JVM) on Symmetrical Multiprocessing (SMP) systems. Using a cycle-accurate simulator, we evaluate the performance scaling of multithreaded Java benchmarks with the number of processors and application threads. By correlating low-level hardware performance data to two high-level software constructs: thread types and memory regions, we present in detail the performance analysis and study the potential performance impacts of memory system latencies and resource contentions on scalability. Several key findings are revealed through this study. First, among the memory access latency components, the primary portion of memory stalls are produced by L2 cache misses and cache-to-cache transfers. Second, among the regions of memory, Java heap space produces most memory stalls. Additionally, a large majority of memory stalls occur in application threads, as opposed to other JVM threads. Furthermore, we find that increasing the number of processors or application threads, independently of each other, leads to increases in L2 cache miss ratio and cache-to-cache transfer ratio. This problem can be alleviated by using a thread-local heap or allocation buffer which can improve L2 cache performance. For certain benchmarks such as Raytracer, their cache-to-cache transfers, mainly dominated by false sharing, can be significantly reduced. Our experiments also show that a thread-local allocation buffer with a size between 16KB and 256KB often leads to optimal performance. I. INTRODUCTION In recent years, Symmetrical Multiprocessing (SMP) has become increasingly popular as a scalable parallel computing platform. This popularity is mainly attributed to two factors: SMP-capable processors and operating systems. Many modern processors, such as Intel s Xeon, Sun s UltraSparc, and IBM s PowerPC, have the built-in SMP support. It is now feasible for end-users to build SMP systems at a low cost. Similarly, operating systems (OS), especially Microsoft Windows and Linux, also support SMP. These two factors drive the adaptation of SMP in various computing environments. Java is emerging as a competitive paradigm for software development. Designed with many advanced features, such as automatic memory management (i.e. garbage collection), enforced security check, and cross-platform portability, Java has become a popular programming language used on various platforms. Because of its built-in multithreading support, Java has been widely used to develop multithreaded programs for server platforms (such as SMP systems). The design of Java threads in Java Virtual Machine (JVM) has been improved over the years to satisfy the requirement of higher performance and better scalability. Java specifications offer the flexibility to implement Java threads in several alternative ways. These implementations can be generalized as a m-to-n threading model, which means that m Java userlevel threads are mapped to n native kernel threads. The original implementation of m-to-1 threading model is known as green threads. This threading model requires a user-level scheduler to control the execution of user-level Java threads. While the green threads approach is efficient on uniprocessor systems, it does not scale well on SMP systems because only one processor is used for actual computation. Nowadays most state-of-the-art JVMs are taking the advantage of native kernel thread support from OS since OS thread scheduler has a better scalability on SMP systems. For instance, Sun JDK uses the 1-to-1 threading model. Though the latest design of Java threads improves the performance of JVM with support of operating system, many other issues, such as memory system performance, can still potentially prevent Java threads from scaling well on SMP systems. The goal of this paper is to study the scalability issues of multithreaded Java applications on SMP systems, to identify potential performance bottlenecks and to develop recommendations to the programmers and compiler writers for performance optimization. Specifically, this paper makes the following contributions: It presents a comprehensive performance characterization of several well known multithreaded Java benchmarks on SMP systems. It evaluates the benchmark performance and examines the benchmark scalability by varying the number of processors and application threads. It presents a thorough analysis by breaking down the performance data based on the types of threads, the memory system latency components and/or the memory regions. This allows us to identify the potential performance bottlenecks and correlate them to their sources. It provides insights into understanding the key sources of two major bottlenecks: memory system latencies and resource contentions. It also demonstrates and examines optimization techniques for minimizing such performance /05/$ IEEE 119

2 impacts on SMP systems. For example, parallel garbage collector shows better scalability over the default garbage collector because of its higher processor utilization during garbage collection; thread-local heap or allocation buffer improves performance of multithreaded Java benchmarks on SMP systems because they can significantly improve L2 cache locality of each thread. The rest of this paper is organized as follows. In section II, we describe the experimental methodology, including the simulation environment, the multithreaded benchmarks, and the implementation details of our experiments. The experimental results together with the analysis are presented in section III. Section IV briefly reviews prior and related research work. Finally, we draw our conclusions and point out the future research work in section V. II. METHODOLOGY This section describes our simulation environment, the multithreaded Java benchmarks and the details of implementation and measurement. A. Simulation Environment Simics is a full system simulator which can simulate most popular hardware platforms to run various unmodified software components. Our simulation system includes four layers: Java benchmark, Java Virtual Machine, OS and CPU. We use Simics to simulate a Linux operating system running on a shared memory multiprocessor system [12]. Java benchmarks can then be run on top of JVM on the simulation system for performance measurement. The simulation environment is setup as follows: Each simulated CPU is an in-order issue processor with Pentium IV instruction set and two level caches (Split L1 cache: 16K, 4-way, 1-cycle latency; Unified L2 cache: 512K, 8-way, 8-cycle latency). 2GB main memory (100- cycle latency) is shared by all processors in the simulated system. The number of processors can be varied for different simulation configurations. The Linux operating system is based on the kernel. This version of kernel provides a constant time (O(1)) thread scheduler and supports process preemption in both user space and kernel space. These new features provide a faster process response time and a higher throughput than previous versions of kernel. In order to avoid skewed results in our experiments, single user mode is enabled so that only a very few necessary system processes are running on the system. The JVM we used is Sun HotSpot JVM Unless otherwise stated, the default JVM options are used in the experiments. The reason is that the default configuration has been tuned by software vendors and usually produce the highest throughput. In all the experiments, a 512MB Java heap space is used due to the fact that an SMP system usually has a large size of main memory. B. Multithreaded Benchmarks Table I gives the descriptions of four multithreaded Java benchmarks used in our experiments. Three benchmarks (Mol- Dyn, MonteCarlo, and RayTracer) come from the multithreaded Java Grande Forum (JGF) Benchmark Suite [8]. We also use PseudoJBB, a variant of SPECjbb2000 [15], as a multithreaded benchmark. SPECjbb2000 is developed by Standard Performance Evaluation Corporation (SPEC) for evaluating the performance of servers running typical Java business applications. As a variant of SPECjbb2000, Pseudo- JBB can run a fixed number of transactions in multiple warehouses. This benchmark has been widely used for performance evaluation in recent studies[1], [4], [16]. In our experiments, the data of single threaded initialization stage of PseudoJBB is excluded due to our interest in the multithreading properties. C. Implementation and Measurement Many factors, such as instructions to be executed, memory access latency, I/O access latency, processor utilization, thread synchronization, etc., can affect the performance of multithreaded Java applications on SMP systems. In order to evaluate the benchmark performance and identify the performance bottlenecks, it is necessary to record performance data during the execution of an application for further performance analysis. As a full system simulator, Simics offers the capability to directly retrieve hardware performance data, such as cache misses and TLB misses, without incurring interferences to JVM. However, Simics does not know the software behaviors directly. In order to correlate hardware performance data with high-level software behaviors, we use the magic instruction in Simics to notify the simulator of the interested software events indirectly. Simics uses a special architecture-dependent instruction (i.e. xchg %bx,%bx in Intel X86) as the magic instruction. Therefore, we instrument the source code of JVM with this instruction, and a real execution of the instrumented JVM shows that the instrumentation overhead is completely negligible. The performance data is recorded on a per-thread basis. We recompile JVM with the support of Drepper s Native POSIX Thread Library (NPTL) model [3], which exhibits higher scalability than the classic pthread library on SMP systems. The NPTL threading model provides 1-to-1 mapping from pthread to kernel process (thread). Similarly, Sun JDK uses 1-to-1 mapping from Java thread to pthread. Therefore, we can observe the Java thread behaviors from the OS level by setting a breakpoint in the address of OS process scheduler. Whenever the breakpoint is reached, Simics is informed of the context switch of kernel processes, and the performance data is recorded and correlated to the running thread right before context switch happens. We run the multithreaded Java benchmarks with different configurations to obtain performance data for our analysis. This allows us to tell how well multithreaded Java benchmarks scale with the number of processors and application threads, 120

3 TABLE I MULTITHREADED JAVA BENCHMARKS Benchmark Description Input MolDyn N-body code modeling particles interacting under a Lennard-Jones potential 2,048 solutions MonteCarlo Monte Carlo techniques to price products derived from an underlying asset 10,000 solutions RayTracer A 3D raytracer, which renders 64 spheres with configurable resolutions 500 solutions PseudoJBB A variant version of SPECjbb ,000 transactions and to know where are the potential performance bottlenecks. The configurations can be divided into four categories: Fixed number of application threads For each run of the benchmarks, we use 16 Java application threads but vary the number of processors of the SMP system (1, 2, 4, 6 and 8). We use no more than 8 processors because Linux 2.4 version or higher uses the logical destination APIC mode of Intel processor, which limits the number of processors to 8. The uniprocessor system (1 processor), which is not an SMP configuration, is included only for comparison purpose. Fixed number of processors For each run of the benchmarks, we fix the number of processors to 4 but vary the number of application threads from 1 to 12 (1, 2, 4, 6, 8, 10, and 12). Table I shows the input for each benchmark. Note that the input is constant for each run regardless of that how many processors or application threads are used. Fixed number of application threads but with parallel garbage collector This configuration is actually the same as the first one except that it uses a parallel garbage collector. The purpose is to show how the parallel garbage collector differs from the default stop-the-world garbage collector in terms of performance. We do not run the second configuration (fixed number of processors) with the parallel garbage collector because we find that this configuration is enough to show the patterns. Fixed number of processors and fixed number of application threads In this configuration, we fix the number of processors and application threads to 4 and 8 respectively. For each run of the benchmarks, we vary the size of thread-local allocation buffer to examine its performance impact on memory system and determine the size for achieving the optimal performance. Details will be discussed in section III-E. III. EXPERIMENTAL RESULTS The experimental results are analyzed and summarized in this section. The objective is to correlate hardware level performance data back to the software level constructs. This will allow us to identify the performance bottleneck at multiple levels. Two software level constructs, the thread types and memory regions, are introduced and justified in Section III-A. Then, the results of throughput scaling with the number of processors and application threads are presented. We then break down the performance data based on the types of threads, the memory access latency components, and/or the memory regions, and discuss in detail the analysis of performance and scalability. Lastly, optimization techniques such as parallel garbage collector and thread-local heap or allocation buffer are examined. A. Two software constructs: Thread Types and Memory Regions On SMP systems, different threads in an application usually have different behaviors and knowing these behaviors will help us to identify the performance bottlenecks. When a multithreaded Java application runs on Sun JDK 1.4.2, a certain number of threads will be created. The number and the types of threads can vary depending on the application behavior, the execution platform, the garbage collection algorithm, etc. Typically, the types of threads include main thread, application thread, compiler thread, garbage collection thread, idle thread, signal dispatcher thread, reference handler thread, finalizer thread, suspend checker thread, and watcher thread. Because threads of the same type usually have similar behavior, we group threads based on their types and present the analysis for each type. In particular, we find that threads with the types of signal dispatcher thread, reference handler thread, finalizer thread, suspend checker thread, and watcher thread have very small impacts on the executions of benchmarks since their contributions to the total execution time are always less than 2% in total. Thus, we do not distinguish those types of threads but classify them together using a new type, called other threads, in our analysis. In the experiments, we also study the behaviors of L1 instruction cache, L1 data cache and L2 unified cache. The results show that the memory system plays a very important role in the overall SMP system performance. As shown in Figure 1, the 4GB virtual address space of a JVM process running on a typical 32-bit Linux operation system can be divided into several memory regions. Note that the addresses shown in the figure may vary with the size of Java heap. To further explore the details of memory system behaviors and identify the performance bottlenecks, we present the performance analysis of memory system based on these memory regions when necessary. B. Throughput We use speedup of throughput to examine the scalability of benchmarks under study. Figure 2(a) shows the scaling of 121

4 0xFFFFFFFF Kernel space 0xC Thread heap and stack space JVM Heap Space Permanent heap space Mature heap space Nursery heap space Compiled Java method space JVM and shared library space 0x x x46BF0000 0x x (a) Speedup of Throughput vs. Number of Processors 0x Fig. 1. An Example of JVM Virtual Memory Space Layout throughput with the number of processors. All the benchmarks achieve increases in throughput with the number of processors. However, the speedups tend to be lower than the linear increment. PseudoJBB has the lowest speedup in all cases. Additionally, no significant improvement is observed with more than 6 processors for PseudoJBB. This indicates that some potential performance bottlenecks limit the improvement. Further investigation of this issue will be presented in section III-C. Figure 2(b) shows the scaling of throughput with the number of application threads. All the benchmarks achieve increases in performance with the number of application threads when the number of application threads is less than the number of processors, obviously due to higher CPU utilization with more threads. The peak throughputs are reached when the number of application threads is equal to the number of processors. Beyond that point, we observe degradations in performance for both PseudoJBB and MolDyn while there are no significant changes for both MonteCarlo and RayTracer. The result of throughput scaling reveals that using more processors often leads to higher performance and matching the number of application threads to the number of processors is important in achieving the maximum performance. However, because potential bottlenecks offset this effect, no benchmark is found to have a linear increase in performance. In the following sections, the behavior of throughput scaling will be further studied in detail by examining the impacts of different performance data on overall system performance. Through detailed analysis, useful findings about performance and scalability issues can be observed. C. Breakdown of Execution Time In this section, we present our analysis by breaking down the total execution cycles based on thread types. We also examine the contributions of various memory access latency components to the overall CPI (average cycles per instruction). 1) Breakdown of Total Execution Time by Thread Types: Figure 3(a) and Figure 3(b) show the scaling of total execution cycles broken down by thread types. The total execution (b) Speedup of Throughput vs. Number of Application Threads Fig. 2. Speedup of Throughput cycles are normalized to 1 so that it allows us to know and compare the contributions of different threads. The results show that, for all benchmarks, application thread and idle thread dominate the execution cycles while the contributions of compiler thread and other threads are negligible. This indicates that the JIT compiler is efficient in compiling Java methods into native machine code. Garbage collection thread contributes from 5% to 10% of the total execution cycles for PseudoJBB and up to 3% for MonteCarlo, while no significant GC cycles are observed for both MolDyn and RayTracer. Our investigation reveals that both PseudoJBB and MonteCarlo have relatively larger working sets, resulting in more frequent and longer executions of garbage collection. MolDyn has a very small working set which causes no garbage collection at all. RayTracer allocates a lot of small objects during execution. However, most small objects are thread-local and die very young, thus garbage collections can be done very quickly without copying a large number of objects from nursery heap space to mature heap space. In our experiments, we find that idle thread has a significant contribution to the total execution cycles, resulting in a severe degradation in performance. In all benchmarks, idle cycles keep increasing with the number of processors. For PseudoJBB running on an eight-processor SMP system, the idle cycles can constitute as much as 50% of the total CPU cycles. The behavior of idle thread becomes different in the scaling with the number of application threads. We see the largest idle cycles for all benchmarks when the number of application threads is equal to 1. For instance, 74% of the total execution cycles are idle cycles for PseudoJBB and 52% for MolDyn. 122

5 (a) Total Execution Cycles by Thread Types vs. Number of Processors (b) Total Execution Cycles by Thread Types vs. Number of Application Threads Fig. 3. Total Execution Cycles As the number of application threads increases, the idle cycles keep decreasing until the number of application threads is equal to the number of processors. Thereafter, we find that idle cycles keep increasing with the number of application threads for both PseudoJBB and MolDyn while there are no significant changes for both MonteCarlo and RayTracer. Further investigation reveals that the major causes of idle cycles are lock contentions and long executions of garbage collection. MolDyn allocates a few highly shared objects. The strong contentions of concurrent accesses to these objects produce considerable number of idle cycles. For MonteCarlo and PseudoJBB, the idle cycles are much largely produced during garbage collection because of the large working sets. RayTracer allocates a large number of thread-local objects, so the contentions of concurrent accesses to these objects tend to be small. Meanwhile, these objects die very quickly, thus garbage collection takes much less time to be completed. As a consequence of these two reasons, only a relatively small number of idle cycles are produced during its execution. In order to exploit thread-level parallelism for high performance achievement, idle cycles should therefore be minimized. Optimizing JVM threading system and Java s synchronization mechanism is important for reducing the idle cycles produced by lock contentions. In addition to this, Java programmers have the responsibility to write scalable code at application level. Also, as to be discussed in section III- F, parallel garbage collector can significantly reduce the idle cycles during garbage collection. 2) Breakdown of CPI by Memory Access Latency Components: In this section, we perform the analysis by breaking down the overall CPI into five memory access latency components: processor time (average access time to L1 cache for CPU to fetch one instruction and its data), L1 instruction cache miss stalls, L1 data cache miss stalls, L2 miss stalls and cacheto-cache transfer stalls. We explicitly exclude the data of idle thread due to the fact that it rarely accesses the memory. Figure 4 shows the scaling of overall CPI, broken down into its memory access latency components. We observe that memory stalls constitute a large percentage of CPU cycles, ranging from 20% to 60%. This indicates that memory access latency is a very critical performance bottleneck on SMP systems. The results also show that the contributions of memory access latency components to the overall CPI vary considerably among the four benchmarks. PseudoJBB and RayTracer are more significantly affected by the memory system stalls while MolDyn and MonteCarlo are less affected. Both PseudoJBB and RayTracer allocate a large volume of objects during the execution, resulting in a large number of cache misses in both L1 and L2 caches. MolDyn has a very small working set and the accesses to objects can usually hit in the caches. MonteCarlo behaves differently compared to the other three benchmarks. We observe that MonteCarlo allocates a large number of objects. However, its CPI tends to be relatively small. This is because its regular memory access pattern leads to good cache locality. For all the benchmarks, L2 cache performance plays a very important role in the overall CPI. L2 cache miss stalls can account for as high as 50% of the total memory latency stalls. Cache-to-cache transfer stalls are the second largest contributor of memory system stalls. We find the significant cache-to-cache transfer stalls for RayTracer and PseudoJBB, while relatively less for MolDyn and MonteCarlo. Both L2 cache misses and cache-to-cache transfers together contributes to the majority of memory system stalls, while L1 instruction cache misses and L1 data cache misses have less impacts on the overall memory system performance. To further understand memory system performance, more detailed explanations will be presented in section III-D. The scaling of L2 cache performance can become worse. We observe that, increasing the number of processors or the number of application threads, often leads to increases in both L2 cache misses and L2 cache-to-cache transfers, and the impact on L2 cache-to-cache transfers is often slightly higher than that on L2 cache misses. Two factors contribute to this effect. Firstly, the default memory allocator of Sun JDK allocates objects in nursery heap space through a bump pointer. Objects belonging to different threads can be allocated close to each other. Accesses to one object may lead to the other object to be loaded into the same cache line, and obviously this will worsen the spatial cache locality for current thread. This allocator could also lead to high cache-to-cache transfers. If two or more processors try to access data in the same cache line but in different processor caches simultaneously, writing data in one cache may cause the data to be invalidated in another cache. Secondly, increasing the number of processors or application threads could make this situation even worse 123

6 (a) Overall CPI by Memory Access Latency Components vs. Number of Processors (b) Overall CPI by Memory Access Latency Components vs. Number of Threads Fig. 4. Overall CPI due to the increasing contentions of memory accesses. Our further investigation shows that using a thread-local heap or allocation buffer can potentially improve cache system performance on SMP systems. We will detail the studies in section III-E. D. Memory System Performance In this section, we further study the memory system behaviors in terms of thread types and memory regions with the motivation of knowing in detail the causes of memory system bottlenecks. Similar as above, we exclude the data of idle thread in our analysis. Also, in comparison to other caches, the performance impact of L1 instruction cache misses tends to be small and therefore will not be reported. 1) L1 Data Cache Misses: Figure 5(a) and Figure 5(b) show the scaling of L1 data cache miss ratio, broken down by the thread types. Like L1 instruction cache, we find that application thread dominates cache misses for all benchmarks. Garbage collection thread contributes to the cache misses as well for PseudoJBB and MonteCarlo. However, the impact is trivial compared to the application thread. This is the direct result of using a large size of Java heap in our experiments which minimizes the impact of the garbage collection. Figure 5(c) and Figure 5(d) show the scaling of L1 data cache miss ratio, broken down by the memory regions. We find that cache misses come from all the regions with the three Java heap spaces together having the largest contribution. This indicates that improving the cache locality of Java heap spaces is critical for improving the overall memory system performance. The behavior of MolDyn is a little bit different. Around 80% or more of the total cache misses come from the nursery heap space. As we mentioned before, this is because of the small working set of MolDyn. All objects are allocated in the nursery space and will never be promoted to the mature space. Consequently, most memory accesses to Java heap go to the nursery heap space. 2) L2 Cache Misses: Figure 5(e) and 5(f) illustrate the contributions of different types of threads to L2 cache miss ratio. Similar to L1 cache, we observe that L2 cache misses are also dominated by the application threads. However, L2 cache miss ratio does not stay constant. Instead, the miss ratio keeps increasing with either the number of processors or the number of application threads. The only exception is that L2 cache miss ratio of MolDyn decreases with the number of processors. Figure 5(g) and Figure 5(h) show the contributions of different memory address spaces to L2 cache miss ratio. For all benchmarks, kernel space and JVM code space only cause a small amount of L2 cache misses. Instead, Java heap space is the main contributor due to the fact that all the object allocations and most data accesses occur in this space. Additionally, garbage collector can also cause a large number of L2 cache misses in Java heap space, since the generational garbage collector requires scanning objects in nursery heap space and copying live ones to mature heap space. We also find that PseudoJBB has much larger cache miss ratio in Java heap space than the other benchmarks. This is attributed to the large sizes of working set and Java code of PseudoJBB. 3) Cache-to-Cache Transfer: To simplify the analysis, we correlate cache-to-cache transfers to the thread responsible for the data modifications which cause the transfers. Figure 5(i) to Figure 5(l) show the contributions of different components to L2 cache-to-cache transfers. We observe that, among all the types of threads, application thread dominates cache-to-cache transfers in all cases, while no significant contributions are observed for all other types of threads. Among all the memory regions, the nursery heap space contributes the most to cacheto-cache transfers for all benchmarks. We also find that increasing the number of processors or the number of application threads will lead to an increase in cache-to-cache transfers. The observation shows that L2 cache-to-cache transfers share the similar behavior of L2 cache misses. However, L2 cache misses are mainly caused by poor cache locality, while L2 cache-to-cache transfers are caused by true/false sharing of data among processors or threads. E. Thread-local Heap/Allocation Buffer Thread-local heap is a scheme in which each thread receives a partition of the heap for thread-local object allocation and thread-local garbage collection without synchronization with other threads[2]. Its original intention is to reduce heap contentions. However, our study shows that this scheme can lead to good cache performance on SMP systems and its positive performance impact is greater than that of its original intention. 124

7 (a) L1 Data Cache Misses by Thread Types vs. Number (b) L1 Data Cache Misses by Thread Types vs. Number (c) L1 Data Cache Misses by Memory Regions vs. of Processors of Application Threads Number of Processors (d) L1 Data Cache Misses by Memory Regions vs. Number of Application Threads (e) L2 Cache Misses by Thread Types vs. Number of Processors (f) L2 Cache Misses by Thread Types vs. Number of Application Threads (g) L2 Cache Misses by Memory Regions vs. Number of Processors (h) L2 Cache Misses by Memory Regions vs. Number of Application Threads (i) Cache-to-Cache Transfers by Thread Types vs. Number of Processors (j) Cache-to-Cache Transfers by Thread Types vs. Number of Application Threads (k) Cache-to-Cache Transfers by Memory Regions vs. Number of Processors (l) Cache-to-Cache Transfers by Memory Regions vs. Number of Application Threads Fig. 5. Cache Performance 125

8 (a) Throughput vs. Size of Thread-local Allocation Buffer (b) Memory Stalls/Instruction vs. Size of Thread-local Allocation Buffer Fig. 6. Thread-local Allocation Buffer Sun JDK does not implement thread-local heap directly, but offers a similar approach called thread-local allocation buffer. However, there are some differences between them. Thread-local heap approach requires that only nonshared objects are allocated locally in the heap belonging to the thread which creates the objects. It has the advantage that thread-local objects can be garbage collected independently without stopping other application threads. However, it is complicated because it requires the support of compiler, and overheads such as write barrier have to be introduced. In contrast, thread-local allocation buffer approach allows any object to be allocated locally in the allocation buffer belonging to the thread which creates the object. In this approach, objects in the allocation buffer cannot be independently collected because they may be shared by other application threads. However, the implementation is simpler and often lead to similar performance. Therefore, we study the performance impact of thread-local allocation buffer instead thread-local heap in this section. Sun JDK allows us to enable the scheme of threadlocal allocation buffer and specify the size of the buffer for each thread through command line. We only run two benchmarks, RayTracer and PseudoJBB, on a simulated SMP system with four processors. Other benchmarks seems insensitive to its impact on performance, either because MolDyn has a small memory footprint, or because MonteCarlo already has good cache performance (see Figure 4). Therefore, they will not be reported. In the simulation, eight application threads are run for each benchmark. We also vary the size of thread-local allocation buffer to examine its impact on the performance. Figure 6(a) shows throughput speedup of RayTracer and PseudoJBB with the size of thread-local allocation buffer. If the size is 0, it means thread-local allocation buffer is not used for the execution. Overall, using a thread-local allocation buffer leads to performance improvement. Figure 6(b) illustrates the memory stalls with the size of thread-local allocation buffer. We observe that performance is improved because L2 cache misses and cache-to-cache transfers are significantly reduced. The allocation behavior leads to this effect. The default memory allocator of SUN JDK allocates objects in the continuous nursery space by increasing the bump pointer. If multiple threads are running simultaneously, the objects belonging to different threads could be allocated close to each other. This will lead to poor spatial cache locality for each thread. This allocator can make cache performance even worse on SMP systems because of the impact of cache-tocache transfers(true/false sharing misses). Objects belonging to different threads could be placed in the same cache line but in different processor cache. Therefore, writing to this cache line in one processor cache will cause the data to be invalidated in the other processor caches. Thread-local allocation buffer alleviates the problems by allocating threadlocal objects together. Thus the cache locality becomes better and false sharing misses can be significantly reduced because most cache lines will not be shared by other threads. We only observe small performance gains from L1 instruction cache and L1 data cache, mainly due to their small sizes of cache. Figure 6(a) also shows that the size of thread-local allocation buffer could affect overall system performance. We find that a size between 16KB and 256KB often leads to optimal performance. If the size is too small, a thread may have to keep requesting a new buffer when current allocation buffer is full. This will cause contention on object allocation and also memory fragmentation since object to be allocated may not fit in the free space of current buffer. A small size of allocation buffer results in poor locality too because objects can be allocated into separate allocation buffers. However, with a large size of allocation buffer, performance gain could be offset by the penalties such as TLB misses. The optimal size of thread-local allocation buffer can vary depending on the cache organization, the application execution behavior, etc. Dynamically choosing the size of allocation buffer has the potential to further improve the performance and we plan to exploit this potential in the future work. F. Parallel Garbage Collector vs. Default Stop-the-world Garbage Collector We use the default stop-the-world garbage collector in our experiments. This collector allows only one processor to actively execute garbage collection. Parallel garbage collector, on the other hand, can fully utilize all the processors on SMP systems to do the garbage collection in parallel. In this section, we present the analysis to show the performance comparison of these two collectors on SMP systems. We only examine PseudoJBB since the JGF benchmarks do not produce garbage collections long enough with the 512 MB heap space. 126

9 (a) Throughput of PseudoJBB vs. Number of Processors (b) Total Execution Cycles of PseudoJBB by Thread Types vs. Number of Processors Fig. 7. Default GC vs. Parallel GC Figure 7(a) shows the throughput scaling of PseudoJBB with the number of processors. The parallel garbage collector shows a performance improvement more close to the linear increment as the number of processors is increased. However, the performance improvement of default garbage collector is far more below the linear increase. The performance gap between two collectors keeps increasing with the number of processors. When the number of processors is more than 6, only a very small performance gain is observed for the default garbage collector. Obviously, this is caused by idle cycles during garbage collection. Parallel garbage collector, on the other hand, does not waste as many CPU cycles as the default collector. Therefore, it achieves much higher performance improvement. To further verify our conclusion, Figure 7(b) shows the scaling of total execution cycles, broken down by the types of thread. Though, due to the contention and synchronization among the garbage collection threads, the parallel garbage collector uses slightly more CPU cycles than the default garbage collector. However, since the decrease of idle cycles outweighs the increase of garbage collection cycles, better performance is achieved because more CPU cycles can be used by application thread. IV. RELATED WORK The behaviors of Java applications have been evaluated since Java was first introduced in late 1995 [6], [10], [14], [13], [7]. Most studies focused on single threaded Java programs, especially the SPECjvm98 benchmarks. However, studies of multithreaded benchmarks are rare. In recent years, because of the popularity of Java-based server applications, the performance of multithreaded Java programs is becoming of great interest. Using the performance counters provided by the processor, Luo et al. studied the characterization of Java server applications on Pentium III [11]. They found that such programs have worse instruction streams (including I-Cache miss rate, ITLB miss rate, etc.) than SPECint2000. By increasing the thread number, they also studied the impacts of Java threads on the micro-architecture. Instead of running benchmarks on a uniprocessor system, our work focuses on the performance characterization of multithreaded Java programs on SMP environment. Many metrics studied in this work, such as cacheto-cache transfer, are not available on the single-processor systems. Using a full system simulator (Simics) and a real machine, Karlsson et al. studied the memory system behavior of Java middleware running on SMP systems [9]. They mainly focused on the characterization of low-level (hardware) performance metrics, such as cache-to-cache transfer. Compared with this research, our work is more fine-grained by attributing the detailed low-level performance metrics to high-level software components of Java Virtual Machine. Specifically, we focus on the correlation between these low-level performance metrics and two high-level software constructs: thread types and memory regions. Such correlation can help to identify the potential performance and scalability bottlenecks at application level for further optimization. Sweeney et al. recently reported a performance monitoring system in Jikes RVM, which was implemented based on the hardware performance counters [16]. As a demonstration, two performance issues (including general performance trends and the memory latency issues) were investigated using this system. The results show that their tool is able to attribute the observed program behaviors to the specific components of JVM. However, the profiling system has some limitations. The performance metrics it can examine heavily rely on the capability provided by the performance counters of processors. Our simulation infrastructure is based on Simics, a full system simulator, which enables us to profile more aspects of the applications. For instance, we are able to categorize the cache misses into different memory regions, which is infeasible based on current implementations of performance counters. Based on the infrastructure introduced in [16], Hauswirth et al. [5] examined further applications of this profiling system. This research work introduces a technique, called vertical profiling, to correlate the performance characterizations across the layers of modern object-oriented systems (OS, virtual machine, and applications). Different from our work, their research did not particularly focus the scalability issues on SMP systems, though their experiments were also based on a 4-way SMP system. V. CONCLUSIONS In this paper, we study the scalability issues of JVM on SMP systems. The detailed simulator offers us a great environment 127

10 to evaluate the performance scaling of multithreaded benchmarks with number of processors and application threads. Our unique analysis methodology of correlating the low-level performance data to high-level software constructs (thread types and memory regions) allows us to identify the performance and scalability bottlenecks at multiple levels. Two potential bottlenecks, memory system latencies and lock contention, are studied in this work. Key observations emerge. First, in terms of memory access latency components, memory regions and threads, the primary portion of memory stalls are produced by L2 cache misses and cache-tocache transfers, Java heap space, and Java application threads respectively. Second, increasing the number of processors or application threads, independently of each other, often leads to increases in L2 cache miss ratio and L2 cache-to-cache transfer ratio, which potentially prevent the system from scaling up linearly. Lastly, lock contentions could cause a large number of idle cycles, meaning significant lack of thread-level parallelism on SMP systems. Particularly, idle cycles often scale up with the number of processors and application threads, resulting in non-linear performance improvement or even performance degradation. Several optimization techniques are examined for their abilities to reduce the impacts of these performance bottlenecks. We obverse that using a thread-local heap or allocation buffer can significantly reduce L2 cache misses and L2 cache-tocache transfers for multithreaded Java benchmarks running on SMP systems, although its original intention is to reduce the heap contention. A thread-local allocation buffer with a size between 16KB and 256KB often leads to optimal performance. Also, parallel garbage collector is examined to have better scalability on SMP systems because workload is balanced for higher CPU utilization. To our knowledge, this is the first work that investigates scalability issues by correlating low-level performance data to high-level software constructs. Our future work includes dynamically choosing the size of allocation buffer, further exploiting the behavior of lock contention, and performing the validation of simulation results on real SMP systems. ACKNOWLEDGEMENTS This material is based upon work supported by the National Science Foundation under Grant No (ITR) (ITR) and Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. [3] U. Drepper and I. Molnar. The native POSIX thread library for Linux. nptl-design.pdf, [4] S. Z. Guyer and K. S. McKinley. Finding your cronies: static analysis for dynamic object colocation. In Proceedings of the 2004 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), Vancouver, Canada, October [5] M. Hauswirth, P. F. Sweeney, A. Diwan, and M. Hind. Vertical profiling: Understanding the behavior of object-oriented applications. In Proceedings of the 18 th ACM SIGPLAN Conference on Objectoriented Programing, Systems, Languages, and Applications (OOPSLA), Vancouver, British Columbia, Canada, October [6] C.-H. A. Hsieh, M. T. Conte, T. L. Johnson, J. C. Gyllenhaal, and W.- M. W. Hwu. A study of the cache and branch performance issues with running Java on current hardware platforms. In Proceedings of the 42 nd IEEE International Computer Conference (CompCon), San Jose, CA, Feburary [7] W. Huang, J. Lin, Z. Zhang, and J. M. Chang. Performance Characterization of Java Applications on SMT Processors. In Proceedings of the IEEE International Symposium On Performance Analysis of Systems And Software (ISPASS), pages , Austin, TX, March [8] Java Grande Forum. The Java Grande Forum Multi-threaded Benchmarks. Available at javagrande/threads.html. [9] M. Karlsson, K. E. Moore, E. Hagersten, and D. A.Wood. Memory system behavior of Java-based middleware. In Proceedings of the 9 th Annual International Symposium on High-Performance Computer Architecture (HPCA), Anaheim, CA, Feburary [10] T. Li, L. K. John, V. Narayanan, A. Sivasubramaniam, J. Sabarinathan, and A. Murthy. Using complete system simulation to characterize SPECjvm98 benchmarks. In Proceedings of the International Conference on Supercomputing (ICS), Santa Fe, NM, May [11] Y. Luo and L. K. John. Workload characterization of multithreaded Java servers. In Proceedings of 2001 IEEE International Sysmposium on Performance Analysis of Systems and Software (ISPASS), Tucson, Arizona, November [12] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: a full system simulation environment. IEEE Computer, pages 50 58, Feburary [13] R. Radhakrishnan, V. Narayanan, L. K. John, and A. Sivasubramaniam. Architectural issues in Java runtime systems. In Proceedings of the 6 th International Symposium on High-Performance Computer Architecture (HPCA), Toulouse, France, January [14] B. Rychlik and J. P. Shen. Characterization of value locality in Java programs. In Proceedings of the 3 rd Workshop on Workload Characterization in Association with ICCD, Austin, TX, September [15] Standard Performance Evaluation Corporation (SPEC). SPECjbb2000 Benchmark. [16] P. F. Sweeney, M. Hauswirth, B. Cahoon, P. Cheng, A. Diwan, D. Grove, and M. Hind. Using hardware performance monitors to understand the behavior of Java applications. In Proceedings of the USENIX 3rd Virtual Machine Research and Technology Symposium (VM 04), San Jose, CA, May REFERENCES [1] S. M. Blackburn, S. Singhai, M. Hertz, K. S. McKinley, and J. E. B. Moss. Pretenuring for Java. In Proceedings of the 2001 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pages , Tampa Bay, FL, October [2] T. Domani, G. Goldshtein, E. K. Kolodner, E. Lewis, E. Petrank, and D. Sheinwald. Thread-local heaps for Java. In ISMM 02: Proceedings of the 3rd international symposium on Memory management, pages 76 87, Berlin, Germany, ACM Press. 128

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are