A Study of Java Virtual Machine Scalability Issues on SMP Systems

Size: px
Start display at page:

Download "A Study of Java Virtual Machine Scalability Issues on SMP Systems"

Transcription

1 A Study of Java Virtual Machine Scalability Issues on SMP Systems Zhongbo Cao, Wei Huang, and J. Morris Chang Department of Electrical and Computer Engineering Iowa State University Ames, Iowa {jzbcao, huangwei, Abstract This paper studies the scalability issues of Java Virtual Machine (JVM) on Symmetrical Multiprocessing (SMP) systems. Using a cycle-accurate simulator, we evaluate the performance scaling of multithreaded Java benchmarks with the number of processors and application threads. By correlating low-level hardware performance data to two high-level software constructs: thread types and memory regions, we present in detail the performance analysis and study the potential performance impacts of memory system latencies and resource contentions on scalability. Several key findings are revealed through this study. First, among the memory access latency components, the primary portion of memory stalls are produced by L2 cache misses and cache-to-cache transfers. Second, among the regions of memory, Java heap space produces most memory stalls. Additionally, a large majority of memory stalls occur in application threads, as opposed to other JVM threads. Furthermore, we find that increasing the number of processors or application threads, independently of each other, leads to increases in L2 cache miss ratio and cache-to-cache transfer ratio. This problem can be alleviated by using a thread-local heap or allocation buffer which can improve L2 cache performance. For certain benchmarks such as Raytracer, their cache-to-cache transfers, mainly dominated by false sharing, can be significantly reduced. Our experiments also show that a thread-local allocation buffer with a size between 16KB and 256KB often leads to optimal performance. I. INTRODUCTION In recent years, Symmetrical Multiprocessing (SMP) has become increasingly popular as a scalable parallel computing platform. This popularity is mainly attributed to two factors: SMP-capable processors and operating systems. Many modern processors, such as Intel s Xeon, Sun s UltraSparc, and IBM s PowerPC, have the built-in SMP support. It is now feasible for end-users to build SMP systems at a low cost. Similarly, operating systems (OS), especially Microsoft Windows and Linux, also support SMP. These two factors drive the adaptation of SMP in various computing environments. Java is emerging as a competitive paradigm for software development. Designed with many advanced features, such as automatic memory management (i.e. garbage collection), enforced security check, and cross-platform portability, Java has become a popular programming language used on various platforms. Because of its built-in multithreading support, Java has been widely used to develop multithreaded programs for server platforms (such as SMP systems). The design of Java threads in Java Virtual Machine (JVM) has been improved over the years to satisfy the requirement of higher performance and better scalability. Java specifications offer the flexibility to implement Java threads in several alternative ways. These implementations can be generalized as a m-to-n threading model, which means that m Java userlevel threads are mapped to n native kernel threads. The original implementation of m-to-1 threading model is known as green threads. This threading model requires a user-level scheduler to control the execution of user-level Java threads. While the green threads approach is efficient on uniprocessor systems, it does not scale well on SMP systems because only one processor is used for actual computation. Nowadays most state-of-the-art JVMs are taking the advantage of native kernel thread support from OS since OS thread scheduler has a better scalability on SMP systems. For instance, Sun JDK uses the 1-to-1 threading model. Though the latest design of Java threads improves the performance of JVM with support of operating system, many other issues, such as memory system performance, can still potentially prevent Java threads from scaling well on SMP systems. The goal of this paper is to study the scalability issues of multithreaded Java applications on SMP systems, to identify potential performance bottlenecks and to develop recommendations to the programmers and compiler writers for performance optimization. Specifically, this paper makes the following contributions: It presents a comprehensive performance characterization of several well known multithreaded Java benchmarks on SMP systems. It evaluates the benchmark performance and examines the benchmark scalability by varying the number of processors and application threads. It presents a thorough analysis by breaking down the performance data based on the types of threads, the memory system latency components and/or the memory regions. This allows us to identify the potential performance bottlenecks and correlate them to their sources. It provides insights into understanding the key sources of two major bottlenecks: memory system latencies and resource contentions. It also demonstrates and examines optimization techniques for minimizing such performance /05/$ IEEE 119

2 impacts on SMP systems. For example, parallel garbage collector shows better scalability over the default garbage collector because of its higher processor utilization during garbage collection; thread-local heap or allocation buffer improves performance of multithreaded Java benchmarks on SMP systems because they can significantly improve L2 cache locality of each thread. The rest of this paper is organized as follows. In section II, we describe the experimental methodology, including the simulation environment, the multithreaded benchmarks, and the implementation details of our experiments. The experimental results together with the analysis are presented in section III. Section IV briefly reviews prior and related research work. Finally, we draw our conclusions and point out the future research work in section V. II. METHODOLOGY This section describes our simulation environment, the multithreaded Java benchmarks and the details of implementation and measurement. A. Simulation Environment Simics is a full system simulator which can simulate most popular hardware platforms to run various unmodified software components. Our simulation system includes four layers: Java benchmark, Java Virtual Machine, OS and CPU. We use Simics to simulate a Linux operating system running on a shared memory multiprocessor system [12]. Java benchmarks can then be run on top of JVM on the simulation system for performance measurement. The simulation environment is setup as follows: Each simulated CPU is an in-order issue processor with Pentium IV instruction set and two level caches (Split L1 cache: 16K, 4-way, 1-cycle latency; Unified L2 cache: 512K, 8-way, 8-cycle latency). 2GB main memory (100- cycle latency) is shared by all processors in the simulated system. The number of processors can be varied for different simulation configurations. The Linux operating system is based on the kernel. This version of kernel provides a constant time (O(1)) thread scheduler and supports process preemption in both user space and kernel space. These new features provide a faster process response time and a higher throughput than previous versions of kernel. In order to avoid skewed results in our experiments, single user mode is enabled so that only a very few necessary system processes are running on the system. The JVM we used is Sun HotSpot JVM Unless otherwise stated, the default JVM options are used in the experiments. The reason is that the default configuration has been tuned by software vendors and usually produce the highest throughput. In all the experiments, a 512MB Java heap space is used due to the fact that an SMP system usually has a large size of main memory. B. Multithreaded Benchmarks Table I gives the descriptions of four multithreaded Java benchmarks used in our experiments. Three benchmarks (Mol- Dyn, MonteCarlo, and RayTracer) come from the multithreaded Java Grande Forum (JGF) Benchmark Suite [8]. We also use PseudoJBB, a variant of SPECjbb2000 [15], as a multithreaded benchmark. SPECjbb2000 is developed by Standard Performance Evaluation Corporation (SPEC) for evaluating the performance of servers running typical Java business applications. As a variant of SPECjbb2000, Pseudo- JBB can run a fixed number of transactions in multiple warehouses. This benchmark has been widely used for performance evaluation in recent studies[1], [4], [16]. In our experiments, the data of single threaded initialization stage of PseudoJBB is excluded due to our interest in the multithreading properties. C. Implementation and Measurement Many factors, such as instructions to be executed, memory access latency, I/O access latency, processor utilization, thread synchronization, etc., can affect the performance of multithreaded Java applications on SMP systems. In order to evaluate the benchmark performance and identify the performance bottlenecks, it is necessary to record performance data during the execution of an application for further performance analysis. As a full system simulator, Simics offers the capability to directly retrieve hardware performance data, such as cache misses and TLB misses, without incurring interferences to JVM. However, Simics does not know the software behaviors directly. In order to correlate hardware performance data with high-level software behaviors, we use the magic instruction in Simics to notify the simulator of the interested software events indirectly. Simics uses a special architecture-dependent instruction (i.e. xchg %bx,%bx in Intel X86) as the magic instruction. Therefore, we instrument the source code of JVM with this instruction, and a real execution of the instrumented JVM shows that the instrumentation overhead is completely negligible. The performance data is recorded on a per-thread basis. We recompile JVM with the support of Drepper s Native POSIX Thread Library (NPTL) model [3], which exhibits higher scalability than the classic pthread library on SMP systems. The NPTL threading model provides 1-to-1 mapping from pthread to kernel process (thread). Similarly, Sun JDK uses 1-to-1 mapping from Java thread to pthread. Therefore, we can observe the Java thread behaviors from the OS level by setting a breakpoint in the address of OS process scheduler. Whenever the breakpoint is reached, Simics is informed of the context switch of kernel processes, and the performance data is recorded and correlated to the running thread right before context switch happens. We run the multithreaded Java benchmarks with different configurations to obtain performance data for our analysis. This allows us to tell how well multithreaded Java benchmarks scale with the number of processors and application threads, 120

3 TABLE I MULTITHREADED JAVA BENCHMARKS Benchmark Description Input MolDyn N-body code modeling particles interacting under a Lennard-Jones potential 2,048 solutions MonteCarlo Monte Carlo techniques to price products derived from an underlying asset 10,000 solutions RayTracer A 3D raytracer, which renders 64 spheres with configurable resolutions 500 solutions PseudoJBB A variant version of SPECjbb ,000 transactions and to know where are the potential performance bottlenecks. The configurations can be divided into four categories: Fixed number of application threads For each run of the benchmarks, we use 16 Java application threads but vary the number of processors of the SMP system (1, 2, 4, 6 and 8). We use no more than 8 processors because Linux 2.4 version or higher uses the logical destination APIC mode of Intel processor, which limits the number of processors to 8. The uniprocessor system (1 processor), which is not an SMP configuration, is included only for comparison purpose. Fixed number of processors For each run of the benchmarks, we fix the number of processors to 4 but vary the number of application threads from 1 to 12 (1, 2, 4, 6, 8, 10, and 12). Table I shows the input for each benchmark. Note that the input is constant for each run regardless of that how many processors or application threads are used. Fixed number of application threads but with parallel garbage collector This configuration is actually the same as the first one except that it uses a parallel garbage collector. The purpose is to show how the parallel garbage collector differs from the default stop-the-world garbage collector in terms of performance. We do not run the second configuration (fixed number of processors) with the parallel garbage collector because we find that this configuration is enough to show the patterns. Fixed number of processors and fixed number of application threads In this configuration, we fix the number of processors and application threads to 4 and 8 respectively. For each run of the benchmarks, we vary the size of thread-local allocation buffer to examine its performance impact on memory system and determine the size for achieving the optimal performance. Details will be discussed in section III-E. III. EXPERIMENTAL RESULTS The experimental results are analyzed and summarized in this section. The objective is to correlate hardware level performance data back to the software level constructs. This will allow us to identify the performance bottleneck at multiple levels. Two software level constructs, the thread types and memory regions, are introduced and justified in Section III-A. Then, the results of throughput scaling with the number of processors and application threads are presented. We then break down the performance data based on the types of threads, the memory access latency components, and/or the memory regions, and discuss in detail the analysis of performance and scalability. Lastly, optimization techniques such as parallel garbage collector and thread-local heap or allocation buffer are examined. A. Two software constructs: Thread Types and Memory Regions On SMP systems, different threads in an application usually have different behaviors and knowing these behaviors will help us to identify the performance bottlenecks. When a multithreaded Java application runs on Sun JDK 1.4.2, a certain number of threads will be created. The number and the types of threads can vary depending on the application behavior, the execution platform, the garbage collection algorithm, etc. Typically, the types of threads include main thread, application thread, compiler thread, garbage collection thread, idle thread, signal dispatcher thread, reference handler thread, finalizer thread, suspend checker thread, and watcher thread. Because threads of the same type usually have similar behavior, we group threads based on their types and present the analysis for each type. In particular, we find that threads with the types of signal dispatcher thread, reference handler thread, finalizer thread, suspend checker thread, and watcher thread have very small impacts on the executions of benchmarks since their contributions to the total execution time are always less than 2% in total. Thus, we do not distinguish those types of threads but classify them together using a new type, called other threads, in our analysis. In the experiments, we also study the behaviors of L1 instruction cache, L1 data cache and L2 unified cache. The results show that the memory system plays a very important role in the overall SMP system performance. As shown in Figure 1, the 4GB virtual address space of a JVM process running on a typical 32-bit Linux operation system can be divided into several memory regions. Note that the addresses shown in the figure may vary with the size of Java heap. To further explore the details of memory system behaviors and identify the performance bottlenecks, we present the performance analysis of memory system based on these memory regions when necessary. B. Throughput We use speedup of throughput to examine the scalability of benchmarks under study. Figure 2(a) shows the scaling of 121

4 0xFFFFFFFF Kernel space 0xC Thread heap and stack space JVM Heap Space Permanent heap space Mature heap space Nursery heap space Compiled Java method space JVM and shared library space 0x x x46BF0000 0x x (a) Speedup of Throughput vs. Number of Processors 0x Fig. 1. An Example of JVM Virtual Memory Space Layout throughput with the number of processors. All the benchmarks achieve increases in throughput with the number of processors. However, the speedups tend to be lower than the linear increment. PseudoJBB has the lowest speedup in all cases. Additionally, no significant improvement is observed with more than 6 processors for PseudoJBB. This indicates that some potential performance bottlenecks limit the improvement. Further investigation of this issue will be presented in section III-C. Figure 2(b) shows the scaling of throughput with the number of application threads. All the benchmarks achieve increases in performance with the number of application threads when the number of application threads is less than the number of processors, obviously due to higher CPU utilization with more threads. The peak throughputs are reached when the number of application threads is equal to the number of processors. Beyond that point, we observe degradations in performance for both PseudoJBB and MolDyn while there are no significant changes for both MonteCarlo and RayTracer. The result of throughput scaling reveals that using more processors often leads to higher performance and matching the number of application threads to the number of processors is important in achieving the maximum performance. However, because potential bottlenecks offset this effect, no benchmark is found to have a linear increase in performance. In the following sections, the behavior of throughput scaling will be further studied in detail by examining the impacts of different performance data on overall system performance. Through detailed analysis, useful findings about performance and scalability issues can be observed. C. Breakdown of Execution Time In this section, we present our analysis by breaking down the total execution cycles based on thread types. We also examine the contributions of various memory access latency components to the overall CPI (average cycles per instruction). 1) Breakdown of Total Execution Time by Thread Types: Figure 3(a) and Figure 3(b) show the scaling of total execution cycles broken down by thread types. The total execution (b) Speedup of Throughput vs. Number of Application Threads Fig. 2. Speedup of Throughput cycles are normalized to 1 so that it allows us to know and compare the contributions of different threads. The results show that, for all benchmarks, application thread and idle thread dominate the execution cycles while the contributions of compiler thread and other threads are negligible. This indicates that the JIT compiler is efficient in compiling Java methods into native machine code. Garbage collection thread contributes from 5% to 10% of the total execution cycles for PseudoJBB and up to 3% for MonteCarlo, while no significant GC cycles are observed for both MolDyn and RayTracer. Our investigation reveals that both PseudoJBB and MonteCarlo have relatively larger working sets, resulting in more frequent and longer executions of garbage collection. MolDyn has a very small working set which causes no garbage collection at all. RayTracer allocates a lot of small objects during execution. However, most small objects are thread-local and die very young, thus garbage collections can be done very quickly without copying a large number of objects from nursery heap space to mature heap space. In our experiments, we find that idle thread has a significant contribution to the total execution cycles, resulting in a severe degradation in performance. In all benchmarks, idle cycles keep increasing with the number of processors. For PseudoJBB running on an eight-processor SMP system, the idle cycles can constitute as much as 50% of the total CPU cycles. The behavior of idle thread becomes different in the scaling with the number of application threads. We see the largest idle cycles for all benchmarks when the number of application threads is equal to 1. For instance, 74% of the total execution cycles are idle cycles for PseudoJBB and 52% for MolDyn. 122

5 (a) Total Execution Cycles by Thread Types vs. Number of Processors (b) Total Execution Cycles by Thread Types vs. Number of Application Threads Fig. 3. Total Execution Cycles As the number of application threads increases, the idle cycles keep decreasing until the number of application threads is equal to the number of processors. Thereafter, we find that idle cycles keep increasing with the number of application threads for both PseudoJBB and MolDyn while there are no significant changes for both MonteCarlo and RayTracer. Further investigation reveals that the major causes of idle cycles are lock contentions and long executions of garbage collection. MolDyn allocates a few highly shared objects. The strong contentions of concurrent accesses to these objects produce considerable number of idle cycles. For MonteCarlo and PseudoJBB, the idle cycles are much largely produced during garbage collection because of the large working sets. RayTracer allocates a large number of thread-local objects, so the contentions of concurrent accesses to these objects tend to be small. Meanwhile, these objects die very quickly, thus garbage collection takes much less time to be completed. As a consequence of these two reasons, only a relatively small number of idle cycles are produced during its execution. In order to exploit thread-level parallelism for high performance achievement, idle cycles should therefore be minimized. Optimizing JVM threading system and Java s synchronization mechanism is important for reducing the idle cycles produced by lock contentions. In addition to this, Java programmers have the responsibility to write scalable code at application level. Also, as to be discussed in section III- F, parallel garbage collector can significantly reduce the idle cycles during garbage collection. 2) Breakdown of CPI by Memory Access Latency Components: In this section, we perform the analysis by breaking down the overall CPI into five memory access latency components: processor time (average access time to L1 cache for CPU to fetch one instruction and its data), L1 instruction cache miss stalls, L1 data cache miss stalls, L2 miss stalls and cacheto-cache transfer stalls. We explicitly exclude the data of idle thread due to the fact that it rarely accesses the memory. Figure 4 shows the scaling of overall CPI, broken down into its memory access latency components. We observe that memory stalls constitute a large percentage of CPU cycles, ranging from 20% to 60%. This indicates that memory access latency is a very critical performance bottleneck on SMP systems. The results also show that the contributions of memory access latency components to the overall CPI vary considerably among the four benchmarks. PseudoJBB and RayTracer are more significantly affected by the memory system stalls while MolDyn and MonteCarlo are less affected. Both PseudoJBB and RayTracer allocate a large volume of objects during the execution, resulting in a large number of cache misses in both L1 and L2 caches. MolDyn has a very small working set and the accesses to objects can usually hit in the caches. MonteCarlo behaves differently compared to the other three benchmarks. We observe that MonteCarlo allocates a large number of objects. However, its CPI tends to be relatively small. This is because its regular memory access pattern leads to good cache locality. For all the benchmarks, L2 cache performance plays a very important role in the overall CPI. L2 cache miss stalls can account for as high as 50% of the total memory latency stalls. Cache-to-cache transfer stalls are the second largest contributor of memory system stalls. We find the significant cache-to-cache transfer stalls for RayTracer and PseudoJBB, while relatively less for MolDyn and MonteCarlo. Both L2 cache misses and cache-to-cache transfers together contributes to the majority of memory system stalls, while L1 instruction cache misses and L1 data cache misses have less impacts on the overall memory system performance. To further understand memory system performance, more detailed explanations will be presented in section III-D. The scaling of L2 cache performance can become worse. We observe that, increasing the number of processors or the number of application threads, often leads to increases in both L2 cache misses and L2 cache-to-cache transfers, and the impact on L2 cache-to-cache transfers is often slightly higher than that on L2 cache misses. Two factors contribute to this effect. Firstly, the default memory allocator of Sun JDK allocates objects in nursery heap space through a bump pointer. Objects belonging to different threads can be allocated close to each other. Accesses to one object may lead to the other object to be loaded into the same cache line, and obviously this will worsen the spatial cache locality for current thread. This allocator could also lead to high cache-to-cache transfers. If two or more processors try to access data in the same cache line but in different processor caches simultaneously, writing data in one cache may cause the data to be invalidated in another cache. Secondly, increasing the number of processors or application threads could make this situation even worse 123

6 (a) Overall CPI by Memory Access Latency Components vs. Number of Processors (b) Overall CPI by Memory Access Latency Components vs. Number of Threads Fig. 4. Overall CPI due to the increasing contentions of memory accesses. Our further investigation shows that using a thread-local heap or allocation buffer can potentially improve cache system performance on SMP systems. We will detail the studies in section III-E. D. Memory System Performance In this section, we further study the memory system behaviors in terms of thread types and memory regions with the motivation of knowing in detail the causes of memory system bottlenecks. Similar as above, we exclude the data of idle thread in our analysis. Also, in comparison to other caches, the performance impact of L1 instruction cache misses tends to be small and therefore will not be reported. 1) L1 Data Cache Misses: Figure 5(a) and Figure 5(b) show the scaling of L1 data cache miss ratio, broken down by the thread types. Like L1 instruction cache, we find that application thread dominates cache misses for all benchmarks. Garbage collection thread contributes to the cache misses as well for PseudoJBB and MonteCarlo. However, the impact is trivial compared to the application thread. This is the direct result of using a large size of Java heap in our experiments which minimizes the impact of the garbage collection. Figure 5(c) and Figure 5(d) show the scaling of L1 data cache miss ratio, broken down by the memory regions. We find that cache misses come from all the regions with the three Java heap spaces together having the largest contribution. This indicates that improving the cache locality of Java heap spaces is critical for improving the overall memory system performance. The behavior of MolDyn is a little bit different. Around 80% or more of the total cache misses come from the nursery heap space. As we mentioned before, this is because of the small working set of MolDyn. All objects are allocated in the nursery space and will never be promoted to the mature space. Consequently, most memory accesses to Java heap go to the nursery heap space. 2) L2 Cache Misses: Figure 5(e) and 5(f) illustrate the contributions of different types of threads to L2 cache miss ratio. Similar to L1 cache, we observe that L2 cache misses are also dominated by the application threads. However, L2 cache miss ratio does not stay constant. Instead, the miss ratio keeps increasing with either the number of processors or the number of application threads. The only exception is that L2 cache miss ratio of MolDyn decreases with the number of processors. Figure 5(g) and Figure 5(h) show the contributions of different memory address spaces to L2 cache miss ratio. For all benchmarks, kernel space and JVM code space only cause a small amount of L2 cache misses. Instead, Java heap space is the main contributor due to the fact that all the object allocations and most data accesses occur in this space. Additionally, garbage collector can also cause a large number of L2 cache misses in Java heap space, since the generational garbage collector requires scanning objects in nursery heap space and copying live ones to mature heap space. We also find that PseudoJBB has much larger cache miss ratio in Java heap space than the other benchmarks. This is attributed to the large sizes of working set and Java code of PseudoJBB. 3) Cache-to-Cache Transfer: To simplify the analysis, we correlate cache-to-cache transfers to the thread responsible for the data modifications which cause the transfers. Figure 5(i) to Figure 5(l) show the contributions of different components to L2 cache-to-cache transfers. We observe that, among all the types of threads, application thread dominates cache-to-cache transfers in all cases, while no significant contributions are observed for all other types of threads. Among all the memory regions, the nursery heap space contributes the most to cacheto-cache transfers for all benchmarks. We also find that increasing the number of processors or the number of application threads will lead to an increase in cache-to-cache transfers. The observation shows that L2 cache-to-cache transfers share the similar behavior of L2 cache misses. However, L2 cache misses are mainly caused by poor cache locality, while L2 cache-to-cache transfers are caused by true/false sharing of data among processors or threads. E. Thread-local Heap/Allocation Buffer Thread-local heap is a scheme in which each thread receives a partition of the heap for thread-local object allocation and thread-local garbage collection without synchronization with other threads[2]. Its original intention is to reduce heap contentions. However, our study shows that this scheme can lead to good cache performance on SMP systems and its positive performance impact is greater than that of its original intention. 124

7 (a) L1 Data Cache Misses by Thread Types vs. Number (b) L1 Data Cache Misses by Thread Types vs. Number (c) L1 Data Cache Misses by Memory Regions vs. of Processors of Application Threads Number of Processors (d) L1 Data Cache Misses by Memory Regions vs. Number of Application Threads (e) L2 Cache Misses by Thread Types vs. Number of Processors (f) L2 Cache Misses by Thread Types vs. Number of Application Threads (g) L2 Cache Misses by Memory Regions vs. Number of Processors (h) L2 Cache Misses by Memory Regions vs. Number of Application Threads (i) Cache-to-Cache Transfers by Thread Types vs. Number of Processors (j) Cache-to-Cache Transfers by Thread Types vs. Number of Application Threads (k) Cache-to-Cache Transfers by Memory Regions vs. Number of Processors (l) Cache-to-Cache Transfers by Memory Regions vs. Number of Application Threads Fig. 5. Cache Performance 125

8 (a) Throughput vs. Size of Thread-local Allocation Buffer (b) Memory Stalls/Instruction vs. Size of Thread-local Allocation Buffer Fig. 6. Thread-local Allocation Buffer Sun JDK does not implement thread-local heap directly, but offers a similar approach called thread-local allocation buffer. However, there are some differences between them. Thread-local heap approach requires that only nonshared objects are allocated locally in the heap belonging to the thread which creates the objects. It has the advantage that thread-local objects can be garbage collected independently without stopping other application threads. However, it is complicated because it requires the support of compiler, and overheads such as write barrier have to be introduced. In contrast, thread-local allocation buffer approach allows any object to be allocated locally in the allocation buffer belonging to the thread which creates the object. In this approach, objects in the allocation buffer cannot be independently collected because they may be shared by other application threads. However, the implementation is simpler and often lead to similar performance. Therefore, we study the performance impact of thread-local allocation buffer instead thread-local heap in this section. Sun JDK allows us to enable the scheme of threadlocal allocation buffer and specify the size of the buffer for each thread through command line. We only run two benchmarks, RayTracer and PseudoJBB, on a simulated SMP system with four processors. Other benchmarks seems insensitive to its impact on performance, either because MolDyn has a small memory footprint, or because MonteCarlo already has good cache performance (see Figure 4). Therefore, they will not be reported. In the simulation, eight application threads are run for each benchmark. We also vary the size of thread-local allocation buffer to examine its impact on the performance. Figure 6(a) shows throughput speedup of RayTracer and PseudoJBB with the size of thread-local allocation buffer. If the size is 0, it means thread-local allocation buffer is not used for the execution. Overall, using a thread-local allocation buffer leads to performance improvement. Figure 6(b) illustrates the memory stalls with the size of thread-local allocation buffer. We observe that performance is improved because L2 cache misses and cache-to-cache transfers are significantly reduced. The allocation behavior leads to this effect. The default memory allocator of SUN JDK allocates objects in the continuous nursery space by increasing the bump pointer. If multiple threads are running simultaneously, the objects belonging to different threads could be allocated close to each other. This will lead to poor spatial cache locality for each thread. This allocator can make cache performance even worse on SMP systems because of the impact of cache-tocache transfers(true/false sharing misses). Objects belonging to different threads could be placed in the same cache line but in different processor cache. Therefore, writing to this cache line in one processor cache will cause the data to be invalidated in the other processor caches. Thread-local allocation buffer alleviates the problems by allocating threadlocal objects together. Thus the cache locality becomes better and false sharing misses can be significantly reduced because most cache lines will not be shared by other threads. We only observe small performance gains from L1 instruction cache and L1 data cache, mainly due to their small sizes of cache. Figure 6(a) also shows that the size of thread-local allocation buffer could affect overall system performance. We find that a size between 16KB and 256KB often leads to optimal performance. If the size is too small, a thread may have to keep requesting a new buffer when current allocation buffer is full. This will cause contention on object allocation and also memory fragmentation since object to be allocated may not fit in the free space of current buffer. A small size of allocation buffer results in poor locality too because objects can be allocated into separate allocation buffers. However, with a large size of allocation buffer, performance gain could be offset by the penalties such as TLB misses. The optimal size of thread-local allocation buffer can vary depending on the cache organization, the application execution behavior, etc. Dynamically choosing the size of allocation buffer has the potential to further improve the performance and we plan to exploit this potential in the future work. F. Parallel Garbage Collector vs. Default Stop-the-world Garbage Collector We use the default stop-the-world garbage collector in our experiments. This collector allows only one processor to actively execute garbage collection. Parallel garbage collector, on the other hand, can fully utilize all the processors on SMP systems to do the garbage collection in parallel. In this section, we present the analysis to show the performance comparison of these two collectors on SMP systems. We only examine PseudoJBB since the JGF benchmarks do not produce garbage collections long enough with the 512 MB heap space. 126

9 (a) Throughput of PseudoJBB vs. Number of Processors (b) Total Execution Cycles of PseudoJBB by Thread Types vs. Number of Processors Fig. 7. Default GC vs. Parallel GC Figure 7(a) shows the throughput scaling of PseudoJBB with the number of processors. The parallel garbage collector shows a performance improvement more close to the linear increment as the number of processors is increased. However, the performance improvement of default garbage collector is far more below the linear increase. The performance gap between two collectors keeps increasing with the number of processors. When the number of processors is more than 6, only a very small performance gain is observed for the default garbage collector. Obviously, this is caused by idle cycles during garbage collection. Parallel garbage collector, on the other hand, does not waste as many CPU cycles as the default collector. Therefore, it achieves much higher performance improvement. To further verify our conclusion, Figure 7(b) shows the scaling of total execution cycles, broken down by the types of thread. Though, due to the contention and synchronization among the garbage collection threads, the parallel garbage collector uses slightly more CPU cycles than the default garbage collector. However, since the decrease of idle cycles outweighs the increase of garbage collection cycles, better performance is achieved because more CPU cycles can be used by application thread. IV. RELATED WORK The behaviors of Java applications have been evaluated since Java was first introduced in late 1995 [6], [10], [14], [13], [7]. Most studies focused on single threaded Java programs, especially the SPECjvm98 benchmarks. However, studies of multithreaded benchmarks are rare. In recent years, because of the popularity of Java-based server applications, the performance of multithreaded Java programs is becoming of great interest. Using the performance counters provided by the processor, Luo et al. studied the characterization of Java server applications on Pentium III [11]. They found that such programs have worse instruction streams (including I-Cache miss rate, ITLB miss rate, etc.) than SPECint2000. By increasing the thread number, they also studied the impacts of Java threads on the micro-architecture. Instead of running benchmarks on a uniprocessor system, our work focuses on the performance characterization of multithreaded Java programs on SMP environment. Many metrics studied in this work, such as cacheto-cache transfer, are not available on the single-processor systems. Using a full system simulator (Simics) and a real machine, Karlsson et al. studied the memory system behavior of Java middleware running on SMP systems [9]. They mainly focused on the characterization of low-level (hardware) performance metrics, such as cache-to-cache transfer. Compared with this research, our work is more fine-grained by attributing the detailed low-level performance metrics to high-level software components of Java Virtual Machine. Specifically, we focus on the correlation between these low-level performance metrics and two high-level software constructs: thread types and memory regions. Such correlation can help to identify the potential performance and scalability bottlenecks at application level for further optimization. Sweeney et al. recently reported a performance monitoring system in Jikes RVM, which was implemented based on the hardware performance counters [16]. As a demonstration, two performance issues (including general performance trends and the memory latency issues) were investigated using this system. The results show that their tool is able to attribute the observed program behaviors to the specific components of JVM. However, the profiling system has some limitations. The performance metrics it can examine heavily rely on the capability provided by the performance counters of processors. Our simulation infrastructure is based on Simics, a full system simulator, which enables us to profile more aspects of the applications. For instance, we are able to categorize the cache misses into different memory regions, which is infeasible based on current implementations of performance counters. Based on the infrastructure introduced in [16], Hauswirth et al. [5] examined further applications of this profiling system. This research work introduces a technique, called vertical profiling, to correlate the performance characterizations across the layers of modern object-oriented systems (OS, virtual machine, and applications). Different from our work, their research did not particularly focus the scalability issues on SMP systems, though their experiments were also based on a 4-way SMP system. V. CONCLUSIONS In this paper, we study the scalability issues of JVM on SMP systems. The detailed simulator offers us a great environment 127

10 to evaluate the performance scaling of multithreaded benchmarks with number of processors and application threads. Our unique analysis methodology of correlating the low-level performance data to high-level software constructs (thread types and memory regions) allows us to identify the performance and scalability bottlenecks at multiple levels. Two potential bottlenecks, memory system latencies and lock contention, are studied in this work. Key observations emerge. First, in terms of memory access latency components, memory regions and threads, the primary portion of memory stalls are produced by L2 cache misses and cache-tocache transfers, Java heap space, and Java application threads respectively. Second, increasing the number of processors or application threads, independently of each other, often leads to increases in L2 cache miss ratio and L2 cache-to-cache transfer ratio, which potentially prevent the system from scaling up linearly. Lastly, lock contentions could cause a large number of idle cycles, meaning significant lack of thread-level parallelism on SMP systems. Particularly, idle cycles often scale up with the number of processors and application threads, resulting in non-linear performance improvement or even performance degradation. Several optimization techniques are examined for their abilities to reduce the impacts of these performance bottlenecks. We obverse that using a thread-local heap or allocation buffer can significantly reduce L2 cache misses and L2 cache-tocache transfers for multithreaded Java benchmarks running on SMP systems, although its original intention is to reduce the heap contention. A thread-local allocation buffer with a size between 16KB and 256KB often leads to optimal performance. Also, parallel garbage collector is examined to have better scalability on SMP systems because workload is balanced for higher CPU utilization. To our knowledge, this is the first work that investigates scalability issues by correlating low-level performance data to high-level software constructs. Our future work includes dynamically choosing the size of allocation buffer, further exploiting the behavior of lock contention, and performing the validation of simulation results on real SMP systems. ACKNOWLEDGEMENTS This material is based upon work supported by the National Science Foundation under Grant No (ITR) (ITR) and Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. [3] U. Drepper and I. Molnar. The native POSIX thread library for Linux. nptl-design.pdf, [4] S. Z. Guyer and K. S. McKinley. Finding your cronies: static analysis for dynamic object colocation. In Proceedings of the 2004 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), Vancouver, Canada, October [5] M. Hauswirth, P. F. Sweeney, A. Diwan, and M. Hind. Vertical profiling: Understanding the behavior of object-oriented applications. In Proceedings of the 18 th ACM SIGPLAN Conference on Objectoriented Programing, Systems, Languages, and Applications (OOPSLA), Vancouver, British Columbia, Canada, October [6] C.-H. A. Hsieh, M. T. Conte, T. L. Johnson, J. C. Gyllenhaal, and W.- M. W. Hwu. A study of the cache and branch performance issues with running Java on current hardware platforms. In Proceedings of the 42 nd IEEE International Computer Conference (CompCon), San Jose, CA, Feburary [7] W. Huang, J. Lin, Z. Zhang, and J. M. Chang. Performance Characterization of Java Applications on SMT Processors. In Proceedings of the IEEE International Symposium On Performance Analysis of Systems And Software (ISPASS), pages , Austin, TX, March [8] Java Grande Forum. The Java Grande Forum Multi-threaded Benchmarks. Available at javagrande/threads.html. [9] M. Karlsson, K. E. Moore, E. Hagersten, and D. A.Wood. Memory system behavior of Java-based middleware. In Proceedings of the 9 th Annual International Symposium on High-Performance Computer Architecture (HPCA), Anaheim, CA, Feburary [10] T. Li, L. K. John, V. Narayanan, A. Sivasubramaniam, J. Sabarinathan, and A. Murthy. Using complete system simulation to characterize SPECjvm98 benchmarks. In Proceedings of the International Conference on Supercomputing (ICS), Santa Fe, NM, May [11] Y. Luo and L. K. John. Workload characterization of multithreaded Java servers. In Proceedings of 2001 IEEE International Sysmposium on Performance Analysis of Systems and Software (ISPASS), Tucson, Arizona, November [12] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: a full system simulation environment. IEEE Computer, pages 50 58, Feburary [13] R. Radhakrishnan, V. Narayanan, L. K. John, and A. Sivasubramaniam. Architectural issues in Java runtime systems. In Proceedings of the 6 th International Symposium on High-Performance Computer Architecture (HPCA), Toulouse, France, January [14] B. Rychlik and J. P. Shen. Characterization of value locality in Java programs. In Proceedings of the 3 rd Workshop on Workload Characterization in Association with ICCD, Austin, TX, September [15] Standard Performance Evaluation Corporation (SPEC). SPECjbb2000 Benchmark. [16] P. F. Sweeney, M. Hauswirth, B. Cahoon, P. Cheng, A. Diwan, D. Grove, and M. Hind. Using hardware performance monitors to understand the behavior of Java applications. In Proceedings of the USENIX 3rd Virtual Machine Research and Technology Symposium (VM 04), San Jose, CA, May REFERENCES [1] S. M. Blackburn, S. Singhai, M. Hertz, K. S. McKinley, and J. E. B. Moss. Pretenuring for Java. In Proceedings of the 2001 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pages , Tampa Bay, FL, October [2] T. Domani, G. Goldshtein, E. K. Kolodner, E. Lewis, E. Petrank, and D. Sheinwald. Thread-local heaps for Java. In ISMM 02: Proceedings of the 3rd international symposium on Memory management, pages 76 87, Berlin, Germany, ACM Press. 128

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

Thread-Aware Garbage Collection for Server Applications

Thread-Aware Garbage Collection for Server Applications Thread-Aware Garbage Collection for Server Applications Woo Jin Kim, Kyungbaek Kim, Jaesun Han, Keuntae Park and Daeyeon Park Department of Electrical Engineering & Computer Science Korea Advanced Institute

More information

Myths and Realities: The Performance Impact of Garbage Collection

Myths and Realities: The Performance Impact of Garbage Collection Myths and Realities: The Performance Impact of Garbage Collection Tapasya Patki February 17, 2011 1 Motivation Automatic memory management has numerous software engineering benefits from the developer

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Performance Impact of Multithreaded Java Server Applications

Performance Impact of Multithreaded Java Server Applications Performance Impact of Multithreaded Java Server Applications Yue Luo, Lizy K. John Laboratory of Computer Architecture ECE Department University of Texas at Austin 1/2/01 1 Outline Motivation VolanoMark

More information

Optimising Multicore JVMs. Khaled Alnowaiser

Optimising Multicore JVMs. Khaled Alnowaiser Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides Garbage Collection Last time Compiling Object-Oriented Languages Today Motivation behind garbage collection Garbage collection basics Garbage collection performance Specific example of using GC in C++

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Interaction of JVM with x86, Sparc and MIPS

Interaction of JVM with x86, Sparc and MIPS Interaction of JVM with x86, Sparc and MIPS Sasikanth Avancha, Dipanjan Chakraborty, Dhiral Gada, Tapan Kamdar {savanc1, dchakr1, dgada1, kamdar}@cs.umbc.edu Department of Computer Science and Electrical

More information

Live Virtual Machine Migration with Efficient Working Set Prediction

Live Virtual Machine Migration with Efficient Working Set Prediction 2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore Live Virtual Machine Migration with Efficient Working Set Prediction Ei Phyu Zaw

More information

Exploiting the Behavior of Generational Garbage Collector

Exploiting the Behavior of Generational Garbage Collector Exploiting the Behavior of Generational Garbage Collector I. Introduction Zhe Xu, Jia Zhao Garbage collection is a form of automatic memory management. The garbage collector, attempts to reclaim garbage,

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Proceedings of the Third Virtual Machine Research and Technology Symposium

Proceedings of the Third Virtual Machine Research and Technology Symposium USENIX Association Proceedings of the Third Virtual Machine Research and Technology Symposium San Jose, CA, USA May 6 7, 2004 2004 by The USENIX Association All Rights Reserved For more information about

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

Workload Characterization of Java Server Applications on Two PowerPC Processors

Workload Characterization of Java Server Applications on Two PowerPC Processors Workload Characterization of Java Server Applications on Two PowerPC Processors Pattabi Seshadri and Lizy K. John Dept of Electrical and Computer Engr The University of Texas at Austin {seshadri,ljohn}@ece.utexas.edu

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

Visual Amortization Analysis of Recompilation Strategies

Visual Amortization Analysis of Recompilation Strategies 2010 14th International Information Conference Visualisation Information Visualisation Visual Amortization Analysis of Recompilation Strategies Stephan Zimmer and Stephan Diehl (Authors) Computer Science

More information

2

2 1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge

More information

Using Transparent Compression to Improve SSD-based I/O Caches

Using Transparent Compression to Improve SSD-based I/O Caches Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

An Examination of Reference Counter Costs and Implementation Approaches

An Examination of Reference Counter Costs and Implementation Approaches An Examination of Reference Counter Costs and Implementation Approaches Luke N Quinane Department of Computer Science Australian National University Canberra, ACT, 0200, Australia luke@octerbar.net ABSTRACT

More information

Fiji VM Safety Critical Java

Fiji VM Safety Critical Java Fiji VM Safety Critical Java Filip Pizlo, President Fiji Systems Inc. Introduction Java is a modern, portable programming language with wide-spread adoption. Goal: streamlining debugging and certification.

More information

Towards Parallel, Scalable VM Services

Towards Parallel, Scalable VM Services Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View Faster Processors

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it to be run Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester Mono-programming

More information

Comparison of Memory System Behavior in Java and Non-Java Commercial Workloads

Comparison of Memory System Behavior in Java and Non-Java Commercial Workloads Comparison of Memory System Behavior in Java and Non-Java Commercial Workloads Morris Marden 1, Shih-Lien Lu 1, Konrad Lai 1, Mikko Lipasti 2 1 Microprocessor Research, Intel Labs Intel Corporation Hillsboro,

More information

Using Scratchpad to Exploit Object Locality in Java

Using Scratchpad to Exploit Object Locality in Java Using Scratchpad to Exploit Object Locality in Java Carl S. Lebsack and J. Morris Chang Department of Electrical and Computer Engineering Iowa State University Ames, IA 50011 lebsack, morris@iastate.edu

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel Fault Tolerant Java Virtual Machine Roy Friedman and Alon Kama Technion Haifa, Israel Objective Create framework for transparent fault-tolerance Support legacy applications Intended for long-lived, highly

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

A Preliminary Workload Analysis of SPECjvm2008

A Preliminary Workload Analysis of SPECjvm2008 A Preliminary Workload Analysis of SPECjvm2008 Hitoshi Oi The University of Aizu, Aizu Wakamatsu, JAPAN oi@oslab.biz Abstract SPECjvm2008 is a new benchmark program suite for measuring client-side Java

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector

Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector David F. Bacon IBM T.J. Watson Research Center Joint work with C.R. Attanasio, Han Lee, V.T. Rajan, and Steve Smith ACM Conference

More information

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf The C4 Collector Or: the Application memory wall will remain until compaction is solved Gil Tene Balaji Iyengar Michael Wolf High Level Agenda 1. The Application Memory Wall 2. Generational collection

More information

Characterizing Multi-threaded Applications based on Shared-Resource Contention

Characterizing Multi-threaded Applications based on Shared-Resource Contention Characterizing Multi-threaded Applications based on Shared-Resource Contention Tanima Dey Wei Wang Jack W. Davidson Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville,

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

New Memory Organizations For 3D DRAM and PCMs

New Memory Organizations For 3D DRAM and PCMs New Memory Organizations For 3D DRAM and PCMs Ademola Fawibe 1, Jared Sherman 1, Krishna Kavi 1 Mike Ignatowski 2, and David Mayhew 2 1 University of North Texas, AdemolaFawibe@my.unt.edu, JaredSherman@my.unt.edu,

More information

Garbage Collection (2) Advanced Operating Systems Lecture 9

Garbage Collection (2) Advanced Operating Systems Lecture 9 Garbage Collection (2) Advanced Operating Systems Lecture 9 Lecture Outline Garbage collection Generational algorithms Incremental algorithms Real-time garbage collection Practical factors 2 Object Lifetimes

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Cross-Layer Memory Management to Reduce DRAM Power Consumption Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT

More information

Hardware-Supported Pointer Detection for common Garbage Collections

Hardware-Supported Pointer Detection for common Garbage Collections 2013 First International Symposium on Computing and Networking Hardware-Supported Pointer Detection for common Garbage Collections Kei IDEUE, Yuki SATOMI, Tomoaki TSUMURA and Hiroshi MATSUO Nagoya Institute

More information

Older-First Garbage Collection in Practice: Evaluation in a Java Virtual Machine

Older-First Garbage Collection in Practice: Evaluation in a Java Virtual Machine Older-First Garbage Collection in Practice: Evaluation in a Java Virtual Machine Darko Stefanovic (Univ. of New Mexico) Matthew Hertz (Univ. of Massachusetts) Stephen M. Blackburn (Australian National

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Java On Steroids: Sun s High-Performance Java Implementation. History

Java On Steroids: Sun s High-Performance Java Implementation. History Java On Steroids: Sun s High-Performance Java Implementation Urs Hölzle Lars Bak Steffen Grarup Robert Griesemer Srdjan Mitrovic Sun Microsystems History First Java implementations: interpreters compact

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

IN recent years, web applications have become a popular

IN recent years, web applications have become a popular IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 5, NO. 2, APRIL-JUNE 2017 263 An Energy-Efficient Java Virtual Machine Kuo-Yi Chen, Member, IEEE, J. Morris Chang, Senior Member, IEEE, and Ting-Wei Hou, Member,

More information

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18 PROCESS VIRTUAL MEMORY CS124 Operating Systems Winter 2015-2016, Lecture 18 2 Programs and Memory Programs perform many interactions with memory Accessing variables stored at specific memory locations

More information

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 8: Main Memory. Operating System Concepts 9 th Edition Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Using Complete System Simulation to Characterize SPECjvm98 Benchmarks

Using Complete System Simulation to Characterize SPECjvm98 Benchmarks Using Complete System Simulation to Characterize SPECjvm98 Benchmarks Tao Li, Lizy Kurian John, Vijaykrishnan Narayanan, Anand Sivasubramaniam, Jyotsna Sabarinathan, and Anupama Murthy Laboratory for Computer

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

Java Garbage Collector Performance Measurements

Java Garbage Collector Performance Measurements WDS'09 Proceedings of Contributed Papers, Part I, 34 40, 2009. ISBN 978-80-7378-101-9 MATFYZPRESS Java Garbage Collector Performance Measurements P. Libič and P. Tůma Charles University, Faculty of Mathematics

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

VIProf: A Vertically Integrated Full-System Profiler

VIProf: A Vertically Integrated Full-System Profiler VIProf: A Vertically Integrated Full-System Profiler NGS Workshop, April 2007 Hussam Mousa Chandra Krintz Lamia Youseff Rich Wolski RACELab Research Dynamic software adaptation As program behavior or resource

More information

Basic Memory Management

Basic Memory Management Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester 10/15/14 CSC 2/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it

More information

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Chapter 8 Memory Management

Chapter 8 Memory Management Chapter 8 Memory Management Da-Wei Chang CSIE.NCKU Source: Abraham Silberschatz, Peter B. Galvin, and Greg Gagne, "Operating System Concepts", 9th Edition, Wiley. 1 Outline Background Swapping Contiguous

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 8: MEMORY MANAGEMENT By I-Chen Lin Textbook: Operating System Concepts 9th Ed. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the

More information

Untyped Memory in the Java Virtual Machine

Untyped Memory in the Java Virtual Machine Untyped Memory in the Java Virtual Machine Andreas Gal and Michael Franz University of California, Irvine {gal,franz}@uci.edu Christian W. Probst Technical University of Denmark probst@imm.dtu.dk July

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

32 Hyper-Threading on SMP Systems

32 Hyper-Threading on SMP Systems 32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Heap Compression for Memory-Constrained Java

Heap Compression for Memory-Constrained Java Heap Compression for Memory-Constrained Java CSE Department, PSU G. Chen M. Kandemir N. Vijaykrishnan M. J. Irwin Sun Microsystems B. Mathiske M. Wolczko OOPSLA 03 October 26-30 2003 Overview PROBLEM:

More information

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Go Deep: Fixing Architectural Overheads of the Go Scheduler Go Deep: Fixing Architectural Overheads of the Go Scheduler Craig Hesling hesling@cmu.edu Sannan Tariq stariq@cs.cmu.edu May 11, 2018 1 Introduction Golang is a programming language developed to target

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Dynamic Object Sampling for Pretenuring

Dynamic Object Sampling for Pretenuring Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Austin, TX, 8, USA mjump@cs.utexas.edu Stephen M Blackburn Department of Computer Science

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information