Memory Organization and Optimization for Java Workloads

Size: px

Start display at page:

Download "Memory Organization and Optimization for Java Workloads"

Emil Cain
6 years ago
Views:

1 284 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November 2006 Memory Organization and Optimization for Java Workloads K. F. Chong, and Anthony S. Fong Department of Electronic Engineering, City University of Hong Kong 83 Tat Chee Avenue, Kowloon, Hong Kong Summary Java has become a popular paradigm in software development. It is widely used in embedded systems and network computing because of its excellent robustness, modularity and security. Its built-in garbage collection automatically reclaims unused memory space. Current generational garbage collector works well with programs containing massive short-lived objects. However, the existence of hot-mature (frequently-accessed and long-lived) objects inhibits object reclamation. In this paper, we present two methodologies to exploit the locality for these objects. Firstly, we employ an on-chip scratchpad memory in memory hierarchy to preserve young and hot-mature objects. This reduces energy consumption and data accessing cycles for Java execution. Secondly, we introduce a pretenuring technique to segregate objects into separated memory regions based on object lifetimes and reference densities, which minimizes the amount of object copying during garbage collections. Key words: Cache and memory systems, Java, memory management, high-performance computer architecture 1. Introduction Java is an object-oriented programming language, which is widely adopted in embedded systems and network computing due to its platform independence and security features. Its built-in Garbage Collection (GC) automatically reclaims unused memory space occupied by unreferenced objects. A state-of-the-art generational garbage collection allows heap space to be separated into several regions (generations) according to the object s age. It turns out to be efficient since different garbage collection algorithms can be applied separately in different regions. Short-lived objects can be directly allocated into the nursery (i.e. youngest memory space or data cache) and collected by the corresponding collector once they are no longer accessible by the running program. However, if the program frequently accesses long-lived or immortal objects, these objects have to be brought from mature regions and cached in nursery persistently for better performance. We name them hot-mature objects. Nursery will be overflowed when the quantity or size of the hot-mature objects is large. It not only generates excessive memory traffic between the main memory and cache, but also inhibits object allocation in nursery. Thus, efficient management of hot-mature objects is crucial for optimizing Java execution. A software-controlled on-chip SRAM, termed Scratch-Pad Memory (SPM), is employed in our Java processor design [2]. It is mapped into an address space disjoint from off-chip memory but connected to the same address and data buses [5]. The major difference between scratchpad memory and data cache is that scratchpad guarantees a single-cycle access time, whereas an access to the cache will be subject to compulsory, capacity, and conflict misses [5]. We use scratchpad memory to preserve young and hot-mature objects that shorten access cycles and minimizes cache penalty. To better utilize memories and alleviate the load of garbage collector, we introduce an allocation strategy called pretenuring. Young and hot-mature objects are directly allocated into scratchpad memory; meanwhile, long-lived objects, which are not highly referenced, will be pretenured into main memory. This reduces the cost of object mutations. This paper is organized as follows. We address the limitations of generational garbage collection as well as traditional cache-only architecture, and outline our approaches in section 2. Then, we investigate the nature and behavior of young and hot-mature objects, and classify them into five object classes in section 3. In sections 4 and 5, we present the concrete implementation of scratchpad memory system and the pretenuring mechanism. Then, we describe our experimental setup and show the results obtained from modified JVM with our benchmark suite in section 6. Finally, we conclude the paper and discuss the future work in section Background and Motivation Objects allocated in Java are typically short-lived (see Fig. 1 and Fig. 2). Generational approach is widely chosen to collect objects in Java applications as it avoids much of its scanning and copying by segregating objects into different memory regions based on their ages. Once objects have survived for a number of collections, they will be

2 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November prompted to mature generations. However, this collection scheme would incur significant overheads when programs contain numerous mature objects. This is because long-lived objects are often copied many times before coming to rest in mature generation. Pretenuring was introduced to reduce the amount of object copying by gathering runtime advices, such as object s size and lifespan, from call sites to memory manager during object allocation [4, 7]. A site predictor is implemented to select an ideal generation for keeping newly allocated objects based on the advices, so that the long-lived objects can be directly allocated into mature spaces. Using cache as a nursery has been demonstrated to elevate the efficiency of generational collector and better utilize the garbage collected heap [2]. According to the empirical results, a concurrent collector was able to directly allocate 90% of objects into object data cache, and reclaimed 68% of them from the cache. It is unnecessary to copy dead objects from cache onto the main memory. This design is practical for applications possessing of massive young and small objects, but it would cause poor cache locality when programs contain significant amount of hot-mature objects. This flaw is imputed by three causes. First, if two hot-mature objects are mapped into two addresses that conflict in data cache, unless the cache is set associative, a significant number of conflict misses will occur [5]. Second, as hot-mature objects are irreclaimable and reside in cache all the time, they generate unnecessary traffic between cache and main memory. Although these objects are copied onto mature spaces when collecting the nursery, they will soon be accessed and brought again to the cache according to the reference locality principle. This restless sweeping considerably degrades runtime performance. Third, an amount of space in nursery occupied by hot-mature objects would inhibit allocation and lead to cache overflow. Excessive collections will be triggered so as to free the nursery space. For these reasons, we propose two object management techniques which exploit the cache locality and garbage collection for Java execution. We employ an on-chip memory in memory hierarchy, which is different to the traditional cache-only system. We refer this to scratchpad memory. There are four implications of using scratchpad for allocation intensive programs. First, scratchpad has been shown to save chip area and power consumption by 46% and 40% according to its tag-less structure [6]. Second, since access to scratchpad would not be subject to compulsory, capacity and conflict misses [5], assigning hot-mature objects to scratchpad would prevent them from being replaced. Single cycle can be achieved for all accesses. Third, using scratchpad can flexibly utilize on-chip memory space. Preserving hot-mature objects in scratchpad region will leave more nursery space for newly allocated short-lived objects. This design reduces capacity misses in nursery compared to the cache-only system. Fourth, scratchpad divides object heap to separately preserve young and hot-mature objects. It allows different policies to collect objects. Then we can collect objects in nursery frequently, whereas objects in hot-mature region would be reclaimed occasionally. These benefits motivate us to use scratchpad for collecting programs which contain numerous long-lived objects, and achieving better cache locality. In addition, we introduce a profile-driven pretenuring mechanism, which predicts the existence of young and hot-mature objects and allocate them into scratchpad memory. Other objects, i.e. cool-mature, are pretenured to main memory. There were several researches conducted to pretenure old objects into mature space using lifetime predictor [4, 7]. Our pretenuring is built upon these works with additionally considering the reference density during site sampling and updating. We can improve the object locality and shorten the data accessing cycle. 3. Java Object Characterization 3.1 Benchmark and JVM Profiler To obtain objects behavior in Java applications, we use industry standard SPECjvm98 as our benchmark. It consists of eight programs which are mostly derived from real-world applications. Seven of them, named compress, jess, db, javac, mpegaudio, mtrt and jack, are used for evaluating system performance; whereas the remained one called check is used for validating the correctness of JVM. As check does not contribute to any performance score, we exclude this in our experiments. We modify the Java Heap Profiler (HPROF) agent from Java 2 SDK v1.4.2 to obtain reference and stack trace [8]. It interacts with Java Virtual Machine Profiler Interface (JVMPI) [8] and dumps information to a file in ASCII or binary format. Inside the trace file, we can get the runtime statistics, such as CPU usage, heap allocation and objects reference density of the profiled SPECjvm98 applications. 3.2 Object Age To quantitate objects lifetime, we define the relative age of an object as the total amount of objects being allocated during its lifetime. To effectively display the object ages on the graphs, we plot values on the x-axis in log 10 scale. Using the default garbage collector, 50% of Java objects have relative ages less than 14,785, which contribute to 0.9% of total allocations in SPECjvm98 as shown in Fig. 1 and Fig % of objects with age less than 215,063 participate in 12.7% of heap allocations. Using reference counting collector to reclaim young objects, 52% of total objects would have zero-lifetime, and over 90% of objects have age less than 16 [2]. This implies that Java objects

3 286 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November 2006 typically are short-lived. More than half of total objects can be immediately collected before next allocation using represent objects whose age is below 5% of total object allocations. Immortal objects are the one whose time of Number of objects (%) 35% 30% 25% 20% 15% 10% compress jess db javac mpegaudio mtrt jack Number of objects (%) 100% 90% 80% 70% 60% 50% 40% 30% Cumulative average age Average age 20% 5% 10% 0% 0% Object Age Object Age Fig. 1 Age distribution of allocated objects in SPECjvm98. Fig. 2 Overall age distribution in SPECjvm98. % of total heap references received by objects 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 0% 4% 8% 12% 16% 20% 24% 28% 32% 36% 40% 44% 48% 52% 56% 60% 64% 68% 72% 76% 80% 84% compress jess db javac mpegaudio mt rt jack 88% 92% 96% 100% Number of references (%) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% compress jess db javac mpegaudio mtrt jack Young Mature Immortal Object lifetime relative to the entire execution time (%) SPECjvm98 Fig. 3 Reference density distribution in SPECjvm98. a reference counting collector. 3.3 Reference Density We now investigate the reference density of objects which represents the total number of references they received, i.e. objects having high reference densities are highly referenced. In Fig. 3, there is a large number of short-lived objects (aligned on the left of x-axis) incurring high reference density. Besides, a substantial amount of references are going to long-lived objects (aligned on the right of x-axis). As Kim and Hsu [3] illustrated, long-lived objects tend to hold important information of the program, such as database records and scene data in db and mtrt, and they have more chances to get referenced. We classify three types of objects, namely young, mature and immortal, and observe their reference density distribution through this experiment. Young objects Fig. 4 Hot objects distribution over three object classes based on object lifetime. death is close to 95% of program execution time. The rest of objects, neither young nor immortal, are classified into mature objects. We notice that the reference density distribution highly varies in different SPECjvm98 applications as shown in Fig. 4. For compress, 92.5% of total heap references are belonged to young objects, whereas in db, more than 70% of references go to immortal objects. We also find that the total number of references owned by old objects (including mature and immortal objects) contributes 44% of total heap references in SPECjvm98. As hot-mature objects are irreclaimable inside nursery, a certain amount of cache misses will be incurred by conflict and capacity misses. It would greatly affect the runtime performance. Thus, it is crucial to segregate and allocate hot-mature directly into dedicated memory partition for optimizing Java execution.

4 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November Object Classification To segregate objects with respect to their lifetimes and reference densities, we compare objects age with a threshold representing the maximum number of nursery allocations T YNG. If the age of an object is less than or equal to T YNG, the object is specified as young object. If an object is not young enough, we see whether its age is close to another runtime threshold T IMM which indicates the total amount of heap allocations from program start to termination. If so, we identify this object is immortal and it will only die together with the program. Otherwise, we classify it as a mature object. In addition, we need to distinguish which long-lived object is frequently accessed in order to segregate it into dedicated SPM region. The last threshold value, T HOT (reference density threshold), is needed to compare with the number of references owned by the mature or immortal object. If its reference density is equal or larger than T HOT, this object is frequently accessed, and a prefix Hot- is assigned to its class type, otherwise Cold- is assigned. Thus, we can categorize an object into the following classes (see table 1), namely Young (YG), Hot-Mature (HM), Hot-Immortal (HI), Cool-Mature (CM) and Cool-Immortal (CI). The detailed sampling mechanism will be discussed in section 5. Table 1: Object classification based on object s age and reference density Object Class Age Reference Density YG TYNG n/a HM > TYNG THOT HI = TIMM THOT CM > TYNG < THOT CI = TIMM < THOT 4. Scratchpad Memory 4.1 Memory Hierarchy and Partitioning Our proposed scratchpad memory system comprises four major components: Data Buffer (DBUFFER), Data Cache (DCACHE), Scratch-Pad Memory (SPM) and Main Memory (MM). The heap space is divided into two separated memories (see Fig. 5). Memory addresses 0 to S-1 map into the on-chip SPM with S data words. Single cycle can be achieved for all SPM accesses. Memory addresses S to S+M-1 map into the M-word off-chip MM. This is accessed by processor through DCACHE and results in single cycle delay if cache hit. However, if cache miss, CPU needs to transfer data between cache and memory and it results in a delay of cycles [5]. DBUFFER, which consists of 16 registers, is designed to encache data for multi-port accesses Nursery (young) Region: Using cache to preserve objects probably incurs the size discrepancy in fixed-size cache line. As cache does not have knowledge about dead objects, excessive traffic will be generated when they are copied between cache and memory. To better utilize the nursery space, we partition SPM, instead of using cache, to handle YG objects. Most heap allocations will take place in SPM instead of off-chip MM. Using the reference counting collector, 75% of total allocated YG objects were directly collected in nursery SPM without promoting to the cache and mature MM regions. This substantially reduces the total amount of object copying Hot (frequently-accessed) Region: Hot objects in the program occupy most of the CPU time which highly influences the runtime performance. Using cache to preserve HM and HI objects may improve the object locality. However, since half of the hot objects are long-lived and even immortal, the chance of being replaced by other conflicting objects is certainly high. It will incur extra cache misses and sweeping delay when hot objects are accessed. Therefore, we further partition SPM to accommodate HM and HI objects. As immortal objects cannot be reclaimed before program termination, no garbage collection is required in Hot-Immortal Space Cool (infrequently-accessed) Region: Since garbage collector cannot efficiently reclaim CM and CI objects which do not generate significant memory traffic, we segregate them from SPM regions to cool regions in MM. CPU would access them through DCACHE. As they do not contribute to overall cache miss ratio, single access cycle can be achieved. Like hot region (see section 4.1.2), immortal objects are separated from mature region so that the load of garbage collector can be alleviated Miscellaneous Region: Since SPM acts as the main store to keep a considerable amount of short-lived and frequently-accessed objects, it would overflow. To preserve the overflowed objects from SPM and prevent them from mixing up with CM and CI objects, we further partition MM and offer a miscellaneous (ML) region for them. Like cool region (see section 4.1.3), CPU can access objects in ML region via DCACHE in single cycle. 4.2 Scratchpad Memory System Architecture Fig. 6 shows the architectural diagram of our SPM system. The address, data and control buses from the processor core are connected to DBUFFER, DCACHE, SPM and SPMC (SPM Controller) inside the embedded chip. For each memory access, a request signal from CPU will be sent through DBUFFER, DCACHE and SPMC. DBUFFER issues DB_HIT if it contains the requested data. If not, CPU checks if DCACHE or SPM owns the data, so

288 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November 2006 the copies of data can be fetched to DBUFFER for further use.

5 288 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November 2006 the copies of data can be fetched to DBUFFER for further use. Similarly, DCACHE issues a cache hit to CPU through DC_HIT. The signal is also sent to SPMC for checking the existence of data. SPMC consists of a pair of registers and comparators to indicate the upper and lower bound addresses of SPM. If SPMC find that the referenced memory address map into SPM, SPM_HIT signal will be sent to CPU. If DBUFFER, DCACHE and SPMC report misses, a block of data will be transferred between DCACHE and MM via external memory buses. 5. Object Pretenuring Pretenuring allocates objects into mature regions based on Fig. 5 Scratchpad memory hierarchy and heap space partitioning Fig. 6 The architecture of reconfigurable scratchpad memory sub-system

6 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November profiling statistics. It reduces an expense to copy long-lived objects from nursery onto the regions where they should be allocated. Previous studies stated that using object s size and type could not accurately predict object lifetime [4, 7]. Our pretenuring is premised on homogeneous object age and reference density at allocation sites. We use these advices to allocate objects into correct regions and reduce the object copying costs. 5.1 Pretenuring Mechanism For each heap allocation request, the contexts of dynamic sequence of method calls led to the request are obtained from the execution stack. We refer this information to allocation contexts. Objects allocated from identical allocation site (i.e. call site) are supposed to have same behaviors including lifetimes and reference densities. To make it into practice, we construct a site predictor which comprises a couple of buffers storing the allocation contexts and pretenuring advices. It is able to predict objects runtime behaviors at allocation time and pretenure them into correct memory regions. Once objects are tenured or collected, the predictor updates its buffers based on object ages and reference densities stored inside the object header (OH). We call this process site sampling. 5.2 Site Prediction and Sampling Allocation site buffer contains five counters for storing the amount of YG, HM, HI, CM and CI objects which have been sampled. These contexts act as parameters for the site predictor to analyze the homogeneity of allocation site and determine the pretenuring destinations of objects. The prediction algorithm is shown as follows: 1 If YG HM + HI + CM + CI, then object allocated from the site is assigned to YG region. Else, go to step If HM + HI > CM + CI, then object is prefixed with H (i.e. Hot), else C (i.e. Cool). 2-2 And if YG + HM + CM > HI + CI, then object is suffixed with M (i.e. Mature), else I (i.e. Immortal). 2-3 Then, object can be assigned to region HM, HI, CM, CI according to its prefix and suffix characters. To sample each object allocated at given site, we log all heap allocations, pointer accesses and mutations, and object deaths. When objects are tenured from young region into older region, predictor will implicitly update pretenuring advices in OH and site buffer according to their pretenuring destinations. It advances the time for sampling mature and immortal objects, such that we do not need to wait for object death to gather pretenuring statistics. Once object is dead, predictor extracts the object lifetime and reference density from OH to update entries in site buffer. The mechanism is shown as follows: 1 If the age of an object is less than the maximum value of nursery age, we classify it as YG object and add one to YG counter in site buffer. Otherwise, go to step For any object which has not been previously defined as hot object, go to step Otherwise, go to step If object s age is less than the program end time, we classify it as CM object and add one to CM counter in site buffer. Otherwise, go to step We classify it as CI object and add one to CI counter in site buffer. 2-2 For any object which has been previously defined as hot object, go to step If object s age is less than the program end time, we classify it as HM object and add one to HM counter in site buffer. Otherwise, go to step We classify it as HI object and add one to HI counter in site buffer. Cheng [4] and Blackburn [7] stated that the immortal age can be obtained by multiplying the maximum amount of lived objects with a threshold. To record the maximum number of allocations and heap usage during runtime execution, we can estimate the program age. 6. Evaluation Our experiments are carried out by using Jikes Research Virtual Machine (RVM) v2.4.4, which is an open source JVM written in Java with its own compilers [1]. It provides us the capabilities for implementing and evaluating our pretenuring algorithm for both JVM runtime and user applications. Besides, we use a Memory Management Toolkit (MMTk) packaged in Jikes for memory allocation and garbage collection instrumentation. It exploits object orientation as well as JVM-in-Java

7 290 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November 2006 property in Java. As MMTk implements several collectors, it provides us the simplest way to implement a reference-counted nursery and mark-compacted mature spaces. Source codes with pretenuring algorithms are inserted into RVM. We generate the runtime trace using Merlin tracer for evaluation, which is provided in RVM. 32KB DCACHE and 32KB SPM are used for the experiment. Fig. 7 shows that a large portion of objects (78-99%) are directly allocated into YG space for most applications compress jess db javac mpegaudio mtrt jack 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Pretenured Objects (%) Fig. 7 Distribution of allocated objects with respect to their pretenured regions (excluding compress and mpegaudio). Using reference counting collector, it can immediately reclaim 75% of nursery space once objects are unreferenced. Hardware write barrier is implemented to minimize the cost of reference count updates. By contrast, for applications possesses massive mature objects, such as compress and mpegaudio, 46-84% of objects are pretenured to CI space, where the objects can be accessed by CPU via DCACHE. Overall, pretenuring reduces 26.6% of total amount of object copying from young space onto mature and immortal spaces. Besides, we notice that a substantial amount of HM objects ( %) and HI objects ( %) are allocated into SPM region in compress and mpegaudio. This ensures hot objects not to be replaced by conflicting objects and achieves single-cycle access time. 7. Conclusions and Future Work YG HM HI CM CI This study examines the implications of using scratchpad memory for Java execution. We classify objects into five categories: Young (YG), Hot-Mature (HM), Hot-Immortal (HI), Cool-Mature (CM) and Cool-Immortal (CI). To exploit generational garbage collection, we introduce a pretenuring mechanism, which is premised on object lifetime and reference density, to directly pretenure YG, HM and HI into scratchpad, and tenure CM and CI objects into main memory. It significantly reduces object copying costs. Results also show that 90% or more objects are allocated into nursery space and 75% of the space can be immediately reclaimed by using a reference counting collector with hardware write barrier. Besides, there is a small fraction of HM and HI allocated into SPM which contribute more than half of total object references. They are never replaced by other conflicting objects and achieve one cycle for all accesses. There are several directions to go from here: pretenuring accuracy, cache pressure caused by HM objects, time and space overheads incurred by the pretenuring mechanism. We are now implementing a hardware site predictor so we can more accurately access the performance costs and benefits of our approach. To avoid having too few and unrealistic dataset, we will analyze more and bigger problems in the future. Acknowledgment The work described in this paper was partially supported by the City University of Hong Kong, Strategic Research Grant References [1] A. Alpern, C. R. Attanasio, A. Cocchi, D. Lieber, S. Smith, T. Ngo, J. J. Barton, S. F. Hummel, J. C. Shepherd, and M. Mergen, Implementing Jalapeno in Java, In Proc. of ACM Conference on Object-Oriented System, Language, and Application, 34(10), [2] C. H. Yau, Y. Y. Tan, A. S. Fong, and W. S. Yu, Hardware Concurrent Garbage Collection for Short-Lived Objects in Mobile Java Devices, 2005 European Conference, pp , [3] J. S. Kim, and Y. Hsu, Memory System Behavior of Java Programs: Methodology and Analysis, In Proc. of ACM Conference on Measurement and Modeling of Computer System, [4] P. Cheng, R. Harper, and P. Lee, Generational stack collection and profile-driven pretenuring, In 1998 SIGPLAN Conference on Programming Language, Design, and Implementation, pp , [5] P. R. Panda, N. D. Dutt, and, A. Nicolau, On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems, ACM Transaction on Design Automation of Electronic System (TODAES), 5, 3, pp , [6] R. Banakar, S. Steinke, B. -S. Lee, M. Balakrishnan and P. Marwedel, Scratchpad Memory: A Design Alternative for Cache On-chip memory in Embedded Systems, In Proc. International Workshop on Hardware/Software Codesign, Colorado, [7] S. M. Blackburn, S. Singhai, M. Hertz, K. S. McKinley, and J. E. B. Moss, Pretenuring for Java, ACM Conf. on Object-Oriented Programming, System, Language and Application, FL, USA, [8] Sun Microsystem, Java Virtual Machine,

IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November 2006 291 K. F. Chong received the B.Eng.

His research interest includes cache and memory systems, garbage collection, computer architecture and high-performance Java processor. Anthony S. Fong received the B.E.

8 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.11, November K. F. Chong received the B.Eng. degree in Information Engineering from City University of Hong Kong in He is currently working towards the MPhil degree in Electronic Engineering at the City University of Hong Kong. His research interest includes cache and memory systems, garbage collection, computer architecture and high-performance Java processor. Anthony S. Fong received the B.E.E. degree from Villanova University, Pennsylvania in He then worked at Philco-Ford Corporation as Programmer before he returned to university. He received his M.Sc. degree in Computer Science from the State University of New York at Buffalo in He was awarded Ph.D. degree from University of Sunderland in In 1991 and joined the City University of Hong Kong as Senior Lecturer in the Department of Electronic Engineering. He was Visiting Professor at the Institute of Electronics, Chinese Academy of Science from 1997 to At present he is Associate Professor and the Director of the EDA Centre in the Department of Electronic Engineering. His research interest includes Computer Architecture & Design, Electronic Design Automation, and Database. He is a senior member of IEEE.

Thread-Aware Garbage Collection for Server Applications

Thread-Aware Garbage Collection for Server Applications Woo Jin Kim, Kyungbaek Kim, Jaesun Han, Keuntae Park and Daeyeon Park Department of Electrical Engineering & Computer Science Korea Advanced Institute