Interaction of JVM with x86, Sparc and MIPS Sasikanth Avancha, Dipanjan Chakraborty, Dhiral Gada, Tapan Kamdar {savanc1, dchakr1, dgada1, kamdar}@cs.umbc.edu Department of Computer Science and Electrical Engineering University of Maryland Baltimore County 1 Hilltop Circle, Baltimore, MD 2125 ABSTRACT A Java class is a perfect example of architecture independent code. A Java program can be compiled on a MIPS R44 based Indy workstation running Irix 6.5 and the generated class executed on an Intel Pentium III based Windows 95 system with no problems. This independence is possible, because any system claiming to support Java implements a Java Virtual Machine (JVM). This paper presents a detailed analysis of the interaction of a JVM with three popular processor architectures the Intel x86, the Sun UltraSparc and the SGI MIPS. The analysis shows that each architecture performs better than the other two in some aspects, but no one architecture is the best for Java programs. 1 Introduction The Java Virtual Machine (JVM) is a powerful concept that allows platform and architecture independent program development. This independence is achieved chiefly by compiling a Java program into JVM instructions, called bytecodes, rather than native (of the underlying architecture) instruction code. The JVM executes these bytecodes by first translating them into native code and then executing the native code. It is possible to immediately ask certain questions regarding the native code generated and executed by the JVM. In this paper, we have asked the following questions and attempted to answer them through analysis of the generated data. Q1. What is the instruction mix for different native instruction classes (ALU, data transfers and control) on different architectures? Q2. What is the average native instruction length? Q3. What is the bytecode complexity (i.e., the number of native instructions generated per bytecode) on different architectures? Q4. Which architecture causes the overall executable native executable code size (in bytes) to be the largest? Q5. On a particular architecture, in which JVM mode (JIT or interpreter) do Java programs execute faster? The rest of this paper is organized as follows. Section 2 provides an overview of the JVM architecture. The components of the heart of the JVM the JIT and the Interpreter are described in section 3. In section 4, we discuss our performance metrics in detail. Section 5 contains details of the data generation process. Section 6 describes the analysis of the data generated. We discuss the results of our analysis in section 7. Conclusions of this work are drawn in section 8. The 1
graphs plotted for various data generated with respect to the five performance metrics are shown in section 1. 2 JVM Structure Class files, containing bytecodes and linkage information are assembled from a variety of sources and are executed on a host machine by the implementation of the JVM. The execution speed is increased by using a verifier that performs a static examination and does not consume many timeconsuming run time checks. Features of the Architecture The JVM is a stack-based machine that manipulates data represented by words. A JVM stack comprises a collection of frames, each associated with the execution of a single method. A frame consists of two components: a collection of local variables and an operand stack. Local variables are accessed directly by index. An operand stack contains a number of words that are accessed on a LIFO basis by the JVM bytecodes. In addition, the VM incorporates a heap that contains objects. The Java Virtual Machine (JVM) consists of two environments: Compiler Environment: A java source is compiled using the javac compiler in the optimizing mode to produce bytecodes in Java classes. Bytecodes are platform independent and a java program can be ported to different platforms by simply moving the classes. Figure 1: Structure of JVM 2
Run-time Environment: The run-time environment consists of the Class Loaders, Java Interpreter/JIT, Runtime System and interaction with the Operating System and the Hardware. Class Loaders: They enable the JVM to load classes without apriori knowledge about the underlying file system semantics, and they allow applications to dynamically load Java classes as extension modules. Java Interpreter/JIT: Java uses the interpreter or Just in Time Compiler. Runtime System: The runtime system communicates with the Operating System, which in turn interacts with the underlying hardware. 3 JIT and Interpreter 3. 1 JIT A Just-In-Time (JIT) Java compiler produces native code from Java byte code instructions during program execution. Compilation speed is more important in a Java JIT compiler requiring optimization algorithms to be lightweight and effective. The JIT consists of five major phases. The pre-pass phase performs a linear-time traversal of byte codes to collect information needed for global register allocation and for implementing garbage collection. The global register allocation phase assigns physical registers to local variables. The code generation phase generates instructions and performs optimizations like common sub-expression elimination, array bounds check elimination, frame pointer elimination etc. The code emission phase copies the generated code and data sections to their final locations in memory. The patching phase fixes up relocations in the code and data sections i.e. offsets of forward branches, addresses of code labels in switch table entries etc. With the exception of global register allocation phase, all phases are linear in time and space. 3.2 Interpreter Interpreters play a crucial role as binary emulators, enabling code to port directly from one architecture to another. The execution time of an interpreted program depends upon the number of commands interpreted and the time to decode and execute each command. The number of commands and execution time directly depends on the complexity of the virtual machine implemented [Romer]. The virtual machine defines a set of virtual commands, which provide a portable interface between the program and the processor. The implementer of the virtual machine executes one virtual command on each trip through the main interpreter loop. The interpreter hence incurs an overhead for fetching and decoding each virtual command before performing the work specified by the command. Hence, the execution time of an interpreted command depends on number of commands interpreted, fetching and decoding of each command and the actual time spent of execution of the operation specified by the command. 4 Metrics In order to study the interaction of the JVM with the underlying architectures, we selected the following metrics: 3
Instruction Count: This is the number of instructions of a particular category generated on a specific architecture. Instructions were categorized into different instruction classes such as ALU, Data transfer, Control and miscellaneous. Average Instruction Length: The instruction sets for the platforms were analyzed to obtain the size in bytes of the individual instruction. RISC platforms (Sun Sparc and MIPS) have a constant size of instruction length for all instructions; x86 has a variable instruction length. Bytecode Complexity: The bytecodes are translated into native instructions on each architecture. A particular bytecode may translate to different number and type of native instructions on different architectures. Hence, the complexity of translation of a particular bytecode to native instructions indicates, to a certain degree, the complexity of the JVM generating the native code on the particular platform. Native executable code size: Depending on the instructions generated for a particular architecture, for a particular program, the native executable code size was generated as the product of the number of instructions of a particular type and the average instruction length of that type of instruction. The native executable code size denotes the size of code generated in bytes on the three different architectures. This allows us to analyze possible effects of program size on the execution time of the JIT or Interpreter on different architecture. Execution time: In order to compare the JIT and interpreter, the execution time of programs in our test suite was determined. The source code of the programs was analyzed to understand the conditions under which JIT/Interpreter out performed each other. 5 Data Generation Kaffe Source Code Analysis: As a first step in obtaining the required data for our analysis, the current version of the Kaffe Virtual Machine [Wilkinson] was obtained. The source code of the Kaffe JIT engine and the Kaffe interpreter engine was analyzed to understand their functional aspects. This code analysis explained, to a certain extent, how the JIT and interpreter might perform bytecode optimizations. Test Programs: The next step was to obtain a set of real test programs that evaluated the JVM. The test programs in the regression test suite designed to test the Kaffe JVM were examined and a set of 25 programs was chosen. These programs test all the capabilities of the JVM and are not intended to be a set of benchmark programs. Important features of the JVM such as class loading, thread handling, garbage collection, exception handling, integer computations, floating point computations, loops, etc., are tested by the programs. These programs were used to perform experiments and compare the performance of the JVM on the three architectures, based on the results of the metrics. Software Tools for Data Generation: In order to evaluate JVM performance, the analysis of the native code generated on each architecture was necessary. A software tool called Toba [Toba], developed in the CS department at the University of Arizona was used for this purpose. Toba converts a java program or a java class file into C source file(s) or native code source file(s), as required. For the analysis, both C and native code source files were obtained. In order to install Toba, the JDK version is required to be 1.1.6 or greater. The Linux Redhat 6., SunOS 5.6 and Irix 6.5 versions of the JDK were installed on three systems based on the x86, Sparc and MIPS architectures, respectively. Toba was then built, installed and configured on each system. 4
6 Data Analysis Instruction Count (Frequency of instructions): The possible native instructions were determined and instruction counts were generated for each architecture by examining their instruction sets. Perl scripts were written to analyze and generate the instruction counts for individual native source files generated by Toba. Average Instruction Length:RISC machines consist of constant length instructions. The Sparc and MIPS, which represent the RISC machines, have a constant length of 4 bytes per instruction. CISC machines (e.g. x86) generate instructions with length varying from 1 to 6 bytes. The occurrence of instructions with 1 byte length is quite rare, in our test environment (32-bit mode on the Linux OS). Instructions with lengths 2, 4 and 6 were observed quite frequently. Bytecode Complexity: Optimized class files (generated by javac with the optimization flag) were used to generate bytecodes using javap. A java program consisting of a single occurrence of each bytecode in the program was used for analysis. The bytecodes generated were mapped manually to the native instructions generated on each architecture. This enabled us to calculate the number of native instructions generated for a particular bytecode on different architectures. Native executable code size: The total contribution of all native instructions to the overall executable code is defined as the native executable code size of the program. Execution time: The optimized class files were executed on each platform, in JIT mode. The execution time was computed as an average of ten runs of each program. The programs were then executed using the interpreter. The interpreter execution times were also computed as an average of ten runs of each program. Similar tests were carried out on all the platforms. We used the java tool from the JDK version 1.2 on each platform using our test programs as inputs. We used the UNIX time command to generate the overall execution time data per program in JIT and interpreter modes. 7 Results Frequency of instructions: Instructions were grouped into Data transfer, ALU, Control or miscellaneous categories. Frequency of instructions was generated on calculating the fraction of the class of the instruction over the total instruction count of the program. Graphs 1,2 and 3 show the following information. x86 MIPS Sparc Data Transfers 6% 75% 2% ALU 2% 1% 45% Control 2% 15% 3% Sparc also contains 5% of miscellaneous instructions, which do not fall in either category. Average Instruction Length: As noted above average instruction for Sparc and MIPS architectures is 4 bytes. The average instruction length for x86 was calculated as follows: the frequency of instructions for each type of instruction and the average instruction length for that type was used. The contribution of each particular type of instruction to the overall native executable code size was determined. Various contributions were then added and average of the sum over the total 5
number of instructions in the program was calculated. The average instruction length for x86 was thus determined to be 4.14 bytes/instruction. Bytecode Complexity: Graphs 7 and 8 represent the distribution of native instructions for a Java program, which consists of one instance of every bytecode of the particular category. We observe that MIPS has the highest number of native instructions generated for data transfers as well as Control instructions. Also, MIPS generates the least number of instructions for ALU bytecodes whereas x86 and Sparc generate nearly the same number of native instructions. Sparc generates the maximum number of instructions for miscellaneous bytecodes. As can be observed from the table, x86 (CISC) generates the least number of instructions while the MIPS generates the most number of native instructions for a particular program. x86 MIPS Sparc Data Transfers 119 186 166 ALU 1 86 1 Control 54 71 54 Miscellaneous 76 87 93 Total 349 43 413 Native executable code size: From Figure 2, it is quite evident that the MIPS generates the maximum native executable code size for a particular program when compared to the same program on x86 and Sparc. Between x86 and Sparc, x86 (CISC) has a greater native executable code size than the Sparc. 12 1 Bytes 8 6 4 2 x86 MIPS Sun Figure 2: Native executable code size Execution time: From Graph 4, it is quite evident that the Interpreter on x86 outperforms the JIT on majority of the occasions. Graphs 5 and 6 show that the JIT outperforms the Interpreter on more occasions on the MIPS and Sparc than on the x86. JIT is slower in programs, which include creation of object/array (the use of new operator), synchronous calls to classes like I/O and also include more inline code. JIT is faster in programs, which include loop overhead and increment and assignment operations. Similar observations were made when the source code of the programs was investigated to obtain the reason as to why the interpreter and the JIT outperformed each other. 8 Conclusions 6
We analyzed the interaction of JVM with underlying architectures and arrived at the following conclusions. Data transfer performance is critical to MIPS & x86. Hence, it is very important to improve the performance of these units to improve execution time on these architectures. Branch performance affects execution time on Sparc the most, since a major percentage of translated instructions fall into the category of control instructions. ALU intensive programs execute faster on MIPS, since the bytecode translation for ALU instructions generates the least number of native instructions. Hence, ALU intensive java applications would run best on the MIPS, though the overall bytecode complexity for the MIPS is the highest. Hence, programs having a proportionate mix of instructions would be slower on MIPS than on the other platforms. Interpreter is faster than JIT in many cases. The x86 JIT behaves quite differently from the other JITs and hence indicates that it does not perform equivalent optimizations on the x86 than the other platforms. 9 References [Krall] Krall, A., et al., CACAO - A 64 bit JavaVM Just-In-Time Compiler. In Proc. ACM PPoPP'97 Workshop on Java for Science and Engineering Computation. http://www.complang.tuwien.ac.at/andi/javaws.ps [Lindholm] Lindholm, T. and Yellin, F., The JavaTM Virtual Machine Specification, Second Edition, http://java.sun.com/docs/books/vmspec [Romer] Romer, T., et al. The Structure and Performance of Interpreters. In Proc. Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, Massachusetts), ACM, 1996. [Tabatabai] Adi-Tabatabai, A., et al. Fast, Effective Code Generation in a Just-In-Time Compiler. In Proceedings SIGPLAN 98 Montreal, Canada. [Transvirtual] http://www.transvirtual.com/products [Toba] http://www.cs.arizona.edu/sumatra/toba/ [Yelland] Yelland, P. A Compositional Account of the Java Virtual Machine. In Proc. 26th Annual Symposium on Principles of Programming Languages (San Antonio, Texas), ACM, 1999. 7
1 Appendix %Frequency 1% 8% 6% 4% 2% % Control% ALU % Transfer % 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% %Control %ALU %Transfers Programs % Programs Figure 1. Instruction counts on x86 Figure 2. Instruction counts on MIPS 1.6 1% 1.4 8% 6% 4% 2% % %Frequency Misc% ALU% Control% Data Transfers 1.2 1.8.6.4.2 Interpreter JIT Figure 3. Instruction counts on Sparc Figure 4. Execution time on x86 8 3.5 7 3 6 2.5 5 4 3 Interpreter JIT 2 1.5 Interpreter JIT 2 1 1.5 Figure 5. Execution time on MIPS Figure 6. Execution time on Sparc 8
ALU Transfers 1 2 95 15 9 85 8 1 1 86 1 5 119 166 186 75 Control Misc 8 1 6 8 4 54 54 71 6 4 76 93 87 2 2 Figure 7.Bytecode Distribution by class on x86,sparc and MIPS 5 4 3 2 1 349 413 43 Figure 8. Total Bytecode Distribution on x86,sparc and MIPS 9