Benchmark hardware support for virtual machines

Size: px

Start display at page:

Download "Benchmark hardware support for virtual machines"

Silvester Rich
5 years ago
Views:

1 Benchmark hardware support for virtual machines Master thesis report Abstract This report identifies and investigates what performance gain hardware acceleration techniques (VFPv3, SMP, NEON, and Thumb2) can give on a virtual machine, with or without the use of JIT. The best performance gain can be made when a combination of techniques is used (SMP support together with a VM, compiled with the techniques VFPv3/NEON, using a JIT that can produce VFPv3 instructions). The impact on memory usage and CPU load is minimal when using these techniques. Programs executed must use the parts in the VM that has been affected by the acceleration techniques to gain any performance boost. It is of all interest to continue tests in the area of hardware accelerated VMs. Authors: Håkan Ternby, Eskil Algéus Supervisor: Peter Andersson, Supervisor at ST-Ericsson Henrik Svensson, Manager at ST-Ericsson Jonas Skeppstedt, Examiner from,

2 Project division of labor The work in this project was divided in a way that: Eskil made investigations and tests on how the hardware acceleration techniques affected the performance with or without the JIT. He also investigated SMP support and how memory and CPU load was affected by the techniques. Håkan made investigations and tests on Thumb2 and VFP3/Neon. Håkan also investigated the VFP in a more detailed way by examining code. He also did measurements of CPU and memory load. Both of us read through all the reference material and for the report we wrote the parts that we worked on and had the other person to proofread. ii

3 Acknowledgement We would like to thank ST-Ericsson for contributing with the workspace and test equipment used in this project. We also want to thank (in alphabetic order); Andersson, Peter (our supervisor at ST-Ericsson, for guidance in this project and on this report), Fagerstedt, Axel (ST-Ericsson, for debugging and code help), Nilsson, Anders (ST-Ericsson, for help with code and programs), Skeppstedt, Jonas (our supervisor at CS-LTH, for input on this report), Strand, Henrik (ST- Ericsson, for helping out with hardware and questions), Svensson, Henrik (ST-Ericsson, our manager in this project), and to the rest of the team under Henrik Svensson at ST-Ericsson. iii

4 Contents LIST OF FIGURES 6 LIST OF DIAGRAMS 7 1 INTRODUCTION BACKGROUND AND PURPOSE PROJECT DELIMITATIONS AND LIMITATIONS PROJECT SCOPE REPORT OUTLINE THEORY BACKGROUND THE VIRTUAL MACHINE JIT (Just In Time) compiler JAVA The J2ME platform The CLDC version of J2ME The CDC version of J2ME THE ARM PLATFORM The ARM Architecture Cortex Family Multi-Processing Core ARM PLATFORM EXTENSIONS Thumb Thumb Thumb2EE Jazelle Jazelle RCT Jazelle DBX Vector Float Point extension Single Instruction, Multiple Data Neon BENCHMARK PROGRAMS Grinderbench SciMark The Monte Carlo integration Successive Over Relaxation METHODS WORK PLAN SYSTEM SETUP The Virtual Machine Compiler flags RUNNING TESTS AND OBTAINING RESULTS Test result evaluation methods RESULTS AND DISCUSSION HARDWARE ACCELERATION TECHNIQUES COMPARISON JIT_OFF JIT_ON JIT_HW_FP Hardware acceleration techniques discussion JIT_OFF vs. JIT_ON vs. JIT_HW_FP JIT discussion

5 4.2 SMP SUPPORT SMP discussion INSTRUCTION SET COMPARISON AND PERFORMANCE Thumb SWP instruction Thumb2 discussion Jazelle discussion Vector Floating Point Tracing the mathematical Java method arcsine Modified SciMark2 SOR method with Java method arcsine Code comparison from e_asin.o with and without VFP JIT_ON and JIT_HW_FP code comparison VFP discussion Neon discussion CPU/Memory usage SMP discussion General findings and discussion CONCLUSION CREDIBILITY ANALYSIS FUTURE WORK

6 List of Figures FIGURE 1: DIFFERENCES IN CLDC AND CDC [14] FIGURE 2: THE ARM ARCHITECTURE EXTENSIONS FOR DIFFERENT ARCHITECTURE VERSIONS. [23] FIGURE 3: SPEED VERSUS POWER CONSUMPTION CHART OF THE CORTEX A9 MPCORE. [27] FIGURE 4: PERFORMANCE VERSUS CODE DENSITY COMPARISON OF THREE INSTRUCTION SETS. [34] FIGURE 5: THE DIFFERENT DECODING STAGES.[39]

7 List of Diagrams DIAGRAM 1: COMPARISON OF TECHNIQUES WITH JIT_OFF DIAGRAM 2: COMPARISON OF TECHNIQUES WITH JIT_ON DIAGRAM 3: COMPARISON OF TECHNIQUES WITH JIT_HW_FP DIAGRAM 4: COMPARISON OF JIT_OFF, JIT_ON AND JIT_HW_FP DIAGRAM 5: SMP COMPARISON ON SCIMARK DIAGRAM 6: SMP COMPARISON ON GRINDERBENCH DIAGRAM 7: CVM SIZE COMPARISON WITH AND WITHOUT THUMB DIAGRAM 8: PERFORMANCE COMPARISON OF THE MODIFIED SOR METHODS DIAGRAM 9: CPU LOAD WITH JIT_OFF AND NO_SMP ON SCIMARK DIAGRAM 10: CPU LOAD WITH JIT_OFF AND SMP ON SCIMARK DIAGRAM 11: CPU LOAD WITH JIT_OFF AND NO_SMP ON GRINDERBENCH DIAGRAM 12: CPU LOAD WITH JIT_OFF AND SMP ON GRINDERBENCH

8 1 Introduction An interesting task is to investigate the benefits of using hardware acceleration of a virtual machine (VM), to see what performance gain can be made. The use of hardware acceleration techniques can make the difference of a good or bad user experience when running Java programs on a mobile unit. Hardware acceleration helps to speed up execution and thus lower the power consumption. The VMs execution speed is directly related to limiting factors of the mobile unit such as memory size, battery capacity, processor speed, and cost requirements [1]. Speeding up the VM is also interesting, from a company view, as Java is the leading mobile application environment and has the highest penetration of devices worldwide with the most available applications [2]. 1.1 Background and purpose ST-Ericsson is a global leader in Wireless technologies. As such, it is crucial to investigate new techniques. One area of interest is the VM and available hardware acceleration techniques for it. ST-Ericsson decided to start this project as a master thesis to identify and investigate available hardware acceleration techniques for VMs. The overall purpose is to find out if there is any performance gain for the VM when using hardware acceleration techniques compared when not using any. The work and testing was done on site at ST-Ericsson Lund, Sweden. ST-Ericsson helped out by providing the target platform, workspace, and personal support such as setting up user accounts, answering questions, and give guidance along the way. 1.2 Project delimitations and limitations The research in this project does not take in consideration economical cost/benefits when using these hardware acceleration techniques. Also this project shall not test the techniques on different VMs, nor shall it investigate the possibility to do so. Performance tests shall only be run on one hardware target platform as the goal is to compare the different hardware techniques against each other under the same hardware circumstances. A limitation of this project is that not all hardware/software solutions can be tested as some of the solutions may be under license or may not be supported by software or target platform. Furthermore the results of the benchmark programs are not valid on systems running other kernels or when other applications are running simultaneously with the VM. The results can only be a vague guidance for other system setups than used in this project. 8

9 1.3 Project scope The idea is to test hardware solutions to speed up the VM and also to compare if these hardware solutions have an impact on the execution speed when running a Java program. To be able to compare the hardware solutions against each other, results from standardized benchmark programs will be used. Measurements of the CPU load, RAM usage, and flash footprint will also be done. The goal is to find answers to the following questions; Which hardware acceleration technique gives the best performance boost? Can any software technique compete or perform better with the use of hardware acceleration techniques? Can combinations of techniques give better performance gain than when working stand alone? Is CPU load, RAM load and Flash footprint affected by the hardware acceleration techniques and how? Does the VM need to be able to handle multitasking to gain further performance boost? 1.4 Report outline The chapters in this report have the following topics: Chapter 2, theory of the techniques and hardware platform Chapter 3, the method is described Chapter 4, results of the tests and discussion of them Chapter 5, conclusion and suggestions for further work 9

10 2 Theory The purpose of this chapter is to show the background theory and present the hardware acceleration techniques, benchmark programs, and hardware used in this project. 2.1 Background A good user experience is crucial as a selling point for mobile units. One thing that contributes to the user experience is how fast programs execute on the unit [3]. This has lead to investigations on how to speed up the execution speed of Java programs on mobile units since leading mobile operators expand their use of java programs in their networks [4]. A Java program is platform independent when compiled into Java byte-code. When executing Java byte-code the VM compiles it into native instructions. Optimizing the execution speed of the VM can then increase the overall performance of a Java program. For this reason manufactures has found different improvements to speed up the execution speed of the VM. These improvements can be both hardware and software improvements. There have been various works and projects by others to investigate improvement techniques of the VM. The project Evaluation of a hardware accelerated java virtual machine on embedded devices [5] shows that in some cases hardware acceleration can give a better performance. It also concluded further investigations of hardware acceleration are needed as the project only looked at one type of acceleration technique. Another work done, Hardware support for embedded java [6] presents hardware techniques used for accelerate Java binary translation through the extension of embedded processor pipelines. Techniques for RISC processors, such as ARM, are presented and investigated. Project conclusion is that improvements can be made when using hardware extensions, but it depends on the embedded system. 2.2 The Virtual Machine A Virtual Machine (VM) executes software code like a physical machine would do. The VM has an instruction set and is allowed to manipulate memory areas at runtime. It creates threads and gives these program counters and native stacks [7]. VMs can provide an Instruction Set Architecture (ISA) that is different from the underlying hardware ISA with high-level abstraction and performance like compiled programming languages. There are two types of VM:s; System VM:s, and Processor VM:s. System VM:s is only virtualization at the ISA level [8]. An example is the SUN CLDC HI. 10

Process VM:s is a type of VM that can only run one process at a time. Multiple instances of a process VM is needed to run multiple processes. PhoneMe Advanced is a process VM [8]. 2.

11 Process VM:s is a type of VM that can only run one process at a time. Multiple instances of a process VM is needed to run multiple processes. PhoneMe Advanced is a process VM [8] JIT (Just In Time) compiler JIT is a dynamic compiler with the goal to produce fast code at the smallest possible compile time [9]. It will compile the most frequently executing methods to native code while the program is running. This means that portability is still intact as native code compilation is done at runtime instead of compiling before the program is run [10]. The VM must initially interpret the program and then analyze how it runs by looking for most frequently executed portions of byte-code in order for this to work. These portions of byte-code are then compiled into optimized native code during program execution [11] since compilation happens at the same time as the program is executing the compilation time will add to the programs total running time. A JIT can be implemented so it can take advantages of hardware acceleration techniques by using an instruction set belonging to the technique. For example a JIT compiler can use a floating point unit by emitting special floating point instructions and using floating point registers [12]. 2.3 Java The Java platform currently consists of three versions; Java 2 Enterprise Edition (J2EE), Java 2 Standard Edition (J2SE), and the Java 2 Micro Edition (J2ME) [13]. Figure 1: Differences in CLDC and CDC [14] 11

12 Java code is compiled to byte-code and saved as a Java class file. The class file is then interpreted at runtime in the Java VM (JVM). The JVM and the Java class files are defined in the Java Virtual Machine Specification [15]. Java byte-code is compiled using either of two techniques; Ahead-Of-Time (AOT), and Just-In-Time (JIT) [10]. AOT compilation is a technique that compiles the Java byte-code into a system dependent binary and provides faster start-up time. AOT performs the compilation before the actual execution at the cost of flash memory. JIT, on the other hand, converts code at runtime and gives in most cases an overall performance that is better than AOT. The code is, briefly, compiled into native machine code. It lacks predictability in performance. This is because when JIT can t find the needed code in the memory cache, it must start compiling it which will create an overhead The J2ME platform The J2ME is primarily targeting consumer products with limited resources and is a collection of technologies and specifications that can be combined to construct a complete Java runtime environment [7]. In the J2ME architecture, configurations and profiles were introduced to be able to address the problems with limited resources on different mobile platforms [16]. A Configuration defines the Java language features and the core Java libraries of the JVM. Different device limitations make the use of different configurations necessary. There are two such configurations for J2ME; the Connected Limited Device Configuration (CLDC), and the Connected Device Configuration (CDC). The Profile is an extension to the configuration and is a set of standard APIs that support a narrower category of devices within the framework of a chosen configuration [17] The CLDC version of J2ME The latest version of the Connected Limited Device Configuration (CLDC) cleared by the Java Community Process (JCP) has the release name Java Specification Request (JSR) 139, also called CLDC 1.1 [18]. CLDC is a minimal runtime environment. The CLDC specification defines three things: CLDC does not handle thread groups, lacks dynamic class loading. It has a very small subset of the J2SE 1.3 classes A new API set for Input/Output called Generic Connection Framework (GCF). The CLDC does not define APIs for user interfaces or how applications are loaded and activated on the device. 12

13 The CDC version of J2ME The Connected Device Configuration (CDC) provides a much more conventional Java 2 runtime environment than CLDC. The latest release of CDC is the JSR 218 release [19]. The CDC does not require pre-verification of classes, even though such pre-verified classes can be used. The CDC specification defines: The capabilities of the VM, which is a full-featured JVM A subset of the J2SE classes [20] The Generic Connection Framework API Supported file and datagram based I/O that uses both the GCF and the ordinary java.io and java.net. The CDC does not define specifications for user interface classes or how applications are loaded and activated on the device. 2.4 The ARM Platform The ARM processor is a 32-bit Reduced Instruction Set Computer (RISC) microprocessor architecture for embedded use [21]. ARM doesn t manufacture, but instead sells intellectual property (IP) licenses to different manufactures, to produce the processor. ARM offers a broad range of processors categorized as Application processors, Embedded processors and SecureCores. The ARM processor families are built on different ARM architecture versions. 13

2.4.1 The ARM Architecture The ARM processor architecture provides support for the 32-bit ARM and 16-bit Thumb Instruction Set Architectures (ISAs) along with architecture extensions to provide

14 2.4.1 The ARM Architecture The ARM processor architecture provides support for the 32-bit ARM and 16-bit Thumb Instruction Set Architectures (ISAs) along with architecture extensions to provide support for Java acceleration (Jazelle ), security (TrustZone ), SIMD, and NEON TM technologies [22]. Figure 2: The ARM architecture extensions for different architecture versions. [23] Cortex Family The ARM architecture version for the Cortex family is ARMv7. The Cortex family consists of three series which all includes the 16-bit Thumb2 instruction set [24]; The ARM Cortex-A Series which is an application processor with support for ARM, Thumb, Thumb2, and Thumb2EE instruction sets. It also has VFP, Jazelle RCT, NEON, and SMP support. The ARM Cortex-R Series is a family of embedded processors for real-time systems. These processors support the ARM, Thumb, and Thumb2 instruction sets. The ARM Cortex-M Series is a family of deeply embedded processors. These processors only support the Thumb2 instruction set. 14

15 2.4.2 Multi-Processing Core With a Multi-Processing Core (MPCore) the theoretical maximum performance of an n-processor device is n*100% [25]. The use of MPCore has also the capability of reducing power consumption of up to 85% when all CPUs are in standby mode, compared to when all CPUs are running at the highest capacity [26]. It provides scalability because more CPUs can be added as busyness increases. There are solutions scalable from 1-4 CPU cores which have memory- and sub-system optimized for multi-processing. There are two techniques that are used in multiprocessing, Symmetric Multi-Processing (SMP) and Asymmetric Multi-Processing (AMP). Figure 3: Speed versus power consumption chart of the Cortex-A9 MPCore. [27] SMP is a load-distributed software architecture which means that the CPU cores are dynamically distributed. The processors are identical and are connected to a single shared memory and input/output system. AMP is similar to SMP but the processors are not perfectly symmetrical. The different CPUs might run different software or have dedicated input/output such as interrupt signals. 15

16 The ARM MPCores support both SMP and AMP and combinations of them. Each processor may be independently configured for their cache sizes. The interrupt controller is designed for distribution across multiple cores [28]. On SMP aware Operating Systems (OS) there are automated load balance/distribution across available cores. This is for processes, applications, threads, and interrupts. The Linux 2.6+ provides SMP support. There is also automatic power saving where adaptive power management for workload variations is used. The ARM MPCore power modes are; running, dormant, stand-by, and power-off. A way of measuring performance is with the Dhrystone results. It is a measurement of the average time the processor takes to perform many iterations of a single loop containing a fixed sequence of instructions. This result is referred as DMIPS or Dhrystone MIPS/MHz [29]. Benefits of Multi-Processing [30]: Higher performance ARM11 MPCore: 650 DMIPS -> 2600 DMIPS Cortex A9 MPCore: 2000 DMIPS -> 8000 DMIPS Less power consumption than you get from the equivalent performance throughput of a single processor. More CPUs that run on lower frequency with ability of individual power-off. Add/enable additional CPUs for on demand performance increase. Scalable system expansion to leverage next-generation system requirements. Flexible, ready-available, programming models to suite application requirements. Isolate real-time requirements from high-performance application deployment. 16

17 2.5 ARM platform extensions Arms instruction sets are run in different modes/states; Arm, Thumb, ThumbEE and Jazelle. Some ARM architectures include hardware extension support for Vector Float Point calculations Thumb Thumb [31] technology can give 31% code size reduction compared to 32-bit ARM instructions, but at an expense of performance. The ARM instructions can perform up to 38% better than Thumb instructions and therefore the equivalent performance loss for Thumb instructions will be 28%, according to ARM [32]. Thumb is a 16-bit instruction set that extends the 32-bit ARM architecture. A processor is operating in Thumb-state when executing Thumb instructions. These instructions are a subset of the most commonly used 32-bit ARM instructions compressed into 16-bit operation code. During execution, these instructions are decoded to enable the same functionality as the ARM instructions. 17

18 2.5.2 Thumb2 Thumb2 [33] technology can give 31% code size reduction compared to ARM instructions, and performance of up to 38% better than when using the Thumb instruction set [32]. Thumb2 is a set of 16- and 32-bit instructions that extends the ARM-architecture to improve the Thumb instruction set. It provides almost exactly the same functionality as the ARM instruction set. A processor is operating in Thumb-state when executing Thumb2 instructions. It consists of the existing 16-bit Thumb instructions and new 16-bit instructions for increased program flow. There are also new 32-bit instructions derived from the ARM instruction equivalent. The new instructions are for co-processor access, privileged instructions, bit-field manipulation, table branches, conditional execution, and special functions like Single Instruction, Multiple Data (SIMD). Figure 4: Performance versus code density comparison of three instruction sets. [34] Thumb2EE The Thumb2 Execution Environment (Thumb2EE) instruction set is for dynamically generated code which will help reduce compiled code and therefore reduce memory footprint [35]. This also means that recompiled methods can be kept in memory which will result in better performance and almost no startup delays. This instruction set is based on Thumb2 but has some changes and additions to make it a better target for dynamically generated code techniques like JIT and AOT. It is a set of 16- and 32-bit instructions. A processor is operating in ThumbEE-state when executing Thumb2EE instructions [36]. 18

2.5.3 Jazelle Jazelle provides hardware acceleration for some of the most commonly used managed execution environments, like Java, and outperforms a software only interpreter [37].

19 2.5.3 Jazelle Jazelle provides hardware acceleration for some of the most commonly used managed execution environments, like Java, and outperforms a software only interpreter [37]. This is because Jazelle will execute a significant amount of Java byte-code in hardware. It extends the processor-states with a Jazelle-state. The processor also maintains the Jazelle operand stack. Jazelle allows designers and developers to deliver more features to the devices but still be able to maintain power and performance characteristics. Figure 5: The different decoding stages.[39] 19

20 Jazelle RCT A compiler that uses Jazelle Run-time Compilation Target (RCT) can provide an overhead of only 10% when converting from byte-code to 16-bit Thumb2EE instructions [38], and still match the performance of Thumb2. There is almost no increase in size between compiled code compared to the existing byte-code. In Jazelle RCT mode, also known as Thumb2EE mode [36], some Thumb2 instructions are changed to do the compilation more efficient by combining these instructions with byte-code instructions. The processor-state in which Jazelle RCT instructions are executed in is called ThumbEE-state. Jazelle RCT supports AOTand JIT-compilation with Java and other execution environments like.net Compact Framework technology. The instruction set that Jazelle RCT uses is called Thumb2EE and is a superset of the existing Thumb2 instruction set. There are also instructions for changing between Jazelle RCT mode and Thumb2 mode. Implicit null-pointer tests and fast array range checking makes the performance better [35]. It also provides 16-bit instructions for commonly used AOT/JIT compilation routines Jazelle DBX Jazelle Direct Byte-code execution (DBX) technology has important benefits when it comes to power consumption and performance compared to co-processor or dedicated processor solutions [39]. Other hardware solutions for accelerating Java execution, like a co-processor or a dedicated processor, would typically require additional silicon footprint and consume extra power to operate. They also require external memory which means that they do not maximize speed [39]. Jazelle DBX technology introduces a new instruction set, Java byte-code, to the processor [40]. In this state the processor fetches and decodes Java byte-code directly. These Java instructions are pausable, which means that an interrupt can take place in the middle of an executed Java instruction, and not affect the interrupt latency which ensures real-time interrupt performance. Jazelle DBX has the disadvantage that only Java byte-code is supported Vector Float Point extension The Vector Floating Point (VFP) is a coprocessor extension that provides hardware acceleration of single and double precision floating-point arithmetic [41]. VFP increases throughput in graphics and signal-processing applications. The implemented VFP extension follows the IEEE 754 [42] standard for binary floatingpoint arithmetic. The VFP supports the execution of short vector instructions allowing Single Instruction, Multiple Data (SIMD) parallelism. There are different implementation versions of the VFP on the ARM architecture and all versions needs support code to trap exceptions. The only version that can trap float-point exceptions is version VFPv3U. In addition there can be extra registers that the VFP coprocessor hardware uses that describes exceptional conditions that may need to be considered. 20

21 Single Instruction, Multiple Data The Single Instruction, Multiple Data (SIMD) parallelism is used for repetitive operations done on multiple data [43]. SIMD uses packed vectors with data and, unlike traditional vectors, the SIMD packed vector can be used as an argument for a specific instruction. This instruction is then performed on all the elements in the vector simultaneously. The vector size directly affects the performance as well of the type of instruction performed on the vector. The SIMD architecture often use a special set of CPU registers where the parallel processing takes place. Real SIMD computers have a mixture of Single Instruction, Single Data (SISD) and SIMD instructions, which is the case in the ARM implementation Neon Neon is the name for the ARM Advanced SIMD extension that has a comprehensive instruction set, separate register files and independent execution hardware. It was developed to accelerate the performance of multimedia and signal processing applications for video encode/decode, 3D graphics, and more [44]. It has an independent pipeline, separate register files, and independent execution hardware. NEON supports 8-, 16-, 32-, 64-bit integer and single-precision floating-point data and operates in SIMD where it can handle up to 16 operations at the same time. According to ARM - Processors that implement the ARMv7-A architecture profile have two options for handling single-precision floating point; VFPv3 and NEON technology. VFPv3 supports full IEEE754 compliant single-precision and doubleprecision handling completely in hardware. The NEON engine operates on singleprecision floating-point numbers only, and its handling of denormalled numbers and NaNs (Not a Number) is not IEEE754 compliant. The NEON engine processing of floating-point numbers is compliant with the standards of most modern programming languages, including C and C++ [45]. In Cortex-A9 an enhancement called Media Processing Engine (MPE) has been added to NEON. This extends the floating-point unit (FPU) to provide a quad-mac and additional 64-bit and 128-bit register sets [46] NEON is encoded in the ARM and Thumb2 instruction sets providing high performance with optimized code density. 21

22 2.6 Benchmark programs Grinderbench Grinderbench is a benchmarking suite that approximates the performance of J2ME. It includes five benchmarking programs [47]. These are: Chess - A chess playing engine that performs the logical parts of a chess game but without any graphical output. Crypto - Cryptographic algorithms that are calculated. kxml - Parsing of an XML-document and/or manipulating of a Document Object Model (DOM) tree. Parallel - Multiple threads running at the same time with thread switching and synchronization. PNG - A PNG image that is decoded. This benchmark focuses on the CLDC 1.0 but has a MIDP 1.0 wrapper so it can be run on devices with MIDP 1.0. This benchmark only uses integer calculations as floating point calculation is not available on the J2ME edition [48]. To calculate the GrinderMark score the geometric mean of the five individual benchmark application scores is calculated [49] SciMark 2.0 SciMark 2.0 performs five numerical tests that are common in scientific and engineering applications. These are; Fast Fourier Transform, Jacobi Successive Over-relaxation, Monte Carlo integration, Sparse matrix multiply, and dense LU matrix factorization [50]. SciMark 2.0 is working with floating point calculations. The presented composite score in SciMark is the average score of the five tests. The sources are presented in MFLOPS The Monte Carlo integration The Monte Carlo integration in SciMark 2.0 is an approximation of PI by doing the integral of the quarter circle y = sqrt(1 - x^2)[50]. 22

23 Successive Over Relaxation The SOR method [51] in SciMark 2.0 uses the Jacobi Iteration [52] to operate on a 100x100 matrix. The algorithm exercises basic "grid averaging" memory patterns, where each A(i,j) is assigned an average weighting of its four nearest neighbors. [50]. 23

24 3 Methods To achieve the main goal of this project, the work was planned and organized as described in this chapter. 3.1 Work plan The work was organized in three phases. 1. Find documents and literature regarding the subject and evaluate its usefulness for this project 2. Gain comfort with the target platform and software 3. Run tests with benchmark programs to obtain measurements for the different hardware acceleration techniques. Compare scores and draw conclusion Throughout the work documentation was done continuously. 3.2 System setup The system setup was a target platform based on the ARM architecture ARMv7-a. The kernel running on the platform was based on Linux and was run as pure as possible. No other applications other than the VM were run on the target platform at the same time. A PC and PC software such as Putty [53] was used to communicate with the target platform and benchmark programs. The target platform and the PC were connected via the serial port The Virtual Machine The VM used in this project was an open source version of Java 2 ME CDC called PhoneME Advanced [54], which has been developed with the mobile phone in mind. To compile the VM the open source compiler GCC [55] was used and also a tool library to be able to do cross-compile for the ARM/Linux platform setup. PhoneME makes use of GNU Make [56] for building and compiling the VM. This makes it easy to build and compile the VM and to change build flags via the use of make files. The make file used for this project was GNUmakefile located in folder phoneme_advanced_mr2/cdc/build/linux-arm-generic. This file contains the different compiler flags for the hardware acceleration techniques. For this project it was also altered to include options for Thumb2. 24

25 Compiler flags To compile for the hardware acceleration techniques different flags must be set to enable or disable the techniques. Used compiler flags are presented in appendix B.3 Compiler flags. More information on GCCs ARM-flags can be found at gcc.gnu.org [57]. How to set the debug and VM flags for this VM is described in CDC Build System Guide [12] document. 3.3 Running tests and obtaining results The test was done according to this plan Compile the VMs to run with the hardware flags set as shown in Appendix B.4 Build options Boot up Linux on the target platform and run the benchmark programs using six different scripts (Appendix C.2 Scripts for running tests) 1. ss.sh, running one instance of SciMark ds.sh, running two instances of SciMark2.0 in parallel 3. ts.sh, running three instances of SciMark2.0 in parallel 4. sg.sh, running one instance of Grinderbench 5. dg.sh, running two instances of Grinderbench in parallel 6. tg.sh, running three instances of Grinderbench in parallel Each script was run twice for each VM. Collecting the test results and evaluate them 25

26 3.3.1 Test result evaluation methods The benchmark programs SciMark 2.0 and GrinderBench are well known and used in the mobile industry to test performance of mobile units. These programs present results in a score based way. These scores have sufficient accuracy for this project. Results from the benchmark programs used in this project can also be used to compare with other hardware platforms and software VMs on the market today that have been benched with SciMark 2.0 and GrinderBench. The benchmark scores were processed and compared by the use of Excel and diagrams. Only the composite score from SciMark 2.0 and the GrinderMark from GrinderBench was used for comparison to eliminate faults in measurement and bumps in test score for individual tests inside the benchmark programs. The results will be shown in a percent based way of how the VMs performed in comparison to a reference VM, which will be shown to have a 100% performance. To be able to measure flash footprint a Java program [Appendix C.1 Flash footprint comparison program] was written to check files sizes of the object files against each other. When checking CPU load and RAM load the Linux command TOP was used. The results from TOP were piped to a data file and then sent through a filter [Appendix C.3 Data filter program] to filter out useful data. This data was thereafter processed in Excel. 26

27 4 Results and discussion The results presented in this chapter were obtained with the method described in chapter three. This chapter also presents this projects discussion around the findings and results. 4.1 Hardware acceleration techniques comparison To be able to compare the hardware acceleration techniques against each other the tests were arranged in three major test cases JIT_OFF, JIT_ON and JIT_HW_FP. This was done to be able to see how the hardware acceleration techniques performed with or without the influence of the software technique JIT JIT_OFF In this test case the VMs were built without the use of JIT. The different VMs build flags are found in appendix B.4 VM#: 1-6. This gives the ability to compare how the hardware accelerations performed when using interpreted mode to process the Java byte-code. The reference VM (Cortex A9) was built with no JIT and no hardware acceleration techniques [Appendix B.4 VM#: 1]. Diagram 1: Comparison of techniques with JIT_OFF 27

28 The results show a performance boost in especially SciMark when using Neon or VFPv3 techniques. This has to do with SciMarks internal tests that are based on floating point calculations as NEON and VFPv3 are aimed towards speeding up floating point calculations JIT_ON This test case puts the use of hardware acceleration techniques against each other under the influence of a JIT. The different VMs build flags are found in appendix B.4 VM#: The reference VM was compiled with no hardware acceleration and with the use of JIT [Appendix B.4 VM#: 7]. Diagram 2: Comparison of techniques with JIT_ON Under the influence of JIT the hardware acceleration techniques don t have the same impact. This has to do with that JIT does not use any of the instruction sets provided in the hardware techniques and therefore cannot benefit from them. 28

4.1.3 JIT_HW_FP Here the VMs are built with the use of JIT_HW_FP option. This option makes the JIT to be able to use floating point instructions when compiling Java byte-code to native instructions.

29 4.1.3 JIT_HW_FP Here the VMs are built with the use of JIT_HW_FP option. This option makes the JIT to be able to use floating point instructions when compiling Java byte-code to native instructions. The different VMs build flags are found in appendix B.4 VM#: The reference VM was compiled with no hardware acceleration technique but with JIT_HW_FP [Appendix B.4 VM#: 13]. Diagram 3: Comparison of techniques with JIT_HW_FP In this test the JIT can use the floating point instruction set but as seen here the VM itself does not gain much with the different techniques. The differences are too small to be able to say if the techniques really did have any impact on performance Hardware acceleration techniques discussion Testing the hardware acceleration techniques against each other under the test cases JIT_OFF, JIT_ON, and JIT_HW_FP shows interesting results. When no JIT is used the performance of each hardware technique is clearly visible. When enabling JIT the hardware acceleration techniques don t have any significant impact on performance. The boosts and drops of 1-3 % cannot be used to evaluate of the techniques themselves as +/- 5% can be considered within the error margin because of background threads in the Java environment [58].The same is valid when JIT_HW_FP is used. 29

30 We found out that the SciMark2 tests didn t call any of the functions in the VM that were affected by the techniques. The mathematical calls that were used in the tests were often addition, subtraction, multiplication and division. The collection of instructions of how these functions shall be executed is located in the libgcc.a library. As we didn t compile this library with the different techniques the impact on performance probably could become better if they were affected by the techniques. That is why we looked further in to the VFPv3 case in chapter and added one instruction that we knew was affected when compiling the VM with VFPv3 instructions. We also found that an interesting case to compare are between the three major test cases when JIT is enabled to see if there is any gain in performance with the hardware acceleration technique for floating point calculations. This has been done in chapter JIT_OFF vs. JIT_ON vs. JIT_HW_FP This test case compares the use of JIT and JIT_HW_FP to the case when not using JIT for each hardware acceleration technique. Reference VMs are compiled for each hardware technique without the use of JIT [Appendix B.4 VM#: 1-6]. Diagram 4: Comparison of JIT_OFF, JIT_ON and JIT_HW_FP The diagram shows an enormous boost when JIT can use floating point instructions. 30

31 JIT discussion Even though JIT is a software acceleration technique we wanted to check if the impact on performance would be better than when just using interpreted mode and hardware acceleration techniques. The diagram shows clearly that JIT alone can boost the performance way better that any hardware technique alone in interpreted mode. In the case when JIT is working with floating point instructions an even greater performance boost can be achieved. In SciMark 2.0 with JIT_ON the performance boost was about three to four times better than with JIT_OFF. However when we tested with JIT_HW_FP the performance boost was nearly 20 times larger to when not using JIT at all. This has probably to do with that the JIT can produce code that is used in the floating point hardware. When looking at the GrinderBench test there was not any noticeable change in performance when comparing JIT_ON and JIT_HW_FP. This is probably because, as we said before, that GrinderBench does not use floating point computations and therefore cannot benefit from the floating point hardware. 4.2 SMP support For testing the use of SMP support the scores were measured when running one, two or three instances of a benchmark program at the same time. The scores for the reference VMs are when running one instance of the benchmark programs without SMP support on the target platform. Test runs were done via scripts shown in appendix C.2. The target platform used in this project has an ARM Cortex-A9 with dual cores. 31

32 Diagram 5: SMP comparison on SciMark2 In the diagram 5 there is a clear performance drop when running more than one instance on a single processor (NO_SMP). When the SMP support is enabled the performance for two instances is almost the same as for one instance. When running two or three instances with SMP support the performance boost is actually almost 100% better than running on a single core. 32

33 Diagram 6: SMP comparison on GrinderBench Similar results for SMP support can be observed when Grinderbench is used. There is a 15% performance gain when using one instance of Grinderbech under SMP support and has to do with the parallel test inside Grinderbench that take advantage of the SMP support SMP discussion The results we got confirmed our expectations when using SMP support. Using more than one core gives a performance boost. Here it clearly shows that in the case of two instances of benchmark programs running at the same time the benchmark programs gives the same scores as for one instance running at one processor. On this target platform we only have SMP support with two cores but we can draw the conclusion that the more cores available, the better the unit will handle multiple tasks. 33

4.3 Instruction set comparison and performance 4.3.1 Thumb2 Compiling the VM with the Thumb2 instruction set introduced problems with the SWP instruction [Chapter: 4.2.1.1].

34 4.3 Instruction set comparison and performance Thumb2 Compiling the VM with the Thumb2 instruction set introduced problems with the SWP instruction [Chapter: ]. This instruction is an atomic read-modify-write operation that is not supported by the Thumb2 instruction set. To solve this problem some files were excluded when using the Thumb2 instruction set. The excluded files are found in the variable UNTHUMBABLE in Appendix B Some files became larger when using the Thumb2 instruction set and has to do with that in some cases it is necessary to combine a couple of Thumb2 instructions to get the same functionality as one ARM instruction. This can cause the files to become larger and the average increase in size on those files is about 1.5%. The files are found in appendix B There was no significant increase or decrease in performance when comparing the VM with [Appendix B.4 VM#:2] or without [Appendix B.4 VM#:1] the Thumb2 option. The decrease in size of the overall VM was about 3% with JIT_ON and about 2,5% when JIT_OFF. The average decrease in size of the files that got smaller is about 11.7% and those files are found in Appendix B The diagram below shows the difference in size on the VM compiled with different techniques. The reference VM is compiled with no acceleration techniques in both the JIT_OFF and JIT_ON case. Diagram 7: CVM size comparison with and without Thumb2 34

35 SWP instruction SWP is an atomic instruction. Atomic instructions will always run the whole instruction without being disturbed from another operation. Implementation is often used with semaphores to guarantee that no other processes will disturb while an atomic instruction is executing. The SWP instruction can be replaced with other instructions but it is hard to solve with a good thread-safe solution. The problem with the SWP instruction first appeared when the VM were compiled with the Thumb2 instruction set. This was because the Thumb2 instruction set did not support the SWP instruction. Therefore we had to exclude some c-files from using Thumb2. The second time the SWP was causing problems was when we were running the VM with JIT support because JIT was producing SWP instructions. To solve this problem there was a workaround to enable the target platform to accept these instructions. Although one problem with this solution is that the SWP instruction locks out both the processor and the memory while it is executing. This can especially have a negative impact on performance when running in multi-core mode Thumb2 discussion We observed that the overall size of the VM did not become any smaller. That is because most of the data in the C-files, which are being compiled and linked together are just C-structs and other pure data. The C-structs and pure data does not have any instructions in them and can therefore not be compressed with another instruction set like Thumb Jazelle discussion We did searches on the internet in how to enable this option on the reference VM. The search criteria we used were that it needed to be license free in order to be valid. No such results were found. Instead we redefined the search criteria to include results containing other VMs implementing Jazelle and found that phoneme feature MR4 included Jazelle RCT [59] and when downloading this VM we could confirm that there was a library containing implementations for Jazelle RCT. But the only search results for Jazelle RCT was that the open source VM phoneme Advanced does not support Jazelle-RCT and the support for Jazelle-DBX is under license, which meant that we had no opportunity to test the Jazelle feature in this project. 35

36 4.3.3 Vector Floating Point Further work on the VFP option was made. Based on the results from the SciMark scores, we decided to do some more tests on a modified version of the SOR method in SciMark 2.0. Because of the poor performance gains when compiling the VM with VFP, compared to no VFP, a decision was taken that some modifications in some SciMark 2.0 method was necessary to be made to really test if the VFP instructions would give any performance gains. As suspected earlier there were no relevant segments of code that tested any native functions, in the original SciMark 2.0, that would have been affected by compiling the VM with the VFP flag. In the original methods in SciMark2 most calls to mathematical functions are done to the library libgcc.a containing mathematical functions that, in our case, are NOT affected by the VFP flag. To solve this problem a small bit of code [Appendix A.1] in the SOR method was modified to make a call to a VM native mathematical function that gets affected by the VFP flag. This function is the Java mathematical method arcsine and is traced to its VM native counterpart by following the next steps in the sequence diagram presented in the next chapter Tracing the mathematical Java method arcsine 36

4.3.3.2 Modified SciMark2 SOR method with Java method arcsine In diagram 8 it is clearly visible that there is a slight increase of performance in the original version of the SOR method when JIT is

37 Modified SciMark2 SOR method with Java method arcsine In diagram 8 it is clearly visible that there is a slight increase of performance in the original version of the SOR method when JIT is turned off. In the cases with JIT_ON and _JIT_HW_FP there is no significant increase/decrease in performance. Diagram 8: Performance comparison of the modified SOR methods. Performance comparison where the pure VM with JIT_OFF 1, JIT_ON 2 and JIT_HW_FP 3 is the base compared with the VM that is compiled with VFP with JIT_OFF 4, JIT_ON 5 and JIT_HW_FP 6. The result difference in SciMark2 is not big when comparing without [Appendix B.4 VM#:1] or with [Appendix B.4 VM#:4] VFP when JIT is turned off. When VFP is turned off the VM uses software emulated floating point computation and may be the cause why there is no big improvement on performance. It may also be that there is so much other code, e.g. overhead, when running in interpreted mode that the overall impact on performance is irrelevant. There is only 3% increase in performance gain when compiling the VM with VFP than without, even though there are big differences when comparing the e_asin.o file. 1 Appendix B.4 VM#: 1 2 Appendix B.4 VM#: 7 3 Appendix B.4 VM#: 13 4 Appendix B.4 VM#: 4 5 Appendix B.4 VM#: 10 6 Appendix B.4 VM#: 16 37

38 When JIT was turned on there was a 50% performance increase between compiling the VM with [Appendix B.4 VM#:10] or without [Appendix B.4 VM#:7] VFP. This probably is because the JIT saves the methods in the memory while executing and can call these methods much faster and with less overhead, meaning that the performance impact of the code from e_asin.o will be much larger Code comparison from e_asin.o with and without VFP The difference between the two e_asin.o files when compiled with [Appendix B.4 VM#:4] or without [Appendix B.4 VM#:1] VFP is: e.asin.o A9 VFP Rows Size 8803kB 3664kB A small code example from e_asin.o, that corresponds to each other, with [Appendix B.4 VM#:4] and without [Appendix B.4 VM#:1] VFP : A9 [Appendix B.4 VM#:1]: Row 214 in e_asin.o.l7: ldrd r4, [sp, #16] mov r2, r0 mov r3, r1 mov r0, r4 mov r1, r5 bl aeabi_dmul (Call the subroutine aeabi_dmul. [61]) [Appendix A.2.3] mov r2, r0 mov r3, r1 mov r0, r4 mov r1, r5 bl aeabi_dadd (Call the subroutine aeabi_dadd. [61]) mov r4, r0 mov r5, r1 b.l4 VFP [Appendix B.4 VM#:4]: Row 94 in e_asin.o.l7 fldd d0, [sp, #0] fmacd d0, d0, d7 b.l4 The differences in assembly code (above) when VFP is used are clearly visible. The ARM assembler guide is found in the ARM Developer Suite Assembler Guide [60]. 38

39 JIT_ON and JIT_HW_FP code comparison When enabling JIT_HW_FP [Appendix B.4 VM#:13] on the VM the JIT can also produce floating point instructions, e.g. instructions to add faddd or multiply fmuld two registers [Appendix A.2.1] [60]. When using JIT_ON [Appendix B.4 VM#:7], the JIT must call the native methods CVMCCMruntimeDAdd to add and CVMCCMruntimeDMul to multiply [Appendix A.2.2] [61]. These native methods are usually much larger compared to the case when using floating point instructions. The native method that is called to multiply is in this case aeabi_dmul [Appendix A.2.3]. This method is in the file _arm_muldivdf3.o that is located in the libgcc.a library. To get the code from this file the library was first unpacked and then the file _arm_muldivdf3.o was disassembled. The disassembled file shows that the aeabi_dmul [Appendix A.2.3] method is 155 rows long and contains several loops. The code shown in appendix A.2.2 and in appendix A.2.3 is the equivalent, with [Appendix B.4 VM#:13] and without [Appendix B.4 VM#:7] hardware support for floating point operations, to the if (x*x + y*y <= 1.0) C-code statement in the MonteCarlo Integrate [Appendix A.2] file. The performance gain in the case when the pure VM with JIT_ON is compared to the pure VM with JIT_HW_FP is about 100% better with JIT_HW_FP VFP discussion The relevant part of all this is that there is a performance increase when actually using methods that has been affected by the VFP option: Those are JIT_ON with VFP VM and JIT_HW_FP with VFP VM and they give the best performance increase when comparing to the pure VM case Neon discussion There is no gain in using NEON over VFP when running GrinderBench and SciMark 2.0. This can probably be explained with that NEON is an SIMD extension [44] for floating point calculations and that the benchmark programs don t use any test that can benefit from the NEON acceleration. 39

Jazelle ARM. By: Adrian Cretzu & Sabine Loebner

Jazelle ARM. By: Adrian Cretzu & Sabine Loebner Jazelle ARM By: Adrian Cretzu & Sabine Loebner Table of Contents Java o Challenge o Acceleration Techniques ARM Overview o RISC o ISA o Background Jazelle o Background o Jazelle mode o bytecode execution