Thread-Level Parallel Execution in Co-designed Virtual Machines

Similar documents
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Interaction of JVM with x86, Sparc and MIPS

Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl

Distributed Deadlock Detection for. Distributed Process Networks

BioTechnology. An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

Global Scheduler. Global Issue. Global Retire

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Implementation of Process Networks in Java

Chapter 7 The Potential of Special-Purpose Hardware

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Improving Java Performance Using Dynamic Method Migration on FPGAs

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

picojava I Java Processor Core DATA SHEET DESCRIPTION

Untyped Memory in the Java Virtual Machine

Karthik Narayanan, Santosh Madiraju EEL Embedded Systems Seminar 1/41 1

Delft-Java Link Translation Buffer

Hardware-Supported Pointer Detection for common Garbage Collections

CS370 Operating Systems

Hardware, Software and Mechanical Cosimulation for Automotive Applications

The Co-Design of Virtual Machines Using Reconfigurable Hardware

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Chapter 8: Virtual Memory. Operating System Concepts

Computer Architecture


Virtual Memory - Overview. Programmers View. Virtual Physical. Virtual Physical. Program has its own virtual memory space.

Page Replacement. 3/9/07 CSE 30341: Operating Systems Principles

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

EFFICIENT EXECUTION OF LARGE APPLICATIONS ON PORTABLE AND WIRELESS CLIENTS

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

Architecture of An AHB Compliant SDRAM Memory Controller

Approaches to Capturing Java Threads State

CS370 Operating Systems

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Module 1. Introduction:

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Exam Guide COMPSCI 386

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

High Performance Computing on GPUs using NVIDIA CUDA

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

MICROPROCESSOR TECHNOLOGY

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor

LegUp: Accelerating Memcached on Cloud FPGAs

Concurrent Programming. Implementation Alternatives. Content. Real-Time Systems, Lecture 2. Historical Implementation Alternatives.

Noorul Islam College Of Engineering, Kumaracoil MCA Degree Model Examination (October 2007) 5 th Semester MC1642 UNIX Internals 2 mark Questions

Hardware/Software Codesign of Schedulers for Real Time Systems

Concurrent Programming

Where are we in the course?

Embedded Software Streaming via Block Stream

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

Following are a few basic questions that cover the essentials of OS:

Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs

PESIT Bangalore South Campus

UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

Process Description and Control

Single Pass Connected Components Analysis

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

Modification and Evaluation of Linux I/O Schedulers

Method-Level Phase Behavior in Java Workloads

Multi-Processor / Parallel Processing

CSc 453 Interpreters & Interpretation

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS

MCC-DSM Specifications

Placement Algorithm for FPGA Circuits

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

Topic & Scope. Content: The course gives

Quantitative study of data caches on a multistreamed architecture. Abstract

Operating Systems Overview. Chapter 2

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Four Components of a Computer System

ATS-GPU Real Time Signal Processing Software

A Network Storage LSI Suitable for Home Network

Optimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs

Processors, Performance, and Profiling

icroprocessor istory of Microprocessor ntel 8086:

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

Chapter 9: Virtual-Memory

CHAPTER 16 - VIRTUAL MACHINES

PowerAware RTL Verification of USB 3.0 IPs by Gayathri SN and Badrinath Ramachandra, L&T Technology Services Limited

Hierarchical PLABs, CLABs, TLABs in Hotspot

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

VASim: An Open Virtual Automata Simulator for Automata Processing Research

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

Transcription:

Thread-Level Parallel Execution in Co-designed Virtual Machines Thomas S. Hall, Kenneth B. Kent Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada Email: c849m@unb.ca, ken@unb.caa Abstract Virtual machine technology is becoming more important as the use of heterogeneous computer networks have become more widespread. However, virtual machines have a major drawback, the runtime performance of an application running on a virtual machine is significantly below that of the same application running as a native executable on a given platform. Previous work shows that a hardware/software co-designed virtual machine can provide a performance improvement for single-threaded applications. This paper describes research work to further improve the performance of the co-designed virtual machine by adding thread-level parallel execution. The design put forward adds the functionality to support independent scheduling of threads in the hardware and software partitions of the codesigned virtual machine. A prototype of the design, based on the Java Virtual Machine, utilizing software simulation has been constructed and tested. The results of this testing show that the design is feasible provided sufficient communication bandwidth is available between the hardware and software partitions. I. INTRODUCTION Virtual machines are increasingly important in today s heterogenous computing environments since they provide a means for a single version of an application to execute on many different computing platforms. The most common is the Java Virtual Machine [1] along with the Java programming language [2] developed by Sun Microsystems. The major drawback of this type of computing environment is the slow run-time performance of applications due to the added layers of software between the application code and the hardware upon which it is running [3]. As described in Section II, significant research effort has been, and continues to be, applied in an attempt to improve the run-time performance of virtual machines. This paper describes the design, prototyping and testing of an extension to the hardware/software co-designed virtual machine originally devised by Kent et al by adding threadlevel parallelism [4] [9]. This extension allows multi-threaded applications to have its threads execute in parallel on both the hardware and software partitions of the co-designed virtual machine. The prototype of the parallel co-designed virtual machine design in this paper is an implementation of the Java Virtual Machine. This does not restrict the generality of the design presented; instead, it shows ways to implement the design and solve some of the issues that arose during implementation. The terms hardware partition and hardware execution engine are synonymous and refer to the hardware portion of the co-designed virtual machine. Similarly, the terms software partition and software execution engine refer to the software portion of the co-designed virtual machine. An assumption used throughout this work is that hardware execution is faster than software execution for the same block of application code. II. RELATED WORK Virtual machines are software applications that implement processor architectures or operating system simulations. Implementations of the same virtual machine can occur on multiple hardware and operating system platforms (e.g. the Java Virtual Machine running on an Intel chip-set with the Linux or Microsoft Windows operating systems or on the IBM AS/400). This permits an application program that uses a given virtual machine to run on any platform that supports that virtual machine. The use of virtual machine-based applications has increased dramatically in recent years as the demand for platformindependence has risen. This rising demand is due to two factors: the Internet and its e-business potential, and the cost of rewriting applications because of the introduction of a new platform within some computing environment. Applications that run on a virtual machine have additional layers of software to pass through before reaching hardware for execution, leading to significant performance issues. This performance degradation arises from the need to translate (interpret or compile) the instruction set of the virtual machine to that of the host system. As the instruction sets of virtual machines become more complex, the time required for translation gets longer, thus further degrading performance. Attempts to improve virtual machine performace have utilized both software (e.g. Just-in-time compilation [10]) and hardware techniques. El-Kharashi et al. have designed an extension to a RISC processor that specifically deals with the execution of Java bytecodes [11] [13]. Their research showed that the simpler bytecodes are the ones most frequently used by applications. These are the bytecodes that they have included in their RISC chip extension. The remaining bytecodes are executed by a standard software virtual machine.

Another approach to custom hardware designs is the creation of a complete custom processor to support the virtual machine instruction set (e.g. the picojava processor [14]). This approach provides native execution for the virtual machine instruction set but results in reduced performance for applications written in traditional programming languages that do not utilize the virtual machine. Other research projects involved the distribution of virtual machine threads across multiple virtual machine instances running on multiple host systems. One group added an additional system thread to each virtual machine to send application threads to other systems [15]. Each system has a daemon running that monitors for incoming thread execution requests and starts a virtual machine instance when necessary. From the viewpoint of the current work, this research showed that independent execution of Java application threads is possible with the proper communications between the processing environments. This work also showed that the Java Applications Programming Interface adequately supports parallel execution of multi-threaded programs. Another research group created a new runtime environment for their target virtual machine that modified the application code during the load process to include support for inter-platform communications [16]. /software co-design techniques (a combination of software and hardware engineering) have seen use in the embedded systems field for years [17]. Two different approaches to virtual machine design using these techniques have been reported. Lattanzi et al devised a virtual machine that uses Java methods converted to optimized configurations for Field Programmable Gate Array devices [18]. This conversion can be done either during application loading or off-line. The work of Kent et al is described in the next section. These two hardware/software co-designed virtual machines show that a combination of hardware and software designed to operate together as a single virtual machine is feasible [4] [9], [18]. III. SINGLE THREAD CO-DESIGNED VIRTUAL MACHINE Kent et al developed a hardware/software co-designed virtual machine design that uses a standard desktop computer and a Field Programmable Gate Array device [4], [7], [8]. In their design, the hardware partition is a bytecode interpreter. Tests have shown that the performance of this virtual machines is better than a software only implementation of the same virtual machine. Unlike traditional hardware/software co-designed systems, the co-designed virtual machine uses overlapping hardware and software partition functionality. It consists of a fully functional software virtual machine running on a standard desktop computer combined with a configurable (FPGA) hardware execution engine running a subset of the virtual machine functionality. The instructions supported by the hardware execution engine are the simpler ones that do not require access to host system resources or use complex memory addressing modes. The partitioning details can be found in [4]. There is a reasonable correspondence between the instructions that are supported by the hardware execution engine and those found by El-Kharashi et al to be the most frequently used [11], [12]. This distribution of functionality results in an overlap in support and ultimately an opportunity to decide both location and ordering of application thread execution. That is, from the perspective of the virtual machine, the host system processor and the hardware execution engine provide a multi-processor environment upon which the virtual machine can run multiple threads in parallel. This potential is explored by the work described in this paper. The original co-designed virtual machine description proposed three different ways for the hardware execution engine to directly access the memory of the software partition of the virtual machine. The most restrictive of these allows direct access only for looking up constant values. With this type of access, the hardware execution engine receives a copy of the data and instructions that it requires prior to beginning execution of a block of code. The block of instructions passed to the hardware execution engine must be a complete execution unit (e.g. function or method) to ensure that the hardware execution engine receives all of the instructions it needs even though the entire unit may not be necessary. The two less restrictive modes allow increasing levels of access to the software partition memory while increasing the complexity of the hardware execution engine by requiring it to support more of the virtual machine instruction set and more complex communications. While the less restrictive modes allow more flexibility in data access, the entire code block is still required by the hardware execution engine. As part of the application loading process, the co-designed virtual machine tags the beginning and end of blocks of code that are capable of being executed in hardware [5], [6]. Whenever the software execution engine detects a switchto-hardware tag, the currently executing thread undergoes a context switch to the hardware execution engine. Execution then continues in hardware until the first occurrence of a switch-to-software tag. IV. PARALLEL CO-DESIGNED VIRTUAL MACHINE DESIGN Fig. 1. Instruction Source Execution Engine Simplified block diagram of the original co-designed virtual machine Figure 1 shows the structure of the existing hardware/software co-designed virtual machine [4]. As an application executes in the virtual machine environment, part of the

software execution engine monitors the stream of instructions and operands coming from the instruction source of the currently executing thread for switch-to-hardware tags. Upon detection of a tag, the software execution engine redirects the executing thread to the hardware execution engine. The software scheduler does not suspend the thread, instead it simply blocks on the call to the hardware context switcher. For single-threaded applications, this design can show measurable performance improvement over the same application running in software only [4]. Fig. 2. Instruction Source Execution Synchronizer Execution Engine Simplified block diagram of the parallel co-designed virtual machine Figure 2 shows the extended design of the virtual machine. This design uses thread suspension rather than blocking during attempts to access the hardware execution engine. This is possible because of the addition of the execution synchronizer module that includes a separate hardware scheduler. A thread adds itself to a queue within the hardware scheduler before suspending itself in software. This allows threads that require software execution to continue without interference from other threads trying to set the hardware access lock of the original design. Fig. 3. Instruction Source Code Block Detector Block diagram of the execution synchronizer The execution synchronizer consists of two parts as shown in Figure 3. The first part is the switch to hardware tag detector. This part monitors the flow of instructions to the software execution engine and redirects an application thread to the hardware scheduler upon finding a tag. The hardware scheduler controls the dispatching of threads to the hardware execution engine while monitoring the state of the software scheduler. Whenever there are no threads available for execution in the software execution engine and the hardware scheduler has at least one thread waiting for the hardware execution engine, the hardware scheduler returns the thread at the front of the hardware queue to software. A thread returned in this manner will be either the one that had the highest priority on the queue, or had been on the queue the longest. Since there may be more than one application thread attempting to gain access to the hardware execution engine at once, the hardware scheduler uses a queue ordered by application thread priority to control the order that threads are sent to the hardware execution engine. Threads that have equal priority proceed through the queue in first-in-first-out (FIFO) order. The integrity of this queue could be compromised if more than one thread attempts to make a change to the thread concurrently. To avoid this situation, changes to the information stored in the queue are made within critical sections of code. These critical sections are made as small as possible to minimize their impact on overall virtual machine performance. When a thread is removed from the queue for dispatching to the hardware execution engine, the complete code, local data and operand stack of the current code block of that thread is copied to the memory of the hardware execution engine. In addition, the start addresses of each of the components within the hardware partition memory are placed in fixed locations within the hardware partition s memory along with the program counter and stack pointer for that code block. See the next section for a detailed description of the communications between the hardware and software partitions and the layout of the hardware partition memory. Once a thread has completed execution on the hardware execution engine, the hardware scheduler copies the local data and operand stack back from the hardware partition memory along with the updated program counter and stack pointer. The appropriate portions of the thread s memory in the software partition receives these data items. The thread is then allowed to continue execution under the control of the software (operating system) scheduler. While the hardware scheduler attempts to keep both the hardware and software execution engines busy, priority is given to the hardware execution engine. This choice of priority follows from the previously stated assumption that hardware execution is, in general, faster than software execution. V. HARDWARE/SOFTWARE COMMUNICATIONS There are numerous ways for the hardware and software portions of the co-designed virtual machine to communicate with one another. The choice of a specific technology will have a direct effect on the performance of the virtual machine. This is due to the differences in time required by various technologies to transfer a given block of data. In the co-designed virtual machine, the effects of communication technologies are noticeable during hardware/software context switching.

The communication technology used is implementation dependent. The original co-designed virtual machine utilizes a field programmable gate array (FPGA) mounted on a 64-bit, 66 MHz PCI card along with a block of memory. The extended design described in this paper targets the same technology in order to minimize the effect of non-functional changes when performing comparisons of the two designs. Mapping PCI board mounted memory into the host system address space allows the software part of the virtual machine to write instructions and data for the hardware execution engine directly to its memory. Once the hardware execution engine has completed execution, the software part of the virtual machine reads the results back into its own memory space from the hardware execution engine memory. The portions of this on-board memory allocated for data and instructions vary based upon the size of the data and instructions passed to the hardware execution engine. As a result, a small part of the on-board memory contains the locations of the instructions and data so that the hardware execution engine always knows where the various items are. Another small part of this memory stores the control signals between the hardware execution engine and the software. Figure 4 shows the on-board memory allocations. Fig. 4. PCI Bus Code Block (Variable Size) Data Block (Local Variable and Stack, Viariable Size) Address Block Control Signal Engine allocation in the hardware partition The control signal value acts as a synchronization flag between the hardware and software components. The software partition writes to the memory when the control flag indicates that the hardware is idle and reads from it when the flag signals that the hardware has completed execution. The hardware execution engine reads from and writes to its memory only when the control signal flag indicates that the software has loaded instructions and data for execution. As mentioned in Section III, there are three software virtual machine memory access modes for the hardware portion of the co-designed virtual machine. The parallel extension to this design uses both the most and least restrictive of these modes depending on the method used to insert context switch tags during the application load process. For the Java Virtual Machine the code block contains the bytecodes and their operands and the data block contains the local variables and stack for the Java method. The address block contains the starting address of the code block, the local variable and stack sections of the data block. In addition, the address block contains the address of the first bytecode to execute (program counter) by the hardware execution engine and the top element of the stack in the hardware execution engine s memory. The address block also contains the address of the start of the constant pool in the software virtual machine memory. Upon completion of its processing the hardware execution engine sets the next instruction and stack top addresses in the address block so that software execution can resume at the appropriate instruction and stack item. The constant pool for the Java class containing the method that caused the switch to hardware remains in the software virtual machines memory. The hardware execution engine uses Direct Access (DMA) techniques to read the constant pool entry directly from the host systems memory. The hardware execution engine never writes to the constant pool. VI. HARDWARE EXECUTION ENGINE SIMULATION Simulation of hardware devices provides a means of evaluating and debugging the hardware design using software. The parallel version of the co-designed virtual machine uses a modified version of the simulator used in evaluating the original design. The modifications consist of additional functionality to execute the simulator in its own virtual machine system thread rather than being part of the software execution engine thread. Figure 5 shows the additional functionality required for the extended design. FUNCTION Simulator-Control Initialize--Simulation control-signal = HW-IDLE DO forever IF control-signal = SW-LOAD-COMPLETE Simulate- control-signal = HW-DONE ELSE IF control-signal = SW-RTV-COMPLETE Reset- control-signal = HW-IDLE END IF END DO END FUNCTION Fig. 5. Pseudo-code of the hardware simulator control module The manner in which hardware signals are stored by the simulator has also been changed to allow for multiple concurrent simulator instances. VII. DATA INTEGRITY A key issue in the design of any software or hardware system, including virtual machines, is the integrity of the data that it manipulates. The Java Language Specification [2] and the Java Virtual Machine Specification [1] define a set of rules and guidelines for the low-level protection of data. Any virtual machine that claims to be compliant with these specifications must implement these rules and guidelines. These specifications also include recommendations on the use of some Java language constructs that allow programmers to explicitly synchronize

various parts of multi-threaded applications to directly protect data. The software part of the co-designed virtual machine implements all of the guidelines and rules set out in the Java specifications and supports the programming constructs. As a result, it provides the level of data integrity required by the Java specifications. The subset of the Java instruction set supported by the hardware partition of the co-designed virtual machine provides access to the local variables and stack of the current method only. It does not support method invocation or return, the synchronization language constructs, or object access. Therefore, all of these operations must execute in the software partition of the virtual machine. Thus, the co-designed virtual machine provides data protection as laid out in the Java specifications. VIII. PROTOTYPE IMPLEMENTATION AND EVALUATION A prototype of the parallel co-designed virtual machine was created using software simulation of the hardware partition. This prototype was built in C using the Microsoft Visual C++ tools and executed on a 2.4 GHz workstation running Microsoft Windows XP Professional. Testing of the parallel co-designed virtual machine consisted of multiple executions of the JVM-SPEC 98 benchmarks (in particular RayTrace)) [19] and custom written test programs (a Fibonacci number generator, an n-queens problem solver and a Mandelbrot fractal program). These test programs all operate in both single and multiple thread modes. Testing scenarios included executing the various test programs on versions of the parallel co-designed virtual machine with one, two and four instances of the hardware simulator as well as the original codesigned virtual machine and a software only virtual machine. A discussion of the test results appears in the next section. Table I shows the various test scenarios utilized, a Y indicates that the scenario was used, an N that it was not. In order to evaluate performance during the trial executions of the test programs, all of the virtual machine versions used included functionality to provide timing, hardware cycle counts and hardware partition memory usage at various stages of execution. This data was recorded and analyzed later. The multi-threaded nature of the system meant that a metric output queueing feature was required as part of the test versions of the virtual machines to avoid data loss or out of order results. IX. EXPERIMENTAL RESULTS Functional correctness of the parallel version of the codesigned virtual machine was demonstrated using some of the JVM SPEC 98 benchmarks [19]. The design of these benchmark programs target single and multi-threaded operation on single processor or symmetric multi-processor type systems. The single thread only benchmarks were not used. While they could show the functional correctness of the parallel codesigned virtual machine in some ways, they are, by design, not suitable for parallel execution. The parallel virtual machine proved to be functionally correct based on these benchmarks. This was the expected result since the software partition TABLE I TESTS PERFORMED IN EVALUATION THE PARALLEL CO-DESIGNED VIRTUAL MACHINE Prog/VM SW Orig Par 1 Par 2 par 4 Fibonacci 1 Y Y Y N N Fibonacci 2 Y N Y Y N Fibonacci 10 Y N Y Y Y Fibonacci 100 Y N Y Y Y RayTrace 1 Y Y Y N N RayTrace 2 Y N Y Y N RayTrace 10 Y N Y Y Y Queens 1 Y Y Y N N Queens 4 Y N Y Y Y Mandelbrot 1 Y Y Y N N Mandelbrot 2 Y N Y Y N SW - Virtual Machine Orig - Original Co-designed Virtual Machine Par 1 - Parallel Co-designed Virtual Machine The number after Par is the number of hardware execution engine instances. The number after each program name is the number of of threads used in that test. contains a standard Java interpreter (albeit slightly modified to support the context switch tags) and the hardware execution engine has the same computational components as the original, extensively tested co-designed virtual machine. The evaluation of the performance of the parallel codesigned virtual machine is more difficult to determine than its basic functional correctness. The benchmark and custom written programs described in Section VIII were all used in this phase of evaluation. Since the prototype testing was done on a single processor computer, the test results are for concurrent execution of multiple threads within a single operating system process rather than having the hardware partition executing on a separate device. Fig. 6. Plot of a two thread trial execution of the Fibonacci program. The horizontal axis represents time and the vertical axis represents execution mode where: Low - in software partition, Middle - on hardware queue, High - in hardware partition. Figure 6 shows a plot of the execution of the Fibonacci

program with two application threads both computing fib(19). These two threads have no data interdependence. The plot shows the two threads switching between the hardware and software partitions in an interleaved manner. Figure 7 shows an expanded view from part of the same trial run as shown in Figure 6. This view shows that the two threads alternate between hardware and software with significant delays while waiting on the hardware queue. These delays can be explained by two factors; real queue waiting time while the hardware is busy, and the host system scheduling other threads within the virtual machine s process as well as other system processes. While these delays could manifest themselves by delays in other virtual machine threads, the impact on the hardware queue is more noticeable since a system thread context switch must occur in the prototype s simulated environment. In these two figures, and the one for RayTrace (Figure 8), the sloped lines between the states represents the amount of time required to make the transition from one state to another. hb TABLE II TEST RESULTS FOR TWO THREAD FIBONACCI AND RAYTRACE TRIALS Metric Fibonacci RayTrace Context Switches 40 3579 Simulator Invocations 37 3520 Average Execution Time (µs) 14892 2857 Average Dispatch Time (µs) 28 27 Average Simulation Time (µs) 1402 256 Average Cycles 920 136 Avg Host Cyc for SW Exec of HW block 53598 5217273 Average Retrieval Time (µs) 17 17 Average Queue Wait (µs) 743128 382713 Average Data Dispatched (bytes) 96 378 Average Data Retrieved (bytes) 28 36 TABLE III PCI COMMUNICATION TIME ESTIMATES FOR THE TWO-THREADED FIBONACCI AND RAYTRACE PROGRAMS Direction Fibonacci RayTrace to 32 µs 41 µs to 108 µs 426 µs Fig. 7. Expanded view of part of the full trial run shown in Figure 6 The RayTrace benchmark program exhibits data interdependence amongst its threads. As a result there is little overlapping in the execution of its threads. The expanded view of part of the two threads execution in Figure 8 shows some of the portion of the trial where they did overlap. A full plot of the trial is not shown since the threads perform so many context switches that they appear as two partially overlapping solid rectangles. This program does not make effective use of the multi-processor capabilities of the parallel co-designed virtual machine. Fig. 8. Expanded plot of a trial two thread RayTrace run Computed results for the two-threaded trials of both the Fibonacci and RayTrace programs are shown in Table II. These results are based on measurements taken when the parallel codesigned virtual machine was the only application running on the host system. However, no attempt to normalize the raw data or results to compensate for the existence of operating system services was made since these will exist in any normal computing environment. The data in Table II shows that for both programs not every attempt to send a thread to hardware succeeded (e.g. for the RayTrace program 3520 of the 3579 attempts succeded). The unsuccessful attempts are a result of the software scheduler being idle and the threads were returned there. The average amount of data transferred between the hardware and software partitions can be used to compute the communication requirements for these two programs, other programs will have their own requirements. As discussed previously, in addition to the actual program data some control data is also passed between the two partitions, 8 bytes from the hardware partition to the software partition and 20 bytes in the other direction. The target hardware device communicates with the host system utilizing a 64-bit 66 MHz PCI bus and it is known from previous work that it requires 8760 host system clock cycles to transfer a 32 bit word using a 32-bit 33 MHz PCI bus [7]. By extrapolating the 32-bit PCI bus requirements, an estimate for the time required to transfer data can be found. Table III shows these time estimates. A prototype of the hardware partition of the original codesigned virtual machine indicates that it will operate with a clock rate of approximately 25 MHz on the target reconfigurable device [20]. Based on this estimate, Table IV provides an estimate of the total time required to send a

TABLE IV TOTAL HARDWARE EXECUTION TIME INCLUDING DISPATCH, RETRIEVAL, COMMUNICATION AND EXECUTION TIMES Time (µs) Fibonacci RayTrace Dispatch to hardware 28 27 Total communications 140 476 execution 37 5 Retrieve from hardware 16 17 Total 221 525 TABLE V COMPARISON OF EXECUTION TIMES Time (µs) Fibonacci RayTrace Average software block execution 14892 2857 Avg SW execution of a HW code block 23 2173 Estimated hardware execution 221 525 Maximum number of threads 67 5 block of code and data to hardware, execute the necessary instructions there and retrieve the results. Table V shows the execution times measured during the testing of the parallel co-designed virtual machine. A comparison of the execution times for hardware-capable blocks of code in both hardware and software shows the potential performance increase or decrease when executing in single-thread mode. For parallel execution, dividing the software execution time by the hardware execution time gives an estimate of the maximum number of threads that an application should use to obtain its maximum possible performance gain as shown in the last row of Table V. The communication time between the hardware and software partitions of the parallel co-designed virtual machine is the major factor in moving a thread from one partition to another (see Table IV and the sloped portion of the plot in Figure 6). This is consistent with the findings of El-Araby et al [21]. X. CONCLUSIONS The concept of virtual machine design using thread-level parallelism and hardware/software co-design is sound as shown by this research. The parallel version of the co-designed virtual machine is functionally correct as shown by the JVM SPEC 98 benchmark tests. The PCI bus bandwidth requirement is small enough that multiple threads can be executed by the virtual machine even if the bus is shared, although a tighter coupling between the partitions would provide better overall performance. The hardware execution engine runs blocks of code in fewer cycles than the software virtual machine can. As with any parallel system, the applications that run on them need to be designed for parallel execution. For example, the RayTrace benchmark does not exhibit good parallel behavior while the Fibonacci number generator, by design, does. Future work on this research will include improving the design of the execution synchronizer, and replacing the hardware simulation with the actual hardware execution engine device [20]. REFERENCES [1] T. Lindholm and F. Yellin, The Java Virtual Machine Specification (2nd. Edition). Addison-Wesley Publishing Company, 1999. [2] S. M. Inc. (2000) Java language specification, second edition. Sun Microsystems Inc. [Online]. Available: www.java.sun.com [3] J. Meyer and T. Downing, Java Virtual Machine. OReilly & Associate Inc., 1997. [4] K. B. Kent, The co-design of virtual machines using reconfigurable hardware, Ph.D. dissertation, University of Victoria, 2003. [5], Branch sensitive context switching between partitions in a hardware/software co-design of the java virtual machine, in IEEE Pacific Rim Conference on Computers, Communications and Signal Processing (PACRIM) 2003, Victoria, Canada, Aug.28-30 2003, pp. 642 645. [6] K. B. Kent and M. Serra, Context switching in a hardware/software co-design of the java virtual machine, in Designer s Forum of Design Automation & Test in Europe (DATE) 2002, Paris, France, Mar.4-8 2002, pp. 81 86. [7], Reconfigurable architecture requirements for co-designed virtual machines, in 10th Reconfigurable Architectures Workshop (RAW) 2003, part of the 17th annual International Parallel & Distributed Processing Symposium (IPDPS), Nice, France, Apr.22 2003. [8], architecture for java in a hardware/software co-design of the virtual machine, in Euromicro Symposium on Digital System Design (DSD) 2002, Dortmund, Germany, Sept.4-6 2002. [9], /software co-design of a java virtual machine, in Proceedings of IEEE International Workshop on Rapid Systems Prototyping (RSP) 2000, Paris, France, June 2000, pp. 66 71. [10] J. L. Schilling, The simplest heuristics may be the best in java jit compilers, ACM SIGPLAN Notices, vol. 38, no. 2, pp. 36 46, Feb. 2003. [11] W. El-Kharashi, M. Watheq, F. ElGuibaly, and K. F. Li, A quantitative study for java microprocessor architectural requirements. part i: Instruction set design, Microprocessors and Microsystems, vol. 24, no. 5, pp. 225 236, Aug. 2000. [12], A quantitative study for java microprocessor architectural requirements. part ii: High-level language support, Microprocessors and Microsystems, vol. 24, no. 5, pp. 237 250, Aug. 2000. [13] W. El-Kharashi, F. ElGuibaly, K. F. Li, and F. Zhang, The jafarrd processor: A java architecture based on a folding algorithm with reservation stations, dynamic translation, and dual processing, IEEE Transactions on Consumer Electronics, vol. 48, no. 4, pp. 1004 1015, Nov. 2002. [14] H. McGhan and M. O Connor, Picojava: a direct execution engine for java bytecode, Computer Magazine, vol. 31, no. 10, pp. 22 30, Oct. 1998. [15] K. B. Kent, C. Muzio, J, and G. C. Shoja, Remote transparent execution of java threads, in Proceedings of the High Performance Computing Symposium - HPC 2001, Seattle, WA, Apr. 2001, pp. 184 191. [16] M. Factor, A. Schuste, and K. Shagin, A distributed runtime for java: Yesterday and today, in Proceedings. 18th International Parallel and Distributed Processing Symposium 2004, Apr.26-30 2004, pp. 159 165. [17] G. De Micheli, /software co-design: Application domains and design technologies, in Proceedings of the NATO Advanced Study Institute on / Co-Design. Tremezzo, Italy: Kluwer Academic Publishers, June19-30 1995, pp. 1 28. [18] E. Lattanzi, A. Gayasen, M. Kandemir, V. Narayanan, L. Benini, and A. Bogliolo, Improving java performance using dynamic method migration on fpgas, in Proceedings. 18th International Parallel and Distributed Processing Symposium 2004, Apr.26-30 2004, p. 134 141. [19] (1997, Nov.) Jvm spec benchmarks. [Online]. Available: www.spec.org/osg/jvm98 [20] H. Ma, An implementation of the hardware partition in a software/hardware co-designed java virtual machine, Master s thesis, University of New Brunswick, 2004. [21] E. El-Araby, M. Taherl, K. Gaj, T. E1-Ghazawi, D. Caliga, and N. Alexandridis, System-level parallelism and throughput optimization in designing reconfigurable computing applications, in Proceedings. 18th International Parallel and Distributed Processing Symposium 2004, Apr. 26-30 2004, pp. 136 141.