Thread-Level Parallel Execution in Co-designed Virtual Machines

Size: px
Start display at page:

Download "Thread-Level Parallel Execution in Co-designed Virtual Machines"

Transcription

1 Thread-Level Parallel Execution in Co-designed Virtual Machines Thomas S. Hall, Kenneth B. Kent Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada Abstract Virtual machine technology is becoming more important as the use of heterogeneous computer networks have become more widespread. However, virtual machines have a major drawback, the runtime performance of an application running on a virtual machine is significantly below that of the same application running as a native executable on a given platform. Previous work shows that a hardware/software co-designed virtual machine can provide a performance improvement for single-threaded applications. This paper describes research work to further improve the performance of the co-designed virtual machine by adding thread-level parallel execution. The design put forward adds the functionality to support independent scheduling of threads in the hardware and software partitions of the codesigned virtual machine. A prototype of the design, based on the Java Virtual Machine, utilizing software simulation has been constructed and tested. The results of this testing show that the design is feasible provided sufficient communication bandwidth is available between the hardware and software partitions. I. INTRODUCTION Virtual machines are increasingly important in today s heterogenous computing environments since they provide a means for a single version of an application to execute on many different computing platforms. The most common is the Java Virtual Machine [1] along with the Java programming language [2] developed by Sun Microsystems. The major drawback of this type of computing environment is the slow run-time performance of applications due to the added layers of software between the application code and the hardware upon which it is running [3]. As described in Section II, significant research effort has been, and continues to be, applied in an attempt to improve the run-time performance of virtual machines. This paper describes the design, prototyping and testing of an extension to the hardware/software co-designed virtual machine originally devised by Kent et al by adding threadlevel parallelism [4] [9]. This extension allows multi-threaded applications to have its threads execute in parallel on both the hardware and software partitions of the co-designed virtual machine. The prototype of the parallel co-designed virtual machine design in this paper is an implementation of the Java Virtual Machine. This does not restrict the generality of the design presented; instead, it shows ways to implement the design and solve some of the issues that arose during implementation. The terms hardware partition and hardware execution engine are synonymous and refer to the hardware portion of the co-designed virtual machine. Similarly, the terms software partition and software execution engine refer to the software portion of the co-designed virtual machine. An assumption used throughout this work is that hardware execution is faster than software execution for the same block of application code. II. RELATED WORK Virtual machines are software applications that implement processor architectures or operating system simulations. Implementations of the same virtual machine can occur on multiple hardware and operating system platforms (e.g. the Java Virtual Machine running on an Intel chip-set with the Linux or Microsoft Windows operating systems or on the IBM AS/400). This permits an application program that uses a given virtual machine to run on any platform that supports that virtual machine. The use of virtual machine-based applications has increased dramatically in recent years as the demand for platformindependence has risen. This rising demand is due to two factors: the Internet and its e-business potential, and the cost of rewriting applications because of the introduction of a new platform within some computing environment. Applications that run on a virtual machine have additional layers of software to pass through before reaching hardware for execution, leading to significant performance issues. This performance degradation arises from the need to translate (interpret or compile) the instruction set of the virtual machine to that of the host system. As the instruction sets of virtual machines become more complex, the time required for translation gets longer, thus further degrading performance. Attempts to improve virtual machine performace have utilized both software (e.g. Just-in-time compilation [10]) and hardware techniques. El-Kharashi et al. have designed an extension to a RISC processor that specifically deals with the execution of Java bytecodes [11] [13]. Their research showed that the simpler bytecodes are the ones most frequently used by applications. These are the bytecodes that they have included in their RISC chip extension. The remaining bytecodes are executed by a standard software virtual machine.

2 Another approach to custom hardware designs is the creation of a complete custom processor to support the virtual machine instruction set (e.g. the picojava processor [14]). This approach provides native execution for the virtual machine instruction set but results in reduced performance for applications written in traditional programming languages that do not utilize the virtual machine. Other research projects involved the distribution of virtual machine threads across multiple virtual machine instances running on multiple host systems. One group added an additional system thread to each virtual machine to send application threads to other systems [15]. Each system has a daemon running that monitors for incoming thread execution requests and starts a virtual machine instance when necessary. From the viewpoint of the current work, this research showed that independent execution of Java application threads is possible with the proper communications between the processing environments. This work also showed that the Java Applications Programming Interface adequately supports parallel execution of multi-threaded programs. Another research group created a new runtime environment for their target virtual machine that modified the application code during the load process to include support for inter-platform communications [16]. /software co-design techniques (a combination of software and hardware engineering) have seen use in the embedded systems field for years [17]. Two different approaches to virtual machine design using these techniques have been reported. Lattanzi et al devised a virtual machine that uses Java methods converted to optimized configurations for Field Programmable Gate Array devices [18]. This conversion can be done either during application loading or off-line. The work of Kent et al is described in the next section. These two hardware/software co-designed virtual machines show that a combination of hardware and software designed to operate together as a single virtual machine is feasible [4] [9], [18]. III. SINGLE THREAD CO-DESIGNED VIRTUAL MACHINE Kent et al developed a hardware/software co-designed virtual machine design that uses a standard desktop computer and a Field Programmable Gate Array device [4], [7], [8]. In their design, the hardware partition is a bytecode interpreter. Tests have shown that the performance of this virtual machines is better than a software only implementation of the same virtual machine. Unlike traditional hardware/software co-designed systems, the co-designed virtual machine uses overlapping hardware and software partition functionality. It consists of a fully functional software virtual machine running on a standard desktop computer combined with a configurable (FPGA) hardware execution engine running a subset of the virtual machine functionality. The instructions supported by the hardware execution engine are the simpler ones that do not require access to host system resources or use complex memory addressing modes. The partitioning details can be found in [4]. There is a reasonable correspondence between the instructions that are supported by the hardware execution engine and those found by El-Kharashi et al to be the most frequently used [11], [12]. This distribution of functionality results in an overlap in support and ultimately an opportunity to decide both location and ordering of application thread execution. That is, from the perspective of the virtual machine, the host system processor and the hardware execution engine provide a multi-processor environment upon which the virtual machine can run multiple threads in parallel. This potential is explored by the work described in this paper. The original co-designed virtual machine description proposed three different ways for the hardware execution engine to directly access the memory of the software partition of the virtual machine. The most restrictive of these allows direct access only for looking up constant values. With this type of access, the hardware execution engine receives a copy of the data and instructions that it requires prior to beginning execution of a block of code. The block of instructions passed to the hardware execution engine must be a complete execution unit (e.g. function or method) to ensure that the hardware execution engine receives all of the instructions it needs even though the entire unit may not be necessary. The two less restrictive modes allow increasing levels of access to the software partition memory while increasing the complexity of the hardware execution engine by requiring it to support more of the virtual machine instruction set and more complex communications. While the less restrictive modes allow more flexibility in data access, the entire code block is still required by the hardware execution engine. As part of the application loading process, the co-designed virtual machine tags the beginning and end of blocks of code that are capable of being executed in hardware [5], [6]. Whenever the software execution engine detects a switchto-hardware tag, the currently executing thread undergoes a context switch to the hardware execution engine. Execution then continues in hardware until the first occurrence of a switch-to-software tag. IV. PARALLEL CO-DESIGNED VIRTUAL MACHINE DESIGN Fig. 1. Instruction Source Execution Engine Simplified block diagram of the original co-designed virtual machine Figure 1 shows the structure of the existing hardware/software co-designed virtual machine [4]. As an application executes in the virtual machine environment, part of the

3 software execution engine monitors the stream of instructions and operands coming from the instruction source of the currently executing thread for switch-to-hardware tags. Upon detection of a tag, the software execution engine redirects the executing thread to the hardware execution engine. The software scheduler does not suspend the thread, instead it simply blocks on the call to the hardware context switcher. For single-threaded applications, this design can show measurable performance improvement over the same application running in software only [4]. Fig. 2. Instruction Source Execution Synchronizer Execution Engine Simplified block diagram of the parallel co-designed virtual machine Figure 2 shows the extended design of the virtual machine. This design uses thread suspension rather than blocking during attempts to access the hardware execution engine. This is possible because of the addition of the execution synchronizer module that includes a separate hardware scheduler. A thread adds itself to a queue within the hardware scheduler before suspending itself in software. This allows threads that require software execution to continue without interference from other threads trying to set the hardware access lock of the original design. Fig. 3. Instruction Source Code Block Detector Block diagram of the execution synchronizer The execution synchronizer consists of two parts as shown in Figure 3. The first part is the switch to hardware tag detector. This part monitors the flow of instructions to the software execution engine and redirects an application thread to the hardware scheduler upon finding a tag. The hardware scheduler controls the dispatching of threads to the hardware execution engine while monitoring the state of the software scheduler. Whenever there are no threads available for execution in the software execution engine and the hardware scheduler has at least one thread waiting for the hardware execution engine, the hardware scheduler returns the thread at the front of the hardware queue to software. A thread returned in this manner will be either the one that had the highest priority on the queue, or had been on the queue the longest. Since there may be more than one application thread attempting to gain access to the hardware execution engine at once, the hardware scheduler uses a queue ordered by application thread priority to control the order that threads are sent to the hardware execution engine. Threads that have equal priority proceed through the queue in first-in-first-out (FIFO) order. The integrity of this queue could be compromised if more than one thread attempts to make a change to the thread concurrently. To avoid this situation, changes to the information stored in the queue are made within critical sections of code. These critical sections are made as small as possible to minimize their impact on overall virtual machine performance. When a thread is removed from the queue for dispatching to the hardware execution engine, the complete code, local data and operand stack of the current code block of that thread is copied to the memory of the hardware execution engine. In addition, the start addresses of each of the components within the hardware partition memory are placed in fixed locations within the hardware partition s memory along with the program counter and stack pointer for that code block. See the next section for a detailed description of the communications between the hardware and software partitions and the layout of the hardware partition memory. Once a thread has completed execution on the hardware execution engine, the hardware scheduler copies the local data and operand stack back from the hardware partition memory along with the updated program counter and stack pointer. The appropriate portions of the thread s memory in the software partition receives these data items. The thread is then allowed to continue execution under the control of the software (operating system) scheduler. While the hardware scheduler attempts to keep both the hardware and software execution engines busy, priority is given to the hardware execution engine. This choice of priority follows from the previously stated assumption that hardware execution is, in general, faster than software execution. V. HARDWARE/SOFTWARE COMMUNICATIONS There are numerous ways for the hardware and software portions of the co-designed virtual machine to communicate with one another. The choice of a specific technology will have a direct effect on the performance of the virtual machine. This is due to the differences in time required by various technologies to transfer a given block of data. In the co-designed virtual machine, the effects of communication technologies are noticeable during hardware/software context switching.

4 The communication technology used is implementation dependent. The original co-designed virtual machine utilizes a field programmable gate array (FPGA) mounted on a 64-bit, 66 MHz PCI card along with a block of memory. The extended design described in this paper targets the same technology in order to minimize the effect of non-functional changes when performing comparisons of the two designs. Mapping PCI board mounted memory into the host system address space allows the software part of the virtual machine to write instructions and data for the hardware execution engine directly to its memory. Once the hardware execution engine has completed execution, the software part of the virtual machine reads the results back into its own memory space from the hardware execution engine memory. The portions of this on-board memory allocated for data and instructions vary based upon the size of the data and instructions passed to the hardware execution engine. As a result, a small part of the on-board memory contains the locations of the instructions and data so that the hardware execution engine always knows where the various items are. Another small part of this memory stores the control signals between the hardware execution engine and the software. Figure 4 shows the on-board memory allocations. Fig. 4. PCI Bus Code Block (Variable Size) Data Block (Local Variable and Stack, Viariable Size) Address Block Control Signal Engine allocation in the hardware partition The control signal value acts as a synchronization flag between the hardware and software components. The software partition writes to the memory when the control flag indicates that the hardware is idle and reads from it when the flag signals that the hardware has completed execution. The hardware execution engine reads from and writes to its memory only when the control signal flag indicates that the software has loaded instructions and data for execution. As mentioned in Section III, there are three software virtual machine memory access modes for the hardware portion of the co-designed virtual machine. The parallel extension to this design uses both the most and least restrictive of these modes depending on the method used to insert context switch tags during the application load process. For the Java Virtual Machine the code block contains the bytecodes and their operands and the data block contains the local variables and stack for the Java method. The address block contains the starting address of the code block, the local variable and stack sections of the data block. In addition, the address block contains the address of the first bytecode to execute (program counter) by the hardware execution engine and the top element of the stack in the hardware execution engine s memory. The address block also contains the address of the start of the constant pool in the software virtual machine memory. Upon completion of its processing the hardware execution engine sets the next instruction and stack top addresses in the address block so that software execution can resume at the appropriate instruction and stack item. The constant pool for the Java class containing the method that caused the switch to hardware remains in the software virtual machines memory. The hardware execution engine uses Direct Access (DMA) techniques to read the constant pool entry directly from the host systems memory. The hardware execution engine never writes to the constant pool. VI. HARDWARE EXECUTION ENGINE SIMULATION Simulation of hardware devices provides a means of evaluating and debugging the hardware design using software. The parallel version of the co-designed virtual machine uses a modified version of the simulator used in evaluating the original design. The modifications consist of additional functionality to execute the simulator in its own virtual machine system thread rather than being part of the software execution engine thread. Figure 5 shows the additional functionality required for the extended design. FUNCTION Simulator-Control Initialize--Simulation control-signal = HW-IDLE DO forever IF control-signal = SW-LOAD-COMPLETE Simulate- control-signal = HW-DONE ELSE IF control-signal = SW-RTV-COMPLETE Reset- control-signal = HW-IDLE END IF END DO END FUNCTION Fig. 5. Pseudo-code of the hardware simulator control module The manner in which hardware signals are stored by the simulator has also been changed to allow for multiple concurrent simulator instances. VII. DATA INTEGRITY A key issue in the design of any software or hardware system, including virtual machines, is the integrity of the data that it manipulates. The Java Language Specification [2] and the Java Virtual Machine Specification [1] define a set of rules and guidelines for the low-level protection of data. Any virtual machine that claims to be compliant with these specifications must implement these rules and guidelines. These specifications also include recommendations on the use of some Java language constructs that allow programmers to explicitly synchronize

5 various parts of multi-threaded applications to directly protect data. The software part of the co-designed virtual machine implements all of the guidelines and rules set out in the Java specifications and supports the programming constructs. As a result, it provides the level of data integrity required by the Java specifications. The subset of the Java instruction set supported by the hardware partition of the co-designed virtual machine provides access to the local variables and stack of the current method only. It does not support method invocation or return, the synchronization language constructs, or object access. Therefore, all of these operations must execute in the software partition of the virtual machine. Thus, the co-designed virtual machine provides data protection as laid out in the Java specifications. VIII. PROTOTYPE IMPLEMENTATION AND EVALUATION A prototype of the parallel co-designed virtual machine was created using software simulation of the hardware partition. This prototype was built in C using the Microsoft Visual C++ tools and executed on a 2.4 GHz workstation running Microsoft Windows XP Professional. Testing of the parallel co-designed virtual machine consisted of multiple executions of the JVM-SPEC 98 benchmarks (in particular RayTrace)) [19] and custom written test programs (a Fibonacci number generator, an n-queens problem solver and a Mandelbrot fractal program). These test programs all operate in both single and multiple thread modes. Testing scenarios included executing the various test programs on versions of the parallel co-designed virtual machine with one, two and four instances of the hardware simulator as well as the original codesigned virtual machine and a software only virtual machine. A discussion of the test results appears in the next section. Table I shows the various test scenarios utilized, a Y indicates that the scenario was used, an N that it was not. In order to evaluate performance during the trial executions of the test programs, all of the virtual machine versions used included functionality to provide timing, hardware cycle counts and hardware partition memory usage at various stages of execution. This data was recorded and analyzed later. The multi-threaded nature of the system meant that a metric output queueing feature was required as part of the test versions of the virtual machines to avoid data loss or out of order results. IX. EXPERIMENTAL RESULTS Functional correctness of the parallel version of the codesigned virtual machine was demonstrated using some of the JVM SPEC 98 benchmarks [19]. The design of these benchmark programs target single and multi-threaded operation on single processor or symmetric multi-processor type systems. The single thread only benchmarks were not used. While they could show the functional correctness of the parallel codesigned virtual machine in some ways, they are, by design, not suitable for parallel execution. The parallel virtual machine proved to be functionally correct based on these benchmarks. This was the expected result since the software partition TABLE I TESTS PERFORMED IN EVALUATION THE PARALLEL CO-DESIGNED VIRTUAL MACHINE Prog/VM SW Orig Par 1 Par 2 par 4 Fibonacci 1 Y Y Y N N Fibonacci 2 Y N Y Y N Fibonacci 10 Y N Y Y Y Fibonacci 100 Y N Y Y Y RayTrace 1 Y Y Y N N RayTrace 2 Y N Y Y N RayTrace 10 Y N Y Y Y Queens 1 Y Y Y N N Queens 4 Y N Y Y Y Mandelbrot 1 Y Y Y N N Mandelbrot 2 Y N Y Y N SW - Virtual Machine Orig - Original Co-designed Virtual Machine Par 1 - Parallel Co-designed Virtual Machine The number after Par is the number of hardware execution engine instances. The number after each program name is the number of of threads used in that test. contains a standard Java interpreter (albeit slightly modified to support the context switch tags) and the hardware execution engine has the same computational components as the original, extensively tested co-designed virtual machine. The evaluation of the performance of the parallel codesigned virtual machine is more difficult to determine than its basic functional correctness. The benchmark and custom written programs described in Section VIII were all used in this phase of evaluation. Since the prototype testing was done on a single processor computer, the test results are for concurrent execution of multiple threads within a single operating system process rather than having the hardware partition executing on a separate device. Fig. 6. Plot of a two thread trial execution of the Fibonacci program. The horizontal axis represents time and the vertical axis represents execution mode where: Low - in software partition, Middle - on hardware queue, High - in hardware partition. Figure 6 shows a plot of the execution of the Fibonacci

6 program with two application threads both computing fib(19). These two threads have no data interdependence. The plot shows the two threads switching between the hardware and software partitions in an interleaved manner. Figure 7 shows an expanded view from part of the same trial run as shown in Figure 6. This view shows that the two threads alternate between hardware and software with significant delays while waiting on the hardware queue. These delays can be explained by two factors; real queue waiting time while the hardware is busy, and the host system scheduling other threads within the virtual machine s process as well as other system processes. While these delays could manifest themselves by delays in other virtual machine threads, the impact on the hardware queue is more noticeable since a system thread context switch must occur in the prototype s simulated environment. In these two figures, and the one for RayTrace (Figure 8), the sloped lines between the states represents the amount of time required to make the transition from one state to another. hb TABLE II TEST RESULTS FOR TWO THREAD FIBONACCI AND RAYTRACE TRIALS Metric Fibonacci RayTrace Context Switches Simulator Invocations Average Execution Time (µs) Average Dispatch Time (µs) Average Simulation Time (µs) Average Cycles Avg Host Cyc for SW Exec of HW block Average Retrieval Time (µs) Average Queue Wait (µs) Average Data Dispatched (bytes) Average Data Retrieved (bytes) TABLE III PCI COMMUNICATION TIME ESTIMATES FOR THE TWO-THREADED FIBONACCI AND RAYTRACE PROGRAMS Direction Fibonacci RayTrace to 32 µs 41 µs to 108 µs 426 µs Fig. 7. Expanded view of part of the full trial run shown in Figure 6 The RayTrace benchmark program exhibits data interdependence amongst its threads. As a result there is little overlapping in the execution of its threads. The expanded view of part of the two threads execution in Figure 8 shows some of the portion of the trial where they did overlap. A full plot of the trial is not shown since the threads perform so many context switches that they appear as two partially overlapping solid rectangles. This program does not make effective use of the multi-processor capabilities of the parallel co-designed virtual machine. Fig. 8. Expanded plot of a trial two thread RayTrace run Computed results for the two-threaded trials of both the Fibonacci and RayTrace programs are shown in Table II. These results are based on measurements taken when the parallel codesigned virtual machine was the only application running on the host system. However, no attempt to normalize the raw data or results to compensate for the existence of operating system services was made since these will exist in any normal computing environment. The data in Table II shows that for both programs not every attempt to send a thread to hardware succeeded (e.g. for the RayTrace program 3520 of the 3579 attempts succeded). The unsuccessful attempts are a result of the software scheduler being idle and the threads were returned there. The average amount of data transferred between the hardware and software partitions can be used to compute the communication requirements for these two programs, other programs will have their own requirements. As discussed previously, in addition to the actual program data some control data is also passed between the two partitions, 8 bytes from the hardware partition to the software partition and 20 bytes in the other direction. The target hardware device communicates with the host system utilizing a 64-bit 66 MHz PCI bus and it is known from previous work that it requires 8760 host system clock cycles to transfer a 32 bit word using a 32-bit 33 MHz PCI bus [7]. By extrapolating the 32-bit PCI bus requirements, an estimate for the time required to transfer data can be found. Table III shows these time estimates. A prototype of the hardware partition of the original codesigned virtual machine indicates that it will operate with a clock rate of approximately 25 MHz on the target reconfigurable device [20]. Based on this estimate, Table IV provides an estimate of the total time required to send a

7 TABLE IV TOTAL HARDWARE EXECUTION TIME INCLUDING DISPATCH, RETRIEVAL, COMMUNICATION AND EXECUTION TIMES Time (µs) Fibonacci RayTrace Dispatch to hardware Total communications execution 37 5 Retrieve from hardware Total TABLE V COMPARISON OF EXECUTION TIMES Time (µs) Fibonacci RayTrace Average software block execution Avg SW execution of a HW code block Estimated hardware execution Maximum number of threads 67 5 block of code and data to hardware, execute the necessary instructions there and retrieve the results. Table V shows the execution times measured during the testing of the parallel co-designed virtual machine. A comparison of the execution times for hardware-capable blocks of code in both hardware and software shows the potential performance increase or decrease when executing in single-thread mode. For parallel execution, dividing the software execution time by the hardware execution time gives an estimate of the maximum number of threads that an application should use to obtain its maximum possible performance gain as shown in the last row of Table V. The communication time between the hardware and software partitions of the parallel co-designed virtual machine is the major factor in moving a thread from one partition to another (see Table IV and the sloped portion of the plot in Figure 6). This is consistent with the findings of El-Araby et al [21]. X. CONCLUSIONS The concept of virtual machine design using thread-level parallelism and hardware/software co-design is sound as shown by this research. The parallel version of the co-designed virtual machine is functionally correct as shown by the JVM SPEC 98 benchmark tests. The PCI bus bandwidth requirement is small enough that multiple threads can be executed by the virtual machine even if the bus is shared, although a tighter coupling between the partitions would provide better overall performance. The hardware execution engine runs blocks of code in fewer cycles than the software virtual machine can. As with any parallel system, the applications that run on them need to be designed for parallel execution. For example, the RayTrace benchmark does not exhibit good parallel behavior while the Fibonacci number generator, by design, does. Future work on this research will include improving the design of the execution synchronizer, and replacing the hardware simulation with the actual hardware execution engine device [20]. REFERENCES [1] T. Lindholm and F. Yellin, The Java Virtual Machine Specification (2nd. Edition). Addison-Wesley Publishing Company, [2] S. M. Inc. (2000) Java language specification, second edition. Sun Microsystems Inc. [Online]. Available: [3] J. Meyer and T. Downing, Java Virtual Machine. OReilly & Associate Inc., [4] K. B. Kent, The co-design of virtual machines using reconfigurable hardware, Ph.D. dissertation, University of Victoria, [5], Branch sensitive context switching between partitions in a hardware/software co-design of the java virtual machine, in IEEE Pacific Rim Conference on Computers, Communications and Signal Processing (PACRIM) 2003, Victoria, Canada, Aug , pp [6] K. B. Kent and M. Serra, Context switching in a hardware/software co-design of the java virtual machine, in Designer s Forum of Design Automation & Test in Europe (DATE) 2002, Paris, France, Mar , pp [7], Reconfigurable architecture requirements for co-designed virtual machines, in 10th Reconfigurable Architectures Workshop (RAW) 2003, part of the 17th annual International Parallel & Distributed Processing Symposium (IPDPS), Nice, France, Apr [8], architecture for java in a hardware/software co-design of the virtual machine, in Euromicro Symposium on Digital System Design (DSD) 2002, Dortmund, Germany, Sept [9], /software co-design of a java virtual machine, in Proceedings of IEEE International Workshop on Rapid Systems Prototyping (RSP) 2000, Paris, France, June 2000, pp [10] J. L. Schilling, The simplest heuristics may be the best in java jit compilers, ACM SIGPLAN Notices, vol. 38, no. 2, pp , Feb [11] W. El-Kharashi, M. Watheq, F. ElGuibaly, and K. F. Li, A quantitative study for java microprocessor architectural requirements. part i: Instruction set design, Microprocessors and Microsystems, vol. 24, no. 5, pp , Aug [12], A quantitative study for java microprocessor architectural requirements. part ii: High-level language support, Microprocessors and Microsystems, vol. 24, no. 5, pp , Aug [13] W. El-Kharashi, F. ElGuibaly, K. F. Li, and F. Zhang, The jafarrd processor: A java architecture based on a folding algorithm with reservation stations, dynamic translation, and dual processing, IEEE Transactions on Consumer Electronics, vol. 48, no. 4, pp , Nov [14] H. McGhan and M. O Connor, Picojava: a direct execution engine for java bytecode, Computer Magazine, vol. 31, no. 10, pp , Oct [15] K. B. Kent, C. Muzio, J, and G. C. Shoja, Remote transparent execution of java threads, in Proceedings of the High Performance Computing Symposium - HPC 2001, Seattle, WA, Apr. 2001, pp [16] M. Factor, A. Schuste, and K. Shagin, A distributed runtime for java: Yesterday and today, in Proceedings. 18th International Parallel and Distributed Processing Symposium 2004, Apr , pp [17] G. De Micheli, /software co-design: Application domains and design technologies, in Proceedings of the NATO Advanced Study Institute on / Co-Design. Tremezzo, Italy: Kluwer Academic Publishers, June , pp [18] E. Lattanzi, A. Gayasen, M. Kandemir, V. Narayanan, L. Benini, and A. Bogliolo, Improving java performance using dynamic method migration on fpgas, in Proceedings. 18th International Parallel and Distributed Processing Symposium 2004, Apr , p [19] (1997, Nov.) Jvm spec benchmarks. [Online]. Available: [20] H. Ma, An implementation of the hardware partition in a software/hardware co-designed java virtual machine, Master s thesis, University of New Brunswick, [21] E. El-Araby, M. Taherl, K. Gaj, T. E1-Ghazawi, D. Caliga, and N. Alexandridis, System-level parallelism and throughput optimization in designing reconfigurable computing applications, in Proceedings. 18th International Parallel and Distributed Processing Symposium 2004, Apr , pp

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Interaction of JVM with x86, Sparc and MIPS

Interaction of JVM with x86, Sparc and MIPS Interaction of JVM with x86, Sparc and MIPS Sasikanth Avancha, Dipanjan Chakraborty, Dhiral Gada, Tapan Kamdar {savanc1, dchakr1, dgada1, kamdar}@cs.umbc.edu Department of Computer Science and Electrical

More information

Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture

Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture Australian Computer Science Communications, Vol.21, No.4, 1999, Springer-Verlag Singapore Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture Kenji Watanabe and Yamin Li Graduate

More information

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100 GATE- 2016-17 Postal Correspondence 1 Operating Systems Computer Science & Information Technology (CS) 20 Rank under AIR 100 Postal Correspondence Examination Oriented Theory, Practice Set Key concepts,

More information

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schoeberl JOP Research Targets Java processor Time-predictable architecture Small design Working solution (FPGA) JOP Overview 2 Overview

More information

Distributed Deadlock Detection for. Distributed Process Networks

Distributed Deadlock Detection for. Distributed Process Networks 0 Distributed Deadlock Detection for Distributed Process Networks Alex Olson Embedded Software Systems Abstract The distributed process network (DPN) model allows for greater scalability and performance

More information

BioTechnology. An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 15 2014 BioTechnology An Indian Journal FULL PAPER BTAIJ, 10(15), 2014 [8768-8774] The Java virtual machine in a thread migration of

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

Global Scheduler. Global Issue. Global Retire

Global Scheduler. Global Issue. Global Retire The Delft-Java Engine: An Introduction C. John Glossner 1;2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs, Allentown, Pa. 2 Delft University oftechnology, Department of Electrical Engineering Delft,

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Implementation of Process Networks in Java

Implementation of Process Networks in Java Implementation of Process Networks in Java Richard S, Stevens 1, Marlene Wan, Peggy Laramie, Thomas M. Parks, Edward A. Lee DRAFT: 10 July 1997 Abstract A process network, as described by G. Kahn, is a

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition

More information

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Improving I/O Bandwidth With Cray DVS Client-Side Caching Improving I/O Bandwidth With Cray DVS Client-Side Caching Bryce Hicks Cray Inc. Bloomington, MN USA bryceh@cray.com Abstract Cray s Data Virtualization Service, DVS, is an I/O forwarder providing access

More information

Improving Java Performance Using Dynamic Method Migration on FPGAs

Improving Java Performance Using Dynamic Method Migration on FPGAs Improving Java Performance Using Dynamic Method Migration on s Emanuele Lattanzi STI - University of Urbino 6129 Urbino - Italy lattanzi@sti.uniurb.it Mahmuth Kandemir DCSE - Penn State University 162

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

picojava I Java Processor Core DATA SHEET DESCRIPTION

picojava I Java Processor Core DATA SHEET DESCRIPTION picojava I DATA SHEET DESCRIPTION picojava I is a uniquely designed processor core which natively executes Java bytecodes as defined by the Java Virtual Machine (JVM). Most processors require the JVM to

More information

Untyped Memory in the Java Virtual Machine

Untyped Memory in the Java Virtual Machine Untyped Memory in the Java Virtual Machine Andreas Gal and Michael Franz University of California, Irvine {gal,franz}@uci.edu Christian W. Probst Technical University of Denmark probst@imm.dtu.dk July

More information

Karthik Narayanan, Santosh Madiraju EEL Embedded Systems Seminar 1/41 1

Karthik Narayanan, Santosh Madiraju EEL Embedded Systems Seminar 1/41 1 Karthik Narayanan, Santosh Madiraju EEL6935 - Embedded Systems Seminar 1/41 1 Efficient Search Space Exploration for HW-SW Partitioning Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS

More information

Delft-Java Link Translation Buffer

Delft-Java Link Translation Buffer Delft-Java Link Translation Buffer John Glossner 1,2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs Advanced DSP Architecture and Compiler Research Allentown, Pa glossner@lucent.com 2 Delft University

More information

Hardware-Supported Pointer Detection for common Garbage Collections

Hardware-Supported Pointer Detection for common Garbage Collections 2013 First International Symposium on Computing and Networking Hardware-Supported Pointer Detection for common Garbage Collections Kei IDEUE, Yuki SATOMI, Tomoaki TSUMURA and Hiroshi MATSUO Nagoya Institute

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

Hardware, Software and Mechanical Cosimulation for Automotive Applications

Hardware, Software and Mechanical Cosimulation for Automotive Applications Hardware, Software and Mechanical Cosimulation for Automotive Applications P. Le Marrec, C.A. Valderrama, F. Hessel, A.A. Jerraya TIMA Laboratory 46 Avenue Felix Viallet 38031 Grenoble France fphilippe.lemarrec,

More information

The Co-Design of Virtual Machines Using Reconfigurable Hardware

The Co-Design of Virtual Machines Using Reconfigurable Hardware The Co-Design of Virtual Machines Using Reconfigurable Hardware by Kenneth Blair Kent B.Sc. (hons), Memorial University of Newfoundland, 1996 M.Sc., University of Victoria, 1999 A Dissertation Submitted

More information

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University Chapter 3 Top Level View of Computer Function and Interconnection Contents Computer Components Computer Function Interconnection Structures Bus Interconnection PCI 3-2 Program Concept Computer components

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Computer Architecture

Computer Architecture Instruction Cycle Computer Architecture Program Execution and Instruction Sets INFO 2603 Platform Technologies The basic function performed by a computer is the execution of a program, which is a set of

More information

UNIT II SYSTEM BUS STRUCTURE 1. Differentiate between minimum and maximum mode 2. Give any four pin definitions for the minimum mode. 3. What are the pins that are used to indicate the type of transfer

More information

Virtual Memory - Overview. Programmers View. Virtual Physical. Virtual Physical. Program has its own virtual memory space.

Virtual Memory - Overview. Programmers View. Virtual Physical. Virtual Physical. Program has its own virtual memory space. Virtual Memory - Overview Programmers View Process runs in virtual (logical) space may be larger than physical. Paging can implement virtual. Which pages to have in? How much to allow each process? Program

More information

Page Replacement. 3/9/07 CSE 30341: Operating Systems Principles

Page Replacement. 3/9/07 CSE 30341: Operating Systems Principles Page Replacement page 1 Page Replacement Algorithms Want lowest page-fault rate Evaluate algorithm by running it on a particular string of memory references (reference string) and computing the number

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

EFFICIENT EXECUTION OF LARGE APPLICATIONS ON PORTABLE AND WIRELESS CLIENTS

EFFICIENT EXECUTION OF LARGE APPLICATIONS ON PORTABLE AND WIRELESS CLIENTS EFFICIENT EXECUTION OF LARGE APPLICATIONS ON PORTABLE AND WIRELESS CLIENTS PRAMOTE KUACHAROEN * School of Applied Statistics, National Institute of Development Administration Bangkapi District, Bangkok,

More information

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1 CSE P 501 Compilers Java Implementation JVMs, JITs &c Hal Perkins Winter 2008 3/11/2008 2002-08 Hal Perkins & UW CSE V-1 Agenda Java virtual machine architecture.class files Class loading Execution engines

More information

Architecture of An AHB Compliant SDRAM Memory Controller

Architecture of An AHB Compliant SDRAM Memory Controller Architecture of An AHB Compliant SDRAM Memory Controller S. Lakshma Reddy Metch student, Department of Electronics and Communication Engineering CVSR College of Engineering, Hyderabad, Andhra Pradesh,

More information

Approaches to Capturing Java Threads State

Approaches to Capturing Java Threads State Approaches to Capturing Java Threads State Abstract This paper describes a range of approaches to capturing the state of Java threads. The capture and restoration of Java threads state have two main purposes:

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Module 1. Introduction:

Module 1. Introduction: Module 1 Introduction: Operating system is the most fundamental of all the system programs. It is a layer of software on top of the hardware which constitutes the system and manages all parts of the system.

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware. Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Exam Guide COMPSCI 386

Exam Guide COMPSCI 386 FOUNDATIONS We discussed in broad terms the three primary responsibilities of an operating system. Describe each. What is a process? What is a thread? What parts of a process are shared by threads? What

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

MICROPROCESSOR TECHNOLOGY

MICROPROCESSOR TECHNOLOGY MICROPROCESSOR TECHNOLOGY Assis. Prof. Hossam El-Din Moustafa Lecture 5 Ch.2 A Top-Level View of Computer Function (Cont.) 24-Feb-15 1 CPU (CISC & RISC) Intel CISC, Motorola RISC CISC (Complex Instruction

More information

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve? What is an Operating System? A Whirlwind Tour of Operating Systems Trusted software interposed between the hardware and application/utilities to improve efficiency and usability Most computing systems

More information

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Dariusz Caban, Institute of Informatics, Gliwice, Poland - June 18, 2014 The use of a real-time multitasking kernel simplifies

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

Concurrent Programming. Implementation Alternatives. Content. Real-Time Systems, Lecture 2. Historical Implementation Alternatives.

Concurrent Programming. Implementation Alternatives. Content. Real-Time Systems, Lecture 2. Historical Implementation Alternatives. Content Concurrent Programming Real-Time Systems, Lecture 2 [Real-Time Control System: Chapter 3] 1. Implementation Alternatives Martina Maggio 19 January 2017 Lund University, Department of Automatic

More information

Noorul Islam College Of Engineering, Kumaracoil MCA Degree Model Examination (October 2007) 5 th Semester MC1642 UNIX Internals 2 mark Questions

Noorul Islam College Of Engineering, Kumaracoil MCA Degree Model Examination (October 2007) 5 th Semester MC1642 UNIX Internals 2 mark Questions Noorul Islam College Of Engineering, Kumaracoil MCA Degree Model Examination (October 2007) 5 th Semester MC1642 UNIX Internals 2 mark Questions 1. What are the different parts of UNIX system? i. Programs

More information

Hardware/Software Codesign of Schedulers for Real Time Systems

Hardware/Software Codesign of Schedulers for Real Time Systems Hardware/Software Codesign of Schedulers for Real Time Systems Jorge Ortiz Committee David Andrews, Chair Douglas Niehaus Perry Alexander Presentation Outline Background Prior work in hybrid co-design

More information

Concurrent Programming

Concurrent Programming Concurrent Programming Real-Time Systems, Lecture 2 Martina Maggio 19 January 2017 Lund University, Department of Automatic Control www.control.lth.se/course/frtn01 Content [Real-Time Control System: Chapter

More information

Where are we in the course?

Where are we in the course? Previous Lectures Memory Management Approaches Allocate contiguous memory for the whole process Use paging (map fixed size logical pages to physical frames) Use segmentation (user s view of address space

More information

Embedded Software Streaming via Block Stream

Embedded Software Streaming via Block Stream Embedded Software Streaming via Block Stream A Dissertation by Pramote Kucharoen Dissertation Advisor Professor Vincent J. Mooney III 7 April 2004 Outline Introduction Related Work Block Streaming Stream-Enabled

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Following are a few basic questions that cover the essentials of OS:

Following are a few basic questions that cover the essentials of OS: Operating Systems Following are a few basic questions that cover the essentials of OS: 1. Explain the concept of Reentrancy. It is a useful, memory-saving technique for multiprogrammed timesharing systems.

More information

Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs

Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs Jürgen Schnerr 1, Oliver Bringmann 1, and Wolfgang Rosenstiel 1,2 1 FZI Forschungszentrum Informatik Haid-und-Neu-Str.

More information

PESIT Bangalore South Campus

PESIT Bangalore South Campus INTERNAL ASSESSMENT TEST I Date: 30/08/2017 Max Marks: 40 Subject & Code: Computer Organization 15CS34 Semester: III (A & B) Name of the faculty: Mrs.Sharmila Banu.A Time: 8.30 am 10.00 am Answer any FIVE

More information

UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems

UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems Chih-Hao Tseng and Pao-Ann Hsiung Department of Computer Science and Information Engineering, National

More information

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking Di-Shi Sun and Douglas M. Blough School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA

More information

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1 SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine David Bélanger dbelan2@cs.mcgill.ca Sable Research Group McGill University Montreal, QC January 28, 2004 SABLEJIT: A Retargetable

More information

Process Description and Control

Process Description and Control Process Description and Control 1 Process:the concept Process = a program in execution Example processes: OS kernel OS shell Program executing after compilation www-browser Process management by OS : Allocate

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel Alexander Züpke, Marc Bommert, Daniel Lohmann alexander.zuepke@hs-rm.de, marc.bommert@hs-rm.de, lohmann@cs.fau.de Motivation Automotive and Avionic industry

More information

Modification and Evaluation of Linux I/O Schedulers

Modification and Evaluation of Linux I/O Schedulers Modification and Evaluation of Linux I/O Schedulers 1 Asad Naweed, Joe Di Natale, and Sarah J Andrabi University of North Carolina at Chapel Hill Abstract In this paper we present three different Linux

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

CSc 453 Interpreters & Interpretation

CSc 453 Interpreters & Interpretation CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson Interpreters An interpreter is a program that executes another program. An interpreter implements a virtual machine,

More information

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has Swapping Active processes use more physical memory than system has Operating Systems I Address Binding can be fixed or relocatable at runtime Swap out P P Virtual Memory OS Backing Store (Swap Space) Main

More information

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS Malaysian Journal of Computer Science, Vol. 9 No. 1, June 1996, pp. 12-17 REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS Mohammed Samaka School of Computer Science Universiti Sains Malaysia

More information

MCC-DSM Specifications

MCC-DSM Specifications DETECTOR CHIP BUMP ELECTRONIC CHIP MCC Design Group Receiver Issue: Revision: 0.1 Reference: ATLAS ID-xx Created: 30 September 2002 Last modified: 7 October 2002 03:18 Edited By: R. Beccherle and G. Darbo

More information

Placement Algorithm for FPGA Circuits

Placement Algorithm for FPGA Circuits Placement Algorithm for FPGA Circuits ZOLTAN BARUCH, OCTAVIAN CREŢ, KALMAN PUSZTAI Computer Science Department, Technical University of Cluj-Napoca, 26, Bariţiu St., 3400 Cluj-Napoca, Romania {Zoltan.Baruch,

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Topic & Scope. Content: The course gives

Topic & Scope. Content: The course gives Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

Operating Systems Overview. Chapter 2

Operating Systems Overview. Chapter 2 Operating Systems Overview Chapter 2 Operating System A program that controls the execution of application programs An interface between the user and hardware Masks the details of the hardware Layers and

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Four Components of a Computer System

Four Components of a Computer System Four Components of a Computer System Operating System Concepts Essentials 2nd Edition 1.1 Silberschatz, Galvin and Gagne 2013 Operating System Definition OS is a resource allocator Manages all resources

More information

ATS-GPU Real Time Signal Processing Software

ATS-GPU Real Time Signal Processing Software Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional

More information

A Network Storage LSI Suitable for Home Network

A Network Storage LSI Suitable for Home Network 258 HAN-KYU LIM et al : A NETWORK STORAGE LSI SUITABLE FOR HOME NETWORK A Network Storage LSI Suitable for Home Network Han-Kyu Lim*, Ji-Ho Han**, and Deog-Kyoon Jeong*** Abstract Storage over (SoE) is

More information

Optimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs

Optimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs Optimal Algorithm Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs page 1 Least Recently Used (LRU) Algorithm Reference string: 1, 2, 3,

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

icroprocessor istory of Microprocessor ntel 8086:

icroprocessor istory of Microprocessor ntel 8086: Microprocessor A microprocessor is an electronic device which computes on the given input similar to CPU of a computer. It is made by fabricating millions (or billions) of transistors on a single chip.

More information

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads Chapter 4: Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads Chapter 4: Threads Objectives To introduce the notion of a

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie Pause-Less GC for Improving Java Responsiveness Charlie Gracie IBM Senior Software Developer charlie_gracie@ca.ibm.com @crgracie charliegracie 1 Important Disclaimers THE INFORMATION CONTAINED IN THIS

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Contents Deliver Faster Code...................................... 3 Optimize Multicore Performance...3 Highlights...............................................

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

Chapter 9: Virtual-Memory

Chapter 9: Virtual-Memory Chapter 9: Virtual-Memory Management Chapter 9: Virtual-Memory Management Background Demand Paging Page Replacement Allocation of Frames Thrashing Other Considerations Silberschatz, Galvin and Gagne 2013

More information

CHAPTER 16 - VIRTUAL MACHINES

CHAPTER 16 - VIRTUAL MACHINES CHAPTER 16 - VIRTUAL MACHINES 1 OBJECTIVES Explore history and benefits of virtual machines. Discuss the various virtual machine technologies. Describe the methods used to implement virtualization. Show

More information

PowerAware RTL Verification of USB 3.0 IPs by Gayathri SN and Badrinath Ramachandra, L&T Technology Services Limited

PowerAware RTL Verification of USB 3.0 IPs by Gayathri SN and Badrinath Ramachandra, L&T Technology Services Limited PowerAware RTL Verification of USB 3.0 IPs by Gayathri SN and Badrinath Ramachandra, L&T Technology Services Limited INTRODUCTION Power management is a major concern throughout the chip design flow from

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

VASim: An Open Virtual Automata Simulator for Automata Processing Research

VASim: An Open Virtual Automata Simulator for Automata Processing Research University of Virginia Technical Report #CS2016-03 VASim: An Open Virtual Automata Simulator for Automata Processing Research J. Wadden 1, K. Skadron 1 We present VASim, an open, extensible virtual automata

More information

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information