instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

Size: px

Start display at page:

Download "instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals"

Rachel Gardner
5 years ago
Views:

1 Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte, C. Krakowski Institute for Process Control Automation and Robotics University of Karlsruhe, Germany Abstract We propose handling of external real-time events through multithreading and describe the microarchitecture of our multithreaded Java microcontroller, called Komodo microcontroller. Real-time Java threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs). Our proposed Komodo microcontroller supports multiple ISTs with zerocycle context switching overhead. We evaluate the basic performance and optimize the internal structures of our microcontroller using real-time applications as benchmarks on a software simulator. We determine the correlation between the fetch bandwidth and the instruction window size. Our results show a performance gain of about. by latency reducing over the single threaded model. Keywords: multithreading, microcontroller, Java microprocessor, real-time, interrupt service threads Introduction A multithreaded processor is characterized by the ability to simultaneously execute instructions of dierent threads within the processor pipeline. Multiple on-chip register sets are employed to reach an extremely fast context switch. Multithreading techniques are proposed in processor architecture research to mask latencies of instructions of the presently executing thread by execution of instructions of other threads []. Recently, multithreading has also been proposed for event-handling of internal events ([], []) by taking advantage of its fast context switching ability. However, to date multithreading has never been explored in context of microcontrollers for handling of external hardware events. Contemporary microprocessors and microcontrollers activate Interrupt-Service-Routines (ISRs) for event handling. Assuming a multithreaded processor core allows to trigger so-called Interrupt-Service-Threads (ISTs) instead of ISRs for event handling. In this case an occurring event activates an assigned thread. Using threads instead of ISRs in combination with a multithreaded processor core has several advantages: Thread activation is extremely fast due to the multiple register sets on the multithreaded processor chip, thus enabling single-cycle event reaction. Because of the parallel execution, ISTs are not blocked by ISTs with higher priority, in contrast to ISRs. Latencies within the pipeline can be utilized by other concurrently executing threads or ISTs. A exible context switch between event triggered ISTs and other threads is made possible by the IST scheme. In contrast, ISRs can only be interrupted by other ISRs, but not by threads.

2 Threads allow exible priority schemes. Our project explores the suitability ofmul- tithreading techniques in embedded real-time systems. We propose multithreading as an event handling mechanism that allows ecient handling of simultaneous overlapping events with hard real-time requirements. Moreover, we explore our methods in context of a realtime Java system. State-of-the-art embedded systems are programmed in assembly or C/C++ languages. Java features several advantages with respect to real-time systems: The object orientation of the Java language supports easy programming, reusability, and robustness. Java bytecode is portable, of small size, and secure due to the bytecode verier in the Java-Virtual- Machine (JVM). However, to date the use of Java is hardly possible in embedded real-time systems because of its high hardware requirements and unpredictable real-time behavior. Both problems are solved with the design of a multithreaded Java-microcontroller enhanced by an adapted JVM. A Java-microprocessor, like the picojava-i and II from Sun ([], [5]), executes Java bytecode instructions directly in hardware raising the performance of Java applications and decreasing the hardware requirements compared to a Java interpreter or Just-In-Time (JIT) compiler. In enhancement to the picojava approach the real-time issue is supported by applying the multithreading technique to reach a zero cycle context switching overhead and by hardware support for advanced real-time scheduling schemes. The next section presents our Komodo microcontroller and section three our performance results. The proposed Komodo microcontroller The Komodo microcontroller is a multithreaded Java microcontroller which supports multiple ISTs with zero-cycle context switching overhead and several priority schemes. A high priority IST may execute with full speed, lower priority ISTs and non-real-time threads are restricted to the latency pipeline slots of the high priority IST. In the actual design, the microcontroller holds the contexts of up to four threads. If there are more then four threads to handle, our architecture allows up to three real-time threads, which are directly mapped to hardware threads. The remaining threads must be non real-time and are scheduled within the fourth hardware thread. To scale up for larger systems with more than three real-time threads, we propose a parallel execution on several microcontrollers connected by a middleware platform. This topic is covered in another paper [6]. Because of its application for embedded systems, the processor core of the Komodo microcontroller is kept at the hardware level of a simple scalar processor. As shown in Fig., the microcontroller core consists of an instructionfetch unit, a decode unit, a memory access unit (MEM) and an execution unit (ALU). Four stack register sets are provided on the processor chip. A signal unit triggers IST execution due to external signals. memory interface address instructions micro-ops ROM data path PC IW MEM instruction fetch PC PC PC IW IW IW priority manager instruction decode stack register sets ALU signal unit Figure : Block diagram of the Komodo microcontroller The instruction fetch unit holds four program counters (PC) with dedicated status bits (e.g. thread active/suspended), each PC is assigned to another thread. Four byte por- extern signals

3 tions are fetched over the memory interface and put in the according instruction window (IW). Several instructions may be contained in the fetch portion, because of the average bytecode length of.8 bytes. Instructions are fetched depending on the status bits and ll levels of the IWs. The instruction decode unit contains the above mentioned IWs, dedicated status bits (e.g. priority) and counters. A priority manager decides subject to the bits and counters from which IW the next instruction will be decoded. We dened several priority schemes [7] to handle real-time requirements. The priority manager applies one of the implemented thread priority schemes for IW selection. Abytecode instruction is decoded either to a single microop, a sequence of micro-ops, or a trap routine is called. Each opcode is propagated through the pipeline together with its thread id. Opcodes from multiple threads can be simultaneously present in the dierent pipeline stages. The instructions for memory access are executed by the MEM unit. If the memory interface only permits one access each cycle, an arbiter is needed for instruction fetch and data access. All other instructions are executed by the ALU unit. Both units (MEM and ALU) can take several cycles to complete an instruction execution. After that, the result is written back to the stack register set of the according thread. External signals are delivered to the signal unit from the peripheral components of the microcontroller core as e.g. timer, counter, or serial interface. By the occurrence of such a signal the corresponding IST is activated. As soon as an IST activation ends its assigned real-time thread is suspended and its status is stored. An external signal may activate the same thread again. To avoid pipeline stalls, instructions from other threads can be fed into the pipeline. Idle times may result from branches or memory accesses. The decode unit predicts the latency after such an instruction, and proceeds with instructions from other IWs. There is no overhead for such a context switch. No save/restore of registers or removal of instructions from the pipeline is needed, because each thread has it's own stack register set. Because of the unpredictability of cache accesses, a non-cached memory access is preferred for real-time microcontrollers. The emerging load latencies are bridged by scheduling instructions of other threads by the priority manager. Therefore a cache is omitted from our Komodo microcontroller. Performance evaluation For the evaluation of our design, we use an execution-based simulator of our Komodo microcontroller with an adapted JVM to run several benchmarks (e.g. CoeineMark.0, a FFT algorithm, a PID element, and an impulse counter). This JVM is yet not optimized with quick-opcodes or something else. Therefore, the benchmarks are used to apply a real bytecode mix for microarchitecture optimization and not to compare the performance of the Komodo microcontroller with other approaches. First of all, we want to prove the presumption of a performance gain by bridging latencies. Our rst experiment assumes processor clock cycles of 00MHz or less and fast SRAM storage for memory thus eliminating load/store latencies. Under these assumptions the only latencies stem from branches, that take two cycles overhead in the pipeline to resolve and fetch from the new PC. The results pointed out a gain of.8 with: gain = cycles needed for sequential execution cycles needed by multithreaded execution The second experiment additionally regards latencies of data accesses. We assume a large data memory with DRAM modules instead of costly SRAM chips. Due to the need for available instructions we assume a small instruction memory implemented with SRAM chips or as an on-chip memory. This yields instruction fetch without latency. We vary the share of load/store instructions and their latency. The

4 results of this situation are shown in gure. The dierent graphs show the gain with respect to memory latency (in cycles). Each graph represents a program with another percentage of l/s-instructions (0, 0, 0, 50, 70, 90%) respectively the impulse counter (imp), PID element (pid), and fast fourier transformation algorithm (t). The results in Fig show a,5,5, latency of l/s-instructions,5,5,5 0,5 share of l/s-instructions in % imp pid fft Memory latency Figure : Benchmark with dierent share and latency of l/s-instructions signicant performance gain which is reached only by the multithreading ability of latency hiding and not by parallel processing. As a third performance test, we determine the number of threads needed to bridge nearly all latencies in typical Java bytecode programs subject to the latency of data access. The "zero" latency graph in Figure shows the gain yielded by branch latency hiding only, while the other graphs concern branch and load/store latency hiding. The gure shows an increasing gain until the number of threads reaches the latency of load/store-instructions. At this point, nearly all latencies are bridged and the processor is fully utilized. After the investigation the basic performance abilities of our architecture, we take a closer look at some hardware parameters. Henceforth, we assume a single memory interface which prefers data access over instruction fetch and a bandwidth of bit. We assume latency of zero cycles for instruction fetch, one cycle for data access, and two cycles for branches. Four threads are active and scheduled by a xed priority until they are 0, Number of threads Figure : threads Benchmark with dierent number of blocked. The share of load/store-instructions is 0% which is typical for many applications. For a zero cycle context switch, it is necessary to keep the instruction windows stued. To reach this goal, several instruction fetch strategies are investigated: only the ll level of the IWs is taken into account, the ll level and the priority of the thread are evaluated, ll level and recently taken branches are taken into account, because the IW is empty after a branch, or a combination of all three techniques is used. Figure shows the dierence between these strategies subject to the IW size. As you can see, there is only a minimal discrepancy and therefore the simplest fetch strategy (only ll level) is sucient. To nd an optimal size of the instruction windows, we run several tests with dierent sizes. We found out, that the window size correlates with the instruction fetch bandwidth, as shown in gure 5. The minimum of the window size is about the fetch bandwidth plus two, necessary to avoid a deadlock that arises when no complete bytecode is in the buer and there

5 ,5,,5,,5 fill level fill level & priority fill level & branches fill level & priority & branches Window size in bytes Figure : Benchmark with dierent fetch strategies is not enough space to fetch a new portion. A detailed view shows an optimum size of about fetch bandwidth plus six, with regard to hardware costs and performance.,6,, 0,8 0,6 0, 0, 0 Fetch bandwidth in bytes Window size in bytes Figure 5: Benchmark with dierent window size and fetch bandwidth Conclusions This paper describes the basic idea of ISTs to handle external events and the appropriate multithreaded Java microcontroller. The microarchitecture is outlined and several benchmark results are shown. On one hand the performance gain possible through multithreading is proven and on the other hand some hardware parameters are optimized. A complete system with hardware, operating system (JVM), middleware and application (automated guided vehicles) is under construction. We are working on a detailed hardware design in VHDL of the Komodo microcontroller and further improve the adapted JVM. References [] J. Silc, B. Robic, and T. Ungerer. Processor Architecture: From Dataow to Superscalar and Beyond. Springer-Verlag, Heidelberg, 999. [] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). ISCA'6 Proceedings, Atlanta, Georgia, Vol 7, No, pp , May 999. [] S. W. Keckler, A. Chang, W. S. Lee, W. J. Dally. Concurrent Event Handling through Multithreading. IEEE Transactions on computers, Vol 8, No 9, pp , September 999 [] J. O'Connor and M. Tremblay. PicoJava-I: The Java Virtual Machine in Hardware. IEEE Micro, pages 5{5, March/April 997. [5] Sun. PicoJava-II Microarchitecture Guide. Sun microsystems, Part No.: , March 999. [6] U. Brinkschulte, C. Krakowski, J. Kreuzinger, R. Marston, and T. Ungerer. The Komodo Project: Thread-Based Event Handling Supported by a Multithreaded Java Microcontroller. 5th EUROMICRO Conference, Milano, September 999. [7] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T. Ungerer. A Multithreaded Java Microcontroller for Thread-oriented Real-time Event- Handling. 999 International Conference on Parallel Architectures and Compilation Techniques (PACT '99), Newport Beach, Ca., pp. -9, October 999. The maximum length of a bytecode instruction executed direct by the Komodo microcontroller is three.

Real-time Scheduling on Multithreaded Processors

Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer Institute for Computer Design, and Fault Tolerance University of Karlsruhe D-76128 Karlsruhe, Germany