Priority manager. I/O access - PDF Free Download

Implementing Real-time Scheduling Within a Multithreaded Java Microcontroller S. Uhrig 1, C. Liemke 2, M. Pfeffer 1,J.Becker 2,U.Brinkschulte 3, Th. Ungerer 1 1 Institute for Computer Science, University of Augsburg, 86159 Augsburg, Germany fuhrig, pfeffer, ungererg@informatik.uni-augsburg.de, 2 Institute for Information Processing Technology University of Karlsruhe, 76128 Karlsruhe, Germany becker@itiv.uni-karlsruhe.de 3 Institute for Process Control, Automation and Robotics University of Karlsruhe, 76128 Karlsruhe, Germany brinks@ira.uka.de Abstract This paper presents the design, evaluation and hardware implementation of real-time scheduling schemes, which are embedded in a multithreaded Java microcontroller. We show the feasibility of a hardware real-time scheduler integrated deeply into the processor pipeline with a VHDL design and its synthesis. Evaluations with a software simulator and real-time applications as benchmarks show that hardware multithreading reaches a 1.2 to 1.6 performance increase for hard real-time applications (multithreading without latency utilization) and a 1.8 to 2.6 speedup by latency utilization for programs without hard real-time requirements. We also show that even for the complex scheduling algorithms EDF (Earliest Deadline First), LLF (Least Laxity First), and GP (Guaranteed Percentage) a scheduling decision is possible within one processor cycle of a 327 MHz, 325 MHz, resp. 274 MHz processor with four threads. With respect to real-time scheduling on a multithreaded microcontroller, the LLF scheme outperforms the FPP (Fixed Priority Preemptive), EDF, and GP schemes. However, only GP allows isolation of threads. Keywords: real-time Java, real-time scheduling, embedded systems, multithreading 1 Introduction The target market of our project is the widespread market of embedded systems, in particular, embedded real-time systems. In this area microcontrollers are typically preferred over general-purpose processors because of their on-

chip integration of RAM and peripheral interfaces, resulting in smaller and cheaper hardware. Requirements for microcontroller design concern besides the execution performance in particular support for real-time event handling and flexible real-time scheduling strategies, rapid context switching ability, and small memory requirements. Hard real-time events are never allowed to miss their deadlines. To guarantee the handling of hard real-time events in time, the runtime of the event-handling algorithm must be countable in processor cycles. A multithreaded processor is able to pursue multiple threads of control in parallel within the processor pipeline. The functional units are multiplexed between the thread contexts. Most approaches store the thread contexts in different register sets on the processor chip. Latencies that arise by cache misses, long running operations or other pipeline hazards are masked by switching to another thread. Thread scheduling has been proposed to optimize the throughput of simultaneousmultithreaded processors (SMT) [12, 9] and to scheduling soft real-time applications in SMT processors [5]. The EVENTS project [8] introduces thread scheduling for event handling on multithreaded processors by an external hardware scheduler. A simple kind of thread scheduling for latency bridging in real-time environments is the round-robin scheduling scheme used in [4]. The Komodo project explores the suitability of hardware multithreading techniques in embedded real-time systems on the basis of a microcontroller, called Komodo microcontroller [2]. Key features of the Komodo microcontroller are the ability of very rapid context switching and the real-time scheduling algorithms integrated deeply within the pipeline [6]. So we propose hardware multithreading as an event handling mechanism that allows efficient handling of simultaneous overlapping events with hard real-time requirements. We design a microcontroller with a multithreaded processor core that allows to trigger so-called Interrupt-Service- Threads (ISTs) instead of Interrupt-Service- Routines (ISRs) for event handling [3]. The basic idea of the IST concept for event handling is that an occurring event activates an assigned thread instead of an ISR as it is done by conventional processors and microcontrollers. The IST concept activates threads directly in hardware. ISTs are directly mapped to the thread slots of the Komodo processor core. Execution of a thread is triggered by an external hardware event. The required real-time scheduling algorithms are embedded within the processor pipeline. In the following we present the design of the Komodo microcontroller core and focus in particular on the implementation of the realtime scheduling schemes in hardware. The next section describes the implemented real-time scheduling algorithms. Section 3 presents the pipeline core of the Komodo microcontroller, section 4 the implementation of the hardware scheduling algorithms, and section 5 the evaluation. Section 6 concludes the paper. 2 Real-Time Scheduling Algorithms Applying a multithreaded processor eliminates the latencies of IST activation and context switching and allows an additional optimization: the scheduling can be done by hardware. This avoids a software scheduler call after an IST activation and allows the immediate processing of 2

an occurring event. However, the scheduling scheme must be implemented in hardware and the hardware scheduler should provide a scheduling decision within one clock cycle. The following real-time scheduling schemes are adapted to the needs of a multithreaded microcontroller and implemented in the Komodo processor core: The Fixed Priority Preemptive (FPP) scheme assigns a fixed constant priority toeach thread. The processor always executes the thread with the highest priority among all active threads. The Earliest Deadline First (EDF) scheme [7] executes the thread closest to its deadline. Therefore, the only necessary parameter for this scheme is the deadline. Stankovic et al. [11] show that EDF is an optimal scheme for periodic threads on a single processor system. It guarantees all deadlines up to 100% processor utilization. The Least Laxity First (LLF) scheme can be considered as an extension to the Earliest Deadline First scheme. Additionally to the deadline, the execution time of each thread is used to calculate its laxity. The laxity is the difference between the remaining time to the deadline and the remaining execution time of a thread. The thread with the least laxity gets the processor. Guaranteed Percentage (GP) [1] is a scheme that has been newly designed for real-time scheduling on multithreaded processors. The basic idea is to statically assign percentages of the available processor time to the threads and to guarantee these percentages in short time intervals. This ensures a definite and predictable proceeding of the threads providing isolation of realtime event-handling threads against each other. A thread cannot harm the timing behavior of any other thread. Such an isolation has two advantages over conventional microcontrollers: multiple hard real-time events can be processed by a single microcontroller and real-time threads can be removed or replaced without affecting the behavior of the remaining threads in the system. So real-time constraints can be kept even during dynamic reconfiguration. Due to its many context switches, the GP scheme is only suitable within a multithreaded processor core with a single cycle context switching overhead. 3 The Komodo Microcontroller The Komodo microcontroller consists of a processor core attached to typical controllers as e.g. a timer/counter, capture/compare, serial and parallel interfaces via an I/O bus [10]. In the following we focus on the processor core [2], which is a multithreaded Java processor with a fourstage pipeline. Because of its application in embedded systems, the processor core of the Komodo microcontroller is kept at a simple hardware level. Figure 1 shows the multithreaded pipeline enhanced by the priority manager and the signal unit. The pipeline consists of the following four pipeline stages: instruction fetch (IF), instruction decode (ID), operand fetch (OF), and execute/memory/io access (EXE). These four stages perform the following tasks: Instruction fetch: If not all instruction windows (IWs) are full, the IF tries to fetch a new instruction package from the memory interface. A successfully fetched instruction package will be routed to the corresponding instruction window. A fetch is not successful, if a memory access occurs at the same time. Each instruction package consists of four bytes. Because of the variable length of bytecode, each package contains from zero up to four bytecodes. Instruction decode: The decoding of instruc- 3

Memory interface Address Data Address Instructions µrom Memory access Stack register set set1 Instruction fetch PC1 PC2 PC3 PC4 IW1 IW2 IW3 IW4 Instruction decode Operand fetch Stack register set set2 Execute Stack register set set3 Stack register set set4 Priority manager I/O access Figure 1: The Komodo processor core Signal unit Address tions will be started after writing a received instruction package in the corresponding IW. Each IW is organized as an 8 byte long transparent FIFO buffer. Every cycle, the priority manager decides which thread will be decoded next. The decoding results in a hardware instruction, executed directly in the execution stage, or it starts a sequence of microcodes. Very complex instructions are executed by trap routines. After termination of a microcode sequence, a new instruction will be decoded. The design of the microcode unit allows an interleaving of microcode instructions with instructions from other threads. Operand fetch: In this pipeline stage, the operands needed by the actual operation are read from the stack. Because of the stack architecture of Java, a lot of data dependencies occur. To manage this problem without adding latencies, data forwarding has been integrated. That allows result forwarding from the execution stage's output and from memory accesses directly to the input latches of the execution stage. Data I/O components Execution, memory and I/O access: The execution stage is responsible for all instructions except of load/store instructions. The execution stage uses the given operands for executing the operation submitted by the decode stage. The result is sent to the stack and the operand fetch unit for forwarding. An explicit write back stage is not necessary. In the case of a load/store instruction, the memory is addressed by one of the operands. Address calculation is performed by software. Because of the usage of only physical addresses, no additional calculation by hardware is necessary. That means the whole execution cycle is available for the memory access. An I/O access is handled in the same way like a memory access. One of the main improvements in comparison to other simple pipelines is the context switching overhead of zero cycles. Such a fast context switch needs the ability of executing different threads within the pipeline. Therefore a thread tag is routed from one pipeline stage to the next. The tag indicates the thread to which the currently transmitted signals belong. These thread tags establish a chain of tags propagted through the pipeline. The origin of this chain is the priority manager shown in the next section. 4 The Priority Manager The priority manager (PM) is responsible for hardware real-time scheduling of the Interrupt Service Threads. Up to now, we didn't find any work that integrates modern real-time scheduling algorithms in a processor pipeline. Four different priority manager implementations supporting the scheduling algorithms FPP, EDF, LLF and GP were investigated. In spite of the different algorithms, the four implementations 4

Generating Determining Actualizing PrioValue PrioValue PrioValue PrioValue < < < Thread tag PrioValue PrioValue PrioValue PrioValue Figure 2: Implementation of the priority manager Characteristic value: latency indicator not ( IW full ) PrioValue not ( waiting for atom lock ) not ( active ) Figure 3: Composition of the characteristic value are very similar. Targeting a possible context switch every clock cycle, the priority manager has to perform a scheduling decision each cycle. The main procedure of the priority manager is split up into three parts: Figure 2 shows the three phases of the PM. In the first step, a characteristic value (PrioValue in fig. 2) is generated for each hardware thread. In the second step, these values are compared in a comparison tree to determine which thread's instruction has to be executed. The last step updates the characteristic value of each thread depending on the scheduling decision and algorithm. Figure 3 shows the composition of the characteristic value. The upper four bits are independent from the chosen scheduling algorithm. These bits present the thread's state. In particular, these bits indicate the activity of the thread, the state while waiting for an atomic lock, and if there are latencies due to the last executed instruction. Also indicated is the fact if the corresponding instruction window contains enough bytes for a complete instruction. Because the comparison tree is looking for the lowest value, all these bits, except the latency indicator, have to be inverted. Threads with latencies, inactive threads, threads waiting for an atomic lock or threads with empty instruction windows get the lowest priority. The rest of the characteristic value, the PrioValue depends on the chosen scheduling scheme and is defined as following: FPP: The PrioValue is the fixed priority of the thread, as stated by the programmer. In the case of four threads four possible priority levels are sufficient, therefore the width of PrioValue is 2 bit. EDF: By activating a thread, its deadline is stored in the PrioValue. By comparing these values, the PM determines the thread with the lowest PrioValue that means the thread with the nearest deadline. During the third step of the PM, all PrioValues are decremented because each thread gets closer to the corresponding deadline. The width of the PrioValue depends on the maximum deadline length. LLF: This algorithm is very similar to EDF. The difference is an additional value, the runtime of each thread. The PrioValue is given by the difference between deadline and runtime. This difference is called laxity. The runtime has the same width as the deadline and is decremented each time the corresponding thread is decoded. GP: By entering a new interval, each PrioValue of the loaded threads is initialized with the amount of cycles given by the GP parameter: We chose an interval length of 100 cy- 5

cles, which allows to load PrioValue simply with the percentage. During the third step, the PM decrements the PrioValue of the actual determined thread. Additionally the PrioValues of all threads have a latency greater than 0, are decremented, because in the analytical model these threads are in the state of execution. By integrating the PM into the decode stage, the whole task of the decode stage can be divided into six steps: 1. Storing the instruction package into the IW 2. Generating the characteristic thread values 3. Determining the thread tag for decoding 4. Actualizing of PrioValue 5. Decoding the instruction 6. Actualizing of the IW After the number of bytes in the IWs is calculated, the execution of the second step can be overlapped with the first step. The steps five and six can be executed in parallel to step four. 5 Evaluations We developed a software simulator of the Komodo processor core for performance estimation, a FPGA and an ASIC prototype of the whole Komodo microcontroller. In the following we present results of the software evaluations and of an ASIC-directed synthesis of the different scheduling algorithms to assess signal runtimes and chip area requirements. We use three real-time applications as benchmarks: an impulse counter (IC), a PID element (Proportional, Integral, Derivative element) and a FFT algorithm (FFT). These benchmarks are programmed in Java and compiled to Java bytecode. Latency assumptions are three cycles for branches and two cycles for memory transfers gain 3,00 2,50 2,00 1,50 1,00 0,50 0,00 Baseline Multithreading, without latency hiding Multithreading, with latency hiding FPP EDF GP LLF Figure 4: Measurements with the DifMix benchmark and for writes to special registers. Figure 4 shows the results of our measurements using one IC, one FFT and two PID threads in the four thread slots (DifMix benchmark). The deadlines of the threads were shorten until a deadline miss occurs. As baseline processor, we chose a model of the singlethreaded picojava-ii with an assumed context switching overhead of 100 cycles. The baseline processor in figure 4 has no ability to hide latencies and assumes context switching time of 100 cycles. The multithreaded model with no context switching costs, but without the ability to use latencies is dedicated towards hard realtime applications, and the multithreading with latency hiding model allows to speed up soft realtime applications. All results are normalized to the baseline FPP version. The measurements of the multithreaded version without latency bridging is important for hard real-time environments because the amount of utilizable latencies completely depends on software. So we cannot guarantee latency-utilization but if there are any latencies available, they will be bridged by executing other threads. A performance increase of 1.2 to 1.6 is reached 6

4 threads 8 threads 16 threads MHz size MHz size MHz size [mm 2 ] [mm 2 ] [mm 2 ] FPP 610 0.017 277 0.033 113 0.055 EDF 327 0.068 162 0.129 81 0.262 LLF 325 0.075 160 0.150 80 0.301 GP 274 0.045 165 0.074 84 0.144 Table 1: Run Times and Sizes using the UMC18 Technology for hard real-time applications due to multithreading and the resulting fast context switching. A further performance gain of 1.8 to 2.6 is reached for soft or non real-time applications by latency hiding (for further simulation results see [6]). Next step was a synthesis of the different scheduling algorithms using the DesignCompiler from Synopsys and the UMC18 library from Virtual Silicon for a 0.18 micron ASIC technology leading to the results of clock frequency and size shown in table 1. These measurements were made with priority managers supporting 4, 8 or 16 threads. With view to the reached frequencies, we show the feasibility of using the priority manager within state of the art microcontroller systems. 6 Conclusions This paper presents a Java based real-time multithreaded microcontroller. We base our Interrupt Service Thread (IST) concept on the idea to handle events by threads utilizing the fast context switching of multithreaded processors. Up to now such processors have been designed for latency hiding and throughput increase. In contrast, our Komodo microcontroller core applies hardware multithreading for fast real-time event handling. Moreover we investigated the behavior of real-time scheduling in combination with the multithreaded processor technique. Because the Komodo microcontroller performs a context switch without any switching overhead, we implemented several well-known scheduling techniques in hardware (FPP, EDF, LLF, and GP). We showed the feasibility of a hardware realtime scheduler integrated deeply into the processor pipeline with a VHDL design and its synthesis. Our evaluations show a performance increase of 1.2 to 1.6 for hard real-time applications due to the fast context switch ability of multithreading and a 1.8 to 2.6 speedup for soft or non real-time applications by latency hiding. We also show that even for the complex scheduling algorithms EDF, LLF, and GP a scheduling decision is possible within one processor cycle of a 327 MHz, 325 MHz, resp. 274 MHz processor with four threads. With respect to realtime scheduling on a multithreaded microcontroller, the LLF (Least Laxity First) scheme outperforms the FPP (Fixed Priority Preemptive), EDF (Earliest Deadline First), and GP (Guaranteed Percentage) schemes. Only GP allows isolation of threads. The next step is to redesign the Komodo microcontroller with the aim to reduce power consumption and to implement it as an ASIC prototype. The microcontroller will be applied to control an autonomous guided vehicle to test it in an industrial environment. References [1] U. Brinkschulte, J. Kreuzinger, M. Pfeffer, and Th. Ungerer. A Scheduling Technique Providing a Strict Isolation of Realtime Threads. In Seventh IEEE Interna- 7

tional Workshop on Object-oriented Realtime Dependable Systems (WORDS), San Diego, CA, January 2002. [2] Uwe Brinkschulte, C. Krakowski, J. Kreuzinger, and Th. Ungerer. A Multithreaded Java Microcontroller for Thread-Oriented Real-Time Event- Handling. In International Conference on Parallel Architectures and Compilation Techniques (PACT 99), Newport Beach, pages 34 39, October 1999. [3] Uwe Brinkschulte, C. Krakowski, J. Kreuzinger, and Th. Ungerer. Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller. In RTSS WIP sessions, Phoenix, pages 11 15, December 1999. [4] Bryce Cogswell and Zary Segall. MACS: A Predictable Architecture for Real Time Systems. In IEEE Real-Time Systems Symposium, pages 296 305, 1991. [5] R. Jain, Ch. J. Hughes, and S. V. Adve. Soft real-time scheduling on simultaneous multithreaded processors. In 23rd IEEE International Real-Time Systems Symposium, December 2002. [6] Jochen Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer, U. Brinkschulte, and C. Krakowski. Real-time Scheduling on Multithreaded Processors. In The 7th International Conference on Real-Time Computing Systems and Applications (RTCSA 2000), Cheju Island, South Korea, pages 155 159, December 2000. [7] C. L. Liu and James W. Layland. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. Journal of the ACM, 20(1):46 61, 1973. [8] K. Lüth, A. Metzner, T. Peikenkamp, and J. Risau. The EVENTS Approach to Rapid Prototyping for Embedded Control Systems. In Zielarchitekturen eingebetteter Systeme, 14. ITG/GI Fachtagung Architektur von Rechnersystemen, Rostock, pages 45 54, September 1997. [9] S. Raasch and S. Reinhardt. Applications of thread prioritization in smt processors. In Proceedings of 1999 Multithreaded Execution, Architecture and Compilation Workshop (MTEAC), January 1999. [10] M. Pfeffer Th. Ungerer S. Uhrig, U. Brinkschulte. Connecting peripheral interfaces to a multithreaded java microcontroller. In Workshop on Java in Embedded Systems, ARCS 2002, Karlsruhe, April 2002. [11] J. A. Stankovic, M. Spuri, K. Ramamritham, and G.C. Buttazzo. Deadline Scheduling for Real-Time Systems: EDF and Related Algorithms. Kluwer Academic Publishers, Dordrecht Norwell, 1998. [12] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In The 23rd International Symposium on Computer Architecture (ISCA), Philadelphia, Pennsylvania, pages 191 202, May 1996. 8