Real-time Scheduling on Multithreaded Processors

Size: px

Start display at page:

Download "Real-time Scheduling on Multithreaded Processors"

Eleanore Melton
6 years ago
Views:

1 Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer U. Brinkschulte, C. Krakowski Institute for Computer Design, Institute for Process Control, and Fault Tolerance Automation and Robotics University of Karlsruhe University of Karlsruhe D Karlsruhe, Germany D Karlsruhe, Germany ira.uka.de Abstract This paper investigates real-time scheduling algorithms on upcoming multithreaded processors. As evaluation testbed we introduce a multithreaded processor kernel which is specifically designed as core processor of a microcontroller or system-on-a-chip. Handling of external realtime events is performed through multithreading. Real-time threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs). Our proposed microcontroller supports multiple ISTs with zero-cycle context switching overhead. We investigate the behavior of fixed priority preemptive, earliest deadline first, least laxity first and guaranteed percentage scheduling with respect to multithreaded processors. Our finding is that the strategies GP and LLF result in a good blending of instructions of dqferent threads thus enabling a multithreaded processor to utilize latencies best. Assuming a zero-cycle context switch LLF performs best, however implementation cost are prohibitive. 1 Introduction The target market of our project is the wide-spread market of embedded systems, in particular, embedded realtime systems. In this area microcontrollers are typically preferred over general-purpose processors because of their on-chip integration of RAM memory and peripheral controllers, resulting in smaller and cheaper hardware. The execution performance is not the main criterion for microcontrollers. Additionally, support for real-time event handling, rapid context switching ability, and small memory requirements are also essential. Rapid context switching is a basic feature of the multithreaded processor technique, which is investigated since a couple of years for its latency utilization ability. Recently several multithreaded processors were announced by industry. A multithreaded processor is able to pursue multiple threads of control in parallel within the processor pipeline. The functional units are multiplexed between the thread contexts. Most approaches store the thread contexts in different register sets on the processor chip. Latencies that arise by cache misses, long running operations or other pipeline hazards are masked by switching to another thread. Multithreaded processors are able to bridge these latencies efficiently if there are enough parallel executable threads as workload and if the time necessary for switching of threads is very small. In consequence, recent announcements of high-performance processors by industry concern a 4-threaded Alpha processor of DECKompaq [ 11 and Sun s MAJC-5200 processor which features two 4- threaded processors on a single die [2]. Both processors are designed as high-performance processors and will not be suitable for low-cost embedded systems. Our Komodo project [3] explores the suitability of multithreading techniques in embedded real-time systems. We propose multithreading as an event handling mechanism that allows efficient handling of simultaneous overlapping events with hard real-time requirements. We design a microcontroller with a multithreaded processor core that allows to trigger so-called Interrupt-Service-Threads (ISTs) instead of Interrupt-Service-Routines (ISRs) for event handling [4]. Our Komodo microcontroller features a zerocycle context switch overhead and hardware support for priority schemes. Because of its application for embedded systems, the processor core of the Komodo micorcontroler is kept at the hardware level of a simple microcontroller similar to the M Our target architecture is a simple pipelined processor kernel which is able to issue one instruction per cycle. Recently, multithreading has also been proposed for event-handling of internal events ([5], [6], [7]) in future high-end processors applying one or more threads for exception handling executing these threads simultaneously to the main thread that caused the exception. However, /00 $ IEEE 155

2 the fast context switching ability of multithreading has rarely been explored in context of microcontrollers for handling of extemal hardware events. Besides our own approach, the EVENTS mechanism [8] proposes a FPGAbased processor-external hardware scheduler that triggers context switches in a single or in multiple multithreaded MSparc processors [9]. This paper investigates real-time scheduling algorithms suitable for multithreaded processors and presents performance evaluations on our evaluation testbed-a multithreaded Java microcontroller called Komodo. 2 The proposed Komodo microcontroller The Komodo microcontroller [ 101 is a multithreaded Java microcontroller which supports multiple ISTs with zero-cycle context switching overhead and several priority schemes. Because of its application for embedded systems, the processor core of the Komodo microcontroller is kept at the hardware level of a simple scalar processor. As shown in Fig. 1, the four stage pipelined processor core consists of an instruction-fetch unit, a decode unit, a memory access unit (MEM) and an execution unit (ALU). Four stack register sets are provided on the processor chip. A signal unit triggers IST execution on the occurrence of extemal signals. U t stack register sets t extern signals Figure 1. Block diagram of the Komodo microcontroller The instruction fetch unit holds four program counters (PC) with dedicated status bits (e.g. thread activehpended), each PC is assigned to a different thread. Four byte portions are fetched over the memory interface and put in the according instruction window (IW). Several instructions may be contained in the fetch portion, because of the average bytecode length of 1.8 bytes. Instructions are fetched depending on the fill levels of the IWs, which is sufficient as instruction fetch strategy [ 111. The instruction decode unit contains the above mentioned IWs, dedicated status bits (e.g. priority) and counters. A priority manager decides subject to the bits and counters from which IW the next instruction will be decoded. We define several priority schemes to handle realtime requirements. In detail, we implemented the fixed priority preemptive (FPP), the earliest deadline first (EDF), the least laxity first (LLF), and the guaranteed percentage (GP) scheduling schemes. The priority manager applies one of the implemented thread priority schemes for IW selection. However, latencies may result from branches or memory accesses. To avoid pipeline stalls, instructions from other threads than the highest priority threads can be fed into the pipeline. The decode unit predicts the latency after such an instruction, and proceeds with instructions from other IWs. There is no overhead for such a context switch. No savehestore of registers or removal of instructions from the pipeline is needed, because each thread has it s own stack register set. A bytecode instruction is decoded either to a single micro-op, a sequence of micro-ops, or a trap routine is called. Each opcode is propagated through the pipeline together with its thread id. Opcodes from multiple threads can be simultaneously present in the different pipeline stages. The instructions for memory access are executed by the MEM unit and all other instructions are executed by the ALU unit. Finally, the result is written back to the stack register set of the according thread. External signals are delivered to the signal unit from the peripheral components of the microcontroller core as e.g. timer, counter, or serial interface. By the occurrence of such a signal the corresponding IST is activated. As soon as an IST activation ends its assigned real-time thread is suspended and its status is stored. An external signal may activate the same thread again. In our current implementation, the Komodo microcontroller holds the contexts of up to four threads, which are directly mapped to hardware threads. Three threads may be real-time threads, all remaining threads must be non realtime and are scheduled within the fourth hardware thread. To scale up for larger systems with more than three realtime threads, we propose a parallel execution on several microcontrollers connected by a middleware platform called OSA+ [3]. Because of the unpredictability of cache accesses, a noncached memory access is preferred for real-time microcontrollers. The emerging load latencies are bridged by scheduling instructions of other threads by the priority manager. The Komodo processor is software simulated and hardware implemented on a Xilinx P GA yielding chip-space requirements of about gates for a four-threaded processor kernel [

3 3 Evaluation In the following section we evaluate the time behavior and the latency slot use of real-time scheduling strategies on a multithreaded processor. We examine the four scheduling techniques Earliest Deadline First (EDF), Least Laxity First (LLF), Fixed Priority Preemptive (FPP) and Guaranteed Percentage (GP). For that we choose real application programs which are typical for real-time systems as benchmarks. The first program is a simple impulse counter (IC) which reads data from an interface, scales it and stores it in the memory. The other two programs are a PID-element (PID) and a rather costly Fast Fourier Transform (FFT). Our testbed is the Komodo microcontroller with four hardware threads and a zero-cycle context switch. Latencies from memory accesses and branches are bridged by instructions of other than the high priority thread. In the first part of the evaluation we executed four equal programs on the processor. In this first experiment, all four threads were given the same real-time parameters (deadline = period, starting processor utilization = 0.25 for each thread). Then the common deadline is shortened until the scheduler can't keep them any more. The results of the different schedulers are compared in figure 2. Here the presentation is scaled to a non multithreaded processor, i.e. a value of 1 corresponds to the performance of a processor, that uses no latencies, but needs no additional clock cycles for a context switch as well. More interesting are the PID element and the FFT that yield different speed-ups wrt. the scheduling strategies. The differences are caused by the following behavior that is typical for multithreaded processors: The performance gain of a multithreaded processor arises from the utilization of instruction latencies by switching context to instructions of another thread. To be effective, a pool of executable instructions of different threads must be present. Techniques like FPP or EDF tend to lessen this pool, because first the most urgent thread is executed, then the second most urgent thread, etc. Figure 3 depicts this behavior for an EDF (of FPP) scheduling of four threads with the same code. Let us assume, all four threads start at time zero. Up to time tl the most prior thread is executed with highest priority and the other three threads are ready for execution. To utilize the instruction latencies that arise in the execution of the most prior thread, the processor can switch to instructions of one of the other three threads. However, after the time tl there are just two, after t2 there is just one, and after t3 there is no thread left for the use of latencies arising from the last running thread. LLF and GP perform better than FPP and EDF for the PID and FFT programs. Figure 4 shows for LLF scheduling, that all threads keep executable instructions until all threads terminate simultaneously. However, a frequent number of context switches is induced by the equal deadlines and the permanently changing least laxities. This provides an instruction mix that keeps the threads alive for a maximum of time and so creates optimal conditions for the use of latency slots on the multithreaded processor. GP creates a similar behavior by its frequent context switches. Figure 5 shows the frequency of context switches caused by the different strategies. v1 1,lO 1,oo 0, IC PID FFr Figure 2. Speed-up of the computation times of different schedulers with the same threads The multithreaded processor increases the speed-up for all benchmark programs and scheduling schemes and thus enhances the possible sample rates. All scheduling strategies provide the same speed-up for the impulse counter (IC). This is explained with the extreme shortness of the IC program, that doesn't allow to demonstrate the differences between the scheduling strategies. ~ 'processor utilization = execution time without latency utilization / deadline T1 T2 T3 T4 + context switches I.t 4 :, 3 :2 2 :3 1 threads ready Figure 3. Four equal threads with EDF scheduling From these considerations we conclude as requirements for an optimal real-time scheduler on a multithreaded processor that is able to utilize instruction latencies: The scheduler must sustain each thread as long as possible, i.e. up to its deadline. On condition of a zero cycle context switching overhead, it is a quality factor for a good scheduler that 'The most prior thread executes as on a non multithreaded processor, which allows to compute the worst case execution time as usual. 157

4 T1 T2 T3 T4 - A I!G 80,OO context switches I I I I I I I I I I I I I I I " ' 1 ~ " IIIIIIIIIIIIIII : threads ready Figure 4. Four equal threads with LLF scheduling dl d4 et 1, n 7 1 U 0,8 P 0.6 0, m I FPP EDF LLF GP Figure 6. Speed-ups of the workload with mixed application programs E" 60,OO v1 a Y 50,OO.-! P E 30,00 L 0 20,oo Q s = 0,oo IC PID m Figure 5. Context switches of different schedulers a high number of context switches is caused. Thereby an instruction mix is created which keeps the threads alive as long as possible. The second experiment uses all three programs and an additional non real-time thread. We assume that the deadlines equal the periods and a starting processor utilization of 0.3 for each of the real-time threads is used. We fix the deadlines for the impulse counter and the FFT and shorten the deadlines for the PID element until the first missed deadline occurs. The priorities for FPP are assigned under the terms of rate monotonic analysis. The implementation of GP on the Komodo microcontroller defines three priority classes: exact, minimal, and non real-time. Class exact causes a thread to meet the requested percentage exactly, not more and not less. In case of minimal, a thread gets at least the requested percentage, but it may get more as well. Therefore, in the case of the GP the impulse counter and FFT belongs to the class exact and the PID element is in the class minimal. The start conditions are 30% of execution time for every real-time thread and 10% for the non realtime thread. Figure 6 shows the results of our experiment. It can be seen that again all scheduling algorithms profit from the multithreaded processor. It is remarkable that in this experiment LLF doesn't perform better than FPP or EDF. This can be explained by the mixture of the threads, too. Due to the highly differing execution times of the threads and corresponding deadlines, the thread with the least laxity is the same over a long period. This leads to a similar behavior of LLF and EDF, resulting in nearly the same number of context switches for LLF and EDF. The behavior of the GP scheduler is unexpected. Actually GP should be an ideal scheduler, because the threads in the class exact are held active until the deadline arrives. A thread that needs 10 msec for execution and has a deadline of 40 msec terminates by a share of 25% exactly at the given deadline. The drawback of GP in this experiment can be seen in the current implementation. The scheduler distributes the shares for the threads in intervals of 100 cycles. In each interval, there is a priority to find the next thread. First of all, the threads of the class exact are scheduled in order of the needed cycles. Accordingly the classes minimal and the non real-time threads are taken for execution. Threads that are blocked or in latencies are excluded from the schedule. Figure 7 shows the typical execution sequence of the four given threads. As you can see, after the termination of the two exact threads (after about 60 cycles pending on the usage of the latencies) only the non real-time thread can utilize the latencies of the PID element. Therefore, the non realtime thread gets much more cycles than by LLF of EDF and the performance for handling real-time events goes down. In this case, the number of executable threads always decreases at the end of each interval and even though the number of context switches is high, the mixture of threads is poor. This observation leads to the conclusion, that many context switches are only a hint for a good scheduling algorithm on a multithreaded processor, but not a fact. 0 9 P... *..... Cxnd (FFn CZBLt (IC) p minimal(p1d) - nonrcd-limc 0 60 Im cyclcr Figure 7. Thread execution within an interval Another essential point is the overhead introduced by the 158

5 various scheduling techniques. To reach a zero cycle context switch on a multithreaded processor, the scheduler must decide within a single processor cycle which instruction to issue next. The prototype implementation of the Komodo microcontroller in a FPGA showed that FPP generates the by far smallest implementation cost. Second with similar costs range GP and EDF. The highest implementationcost is introduced by LLF. GP and LLF profit by the ability of fast context switching yielding good performance results when assuming a zero cycle context switching overhead. These strategies produce a high number of context switches which allows an excellent blending of threads and therefore an optimal latency utilization. However, the performance of these strategies deteriorates quickly, when context switching costs increase. 4 Conclusions Multithreaded processors with the ability of very fast context switching offer a new challenge to real-time scheduling policies. First, scheduling strategies like EDF, LLF, and GP may be implemented without thread switching overhead. Second, multithreaded processors may switch the context to another thread to increase performance by utilizing latencies caused by memory access or branch instructions. Latency utilization in a multithreaded processor can increase processor performance over 100% compared to a non-multithreaded processor. However, latency utilization is an additional performance gain that cannot be guaranteed for hard real-time event handling. To efficiently utilize latencies, a pool of executable instructions of different threads is needed. Classical realtime scheduling policies like EDF or FPP tend to thin out this pool by executing instructions of a thread block-wise, the most urgent thread first, then the second urgent, and so on. This produces a minimal number of context switches, which is a good choice on conventional processors. On a multithreaded processor, not enough instructions of ready threads may remain to bridge occurring latencies. A realtime scheduling policy optimal in the sense of bridging latencies on multithreaded processors must keep a thread alive as long as possible. This means, the execution time of a thread must be extended to its deadline. But LLF and GP are still not optimal. LLF thins out the thread pool like EDF or FPP in case of strongly different deadlines. GP may be a candidate, but the current implementation produces the same problem. So an optimal policy still has to be found. The work described in this paper is considered as a basis for further research on real-time scheduling on envisioned future multithreaded processors and microcontrollers. By modifying the well known real-time scheduling policies, the architectural features of such processors can be used more efficiently. References [ 11 Emer, J. Simultaneous Multithreading: Multiplying Alpha speiformance. Microprocessor Forum 1999, San Jose, Ca., Oct [2] L. Gwennap. MAJC Gives VLIW a New Twist. Microprocessor Report, Vol 13, No. 12, pp , September, [3J U. Brinkschulte, C. Krakowski, J. Kreuzinger, R. Marston, and T. Ungerer. The Komodo Project: Thread-Based Event Handling Supported by a Multithreaded Java Microcontrollen 25th EUROMICRO Conference, Milano, September [4] U. Brinkschulte, C. Krakowski, J. Kreuzinger, Th. Ungerer. Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-time Events on a Multithreaded Microcontroller. The 20th IEEE Real-Time Systems Symposium. Phoenix, Arizona, December 1-3, [SI R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). ISCA 26 Proceedings, Atlanta, Georgia, Vol 27, No 2, pp , May [6] S. W. Keckler, A. Chang, W. S. Lee, W. J. Dally. Concurrent Event Handling through Multithreading. IEEE Transactions on computers, Vol48, No 9, pp , September 1999 [7] C.B. Zilles, J.S. Emer, G.S. Sohi. The Use of Multithreading for Exception Handling MICRO-32, Haifa, November 1999, [8] K. Liith, A. Metzner, T. Peikenkamp, J. Risau. The EVENTS Approach to Rapid Prototyping for Embedded Control Systems. Zielarchitekturen eingebetteter Systeme, 14. ITG/GI Fachtagung Architektur von Rechnersystemen, Rostock, [9] W. Damm, A. Mikschl. MSPARC: a multithreaded SPARC. Euro-Par 96 Parallel Processings: Second International Euro-Par Conference, Vol 11, LNCS 1124, Springer Verlag [lo] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T. Ungerer. A Multithreaded Java Microcontroller for Thi-ead-oriented Real-time Event-Handling International Conference on Parallel Architectures and Compilation Techniques (PACT 99), Newport Beach, Ca., pp , October [ll] J. Kreuzinger, M. Pfeffer, A. Schulz, T. Ungerer, U. Brinkschulte, C. Krakowski. Performance Evaluations of a Multithreaded Java Microcontroller PDPTA OO, Las Vegas, Nevada, USA, Vol. 1, pp , June [12] J. Kreuzinger, R. Zulauf, A. Schulz, T. Ungerer, M. Pfeffer, U. Brinkschulte, C. Krakowski. Performance Evaluations and Chip-Space Requirements of a Multithreaded Java Microcontroller. The Second Annual Workshop on Hardware Support for Objects and Microarchitectures for Java - in conjunction with ICCD 2000, Austin, Texas, September

Real-time Scheduling on Multithreaded Processors

Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer Institute for Computer Design, and Fault Tolerance University of Karlsruhe D-76128 Karlsruhe, Germany