Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller

Similar documents
A Multithreaded Java Microcontroller for Thread-Oriented Real-Time Event-Handling

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

Real-time Scheduling on Multithreaded Processors

Real-time Scheduling on Multithreaded Processors

A Microkernel Architecture for a Highly Scalable Real-Time Middleware

A Scheduling Technique Providing a Strict Isolation of Real-time Threads

Priority manager. I/O access

A Real-Time Java System on a Multithreaded Java Microcontroller

The Komodo Project: Thread-based Event Handling Supported by a Multithreaded Java Microcontroller

Multimedia Systems 2011/2012

4. Hardware Platform: Real-Time Requirements

Chapter 12. CPU Structure and Function. Yonsei University

EC EMBEDDED AND REAL TIME SYSTEMS

Embedded Systems. Read pages

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition

Processors. Young W. Lim. May 12, 2016

Computer-System Architecture (cont.) Symmetrically Constructed Clusters (cont.) Advantages: 1. Greater computational power by running applications

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Main Points of the Computer Organization and System Software Module

ECE519 Advanced Operating Systems

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer Architecture

Multimedia-Systems. Operating Systems. Prof. Dr.-Ing. Ralf Steinmetz Prof. Dr. rer. nat. Max Mühlhäuser Prof. Dr.-Ing. Wolfgang Effelsberg

Analyzing Real-Time Systems

PC Interrupt Structure and 8259 DMA Controllers

Control Hazards. Prediction

CARUSO Project Goals and Principal Approach

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Real-time Garbage Collection for a Multithreaded Java Microcontroller

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, IIT Delhi

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Chapter 1: Introduction Operating Systems MSc. Ivan A. Escobar

Multiprocessor scheduling

Announcements/Reminders

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

DSP/BIOS Kernel Scalable, Real-Time Kernel TM. for TMS320 DSPs. Product Bulletin

There are different characteristics for exceptions. They are as follows:

UNIT- 5. Chapter 12 Processor Structure and Function

CPU Scheduling: Objectives

Multithreading and the Tera MTA. Multithreading for Latency Tolerance

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

Uniprocessor Computer Architecture Example: Cray T3E

Instr. execution impl. view

William Stallings Computer Organization and Architecture. Chapter 11 CPU Structure and Function

Hardware-Based Speculation

Computer Systems Assignment 4: Scheduling and I/O

COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS

4/6/2011. Informally, scheduling is. Informally, scheduling is. More precisely, Periodic and Aperiodic. Periodic Task. Periodic Task (Contd.

Lecture 14: Multithreading

UNIT I (Two Marks Questions & Answers)

The Nios II Family of Configurable Soft-core Processors

A hardware operating system kernel for multi-processor systems

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Techniques described here for one can sometimes be used for the other.

Embedded Systems: OS. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

WHY PARALLEL PROCESSING? (CE-401)

Last 2 Classes: Introduction to Operating Systems & C++ tutorial. Today: OS and Computer Architecture

Introduction to Computer Systems and Operating Systems

CS370 Operating Systems

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Multiprocessor and Real-Time Scheduling. Chapter 10

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking

Understanding the basic building blocks of a microcontroller device in general. Knows the terminologies like embedded and external memory devices,

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Unit 2 : Computer and Operating System Structure

Kaisen Lin and Michael Conley

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Chapter 1: Introduction. Operating System Concepts 8th Edition,

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Operating System Support

Advanced issues in pipelining

Embedded Systems: OS

Processors, Performance, and Profiling

Three basic multiprocessing issues

Concurrent Event Handling through Multithreading

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

New Advances in Micro-Processors and computer architectures

SAE5C Computer Organization and Architecture. Unit : I - V

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

COT 4600 Operating Systems Fall Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM

On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors

PowerVR Hardware. Architecture Overview for Developers

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS

8085 Microprocessor Architecture and Memory Interfacing. Microprocessor and Microcontroller Interfacing

Design and Implementation of a FPGA-based Pipelined Microcontroller

CPU Structure and Function

Instruction Pipelining Review

Advanced Computer Architecture

CS425 Computer Systems Architecture

Transcription:

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control, Institute of Computer Design Automation and Robotics and Fault Tolerance University of Karlsruhe University of Karlsruhe D-76128 Karlsruhe, Germany D-76128 Karlsruhe, Germany Abstract We propose a new event handling mechanism based on a multithreaded microcontroller, that allows efficient handling of concurrent events with hard real-time requirements. Real-time threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs) and are executed on a multithreaded microcontroller. Several thread priority schemes are managed in hardware, in particular, we propose a guaranteed percentage scheme where each real-time thread is assigned a rate of the full processing power. We show an analytical evaluation of the IST technique and the guaranteed percentage scheme using real-time requirements of an autonomous guided vehicle. The evaluations show that the ISR solution with the fixed priority preemptive scheme isn t able to guarantee the specified deadlines, in contrast to the IST solution, that even offers a spare of 5%. Moreover, in our case guaranteed percentage scheduling offers an advantage over earliest deadline first scheduling, if not only deadlines, but also data rates must be met. When calculating the maximum vehicle speed without violating the real-time constraints ISTs dominate ISRs by a speed increase of 28%. 1 Introduction The market for embedded systems is rapidly spreading out as can be seen by the number of sold microcontrollers. A special requirement for many embedded systems is realtime behavior. Usually interrupt service routines (ISRs) are used to implement event handling routines. However, when several hard real-time events that emerge in an irregular time pattern must be serviced, ISRs are uncomfortable to program and may even miss hard deadlines because of nonoptimal processor utilization. In our approach we propose interrupt service threads (ISTs) as a new efficient hardwaresupported event handling mechanism which simplifies and improves the handling of concurrent events with hard realtime requirements. Therefore, we propose a multithreaded microcontroller that supports multiple ISTs, zero-cycle context switching overhead, and triggers ISTs directly by hardware. A hardware unit called priority manager within the microcontroller manages several thread priority schemes. We propose a guaranteed percentage scheme where each real-time thread is assigned a rate of the full processor power. Section 2 introduces our thread-based real-time event handling by ISTs in combination with thread priority strategies, in particular with the guaranteed percentage scheme. Section 3 introduces multithreaded processor techniques, and presents our multithreaded microcontroller model. The benefits of using ISTs instead of ISRs and of the guaranteed percentage scheme in event handling of concurrent events with hard real-time requirements are evaluated in section 4 with respect to an autonomous guided vehicle. 2 Thread-based Real-time Event Handling The conventional method of event handling on today s processors and microcontrollers is event handling by interrupt service routines (ISRs). Events of different priorities are handled by ISRs with appropriate priorities. The generally used priority scheme is fixed priority preemptive. This priority scheme has several disadvantages: First, fixed priority preemptive scheduling restricts the processor charge to about 70% [1]. Furthermore, the handling of events of lower priority can be blocked for a longer amount of time by events of higher priority. This forces the programmer to keep ISRs as short as possible and to move work outside the ISR. If several concurrent time-critical events must be handled that way, the resulting programs are complex and hard to test. An alternative method to handle events is by threads. In

this case an occurring event activates an assigned thread. Using threads instead of ISRs has several advantages: A flexible context switch between event triggered threads and other threads is possible. ISRs can only be interrupted by other ISRs, but not by threads. instructions of the three ISTs can be interleaved in a very fine granular fashion without context switching overhead on a multithreaded processor. The remaining 10% processing power remain as a reserve. Threads allow flexible priority schemes. For example, earliest deadline first (EDF) or guaranteed percentage scheduling can be used instead of fixed priority preemptive. Both, the EDF scheme [2] and the guaranteed percentage scheme allow processor charges of 100%. The guaranteed percentage scheme offers the possibility to assign a rate of the full processing power to each thread. Herewith response times and data rates can be guaranteed even for several concurrent events independent of other processor activities, as long as there is no overload condition. Furthermore, overload conditions can be easily detected early, as soon as the total requested percentage of processor cycles exceeds 100%. Scheduling strategies like EDF detect overload conditions late by missed deadlines. On a conventional microcontroller, thread based event handling can be emulated by indirectly activating a thread by an ISR. The only task of the ISR is to launch the corresponding event handling thread. This concept is realized by several operating systems, e.g. as Asynchronous System Traps (AST) by DEC [3]. In our approach, we propose a multithreaded microcontroller which handles events by activating Interrupt Service Threads (ISTs) directly by hardware. First, this avoids latencies, which occur by the indirect thread activation. Also operating system calls are evaded for EDF or guaranteed percentage scheduling schemes. So response times can be improved. Second, a multithreaded microcontroller with zero-cycle context switching overhead allows a very fine granular realization of the guaranteed percentage scheduling scheme. The requested percentage can be guaranteed in the very short period of a few dozens of processor cycles, which is not possible when a context switch needs time for itself. This is demonstrated in Fig. 1 as follows. The top part of Fig. 1 shows three events with assigned rates and deadlines to be met. The shaded column picks a time period where all three ISTs that handle the three events are active. On a conventional microcontroller a context switch generates some context switching overhead. If the guaranteed rates of the three ISTs should be fulfilled within the shown time period, the 90% processing power needed by the three concurrently executing ISTs lead to an exceeding of the 100% processing power if the context switching overheads of the three IST switches is more than 10% of the processing power as demonstrated in the middle part of the figure. In contrast, Figure 1. Guaranteed percentage scheme on a multithreaded processor For that reasons, the IST concept simplifies and improves the programming of concurrent real-time events. Moreover it offers e.g. the possibility of debug threads which monitor the system at a low percentage without changing real-time behavior. The IST concept and the guaranteed percentage scheme unfold their full power only in combination with hardware support by a multithreaded processor core. 3 A multithreaded microcontroller A multithreaded processor is characterized by the existence of multiple on-chip instruction counters for different threads of control and the ability to execute instructions from different threads in the pipeline simultaneously. A multithreaded processor usually features multiple onchip register sets to yield a fast context switch. Multithreading techniques may be applied within the microprocessor or microcontroller to enhance its performance by masking latencies of instructions of the presently scheduled thread by instructions of other threads. Thus the throughput of a multiprogramming workload is increased leading to very powerful techniques that appear in next generation s multiple-issue microprocessors (see Sun s MAJC [4], and DEC/Compaq s EV8 Architecture [5]). However, to date multithreading has never been applied for event handling by taking advantage of its fast context switching ability. The basic multithreading techniques that are appropriate for microcontrollers with a single-issue RISC processor kernel are the cycle-by-cycle and the block interleaving techniques [6]. Cycle-by-cycle interleaving switches context each cycle with a zero cycle contextswitching overhead but with bad single-thread performance,

because every cycle an instruction of another thread is introduced in the pipeline. Block interleaving processors execute a single thread until a context-switch event occurs. Typically a switch-on-cache-miss event strategy is applied leading to 6-12 cycles context-switching overhead. To deal with the requirements of the thread-based realtime event handling, we propose a multithreaded microcontroller to implement the IST with several priority schemes. In particular we focus on the guaranteed percentage scheme. Our multithreaded microcontroller is intermediate between the cycle-by-cycle and the block interleaving approaches and reaches a zero cycle context-switching overhead. The microcontroller holds the contexts of up to four hardware threads. Such hardware threads can be non real-time (e.g. operating system, debugging, garbage collection) or realtime threads (ISTs). Because of its application for embedded systems, the processor core of the multithreaded microcontroller is kept at the hardware level of a simple microcontroller similar to the M68302. As shown in Fig. 2, the microcontroller core consists of an instruction-fetch unit, a decode unit, a memory access unit (MEM), and an execution unit (ALU). Four register sets are provided on the processor chip, restricting the number of hardware-supported ISTs to at most four. A signal unit triggers IST execution due to external signals directly by hardware. There are no caches because of the need of real-time applications. memory interface address instructions micro-ops ROM PC1 IW1 instruction fetch PC2 PC3 PC4 IW2 IW3 IW4 priority manager instruction decode signal unit extern signals The instruction fetch unit holds four program counters (PC) with dedicated status bits (e.g. thread active/suspended), each PC is assigned to another thread. Instructions are fetched depending on the status bits and fill levels of the IWs. The instruction decode unit contains the above mentioned IWs, dedicated status bits (e.g. priority, delay) and counters in the case of implementing the guaranteed percentage scheme. A priority manager decides subject to the bits and counters from which IW the next instruction will be decoded. Besides the guaranteed percentage scheme also other priority schemes may be supported by the priority manager. Each opcode is propagated through the pipeline with its thread id. Opcodes from multiple threads can be simultaneously present in the different pipeline stages. The instructions for memory access are executed by the MEM unit. If the memory interface only permits one access each cycle, an arbiter is needed for instruction fetch and data access. All other instructions are executed by the ALU unit. Both units (MEM and ALU) can take several cycles to complete an instruction execution. After that, the result is written back to the register set of the according thread. External signals are delivered to the signal unit from the peripheral components of the microcontroller core as e.g. timer, counter, or serial interface. By the occurrence of such a signal the corresponding IST will be activated. As soon as an IST activation ends its assigned real-time thread is suspended and its status is stored. An external signal may activate the same thread again. To avoid pipeline stalls, instructions from other threads can be fed into the pipeline using various static or dynamic multithreading techniques. Possible reasons for idle times may be a branch or memory access. The decode unit may predict the latency after such an instruction and inform the priority manager via delay bits (switch-on-branch and switch-on-load strategies). There is no overhead for such a context switch. No save/restore of registers or removal of instructions from the pipeline is needed, because each thread has it s own register set. Because of the unpredictability of cache accesses, a noncached memory access is preferred for real-time microcontrollers. The emerging load latencies are bridged by scheduling instructions of other threads by the priority manager. Therefore a cache is omitted from our multithreaded microcontroller. data path MEM ALU 4 Evaluation using an industrial application example register sets Figure 2. Block diagram of the Komodo microcontroller This section gives a evaluation of the IST technique and of the guaranteed percentage scheme. The evaluation is done using a real industrial application example of autonomous guided vehicles (AGV). This is a good example of several concurrent time-critical events. The vehicles in our example are guided by a reflex tape glued on the floor. A vehicle pursues its track by use of a CCD line camera. This camera produces periodic events in a rate of 10 milliseconds. This period gives the deadline for converting and reading the camera information and for executing the control loop, which keeps the vehicle on the track. A second time-critical event is produced asynchronously by

transponder-based position marks. These position marks notify the vehicle that some default position is reached (e.g. a docking station). If the vehicle notices a position mark, the corresponding transponder which is installed in the floor beside the track, must be read. The precision needed for position detection using these marks is 1 cm. This gives a vehicle speed dependent deadline for reading the transponder information. The vehicle speed can vary in the range of 0.5 to 1 meters per second, which results in a deadline range from 20 to 10 milliseconds. To solve this job, the vehicle software is structured into three tasks: The control task performs the control loop based on the actual camera information. It is triggered by a timer event with a period of 10 milliseconds. The camera task triggered by the same timer event converts and reads the next camera information. The transponder task is triggered by the position mark events and reads the transponder information. We compare the real-time behavior of three different realizations of these tasks: first, each task is realized by a conventional ISR; second, each task is realized by an IST using guaranteed percentage scheduling; third, each task is realized by an IST using EDF scheduling. To allow a fair comparison, we assume an identical processor performance for all three techniques. This is based on the real performance of a 20 MHz M68302 microcontroller which is often used in industrial applications. So our timing values stem from real applications. The evaluation itself is done as follows: First we examine the three techniques at a fixed vehicle speed of 0.65 meters per second. Then we calculate the maximum speed that can be reached for each technique without violating the realtime constraints. The following table summarizes the basic values for the first evaluation: vehicle speed 0.65 m/sec camera period 10 msec position mark precision 1cm control task calculation time 1) 5msec camera task execution time 1) 1msec transponder task execution time 1) 5.5 msec 1) based on a 20 MHz M68302 processor This gives the following execution-time / deadline ratio: control task 5 msec / 10 msec = 50% camera task 1 msec / 10 msec = 10% transponder task 5.5 msec / 15.4 msec 1) = 35% 1) 15.4msec=1cm/0.65m/sec 1. Realization with ISRs At first we demonstrate the solution with a conventional ISR realization of the tasks. The priority scheme of ISRs is a fixed priority scheme. Fixed priority scheduling can only guaranty a processor utilization of 78% for three events [1], which is less than the needed 95%. Therefore we will have missed deadlines regardless of the priority assignment. Figure 3 shows an example for missed deadlines at a specific assignment. # "? JH?= AH= JH= IF @AH # FHAA FJA@ # E I I A @ # # Figure 3. ISRs with P control >P camera >P transp 2. Realization with ISTs using guaranteed percentage scheduling The guaranteed percentage scheduling assigns a guaranteed percentage of processor cycles to a thread. On the proposed microcontroller with its zero time context switches, this percentage is guaranteed in a very short period of a few processor cycles by the hardware priority manager. So the realization of the three tasks is simple: Each task is assigned to a thread (IST) with the executiontime/deadline ratio as requested percentage of processor cycles (50% control task, 10% camera task, 35% transponder task). This gives a total requested percentage of 95%, which means, all deadlines will be met in any case. Furthermore there is an additional spare of 5%, which e.g. could be used for a debug thread. Figure 4 shows the worst case scenario, where all events occur at the same time.? JH?= AH= JH= IF @AH # " # # #!# J AN # # J AN J AN # #!# # " Figure 4. ISTs with guaranteed percentage 3. Realization with ISTs using EDF scheduling EDF and guaranteed percentage scheduling allow a processor utilization of 100% [1]. This means, ISTs with EDF or guaranteed percentage scheduling meet all deadlines in the example above. Figure 5 shows the above scenario using EDF scheduling.

? JH?= AH= JH= IF @AH # # # # # # I A? Figure 5. ISTs with EDF Speedup IST=ISR = 28% Finally, we like to point out that in this section we evaluated the benefits of the IST concept. However, using a multithreaded microcontroller has the additional benefit of latency masking [7] which is not yet regarded. Unfortunately this benefit can not be evaluated statically. So our calculated values are worst-case values. 5 Conclusions But figure 5 reveals a disadvantage of EDF scheduling. The camera task and the transponder task not only must meet a deadline, they deal as well with data rates. The transponder information starts at the position mark event and lasts 4 milliseconds (8 bytes, 19200 baud serial link). In case of guaranteed percentage scheduling this is no problem, because the transponder task starts at the position mark event as well with 35% of processor cycles, which is enough to read the information. In case of EDF scheduling, the transponder task starts 6 miliseconds after the position mark event. This means, if the serial link doesn t contain a 8 byte hardware buffer, the information will be lost. For the same reason, a 10 times faster AD converter is needed for reading the CCD camera. As a conclusion the ISR solution isn t able to guarantee the specified deadlines, in contrast to the IST solution, that even offers a spare of 5%. Moreover, guaranteed percentage scheduling offers an advantage over EDF scheduling, if not only deadlines, but also data rates must be met. Maximum vehicle speed for IST and ISR In a second step, the maximum vehicle speed without violating the realtime constraints can be calculated. For ISTs using guaranteed percentage scheduling, the maximum vehicle speed can be reached, if the transponder task uses the remaining spare of 5%. This leads to a transponder task of 40% and a total processor utilization of 100%. In this case, the transponder task can guaranty a deadline of (5.5 msec * 100%) / 40% = 13.75 msec. With this deadline, the position resolution of 1 cm can be retained for a vehicle speed of: V, IST max =0:73m=sec This calculation is valid for ISTs using EDF scheduling as well, because the same processor utilization of 100% is reached. In case of ISRs, the reachable deadline for the transponder task can be taken from Fig. 3. It calculates to 5 msec +1msec+4msec+5msec+1msec+1.5msec=17.5 msec. So the maximum vehicle speed can be calculated to V, ISR max =0:57m=sec As result, the vehicle speedup for the IST architecture compared to ISR is We combine interrupt service threads (ISTs) with a multithreaded microcontroller to form a new hardwaresupported event handling mechanism which allows efficient handling of concurrent events with hard real-time requirements. We propose a guaranteed percentage scheme where each real-time thread is assigned a percentage of the full processor power. An analytical evaluation shows the advantages of our approach in handling concurrent overlapping events. The additional ability of a multithreaded microcontroller which is able to utilize instruction latencies by scheduling instructions of a different thread is not regarded in the calculations, although it can yield a higher execution speed. So the calculated values are worst-case values. We are working on the simulation of the proposed multithreaded microcontroller. Our target is the evaluation of its real-time performance versus the performance of a conventional microprocessor with ISRs and fixed priority preemptive scheme. References [1] C. L. Liu and J. W. Layland. Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment. JACM, 20(1):46 61, 1973. [2] J. A. Stankovic, M. Spuri, K. Ramamritham, G. C. Buttazzo. Deadline Scheduling for Real-Time Systems: EDF and Related Algorithms. Kluwer Academic Publishers, 1998. [3] Digital. Guide to Decthreads. March 1996. [4] L. Gwennap. MAJC Gives VLIW a New Twist. Microprocessor Report, Vol 13, No. 12, pp. 12 15, September, [5] J. Emer. Simultaneous Multithreading: Multiplying Alpha Performance. Microprocessor Forum, San Jose, October [6] J. Kreuzinger and T. Ungerer. Context Switching Techniques for Decoupled Multithreaded Processors. 25th EU- ROMICRO Conference, Milano, Vol 1, pp. 248-251, September [7] W. Grünewald, T. Ungerer. A Multithreaded Processor Designed for Distributed Shared Memory Systems. International Conference on Advances in Parallel and Distributed Computing, Shanghai, pp. 206 213, March 1997.