DISC: DYNAMIC INSTRUCTION STREAM COMPUTER

Size: px

Start display at page:

Download "DISC: DYNAMIC INSTRUCTION STREAM COMPUTER"

Jeffery Thornton
6 years ago
Views:

DISC: DYNAMIC INSTRUCTION STREAM COMPUTER Dr Mario Daniel Nemirovsky Apple Computer Corporation Drs Forrest Brewer and Roger C Wood Electrical and Computer Engineering Department University of

1 DISC: DYNAMIC INSTRUCTION STREAM COMPUTER Dr Mario Daniel Nemirovsky Apple Computer Corporation Drs Forrest Brewer and Roger C Wood Electrical and Computer Engineering Department University of California, Santa Batiara ABSTRACT This paper applies a form of instruction stream interleaving to the problem of high performance real-time systems Such systems are characterized by high bandwidth, stochastically occurring interrupts as well as high throughput requirements, The DISC computer is based on dynamic interleaving where the next instruction to be executed is dynamically selected from several possible simultaneously active streams Each stream context is stored internally making possible active task switching in a single instruction cycle For several RTS applications the DISC concept promises higher computation throughput at lower cost than is possible on contemporary RISC processors Implementation and register organization details are presented as well as simulation results 10 INTRODUCTION This paper describes an architecture concept and implementation which is specifically oriented for use in real time controller systems (RTS) Such systems are characterized by various priority hard and soft deadlines for completion of tasks and efficient interaction of the processor and several peripherals running at vastly differing data rates Ideally, system deadlines can all be met by the computation engine in all cases However, reasonable provisions must be made for graceful degradation of low priority tasks in exceptional circumstances Another characteristic of such systems is the notion that the worst case delays of the system must fall within the critical timing constraints It is of no use for the average performance to meet these requirements as the system may incur permanent damage if these constraints are not met Many present applications of micro-controllers are in relatively lowend applications where meeting these requirements is simple even for slow microprocessors These uses have not pushed the technology significantly except in methods for Iowwing the costs Recently, however, there has been a rapid increase in the complexity of control systems as mechanical and computer Permission to copy without fee all or part of th~ mateml ia granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright not ice and the title of the publication aod its date appear, and not ice is given that copying is by permission of the Association for Computing Machinery To copy otherwise, or to republish, requires a & arid/or specitic permission O 1991 ACM /91/0011/0163 $150 integrated manufacturing systems have become common Other real time systems occur in the automotive industry and airplane control systems In these newer applications, it is questionable whether conventional architectures provide a cost effective computation engine solution We propose an efficient architectural concept for construction of real time system controllers which promise significantly higher performance than conventional approaches for modest increases in cost Real time systems provide different constraints for the system architect than do conventional systems In particular, externally derived deadlines from the controlled systemproducewidely varying computational loads on the controller, as it must respond to these external requests and interrupts in a specified amount of time In this work, we are considering the deadline times to be from microseconds to milliseconds, common to conventional microprocessor system controllers In these systems, 1/0 timing constraints become a primary issue Often the data required is generated by a sensor in a time scale much slower than the operation of the processor On the other hand, keeping the data current is much desired which prevents caching or queueing earlier data values In these cases, it is difficult to make use of the processor idle time while it is awaiting new data due to the overhead required to change program context Interrupt processing is also very important in RTS to help alleviate overhead due to polling and to insure quick responses to exceptional or critical deadlines For this reason, interrupt latency (a measure of the time to respond to an interrupt signal) is an important performance measure for real time control systems If the control system is complex and dynamic rescheduling is required, then there must be provisions for rapid context switching as processes are started and stopped as required by the system Finally, it has keen shown [1] that if the processor throughput can be partitioned arbitrarily among the executing processes, scheduling which is in some senses optimal can be achieved This throughput partitioning must be done with very low overhead so as to not compete with the processing tasks themselves 20 PREVIOUS WORK Previous work in architectures for real time controller systems involves a few specialized architectures and several specially modified current microprocessors Although digital 163

signal processing (DSP) chips are often used in real time systems, they are usually used as an auxiliary processor as their specialized architectures do not perform more general (nonnumeric)

2 signal processing (DSP) chips are often used in real time systems, they are usually used as an auxiliary processor as their specialized architectures do not perform more general (nonnumeric) processing efficiently In addition, the large parallelism and register set size of DSP S make these devices very inefficient for use in interrupt driven or heavy context switching applications Several common microprocessors have been modified for real time control applications with the addition of internal time, DMA, and communication interface functions These include the 68332, 68HC11, and 8748 microprocessors as examples It is important to note that the general purpose architecture of the original microprocessors is retained in these controllers, with the extra functions simplifying the peripheral interfacing For this reason, these micro controllers have interrupt latency and context switching behaviors similar to the original microprocessor parent The [2] does have an auxiliary processor called the timer processing unit (TPU) which is capable of performing relatively complex time process behaviors such as stepper motor control, etc The purpose of this unit is to reduce the frequency of interrupts and context switches required by the real time system Another solution to the interrupt latency and context switching time problems is use of a stack architecture such as that of the RTX2000 machine Since the instruction stream is primarily zero address (stack) operations, these machines do not have large internal register sets which need to be saved For this reason, the interrupt latency and the context switching times are very fast The stack instructions, however, do not lend themselves to manipulating complex 1/0 devices due to the lack of support for complex address modes to these peripherals These processors also tend to have slightly lower performance than their register heavy counterparts Instruction level interleaving is not new; early processors using interleaving include the CDC6600 1/0 processor [4], the multiple instruction stream processor of Flynn [5,6], the work of Kaminsky and Davidson [7], and the Denelcor HEP computer [8] More recent work includes the UCSB CCMP system [10], the APRIL [13], and others [9, 11, 12] This work (with the exception of the CDC) is primarily directed towards the performance gains and ease of parallel programming implementations possible with interleaving Very little attention has been given to the advantages of interleaving in real time systems Interleaving architectures relv on maintenance of several simultaneous contexts for each of the running processes in the processor This leads inevitably to a large overhead of registers required to sustain these contexts These registers are often organized into register windows or multiple windows with disadvantageous worst case replacement behavior These memory problems have been studied by Sites [18], and by Wyes and Plessman [19] using background processes to update the register windows before registers are needed Another alternative is proposed in the CRISP [221 architecture usimz a stack cache We will propose a v-=iable sized multi-window organization for this purpose 30 DYNAMIC lnstructlon STREAM COMPUTER (DISC) 31 DISC Concept The dynamic instruction stream computer (DISC) concept relies on an architecture maintaining several simultaneous instruction streams which are dynamically started and halted by the processor Each of these streams is interleaved in the processor at the instruction level, providing the highest level of granularity for task scheduling and partitioning The instruction level interleaving allows for efficient pipelining to obtain high instruction throughput not achievable in conventional architectures For applications in real time systems, however, it is the dynamic nature of the effectively parallel streams which is particularly useful In a conventional processor, the control unit selects the next instruction to be executed in sequential order unless this order is changed by a jump or other control instruction In DISC, the sequential order is replaced by a hardware scheduler which selects from among the several possible streams a particular instruction for execution on the next cycle It is thus possible to assign an interrupt to a given stream which begins processing effectively in parallel and at a given level of partitioned throughput to the rest of the streams then active in ~he processor Streams can also trigger other instruction streams and multiple streams can synchronize with each other when necessary As an example, consider a machine running 3 streams concurrently, and one of the streams is halted by wait states from a slow peripheral The other streams automatically are allocated the instruction slots which would otherwise be used as polling or interrupt overhead In situations where the number of active contexts is smaller than the number of supported streams, all overhead for context switching is removed (Even when this is not the case, for many real time systems, the frequency of context switches should be reduced) It is well known that multi-stream interleaving on a pipelined processor is more efficient than single stream execution DISC exploits this efficiency advantage by implementing a RISC-based processing engine design to automatically interleave instruction execution from a small number of stored process contexts Scheduling of streams on an instruction basis allows simple partitioning of the processing power among the several active real time tasks This schedule allows several versions of real time scheduling models, including preemptive and fixed schedules as well as General scheduling [1] with little or no overhead The cost for this is the necessity of several stored contexts along with the ancillary registers needed to be duplicated for each stream In particular, PC, SP, El registers must be maintained for each stream To manage this large number or registers, DISC introduces the concept of a stack window register set These registers are similar to the registers windows proposed by Patterson in RISC- I [3] with the exception that the number of registers allocated in a procedure crdl is variable Each stream is allocated its own stack window as well as a common set of global registers used for inter-stream parameter passing The stack window described below is a very important issue in a hard deadline environment to minimize context switching, procedure call/return, and interrupt overhead 32 Pipelining Pipelining is a mechanism by which multiple instructions from a sequential instruction stream are simultaneously executed in an overlapped fashion For this discussion we will consider a five stage pipeline, consisting of instruction fetch (IF), instruction decode (ID), read registers (RR), execute (EX), and write register (wr) The essential feature of a pipe is that ensuing instructions are scheduled before earlier ones have completed, This leads to hazards lowering the performance of the pipeline A hazard is a situation which precludes executing the next instruction of the stream Hazards are caused by violation of either data or 164

I a,l: indicates instruction a runnina on instruction stream 1 IF a1 IF f2 IF k3 IF n4 IF p5 IF b1 IF g2 ID a1 ID f2 ID k3 ID n4 ID p5 ID b1 RR a1 RR f2 RR k3 RR n4 RR p5 FX a1 EX f2 EX k3 F)(n4 WR

3 I a,l: indicates instruction a runnina on instruction stream 1 IF a1 IF f2 IF k3 IF n4 IF p5 IF b1 IF g2 ID a1 ID f2 ID k3 ID n4 ID p5 ID b1 RR a1 RR f2 RR k3 RR n4 RR p5 FX a1 EX f2 EX k3 F)(n4 WR a1 WR f2 WR k3 Interleaved Pipeline - Figure No other instruction on the pipe belongs to instruction stream 1 IF a1 IF f2 IF k3 IF n4 ~lf p5 ; IF b1 IF g2 ID a1 ID f2 ID k3 ;ID n4~ ID p5 ID b1 E RR a1 RR fz ERRk~ RR n4 RR p5 EX a1!ex f2; EX k3 EX n4 WR fq WR k3 Interleaved Pipeline During a Jump - Figure 32 control caused dependencies A data hazard exists when an instruction, A, is modifying data which is used in the next instruction, B In this case, data for B has not been updated by the time it is read To insure correct operation, instruction A should be completed before B executes its third stage and the pipeline should keep running A but delay all those instructions that follow A until the register write is completed A control hazard takes place when the instruction sequence is modified as a result of an interrupt or an instruction such as jump or branch By the time an instruction modifies the program sequence, there will be several instructions in the pipe which belong to the incorrect sequence Any such instructions need to from the pipe It is important to reduce the performance overhead associated with hazards Several techniques such as delayed branching and pipeline bypasses reduce the effect of hazards, but generally do not eliminate them Interleaving, however, can be used to eliminate hazards from pipeline execution and has been employed in a number of systems [4-8] 33 Interleaving A pipeline is interleaved if, at every pipe cycle, an instruction from a different instruction stream enters the pipe and there are at least as many instruction streams as pipe stages Therefore, interleaving is a way to share the processor resources between multiple processes Figure 31 shows an interleaved pipeline, in which five independent instruction streams or tasks are shown in a five stage pipeline The result of such an interleaved pipe is the equivalent of five parallel processors, where each processor is running at one instruction every five cycles Thus in an ideal pipeline there is no performance gain from interleaving instructions In fact, the overhead of supporting several parallel streams may slow down the achievable clock cycle, hence the performance may decrease However, a single stream running a pipeline will have both data and control hazards reduging the throughput of the pipeline, In an interleaved pipeline, all instructions present in the pipeline belong to separate processes at all times Thus each instruction for each process completes before instructions from that process me fetched Under these conditions and assuming the processes to be independent, there are no control or data hazards at all A representative branch is shown in Figure 32 Hazards between separate processes are possible if the processes are not independent, for example, the processes may communicate We can add special hardware for process communication which will reduce the overhead in these cases As a result, the interleaved pipeline achieves higher throughput on several processes than an identical pipeline executing a single stream, due to reduction in the number of hazards The performance increase for interleaved pipelines is not without cost There must be sufficient registers to retain the states of all executing processes The resources required to duplicate the context must be duplicated for each of the processors as many times as there are virtual processors to be supported This cost is highly dependent on the architecture of the processor It is important to have a very small context per process, and to have minimum extra hardware to support the multiple process switching The question remains: How does interleaving give a solution to the real-time controller requirements? 34 Dynamic Interleaving As we described earlier, a real-time system requires that multiple tasks be able to run concurrently Some of these tasks occur at deterministic times, others at random times There are a large number of interrupts, and the 1/0 speed is generally much slower than the processor speed Interleaving could be a very a good solution if a sufficient number of active tasks could be guaranteed, but this is difficult because of the randomness Thus, we introduce the concept of dynamic interleaving A pipeline organization is said to be dynamically interleaved if it can run from a single instruction stream to a multiple instruction stream and the computation power of the prqgem+m can be allocated ktween the mtdt;plc virtual processors in any way and can dynamically reallocate the throughput when the instruction stream scheduled to run is 165

not ready This is achieved in DISC by dynamically selecting the next instruction to execute from the possible streams In the case where only one stream is active, each pipeline slot executes

4 not ready This is achieved in DISC by dynamically selecting the next instruction to execute from the possible streams In the case where only one stream is active, each pipeline slot executes sequential instruction from that stream The concept is described by Figure 33 The figure shows up to four instruction streams (IS 1, 1S2, 1S3, and 1S4) Assume that the total throughput of the processor is T and the following partition is assigned T/2 to IS 1, and T/6 to 1S2, 1S3, and 1S4, As the figure shows, when 1S1 is the only one active, it will be dynamically assigned T even though the static assignment is T/2 Similarly, if 1S3 is inactive, its processor time will be dynamically reassigned to 1S2 and 1S4 Dynamic interleaving greatly facilitates scheduling and multitasking since each task can be assigned its own virtual processor of adjustable computational power Real Time Systems also require hard deadline management which is often implemented via timer based interrupts In conventional architectures, these interrupts require context switches In DISC, an intermpt, instead of suspending a running process, can create its own instruction stream This makes the system more deterministic since even when interrupts are invoked, other tasks can be running When the interrupt routine is finished, the throughput will be dynamically reallocated to the remaining instruction streams Context switching will not be required as long as the number of instruction steams supported by the processor is less 1S1 : 1S2 1S3 : 1S4 ; II Fc&c,: F&K +--- I SUB-CALL ; or INT ~ kmin xj T SUB-RET ~ or INT-RET ~ $, Dynamic Instruction Stream Diagram bigure Jj or equal to that required by the application Otherwise, some context switching will be required but the total number of switches will be smaller in a DISC than in a traditional architecture 35 The Stack Window Due to the speed degradation of external access with respect to the processor, it is important to keep operands in the processor Therefore, the processor should have enough registers to be able to allocate registers to most, or all, of the local and global variables for all streams However, keeping local variables in internal registers causes context switching overhead on procedure calls and retorns [14, 15] To solve the inconsistency between a large register set and fast context switching, a multi-window approach is a very logical alternative [14-23] In addition to reducing the local register saving/restoring to just a pointer change, if the windows overlap, then the overlapped registers can be used for argument passing DISC is an architecture which contains multiple ;nstruc~ion streams each instruction stream should have its own multiple window file AUTO + R STACK WINDow Stack Window Approach - Figure 34 AWP-n ~ AU70 + X The approach used on DISC is called a stack window Figure 34 shows the window file on the stack window approach Bottom Of the Stack register (BOS) is pointing to the last empty word of the stack window (SW) Active Window Pointer (AWP) is pointing to register zero (RO) of the window If the window size is S then the address of RO is AWP, R1 is AWP-1,, Rn is AWP-n, R(S-1) is AWP-S+l In the instruction se~ stack increment and decrement is added to some instructions such as Load, Store, Add, Subtract, etc When an instruction increments AWP then the new AWP location becomes RO,RO becomes RI, RI becomes R2, and so on (Figure 35) Then the SW is a window that is moving up and down as demands require Let us assume that instructions which increment the AWP do so at the end of the instruction Then a procedure call will increment AWP storing the return address there On a return, the TOS is decremented by the instruction offset (no larger than the window size) to the return address location It restores the program counter and decrements the AWP one more time, leaving h at the same place h was before the call took place 36 Communication Issues 361 Input/Output Real Time Systems require multiple 1/0 peripherals with different access times; therefore, DISC has to support an asynchronous DATA bus DISC is a load/store type machine To avoid stopping the other instruction stream when a load or store instruction is issued, a pseudo-dma type load/store was I 166

implemented on DISC1 On a load instruction, the effective address of the external request is calculated It is then loaded into the Asynchronous Bus Interface (ABI), with the address of the

5 implemented on DISC1 On a load instruction, the effective address of the external request is calculated It is then loaded into the Asynchronous Bus Interface (ABI), with the address of the destination register The IS requesting the read cycle is sent into a wait state and the ABI initiates the read cycle During the time the access in taking place, another IS requests for a load or store will send IS to a wait state Once the read is completed the ABI stores the data into the destination register and re-activates all waiting 1Ss This is done without affecting the running instruction streams The store instruction works in a similar way R7 L-l + increment AWP R6 R7 R7 R6 R5 R6 R6 R5 R4 R5 R5 R4 R3 R4 R4 R3 R2 ~ R3 R2 R2 -+ ;: RI 362 Interprocess Communication and Synchronization Since DISC has multiple 1Ss, communication between IS is required This can be accomplished in different ways On DISC1 there are three ways supported There are four global registers that are shared between all the 1Ss In addition there is an internal global memory shared between the 1Ss, since global registers and internal memory allow read-modify-write instructions, they can be used as semaphores IPC can also be done via software interrupts which is discussed in the next section Process synchronization can be achieved by either semaphore polling or by interprocess interrupts Interrupts are more efficient since they do not require repetitive instructions on the processing engine 363 Interrupts The interrupt structure on DISC is very special because of the importance of interrupts in real time systems and because they are also used to obtain inter IS communication and synchronization Every IS has one interrupt register (IR) and one mask register (MR) On DISC 1 the interrupt registers contain 8 bits, bit 7 is the highest priority, bit O is the lowest priority (or background or normal mode of running) Interrupt 7 to 1 are vectored interrupts Interrupt O is the background, no vector is generated, eg 1S0 interrupts 1S2 by setting a bit on the IR of 1S2 External interrupts can also set a request to any of the IRs Finally, interrupts can be automatically generated, such as the stack overflow or other exceptional interrupt Interrupt request bits can only be cleared by the IS to which the IR belongs When no bk of the IS is se~ the instruction stream will not be scheduled (not active) Once an interrupt is requested, if it is the highest priority one pending, a vector interrupt will be generated The next instruction that belongs to that IS will be started at the address given by the vector interrupt Vector interrupts were chosen in the implementation to avoid the need for polling to obtain the interrupt source Synchronization between 1Ss can be obtained via interrupts When interrupts are used to synchronize 1Ss, the first IS to reach the join point is deactivated until the other IS arrives This is much better than having the IS polling a semaphore to + :; n< RO is lost E + 1: decrement AWP J bti R2 Stack Wkdow Movements - Figure 35 R7 RI RO check synchronization since the computation throughput which would be spent polling will be dynamically allocated to the active 1Ss 37 Implementation of DISC DISC1 is the experimental implementation of the DISC concept More information about the implementation and models is available [24] This implementation was designed to prove feasibility of DISC and to obtain benchmarks The design is targeted to the typical control requirements of automotive electronics A 16-bit architecture was chosen for DISC1 since the goal was to compare its performance with respect to present RTS controllers In fact, present technology would allow physical implementation as a 32-bit architecture A Harvard architecture was chosen to allow simultaneous instruction and data fetch Instructions are fetched through the program bus which is 24-bits wide, while the data bus is 16-bit asynchronous An asynchronous data bus is required since controllers have a very large variety of 1/0 peripherals with large variety of access times DISC1 supports up to four instruction stream running concurrently and uses a four stage pipeline The scheduler of DISC1 is responsible for selecting which instruction stream will be executed next, based on present priority The computational power of the system can be allocated evenly between 1Ss, or assigned in increments as low as 1/16 of the total It contains 2 Kbyte of internal memory in addition to the stack window registers The internal memory is shared between all 1Ss with access done via register indirect, register plus offset, or 9-bits immediate addressing DISC 1 is a load/store, computer with reduced instruction set All the instructions are effectively single cycle including the load and store instructions with the proviso of asynchronous wait for external memory and I/O This simplifies the design and reduces the overhead cost of the multiple instruction streams A 16x 16 integer hardware multiplier is included in DISC1 DISC1 has 16 registers per instruction stream, four global, four special registers and eight local (stack window) registers Fimre 36 sh~ws a block d~agram of DISC 1 A RTL model of DISC1 was written in Verilogm and several programs were run on model 40 EVALUATION OF DISC PERFORMANCE 41 Stochastic Model A stochastic model was developed to evaluate the DISC architecture Poission distributions, with the indicated means, were assumed for the number of consecutive instructions for which the IS is active (meanon), or inactive (meanoff), between external access requests (mean_req), and for 1/0 request times (mean_io) Also controlled were the percentage of external requests that were dwected to memory this 167

6 I t--t IL I I I Block Diagram of DISC1 - Figure 36

(alpha), the percentage of instructions, such as jumps, calls, returns, branches and interrupts that modify program flow (aljmp), and the number of wait cycles for an external memory access The model

7 (alpha), the percentage of instructions, such as jumps, calls, returns, branches and interrupts that modify program flow (aljmp), and the number of wait cycles for an external memory access The model simulates the sequencer used in DISC 1, so that any sequence that can run on DISC 1 can be simulated The model assumes that when a jump instruction takes place, all of the instructions in the pipe that belong to the same IS have to be flushed from the pipe If only one IS is active, this simplifying assumption makes DISC performance worse than a single IS computer For an external request, either 1/0 or memory, if the access time is larger than zero, all instructions on the pipe belonging to the same IS are flushed, and the IS requesting access is put into a wait state This is done in order to allow other 1Ss to keep running, but penalizes DISC with respect to a standard architecture if only one IS is being run since the pipe could simply be halted If the bus was busy at the time access is requested, the instruction is flushed and a new external access is requested once the IS is out of the wait state If the bus was not busy, the busy flag is set and it remains set until the access time is completed Upon completion of the external access all waiting flags are cleared Two performance measure for DISC are evaluated: processor utilization on DISC, PD, and Delta Delta is a value used to compare a single IS system with a multiple IS system, and is defined as: delta = (PD- Ps)ps * 100% Ps (processor utilization on the standard processor) is calculated as the total number of executable instructions divided by the sum of the total number of executable instructions, the number of cycles that the data bus was busy, and the number of cycles dropped due to jump type instructions This assumes that instructions are not being executed in a standard processor when it is waiting for data To assume the contrary implies support of out-of-sequence code and/or a smart compiler It also assumes that every time a jump type instruction is executed, the standard processor will require (pipe_length- 1) cycles to be flushed from the pipeline This is conservative in that delayed branching can be used to help alleviate the number of cycles needed to be flushed However, delayed branching can only be applied to statically analyzable portions of the design and is less effective as pipeline depth increases It is common practice in RTS analysis to measure the interrupt latency time as a system evaluation By dedicating a stream to a particular interrupt, we can achieve very high figures of merit since the instructions will start execution immediately However, we must still ensure that the appropriate context is available and that the interrupt executes quickly enough once it is started The latency time as conventionally described is ambiguous in this sense since a short interrupt to retrieve a value will execute very quickly (the common micro-controller case) while a longer interrupt will be scheduled throughput by the hardware scheduler 42 Simulation Results A large number of simulation runs were made to evaluate the DISC architecture Parameters varied, in addition to those described above, included the scheduler sequence, number of cycles for an external memory access (tmem), and pipeline length One set of runs evaluated the effect of only jump instructions, another of external 1/0 only Finally, a set of four program loads was specified to simulate more realistic RTS behavior Loads 1 and 2 represent typical RTS behavior differing principally in the fact that load 2 is alternatively active and inactive, while load 1 is always active Load 3 represents a DSP type program running only from internal memory, and load 4 an interrupt driven program which is only active while handling an interrupt These loads were also combined into a single IS, eg load (1:4) represents a statistical combination of loads 1 and 4 into a single IS Table 41 shows the parameters for each of these runs, and Tables 42 Ldl Ld 1:2 M 1:3 Ld 1:4 Ld2 Ld3 Ld4 meanon meanoff o mean-req alpha tmem mean_io aljmp Table 41 - Parameter Set for Typical Program and 43 show the processor utilization and delta for different combinations In Table 42 we show that as the degree of partitioning increases, so does the utilization Hence if we have a program that can be partitioned into multiple 1Ss, a much better processor utilization is obtained, especially if the processor utilization of the single IS is low Even when the processor utilization of a single IS is very high, there are still some gains obtained by running multiple instruction streams, as shown in load 3 in Table 42 Maximum Number of Instruction Streams load load load load Table 42 a - Processor Utilization PD Maximum Number of Instruction Streams load load load load Table 42 b - Delta Table 43 shows results for load 1 combined with each of the other loads, first into a single IS, then each run in an 169

independent IS, then with load 1 partitioned into two 1Ss, and finally with both loads partitioned into dual 1Ss The range of improvement of DISC over a traditional single-instructionstream processor

8 independent IS, then with load 1 partitioned into two 1Ss, and finally with both loads partitioned into dual 1Ss The range of improvement of DISC over a traditional single-instructionstream processor (delta), is dramatic as long as at least two 1Ss are enabled, especially when traditional processor performance is poor Load Combined Separated Type Loads Loads Three 1Ss Four I% Table 43 a - Processor Utilization PD dependent constraints on the performance and size of DISC architectures needs to be evaluated ACKNOWLEDGEMENTS This research was partially supported by Delco Systems Operations, a subsidiary of Delco Electronics Corporation, and the University of California M I CRO grant # REFERENCES 1 2 Coffman, EG and Denning PJ, Operating System Theory, Prentice-Hall, 1973 CPU32, Reference Manual (Rev 08), Motorola 1989 d Combined Separated Type Loads Loads Three 1Ss Four 1Ss Table 43 b - Delta On the other hand, in applications where single stream processor utilization is very high, the advantages of DISC are not significant In addition, if the application does not permit keeping multiple 1Ss active, then a DISC architecture could result in a performance degradation What is remarkable is the large throughput increase made available by using such a small number of parallel streams 50 CONCLUSIONS DISC shows a performance improvement over standard architectures for real time applications The ability to dynamically reallocate the throughput permits the system to take advantage of the time that otherwise would be lost It was shown that even a system with two instruction streams significantly outperforms a single instruction stream system In particular, the ability to partition throughput among streams, the rapid interrupt handling, and the concurrent processing during 1/0 s should provide substantial benefits to RTS There are many application where DISC will be outperformed Specifically, this will be true in applications where the number of wait cycles and pipe hazards are very small Future work should be done to evaluate the optimum number of instruction streams for a given application The stochastic model sheds considerable light on this question, but detailed analysis of algorithmic requirements, 1/0 patterns, etc will be necessary Two other parameters also need study: the depth and size of memory usage in the stack windows could be evaluated by stochastic means and appropriate measures of interrupt latency need to be defined and modeled Numerous operating system, compiler, and other software questions also need to be addressed Finally, implementation technology Patterson D and Sequin C, RISC I A reduced Instruction Set VLSI Computer, Proc of the 8th Symposium on Computer Architecture, May 1981 Thornton JE, Parallel Operation in the Control Data 6600, Proceedings-Spring Joint Computer Conference, 1964 Flynn MJ, Podvin A, and Shimizuk K, A Multiple Instruction Stream processor with shared resources, Parallel Processor System, C Hobbs, Washington D C, Spartan, 1970 Flynn MJ, Some Computer Organizations and Their Effectiveness, Transactions on Computers, Vol C-21, No 9 Sept 1972 Kaminsky, WJ and Davidson E S, Developing a Multiple-Instruction-Stream Single-Chip Processor, IEEE Computer Magazine, Dec 1979 Kowalik J S, cd, Parallel MIMD Computation: HEP Supercomputer and its Applications, The MIT Press, 1985 Smith B J, A Pipelined, Shared Resource MIMD Computer, Proc of the 1978 International Conference on Parallel Processing, 1978 Staley C A, Design and Analysis of the CCMP: A Highly Expandable Shared Memory Parallel Computer, PhD Dissertation UCSB, August 1986 Halstead RH and Fujita T, MASA: A Multithreaded Processor Architecture for Parallel Symbolic Computing, Proc of the 15th Symposium on Computer Architecture, June 1988 Rishiyur S, Nikhil, and Arvind, Can Dataflow Subsume Von--Neumann Computing?, Proc of the 16th Symposium on Computer Architecture, June

9 13 Agarwal A, Lim B, Kranz D and Kubiatowicz, APRIL: A Processor Architecture for Multiprocessing, Proc of the 17th Symposium on Computer Architecture, May patterson D and Sequin C, A VLSI RISC, IEEE Computer Magazine, Sept Lunde A, Empirical Evaluation of some Features of Instruction Set Processor Architectures, Communication of the ACM, March Alexander G, Wortman D, Static and Dynamic characteristics of XPL programs, IEEE Computer Magazine, Nov Patterson D, Reduced Instruction Set Computers, Communications of the ACM, Jan Sites, RL, How to use 1000 registers, Proc Caltech Conference on VLSI, Jan Wyes HW and Plessmann KW, OMEGA- A RISC Architecture for Real-Time Applications, IFAC 10th Triennial World Congress, Munich, FRG, Tannenbaum A S, Implications of structured programming for machine architecture, Communications of the ACM March Halbert D and Kessler P, Windows of overlapping Register Frames, CS292R-course final report, UC Berkeley, June Dirzel DR and McLellan HR, Register Allocation for Free: The C Machine Stack Cache, Proc Symp on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA, March Siewiorek D P, Gordon Bell C, and Newell A, Computer Structures: Principles and Examples, McGraw-Hill Book, Nemirovsky, Mario, DISC, A Dynamic Instruction Stream Computer, PhD Dissertation, University of California, Santa Barbara, September

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com