DISC: DYNAMIC INSTRUCTION STREAM COMPUTER

Size: px
Start display at page:

Download "DISC: DYNAMIC INSTRUCTION STREAM COMPUTER"

Transcription

1 DISC: DYNAMIC INSTRUCTION STREAM COMPUTER Dr Mario Daniel Nemirovsky Apple Computer Corporation Drs Forrest Brewer and Roger C Wood Electrical and Computer Engineering Department University of California, Santa Batiara ABSTRACT This paper applies a form of instruction stream interleaving to the problem of high performance real-time systems Such systems are characterized by high bandwidth, stochastically occurring interrupts as well as high throughput requirements, The DISC computer is based on dynamic interleaving where the next instruction to be executed is dynamically selected from several possible simultaneously active streams Each stream context is stored internally making possible active task switching in a single instruction cycle For several RTS applications the DISC concept promises higher computation throughput at lower cost than is possible on contemporary RISC processors Implementation and register organization details are presented as well as simulation results 10 INTRODUCTION This paper describes an architecture concept and implementation which is specifically oriented for use in real time controller systems (RTS) Such systems are characterized by various priority hard and soft deadlines for completion of tasks and efficient interaction of the processor and several peripherals running at vastly differing data rates Ideally, system deadlines can all be met by the computation engine in all cases However, reasonable provisions must be made for graceful degradation of low priority tasks in exceptional circumstances Another characteristic of such systems is the notion that the worst case delays of the system must fall within the critical timing constraints It is of no use for the average performance to meet these requirements as the system may incur permanent damage if these constraints are not met Many present applications of micro-controllers are in relatively lowend applications where meeting these requirements is simple even for slow microprocessors These uses have not pushed the technology significantly except in methods for Iowwing the costs Recently, however, there has been a rapid increase in the complexity of control systems as mechanical and computer Permission to copy without fee all or part of th~ mateml ia granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright not ice and the title of the publication aod its date appear, and not ice is given that copying is by permission of the Association for Computing Machinery To copy otherwise, or to republish, requires a & arid/or specitic permission O 1991 ACM /91/0011/0163 $150 integrated manufacturing systems have become common Other real time systems occur in the automotive industry and airplane control systems In these newer applications, it is questionable whether conventional architectures provide a cost effective computation engine solution We propose an efficient architectural concept for construction of real time system controllers which promise significantly higher performance than conventional approaches for modest increases in cost Real time systems provide different constraints for the system architect than do conventional systems In particular, externally derived deadlines from the controlled systemproducewidely varying computational loads on the controller, as it must respond to these external requests and interrupts in a specified amount of time In this work, we are considering the deadline times to be from microseconds to milliseconds, common to conventional microprocessor system controllers In these systems, 1/0 timing constraints become a primary issue Often the data required is generated by a sensor in a time scale much slower than the operation of the processor On the other hand, keeping the data current is much desired which prevents caching or queueing earlier data values In these cases, it is difficult to make use of the processor idle time while it is awaiting new data due to the overhead required to change program context Interrupt processing is also very important in RTS to help alleviate overhead due to polling and to insure quick responses to exceptional or critical deadlines For this reason, interrupt latency (a measure of the time to respond to an interrupt signal) is an important performance measure for real time control systems If the control system is complex and dynamic rescheduling is required, then there must be provisions for rapid context switching as processes are started and stopped as required by the system Finally, it has keen shown [1] that if the processor throughput can be partitioned arbitrarily among the executing processes, scheduling which is in some senses optimal can be achieved This throughput partitioning must be done with very low overhead so as to not compete with the processing tasks themselves 20 PREVIOUS WORK Previous work in architectures for real time controller systems involves a few specialized architectures and several specially modified current microprocessors Although digital 163

2 signal processing (DSP) chips are often used in real time systems, they are usually used as an auxiliary processor as their specialized architectures do not perform more general (nonnumeric) processing efficiently In addition, the large parallelism and register set size of DSP S make these devices very inefficient for use in interrupt driven or heavy context switching applications Several common microprocessors have been modified for real time control applications with the addition of internal time, DMA, and communication interface functions These include the 68332, 68HC11, and 8748 microprocessors as examples It is important to note that the general purpose architecture of the original microprocessors is retained in these controllers, with the extra functions simplifying the peripheral interfacing For this reason, these micro controllers have interrupt latency and context switching behaviors similar to the original microprocessor parent The [2] does have an auxiliary processor called the timer processing unit (TPU) which is capable of performing relatively complex time process behaviors such as stepper motor control, etc The purpose of this unit is to reduce the frequency of interrupts and context switches required by the real time system Another solution to the interrupt latency and context switching time problems is use of a stack architecture such as that of the RTX2000 machine Since the instruction stream is primarily zero address (stack) operations, these machines do not have large internal register sets which need to be saved For this reason, the interrupt latency and the context switching times are very fast The stack instructions, however, do not lend themselves to manipulating complex 1/0 devices due to the lack of support for complex address modes to these peripherals These processors also tend to have slightly lower performance than their register heavy counterparts Instruction level interleaving is not new; early processors using interleaving include the CDC6600 1/0 processor [4], the multiple instruction stream processor of Flynn [5,6], the work of Kaminsky and Davidson [7], and the Denelcor HEP computer [8] More recent work includes the UCSB CCMP system [10], the APRIL [13], and others [9, 11, 12] This work (with the exception of the CDC) is primarily directed towards the performance gains and ease of parallel programming implementations possible with interleaving Very little attention has been given to the advantages of interleaving in real time systems Interleaving architectures relv on maintenance of several simultaneous contexts for each of the running processes in the processor This leads inevitably to a large overhead of registers required to sustain these contexts These registers are often organized into register windows or multiple windows with disadvantageous worst case replacement behavior These memory problems have been studied by Sites [18], and by Wyes and Plessman [19] using background processes to update the register windows before registers are needed Another alternative is proposed in the CRISP [221 architecture usimz a stack cache We will propose a v-=iable sized multi-window organization for this purpose 30 DYNAMIC lnstructlon STREAM COMPUTER (DISC) 31 DISC Concept The dynamic instruction stream computer (DISC) concept relies on an architecture maintaining several simultaneous instruction streams which are dynamically started and halted by the processor Each of these streams is interleaved in the processor at the instruction level, providing the highest level of granularity for task scheduling and partitioning The instruction level interleaving allows for efficient pipelining to obtain high instruction throughput not achievable in conventional architectures For applications in real time systems, however, it is the dynamic nature of the effectively parallel streams which is particularly useful In a conventional processor, the control unit selects the next instruction to be executed in sequential order unless this order is changed by a jump or other control instruction In DISC, the sequential order is replaced by a hardware scheduler which selects from among the several possible streams a particular instruction for execution on the next cycle It is thus possible to assign an interrupt to a given stream which begins processing effectively in parallel and at a given level of partitioned throughput to the rest of the streams then active in ~he processor Streams can also trigger other instruction streams and multiple streams can synchronize with each other when necessary As an example, consider a machine running 3 streams concurrently, and one of the streams is halted by wait states from a slow peripheral The other streams automatically are allocated the instruction slots which would otherwise be used as polling or interrupt overhead In situations where the number of active contexts is smaller than the number of supported streams, all overhead for context switching is removed (Even when this is not the case, for many real time systems, the frequency of context switches should be reduced) It is well known that multi-stream interleaving on a pipelined processor is more efficient than single stream execution DISC exploits this efficiency advantage by implementing a RISC-based processing engine design to automatically interleave instruction execution from a small number of stored process contexts Scheduling of streams on an instruction basis allows simple partitioning of the processing power among the several active real time tasks This schedule allows several versions of real time scheduling models, including preemptive and fixed schedules as well as General scheduling [1] with little or no overhead The cost for this is the necessity of several stored contexts along with the ancillary registers needed to be duplicated for each stream In particular, PC, SP, El registers must be maintained for each stream To manage this large number or registers, DISC introduces the concept of a stack window register set These registers are similar to the registers windows proposed by Patterson in RISC- I [3] with the exception that the number of registers allocated in a procedure crdl is variable Each stream is allocated its own stack window as well as a common set of global registers used for inter-stream parameter passing The stack window described below is a very important issue in a hard deadline environment to minimize context switching, procedure call/return, and interrupt overhead 32 Pipelining Pipelining is a mechanism by which multiple instructions from a sequential instruction stream are simultaneously executed in an overlapped fashion For this discussion we will consider a five stage pipeline, consisting of instruction fetch (IF), instruction decode (ID), read registers (RR), execute (EX), and write register (wr) The essential feature of a pipe is that ensuing instructions are scheduled before earlier ones have completed, This leads to hazards lowering the performance of the pipeline A hazard is a situation which precludes executing the next instruction of the stream Hazards are caused by violation of either data or 164

3 I a,l: indicates instruction a runnina on instruction stream 1 IF a1 IF f2 IF k3 IF n4 IF p5 IF b1 IF g2 ID a1 ID f2 ID k3 ID n4 ID p5 ID b1 RR a1 RR f2 RR k3 RR n4 RR p5 FX a1 EX f2 EX k3 F)(n4 WR a1 WR f2 WR k3 Interleaved Pipeline - Figure No other instruction on the pipe belongs to instruction stream 1 IF a1 IF f2 IF k3 IF n4 ~lf p5 ; IF b1 IF g2 ID a1 ID f2 ID k3 ;ID n4~ ID p5 ID b1 E RR a1 RR fz ERRk~ RR n4 RR p5 EX a1!ex f2; EX k3 EX n4 WR fq WR k3 Interleaved Pipeline During a Jump - Figure 32 control caused dependencies A data hazard exists when an instruction, A, is modifying data which is used in the next instruction, B In this case, data for B has not been updated by the time it is read To insure correct operation, instruction A should be completed before B executes its third stage and the pipeline should keep running A but delay all those instructions that follow A until the register write is completed A control hazard takes place when the instruction sequence is modified as a result of an interrupt or an instruction such as jump or branch By the time an instruction modifies the program sequence, there will be several instructions in the pipe which belong to the incorrect sequence Any such instructions need to from the pipe It is important to reduce the performance overhead associated with hazards Several techniques such as delayed branching and pipeline bypasses reduce the effect of hazards, but generally do not eliminate them Interleaving, however, can be used to eliminate hazards from pipeline execution and has been employed in a number of systems [4-8] 33 Interleaving A pipeline is interleaved if, at every pipe cycle, an instruction from a different instruction stream enters the pipe and there are at least as many instruction streams as pipe stages Therefore, interleaving is a way to share the processor resources between multiple processes Figure 31 shows an interleaved pipeline, in which five independent instruction streams or tasks are shown in a five stage pipeline The result of such an interleaved pipe is the equivalent of five parallel processors, where each processor is running at one instruction every five cycles Thus in an ideal pipeline there is no performance gain from interleaving instructions In fact, the overhead of supporting several parallel streams may slow down the achievable clock cycle, hence the performance may decrease However, a single stream running a pipeline will have both data and control hazards reduging the throughput of the pipeline, In an interleaved pipeline, all instructions present in the pipeline belong to separate processes at all times Thus each instruction for each process completes before instructions from that process me fetched Under these conditions and assuming the processes to be independent, there are no control or data hazards at all A representative branch is shown in Figure 32 Hazards between separate processes are possible if the processes are not independent, for example, the processes may communicate We can add special hardware for process communication which will reduce the overhead in these cases As a result, the interleaved pipeline achieves higher throughput on several processes than an identical pipeline executing a single stream, due to reduction in the number of hazards The performance increase for interleaved pipelines is not without cost There must be sufficient registers to retain the states of all executing processes The resources required to duplicate the context must be duplicated for each of the processors as many times as there are virtual processors to be supported This cost is highly dependent on the architecture of the processor It is important to have a very small context per process, and to have minimum extra hardware to support the multiple process switching The question remains: How does interleaving give a solution to the real-time controller requirements? 34 Dynamic Interleaving As we described earlier, a real-time system requires that multiple tasks be able to run concurrently Some of these tasks occur at deterministic times, others at random times There are a large number of interrupts, and the 1/0 speed is generally much slower than the processor speed Interleaving could be a very a good solution if a sufficient number of active tasks could be guaranteed, but this is difficult because of the randomness Thus, we introduce the concept of dynamic interleaving A pipeline organization is said to be dynamically interleaved if it can run from a single instruction stream to a multiple instruction stream and the computation power of the prqgem+m can be allocated ktween the mtdt;plc virtual processors in any way and can dynamically reallocate the throughput when the instruction stream scheduled to run is 165

4 not ready This is achieved in DISC by dynamically selecting the next instruction to execute from the possible streams In the case where only one stream is active, each pipeline slot executes sequential instruction from that stream The concept is described by Figure 33 The figure shows up to four instruction streams (IS 1, 1S2, 1S3, and 1S4) Assume that the total throughput of the processor is T and the following partition is assigned T/2 to IS 1, and T/6 to 1S2, 1S3, and 1S4, As the figure shows, when 1S1 is the only one active, it will be dynamically assigned T even though the static assignment is T/2 Similarly, if 1S3 is inactive, its processor time will be dynamically reassigned to 1S2 and 1S4 Dynamic interleaving greatly facilitates scheduling and multitasking since each task can be assigned its own virtual processor of adjustable computational power Real Time Systems also require hard deadline management which is often implemented via timer based interrupts In conventional architectures, these interrupts require context switches In DISC, an intermpt, instead of suspending a running process, can create its own instruction stream This makes the system more deterministic since even when interrupts are invoked, other tasks can be running When the interrupt routine is finished, the throughput will be dynamically reallocated to the remaining instruction streams Context switching will not be required as long as the number of instruction steams supported by the processor is less 1S1 : 1S2 1S3 : 1S4 ; II Fc&c,: F&K +--- I SUB-CALL ; or INT ~ kmin xj T SUB-RET ~ or INT-RET ~ $, Dynamic Instruction Stream Diagram bigure Jj or equal to that required by the application Otherwise, some context switching will be required but the total number of switches will be smaller in a DISC than in a traditional architecture 35 The Stack Window Due to the speed degradation of external access with respect to the processor, it is important to keep operands in the processor Therefore, the processor should have enough registers to be able to allocate registers to most, or all, of the local and global variables for all streams However, keeping local variables in internal registers causes context switching overhead on procedure calls and retorns [14, 15] To solve the inconsistency between a large register set and fast context switching, a multi-window approach is a very logical alternative [14-23] In addition to reducing the local register saving/restoring to just a pointer change, if the windows overlap, then the overlapped registers can be used for argument passing DISC is an architecture which contains multiple ;nstruc~ion streams each instruction stream should have its own multiple window file AUTO + R STACK WINDow Stack Window Approach - Figure 34 AWP-n ~ AU70 + X The approach used on DISC is called a stack window Figure 34 shows the window file on the stack window approach Bottom Of the Stack register (BOS) is pointing to the last empty word of the stack window (SW) Active Window Pointer (AWP) is pointing to register zero (RO) of the window If the window size is S then the address of RO is AWP, R1 is AWP-1,, Rn is AWP-n, R(S-1) is AWP-S+l In the instruction se~ stack increment and decrement is added to some instructions such as Load, Store, Add, Subtract, etc When an instruction increments AWP then the new AWP location becomes RO,RO becomes RI, RI becomes R2, and so on (Figure 35) Then the SW is a window that is moving up and down as demands require Let us assume that instructions which increment the AWP do so at the end of the instruction Then a procedure call will increment AWP storing the return address there On a return, the TOS is decremented by the instruction offset (no larger than the window size) to the return address location It restores the program counter and decrements the AWP one more time, leaving h at the same place h was before the call took place 36 Communication Issues 361 Input/Output Real Time Systems require multiple 1/0 peripherals with different access times; therefore, DISC has to support an asynchronous DATA bus DISC is a load/store type machine To avoid stopping the other instruction stream when a load or store instruction is issued, a pseudo-dma type load/store was I 166

5 implemented on DISC1 On a load instruction, the effective address of the external request is calculated It is then loaded into the Asynchronous Bus Interface (ABI), with the address of the destination register The IS requesting the read cycle is sent into a wait state and the ABI initiates the read cycle During the time the access in taking place, another IS requests for a load or store will send IS to a wait state Once the read is completed the ABI stores the data into the destination register and re-activates all waiting 1Ss This is done without affecting the running instruction streams The store instruction works in a similar way R7 L-l + increment AWP R6 R7 R7 R6 R5 R6 R6 R5 R4 R5 R5 R4 R3 R4 R4 R3 R2 ~ R3 R2 R2 -+ ;: RI 362 Interprocess Communication and Synchronization Since DISC has multiple 1Ss, communication between IS is required This can be accomplished in different ways On DISC1 there are three ways supported There are four global registers that are shared between all the 1Ss In addition there is an internal global memory shared between the 1Ss, since global registers and internal memory allow read-modify-write instructions, they can be used as semaphores IPC can also be done via software interrupts which is discussed in the next section Process synchronization can be achieved by either semaphore polling or by interprocess interrupts Interrupts are more efficient since they do not require repetitive instructions on the processing engine 363 Interrupts The interrupt structure on DISC is very special because of the importance of interrupts in real time systems and because they are also used to obtain inter IS communication and synchronization Every IS has one interrupt register (IR) and one mask register (MR) On DISC 1 the interrupt registers contain 8 bits, bit 7 is the highest priority, bit O is the lowest priority (or background or normal mode of running) Interrupt 7 to 1 are vectored interrupts Interrupt O is the background, no vector is generated, eg 1S0 interrupts 1S2 by setting a bit on the IR of 1S2 External interrupts can also set a request to any of the IRs Finally, interrupts can be automatically generated, such as the stack overflow or other exceptional interrupt Interrupt request bits can only be cleared by the IS to which the IR belongs When no bk of the IS is se~ the instruction stream will not be scheduled (not active) Once an interrupt is requested, if it is the highest priority one pending, a vector interrupt will be generated The next instruction that belongs to that IS will be started at the address given by the vector interrupt Vector interrupts were chosen in the implementation to avoid the need for polling to obtain the interrupt source Synchronization between 1Ss can be obtained via interrupts When interrupts are used to synchronize 1Ss, the first IS to reach the join point is deactivated until the other IS arrives This is much better than having the IS polling a semaphore to + :; n< RO is lost E + 1: decrement AWP J bti R2 Stack Wkdow Movements - Figure 35 R7 RI RO check synchronization since the computation throughput which would be spent polling will be dynamically allocated to the active 1Ss 37 Implementation of DISC DISC1 is the experimental implementation of the DISC concept More information about the implementation and models is available [24] This implementation was designed to prove feasibility of DISC and to obtain benchmarks The design is targeted to the typical control requirements of automotive electronics A 16-bit architecture was chosen for DISC1 since the goal was to compare its performance with respect to present RTS controllers In fact, present technology would allow physical implementation as a 32-bit architecture A Harvard architecture was chosen to allow simultaneous instruction and data fetch Instructions are fetched through the program bus which is 24-bits wide, while the data bus is 16-bit asynchronous An asynchronous data bus is required since controllers have a very large variety of 1/0 peripherals with large variety of access times DISC1 supports up to four instruction stream running concurrently and uses a four stage pipeline The scheduler of DISC1 is responsible for selecting which instruction stream will be executed next, based on present priority The computational power of the system can be allocated evenly between 1Ss, or assigned in increments as low as 1/16 of the total It contains 2 Kbyte of internal memory in addition to the stack window registers The internal memory is shared between all 1Ss with access done via register indirect, register plus offset, or 9-bits immediate addressing DISC 1 is a load/store, computer with reduced instruction set All the instructions are effectively single cycle including the load and store instructions with the proviso of asynchronous wait for external memory and I/O This simplifies the design and reduces the overhead cost of the multiple instruction streams A 16x 16 integer hardware multiplier is included in DISC1 DISC1 has 16 registers per instruction stream, four global, four special registers and eight local (stack window) registers Fimre 36 sh~ws a block d~agram of DISC 1 A RTL model of DISC1 was written in Verilogm and several programs were run on model 40 EVALUATION OF DISC PERFORMANCE 41 Stochastic Model A stochastic model was developed to evaluate the DISC architecture Poission distributions, with the indicated means, were assumed for the number of consecutive instructions for which the IS is active (meanon), or inactive (meanoff), between external access requests (mean_req), and for 1/0 request times (mean_io) Also controlled were the percentage of external requests that were dwected to memory this 167

6 I t--t IL I I I Block Diagram of DISC1 - Figure 36

7 (alpha), the percentage of instructions, such as jumps, calls, returns, branches and interrupts that modify program flow (aljmp), and the number of wait cycles for an external memory access The model simulates the sequencer used in DISC 1, so that any sequence that can run on DISC 1 can be simulated The model assumes that when a jump instruction takes place, all of the instructions in the pipe that belong to the same IS have to be flushed from the pipe If only one IS is active, this simplifying assumption makes DISC performance worse than a single IS computer For an external request, either 1/0 or memory, if the access time is larger than zero, all instructions on the pipe belonging to the same IS are flushed, and the IS requesting access is put into a wait state This is done in order to allow other 1Ss to keep running, but penalizes DISC with respect to a standard architecture if only one IS is being run since the pipe could simply be halted If the bus was busy at the time access is requested, the instruction is flushed and a new external access is requested once the IS is out of the wait state If the bus was not busy, the busy flag is set and it remains set until the access time is completed Upon completion of the external access all waiting flags are cleared Two performance measure for DISC are evaluated: processor utilization on DISC, PD, and Delta Delta is a value used to compare a single IS system with a multiple IS system, and is defined as: delta = (PD- Ps)ps * 100% Ps (processor utilization on the standard processor) is calculated as the total number of executable instructions divided by the sum of the total number of executable instructions, the number of cycles that the data bus was busy, and the number of cycles dropped due to jump type instructions This assumes that instructions are not being executed in a standard processor when it is waiting for data To assume the contrary implies support of out-of-sequence code and/or a smart compiler It also assumes that every time a jump type instruction is executed, the standard processor will require (pipe_length- 1) cycles to be flushed from the pipeline This is conservative in that delayed branching can be used to help alleviate the number of cycles needed to be flushed However, delayed branching can only be applied to statically analyzable portions of the design and is less effective as pipeline depth increases It is common practice in RTS analysis to measure the interrupt latency time as a system evaluation By dedicating a stream to a particular interrupt, we can achieve very high figures of merit since the instructions will start execution immediately However, we must still ensure that the appropriate context is available and that the interrupt executes quickly enough once it is started The latency time as conventionally described is ambiguous in this sense since a short interrupt to retrieve a value will execute very quickly (the common micro-controller case) while a longer interrupt will be scheduled throughput by the hardware scheduler 42 Simulation Results A large number of simulation runs were made to evaluate the DISC architecture Parameters varied, in addition to those described above, included the scheduler sequence, number of cycles for an external memory access (tmem), and pipeline length One set of runs evaluated the effect of only jump instructions, another of external 1/0 only Finally, a set of four program loads was specified to simulate more realistic RTS behavior Loads 1 and 2 represent typical RTS behavior differing principally in the fact that load 2 is alternatively active and inactive, while load 1 is always active Load 3 represents a DSP type program running only from internal memory, and load 4 an interrupt driven program which is only active while handling an interrupt These loads were also combined into a single IS, eg load (1:4) represents a statistical combination of loads 1 and 4 into a single IS Table 41 shows the parameters for each of these runs, and Tables 42 Ldl Ld 1:2 M 1:3 Ld 1:4 Ld2 Ld3 Ld4 meanon meanoff o mean-req alpha tmem mean_io aljmp Table 41 - Parameter Set for Typical Program and 43 show the processor utilization and delta for different combinations In Table 42 we show that as the degree of partitioning increases, so does the utilization Hence if we have a program that can be partitioned into multiple 1Ss, a much better processor utilization is obtained, especially if the processor utilization of the single IS is low Even when the processor utilization of a single IS is very high, there are still some gains obtained by running multiple instruction streams, as shown in load 3 in Table 42 Maximum Number of Instruction Streams load load load load Table 42 a - Processor Utilization PD Maximum Number of Instruction Streams load load load load Table 42 b - Delta Table 43 shows results for load 1 combined with each of the other loads, first into a single IS, then each run in an 169

8 independent IS, then with load 1 partitioned into two 1Ss, and finally with both loads partitioned into dual 1Ss The range of improvement of DISC over a traditional single-instructionstream processor (delta), is dramatic as long as at least two 1Ss are enabled, especially when traditional processor performance is poor Load Combined Separated Type Loads Loads Three 1Ss Four I% Table 43 a - Processor Utilization PD dependent constraints on the performance and size of DISC architectures needs to be evaluated ACKNOWLEDGEMENTS This research was partially supported by Delco Systems Operations, a subsidiary of Delco Electronics Corporation, and the University of California M I CRO grant # REFERENCES 1 2 Coffman, EG and Denning PJ, Operating System Theory, Prentice-Hall, 1973 CPU32, Reference Manual (Rev 08), Motorola 1989 d Combined Separated Type Loads Loads Three 1Ss Four 1Ss Table 43 b - Delta On the other hand, in applications where single stream processor utilization is very high, the advantages of DISC are not significant In addition, if the application does not permit keeping multiple 1Ss active, then a DISC architecture could result in a performance degradation What is remarkable is the large throughput increase made available by using such a small number of parallel streams 50 CONCLUSIONS DISC shows a performance improvement over standard architectures for real time applications The ability to dynamically reallocate the throughput permits the system to take advantage of the time that otherwise would be lost It was shown that even a system with two instruction streams significantly outperforms a single instruction stream system In particular, the ability to partition throughput among streams, the rapid interrupt handling, and the concurrent processing during 1/0 s should provide substantial benefits to RTS There are many application where DISC will be outperformed Specifically, this will be true in applications where the number of wait cycles and pipe hazards are very small Future work should be done to evaluate the optimum number of instruction streams for a given application The stochastic model sheds considerable light on this question, but detailed analysis of algorithmic requirements, 1/0 patterns, etc will be necessary Two other parameters also need study: the depth and size of memory usage in the stack windows could be evaluated by stochastic means and appropriate measures of interrupt latency need to be defined and modeled Numerous operating system, compiler, and other software questions also need to be addressed Finally, implementation technology Patterson D and Sequin C, RISC I A reduced Instruction Set VLSI Computer, Proc of the 8th Symposium on Computer Architecture, May 1981 Thornton JE, Parallel Operation in the Control Data 6600, Proceedings-Spring Joint Computer Conference, 1964 Flynn MJ, Podvin A, and Shimizuk K, A Multiple Instruction Stream processor with shared resources, Parallel Processor System, C Hobbs, Washington D C, Spartan, 1970 Flynn MJ, Some Computer Organizations and Their Effectiveness, Transactions on Computers, Vol C-21, No 9 Sept 1972 Kaminsky, WJ and Davidson E S, Developing a Multiple-Instruction-Stream Single-Chip Processor, IEEE Computer Magazine, Dec 1979 Kowalik J S, cd, Parallel MIMD Computation: HEP Supercomputer and its Applications, The MIT Press, 1985 Smith B J, A Pipelined, Shared Resource MIMD Computer, Proc of the 1978 International Conference on Parallel Processing, 1978 Staley C A, Design and Analysis of the CCMP: A Highly Expandable Shared Memory Parallel Computer, PhD Dissertation UCSB, August 1986 Halstead RH and Fujita T, MASA: A Multithreaded Processor Architecture for Parallel Symbolic Computing, Proc of the 15th Symposium on Computer Architecture, June 1988 Rishiyur S, Nikhil, and Arvind, Can Dataflow Subsume Von--Neumann Computing?, Proc of the 16th Symposium on Computer Architecture, June

9 13 Agarwal A, Lim B, Kranz D and Kubiatowicz, APRIL: A Processor Architecture for Multiprocessing, Proc of the 17th Symposium on Computer Architecture, May patterson D and Sequin C, A VLSI RISC, IEEE Computer Magazine, Sept Lunde A, Empirical Evaluation of some Features of Instruction Set Processor Architectures, Communication of the ACM, March Alexander G, Wortman D, Static and Dynamic characteristics of XPL programs, IEEE Computer Magazine, Nov Patterson D, Reduced Instruction Set Computers, Communications of the ACM, Jan Sites, RL, How to use 1000 registers, Proc Caltech Conference on VLSI, Jan Wyes HW and Plessmann KW, OMEGA- A RISC Architecture for Real-Time Applications, IFAC 10th Triennial World Congress, Munich, FRG, Tannenbaum A S, Implications of structured programming for machine architecture, Communications of the ACM March Halbert D and Kessler P, Windows of overlapping Register Frames, CS292R-course final report, UC Berkeley, June Dirzel DR and McLellan HR, Register Allocation for Free: The C Machine Stack Cache, Proc Symp on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA, March Siewiorek D P, Gordon Bell C, and Newell A, Computer Structures: Principles and Examples, McGraw-Hill Book, Nemirovsky, Mario, DISC, A Dynamic Instruction Stream Computer, PhD Dissertation, University of California, Santa Barbara, September

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015 Advanced Parallel Architecture Lesson 3 Annalisa Massini - 2014/2015 Von Neumann Architecture 2 Summary of the traditional computer architecture: Von Neumann architecture http://williamstallings.com/coa/coa7e.html

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

Xinu on the Transputer

Xinu on the Transputer Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1990 Xinu on the Transputer Douglas E. Comer Purdue University, comer@cs.purdue.edu Victor

More information

2 MARKS Q&A 1 KNREDDY UNIT-I

2 MARKS Q&A 1 KNREDDY UNIT-I 2 MARKS Q&A 1 KNREDDY UNIT-I 1. What is bus; list the different types of buses with its function. A group of lines that serves as a connecting path for several devices is called a bus; TYPES: ADDRESS BUS,

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

UNIT II SYSTEM BUS STRUCTURE 1. Differentiate between minimum and maximum mode 2. Give any four pin definitions for the minimum mode. 3. What are the pins that are used to indicate the type of transfer

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

UNIT-II. Part-2: CENTRAL PROCESSING UNIT Page1 UNIT-II Part-2: CENTRAL PROCESSING UNIT Stack Organization Instruction Formats Addressing Modes Data Transfer And Manipulation Program Control Reduced Instruction Set Computer (RISC) Introduction:

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 1: Digital logic circuits The digital computer is a digital system that performs various computational tasks. Digital computers use the binary number system, which has two

More information

New Advances in Micro-Processors and computer architectures

New Advances in Micro-Processors and computer architectures New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,

More information

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015 Advanced Parallel Architecture Lesson 3 Annalisa Massini - Von Neumann Architecture 2 Two lessons Summary of the traditional computer architecture Von Neumann architecture http://williamstallings.com/coa/coa7e.html

More information

icroprocessor istory of Microprocessor ntel 8086:

icroprocessor istory of Microprocessor ntel 8086: Microprocessor A microprocessor is an electronic device which computes on the given input similar to CPU of a computer. It is made by fabricating millions (or billions) of transistors on a single chip.

More information

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009 Digital Signal Processing 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

Top-Level View of Computer Organization

Top-Level View of Computer Organization Top-Level View of Computer Organization Bởi: Hoang Lan Nguyen Computer Component Contemporary computer designs are based on concepts developed by John von Neumann at the Institute for Advanced Studies

More information

M. Sc (CS) (II Semester) Examination, Subject: Computer System Architecture Paper Code: M.Sc-CS-203. Time: Three Hours] [Maximum Marks: 60

M. Sc (CS) (II Semester) Examination, Subject: Computer System Architecture Paper Code: M.Sc-CS-203. Time: Three Hours] [Maximum Marks: 60 M. Sc (CS) (II Semester) Examination, 2012-13 Subject: Computer System Architecture Paper Code: M.Sc-CS-203 Time: Three Hours] [Maximum Marks: 60 Note: Question Number 1 is compulsory. Answer any four

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Major Advances (continued)

Major Advances (continued) CSCI 4717/5717 Computer Architecture Topic: RISC Processors Reading: Stallings, Chapter 13 Major Advances A number of advances have occurred since the von Neumann architecture was proposed: Family concept

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator). Microprocessors Von Neumann architecture The first computers used a single fixed program (like a numeric calculator). To change the program, one has to re-wire, re-structure, or re-design the computer.

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

Modes of Transfer. Interface. Data Register. Status Register. F= Flag Bit. Fig. (1) Data transfer from I/O to CPU

Modes of Transfer. Interface. Data Register. Status Register. F= Flag Bit. Fig. (1) Data transfer from I/O to CPU Modes of Transfer Data transfer to and from peripherals may be handled in one of three possible modes: A. Programmed I/O B. Interrupt-initiated I/O C. Direct memory access (DMA) A) Programmed I/O Programmed

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected

More information

Performance Estimation of Multis treamed, Superscalar Processors

Performance Estimation of Multis treamed, Superscalar Processors Performance Estimation of Multis treamed, Superscalar Processors Wayne Yamamoto Mauricio J. Serrano Adam R. Talcott Roger C. Wood Mario Nemirovsky wayne@mimd.ucsb.edu mauriao@mkd.ucsb.edu tnlcott@pmimd.u&.edu

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

EC EMBEDDED AND REAL TIME SYSTEMS

EC EMBEDDED AND REAL TIME SYSTEMS EC6703 - EMBEDDED AND REAL TIME SYSTEMS Unit I -I INTRODUCTION TO EMBEDDED COMPUTING Part-A (2 Marks) 1. What is an embedded system? An embedded system employs a combination of hardware & software (a computational

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

Module 4c: Pipelining

Module 4c: Pipelining Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A

More information

Chapter 7 Central Processor Unit (S08CPUV2)

Chapter 7 Central Processor Unit (S08CPUV2) Chapter 7 Central Processor Unit (S08CPUV2) 7.1 Introduction This section provides summary information about the registers, addressing modes, and instruction set of the CPU of the HCS08 Family. For a more

More information

INPUT/OUTPUT ORGANIZATION

INPUT/OUTPUT ORGANIZATION INPUT/OUTPUT ORGANIZATION Accessing I/O Devices I/O interface Input/output mechanism Memory-mapped I/O Programmed I/O Interrupts Direct Memory Access Buses Synchronous Bus Asynchronous Bus I/O in CO and

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

A DESIGN FOR A MULTIPLE USER MULTIPROCESSING SYSTEM

A DESIGN FOR A MULTIPLE USER MULTIPROCESSING SYSTEM A DESIGN FOR A MULTIPLE USER MULTIPROCESSING SYSTEM James D. McCullough Kermith H. Speierman and Frank W. Zurcher Burroughs Corporation Paoli, Pennsylvania INTRODUCTION The B8500 system is designed to

More information

Question Bank Microprocessor and Microcontroller

Question Bank Microprocessor and Microcontroller QUESTION BANK - 2 PART A 1. What is cycle stealing? (K1-CO3) During any given bus cycle, one of the system components connected to the system bus is given control of the bus. This component is said to

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University Chapter 3 Top Level View of Computer Function and Interconnection Contents Computer Components Computer Function Interconnection Structures Bus Interconnection PCI 3-2 Program Concept Computer components

More information

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design ECE 1160/2160 Embedded Systems Design Midterm Review Wei Gao ECE 1160/2160 Embedded Systems Design 1 Midterm Exam When: next Monday (10/16) 4:30-5:45pm Where: Benedum G26 15% of your final grade What about:

More information

Module 5 - CPU Design

Module 5 - CPU Design Module 5 - CPU Design Lecture 1 - Introduction to CPU The operation or task that must perform by CPU is: Fetch Instruction: The CPU reads an instruction from memory. Interpret Instruction: The instruction

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 09, SPRING 2013

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 09, SPRING 2013 CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 09, SPRING 2013 TOPICS TODAY I/O Architectures Interrupts Exceptions FETCH EXECUTE CYCLE 1.7 The von Neumann Model This is a general

More information

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may

More information

INPUT/OUTPUT ORGANIZATION

INPUT/OUTPUT ORGANIZATION INPUT/OUTPUT ORGANIZATION Accessing I/O Devices I/O interface Input/output mechanism Memory-mapped I/O Programmed I/O Interrupts Direct Memory Access Buses Synchronous Bus Asynchronous Bus I/O in CO and

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

SECTION 5 PROGRAM CONTROL UNIT

SECTION 5 PROGRAM CONTROL UNIT SECTION 5 PROGRAM CONTROL UNIT MOTOROLA PROGRAM CONTROL UNIT 5-1 SECTION CONTENTS SECTION 5.1 PROGRAM CONTROL UNIT... 3 SECTION 5.2 OVERVIEW... 3 SECTION 5.3 PROGRAM CONTROL UNIT (PCU) ARCHITECTURE...

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009 CS211: Programming and Operating Systems Lecture 17: Threads and Scheduling Thursday, 05 Nov 2009 CS211 Lecture 17: Threads and Scheduling 1/22 Today 1 Introduction to threads Advantages of threads 2 User

More information

Interrupts Peter Rounce - room 6.18

Interrupts Peter Rounce - room 6.18 Interrupts Peter Rounce - room 6.18 P.Rounce@cs.ucl.ac.uk 20/11/2006 1001 Interrupts 1 INTERRUPTS An interrupt is a signal to the CPU from hardware external to the CPU that indicates than some event has

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 293 5.2 Instruction Formats 293 5.2.1 Design Decisions for Instruction Sets 294 5.2.2 Little versus Big Endian 295 5.2.3 Internal

More information

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control,

More information

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

INPUT/OUTPUT ORGANIZATION

INPUT/OUTPUT ORGANIZATION INPUT/OUTPUT ORGANIZATION Accessing I/O Devices I/O interface Input/output mechanism Memory-mapped I/O Programmed I/O Interrupts Direct Memory Access Buses Synchronous Bus Asynchronous Bus I/O in CO and

More information

Computer System Overview

Computer System Overview Computer System Overview Operating Systems 2005/S2 1 What are the objectives of an Operating System? 2 What are the objectives of an Operating System? convenience & abstraction the OS should facilitate

More information

1 MALP ( ) Unit-1. (1) Draw and explain the internal architecture of 8085.

1 MALP ( ) Unit-1. (1) Draw and explain the internal architecture of 8085. (1) Draw and explain the internal architecture of 8085. The architecture of 8085 Microprocessor is shown in figure given below. The internal architecture of 8085 includes following section ALU-Arithmetic

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Digital IP Cell 8-bit Microcontroller PE80

Digital IP Cell 8-bit Microcontroller PE80 1. Description The is a Z80 compliant processor soft-macro - IP block that can be implemented in digital or mixed signal ASIC designs. The Z80 and its derivatives and clones make up one of the most commonly

More information

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2. BASIC ELEMENTS Simplified view: Processor Slide 1 Computer System Overview Operating Systems Slide 3 Main Memory referred to as real memory or primary memory volatile modules 2004/S2 secondary memory devices

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

PC Interrupt Structure and 8259 DMA Controllers

PC Interrupt Structure and 8259 DMA Controllers ELEC 379 : DESIGN OF DIGITAL AND MICROCOMPUTER SYSTEMS 1998/99 WINTER SESSION, TERM 2 PC Interrupt Structure and 8259 DMA Controllers This lecture covers the use of interrupts and the vectored interrupt

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

CS252 Lecture Notes Multithreaded Architectures

CS252 Lecture Notes Multithreaded Architectures CS252 Lecture Notes Multithreaded Architectures Concept Tolerate or mask long and often unpredictable latency operations by switching to another context, which is able to do useful work. Situation Today

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Ninth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Topics in computer architecture

Topics in computer architecture Topics in computer architecture Sun Microsystems SPARC P.J. Drongowski SandSoftwareSound.net Copyright 1990-2013 Paul J. Drongowski Sun Microsystems SPARC Scalable Processor Architecture Computer family

More information

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA CSE 490/590 Computer Architecture Complex Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo Last time Virtual address caches Virtually-indexed, physically-tagged cache design

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

structural RTL for mov ra, rb Answer:- (Page 164) Virtualians Social Network Prepared by: Irfan Khan

structural RTL for mov ra, rb Answer:- (Page 164) Virtualians Social Network  Prepared by: Irfan Khan Solved Subjective Midterm Papers For Preparation of Midterm Exam Two approaches for control unit. Answer:- (Page 150) Additionally, there are two different approaches to the control unit design; it can

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

Introduction to Operating. Chapter Chapter

Introduction to Operating. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

CISC Attributes. E.g. Pentium is considered a modern CISC processor

CISC Attributes. E.g. Pentium is considered a modern CISC processor What is CISC? CISC means Complex Instruction Set Computer chips that are easy to program and which make efficient use of memory. Since the earliest machines were programmed in assembly language and memory

More information

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Dariusz Caban, Institute of Informatics, Gliwice, Poland - June 18, 2014 The use of a real-time multitasking kernel simplifies

More information

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations UNIT III PIPELINING Basic concepts Data hazards Instruction hazards Influence on instruction sets Data path and control considerations Performance considerations Exception handling Basic Concepts It is

More information

Input Output (IO) Management

Input Output (IO) Management Input Output (IO) Management Prof. P.C.P. Bhatt P.C.P Bhatt OS/M5/V1/2004 1 Introduction Humans interact with machines by providing information through IO devices. Manyon-line services are availed through

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

8086 Interrupts and Interrupt Responses:

8086 Interrupts and Interrupt Responses: UNIT-III PART -A INTERRUPTS AND PROGRAMMABLE INTERRUPT CONTROLLERS Contents at a glance: 8086 Interrupts and Interrupt Responses Introduction to DOS and BIOS interrupts 8259A Priority Interrupt Controller

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Computer Architecture

Computer Architecture Instruction Cycle Computer Architecture Program Execution and Instruction Sets INFO 2603 Platform Technologies The basic function performed by a computer is the execution of a program, which is a set of

More information

DSP/BIOS Kernel Scalable, Real-Time Kernel TM. for TMS320 DSPs. Product Bulletin

DSP/BIOS Kernel Scalable, Real-Time Kernel TM. for TMS320 DSPs. Product Bulletin Product Bulletin TM DSP/BIOS Kernel Scalable, Real-Time Kernel TM for TMS320 DSPs Key Features: Fast, deterministic real-time kernel Scalable to very small footprint Tight integration with Code Composer

More information

Computer Logic II CCE 2010

Computer Logic II CCE 2010 Computer Logic II CCE 2010 Dr. Owen Casha Computer Logic II 1 The Processing Unit Computer Logic II 2 The Processing Unit In its simplest form, a computer has one unit that executes program instructions.

More information

9/25/ Software & Hardware Architecture

9/25/ Software & Hardware Architecture 8086 Software & Hardware Architecture 1 INTRODUCTION It is a multipurpose programmable clock drive register based integrated electronic device, that reads binary instructions from a storage device called

More information

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines Announcement Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Seung-Jong Park (Jay) http://wwwcsclsuedu/~sjpark 1 2 Chapter 9 Objectives 91 Introduction Learn the properties that often distinguish

More information

Chapter 1: Basics of Microprocessor [08 M]

Chapter 1: Basics of Microprocessor [08 M] Microprocessor: Chapter 1: Basics of Microprocessor [08 M] It is a semiconductor device consisting of electronic logic circuits manufactured by using either a Large scale (LSI) or Very Large Scale (VLSI)

More information

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION Introduction :- An exploits the hardware resources of one or more processors to provide a set of services to system users. The OS also manages secondary memory and I/O devices on behalf of its users. So

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Objectives of Chapter To provide a grand tour of the major computer system components:

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information