Control and Data Dependence Speculation in Multithreaded Processors

Similar documents
Control Speculation in Multithreaded Processors through Dynamic Loop Detection

SPECULATIVE MULTITHREADED ARCHITECTURES

114 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004

A Quantitative Assessment of Thread-Level Speculation Techniques

Speculative Execution via Address Prediction and Data Prefetching

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Delaying Physical Register Allocation Through Virtual-Physical Registers

Multiple-Banked Register File Architectures

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Dynamic Simultaneous Multithreaded Architecture

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Simultaneous Multithreading Architecture

TDT 4260 lecture 7 spring semester 2015

CS425 Computer Systems Architecture

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

A Mechanism for Verifying Data Speculation

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

A Dynamic Multithreading Processor

Getting CPI under 1: Outline

Multithreaded Value Prediction

Speculation and Future-Generation Computer Architecture

ILP: Instruction Level Parallelism

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Exploitation of instruction level parallelism

Hardware-Based Speculation

One-Level Cache Memory Design for Scalable SMT Architectures

Control-Flow Speculation through Value Prediction for Superscalar Processors

Two-Level Address Storage and Address Prediction

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

A Cost-Effective Clustered Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Speculative Execution for Hiding Memory Latency

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Multiple Stream Prediction

Dynamic Scheduling. CSE471 Susan Eggers 1

LIMITS OF ILP. B649 Parallel Architectures and Programming

Simultaneous Multithreading: a Platform for Next Generation Processors

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Loop Scheduling for Multithreaded Processors

November 7, 2014 Prediction

Techniques for Efficient Processing in Runahead Execution Engines

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture-13 (ROB and Multi-threading) CS422-Spring

PowerPC 620 Case Study

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

Instruction Level Parallelism (ILP)

Multiple Instruction Issue and Hardware Based Speculation

5008: Computer Architecture

Hardware-based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Wide Instruction Fetch

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ONE-CYCLE ZERO-OFFSET LOADS

Instruction-Level Parallelism and Its Exploitation

The Use of Multithreading for Exception Handling

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Simultaneous Multithreading (SMT)

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Storageless Value Prediction Using Prior Register Values

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Value Compression for Efficient Computation

Advanced Computer Architecture

Handout 2 ILP: Part B

Advanced issues in pipelining

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Pipelining and Vector Processing

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Hardware-Based Speculation

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

E0-243: Computer Architecture

A Fine-Grain Multithreading Superscalar Architecture

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Simultaneous Multithreading Processor

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Introduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures.

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Improving Value Prediction by Exploiting Both Operand and Output Value Locality

Low-Complexity Reorder Buffer Architecture*

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Case Study IBM PowerPC 620

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

The Mitosis Speculative Multithreaded Architecture

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Complex Pipelines and Branch Prediction

Weld for Itanium Processor

CS 654 Computer Architecture Summary. Peter Kemper

Transcription:

Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (MTEAC 98) Control and Data Dependence Speculation in Multithreaded Processors Pedro Marcuello and Antonio González Universitat Politècnica de Catalunya Departament d Arquitectura de Computadors C/ Jordi Girona, 3, Mòdul D6 834 Barcelona, Spain Email: {pmarcue,antonio}@ac.upc.es Keywords: Multithreaded Architecture, Runtime Generation of Threads, Control Speculation, Loop Detection, Data Dependence Speculation. Abstract Boosting instruction level parallelism in dynamically scheduled processors requires a large instruction window. The approach taken by current superscalar processors to build the instruction window is known to have important limitations, such as the requirement of more powerful instruction fetch mechanisms and the increasing complexity and delay of the issue logic. In this paper we present a novel processor architecture (which is called DeSM) based on a multithreaded execution model that takes a completely different approach to manage a large instruction window. The idea is to identify at runtime sections of code that correspond to loops and execute concurrently several iterations even if they are dependent. Unlike superscalar processors, instructions are not decoded in sequential order and thus, the dependence checking mechanism of superscalar processors would not work for DeSM processors. In a DeSM processor, interthread dependences are speculated (i.e. they are predicted and instructions are executed obeying the predicted dependences). Besides, a DeSM processor significantly reduces the required instruction fetch bandwidth since it takes advantage from the fact that the multiple threads of control share a common code (they are executing different iterations of the same loop). The novel features of the DeSM rely on hardware mechanisms that do not require any special feature in the instruction set architecture nor compiler support.. Introduction In the last years, the technology improvements have allowed the processor designers to increase the instruction window size and the number of functional units in such a way that processors have been able to increase their instruction issue width. Nevertheless, the approach used by current superscalar processors to build the instruction window is not scalable and it finds two main problems. Branch prediction. Building a large window requires to speculate beyond a large number of branches. Even with very powerful prediction techniques such as hybrid predictors announced for the DEC 2264, the probability to be in the correct path after speculating on multiple branches is rather low. For instance, if we assume a 95% uniformly distributed success rate, as expected for the announced DEC 2264 for integer codes, and a 2% branch frequency, the probability to correctly predict the control flow beyond instructions is only.36. Complexity and delay of the issue logic grows with the instruction window size. It has been shown in a detailed study that the issue logic is likely to be one of the most critical parts in the future [9]. On other hand, multithreaded architectures have been studied to increase the instruction level parallelism in two nonexclusive ways: executing several independent threads which may come from different programs or executing dependent threads with the necessary synchronization in order to obey all the dependences among them. We propose a novel processor architecture, which is called Dependence Speculative Multithreaded Architecture (DeSM), that is based on the second kind of multithreaded architectures. However, unlike previous proposals, the DeSM architecture enforces interthread dependences through runtime dependence speculation. 2. Related work An important feature of the DeSM architecture is that unlike previous works like Multiscalar [3][], SPSM [2], Superthreaded [2] and Multithreaded Decoupled [] architectures, it does not require any addition/extension to the ISA. The DeSM architecture obtains multiple threads of control dynamically from a sequential conventional object code without any support from the compiler, by

Single PC Single Fetch Unit Large RegisterFile MultiValue Cache Instruction Cache Decode Renaming IQ FUs Control Speculation Logic Data Cache Figure : General diagram of the DeSM architecture speculating on highly predictable branches (e.g. branches that close loops). Executing multiple dependent threads requires mechanisms to enforce dependences among different threads. Data dependence speculation refers to those techniques that execute parts of a code without a complete knowledge of data dependences. The DeSM architecture makes an extensive use of it: Those dependences that are unknown are predicted and the code is speculatively executed obeying the predicted dependences, in addition to the known dependences. Data dependence speculation has been hardly used in the past. Since dependences through registers are easily identified by both the compiler and the hardware, data dependence speculation has been previously used to speculate on dependences through memory. Memory references whose effective address are unknown are usually called ambiguous references. When memory instructions are executed outoforder, a memory reference may be performed before the effective address of all previous references are known, that is, it is performed without completely disambiguating the reference. We call this scheme dynamic partial disambiguation. The most remarkable work on dynamic partial disambiguation is the address resolution buffer of the Multiscalar [4] and the address reorder buffer of the HP PA8. Both approaches use a very simple speculation heuristic: they assume no dependence between an instruction and any previous instruction whose effective address is unknown. Recent results have shown that data dependence speculation can be significantly improved through more sophisticated dependence predictors[5][7][8][6]..obviously, the compiler may help to improve the performance of the speculation techniques, but this issue is beyond the scope of this paper. The DeSM architecture speculates on both interthread memory and register dependences. The former are handled through address prediction whereas the latter are managed by means of predicting the number of writes to each architected register. 3. The Dependence Speculative Multithreaded Architecture The DeSM hardware architecture (see figure ) is based on a simultaneous multithreaded processor[4][5] with the capability to speculate on multiple threads of control (threads for short) obtained from a single sequential program, and to speculate on interthread data dependences through registers and memory. Thus, the distinguishing features of the novel architecture are the extensive use of speculation on control and data dependences, as it is reflected in its name. Both types of speculation are completely performed by the hardware without requiring any compiler support. Thus, the object code can be specified in any conventional instruction set architecture (ISA), without any additional extension. Each thread, as in the SMT architecture, executes outoforder. The DeSM processor fetches instructions from a single program counter (PC). The instructions are decoded and renamed in the normal way as well as they are treated by the control speculation logic for the runtime generation of threads as it is described in Section 3.. The instructions are renamed separately for each thread (every thread has its own register map table although all the threads share the register file) and put into different instruction queues. The threads executed by DeSM architecture may not be independent, so there can be dependences among them. The enforcement of interthread dependences through registers is described in Section 3.2., and through memory is performed by a special first level data cache called MultiValue cache which is described in Section 3.2.2. Threads also share the functional units.

3.. Control speculation The control speculation mechanism is based on the technique we proposed in [3]. Below we summarize the main issues of that technique. The interested reader is referred to the original paper for more details. The idea of the control speculation technique is to detect loops at runtime and predict their number of iterations in order to generate multiple threads from different iterations of the same loop. Control speculation is performed by means of a table, which is called the loop table, that is indexed by a loop identifier. A loop is identified by the target address of a backward conditional or unconditional branch (for most of the loops this is a precise identifier). A loop table entry contains information that allows to predict the number of iterations that the next execution of this loop will likely perform. In particular, it has three fields: the number of iterations of the last execution of the loop (last count); the last observed stride (i.e. the difference between the number of iterations of the last two executions of the loop) and the number of iterations of the current execution so far (current count). A new entry in this table is allocated every time that a new loop is started (in fact it is detected when its first iteration finishes). The entry corresponding to the loop with the least recently started iteration is chosen for replacement. Termination of loops is detected by means of an auxiliary structure that is called the loop stack. This structure is a stack of loops that contains at any time the loops to which the last committed instruction belongs. The loops are ordered from innermost to outermost, the top of the stack being the innermost. In case of overflow in the loop stack, the most external loop is lost when a new loop is introduced. The size of the stack of loops should be equal to the deepest loop nest of the program in order not to lose information. With the support of the loop table and the loop stack, control speculation proceeds as follows. When a new iteration of a loop is started (this is recognized by a backward taken branch), the loop table is looked up. If the loop is in the table, the number of iterations of the loop that are left to finish the loop is predicted as the last count plus the stride minus the current count. If there are enough free contexts and dependences can be speculated (this later fact is analyzed in the following section), one thread for each remaining iteration can proceed in parallel. If the loop is not in the loop table or the current iteration count is higher than the predicted one, the current thread proceeds with the next iteration and several new threads are allocated to iterations that may follow this one. In this case, creating more threads imply more potential for parallelism but also, if the threads turn out to be squashed more wasted resources. The analysis of this parameter is beyond the scope of this paper and will be researched in future work. target address of backward branch LNRW #reg. #reg. # stores # stores CNRW address Figure 2: The iteration table. LSA stride We assume here for simplicity that in this case, as many threads as number of free contexts in the processor are created. Notice that there may be speculative threads for different loops at the same time. However, in this paper we consider a simple implementation in which any loop can be speculated but only one loop can be speculated at the same time, each speculated iteration may contain any number of nested loops and subroutine calls. The objective is to provide a preliminary evaluation of the feasibility of the proposed architecture. Optimizing the control speculation scheme is left for future work. In this simplified scheme, only the nonspeculative thread is allowed to create new speculative threads. 3.2. Data dependence speculation CSA The multiple threads that are simultaneously executed by the DeSM architecture are not necessarily independent. Dependences among threads either through memory or registers may be present. Dependences are predicted based on the history of each loop. In particular, dependences through memory are predicted by taking advantage of the highly predictability of memory addresses as shown in other studies like [5]. The memory address prediction mechanism is based on keeping track of the last effective address generated for each static store instruction and its last stride. Using this scheme, about 75% of the memory instructions executed by the Spec 95 can be correctly predicted [6]. Predicting dependences through registers is based on the observation that the number of writes to the architected registers is usually the same for a given loop. This obviously holds for loops without conditionals in the loop body and it may also hold in the presence of conditional, for instance, when the then and else parts assign different values to the same registers or when the conditional branches usually take the same direction. Data dependences will be predicted by means of a table that is called the iteration table (see figure 2). Each entry of this table keeps the information regarding the dependences among iterations of a loop (usually called loopcarried dependences). It is indexed by the loop identifier (the target address of a backward branch) and each entry contains the C

Example: Loop: ld r, A(r) ld r2, B(r) add r3,r,r2 st C(r),r3 sub r,r,#4 bz #24 LNRW:........ LSA: C(r) 4.......... Figure 3: Values of the iteration table at the end of an iteration of a sample loop. following fields (an example of the management of these fields is shown in figure 3): LNRW (Last Number of Register Writes): This field indicates for each architected register the number of times that it was written in the last iteration of the loop. CNRW (Current Number of Register Writes): This field is similar to the LNRW but it refers to the current execution of the loop, which has not finished yet. LSA (Last Store Addresses): This field is an array of a number of entries equal to the number of store instructions in the last iteration of the loop. For each store it contains the last effective address and the last stride (difference between the effective address for the last two iterations of the loop). Stores are allocated in the sequential order. To reduce complexity, this field can be implemented with a fixed number of entries. Those loops that have more stores than this limit will not be speculated. CSA (Current Store Addresses): This field contains the memory addresses of the store instructions executed so far in the current iteration. C (Confidence): This field assigns confidence to the predictions in a similar way as branch predictors do. In this paper we assume a 2bit saturating counter to implement this field. This field avoids to spend resources when the data required by a speculative thread is not highly predictable. Although the size of each entry is high, a small number of entries is enough to predict most of the loops. For instance, we showed that the hit ratio of this table is about the 85% with just two entries for the Spec95 suite. When a loop that is in the iteration table finishes (the termination of a loop is detected by means of the loop stack as previously described), the predictability of the loop is determined by checking if: a) the current iteration has performed the same number of writes to each register as the previous one; b) the current iteration has performed the same number of stores as the previous one and the addresses of these stores correspond to the addresses of the stores of the previous iteration plus the last observed stride. If the two previous conditions hold, the C field is increased; otherwise it is decreased. A loop is considered predictable is the most significant bit of C is set. Thus, in addition to the conditions imposed by the control speculation scheme described in section 3. in order to create a speculative thread to execute an iteration of a loop, an additional requirement is that the loop is found in the iteration table and the most significant bit of C is set. At the end of a loop, CNRW is copied into LNRW; CSA is copied into LSA.address and for each i, LSA[i].stride is updated with the CSA[i] LSA[i].address if the most significant bit of the new value of C is zero. The CRNW and CSA fields are reset. 3.2.. Dependences through registers The processor has a large number of registers in order to store the multiple contexts corresponding to the multiple threads that are simultaneously executed. Such registers could be organized into a single register file or distributed into several files in order to reduce the number of ports. The study of the impact of the register file organization is beyond the scope of this paper. For the evaluation part of this paper, a single register file is considered. Each thread has its own register map table with as many entries as number of architected (also called logical) registers that indicates the current physical register allocated to the corresponding logical register. We will refer to these tables as Rmap. Notice that each map table reflects a different assignment of logical to physical registers, corresponding to a different point in time of the execution. A Rmap entry may contain a special value, NIL, that indicates that the corresponding logical register is not currently mapped to any physical register. Also, each thread has another table, which is called Rwrite (register write table), that contains for each logical register the number of remaining writes to that register. When the nonspeculative thread creates the speculative ones, each entry of their Rmap table are initialized with the same physical register as the mapping used by nonspeculative thread if no writes are expected to this register, or with the special value NIL otherwise. In other words, if Rwrite[r] is zero, then all instructions of speculative threads that have register r as a source operand are allowed to be issued (if the value in the physical register is available). If a thread has not enough physical registers, it and the subsequent speculative threads are not created.

address value V NW value V NW value 2 V 2 NW 2 value 3 V 3 NW 3 L2 cache Figure 4: The MV cache for a four context DeSM processor. When an instruction with destination register r finishes, the entry of the corresponding destination register in Rwrite is decreased, then the following actions are taken: If Rwrite is equal to, then the corresponding entry in the Rmap of this thread is copied into the same entry in the table of the next one (if it exists) if this latter entry is equal to NIL: since then they will share the same physical register. If Rwrite becomes lower than, then the current thread is going to perform a non predicted write to a register. In this case, all subsequent threads are squashed and all physical registers allocated to them that are not shared with the previous thread are released. When a thread finishes and if some entry of the Rwrite table is greater than, this means that the thread has performed less writes to a register than it was predicted. If there is a subsequent thread executing the next iteration and the corresponding entry of the Rmap is NIL, then the entry of Rmap is copied into the same entry of the subsequent map table. Finally, those physical registers of the finished thread that are not shared with the next thread will be freed. 3.2.2. Dependences through memory In the same way as the processor provides support to store multiple states of the registers, each one corresponding to the view that a different thread has, a similar type of support is provided for memory. This is achieved by a special first level data cache. This cache is called multivalue (MV) cache and has the particularity that their data words are replicated for each context (see figure 4), since each thread may have a different view of the contents of memory. For each replicated word the MV cache contains two additional fields: the number of expected writes by thread (NW) a a bit indicating whether it contains a valid value for the.a selective squashing of only dependent instructions would be more effective, but it is not considered in this paper corresponding thread. Another important feature is that it stores noncommitted values and does not allow to modify the next memory level until they are committed. This is implement by means of a writeback policy together with a particular replacement scheme. When the nonspeculative thread creates a set of speculative ones, the MV cache is initialized with all those words whose addresses are predicted to be written by these threads and the number of expected writes is put in the NW field. Those addresses can be predicted by means of the LSA field in the iteration table by adding at each address the stride as many times as the distance between the speculative thread and the nonspeculative one. The validity bit is initialized with a for the first thread that is expected to write to that address and all the preceding ones, it is going to write and with a for the subsequent threads. Notice that the initialization of the MV cache may seem a costly task. However, it can be done before the creation of the speculative threads in a distributed way, while the nonspeculative thread is executing the iteration prior to the creation of speculative threads. An example of the initial values of the MultiValue Cache for a given loop is shown in figure 5. Each thread executes memory instructions outoforder using a total disambiguation scheme. That is, memory instructions compute their effective address as soon as their operands are available and then, they are sent to a load/ store buffer. Stores write to memory when all the previous instructions of the same thread have completed whereas loads read from memory when the addresses of all previous stores of the same thread are known. If the load matches a Example: Loop: ld r, A(r) ld r2, B(r) add r3,r,r2 st C(r),r3 sub r,r,#4 bz #24 MultiValue Cache: 2 24 28 32........ Figure 5: State of the MultiValue Cache for a sample loop. We assume thread is the nonspeculative one, its store address is 2 and any thread has not performed its store yet.

previous store address the store data is forwarded to the load destination register; otherwise the read is performed from memory. When a thread performs a store instruction, the corresponding value of the MV cache is updated, the NW field is decreased and the V flag is set. If it becomes, the data is copied to succeeding threads from the next one to the first one that has either NW or V different from (excluded). The data is copied into this latter thread only if its V flag is reset. The V flags of all the threads where the produced data is forwarded are set. If NW becomes lower than, all succeeding threads are squashed. Besides, if the corresponding line is not in the MV cache, all succeeding threads are squashed because this write was not predicted and then there is not guarantee that all memory dependences of succeeding threads have been obeyed. When a thread performs a read from memory, the MV cache is checked first; if the validity bit is set, the value is read, otherwise the load is cancelled and stored in a load queue. If the address is not found, the value is requested to the next memory level. 3.3. Other features: Management of precise exceptions and instruction fetching Precise exceptions are implemented in the DeSM architecture by means of small reorder buffers, one for each thread, along with the MV cache previously described. A reorder buffer holds instructions until they are partially committed. The first instruction of a thread can partially commit as soon as it finishes. Any other instruction can partially commit when all previous instructions of the same thread have partially committed. Each reorder buffer allows for outoforder execution and control speculation inside each thread and therefore its size is of the same order than that of a conventional superscalar processor. In case of an asynchronous exception, a precise state can be recovered by squashing all the speculative threads (these free their physical registers as described in section 3.2.) and recovering a precise state in the nonspeculative thread through the support of its local reorder buffer following the conventional approach used in superscalar processors. Any write memory performed by speculative threads has been performed in the MV cache and has not modified the L2 cache. Therefore, the part of the MV cache corresponding to the nonspeculative thread together with the next levels of the memory hierarchy hold the committed values that constitute the recovered precise state. Regarding the instruction fetching, since all the concurrent threads are executing the same code, a single.a selective squashing of only dependent instructions will be considered in future work instruction fetch engine can fetch the instructions of the loop just once and replicate them as many times as number of active threads. Each instruction is renamed using a different register map table and then they are dispatch to a shared instruction queue. In this way, the DeSM architecture significantly reduces the instruction memory pressure when compared to a conventional superscalar processor. A similar feature was also proposed in [7]. This organization overcomes one of the most important hurdles of multithreaded architectures. In those machines, the processor has to fetch from different program counters, simultaneously or alternatively, which results in one of the critical parts of such architectures. In the DeSM architecture, instructions are always fetched from a single program counter. Even with this simple fetch organization, the DeSM architecture can build a large instruction window composed of several nonadjacent small windows. 4. Performance evaluation The DeSM architecture has been evaluated through tracedriven simulation of the SPEC95 benchmark suite. The programs have been compiled for a DEC AlphaStation 6 5/266 with a 264 processor using the DEC compiler with full optimization, and instrumented by means of the Atom tool[]. A cyclebycycle simulation is performed in order to obtain accurate timing results. Because of the detail at which simulation is carried out the simulator is slow, so we have simulated 25 million of instructions for each benchmark after skipping the initial part that corresponds to initializations. The evaluated DeSM processor has 4 contexts, an issue bandwidth of 4 instructions per cycle for each context, 4 entries in the iteration table, and a fully associative MV cache with 28 entries. Every context has a local reorder buffer with 64 entries. A rather limited fetch bandwidth has been considered: up to 4 instructions of consecutive addresses. The number of physical registers is 256. The number of read and write ports of Rmap and Rwrite tables are 8 and 4 respectively. We have assumed perfect branch prediction for intrathread branches and an ideal L2 cache. The number of functional units is (latency in brackets): 8 simple integer (), 4 integer multiplication (2), 4 memory (2), 6 simple FP (), 3 FP multiplication (4) and 2 FP division (7). 4.. Performance figures The evaluation of the DeSM is summarized in Table. The first five rows correspond to FP programs and the last four correspond to integer benchmarks. The first column of this table shows the average number of committed instructions per cycle. It can be seen that for the FP programs the IPC is significantly higher than for integer programs. For most of

IPC TPC % single thread pred. hit ratio tomcatv 4.5 3.8 2.99 swim 4.5 3.9.99 FP hydro2d 4.2 3.8 2.99 mgrid 5.4 3. 8.95 applu 3.7 2. 6.86 m88ksim 3. 2.4 48.88 int vortex 3.6.2 93.9 go 3.5. 97.7 ijpeg 3.7.8 64.7 Table : Performance measures for the DeSM processor the FP programs, the DeSM architecture achieves an IPC even higher than the fetch bandwidth. This confirms the potential benefits of the fetch mechanism in terms of the reduction in peak fetch bandwidth requirements. The second column shows the average number of active threads per cycle (TPC) that are correctly speculated. This figure measures the ability of the control and dependence speculation mechanism to dynamically overlap multiple iterations of a loop, and therefore to exploit ILP in a non contiguous window. This is the main source of additional ILP provided by the novel features of the DeSM architecture. The average TPC for FP programs is 3.3, out of a maximum of 4, whereas for integer programs it is much lower (.6). From this figures we can conclude that the speculation approach is very effective for FP programs and it is not so good for the integer codes. This motivates our current work to research the main reasons for that and try to come up with alternative speculation mechanism more appropriate for integer programs. The third column shows the percentage of the execution time in which only a single thread (the nonspeculative one is active). This figure is very low for all FP programs except applu, which is the one that exhibits also the lowest TCP among the FP programs. However, it is quite high for integer programs. The forth column shows the percentage of correctly speculated threads. This percentage is in general quite high, even for integer programs, which confirms that the speculation mechanism is quite accurate in identifying speculative threads (in terms of control and dependences). It remains to be proved whether it is too conservative and then, it misses to identify speculative threads, which could be one of the reasons for the low TCP of integer programs. We have also evaluated the performance of an outoforder superscalar processor with the same number of units and fetch bandwidth. We have also assumed a perfect data cache. We have assumed a perfect branch prediction for all the branches too, which may favor the superscalar processor since the DeSM processor has to predict closing loop branches. Besides, with this perfect prediction, the superscalar processor can overlap the execution of several iterations of the loop when they are small like the DeSM processor does. With a realistic branch prediction, the DeSM will still be able to overlap the same number of iterations of a loop, but the superscalar processor will not in general due to branches that are difficult to predict. In spite of this, the DeSM processor is in average 28% faster for FP codes and 6% faster for integer codes assuming the same clock cycle. However, notice that superscalar processors have a severe limitations to scale up the issue logic[9], whereas the DeSM architecture has a distributed issue logic that is more scalable. The main objective of this preliminary evaluation is to confirm the potential benefits of the DeSM architecture. More exhaustive evaluations are required to evaluate the benefits for different configurations, to include the effects of realistic intrathread branch prediction and cache memory and in general to identify critical design parameters and propose alternative solutions. 5. Conclusions We have presented a novel processor architecture that makes use of a new approach to dynamically extract and execute multiple threads of control from a sequential application, without any special feature in the ISA nor compiler support. This approach is based on speculating on highly predictable branches not necessarily adjacent, such as the closing branches of loops. In this way, the processor manages a large window that is composed of several small windows, each one corresponding to a different iteration of a loop. The multiple instruction windows are simultaneously processed by different threads of control. To overcome the problem caused by the serialized instruction fetch and decode approach implemented by superscalar processors, the DeSM architecture speculates on interthread dependences when they are unknown, for both register and memory dependences. A preliminary evaluation of the DeSM architecture has shown that interthread parallelism, which is measured as the average number of threads per cycle (TPC), is very high for FP programs (3.3 for a four context machine) and moderate for integer ones (.6). Besides, this additional source of parallelism does not require any additional increase in the fetch bandwidth since the processor takes advantage from the fact that all the simultaneous threads executes the same code. In fact, it has been shown that the DeSM processor can achieve and ILP even higher than the peak fetch bandwidth, as observed for some programs.

6. Acknowledgments This work has been supported by the Spanish Ministry of Education under contract CYCIT 429/95 and by the AP96 522746. The research described in this paper has been developed using the resources of the Center of Parallelism of Barcelona (CEPBA). References [] M.N. Dorojevets and V.G. Oklobdzija, Multithreaded Decoupled Architecture, Int. J. of High Speed Computing, 7(3), pp. 46548, 995. [2] P.K. Dubey, K. O Brien, K.M. O Brien and C. Barton, SingleProgram Speculative Multithreading (SPSM) Architecture: CompilerAssisted FineGrained Multithreading, in Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 92, 995. [3] M. Franklin and G.S. Sohi, The Expandable Split Window Paradigm for Exploiting Fine Grain Parallelism, in Proc. of Int. Symp. on Computer Architecture, pp. 5867, 992. [4] M. Franklin and G.S. Sohi, ARB: A Hardware Mechanism for Dynamic Reordering of Memory References, IEEE Transactions on Computers, pp. 55257, May 996. [5] J. González and A. González, Memory Address Prediction for Data Speculation, in Proc. of Third Int. Euro Par Conf., pp. 849, 997. [6] J. González and A. González, Speculative Execution via Address Prediction and Data Prefetching, in Proc. of th. ACM Int. Conf. on Supercomputing, 997. [7] A. Moshovos, S.E. Breach, T.N. Vijaykumar and G.S. Sohi, Dynamic Speculation and Synchronization of Data Dependences, in Proc. of Int. Symp. on Computer Architecture, pp. 893, 997. [8] A. Moshovos and G.S. Sohi Streamlining Interoperation Memory Communication via Data Dependence Prediction, in Proc. of the 3th Int. Symp. on Microarchitecture, pp.28228, 997. [9] S. Palacharla, N.P Jouppi and J.E. Smith, Complexity Effective Superscalar Processors, in Proc. of Int. Symp on Computer Architecture, pp.2628, 997. [] G.S. Sohi, S.E. Breach and T.N Vijaykumar, Multiscalar Processors, in Proc. of the Int. Symp. on Computer Architecture, pp. 44425, 995. [] A. Srivastava and A. Eustace, ATOM: A system for building customized program analysis tools, in Proc. of the 994 Conf. on Programming Languages Design and Implementation, 994. [2] JY. Tsai and PC. Yew, The Superthreaded Architecture: Thread Pipelining with RunTime Data Dependence Checking and Control Speculation, in Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 3546, 996. [3] J. Tubella and A. González, Control Speculation in Multithreaded Processors through Dynamic Loop Detection, in Proc. of the 4th Int. Symp. on HighPerformance Computer Architecture (HPCA4), to appear. [4] D.M. Tullsen, S.J. Eggers and H.M. Levy, Simultaneous Multithreading: Maximizing OnChip Parallelism, in Proc. of the Int. Symp. on Computer Architecture, pp. 392 43, 995. [5] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo and R.L. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, in Proc. of the Int. Symp. on Computer Architecture, pp. 922, 996. [6] G. Tyson and T. Austin, Improving the Accuracy and Performance of Memory Communication Through Renaming, in Proc. of the 3th Int. Symp. on Microarchitecture, pp. 235245, 997. [7] S.Vajapeyam and T. Mitra, Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences, in Proc. of the Int. Symp. on Computer Architecture, pp. 2, 997.