Tesis del Máster en Ingeniería de Computadores Curso Out-of-Order Retirement of Instructions in Superscalar and Multithreaded Processors

Size: px
Start display at page:

Download "Tesis del Máster en Ingeniería de Computadores Curso Out-of-Order Retirement of Instructions in Superscalar and Multithreaded Processors"

Transcription

1 Tesis del Máster en Ingeniería de Computadores Curso Out-of-Order Retirement of Instructions in Superscalar and Multithreaded Processors Autor: Rafael Ubal Tena Directores: Julio Sahuquillo Borrás Pedro López Rodríguez

2

3 Contents Contents Abstract i iii 1 Introduction Out-of-Order Retirement in Monothreaded Processors Out-of-Order Retirement with Support for Multiple Threads The Validation Buffer Microarchitecture Register Reclamation Recovery Mechanism Working Example Uniprocessor Memory Model Multithreaded Validation Buffer (VB-MT) Microarchitecture Multithreading support Resource sharing Adapting SMT Techniques to VB-MT Instruction Count (ICOUNT) Predictive Data Gating (PDG) Dynamically Controlled Resource Allocation (DCRA) The Simulation Framework Multi2Sim Basic simulator description Program Loading Simulation Model Support for Multithreaded and Multicore Architectures Functional simulation: parallel workloads support Detailed simulation: Multithreading support Detailed simulation: Multicore support i

4 5 Exterimental Results Evaluation of the VB Microarchitecture Exploring the Potential of the VB Microarchitecture Exploring the Behavior in a Modern Microprocessor Impact on Performance of Supporting Precise Floating-Point Exceptions Evaluation of the VB-MT Microarchitecture Multithreading Granularity Multithreading Scalability: Performance vs. Complexity Fetch Policies Resources occupancy Related Work 42 7 Conclusions Contributions and Future Work Publications Related with this Work ii

5 Abstract Current superscalar processors commit instructions in program order by using a reorder buffer (ROB). The ROB provides support for speculation, precise exceptions, and register reclamation. However, committing instructions in program order may lead to significant performance degradation if a long latency operation blocks the ROB head. Several proposals have been published to deal with this problem. Most of them retire instructions speculatively. However, as speculation may fail, checkpoints are required in order to rollback the processor to a precise state, which requires both extra hardware to manage checkpoints and the enlargement of other major processor structures, which in turn might impact the processor cycle. This work focuses on out-of-order commit in a nonspeculative way, thus avoiding checkpoints. To this end, we replace the ROB with a structure called Validation Buffer (VB). This structure keeps dispatched instructions until they are nonspeculative or mispeculated, which allows an early retirement. By doing so, the performance bottleneck is largely alleviated. An aggressive register reclamation mechanism targeted to this microarchitecture is also devised. As experimental results show, the VB structure is much more efficient than a typical ROB since, with only 32 entries, it achieves a performance close to an in-order commit microprocessor using a 256-entry ROB. The present work also makes an exhaustive analysis of out-of-order retirement of instructions on multithreaded processors. Superscalar processors exploit instruction level parallelism by issuing multiple instructions per cycle. However, issue width is usually wasted because of instruction dependencies. On the other hand, multithreaded processors reduce this waste by providing support to the concurrent execution of instructions from multiple threads, thus exploiting both instruction and thread level parallelism. Additionally, out-of-order commit (OOC) processors overlap the execution of the long latency instructions with potentially lots of subsequent ones, so this microarchitectures also help to reduce the issue waste. We analyze the impact on performance of unifying both multithreading and OOC techniques, by combining the three main paradigms of multithreading iii

6 fine grain (FGMT), coarse grain (CGMT), and simultaneous (SMT) with the Validation Buffer microarchitecture, which retires instructions out of order. From the experimental results, we conclude that: (i) an OOC-SMT processor achieves the same performance as a conventional SMT with half the amount of hardware threads; (ii) an OOC-FGMT processor outperforms a conventional SMT processor, which requires a more complex issue logic; and (iii) the use of OOC allows optimized fetch policies (DCRA) for SMT processors to improve their job, almost completely removing the issue width waste. iv

7 Chapter 1 Introduction 1.1 Out-of-Order Retirement in Monothreaded Processors Current high-performance microprocessors execute instructions out-of-order to exploit instruction level parallelism (ILP). To support speculative execution, provide precise exceptions, and register reclamation, a reorder buffer (ROB) structure is used [1]. After being decoded, instructions are inserted in program order in the ROB, where they are kept while being executed and until retired in the commit stage. The key to support speculation and precise exceptions is that instructions leave the ROB also in program order, that is, when they are the oldest ones in the pipeline. Consequently, if a branch is mispredicted or an instruction raises an exception there is a guarantee that, when the offending instruction reaches the commit stage, all the previous instructions have already been retired and none of the subsequent ones have done it. Therefore, to recover from that situation, all the processor has to do is to abort the latter ones. This behavior is conservative. For instance, when the ROB head is blocked by a long latency instruction (e.g., a load that misses in the L2), subsequent instructions cannot release their ROB entries. This happens even if these instructions are independent from the long latency one and they have been completed. In such a case, since the ROB has a finite size, as long as instruction decoding continues, the ROB may become full, thus stalling the processor for a valuable number of cycles. Register reclamation is also handled in a conservative way because physical registers are mapped for longer than their useful lifetime. In summary, both the advantages and the shortcomings of the ROB come from the fact that instructions are committed in program order. A naive solution to address this problem is to enlarge the ROB size to accommodate more instructions in flight. However, as ROB-based microarchitectures 1

8 serialize the release of some critical resources at the commit stage (e.g., physical registers or store queue entries), these resources should be also enlarged. This resizing increases the cost in terms of area and power, and it might also impact the processor cycle [2]. To overcome this drawback, some solutions that commit instructions out of order have been published. These proposals can be classified into two approaches depending on whether instructions are speculatively retired or not. Some proposals falling into the first approach, like [3], allow the retirement of the instruction obstructing the ROB head by providing a speculative value. Others, like [4] or [5], replace the normal ROB with alternative structures to speculatively retire instructions out of order. As speculation may fail, these proposals need to provide a mechanism to recover the processor to the correct state. To this end, the architectural state of the machine is checkpointed. Again, this implies the enlargement of some major microprocessor structures, for instance, the register file [5] or the load/store queue [4], because completed instructions cannot free some critical resources until their associated checkpoint is released. Regarding the nonspeculative approach, Bell and Lipasti [6] propose to scan a few entries of the ROB, as many as the commit width, and those instructions satisfying certain conditions are allowed to be retired. None of these conditions imposes an instruction to be the oldest one in the pipeline to be retired. Hence, instructions can be retired out of program order. However, in this scenario, the ROB head may become fragmented after the commit stage, and thus the ROB must be collapsed for the next cycle. Collapsing a large structure is costly in time and could adversely impact the microprocessor cycle. As a consequence, this proposal is unsuitable for large ROB sizes, which is the current trend. Moreover, the small number of instructions scanned at the commit stage significantly constrains the potential that this proposal could achieve. In this work we propose the Validation Buffer (VB) microarchitecture, which is also based on the nonspeculative approach. This microarchitecture uses a FIFOlike table structure analogous to the ROB. The aim of this structure is to provide support for speculative execution, exceptions, and register reclamation. While in the VB, instructions are speculatively executed. Once all the previous branches and supported exceptions are resolved, the execution mode of the instructions changes either to nonspeculative or mispeculated. At that point, instructions are allowed to leave the VB. Therefore, instructions leave the VB in program order but, unlike for the ROB, they do not remain in the VB until retirement. Instead, they remain in the VB only until the execution mode of such instruction is resolved, either nonspeculative or mispeculated. Consequently, instructions leave the VB at different stages of their execution: completed, issued, or just decoded and not issued. For instance, instructions following a long latency memory reference instruction could leave the VB as soon as its memory address is success- 2

9 fully calculated and no page fault has been risen. This work discusses how the VB microarchitecture works, focusing on how it deals with register reclamation, speculative execution, and exceptions. The first main contribution (Chapter 2) of this work is the proposal of an aggressive out-of-order retirement microarchitecture without checkpointing. This microarchitecture decouples instruction tracking for execution purposes and for resource reclamation purposes. The proposal outperforms the existing proposal that does not perform checkpoints, since it achieves more than twice its performance using smaller VB/ROB sizes. On the other hand, register reclamation cannot be handled as done in current microprocessors [7, 8, 9] because no ROB is used. Therefore, we devise an aggressive register reclamation method targeted to this architecture. Experimental results show that the VB microarchitecture increases the ILP while requiring less complexity in some major critical resources like the register file and the load/store queue. 1.2 Out-of-Order Retirement with Support for Multiple Threads Superscalar processors effectively exploit instruction level parallelism of a single thread, by issuing multiple instructions in an out-of-order fashion in the same cycle. Nevertheless, issue ports are usually wasted because of instruction dependencies (i.e., the available parallelism), thus, adversely impacting the performance. Two kinds of waste are distinguishable [10]: vertical waste, when no instruction is issued, and horizontal waste, when some instruction is issued without completely filling the issue width. Resource utilization can be improved by providing support to the execution of multiple threads, that is, by exploiting both instruction and thread level parallelism. There are three main multithreading models implemented in current processors: fine grain (FGMT), coarse grain (CGMT), and simultaneous multitreading (SMT). All of them reduce vertical waste but only SMT reduces horizontal waste [10] by issuing instructions from multiple threads in the same cycle, achieving best performance gains. Nevertheless, this is done at expenses of adding complexity to the issue logic, which is a critical point in current microprocessors. Multithreaded architectures represent an important segment in the industry. For instance, the Alpha 21464, the Intel Pentium 4 [9], the IBM Power 5 [11], the Sun Niagara [12], and the Intel Montecito [13] are commercial microprocessors included in this group. On the other hand, and as introduced above, out-of-order commit (OOC) processors differ from in-order commit (IOC) architectures in the sense that they do 3

10 not force an instruction to be the oldest one in the pipeline in order to be retired. In this way, long latency operations (e.g., a L2 miss) do not block the ROB when they reach the ROB head; instead, long memory latencies are overlapped with the retirement of subsequent instructions which do not depend on the memory operation. Thus, these architectures mainly attack the vertical waste, although horizontal waste is indirectly also improved. In addition, we will show that OOC processors can make better use of resources and are more performance-cost effective than IOC processors. Quite recently, out-of-order retirement has been investigated on superscalar processors and chip multiprocessors (CMPs), but, to the best of our knowledge, no research has focused on multithreaded processors. As second main contribution of this work (Chapter 3), we analyze the impact of retiring instructions in an out-of-order fashion in the three main models of multithreading: FGMT, CFMT, and SMT. To this end, we selected our own proposal the Validation Buffer Microarchitecture as the OOC base architecture, and extended it to support the execution of multiple threads. Experimental results provide three main conclusions: First, a VB-based SMT processor requires in most cases half the amount of hardware threads than an ROB-based SMT processor to achieve similar performance. In other words, performance can be maintained in VB-based SMT when reducing the number of hardware threads, thus saving all hardware resources to track their status. Second, a VB-based FGMT processor outperforms an ROB-based SMT processor. In this case, performance can be sustained while simplifying the issue logic, which can be translated in shorter issue delays or lower power consumption of instruction schedulers. Third, existing fetch policies for SMT processors provide complementary advantages to the out-of-order retirement benefits. A high-performance SMT design could implement both techniques if area, power consumption and hardware constraints allow it. The rest of this work is structured as follows. Chapter 2 gives a detailed view of the Validation Buffer microarchitecture. Chapter 3 deals with out-of-order retirement in multithreaded processors, and explains a set of existing instruction fetch policies that will be used for evaluation purposes. Chapter 4.1 describes the simulation framework used to model all proposed techniques. Chapter 5 performs an exhaustive evaluation of both VB and VB-MT architectures, and Chapters 6 and 7 provide citations to related works and some concluding remarks, respectively. 4

11 Chapter 2 The Validation Buffer Microarchitecture The commit stage is typically the latest one of the microarchitecture pipeline. At this stage, a completed instruction updates the architectural machine state, frees the used resources and exits the ROB. The mechanism proposed in this work allows instructions to be retired early, as soon as it is known that they are nonspeculative. Notice that these instructions may not be completed. Once they are completed, they will update the machine state and free the used resources. Therefore, instructions will exit the pipeline in an out-of-order fashion. The necessary conditions to allow an instruction to be committed out-or-order are [6]: i) the instruction is completed; ii) WAR hazards are solved (i.e., a write to a particular register cannot be permitted to commit before all prior reads of that architected register have completed); iii) previous branches are successfully predicted; iv) none of the previous instructions is going to raise an exception, and v) the instruction is not involved in memory replay traps. The first condition is straightforwardly met by any proposal at the writeback stage. The last three conditions are handled by the Validation Buffer (VB) structure, which replaces the ROB and contains the instructions whose conditions are not known yet. The second condition is fulfilled by the devised register reclamation method (see Section 2.1). The VB deals with the speculation related conditions (iii, iv and v) by decomposing code into fragments or epochs. The epoch boundaries are defined by instructions that may initiate an speculative execution, referred to as epoch initiators (e.g., branches or potentially exception raiser instructions). Only those instructions whose previous epoch initiators have completed and confirmed their prediction are allowed to modify the machine state. We refer to these instructions as validated instructions. Instructions reserve an entry in the VB when they are dispatched, that is, they 5

12 enter in program order in the VB. Epoch initiator instructions are marked as such in the VB. When an epoch initiator detects a mispeculation, all the following instructions are cancelled. When an instruction reaches the VB head, if it is an epoch initiator and it has not completed execution yet, it waits. When it completes, it leaves the VB and updates machine state, if any. Non epoch-initiator instructions that reach the VB head can leave it regardless of their execution state. That is, they can be either dispatched, issued or completed. However, only those not cancelled instructions (i.e., validated) will update the machine state. On the other hand, cancelled instructions are drained to free the resources they occupy (see Section 2.2). Notice that when an instruction leaves the VB, if it is already completed, it is not consuming execution resources in the pipeline. Thus, it is analogous to a normal retirement when using the ROB. Otherwise, unlike the ROB, the instruction is retired from the VB but it remains in the pipeline until it is completed. The proposed microarchitecture can support a wide range of epochs initiators. At least, epoch initiators according to the three speculative related conditions are supported. Therefore, branches and memory reference instructions (i.e., the address calculation part) act as epoch initiators. In other words, branch speculation, memory replay traps (see Section 2.4) and exceptions related with address calculation (e.g., page faults, invalid addresses) are supported by design. It is possible to include more instructions in the set of epoch initiators. For instance, in order to support precise floating-point exceptions, floating-point instructions should be included in this set. As instructions are able to leave the VB only when their epoch initiators validate their epoch, a high percentage of epoch initiators might reduce the performance benefits of the VB. We can use user definable flags to enable or disable support for precise exceptions. If the corresponding flag is enabled, the instruction that may generate a given type of exception will force a new epoch when it is decoded. In fact, a program could dynamically enable or disable these flags during its execution. For instance, it can be enabled when the compiler suspects that an arithmetic exception may be raised. 2.1 Register Reclamation Typically, modern microprocessors free a physical register when the instruction that renames the corresponding logical register commits [14]. Then, the physical register index is placed in the list that contains the free physical registers available for new producers. Waiting until the commit stage to free a physical register is easy to implement but conforms to a conservative approach; a sufficient condition is that all consumers have read the corresponding value. Therefore, this method does not 6

13 Figure 2.1: VB microarchitecture block diagram. efficiently use registers, as they can be mapped for longer than their useful lifetime. In addition, this method requires keeping track of the oldest instruction in the pipeline. As this instruction may have already left the VB, this method is unsuitable for our proposal. Due to these reasons, we devised a register reclamation strategy based on the counter method [15, 14] targeted for the VB microarchitecture. The hardware components used in this scheme, as shown in Figure 2.1 are: Frontend Register Alias Table (RAT front ). This table maintains the current mapping for each logical register and is accessed in the rename stage. The table is indexed by the source logical register to obtain the corresponding physical register identifier. Additionally, each time a new physical register is mapped to a destination logical register, the RAT front is updated. Retirement Register Alias Table (RAT ret ). The RAT ret table is updated as long as instructions exit the VB. This table contains a precise state of the register map, as only those validated instructions leaving the VB are allowed to update it. Register Status Table (RST ). The RST is indexed by a physical register number and contains three fields, labelled as pending readers, valid remapping and completed, respectively. Pending readers field contains the number of decoded instructions that consume the corresponding physical register, but have not read it yet. This value is incremented as consumers enter the decode stage, and decremented when they are issued to the execution units. The second and third fields are composed each by a single bit. The valid remapping bit is set when the associated logical register has been definitively remapped to a new physical register, that is, when the instruction that remapped the logical register has left the VB as validated. Finally, the completed bit indicates that the instruction producing its value has completed execution, writing the result to the physical register. 7

14 Event Table 2.1: Actions depending on pipeline events Actions An instruction I enters the rename stage and has physical register (p.r.) p as source operand. I enters the rename stage and reclaims a p.r. to map an output logical register l. I is issued and reads p.r. p. I finishes execution, writing the result over p.r. p. I exits the VB as validated. l is the logical destination of I, p.r. p is the current mapping of l, and p.r. p was the previous mapping of l. RST[p].pending readers + + Find a free p.r. p and set RST[p] = {0,0,0}, RAT front [l] = p. RST[p].pending readers RST[p].completed = 1 RST[p ].valid remapping = 1, RAT ret [l] = p. With this representation, a free physical register p can be easily identified when the corresponding entry in the RST contains the triplet {0,1,1}. A 0 in pending readers guarantees that no instruction already in the pipeline will read the contents of p. Next, a 1 in valid remapping implies that no new instruction will enter the pipeline (i.e., the rename stage) and read the contents of p, because p has been unmapped by a valid instruction. Finally, a 1 in the completed field denotes that no instruction in the pipeline is going to overwrite p. These conditions ensure that a specific physical register can be safely reallocated for a subsequent renaming. On the other hand, a triplet {0,0,1} denotes a busy register, with no pending readers, not unmapped by a valid instruction, and appearing as the result of a completed (and valid) operation. Table 2.1 shows different situations which illustrate the dynamic operation of the proposed register reclamation strategy in a non-speculative mode, as well as the updating mechanism of RST, RAT front and RAT ret tables. 2.2 Recovery Mechanism The recovery mechanism always involves restoring both the RAT front and the RST tables. Register Alias Table Recovery. Current microprocessors employ different methods to restore the renaming information when a mispeculation or exception occurs. The method presented in this work uses the two renaming tables, RAT front and RAT ret, explained above, similarly to the Pentium 4 [9]. 8

15 RAT ret contains a delayed copy of a validated RAT front. That is, it matches the RAT front table at the time the exiting (as valid) instruction was renamed. So, a simple method to implement the recovery mechanism (restoring the mapping to a precise state) is to wait until the offending instruction reaches the VB head, and then copying RAT ret into RAT front. Alternative implementations can be found in [4]. Register Status Table Recovery. The recovery mechanism must also undo the modifications performed by the cancelled instructions in any of the three fields of the RST. Concerning the valid remapping field, we describe two possible techniques to restore its values. The first technique squashes from the VB those entries corresponding to instructions younger than the offending instruction when this one reaches the VB head. At that point, the RAT ret contains the physical registers identifiers that we use to restore the correct mapping. The remaining physical registers must be freed. To this end, all valid remapping entries are initially set to 1 (necessary condition to be freed). Then, the RAT ret is scanned looking for physical registers whose valid remapping entry must be reset. The second technique relies on the following observation. Only the physical registers that were allocated (i.e., mapped to a logical register) by instructions younger than the offending one must be freed. Therefore, instead of squashing the VB contents when the offending instruction reaches the VB head, as instructions in the VB are cancelled, they are drained. These instructions must set to 1 the valid remapping entry of their current mapping. Notice that in this case, the valid remapping flag is used to free the registers allocated by the current mapping, instead of the previous mapping like in normal operation. While the cancelled instructions are being drained, new instructions can enter the renaming stage, provided that the RAT front has been already recovered. Therefore, the VB draining can be overlapped with subsequent new processor operations. Regarding to the pending readers field, it cannot be just reset as there can already be valid pending readers in the issue queue. Thus, each pending readers entry must be decremented as many as the number of cancelled pending readers for the corresponding physical register. To this end, the issue logic must allow to detect those instructions younger than the offending instruction, that is, the cancelled pending readers. This can be implemented by using a bitmap mask in the issue queue to identify which instructions are younger than a given branch [16]. The cancelled instructions must be drained from the issue queue to correctly handle (i.e. decrement) their pending readers entries. Notice that this logic can be also used to handle the completed field, by enabling a cancelled instruction to set the entry of its destination physical register. Alternatively, it is also possible to simply let the cancelled instructions execute to correctly handle the pending readers and completed fields. 9

16 2.3 Working Example Figure 2.2: Instruction status and epochs. To illustrate different scenarios that could require triggering the recovery mechanism as well as register reclamation handling, we use the example shown in Figure 2.2. This example shows a validation buffer with 12 instructions belonging to 3 different epochs. Instructions can be in one of the following states: dispatched but not issued, issued but not yet completed, and completed. Unlike normal ROBs no control information about these states is stored in the VB. Instead, the only information required is whether the epoch is validated or cancelled. In the example, assume that the three epochs have just been resolved, epochs 0 and 1 as validated and epoch 2 as cancelled. Thus, only instructions belonging to epochs 0 and 1 should be allowed to update the machine state. Firstly, instructions belonging to epoch 0 leave the VB. As this epoch has been validated, each instruction will update the RAT ret and set the valid remapping bit of the physical register previously mapped to its destination logical register. Since these instructions are completed, the VB is the last machine resource they consume. Then, instructions of epoch 1 leave the VB, two completed, one issued and the other one dispatched but not yet issued. The RAT ret table and the valid remapping bit are handled as above, no matter the instruction status. However, the non completed instructions will remain in the pipeline until they are complete. Finally, instructions belonging to epoch 2 leave the VB as cancelled. These instructions must not update the RAT ret, but they must set the valid remapping bit of the physical register currently mapped to its destination logical register. In addition, the processor must obtain a correct RAT front, re-execute the epoch initiator (if needed), and fetch the correct instructions. To this end, the machine waits until the epoch initiator instruction that has triggered the cancellation reaches the VB head. Then, the processor recovers to a correct state by copying the RAT ret to the RAT front and resumes execution from the correct path. The RST state is recovered as explained in Section

17 2.4 Uniprocessor Memory Model To correctly follow the uniprocessor memory model it must be ensured that load instructions get the data produced by the newest previous store matching its memory address. A key component to improve performance in such a model, is the load/store queue (LSQ). In the VB microarchitecture, as done in some current microprocessors, memory reference instructions are internally split by the hardware into two instructions when they are decoded and dispatched: the memory address calculation, which is considered as an epoch initiator, and the memory operation itself. The former reserves a VB entry when it is dispatched while the latter reserves an entry in the LSQ. A memory reference instruction frees its corresponding queue entry provided that it has been validated and completed. Additionally, a store instruction must be the oldest memory instruction in the pipeline. Load bypassing is the main technique applied to the LSQ to improve the processor performance. This technique permits loads to early execute by advancing previous stores in their access to the cache. Load bypassing can be speculatively performed by allowing loads to bypass previous stores in the LSQ even if any store address is unresolved yet. As speculation may fail, processors must provide some mechanism to detect and recover from load mispeculation. For instance, loads issued speculatively can be placed in a special buffer called the finished load buffer [17]. The entry of this buffer is released when the load commits. On the other hand, when a store commits, a search is performed in this buffer looking for aliasing loads (note that all loads in the buffer are younger than the store). If aliasing is detected, when the mispeculated load commits, both the load and subsequent instructions must be re-executed. A finished load buffer can be quite straightforwardly implemented in the VB, with no additional complexity. In this case, a load instruction will release its entry in the finished load buffer when it leaves the VB. When a store leaves the VB, the address of all previous memory instructions and its own address have been resolved. Thus, it can already search the mentioned buffer looking for aliasing loads speculatively issued. As in a ROB-based implementation, all loads in the buffer are younger than the store. On a hit, the recovery mechanism should be triggered as soon as the mispeculated load exits the VB. Notice that this implementation allows mispeculation to be early detected. Finally, store-load forwarding and load replay traps are also supported by the VB by using the same hardware available in current microprocessors. 11

18 Chapter 3 Multithreaded Validation Buffer (VB-MT) Microarchitecture Superscalar processors effectively exploit instruction level parallelism of a single thread. To this end, multiple instructions can be issued in an out-of-order fashion in the same cycle. Nevertheless, issue ports are usually wasted because of instruction dependencies (i.e., the available parallelism), thus, adversely impacting the performance. Two kinds of waste are distinguishable [10]: vertical waste, when no instruction is issued, and horizontal waste, when some instruction is issued without completely filling the issue width. Resource utilization can be improved by providing support to the execution of multiple threads, that is, by exploiting both instruction and thread level parallelism. There are three main multithreading models implemented in current processors: fine grain (FGMT), coarse grain (CGMT), and simultaneous multitreading (SMT). All of them reduce vertical waste but only SMT reduces horizontal waste [10] by issuing instructions from multiple threads in the same cycle, achieving best performance gains. Nevertheless, this is done at expenses of adding complexity to the issue logic, which is a critical point in current microprocessors. Multithreaded architectures represent an important segment in the industry. For instance, the Alpha 21464, the Intel Pentium 4 [9], the IBM Power 5 [11], the Sun Niagara [12], and the Intel Montecito [13] are commercial microprocessors included in this group. On the other hand, out-of-order commit prevents long latency operations (e.g., a L2 cache miss) of blocking the ROB when they reach its head; instead, long memory latencies are overlapped with the retirement of subsequent instructions which do not depend on the memory operation. Thus, these architectures mainly attack the vertical waste, although horizontal waste is indirectly also improved. In this chapter, we deeply analyze the impact of retiring instructions in an outof-order fashion in the three main models of multithreading: FGMT, CGMT, and 12

19 SMT, using the Validation Buffer microarchitecture as base out-of-order commit approach. 3.1 Multithreading support As in any multithreading model, the Multithreaded Validation Buffer (VB-MT) microarchitecture gives the illusion of having various logical processors, that is, various simultaneously active software contexts, one per hardware thread. Although most hardware structures can be shared or private among threads [6], the logical state, defined by a virtual memory image and a logical register file, must be independently maintained per thread. Concerning the virtual memory image, it makes sense for all threads to have a common physical address space. The MMU (Memory Management Unit), and the Translation Lookaside Buffers (TLBs) in particular, must be adapted to be indexed by pairs {thread, virtual address}, returning disjoint physical addresses to each thread. In the case of the register file, the associated physical structures can be either shared or private per thread, but the subsets of physical registers bound to specific threads must be disjoint from each other. This implies a renaming mechanism that allocates a new physical register from a pair {thread, logical register}, which is straightforwardly translated into a private RAT (Register Aliasing Table) per thread. Involving the register renaming strategy, its implementation on a multithreaded, out-of-order commit architecture is analogous to the single-threaded, out-of-order commit one. The main difference is the replication of the two renaming tables (RAT front and RAT ret ) for each thread. The former is looked up and updated when an instruction is at the rename stage, while the latter is modified when an instruction leaves the VB, providing a delayed copy of RAT front. Depending on the thread to which a specific instruction belongs, it should look up or update its associated RAT table. 3.2 Resource sharing A multithreaded design offers the view of having multiple logical processors in a single chip. A straightforward way of implementing such a system consists of replicating all hardware structures per thread, while maintaining a common functional resource pool. In the opposite approach, hardware structures can be shared among threads, using policies that dynamically allocate resource slots. Between these limits, there is a gradient of solutions to be explored in order to find the 13

20 Figure 3.1: VB-MT architecture diagram with all storage resources shared among threads optimal sharing strategy of resources. Processor resources can be classified as storage resources (ROB, IQ, LSQ...) and bandwidth resources (fetch stage, issue logic...). As pointed out in [18], the sharing of storage resources can result in a performance degradation if no appropriate resource allocation policy is used. The reason is that over-allocations to stalled threads can cause starvation to active ones, and wrong allocation decisions can affect several future cycles. In contrast, unwise allocation decisions of bandwidth resources can be immediately compensated in the next cycle. On the other hand, the high demand variability of bandwidth resources (in contrast to storage resources) causes a shared design to be the most performanceefficient solution. The issue stage is the main bandwidth resource affected by this fact, which has caused SMT (Simultaneous Multithreading) to stand out from other multithreading designs. Related experimental results are shown in Chapter 5. In this section, we evaluate the effect of sharing the main storage resources, so we can compare it with results of previous works, and check out if they can be extrapolated to VB-MT. Figure 3.1 shows a general block diagram of the VB- MT architecture, where all resources with a variable sharing strategy draw gray. Additionally, Figure 3.2 shows the impact of varying the sharing strategy of some of them. In all cases, a SMT processor with a round-robin fetch policy is implemented, being all resources private except the one labelled in the X-axis. IPCs of ROB/VB architectures with all resources set as private are tagged as ROB/VB Baselines. Shared resources are often more costly than private, as the allocation/deallocation of resource entries may get more complex. Moreover, shared resources need to increase read/write ports in order to enable parallel access to various threads, which may increase both their area and latency. The advantage of shared resources, as one would expect, lies in the fact that one single thread can compete for all its entries. 14

21 VB Baseline IPC ROB Baseline Shared Cache ROB Shared IFQ Shared IQ VB Shared Ph.Regs. Figure 3.2: Impact of resource sharing for ROB/VB architectures. Nevertheless, Figure 3.2 reveals that the conclusions of [18] continue to be valid in an VB-MT architecture: there is a performance loss when sharing any storage resource without an optimized allocation policy. The reason is that slow or stalled threads can abuse of shared resources, reserving their entries for a long time, and preventing other threads from using them fruitfully. The only component that does not show performance degradation when being shared is the ROB/VB (not shown in Figure 3.2, see below). The negative effects that stalled threads have over shared components can be solved by applying some resource allocation policy that detects such situations (e.g., DCRA [19], evaluated later). Consequently, we will use a baseline processor with all resources configured as private among threads, except in those experiments that implement DCRA, where all resources will be shared among threads using this policy. The ROB/VB is found between those resources that can be shared in an MT design. In this case, we can cite two possible mechanisms to handle the ROB entries that are assigned to threads. One approach consists of assigning disjoint ROB portions to threads [20], whose size can vary depending on the threads demand; each portion is treated as an independent ROB. As an opposite alternative, instructions can be inserted in a FIFO manner into the shared ROB, as if all of them belonged to the same thread, and hence retired in the same order. Both approaches can take the advantages of a shared storage resource (also applying some resource allocation policy if necessary), but each one suffers of certain drawbacks. In the former case, ROB portions cannot grow arbitrarily to fill the whole ROB size; a portion size is constrained by the position of contiguous 15

22 portions, and ROB portions can only be shifted when the associated head and tail pointers are properly aligned. In the latter case, instructions from different threads are intermingled across the ROB, so non completed instructions from a thread may prevent completed instructions from another from exiting the ROB. Moreover, the recovery process should cancel instructions selectively, forcing interleaved gaps to remain in the ROB until they are retired. A deeper study of the effects of the ROB/VB sharing strategies is planned as for future work, so we assume in this paper private ROBs/VBs for all experiments. 3.3 Adapting SMT Techniques to VB-MT In this section, we discuss the adaptation of previously proposed techniques on SMT processors to the VB architecture. These techniques try to exploit the SMT potential by maintaining functional units utilization as high as possible. The techniques studied in this section are Instruction Count (ICOUNT), Predictive Data Gating (PDG) and Dinamically Controlled Resource Allocation (DCRA). Their impact on the VB architecture, compared with the ROB, will be evaluated in Chapter Instruction Count (ICOUNT) ICOUNT is a fetch policy proposed by Tullsen et al [21], which assigns fetch priority to those threads with less instructions in the decode, rename and issue stages. The designation ICOUNT.n t.n i means that at most n t threads can be handled at the fetch stage in the same cycle, taking at most n i instructions from each one. In the experiments shown in Chapter 5 we used an ICOUNT.2.8 configuration, which provides the best results for the ROB-SMT architecture Predictive Data Gating (PDG) PDG was proposed by El-Moursy et al [22], and can be classified as a fetch policy, too. This technique is aimed at avoiding a long permanence of non-ready instructions in the issue queue (IQ) due to dependences on long-latency instructions, long data dependence chains or contention for functional units or cache. To early detect long-latency events, a load miss predictor is placed in the processor front-end, and a per thread specific counter tracks the estimated number of pending cache misses. The values of these counters are used to stall threads fetch when a threshold is exceeded. For our experiments, a load miss predictor of 4K entries with 2-bit saturating counters is assumed, as suggested in [22]. The predictor is indexed by the PC 16

23 of the load instruction, and the prediction is given by the most significant bit of the saturating counter. Whenever a load misses the data cache, the corresponding counter is reset, while it is incremented when the associated load hits the cache Dynamically Controlled Resource Allocation (DCRA) DCRA was proposed by Cazorla et al [19] and can be classified as a resource allocation policy. As ICOUNT and PDG, it tries to increase functional units utilization by preventing stalled threads from occupying resources that could be used by other threads with instructions ready to execute. DCRA does not only control the fetch bandwidth granted to each thread, but also the allocation of shared processor resources. However, nomenclature has been relaxed in previous works [22], considering this approach as an instruction fetch policy as well. DCRA classifies threads according to two criteria. A thread can be slow or fast if it has or not pending L1 cache misses. At the same time, a thread can be active or inactive for a given resource R depending on whether it has recently demanded the resource. A thread is considered active for resource R during a predefined number Y of cycles after the cycle it demanded an entry of R (e.g. Y = 256). The authors propose a mathematical formula to limit the number of entries of a shared resource that a thread can allocate depending on its previous classification. The shared resources can be the instruction queue, load-store queue, physical register file, etc., and the additional hardware consists of resource occupancy counters and a lookup table to implement the computation of resource entries limit using the aforementioned formula. 17

24 Chapter 4 The Simulation Framework Multi2Sim This section describes Multi2Sim [23], the simulation framework that has been used to model the architectural designs proposed in this work. Multi2Sim integrates a model of processor cores, memory hierarchy and interconnection network in a tool that enables their evaluation. The simulator has been extended to model the VB microarchitecture, both for monothreaded and multithreaded environments, including all instruction fetch policies cited in Chapter Basic simulator description Multi2Simhas been developed integrating some significant characteristics of popular simulators, such as separate functional and timing simulation, SMT and multiprocessor support and cache coherence. Multi2Sim is an application-only tool intended to simulate final MIPS32 executable files. With a MIPS32 crosscompiler (or a MIPS32 machine) one can compile his own program sources, and test them under Multi2Sim. This section deals with the process of starting and running an application in a cross-platform environment, and describes briefly the three implemented simulation techniques (functional, detailed and event-driven simulation) Program Loading Program loading is the process in which an executable file is mapped into different virtual memory regions of a new software context, and its register file and stack are initialized to start execution. In a real machine, the operating system is in charge of 18

25 these actions, but an application-only tool should manage program loading during its initialization. Executable File Loading. The executable files output by gcc follow the ELF (Executable and Linkable Format) specification. An ELF file is made up of a header and a set of sections. Some Linux distributions include the library libbfd, which provides types and functions to list the sections of an ELF file and track their main attributes (starting address, size, flags and content). When the flags of an ELF section indicate that it is loadable, its contents are copied into memory after the corresponding starting address. Program Stack. The next step of the program loading process is to initialize the process stack. The aim of the program stack is to store function local variables and parameters. During the program execution, the stack pointer ($sp register) is managed by the own program code. However, when the program starts, it expects some data in it, namely the program arguments and environment variables, which must be placed by the program loader. Register File. The last step is the register file initialization. This includes the $sp register, which has been progressively updated during the stack initialization, and the PC and NPC registers. The initial value of the PC register is specified in the ELF header of the executable file as the program entry point. The NPC register is not explicitly defined in the MIPS32 architecture, but it is used internally by the simulator to handle the branch delay slot Simulation Model Multi2Sim uses three different simulation models, embodied in different modules: a functional simulation engine, a detailed simulator and an event-driven module the latter two perform the timing simulation. To describe them, the term context will be used hereafter to denote a software entity, defined by the status of a virtual memory image and a logical register file. In contrast, the term thread will refer to a processor hardware entity comprising a physical register file, a set of physical memory pages, a set of entries in the pipeline queues, etc. The three main simulation techniques are described next. Functional Simulation, also called simulator kernel. It is built as an autonomous library and provides an interface to the rest of the simulator. This engine does not know of hardware threads, and owns functions to create/destroy software contexts, perform program loading, enumerate existing contexts, consult their status, execute machine instructions and handle speculative execution. The supported machine instructions follow the MIPS32 specification [24] [25]. This choice was basically motivated by a fixed instruction size and formats, which enable a simple instruction decoding. An important feature of the simulation kernel, inherited from SimpleScalar 19

26 [26], is the checkpointing capability of the implemented memory module and register file, thinking of an external module that needs to implement speculative execution. In this sense, when a wrong execution path starts, both the register file and memory status are saved, reloading them on the misprediction detection. Detailed Simulation. The Multi2Sim detailed simulator uses the functional engine to perform a timing-first [27] simulation: in each cycle, a sequence of calls to the kernel updates the state of existing contexts. The detailed simulator analyzes the nature of the recently executed machine instructions and accounts the operation latencies incurred by hardware structures. The main simulated hardware consists of pipeline structures (stage resources, instruction queue, load-store queue, reorder buffer...), branch predictor (modelling a combined bimodal-gshare predictor), cache memories (with variable size, associativity and replacement policy), memory management unit, and segmented functional units of configurable latency. Event-Driven Simulation. In a scheme where functional and detailed simulation are independent, the implementation of the machine instructions behaviour can be centralized in a single file (functional simulation), increasing the simulator modularity. In this sense, function calls that activate hardware components (detailed simulation) have an interface that returns the latency required to complete their access. Nevertheless, this latency is not a deterministic value in some situations, so it cannot be calculated when the function call is performed. Instead, it must be simulated cycle by cycle. This is the case of interconnects and caches, where an access can result in a message transfer, whose delay cannot be computed a priori, justifying the need of an independent event-driven simulation engine. 4.2 Support for Multithreaded and Multicore Architectures This section describes the basic simulator features that provide support for multithreaded and multicore processor modelling. They can be classified in two main groups: those that affect the functional simulation engine (enabling the execution of parallel workloads) and those which involve the detailed simulation module (enabling pipelines with various hardware threads on the one hand, and systems with several cores on the other). 20

A Complexity-Effective Out-of-Order Retirement Microarchitecture

A Complexity-Effective Out-of-Order Retirement Microarchitecture 1 A Complexity-Effective Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, R. Ubal, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia,

More information

The Validation Buffer Out-of-Order Retirement Microarchitecture

The Validation Buffer Out-of-Order Retirement Microarchitecture The Validation Buffer Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia, Spain Abstract

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

ece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

EECS 470 Final Project Report

EECS 470 Final Project Report EECS 470 Final Project Report Group No: 11 (Team: Minion) Animesh Jain, Akanksha Jain, Ryan Mammina, Jasjit Singh, Zhuoran Fan Department of Computer Science and Engineering University of Michigan, Ann

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS PROCESSORS REQUIRE A COMBINATION OF LARGE INSTRUCTION WINDOWS AND HIGH CLOCK FREQUENCY TO ACHIEVE HIGH PERFORMANCE.

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors

Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors R. Ubal, J. Sahuquillo, S. Petit and P. López Department of Computing Engineering (DISCA) Universidad Politécnica de Valencia,

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3 PR 0,0 ID:incmb PR ID:St: C,X LSU EE 4720 Dynamic Scheduling Study Guide Fall 2005 1.1 Introduction David M. Koppelman The material on dynamic scheduling is not covered in detail in the text, which is

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Superscalar Processor Design

Superscalar Processor Design Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections ) Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors by Walid Ahmed Mohamed El-Reedy A Thesis Submitted to the Faculty of Engineering at Cairo University In Partial Fulfillment

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information