Tesis del Máster en Ingeniería de Computadores Curso Out-of-Order Retirement of Instructions in Superscalar and Multithreaded Processors
|
|
- Horatio Juniper Alexander
- 6 years ago
- Views:
Transcription
1 Tesis del Máster en Ingeniería de Computadores Curso Out-of-Order Retirement of Instructions in Superscalar and Multithreaded Processors Autor: Rafael Ubal Tena Directores: Julio Sahuquillo Borrás Pedro López Rodríguez
2
3 Contents Contents Abstract i iii 1 Introduction Out-of-Order Retirement in Monothreaded Processors Out-of-Order Retirement with Support for Multiple Threads The Validation Buffer Microarchitecture Register Reclamation Recovery Mechanism Working Example Uniprocessor Memory Model Multithreaded Validation Buffer (VB-MT) Microarchitecture Multithreading support Resource sharing Adapting SMT Techniques to VB-MT Instruction Count (ICOUNT) Predictive Data Gating (PDG) Dynamically Controlled Resource Allocation (DCRA) The Simulation Framework Multi2Sim Basic simulator description Program Loading Simulation Model Support for Multithreaded and Multicore Architectures Functional simulation: parallel workloads support Detailed simulation: Multithreading support Detailed simulation: Multicore support i
4 5 Exterimental Results Evaluation of the VB Microarchitecture Exploring the Potential of the VB Microarchitecture Exploring the Behavior in a Modern Microprocessor Impact on Performance of Supporting Precise Floating-Point Exceptions Evaluation of the VB-MT Microarchitecture Multithreading Granularity Multithreading Scalability: Performance vs. Complexity Fetch Policies Resources occupancy Related Work 42 7 Conclusions Contributions and Future Work Publications Related with this Work ii
5 Abstract Current superscalar processors commit instructions in program order by using a reorder buffer (ROB). The ROB provides support for speculation, precise exceptions, and register reclamation. However, committing instructions in program order may lead to significant performance degradation if a long latency operation blocks the ROB head. Several proposals have been published to deal with this problem. Most of them retire instructions speculatively. However, as speculation may fail, checkpoints are required in order to rollback the processor to a precise state, which requires both extra hardware to manage checkpoints and the enlargement of other major processor structures, which in turn might impact the processor cycle. This work focuses on out-of-order commit in a nonspeculative way, thus avoiding checkpoints. To this end, we replace the ROB with a structure called Validation Buffer (VB). This structure keeps dispatched instructions until they are nonspeculative or mispeculated, which allows an early retirement. By doing so, the performance bottleneck is largely alleviated. An aggressive register reclamation mechanism targeted to this microarchitecture is also devised. As experimental results show, the VB structure is much more efficient than a typical ROB since, with only 32 entries, it achieves a performance close to an in-order commit microprocessor using a 256-entry ROB. The present work also makes an exhaustive analysis of out-of-order retirement of instructions on multithreaded processors. Superscalar processors exploit instruction level parallelism by issuing multiple instructions per cycle. However, issue width is usually wasted because of instruction dependencies. On the other hand, multithreaded processors reduce this waste by providing support to the concurrent execution of instructions from multiple threads, thus exploiting both instruction and thread level parallelism. Additionally, out-of-order commit (OOC) processors overlap the execution of the long latency instructions with potentially lots of subsequent ones, so this microarchitectures also help to reduce the issue waste. We analyze the impact on performance of unifying both multithreading and OOC techniques, by combining the three main paradigms of multithreading iii
6 fine grain (FGMT), coarse grain (CGMT), and simultaneous (SMT) with the Validation Buffer microarchitecture, which retires instructions out of order. From the experimental results, we conclude that: (i) an OOC-SMT processor achieves the same performance as a conventional SMT with half the amount of hardware threads; (ii) an OOC-FGMT processor outperforms a conventional SMT processor, which requires a more complex issue logic; and (iii) the use of OOC allows optimized fetch policies (DCRA) for SMT processors to improve their job, almost completely removing the issue width waste. iv
7 Chapter 1 Introduction 1.1 Out-of-Order Retirement in Monothreaded Processors Current high-performance microprocessors execute instructions out-of-order to exploit instruction level parallelism (ILP). To support speculative execution, provide precise exceptions, and register reclamation, a reorder buffer (ROB) structure is used [1]. After being decoded, instructions are inserted in program order in the ROB, where they are kept while being executed and until retired in the commit stage. The key to support speculation and precise exceptions is that instructions leave the ROB also in program order, that is, when they are the oldest ones in the pipeline. Consequently, if a branch is mispredicted or an instruction raises an exception there is a guarantee that, when the offending instruction reaches the commit stage, all the previous instructions have already been retired and none of the subsequent ones have done it. Therefore, to recover from that situation, all the processor has to do is to abort the latter ones. This behavior is conservative. For instance, when the ROB head is blocked by a long latency instruction (e.g., a load that misses in the L2), subsequent instructions cannot release their ROB entries. This happens even if these instructions are independent from the long latency one and they have been completed. In such a case, since the ROB has a finite size, as long as instruction decoding continues, the ROB may become full, thus stalling the processor for a valuable number of cycles. Register reclamation is also handled in a conservative way because physical registers are mapped for longer than their useful lifetime. In summary, both the advantages and the shortcomings of the ROB come from the fact that instructions are committed in program order. A naive solution to address this problem is to enlarge the ROB size to accommodate more instructions in flight. However, as ROB-based microarchitectures 1
8 serialize the release of some critical resources at the commit stage (e.g., physical registers or store queue entries), these resources should be also enlarged. This resizing increases the cost in terms of area and power, and it might also impact the processor cycle [2]. To overcome this drawback, some solutions that commit instructions out of order have been published. These proposals can be classified into two approaches depending on whether instructions are speculatively retired or not. Some proposals falling into the first approach, like [3], allow the retirement of the instruction obstructing the ROB head by providing a speculative value. Others, like [4] or [5], replace the normal ROB with alternative structures to speculatively retire instructions out of order. As speculation may fail, these proposals need to provide a mechanism to recover the processor to the correct state. To this end, the architectural state of the machine is checkpointed. Again, this implies the enlargement of some major microprocessor structures, for instance, the register file [5] or the load/store queue [4], because completed instructions cannot free some critical resources until their associated checkpoint is released. Regarding the nonspeculative approach, Bell and Lipasti [6] propose to scan a few entries of the ROB, as many as the commit width, and those instructions satisfying certain conditions are allowed to be retired. None of these conditions imposes an instruction to be the oldest one in the pipeline to be retired. Hence, instructions can be retired out of program order. However, in this scenario, the ROB head may become fragmented after the commit stage, and thus the ROB must be collapsed for the next cycle. Collapsing a large structure is costly in time and could adversely impact the microprocessor cycle. As a consequence, this proposal is unsuitable for large ROB sizes, which is the current trend. Moreover, the small number of instructions scanned at the commit stage significantly constrains the potential that this proposal could achieve. In this work we propose the Validation Buffer (VB) microarchitecture, which is also based on the nonspeculative approach. This microarchitecture uses a FIFOlike table structure analogous to the ROB. The aim of this structure is to provide support for speculative execution, exceptions, and register reclamation. While in the VB, instructions are speculatively executed. Once all the previous branches and supported exceptions are resolved, the execution mode of the instructions changes either to nonspeculative or mispeculated. At that point, instructions are allowed to leave the VB. Therefore, instructions leave the VB in program order but, unlike for the ROB, they do not remain in the VB until retirement. Instead, they remain in the VB only until the execution mode of such instruction is resolved, either nonspeculative or mispeculated. Consequently, instructions leave the VB at different stages of their execution: completed, issued, or just decoded and not issued. For instance, instructions following a long latency memory reference instruction could leave the VB as soon as its memory address is success- 2
9 fully calculated and no page fault has been risen. This work discusses how the VB microarchitecture works, focusing on how it deals with register reclamation, speculative execution, and exceptions. The first main contribution (Chapter 2) of this work is the proposal of an aggressive out-of-order retirement microarchitecture without checkpointing. This microarchitecture decouples instruction tracking for execution purposes and for resource reclamation purposes. The proposal outperforms the existing proposal that does not perform checkpoints, since it achieves more than twice its performance using smaller VB/ROB sizes. On the other hand, register reclamation cannot be handled as done in current microprocessors [7, 8, 9] because no ROB is used. Therefore, we devise an aggressive register reclamation method targeted to this architecture. Experimental results show that the VB microarchitecture increases the ILP while requiring less complexity in some major critical resources like the register file and the load/store queue. 1.2 Out-of-Order Retirement with Support for Multiple Threads Superscalar processors effectively exploit instruction level parallelism of a single thread, by issuing multiple instructions in an out-of-order fashion in the same cycle. Nevertheless, issue ports are usually wasted because of instruction dependencies (i.e., the available parallelism), thus, adversely impacting the performance. Two kinds of waste are distinguishable [10]: vertical waste, when no instruction is issued, and horizontal waste, when some instruction is issued without completely filling the issue width. Resource utilization can be improved by providing support to the execution of multiple threads, that is, by exploiting both instruction and thread level parallelism. There are three main multithreading models implemented in current processors: fine grain (FGMT), coarse grain (CGMT), and simultaneous multitreading (SMT). All of them reduce vertical waste but only SMT reduces horizontal waste [10] by issuing instructions from multiple threads in the same cycle, achieving best performance gains. Nevertheless, this is done at expenses of adding complexity to the issue logic, which is a critical point in current microprocessors. Multithreaded architectures represent an important segment in the industry. For instance, the Alpha 21464, the Intel Pentium 4 [9], the IBM Power 5 [11], the Sun Niagara [12], and the Intel Montecito [13] are commercial microprocessors included in this group. On the other hand, and as introduced above, out-of-order commit (OOC) processors differ from in-order commit (IOC) architectures in the sense that they do 3
10 not force an instruction to be the oldest one in the pipeline in order to be retired. In this way, long latency operations (e.g., a L2 miss) do not block the ROB when they reach the ROB head; instead, long memory latencies are overlapped with the retirement of subsequent instructions which do not depend on the memory operation. Thus, these architectures mainly attack the vertical waste, although horizontal waste is indirectly also improved. In addition, we will show that OOC processors can make better use of resources and are more performance-cost effective than IOC processors. Quite recently, out-of-order retirement has been investigated on superscalar processors and chip multiprocessors (CMPs), but, to the best of our knowledge, no research has focused on multithreaded processors. As second main contribution of this work (Chapter 3), we analyze the impact of retiring instructions in an out-of-order fashion in the three main models of multithreading: FGMT, CFMT, and SMT. To this end, we selected our own proposal the Validation Buffer Microarchitecture as the OOC base architecture, and extended it to support the execution of multiple threads. Experimental results provide three main conclusions: First, a VB-based SMT processor requires in most cases half the amount of hardware threads than an ROB-based SMT processor to achieve similar performance. In other words, performance can be maintained in VB-based SMT when reducing the number of hardware threads, thus saving all hardware resources to track their status. Second, a VB-based FGMT processor outperforms an ROB-based SMT processor. In this case, performance can be sustained while simplifying the issue logic, which can be translated in shorter issue delays or lower power consumption of instruction schedulers. Third, existing fetch policies for SMT processors provide complementary advantages to the out-of-order retirement benefits. A high-performance SMT design could implement both techniques if area, power consumption and hardware constraints allow it. The rest of this work is structured as follows. Chapter 2 gives a detailed view of the Validation Buffer microarchitecture. Chapter 3 deals with out-of-order retirement in multithreaded processors, and explains a set of existing instruction fetch policies that will be used for evaluation purposes. Chapter 4.1 describes the simulation framework used to model all proposed techniques. Chapter 5 performs an exhaustive evaluation of both VB and VB-MT architectures, and Chapters 6 and 7 provide citations to related works and some concluding remarks, respectively. 4
11 Chapter 2 The Validation Buffer Microarchitecture The commit stage is typically the latest one of the microarchitecture pipeline. At this stage, a completed instruction updates the architectural machine state, frees the used resources and exits the ROB. The mechanism proposed in this work allows instructions to be retired early, as soon as it is known that they are nonspeculative. Notice that these instructions may not be completed. Once they are completed, they will update the machine state and free the used resources. Therefore, instructions will exit the pipeline in an out-of-order fashion. The necessary conditions to allow an instruction to be committed out-or-order are [6]: i) the instruction is completed; ii) WAR hazards are solved (i.e., a write to a particular register cannot be permitted to commit before all prior reads of that architected register have completed); iii) previous branches are successfully predicted; iv) none of the previous instructions is going to raise an exception, and v) the instruction is not involved in memory replay traps. The first condition is straightforwardly met by any proposal at the writeback stage. The last three conditions are handled by the Validation Buffer (VB) structure, which replaces the ROB and contains the instructions whose conditions are not known yet. The second condition is fulfilled by the devised register reclamation method (see Section 2.1). The VB deals with the speculation related conditions (iii, iv and v) by decomposing code into fragments or epochs. The epoch boundaries are defined by instructions that may initiate an speculative execution, referred to as epoch initiators (e.g., branches or potentially exception raiser instructions). Only those instructions whose previous epoch initiators have completed and confirmed their prediction are allowed to modify the machine state. We refer to these instructions as validated instructions. Instructions reserve an entry in the VB when they are dispatched, that is, they 5
12 enter in program order in the VB. Epoch initiator instructions are marked as such in the VB. When an epoch initiator detects a mispeculation, all the following instructions are cancelled. When an instruction reaches the VB head, if it is an epoch initiator and it has not completed execution yet, it waits. When it completes, it leaves the VB and updates machine state, if any. Non epoch-initiator instructions that reach the VB head can leave it regardless of their execution state. That is, they can be either dispatched, issued or completed. However, only those not cancelled instructions (i.e., validated) will update the machine state. On the other hand, cancelled instructions are drained to free the resources they occupy (see Section 2.2). Notice that when an instruction leaves the VB, if it is already completed, it is not consuming execution resources in the pipeline. Thus, it is analogous to a normal retirement when using the ROB. Otherwise, unlike the ROB, the instruction is retired from the VB but it remains in the pipeline until it is completed. The proposed microarchitecture can support a wide range of epochs initiators. At least, epoch initiators according to the three speculative related conditions are supported. Therefore, branches and memory reference instructions (i.e., the address calculation part) act as epoch initiators. In other words, branch speculation, memory replay traps (see Section 2.4) and exceptions related with address calculation (e.g., page faults, invalid addresses) are supported by design. It is possible to include more instructions in the set of epoch initiators. For instance, in order to support precise floating-point exceptions, floating-point instructions should be included in this set. As instructions are able to leave the VB only when their epoch initiators validate their epoch, a high percentage of epoch initiators might reduce the performance benefits of the VB. We can use user definable flags to enable or disable support for precise exceptions. If the corresponding flag is enabled, the instruction that may generate a given type of exception will force a new epoch when it is decoded. In fact, a program could dynamically enable or disable these flags during its execution. For instance, it can be enabled when the compiler suspects that an arithmetic exception may be raised. 2.1 Register Reclamation Typically, modern microprocessors free a physical register when the instruction that renames the corresponding logical register commits [14]. Then, the physical register index is placed in the list that contains the free physical registers available for new producers. Waiting until the commit stage to free a physical register is easy to implement but conforms to a conservative approach; a sufficient condition is that all consumers have read the corresponding value. Therefore, this method does not 6
13 Figure 2.1: VB microarchitecture block diagram. efficiently use registers, as they can be mapped for longer than their useful lifetime. In addition, this method requires keeping track of the oldest instruction in the pipeline. As this instruction may have already left the VB, this method is unsuitable for our proposal. Due to these reasons, we devised a register reclamation strategy based on the counter method [15, 14] targeted for the VB microarchitecture. The hardware components used in this scheme, as shown in Figure 2.1 are: Frontend Register Alias Table (RAT front ). This table maintains the current mapping for each logical register and is accessed in the rename stage. The table is indexed by the source logical register to obtain the corresponding physical register identifier. Additionally, each time a new physical register is mapped to a destination logical register, the RAT front is updated. Retirement Register Alias Table (RAT ret ). The RAT ret table is updated as long as instructions exit the VB. This table contains a precise state of the register map, as only those validated instructions leaving the VB are allowed to update it. Register Status Table (RST ). The RST is indexed by a physical register number and contains three fields, labelled as pending readers, valid remapping and completed, respectively. Pending readers field contains the number of decoded instructions that consume the corresponding physical register, but have not read it yet. This value is incremented as consumers enter the decode stage, and decremented when they are issued to the execution units. The second and third fields are composed each by a single bit. The valid remapping bit is set when the associated logical register has been definitively remapped to a new physical register, that is, when the instruction that remapped the logical register has left the VB as validated. Finally, the completed bit indicates that the instruction producing its value has completed execution, writing the result to the physical register. 7
14 Event Table 2.1: Actions depending on pipeline events Actions An instruction I enters the rename stage and has physical register (p.r.) p as source operand. I enters the rename stage and reclaims a p.r. to map an output logical register l. I is issued and reads p.r. p. I finishes execution, writing the result over p.r. p. I exits the VB as validated. l is the logical destination of I, p.r. p is the current mapping of l, and p.r. p was the previous mapping of l. RST[p].pending readers + + Find a free p.r. p and set RST[p] = {0,0,0}, RAT front [l] = p. RST[p].pending readers RST[p].completed = 1 RST[p ].valid remapping = 1, RAT ret [l] = p. With this representation, a free physical register p can be easily identified when the corresponding entry in the RST contains the triplet {0,1,1}. A 0 in pending readers guarantees that no instruction already in the pipeline will read the contents of p. Next, a 1 in valid remapping implies that no new instruction will enter the pipeline (i.e., the rename stage) and read the contents of p, because p has been unmapped by a valid instruction. Finally, a 1 in the completed field denotes that no instruction in the pipeline is going to overwrite p. These conditions ensure that a specific physical register can be safely reallocated for a subsequent renaming. On the other hand, a triplet {0,0,1} denotes a busy register, with no pending readers, not unmapped by a valid instruction, and appearing as the result of a completed (and valid) operation. Table 2.1 shows different situations which illustrate the dynamic operation of the proposed register reclamation strategy in a non-speculative mode, as well as the updating mechanism of RST, RAT front and RAT ret tables. 2.2 Recovery Mechanism The recovery mechanism always involves restoring both the RAT front and the RST tables. Register Alias Table Recovery. Current microprocessors employ different methods to restore the renaming information when a mispeculation or exception occurs. The method presented in this work uses the two renaming tables, RAT front and RAT ret, explained above, similarly to the Pentium 4 [9]. 8
15 RAT ret contains a delayed copy of a validated RAT front. That is, it matches the RAT front table at the time the exiting (as valid) instruction was renamed. So, a simple method to implement the recovery mechanism (restoring the mapping to a precise state) is to wait until the offending instruction reaches the VB head, and then copying RAT ret into RAT front. Alternative implementations can be found in [4]. Register Status Table Recovery. The recovery mechanism must also undo the modifications performed by the cancelled instructions in any of the three fields of the RST. Concerning the valid remapping field, we describe two possible techniques to restore its values. The first technique squashes from the VB those entries corresponding to instructions younger than the offending instruction when this one reaches the VB head. At that point, the RAT ret contains the physical registers identifiers that we use to restore the correct mapping. The remaining physical registers must be freed. To this end, all valid remapping entries are initially set to 1 (necessary condition to be freed). Then, the RAT ret is scanned looking for physical registers whose valid remapping entry must be reset. The second technique relies on the following observation. Only the physical registers that were allocated (i.e., mapped to a logical register) by instructions younger than the offending one must be freed. Therefore, instead of squashing the VB contents when the offending instruction reaches the VB head, as instructions in the VB are cancelled, they are drained. These instructions must set to 1 the valid remapping entry of their current mapping. Notice that in this case, the valid remapping flag is used to free the registers allocated by the current mapping, instead of the previous mapping like in normal operation. While the cancelled instructions are being drained, new instructions can enter the renaming stage, provided that the RAT front has been already recovered. Therefore, the VB draining can be overlapped with subsequent new processor operations. Regarding to the pending readers field, it cannot be just reset as there can already be valid pending readers in the issue queue. Thus, each pending readers entry must be decremented as many as the number of cancelled pending readers for the corresponding physical register. To this end, the issue logic must allow to detect those instructions younger than the offending instruction, that is, the cancelled pending readers. This can be implemented by using a bitmap mask in the issue queue to identify which instructions are younger than a given branch [16]. The cancelled instructions must be drained from the issue queue to correctly handle (i.e. decrement) their pending readers entries. Notice that this logic can be also used to handle the completed field, by enabling a cancelled instruction to set the entry of its destination physical register. Alternatively, it is also possible to simply let the cancelled instructions execute to correctly handle the pending readers and completed fields. 9
16 2.3 Working Example Figure 2.2: Instruction status and epochs. To illustrate different scenarios that could require triggering the recovery mechanism as well as register reclamation handling, we use the example shown in Figure 2.2. This example shows a validation buffer with 12 instructions belonging to 3 different epochs. Instructions can be in one of the following states: dispatched but not issued, issued but not yet completed, and completed. Unlike normal ROBs no control information about these states is stored in the VB. Instead, the only information required is whether the epoch is validated or cancelled. In the example, assume that the three epochs have just been resolved, epochs 0 and 1 as validated and epoch 2 as cancelled. Thus, only instructions belonging to epochs 0 and 1 should be allowed to update the machine state. Firstly, instructions belonging to epoch 0 leave the VB. As this epoch has been validated, each instruction will update the RAT ret and set the valid remapping bit of the physical register previously mapped to its destination logical register. Since these instructions are completed, the VB is the last machine resource they consume. Then, instructions of epoch 1 leave the VB, two completed, one issued and the other one dispatched but not yet issued. The RAT ret table and the valid remapping bit are handled as above, no matter the instruction status. However, the non completed instructions will remain in the pipeline until they are complete. Finally, instructions belonging to epoch 2 leave the VB as cancelled. These instructions must not update the RAT ret, but they must set the valid remapping bit of the physical register currently mapped to its destination logical register. In addition, the processor must obtain a correct RAT front, re-execute the epoch initiator (if needed), and fetch the correct instructions. To this end, the machine waits until the epoch initiator instruction that has triggered the cancellation reaches the VB head. Then, the processor recovers to a correct state by copying the RAT ret to the RAT front and resumes execution from the correct path. The RST state is recovered as explained in Section
17 2.4 Uniprocessor Memory Model To correctly follow the uniprocessor memory model it must be ensured that load instructions get the data produced by the newest previous store matching its memory address. A key component to improve performance in such a model, is the load/store queue (LSQ). In the VB microarchitecture, as done in some current microprocessors, memory reference instructions are internally split by the hardware into two instructions when they are decoded and dispatched: the memory address calculation, which is considered as an epoch initiator, and the memory operation itself. The former reserves a VB entry when it is dispatched while the latter reserves an entry in the LSQ. A memory reference instruction frees its corresponding queue entry provided that it has been validated and completed. Additionally, a store instruction must be the oldest memory instruction in the pipeline. Load bypassing is the main technique applied to the LSQ to improve the processor performance. This technique permits loads to early execute by advancing previous stores in their access to the cache. Load bypassing can be speculatively performed by allowing loads to bypass previous stores in the LSQ even if any store address is unresolved yet. As speculation may fail, processors must provide some mechanism to detect and recover from load mispeculation. For instance, loads issued speculatively can be placed in a special buffer called the finished load buffer [17]. The entry of this buffer is released when the load commits. On the other hand, when a store commits, a search is performed in this buffer looking for aliasing loads (note that all loads in the buffer are younger than the store). If aliasing is detected, when the mispeculated load commits, both the load and subsequent instructions must be re-executed. A finished load buffer can be quite straightforwardly implemented in the VB, with no additional complexity. In this case, a load instruction will release its entry in the finished load buffer when it leaves the VB. When a store leaves the VB, the address of all previous memory instructions and its own address have been resolved. Thus, it can already search the mentioned buffer looking for aliasing loads speculatively issued. As in a ROB-based implementation, all loads in the buffer are younger than the store. On a hit, the recovery mechanism should be triggered as soon as the mispeculated load exits the VB. Notice that this implementation allows mispeculation to be early detected. Finally, store-load forwarding and load replay traps are also supported by the VB by using the same hardware available in current microprocessors. 11
18 Chapter 3 Multithreaded Validation Buffer (VB-MT) Microarchitecture Superscalar processors effectively exploit instruction level parallelism of a single thread. To this end, multiple instructions can be issued in an out-of-order fashion in the same cycle. Nevertheless, issue ports are usually wasted because of instruction dependencies (i.e., the available parallelism), thus, adversely impacting the performance. Two kinds of waste are distinguishable [10]: vertical waste, when no instruction is issued, and horizontal waste, when some instruction is issued without completely filling the issue width. Resource utilization can be improved by providing support to the execution of multiple threads, that is, by exploiting both instruction and thread level parallelism. There are three main multithreading models implemented in current processors: fine grain (FGMT), coarse grain (CGMT), and simultaneous multitreading (SMT). All of them reduce vertical waste but only SMT reduces horizontal waste [10] by issuing instructions from multiple threads in the same cycle, achieving best performance gains. Nevertheless, this is done at expenses of adding complexity to the issue logic, which is a critical point in current microprocessors. Multithreaded architectures represent an important segment in the industry. For instance, the Alpha 21464, the Intel Pentium 4 [9], the IBM Power 5 [11], the Sun Niagara [12], and the Intel Montecito [13] are commercial microprocessors included in this group. On the other hand, out-of-order commit prevents long latency operations (e.g., a L2 cache miss) of blocking the ROB when they reach its head; instead, long memory latencies are overlapped with the retirement of subsequent instructions which do not depend on the memory operation. Thus, these architectures mainly attack the vertical waste, although horizontal waste is indirectly also improved. In this chapter, we deeply analyze the impact of retiring instructions in an outof-order fashion in the three main models of multithreading: FGMT, CGMT, and 12
19 SMT, using the Validation Buffer microarchitecture as base out-of-order commit approach. 3.1 Multithreading support As in any multithreading model, the Multithreaded Validation Buffer (VB-MT) microarchitecture gives the illusion of having various logical processors, that is, various simultaneously active software contexts, one per hardware thread. Although most hardware structures can be shared or private among threads [6], the logical state, defined by a virtual memory image and a logical register file, must be independently maintained per thread. Concerning the virtual memory image, it makes sense for all threads to have a common physical address space. The MMU (Memory Management Unit), and the Translation Lookaside Buffers (TLBs) in particular, must be adapted to be indexed by pairs {thread, virtual address}, returning disjoint physical addresses to each thread. In the case of the register file, the associated physical structures can be either shared or private per thread, but the subsets of physical registers bound to specific threads must be disjoint from each other. This implies a renaming mechanism that allocates a new physical register from a pair {thread, logical register}, which is straightforwardly translated into a private RAT (Register Aliasing Table) per thread. Involving the register renaming strategy, its implementation on a multithreaded, out-of-order commit architecture is analogous to the single-threaded, out-of-order commit one. The main difference is the replication of the two renaming tables (RAT front and RAT ret ) for each thread. The former is looked up and updated when an instruction is at the rename stage, while the latter is modified when an instruction leaves the VB, providing a delayed copy of RAT front. Depending on the thread to which a specific instruction belongs, it should look up or update its associated RAT table. 3.2 Resource sharing A multithreaded design offers the view of having multiple logical processors in a single chip. A straightforward way of implementing such a system consists of replicating all hardware structures per thread, while maintaining a common functional resource pool. In the opposite approach, hardware structures can be shared among threads, using policies that dynamically allocate resource slots. Between these limits, there is a gradient of solutions to be explored in order to find the 13
20 Figure 3.1: VB-MT architecture diagram with all storage resources shared among threads optimal sharing strategy of resources. Processor resources can be classified as storage resources (ROB, IQ, LSQ...) and bandwidth resources (fetch stage, issue logic...). As pointed out in [18], the sharing of storage resources can result in a performance degradation if no appropriate resource allocation policy is used. The reason is that over-allocations to stalled threads can cause starvation to active ones, and wrong allocation decisions can affect several future cycles. In contrast, unwise allocation decisions of bandwidth resources can be immediately compensated in the next cycle. On the other hand, the high demand variability of bandwidth resources (in contrast to storage resources) causes a shared design to be the most performanceefficient solution. The issue stage is the main bandwidth resource affected by this fact, which has caused SMT (Simultaneous Multithreading) to stand out from other multithreading designs. Related experimental results are shown in Chapter 5. In this section, we evaluate the effect of sharing the main storage resources, so we can compare it with results of previous works, and check out if they can be extrapolated to VB-MT. Figure 3.1 shows a general block diagram of the VB- MT architecture, where all resources with a variable sharing strategy draw gray. Additionally, Figure 3.2 shows the impact of varying the sharing strategy of some of them. In all cases, a SMT processor with a round-robin fetch policy is implemented, being all resources private except the one labelled in the X-axis. IPCs of ROB/VB architectures with all resources set as private are tagged as ROB/VB Baselines. Shared resources are often more costly than private, as the allocation/deallocation of resource entries may get more complex. Moreover, shared resources need to increase read/write ports in order to enable parallel access to various threads, which may increase both their area and latency. The advantage of shared resources, as one would expect, lies in the fact that one single thread can compete for all its entries. 14
21 VB Baseline IPC ROB Baseline Shared Cache ROB Shared IFQ Shared IQ VB Shared Ph.Regs. Figure 3.2: Impact of resource sharing for ROB/VB architectures. Nevertheless, Figure 3.2 reveals that the conclusions of [18] continue to be valid in an VB-MT architecture: there is a performance loss when sharing any storage resource without an optimized allocation policy. The reason is that slow or stalled threads can abuse of shared resources, reserving their entries for a long time, and preventing other threads from using them fruitfully. The only component that does not show performance degradation when being shared is the ROB/VB (not shown in Figure 3.2, see below). The negative effects that stalled threads have over shared components can be solved by applying some resource allocation policy that detects such situations (e.g., DCRA [19], evaluated later). Consequently, we will use a baseline processor with all resources configured as private among threads, except in those experiments that implement DCRA, where all resources will be shared among threads using this policy. The ROB/VB is found between those resources that can be shared in an MT design. In this case, we can cite two possible mechanisms to handle the ROB entries that are assigned to threads. One approach consists of assigning disjoint ROB portions to threads [20], whose size can vary depending on the threads demand; each portion is treated as an independent ROB. As an opposite alternative, instructions can be inserted in a FIFO manner into the shared ROB, as if all of them belonged to the same thread, and hence retired in the same order. Both approaches can take the advantages of a shared storage resource (also applying some resource allocation policy if necessary), but each one suffers of certain drawbacks. In the former case, ROB portions cannot grow arbitrarily to fill the whole ROB size; a portion size is constrained by the position of contiguous 15
22 portions, and ROB portions can only be shifted when the associated head and tail pointers are properly aligned. In the latter case, instructions from different threads are intermingled across the ROB, so non completed instructions from a thread may prevent completed instructions from another from exiting the ROB. Moreover, the recovery process should cancel instructions selectively, forcing interleaved gaps to remain in the ROB until they are retired. A deeper study of the effects of the ROB/VB sharing strategies is planned as for future work, so we assume in this paper private ROBs/VBs for all experiments. 3.3 Adapting SMT Techniques to VB-MT In this section, we discuss the adaptation of previously proposed techniques on SMT processors to the VB architecture. These techniques try to exploit the SMT potential by maintaining functional units utilization as high as possible. The techniques studied in this section are Instruction Count (ICOUNT), Predictive Data Gating (PDG) and Dinamically Controlled Resource Allocation (DCRA). Their impact on the VB architecture, compared with the ROB, will be evaluated in Chapter Instruction Count (ICOUNT) ICOUNT is a fetch policy proposed by Tullsen et al [21], which assigns fetch priority to those threads with less instructions in the decode, rename and issue stages. The designation ICOUNT.n t.n i means that at most n t threads can be handled at the fetch stage in the same cycle, taking at most n i instructions from each one. In the experiments shown in Chapter 5 we used an ICOUNT.2.8 configuration, which provides the best results for the ROB-SMT architecture Predictive Data Gating (PDG) PDG was proposed by El-Moursy et al [22], and can be classified as a fetch policy, too. This technique is aimed at avoiding a long permanence of non-ready instructions in the issue queue (IQ) due to dependences on long-latency instructions, long data dependence chains or contention for functional units or cache. To early detect long-latency events, a load miss predictor is placed in the processor front-end, and a per thread specific counter tracks the estimated number of pending cache misses. The values of these counters are used to stall threads fetch when a threshold is exceeded. For our experiments, a load miss predictor of 4K entries with 2-bit saturating counters is assumed, as suggested in [22]. The predictor is indexed by the PC 16
23 of the load instruction, and the prediction is given by the most significant bit of the saturating counter. Whenever a load misses the data cache, the corresponding counter is reset, while it is incremented when the associated load hits the cache Dynamically Controlled Resource Allocation (DCRA) DCRA was proposed by Cazorla et al [19] and can be classified as a resource allocation policy. As ICOUNT and PDG, it tries to increase functional units utilization by preventing stalled threads from occupying resources that could be used by other threads with instructions ready to execute. DCRA does not only control the fetch bandwidth granted to each thread, but also the allocation of shared processor resources. However, nomenclature has been relaxed in previous works [22], considering this approach as an instruction fetch policy as well. DCRA classifies threads according to two criteria. A thread can be slow or fast if it has or not pending L1 cache misses. At the same time, a thread can be active or inactive for a given resource R depending on whether it has recently demanded the resource. A thread is considered active for resource R during a predefined number Y of cycles after the cycle it demanded an entry of R (e.g. Y = 256). The authors propose a mathematical formula to limit the number of entries of a shared resource that a thread can allocate depending on its previous classification. The shared resources can be the instruction queue, load-store queue, physical register file, etc., and the additional hardware consists of resource occupancy counters and a lookup table to implement the computation of resource entries limit using the aforementioned formula. 17
24 Chapter 4 The Simulation Framework Multi2Sim This section describes Multi2Sim [23], the simulation framework that has been used to model the architectural designs proposed in this work. Multi2Sim integrates a model of processor cores, memory hierarchy and interconnection network in a tool that enables their evaluation. The simulator has been extended to model the VB microarchitecture, both for monothreaded and multithreaded environments, including all instruction fetch policies cited in Chapter Basic simulator description Multi2Simhas been developed integrating some significant characteristics of popular simulators, such as separate functional and timing simulation, SMT and multiprocessor support and cache coherence. Multi2Sim is an application-only tool intended to simulate final MIPS32 executable files. With a MIPS32 crosscompiler (or a MIPS32 machine) one can compile his own program sources, and test them under Multi2Sim. This section deals with the process of starting and running an application in a cross-platform environment, and describes briefly the three implemented simulation techniques (functional, detailed and event-driven simulation) Program Loading Program loading is the process in which an executable file is mapped into different virtual memory regions of a new software context, and its register file and stack are initialized to start execution. In a real machine, the operating system is in charge of 18
25 these actions, but an application-only tool should manage program loading during its initialization. Executable File Loading. The executable files output by gcc follow the ELF (Executable and Linkable Format) specification. An ELF file is made up of a header and a set of sections. Some Linux distributions include the library libbfd, which provides types and functions to list the sections of an ELF file and track their main attributes (starting address, size, flags and content). When the flags of an ELF section indicate that it is loadable, its contents are copied into memory after the corresponding starting address. Program Stack. The next step of the program loading process is to initialize the process stack. The aim of the program stack is to store function local variables and parameters. During the program execution, the stack pointer ($sp register) is managed by the own program code. However, when the program starts, it expects some data in it, namely the program arguments and environment variables, which must be placed by the program loader. Register File. The last step is the register file initialization. This includes the $sp register, which has been progressively updated during the stack initialization, and the PC and NPC registers. The initial value of the PC register is specified in the ELF header of the executable file as the program entry point. The NPC register is not explicitly defined in the MIPS32 architecture, but it is used internally by the simulator to handle the branch delay slot Simulation Model Multi2Sim uses three different simulation models, embodied in different modules: a functional simulation engine, a detailed simulator and an event-driven module the latter two perform the timing simulation. To describe them, the term context will be used hereafter to denote a software entity, defined by the status of a virtual memory image and a logical register file. In contrast, the term thread will refer to a processor hardware entity comprising a physical register file, a set of physical memory pages, a set of entries in the pipeline queues, etc. The three main simulation techniques are described next. Functional Simulation, also called simulator kernel. It is built as an autonomous library and provides an interface to the rest of the simulator. This engine does not know of hardware threads, and owns functions to create/destroy software contexts, perform program loading, enumerate existing contexts, consult their status, execute machine instructions and handle speculative execution. The supported machine instructions follow the MIPS32 specification [24] [25]. This choice was basically motivated by a fixed instruction size and formats, which enable a simple instruction decoding. An important feature of the simulation kernel, inherited from SimpleScalar 19
26 [26], is the checkpointing capability of the implemented memory module and register file, thinking of an external module that needs to implement speculative execution. In this sense, when a wrong execution path starts, both the register file and memory status are saved, reloading them on the misprediction detection. Detailed Simulation. The Multi2Sim detailed simulator uses the functional engine to perform a timing-first [27] simulation: in each cycle, a sequence of calls to the kernel updates the state of existing contexts. The detailed simulator analyzes the nature of the recently executed machine instructions and accounts the operation latencies incurred by hardware structures. The main simulated hardware consists of pipeline structures (stage resources, instruction queue, load-store queue, reorder buffer...), branch predictor (modelling a combined bimodal-gshare predictor), cache memories (with variable size, associativity and replacement policy), memory management unit, and segmented functional units of configurable latency. Event-Driven Simulation. In a scheme where functional and detailed simulation are independent, the implementation of the machine instructions behaviour can be centralized in a single file (functional simulation), increasing the simulator modularity. In this sense, function calls that activate hardware components (detailed simulation) have an interface that returns the latency required to complete their access. Nevertheless, this latency is not a deterministic value in some situations, so it cannot be calculated when the function call is performed. Instead, it must be simulated cycle by cycle. This is the case of interconnects and caches, where an access can result in a message transfer, whose delay cannot be computed a priori, justifying the need of an independent event-driven simulation engine. 4.2 Support for Multithreaded and Multicore Architectures This section describes the basic simulator features that provide support for multithreaded and multicore processor modelling. They can be classified in two main groups: those that affect the functional simulation engine (enabling the execution of parallel workloads) and those which involve the detailed simulation module (enabling pipelines with various hardware threads on the one hand, and systems with several cores on the other). 20
A Complexity-Effective Out-of-Order Retirement Microarchitecture
1 A Complexity-Effective Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, R. Ubal, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia,
More informationThe Validation Buffer Out-of-Order Retirement Microarchitecture
The Validation Buffer Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia, Spain Abstract
More informationSPECULATIVE MULTITHREADED ARCHITECTURES
2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationEECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont
Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationEECS 470 Final Project Report
EECS 470 Final Project Report Group No: 11 (Team: Minion) Animesh Jain, Akanksha Jain, Ryan Mammina, Jasjit Singh, Zhuoran Fan Department of Computer Science and Engineering University of Michigan, Ann
More informationPortland State University ECE 587/687. The Microarchitecture of Superscalar Processors
Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,
More informationCHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS
CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS PROCESSORS REQUIRE A COMBINATION OF LARGE INSTRUCTION WINDOWS AND HIGH CLOCK FREQUENCY TO ACHIEVE HIGH PERFORMANCE.
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationPaired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors
Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors R. Ubal, J. Sahuquillo, S. Petit and P. López Department of Computing Engineering (DISCA) Universidad Politécnica de Valencia,
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationEfficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel
Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationLSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3
PR 0,0 ID:incmb PR ID:St: C,X LSU EE 4720 Dynamic Scheduling Study Guide Fall 2005 1.1 Introduction David M. Koppelman The material on dynamic scheduling is not covered in detail in the text, which is
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationWide Instruction Fetch
Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationOptimizing SMT Processors for High Single-Thread Performance
University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul
More informationSuperscalar Processor Design
Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationModule 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.
Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationDynamically Controlled Resource Allocation in SMT Processors
Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationComputer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationFall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationA Mechanism for Verifying Data Speculation
A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationLecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue
Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and
More informationComputer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013
18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationReconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj
Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More information15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationSuperscalar Processors Ch 14
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationArchitectures for Instruction-Level Parallelism
Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationAR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationHigh Performance Memory Requests Scheduling Technique for Multicore Processors
High Performance Memory Requests Scheduling Technique for Multicore Processors by Walid Ahmed Mohamed El-Reedy A Thesis Submitted to the Faculty of Engineering at Cairo University In Partial Fulfillment
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More information