Improving Value Prediction by Exploiting Both Operand and Output Value Locality

Size: px
Start display at page:

Download "Improving Value Prediction by Exploiting Both Operand and Output Value Locality"

Transcription

1 Improving Value Prediction by Exploiting Both Operand and Output Value Locality Jian Huang and Youngsoo Choi Department of Computer Science and Engineering Minnesota Supercomputing Institute University of Minnesota, Minneapolis, MN David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute University of Minnesota, Minneapolis, MN Abstract Several studies have determined that the values produced by the execution of a program s instructions are often quite repetitive. There are basically two approaches that have been proposed for exploiting this instruction output value locality 1) reusing the results of a prior execution of an instruction, and 2) predicting the value that will be produced by the current execution of an instruction based on the previous values it has produced. The result cache [17], on the other hand, exploits operand value locality by reusing the values produced by any instruction of the same type that has been executed previously using the same input operands. However, due to the non-speculative nature of the result cache, it is effective only for long latency operations. In this paper, we extend the reuse-based result cache to support speculative execution using value prediction. We call this new scheme the Speculative Result Cache (SRC). Our simulations using the SimpleScalar Tool Set show that value prediction with the SRC alone can produce 94% - 98% raw prediction accuracies and speedups of 2% - 7% for the tested SPEC95 integer benchmark programs in a 16-issue superscalar processor. Our proposed Combined Dynamic Predictor (CDP) couples the SRC with the hybrid predictor [21] and dynamically selects between the two component predictors. This approach outperforms the SRC and the hybrid predictor when they are used separately, achieving higher effective prediction accuracies and speedups of up to 10%. Keywords: speculative result cache, instruction reuse, value locality, value prediction, combined dynamic predictor 1 Introduction Computer architects have long exploited the typical behavior of application programs to improve performance. For instance, the assumption that future memory access patterns can be predicted by observing the patterns in the previous memory addresses referenced by a program has led to mechanisms such as caches, prefetching, and vector instructions. Architects speak of the spatial and temporal locality of memory references to describe the behavior observed in programs that future memory accesses are likely to be close in space and time to other recent references. Other types of repetitive behaviors of programs have recently been further exploited by noticing that programs exhibit not only locality of memory addresses 1

2 referenced, but also locality in the output values produced by individual instructions. This instruction output value locality suggests that a single instruction often produces the same value, or a predictable sequence of values, each time it is executed [2, 8, 20]. There are essentially two approaches that have been proposed for exploiting this program behavior 1) reusing the results of a prior execution of an instruction, and 2) predicting the value that will be produced by the current execution of an instruction based on the previous values it has produced. With instruction reuse [15], the input operand values of an instruction and the corresponding output result are saved in a reuse buffer, which is indexed by the instruction s address. The next time the instruction is executed, this buffer is queried to see if the current operand values match previously seen values. If they do, the previously stored output value can be used immediately without needing to re-execute the instruction. Simulations with the SPEC benchmarks have shown that this approach can improve performance by 1-5 percent [16]. This idea can be extended to larger granularities to reuse entire sequences of instructions, such as basic blocks, for instance [6]. Alternatively, the result cache proposed by Richardson [17] indexes the reuse buffer with a function that hashes the instructions operands instead of the instructions addresses. This indexing scheme allows the result cache to reuse the values produced by any previously executed instructions of the same type. With these value reuse approaches, the output values stored in the reuse buffer are known to be the correct values since the input operands are matched exactly. In contrast, value prediction [2, 8, 16] allows an instruction to begin executing before its input operands are available by guessing what the operand values will be. This speculative instruction execution effectively breaks flow dependences between instructions, although it does require a subsequent check to verify that the values were predicted correctly. If they were not, the program execution must be restored to the appropriate state before the speculation, and the instructions then must be re-executed with the correct values. Due to the potentially high cost of wrong predictions, several techniques have been proposed to improve value prediction accuracy. These include a last-value predictor [9], a stride-based predictor [21], a two-level predictor [9, 21], a hybrid predictor [21], and context-based predictors [10, 13, 14]. To be effective, value reuse mechanisms must store previous operand and result values for as many instructions as possible. Similarly, since the value prediction mechanisms attempt to predict the next value that will be produced by an instruction based on the previous values already generated, they would like to cache as large a history of previously seen values for each instruction as possible. Thus, both of these approaches require large hardware tables to be implemented directly on the processor die. For instance, each entry in the hybrid predictor needs at least 16 bytes to store the last four unique values produced previously by the same instruction [21]. Additionally, these tables must be able to be queried quickly to provide any performance benefit. Our experiments with the programs in the SPECint95 benchmark suite show, however, that all of the instructions executed within a single application have a relatively small overall result space. That is, while the potential range of values that could be produced is enormous, the actual range of values tends to be much more constrained. Furthermore, different instructions of the same type often produce the same values, since the operand values of instructions are frequently repeated. The repeated operand values could come from previous executions of the same instruction, or from previous executions of other instructions of the same type. We say that a program is showing operand value locality when an instruction is executed using the same operand values that have been used by other instructions of the same type. 2

3 If we can save the result values corresponding to the operands, the execution of this instruction can be skipped. The previously proposed result cache [17] is one example of this approach. The result cache, however, requires that the operand values used to index the buffer be the same as the actual values that will be operated on. This means that the earliest that the result cache can be accessed is right before instructions are actually issued. Hence, this scheme can be applied only to long-latency operations [11, 17], such as those typically seen in floating-point programs. In this paper, we extend the reuse-based result cache to a Speculative Result Cache (SRC) to support data value prediction for integer programs by exploiting operand value locality. Since most integer operations typically require only a single cycle in a pipelined superscalar processor, waiting for the correct operand values to access the result cache will not bring any performance benefit. Tullsen et al [19] showed that the current value in the register file for an operand register may stay the same for a while even though new values will be produced. Hence, rather than waiting to index the result buffer with the instructions exact operand values, as was done in the result cache, or indexing with the instructions addresses, as is done with other predictors [9, 10, 13, 21], the SRC is indexed by using the currently available values in the corresponding operand registers. Instead of using the hardware to detect the value patterns dynamically, the SRC simply reuses the values produced previously by this or any other instruction of the same type. Previously proposed value predictors [9, 10, 13, 21] focus on exploiting the output value locality of individual instructions, while the SRC focuses on the operand value locality produced by all instructions of the same type. To simultaneously exploit both kinds of value locality, we propose the Combined Dynamic Predictor (CDP) to dynamically choose between the values predicted by a predictor that exploits operand value locality (the SRC) and a predictor that exploits instruction output value locality (the hybrid predictor [21]). (We choose the hybrid predictor as being representative of the class of predictors that use instruction output value locality due to its demonstrated high prediction accuracy [3, 13, 21].) Our simulations show that the CDP performs better than either of the two component predictors applied alone, with prediction accuracies and speedups up to 96% and 10%, respectively. Since each SRC entry is much narrower than the entries in the hybrid predictor, the CDP may also be more efficient in the use of die space. In the remainder of the paper, Section 2 summarizes some related work. Section 3 describes the SRC, while Section 4 studies the Combined Dynamic Predictor. Section 5 evaluates the performance potential of these new predictors and Section 6 then summarizes and concludes the paper. 2 Related Work Sodani et al [15] investigated the potential of instruction reuse with a reuse buffer that stored the inputs and outputs of each executed instruction. The goal was to make use of the results of instructions that were executed but not actually used due to branch misprediction, as well as to reuse the results of the same instruction from previous executions. They showed relatively moderate speedups of 1-5 percent for the SPECint benchmark programs [16]. The primary advantage of this scheme is that the results do not need to be verified since it is a strict reuse with no speculation at all. This instruction reuse technique has been extended to reusing the results of an entire sequence of instructions within a basic block [6]. This approach detects a dependence chain at runtime and compares 3

4 all of the inputs to an entire basic block instead of only a single instruction. This technique produces slightly greater performance improvements than the simple instruction reuse scheme due to the larger amount of time that can be saved when the execution of all of the instructions within a basic block can be skipped by reusing the block s previously stored output values. Lipasti et al [8] investigated the potential of predicting the values returned by load instructions to improve instruction-level parallelism. The performance benefits of this scheme in a simulation of the DEC and PowerPC 620 processors were also moderate with speedups in the range of 4 to 12 percent with limited hardware. This basic scheme has been improved to achieve better prediction accuracies [10, 13, 21], although it has been suggested that, due to the cost of a misprediction, the improvement in instruction level parallelism that can be achieved through value prediction mechanisms is limited [4]. In addition, since an instruction typically has two inputs, obtaining a higher prediction rate on only one input operand but not on the other will further limit the performance benefit. Harbison [5] used a mechanism called the value cache to store the result of an instruction as well as some dependence information and the execution phase. This cache was also indexed by the instruction address. With this mechanism, hardware-level common subexpression elimination was achieved to produce speedups of about 2.0 on a stack machine. Richardson [17] proposed a scheme targeting floating point instructions. He suggested a tabulation method to store the return values of function calls with no side-effects using software, and to store long-latency operation results in a device called a result cache. The result cache was indexed by exclusive-or ing some of the most significant bits of an instruction s two operands. This mechanism was able to reduce the execution time of some SPEC92 floating-point programs by 15 to 44 percent. However, this result cache requires that the operands used to produce the index be the correct values in the program order. Due to its non-speculative nature, the result cache can only reuse results produced by long-latency operations, such as floating-point division, and multiplication, for instance. Tullsen and Seng [19] proposed a scheme that saves value buffer space by using the values already available in the register file. They found that, at least 75% of the time, the value loaded from memory is either already in the register file, or was recently there. In this scheme, the instruction uses the old value in the destination (architectural) register as a prediction for the new value produced. Compiler support is used to increase the prediction opportunities for this scheme. They showed speedups of up to 11% for the SPEC95 integer programs, and up to 13% for the SPEC95 floating-point programs. 3 Speculative Result Cache There are several approaches that can be used to predict the result of an instruction before it is executed: 1. If the operands are available, and the values of these operands have been used previously by this instruction, we can simply reuse the previously stored result. 2. If the operands are available, but the values of these operands have not been used previously by this instruction, we cannot simply reuse the previously stored result. However, if the values of these operands have been used by another instruction of the same type, and we have previously saved the result produced by this other instruction, we can predict that the result of the current instruction will be the same as that of this previously executed instruction. 4

5 3. If the operands are not available, but the result produced by this instruction follows a regular pattern, we can predict the result by following this regularity. The first case is handled by the instruction reuse [15] and value cache [5] mechanisms, while the existing predictors [9, 10, 13, 14, 21] handle at least part of the third case. The result cache and our proposed SRC are both targeted towards the second case. For example, in Figure 1, arrays A, B, and C are all defined to have 100 elements. The index variables used to operate on each element of these arrays (i1 and i2) are incremented from 0 to 99 in two different loops. The operations used to increment i1 and i2 are both addu instructions. Thus, the results produced by updating i1 can be used to predict the results that will be produced by the addu instruction that updates i2. Additionally, the instructions used to determine the end of the loop (slt) in the first loop can also be used to predict the outcome of the slt instruction used in the second loop. This similar behavior among different instruction instances allows us to resolve the branch instructions (bne) earlier than would otherwise be possible. 3.1 Basic Operation of the Speculative Result Cache The SRC buffers the result value produced by each instruction that may need predicting in the future in an entry of the SRC indexed by hashing the values of this instruction s input operands. When a decoded instruction cannot be issued because the source instructions that produce its operands have not yet completed, the processor can predict the values that will eventually be produced by the source instructions using the SRC. To perform this prediction, the processor indexes the SRC using the currently available values of the operands of the source instructions. If predictable values are found in the SRC, the instruction can be speculatively issued for execution using these predicted values for its input operands. Figure 2 provides an example of how the SRC is used. The instructions in this figure are assumed to be in MIPS IV format, so that the first register specified is the destination register, and the other two registers are the source operands. Instruction 1 is a load instruction that has been issued but not completed. Instructions 2, 3, 4, and 5 are waiting for their operands to be ready. The arrows show the dependences between instructions. We could speculatively issue both instructions 2 and 3 if the result cache query for instruction 1 produces a hit. However, if the ultimately loaded value is found to be different from the one in the result cache, the speculation would fail. Instruction 4 depends on the results of instructions 2 and 3. After decoding, we know the register numbers for the operands of instructions 2 and 3 and we could use the currently available values in the register file for the corresponding operands to query the result cache and proceed with issuing instruction 4. By doing so, we are betting that the operands used by instructions 2 and 3 will not be altered by instruction 1. The speculation of instruction 5 can be done similarly. Hence, if the processor s issue-width allows, all 5 instructions could be issued concurrently. When instruction 1 is executed a second time, the address calculated by adding register sp (stack pointer) and the value 4 could be different from its previous executions. However, if other load instructions have previously read the value corresponding to the new address, the SRC mechanism allows us to effectively exploit this operand value locality globally. 3.2 An Implementation of the Speculative Result Cache Instructions in the execution window are decoded before being issued, which allows dependence chains to be constructed before the instructions reach the issue stage. This then allows the SRC to be queried 5

6 /* C source code */ int i1, i2, A[100], B[100], C[100]; for (i1=0; i1 < 100; i1++) A[i1] = A[i1]+C[i1]+B[i1]; for (i2=0; i2 < 100; i2++) B[i2] = B[i2]+A[i2]; /* MIPS-IV Assembly Code for the above loops */ move $6,$0 addu $5,$sp,16 LOOP1: lw $2,0($5) lw $3,400($5) lw $4,800($5) addu $6,$6,1 # update the index addu $2,$2,$3 addu $2,$2,$4 sw $2,0($5) # save the result addu $5,$5,4 # compute the next address slt $2,$6,99 #compare index to loop-bound bne $2,$0,LOOP1 move $5,$0 addu $4,$sp,16 LOOP2: lw $2,800($4) lw $3,0($4) lw $3,0($4) addu $5,$5,1 # update the index addu $2,$2,$3 sw $2,800($4) # save the computed result addu $4,$4,4 # compute the next address slt $2,$5,99 #compare index to loop-bound bne $2,$0,LOOP2 Figure 1: An example code segment that could utilize the Speculative Result Cache. 6

7 5 ld $5, 16($6) not issued 4 sub $8, $6, $4 not issued 3 muli $6, $4, 4 not issued Program 2 addi $5, $4, 8 not issued Order 1 ld $4, 16($sp) issued Figure 2: The inputs to an instruction that is blocked in the issue stage can be predicted by querying the Speculative Result Cache using the input operands of the previous instructions on which the blocked instruction is dependent. in the instruction decode stage or, possibly, in the dispatch stage. As shown in Figure 3, in addition to storing a result value, each entry in the SRC contains: 1) a tag field that contains the lower 16 bits of each of the operand values that produce this result; 2) the corresponding result itself; 3) a valid bit; and 4) 4 bits for a confidence counter. We used only the lower 16-bits from both operands as the tag, since the SRC is speculative in nature and most repeated operands have fewer than 16 bits. The size of each entry is about 9 bytes, while a typical entry in a two-level predictor has more than 18 bytes [21]. If the valid bit is not set, the SRC will signal the issue stage to not predict the output values since the value stored in the SRC is invalid. This situation occurs when the hashed operand values have not been previously seen. The counter field determines the confidence of the prediction. If the value of the counter is below some threshold, the stored value is expected to be a poor prediction. This counter is incremented by a predetermined amount when the access of the SRC produces a successful prediction, and is decremented otherwise. By the time an instruction reaches the issue stage, the results of the SRC would be available. If this instruction cannot be issued due to data dependences, the issue logic can choose to either issue the instruction speculatively with the predicted operand values from the SRC, or to wait for the actual operand values to become available. The second choice forces the existing dependence to be completely resolved, and so eliminates the chance to increase the instruction issue rate. However, it also eliminates the chance of a potentially expensive misprediction. Figure 4 depicts a possible implementation of a superscalar processor with an SRC. Since different types of instructions will produce different output results when given the same input operand values, we partition the SRC into several different sections by using an op-type field (see Figure 3). Each section corresponds to a single instruction type. In the following experiments, we base this partitioning on the most common types of instructions that occur in the SPEC95 integer benchmarks when 7

8 OP-Type TAG Operand1 Operand2 Result Value Confidence Counter Valid Concat Operand1 Operand2 Hash Function Figure 3: A possible implementation of the SRC. Instruction Cache Branch Predictor Decode Dispatch Execution Speculative Result Cache Core Reorder Buffer Figure 4: A superscalar processor implementation with a Speculative Result Cache. 8

9 Distribution of Instructions Percentage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% other SLT Shift Right Shift Left Logical Load Add/Sub 0% compress ijpeg gcc m88ksim li go Figure 5: The types of instructions that are applicable to the SRC and their frequency distribution. executed by the processor model in the SimpleScalar [1] simulation tool set. The six most common and predictable instruction categories found are addition (including subtraction), left-shift, right-shift, load, set-less-than (SLT), and logical (or, and, not) instructions. Since logical instructions require three separate sections of the SRC, we have a total of 8 types of operations. Most of these instructions are also common in other RISC instruction set architectures, but SLT is more platform dependent. Since many branch instructions work with SLT in the SimpleScalar Instruction Set, we predict SLT instructions in our study to provide early resolution of branches. As shown in Figure 5, these types of instructions comprise from 62% to 80% of the total instructions executed in each of the benchmark programs. We focus on improving the performance of only integer programs in this work since it tends to be more difficult to obtain high levels of parallelism from these types of programs than from floating-point programs. Thus, each entry in the SRC must be able to store a single integer value. The operand values are exclusive-ored [17] to index into the SRC. We used the lower n bits for this hash index instead of the full-length operand values. With these n bits, each instruction type requires 2 n entries to store the corresponding result values. Since the SRC used in the following experiments is partitioned into eight different instruction types, the overall SRC structure requires 2 n+3 entries. In order to pick a reasonable n for our evaluation, we limit the total SRC size to be comparable to the typical size of the level-1 cache of the RISC processors. For example, with 32K entries, n = 12. Figure 6 shows the prediction accuray for the SRC with 4K-32K total entries, where the prediction accuracy is calculated as the number of correctly predicted instructions over the total number of predicted instructions. We can see that the raw prediction accuracy of the SRC is quite high, ranging from 94.5% to 98%. Furthermore, this accuracy is relatively insensitive to the SRC sizes for the tested programs. The percentage of instructions that are not predicted by the SRC, either because the instruciton type is not applicable, or because the confidence counter in the SRC is below the threshold, decreases slightly as the size of the SRC increases, as shown in Figure 7. However, the slopes of the curves are not particularly steep. The difference in prediction accuracy between the 4K and 32K cases is at most 8%, indicating that 9

10 99 Prediction Accuracy of the SRC Percentage compress gcc ijpeg li m88ksim go K 8K 16K 32K Figure 6: The prediction accuracy of the SRC. the SRC adequately captures the operand value locality within a certain size of execution window. This result also suggests that the different combinations of operands in these windows are not large. Hence, we conclude that a 4K SRC is sufficient for the programs tested. 4 The Combined Dynamic Predictor (CDP) While the SRC utilizes the operand value locality of programs, most of the existing predictors [10, 13, 21] are exploiting the instruction output value locality. That is, they simply assume that each instruction will produce regular outputs in different instances of execution. Consequently, saving the results produced in the previous executions of instructions will help predict what value a specific instruction is likely to produce next. Since the SRC and these predictors are exploiting different aspects of program behavior, combining the SRC with one of the other predictors may produce higher prediction accuracy, and, hence, better performance. 4.1 Implementation of the Combined Dynamic Predictor The hybrid predictor presented in [21], as shown in Figure 8, is one of the most sophisticated and effective predictors proposed [13, 19, 21] so far. Hence, we will combine our SRC with the hybrid predictor for our subsequent experiments. We call this new predictor the Combined Dynamic Predictor (CDP). The hybrid predictor combines a two-level predictor and a stride-based predictor. In the two-level predictor portion, the most recent 4 unique values produced by an instruction s previous executions are stored in each entry of the value history table (VHT). An additional 2p bits are used for the Value History Pattern (VHP) portion of the table to record the order in which the 4 unique values were produced in the most recent executions. Here p refers to the history depth that can be stored. Since 4 unique values 10

11 Percentage of Not-Predicted Iinstructions compress gcc go ijpeg li m88ksim K 8K 16K 32K Figure 7: The percentage of instructions not predicted by the SRC. Tag State Stride LRU Info Data Values Value History Pattern Pattern History Table Instruction Address Hash Function Figure 8: A simplified description of the hybrid predictor described in [21]. 11

12 Decoded Instruction Speculative Result Cache Predicted Value Select Operands MUX Predicted Value Instruction Address Hybrid Predictor Predicted Value Figure 9: The Combined Dynamic Predictor (CDP). are stored and each history needs 2 bits to point out which value occurred at that point, a total of 2p bits are needed. This VHP is then used to index the second-level of the predictor, the Pattern History Table (PHT), which is essentially a table of confidence counters for each stored value. Of the given four values, the one with the highest confidence is picked as the predicted value. If the stored confidence is higher than the confidence threshold, the predictor will return one of the four unique values as the predicted result. If none of the confidence values exceeds the threshold, no prediction is returned. The stride predictor stores an extra value called the stride, which is the difference between the last two values produced by the same instruction. The stride predictor always returns last-value produced + stride as the predicted value. As its name suggests, the stride predictor is effective when an instruction produces a sequence of values with a constant stride. When the hybrid predictor is queried, the result returned by the two-level predictor has higher priority, which means the stride predictor is used only when the two-level predictor does not return a predicted value. Combining this hybrid predictor with our SRC produces the Combined Dynamic Predictor (CDP) shown in Figure 9. Note that the SRC is indexed with the operand values while the hybrid predictor is indexed with an instruction s address. When the CDP is accessed, both the hybrid predictor and the SRC simultaneously go to work to produce two predicted values. The processor then selects one of these predicted values for speculative execution. We next evaluate several different policies for determining which of the two predicted values should be used when each speculated instruction is issued. 4.2 Selecting Between the Component Predictors We evaluated three policies for selecting between the component predictors: 1) always select the hybrid predictor first; 2) always select the SRC first; and 3) select the predictor with higher confidence first. Figure 10 shows the prediction accuracy of the CDP using these different component predictor selection policies. The raw prediction accuracy is again calculated as the number of correctly predicted instructions divided by the total number of instructions predicted. We used 4K entries in both the hybrid predictor and the SRC. The prediction accuracies of the hybrid predictor alone with 8K entries and the SRC alone with 16K entries are also compared. Note that the 16K-entry SRC and the 8K-entry hybrid predictor each consume approximately the same amount of total storage as the CDP with a 4K Hybrid and and 8K SRC component predictors. 12

13 Prediction Accuracy of Different Policies compress Gcc Go Ijpeg li m88ksim k+4k, Higher Confidence First 4k+4k, SRC First 4k+4k, HP First 8k, HP Only 16K SRC Only Figure 10: The raw prediction accuracy of the different selection policies that can be used in the CDP. We can see that the prediction accuracy of the CDP is about 10% to 14% higher than the hybrid-only case, and about 3% to 30% lower than the SRC-only case. However, since different predictors may return different prediction decisions for the same instruction, the number of instructions that are not predicted for the various predictors are different, as shown in Figure 11. Thus, the prediction accuracies shown in Figure 10 cannot be directly compared. Instead, we must normalize the number of instructions correctly predicted in each case to the same base, which is simply the total number of instructions executed by each program. This gives the effective prediction accuracy = number of instructions predicted correctly / total number of instructions executed. Using this metric, Figure 12 shows that the CDP outperforms the other two predictors for all of these test programs. Among the component predictor selection policies, the higher confidence first and the SRC-first policies produce essentially the same prediction accuracies. Thus, a nice feature of the CDP is that its effective prediction accuracy is not particularly sensitive to the choice of this policy. For the remainder of this study, we use the higher confidence first policy. 4.3 Resource Allocation Among the Component Predictors Now that the component-selection policy has been examined, we must next determine how to allocate the total available storage for the prediction hardware between the two component predictors. We evaluate the different configurations of the CDP shown in Table 1. In the first line of this table, approximately 1/3 of the total available storage is devoted to the SRC, with the remaining 2/3 devoted to the hybrid predictor. We denote this configuration as (1/3, 2/3). The second line approximates the (1/2, 1/2) case with the (3/4, 1/4) case shown in the third line. The remaining two configurations devote all of the available resources to one predictor alone. Evaluating these five combinations will tell us how the available resources should be partitioned among the component predictors. The goal is to keep the total available storage for the overall predictor roughly constant. However, if the number of entries in a predictor is not a power-of-two, the utilization of the value buffers will be uneven. Consequently, the sizes of the 13

14 Percentage of Not-Predicted Instructions Compress Gcc Go Ijpeg li m88ksim k+4k, Higher Confidence First 4k+4k, SRC First 4k+4k, HP First 8k, HP Only 16K SRC Only Figure 11: The percentage of not-predicted instructions for the different selection policies. Effective Prediction Accuracy of Different Policies Compress Gcc Go Ijpeg li m88ksim k+4k, Higher Confidence First 4k+4k, SRC First 4k+4k, HP First 8k, HP Only 16K SRC Only Figure 12: The effective prediction accuracy of the CDP selection policies. 14

15 Line SRC Hybrid Total Size % SRC Label 1 4K 4K 124KB 29% (1/3, 2/3) 2 8K 4K 160KB 45% (1/2, 1/2) 3 16K 2K 184KB 76% (3/4, 1/4) 4 16K 0K 144KB 100% SRC-only 5 0K 8K 176KB 0% HP-only Table 1: Predictor configurations used to determine how the available storage resources should be allocated among the component predictors of the CDP. component predictors in Table 1 are chosen to be the power-of-two sizes that most closely approximate the desired distribution of resources. The actual percentage of the total storage allocated to the SRC predictor in each configuration is shown in the % SRC column. Figures show the raw prediction accuracy, the percentage of instructions not predicted, and the effective prediction accuracy for these configurations of the CDP. We can see that the prediction accuracy of the SRC is still the best among all of the configurations, while the HP-only case trails the others by a margin of up to 24%. For the three combinations of the hybrid predictor and the SRC that compose the CDP, the (3/4, 1/4) configuration outperforms the the other two by a small margin. This is expected since the SRC has a very high prediction accuracy. Hence, devoting more resources to the SRC can reduce the raw misprediction rate. However, since the SRC predicts a smaller portion of the total number of instructions than the hybrid predictor, as shown in Figure 14, a hybrid predictor of considerable size is needed to provide a prediction for the instructions rejected by the SRC. As measured by the effective prediction accuracy in Figure 15, the three CDP configurations outperform the SRC-only or the HP-only case, with the (3/4, 1/4) resource allocation again outperforming the others. Since the (3/4, 1/4) configuration has more total storage than the other two CDP configurations, we conclude that there is only a slight preference for devoting more resouces to the SRC predictor than the hybrid predictor, when used as components of the CDP. 5 Performance Potential In this section, we evaluate the performance benefit of the SRC, and the CDP when used to perform value prediction within a pipelined superscalar processor. The results are compared to the hybrid-only configuration. 5.1 Simulation Environment We use the SimpleScalar Tool Set Version 3.0 [1] for all of our experiments. The sim-outorder detailed superscalar simulator in this tool set is modified to accomodate value prediction with the hybrid predictor [7], the SRC (Section 3), and the CDP (Section 4). Our base processor model is a 16-issue superscalar processor with a 64KB level-1 instruction cache, 64KB level-1 data cache, and a 1MB unified level-2 cache. Thirty-two ALUs and 8 MULT/DIV units are supplied in the execution core. The miss recovery 15

16 Prediction Accuracy Compress Gcc Ijpeg li m88ksim Go (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Figure 13: The prediction accuracy of the CDP for the different resource allocation configurations shown in Table 1. Percentage of Not-Predicted Instructions Compress Gcc Go Ijpeg li m88ksim (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Figure 14: The percentage of instructions not predicted by the CDP for the different resource allocation configurations shown in Table 1. 16

17 Effective Prediction Accuracy Compress Gcc Go Ijpeg li m88ksim (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Figure 15: The effective prediction accuracy of the CDP for the different resource allocation configurations shown in Table 1. penalty of value prediction is set to 3 cycles. The number of entries in the load/store queue and the reorder buffer is 256, since dependent instructions on predicted values are issued and waiting to be committed in those buffers. All the predicted values are verified in the write-back stage in the execution pipeline and squashed if they are mispredicted. Those mispredicted instructions, including instructions dependent on the values, are re-executed after the misprediction recovery penalty. The hybrid predictor part of the CDP uses a 4-bit confidence counter with a prediction threshold of 7. It is configured to increment the counter by 1 when a correct prediction is made, and to decrement by 7 when an incorrect prediction is made [3]. The SRC part also adopts a similarly configured confidence counter except for incrementing by 3 and decrementing by 1, since the operand value locality is relatively higher. Six programs from the SPECint 95 benchmark suite are tested for our studies. They are Compress, Gcc, Go, Ijpeg, Li, and M88Ksim 1. Compress and Go use the train input size, while the other programs use the test input size. All the program binaries are downloaded from the SimpleScalar website [1]. 5.2 The SRC-Only Case As shown in Figure 16, the SRC by itself can produce speedups of 1% - 6% with 4K entries, and 1.5% - 7% for when the size increases to 32K entries. The changes in speedup as the size of the SRC varies are not very dramatic, even though the size of the SRC increases by a factor of 8. Only M88Ksim shows a big improvement when the size of the SRC changes from 4K entries to 8K entries. We conclude that, for these test programs, an SRC of 4K - 8K entries is sufficient to capture most of the available operand value locality. 1 Note to the reviewers: We will include the results for Perl in the final version of this paper. 17

18 Speedups Produced by the SRC Speedups (Percentage) Compress Gcc Go Ijpeg li m88ksim 1 0 4K 8K 16K 32K Figure 16: The speedup potential of the SRC with different sizes. 5.3 The Combined Dynamic Predictor (CDP) The Combined Dynamic Predictor (CDP) produces speedups varying from -1% to 10%, as shown in Figure 17. The CDP is unable to produce any speedup for Go, but is able to reasonably improve the performance for all of the other programs. The speedup of the (1/2,1/2) configuration indicates that splitting the available storage resources equally between the two component predictors tends to produce the best overall performance. However, the performance differences between the three different CDP allocations studied are quite small within a single application program. Furthermore, splitting the avaiable resources between these two component predictors produces better performance for all of the programs than using only a single type of predictor alone. The only exceptions are Compress and Go, in which the SRC-Only configuratoin produces slightly greater speedups than the split CDP configurations. Since the total storage taken by the CDP is close to 128KB, we also compare the case of using a larger level-1 cache instead of allocating any resouces to value predictors. With this configuration, the level-1 data and level-1 instruction caches are increased from 64KB to 128KB each. Figure 17 shows that Gcc and Go strongly favor the larger cache, while the other programs perform better when the resources are used for value prediction instead of simply making the caches larger. For M88Ksim, a larger cache does not provide any performance benefit. 6 Conclusion Most of the existing value predictors [8, 10, 13, 21] exploit instruction output value locality. The Result Cache [17], in contrast, exploits operand value locality by reusing the results produced by any of the previously executed instructions of the same type. Due to its non-speculative nature, however, the result 18

19 Speedups Produced by Different Techniques Speedups (Percentage) 10 5 Compress Gcc Go Ijpeg li m88ksim 0-5 (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Bigger L1 Caches Figure 17: Processor speedups when using the various value prediction mechanisms shown in Table 1 compared to allocating the predictor resources to a larger cache. 19

20 cache can only be accessed when the instructions are ready to be issued. Since most integer operations in pipelined superscalar processors take one cylce to finish, this reuse-based result cache produces only limited benefit for integer programs. In this paper, we extended the result cache to support speculative execution with value prediction. We call this new scheme the Speculative Result Cache (SRC). Simulations based on the SimpleScalar Tool Set showed that the SRC can produce 94% - 98% raw prediction accuracies and up to 7% speedup on a 16-issue superscalar processor. Since the SRC is exploiting a different kind of value locality than most of the existing predictors, we evaluated combining the SRC with an existing predictor to produce better prediction accuracy and performance potential. Among the existing value predictors, the hybrid predictor [21] has been demonstrated to have good prediction accuracy and substantial performance benefit. Thus, we proposed a new predictor, called the Combined Dynamic Predictor (CDP), that combines the hybrid predictor and the SRC. This new predictor exploits both instruction output value locality and operand value locality by dynamically selecting between the predicted values produced by its two component predictors. The higher confidence first policy is shown to have raw prediction accuracies of up to 96%. Using a CDP based on this policy produced speedups of up to 10% on a 16-issue superscalar processor for the SPEC95 integer programs tested. This performance is better than that of either of its component predictors used separately. We conclude that the best value prediction for integer programs is obtained by dynamically selecting between a predictor that exploits instruction output value locality and one that exploits operand value locality. Furthermore, our results indicate that, for most of the programs tested, it is better to allocate some resources to this type of value prediction than to simply increase the size of the level-1 caches. References [1] D. Burger, T. Austin, and S. Bennett. The Simplescalar Tool Set, Version 2.0. Technical Report 1342, Computer Science Department, University of Wisconsin, Madison. [2] F. Babbay and A. Mendelson. Using Value Prediction to Increase the Power of Speculative Execution Hardware. In ACM Transactions on Computer Systems, Vol. 16, No. 3, August, 1998, Pages [3] C. Calder, G. Reinman, D. Tullsen. Selective Value Prediction. In Proceedings of the 26th Int l Symposium on Computer Architecture (ISCA), Atlanta, May, 1999, pages [4] J. Gonzalez and A. Gonzalez. The Potential of Data Value Speculation to Boost ILP. Proceedings of the 1998 international conference on Supercomputing (ICS 98), July, 1998, Pages [5] S. Harbison. An architectural alternative to optimizing compilers. In Proceedings of the Int l Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 1982, Pages [6] J. Huang and D. Lilja. Exploiting Basic Block Value Locality with Block Reuse. In the Proceedings of the 5th Int l Symposium on High Performance Computer Architecture (HPCA-5), Orlando, Jan. 9-13, 1999, Pages [7] S.J. Lee. Several data value predictors. misc.html [8] M. Lipasti, C. Wilkerson and J. Shen. Value Locality and Load Value Prediction, In the Proceedings of 7th Int l Conf. on Architecture Support for Programming Languages and Operating Systems (ASPLOS VII), Oct., 1996, Pages

21 [9] M. Lipasti and J. Shen. Exceeding the Dataflow Limit via Value Prediction, In the Proceedings of 29th Annual ACM/IEEE Int l Symposium on Microarchitecture (MICRO-29), Dec. 2-4, 1996, Pages [10] T. Nakra, R. Gupta, and M.L. Soffa. Global Context-Based Value Prediction. In the Proceedings of the 5th Int l Symposium on High Performance Computer Architecture (HPCA-5), Pages [11] S. Oberman and M. Flynn. On Division and Reciprocal Caches. Stanford University Technical Report, CSL-TR , April, [12] Y. Patt, S. Patel, M. Evers, D. Friendly, and J. Stark. One Billion Transistors, One Uniprocessor, One Chip. In IEEE Computer, Special Issue on Billion-Transistor Architectures, vol. 30, No. 9, September 1997, Pages [13] B. Rychlik, J. Faistl, B. Krug, and J. Shen. Efficacy and Performance Impact of Value Prediction. In Proceedings of the Xth Intn l Symposium on Parallel Architectures and Compilation Techniques, Paris, October 1998, pages xxx-yyy. [14] Y. Sazeides and J. Smith. The Predictability of Data Values. In the 30th Annual Int l Symposium on Microarchitecture (MICRO 30), Dec. 1997, Pages [15] A. Sodani and G. Sohi. Dynamic Instruction Reuse. In the 24th Int l Symposium on Computer Architecture (ISCA), June, 1997, Pages [16] A. Sodani and G. Sohi. Understanding the Differences Between Value Prediction and Instruction Reuse. In the 31st Int l Symposium on Micro-Architecture (MICRO-31), 1998, Pages [17] S. Richardson. Caching function results: Faster arithmetic by avoiding unnecessary computation. Technical report SMLI TR-92-1, Sun Microsystems Laboratories, September [18] J. Smith and S. Vajapeyam. Trace Processors: Moving to Fourth Generation Microarchitectures. In IEEE Computer, September 1997, volume 30, number 9, Page [19] D. Tullsen, and J. Seng. Storageless Value Prediction using Prior Register Values. In Proceedings of the 26th Int l Symposium on Computer Architecture (ISCA), Atlanta, May, 1999, pages [20] G. Tyson and T. Austin. Improving the Accuracy and Performance of Memory Communication Through Renaming. In the Proceedings of the 30th Annual Int l Symposium on Microarchitecture (MICRO 30), December, 1997, Pages [21] K. Wang and M. Franklin. Highly Accurate Data Value Prediction using Hybrid Predictors. In the 30th Annual Int l Symposium on Microarchitecture (MICRO 30), Dec. 1997, Pages

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department

More information

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Data Speculation. Architecture. Carnegie Mellon School of Computer Science Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Focalizing Dynamic Value Prediction to CPU s Context

Focalizing Dynamic Value Prediction to CPU s Context Lucian Vintan, Adrian Florea, Arpad Gellert, Focalising Dynamic Value Prediction to CPUs Context, IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4 (July), p. 457-536, ISSN: 1350-2387,

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors The Limits of Speculative Trace Reuse on Deeply Pipelined Processors Maurício L. Pilla, Philippe O. A. Navaux Computer Science Institute UFRGS, Brazil fpilla,navauxg@inf.ufrgs.br Amarildo T. da Costa IME,

More information

Exploiting Type Information in Load-Value Predictors

Exploiting Type Information in Load-Value Predictors Exploiting Type Information in Load-Value Predictors Nana B. Sam and Min Burtscher Computer Systems Laboratory Cornell University Ithaca, NY 14853 {besema, burtscher}@csl.cornell.edu ABSTRACT To alleviate

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Increasing Instruction-Level Parallelism with Instruction Precomputation

Increasing Instruction-Level Parallelism with Instruction Precomputation Increasing Instruction-Level Parallelism with Instruction Precomputation Joshua J. Yi, Resit Sendag, and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

A Speculative Trace Reuse Architecture with Reduced Hardware Requirements

A Speculative Trace Reuse Architecture with Reduced Hardware Requirements A Speculative Trace Reuse Architecture with Reduced Hardware Requirements Maurício L. Pilla Computer Science School UCPEL Pelotas, Brazil pilla@ucpel.tche.br Amarildo T. da Costa IME Rio de Janeiro, Brazil

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Software-Only Value Speculation Scheduling

Software-Only Value Speculation Scheduling Software-Only Value Speculation Scheduling Chao-ying Fu Matthew D. Jennings Sergei Y. Larin Thomas M. Conte Abstract Department of Electrical and Computer Engineering North Carolina State University Raleigh,

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Exploiting Idle Floating-Point Resources for Integer Execution

Exploiting Idle Floating-Point Resources for Integer Execution Exploiting Idle Floating-Point Resources for Integer Execution, Subbarao Palacharla, James E. Smith University of Wisconsin, Madison Motivation Partitioned integer and floating-point resources on current

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

The Effect of Program Optimization on Trace Cache Efficiency

The Effect of Program Optimization on Trace Cache Efficiency The Effect of Program Optimization on Trace Cache Efficiency Derek L. Howard and Mikko H. Lipasti IBM Server Group Rochester, MN 55901 derekh@us.ibm.com, mhl@ece.cmu.edu 1 Abstract Trace cache, an instruction

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Improving Achievable ILP through Value Prediction and Program Profiling

Improving Achievable ILP through Value Prediction and Program Profiling Improving Achievable ILP through Value Prediction and Program Profiling Freddy Gabbay Department of Electrical Engineering Technion - Israel Institute of Technology, Haifa 32000, Israel. fredg@psl.technion.ac.il

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

A Dynamic Multithreading Processor

A Dynamic Multithreading Processor A Dynamic Multithreading Processor Haitham Akkary Microcomputer Research Labs Intel Corporation haitham.akkary@intel.com Michael A. Driscoll Department of Electrical and Computer Engineering Portland State

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Storageless Value Prediction Using Prior Register Values

Storageless Value Prediction Using Prior Register Values Published in the Proceedings of the 26th International Symposium on Computer Architecture, May 999 Storageless Value Prediction Using Prior Register Values Dean M. Tullsen John S. Seng Dept of Computer

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name: SOLUTION Notes: CS 152 Computer Architecture and Engineering CS 252 Graduate Computer Architecture Midterm #1 February 26th, 2018 Professor Krste Asanovic Name: I am taking CS152 / CS252 This is a closed

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

On Pipelining Dynamic Instruction Scheduling Logic

On Pipelining Dynamic Instruction Scheduling Logic On Pipelining Dynamic Instruction Scheduling Logic Jared Stark y Mary D. Brown z Yale N. Patt z Microprocessor Research Labs y Intel Corporation jared.w.stark@intel.com Dept. of Electrical and Computer

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Resit Sendag 1, David J. Lilja 1, and Steven R. Kunkel 2 1 Department of Electrical and Computer Engineering Minnesota

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

ONE-CYCLE ZERO-OFFSET LOADS

ONE-CYCLE ZERO-OFFSET LOADS ONE-CYCLE ZERO-OFFSET LOADS E. MORANCHO, J.M. LLABERÍA, À. OLVÉ, M. JMÉNEZ Departament d'arquitectura de Computadors Universitat Politècnica de Catnya enricm@ac.upc.es Fax: +34-93 401 70 55 Abstract Memory

More information

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Ying Chen, Resit Sendag, and David J Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

DFCM++: Augmenting DFCM with Early Update and Data Dependence-driven Value Estimation

DFCM++: Augmenting DFCM with Early Update and Data Dependence-driven Value Estimation ++: Augmenting with Early Update and Data Dependence-driven Value Estimation Nayan Deshmukh*, Snehil Verma*, Prakhar Agrawal*, Biswabandan Panda, Mainak Chaudhuri Indian Institute of Technology Kanpur,

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page

For Problems 1 through 8, You can learn about the go SPEC95 benchmark by looking at the web page Problem 1: Cache simulation and associativity. For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page http://www.spec.org/osg/cpu95/news/099go.html. This problem

More information

Computer System Architecture Quiz #2 April 5th, 2019

Computer System Architecture Quiz #2 April 5th, 2019 Computer System Architecture 6.823 Quiz #2 April 5th, 2019 Name: This is a closed book, closed notes exam. 80 Minutes 16 Pages (+2 Scratch) Notes: Not all questions are of equal difficulty, so look over

More information

A Cache Scheme Based on LRU-Like Algorithm

A Cache Scheme Based on LRU-Like Algorithm Proceedings of the 2010 IEEE International Conference on Information and Automation June 20-23, Harbin, China A Cache Scheme Based on LRU-Like Algorithm Dongxing Bao College of Electronic Engineering Heilongjiang

More information

Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

Selective Value Prediction

Selective Value Prediction Pubshed in the Proceedings of the 26th International Symposium on Computer Architecture, May 1999. Selective Value Prediction Brad Calder Glenn Reinman Dean M. Tullsen Department of Computer Science and

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Midterm Exam 1 Wednesday, March 12, 2008

Midterm Exam 1 Wednesday, March 12, 2008 Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Static, multiple-issue (superscaler) pipelines

Static, multiple-issue (superscaler) pipelines Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Speculation and Future-Generation Computer Architecture

Speculation and Future-Generation Computer Architecture Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi Outline Computer architecture and speculation control, dependence, value speculation

More information

Selective Value Prediction

Selective Value Prediction Selective Value Prediction Brad Calder Glenn Reinman Dean M. Tullsen Department of Computer Science and Engineering University of Cafornia, San Die fcalder,greinman,tullseng@cs.ucsd.edu Abstract Value

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

Computer Architecture and Engineering. CS152 Quiz #4 Solutions

Computer Architecture and Engineering. CS152 Quiz #4 Solutions Computer Architecture and Engineering CS152 Quiz #4 Solutions Problem Q4.1: Out-of-Order Scheduling Problem Q4.1.A Time Decode Issued WB Committed OP Dest Src1 Src2 ROB I 1-1 0 1 2 L.D T0 R2 - I 2 0 2

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information