Improving Value Prediction by Exploiting Both Operand and Output Value Locality

Size: px

Start display at page:

Download "Improving Value Prediction by Exploiting Both Operand and Output Value Locality"

Kellie Chandler
5 years ago
Views:

1 Improving Value Prediction by Exploiting Both Operand and Output Value Locality Jian Huang and Youngsoo Choi Department of Computer Science and Engineering Minnesota Supercomputing Institute University of Minnesota, Minneapolis, MN David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute University of Minnesota, Minneapolis, MN Abstract Several studies have determined that the values produced by the execution of a program s instructions are often quite repetitive. There are basically two approaches that have been proposed for exploiting this instruction output value locality 1) reusing the results of a prior execution of an instruction, and 2) predicting the value that will be produced by the current execution of an instruction based on the previous values it has produced. The result cache [17], on the other hand, exploits operand value locality by reusing the values produced by any instruction of the same type that has been executed previously using the same input operands. However, due to the non-speculative nature of the result cache, it is effective only for long latency operations. In this paper, we extend the reuse-based result cache to support speculative execution using value prediction. We call this new scheme the Speculative Result Cache (SRC). Our simulations using the SimpleScalar Tool Set show that value prediction with the SRC alone can produce 94% - 98% raw prediction accuracies and speedups of 2% - 7% for the tested SPEC95 integer benchmark programs in a 16-issue superscalar processor. Our proposed Combined Dynamic Predictor (CDP) couples the SRC with the hybrid predictor [21] and dynamically selects between the two component predictors. This approach outperforms the SRC and the hybrid predictor when they are used separately, achieving higher effective prediction accuracies and speedups of up to 10%. Keywords: speculative result cache, instruction reuse, value locality, value prediction, combined dynamic predictor 1 Introduction Computer architects have long exploited the typical behavior of application programs to improve performance. For instance, the assumption that future memory access patterns can be predicted by observing the patterns in the previous memory addresses referenced by a program has led to mechanisms such as caches, prefetching, and vector instructions. Architects speak of the spatial and temporal locality of memory references to describe the behavior observed in programs that future memory accesses are likely to be close in space and time to other recent references. Other types of repetitive behaviors of programs have recently been further exploited by noticing that programs exhibit not only locality of memory addresses 1

2 referenced, but also locality in the output values produced by individual instructions. This instruction output value locality suggests that a single instruction often produces the same value, or a predictable sequence of values, each time it is executed [2, 8, 20]. There are essentially two approaches that have been proposed for exploiting this program behavior 1) reusing the results of a prior execution of an instruction, and 2) predicting the value that will be produced by the current execution of an instruction based on the previous values it has produced. With instruction reuse [15], the input operand values of an instruction and the corresponding output result are saved in a reuse buffer, which is indexed by the instruction s address. The next time the instruction is executed, this buffer is queried to see if the current operand values match previously seen values. If they do, the previously stored output value can be used immediately without needing to re-execute the instruction. Simulations with the SPEC benchmarks have shown that this approach can improve performance by 1-5 percent [16]. This idea can be extended to larger granularities to reuse entire sequences of instructions, such as basic blocks, for instance [6]. Alternatively, the result cache proposed by Richardson [17] indexes the reuse buffer with a function that hashes the instructions operands instead of the instructions addresses. This indexing scheme allows the result cache to reuse the values produced by any previously executed instructions of the same type. With these value reuse approaches, the output values stored in the reuse buffer are known to be the correct values since the input operands are matched exactly. In contrast, value prediction [2, 8, 16] allows an instruction to begin executing before its input operands are available by guessing what the operand values will be. This speculative instruction execution effectively breaks flow dependences between instructions, although it does require a subsequent check to verify that the values were predicted correctly. If they were not, the program execution must be restored to the appropriate state before the speculation, and the instructions then must be re-executed with the correct values. Due to the potentially high cost of wrong predictions, several techniques have been proposed to improve value prediction accuracy. These include a last-value predictor [9], a stride-based predictor [21], a two-level predictor [9, 21], a hybrid predictor [21], and context-based predictors [10, 13, 14]. To be effective, value reuse mechanisms must store previous operand and result values for as many instructions as possible. Similarly, since the value prediction mechanisms attempt to predict the next value that will be produced by an instruction based on the previous values already generated, they would like to cache as large a history of previously seen values for each instruction as possible. Thus, both of these approaches require large hardware tables to be implemented directly on the processor die. For instance, each entry in the hybrid predictor needs at least 16 bytes to store the last four unique values produced previously by the same instruction [21]. Additionally, these tables must be able to be queried quickly to provide any performance benefit. Our experiments with the programs in the SPECint95 benchmark suite show, however, that all of the instructions executed within a single application have a relatively small overall result space. That is, while the potential range of values that could be produced is enormous, the actual range of values tends to be much more constrained. Furthermore, different instructions of the same type often produce the same values, since the operand values of instructions are frequently repeated. The repeated operand values could come from previous executions of the same instruction, or from previous executions of other instructions of the same type. We say that a program is showing operand value locality when an instruction is executed using the same operand values that have been used by other instructions of the same type. 2

3 If we can save the result values corresponding to the operands, the execution of this instruction can be skipped. The previously proposed result cache [17] is one example of this approach. The result cache, however, requires that the operand values used to index the buffer be the same as the actual values that will be operated on. This means that the earliest that the result cache can be accessed is right before instructions are actually issued. Hence, this scheme can be applied only to long-latency operations [11, 17], such as those typically seen in floating-point programs. In this paper, we extend the reuse-based result cache to a Speculative Result Cache (SRC) to support data value prediction for integer programs by exploiting operand value locality. Since most integer operations typically require only a single cycle in a pipelined superscalar processor, waiting for the correct operand values to access the result cache will not bring any performance benefit. Tullsen et al [19] showed that the current value in the register file for an operand register may stay the same for a while even though new values will be produced. Hence, rather than waiting to index the result buffer with the instructions exact operand values, as was done in the result cache, or indexing with the instructions addresses, as is done with other predictors [9, 10, 13, 21], the SRC is indexed by using the currently available values in the corresponding operand registers. Instead of using the hardware to detect the value patterns dynamically, the SRC simply reuses the values produced previously by this or any other instruction of the same type. Previously proposed value predictors [9, 10, 13, 21] focus on exploiting the output value locality of individual instructions, while the SRC focuses on the operand value locality produced by all instructions of the same type. To simultaneously exploit both kinds of value locality, we propose the Combined Dynamic Predictor (CDP) to dynamically choose between the values predicted by a predictor that exploits operand value locality (the SRC) and a predictor that exploits instruction output value locality (the hybrid predictor [21]). (We choose the hybrid predictor as being representative of the class of predictors that use instruction output value locality due to its demonstrated high prediction accuracy [3, 13, 21].) Our simulations show that the CDP performs better than either of the two component predictors applied alone, with prediction accuracies and speedups up to 96% and 10%, respectively. Since each SRC entry is much narrower than the entries in the hybrid predictor, the CDP may also be more efficient in the use of die space. In the remainder of the paper, Section 2 summarizes some related work. Section 3 describes the SRC, while Section 4 studies the Combined Dynamic Predictor. Section 5 evaluates the performance potential of these new predictors and Section 6 then summarizes and concludes the paper. 2 Related Work Sodani et al [15] investigated the potential of instruction reuse with a reuse buffer that stored the inputs and outputs of each executed instruction. The goal was to make use of the results of instructions that were executed but not actually used due to branch misprediction, as well as to reuse the results of the same instruction from previous executions. They showed relatively moderate speedups of 1-5 percent for the SPECint benchmark programs [16]. The primary advantage of this scheme is that the results do not need to be verified since it is a strict reuse with no speculation at all. This instruction reuse technique has been extended to reusing the results of an entire sequence of instructions within a basic block [6]. This approach detects a dependence chain at runtime and compares 3

4 all of the inputs to an entire basic block instead of only a single instruction. This technique produces slightly greater performance improvements than the simple instruction reuse scheme due to the larger amount of time that can be saved when the execution of all of the instructions within a basic block can be skipped by reusing the block s previously stored output values. Lipasti et al [8] investigated the potential of predicting the values returned by load instructions to improve instruction-level parallelism. The performance benefits of this scheme in a simulation of the DEC and PowerPC 620 processors were also moderate with speedups in the range of 4 to 12 percent with limited hardware. This basic scheme has been improved to achieve better prediction accuracies [10, 13, 21], although it has been suggested that, due to the cost of a misprediction, the improvement in instruction level parallelism that can be achieved through value prediction mechanisms is limited [4]. In addition, since an instruction typically has two inputs, obtaining a higher prediction rate on only one input operand but not on the other will further limit the performance benefit. Harbison [5] used a mechanism called the value cache to store the result of an instruction as well as some dependence information and the execution phase. This cache was also indexed by the instruction address. With this mechanism, hardware-level common subexpression elimination was achieved to produce speedups of about 2.0 on a stack machine. Richardson [17] proposed a scheme targeting floating point instructions. He suggested a tabulation method to store the return values of function calls with no side-effects using software, and to store long-latency operation results in a device called a result cache. The result cache was indexed by exclusive-or ing some of the most significant bits of an instruction s two operands. This mechanism was able to reduce the execution time of some SPEC92 floating-point programs by 15 to 44 percent. However, this result cache requires that the operands used to produce the index be the correct values in the program order. Due to its non-speculative nature, the result cache can only reuse results produced by long-latency operations, such as floating-point division, and multiplication, for instance. Tullsen and Seng [19] proposed a scheme that saves value buffer space by using the values already available in the register file. They found that, at least 75% of the time, the value loaded from memory is either already in the register file, or was recently there. In this scheme, the instruction uses the old value in the destination (architectural) register as a prediction for the new value produced. Compiler support is used to increase the prediction opportunities for this scheme. They showed speedups of up to 11% for the SPEC95 integer programs, and up to 13% for the SPEC95 floating-point programs. 3 Speculative Result Cache There are several approaches that can be used to predict the result of an instruction before it is executed: 1. If the operands are available, and the values of these operands have been used previously by this instruction, we can simply reuse the previously stored result. 2. If the operands are available, but the values of these operands have not been used previously by this instruction, we cannot simply reuse the previously stored result. However, if the values of these operands have been used by another instruction of the same type, and we have previously saved the result produced by this other instruction, we can predict that the result of the current instruction will be the same as that of this previously executed instruction. 4

5 3. If the operands are not available, but the result produced by this instruction follows a regular pattern, we can predict the result by following this regularity. The first case is handled by the instruction reuse [15] and value cache [5] mechanisms, while the existing predictors [9, 10, 13, 14, 21] handle at least part of the third case. The result cache and our proposed SRC are both targeted towards the second case. For example, in Figure 1, arrays A, B, and C are all defined to have 100 elements. The index variables used to operate on each element of these arrays (i1 and i2) are incremented from 0 to 99 in two different loops. The operations used to increment i1 and i2 are both addu instructions. Thus, the results produced by updating i1 can be used to predict the results that will be produced by the addu instruction that updates i2. Additionally, the instructions used to determine the end of the loop (slt) in the first loop can also be used to predict the outcome of the slt instruction used in the second loop. This similar behavior among different instruction instances allows us to resolve the branch instructions (bne) earlier than would otherwise be possible. 3.1 Basic Operation of the Speculative Result Cache The SRC buffers the result value produced by each instruction that may need predicting in the future in an entry of the SRC indexed by hashing the values of this instruction s input operands. When a decoded instruction cannot be issued because the source instructions that produce its operands have not yet completed, the processor can predict the values that will eventually be produced by the source instructions using the SRC. To perform this prediction, the processor indexes the SRC using the currently available values of the operands of the source instructions. If predictable values are found in the SRC, the instruction can be speculatively issued for execution using these predicted values for its input operands. Figure 2 provides an example of how the SRC is used. The instructions in this figure are assumed to be in MIPS IV format, so that the first register specified is the destination register, and the other two registers are the source operands. Instruction 1 is a load instruction that has been issued but not completed. Instructions 2, 3, 4, and 5 are waiting for their operands to be ready. The arrows show the dependences between instructions. We could speculatively issue both instructions 2 and 3 if the result cache query for instruction 1 produces a hit. However, if the ultimately loaded value is found to be different from the one in the result cache, the speculation would fail. Instruction 4 depends on the results of instructions 2 and 3. After decoding, we know the register numbers for the operands of instructions 2 and 3 and we could use the currently available values in the register file for the corresponding operands to query the result cache and proceed with issuing instruction 4. By doing so, we are betting that the operands used by instructions 2 and 3 will not be altered by instruction 1. The speculation of instruction 5 can be done similarly. Hence, if the processor s issue-width allows, all 5 instructions could be issued concurrently. When instruction 1 is executed a second time, the address calculated by adding register sp (stack pointer) and the value 4 could be different from its previous executions. However, if other load instructions have previously read the value corresponding to the new address, the SRC mechanism allows us to effectively exploit this operand value locality globally. 3.2 An Implementation of the Speculative Result Cache Instructions in the execution window are decoded before being issued, which allows dependence chains to be constructed before the instructions reach the issue stage. This then allows the SRC to be queried 5

6 /* C source code */ int i1, i2, A[100], B[100], C[100]; for (i1=0; i1 < 100; i1++) A[i1] = A[i1]+C[i1]+B[i1]; for (i2=0; i2 < 100; i2++) B[i2] = B[i2]+A[i2]; /* MIPS-IV Assembly Code for the above loops */ move $6,$0 addu $5,$sp,16 LOOP1: lw $2,0($5) lw $3,400($5) lw $4,800($5) addu $6,$6,1 # update the index addu $2,$2,$3 addu $2,$2,$4 sw $2,0($5) # save the result addu $5,$5,4 # compute the next address slt $2,$6,99 #compare index to loop-bound bne $2,$0,LOOP1 move $5,$0 addu $4,$sp,16 LOOP2: lw $2,800($4) lw $3,0($4) lw $3,0($4) addu $5,$5,1 # update the index addu $2,$2,$3 sw $2,800($4) # save the computed result addu $4,$4,4 # compute the next address slt $2,$5,99 #compare index to loop-bound bne $2,$0,LOOP2 Figure 1: An example code segment that could utilize the Speculative Result Cache. 6

7 5 ld $5, 16($6) not issued 4 sub $8, $6, $4 not issued 3 muli $6, $4, 4 not issued Program 2 addi $5, $4, 8 not issued Order 1 ld $4, 16($sp) issued Figure 2: The inputs to an instruction that is blocked in the issue stage can be predicted by querying the Speculative Result Cache using the input operands of the previous instructions on which the blocked instruction is dependent. in the instruction decode stage or, possibly, in the dispatch stage. As shown in Figure 3, in addition to storing a result value, each entry in the SRC contains: 1) a tag field that contains the lower 16 bits of each of the operand values that produce this result; 2) the corresponding result itself; 3) a valid bit; and 4) 4 bits for a confidence counter. We used only the lower 16-bits from both operands as the tag, since the SRC is speculative in nature and most repeated operands have fewer than 16 bits. The size of each entry is about 9 bytes, while a typical entry in a two-level predictor has more than 18 bytes [21]. If the valid bit is not set, the SRC will signal the issue stage to not predict the output values since the value stored in the SRC is invalid. This situation occurs when the hashed operand values have not been previously seen. The counter field determines the confidence of the prediction. If the value of the counter is below some threshold, the stored value is expected to be a poor prediction. This counter is incremented by a predetermined amount when the access of the SRC produces a successful prediction, and is decremented otherwise. By the time an instruction reaches the issue stage, the results of the SRC would be available. If this instruction cannot be issued due to data dependences, the issue logic can choose to either issue the instruction speculatively with the predicted operand values from the SRC, or to wait for the actual operand values to become available. The second choice forces the existing dependence to be completely resolved, and so eliminates the chance to increase the instruction issue rate. However, it also eliminates the chance of a potentially expensive misprediction. Figure 4 depicts a possible implementation of a superscalar processor with an SRC. Since different types of instructions will produce different output results when given the same input operand values, we partition the SRC into several different sections by using an op-type field (see Figure 3). Each section corresponds to a single instruction type. In the following experiments, we base this partitioning on the most common types of instructions that occur in the SPEC95 integer benchmarks when 7

8 OP-Type TAG Operand1 Operand2 Result Value Confidence Counter Valid Concat Operand1 Operand2 Hash Function Figure 3: A possible implementation of the SRC. Instruction Cache Branch Predictor Decode Dispatch Execution Speculative Result Cache Core Reorder Buffer Figure 4: A superscalar processor implementation with a Speculative Result Cache. 8

9 Distribution of Instructions Percentage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% other SLT Shift Right Shift Left Logical Load Add/Sub 0% compress ijpeg gcc m88ksim li go Figure 5: The types of instructions that are applicable to the SRC and their frequency distribution. executed by the processor model in the SimpleScalar [1] simulation tool set. The six most common and predictable instruction categories found are addition (including subtraction), left-shift, right-shift, load, set-less-than (SLT), and logical (or, and, not) instructions. Since logical instructions require three separate sections of the SRC, we have a total of 8 types of operations. Most of these instructions are also common in other RISC instruction set architectures, but SLT is more platform dependent. Since many branch instructions work with SLT in the SimpleScalar Instruction Set, we predict SLT instructions in our study to provide early resolution of branches. As shown in Figure 5, these types of instructions comprise from 62% to 80% of the total instructions executed in each of the benchmark programs. We focus on improving the performance of only integer programs in this work since it tends to be more difficult to obtain high levels of parallelism from these types of programs than from floating-point programs. Thus, each entry in the SRC must be able to store a single integer value. The operand values are exclusive-ored [17] to index into the SRC. We used the lower n bits for this hash index instead of the full-length operand values. With these n bits, each instruction type requires 2 n entries to store the corresponding result values. Since the SRC used in the following experiments is partitioned into eight different instruction types, the overall SRC structure requires 2 n+3 entries. In order to pick a reasonable n for our evaluation, we limit the total SRC size to be comparable to the typical size of the level-1 cache of the RISC processors. For example, with 32K entries, n = 12. Figure 6 shows the prediction accuray for the SRC with 4K-32K total entries, where the prediction accuracy is calculated as the number of correctly predicted instructions over the total number of predicted instructions. We can see that the raw prediction accuracy of the SRC is quite high, ranging from 94.5% to 98%. Furthermore, this accuracy is relatively insensitive to the SRC sizes for the tested programs. The percentage of instructions that are not predicted by the SRC, either because the instruciton type is not applicable, or because the confidence counter in the SRC is below the threshold, decreases slightly as the size of the SRC increases, as shown in Figure 7. However, the slopes of the curves are not particularly steep. The difference in prediction accuracy between the 4K and 32K cases is at most 8%, indicating that 9

10 99 Prediction Accuracy of the SRC Percentage compress gcc ijpeg li m88ksim go K 8K 16K 32K Figure 6: The prediction accuracy of the SRC. the SRC adequately captures the operand value locality within a certain size of execution window. This result also suggests that the different combinations of operands in these windows are not large. Hence, we conclude that a 4K SRC is sufficient for the programs tested. 4 The Combined Dynamic Predictor (CDP) While the SRC utilizes the operand value locality of programs, most of the existing predictors [10, 13, 21] are exploiting the instruction output value locality. That is, they simply assume that each instruction will produce regular outputs in different instances of execution. Consequently, saving the results produced in the previous executions of instructions will help predict what value a specific instruction is likely to produce next. Since the SRC and these predictors are exploiting different aspects of program behavior, combining the SRC with one of the other predictors may produce higher prediction accuracy, and, hence, better performance. 4.1 Implementation of the Combined Dynamic Predictor The hybrid predictor presented in [21], as shown in Figure 8, is one of the most sophisticated and effective predictors proposed [13, 19, 21] so far. Hence, we will combine our SRC with the hybrid predictor for our subsequent experiments. We call this new predictor the Combined Dynamic Predictor (CDP). The hybrid predictor combines a two-level predictor and a stride-based predictor. In the two-level predictor portion, the most recent 4 unique values produced by an instruction s previous executions are stored in each entry of the value history table (VHT). An additional 2p bits are used for the Value History Pattern (VHP) portion of the table to record the order in which the 4 unique values were produced in the most recent executions. Here p refers to the history depth that can be stored. Since 4 unique values 10

11 Percentage of Not-Predicted Iinstructions compress gcc go ijpeg li m88ksim K 8K 16K 32K Figure 7: The percentage of instructions not predicted by the SRC. Tag State Stride LRU Info Data Values Value History Pattern Pattern History Table Instruction Address Hash Function Figure 8: A simplified description of the hybrid predictor described in [21]. 11

12 Decoded Instruction Speculative Result Cache Predicted Value Select Operands MUX Predicted Value Instruction Address Hybrid Predictor Predicted Value Figure 9: The Combined Dynamic Predictor (CDP). are stored and each history needs 2 bits to point out which value occurred at that point, a total of 2p bits are needed. This VHP is then used to index the second-level of the predictor, the Pattern History Table (PHT), which is essentially a table of confidence counters for each stored value. Of the given four values, the one with the highest confidence is picked as the predicted value. If the stored confidence is higher than the confidence threshold, the predictor will return one of the four unique values as the predicted result. If none of the confidence values exceeds the threshold, no prediction is returned. The stride predictor stores an extra value called the stride, which is the difference between the last two values produced by the same instruction. The stride predictor always returns last-value produced + stride as the predicted value. As its name suggests, the stride predictor is effective when an instruction produces a sequence of values with a constant stride. When the hybrid predictor is queried, the result returned by the two-level predictor has higher priority, which means the stride predictor is used only when the two-level predictor does not return a predicted value. Combining this hybrid predictor with our SRC produces the Combined Dynamic Predictor (CDP) shown in Figure 9. Note that the SRC is indexed with the operand values while the hybrid predictor is indexed with an instruction s address. When the CDP is accessed, both the hybrid predictor and the SRC simultaneously go to work to produce two predicted values. The processor then selects one of these predicted values for speculative execution. We next evaluate several different policies for determining which of the two predicted values should be used when each speculated instruction is issued. 4.2 Selecting Between the Component Predictors We evaluated three policies for selecting between the component predictors: 1) always select the hybrid predictor first; 2) always select the SRC first; and 3) select the predictor with higher confidence first. Figure 10 shows the prediction accuracy of the CDP using these different component predictor selection policies. The raw prediction accuracy is again calculated as the number of correctly predicted instructions divided by the total number of instructions predicted. We used 4K entries in both the hybrid predictor and the SRC. The prediction accuracies of the hybrid predictor alone with 8K entries and the SRC alone with 16K entries are also compared. Note that the 16K-entry SRC and the 8K-entry hybrid predictor each consume approximately the same amount of total storage as the CDP with a 4K Hybrid and and 8K SRC component predictors. 12

13 Prediction Accuracy of Different Policies compress Gcc Go Ijpeg li m88ksim k+4k, Higher Confidence First 4k+4k, SRC First 4k+4k, HP First 8k, HP Only 16K SRC Only Figure 10: The raw prediction accuracy of the different selection policies that can be used in the CDP. We can see that the prediction accuracy of the CDP is about 10% to 14% higher than the hybrid-only case, and about 3% to 30% lower than the SRC-only case. However, since different predictors may return different prediction decisions for the same instruction, the number of instructions that are not predicted for the various predictors are different, as shown in Figure 11. Thus, the prediction accuracies shown in Figure 10 cannot be directly compared. Instead, we must normalize the number of instructions correctly predicted in each case to the same base, which is simply the total number of instructions executed by each program. This gives the effective prediction accuracy = number of instructions predicted correctly / total number of instructions executed. Using this metric, Figure 12 shows that the CDP outperforms the other two predictors for all of these test programs. Among the component predictor selection policies, the higher confidence first and the SRC-first policies produce essentially the same prediction accuracies. Thus, a nice feature of the CDP is that its effective prediction accuracy is not particularly sensitive to the choice of this policy. For the remainder of this study, we use the higher confidence first policy. 4.3 Resource Allocation Among the Component Predictors Now that the component-selection policy has been examined, we must next determine how to allocate the total available storage for the prediction hardware between the two component predictors. We evaluate the different configurations of the CDP shown in Table 1. In the first line of this table, approximately 1/3 of the total available storage is devoted to the SRC, with the remaining 2/3 devoted to the hybrid predictor. We denote this configuration as (1/3, 2/3). The second line approximates the (1/2, 1/2) case with the (3/4, 1/4) case shown in the third line. The remaining two configurations devote all of the available resources to one predictor alone. Evaluating these five combinations will tell us how the available resources should be partitioned among the component predictors. The goal is to keep the total available storage for the overall predictor roughly constant. However, if the number of entries in a predictor is not a power-of-two, the utilization of the value buffers will be uneven. Consequently, the sizes of the 13

14 Percentage of Not-Predicted Instructions Compress Gcc Go Ijpeg li m88ksim k+4k, Higher Confidence First 4k+4k, SRC First 4k+4k, HP First 8k, HP Only 16K SRC Only Figure 11: The percentage of not-predicted instructions for the different selection policies. Effective Prediction Accuracy of Different Policies Compress Gcc Go Ijpeg li m88ksim k+4k, Higher Confidence First 4k+4k, SRC First 4k+4k, HP First 8k, HP Only 16K SRC Only Figure 12: The effective prediction accuracy of the CDP selection policies. 14

15 Line SRC Hybrid Total Size % SRC Label 1 4K 4K 124KB 29% (1/3, 2/3) 2 8K 4K 160KB 45% (1/2, 1/2) 3 16K 2K 184KB 76% (3/4, 1/4) 4 16K 0K 144KB 100% SRC-only 5 0K 8K 176KB 0% HP-only Table 1: Predictor configurations used to determine how the available storage resources should be allocated among the component predictors of the CDP. component predictors in Table 1 are chosen to be the power-of-two sizes that most closely approximate the desired distribution of resources. The actual percentage of the total storage allocated to the SRC predictor in each configuration is shown in the % SRC column. Figures show the raw prediction accuracy, the percentage of instructions not predicted, and the effective prediction accuracy for these configurations of the CDP. We can see that the prediction accuracy of the SRC is still the best among all of the configurations, while the HP-only case trails the others by a margin of up to 24%. For the three combinations of the hybrid predictor and the SRC that compose the CDP, the (3/4, 1/4) configuration outperforms the the other two by a small margin. This is expected since the SRC has a very high prediction accuracy. Hence, devoting more resources to the SRC can reduce the raw misprediction rate. However, since the SRC predicts a smaller portion of the total number of instructions than the hybrid predictor, as shown in Figure 14, a hybrid predictor of considerable size is needed to provide a prediction for the instructions rejected by the SRC. As measured by the effective prediction accuracy in Figure 15, the three CDP configurations outperform the SRC-only or the HP-only case, with the (3/4, 1/4) resource allocation again outperforming the others. Since the (3/4, 1/4) configuration has more total storage than the other two CDP configurations, we conclude that there is only a slight preference for devoting more resouces to the SRC predictor than the hybrid predictor, when used as components of the CDP. 5 Performance Potential In this section, we evaluate the performance benefit of the SRC, and the CDP when used to perform value prediction within a pipelined superscalar processor. The results are compared to the hybrid-only configuration. 5.1 Simulation Environment We use the SimpleScalar Tool Set Version 3.0 [1] for all of our experiments. The sim-outorder detailed superscalar simulator in this tool set is modified to accomodate value prediction with the hybrid predictor [7], the SRC (Section 3), and the CDP (Section 4). Our base processor model is a 16-issue superscalar processor with a 64KB level-1 instruction cache, 64KB level-1 data cache, and a 1MB unified level-2 cache. Thirty-two ALUs and 8 MULT/DIV units are supplied in the execution core. The miss recovery 15

16 Prediction Accuracy Compress Gcc Ijpeg li m88ksim Go (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Figure 13: The prediction accuracy of the CDP for the different resource allocation configurations shown in Table 1. Percentage of Not-Predicted Instructions Compress Gcc Go Ijpeg li m88ksim (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Figure 14: The percentage of instructions not predicted by the CDP for the different resource allocation configurations shown in Table 1. 16

17 Effective Prediction Accuracy Compress Gcc Go Ijpeg li m88ksim (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Figure 15: The effective prediction accuracy of the CDP for the different resource allocation configurations shown in Table 1. penalty of value prediction is set to 3 cycles. The number of entries in the load/store queue and the reorder buffer is 256, since dependent instructions on predicted values are issued and waiting to be committed in those buffers. All the predicted values are verified in the write-back stage in the execution pipeline and squashed if they are mispredicted. Those mispredicted instructions, including instructions dependent on the values, are re-executed after the misprediction recovery penalty. The hybrid predictor part of the CDP uses a 4-bit confidence counter with a prediction threshold of 7. It is configured to increment the counter by 1 when a correct prediction is made, and to decrement by 7 when an incorrect prediction is made [3]. The SRC part also adopts a similarly configured confidence counter except for incrementing by 3 and decrementing by 1, since the operand value locality is relatively higher. Six programs from the SPECint 95 benchmark suite are tested for our studies. They are Compress, Gcc, Go, Ijpeg, Li, and M88Ksim 1. Compress and Go use the train input size, while the other programs use the test input size. All the program binaries are downloaded from the SimpleScalar website [1]. 5.2 The SRC-Only Case As shown in Figure 16, the SRC by itself can produce speedups of 1% - 6% with 4K entries, and 1.5% - 7% for when the size increases to 32K entries. The changes in speedup as the size of the SRC varies are not very dramatic, even though the size of the SRC increases by a factor of 8. Only M88Ksim shows a big improvement when the size of the SRC changes from 4K entries to 8K entries. We conclude that, for these test programs, an SRC of 4K - 8K entries is sufficient to capture most of the available operand value locality. 1 Note to the reviewers: We will include the results for Perl in the final version of this paper. 17

18 Speedups Produced by the SRC Speedups (Percentage) Compress Gcc Go Ijpeg li m88ksim 1 0 4K 8K 16K 32K Figure 16: The speedup potential of the SRC with different sizes. 5.3 The Combined Dynamic Predictor (CDP) The Combined Dynamic Predictor (CDP) produces speedups varying from -1% to 10%, as shown in Figure 17. The CDP is unable to produce any speedup for Go, but is able to reasonably improve the performance for all of the other programs. The speedup of the (1/2,1/2) configuration indicates that splitting the available storage resources equally between the two component predictors tends to produce the best overall performance. However, the performance differences between the three different CDP allocations studied are quite small within a single application program. Furthermore, splitting the avaiable resources between these two component predictors produces better performance for all of the programs than using only a single type of predictor alone. The only exceptions are Compress and Go, in which the SRC-Only configuratoin produces slightly greater speedups than the split CDP configurations. Since the total storage taken by the CDP is close to 128KB, we also compare the case of using a larger level-1 cache instead of allocating any resouces to value predictors. With this configuration, the level-1 data and level-1 instruction caches are increased from 64KB to 128KB each. Figure 17 shows that Gcc and Go strongly favor the larger cache, while the other programs perform better when the resources are used for value prediction instead of simply making the caches larger. For M88Ksim, a larger cache does not provide any performance benefit. 6 Conclusion Most of the existing value predictors [8, 10, 13, 21] exploit instruction output value locality. The Result Cache [17], in contrast, exploits operand value locality by reusing the results produced by any of the previously executed instructions of the same type. Due to its non-speculative nature, however, the result 18

19 Speedups Produced by Different Techniques Speedups (Percentage) 10 5 Compress Gcc Go Ijpeg li m88ksim 0-5 (1/3, 2/3), Higher Confidence First (1/2, 1/2), Higher Confidence First (3/4, 1/4), Higher Confidence First) 8k HP Only 16K SRC Only Bigger L1 Caches Figure 17: Processor speedups when using the various value prediction mechanisms shown in Table 1 compared to allocating the predictor resources to a larger cache. 19

20 cache can only be accessed when the instructions are ready to be issued. Since most integer operations in pipelined superscalar processors take one cylce to finish, this reuse-based result cache produces only limited benefit for integer programs. In this paper, we extended the result cache to support speculative execution with value prediction. We call this new scheme the Speculative Result Cache (SRC). Simulations based on the SimpleScalar Tool Set showed that the SRC can produce 94% - 98% raw prediction accuracies and up to 7% speedup on a 16-issue superscalar processor. Since the SRC is exploiting a different kind of value locality than most of the existing predictors, we evaluated combining the SRC with an existing predictor to produce better prediction accuracy and performance potential. Among the existing value predictors, the hybrid predictor [21] has been demonstrated to have good prediction accuracy and substantial performance benefit. Thus, we proposed a new predictor, called the Combined Dynamic Predictor (CDP), that combines the hybrid predictor and the SRC. This new predictor exploits both instruction output value locality and operand value locality by dynamically selecting between the predicted values produced by its two component predictors. The higher confidence first policy is shown to have raw prediction accuracies of up to 96%. Using a CDP based on this policy produced speedups of up to 10% on a 16-issue superscalar processor for the SPEC95 integer programs tested. This performance is better than that of either of its component predictors used separately. We conclude that the best value prediction for integer programs is obtained by dynamically selecting between a predictor that exploits instruction output value locality and one that exploits operand value locality. Furthermore, our results indicate that, for most of the programs tested, it is better to allocate some resources to this type of value prediction than to simply increase the size of the level-1 caches. References [1] D. Burger, T. Austin, and S. Bennett. The Simplescalar Tool Set, Version 2.0. Technical Report 1342, Computer Science Department, University of Wisconsin, Madison. [2] F. Babbay and A. Mendelson. Using Value Prediction to Increase the Power of Speculative Execution Hardware. In ACM Transactions on Computer Systems, Vol. 16, No. 3, August, 1998, Pages [3] C. Calder, G. Reinman, D. Tullsen. Selective Value Prediction. In Proceedings of the 26th Int l Symposium on Computer Architecture (ISCA), Atlanta, May, 1999, pages [4] J. Gonzalez and A. Gonzalez. The Potential of Data Value Speculation to Boost ILP. Proceedings of the 1998 international conference on Supercomputing (ICS 98), July, 1998, Pages [5] S. Harbison. An architectural alternative to optimizing compilers. In Proceedings of the Int l Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 1982, Pages [6] J. Huang and D. Lilja. Exploiting Basic Block Value Locality with Block Reuse. In the Proceedings of the 5th Int l Symposium on High Performance Computer Architecture (HPCA-5), Orlando, Jan. 9-13, 1999, Pages [7] S.J. Lee. Several data value predictors. misc.html [8] M. Lipasti, C. Wilkerson and J. Shen. Value Locality and Load Value Prediction, In the Proceedings of 7th Int l Conf. on Architecture Support for Programming Languages and Operating Systems (ASPLOS VII), Oct., 1996, Pages

21 [9] M. Lipasti and J. Shen. Exceeding the Dataflow Limit via Value Prediction, In the Proceedings of 29th Annual ACM/IEEE Int l Symposium on Microarchitecture (MICRO-29), Dec. 2-4, 1996, Pages [10] T. Nakra, R. Gupta, and M.L. Soffa. Global Context-Based Value Prediction. In the Proceedings of the 5th Int l Symposium on High Performance Computer Architecture (HPCA-5), Pages [11] S. Oberman and M. Flynn. On Division and Reciprocal Caches. Stanford University Technical Report, CSL-TR , April, [12] Y. Patt, S. Patel, M. Evers, D. Friendly, and J. Stark. One Billion Transistors, One Uniprocessor, One Chip. In IEEE Computer, Special Issue on Billion-Transistor Architectures, vol. 30, No. 9, September 1997, Pages [13] B. Rychlik, J. Faistl, B. Krug, and J. Shen. Efficacy and Performance Impact of Value Prediction. In Proceedings of the Xth Intn l Symposium on Parallel Architectures and Compilation Techniques, Paris, October 1998, pages xxx-yyy. [14] Y. Sazeides and J. Smith. The Predictability of Data Values. In the 30th Annual Int l Symposium on Microarchitecture (MICRO 30), Dec. 1997, Pages [15] A. Sodani and G. Sohi. Dynamic Instruction Reuse. In the 24th Int l Symposium on Computer Architecture (ISCA), June, 1997, Pages [16] A. Sodani and G. Sohi. Understanding the Differences Between Value Prediction and Instruction Reuse. In the 31st Int l Symposium on Micro-Architecture (MICRO-31), 1998, Pages [17] S. Richardson. Caching function results: Faster arithmetic by avoiding unnecessary computation. Technical report SMLI TR-92-1, Sun Microsystems Laboratories, September [18] J. Smith and S. Vajapeyam. Trace Processors: Moving to Fourth Generation Microarchitectures. In IEEE Computer, September 1997, volume 30, number 9, Page [19] D. Tullsen, and J. Seng. Storageless Value Prediction using Prior Register Values. In Proceedings of the 26th Int l Symposium on Computer Architecture (ISCA), Atlanta, May, 1999, pages [20] G. Tyson and T. Austin. Improving the Accuracy and Performance of Memory Communication Through Renaming. In the Proceedings of the 30th Annual Int l Symposium on Microarchitecture (MICRO 30), December, 1997, Pages [21] K. Wang and M. Franklin. Highly Accurate Data Value Prediction using Hybrid Predictors. In the 30th Annual Int l Symposium on Microarchitecture (MICRO 30), Dec. 1997, Pages

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract

Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department