Improving Multiple-block Prediction in the Block-based Trace Cache

Size: px

Start display at page:

Download "Improving Multiple-block Prediction in the Block-based Trace Cache"

Amos Henderson
5 years ago
Views:

1 May 1, Master s Thesis Improving Multiple-block Prediction in the Block-based Trace Cache Ryan Rakvic Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 1513 rnr@ece.cmu.edu Abstract Multiple-block prediction is emerging as a new and exciting research area. Highly accurate multiple-block predictors are essential for wide instruction fetch mechanisms, that will support future generations of microprocessors. The block-based trace cache is a recent proposal for wide instruction fetch. It aligns and stores instructions at the basic block level instead of at the trace level, thus significantly reducing instruction trace storage requirements. This paper investigates a new mechanism, the tree-based multiple-block predictor, that utilizes multiple-branch prediction techniques to improve trace construction in the fill unit. This new fill-time predictor, does not replace but augments the fetch time path-based next-trace predictor, by improving the trace construction heuristic. The tree-based multiple-block predictor utilizes a tree structure that represents all possible paths beginning at a root block, where each tree node is a branch predictor. Both a bimodal scheme and twolevel adaptive schemes are examined for these tree-node branch predictors. Results: The enhanced trace constructor using the tree-based multiple-block predictor improves performance of the SPECint95 benchmarks by % over [1]. It is observed that a two-level adaptive predictor outperforms bimodal by %. Finally, the block-based trace cache with the enhanced trace constructor improves performance 8% beyond that of perfect branch prediction and outperforms the conventional trace cache, with a perfect predictor, for instruction storage capacities up to 1KB. 1 Introduction Microprocessor performance has been increasing at a phenomenal rate. This is accomplished by increasing both the clocking rate as well as the instruction-level parallelism (measured in ). To further increase performance, it is necessary for future process ors to increase their instruction fetch bandwidth. The trace cache [1][13] has been proposed as an effective way to fetch multiple blocks of instructions in every cycle. More recently the block-based trace cache [1] has been proposed as an efficient way to implement the trace cache. Fundamental to the operation of the trace cache is the ability to predict a sequence of basic blocks, or a trace, that will be executed. We refer to this as the task of multiple-block prediction. 1

2 While [1] demonstrated that the block-based trace cache can be very efficient in its trace storage, the multiple-block predictor used in that design was quite simplistic and was a limiting factor to achieving higher performance. In this paper we introduce a new multiple-block predictor that achieves higher prediction accuracy than the simplistic design and is able to significantly increase the performance of the block-based trace cache. 1.1 Previous Work Multiple-block predictors are derived from single-branch prediction techniques. For example, the two-bit predictor from Smith [1] and the pattern predictors of Yeh and Patt [19][] are employed in multiple-block predictors. Consequently all current multiple-block predictors are very similar. The differences between these predictors are determined by: 1) the type of execution history used, ) the prediction scheme, and 3) the unit of prediction. The execution history typically involves a combination of branch direction bits and path history (instruction addresses). The prediction scheme varies from the two-bit saturating counter to two-level adaptive schemes, and the prediction unit is either a basic block or a trace of instructions. Yeh and Patt [18] introduced one of the first multiple-block predictors. It is an extension of the twolevel adaptive branch predictor and incorporates a Branch Address Cache, yielding a combination that predicts both branch directions and branch targets. Their mechanism uses branch direction history to predict basic blocks with a two-level adaptive scheme. Later, [][3][15] improved on Yeh and Patt by introducing cache index prediction, the collapsing buffer, and efficient multiported branch target buffers, respectively. All of these works use branch direction history to predict basic blocks with a two-bit predictor. Path predictors, introduced by [11], use path address history to index the predictor. [1] and [9] combined branch direction history with path history to index the predictor. [1] predicts basic blocks with a last path (i.e. predict the same path as last time) scheme and [9] predicts traces using two-bit confidence counters. Another major step in multiple-block prediction is the use of tree-like structures that capture traces of instructions. Dutta [5] used a tree structure to capture portions of the control flow graph and four-bit counters to pick which path through

3 the tree will be predicted. A tree structure is also used in [13] to predict traces with a two-bit predictor. 1. The Tree-based Multiple-block Predictor When a trace cache is used, multiple-block prediction occurs in two phases. Essentially the trace cache approach attempts to move work from the front end of the machine to the back end. The work done in the back end is not in the critical performance path and allows the front end to be faster. As shown in Figure 1, the first phase of multiple-block prediction is performed at instruction fetch time, and the second phase is performed at instruction completion time. At instruction fetch time, a predictor explicitly predicts the next trace to be fetched by using global execution history information to index into the trace cache. We refer to this as the fetch-time predictor. At instruction com- constructed trace Next Trace Selection next trace ptr Trace Cache Fetch Time predicted trace Fill Unit completing instructions Fill Time Figure 1 - Multiple-block prediction in the trace cache. pletion time, a trace is formed by the fill unit and loaded into the trace cache based on certain trace construction heuristics. The implicit multiple-block prediction implied by the trace is predetermined or predicted by the trace construction mechanism. We refer to this as the fill-time predictor which determines the sequence of basic blocks that constitutes the trace and which traces are to be loaded into the trace cache for use in future fetch-time prediction. 3

4 The fetch-time predictor used in the block-based trace cache [1] is similar to that of [9] and determined which trace will be fetched next. The fetch-time predictor is not the focus of this paper. In [1] the fill-time predictor of the block-based trace cache used a simple last path (i.e. predict the same path as last time) scheme that determined which instruction traces will be available for prediction at fetch time. This paper proposes a new fill-time predictor, the tree-based multiple-block predictor (TMBP), that enhances the trace constructor of the block-based trace cache. The TMBP employs prediction techniques to both construct traces and judiciously place only those traces that are likely to be predicted at fetch time into the trace cache. Effectively, the TMBP prevents instruction traces that will not be predicted at fetch time from being stored in the fetch-time predictor table. This improves the efficiency of trace predictor storage and further increases the prediction rate of the fetch-time next trace predictor. Using this enhanced trace constructor, significant performance improvement can be obtained for the block-based trace cache. The proposed fill-time TMBP borrows and combines the best features of previous designs. It uses both branch direction and path history, and employs a two-level treebased adaptive scheme. Implementation issues and details are discussed in Section.1. Implementation The tree-based multiple-block predictor (TMBP) is investigated using the block-based trace cache [1]. In the block-based trace cache, illustrated in Figure, an instruction trace is not explicitly stored as one physical unit as in the conventional trace cache [1][13]. Instead, only pointers to basic blocks that constitute a trace are stored in a structure called the trace table (i.e. the fetch-time predictor table). Other than the trace table there are three other major components to the blockbased trace cache: the block cache, the rename table and the fill unit. The block cache stores aligned instruction blocks and is indexed by a renamed block-id. To support simultaneous multi-block accesses, the block cache is replicated (in effect multi-ported).the trace table can be viewed as storing a short-hand representation of traces since each of its entries contains the block-ids of a trace. The rename table implements the fetch address renaming of basic blocks,

5 and converts instruction fetch addresses to the block-ids which are used to index the block cache entries. The fill unit gathers instructions into physical blocks for renaming and storing in the block cache, and constructs traces of block-ids for future next-trace prediction via the trace table. The block-based trace cache performs a path-based next trace prediction by accessing the trace table with path and branch history, similar to [9]. The trace table outputs the block-ids of the predicted next trace. These block-ids are used in the next cycle (fetch) to access the replicated block cache. block update Path-based Next Trace Predictor hash Trace Table Exec. History block_ids path & branch history Block Cache Final Collapse Fetch Buffer Instruction Cache Rename Table trace update Fill Unit Tree-based Multipleblock Predictor completing instructions Execution Core Completion Figure - The block-based trace cache. Finally, the fill unit determines which traces are actually stored in the trace table and how instruction blocks are renamed. In previous work [1] the fill unit used a last path algorithm to construct traces of block-ids..1 Tree-based Multiple-block Predictor Implementation The proposed TMBP is part of the fill unit and is used at fill time to perform a multiple-block prediction to help determine which basic blocks should be in a trace. Figure 3 illustrates the trace construction part of the fill unit, which is made up of the TMBP and the fill heuristic logic. The fill heuristic, described in Section.1.1, uses prediction information from the TMBP and instruction 5

6 completion information to determine which blocks to insert into a trace. The TMBP consists of two major parts, one performs the execution history hashing and the other is the tree-based pattern history table (TPHT). Execution history hashing, described in Section.1., uses execution history, in the form of an execution path and/or directions of branches, and a hashing function to reduce the execution history bits to a more manageable size for indexing into the TPHT. The TPHT, described in Section.1.3, supports a two-level adaptive scheme for multiple-block prediction. Trace Table Update Tree-based Multiple-block Predictor (TMBP) Execution History Buffer Hashing Function i (TPHT) Tree-based Pattern History Table tree(i) local history 11 Fill Heuristic Logic prediction information Completing Instructions Figure 3 - The tree-based multiple-block predictor in the trace constructor of the fill unit..1.1 Fill Heuristic Logic The fill heuristic logic determines which blocks make up a trace and inserts the associated blockids of that trace into the trace table. The original block-based trace cache presented in [1] used a

7 simple heuristic, illustrated in Figure, that is based solely on a last path scheme. It used a greedy D A C E B F Sequence of Blocks Completed ABCD BCDE CDEF DEFC EFCD FCDE Old Greedy Algorithm Figure - Trace construction heuristics. Possible Traces: ABCD EFCD New Heuristic algorithm to construct traces, which viewed every executed block as a potential start of a new trace. Such a greedy algorithm is useful because it creates every possible trace from the completing sequence of instruction blocks; however it can lead to unnecessary traces that consume trace storage capacity. The new heuristic proposed in this work, also shown in Figure, only constructs nonoverlapping traces, which can lead to fewer total number of traces and more efficient trace storage. A trace is terminated when the limit for maximum trace length is reached. Traces are started with the oldest completed block or after a branch misprediction. In addition to the new heuristic the fill heuristic logic uses multiple-block prediction information from the TMBP to determine which block-ids should be in a trace..1. Execution History and Hashing The execution history gathered by a predictor is critical. McFarling [1] showed both branch address and branch direction history improve performance significantly. Nair [9] showed the effectiveness of path history in the prediction of a single branch, while [1][9] use path history along with branch directions to predict the next instruction trace. This work investigates the possible choices for global execution history in Section., and concludes that a simple global path history is sufficient for predicting the next trace. 7

8 Hashing is a common method used to reduce execution history bits such that predictor array access is direct with no tag comparison. Hashing of execution history bits can cause significant aliasing of distinct execution histories. When only branch direction is being predicted such aliasing may not be that detrimental, since both negative as well as positive interferences can occur. However, multiple-block predictors predict branch targets along with branch directions. It is unlikely that aliasing will result in positive interference for branch target prediction; hence this work proposes the use of concatenation. Concatenation eliminates aliasing, but requires a tag comparison when only a subset of the history bits are used to index into the predictor table. Section. demonstrates the advantage of concatenation..1.3 Tree-based Pattern History Table The TMBP is used by the trace constructor to predict which blocks should compose a trace. During trace construction the TMBP uses global path execution history bits (Section.1.) to index into the tree-based pattern history table (TPHT). The TPHT is an extension of the pattern history table in [19][], that more effectively supports traces. Each entry of the TPHT has three fields: tag, predicted path, and tree; see Figure 5. The tag field stores the execution history bits that are not used for indexing the TPHT. Each entry of the TPHT has an associated root block. The tree field contains a tree of blocks that represents all possible sequences of blocks, or paths, originating from the root block. The predicted path field contains a sequence of branch directions (taken or not taken) that identifies a particular path of the tree. This path represents the instruction trace (starting from the root block) that is currently stored in the trace table. Each node of the tree represents a block and has an associated single-branch predictor that predicts whether the branch instruction at the end of that block will be taken or not taken. The predicted path is determined based on the outcome of these single-branch predictors associated with the nodes of the tree. We call these tree-node predictors. The predicted path is used by the trace constructor to update the trace table. If the predicted path matches the trace that already exists in the trace table, no update is performed. Otherwise, the newly predicted trace is loaded into the trace 8

9 table in anticipation of a future next trace prediction, which can involve evicting the existing trace in that entry of the trace table. During trace construction, the execution history is used to select a particular entry of the TPHT. The current sequence of blocks in the trace construction buffer (representing recently executed blocks) identifies a particular path in the tree of that TPHT entry. The tree-node predictors along this path are updated based on this sequence of recently executed blocks. In other words the branch execution results of these buffered blocks are used to update the individual tree-node predictors along the said path of the tree. Once these tree-node predictors are updated, a new predicted path for this TPHT entry can be identified and the trace constructor can then update the trace table. Essentially, the TMBP implements a form of the two-level adaptive branch prediction scheme. Global execution history (G) is used to index into the TPHT at the first level. (The fetch time next trace predictor also uses the same global execution history.) The second level involves the treenode predictors, which can be implemented in a number of ways. Borrowing the terminology of [19][], the second-level predictor table can be: per-branch (p), global (g) or shared (s). The perbranch option allocates a second-level table for each node of the tree, as illustrated in Figure 5. The local branch history of each node is used to index into its own second-level table to generate a pre- 9 Tree-based Pattern History Table (TPHT) tag Predicted Path Tree Tree-Node Predictor n bits of local history N T T N Predicted Path Directions N T... n Second Level Table Tree and Its Nodes (predicted path highlighted with bold arcs) Figure 5 - Tree-based pattern history table and its components.

10 diction for that node. The global option allocates a single second-level table that is used by all of the TPHT entries. Each node in each tree uses its local history bits to index into a single secondlevel table. The shared option allocates a second-level table for each TPHT entry, i.e. all nodes of a tree use the same second-level table. Hence the shared (s) option can also be called the per-tree option. In all three options, each node maintains its own local history bits which are used to index into the second-level table. A fourth option is investigated in this paper which replaces the twolevel adaptive scheme with the bimodal scheme, in which the second-level table is eliminated, instead each node of the tree uses a single two-bit saturating counter. Section.3 examines the performance and space requirements of each of these four options. We will also compare these four schemes with the original last path scheme used in [1]. 3 Experimental Methodology All the experimental data reported in this paper are generated by a full-function performance simulator based on the PowerPC architecture. The simulation model is developed from published reports [][8][17] and accurately models all key features of the microarchitecture. 3.1 Machine Model To focus the current study on instruction fetching and to highlight the impact of instruction availability on machine performance, the PowerPC microarchitecture is extended to remove resource constraints, and widened to utilize more instruction bandwidth. A centralized reservation station with 51 entries and unlimited out-of-order issue bandwidth are assumed. This effectively limits the instruction window to 51 instructions. An unlimited number of functional units is also assumed. The instruction fetch, dispatch, and completion bandwidth is increased to 1 instructions per cycle. The memory hierarchy is fully modeled with a perfect main memory, a 3KB Level-1 I-cache, a 3KB Level-1 D-cache, and a 5KB unified Level- cache. Access latencies are 1, 3, and 1 cycles for the L1, L caches and the main memory, respectively. On-chip implementation of the L is assumed. An unlimited load miss queue and an unlimited 1

11 store queue handle all load and store execution. The store queue performs data forwarding, and load/store instructions execute out-of-order if no address aliasing is detected. 1 All register and memory data dependencies are enforced. Instruction execution latencies can be found in [8], and accurately reflect the PowerPC. The potential bottlenecks of this machine are the data flow limit due to true data dependencies and instruction availability. 3. Benchmarks The benchmark set used is the SPECint95 suite, compiled by gcc.7.. To reduce simulation time, we use small input files and limit run length to million instructions for each benchmark, totaling 1. billion instructions. All user library calls are modeled, though system calls are not. Experimental Results This section explores the design space of the enhanced trace constructor outlined in Section. The design parameters of the fill heuristic logic are discussed in Section.1, the execution history and hashing in Section., and the tree-based multiple-block predictor (TMBP) is discussed in Section.3 Each of these sections analyzes one component with the purpose of narrowing the design space. Finally, the performance of a block-based trace cache with the enhanced trace constructor is compared to that of the original block-based trace cache design [1] in Section...1 Fill Heuristic Logic As discussed in Section.1.1 the fill heuristic logic controls the update of both the fetch-time predictor, namely the trace table (TT), and the fill-time predictor, namely the tree-based multipleblock predictor (TMBP). The new fill heuristic is designed to reduce the total number of traces generated and stored, by not allowing traces to overlap, thus reducing the capacity requirements of the TT and the TMBP. The new heuristic is quite efficient and allows the size of the TT and TMBP tables to be reduced without sacrificing performance. When the predictor tables are large enough, 1 The authors are not proposing this as a realistic machine design, but a machine model that focuses the performance bottleneck on instruction fetch while enforcing register and memory data dependencies and using realistic trace prediction. 11

12 e.g. having 8K entries, the new heuristic has insignificant impact. However, if the size is reduced to K entries for the two predictor tables, the new heuristic is able to yield a % performance gain.. Execution History and Hashing The number of (global) branch-direction history bits and the type of path history are explored using very large predictor sizes. Large predictors reduce the capacity and conflict misses, focussing performance on the type of history gathered. After testing several different history lengths (k) it is discovered that longer history yields better prediction rates, but the longer warm-up increases the number of misses when accessing the predictor. A history length of k=3 is found to achieve a good balance between correct prediction rate and hit rate of the predictor. The remainder of this work assumes a history length of k=3. In addition to history length the path history is also explored. The path history can range from many block-ids to just one. Several combinations of path history are explored, revealing that longer paths produce better prediction rates, but take longer to warm-up. Short histories warm-up quickly, with slightly lower prediction rates, however all combinations of path history yielded very similar performance numbers. Since the performance impact of the path history variations is not significant only the last block-id is recorded, very similar to gshare [1]. Typical hashing functions include the XOR and a simple concatenation of some or all of the bits. As discussed in Section.1., aliasing is not desirable when predicting the targets of branch instructions. If the aliasing of the hashing function is significant the performance will be negatively impacted. It is discovered that a simple concatenation of the 15 history bits (1 bits for the blockid with k=3 global history bits) improves performance by.3% over a gshare [1] XOR. The performance is improved by increasing the number of correct predictions and converting most incorrect predictions to no predictions. The remainder of this work will assume a simple concatenation hash function yielding a total of 15 bits for indexing into predictor tables..3 Tree-based Multiple-block Predictor As shown in Figure 5, the TMBP contains a branch predictor at each node of the tree structure. Figure illustrates the performance of the four possible configurations for the design of the TMBP 1

13 outlined in Section.1.3. The first three plots show the performance of the three two-level predictors (g, p, & s) each with three different local history lengths (n), while varying the number of entries in the TPHT. The node labels in the plots indicate the number of entries (N) in the TPHT for each data point. The total byte count includes the TMBP and the TT. The TT size is N*(# tag bits + 8)/8 bytes, where 8 is the number of bits necessary to record the four block-ids that represent the instruction trace. The number of TT entries equals the number of TMBP entries. The first plot shows the performance of the two-level global (g) predictor with history lengths n=, Two-level global (g).7 K 8K. K K K 8K K K.5 1K 1K. 1K n=.3 n=8 51 n= bytes Two-level per-branch (p).8 K K K K 8K. 1K 1K 1K n=1 n=3 3. n= bytes.8.7 K Two-level per-tree (s) K K K 8K.8. All predictors 8K K K K 1K. 1K K 8K 1K K.5 n=. 51 n= n= K per-tree (n=) per-branch (n=3) global (n=1) 3.8 two-bit last-path bytes bytes Figure - Possible configurations for the TMBP and their performance. 8, and 1. The total size in bytes of the TMBP with two-level global predictors is: [N*(15*n + + # tag bits) + ^n+1]/8, where 15 is the number of predictor nodes in each tree and is the number of bits necessary to record the predicted path. The second level predictor table (size ^n+1) is shared among all the trees in the TMBP and therefore only counted once. For the smaller predictor 13

14 sizes, it is seen that the trade-off between the number of entries and the length of history is won by the number of entries. As the number of entries N increases (>K), the scheme with n=1 outperforms the other two. n=1 is better than n=, or 8 because the second level predictor table is larger and is able to reduce aliasing. The second plot shows the performance of the two-level per-branch (p) predictor with local history lengths of n=1, 3, 5 bits. The size (in bytes) of the TMBP with two-level per-branch predictors is: [N*(15*(n + ^n # tag bits)]/8. Notice the second level table (size ^n+1) is accounted for at each node. This seriously increases the area requirement of the per-branch configuration. History length n=3 performs significantly better than the other two, because it provides a balance between table entry count and history length. The third plot shows the performance of the two-level per-tree (s) predictor with history lengths n=,, and bits. The size (in bytes) of the TMBP with two-level per-tree predictors is: [N*(15*(n + + # tag bits) + ^n+1]/8. Here the second level table is only per-tree and significantly reduces the capacity requirements as compared to the per-branch predictor. A history length n= is the best. The final plot in Figure, shows the best results of the three two-level adaptive configurations against the bimodal and the last-path schemes. The bimodal scheme has only a two-bit predictor at each node yielding a size of: [N*(15* + + # tag bits)]/8. The last-path algorithm from [1] does not require a TMBP and therefore its size is dependent on only the TT. The data in this figure shows the two-level per-tree predictor significantly out performs any other predictor. The performance of the last-path algorithm levels off very quickly, because the number of branch mispredictions is quite significant and quickly dominates the capacity misses. The two-bit predictor is more effective, but still levels off well below the performance of the two-level predictors. The two-level pertree predictor with n= and N=K (most efficient for 5-1 Kbytes range) is considered the sweet spot and is used in the remainder of this paper. 1

15 We now take a closer look at the prediction accuracy comparison between the per-tree (N=K, n=) and the last-path (N=8K) schemes. Both predictors are predicting to a depth of (i.e. predicting up to four blocks per cycle). Figure 7 displays the prediction accuracy at each level for the SPECint95 benchmarks. For each level, accuracy is measured only if there is a prediction made at that level. Also, the accuracy at each level is compounded. For example, the accuracy at level is equal to the prediction accuracy of level multiplied by the accuracy of level 1. Notice there is significant improvement at accurately predicting the third and the fourth blocks. This improvement is due to the use of the tree structure of the TMBP and significantly increases the fetch bandwidth and the resultant performance. Compounded accuracy (%) l a s t - p a t h t w o - le v e l p e r - t r e e ( N = K, n = ) b lo c k - 1 b lo c k - b lo c k - 3 b lo c k - N u m b e r o f b l o c k s p r e d i c t e d c o r r e c t l y Figure 7 - Branch prediction performance comparison The new trace constructor is more complex; the update of the TT could take longer than just a single cycle. We ran simulations extending the fill latency all the way to 1 cycles and found no impact on either the hit rate of the trace table nor the overall performance of the benchmarks.. Performance Improvement in the Block-based Trace Cache Figure 8 shows the gain of the enhanced trace constructor in the block-based trace cache. The addition of the new fill heuristic and the TMBP with two-level per-tree predictors improves performance ranging from 3% to 8% with an average improvement of %. The TMBP uses N=K entries and a local history length of n=. All of this performance improvement is due to the in- 15

16 creased prediction rates of the multiple-blocks shown in Figure 7. Further improvement can be achieved with a larger TMBP. 9 8 l a s t - p a t h [ 1 ] 7 t w o - l e v e l p e r - t r e e c o m p g c c g o i j p e g l i m 8 8 k p e r l v o r t h - m b e n c h m a r k s Figure 8 - Performance of the enhanced trace constructor in the block-based trace cache 5 Analysis of Experimental Results In this section we present the performance of the block-based trace cache employing the new treebased multiple-block predictor (TMBP), as a function of the block cache size or total instruction storage capacity. We also compare this performance against four different idealized machine configurations. We then analyze the breakdown of the total execution cycles to assess the effectiveness of the block-based trace cache with a realistic TMBP. 5.1 Performance Comparisons Figure 9 contains nine graphs, one for each benchmark with a ninth one showing the harmonic means of all the benchmarks. Each graph plots sustained as a function of total trace cache storage capacity in bytes. (The machine width is 1.) In each graph there are two straight lines and three curves. The two straight lines represent the idealized machine configurations of perfect fetch (pfetch) and perfect branch prediction (pbranch). These two configurations do not employ a trace cache and hence are independent of the trace cache size. Perfect fetch can perfectly predict and fetch beyond any number of branches and can cross I-cache line boundaries. It is only limited by true data dependencies and the size of the instruction window (51 instructions). Perfect branch is a standard I-cache based machine that can predict all conditional branches with 1% accuracy but 1

17 c o m p re s s b y te s g c c b y te s g o b y te s ijp e g b y te s li m 8 8 k s im b y te s b y te s p e r l b y te s h a r m o n ic m e a n h a r m o n i c m e a n data dependencies/instruction window limit block cache capacity misses & fragmentation b y t e s p fe tc h p b lo c k r b lo c k p b r a n c h p tr a c e 17 vo rtex b y te s block mispredictions block-based over traditional taken branch boundary Figure 9 - as a function of bytes of trace storage for all benchmarks.

18 can only fetch up to the first taken branch instruction or the I-cache line boundary. This configuration is limited by cache line boundaries and instruction cache misses and represents the best that can be achieved without using a trace cache. The three curves in each graph represent the three machine configurations that employ a trace cache. The middle curve, denoted real block (rblock), plots the that can be achieved by a block-based trace cache that employs the realistic TMBP chosen in Section. For this TMBP, the tree-based pattern history table (TPHT) as well as the trace table each contain N=K entries. Each tree node predictor in the TPHT uses n= local history bits to index into a second-level table shared by all the nodes in a tree, i.e. the per tree scheme. The top curve, denoted perfect block (pblock), shows the that can be achieved by a block-based trace cache that employs a perfect multipleblock predictor. The gap between real block and perfect block curves reveals the imperfect prediction of the realistic TMBP. The bottom curve, denoted perfect trace (ptrace), shows the achievable by a conventional trace cache with perfect trace prediction. The perfect trace curve represents the best performance that a conventional trace cache can possibly achieve. We now compare the real block performance to the other four idealized configurations. Other than compress and go for all the other benchmarks, we see that the real block performance exceeds that of perfect branch given adequate block cache capacity. Looking at the harmonic means graph, we see that real block crosses over perfect branch with a total trace cache storage capacity of about 5KB (or a K-entry block cache). This total storage capacity also includes the K-entry trace table. This comparison shows the need to be able to cross taken-branch boundaries and justifies the use of the trace cache. It is interesting to note that with limited trace storage capacity, real block can actually outperform perfect trace for most of the benchmarks. Looking at the harmonic means graph, we see that for trace storage capacities of less than 1KB, real block consistently achieves higher than perfect trace. Perfect trace performance crosses over real block performance at a trace storage capacity 18

19 Exploded pieces are all trace cache accesses totaling % 1 - block % - blocks 1% 3 - blocks 1% - blocks % buffer full 8% - blocks 13% icache miss penalty 1% icache hit % branch penalty 9% Figure 1 - How total execution cycles are spent. beyond 1KB. When realistic trace prediction is taken into account for the conventional trace cache, this crossover point will be further out. 5. Analysis of Execution Cycles In this subsection we attempt to gain a better understanding of the gap between real block and perfect block by analyzing the breakdown of all execution cycles to see where cycles are spent or wasted. Figure 1 provides the breakdown. We notice that in 8% of all execution cycles, the fetch buffer is full and no fetching is done. This represents that either the window is full, or there is a dispatch stall. In % of all execution cycles, fetching is performed from the trace cache. In % of the cycles I-cache hits are achieved, and in 1% of the cycles, we are waiting on an I-cache miss. The I-cache is consulted if the trace table is not making a prediction. In a towering 9% of the cycles, fetch is worthless as we are waiting for a mispredicted branch to be resolved. Even though overall block prediction rate is near 9%, the branch misprediction penalty is the biggest piece of the pie. A closer examination of the % of the cycles in which fetching occurs from the trace cache reveals that in % of the cycles trace misprediction occurs and no block is fetched from the trace cache. The fractions of cycles in which 1 block, blocks, 3 blocks and blocks are fetched from the trace cache are %, 1%, 1%, and 13%, respectively. These percentages indicate that the multipleblock predictor in the block-based trace cache is doing quite well in facilitating the fetching of multiple blocks in each cycle. We are pleasantly surprised to see that in 13% of all execution cycles, four blocks are successfully fetched from the trace cache. 19

20 Conclusion In [1] the block-based trace cache was proposed as a realistic and efficient way to implement a trace cache. In that previous work a simplistic last-path predictor was used to perform the necessary multiple-block prediction. In this paper we introduced a new tree-based multiple-block predictor (TMBP) that is used at fill-time to predict traces and select traces for storing in the trace table. A realistic design of a TMBP is presented that employs a two-level adaptive scheme with a first-level tree-based pattern history table (TPHT) of K entries and a second-level shared table indexed by bits of local history. With this realistic multiple-block predictor the new block-based trace cache is able to achieve an improvement of % over the original block-based trace cache [1] for the SPECint95 benchmark suite. We also show that the block-based trace cache with the realistic TMBP is able to outperform both a traditional I-cache based machine with perfect branch prediction as well as a conventional trace cache design with perfect trace prediction. These results highlight the limitation of the traditional I-cache based design, motivate and justify the use of the trace cache, and further validate that the block-based approach is an effective and efficient way to implement trace caches. By examining the gap between real block and perfect block performance curves of Figure 9 we see that there is still headroom to improve the multiple-block predictor for the block-based trace cache. It is also encouraging to see that perfect block performance is approaching that of perfect fetch. The gap between these two curves is due to the limited capacity and fragmentation of the block cache. This gap can be narrowed through more efficient storing of blocks in the replicated block cache, for example, by not storing every block in every copy of the block cache but only in the copy that it is likely to be accessed. We will be pursuing these issues in our future research. 7 References [1] B. Black, B. Rychlik, and J. Shen, The Block-based Trace Cache. In Proceedings of the th Annual International Symposium on Computer Architecture, May [] Calder and D. Grunwald, Next Cache Line and Set Prediction. In Proceedings of the nd Annual International Symposium on Computer Architecture, pp. 87-9, June 1995

21 [3] T. Conte, K. Menezes, P. Mills, and B. Patel, Optimization of Instruction Fetch Mechanisms for High Issue Rates. In Proceedings of the nd International Symposium on Computer Architecture, pp , June 1995 [] K. Diefendorf and E. Silha, The PowerPC User Instruction Set Architecture. In IEEE Micro, pp. 3-1, 199 [5] S. Dutta and M. Franklin, Control Flow Prediction with Tree-Like Subgraphs for Superscalar Processors. In Proceedings of the 8th International Symposium on Microarchitecture, pp. 58-3, December 1995 [] D. Friendly, S. Patel, and Y. Patt, Alternative Fetch and Issue Policies for the Trace Cache Fetch Mechanism. In Proceedings of the 3th International Symposium on Microarchitecture, pp. -33, November 1997 [7] D. Friendly, S. Patel, and Y. Patt, Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors. In Proceedings of the 31st International Symposium on Microarchitecture, December 1998 [8] IBM Microelectronics Division, PowerPC RISC Microprocessor User s Manual, 199 [9] Q. Jacobson, E. Rotenberg, and J. Smith, Path-Based Next Trace Prediction. In Proceedings of the 3th International Symposium on Microarchitecture, November 1997 [1] S. McFarling, Combining Branch Predictors. Technical Report TN-3, Digital Equipment Corp., June 1993 [11] R. Nair, Dynamic Path-based Branch Correlation. In Proceedings of the 8th International Symposium on Microarchitecture, pp. 15-3, December 1995 [1] E. Rotenberg, S. Bennett and J. E. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching, In Proceedings of the 9th International Symposium on Microarchitecture, pp. - 3, December 199 [13] S. Patel, D. Friendly, and Y. Patt, Evaluation of Design Options for the Trace Cache Fetch Mechanism. IEEE Transactions on Computers a Special Issue on Cache Memory and Related Problems, [1] S. Patel, M. Evers, and Y. Patt, Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing. In Proceedings of the 5th Annual International Symposium on Computer Architecture, pp. -71, June 1998 [15] A. Seznec, S. Jourdan, P. Sainrat, and P. Michaud, Multiple-Block Ahead Branch Predictors. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pp , October 199 [1] J. Smith, A Study of Branch Prediction Strategies. In Proceedings of the 8th International Symposium on Computer Architecture, pp , May 1981 [17] S. Song, M. Denman, and J. Chang, The PowerPC RISC Microprocessor. In IEEE Micro, pp. 8-17, 199 [18] T-Y. Yeh, D. Marr, and Y. Patt, Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache. In Proceedings of the 7th ACM International Conference on Supercomputing, pp. 7-7, July 1993 [19] T-Y. Yeh and Y. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction. In Proceedings of the 19th International Symposium on Computer Architecture, pp. 1-13, May 199 1

22 [] T-Y. Yeh and Y. Patt, A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History. In Proceedings of the th International Symposium on Computer Architecture, May 1993

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract