Improving Multiple-block Prediction in the Block-based Trace Cache

Size: px
Start display at page:

Download "Improving Multiple-block Prediction in the Block-based Trace Cache"

Transcription

1 May 1, Master s Thesis Improving Multiple-block Prediction in the Block-based Trace Cache Ryan Rakvic Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 1513 rnr@ece.cmu.edu Abstract Multiple-block prediction is emerging as a new and exciting research area. Highly accurate multiple-block predictors are essential for wide instruction fetch mechanisms, that will support future generations of microprocessors. The block-based trace cache is a recent proposal for wide instruction fetch. It aligns and stores instructions at the basic block level instead of at the trace level, thus significantly reducing instruction trace storage requirements. This paper investigates a new mechanism, the tree-based multiple-block predictor, that utilizes multiple-branch prediction techniques to improve trace construction in the fill unit. This new fill-time predictor, does not replace but augments the fetch time path-based next-trace predictor, by improving the trace construction heuristic. The tree-based multiple-block predictor utilizes a tree structure that represents all possible paths beginning at a root block, where each tree node is a branch predictor. Both a bimodal scheme and twolevel adaptive schemes are examined for these tree-node branch predictors. Results: The enhanced trace constructor using the tree-based multiple-block predictor improves performance of the SPECint95 benchmarks by % over [1]. It is observed that a two-level adaptive predictor outperforms bimodal by %. Finally, the block-based trace cache with the enhanced trace constructor improves performance 8% beyond that of perfect branch prediction and outperforms the conventional trace cache, with a perfect predictor, for instruction storage capacities up to 1KB. 1 Introduction Microprocessor performance has been increasing at a phenomenal rate. This is accomplished by increasing both the clocking rate as well as the instruction-level parallelism (measured in ). To further increase performance, it is necessary for future process ors to increase their instruction fetch bandwidth. The trace cache [1][13] has been proposed as an effective way to fetch multiple blocks of instructions in every cycle. More recently the block-based trace cache [1] has been proposed as an efficient way to implement the trace cache. Fundamental to the operation of the trace cache is the ability to predict a sequence of basic blocks, or a trace, that will be executed. We refer to this as the task of multiple-block prediction. 1

2 While [1] demonstrated that the block-based trace cache can be very efficient in its trace storage, the multiple-block predictor used in that design was quite simplistic and was a limiting factor to achieving higher performance. In this paper we introduce a new multiple-block predictor that achieves higher prediction accuracy than the simplistic design and is able to significantly increase the performance of the block-based trace cache. 1.1 Previous Work Multiple-block predictors are derived from single-branch prediction techniques. For example, the two-bit predictor from Smith [1] and the pattern predictors of Yeh and Patt [19][] are employed in multiple-block predictors. Consequently all current multiple-block predictors are very similar. The differences between these predictors are determined by: 1) the type of execution history used, ) the prediction scheme, and 3) the unit of prediction. The execution history typically involves a combination of branch direction bits and path history (instruction addresses). The prediction scheme varies from the two-bit saturating counter to two-level adaptive schemes, and the prediction unit is either a basic block or a trace of instructions. Yeh and Patt [18] introduced one of the first multiple-block predictors. It is an extension of the twolevel adaptive branch predictor and incorporates a Branch Address Cache, yielding a combination that predicts both branch directions and branch targets. Their mechanism uses branch direction history to predict basic blocks with a two-level adaptive scheme. Later, [][3][15] improved on Yeh and Patt by introducing cache index prediction, the collapsing buffer, and efficient multiported branch target buffers, respectively. All of these works use branch direction history to predict basic blocks with a two-bit predictor. Path predictors, introduced by [11], use path address history to index the predictor. [1] and [9] combined branch direction history with path history to index the predictor. [1] predicts basic blocks with a last path (i.e. predict the same path as last time) scheme and [9] predicts traces using two-bit confidence counters. Another major step in multiple-block prediction is the use of tree-like structures that capture traces of instructions. Dutta [5] used a tree structure to capture portions of the control flow graph and four-bit counters to pick which path through

3 the tree will be predicted. A tree structure is also used in [13] to predict traces with a two-bit predictor. 1. The Tree-based Multiple-block Predictor When a trace cache is used, multiple-block prediction occurs in two phases. Essentially the trace cache approach attempts to move work from the front end of the machine to the back end. The work done in the back end is not in the critical performance path and allows the front end to be faster. As shown in Figure 1, the first phase of multiple-block prediction is performed at instruction fetch time, and the second phase is performed at instruction completion time. At instruction fetch time, a predictor explicitly predicts the next trace to be fetched by using global execution history information to index into the trace cache. We refer to this as the fetch-time predictor. At instruction com- constructed trace Next Trace Selection next trace ptr Trace Cache Fetch Time predicted trace Fill Unit completing instructions Fill Time Figure 1 - Multiple-block prediction in the trace cache. pletion time, a trace is formed by the fill unit and loaded into the trace cache based on certain trace construction heuristics. The implicit multiple-block prediction implied by the trace is predetermined or predicted by the trace construction mechanism. We refer to this as the fill-time predictor which determines the sequence of basic blocks that constitutes the trace and which traces are to be loaded into the trace cache for use in future fetch-time prediction. 3

4 The fetch-time predictor used in the block-based trace cache [1] is similar to that of [9] and determined which trace will be fetched next. The fetch-time predictor is not the focus of this paper. In [1] the fill-time predictor of the block-based trace cache used a simple last path (i.e. predict the same path as last time) scheme that determined which instruction traces will be available for prediction at fetch time. This paper proposes a new fill-time predictor, the tree-based multiple-block predictor (TMBP), that enhances the trace constructor of the block-based trace cache. The TMBP employs prediction techniques to both construct traces and judiciously place only those traces that are likely to be predicted at fetch time into the trace cache. Effectively, the TMBP prevents instruction traces that will not be predicted at fetch time from being stored in the fetch-time predictor table. This improves the efficiency of trace predictor storage and further increases the prediction rate of the fetch-time next trace predictor. Using this enhanced trace constructor, significant performance improvement can be obtained for the block-based trace cache. The proposed fill-time TMBP borrows and combines the best features of previous designs. It uses both branch direction and path history, and employs a two-level treebased adaptive scheme. Implementation issues and details are discussed in Section.1. Implementation The tree-based multiple-block predictor (TMBP) is investigated using the block-based trace cache [1]. In the block-based trace cache, illustrated in Figure, an instruction trace is not explicitly stored as one physical unit as in the conventional trace cache [1][13]. Instead, only pointers to basic blocks that constitute a trace are stored in a structure called the trace table (i.e. the fetch-time predictor table). Other than the trace table there are three other major components to the blockbased trace cache: the block cache, the rename table and the fill unit. The block cache stores aligned instruction blocks and is indexed by a renamed block-id. To support simultaneous multi-block accesses, the block cache is replicated (in effect multi-ported).the trace table can be viewed as storing a short-hand representation of traces since each of its entries contains the block-ids of a trace. The rename table implements the fetch address renaming of basic blocks,

5 and converts instruction fetch addresses to the block-ids which are used to index the block cache entries. The fill unit gathers instructions into physical blocks for renaming and storing in the block cache, and constructs traces of block-ids for future next-trace prediction via the trace table. The block-based trace cache performs a path-based next trace prediction by accessing the trace table with path and branch history, similar to [9]. The trace table outputs the block-ids of the predicted next trace. These block-ids are used in the next cycle (fetch) to access the replicated block cache. block update Path-based Next Trace Predictor hash Trace Table Exec. History block_ids path & branch history Block Cache Final Collapse Fetch Buffer Instruction Cache Rename Table trace update Fill Unit Tree-based Multipleblock Predictor completing instructions Execution Core Completion Figure - The block-based trace cache. Finally, the fill unit determines which traces are actually stored in the trace table and how instruction blocks are renamed. In previous work [1] the fill unit used a last path algorithm to construct traces of block-ids..1 Tree-based Multiple-block Predictor Implementation The proposed TMBP is part of the fill unit and is used at fill time to perform a multiple-block prediction to help determine which basic blocks should be in a trace. Figure 3 illustrates the trace construction part of the fill unit, which is made up of the TMBP and the fill heuristic logic. The fill heuristic, described in Section.1.1, uses prediction information from the TMBP and instruction 5

6 completion information to determine which blocks to insert into a trace. The TMBP consists of two major parts, one performs the execution history hashing and the other is the tree-based pattern history table (TPHT). Execution history hashing, described in Section.1., uses execution history, in the form of an execution path and/or directions of branches, and a hashing function to reduce the execution history bits to a more manageable size for indexing into the TPHT. The TPHT, described in Section.1.3, supports a two-level adaptive scheme for multiple-block prediction. Trace Table Update Tree-based Multiple-block Predictor (TMBP) Execution History Buffer Hashing Function i (TPHT) Tree-based Pattern History Table tree(i) local history 11 Fill Heuristic Logic prediction information Completing Instructions Figure 3 - The tree-based multiple-block predictor in the trace constructor of the fill unit..1.1 Fill Heuristic Logic The fill heuristic logic determines which blocks make up a trace and inserts the associated blockids of that trace into the trace table. The original block-based trace cache presented in [1] used a

7 simple heuristic, illustrated in Figure, that is based solely on a last path scheme. It used a greedy D A C E B F Sequence of Blocks Completed ABCD BCDE CDEF DEFC EFCD FCDE Old Greedy Algorithm Figure - Trace construction heuristics. Possible Traces: ABCD EFCD New Heuristic algorithm to construct traces, which viewed every executed block as a potential start of a new trace. Such a greedy algorithm is useful because it creates every possible trace from the completing sequence of instruction blocks; however it can lead to unnecessary traces that consume trace storage capacity. The new heuristic proposed in this work, also shown in Figure, only constructs nonoverlapping traces, which can lead to fewer total number of traces and more efficient trace storage. A trace is terminated when the limit for maximum trace length is reached. Traces are started with the oldest completed block or after a branch misprediction. In addition to the new heuristic the fill heuristic logic uses multiple-block prediction information from the TMBP to determine which block-ids should be in a trace..1. Execution History and Hashing The execution history gathered by a predictor is critical. McFarling [1] showed both branch address and branch direction history improve performance significantly. Nair [9] showed the effectiveness of path history in the prediction of a single branch, while [1][9] use path history along with branch directions to predict the next instruction trace. This work investigates the possible choices for global execution history in Section., and concludes that a simple global path history is sufficient for predicting the next trace. 7

8 Hashing is a common method used to reduce execution history bits such that predictor array access is direct with no tag comparison. Hashing of execution history bits can cause significant aliasing of distinct execution histories. When only branch direction is being predicted such aliasing may not be that detrimental, since both negative as well as positive interferences can occur. However, multiple-block predictors predict branch targets along with branch directions. It is unlikely that aliasing will result in positive interference for branch target prediction; hence this work proposes the use of concatenation. Concatenation eliminates aliasing, but requires a tag comparison when only a subset of the history bits are used to index into the predictor table. Section. demonstrates the advantage of concatenation..1.3 Tree-based Pattern History Table The TMBP is used by the trace constructor to predict which blocks should compose a trace. During trace construction the TMBP uses global path execution history bits (Section.1.) to index into the tree-based pattern history table (TPHT). The TPHT is an extension of the pattern history table in [19][], that more effectively supports traces. Each entry of the TPHT has three fields: tag, predicted path, and tree; see Figure 5. The tag field stores the execution history bits that are not used for indexing the TPHT. Each entry of the TPHT has an associated root block. The tree field contains a tree of blocks that represents all possible sequences of blocks, or paths, originating from the root block. The predicted path field contains a sequence of branch directions (taken or not taken) that identifies a particular path of the tree. This path represents the instruction trace (starting from the root block) that is currently stored in the trace table. Each node of the tree represents a block and has an associated single-branch predictor that predicts whether the branch instruction at the end of that block will be taken or not taken. The predicted path is determined based on the outcome of these single-branch predictors associated with the nodes of the tree. We call these tree-node predictors. The predicted path is used by the trace constructor to update the trace table. If the predicted path matches the trace that already exists in the trace table, no update is performed. Otherwise, the newly predicted trace is loaded into the trace 8

9 table in anticipation of a future next trace prediction, which can involve evicting the existing trace in that entry of the trace table. During trace construction, the execution history is used to select a particular entry of the TPHT. The current sequence of blocks in the trace construction buffer (representing recently executed blocks) identifies a particular path in the tree of that TPHT entry. The tree-node predictors along this path are updated based on this sequence of recently executed blocks. In other words the branch execution results of these buffered blocks are used to update the individual tree-node predictors along the said path of the tree. Once these tree-node predictors are updated, a new predicted path for this TPHT entry can be identified and the trace constructor can then update the trace table. Essentially, the TMBP implements a form of the two-level adaptive branch prediction scheme. Global execution history (G) is used to index into the TPHT at the first level. (The fetch time next trace predictor also uses the same global execution history.) The second level involves the treenode predictors, which can be implemented in a number of ways. Borrowing the terminology of [19][], the second-level predictor table can be: per-branch (p), global (g) or shared (s). The perbranch option allocates a second-level table for each node of the tree, as illustrated in Figure 5. The local branch history of each node is used to index into its own second-level table to generate a pre- 9 Tree-based Pattern History Table (TPHT) tag Predicted Path Tree Tree-Node Predictor n bits of local history N T T N Predicted Path Directions N T... n Second Level Table Tree and Its Nodes (predicted path highlighted with bold arcs) Figure 5 - Tree-based pattern history table and its components.

10 diction for that node. The global option allocates a single second-level table that is used by all of the TPHT entries. Each node in each tree uses its local history bits to index into a single secondlevel table. The shared option allocates a second-level table for each TPHT entry, i.e. all nodes of a tree use the same second-level table. Hence the shared (s) option can also be called the per-tree option. In all three options, each node maintains its own local history bits which are used to index into the second-level table. A fourth option is investigated in this paper which replaces the twolevel adaptive scheme with the bimodal scheme, in which the second-level table is eliminated, instead each node of the tree uses a single two-bit saturating counter. Section.3 examines the performance and space requirements of each of these four options. We will also compare these four schemes with the original last path scheme used in [1]. 3 Experimental Methodology All the experimental data reported in this paper are generated by a full-function performance simulator based on the PowerPC architecture. The simulation model is developed from published reports [][8][17] and accurately models all key features of the microarchitecture. 3.1 Machine Model To focus the current study on instruction fetching and to highlight the impact of instruction availability on machine performance, the PowerPC microarchitecture is extended to remove resource constraints, and widened to utilize more instruction bandwidth. A centralized reservation station with 51 entries and unlimited out-of-order issue bandwidth are assumed. This effectively limits the instruction window to 51 instructions. An unlimited number of functional units is also assumed. The instruction fetch, dispatch, and completion bandwidth is increased to 1 instructions per cycle. The memory hierarchy is fully modeled with a perfect main memory, a 3KB Level-1 I-cache, a 3KB Level-1 D-cache, and a 5KB unified Level- cache. Access latencies are 1, 3, and 1 cycles for the L1, L caches and the main memory, respectively. On-chip implementation of the L is assumed. An unlimited load miss queue and an unlimited 1

11 store queue handle all load and store execution. The store queue performs data forwarding, and load/store instructions execute out-of-order if no address aliasing is detected. 1 All register and memory data dependencies are enforced. Instruction execution latencies can be found in [8], and accurately reflect the PowerPC. The potential bottlenecks of this machine are the data flow limit due to true data dependencies and instruction availability. 3. Benchmarks The benchmark set used is the SPECint95 suite, compiled by gcc.7.. To reduce simulation time, we use small input files and limit run length to million instructions for each benchmark, totaling 1. billion instructions. All user library calls are modeled, though system calls are not. Experimental Results This section explores the design space of the enhanced trace constructor outlined in Section. The design parameters of the fill heuristic logic are discussed in Section.1, the execution history and hashing in Section., and the tree-based multiple-block predictor (TMBP) is discussed in Section.3 Each of these sections analyzes one component with the purpose of narrowing the design space. Finally, the performance of a block-based trace cache with the enhanced trace constructor is compared to that of the original block-based trace cache design [1] in Section...1 Fill Heuristic Logic As discussed in Section.1.1 the fill heuristic logic controls the update of both the fetch-time predictor, namely the trace table (TT), and the fill-time predictor, namely the tree-based multipleblock predictor (TMBP). The new fill heuristic is designed to reduce the total number of traces generated and stored, by not allowing traces to overlap, thus reducing the capacity requirements of the TT and the TMBP. The new heuristic is quite efficient and allows the size of the TT and TMBP tables to be reduced without sacrificing performance. When the predictor tables are large enough, 1 The authors are not proposing this as a realistic machine design, but a machine model that focuses the performance bottleneck on instruction fetch while enforcing register and memory data dependencies and using realistic trace prediction. 11

12 e.g. having 8K entries, the new heuristic has insignificant impact. However, if the size is reduced to K entries for the two predictor tables, the new heuristic is able to yield a % performance gain.. Execution History and Hashing The number of (global) branch-direction history bits and the type of path history are explored using very large predictor sizes. Large predictors reduce the capacity and conflict misses, focussing performance on the type of history gathered. After testing several different history lengths (k) it is discovered that longer history yields better prediction rates, but the longer warm-up increases the number of misses when accessing the predictor. A history length of k=3 is found to achieve a good balance between correct prediction rate and hit rate of the predictor. The remainder of this work assumes a history length of k=3. In addition to history length the path history is also explored. The path history can range from many block-ids to just one. Several combinations of path history are explored, revealing that longer paths produce better prediction rates, but take longer to warm-up. Short histories warm-up quickly, with slightly lower prediction rates, however all combinations of path history yielded very similar performance numbers. Since the performance impact of the path history variations is not significant only the last block-id is recorded, very similar to gshare [1]. Typical hashing functions include the XOR and a simple concatenation of some or all of the bits. As discussed in Section.1., aliasing is not desirable when predicting the targets of branch instructions. If the aliasing of the hashing function is significant the performance will be negatively impacted. It is discovered that a simple concatenation of the 15 history bits (1 bits for the blockid with k=3 global history bits) improves performance by.3% over a gshare [1] XOR. The performance is improved by increasing the number of correct predictions and converting most incorrect predictions to no predictions. The remainder of this work will assume a simple concatenation hash function yielding a total of 15 bits for indexing into predictor tables..3 Tree-based Multiple-block Predictor As shown in Figure 5, the TMBP contains a branch predictor at each node of the tree structure. Figure illustrates the performance of the four possible configurations for the design of the TMBP 1

13 outlined in Section.1.3. The first three plots show the performance of the three two-level predictors (g, p, & s) each with three different local history lengths (n), while varying the number of entries in the TPHT. The node labels in the plots indicate the number of entries (N) in the TPHT for each data point. The total byte count includes the TMBP and the TT. The TT size is N*(# tag bits + 8)/8 bytes, where 8 is the number of bits necessary to record the four block-ids that represent the instruction trace. The number of TT entries equals the number of TMBP entries. The first plot shows the performance of the two-level global (g) predictor with history lengths n=, Two-level global (g).7 K 8K. K K K 8K K K.5 1K 1K. 1K n=.3 n=8 51 n= bytes Two-level per-branch (p).8 K K K K 8K. 1K 1K 1K n=1 n=3 3. n= bytes.8.7 K Two-level per-tree (s) K K K 8K.8. All predictors 8K K K K 1K. 1K K 8K 1K K.5 n=. 51 n= n= K per-tree (n=) per-branch (n=3) global (n=1) 3.8 two-bit last-path bytes bytes Figure - Possible configurations for the TMBP and their performance. 8, and 1. The total size in bytes of the TMBP with two-level global predictors is: [N*(15*n + + # tag bits) + ^n+1]/8, where 15 is the number of predictor nodes in each tree and is the number of bits necessary to record the predicted path. The second level predictor table (size ^n+1) is shared among all the trees in the TMBP and therefore only counted once. For the smaller predictor 13

14 sizes, it is seen that the trade-off between the number of entries and the length of history is won by the number of entries. As the number of entries N increases (>K), the scheme with n=1 outperforms the other two. n=1 is better than n=, or 8 because the second level predictor table is larger and is able to reduce aliasing. The second plot shows the performance of the two-level per-branch (p) predictor with local history lengths of n=1, 3, 5 bits. The size (in bytes) of the TMBP with two-level per-branch predictors is: [N*(15*(n + ^n # tag bits)]/8. Notice the second level table (size ^n+1) is accounted for at each node. This seriously increases the area requirement of the per-branch configuration. History length n=3 performs significantly better than the other two, because it provides a balance between table entry count and history length. The third plot shows the performance of the two-level per-tree (s) predictor with history lengths n=,, and bits. The size (in bytes) of the TMBP with two-level per-tree predictors is: [N*(15*(n + + # tag bits) + ^n+1]/8. Here the second level table is only per-tree and significantly reduces the capacity requirements as compared to the per-branch predictor. A history length n= is the best. The final plot in Figure, shows the best results of the three two-level adaptive configurations against the bimodal and the last-path schemes. The bimodal scheme has only a two-bit predictor at each node yielding a size of: [N*(15* + + # tag bits)]/8. The last-path algorithm from [1] does not require a TMBP and therefore its size is dependent on only the TT. The data in this figure shows the two-level per-tree predictor significantly out performs any other predictor. The performance of the last-path algorithm levels off very quickly, because the number of branch mispredictions is quite significant and quickly dominates the capacity misses. The two-bit predictor is more effective, but still levels off well below the performance of the two-level predictors. The two-level pertree predictor with n= and N=K (most efficient for 5-1 Kbytes range) is considered the sweet spot and is used in the remainder of this paper. 1

15 We now take a closer look at the prediction accuracy comparison between the per-tree (N=K, n=) and the last-path (N=8K) schemes. Both predictors are predicting to a depth of (i.e. predicting up to four blocks per cycle). Figure 7 displays the prediction accuracy at each level for the SPECint95 benchmarks. For each level, accuracy is measured only if there is a prediction made at that level. Also, the accuracy at each level is compounded. For example, the accuracy at level is equal to the prediction accuracy of level multiplied by the accuracy of level 1. Notice there is significant improvement at accurately predicting the third and the fourth blocks. This improvement is due to the use of the tree structure of the TMBP and significantly increases the fetch bandwidth and the resultant performance. Compounded accuracy (%) l a s t - p a t h t w o - le v e l p e r - t r e e ( N = K, n = ) b lo c k - 1 b lo c k - b lo c k - 3 b lo c k - N u m b e r o f b l o c k s p r e d i c t e d c o r r e c t l y Figure 7 - Branch prediction performance comparison The new trace constructor is more complex; the update of the TT could take longer than just a single cycle. We ran simulations extending the fill latency all the way to 1 cycles and found no impact on either the hit rate of the trace table nor the overall performance of the benchmarks.. Performance Improvement in the Block-based Trace Cache Figure 8 shows the gain of the enhanced trace constructor in the block-based trace cache. The addition of the new fill heuristic and the TMBP with two-level per-tree predictors improves performance ranging from 3% to 8% with an average improvement of %. The TMBP uses N=K entries and a local history length of n=. All of this performance improvement is due to the in- 15

16 creased prediction rates of the multiple-blocks shown in Figure 7. Further improvement can be achieved with a larger TMBP. 9 8 l a s t - p a t h [ 1 ] 7 t w o - l e v e l p e r - t r e e c o m p g c c g o i j p e g l i m 8 8 k p e r l v o r t h - m b e n c h m a r k s Figure 8 - Performance of the enhanced trace constructor in the block-based trace cache 5 Analysis of Experimental Results In this section we present the performance of the block-based trace cache employing the new treebased multiple-block predictor (TMBP), as a function of the block cache size or total instruction storage capacity. We also compare this performance against four different idealized machine configurations. We then analyze the breakdown of the total execution cycles to assess the effectiveness of the block-based trace cache with a realistic TMBP. 5.1 Performance Comparisons Figure 9 contains nine graphs, one for each benchmark with a ninth one showing the harmonic means of all the benchmarks. Each graph plots sustained as a function of total trace cache storage capacity in bytes. (The machine width is 1.) In each graph there are two straight lines and three curves. The two straight lines represent the idealized machine configurations of perfect fetch (pfetch) and perfect branch prediction (pbranch). These two configurations do not employ a trace cache and hence are independent of the trace cache size. Perfect fetch can perfectly predict and fetch beyond any number of branches and can cross I-cache line boundaries. It is only limited by true data dependencies and the size of the instruction window (51 instructions). Perfect branch is a standard I-cache based machine that can predict all conditional branches with 1% accuracy but 1

17 c o m p re s s b y te s g c c b y te s g o b y te s ijp e g b y te s li m 8 8 k s im b y te s b y te s p e r l b y te s h a r m o n ic m e a n h a r m o n i c m e a n data dependencies/instruction window limit block cache capacity misses & fragmentation b y t e s p fe tc h p b lo c k r b lo c k p b r a n c h p tr a c e 17 vo rtex b y te s block mispredictions block-based over traditional taken branch boundary Figure 9 - as a function of bytes of trace storage for all benchmarks.

18 can only fetch up to the first taken branch instruction or the I-cache line boundary. This configuration is limited by cache line boundaries and instruction cache misses and represents the best that can be achieved without using a trace cache. The three curves in each graph represent the three machine configurations that employ a trace cache. The middle curve, denoted real block (rblock), plots the that can be achieved by a block-based trace cache that employs the realistic TMBP chosen in Section. For this TMBP, the tree-based pattern history table (TPHT) as well as the trace table each contain N=K entries. Each tree node predictor in the TPHT uses n= local history bits to index into a second-level table shared by all the nodes in a tree, i.e. the per tree scheme. The top curve, denoted perfect block (pblock), shows the that can be achieved by a block-based trace cache that employs a perfect multipleblock predictor. The gap between real block and perfect block curves reveals the imperfect prediction of the realistic TMBP. The bottom curve, denoted perfect trace (ptrace), shows the achievable by a conventional trace cache with perfect trace prediction. The perfect trace curve represents the best performance that a conventional trace cache can possibly achieve. We now compare the real block performance to the other four idealized configurations. Other than compress and go for all the other benchmarks, we see that the real block performance exceeds that of perfect branch given adequate block cache capacity. Looking at the harmonic means graph, we see that real block crosses over perfect branch with a total trace cache storage capacity of about 5KB (or a K-entry block cache). This total storage capacity also includes the K-entry trace table. This comparison shows the need to be able to cross taken-branch boundaries and justifies the use of the trace cache. It is interesting to note that with limited trace storage capacity, real block can actually outperform perfect trace for most of the benchmarks. Looking at the harmonic means graph, we see that for trace storage capacities of less than 1KB, real block consistently achieves higher than perfect trace. Perfect trace performance crosses over real block performance at a trace storage capacity 18

19 Exploded pieces are all trace cache accesses totaling % 1 - block % - blocks 1% 3 - blocks 1% - blocks % buffer full 8% - blocks 13% icache miss penalty 1% icache hit % branch penalty 9% Figure 1 - How total execution cycles are spent. beyond 1KB. When realistic trace prediction is taken into account for the conventional trace cache, this crossover point will be further out. 5. Analysis of Execution Cycles In this subsection we attempt to gain a better understanding of the gap between real block and perfect block by analyzing the breakdown of all execution cycles to see where cycles are spent or wasted. Figure 1 provides the breakdown. We notice that in 8% of all execution cycles, the fetch buffer is full and no fetching is done. This represents that either the window is full, or there is a dispatch stall. In % of all execution cycles, fetching is performed from the trace cache. In % of the cycles I-cache hits are achieved, and in 1% of the cycles, we are waiting on an I-cache miss. The I-cache is consulted if the trace table is not making a prediction. In a towering 9% of the cycles, fetch is worthless as we are waiting for a mispredicted branch to be resolved. Even though overall block prediction rate is near 9%, the branch misprediction penalty is the biggest piece of the pie. A closer examination of the % of the cycles in which fetching occurs from the trace cache reveals that in % of the cycles trace misprediction occurs and no block is fetched from the trace cache. The fractions of cycles in which 1 block, blocks, 3 blocks and blocks are fetched from the trace cache are %, 1%, 1%, and 13%, respectively. These percentages indicate that the multipleblock predictor in the block-based trace cache is doing quite well in facilitating the fetching of multiple blocks in each cycle. We are pleasantly surprised to see that in 13% of all execution cycles, four blocks are successfully fetched from the trace cache. 19

20 Conclusion In [1] the block-based trace cache was proposed as a realistic and efficient way to implement a trace cache. In that previous work a simplistic last-path predictor was used to perform the necessary multiple-block prediction. In this paper we introduced a new tree-based multiple-block predictor (TMBP) that is used at fill-time to predict traces and select traces for storing in the trace table. A realistic design of a TMBP is presented that employs a two-level adaptive scheme with a first-level tree-based pattern history table (TPHT) of K entries and a second-level shared table indexed by bits of local history. With this realistic multiple-block predictor the new block-based trace cache is able to achieve an improvement of % over the original block-based trace cache [1] for the SPECint95 benchmark suite. We also show that the block-based trace cache with the realistic TMBP is able to outperform both a traditional I-cache based machine with perfect branch prediction as well as a conventional trace cache design with perfect trace prediction. These results highlight the limitation of the traditional I-cache based design, motivate and justify the use of the trace cache, and further validate that the block-based approach is an effective and efficient way to implement trace caches. By examining the gap between real block and perfect block performance curves of Figure 9 we see that there is still headroom to improve the multiple-block predictor for the block-based trace cache. It is also encouraging to see that perfect block performance is approaching that of perfect fetch. The gap between these two curves is due to the limited capacity and fragmentation of the block cache. This gap can be narrowed through more efficient storing of blocks in the replicated block cache, for example, by not storing every block in every copy of the block cache but only in the copy that it is likely to be accessed. We will be pursuing these issues in our future research. 7 References [1] B. Black, B. Rychlik, and J. Shen, The Block-based Trace Cache. In Proceedings of the th Annual International Symposium on Computer Architecture, May [] Calder and D. Grunwald, Next Cache Line and Set Prediction. In Proceedings of the nd Annual International Symposium on Computer Architecture, pp. 87-9, June 1995

21 [3] T. Conte, K. Menezes, P. Mills, and B. Patel, Optimization of Instruction Fetch Mechanisms for High Issue Rates. In Proceedings of the nd International Symposium on Computer Architecture, pp , June 1995 [] K. Diefendorf and E. Silha, The PowerPC User Instruction Set Architecture. In IEEE Micro, pp. 3-1, 199 [5] S. Dutta and M. Franklin, Control Flow Prediction with Tree-Like Subgraphs for Superscalar Processors. In Proceedings of the 8th International Symposium on Microarchitecture, pp. 58-3, December 1995 [] D. Friendly, S. Patel, and Y. Patt, Alternative Fetch and Issue Policies for the Trace Cache Fetch Mechanism. In Proceedings of the 3th International Symposium on Microarchitecture, pp. -33, November 1997 [7] D. Friendly, S. Patel, and Y. Patt, Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors. In Proceedings of the 31st International Symposium on Microarchitecture, December 1998 [8] IBM Microelectronics Division, PowerPC RISC Microprocessor User s Manual, 199 [9] Q. Jacobson, E. Rotenberg, and J. Smith, Path-Based Next Trace Prediction. In Proceedings of the 3th International Symposium on Microarchitecture, November 1997 [1] S. McFarling, Combining Branch Predictors. Technical Report TN-3, Digital Equipment Corp., June 1993 [11] R. Nair, Dynamic Path-based Branch Correlation. In Proceedings of the 8th International Symposium on Microarchitecture, pp. 15-3, December 1995 [1] E. Rotenberg, S. Bennett and J. E. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching, In Proceedings of the 9th International Symposium on Microarchitecture, pp. - 3, December 199 [13] S. Patel, D. Friendly, and Y. Patt, Evaluation of Design Options for the Trace Cache Fetch Mechanism. IEEE Transactions on Computers a Special Issue on Cache Memory and Related Problems, [1] S. Patel, M. Evers, and Y. Patt, Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing. In Proceedings of the 5th Annual International Symposium on Computer Architecture, pp. -71, June 1998 [15] A. Seznec, S. Jourdan, P. Sainrat, and P. Michaud, Multiple-Block Ahead Branch Predictors. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pp , October 199 [1] J. Smith, A Study of Branch Prediction Strategies. In Proceedings of the 8th International Symposium on Computer Architecture, pp , May 1981 [17] S. Song, M. Denman, and J. Chang, The PowerPC RISC Microprocessor. In IEEE Micro, pp. 8-17, 199 [18] T-Y. Yeh, D. Marr, and Y. Patt, Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache. In Proceedings of the 7th ACM International Conference on Supercomputing, pp. 7-7, July 1993 [19] T-Y. Yeh and Y. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction. In Proceedings of the 19th International Symposium on Computer Architecture, pp. 1-13, May 199 1

22 [] T-Y. Yeh and Y. Patt, A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History. In Proceedings of the th International Symposium on Computer Architecture, May 1993

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Path-Based Next Trace Prediction

Path-Based Next Trace Prediction Quinn Jacobson Path-Based Next Trace Prediction Eric Rotenberg James E. Smith Department of Electrical & Computer Engineering qjacobso@ece.wisc.edu Department of Computer Science ericro@cs.wisc.edu Department

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

The Effect of Program Optimization on Trace Cache Efficiency

The Effect of Program Optimization on Trace Cache Efficiency The Effect of Program Optimization on Trace Cache Efficiency Derek L. Howard and Mikko H. Lipasti IBM Server Group Rochester, MN 55901 derekh@us.ibm.com, mhl@ece.cmu.edu 1 Abstract Trace cache, an instruction

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Multiple Branch and Block Prediction

Multiple Branch and Block Prediction Multiple Branch and Block Prediction Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine Irvine, CA 92697 swallace@ece.uci.edu, nader@ece.uci.edu

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

Calibration of Microprocessor Performance Models

Calibration of Microprocessor Performance Models Calibration of Microprocessor Performance Models Bryan Black and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA, 15213 {black,shen}@ece.cmu.edu

More information

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001 Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec IRISA / INRIA January 2001 2 Introduction Context: dynamic instruction scheduling in out-oforder

More information

Looking for limits in branch prediction with the GTL predictor

Looking for limits in branch prediction with the GTL predictor Looking for limits in branch prediction with the GTL predictor André Seznec IRISA/INRIA/HIPEAC Abstract The geometric history length predictors, GEHL [7] and TAGE [8], are among the most storage effective

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

A COST-EFFECTIVE TWO-LEVEL ADAPTIVE BRANCH PREDICTOR

A COST-EFFECTIVE TWO-LEVEL ADAPTIVE BRANCH PREDICTOR A COST-EFFECTIVE TWO-LEVEL ADAPTIVE BRANCH PREDICTOR STEVEN, G. B., EGAN, C., SHIM, W. VINTAN, L. University of Hertfordshire, Seoul National Univ. of Technology, University Lucian Braga of Sibiu Hatfield,

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

A Realistic Study on Multithreaded Superscalar Processor Design

A Realistic Study on Multithreaded Superscalar Processor Design A Realistic Study on Multithreaded Superscalar Processor Design Yuan C. Chou, Daniel P. Siewiorek, and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,

More information

Dynamic Branch Prediction for a VLIW Processor

Dynamic Branch Prediction for a VLIW Processor Dynamic Branch Prediction for a VLIW Processor Jan Hoogerbrugge Philips Research Laboratories, Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands Abstract This paper describes the design of a dynamic

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R

More information

A Study of Control Independence in Superscalar Processors

A Study of Control Independence in Superscalar Processors A Study of Control Independence in Superscalar Processors Eric Rotenberg, Quinn Jacobson, Jim Smith University of Wisconsin - Madison ericro@cs.wisc.edu, {qjacobso, jes}@ece.wisc.edu Abstract An instruction

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Improving Value Prediction by Exploiting Both Operand and Output Value Locality

Improving Value Prediction by Exploiting Both Operand and Output Value Locality Improving Value Prediction by Exploiting Both Operand and Output Value Locality Jian Huang and Youngsoo Choi Department of Computer Science and Engineering Minnesota Supercomputing Institute University

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University EE382A Lecture 5: Branch Prediction Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 5-1 Announcements Project proposal due on Mo 10/14 List the group

More information

Evaluating the Energy Efficiency of Trace Caches UNIVERSITY OF VIRGINIA, DEPT. OF COMPUTER SCIENCE TECH. REPORT CS

Evaluating the Energy Efficiency of Trace Caches UNIVERSITY OF VIRGINIA, DEPT. OF COMPUTER SCIENCE TECH. REPORT CS Evaluating the Energy Efficiency of Trace Caches UNIVERSITY OF VIRGINIA, DEPT. OF COMPUTER SCIENCE TECH. REPORT CS-23-19 Michele Co and Kevin Skadron Department of Computer Science University of Virginia

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

Path Traced Perceptron Branch Predictor Using Local History for Weight Selection

Path Traced Perceptron Branch Predictor Using Local History for Weight Selection Path Traced Perceptron Branch Predictor Using Local History for Selection Yasuyuki Ninomiya and Kôki Abe Department of Computer Science The University of Electro-Communications 1-5-1 Chofugaoka Chofu-shi

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues

High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues Paper # 3 Paper # 2 Paper # 1 Paper # 3 Paper # 7 Paper # 7 Paper # 6 High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues For Superscalar Processors The Basic Block Fetch Limitation/Cache

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

A 64-Kbytes ITTAGE indirect branch predictor

A 64-Kbytes ITTAGE indirect branch predictor A 64-Kbytes ITTAGE indirect branch André Seznec To cite this version: André Seznec. A 64-Kbytes ITTAGE indirect branch. JWAC-2: Championship Branch Prediction, Jun 2011, San Jose, United States. 2011,.

More information

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum Simultaneous Way-footprint Prediction and Branch Prediction for Energy Savings in Set-associative Instruction Caches Weiyu Tang Rajesh Gupta Alexandru Nicolau Alexander V. Veidenbaum Department of Information

More information

Static Branch Prediction

Static Branch Prediction Announcements EE382A Lecture 5: Branch Prediction Project proposal due on Mo 10/14 List the group members Describe the topic including why it is important and your thesis Describe the methodology you will

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

An Efficient Indirect Branch Predictor

An Efficient Indirect Branch Predictor An Efficient Indirect ranch Predictor Yul Chu and M. R. Ito 2 Electrical and Computer Engineering Department, Mississippi State University, ox 957, Mississippi State, MS 39762, USA chu@ece.msstate.edu

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

Using Cache Line Coloring to Perform Aggressive Procedure Inlining Using Cache Line Coloring to Perform Aggressive Procedure Inlining Hakan Aydın David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115 {haydin,kaeli}@ece.neu.edu

More information

Threaded Multiple Path Execution

Threaded Multiple Path Execution Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract

More information

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department

More information

Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures

Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures Eric Hao, Po-Yung Chang, Marks Evers, and Yale N. Patt Advanced Computer Architecture Laboratory Department of Electrical

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

Why memory hierarchy

Why memory hierarchy Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast memory fast memory expensive, slow memory cheap cache: small, fast memory near CPU large, slow memory (main memory,

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

The 2-way Thrashing-Avoidance Cache (TAC): An Efficient Instruction Cache Scheme for Object-Oriented Languages

The 2-way Thrashing-Avoidance Cache (TAC): An Efficient Instruction Cache Scheme for Object-Oriented Languages The 2- Thrashing-voidance Cache (): n Efficient Instruction Cache Scheme for Object-Oriented Languages Yul Chu and M. R. Ito Electrical and Computer Engineering Department, University of British Columbia

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Optimizations Enabled by a Decoupled Front-End Architecture

Optimizations Enabled by a Decoupled Front-End Architecture Optimizations Enabled by a Decoupled Front-End Architecture Glenn Reinman y Brad Calder y Todd Austin z y Department of Computer Science and Engineering, University of California, San Diego z Electrical

More information

AN ANALYSIS OF VALUE PREDICTIBALITY AND ITS APPLICATION TO A SUPERSCALAR PROCESSOR

AN ANALYSIS OF VALUE PREDICTIBALITY AND ITS APPLICATION TO A SUPERSCALAR PROCESSOR AN ANALYSIS OF VALUE PREDICTIBALITY AND ITS APPLICATION TO A SUPERSCALAR PROCESSOR by Yiannakis Thrasou Sazeides A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination Name Computer Architecture EE 4720 Final Examination Primary: 6 December 1999, Alternate: 7 December 1999, 10:00 12:00 CST 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (25 pts)

More information

Comparing Multiported Cache Schemes

Comparing Multiported Cache Schemes Comparing Multiported Cache Schemes Smaїl Niar University of Valenciennes, France Smail.Niar@univ-valenciennes.fr Lieven Eeckhout Koen De Bosschere Ghent University, Belgium {leeckhou,kdb}@elis.rug.ac.be

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

On Pipelining Dynamic Instruction Scheduling Logic

On Pipelining Dynamic Instruction Scheduling Logic On Pipelining Dynamic Instruction Scheduling Logic Jared Stark y Mary D. Brown z Yale N. Patt z Microprocessor Research Labs y Intel Corporation jared.w.stark@intel.com Dept. of Electrical and Computer

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information