1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing

Size: px

Start display at page:

Download "1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing"

Clementine Ellis
6 years ago
Views:

1 Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors 1 Shlomit S. Pinter IBM Science and Technology MATAM Advance Technology ctr. Haifa 31905, Israel shlomit@vnet.ibm.com Adi Yoaz Intel Israel (74) MATAM Advance Technology ctr. Haifa 31905, Israel ayoaz@iil.intel.com Abstract We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the eective utilization of slack (dead) time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the Program Progress Graph (PPG), as a simple extension to the Branch Target Buer (BTB). We use the PPG for implementing a fast pre-program counter, pre-pc, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-pc extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used to implement a pre-processor and for instruction prefetching. The prefetch requests are scheduled to \tango" with the core requests from the data cache, by using only free time slots on the existing data cache tag ports. Employing special methods for removing prefetch requests that are already in the cache (without utilizing the cache-tag ports bandwidth) and a simple optimization on the cache LRU mechanism reduce the number of prefetch requests sent to the core{cache bus and to the memory (second level) bus. Simulation results on the SPEC92 benchmark for the base line architecture (32K-byte data cache and 12 cycles fetch latency) show an average speedup of 1.36 (CPI ratio). The highest speedup of 2.53 is obtained for systems with smaller data cache (8K-byte). Key words: data prefetching, instruction prefetching, memory reference prediction, superscalar processors, pre-instruction decoding, BTB, branch prediction. 1 Preliminary results were published in MICRO 29 conference (see [13]). i

2 1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing gap between processor and memory speeds further increases the memory access penalty problem. Data prefetching is an effective technique for hiding memory latency and can alleviate the pressure to constantly increase the size of rst level caches. Timely and correct prediction of data accesses are the key issues in both software and hardware based prefetching techniques. In hardware based schemes the timely generation of a data prefetch and the additional trac on both the memory bus and the cache-tag ports can potentially be controlled in order not to interfere with the main computation process. In addition, the amount of prefetching done can be tuned to the cache parameters using dynamic information. Prediction in hardware can recognize constant memory strides (resulting from integer valued linear functions of loop variables) [7, 1, 8]. For speculative data prefetching the hardware based techniques can improve on software approaches by using information from the branch prediction unit. Hardware based o-chip schemes, like the stream buer [6], are driven by cache miss accesses, whereas with on-chip schemes all the memory access instructions are sampled and more prediction information is available [7, 1]. Prediction on-chip is not inuenced by out of order computations and the data cache status can be used for controlling the prefetch rate. However, there are no on-chip prefetch schemes suggested for superscalar processors. The main problem in the design of on-chip prefetch scheme for superscalar processors is to initiate the prefetch requests fast enough. Another problem studied in [14] and noted by [6] is the cache bandwidth problem. The problem is manifested by the heavy load on the data ports and tag ports of the data cache. In superscalar processors the cache-tag ports are one of the most critical resources and extra prefetch requests (especially with aggressive on-chip prediction) may create contention. In our scheme we provide an eective and simple solution to these problems. In software based prefetching extra prefetch instructions are inserted. The new instructions may increase the length of the critical path since there are not enough \open" slots (in the load functional units), especially for data intensive programs, or when prefetching most of the memory access instructions [14]. The extra prefetch instructions also occupy space in the I-cache and increase the trac on the memory bus. Whereas extensive software analysis can generate accurate static predictions, the timely generation of the prefetch requests is a problem (see [11, 2]). In [11] the number of iterations to prefetch ahead was chosen to be the prefetch latency divided by the length of the shortest path through the loop. In general, prefetch instructions in the wrong places (time) can cause cache pollution if no special prefetch buer is used. For reducing such misses and for removing redundant prefetch instructions the compiler must be tuned to the data cache parameters. Such tuning has no eect on existing code and is very intricate for the compiler (see [2, 11] for some heuristics). In this paper we present the design and simulation results of a new on-chip hardware 1

3 based data prefetching scheme called Tango. The scheme uses a pre-pc 2 component that exploits only instructions that use the data cache (rather than scanning the instructions sequentially). At each clock cycle, all the memory references in some future block of instructions are exposed with the help of the Program Progress Graph (PPG) table. The pre-pc mechanism uses the data prediction component (that reveals all constant-stride accesses) to generate prefetch requests for these instructions. The scheme thus provides data fast enough for future multi-functional processors that can employ out-of-order execution. The prefetch requests do not interfere with the main computation and are scheduled to tango with the core requests from the cache (they only use the free time slots on the existing data cache tag ports). Tango employs a new simple method that without cache-tag ports bandwidth consumption removes prefetch requests for data already in the cache (this mechanism can be used by other schemes as well). Another simple optimization on the cache LRU mechanism lets us further reduce the number of prefetch requests sent to the core{cache bus and to the memory bus. The cost incurred by Tango involves chip space, which is smaller than a 4K-byte dual ported cache. The PPG, which is implemented as a small and simple extension to a BTB, provides an ecient and fast method for lookahead on instructions. Thus, in addition to data prefetching, it can be used as a pre-processor for decoding instructions early enough on their way to the core or for instruction prefetching. Our investigations show that the Tango prefetching scheme signicantly improves the overall system performance and extensively reduces the memory access miss penalty. In most of the programs tested, the speedup gained is over 1.33 (average of 1.36 for the base line architecture with memory fetch latency of 12 cycles), and the reduction in the memory penalty is 96% to 43%. We rst discuss related work in Section 2. Then, in Section 3 we present our scheme and discuss its properties. Section 4 includes the machine model, and experimental results are brought in Section 5. Conclusions and discussion are presented in Section 6. 2 Related Work Most simple hardware data prefetching schemes are based on data misses that can trigger the prefetch of the next data line. Such schemes were investigated in [4] and with the help of special prefetch buers (stream buers) by [9]. Similar schemes that use more elaborate o-chip predictions are studied in [12] and in [6]. For improving the prediction [7, 8] use a history table for keeping the previous memory address. The dierence between the current address and the previous one is used to calculate the memory stride. The prefetch is issued for the next iteration based on the calculated stride. The penalty incurred by small stride accesses was investigated in [5]. We note that Tango does not generate requests to the data cache and the memory bus for predictions with small strides whenever the relevant part is in some cache line (cache block). 2 Pre-program counter. 2

4 Even with good prediction, data can be prefetched too early or too late to be useful. One way for solving this problem is a lookahead scheme. The lookahead scheme in [10] is based on generating data prefetch for operands simultaneously with the decoding of the instruction. Prediction and lookahead are integrated by Baer and Chen in [1, 3] to support prefetch for scalar processors. In this on-chip scheme the stride prediction is calculated with a reference prediction (history) table (RPT) indexed by load/store instruction addresses. The lookahead mechanism implements a lookahead program counter (LA-PC) that advances at the same pace as the PC. At each clock cycle a new instruction is scanned. The time wasted for checking all the instructions prevents the possibility to adapt this method for superscalar processors with multiple instruction issue rates. Our pre-pc mechanism skips over instructions that are not memory references (it goes only through memory reference instructions). Thus, it can progress in a higher pace and may use more of the precious free time slots on the core{cache bus (cache-tag ports). In [3] branches are chosen for the LA-PC by using the BTB with a duplicated branch address eld (indexing eld) for the use of the lookahead PC. Instead, we added extra information to the BTB with the same amount of extra hardware. Furthermore, in [3] whenever an incorrect branch prediction occurs the distance between the LA-PC and the PC is reset and has to build up by waiting to a data miss. During such a period, prefetches may not be issued early enough. In our scheme the pre-pc can build the distance from the PC immediately after a wrong branch prediction, without waiting for a data miss. The major dierences between the scheme by Baer and Chen and Tango are: The scheme by Baer and Chen does not apply for superscalar (multiple issue) processors. Adaptation to such a processor is impossible with the current LA-PC and RPT. This precludes the possibility to compare the performance of the two schemes in a multiple issue environment. The Pre-PC lookahead scheme in Tango scans only branches and memory access instructions. The number of memory access instructions analyzed in a single clock cycle may be equal to the number of ports to the data cache. This is in contrast to the LA-PC operation. In Tango we oer a new technique for removing (ltering) undesirable predictions without the need to consume cache-tag bandwidth. This mechanism can be used by Baer and Chen as well. Tango implements an improvement to the LRU replacement algorithm. In some cases the total number of transactions on the memory bus with Tango was smaller than that of the system without prefetching. 3

5 3 The Tango Data Prefetching Scheme The Tango hardware predicts when data is likely to be needed and generates requests to bring it into the data cache. To accomplish this goal, memory access instructions are saved in the Reference Prediction Table for Superscalar Processors (SRPT) together with some history information. In order to bring the predicted data on time, Tango employs a fast pre-program counter, pre-pc, that uses the branch prediction information and the PPG (a graph representing the predicted future of the execution ow). With the PPG information the pre-pc searches the predicted memory access instructions in the SRPT and generates the prefetch requests. The stream of prefetch requests is ltered in the prefetch requests controller (PRC) and redundant requests are removed. This is done before the requests query the cache, thus reducing primary cache-tag ports bandwidth consumption as well as memory bus bandwidth. In this section we describe our data prefetching mechanism and discuss its hardware considerations. We start with motivation and functional presentation of the design, and then provide the details of the components. 3.1 Motivation The purpose of a data prefetching scheme is to bring data into the cache such that it will be there when needed by the memory access instruction. Along this line the Tango scheme has the following goals: Provide a design to generate data prefetches on time for superscalar processors without the need to change, revalidate, or interfere with existing components (specically, with critical path timing). Generate correct predictions to as many memory references as possible. Use data cache-tag ports only for data not in cache and only when not used by core. Issue prefetch requests to memory in time (will be available in cache when needed) and only to relevant data (not in cache or on its way to the cache). Incur no execution time penalty for prediction or prefetching data already in cache. The Tango scheme exploits the advantages of the lookahead scheme presented for scalar processors by [1] and extends it further to superscalar processors. In [6] it is suggested that lookahead schemes are not ecient for superscalar processors due to the need for extra ports to the data cache tag. Indeed, in our simulations the tag ports are heavily used by the core (demand fetches). Thus, in order to solve this problem, Tango lters 4

6 out prefetch requests for data in the cache before the requests consume data cache-tag ports bandwidth and it uses only open slots on the cache-tag ports. The scheme comprises three functional parts. The rst is a special lookahead mechanism. Our pre-pc mechanism jumps from a branch point to its predicted successor branch point (using the branch prediction information) and in each block it searches only through the memory reference instructions. It is implemented as a simple extension to the BTB called PPG and a special eld in the SRPT. The second functional part generates data access predictions. This is done by storing the access history information of the memory reference instructions. Our mechanism is based on the Reference Prediction Table of [1] designed for scalar processors. The table is enhanced to support the fast pre-pc which extracts the reference predictions and advances ahead of the PC more eectively than the lookahead PC in [1]. The third part, the Prefetch Requests Controller (PRC), is a mechanism for ltering out redundant prefetch requests. The SRPT and the lookahead mechanism (pre-pc) generate prefetch requests for most of the future memory access instructions. With software prefetching some extra analysis is done in order to remove prefetch requests whenever it is predicted that the data is already in the cache [11, 2]. In our scheme this task is very simple. In particular, our PRC has a simple mechanism for removing redundant prefetch requests without the need to probe the data cache. 3.2 PPG: The Program Progress Graph The rst hardware component is the Program Progress Graph (PPG) generated from the instructions currently viewed by the processor. In this directed graph every node corresponds to a branch instruction and an edge to a block of instructions on the path between the corresponding two branch instructions. A number on every edge indicates the number of instructions in that block. A number in a node is the entry number of the branch in the PPG table (marked also by br-entry-num). Figure 1 is an example of a program fragment and its PPG. For example, instructions 17 and 3 are in entries 18 and 15, respectively; the marking T,3 on the edge from br-num 18 to br-num 15 corresponds to instructions 1,2,3 of the taken block following instruction 17. The PPG is stored as an extension to the BTB by adding four new columns. An entry in the BTB/PPG table has 7 elds: branch-pc target prediction-info T-entry NT-entry T-size NT-size 5

7 (br-entry-num 15) 3 beqz 40 4 load 5 load store load (br-entry-num 30) 13 bnez load (br-entry-num 18) 17 beqz 1 T T,1 br-num 15 NT,10 br-num 30 NT,4 br-num 18 NT T,3 Figure 1: A sample program fragment and its PPG. The rst three are the address of the branch instruction (branch tag), the branch target address, and the branch prediction information of the BTB. The T-entry eld contains the entry number (in the BTB/PPG table) of the next branch on the taken path, and NT-entry is the entry number of the next branch along the not-taken path. Each of the T-size and NT-size elds holds the size of the block (number of instructions) following the branch on the taken and not-taken paths, respectively. The relevant parts of entries number 15, 18, and 30 of the BTB/PPG table for the example in Figure 1 are given in Table 1. entry number BTB info. T-entry NT-entry T-size NT-size Table 1: Entries number 15, 18, and 30 of the BTB/PPG table for the example in Figure 1. A simple BTB has two large elds: the address of the branch instruction (part of it is 6

8 used as a tag), and the branch target address. For implementing parallel lookup for the PC and a simple pre-pc there is a need for an extra tag of size similar to the address size. Instead, in our scheme each of the two extra entry elds, T-entry and NT-entry, contains an entry number in the PPG table, which take 9 bits each for a 512 entry table. The size elds (T-size and NT-size) of 7 bits each are used for controlling the distance between the PC and the pre-pc. As a result 32 bits per entry are needed to keep the extra information independent of the processor's address size. Note that the allocation of a new branch (entry) in the PPG is done only when the branch is rst taken, hence a block of the PPG may include more than a single basic block of the program. Yet, the graph truly represents the execution taking place Updating the PPG Table and Hardware Design Considerations Following a hit on a PC lookup in the BTB, the prediction and the entry number of the current branch are found. The PPG's entry number of the previous branch instruction was saved (during its update) in a small buer together with the taken/not-taken bit ag. Following the current branch resolution, a direct update is performed on both the current branch elds and two elds of the previous branch. Since dierent elds are changed in both updates (prediction-info and possibly target in the current branch and possibly T-entry with T-size { or its NT counter part { for the previous branch), no extra write port is needed. The extra hardware consists of the above small buer and some logic for counting the size values. For speculative executions, we attach the entry number of the previous branch instruction to the entry in the branch reservation station (instead of using the above buer). The PPG design has to consider the case in which an entry is removed when allocating a new entry in a full table set. In such a case, the T-size and/or NT-size elds of the entries pointing to the entry being removed need to be reset (set to 0). We could avoid this reset by keeping the branch address in the T-entry and the NT-entry elds, and by adding an extra tag eld for the pre-pc search; but this will double the area 7

9 of a standard BTB. Since evicting a BTB entry is a relatively rare event 3, and the PC is stalling in this case, due to a miss predicted branch, we decided instead to keep the table small and stop the pre-pc during the updates. The T-entry and NT-entry elds are stored in an associative memory, thus, when removing an entry from the table an associative search is done on the two elds with the removed entry number. For every match the corresponding size eld is reset to 0; this operation may consume a few cycles for a big table. We note that the delay thus incurred to the pre-pc is negligible compared with the cost of the alternative. Only one bit (taken/not-taken) from the BTB is used by both the PC and pre-pc (when reading the BTB/PPG predictions). A dual read port to this bit can solve the problem. Altogether, with four bytes per entry (two of which use associative memory), a table of 512 entries is equivalent to a 2K-byte cache. 3.3 SRPT: a Reference Prediction Table for Superscalar Processors The second hardware component is a Reference Prediction Table for Superscalar processors (SRPT). This table stores the access history information of the memory reference instructions, so that whenever a memory reference has a steady access pattern it is possible to predict the future memory accesses. The table is managed as a cache. An entry is allocated for a memory reference instruction accessed by the PC and the relevant information is kept and updated on the following accesses. A special tag eld is used for retrieving information stored in the table. The RPT of [3] has two pc-tag elds (or two ports to the same pc-tag); one is used for the PC search (set-associative) and the second for the lookahead PC (which is incremented and maintained in the same fashion as the PC). Instead, in the SRPT, which is an extension to the RPT, the PC search uses the pc-tag as above but, the search with the second index (called pre-pc-tag) is fully associative and is tuned for superscalar 3 In our simulations it happens, on average, only once per 7000 instructions { 99.9% BTB hit rate, and 1/7 branch instruction rate. 8

10 processors. The pre-pc uses the second index to identify all the access instructions in a block predicted for execution. An entry in the SRPT contains the following elds: pre-pc-tag pc-tag last-ea stride times fsm dist The pc-tag eld holds the address of the load/store instruction. The last-ea eld stores the last eective address accessed by this instruction. The stride eld holds the dierence between the last two eective addresses accessed by the memory instruction, and the counter times keeps track of how many iterations (with respect to that instruction) the pre-pc is ahead of the PC. The elds stride and times together with fsm are used to generate the predicted address as in [3]. The dist eld contains the distance of the memory reference instruction from the last branch and is used to control the distance between the PC and the pre-pc. The pre-pc-tag eld stores the following three values: br-entry-num T/NT mem-ref-num The 9 bits br-entry-num eld holds the BTB/PPG entry number of the last branch executed before reaching the memory access instruction; the T/NT bit indicates if the instruction is on the taken or not-taken path, and the last 7 bits, called mem-ref-num, are the ordinal number of the instruction within the load/store instructions of the block. The SRPT is updated whenever the CPU core (PC) executes a load/store instruction. During this update the prediction information (such as stride, times, etc.) is calculated and stored together with the eective address used by the instruction. The pre-pc uses this information in order to predict future accesses addressed by this instruction. The predicted eective address is equal to last-ea + (stride times), and the decision to generate the prediction is made by the two states init and prediction of the nite state machine (fsm) in Figure 2. The hardware needed to implement a 128 entry SRPT is equivalent to a 2K-byte data cache where the pc-tag size is equivalent to the tag eld of the cache and the memory 9

11 init incorrect prediction No prediction incorrect prediction incorrect prediction correct prediction prediction correct prediction correct prediction Figure 2: The nite state machine fsm. needed for the pre-pc-tag is counted twice its actual size (since it is fully-associative). The search in the SRPT is done in parallel to the execution in the CPU pipeline, thus, two clock cycles can be used if the operation is too long SRPT Updates and Management In our simulations we assume the possibility of issuing multiple instructions per cycle with no more than two memory accesses per cycle (dual ported cache); thus, to support the updates of the memory reference instructions, the SRPT must be dual ported. An entry in the SRPT is allocated on the rst execution of a memory access instruction. At that point the prediction elds (stride, times and fsm) are set to zero. When an entry is removed from the BTB/PPG there is a need to update the SRPT entries for which the br-entry-num eld is equal to the number of the entry being removed. This is done by setting the dist eld to zero (invalidate the pre-pc-tag eld of the SRPT entry). Note that the PC is stalling in this case (due to a mispredicted branch penalty) for a few cycles and the pre-pc is not progressing until all updates complete. 3.4 The pre-pc Using the PPG and SRPT Structures In this section we describe the pre-pc mechanism. In the beginning of the program execution, and following a misprediction in the direction taken by the pre-pc, the PC and pre-pc are equal. The pre-pc can depart from the PC at each branch entry whose size eld is not 0. When the size eld is not 0 the branch was visited at least once before, 10

12 and its prediction entries and size elds were updated (pointing to the next branch). Within one clock cycle the pre-pc can advance a full block ahead of the PC. Using the entry of the next branch (according to the branch prediction) the pre-pc can move directly to the next block without the need for a lookup on the BTB tag eld. Thus, the PPG provides an ecient method for supporting a pre-pc that can be used in many places. During a data cache access, a lookup and update are conducted on the SRPT with the pc-tag. In parallel, the pre-pc is looking for future memory reference instructions in the SRPT. With the example in Figure 1 we illustrate the pre-pc progress assuming that branch instructions 3, 13, and 17 are kept in entries 15, 30, and 18 of the BTB/PPG table, respectively (as in Table 1). Table 2 presents the pre-pc-tag for some of the SRPT entries in a possible execution of the code fragment in Figure 1. inst br-ent-num T/NT mem-ref-num I I I I I Table 2: The pre-pc-tag for some of the SRPT entries in a possible execution of the fragment code in Figure 1. The mem-ref-num elds of the rst two memory reference Instructions (in block 15), 4 and 5 (denoted by I4, I5), have value 1 when the memory is dual ported. The value of that eld for I7 and I12 is 2 and it is 1 for I16 (rst in block 30). The T/NT values for all ve instructions are zero since they are on the not-taken path of there respective branches, 15 and 30. At some point when the pre-pc value is (15,0,1) its lookup in the SRPT on the pre-pc-tag eld (br-entry-num,t/nt,mem-ref-num) matches I4 and I5 and partially matches I7, and I12. A partial match is given for SRPT entries for which the br-entry-num and T/NT elds of the pre-pc-tag have the same values as in the lookup operation, e.g. (15,0,) (the use of the partial match will soon be clear). In the next clock cycle the pre-pc value is (15,0,2) and it matches I7 and I12. Thus, the pre-pc jumps over irrelevant instructions without losing cycles between consecutive 11

13 accesses to the SRPT. Once I7 and I12 are found (matched), with the help of the PPG, in the next cycle the pre-pc value is (30,0,1) (described latter). The new pre-pc value causes the next block to be scanned and the load of I16 is found. Whenever a match occurs the reference prediction is computed based on the history information elds and the decision to issue the prefetch request is made by a four state nite state machine. In the following example we discuss the progress of the pre-pc in a few scenarios for the example in Figure 1. Assume we are at clock cycle x and a lookup is done on entry 15 of the PPG (see PPG:LU column in the rst row of Table 3). By the end of cycle x the PPG:LU results are: T/NT=0 (branch 15 is not taken) and next-br=30. cycle SRPT:LU PPG:LU PPG:LU results SRPT:LU results x T/NT=0, next-br=30... x+1 (15,0,1) 30 T/NT=0, next-br=18 FM=fI4,I5g, PM=fI12g x+2 (15,0,2) FM=fI12g x+3 (30,0,1) 18 T/NT=1, next-br=15 FM=fI16g x+4 (18,1,1) 15 T/NT=0, next-br=30 no match x+5 (15,0,1) 30 T/NT=0, next-br=18 FM=fI4,I5g, PM=fI12g Table 3: The pre-pc progress (a possible scenario for Figure 1). To make the example more general, assume at this stage that I4, I5, I12 and I16 are in the SRPT and the entry for I7 was removed. During cycle x+1, in parallel to a direct lookup on PPG entry 30, the pre-pc searches the SRPT pre-pc-tag with (15,0,1) (see SRPT:LU column of Table 3). The SRPT:LU results are two full matches for I4, I5, and a partial match for I12. The partial match on I12 indicates the need to continue and search this block by the pre-pc. Thus, during cycle x+2 the SRPT is searched with values (15,0,2). By the end of cycle x+2 instruction I12 is fully matched (last in the block). The next predicted block is 30 and T/NT=0, thus, during cycle x+3 a SRPT:LU search for (30,0,1) is generated and I16 is fully matched. In parallel, a PPG:LU search on branch entry 18 is generated. At the next cycle a SRPT:LU with (18,1,1) yields no match since block 18 has no memory access instructions. From the PPG:LU results of cycle x+3 the next branch is 15, this branch is not taken (due to PPG:LU results at the end of cycle x+4). Thus, the next SRPT:LU search (cycle x+5) is for (15,0,1). 12

14 The pre-pc search in the SRPT is fully associative and can take a full clock cycle 4. Thus, this search is pipelined with the calculation of a predicted eective address and the prefetch dispatch. The example above shows the progress of the pre-pc in a few special cases. The pre-pc searches the SRPT and the PPG in parallel and thus, in one clock cycle, it can jump over a block that has two or less memory reference instructions. If a block has more than two memory reference instructions the pre-pc's progress through the block is never slower than that of the PC. In general, the pre-pc can be either ahead of the PC, or in the same position as the PC following a misprediction in the direction taken by the pre-pc or in the beginning of the execution. The goal is to increase the distance before a data cache miss occurs. The optimal distance between the pre-pc and the PC is the one in which no cache miss occurs for predicted data. Thus, it must not be smaller than the time taken for a prefetch request to be fullled. A prefetch request has a low priority both for the lookup in the cache and for the memory bus. This implies that the time interval between the PC and the pre-pc must be even larger than the fetch latency. On the other hand, it must not be too big in order not to cause a replacement of data that soon will be needed for the computation. In addition, we note that the history in the BTB/PPG and SRPT, used for generating the predictions, is less accurate when the distance is very large. Tango can only control the maximum distance. Since the distance is measured by the number of instructions between the PC and the pre-pc, the maximum value should reect the execution time of the instructions and the delays added by the prefetch requests service. The distance between the pre-pc and the PC is computed every cycle. Whenever the pre-pc jumps to a new block the distance increases by T-size or NT-size, depending on the direction taken by the pre-pc. The distance inside a block is accumulated upon a SRPT match (the dist value of the matched memory reference instruction is added and that of the previous one { in the same block { is subtracted). The distance covered by the PC is subtracted from the accumulated distance following every instruction's 4 If more than a single clock cycle is needed (128 entries) the depth of the associativity can be reduced by partitioning the SRPT (e.g. instructions of blocks in the upper and lower halves of the PPG table are mapped into two dierent parts of the SRPT). 13

15 commit. 3.5 PRC: a Prefetch Requests Controller The third part of our scheme, the Prefetch Requests Controller (PRC), is a mechanism for controlling and scheduling the prefetch requests. Due to the heavy trac on the core{cache bus, the number of open slots for prefetching queries in the cache is very small. In addition, the timing when open slots occur may not be the right one for prefetching. Minimizing the trac overhead incurred by prefetching on this bus, and on the memory bus, calls for optimizations of data prefetching in two places. The PRC mechanism lters out redundant prefetch requests and controls the scheduling of the prefetch requests on the core{cache bus (it always gives priority to the core requests). In addition, the data cache LRU is touched (following reference predictions) in order to reduce the number of requests on the memory bus and for improving Tango performance. When the pre-pc scans a memory reference instruction its predicted eective address is supplied by the SRPT. In the next step the address can be looked up in the data cache. Since the ports to the data cache-tag may be occupied by requests from the core we added a FIFO buer, Req-Q, of four entries for keeping the prefetch requests (see Figure 3). Thus, the pre-pc can advance before the request is served. When Req-Q is full, the pre-pc must wait. The queue is ushed if the pre-pc took the wrong direction (wrong branch prediction). Due to the locality of reference principle it is very likely that within short time periods more than a single request for data in a cache line (block) can occur. An associative search on Req-Q prevents a double insertion of a prefetch request. At every clock cycle it is possible to issue up to two prefetch requests on the cache-tag ports not used by the core. A prefetch request which misses the cache is directed to the next level in the memory hierarchy. If the memory bus is busy the request is stored in a second small buer, Wait-Q, of which two entries (with top priority) are dedicated to requests generated by the core (see Figure 3). The prefetch requests are ushed from the buer when the pre-pc took the wrong direction (on a branch). A Track-Q is served for tracking prefetch requests issued to the memory bus, and for preserving 14

16 CPU - core Multiple issue Req-Q Dispatch prefetch requests data ports address switch Probe1 I-CACHE BANK1 BANK2 D - CACHE Requests that miss the cache Requests that hit the cache Filter-cache Probe2 PRC Prefetch requests predicted by SRPT Wait-Q Probe3 Write Buffer Core Requests Prefetch Requests Track-Q Probe4 Memory interface (priority interconnect) TO NEXT LEVEL IN THE MEMORY HIERARCHY Figure 3: Scheduling and controlling prefetch requests. system consistency. Every memory request is put on Track-Q and is removed when the data arrive. When Track-Q is full the issuing of prefetch requests to the memory bus ceases. Lastly, Tango uses a unique buer, Filter-cache, in order to track requests that were found (hit) in the data cache. This buer is original to Tango, unlike Track-Q, Wait-Q and Req-Q. For a 2-way set-associativity cache, and a memory bus that can serve a request every 4 cycles with a 12-cycle latency, a line found (hit) in the cache will stay in the cache at least 16 cycles. Following a cache hit the requested block address is put in Filter-cache. A counter is attached to this entry and its value is set to the fetch latency plus the fetch spacing. This counter is decremented by one every clock cycle and the entry is removed when its value is 0. The idea behind this action is to minimize the dispatching of prefetch requests to Req-Q, and thus to decrease the trac on the core{cache bus. If Filter-cache is full it behaves like a FIFO queue. This lter provides a simple way to dynamically remove redundant prefetch requests and is thus suitable for out of order computations as well. Only prefetch requests that are not in either Req-Q, Wait-Q, Track-Q or Filter-cache are directed to the Req-Q buer. Thus, the Tango scheme minimizes the number of 15

17 requests sent to Req-Q and prevents loading the core{cache bus by leaving out most (about 2/3) of the requests generated by the SRPT. About 80% of the requests removed were ltered by a Filter-cache of six entries. Since we use an associative search on these buers they were all chosen to be small. Avoiding the prefetch of data found in the data cache without indicating that it may be needed soon can result in purging it out before its actual use. This indeed happened in our early simulations and indicated that due to the distance between the pre-pc and the PC many lines residing in the cache following the pre-pc lookup were not there when the PC looked for them. This problem was solved by a simple modication to the data cache LRU scheme. The Tango scheme generates LRU touch whenever a prefetch request is a hit, thus indicating that it may be needed soon (rather than leaving it an early used). This internal prefetching prevents an erroneous purging on one hand and saves us the cost of issuing redundant requests to the next memory level on the other hand. The simulation results showed signicant improvement along this avenue for both the performance and the memory bus trac. The prediction done by the SRPT is very aggressive since the majority of the references are of constant stride, and since the stride is not checked to be stable before generating a prefetch request (e.g. following the rst 2 references of a load/store instruction). An extreme example is the dnasa7 program in the SPEC92 benchmark. For this program the SRPT generated predictions for 98% of the memory accesses. Since 46% of all the instructions are memory reference instructions, the cache tag ports were occupied by the core 60% of the execution time. Without an addition optimization the prefetch requests will also need 60% of the cache-tag ports bandwidth. Since only 40% of the bus bandwidth is available and not always at the right moment, the prefetch performance was impaired. With the above PRC optimizations only 1/3 of the SRPT requests remained. As a result only 20% of the cache-tag ports bandwidth was needed for data prefetching. Note that the core always has the rst priority on the cache tag ports. In addition, the internal prefetch (touching the LRU) reduces the trac on the memory bus. The improvement to the total performance and the reduction in buses utilization show the eectiveness of the above optimizations. 16

18 4 Architectural Models for the Simulations Our investigated architecture consists of a modern processor that can issue up to four instructions per cycle (base line architecture). The physical limitations on this rate are: at most two memory reference instructions per cycle (i.e. two loads, two stores, or one of each), and a single conditional or unconditional control ow operation. All execution units are fully pipelined with one cycle throughput and one cycle execution stage (latency). If no delay is incurred to any of the issued instruction in a cycle, their execution contribute a single cycle to the simulation. The instruction cache size is 32K-byte organized as 4-way set-associative with 32-byte cache line size. No penalty is charged for a hit and the miss penalty in this cache is 6-cycles (the instruction cache miss penalty is relatively small to compensate for the lack of instruction prefetching mechanism in the simulations). In addition to the I-cache miss penalty (and the following use of memory bus bandwidth), a misprediction of a branch can also stall the execution. In our simulated architecture (described in Figure 4) we use a branch prediction mechanism based on the two-level adaptive training scheme suggested by Yeh and Patt [15]. In our simulations the automata are initialized with non-zero values and branches that are not in the table are predicted as not-taken. A mispredicted branch stalls the execution for 2-cycles. This happened in case that the BTB generates a wrong prediction (direction or target address) or for a taken branch which is not in the table. The BTB conguration is determined by specifying the BTB size, associativity, the number of bits in the history register, and the automaton type. The base line BTB conguration for our simulations comprises a 512-entry 4-way set-associative table. Each entry has a 2-bit history register which is used for selecting one of four 2-bit Lee & Smith automata. The data cache is an interleaved (2 banks) write-back cache with write-allocate policy and a write-back buer (for replacing dirty data lines) of 8 entries. Each bank has a separate data port that can serve either a load or a store in every cycle (i.e. no write \blockage" between a store and a subsequent load). The two ports are accessible from all four execution units and two memory reference operations may access the cache in 17

19 PROCESSOR CORE : MULTIPLE ISSUE Prefetch Requests Controller address switch data ports 2 LEVEL BTB Baseline configuration: * 512 entries * 4-way set-associativety * 2-bit history register for each entry * 4 2-bit state machines for each entry PPG Extension to the BTB for implementing the PPG SRPT Baseline config: * 128 entries * 4-way set-associativety Data Cache Baseline configuration: * 32K-byte * 32-byte line size * 4-way set-associativety * 2 banks I-Cache * 32K-byte * 4-way set-assoc * 32-byte line size * no prefetching Data Prefetching Unit Wait-Q Write buffer Interconnect network to the next memory level (BIU) * Configurable fetch latency and spacing between requests Figure 4: Simulated architecture model. parallel if they address the dierent banks. The bank selection is made at the switching point, with the low order bit of the line address. All the cache parameters are congurable. The base line conguration for the data cache is 32K-byte organized as 4-way set-associative with 32-byte cache line size. Dirty cache lines which are evacuated from the cache are moved to the write buer. The write buer has the lowest priority on the memory bus and its lines are moved to memory in a FIFO discipline whenever the bus is not busy. A prefetch request that may cause a dirty cache line replacement when the write buer is full will not be entered to the cache. Thus, the delay caused by writing a dirty line from the write buer can increase only the penalty of a data cache miss (in such a case the write buer gets the highest priority on the memory bus). The size of the PRC buers are those presented in Figure 3 (4, 6, and 4 entries for Req-Q, Filter-cache, and Track-Q respectively). The interface with the memory is parameterized in two ways. First, the bandwidth is constrained by controlling the number of cycles, fetch spacing, between two consecutive launches of memory fetch requests. The second parameter, fetch latency, species the number of cycles needed for fetching the data. When the fetch spacing is one, the memory interface is pipelined, and at the other extreme, when fetch spacing equals to 18

20 the fetch latency, only one memory fetch is allowed at a time. In our base conguration the fetch spacing is 4 cycles and the fetch latency is 12 cycles. We tested the Tango prefetching scheme by running programs from the SPEC benchmarks. In each simulation we used the same environment parameters with two system architectures: the reference (no data prefetching) and the Tango prefetching scheme. We assumed, for both systems, that the instructions are statically scheduled such that there is no memory reference dependence between instructions executing simultaneously. Thus, if 4 instructions are executed in the same cycle none of them need the results obtained by the other (this behavior is supported by all out of order mechanisms). Any stall, due to a data cache miss, can postpone only those instructions that were issued at a later stage. The amount of extra cycles to the simulation process due to a miss depends on the memory bus status and the fetch latency. We compared both systems with various types of memory units by changing cache size, associativity, line size, buers size, and the interface parameters to the memory. Since two misses can occur during a single cycle (dual ported data cache) and the write buer as well as the I-cache use the memory bus, both systems benet in the case where the fetch spacing is smaller than the fetch latency, compared to the case when they are equal. 5 Simulation Results In this section we explore our prefetching design for various architecture parameters. We ran trace driven simulations with ve programs from the SPEC92 benchmark and matrix from SPEC89. We used matrix mainly for investigating the load on the cache-tag ports (data intensive program). Every simulation included the rst 100 Million instructions and results were inspected every 10 Million instructions. The behavior of ve programs (all but espresso) reached a steady phase already at 50 Million instructions, hence, some of the detailed simulations (starting from Section 5.2) were carried out for the rst 50 Million instructions. The average size of a program tested is about 20 thousand instructions. 19

21 5.1 Programs Characteristics and General Results Table 4 shows the dynamic characteristics of the programs. program Data ref rate Stride distribution Corr BTB write read zero large small irreg. pred (%) matrix dnasa xlisp tomcatv espresso spice2g Table 4: Programs Characteristics for 100 million Instructions The column \Data ref rate" shows the percentage of writes and reads in each application. As expected, the reads are more frequent than the writes and the portion of read misses from the total misses is also larger than that of the writes (see Table 5, rst 2 columns). The next four columns in Table 4 indicate the predictability of memory references. The stride distribution information tells us the proportions of data references that behave according to one of four categories. Data references with zero stride are steady references directed to the same memory location, large and small are those references with strides larger/equal than 32-byte (line size in the base line architecture) and smaller than 32- byte, respectively 5. The data cache can be very helpful with zero and small stride references, but the prefetching mechanism further improved in these cases when there was no temporal locality. Large and irregular stride references can be the main source for cache misses; while the prefetching mechanism is useful for large strides it must also identify irregular memory references so as to avoid unnecessarily initiation of erroneous prefetch requests (this is done using a 2-bit state machine). The last column presents the percentage of correct BTB predictions for the base line BTB. This information can help in further estimating the success of our prefetching mechanism. 5 We gathered this data with innite size SRPT. 20

22 Table 5 summarizes the results obtained for the base line architecture (with pipelined memory bus). program misses (%) ref hit Prefetch C-tag used M bus ext M pen. speedup read write ratio C-hit rat by core (%) band (%) red. (%) CPI rat matrix dnasa xlisp tomcatv espresso spice2g Table 5: Simulation Results of the Base Architecture (100 million instructions); Tango vs. the reference system. The misses (read/write) columns of Table 5 represent the distribution of read and write misses. The \ref hit ratio" column shows the hit ratio of the reference data cache followed by the data cache hit ratio column of the prefetch enhanced architecture. For the Tango architecture we incorporated the relative portions of the penalty for those miss requests that were on their way (due to a late prefetch). In this calculated hit ratio, the sum of all the partial penalties was divided by the fetch latency value to generate the relative number of misses. In ve out of the six programs the C-hit ratio was 99% and only for spice2g6 the change is from 83.8% to 91.2% in the Tango system. Instruction scheduling and out of order execution exposed most of the available parallelism (this is implied by the relatively small ideal CPI in Figure 5). Such parallelism exploits the hardware resources most of the time. In the \C-tag used by core" column of Table 5 we present the percentage of time in which the data cache tag ports were busy due to core accesses (demand fetches). On the average, the cache-tag ports are used by the core percent of the time. Thus, the remaining bandwidth for prefetch requests is small and must be used wisely. The \M bus ext bandwidth" column presents the percentage of extra requests imposed by Tango on the memory bus. For xlisp and espresso this number is signicant. Nevertheless, in xlisp the total number of prefetch requests was small since the hit ratio (for 21

23 the reference system) is 99.3%. The last two columns summarize the performance of the Tango scheme. The percentage of memory penalty reduction (\M pen red" column) is correlated with low irregular strides and high BTB prediction rate (see matrix, dnasa7, and tomcatv). The other programs too exhibit signicant improvements. For matrix and dnasa7 programs 96.57% and 94.11% of the memory penalty were removed, respectively, witnessing an ecient prefetching scheme in spite of the small bandwidth left on the core{cache bus (45.3% and 46.7% of the dynamic code access the data cache respectively). On the other side of the scale we nd xlisp with a very high hit ratio (99.3%) and a large irregular stride percentage. Nevertheless, the performance of this program is improved as well. Figure 5 summarizes the performance in three bars for each application. The I-CPI is the ideal CPI derived for the case in which every memory access reference is a hit. The R-CPI bar is the result found for the reference system, and the T-CPI bar presents the Tango performance. Figure 5: Comparing performance results; I-CPI ideal CPI, T-CPI CPI with prefetching (Tango), R-CPI CPI with out prefetching (reference). 22

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu