1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing

Size: px
Start display at page:

Download "1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing"

Transcription

1 Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors 1 Shlomit S. Pinter IBM Science and Technology MATAM Advance Technology ctr. Haifa 31905, Israel shlomit@vnet.ibm.com Adi Yoaz Intel Israel (74) MATAM Advance Technology ctr. Haifa 31905, Israel ayoaz@iil.intel.com Abstract We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the eective utilization of slack (dead) time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the Program Progress Graph (PPG), as a simple extension to the Branch Target Buer (BTB). We use the PPG for implementing a fast pre-program counter, pre-pc, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-pc extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used to implement a pre-processor and for instruction prefetching. The prefetch requests are scheduled to \tango" with the core requests from the data cache, by using only free time slots on the existing data cache tag ports. Employing special methods for removing prefetch requests that are already in the cache (without utilizing the cache-tag ports bandwidth) and a simple optimization on the cache LRU mechanism reduce the number of prefetch requests sent to the core{cache bus and to the memory (second level) bus. Simulation results on the SPEC92 benchmark for the base line architecture (32K-byte data cache and 12 cycles fetch latency) show an average speedup of 1.36 (CPI ratio). The highest speedup of 2.53 is obtained for systems with smaller data cache (8K-byte). Key words: data prefetching, instruction prefetching, memory reference prediction, superscalar processors, pre-instruction decoding, BTB, branch prediction. 1 Preliminary results were published in MICRO 29 conference (see [13]). i

2 1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing gap between processor and memory speeds further increases the memory access penalty problem. Data prefetching is an effective technique for hiding memory latency and can alleviate the pressure to constantly increase the size of rst level caches. Timely and correct prediction of data accesses are the key issues in both software and hardware based prefetching techniques. In hardware based schemes the timely generation of a data prefetch and the additional trac on both the memory bus and the cache-tag ports can potentially be controlled in order not to interfere with the main computation process. In addition, the amount of prefetching done can be tuned to the cache parameters using dynamic information. Prediction in hardware can recognize constant memory strides (resulting from integer valued linear functions of loop variables) [7, 1, 8]. For speculative data prefetching the hardware based techniques can improve on software approaches by using information from the branch prediction unit. Hardware based o-chip schemes, like the stream buer [6], are driven by cache miss accesses, whereas with on-chip schemes all the memory access instructions are sampled and more prediction information is available [7, 1]. Prediction on-chip is not inuenced by out of order computations and the data cache status can be used for controlling the prefetch rate. However, there are no on-chip prefetch schemes suggested for superscalar processors. The main problem in the design of on-chip prefetch scheme for superscalar processors is to initiate the prefetch requests fast enough. Another problem studied in [14] and noted by [6] is the cache bandwidth problem. The problem is manifested by the heavy load on the data ports and tag ports of the data cache. In superscalar processors the cache-tag ports are one of the most critical resources and extra prefetch requests (especially with aggressive on-chip prediction) may create contention. In our scheme we provide an eective and simple solution to these problems. In software based prefetching extra prefetch instructions are inserted. The new instructions may increase the length of the critical path since there are not enough \open" slots (in the load functional units), especially for data intensive programs, or when prefetching most of the memory access instructions [14]. The extra prefetch instructions also occupy space in the I-cache and increase the trac on the memory bus. Whereas extensive software analysis can generate accurate static predictions, the timely generation of the prefetch requests is a problem (see [11, 2]). In [11] the number of iterations to prefetch ahead was chosen to be the prefetch latency divided by the length of the shortest path through the loop. In general, prefetch instructions in the wrong places (time) can cause cache pollution if no special prefetch buer is used. For reducing such misses and for removing redundant prefetch instructions the compiler must be tuned to the data cache parameters. Such tuning has no eect on existing code and is very intricate for the compiler (see [2, 11] for some heuristics). In this paper we present the design and simulation results of a new on-chip hardware 1

3 based data prefetching scheme called Tango. The scheme uses a pre-pc 2 component that exploits only instructions that use the data cache (rather than scanning the instructions sequentially). At each clock cycle, all the memory references in some future block of instructions are exposed with the help of the Program Progress Graph (PPG) table. The pre-pc mechanism uses the data prediction component (that reveals all constant-stride accesses) to generate prefetch requests for these instructions. The scheme thus provides data fast enough for future multi-functional processors that can employ out-of-order execution. The prefetch requests do not interfere with the main computation and are scheduled to tango with the core requests from the cache (they only use the free time slots on the existing data cache tag ports). Tango employs a new simple method that without cache-tag ports bandwidth consumption removes prefetch requests for data already in the cache (this mechanism can be used by other schemes as well). Another simple optimization on the cache LRU mechanism lets us further reduce the number of prefetch requests sent to the core{cache bus and to the memory bus. The cost incurred by Tango involves chip space, which is smaller than a 4K-byte dual ported cache. The PPG, which is implemented as a small and simple extension to a BTB, provides an ecient and fast method for lookahead on instructions. Thus, in addition to data prefetching, it can be used as a pre-processor for decoding instructions early enough on their way to the core or for instruction prefetching. Our investigations show that the Tango prefetching scheme signicantly improves the overall system performance and extensively reduces the memory access miss penalty. In most of the programs tested, the speedup gained is over 1.33 (average of 1.36 for the base line architecture with memory fetch latency of 12 cycles), and the reduction in the memory penalty is 96% to 43%. We rst discuss related work in Section 2. Then, in Section 3 we present our scheme and discuss its properties. Section 4 includes the machine model, and experimental results are brought in Section 5. Conclusions and discussion are presented in Section 6. 2 Related Work Most simple hardware data prefetching schemes are based on data misses that can trigger the prefetch of the next data line. Such schemes were investigated in [4] and with the help of special prefetch buers (stream buers) by [9]. Similar schemes that use more elaborate o-chip predictions are studied in [12] and in [6]. For improving the prediction [7, 8] use a history table for keeping the previous memory address. The dierence between the current address and the previous one is used to calculate the memory stride. The prefetch is issued for the next iteration based on the calculated stride. The penalty incurred by small stride accesses was investigated in [5]. We note that Tango does not generate requests to the data cache and the memory bus for predictions with small strides whenever the relevant part is in some cache line (cache block). 2 Pre-program counter. 2

4 Even with good prediction, data can be prefetched too early or too late to be useful. One way for solving this problem is a lookahead scheme. The lookahead scheme in [10] is based on generating data prefetch for operands simultaneously with the decoding of the instruction. Prediction and lookahead are integrated by Baer and Chen in [1, 3] to support prefetch for scalar processors. In this on-chip scheme the stride prediction is calculated with a reference prediction (history) table (RPT) indexed by load/store instruction addresses. The lookahead mechanism implements a lookahead program counter (LA-PC) that advances at the same pace as the PC. At each clock cycle a new instruction is scanned. The time wasted for checking all the instructions prevents the possibility to adapt this method for superscalar processors with multiple instruction issue rates. Our pre-pc mechanism skips over instructions that are not memory references (it goes only through memory reference instructions). Thus, it can progress in a higher pace and may use more of the precious free time slots on the core{cache bus (cache-tag ports). In [3] branches are chosen for the LA-PC by using the BTB with a duplicated branch address eld (indexing eld) for the use of the lookahead PC. Instead, we added extra information to the BTB with the same amount of extra hardware. Furthermore, in [3] whenever an incorrect branch prediction occurs the distance between the LA-PC and the PC is reset and has to build up by waiting to a data miss. During such a period, prefetches may not be issued early enough. In our scheme the pre-pc can build the distance from the PC immediately after a wrong branch prediction, without waiting for a data miss. The major dierences between the scheme by Baer and Chen and Tango are: The scheme by Baer and Chen does not apply for superscalar (multiple issue) processors. Adaptation to such a processor is impossible with the current LA-PC and RPT. This precludes the possibility to compare the performance of the two schemes in a multiple issue environment. The Pre-PC lookahead scheme in Tango scans only branches and memory access instructions. The number of memory access instructions analyzed in a single clock cycle may be equal to the number of ports to the data cache. This is in contrast to the LA-PC operation. In Tango we oer a new technique for removing (ltering) undesirable predictions without the need to consume cache-tag bandwidth. This mechanism can be used by Baer and Chen as well. Tango implements an improvement to the LRU replacement algorithm. In some cases the total number of transactions on the memory bus with Tango was smaller than that of the system without prefetching. 3

5 3 The Tango Data Prefetching Scheme The Tango hardware predicts when data is likely to be needed and generates requests to bring it into the data cache. To accomplish this goal, memory access instructions are saved in the Reference Prediction Table for Superscalar Processors (SRPT) together with some history information. In order to bring the predicted data on time, Tango employs a fast pre-program counter, pre-pc, that uses the branch prediction information and the PPG (a graph representing the predicted future of the execution ow). With the PPG information the pre-pc searches the predicted memory access instructions in the SRPT and generates the prefetch requests. The stream of prefetch requests is ltered in the prefetch requests controller (PRC) and redundant requests are removed. This is done before the requests query the cache, thus reducing primary cache-tag ports bandwidth consumption as well as memory bus bandwidth. In this section we describe our data prefetching mechanism and discuss its hardware considerations. We start with motivation and functional presentation of the design, and then provide the details of the components. 3.1 Motivation The purpose of a data prefetching scheme is to bring data into the cache such that it will be there when needed by the memory access instruction. Along this line the Tango scheme has the following goals: Provide a design to generate data prefetches on time for superscalar processors without the need to change, revalidate, or interfere with existing components (specically, with critical path timing). Generate correct predictions to as many memory references as possible. Use data cache-tag ports only for data not in cache and only when not used by core. Issue prefetch requests to memory in time (will be available in cache when needed) and only to relevant data (not in cache or on its way to the cache). Incur no execution time penalty for prediction or prefetching data already in cache. The Tango scheme exploits the advantages of the lookahead scheme presented for scalar processors by [1] and extends it further to superscalar processors. In [6] it is suggested that lookahead schemes are not ecient for superscalar processors due to the need for extra ports to the data cache tag. Indeed, in our simulations the tag ports are heavily used by the core (demand fetches). Thus, in order to solve this problem, Tango lters 4

6 out prefetch requests for data in the cache before the requests consume data cache-tag ports bandwidth and it uses only open slots on the cache-tag ports. The scheme comprises three functional parts. The rst is a special lookahead mechanism. Our pre-pc mechanism jumps from a branch point to its predicted successor branch point (using the branch prediction information) and in each block it searches only through the memory reference instructions. It is implemented as a simple extension to the BTB called PPG and a special eld in the SRPT. The second functional part generates data access predictions. This is done by storing the access history information of the memory reference instructions. Our mechanism is based on the Reference Prediction Table of [1] designed for scalar processors. The table is enhanced to support the fast pre-pc which extracts the reference predictions and advances ahead of the PC more eectively than the lookahead PC in [1]. The third part, the Prefetch Requests Controller (PRC), is a mechanism for ltering out redundant prefetch requests. The SRPT and the lookahead mechanism (pre-pc) generate prefetch requests for most of the future memory access instructions. With software prefetching some extra analysis is done in order to remove prefetch requests whenever it is predicted that the data is already in the cache [11, 2]. In our scheme this task is very simple. In particular, our PRC has a simple mechanism for removing redundant prefetch requests without the need to probe the data cache. 3.2 PPG: The Program Progress Graph The rst hardware component is the Program Progress Graph (PPG) generated from the instructions currently viewed by the processor. In this directed graph every node corresponds to a branch instruction and an edge to a block of instructions on the path between the corresponding two branch instructions. A number on every edge indicates the number of instructions in that block. A number in a node is the entry number of the branch in the PPG table (marked also by br-entry-num). Figure 1 is an example of a program fragment and its PPG. For example, instructions 17 and 3 are in entries 18 and 15, respectively; the marking T,3 on the edge from br-num 18 to br-num 15 corresponds to instructions 1,2,3 of the taken block following instruction 17. The PPG is stored as an extension to the BTB by adding four new columns. An entry in the BTB/PPG table has 7 elds: branch-pc target prediction-info T-entry NT-entry T-size NT-size 5

7 (br-entry-num 15) 3 beqz 40 4 load 5 load store load (br-entry-num 30) 13 bnez load (br-entry-num 18) 17 beqz 1 T T,1 br-num 15 NT,10 br-num 30 NT,4 br-num 18 NT T,3 Figure 1: A sample program fragment and its PPG. The rst three are the address of the branch instruction (branch tag), the branch target address, and the branch prediction information of the BTB. The T-entry eld contains the entry number (in the BTB/PPG table) of the next branch on the taken path, and NT-entry is the entry number of the next branch along the not-taken path. Each of the T-size and NT-size elds holds the size of the block (number of instructions) following the branch on the taken and not-taken paths, respectively. The relevant parts of entries number 15, 18, and 30 of the BTB/PPG table for the example in Figure 1 are given in Table 1. entry number BTB info. T-entry NT-entry T-size NT-size Table 1: Entries number 15, 18, and 30 of the BTB/PPG table for the example in Figure 1. A simple BTB has two large elds: the address of the branch instruction (part of it is 6

8 used as a tag), and the branch target address. For implementing parallel lookup for the PC and a simple pre-pc there is a need for an extra tag of size similar to the address size. Instead, in our scheme each of the two extra entry elds, T-entry and NT-entry, contains an entry number in the PPG table, which take 9 bits each for a 512 entry table. The size elds (T-size and NT-size) of 7 bits each are used for controlling the distance between the PC and the pre-pc. As a result 32 bits per entry are needed to keep the extra information independent of the processor's address size. Note that the allocation of a new branch (entry) in the PPG is done only when the branch is rst taken, hence a block of the PPG may include more than a single basic block of the program. Yet, the graph truly represents the execution taking place Updating the PPG Table and Hardware Design Considerations Following a hit on a PC lookup in the BTB, the prediction and the entry number of the current branch are found. The PPG's entry number of the previous branch instruction was saved (during its update) in a small buer together with the taken/not-taken bit ag. Following the current branch resolution, a direct update is performed on both the current branch elds and two elds of the previous branch. Since dierent elds are changed in both updates (prediction-info and possibly target in the current branch and possibly T-entry with T-size { or its NT counter part { for the previous branch), no extra write port is needed. The extra hardware consists of the above small buer and some logic for counting the size values. For speculative executions, we attach the entry number of the previous branch instruction to the entry in the branch reservation station (instead of using the above buer). The PPG design has to consider the case in which an entry is removed when allocating a new entry in a full table set. In such a case, the T-size and/or NT-size elds of the entries pointing to the entry being removed need to be reset (set to 0). We could avoid this reset by keeping the branch address in the T-entry and the NT-entry elds, and by adding an extra tag eld for the pre-pc search; but this will double the area 7

9 of a standard BTB. Since evicting a BTB entry is a relatively rare event 3, and the PC is stalling in this case, due to a miss predicted branch, we decided instead to keep the table small and stop the pre-pc during the updates. The T-entry and NT-entry elds are stored in an associative memory, thus, when removing an entry from the table an associative search is done on the two elds with the removed entry number. For every match the corresponding size eld is reset to 0; this operation may consume a few cycles for a big table. We note that the delay thus incurred to the pre-pc is negligible compared with the cost of the alternative. Only one bit (taken/not-taken) from the BTB is used by both the PC and pre-pc (when reading the BTB/PPG predictions). A dual read port to this bit can solve the problem. Altogether, with four bytes per entry (two of which use associative memory), a table of 512 entries is equivalent to a 2K-byte cache. 3.3 SRPT: a Reference Prediction Table for Superscalar Processors The second hardware component is a Reference Prediction Table for Superscalar processors (SRPT). This table stores the access history information of the memory reference instructions, so that whenever a memory reference has a steady access pattern it is possible to predict the future memory accesses. The table is managed as a cache. An entry is allocated for a memory reference instruction accessed by the PC and the relevant information is kept and updated on the following accesses. A special tag eld is used for retrieving information stored in the table. The RPT of [3] has two pc-tag elds (or two ports to the same pc-tag); one is used for the PC search (set-associative) and the second for the lookahead PC (which is incremented and maintained in the same fashion as the PC). Instead, in the SRPT, which is an extension to the RPT, the PC search uses the pc-tag as above but, the search with the second index (called pre-pc-tag) is fully associative and is tuned for superscalar 3 In our simulations it happens, on average, only once per 7000 instructions { 99.9% BTB hit rate, and 1/7 branch instruction rate. 8

10 processors. The pre-pc uses the second index to identify all the access instructions in a block predicted for execution. An entry in the SRPT contains the following elds: pre-pc-tag pc-tag last-ea stride times fsm dist The pc-tag eld holds the address of the load/store instruction. The last-ea eld stores the last eective address accessed by this instruction. The stride eld holds the dierence between the last two eective addresses accessed by the memory instruction, and the counter times keeps track of how many iterations (with respect to that instruction) the pre-pc is ahead of the PC. The elds stride and times together with fsm are used to generate the predicted address as in [3]. The dist eld contains the distance of the memory reference instruction from the last branch and is used to control the distance between the PC and the pre-pc. The pre-pc-tag eld stores the following three values: br-entry-num T/NT mem-ref-num The 9 bits br-entry-num eld holds the BTB/PPG entry number of the last branch executed before reaching the memory access instruction; the T/NT bit indicates if the instruction is on the taken or not-taken path, and the last 7 bits, called mem-ref-num, are the ordinal number of the instruction within the load/store instructions of the block. The SRPT is updated whenever the CPU core (PC) executes a load/store instruction. During this update the prediction information (such as stride, times, etc.) is calculated and stored together with the eective address used by the instruction. The pre-pc uses this information in order to predict future accesses addressed by this instruction. The predicted eective address is equal to last-ea + (stride times), and the decision to generate the prediction is made by the two states init and prediction of the nite state machine (fsm) in Figure 2. The hardware needed to implement a 128 entry SRPT is equivalent to a 2K-byte data cache where the pc-tag size is equivalent to the tag eld of the cache and the memory 9

11 init incorrect prediction No prediction incorrect prediction incorrect prediction correct prediction prediction correct prediction correct prediction Figure 2: The nite state machine fsm. needed for the pre-pc-tag is counted twice its actual size (since it is fully-associative). The search in the SRPT is done in parallel to the execution in the CPU pipeline, thus, two clock cycles can be used if the operation is too long SRPT Updates and Management In our simulations we assume the possibility of issuing multiple instructions per cycle with no more than two memory accesses per cycle (dual ported cache); thus, to support the updates of the memory reference instructions, the SRPT must be dual ported. An entry in the SRPT is allocated on the rst execution of a memory access instruction. At that point the prediction elds (stride, times and fsm) are set to zero. When an entry is removed from the BTB/PPG there is a need to update the SRPT entries for which the br-entry-num eld is equal to the number of the entry being removed. This is done by setting the dist eld to zero (invalidate the pre-pc-tag eld of the SRPT entry). Note that the PC is stalling in this case (due to a mispredicted branch penalty) for a few cycles and the pre-pc is not progressing until all updates complete. 3.4 The pre-pc Using the PPG and SRPT Structures In this section we describe the pre-pc mechanism. In the beginning of the program execution, and following a misprediction in the direction taken by the pre-pc, the PC and pre-pc are equal. The pre-pc can depart from the PC at each branch entry whose size eld is not 0. When the size eld is not 0 the branch was visited at least once before, 10

12 and its prediction entries and size elds were updated (pointing to the next branch). Within one clock cycle the pre-pc can advance a full block ahead of the PC. Using the entry of the next branch (according to the branch prediction) the pre-pc can move directly to the next block without the need for a lookup on the BTB tag eld. Thus, the PPG provides an ecient method for supporting a pre-pc that can be used in many places. During a data cache access, a lookup and update are conducted on the SRPT with the pc-tag. In parallel, the pre-pc is looking for future memory reference instructions in the SRPT. With the example in Figure 1 we illustrate the pre-pc progress assuming that branch instructions 3, 13, and 17 are kept in entries 15, 30, and 18 of the BTB/PPG table, respectively (as in Table 1). Table 2 presents the pre-pc-tag for some of the SRPT entries in a possible execution of the code fragment in Figure 1. inst br-ent-num T/NT mem-ref-num I I I I I Table 2: The pre-pc-tag for some of the SRPT entries in a possible execution of the fragment code in Figure 1. The mem-ref-num elds of the rst two memory reference Instructions (in block 15), 4 and 5 (denoted by I4, I5), have value 1 when the memory is dual ported. The value of that eld for I7 and I12 is 2 and it is 1 for I16 (rst in block 30). The T/NT values for all ve instructions are zero since they are on the not-taken path of there respective branches, 15 and 30. At some point when the pre-pc value is (15,0,1) its lookup in the SRPT on the pre-pc-tag eld (br-entry-num,t/nt,mem-ref-num) matches I4 and I5 and partially matches I7, and I12. A partial match is given for SRPT entries for which the br-entry-num and T/NT elds of the pre-pc-tag have the same values as in the lookup operation, e.g. (15,0,) (the use of the partial match will soon be clear). In the next clock cycle the pre-pc value is (15,0,2) and it matches I7 and I12. Thus, the pre-pc jumps over irrelevant instructions without losing cycles between consecutive 11

13 accesses to the SRPT. Once I7 and I12 are found (matched), with the help of the PPG, in the next cycle the pre-pc value is (30,0,1) (described latter). The new pre-pc value causes the next block to be scanned and the load of I16 is found. Whenever a match occurs the reference prediction is computed based on the history information elds and the decision to issue the prefetch request is made by a four state nite state machine. In the following example we discuss the progress of the pre-pc in a few scenarios for the example in Figure 1. Assume we are at clock cycle x and a lookup is done on entry 15 of the PPG (see PPG:LU column in the rst row of Table 3). By the end of cycle x the PPG:LU results are: T/NT=0 (branch 15 is not taken) and next-br=30. cycle SRPT:LU PPG:LU PPG:LU results SRPT:LU results x T/NT=0, next-br=30... x+1 (15,0,1) 30 T/NT=0, next-br=18 FM=fI4,I5g, PM=fI12g x+2 (15,0,2) FM=fI12g x+3 (30,0,1) 18 T/NT=1, next-br=15 FM=fI16g x+4 (18,1,1) 15 T/NT=0, next-br=30 no match x+5 (15,0,1) 30 T/NT=0, next-br=18 FM=fI4,I5g, PM=fI12g Table 3: The pre-pc progress (a possible scenario for Figure 1). To make the example more general, assume at this stage that I4, I5, I12 and I16 are in the SRPT and the entry for I7 was removed. During cycle x+1, in parallel to a direct lookup on PPG entry 30, the pre-pc searches the SRPT pre-pc-tag with (15,0,1) (see SRPT:LU column of Table 3). The SRPT:LU results are two full matches for I4, I5, and a partial match for I12. The partial match on I12 indicates the need to continue and search this block by the pre-pc. Thus, during cycle x+2 the SRPT is searched with values (15,0,2). By the end of cycle x+2 instruction I12 is fully matched (last in the block). The next predicted block is 30 and T/NT=0, thus, during cycle x+3 a SRPT:LU search for (30,0,1) is generated and I16 is fully matched. In parallel, a PPG:LU search on branch entry 18 is generated. At the next cycle a SRPT:LU with (18,1,1) yields no match since block 18 has no memory access instructions. From the PPG:LU results of cycle x+3 the next branch is 15, this branch is not taken (due to PPG:LU results at the end of cycle x+4). Thus, the next SRPT:LU search (cycle x+5) is for (15,0,1). 12

14 The pre-pc search in the SRPT is fully associative and can take a full clock cycle 4. Thus, this search is pipelined with the calculation of a predicted eective address and the prefetch dispatch. The example above shows the progress of the pre-pc in a few special cases. The pre-pc searches the SRPT and the PPG in parallel and thus, in one clock cycle, it can jump over a block that has two or less memory reference instructions. If a block has more than two memory reference instructions the pre-pc's progress through the block is never slower than that of the PC. In general, the pre-pc can be either ahead of the PC, or in the same position as the PC following a misprediction in the direction taken by the pre-pc or in the beginning of the execution. The goal is to increase the distance before a data cache miss occurs. The optimal distance between the pre-pc and the PC is the one in which no cache miss occurs for predicted data. Thus, it must not be smaller than the time taken for a prefetch request to be fullled. A prefetch request has a low priority both for the lookup in the cache and for the memory bus. This implies that the time interval between the PC and the pre-pc must be even larger than the fetch latency. On the other hand, it must not be too big in order not to cause a replacement of data that soon will be needed for the computation. In addition, we note that the history in the BTB/PPG and SRPT, used for generating the predictions, is less accurate when the distance is very large. Tango can only control the maximum distance. Since the distance is measured by the number of instructions between the PC and the pre-pc, the maximum value should reect the execution time of the instructions and the delays added by the prefetch requests service. The distance between the pre-pc and the PC is computed every cycle. Whenever the pre-pc jumps to a new block the distance increases by T-size or NT-size, depending on the direction taken by the pre-pc. The distance inside a block is accumulated upon a SRPT match (the dist value of the matched memory reference instruction is added and that of the previous one { in the same block { is subtracted). The distance covered by the PC is subtracted from the accumulated distance following every instruction's 4 If more than a single clock cycle is needed (128 entries) the depth of the associativity can be reduced by partitioning the SRPT (e.g. instructions of blocks in the upper and lower halves of the PPG table are mapped into two dierent parts of the SRPT). 13

15 commit. 3.5 PRC: a Prefetch Requests Controller The third part of our scheme, the Prefetch Requests Controller (PRC), is a mechanism for controlling and scheduling the prefetch requests. Due to the heavy trac on the core{cache bus, the number of open slots for prefetching queries in the cache is very small. In addition, the timing when open slots occur may not be the right one for prefetching. Minimizing the trac overhead incurred by prefetching on this bus, and on the memory bus, calls for optimizations of data prefetching in two places. The PRC mechanism lters out redundant prefetch requests and controls the scheduling of the prefetch requests on the core{cache bus (it always gives priority to the core requests). In addition, the data cache LRU is touched (following reference predictions) in order to reduce the number of requests on the memory bus and for improving Tango performance. When the pre-pc scans a memory reference instruction its predicted eective address is supplied by the SRPT. In the next step the address can be looked up in the data cache. Since the ports to the data cache-tag may be occupied by requests from the core we added a FIFO buer, Req-Q, of four entries for keeping the prefetch requests (see Figure 3). Thus, the pre-pc can advance before the request is served. When Req-Q is full, the pre-pc must wait. The queue is ushed if the pre-pc took the wrong direction (wrong branch prediction). Due to the locality of reference principle it is very likely that within short time periods more than a single request for data in a cache line (block) can occur. An associative search on Req-Q prevents a double insertion of a prefetch request. At every clock cycle it is possible to issue up to two prefetch requests on the cache-tag ports not used by the core. A prefetch request which misses the cache is directed to the next level in the memory hierarchy. If the memory bus is busy the request is stored in a second small buer, Wait-Q, of which two entries (with top priority) are dedicated to requests generated by the core (see Figure 3). The prefetch requests are ushed from the buer when the pre-pc took the wrong direction (on a branch). A Track-Q is served for tracking prefetch requests issued to the memory bus, and for preserving 14

16 CPU - core Multiple issue Req-Q Dispatch prefetch requests data ports address switch Probe1 I-CACHE BANK1 BANK2 D - CACHE Requests that miss the cache Requests that hit the cache Filter-cache Probe2 PRC Prefetch requests predicted by SRPT Wait-Q Probe3 Write Buffer Core Requests Prefetch Requests Track-Q Probe4 Memory interface (priority interconnect) TO NEXT LEVEL IN THE MEMORY HIERARCHY Figure 3: Scheduling and controlling prefetch requests. system consistency. Every memory request is put on Track-Q and is removed when the data arrive. When Track-Q is full the issuing of prefetch requests to the memory bus ceases. Lastly, Tango uses a unique buer, Filter-cache, in order to track requests that were found (hit) in the data cache. This buer is original to Tango, unlike Track-Q, Wait-Q and Req-Q. For a 2-way set-associativity cache, and a memory bus that can serve a request every 4 cycles with a 12-cycle latency, a line found (hit) in the cache will stay in the cache at least 16 cycles. Following a cache hit the requested block address is put in Filter-cache. A counter is attached to this entry and its value is set to the fetch latency plus the fetch spacing. This counter is decremented by one every clock cycle and the entry is removed when its value is 0. The idea behind this action is to minimize the dispatching of prefetch requests to Req-Q, and thus to decrease the trac on the core{cache bus. If Filter-cache is full it behaves like a FIFO queue. This lter provides a simple way to dynamically remove redundant prefetch requests and is thus suitable for out of order computations as well. Only prefetch requests that are not in either Req-Q, Wait-Q, Track-Q or Filter-cache are directed to the Req-Q buer. Thus, the Tango scheme minimizes the number of 15

17 requests sent to Req-Q and prevents loading the core{cache bus by leaving out most (about 2/3) of the requests generated by the SRPT. About 80% of the requests removed were ltered by a Filter-cache of six entries. Since we use an associative search on these buers they were all chosen to be small. Avoiding the prefetch of data found in the data cache without indicating that it may be needed soon can result in purging it out before its actual use. This indeed happened in our early simulations and indicated that due to the distance between the pre-pc and the PC many lines residing in the cache following the pre-pc lookup were not there when the PC looked for them. This problem was solved by a simple modication to the data cache LRU scheme. The Tango scheme generates LRU touch whenever a prefetch request is a hit, thus indicating that it may be needed soon (rather than leaving it an early used). This internal prefetching prevents an erroneous purging on one hand and saves us the cost of issuing redundant requests to the next memory level on the other hand. The simulation results showed signicant improvement along this avenue for both the performance and the memory bus trac. The prediction done by the SRPT is very aggressive since the majority of the references are of constant stride, and since the stride is not checked to be stable before generating a prefetch request (e.g. following the rst 2 references of a load/store instruction). An extreme example is the dnasa7 program in the SPEC92 benchmark. For this program the SRPT generated predictions for 98% of the memory accesses. Since 46% of all the instructions are memory reference instructions, the cache tag ports were occupied by the core 60% of the execution time. Without an addition optimization the prefetch requests will also need 60% of the cache-tag ports bandwidth. Since only 40% of the bus bandwidth is available and not always at the right moment, the prefetch performance was impaired. With the above PRC optimizations only 1/3 of the SRPT requests remained. As a result only 20% of the cache-tag ports bandwidth was needed for data prefetching. Note that the core always has the rst priority on the cache tag ports. In addition, the internal prefetch (touching the LRU) reduces the trac on the memory bus. The improvement to the total performance and the reduction in buses utilization show the eectiveness of the above optimizations. 16

18 4 Architectural Models for the Simulations Our investigated architecture consists of a modern processor that can issue up to four instructions per cycle (base line architecture). The physical limitations on this rate are: at most two memory reference instructions per cycle (i.e. two loads, two stores, or one of each), and a single conditional or unconditional control ow operation. All execution units are fully pipelined with one cycle throughput and one cycle execution stage (latency). If no delay is incurred to any of the issued instruction in a cycle, their execution contribute a single cycle to the simulation. The instruction cache size is 32K-byte organized as 4-way set-associative with 32-byte cache line size. No penalty is charged for a hit and the miss penalty in this cache is 6-cycles (the instruction cache miss penalty is relatively small to compensate for the lack of instruction prefetching mechanism in the simulations). In addition to the I-cache miss penalty (and the following use of memory bus bandwidth), a misprediction of a branch can also stall the execution. In our simulated architecture (described in Figure 4) we use a branch prediction mechanism based on the two-level adaptive training scheme suggested by Yeh and Patt [15]. In our simulations the automata are initialized with non-zero values and branches that are not in the table are predicted as not-taken. A mispredicted branch stalls the execution for 2-cycles. This happened in case that the BTB generates a wrong prediction (direction or target address) or for a taken branch which is not in the table. The BTB conguration is determined by specifying the BTB size, associativity, the number of bits in the history register, and the automaton type. The base line BTB conguration for our simulations comprises a 512-entry 4-way set-associative table. Each entry has a 2-bit history register which is used for selecting one of four 2-bit Lee & Smith automata. The data cache is an interleaved (2 banks) write-back cache with write-allocate policy and a write-back buer (for replacing dirty data lines) of 8 entries. Each bank has a separate data port that can serve either a load or a store in every cycle (i.e. no write \blockage" between a store and a subsequent load). The two ports are accessible from all four execution units and two memory reference operations may access the cache in 17

19 PROCESSOR CORE : MULTIPLE ISSUE Prefetch Requests Controller address switch data ports 2 LEVEL BTB Baseline configuration: * 512 entries * 4-way set-associativety * 2-bit history register for each entry * 4 2-bit state machines for each entry PPG Extension to the BTB for implementing the PPG SRPT Baseline config: * 128 entries * 4-way set-associativety Data Cache Baseline configuration: * 32K-byte * 32-byte line size * 4-way set-associativety * 2 banks I-Cache * 32K-byte * 4-way set-assoc * 32-byte line size * no prefetching Data Prefetching Unit Wait-Q Write buffer Interconnect network to the next memory level (BIU) * Configurable fetch latency and spacing between requests Figure 4: Simulated architecture model. parallel if they address the dierent banks. The bank selection is made at the switching point, with the low order bit of the line address. All the cache parameters are congurable. The base line conguration for the data cache is 32K-byte organized as 4-way set-associative with 32-byte cache line size. Dirty cache lines which are evacuated from the cache are moved to the write buer. The write buer has the lowest priority on the memory bus and its lines are moved to memory in a FIFO discipline whenever the bus is not busy. A prefetch request that may cause a dirty cache line replacement when the write buer is full will not be entered to the cache. Thus, the delay caused by writing a dirty line from the write buer can increase only the penalty of a data cache miss (in such a case the write buer gets the highest priority on the memory bus). The size of the PRC buers are those presented in Figure 3 (4, 6, and 4 entries for Req-Q, Filter-cache, and Track-Q respectively). The interface with the memory is parameterized in two ways. First, the bandwidth is constrained by controlling the number of cycles, fetch spacing, between two consecutive launches of memory fetch requests. The second parameter, fetch latency, species the number of cycles needed for fetching the data. When the fetch spacing is one, the memory interface is pipelined, and at the other extreme, when fetch spacing equals to 18

20 the fetch latency, only one memory fetch is allowed at a time. In our base conguration the fetch spacing is 4 cycles and the fetch latency is 12 cycles. We tested the Tango prefetching scheme by running programs from the SPEC benchmarks. In each simulation we used the same environment parameters with two system architectures: the reference (no data prefetching) and the Tango prefetching scheme. We assumed, for both systems, that the instructions are statically scheduled such that there is no memory reference dependence between instructions executing simultaneously. Thus, if 4 instructions are executed in the same cycle none of them need the results obtained by the other (this behavior is supported by all out of order mechanisms). Any stall, due to a data cache miss, can postpone only those instructions that were issued at a later stage. The amount of extra cycles to the simulation process due to a miss depends on the memory bus status and the fetch latency. We compared both systems with various types of memory units by changing cache size, associativity, line size, buers size, and the interface parameters to the memory. Since two misses can occur during a single cycle (dual ported data cache) and the write buer as well as the I-cache use the memory bus, both systems benet in the case where the fetch spacing is smaller than the fetch latency, compared to the case when they are equal. 5 Simulation Results In this section we explore our prefetching design for various architecture parameters. We ran trace driven simulations with ve programs from the SPEC92 benchmark and matrix from SPEC89. We used matrix mainly for investigating the load on the cache-tag ports (data intensive program). Every simulation included the rst 100 Million instructions and results were inspected every 10 Million instructions. The behavior of ve programs (all but espresso) reached a steady phase already at 50 Million instructions, hence, some of the detailed simulations (starting from Section 5.2) were carried out for the rst 50 Million instructions. The average size of a program tested is about 20 thousand instructions. 19

21 5.1 Programs Characteristics and General Results Table 4 shows the dynamic characteristics of the programs. program Data ref rate Stride distribution Corr BTB write read zero large small irreg. pred (%) matrix dnasa xlisp tomcatv espresso spice2g Table 4: Programs Characteristics for 100 million Instructions The column \Data ref rate" shows the percentage of writes and reads in each application. As expected, the reads are more frequent than the writes and the portion of read misses from the total misses is also larger than that of the writes (see Table 5, rst 2 columns). The next four columns in Table 4 indicate the predictability of memory references. The stride distribution information tells us the proportions of data references that behave according to one of four categories. Data references with zero stride are steady references directed to the same memory location, large and small are those references with strides larger/equal than 32-byte (line size in the base line architecture) and smaller than 32- byte, respectively 5. The data cache can be very helpful with zero and small stride references, but the prefetching mechanism further improved in these cases when there was no temporal locality. Large and irregular stride references can be the main source for cache misses; while the prefetching mechanism is useful for large strides it must also identify irregular memory references so as to avoid unnecessarily initiation of erroneous prefetch requests (this is done using a 2-bit state machine). The last column presents the percentage of correct BTB predictions for the base line BTB. This information can help in further estimating the success of our prefetching mechanism. 5 We gathered this data with innite size SRPT. 20

22 Table 5 summarizes the results obtained for the base line architecture (with pipelined memory bus). program misses (%) ref hit Prefetch C-tag used M bus ext M pen. speedup read write ratio C-hit rat by core (%) band (%) red. (%) CPI rat matrix dnasa xlisp tomcatv espresso spice2g Table 5: Simulation Results of the Base Architecture (100 million instructions); Tango vs. the reference system. The misses (read/write) columns of Table 5 represent the distribution of read and write misses. The \ref hit ratio" column shows the hit ratio of the reference data cache followed by the data cache hit ratio column of the prefetch enhanced architecture. For the Tango architecture we incorporated the relative portions of the penalty for those miss requests that were on their way (due to a late prefetch). In this calculated hit ratio, the sum of all the partial penalties was divided by the fetch latency value to generate the relative number of misses. In ve out of the six programs the C-hit ratio was 99% and only for spice2g6 the change is from 83.8% to 91.2% in the Tango system. Instruction scheduling and out of order execution exposed most of the available parallelism (this is implied by the relatively small ideal CPI in Figure 5). Such parallelism exploits the hardware resources most of the time. In the \C-tag used by core" column of Table 5 we present the percentage of time in which the data cache tag ports were busy due to core accesses (demand fetches). On the average, the cache-tag ports are used by the core percent of the time. Thus, the remaining bandwidth for prefetch requests is small and must be used wisely. The \M bus ext bandwidth" column presents the percentage of extra requests imposed by Tango on the memory bus. For xlisp and espresso this number is signicant. Nevertheless, in xlisp the total number of prefetch requests was small since the hit ratio (for 21

23 the reference system) is 99.3%. The last two columns summarize the performance of the Tango scheme. The percentage of memory penalty reduction (\M pen red" column) is correlated with low irregular strides and high BTB prediction rate (see matrix, dnasa7, and tomcatv). The other programs too exhibit signicant improvements. For matrix and dnasa7 programs 96.57% and 94.11% of the memory penalty were removed, respectively, witnessing an ecient prefetching scheme in spite of the small bandwidth left on the core{cache bus (45.3% and 46.7% of the dynamic code access the data cache respectively). On the other side of the scale we nd xlisp with a very high hit ratio (99.3%) and a large irregular stride percentage. Nevertheless, the performance of this program is improved as well. Figure 5 summarizes the performance in three bars for each application. The I-CPI is the ideal CPI derived for the case in which every memory access reference is a hit. The R-CPI bar is the result found for the reference system, and the T-CPI bar presents the Tango performance. Figure 5: Comparing performance results; I-CPI ideal CPI, T-CPI CPI with prefetching (Tango), R-CPI CPI with out prefetching (reference). 22

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Branch-directed and Pointer-based Data Cache Prefetching

Branch-directed and Pointer-based Data Cache Prefetching Branch-directed and Pointer-based Data Cache Prefetching Yue Liu Mona Dimitri David R. Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, M bstract The design of the

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

A Survey of prefetching techniques

A Survey of prefetching techniques A Survey of prefetching techniques Nir Oren July 18, 2000 Abstract As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance.

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2018 SOLUTIONS Caches and the Memory Hierarchy Assigned February 8 Problem Set #2 Due Wed, February 21 http://inst.eecs.berkeley.edu/~cs152/sp18

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR JAMES EDWARD SICOLO THESIS. Submitted in partial fulllment of the requirements

A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR JAMES EDWARD SICOLO THESIS. Submitted in partial fulllment of the requirements A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR BY JAMES EDWARD SICOLO B.S., State University of New York at Bualo, 989 THESIS Submitted in partial fulllment of the requirements for the

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

CS152 Computer Architecture and Engineering SOLUTIONS Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4

CS152 Computer Architecture and Engineering SOLUTIONS Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 CS152 Computer Architecture and Engineering SOLUTIONS Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/sp16 The problem sets are intended

More information

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3. Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

Week 11: Assignment Solutions

Week 11: Assignment Solutions Week 11: Assignment Solutions 1. Consider an instruction pipeline with four stages with the stage delays 5 nsec, 6 nsec, 11 nsec, and 8 nsec respectively. The delay of an inter-stage register stage of

More information

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Data Speculation. Architecture. Carnegie Mellon School of Computer Science Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A

More information

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 1999 Exam Average 76 90-100 4 80-89 3 70-79 3 60-69 5 < 60 1 Admin

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering.

Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering. Power/Performance Advantages of Victim Buer in High-Performance Processors Gianluca Albera xy x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY 10129 R. Iris Bahar y y Brown University

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum Simultaneous Way-footprint Prediction and Branch Prediction for Energy Savings in Set-associative Instruction Caches Weiyu Tang Rajesh Gupta Alexandru Nicolau Alexander V. Veidenbaum Department of Information

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes Module: Branch Prediction Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information