Configuration Prefetching Techniques for Partial Reconfigurable Coprocessor with Relocation and Defragmentation

Size: px
Start display at page:

Download "Configuration Prefetching Techniques for Partial Reconfigurable Coprocessor with Relocation and Defragmentation"

Transcription

1 Configuration Prefetching Techniques for Partial Reconfigurable Coprocessor with Relocation and Defragmentation Zhiyuan Li Department of Electrical and Computer Engineering Northwestern University Evanston, IL USA Scott Hauck Department of Electrical Engineering University of Washington Seattle, WA USA Abstract One of the major overheads for reconfigurable computing is the time it takes to reconfigure the devices in the system. This overhead limits the speedup possible in this paradigm. In this paper we explore configuration prefetching techniques for reducing this overhead. By overlapping the configuration loadings with the computation on the host processor the reconfiguration overhead can be reduced. Our prefetching techniques target to the reconfigurable systems containing a Partial Reconfigurable FPGA with Relocation + Defragmentation (R+D model) since the R+D FPGA showed high hardware utilization. We have investigated various techniques including static configuration prefetching, dynamic configuration prefetching, and hybrid prefetching. We have developed prefetching algorithms that significantly reduce the reconfiguration overhead. 1. INTRODUCTION In recent years, reconfigurable computing systems such as Garp [Hauser97], OneChip [Wittig96], PipeRench [Schmit97], and Chimaera [Hauck97] have attracted a lot of attention because of their promise to deliver the high performance provided by reconfigurable hardware along with the flexibility of general purpose processors. In such systems, portions of an application with repetitive logic and arithmetic computation are mapped to the reconfigurable hardware, while the general-purpose processor handles other portions of computation. For many applications the systems need to be reconfigured frequently during run-time. Reconfiguration overhead becomes a major concern because the systems must sit idle during reconfiguration. By reducing the overall reconfiguration overhead, the performance of the system can be improved. Therefore, many studies have involved investigating techniques to reduce the configuration overhead. Some of these techniques include configuration caching [Li00] and configuration compression [Li99, Li01, Dandilas01]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FPGA 02, February 24-26, 2002, Monterey, California, USA. Copyright 2002 ACM /02/0002 $5.00. Another interesting technique that effectively reduces this overhead is configuration prefetching [Hauck98]. The basic idea of configuration prefetching, similar to that of prefetching in general purpose computer system, is to overlap the configuration loading with computation. Targeted at the Single Context FPGA model, [Hauck98] described an algorithm that can reduce the reconfiguration overhead by a factor of 2. As technology moves forward more advanced devices and systems, such as Xilinx Virtex families, Xilinx XC6200 families, Garp [Hauser 97], and Chimaera [Hauck97], can be partially reconfigured at run time. For a system containing a Partial Run- Time Reconfigurable device, a configuration can be loaded into part of the device while the rest of the system continues computing. Compared with the Single Context FPGA, the Partial Run-Time Reconfigurable devices provide greater flexibility and higher hardware utilization. Based on the Partial Run-Time Reconfigurable model, a new model called Relocation + Defragmentation [Compton00] is built to further improve the hardware utilization. The relocation allows the final placement of a configuration within the FPGA to be determined at run-time, while defragmentation provides a method to consolidate unused area within an FPGA during run-time without unloading useful configurations. The configuration caching techniques [Li00] applied on the Relocation + Defragmentation model (Partial R+D) have demonstrated a significant reduction in reconfiguration overhead. Note that this overhead reduction was based on the demand fetch of configurations, meaning that effective prefetching approaches will further reduce this overhead. 2. PARTIAL R+D FPGA Similarly to the partially reconfigurable FPGA, the memory array of the Partial R+D FPGA is composed of an array of SRAM bits. These bits are read/write enabled by the decoded row address for the programming data. However, instead of using a column decoder, a SRAM buffer called a staging area is built. This buffer is essentially a set of memory cells equal in number to one row of programming bits in the FPGA memory array at the row location indicated by the row address. To configure the chip every row of a configuration is loaded into the staging area and then transferred to the array. By providing the run-timedetermined row address to the row decoder rows of a configuration can be relocated to locations specified by the system. 187

2 Incoming configuration: (a) (b) (c) (d) Figure 1. An example of configuration Relocation. The incoming configuration contains 2 rows. The first row is loaded into staging area (a) and then transferred to the desired location that is determined at run-time (b). Then the second row of the incoming configuration is loaded to the staging area (c) and transferred to the array (d). Figure 2. An example of Defragmentation. By moving the rows in a top-down fashion into the staging area and then moving upwards into the array, the smaller fragments were collected. Figure 1 shows the steps of relocating a configuration into the array. Defragmentation operation is slightly more complicated than a simple relocation operation. In order to collect the fragments within that array each row of a particular configuration is read back into the staging area and then moved to a new location in the array. Figure 2 demonstrates the steps of a defragmentation operation. It has been demonstrated in [Compton00] that with a very minor area increase over standard architectures the Relocation + Defragmentation model has a considerably lower reconfiguration overhead than the Partial Run-Time Reconfigurable model. 3. PREFETCHING OVERVIEW As demonstrated in [Li00], an FPGA can be viewed as a cache of configurations. Prefetching configurations on an FPGA, which is similar to prefetching in a general memory system, overlaps the reconfigurations with computation to hide the reconfiguration latency. Before we will discuss the details for the configuration prefetching, we first reexamine the factors very important to the effectiveness of the prefetching for a general purpose system: 1) Accuracy. This is the ratio of the executed prefetched instructions or data to the overall prefetched instructions or data. 2) Coverage. This is the fraction of cache misses eliminated by the effectiveness of a prefetching technique. An accurate prefetch technique will not significantly reduce the latency without a high coverage. 3) Pollution. One side-effect of prefetching techniques produce is the cache lines that would have been used in the future will be replaced by some prefetched instructions or data that may not be used. This is known as cache pollution. These issues are even more critical to the performance of configuration prefetching. In general purpose systems the atomic data transfer unit is cache block. Cache studies consistently showed that the average access time will likely to drop when the block size increases until it reaches a certain value (usually fewer than 128 bytes), then the access time will increases as the block size continue to increase since a very large block will result an enormous penalty for every cache miss. The atomic 188

3 data transfer unit in the configuration caching or configuration prefetching domain, rather than a block, is the configuration itself, which normally is significantly larger than a block. Therefore the system suffers severely if a demanded configuration is not present on chip. In order for a system to minimize this huge latency accurate prediction of the next required configurations is highly desired. As bad as it could be in general memory systems, cache pollution plays a much more malicious role in the configuration prefetchng domain. As demonstrated in [Li00], due to the large configuration size and relatively small on-chip memory, very few configurations can be stored on chip. As the result, a wrong prefech will be very likely to cause a required configuration to be replaced and a significant overhead will be generated when the required configuration is brought back later. Thus, rather than reducing the overall configuration overhead a poor prefetching approach can actually significantly increase this overhead. In general, prefetching algorithms can be divided into 3 categories: static prefetching, dynamic prefetching and hybrid prefetching. A compiler-controlled approach, static prefetching inserts prefetch instructions after performing control flow or data flow analysis based on profile information and data access patterns. One major advantage of static prefetching is that it requires no additional hardware. However, since a significant amount of access information is unknown at compile time, the static approach is limited by the lack of run-time data access information. Dynamic prefetching determines and dispatches prefetches at run-time without compiler intervention. With the help of the extra hardware, dynamic prefetching uses more data access information to make accurate predictions. Hybrid prefetching tries to combine the strong points of both approaches it utilizes both compile-time and run-time information to become a more accurate and efficient approach. 4. PREFETCHING FACTORS In order to better discuss the factors that will affect the prefeching performance, we first make following definitions. L k : The latency of loading a configuration k. S k : The size of the configuration k. D ik : The distance between an operation i (can be either a normal instruction or a prefetch operation) and the execution of the configuration k. P ik : The probability of i to reach the configuration k. C: The capacity of the chip. The probability factor could have a significant impact on the prefetching accuracy. The combination of S k and C determines whether there is enough space on the chip to load the configuration k. The combination of L k and D ik determines that the amount of the latency of the configuration k can be hidden if it is prefetched at i. It is obvious that there will not be a significant reduction if the D ik is too short because most of the latency of the configuration k cannot be eliminated if it is prefetched at i. In addition, the non-uniform configuration latency will affect the order of the prefetches that need to be performed. Specifically, we might want to perform out-of-order prefetch (prefetch configuration j before configuration k even if k is required earlier than j) for some situations. For example, suppose we have 3 configurations 1, 2, and 3 to be executed in that order. Given that S 3 >> S 1 >> S 2, S 1 + S 3 < C < S 1 + S 2 + S 3, and D 12 >> L 3 >> L 2 >D 23, prefetching the configration 3 before the configuration 2 when 1 is executed results in a penalty of L 2 at most. This is because the latency of the configuration 3 can be completely hidden and the configuration 2 can be either demanded fetched (penalty of L 2 ) or prefetched once the execution of the configuration 1 completes. However, if the inorder prefetches are performed the overall penalty is calculated as L 3 -D 23, which is much larger than L CONFIGURATION PREFETCHING Most current reconfigurable computing architectures consist of a FPGA connected, either loosely or tightly, to a host microprocessor. In these systems the microprocessor will act as controller for the computation, and will perform computations which are more efficiently handled outside the FPGA logic. The FPGA will perform multiple, regular operations (contained in FPGA configurations) during the execution of the computation. In normal operation the processor executes until a call to the reconfigurable coprocessor is found. These calls (RFUOPs) contain the ID of the configuration required to compute the desired function. In this work, we target systems containing Partial R+D FPGAs. In this work we seek to find efficient prefetching techniques for such systems. We have developed algorithms applying the different configuration prefetching techniques. Based on the available access information and the additional hardware required, our configuration caching algorithms can be divided into 3 categories: Static Configuration Prefetching, Dynamic Configuration Prefetching, and Hybrid Configuration Prefetching. Before presenting the details of the algorithms, we first discuss the experimental setup Experimental Setup In order to evaluate the different configuration prefetching algorithms we must perform the following steps. First, some method must be developed to choose which portions of the software algorithms should be mapped to the reconfigurable coprocessor. In this work, we apply the approach presented in [Hauck98] (these mappings of the portions of the source code will be referred to as RFUOPs). Second, a simulator of the reconfigurable system must be employed to measure the performance of the prefetching algorithms. Our simulator is developed from SHADE [Cmelik93]. It allows us to track the cycle-by-cycle operation of the system, and get exact cycle counts. We will compare the performance of the prefetching algorithms as well as the performance assuming no prefetch occurs at all Static Configuration Prefetching Similar to the prefetching approach used in [Hauck98], a program running on this reconfigurable system can insert prefetch operations into the code executed on the host processor. However, the system described in [Hauck98] contains a Single Context FPGA rather than a Partial R+D FPGA. The prefetch instructions for systems containing a Single Context FPGA are executed just like any other instructions, occupying a single slot in the processor s pipeline. The prefetch instruction specifies the ID of a specific configuration that should be loaded into the coprocessor. If the desired configuration is already loaded, or is 189

4 in the process of being loaded by some other prefetch instruction, this prefetch instruction becomes a NO-OP. Since a Single Context FPGA can only hold one configuration, at any given point only one configuration needs to be prefetched. This greatly simplifies prefetches in these systems. Unlike the Single Context FPGA a Partial R+D FPGA can hold more than one configuration, and at a given point multiple RFUOPs may need to be prefetched. Thus, the method to effectively specify the IDs of the RFUOPs to prefetch becomes an issue. One intuitive approach is to pack the IDs into one single instruction. However, since the number of IDs need to be specified could be different for each prefetch instruction, it is not possible to generate prefetch instructions with equal length. Another option is to use a sequence of prefetch instructions when multiple prefetching operations need to be performed. However, to make it an effective approach a method that can terminate previously issued prefetches is required. This is because during the execution certain previous unfinished prefetch instructions may become obsolete and these unwanted prefetching operations will significantly harm performance. In Figure 3 for example, 1, 2, 3, and 4 are RFUOPs and P1, P2, P3, and P4 are prefetching instructions. It is obvious that when P1 is executed, configurations 3 and 4 will not be reached, therefore the prefetches of 3 and 4 are wasted. This waste may be negligible for a general purpose system since the load latency of an instruction or a data block is very small. However, because of the large configuration latency in reconfigurable systems, it is likely the prefetch of configuration 3 has not completed or even not started when P1 is reached. As a consequence, if we use the same approach used in general purpose systems, letting the prefetches of P3 and P4 complete before prefetching P1 and P2, the effectiveness of the prefetches of P1 and P2 will be severely damaged since they cannot completely or mostly hide the load latencies of configurations 1 and 2. Therefore, we must find a way to terminate previously issued prefetches if they become unwanted. P1 P2 1 2 P4 P3 3 Figure 3. Example of prefetching operation control. A simple approach we used in this work to solve this problem is to insert termination instructions when necessary. The format of a termination instruction is just like any other instructions, consuming a single slot in the processor s pipeline. Once a termination instruction is encountered the processor will terminate all previously issued prefeches so the new prefetches can start immediately. For the example in Figure 3, a 4 termination instruction will be inserted immediately before P1 to eliminate the unwanted prefetches of P4 and P3. Now we have demonstrated the way to handle the prefetches, the remaining problem is to determine where the prefetch instructions will be placed given the RFUOPs and the control flow graph of the application. Since the algorithm used in [Hauck98] demonstrates good-quality results for the Single Context reconfigurable systems, we will extend it for systems containing a Partial R+D. The algorithm we used to determine the prefetches contains 3 stages: 1) Penalty calculation. In this stage the algorithm will compute the potential penalties for a set of prefetches at each instruction node of the control flow graph. 2) Prefetch scheduling and generation. In this stage the algorithm will determine the configurations that need to be prefetched at each instruction node based on the penalties calculated in stage 1. Prefetches will be generated under the restriction of the size of the chip. 3) Prefetch reduction. In this stage the algorithm will trim the redundant prefetches generated in the previous stage. In addition, the termination instructions are inserted. In order to make our analysis clearer in the control flow graph, we will use circles to represent the instruction nodes and squarse to represent the RFUOPs. Since a prefetch should be executed at the top node of a single entry and single exit path, but not other nodes contained in the path, only the top node is considered as the candidate where prefetch instructions can be inserted. Therefore, we simplify the control graph by packing other nodes in the path. In the following sections we will discuss the details for every stage Penalty Calculation In [Hauck98] a bottom-up approach is applied to calculate the penalties using the probability and distance factors. Since the probability is dominant in deciding the penalties, and the average prefetching distance is mostly greater than the latencies of RFUOPs, it is adequate to use probability to represent penalty. Note that control flow graph can be complex with nested loops, the bottom-up algorithm cannot be performed without loop detection and conversion. In this work we apply the loop detection and conversion algorithm presented in [Hauck98]. Given the simplified control flow graph and the branch probabilities, we will use a bottom-up approach to calculate the potential probabilities for an instruction node to reach a set of RFUOPs. The basic steps of the bottom-up algorithm are outlined below: 1. For each instruction, initialize the probability of each reachable configuration to 0, and set the num_children_searched to be Set the probability of the configuration nodes to 1. Place the configuration nodes into a queue. 3. While the queue is not empty, do 3.1. Remove a node k from the queue, if it is not a configuration node, do 190

5 P kj = (P ki P ij ), for all children i of node k For each parent node, if it is not a configuration node, do Increase num_children_searched by 1, if num_children_searched equals to the total number of children of that node, insert the parent node into the queue. Once this stage is complete, each instruction will contain the probabilities of the reachable RFUOPs. The following stages will use the results to schedule the necessary prefetches Prefetch Scheduling and Generation The prefetch scheduling is quite trivial once the probabilities are calculated. Based on the decreasing order the probabilities a sequence of prefetches could be generated for each instruction node. Since the aggregate size of the reachable RFUOPs for a certain instruction may exceed the capacity of the chip, the algorithm will only generate prefetches under the size restriction of the chip. The rest of the reachable RFUOPs are ignored. Assume in Figure 4 that the chip can at most hold 2 RFUOPs at a time, the generated prefetches at each instruction are showed on the right side of the figure. 0.5 I9 0.4 I7 I I8 I10 P1,4 I9 P3,4 P1,3 P4,3 I7 I8 I5 I I I2 I3 I4 P1,2 P4,3 I5 I6 P1 P2 P3 P4 I1 I2 I3 I Figure 4. An example of prefetch scheduling and generation. The control flow is shown on the left. The corresponding prefetches are shown on the right Prefetch Reduction The prefetches generated at a child node are considered as redundant if they match the beginning sub-sequence generated at its parents. Our algorithm will check and eliminate these redundant prefetches. One may argue that the prefetches at a child node should be considered as redundant if they are a subsequence, but not necessary the beginning sub-sequence, of its parents because the parents represent a superset of prefetches. However, by eliminating the prefetch instructions at the child node, the desired the prefetches cannot start immediately since the unwanted prefeches at the parents might have not completed. For example in Figure 4, P2 at the instruction I2 cannot be eliminated even though it is a sub-sequence of P1,2 because when I2 is reached P2 may not be able to start if P1 has not completed. In Figure 4, the prefetches at instructions I1, I4, and I6 can be eliminated. Once the prefetch reduction is complete, for each instruction node where prefetches need to be performed a termination instruction is inserted followed by a sequence of prefetch instructions. Note that a termination instruction does not flush the entire FPGA, but merely clears the prefetch queue. RFUOPs that have already been loaded by the preceding prefetches are often retained Dynamic Configuration Prefetching Among the various dynamic prefetching techniques available for general purpose computing systems, Markov prefetching [Joseph 1997] is a very unique approach. Markov Prefetching does not tie itself to particular data structure accesses and is capable of prefetching both instructions and data Markov Prefetching As the name implies, Markov prefetching utilize a Markov model to determine what blocks should be brought in from the higherlevel memory. A Markov process is a stochastic system for which the occurrence of a future state depends on the immediately preceding state, and only on it. A Markov process can be represented as a directed graph, with probabilities associated with each vertex. Each node represents a specific state, and a state transition is described by traversing an edge from the current node to a new node. The graph is built and updated dynamically using the available access information. As an example, the access string A B C D C C C A B D E will result in the Markov model described in Figure 5. Using the Markov graph multiple prefetches with different priorities can be issued. 191

6 Figure 5. The Markov model generated from access string A B C D C C C A B D E. Each node represents a specific state and each edge represents a transition from one state to another. The number on an edge represents the probability of the transition occurring Dynamic Prefetching Algorithm Markov prefetching can be extended to handle the configuration prefetching for reconfigurable systems. Specifically, the RFUOPs can be represented as the vertices in the Markov graph and the transitions can be built and updated using the RFUOP access sequence. However, the Markov prefetching needs to be modified because of the differences between general purpose systems and reconfigurable systems. As mentioned in previous sections, due to their large sizes only a few RFUOPs can be stored on chip at a given time. Therefore, it is very unlikely that all the transitions from the current RFUOP node can be executed. This feature requires the system to make accurate predictions to guarantee that only the highly probable RFUOPs are prefetched. In order to find good candidates to prefetch, Markov prefetching keeps updating the probability of each transition using the currently available access information. This probability represents the overall likelihood this transition could happen during the course of the execution and may work well for a general purpose system which a large number of instructions or data blocks with small loading latency can be stored in a cache. However, without emphasizing the recent history, Markov prefetching could make poor prediction during a certain period of the execution. Due to the features of the reconfigurable systems mentioned above, we believe the probability calculated based on the recent history is more important than the overall probability. In this work, a weighed probability in which recent accesses are given higher weight in probability calculation is used as a metric for candidate selection and prefetching order determination. The weighted probability of each transition is continually updated as the RFUOP access sequence progresses. Specifically, probabilities out of a node u will be updated once a transition (u, v) is executed: For each transition starting from u, P u,w = P u,w / (1 + C) if w v; P u,v = (P u,v + C) / (1 + C); Where C is a weight coefficient. For general purpose systems, the prefetching unit generally operates separately from the cache management unit. Specifically, the prefetching unit will pick candidates and then send prefetch requests into the prefetching buffer (usually a FIFO). Working separately, the cache management unit will vacate the requests one by one from the buffer and load the instructions or data blocks into the cache. Since more accurate run-time access information is available for prefetching, integrating caching with prefetching will allow cache manager to choose better victims to replace. The dynamic prefetching algorithm can be described as following: Upon the finish of each RFUOP k execution, do: 1. Sort the weighted probabilities in decreasing order of all transitions starting from k in the Markov graph. 2. Terminate all previously issued prefetches. Select k as the first candidate. 3. Select the rest of the candidates in sorted order under the size constraint of the chip. Issue prefetch requests for each candidate that is not currently on chip. 4. Update the weighted probability of each transition starting from j, where j is the RFUOP executed just before k. Though the replacement is not presented in the algorithm, it is carried out indirectly. Specifically, any RFUOPs that are currently on chip will be marked for eviction if they are not selected as candidates. One last thing to mention is that the RFUOP just executed is treated as the top candidate automatically since generally each RFUOP is contained in a loop and likely to be repeated multiple times. Correspondingly, the self-loop transitions in the Markov graph are ignored Hardware Requirements of Dynamic Prefetching A data structure is required to maintain and continually update the Markov graph as execution progresses. A table showed in Figure 6 can be used to represent the Markov graph. Each node (RFUOP) of the Markov graph occupies a single row of the table. The first column of each row is the ID of an RFUOP, and the rest of the columns are the RFUOPs it can reach. Since under chip size constraint only the high probable RFUOPs out of each node are used for the prefetching algorithm, keeping all transitions out of a node will simply waste precious hardware resources. As can be seen in Figure 6, the number of transitions retained is limited to K. RFUOP 1 Next 1 Next 2 Next K RFUOP 2 Next 1 Next 2 Next K RFUOP M Next 1 Next 2 Next K Figure 6. A table is used to represent the Markov graph. The first column of each row is the ID of a RFUOP and the rest of the columns are k reachable RFUOPs from this row s RFUOP with highest probability. In addition, a small FIFO buffer is required to store the prefetch requests. The configuration management unit will take the requests from the buffer and load the corresponding RFUOPs. Note that the buffer will be flushed to terminate previous 192

7 prefetches before the new prefetches are sent to the buffer. Furthermore, the configuration management unit can be interrupted to stop the current loading if an RFUOP not currently loaded is invoked. In order for the host process to save execution time in updating the probabilities, the weight coefficient C is set to 1. This means when a transition need to be updated, the host processor will simply right shift the register retaining the probability by one bit. Then the most significant bit of the register representing the currently occurring transition is set to 1. To balance the hardware cost and retain enough history, we use 8-bit registers in this work Hybrid Configuration Prefetching The dynamic prefetching using recent history works well for the transitions occurred within a loop. However, this approach will not be able to make accurate predictions for the transitions jumping out a loop. For example on the left side of Figure 7 we assume only one RFUOP can be store on chip at any given point. By applying the dynamic prefetching approach, RFUOP 2 is always prefetched after RFUOP 1 assuming the inner loop will always be taken for several times. Thus, the reconfiguration penalty for RFUOP 3 can never be hidden due to the wrong prediction. This misprediction can be avoided if the static prefetching approach can be integrated with the dynamic approach. More specifically, before reaching RFUOP3 a normal instruction node will likely be encountered and the static prefetches determined at that instruction node can be used to correct the wrong predictions determined by the dynamic prefetching. As illustrated on the right side of the Figure 7, a normal instruction I1 will be encountered before RFUOP 3 is reached and our static prefetching will correctly predict 3 will be the next required RFUOP. As the consequence, the wrong prefetch of RFUOP 2 determined by our dynamic prefetching can be corrected at I I1 3 Figure 7. An example illustrates the ineffectiveness of the dynamic prefetching. The goal of combining the dynamic configuration prefetching with the static configuration prefetching is to take advantage of the recent access history without exaggerating it. Specifically, dynamic prefetching using the recent history will make accurate predictions within the loops while static prefetching using the global history will make accurate predictions between the loops. The challenge of integrating dynamic prefetching with static 2 1 I1 3 P3 prefetching is to coordinate the prefetches such that the wrong prefetches are minimized. When the prefetches determined by the dynamic prefetching do not agree those determined by the static prefetching a decision must be made. The basic idea we use to determine the beneficial prefetches for our hybrid prefetching is to penalize the wrong prefetches. We add an per-rfuop flag bit to indicate the correctness of the prefetch made by previous static prefetching. When the prefetches determined by the static prefetching conflict with those determined by the dynamic prefetching, the statically predicted prefetch of a RFUOP is issued only if the flag bit for that RFUOP was set to 1. The flag bit of a RFUOP is set to 0 once the static prefetch of the RFUOP is issued, and will remain 0 until the RFUOP is executed. As the consequence, statically predicted prefetches, especially those made within the loops, are ignored if they are not correctly predicted. On the other hand, those correctly predicted static prefetches, especially those made between the loops, are chosen to replace the wrong prefetches made by the dynamic prefetching. The basic steps of the hybrid prefetching are outlined as following: 1. Perform the static configuration prefetching algorithm. Set the flag bit of each RFUOP to 1. An empty priority queue is created. 2. Upon the finish of a RFUOP execution, perform the dynamic prefetching algorithm. Set the flag bit of the RFUOP to 1. Clear the priority queue first, then place the Ids of the dynamically predicted RFUOPs into the queue. 3. When a static prefetch of a RFUOP is encountered and the flag bit of the RFUOP is 1, terminate current loading. Set the flag bit of the RFUOP to 0. Give the highest priority to this RFUOP and insert its ID into the priority queue. The RFUOPs with lower priorities are replaced or ignored to make room for the new RFUOP. 4. Load the RFUOPs from the priority queue. 6. RESULTS AND ANALYSIS All algorithms are implemented in C++ on a Sun Sparc-20 workstation and are simulated with the SHADE simulator [Cmelik93]. We choose to use the SPEC95 benchmark suite to test the performance of our prefetching algorithms. Note that these applications have not been optimized for reconfigurable systems, and may not be as accurate in predicting exact performance as would real applications for reconfigurable systems. However, such real applications are not in general available for experimentation. In addition, the performance of the prefetching techniques will be compared against the previous caching techniques, which also use SPEC95 benchmark suite. As can be seen in Figure 8, 5 algorithms are compared: Least Recently Used (LRU) Caching, Off-line Caching, Static Prefetching, Dynamic Prefetching, and Hybrid Prefetching. The LRU algorithm chooses victims to be replaced based on run-time information, while the Off-line algorithm takes consideration of future access patterns to make more accurate decisions. Note that our static prefetching uses the Off-line algorithm to pick victims. Since the cache replacement is integrated into the dynamic prefetching and the hybrid prefetching, no additional replacement algorithms are used for both prefetching algorithms. Clearly, all prefetching techniques substantially outperformed the solely caching techniques, especially when cache size is small. As cache 193

8 size grows, the chip is able to hold more RFUOPs and the cache misses will be reduced. However, the prefetching distance will not be changed. As the consequence, the performance due to prefetching will not significantly improve as the cache size grows. Among the prefetching techniques, dynamic prefetching performs consistently better than static prefetching because dynamic prefetching can use the RFUOP access information. Hybrid prefetching performs slightly better than dynamic prefetching because of its ability to correct some wrong prediction made by dynamic prefetching. However, the advantage of hybrid prefetching becomes negligible as the cache size grows. Figure 9 demonstrates the effect of the different replacement algorithms that used for the static prefetching. As can be seen, the off-line replacement algorithm performs slightly better than the LRU since it has more complete information on the applications. Normalized Configuration Penalty Normalized Reconfiguration Penalty LRU Off-line Static+Off-line Dynamic Hybrid Normalized FPGA Size Figure 8. Reconfiguration overhead comparison Static+LRU Static+Off-line Normalized FPGA Size Figure 9. Effect of the replacement algorithms for the static prefetching. 7. CONCLUSIONS Configuration prefetching, where configurations are preloaded on chip before they are required, is a technique to reduce the reconfiguration overhead. However, the limited on-chip memory and the large configuration latency add complexity in deciding which configurations to prefetch. In this work we developed efficient prefetching techniques for reconfigurable systems containing a Partial Reconfigurable FPGA with Relocation and Defragmentation (R+D). We have developed algorithms applying the different configuration prefetching techniques. Based on the available access information and the additional hardware required, our configuration prefetching algorithms can be divided into 3 categories: Static Configuration Prefetching, Dynamic Configuration Prefetching, and Hybrid Configuration Prefetching. Compare with the caching techniques presented in our previous work [Li00], our prefetching algorithms can further reduce the reconfiguration overhead by more than a factor of 2. References [Babb97] J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, A. Agarwal, The RAW Benchmark Suite: Computation Structures for General Purpose Computing, IEEE Symposium on FPGAs for Custom Computing Machines, pp , [Bolotski94] M. Bolotski, A. DeHon, T. F. Knight Jr., Unifying FPGAs and SIMD Arrays, 2nd International ACM/SIGDA Workshop on Field-Programmable Gate Arrays, [Callahan91] D. Callahan, K. Kennedy, A. Porterfield, Software Prefetching, International Conference on Architectural Support for Programming Languages and Operating Systems, pp , [Cmelik93] R. F. Cmelik, Introduction to Shade, Sun Microsystems Laboratories, Inc., February, [Compton00] K. Compton, J. Cooley, S. Knol, S. Hauck, Abstract: Configuration Relocation and Defragmentation for FPGAs, IEEE Symposium on Field-Programmable Custom Computing Machines, [Dandalis01] Andreas Dandalis, Viktor Prasanna, Configuration Compression for FPGAbased Embedded Systems, ACM/SIGDA International Symposium on Field- Programmable Gate Arrays. [Hauck97] S. Hauck, T. W. Fry, M. M. Hosler, J. P. Kao, The Chimaera Reconfigurable Functional Unit, IEEE Symposium on FPGAs for Custom Computing Machines, [Hauck98] S. Hauck, The Configuration prefetching for the Single Context FPGA, ACM/SIGDA International Symposium on FPGAs, pp65-74, [Hauser97] John Hauser, John Wawrzynek, Garp: A MIPS Processor with a Reconfigurable Coprocessor, IEEE Symposium on FPGAs for Custom Computing Machines, pp ,

9 [Joseph97] Doug Joseph, Dirk Grunwald, Prefetching using Markov Predictors, Proceedings of the 24 th International Symposium on Computer Architecture, pp , [Li99] Zhiyuan Li, S. Hauck, Don t Care Discovery for FPGA Configuration Compression, ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp , [Li00] Zhiyuan Li, K. Compton, Scott Hauck, Configuration Caching Management Techniques for Reconfigurable Computing, IEEE Symposium on FPGAs for Custom Computing Machines, pp , 2000 [Li01] Zhiyuan Li, Scott Hauck, Configuration Compression for Virtex FPGA, IEEE Symposium on FPGAs for Custom Computing Machines, pp , 2001 [Schmit97] Herman Schmit, Incremental Reconfiguration for Pipelined Applications, IEEE Symposium on FPGAs for Custom Computing Machines, pp 47-55, [Spec95] [Wittig96] SPEC CPU95 Benchmark Suite, Standard Performance Evaluation Corp., Manassas, VA, R. Wittig, P. Chow, OneChip: An FPGA Processor with Reconfigurable Logic, IEEE Symposium on FPGAs for Custom Computing Machines, pp ,

Configuration Prefetching Techniques for Partial Reconfigurable Coprocessors with Relocation and Defragmentation

Configuration Prefetching Techniques for Partial Reconfigurable Coprocessors with Relocation and Defragmentation Configuration Prefetching Techniques for Partial Reconfigurable Coprocessors with Relocation and Defragmentation Zhiyuan Li Motorola Labs, Motorola Inc. 1303 E. Algonquin Rd., Annex 2, Schaumburg, Il 60196,

More information

Configuration Caching Techniques for FPGA

Configuration Caching Techniques for FPGA Submitted to IEEE Symposium on FPGAs for Custom Computing Machines, 2000. Configuration Caching Techniques for FPGA Zhiyuan Li, Katherine Compton Department of Electrical and Computer Engineering Northwestern

More information

Abstract. 1 Introduction. 2 Configuration Prefetch. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 1998.

Abstract. 1 Introduction. 2 Configuration Prefetch. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 1998. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 1998. Configuration Prefetch for Single Context Reconfigurable Coprocessors Scott Hauck Department of Electrical and Computer Engineering

More information

Configuration Caching Management Techniques for Reconfigurable Computing

Configuration Caching Management Techniques for Reconfigurable Computing Configuration Caching Management Techniques for Reconfigurable Computing Zhiyuan Li, Motorola Labs, Motorola Inc. 303 E. Algonquin Rd., Annex 2, Schaumburg, Il 6096, USA azl086@motorola.com Abstract Although

More information

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor Ying Chen, Simon Y. Chen 2 School of Engineering San Francisco State University 600 Holloway Ave San Francisco, CA 9432

More information

Configuration Relocation and Defragmentation for Run-Time Reconfigurable Computing

Configuration Relocation and Defragmentation for Run-Time Reconfigurable Computing IEEE Transactions on VLSI Systems, 2002 Configuration Relocation and Defragmentation for Run-Time Reconfigurable Computing Katherine Compton, Zhiyuan Li, James Cooley, Stephen Knol Northwestern University

More information

NORTHWESTERN UNIVERSITY. Configuration Management Techniques for Reconfigurable Computing A DISSERTATION

NORTHWESTERN UNIVERSITY. Configuration Management Techniques for Reconfigurable Computing A DISSERTATION NORTHWESTERN UNIVERSITY Configuration Management Techniques for Reconfigurable Computing A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS For the degree DOCTOR

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Configuration Compression for Virtex FPGAs

Configuration Compression for Virtex FPGAs IEEE Symposium on FPGAs for Custom Computing Machines, 2001. Configuration Compression for Virtex FPGAs Zhiyuan Li Department of Electrical and Computer Engineering Northwestern University Evanston, IL

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Configuration Relocation and Defragmentation for Reconfigurable Computing

Configuration Relocation and Defragmentation for Reconfigurable Computing Configuration Relocation and Defragmentation for Reconfigurable Computing Katherine Compton, James Cooley, Stephen Knol Department of Electrical and Computer Engineering Northwestern University Evanston,

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

Configuration Compression for Virtex FPGAs

Configuration Compression for Virtex FPGAs Configuration Compression for Virtex FPGAs Zhiyuan Li Motorola Labs, Motorola Inc. Schaumburg, IL 60196 USA azl086@motorola.com Abstract Although run-time reconfigurable systems have been shown to achieve

More information

NAME: Problem Points Score. 7 (bonus) 15. Total

NAME: Problem Points Score. 7 (bonus) 15. Total Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 NAME: Problem Points Score 1 40

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

Synthetic Benchmark Generator for the MOLEN Processor

Synthetic Benchmark Generator for the MOLEN Processor Synthetic Benchmark Generator for the MOLEN Processor Stephan Wong, Guanzhou Luo, and Sorin Cotofana Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology,

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Improve performance by increasing instruction throughput

Improve performance by increasing instruction throughput Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Runlength Compression Techniques for FPGA Configurations

Runlength Compression Techniques for FPGA Configurations Runlength Compression Techniques for FPGA Configurations Scott Hauck, William D. Wilson Department of Electrical and Computer Engineering Northwestern University Evanston, IL 60208-3118 USA {hauck, wdw510}@ece.nwu.edu

More information

Garbage Collection (2) Advanced Operating Systems Lecture 9

Garbage Collection (2) Advanced Operating Systems Lecture 9 Garbage Collection (2) Advanced Operating Systems Lecture 9 Lecture Outline Garbage collection Generational algorithms Incremental algorithms Real-time garbage collection Practical factors 2 Object Lifetimes

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University Stanford, CA 94305, USA zmily@stanford.edu,

More information

Midterm Exam 1 Wednesday, March 12, 2008

Midterm Exam 1 Wednesday, March 12, 2008 Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 9 Pipelining Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Basic Concepts Data Hazards Instruction Hazards Advanced Reliable Systems (ARES) Lab.

More information

Memory Access Optimizations in Instruction-Set Simulators

Memory Access Optimizations in Instruction-Set Simulators Memory Access Optimizations in Instruction-Set Simulators Mehrdad Reshadi Center for Embedded Computer Systems (CECS) University of California Irvine Irvine, CA 92697, USA reshadi@cecs.uci.edu ABSTRACT

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo PAGE REPLACEMENT Operating Systems 2015 Spring by Euiseong Seo Today s Topics What if the physical memory becomes full? Page replacement algorithms How to manage memory among competing processes? Advanced

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 1999 Exam Average 76 90-100 4 80-89 3 70-79 3 60-69 5 < 60 1 Admin

More information

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 797- flur,chengkokg@ecn.purdue.edu

More information

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging CS162 Operating Systems and Systems Programming Lecture 14 Caching (Finished), Demand Paging October 11 th, 2017 Neeraja J. Yadwadkar http://cs162.eecs.berkeley.edu Recall: Caching Concept Cache: a repository

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Virtual Memory Outline

Virtual Memory Outline Virtual Memory Outline Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations Operating-System Examples

More information

MEMORY MANAGEMENT/1 CS 409, FALL 2013

MEMORY MANAGEMENT/1 CS 409, FALL 2013 MEMORY MANAGEMENT Requirements: Relocation (to different memory areas) Protection (run time, usually implemented together with relocation) Sharing (and also protection) Logical organization Physical organization

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 23 Virtual memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Is a page replaces when

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Operating Systems. Operating Systems Sina Meraji U of T

Operating Systems. Operating Systems Sina Meraji U of T Operating Systems Operating Systems Sina Meraji U of T Recap Last time we looked at memory management techniques Fixed partitioning Dynamic partitioning Paging Example Address Translation Suppose addresses

More information

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Quick Reconfiguration in Clustered Micro-Sequencer

Quick Reconfiguration in Clustered Micro-Sequencer Quick Reconfiguration in Clustered Micro-Sequencer Roozbeh Jafari UCLA 3256N Boelter Hall Los Angeles, CA 90095 rjafari@cs.ucla.edu Seda Ogrenci Memik Northwestern University 2145 Sheridan Road Evanston,

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement" October 3, 2012 Ion Stoica http://inst.eecs.berkeley.edu/~cs162 Lecture 9 Followup: Inverted Page Table" With

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

An Instruction Stream Compression Technique 1

An Instruction Stream Compression Technique 1 An Instruction Stream Compression Technique 1 Peter L. Bird Trevor N. Mudge EECS Department University of Michigan {pbird,tnm}@eecs.umich.edu Abstract The performance of instruction memory is a critical

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

ECE468 Computer Organization and Architecture. Virtual Memory

ECE468 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory ECE468 vm.1 Review: The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively

More information

A C Compiler for a Processor with a Reconfigurable Functional Unit Abstract

A C Compiler for a Processor with a Reconfigurable Functional Unit Abstract A C Compiler for a Processor with a Reconfigurable Functional Unit Alex Ye, Nagaraj Shenoy, Scott Hauck, Prithviraj Banerjee Department of Electrical and Computer Engineering, Northwestern University Evanston,

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

ECE4680 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory If I can see it and I can touch it, it s real. If I can t see it but I can touch it, it s invisible. If I can see it but I can t touch it, it

More information