A Study of Data Prefetching using Multi2Sim
|
|
- Cuthbert Rice
- 5 years ago
- Views:
Transcription
1 A Study of Data Prefetching using Multi2Sim Vaivaswatha N sercvaivaswatha@ssl.serc.iisc.in 1. KEYWORDS Data prefetching, computer architecture, multi2sim, global history buffer. 2. ABSTRACT Data prefetching[6] is a cache optimization technique that tries to minimize access time by predicting future data accesses and initiating a fetch for the data, so that it is available in the cache when required. Multi2Sim[11] is an architectural simulator for heterogeneous architectures. It includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. This course project aims at implementing a prefetching model in the Multi2Sim simulator and study changes in program performance due to prefetching. Primarily, two global history buffer based hardware prefetchers[8] and compiler aided prefetching are studied. This study uses the MediaBench[7] benchmark suite and the PolyBench[9] benchmark suite to study the performance impact of prefetching. An important goal of this project has been to contribute to the open-source community. In the process of doing this study, I have contributed the following to the Multi2Sim simulator. Support for the x86 prefetch instruction[4], Memory hierarchy support for prefetching, and Two global history buffer based prefetchers. 3. INTRODUCTION Data prefetching is an optimization in which block fetches are initiated by speculating the requirement of data, so that the block is available in the cache when it is actually required. Related to the concept of data prefetching is instruction prefetching where instruction cache blocks are prefetched. This project is entirely in the context of data prefetching, and henceforth it will just be referred to as prefetching. Prefetching may be targetted at any of the cache levels. Different approaches are effective at different levels of the cache. For example, the prefetching technique described in [1] is suitable for L1 caches whereas the technique described in [8] is suitable for L2 caches. for i 1,n do a[i] b[i] + c[i] end for Figure 1: Simple Loop 3.1 The basic idea of prefetching Consider the simple loop shown in figure 1. Assume that the L1 cache block size is 16 bytes and all three arrays are of the same data type, of size 4 bytes. This means that the loop requires two new blocks to be fetched every 4 iterations, just for reading b[i] and c[i]. If somehow we were able to recognize this pattern and fetch the blocks for iteration i+4 when iteration i is running, then, when iteration i+4 begins, the blocks will already be in the cache. All data prefetching schemes have the following common goals[1]: Generate prefetches well in advance for blocks that may be needed in the future. Upon an actual access of the block, it should be available. Avoid unnecessary prefetches (i.e, the prediction accuracy must be high so that bandwidth is not wasted for blocks that will never be used). Do not prefetch too early since prefetched blocks still take up cache space. No penalty must be incurred for prefetching blocks that are already in the cache. Prefetching techniques must not increase the processor cycle time, i.e it should not interfere with the critical path timing. 3.2 Classification of prefetching techniques Prefetching techniques may be broadly classified as software (or compiler) aided prefetching and hardware prefetching. Compiler aided prefetching involves compiler analysis passes that try to accurately determine what are the best positions in the code to prefetch data, and what data should be prefetched at that point. The compiler inserts prefetch hints (instructions) in the program at these points. Whenever the processor sees a prefetch hint, it initiates a prefetch for the block specified.
2 Hardware prefetching is a completely hardware based technique that does not depend on compiler inserted prefetch hints. The hardware has circuits to detect patterns in data access, based on which it initiates prefetch requests to the next lower level memory component. A good comparison of software prefetching and hardware prefetching can be found in [3]. 3.3 Multi2Sim Multi2Sim[11] is an architectural simulator for heterogeneous architectures that includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. It is an application-only simulator (i.e, not a full system simulator). The Multi2Sim document can be referred to for more details on the architecture and features of the simulator. Only relevant details about the simulator are discussed in this document. The part of the simulator that is of primary interest to this project is the memory subsystem. The memory subsystem in Multi2Sim is based on an event simulation model. Every request to a memory module (cache or main-memory) is scheduled as an event. The event simulator is responsible for executing the event (through a call-back) at the right time. Central to the memory subsystem is the NMOESI cache coherence protocol, which is an extension of the MOESI protocol to include non-coherent (The N in NMOESI) accesses. This protocol has been implemented as a distributed directory based protocol. Multi2Sim also provides a configurable interconnect system to organize the different memory components in different ways. 4. BACKGROUND This section discusses a few hardware prefetching mechanisms, including the technique that was implemented in this project. Early approaches to prefetching used simple heuristics to perform prefetching. Smith[1] studied variations of block lookahead schemes where, whenever block i was referenced, block i+1 would be prefetched. Fu and Patel[5] studied a scheme for prefetching in which they maintained a history of previous addresses and generated prefetches based on a constant stride. This scheme did not have a control mechanism to reduce unnecessary prefetches on irregular accesses[1]. 4.1 Reference prediction based approach Chen and Baer[1] study three variations of an approach based on the reference prediction table. This scheme is most suitable for processors with a small first-level cache and a small block size. The main idea of reference prediction is to predict future references that an instruction might make based on the past history of accesses of the instruction. The paper discusses three variations on this, each with increasing degree of complexity and effectiveness. The first (simplest) approach is based on a four state machine. The state of the machine determines if accesses by an instruction are following a regular pattern (constant stride) and hence if prefetching is useful. The state transitions are aided by keeping track of the previous address, stride and the state for an instruction in a reference prediction table (RPT). The working of the state machine is similar to how a two bit branch prediction works and hence I will skip explaining it in detail. Prediction mechanism is triggered when the program counter decodes a load (or store) instruction. Based on the entry for the instruction in the RPT, a prefetch may be initiated. The basic scheme has a potential weakness associated with the timing of the prefetch. If the loop body is too small, the prefetched data may arrive late, thus reducing the effect of prefetching. Similarly, too large a loop body might trigger the prefetch too early. This can be fixed by looking ahead in time to see when is the best time to initiate a prefetch, so that the data arrives just in time for it to be used. This is achieved by using a look ahead program counter (LA-PC). The LA-PC stays ahead of the PC by as much a time δ that would be required for the block to arrive at the cache after a prefetch request. To have the LA-PC be as accurate as possible, it is linked with the branch prediction table. In this scheme, it is the LA-PC that is responsible for initiating prefetches as against the PC in the previous approach. In the first two approaches, reference prediction was based on regularity between adjacent data accesses. However for more general patterns, for example smaller inner loops or a triangle shaped loop, there would be frequent redundant prefetches. This is avoided by using a correlated reference prediction that not only tracks adjacent accesses in inner loops but also accesses that are correlated by changes in the loop level. This method involves using a shift register to record the outcome of the last branch and an extended RPT with extra fields for computing the strides of various correlated accesses. Keeping track of branch outcomes is necessary since a non-taken branch will trigger the correlation to the next level up. 4.2 Global history buffer based prefetching Nesbit and Smith[8] proposed a new structure called the global history buffer (GHB) that provided a new way to organize the access history. The scheme not only improves the accuracy of correlation prefetching by minimizing stale entries but also contains a better picture of cache miss history which can be used to design more effective prefetching techniques. One main difference between the GHB approach and the approach in [1] is that the history stored in [1] was memory reference history whereas this method stores a history of cache misses. Prefetching methods prior to this used a table that was indexed by some key, and the corresponding row contained history information. The GHB based approach decouples indexing into the table and the storage of prefetch-related history information. More precisely, the following two tables were proposed (Figure 2). An Index Table (IT) that is accessed with a key (typically the load (or store) instruction s PC or a cache miss address), with the entry in the row pointing to an entry in the GHB.
3 for i 1,n do prefetch(arr[i+1]) use/define arr[i] end for Figure 3: Badly inserted prefetches Figure 2: GHB prefetch structure The Global History Buffer (GHB) is an n-entry FIFO table that holds the n most recent cache misses. Each entry also holds a link pointer that is used to form linked lists (called address lists) of elements in the GHB. Each linked list has the property that the elements in that list correspond to the same IT key. Every cache miss is entered into the GHB in a FIFO order. At the time of entering, the new GHB entry will be made to point to the existing pointee of the corresponding IT entry and the IT entry will then be made to point to the new GHB entry (in other words, the linked list is kept consistent). Depending on the key that is used for indexing the IT and whether stride prefetching or correlation prefetching is performed, many possible prefetching schemes may be implemented using the GHB structure. This project work implements two GHB based prefetching schemes (the details of which are discussed in section 5.3). 1. PC/CS: The PC of the load (or store) instruction is used to index into the IT. The prefetching performed is Constant Stride. 2. PC/DC: The PC of the load (or store) instruction is used to index into the IT. Delta Correlation prefetching is performed. This method is referred to as Local Delta Correlation in [8]. 5. IMPLEMENTATION This section discusses the implementation work that was carried out as a part of this project. The implementation was done in the Multi2Sim simulator and was committed into the main source tree (i.e, it is available on the web as part of the main Multi2Sim simluator). The implementation is separated out into three parts, each of which is explained in detail in the following subsections. 5.1 Prefetch hint support Prior to this work, Multi2Sim did not support any kind of prefetch hints. So the first phase of work involved supporting the x86 prefetch instruction[4]. This was essential to have a comparison of hardware prefetching with software prefetching. Supporting the prefetch instruction involved expanding the instruction into appropriate micro instructions (effective address calculation followed by a memory operation). This expansion was done during the emulation stage. In terms of the architecture of the simulator, this is equivalent to saying that the expansion of the ISA instruction into micro instructions is done during the fetch stage of the pipeline. The implementation also involved creating a separate queue for prefetch requests in the issue unit. Scheduling of prefetch requests to the memory in the issue unit was done similar to how loads and stores were scheduled. However, the implementation was kept separate so as to enable any future work in this direction (for example, issue of prefetches requests may be given lower priority). An important part of implementing prefetch hint support was to avoid issuing redundant prefetch requests to the memory. Even though upon realizing a cache hit, the memory unit would discard the prefetch request 1, sending requests to the memory itself costs bandwidth. Also, just to realize a hit, the cache unit would need to acquire a port and a lock on the directory, both of which may turn out to be expensive. To better illustrate this, consider the case of a badly inserted prefetch shown in figure 3. Assuming 1 elements of the array fit in the cache, there will be 9 redundant prefetches for every necessary prefetch. Ideally, the compiler should either have unrolled the loop so that an iteration is capable of using one block of data, or, inserted the prefetch conditionally (i.e, inside an if ). Such simple bad scenarios can be easily handled by having a small prefetch request history in the issue unit so that the issue unit can query this table before issuing prefetch requests to the memory. A simple FIFO table would be sufficient to handle many cases. Such a table was implemented as part of implementing support for the x86 prefetch instruction. 5.2 Memory system support for prefetching Prefetching support for the memory system mainly involved incorporating the concept of prefetching into the NMOESI cache coherence protocol. For most parts, this was implemented similar to how load requests are handled by the protocol. However, the implementation was fully separated from that of the load request implementation. This was primarily because of the following differences. Whenever two load (or store) requests to the same block come within a short duration of each other such that the first request hasn t begun executing yet, the controller will try to coalesce them into one request. However for prefetch requests, the prefetch request just 1 More on this in section 5.2
4 needs to be dropped. Prefetches are not coalesced with other prefetches or with other load (or store) instructions. Whenever there is contention between a normal load (or store) request and a prefetch request, priority needs to be given to the load (or store) request (or maybe not). As a followup on the previous point, currently one heuristic that has been implemented is to not retry prefetch requests whenever it fails to acquire a port or a directory lock. This assumes that there is a contention already and we do not to further increase this by handling prefetch requests. Keeping the implementation separate allows future modifications along these lines to be easy. 5.3 GHB based hardware prefetching As mentioned earlier, two GHB based hardware prefetching schemes were implemented as part of this project. The implementation of Index Table (IT) and Global History Buffer (GHB) is common to both schemes. Only the implementation of the actual prefetching logic vary. Both the tables were added as a simple array of structures whose size was configurable through the memory configuration file. One important point that requires a mention here is that the linked list (address list) management of the GHB entries was done differently compared to the method mentioned in the paper. Since the description of GHB entries only specifies a single pointer to maintain the linked list, deletion of elements is not trivial. The paper proposes a scheme where extra bits are kept in each pointer (but are not used to find the pointee) based on which an invalid pointer (i.e, pointer to an element not in the GHB) can be detected whenever the difference in the pointer value (including the extra bits) and the current head of the FIFO queue is more than the size of the GHB. However, this is not perfect as the head pointer can wrap around and still cause an incorrect match. To avoid this problem altogether, my implementation used a doubly linked list. The implementation of GHB and IT also involved adding a mechanism to convey the PC of the load (or store) instruction from the processor to the memory system. In the PC/CS implementation 2, the PC is used to index the IT. The lookup depth to look in the GHB can be specified in the configuration file. Based on the lookup depth d, the prefetcher looks at the past d cache misses for that PC and initiates a prefetch if all of them have a common (constant) stride, using this stride to predict the next address. To illustrate the necessity of the PC/DC scheme 2, consider the following sequence of memory accesses[8]:, 1, 2, 64, 65, 66, 128, 129. These accesses have a delta (stride) of 1, 1, 62, 1, 1, 62, 1. This pattern is representative of a load that accesses the first three words of each column in a 2-D array. The PC/CS scheme (with a lookup depth of 2) would conclude a constant stride by seeing the two 1s and hence 2 See section 4.2 wrongly prefetch along that stride. This problem can be overcome by the PC/DC scheme. In this scheme, delta pairs are used to find correlation in the accesses. In other words, the last two deltas are taken and are matched with the history to find the most recent previous occurrence of this delta pair. Whatever delta followed this previous occurrence will be considered for determining the prefetch address. Though the paper doesn t mention anything beyond two deltas (i.e a pair) my implementation can take this number as a parameter and can try to match beyond two deltas. 6. EXPERIMENTATION This section describes the simulation runs that were carried out in order to evaluate different prefetching schemes. The results from the various runs are also tabulated here. All programs were built with the -O4 flag to GCC. 6.1 Choice of benchmarks MediaBench[7] is a set of video and image codecs. It includes popular codecs such as mpeg, jpeg, adpcm, and epic. The binaries for MediaBench along with the input and output were already available on the Multi2Sim website for easy download. For this project, adpcm-dec, adpcm-enc, epicdec and epic-enc were used, mainly because the simulation times of these programs were affordable. PolyBench[9] is a benchmark suite that was designed to benchmark compiler loop optimization techniques. These programs contains a variety of loop patterns and hence the benchmark is a good candidate for evaluating prefetching. The programs in PolyBench are broadly separated out into data-mining, linear algebra, medly, and stencil computations. The following PolyBench programs were used in this project: floyd-warshall, (triangular matrix multiply), (dynamic programming 2D), gram-shmidt, (multiresolution analysis), (matrix transpose and vector multiplication). Each program can be specified to run on different input sizes (mini, small, standard, large, extra large). The small sized input was used in this project, again mainly due to time constraints. An attempt was made to run SPEC26 benchmarks. The simulation time was extremely high even for the test input set. I tried to use SimPoints (along with the PinPoints tool), but the downloaded sources failed to build, even after manual attempts to correct the sources. Both the PolyBench and MediaBench programs were run to completion and the number of CPU cycles they ran for is used as a measurement of performance 6.2 Base configuration The cache configuration used to get the primary results is shown in table 1. Figures 4 and 5 show the impact of various prefetching schemes on PolyBench and MediaBench respectively. As noted before, the performance is measured in terms of the total cycle count for execution of the entire program. In the MediaBench suite, epic-dec shows the highest improvement of 13.9%. In the PolyBench suite, shows the highest improvement of 25.1%. Both these highest improvements are with the delta correlation prefetching (PC/DC with a lookup depth of 2) scheme. For a few bench-
5 Parameter L1 Value L2 Value Sets Assoc 4 8 BlockSize Latency 4 2 Policy LRU LRU Ports 4 4 H/W Prefetching NO YES Prefetcher index table size 64 Prefetcher GHB size 256 6,, 5,, 4,, 3,, no_prefetching pc_cs_2 pc_dc_2 Table 1: Base configuration 2,, 3,, 1,, 25,, 2,, no_prefetching compiler_prefetching pc_cs_2 pc_dc_2 15,, 1,, Figure 5: MediaBench performance for the base configuration 5,, Figure 4: PolyBench performance for the base configuration marks, the performance improvement is low and hence cannot be clearly seen in the graph. As will be shown later, a lookup depth 3 of 3 for constant stride prefetching performs better than the lookup depth of 2 used in the base configuration. diabench programs as discussed previously. The first four bars for each benchmark represent the cycles for a 4 way L2 and the second four bars are for an 8 way L2. The impact of associativity is seen significantly for the gramshmidt benchmark, while for the other benchmarks prefetching behavior remained almost the same. 6.4 L1 vs L2 prefetching As mentioned earlier, hardware prefetching can be performed both at the L1 cache level and at the L2 cache level. Prefetching methods that are designed to work based on a history of cache misses work better on L2 caches while prefetching methods which work on the reference address stream work better on the L1 cache[1]. The result of the experiment agrees with this idea. Figure 7 shows a comparison of prefetching on the L1 and L2 caches. For compiler prefetching, the programs were built with - fprefetch-loop-arrays -mtune=core2 flag passed to GCC. Note that for MediaBench, compiler prefetching performance is not measured since the binaries were directly downloaded from the Multi2Sim website and they were not compiled with prefetching enabled. 6.3 Associativity Here we try to see how associativity affects prefetching performance. This is an important dimension to study prefetching in, since prefetching might cause other blocks that are currently in the workset to be evicted due to a way conflict. Figure 6 shows a comparison of prefetching with the L2 associativity reduced from 8 ways to 4 ways (Table 1). Note that the cycles for compiler prefetching is zero for the Me- 3 In the graphs (figures), the final number suffixed to hardware prefetching schemes is the lookup depth. 25,, 2,, 15,, 1,, 5,, compiler_4way pc_cs_2_4way pc_dc_2_4way compiler_8way pc_cs_2_8way pc_dc_2_8way Figure 6: Effect of associativity on each scheme
6 3,, 25,, 2,, 15,, 1,, 5,, 3,, 25,, 2,, 15,, 1,, 5,, pc_cs_2_l1 pc_cs_2_l2 pc_dc_2_l1 pc_dc_2_l2 Figure 7: L1 vs L2 prefetching Figure 8: Lookup depth comparison pc_cs_2 pc_cs_3 pc_cs_4 6.5 Lookup depth This subsection evaluates the PC/CS (Program Counter / Constant Stride) prefetching scheme for lookup depths of 2, 3, and 4 (All other parameters remain as in table 1). No lookup depth comparison is done for delta correlation prefetching as implementing such a lookup beyond 2 on hardware seemed unrealistic (although the simulator implementation that was done does support higher lookup depths for DC). Figure 8 shows the comparison for constant stride prefetching. Except for floyd-warshall, a lookup depth of 3 seems to be the best for constant stride prefetching. The lookup depth influences prefetch accuracy and impact in two ways. (1) As the lookup depth increases, prefetch accuracy improves. As the length of the history looked up is more (by definition), the prefetcher is less likely to issue prefetches along a wrong stride. (2) However, as the lookup depth increases, it takes longer for the prefetcher to start initiating prefetches (even on constant stride accesses). This may be significant if the inner most loop runs for a shorter duration. 7. CONCLUSION This project work implemented and compared a few prefetching schemes, studying their behavior for different cache configurations and other parameters. One important aspect of cache configuration that was missed in the discussion above was that of the number of ports. A minimum of 4 ports on the prefetching cache was required for prefetching to show improvements. Any number lower than this resulted in quite a few degradations. The detailed graphs are not shown due to lack of space. A natural extension of this project would be to study prefetching behavior in a multi-core scenario[2]. Such a study was not carried out as part of this project since it involved significantly more implementation work to take into account multiple cores accessing the same shared cache. It also required implementing support for other x86 prefetch hints such as prefetch nta[4] which makes more sense in a multicore scenario. Finally, I would like to thank Prof Matthew Jacob for his guidance and the Multi2Sim developer team for their comments on the prefetcher implementation. 8. REFERENCES [1] Jean-Loup Baer and Tien-Fu Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput., 44(5):69 623, May [2] Surendra Byna, Yong Chen, and Xian-He Sun. Taxonomy of data prefetching for multicore processors. Journal of Computer Science and Technology, 24:45 417, 29. [3] T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. SIGARCH Comput. Archit. News, 22(2): , April [4] Intel Corporation. Intel Architecture - Software Developer s Manual, Volume 2: Instruction Set Reference. [5] John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride directed prefetching in scalar processors. SIGMICRO Newsl., 23(1-2):12 11, December [6] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 26. [7] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In Proc. of the 3th Int l Symposium on Microarchitecture, Dec [8] K.J. Nesbit and J.E. Smith. Data cache prefetching using a global history buffer. In Software, IEE Proceedings-, page 96, feb. 24. [9] Louis-Noel Pouche. The Polyhedral Suite. [1] Alan Jay Smith. Cache memories. ACM Comput. Surv., 14(3):473 53, September [11] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proc. of the 21st International Conference on Parallel Architectures and Compilation Techniques, Sep. 212.
Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables
Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence
More informationEfficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness
Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma
More informationFall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic
Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory
More informationEECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table
Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.
More informationAnalyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009
Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationComputer Architecture Spring 2016
omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationExam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence
Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationStatic Branch Prediction
Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More information1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722
Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict
More informationStride- and Global History-based DRAM Page Management
1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationAdvanced Caching Techniques (2) Department of Electrical Engineering Stanford University
Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationDynamic Branch Prediction
#1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationA Study for Branch Predictors to Alleviate the Aliasing Problem
A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract
More informationCS7810 Prefetching. Seth Pugsley
CS7810 Prefetching Seth Pugsley Predicting the Future Where have we seen prediction before? Does it always work? Prefetching is prediction Predict which cache line will be used next, and place it in the
More informationEE482c Final Project: Stream Programs on Legacy Architectures
EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project
More informationA Hybrid Adaptive Feedback Based Prefetcher
A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationThreshold-Based Markov Prefetchers
Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this
More informationComputer Architecture Memory hierarchies and caches
Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationSF-LRU Cache Replacement Algorithm
SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,
More informationCACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás
CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationCombining Local and Global History for High Performance Data Prefetching
Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu
More informationLecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationReduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction
ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch
More informationHigh Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 26 Cache Optimization Techniques (Contd.) (Refer
More informationEXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu
Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points
More informationAn Automated Method for Software Controlled Cache Prefetching
An Automated Method for Software Controlled Cache Prefetching Daniel F. Zucker*, Ruby B. Lee, and Michael J. Flynn Computer Systems Laboratory Department of Electrical Engineering Stanford University Stanford,
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationData Speculation. Architecture. Carnegie Mellon School of Computer Science
Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A
More informationComprehensive Review of Data Prefetching Mechanisms
86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More information15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15
More informationPortland State University ECE 587/687. Memory Ordering
Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen and Haitham Akkary 2012 Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize
More informationHardware versus Hybrid Data Prefetching in Multimedia Processors: A Case Study
In Proc, of the IEEE Int. Performance, Computing and Communications Conference, Phoenix, USA, Feb. 2, c 2 IEEE, reprinted with permission of the IEEE Hardware versus Hybrid Data ing in Multimedia Processors:
More information1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing
Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors 1 Shlomit S. Pinter IBM Science and Technology MATAM Advance Technology ctr. Haifa 31905, Israel E-mail: shlomit@vnet.ibm.com
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationABSTRACT. PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd).
ABSTRACT PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd). With the increasing gap between processor speeds and memory,
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationComputer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction
More informationMPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors
MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development
More informationSISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:
SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationA Perfect Branch Prediction Technique for Conditional Loops
A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More information2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]
EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationDemand fetching is commonly employed to bring the data
Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni
More informationAugust 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.
Cache Advantage August 1994 / Features / Cache Advantage Cache design and implementation can make or break the performance of your high-powered computer system. David F. Bacon Modern CPUs have one overriding
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationAN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES
AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES Swadhesh Kumar 1, Dr. P K Singh 2 1,2 Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur,
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationHigh Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationBranch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken
Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of
More informationAssignment 2: Understanding Data Cache Prefetching
Assignment 2: Understanding Data Cache Prefetching Computer Architecture Due: Monday, March 27, 2017 at 4:00 PM This assignment represents the second practical component of the Computer Architecture module.
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationEvaluation of Branch Prediction Strategies
1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III
More informationA Survey of Data Prefetching Techniques
A Survey of Data Prefetching Techniques Technical Report No: HPPC-96-05 October 1996 Steve VanderWiel David J. Lilja High-Performance Parallel Computing Research Group Department of Electrical Engineering
More informationA Survey of prefetching techniques
A Survey of prefetching techniques Nir Oren July 18, 2000 Abstract As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance.
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based
More informationSPECULATIVE MULTITHREADED ARCHITECTURES
2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions
More informationA Review on Cache Memory with Multiprocessor System
A Review on Cache Memory with Multiprocessor System Chirag R. Patel 1, Rajesh H. Davda 2 1,2 Computer Engineering Department, C. U. Shah College of Engineering & Technology, Wadhwan (Gujarat) Abstract
More information6 th Lecture :: The Cache - Part Three
Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 1/17 [CS7031] Graphics and Console Hardware and Real-time Rendering 6 th Lecture :: The Cache - Part Three
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationLecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)
Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationMemory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple
Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss
More informationEE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction
EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction
More informationImprove performance by increasing instruction throughput
Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access
More informationSuperscalar Processor Design
Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction
More informationInstruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.
Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationA Framework for the Performance Evaluation of Operating System Emulators. Joshua H. Shaffer. A Proposal Submitted to the Honors Council
A Framework for the Performance Evaluation of Operating System Emulators by Joshua H. Shaffer A Proposal Submitted to the Honors Council For Honors in Computer Science 15 October 2003 Approved By: Luiz
More informationECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen
ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationMemory Consistency. Challenges. Program order Memory access order
Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined
More informationComputer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi
Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 32 The Memory Systems Part III Welcome back. (Refer Slide
More informationPrinciples in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008
Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More information