A Study of Data Prefetching using Multi2Sim

Size: px
Start display at page:

Download "A Study of Data Prefetching using Multi2Sim"

Transcription

1 A Study of Data Prefetching using Multi2Sim Vaivaswatha N sercvaivaswatha@ssl.serc.iisc.in 1. KEYWORDS Data prefetching, computer architecture, multi2sim, global history buffer. 2. ABSTRACT Data prefetching[6] is a cache optimization technique that tries to minimize access time by predicting future data accesses and initiating a fetch for the data, so that it is available in the cache when required. Multi2Sim[11] is an architectural simulator for heterogeneous architectures. It includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. This course project aims at implementing a prefetching model in the Multi2Sim simulator and study changes in program performance due to prefetching. Primarily, two global history buffer based hardware prefetchers[8] and compiler aided prefetching are studied. This study uses the MediaBench[7] benchmark suite and the PolyBench[9] benchmark suite to study the performance impact of prefetching. An important goal of this project has been to contribute to the open-source community. In the process of doing this study, I have contributed the following to the Multi2Sim simulator. Support for the x86 prefetch instruction[4], Memory hierarchy support for prefetching, and Two global history buffer based prefetchers. 3. INTRODUCTION Data prefetching is an optimization in which block fetches are initiated by speculating the requirement of data, so that the block is available in the cache when it is actually required. Related to the concept of data prefetching is instruction prefetching where instruction cache blocks are prefetched. This project is entirely in the context of data prefetching, and henceforth it will just be referred to as prefetching. Prefetching may be targetted at any of the cache levels. Different approaches are effective at different levels of the cache. For example, the prefetching technique described in [1] is suitable for L1 caches whereas the technique described in [8] is suitable for L2 caches. for i 1,n do a[i] b[i] + c[i] end for Figure 1: Simple Loop 3.1 The basic idea of prefetching Consider the simple loop shown in figure 1. Assume that the L1 cache block size is 16 bytes and all three arrays are of the same data type, of size 4 bytes. This means that the loop requires two new blocks to be fetched every 4 iterations, just for reading b[i] and c[i]. If somehow we were able to recognize this pattern and fetch the blocks for iteration i+4 when iteration i is running, then, when iteration i+4 begins, the blocks will already be in the cache. All data prefetching schemes have the following common goals[1]: Generate prefetches well in advance for blocks that may be needed in the future. Upon an actual access of the block, it should be available. Avoid unnecessary prefetches (i.e, the prediction accuracy must be high so that bandwidth is not wasted for blocks that will never be used). Do not prefetch too early since prefetched blocks still take up cache space. No penalty must be incurred for prefetching blocks that are already in the cache. Prefetching techniques must not increase the processor cycle time, i.e it should not interfere with the critical path timing. 3.2 Classification of prefetching techniques Prefetching techniques may be broadly classified as software (or compiler) aided prefetching and hardware prefetching. Compiler aided prefetching involves compiler analysis passes that try to accurately determine what are the best positions in the code to prefetch data, and what data should be prefetched at that point. The compiler inserts prefetch hints (instructions) in the program at these points. Whenever the processor sees a prefetch hint, it initiates a prefetch for the block specified.

2 Hardware prefetching is a completely hardware based technique that does not depend on compiler inserted prefetch hints. The hardware has circuits to detect patterns in data access, based on which it initiates prefetch requests to the next lower level memory component. A good comparison of software prefetching and hardware prefetching can be found in [3]. 3.3 Multi2Sim Multi2Sim[11] is an architectural simulator for heterogeneous architectures that includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. It is an application-only simulator (i.e, not a full system simulator). The Multi2Sim document can be referred to for more details on the architecture and features of the simulator. Only relevant details about the simulator are discussed in this document. The part of the simulator that is of primary interest to this project is the memory subsystem. The memory subsystem in Multi2Sim is based on an event simulation model. Every request to a memory module (cache or main-memory) is scheduled as an event. The event simulator is responsible for executing the event (through a call-back) at the right time. Central to the memory subsystem is the NMOESI cache coherence protocol, which is an extension of the MOESI protocol to include non-coherent (The N in NMOESI) accesses. This protocol has been implemented as a distributed directory based protocol. Multi2Sim also provides a configurable interconnect system to organize the different memory components in different ways. 4. BACKGROUND This section discusses a few hardware prefetching mechanisms, including the technique that was implemented in this project. Early approaches to prefetching used simple heuristics to perform prefetching. Smith[1] studied variations of block lookahead schemes where, whenever block i was referenced, block i+1 would be prefetched. Fu and Patel[5] studied a scheme for prefetching in which they maintained a history of previous addresses and generated prefetches based on a constant stride. This scheme did not have a control mechanism to reduce unnecessary prefetches on irregular accesses[1]. 4.1 Reference prediction based approach Chen and Baer[1] study three variations of an approach based on the reference prediction table. This scheme is most suitable for processors with a small first-level cache and a small block size. The main idea of reference prediction is to predict future references that an instruction might make based on the past history of accesses of the instruction. The paper discusses three variations on this, each with increasing degree of complexity and effectiveness. The first (simplest) approach is based on a four state machine. The state of the machine determines if accesses by an instruction are following a regular pattern (constant stride) and hence if prefetching is useful. The state transitions are aided by keeping track of the previous address, stride and the state for an instruction in a reference prediction table (RPT). The working of the state machine is similar to how a two bit branch prediction works and hence I will skip explaining it in detail. Prediction mechanism is triggered when the program counter decodes a load (or store) instruction. Based on the entry for the instruction in the RPT, a prefetch may be initiated. The basic scheme has a potential weakness associated with the timing of the prefetch. If the loop body is too small, the prefetched data may arrive late, thus reducing the effect of prefetching. Similarly, too large a loop body might trigger the prefetch too early. This can be fixed by looking ahead in time to see when is the best time to initiate a prefetch, so that the data arrives just in time for it to be used. This is achieved by using a look ahead program counter (LA-PC). The LA-PC stays ahead of the PC by as much a time δ that would be required for the block to arrive at the cache after a prefetch request. To have the LA-PC be as accurate as possible, it is linked with the branch prediction table. In this scheme, it is the LA-PC that is responsible for initiating prefetches as against the PC in the previous approach. In the first two approaches, reference prediction was based on regularity between adjacent data accesses. However for more general patterns, for example smaller inner loops or a triangle shaped loop, there would be frequent redundant prefetches. This is avoided by using a correlated reference prediction that not only tracks adjacent accesses in inner loops but also accesses that are correlated by changes in the loop level. This method involves using a shift register to record the outcome of the last branch and an extended RPT with extra fields for computing the strides of various correlated accesses. Keeping track of branch outcomes is necessary since a non-taken branch will trigger the correlation to the next level up. 4.2 Global history buffer based prefetching Nesbit and Smith[8] proposed a new structure called the global history buffer (GHB) that provided a new way to organize the access history. The scheme not only improves the accuracy of correlation prefetching by minimizing stale entries but also contains a better picture of cache miss history which can be used to design more effective prefetching techniques. One main difference between the GHB approach and the approach in [1] is that the history stored in [1] was memory reference history whereas this method stores a history of cache misses. Prefetching methods prior to this used a table that was indexed by some key, and the corresponding row contained history information. The GHB based approach decouples indexing into the table and the storage of prefetch-related history information. More precisely, the following two tables were proposed (Figure 2). An Index Table (IT) that is accessed with a key (typically the load (or store) instruction s PC or a cache miss address), with the entry in the row pointing to an entry in the GHB.

3 for i 1,n do prefetch(arr[i+1]) use/define arr[i] end for Figure 3: Badly inserted prefetches Figure 2: GHB prefetch structure The Global History Buffer (GHB) is an n-entry FIFO table that holds the n most recent cache misses. Each entry also holds a link pointer that is used to form linked lists (called address lists) of elements in the GHB. Each linked list has the property that the elements in that list correspond to the same IT key. Every cache miss is entered into the GHB in a FIFO order. At the time of entering, the new GHB entry will be made to point to the existing pointee of the corresponding IT entry and the IT entry will then be made to point to the new GHB entry (in other words, the linked list is kept consistent). Depending on the key that is used for indexing the IT and whether stride prefetching or correlation prefetching is performed, many possible prefetching schemes may be implemented using the GHB structure. This project work implements two GHB based prefetching schemes (the details of which are discussed in section 5.3). 1. PC/CS: The PC of the load (or store) instruction is used to index into the IT. The prefetching performed is Constant Stride. 2. PC/DC: The PC of the load (or store) instruction is used to index into the IT. Delta Correlation prefetching is performed. This method is referred to as Local Delta Correlation in [8]. 5. IMPLEMENTATION This section discusses the implementation work that was carried out as a part of this project. The implementation was done in the Multi2Sim simulator and was committed into the main source tree (i.e, it is available on the web as part of the main Multi2Sim simluator). The implementation is separated out into three parts, each of which is explained in detail in the following subsections. 5.1 Prefetch hint support Prior to this work, Multi2Sim did not support any kind of prefetch hints. So the first phase of work involved supporting the x86 prefetch instruction[4]. This was essential to have a comparison of hardware prefetching with software prefetching. Supporting the prefetch instruction involved expanding the instruction into appropriate micro instructions (effective address calculation followed by a memory operation). This expansion was done during the emulation stage. In terms of the architecture of the simulator, this is equivalent to saying that the expansion of the ISA instruction into micro instructions is done during the fetch stage of the pipeline. The implementation also involved creating a separate queue for prefetch requests in the issue unit. Scheduling of prefetch requests to the memory in the issue unit was done similar to how loads and stores were scheduled. However, the implementation was kept separate so as to enable any future work in this direction (for example, issue of prefetches requests may be given lower priority). An important part of implementing prefetch hint support was to avoid issuing redundant prefetch requests to the memory. Even though upon realizing a cache hit, the memory unit would discard the prefetch request 1, sending requests to the memory itself costs bandwidth. Also, just to realize a hit, the cache unit would need to acquire a port and a lock on the directory, both of which may turn out to be expensive. To better illustrate this, consider the case of a badly inserted prefetch shown in figure 3. Assuming 1 elements of the array fit in the cache, there will be 9 redundant prefetches for every necessary prefetch. Ideally, the compiler should either have unrolled the loop so that an iteration is capable of using one block of data, or, inserted the prefetch conditionally (i.e, inside an if ). Such simple bad scenarios can be easily handled by having a small prefetch request history in the issue unit so that the issue unit can query this table before issuing prefetch requests to the memory. A simple FIFO table would be sufficient to handle many cases. Such a table was implemented as part of implementing support for the x86 prefetch instruction. 5.2 Memory system support for prefetching Prefetching support for the memory system mainly involved incorporating the concept of prefetching into the NMOESI cache coherence protocol. For most parts, this was implemented similar to how load requests are handled by the protocol. However, the implementation was fully separated from that of the load request implementation. This was primarily because of the following differences. Whenever two load (or store) requests to the same block come within a short duration of each other such that the first request hasn t begun executing yet, the controller will try to coalesce them into one request. However for prefetch requests, the prefetch request just 1 More on this in section 5.2

4 needs to be dropped. Prefetches are not coalesced with other prefetches or with other load (or store) instructions. Whenever there is contention between a normal load (or store) request and a prefetch request, priority needs to be given to the load (or store) request (or maybe not). As a followup on the previous point, currently one heuristic that has been implemented is to not retry prefetch requests whenever it fails to acquire a port or a directory lock. This assumes that there is a contention already and we do not to further increase this by handling prefetch requests. Keeping the implementation separate allows future modifications along these lines to be easy. 5.3 GHB based hardware prefetching As mentioned earlier, two GHB based hardware prefetching schemes were implemented as part of this project. The implementation of Index Table (IT) and Global History Buffer (GHB) is common to both schemes. Only the implementation of the actual prefetching logic vary. Both the tables were added as a simple array of structures whose size was configurable through the memory configuration file. One important point that requires a mention here is that the linked list (address list) management of the GHB entries was done differently compared to the method mentioned in the paper. Since the description of GHB entries only specifies a single pointer to maintain the linked list, deletion of elements is not trivial. The paper proposes a scheme where extra bits are kept in each pointer (but are not used to find the pointee) based on which an invalid pointer (i.e, pointer to an element not in the GHB) can be detected whenever the difference in the pointer value (including the extra bits) and the current head of the FIFO queue is more than the size of the GHB. However, this is not perfect as the head pointer can wrap around and still cause an incorrect match. To avoid this problem altogether, my implementation used a doubly linked list. The implementation of GHB and IT also involved adding a mechanism to convey the PC of the load (or store) instruction from the processor to the memory system. In the PC/CS implementation 2, the PC is used to index the IT. The lookup depth to look in the GHB can be specified in the configuration file. Based on the lookup depth d, the prefetcher looks at the past d cache misses for that PC and initiates a prefetch if all of them have a common (constant) stride, using this stride to predict the next address. To illustrate the necessity of the PC/DC scheme 2, consider the following sequence of memory accesses[8]:, 1, 2, 64, 65, 66, 128, 129. These accesses have a delta (stride) of 1, 1, 62, 1, 1, 62, 1. This pattern is representative of a load that accesses the first three words of each column in a 2-D array. The PC/CS scheme (with a lookup depth of 2) would conclude a constant stride by seeing the two 1s and hence 2 See section 4.2 wrongly prefetch along that stride. This problem can be overcome by the PC/DC scheme. In this scheme, delta pairs are used to find correlation in the accesses. In other words, the last two deltas are taken and are matched with the history to find the most recent previous occurrence of this delta pair. Whatever delta followed this previous occurrence will be considered for determining the prefetch address. Though the paper doesn t mention anything beyond two deltas (i.e a pair) my implementation can take this number as a parameter and can try to match beyond two deltas. 6. EXPERIMENTATION This section describes the simulation runs that were carried out in order to evaluate different prefetching schemes. The results from the various runs are also tabulated here. All programs were built with the -O4 flag to GCC. 6.1 Choice of benchmarks MediaBench[7] is a set of video and image codecs. It includes popular codecs such as mpeg, jpeg, adpcm, and epic. The binaries for MediaBench along with the input and output were already available on the Multi2Sim website for easy download. For this project, adpcm-dec, adpcm-enc, epicdec and epic-enc were used, mainly because the simulation times of these programs were affordable. PolyBench[9] is a benchmark suite that was designed to benchmark compiler loop optimization techniques. These programs contains a variety of loop patterns and hence the benchmark is a good candidate for evaluating prefetching. The programs in PolyBench are broadly separated out into data-mining, linear algebra, medly, and stencil computations. The following PolyBench programs were used in this project: floyd-warshall, (triangular matrix multiply), (dynamic programming 2D), gram-shmidt, (multiresolution analysis), (matrix transpose and vector multiplication). Each program can be specified to run on different input sizes (mini, small, standard, large, extra large). The small sized input was used in this project, again mainly due to time constraints. An attempt was made to run SPEC26 benchmarks. The simulation time was extremely high even for the test input set. I tried to use SimPoints (along with the PinPoints tool), but the downloaded sources failed to build, even after manual attempts to correct the sources. Both the PolyBench and MediaBench programs were run to completion and the number of CPU cycles they ran for is used as a measurement of performance 6.2 Base configuration The cache configuration used to get the primary results is shown in table 1. Figures 4 and 5 show the impact of various prefetching schemes on PolyBench and MediaBench respectively. As noted before, the performance is measured in terms of the total cycle count for execution of the entire program. In the MediaBench suite, epic-dec shows the highest improvement of 13.9%. In the PolyBench suite, shows the highest improvement of 25.1%. Both these highest improvements are with the delta correlation prefetching (PC/DC with a lookup depth of 2) scheme. For a few bench-

5 Parameter L1 Value L2 Value Sets Assoc 4 8 BlockSize Latency 4 2 Policy LRU LRU Ports 4 4 H/W Prefetching NO YES Prefetcher index table size 64 Prefetcher GHB size 256 6,, 5,, 4,, 3,, no_prefetching pc_cs_2 pc_dc_2 Table 1: Base configuration 2,, 3,, 1,, 25,, 2,, no_prefetching compiler_prefetching pc_cs_2 pc_dc_2 15,, 1,, Figure 5: MediaBench performance for the base configuration 5,, Figure 4: PolyBench performance for the base configuration marks, the performance improvement is low and hence cannot be clearly seen in the graph. As will be shown later, a lookup depth 3 of 3 for constant stride prefetching performs better than the lookup depth of 2 used in the base configuration. diabench programs as discussed previously. The first four bars for each benchmark represent the cycles for a 4 way L2 and the second four bars are for an 8 way L2. The impact of associativity is seen significantly for the gramshmidt benchmark, while for the other benchmarks prefetching behavior remained almost the same. 6.4 L1 vs L2 prefetching As mentioned earlier, hardware prefetching can be performed both at the L1 cache level and at the L2 cache level. Prefetching methods that are designed to work based on a history of cache misses work better on L2 caches while prefetching methods which work on the reference address stream work better on the L1 cache[1]. The result of the experiment agrees with this idea. Figure 7 shows a comparison of prefetching on the L1 and L2 caches. For compiler prefetching, the programs were built with - fprefetch-loop-arrays -mtune=core2 flag passed to GCC. Note that for MediaBench, compiler prefetching performance is not measured since the binaries were directly downloaded from the Multi2Sim website and they were not compiled with prefetching enabled. 6.3 Associativity Here we try to see how associativity affects prefetching performance. This is an important dimension to study prefetching in, since prefetching might cause other blocks that are currently in the workset to be evicted due to a way conflict. Figure 6 shows a comparison of prefetching with the L2 associativity reduced from 8 ways to 4 ways (Table 1). Note that the cycles for compiler prefetching is zero for the Me- 3 In the graphs (figures), the final number suffixed to hardware prefetching schemes is the lookup depth. 25,, 2,, 15,, 1,, 5,, compiler_4way pc_cs_2_4way pc_dc_2_4way compiler_8way pc_cs_2_8way pc_dc_2_8way Figure 6: Effect of associativity on each scheme

6 3,, 25,, 2,, 15,, 1,, 5,, 3,, 25,, 2,, 15,, 1,, 5,, pc_cs_2_l1 pc_cs_2_l2 pc_dc_2_l1 pc_dc_2_l2 Figure 7: L1 vs L2 prefetching Figure 8: Lookup depth comparison pc_cs_2 pc_cs_3 pc_cs_4 6.5 Lookup depth This subsection evaluates the PC/CS (Program Counter / Constant Stride) prefetching scheme for lookup depths of 2, 3, and 4 (All other parameters remain as in table 1). No lookup depth comparison is done for delta correlation prefetching as implementing such a lookup beyond 2 on hardware seemed unrealistic (although the simulator implementation that was done does support higher lookup depths for DC). Figure 8 shows the comparison for constant stride prefetching. Except for floyd-warshall, a lookup depth of 3 seems to be the best for constant stride prefetching. The lookup depth influences prefetch accuracy and impact in two ways. (1) As the lookup depth increases, prefetch accuracy improves. As the length of the history looked up is more (by definition), the prefetcher is less likely to issue prefetches along a wrong stride. (2) However, as the lookup depth increases, it takes longer for the prefetcher to start initiating prefetches (even on constant stride accesses). This may be significant if the inner most loop runs for a shorter duration. 7. CONCLUSION This project work implemented and compared a few prefetching schemes, studying their behavior for different cache configurations and other parameters. One important aspect of cache configuration that was missed in the discussion above was that of the number of ports. A minimum of 4 ports on the prefetching cache was required for prefetching to show improvements. Any number lower than this resulted in quite a few degradations. The detailed graphs are not shown due to lack of space. A natural extension of this project would be to study prefetching behavior in a multi-core scenario[2]. Such a study was not carried out as part of this project since it involved significantly more implementation work to take into account multiple cores accessing the same shared cache. It also required implementing support for other x86 prefetch hints such as prefetch nta[4] which makes more sense in a multicore scenario. Finally, I would like to thank Prof Matthew Jacob for his guidance and the Multi2Sim developer team for their comments on the prefetcher implementation. 8. REFERENCES [1] Jean-Loup Baer and Tien-Fu Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput., 44(5):69 623, May [2] Surendra Byna, Yong Chen, and Xian-He Sun. Taxonomy of data prefetching for multicore processors. Journal of Computer Science and Technology, 24:45 417, 29. [3] T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. SIGARCH Comput. Archit. News, 22(2): , April [4] Intel Corporation. Intel Architecture - Software Developer s Manual, Volume 2: Instruction Set Reference. [5] John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride directed prefetching in scalar processors. SIGMICRO Newsl., 23(1-2):12 11, December [6] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 26. [7] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In Proc. of the 3th Int l Symposium on Microarchitecture, Dec [8] K.J. Nesbit and J.E. Smith. Data cache prefetching using a global history buffer. In Software, IEE Proceedings-, page 96, feb. 24. [9] Louis-Noel Pouche. The Polyhedral Suite. [1] Alan Jay Smith. Cache memories. ACM Comput. Surv., 14(3):473 53, September [11] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proc. of the 21st International Conference on Parallel Architectures and Compilation Techniques, Sep. 212.

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Stride- and Global History-based DRAM Page Management

Stride- and Global History-based DRAM Page Management 1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

CS7810 Prefetching. Seth Pugsley

CS7810 Prefetching. Seth Pugsley CS7810 Prefetching Seth Pugsley Predicting the Future Where have we seen prediction before? Does it always work? Prefetching is prediction Predict which cache line will be used next, and place it in the

More information

EE482c Final Project: Stream Programs on Legacy Architectures

EE482c Final Project: Stream Programs on Legacy Architectures EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 26 Cache Optimization Techniques (Contd.) (Refer

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

An Automated Method for Software Controlled Cache Prefetching

An Automated Method for Software Controlled Cache Prefetching An Automated Method for Software Controlled Cache Prefetching Daniel F. Zucker*, Ruby B. Lee, and Michael J. Flynn Computer Systems Laboratory Department of Electrical Engineering Stanford University Stanford,

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Data Speculation. Architecture. Carnegie Mellon School of Computer Science Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen and Haitham Akkary 2012 Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize

More information

Hardware versus Hybrid Data Prefetching in Multimedia Processors: A Case Study

Hardware versus Hybrid Data Prefetching in Multimedia Processors: A Case Study In Proc, of the IEEE Int. Performance, Computing and Communications Conference, Phoenix, USA, Feb. 2, c 2 IEEE, reprinted with permission of the IEEE Hardware versus Hybrid Data ing in Multimedia Processors:

More information

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors 1 Shlomit S. Pinter IBM Science and Technology MATAM Advance Technology ctr. Haifa 31905, Israel E-mail: shlomit@vnet.ibm.com

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

ABSTRACT. PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd).

ABSTRACT. PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd). ABSTRACT PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd). With the increasing gap between processor speeds and memory,

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

A Perfect Branch Prediction Technique for Conditional Loops

A Perfect Branch Prediction Technique for Conditional Loops A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system. Cache Advantage August 1994 / Features / Cache Advantage Cache design and implementation can make or break the performance of your high-powered computer system. David F. Bacon Modern CPUs have one overriding

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES Swadhesh Kumar 1, Dr. P K Singh 2 1,2 Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur,

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Assignment 2: Understanding Data Cache Prefetching

Assignment 2: Understanding Data Cache Prefetching Assignment 2: Understanding Data Cache Prefetching Computer Architecture Due: Monday, March 27, 2017 at 4:00 PM This assignment represents the second practical component of the Computer Architecture module.

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

A Survey of Data Prefetching Techniques

A Survey of Data Prefetching Techniques A Survey of Data Prefetching Techniques Technical Report No: HPPC-96-05 October 1996 Steve VanderWiel David J. Lilja High-Performance Parallel Computing Research Group Department of Electrical Engineering

More information

A Survey of prefetching techniques

A Survey of prefetching techniques A Survey of prefetching techniques Nir Oren July 18, 2000 Abstract As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance.

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

A Review on Cache Memory with Multiprocessor System

A Review on Cache Memory with Multiprocessor System A Review on Cache Memory with Multiprocessor System Chirag R. Patel 1, Rajesh H. Davda 2 1,2 Computer Engineering Department, C. U. Shah College of Engineering & Technology, Wadhwan (Gujarat) Abstract

More information

6 th Lecture :: The Cache - Part Three

6 th Lecture :: The Cache - Part Three Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 1/17 [CS7031] Graphics and Console Hardware and Real-time Rendering 6 th Lecture :: The Cache - Part Three

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

More information

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction

More information

Improve performance by increasing instruction throughput

Improve performance by increasing instruction throughput Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access

More information

Superscalar Processor Design

Superscalar Processor Design Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction

More information

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3. Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

A Framework for the Performance Evaluation of Operating System Emulators. Joshua H. Shaffer. A Proposal Submitted to the Honors Council

A Framework for the Performance Evaluation of Operating System Emulators. Joshua H. Shaffer. A Proposal Submitted to the Honors Council A Framework for the Performance Evaluation of Operating System Emulators by Joshua H. Shaffer A Proposal Submitted to the Honors Council For Honors in Computer Science 15 October 2003 Approved By: Luiz

More information

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi

Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 32 The Memory Systems Part III Welcome back. (Refer Slide

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information