A Study of Data Prefetching using Multi2Sim

Size: px

Start display at page:

Download "A Study of Data Prefetching using Multi2Sim"

Cuthbert Rice
5 years ago
Views:

1 A Study of Data Prefetching using Multi2Sim Vaivaswatha N sercvaivaswatha@ssl.serc.iisc.in 1. KEYWORDS Data prefetching, computer architecture, multi2sim, global history buffer. 2. ABSTRACT Data prefetching[6] is a cache optimization technique that tries to minimize access time by predicting future data accesses and initiating a fetch for the data, so that it is available in the cache when required. Multi2Sim[11] is an architectural simulator for heterogeneous architectures. It includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. This course project aims at implementing a prefetching model in the Multi2Sim simulator and study changes in program performance due to prefetching. Primarily, two global history buffer based hardware prefetchers[8] and compiler aided prefetching are studied. This study uses the MediaBench[7] benchmark suite and the PolyBench[9] benchmark suite to study the performance impact of prefetching. An important goal of this project has been to contribute to the open-source community. In the process of doing this study, I have contributed the following to the Multi2Sim simulator. Support for the x86 prefetch instruction[4], Memory hierarchy support for prefetching, and Two global history buffer based prefetchers. 3. INTRODUCTION Data prefetching is an optimization in which block fetches are initiated by speculating the requirement of data, so that the block is available in the cache when it is actually required. Related to the concept of data prefetching is instruction prefetching where instruction cache blocks are prefetched. This project is entirely in the context of data prefetching, and henceforth it will just be referred to as prefetching. Prefetching may be targetted at any of the cache levels. Different approaches are effective at different levels of the cache. For example, the prefetching technique described in [1] is suitable for L1 caches whereas the technique described in [8] is suitable for L2 caches. for i 1,n do a[i] b[i] + c[i] end for Figure 1: Simple Loop 3.1 The basic idea of prefetching Consider the simple loop shown in figure 1. Assume that the L1 cache block size is 16 bytes and all three arrays are of the same data type, of size 4 bytes. This means that the loop requires two new blocks to be fetched every 4 iterations, just for reading b[i] and c[i]. If somehow we were able to recognize this pattern and fetch the blocks for iteration i+4 when iteration i is running, then, when iteration i+4 begins, the blocks will already be in the cache. All data prefetching schemes have the following common goals[1]: Generate prefetches well in advance for blocks that may be needed in the future. Upon an actual access of the block, it should be available. Avoid unnecessary prefetches (i.e, the prediction accuracy must be high so that bandwidth is not wasted for blocks that will never be used). Do not prefetch too early since prefetched blocks still take up cache space. No penalty must be incurred for prefetching blocks that are already in the cache. Prefetching techniques must not increase the processor cycle time, i.e it should not interfere with the critical path timing. 3.2 Classification of prefetching techniques Prefetching techniques may be broadly classified as software (or compiler) aided prefetching and hardware prefetching. Compiler aided prefetching involves compiler analysis passes that try to accurately determine what are the best positions in the code to prefetch data, and what data should be prefetched at that point. The compiler inserts prefetch hints (instructions) in the program at these points. Whenever the processor sees a prefetch hint, it initiates a prefetch for the block specified.

2 Hardware prefetching is a completely hardware based technique that does not depend on compiler inserted prefetch hints. The hardware has circuits to detect patterns in data access, based on which it initiates prefetch requests to the next lower level memory component. A good comparison of software prefetching and hardware prefetching can be found in [3]. 3.3 Multi2Sim Multi2Sim[11] is an architectural simulator for heterogeneous architectures that includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. It is an application-only simulator (i.e, not a full system simulator). The Multi2Sim document can be referred to for more details on the architecture and features of the simulator. Only relevant details about the simulator are discussed in this document. The part of the simulator that is of primary interest to this project is the memory subsystem. The memory subsystem in Multi2Sim is based on an event simulation model. Every request to a memory module (cache or main-memory) is scheduled as an event. The event simulator is responsible for executing the event (through a call-back) at the right time. Central to the memory subsystem is the NMOESI cache coherence protocol, which is an extension of the MOESI protocol to include non-coherent (The N in NMOESI) accesses. This protocol has been implemented as a distributed directory based protocol. Multi2Sim also provides a configurable interconnect system to organize the different memory components in different ways. 4. BACKGROUND This section discusses a few hardware prefetching mechanisms, including the technique that was implemented in this project. Early approaches to prefetching used simple heuristics to perform prefetching. Smith[1] studied variations of block lookahead schemes where, whenever block i was referenced, block i+1 would be prefetched. Fu and Patel[5] studied a scheme for prefetching in which they maintained a history of previous addresses and generated prefetches based on a constant stride. This scheme did not have a control mechanism to reduce unnecessary prefetches on irregular accesses[1]. 4.1 Reference prediction based approach Chen and Baer[1] study three variations of an approach based on the reference prediction table. This scheme is most suitable for processors with a small first-level cache and a small block size. The main idea of reference prediction is to predict future references that an instruction might make based on the past history of accesses of the instruction. The paper discusses three variations on this, each with increasing degree of complexity and effectiveness. The first (simplest) approach is based on a four state machine. The state of the machine determines if accesses by an instruction are following a regular pattern (constant stride) and hence if prefetching is useful. The state transitions are aided by keeping track of the previous address, stride and the state for an instruction in a reference prediction table (RPT). The working of the state machine is similar to how a two bit branch prediction works and hence I will skip explaining it in detail. Prediction mechanism is triggered when the program counter decodes a load (or store) instruction. Based on the entry for the instruction in the RPT, a prefetch may be initiated. The basic scheme has a potential weakness associated with the timing of the prefetch. If the loop body is too small, the prefetched data may arrive late, thus reducing the effect of prefetching. Similarly, too large a loop body might trigger the prefetch too early. This can be fixed by looking ahead in time to see when is the best time to initiate a prefetch, so that the data arrives just in time for it to be used. This is achieved by using a look ahead program counter (LA-PC). The LA-PC stays ahead of the PC by as much a time δ that would be required for the block to arrive at the cache after a prefetch request. To have the LA-PC be as accurate as possible, it is linked with the branch prediction table. In this scheme, it is the LA-PC that is responsible for initiating prefetches as against the PC in the previous approach. In the first two approaches, reference prediction was based on regularity between adjacent data accesses. However for more general patterns, for example smaller inner loops or a triangle shaped loop, there would be frequent redundant prefetches. This is avoided by using a correlated reference prediction that not only tracks adjacent accesses in inner loops but also accesses that are correlated by changes in the loop level. This method involves using a shift register to record the outcome of the last branch and an extended RPT with extra fields for computing the strides of various correlated accesses. Keeping track of branch outcomes is necessary since a non-taken branch will trigger the correlation to the next level up. 4.2 Global history buffer based prefetching Nesbit and Smith[8] proposed a new structure called the global history buffer (GHB) that provided a new way to organize the access history. The scheme not only improves the accuracy of correlation prefetching by minimizing stale entries but also contains a better picture of cache miss history which can be used to design more effective prefetching techniques. One main difference between the GHB approach and the approach in [1] is that the history stored in [1] was memory reference history whereas this method stores a history of cache misses. Prefetching methods prior to this used a table that was indexed by some key, and the corresponding row contained history information. The GHB based approach decouples indexing into the table and the storage of prefetch-related history information. More precisely, the following two tables were proposed (Figure 2). An Index Table (IT) that is accessed with a key (typically the load (or store) instruction s PC or a cache miss address), with the entry in the row pointing to an entry in the GHB.

3 for i 1,n do prefetch(arr[i+1]) use/define arr[i] end for Figure 3: Badly inserted prefetches Figure 2: GHB prefetch structure The Global History Buffer (GHB) is an n-entry FIFO table that holds the n most recent cache misses. Each entry also holds a link pointer that is used to form linked lists (called address lists) of elements in the GHB. Each linked list has the property that the elements in that list correspond to the same IT key. Every cache miss is entered into the GHB in a FIFO order. At the time of entering, the new GHB entry will be made to point to the existing pointee of the corresponding IT entry and the IT entry will then be made to point to the new GHB entry (in other words, the linked list is kept consistent). Depending on the key that is used for indexing the IT and whether stride prefetching or correlation prefetching is performed, many possible prefetching schemes may be implemented using the GHB structure. This project work implements two GHB based prefetching schemes (the details of which are discussed in section 5.3). 1. PC/CS: The PC of the load (or store) instruction is used to index into the IT. The prefetching performed is Constant Stride. 2. PC/DC: The PC of the load (or store) instruction is used to index into the IT. Delta Correlation prefetching is performed. This method is referred to as Local Delta Correlation in [8]. 5. IMPLEMENTATION This section discusses the implementation work that was carried out as a part of this project. The implementation was done in the Multi2Sim simulator and was committed into the main source tree (i.e, it is available on the web as part of the main Multi2Sim simluator). The implementation is separated out into three parts, each of which is explained in detail in the following subsections. 5.1 Prefetch hint support Prior to this work, Multi2Sim did not support any kind of prefetch hints. So the first phase of work involved supporting the x86 prefetch instruction[4]. This was essential to have a comparison of hardware prefetching with software prefetching. Supporting the prefetch instruction involved expanding the instruction into appropriate micro instructions (effective address calculation followed by a memory operation). This expansion was done during the emulation stage. In terms of the architecture of the simulator, this is equivalent to saying that the expansion of the ISA instruction into micro instructions is done during the fetch stage of the pipeline. The implementation also involved creating a separate queue for prefetch requests in the issue unit. Scheduling of prefetch requests to the memory in the issue unit was done similar to how loads and stores were scheduled. However, the implementation was kept separate so as to enable any future work in this direction (for example, issue of prefetches requests may be given lower priority). An important part of implementing prefetch hint support was to avoid issuing redundant prefetch requests to the memory. Even though upon realizing a cache hit, the memory unit would discard the prefetch request 1, sending requests to the memory itself costs bandwidth. Also, just to realize a hit, the cache unit would need to acquire a port and a lock on the directory, both of which may turn out to be expensive. To better illustrate this, consider the case of a badly inserted prefetch shown in figure 3. Assuming 1 elements of the array fit in the cache, there will be 9 redundant prefetches for every necessary prefetch. Ideally, the compiler should either have unrolled the loop so that an iteration is capable of using one block of data, or, inserted the prefetch conditionally (i.e, inside an if ). Such simple bad scenarios can be easily handled by having a small prefetch request history in the issue unit so that the issue unit can query this table before issuing prefetch requests to the memory. A simple FIFO table would be sufficient to handle many cases. Such a table was implemented as part of implementing support for the x86 prefetch instruction. 5.2 Memory system support for prefetching Prefetching support for the memory system mainly involved incorporating the concept of prefetching into the NMOESI cache coherence protocol. For most parts, this was implemented similar to how load requests are handled by the protocol. However, the implementation was fully separated from that of the load request implementation. This was primarily because of the following differences. Whenever two load (or store) requests to the same block come within a short duration of each other such that the first request hasn t begun executing yet, the controller will try to coalesce them into one request. However for prefetch requests, the prefetch request just 1 More on this in section 5.2

4 needs to be dropped. Prefetches are not coalesced with other prefetches or with other load (or store) instructions. Whenever there is contention between a normal load (or store) request and a prefetch request, priority needs to be given to the load (or store) request (or maybe not). As a followup on the previous point, currently one heuristic that has been implemented is to not retry prefetch requests whenever it fails to acquire a port or a directory lock. This assumes that there is a contention already and we do not to further increase this by handling prefetch requests. Keeping the implementation separate allows future modifications along these lines to be easy. 5.3 GHB based hardware prefetching As mentioned earlier, two GHB based hardware prefetching schemes were implemented as part of this project. The implementation of Index Table (IT) and Global History Buffer (GHB) is common to both schemes. Only the implementation of the actual prefetching logic vary. Both the tables were added as a simple array of structures whose size was configurable through the memory configuration file. One important point that requires a mention here is that the linked list (address list) management of the GHB entries was done differently compared to the method mentioned in the paper. Since the description of GHB entries only specifies a single pointer to maintain the linked list, deletion of elements is not trivial. The paper proposes a scheme where extra bits are kept in each pointer (but are not used to find the pointee) based on which an invalid pointer (i.e, pointer to an element not in the GHB) can be detected whenever the difference in the pointer value (including the extra bits) and the current head of the FIFO queue is more than the size of the GHB. However, this is not perfect as the head pointer can wrap around and still cause an incorrect match. To avoid this problem altogether, my implementation used a doubly linked list. The implementation of GHB and IT also involved adding a mechanism to convey the PC of the load (or store) instruction from the processor to the memory system. In the PC/CS implementation 2, the PC is used to index the IT. The lookup depth to look in the GHB can be specified in the configuration file. Based on the lookup depth d, the prefetcher looks at the past d cache misses for that PC and initiates a prefetch if all of them have a common (constant) stride, using this stride to predict the next address. To illustrate the necessity of the PC/DC scheme 2, consider the following sequence of memory accesses[8]:, 1, 2, 64, 65, 66, 128, 129. These accesses have a delta (stride) of 1, 1, 62, 1, 1, 62, 1. This pattern is representative of a load that accesses the first three words of each column in a 2-D array. The PC/CS scheme (with a lookup depth of 2) would conclude a constant stride by seeing the two 1s and hence 2 See section 4.2 wrongly prefetch along that stride. This problem can be overcome by the PC/DC scheme. In this scheme, delta pairs are used to find correlation in the accesses. In other words, the last two deltas are taken and are matched with the history to find the most recent previous occurrence of this delta pair. Whatever delta followed this previous occurrence will be considered for determining the prefetch address. Though the paper doesn t mention anything beyond two deltas (i.e a pair) my implementation can take this number as a parameter and can try to match beyond two deltas. 6. EXPERIMENTATION This section describes the simulation runs that were carried out in order to evaluate different prefetching schemes. The results from the various runs are also tabulated here. All programs were built with the -O4 flag to GCC. 6.1 Choice of benchmarks MediaBench[7] is a set of video and image codecs. It includes popular codecs such as mpeg, jpeg, adpcm, and epic. The binaries for MediaBench along with the input and output were already available on the Multi2Sim website for easy download. For this project, adpcm-dec, adpcm-enc, epicdec and epic-enc were used, mainly because the simulation times of these programs were affordable. PolyBench[9] is a benchmark suite that was designed to benchmark compiler loop optimization techniques. These programs contains a variety of loop patterns and hence the benchmark is a good candidate for evaluating prefetching. The programs in PolyBench are broadly separated out into data-mining, linear algebra, medly, and stencil computations. The following PolyBench programs were used in this project: floyd-warshall, (triangular matrix multiply), (dynamic programming 2D), gram-shmidt, (multiresolution analysis), (matrix transpose and vector multiplication). Each program can be specified to run on different input sizes (mini, small, standard, large, extra large). The small sized input was used in this project, again mainly due to time constraints. An attempt was made to run SPEC26 benchmarks. The simulation time was extremely high even for the test input set. I tried to use SimPoints (along with the PinPoints tool), but the downloaded sources failed to build, even after manual attempts to correct the sources. Both the PolyBench and MediaBench programs were run to completion and the number of CPU cycles they ran for is used as a measurement of performance 6.2 Base configuration The cache configuration used to get the primary results is shown in table 1. Figures 4 and 5 show the impact of various prefetching schemes on PolyBench and MediaBench respectively. As noted before, the performance is measured in terms of the total cycle count for execution of the entire program. In the MediaBench suite, epic-dec shows the highest improvement of 13.9%. In the PolyBench suite, shows the highest improvement of 25.1%. Both these highest improvements are with the delta correlation prefetching (PC/DC with a lookup depth of 2) scheme. For a few bench-

5 Parameter L1 Value L2 Value Sets Assoc 4 8 BlockSize Latency 4 2 Policy LRU LRU Ports 4 4 H/W Prefetching NO YES Prefetcher index table size 64 Prefetcher GHB size 256 6,, 5,, 4,, 3,, no_prefetching pc_cs_2 pc_dc_2 Table 1: Base configuration 2,, 3,, 1,, 25,, 2,, no_prefetching compiler_prefetching pc_cs_2 pc_dc_2 15,, 1,, Figure 5: MediaBench performance for the base configuration 5,, Figure 4: PolyBench performance for the base configuration marks, the performance improvement is low and hence cannot be clearly seen in the graph. As will be shown later, a lookup depth 3 of 3 for constant stride prefetching performs better than the lookup depth of 2 used in the base configuration. diabench programs as discussed previously. The first four bars for each benchmark represent the cycles for a 4 way L2 and the second four bars are for an 8 way L2. The impact of associativity is seen significantly for the gramshmidt benchmark, while for the other benchmarks prefetching behavior remained almost the same. 6.4 L1 vs L2 prefetching As mentioned earlier, hardware prefetching can be performed both at the L1 cache level and at the L2 cache level. Prefetching methods that are designed to work based on a history of cache misses work better on L2 caches while prefetching methods which work on the reference address stream work better on the L1 cache[1]. The result of the experiment agrees with this idea. Figure 7 shows a comparison of prefetching on the L1 and L2 caches. For compiler prefetching, the programs were built with - fprefetch-loop-arrays -mtune=core2 flag passed to GCC. Note that for MediaBench, compiler prefetching performance is not measured since the binaries were directly downloaded from the Multi2Sim website and they were not compiled with prefetching enabled. 6.3 Associativity Here we try to see how associativity affects prefetching performance. This is an important dimension to study prefetching in, since prefetching might cause other blocks that are currently in the workset to be evicted due to a way conflict. Figure 6 shows a comparison of prefetching with the L2 associativity reduced from 8 ways to 4 ways (Table 1). Note that the cycles for compiler prefetching is zero for the Me- 3 In the graphs (figures), the final number suffixed to hardware prefetching schemes is the lookup depth. 25,, 2,, 15,, 1,, 5,, compiler_4way pc_cs_2_4way pc_dc_2_4way compiler_8way pc_cs_2_8way pc_dc_2_8way Figure 6: Effect of associativity on each scheme

6 3,, 25,, 2,, 15,, 1,, 5,, 3,, 25,, 2,, 15,, 1,, 5,, pc_cs_2_l1 pc_cs_2_l2 pc_dc_2_l1 pc_dc_2_l2 Figure 7: L1 vs L2 prefetching Figure 8: Lookup depth comparison pc_cs_2 pc_cs_3 pc_cs_4 6.5 Lookup depth This subsection evaluates the PC/CS (Program Counter / Constant Stride) prefetching scheme for lookup depths of 2, 3, and 4 (All other parameters remain as in table 1). No lookup depth comparison is done for delta correlation prefetching as implementing such a lookup beyond 2 on hardware seemed unrealistic (although the simulator implementation that was done does support higher lookup depths for DC). Figure 8 shows the comparison for constant stride prefetching. Except for floyd-warshall, a lookup depth of 3 seems to be the best for constant stride prefetching. The lookup depth influences prefetch accuracy and impact in two ways. (1) As the lookup depth increases, prefetch accuracy improves. As the length of the history looked up is more (by definition), the prefetcher is less likely to issue prefetches along a wrong stride. (2) However, as the lookup depth increases, it takes longer for the prefetcher to start initiating prefetches (even on constant stride accesses). This may be significant if the inner most loop runs for a shorter duration. 7. CONCLUSION This project work implemented and compared a few prefetching schemes, studying their behavior for different cache configurations and other parameters. One important aspect of cache configuration that was missed in the discussion above was that of the number of ports. A minimum of 4 ports on the prefetching cache was required for prefetching to show improvements. Any number lower than this resulted in quite a few degradations. The detailed graphs are not shown due to lack of space. A natural extension of this project would be to study prefetching behavior in a multi-core scenario[2]. Such a study was not carried out as part of this project since it involved significantly more implementation work to take into account multiple cores accessing the same shared cache. It also required implementing support for other x86 prefetch hints such as prefetch nta[4] which makes more sense in a multicore scenario. Finally, I would like to thank Prof Matthew Jacob for his guidance and the Multi2Sim developer team for their comments on the prefetcher implementation. 8. REFERENCES [1] Jean-Loup Baer and Tien-Fu Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput., 44(5):69 623, May [2] Surendra Byna, Yong Chen, and Xian-He Sun. Taxonomy of data prefetching for multicore processors. Journal of Computer Science and Technology, 24:45 417, 29. [3] T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. SIGARCH Comput. Archit. News, 22(2): , April [4] Intel Corporation. Intel Architecture - Software Developer s Manual, Volume 2: Instruction Set Reference. [5] John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride directed prefetching in scalar processors. SIGMICRO Newsl., 23(1-2):12 11, December [6] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 26. [7] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In Proc. of the 3th Int l Symposium on Microarchitecture, Dec [8] K.J. Nesbit and J.E. Smith. Data cache prefetching using a global history buffer. In Software, IEE Proceedings-, page 96, feb. 24. [9] Louis-Noel Pouche. The Polyhedral Suite. [1] Alan Jay Smith. Cache memories. ACM Comput. Surv., 14(3):473 53, September [11] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proc. of the 21st International Conference on Parallel Architectures and Compilation Techniques, Sep. 212.

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence