Effective Prefetching for Multicore/Multiprocessor Systems

Size: px

Start display at page:

Download "Effective Prefetching for Multicore/Multiprocessor Systems"

Corey Gibson
5 years ago
Views:

1 Effective Prefetching for Multicore/Multiprocessor Systems Suchita Pati and Pratyush Mahapatra Abstract Prefetching has been widely used by designers to hide memory access latency for applications with predictable memory access patterns. But in the age of multi-cores, using only per-core information to prefetch data into the cache hierarchy could lead to unnecessary cache pollution. Moreover, the additional coherence traffic and frequent operations such as downgrades can further reduce the overall system throughput. Additionally, in multi-threaded applications with synchronization, performance is determined by a few critical threads.prefetching aimed at a non-critical thread could throttle critical thread progress and thus, impact the system as a whole. In this project, we propose revisiting the prefetcher design questions of what, when, where and how to prefetch, for multiprocessor architectures which we believe will be extremely important for producer-consumer applications. Through our proposal targeting coherence invalidations due to prefetchers, we were able to significantly reduce coherence messages by augmenting the coherence protocol. We found that it did not impact performance of the prefetcher while also saving power. In our second proposal on Global Prefetch Control we characterized usefulness of local prefetchers using parameters like Accuracy, Cache Pollution and Thread Criticality. Combining information from all these parameters, a global decision can be made regarding when to and when not to throttle the local prefetchers. I. INTRODUCTION With the end of Dennard scaling, we have shifted from building faster, beefier cores to having multiple smaller cores working cooperatively together. While there has been a shift towards a multi-core era, many artifacts from the single core era still remain and are still being used as they were earlier. In our project, we wanted to revisit this and find out if architecture techniques borrowed from the single core era are still relevant in a multi-core scenario. This work focuses on Prefetchers and their adoption in a Multicore/Multiprocessor System. We have focused on two problems with prefetchers in multicore systems. First is the problem of additional coherence messages introduced due to prefetching. In a single core system, we prefetch aggressively since the data would either be in the private caches or in the main memory. However, in multicore systems, we also need to consider the possibility of data being present in other core s caches. In this work we carry out studies to answer the question of whether it is worth fetching data into the private cache while inducing additional coherence messages, or are there other optimizations. We propose augmenting the MESI coherency protocol to ignore lines which are already present in the cache hierarchy. Second is the problem of cache pollution induced at the shared last level cache due to individual local prefetchers. *Department of Computer Sciences, University of Wisconsin-Madison While cache pollution due to prefetching is a problem faced in single-core systems as well, it takes on a completely new dimension in multicore systems with many more permutations and combinations to be explored. We propose the idea of a Global Prefetch Control which takes into consideration cache pollution at the LLC, prefetch accuracy and thread criticality, to identify usefulness of the individual prefetchers and throttles them if found useless. The paper is split as follows. In Section II we briefly describe some related work that has been done in this area. In Section III we introduce the simulator used and the system configuration used for our experiments. We proceed to show experimental evidence on Cache Pollution and additional Coherence Messages introduced by Prefetchers in a Multicore system in Section IV. In Section V we introduce our proposals for improving prefetch efficiency without impacting performance in a Multicore system. In section VI we propose the idea of Global Prefetch Control. In Section VII we present our results and Section VIII lists few ideas for future work and we conclude in Section IX. II. RELATED WORK There has been previous work done looking at intercore prefetching. We list a few of them here. Inter-core prefetching [2] uses the idea of Helper Threads to speed up compute. They use idle threads to prefetch and use thread migration to switch to the thread with the prefetched data. The paper also mentions the problem of cache invalidations due to prefetching but did not propose a solution for the problem. [3] also uses helper threads to prefetch useful data. [4] uses a spare core to prefetch data by executing the nth instance of all non-control flow instructions. [5] considers the coherency traffic invalidations and solves this by identifying unshared regions of memory and prefetching from them. [6] proposes new metrics to identify critical threads and new hardware additions which allow critical threads to identified at runtime. [1] uses inbuilt counters to predict critical threads. The idea of global prefetch control was first introduced in [9] which proposes controlling local prefetchers in a multiprogrammed environment using local accuracy, global pollution and memory contention. Ours is the first work that proposes a global control that also includes the aspect of thread criticality while prefetching, making it a truly global and application performance sensitive technique.

TABLE I SYSTEM CONFIGURATION Core L1 I Cache L1 D Cache L2 Cache L3 Cache (Shared) Coherency Protocol Prefetchers Out-of-Order 32 KB, 4-way Set Associative 32 KB, 2-way Set Associative 256 KB, 16-way

ANALYSIS TOOLS AND SYSTEM CONFIGURATION A. SIMULATOR For our evaluation, we used ZSIM[7] which is a fast x86-64 simulator developed by Daniel Sanchez, et al.

2 TABLE I SYSTEM CONFIGURATION Core L1 I Cache L1 D Cache L2 Cache L3 Cache (Shared) Coherency Protocol Prefetchers Out-of-Order 32 KB, 4-way Set Associative 32 KB, 2-way Set Associative 256 KB, 16-way Set Associative 32 MB, 16-way Set Associative MESI Directory Based 2 L2 Stream Prefetchers for Instr. and Data Fig. 1. system % of prefetches in L3 leading to invalidations for a dual core III. ANALYSIS TOOLS AND SYSTEM CONFIGURATION A. SIMULATOR For our evaluation, we used ZSIM[7] which is a fast x86-64 simulator developed by Daniel Sanchez, et al. Its a PIN based simulator which focuses on simulating memory hierarchies and large, heterogeneous systems. ZSIM allowed us to scale to large core counts and it also came with an inbuilt stream prefetcher which we could use out of the box. B. BENCHMARKS We used the Parsec Multithreaded Benchmarks [8]. We ran Blackscholes, Dedup, Facesim, Ferret, Raytrace and Swaptions. These benchmarks were chosen since they had a mix of high and low sharing. We also enabled MAGIC OPS Hooks in Parsec to only simulate benchmarks in the Region of Interest. C. SYSTEM CONFIGURATION We wanted to ensure our system mimics a modern processor and hence we set up our configuration similar to that of Intel s Broadwell Server. The configuration is described in Table 1. IV. INITIAL STUDIES Coherence Traffic Due to Prefetchers: In this study, we analyzed the additional coherence traffic induced by prefetchers in a multicore system. We are particularly interested in looking at GETS prefetch requests which leads to a downgrade in other shared caches. That scenario could potentially lead to further coherence traffic with the cache whose data is downgraded due to the prefetch request, again asking for the data in Exclusive state and thus invalidating the prefetch copy. Thus the prefetch would be a wasted prefetch and it would have also lead to additional coherence traffic while also delaying execution in one of the cores. Figure 1 shows that a significant % of prefetch accesses in L3 lead to downgrades of cache lines in other cores. Figure 2 shows the % of total number of prefetch accesses leading to downgrades over all L3 accesses combined. The number stays significant especially for high sharing workloads like dedup and ferret. We decided to explore the impact on systems with varying number of cores. Fig. 2. % of prefetches over all accesses to L3, leading to invalidations for a dual core system Figure 3 shows that the problem exists even in systems with multiple cores. Moreover we find that the number of coherence invalidations due to prefetches becomes a significant factor even for benchmarks with low sharing. Cache Pollution Due to Prefetchers: We next study the problem of cache pollution by analyzing the performance of benchmarks with L2 prefetchers turned on and off. The study was done over many parsec benchmarks but we chose to focus on ferret as it s per-thread IPC degraded with increasing threads/cores implying high memory access contention. As shown in Figure 4, the per-thread IPC of ferret does not improve with prefetchers turned on implying there is no performance gain on using prefetchers. However, from Figures 5 and 6, it can be seen that both L3 cache misses and accesses increase on turning on prefetchers. This implies that even though the prefecthers are fetching more lines from L3 and memory, they are either evicting useful cachelines or are simply getting evicted by other core s prefetches, leading to larger miss rates and no performance gain. Given this observation, we were motivated to control local prefetchers using global information, i.e., cache pollution at L3, relative accuracy and thread criticality to not only reduce this interference but to also prefetch data that is useful for improving overall application performance. V. COHERENCE TRAFFIC REDUCTION FOR PREFETCHERS Our initial studies clearly indicate that Coherence downgrades due to prefetch requests is significant and should be looked at in greater detail. As systems scale and they incorporate more and more cores, we find that the number

Fig. 5. L3 Misses of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) Fig. 3.

L3 Accesses of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) Fig. 4.

The problem could potentially worsen in systems with high number of cores since the cache to cache latency will be much higher due to a much more complex interconnect and also a huge waste of energy.

No Modified/Exclusive Lines Fetching Ignore GETS request generated by a prefetcher if data is present in Exclusive or Modified state in the cache hierarchy.

3 Fig. 5. L3 Misses of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) Fig. 3. % of prefetches over all accesses to L3, leading to invalidations for multicore systems Fig. 6. L3 Accesses of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) Fig. 4. IPC of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) of such invalidations increases further. The problem could potentially worsen in systems with high number of cores since the cache to cache latency will be much higher due to a much more complex interconnect and also a huge waste of energy. We propose two solutions to encounter this problem: A. No Inter Core Cache Line Fetching Ignore GETS request generated by a prefetcher if data is present in the cache hierarchy. B. No Modified/Exclusive Lines Fetching Ignore GETS request generated by a prefetcher if data is present in Exclusive or Modified state in the cache hierarchy. In both the above proposals, we augment the cache coherence protocol. In the first proposal, we disallow any prefetch data transfer when the data is already present in the cache hierarchy. Prefetches are only allowed from memory. This helps avoid the additional cache invalidations and downgrades. To implement this proposal, we track all coherence requests sent to L3 and add a special flag that allows us to identify requests originating from prefetchers. If we find that the data requested is not in Invalid state, i.e. Data is present in one of the other caches in either Shared, Modified or Exclusive, we drop the coherence request and send a nack to the requesting core. In the second proposal, we follow a similar approach to the first but optimize it further. Since, prefetches are a GETS request, we can safely allow data that is present in Shared state to be sent to the prefetcher without downgrading permissions in other cores. The implementation is similar to the first proposal. VI. GLOBAL PREFETCH CONTROL In order to reduce the pollution of multiple local prefetchers at the last-level cache and to provide an illusion of a global prefetch unit with an objective of improving the overall application performance instead of local core performance, we put forth the idea of a Global Prefetch Throttler - a unit next to the LLC which checks for each individual prefetchers pollution at the LLC, its accuracy and the criticality of the thread running on the core it is prefetching for and throttles local preftechers which aren t fetching useful data for the application. The Global Prefetch Throttler periodically measures the following: Prefetchers Accuracy: Percentage of prefetches which were indeed accessed by the core. Prefetchers Pollution: Number of useful cachelines evicted by the prefetcher. If a demand access for a prefetch evicted (normal or a previously prefetched) line comes before the prefetched line is accessed, the prefetcher is penalized for pollution. Threads Criticality: If the thread running on the core is in a critical section or is a trailing barrier thread Global Control Rules We implement the global throttler to check the three parameters mentioned above for all the local prefetchers at certain intervals (on every 1000 L3 misses) and control global prefetching by throtlling the local prefetchers based on the rules mentioned in Figure 11. The thresholds for pollution and accuracy are calculated at each interval based on the median of pollution and accuracy values of all the local prefetchers, instead of fixing them to certain values. Out of all the rules mentioned in the table the two highlighted rows are for the best and worst case prefetcher performance. The first is when a less critical thread with low accuracy is causing high polluting at the LLC, in which case it is throttles. The best case is when a critical thread with high accuracy is prefetching without causing any pollution at the LLC.

Fig. 7. Number of Prefetches served by L3 for a dual core and a 32 core system. Comparing Normal Approach and Optimization 1 Fig. 9. Comparing L3 Hit Rates of Normal Approach and Optimization 1 Fig.

RESULTS: COHERENCE TRAFFIC REDUCTION Our results show that, we were able to reduce the prefetch accesses served by L3 (Figure 7 and Figure 8).

In addition, we should also note that as well scale cores, the number of prefetch requests sent to L3 increases exponentially and in such cases, our proposal bears major fruit since we find most of

While we didn t find any discernible impact on IPC due to our proposal, we found an improvement in L3 hit rates for some benchmarks(figure 9 and Figure 10).

4 Fig. 7. Number of Prefetches served by L3 for a dual core and a 32 core system. Comparing Normal Approach and Optimization 1 Fig. 9. Comparing L3 Hit Rates of Normal Approach and Optimization 1 Fig. 8. Number of Prefetches served by L3 for a dual core and a 32 core system. Comparing both optimizations VII. EVALUATION A. RESULTS: COHERENCE TRAFFIC REDUCTION Our results show that, we were able to reduce the prefetch accesses served by L3 (Figure 7 and Figure 8). This is because we were ignoring prefetch requests to lines already present in the cache hierarchy which leads to a reduction in the coherence traffic. In addition, we should also note that as well scale cores, the number of prefetch requests sent to L3 increases exponentially and in such cases, our proposal bears major fruit since we find most of the requests are for lines already present in the cache hierarchy and we thus end up significantly reducing coherence traffic. While we didn t find any discernible impact on IPC due to our proposal, we found an improvement in L3 hit rates for some benchmarks(figure 9 and Figure 10). We believe due to the streamlining of the prefetcher, we have been able to reduce coherence pollution by a small amount leading to higher L3 hit rates for some scenarios. Understanding the exact nature of the results and if we could leverage it further would be part of our future work. B. RESULTS: GLOBAL PREFETCH CONTROL The implementation of the Global Prefetch Throttler is ongoing work. We have been able to implement a unit which tracks every prefetcher s accuracy and pollution at the LLC, however, identification of critical thread and throttling of local prefetchers is yet to be implemented. There were several deadlock and simulation slowdown issues with ZSIM which we have overcome and the implementation looks feasible now. VIII. FUTURE WORK There is scope for additional work in this domain and streamline prefetching in multicore systems further. Here are few items that we intend to look at in the future: Augmenting Coherence Protocol further by setting a hint bit when a prefetch request is made for a line in Fig. 10. Comparing L3 Hit Rates of both optimizations Exclusive or Modified State. The cache line is eventually sent to the requesting prefetcher on an invalidation, downgrade or when a demand request is made for it, whichever occurs first. Augment cache and interconnect model to analyze cache to cache transfer latency and also have a more accurate memory simulator to simulate memory contention. For global throttling, memory contention can be the fourth parameter to consider as aggressive prefetching can impact demand accesses at high memory contention. Evaluate the proposal using different types of prefetchers IX. CONCLUSIONS In this project, we put forth two proposals to improve prefetching in a multicore system. We first looked at the impact of prefetchers on additional coherence traffic and invalidations/downgrades of shared cache lines. We proposed augmenting the coherence protocol to ignore requests for cache lines in Exclusive and Modified State. This allowed us to reduce coherence traffic by a significant amount while also having the unintended effect of improving L3 hit rates in some scenarios. We also propose Global Prefetch Control, which we believe by having a holistic view of the system state (pollution) and requirements (criticality) can significantly improve overall system and application performance while also saving power wasted on useless prefetches. REFERENCES [1] Bhattacharjee, Abhishek, and Margaret Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, [2] Kamruzzaman, Md, Steven Swanson, and Dean M. Tullsen. Inter-core prefetching for multicore processors using migrating helper threads. ACM SIGPLAN Notices 46.3 (2011): [3] Kim, Dongkeun, et al. Physical experimentation with prefetching helper threads on Intel s hyper-threaded processors. Code Generation and Optimization, CGO International Symposium on. IEEE, 2004.

Fig. 11. Global Prefetch Control Rules [4] Ganusov, Ilya, and Martin Burtscher. Future execution: A hardware prefetching technique for chip multiprocessors.

5 Fig. 11. Global Prefetch Control Rules [4] Ganusov, Ilya, and Martin Burtscher. Future execution: A hardware prefetching technique for chip multiprocessors. Parallel Architectures and Compilation Techniques, PACT th International Conference on. IEEE, [5] Cantin, Jason F., Mikko H. Lipasti, and James E. Smith. Stealth prefetching. ACM SIGOPS Operating Systems Review. Vol. 40. No. 5. ACM, [6] Du Bois, Kristof, et al. Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior. ACM SIGARCH Computer Architecture News 41.3 (2013): [7] Sanchez, Daniel, and Christos Kozyrakis. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer architecture news. Vol. 41. No. 3. ACM, [8] Bienia, Christian, et al. The PARSEC benchmark suite: Characterization and architectural implications. Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, [9] Lee, J., Lakshminarayana, N. B., Kim, H., & Vuduc, R. (2010, December). Many-thread aware prefetching mechanisms for GPGPU applications. In Microarchitecture (MICRO), rd Annual IEEE/ACM International Symposium on (pp ). IEEE.

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance