Performance Evaluation of data-push Thread on Commercial CMP Platform

Size: px

Start display at page:

Download "Performance Evaluation of data-push Thread on Commercial CMP Platform"

Reynard Richard
6 years ago
Views:

1 Performance Evaluation of data-push Thread on Commercial CMP Platform Jianxun Zhang 1,2, Zhimin Gu 1, Ninghan Zheng 1,3, Yan Huang 1, Min Cai 1, Sicai Yang 1, Wenbiao Zhou 1 1 School of Computer, Beijing Institute of Technology, Beijing, China 2 Network Center, Tianjin University of Traditional Chinese Medicine, Tianjin, China 3 Department of Computer Science and Technology, Tsinghua University, Beijing, China Zhangjx@bit.edu.cn, zmgu@x263.net Abstract Helper thread is a promising prefetching technique to bridge the memory wall on contemporary CMP platform. However, the synchronization between application and helper thread is important to the performance improvement. Previous research mainly focused on the loop-count based synchronization, and it is only suitable for the main thread which has enough computation workload. As for the situation of small computation workload in main thread, this paper presents a multi-parameter helper thread prefetching model. By using memory intensive workloads, this paper gives a detailed performance evaluation of data-push(helper) thread on commercial CMP platform. As well, we evaluated the applicability of data push thread prefetching in multiple process environment. A methodology including workload selection and measurement metrics and hardware prefetcher throttle effect has been described. The evaluation results using data-push threads on em3d, mcf and mst show gains of 12%, 24%, 42% respectively when the hardware prefetcher was adjusted properly. Keywords-pre-execution; data prefetching; push/helper thread I. INTRODUCTION Chip on Multiple Processor (CMP) is the trend of contemporary microprocessor design. It provides a tremendous ability to perform arithmetic computations which are not involving slow memory operations such as the last level cache misses. Although there are in advances in memory technology, the speeding gap between the processor and memory is still widen, especially for those memory intensive workloads which with heavy pointer chasing. Prefetching is a promising technology to minimize the speed gap. Hardware prefetch schemes are quite effective on many memory access patterns such as stride [1] or ever encountered patterns [2] etc, but have trouble in predicting the pattern in pointer chasing workloads. In recent years, a pre-execution/pre-computation prefetching approach based helper thread aroused widely attention [3-12]. The pre-execution/pre-computation scheme has been found to compliment to hardware prefetch well and it could prefetching the addresses that hardware prefetch could not [3]. A software prefetching method based pre-execution is proposed in [13,14]. To our knowledge, this is the first two proposals which the helper thread was implemented on read commercial platforms, such as SMT platform and CMP platform. Motivated by these works, we applied pre-execution technique to the commercial CMP platforms. We proposed a helper thread framework on CMP in [15]. In this paper, we present a multi-parameter helper thread prefetching model which suitable for various application by adjusting the parameters, especially for those with small computation workloads in the main thread. We evaluated the bandwidth effect of data-push thread and also evaluated the performance effect of hardware prefetcher on the data-push thread. To study the performance improvement of data-push thread mechanism, we test it on Intel Core 2 Quad Processor with three memory intensive benchmarks of SPEC 2006 and Olden. Experiment result of em3d, mcf and mst was better than the literature [5,6,13,14,19] and the performance gain are 12%, 24%, 42% respectively. II. RELATED WORK Data prefetching is a promising technique to bridge the ever-widening gap between processor and memory speed. It is the motivation of various helper threading mechanisms implemented on different platforms. Collins et al. use speculative pre-computation for prefetching [5]. Luk describes the use of helper threading in the simultaneously multithreaded machines [8]. Both [5] and [8] require special hardware support. Liao et al. [7] use a binary rewriter to generate helper threads. Song et al. [13] propose a compiler framework on helper threading on Sun UltraSPARC IV+ processor. They introduce a cost-benefit method to select the candidate loops. After the candidate loop is selected, the terminal condition of helper thread is added in the main thread s relevant loop. Helper thread synchronized with the main thread after several loops to check whether the helper thread run far away to the main thread. If the terminal condition is satisfied, the helper thread will be terminated. Kim et al [14] apply pre-execution technique on a real multithread machine such as Intel Pentium 4 with Hype Threading technology. They constructed helper thread through software and proposed two synchronization mechanisms which are loop-based and sample-based synchronization mechanism respectively. Lee et al [19] propose a coarse-grain synchronization mechanism for control the execution rhythm between application and helper thread. In nature, they try to decrease the Synchronous frequency by using a lightweight general semaphores. All of the above proposals did not consider the small workload scenario in main thread. In this scenario, the main thread has little computation workload, it will run faster than the helper thread even the helper thread is a lightweight thread. Their loop-count based synchronization was not effective because helper thread s work is useless during the interval of synchronization. Different from the early works, we proposed a K forward-p Push size data-push thread scheme that targeted to real CMP platforms. In the case of small computation workload,

2 the target LLC miss was shared by the main thread and helper thread. Helper thread did not always do the prefetching work, and it will skip K iterations to guarantee it runs ahead the main thread. Based on our past work [15], a performance model was developed to analyze the relationship of parameter K and P. In this paper, we give a detailed evaluation of data-push thread on a real CMP platform due to the reason that the simplistic memory model of simulator will result in overly optimistic results [16]. III. PERFORMANCE MODEL OF DATA-PUSH PREFETCHING The latest version of CMP architecture often has shared last level cache. This makes it possible to execute helper thread running on the idle core to push data to the LLC before it is actually needed by the computing core, therefore, the main thread can hit in LLC and reduce the miss penalty. This approach is regarded as a processor-side push-based data prefetching method [17]. In this paper we name the prefetching thread as push thread, while calling the computing thread main thread. The main thread and push thread was executed on different core which shared the LLC. For memory intensive workloads, especially for those pointer chasing workloads, the execution time was dominated by the memory dependence chain of missing loads since the computations (including those cache hits) was either overlapped with the memory access latency or only accounts for a small portion of the overall execution time. Here, we ignore the computation time in our model. Suppose N represents the length of a memory dependence chain containing N dependent LLC missing loads. We do not model the Lower level of cache latency because it can be hidden successfully with out-of-order execution. We defined the execution time as the time to resolve all these missing loads, so the performance bound of original program is: Execution time original = N * LLC Miss_penalty 1 To model the performance potential of data-push thread prefetching, we first explain how the data-push thread worked. It is depicted in Figure 1. In order to guarantee the main thread can benefit from the push thread, the push thread runs ahead the main thread in suitable distance from beginning. Here, K represents the run-ahead distance. Initially, the main thread will suffered K LLC misses, and the push thread pushed LLC data at the point of K+1 until it pushed P LLC loads. Here, P represents the number of loads which push thread pushed. Block was defined as K plus P. Due to lack of computation workload, we assume the time is zero which the main thread enjoyed the pushed data in LLC. So the push thread will synchronize with the main thread at the boundary of the block and align the current working point. Then, the push thread will skip another K LLC loads and start to prefetch P number of LLC loads. Based on this scenario, when K = P, the N dependence chain loads can be divided N /( K + P) blocks. Then, the ( N * P)/( K + P) LLC missing loads can be prefetched by the push thread. It suites for the situation that the workload of main thread and helper thread is the same. The main thread s performance bound is the time to resolve the remaining ( N ( N * P) / ( K + P)) loads plus the synchronization overhead between the main thread and push thread. In special case, there are can be two synchronizations at the start and end of helper thread such as mcf. Suppose O represents the once synchronization overhead between the main thread and push thread. The performance bound of execution time is: Execution time ( N / ( K + P ))* K * LLC Miss_penalty data pushthread + N * O / ( K + P) 2 Main: Block Hide latency Push: K P Synchronization Main: Main Thread Push: Push Thread K : Run-ahead distance P : Push distance Figure 1. Data-push thread working Scenario (K=P) Suppose the prefetch accuracy of push thread is x%. For the purpose of simplicity, we assume the prefetch accuracy is a constant. The performance bound of execution time can be extended to the following equation. Execution time (( N / ( K + P )) * ( K + P * (1 x %))) data pushthread accuracy * LLC Miss_penalty + N * O / ( K + P) 3 When P<K<2*P, the push thread working scenario is depicted in figure 2. Initially, the push thread will run ahead main thread K distance and push P number of LLC missing loads. When the push thread completed the pushing of the P missing loads, it synchronized with the main thread. At this point, the main thread also completed P missing loads. The push thread got the current pointer of the main thread. It started to push data from the point of K+P. As shown in Figure 2, the yellow bar represents the K-P missing loads. The main thread will benefit the pushed data after it completed the K-P missing loads. In the same time, the push thread was also completing K-P missing loads from the start point of K+P. During the period of t1, the main thread and push thread load the same data. Once their current pointer is not absolutely the same, they will interfere with each other. The push thread will result in LLC cache line conflict or cache pollution. We define the block as K+2*P. The push thread will synchronize with main thread two times in one block. And in one block the main thread will load 2*P number of data. ( 2*P=K+(P-(K-P)) ) Through the above analysis, when K>P, the best value of K is divisible by P, therefore, it will reduce the interference between the main thread and push thread. It suites for the situation that the main thread has a little more workload than helper thread.

3 When P<K<2*P, the performance bound of execution time is: Execution time data pushthread ( K > P) ( N / ( K + 2 P))*(2 P) * LLC Miss_penalty + (2 N / ( K + 2 P))* O 4 When the prefetch accuracy of push thread is considered, the performance bound of execution time can be extended to the following equation. Execution time data pushthread accuracy (( N / ( K + 2 P)) * (2 P + K * (1 x%))) * LLC Miss_penalty 5 + (2 N / ( K + 2 P)) * O Main: Push: P K K P K-P Figure 2. Data-push thread working Scenario The model aimed to small computation workload in the main thread, the precondition is K P. In the case of small computation workload, if the value of K is less than P, the main thread will load the missing load by itself after it enjoyed the pushed LLC data. At this point, push thread can not run ahead the main thread. Then, it will result in much pressure on the shared LLC, such as bandwidth and LLC cache line conflict. In the multi-parameter data-push prefetch model, if we set the parameter K<P, it means that there has a certain mount computation workload in the main thread, but it still can not overlap with helper thread. In this scenario, the value of K should be set to less than P, so the push thread can be pushed more data to LLC. When the parameter K=0, it means there have enough computation workload in main thread. In the scenario of K=0, the synchronization mechanism is the same as [13,19]. However, in the real situation, if the two threads did not aware of the progress of each other, it will result in severe cache pollution. t1 Main: Main Thread Push: Push Thread K : Run-ahead distance P : Push distance K Block P Synchronizatio IV. EVALUATION SETUP A. Experiment Environment This section briefly describes the experimental test bed used in this study. Table I provides a summary of key architectural and system features. Intel Core 2 Quad Q6600 processor is a combination of previously mentioned two Core 2 Duo E6600 (Conroe) processors onto a single multi-chip module (MCM). On an Intel Core 2 CPU, there are four hardware prefetchers which are DPL, ACL, DCU and IP prefetcher. The detailed function of each prefetcher can be referenced to [18]. Each L1 data cache is equipped with two prefetchers. One of them is DCU prefetcher and the other is IP prefetcher. Each L2 data cache has one DPL prefetcher and one ACL prefetcher. Each prefetcher can be enabled or disabled independently. These prefetchers can improve the performance, mostly when accessing successive data, but can also cause performance degradation, when much unneeded data evicts required lines. In the paper, we also evaluated the performance of push thread prefetching at different configuration of hardware prefetcher. TABLE I. EXPERIMENTAL SETUP Processor Intel Core 2 Quad Processor Q6600 Memory 2GB(DDR 667, non-ecc) L1 D-Cache 32KB*4 8 set-association cache line 64byte L1 I-Cache 32KB*4 8 set-association cache line 64byte L2 Cache 4096KB*2 16 set-association cache line 64 byte FSB Speed 1066MHz Compiler gcc version O2 OS Fedora 9 with kernel B. Measuring Method The evaluation is performed on a real processor, the Intel Core 2 Quad processor. We insert time function to record executing time of benchmarks. The code region that have a high concentration of L2 cache miss can be obtained by profiling tool such as Intel VTune. The data-push thread was constructed manually after offline profiling. And we also use VTune TM performance analyzer to obtain the performance data including execution clock ticks, number of instructions, number of memory references, cache hits/misses, etc. To evaluate the performance of push thread, we use three memory intensive benchmarks which are mcf from SPEC2006, mst and em3d from Olden benchmark shown in Table II. TABLE II. BENCHMARKS AND DATA SETS Benchmark Suite Data Set Arguments Model Para 429.mcf SPEC 2006 Ref/inp.in K=P Mst Olden K>P Em3d Olden K=0 C. Measurement Metrics To understand the LLC behavior of the main thread and push thread, we defined the following metrics to measure the behavior of LLC cache. Other metrics include the execution time of the application and the number of L2 misses.

4 Bandwidth utilization: Intel uses the front side bus (FSB) as the only bus on the chip. All traffic to and from the processor is sent over this bus. Also, the two dual-core dice in the quad-core package communicate over this bus and all memory access is sent over it. The frequency of FSB is 1066 MHz. The theoretical peak bandwidth, k, in GiB/s for a 64-bit 9 wide bus is calculated as k = bus _ frequency *8/10. With 1066MHz FSB, the theoretical peak bandwidth of test bed Q6600 is 8.5GB/S. The percentage, p, of bus utilization can be defined as p = bus _ trans _ burst * 64 *100 / k. BUS_TRANS_BURST is a PMC event of Intel core 2 micro-architecture which counts the burst (full cache line) bus transactions. It can be directly obtained by Intel VTune. L2 Miss Rate: We defined the L2 miss rate as MEM_LOAD_RETIRED.L2_MISS/INST_RETIRED.ANY. MEM_LOAD_RETIRED.L2_MISS is a precise event, which counts the number of retired load operations that missed the L2 cache. INST_RETIRED.ANY counts the number of instructions that retire execution. V. EVALUATION RESULTS A. Performance Speed up of Single Benchmark In order to compare the performance improvement, we use processor affinity interface, provided by linux, to bind a thread to a specified processor core. The benchmark was compiled by gcc-4.3 with the optimization level O2. Figure 3 shows the performance speed up of em3d, mst and SPEC CPU2006 mcf comparing with the original program. In Figure 3, ori represents the original benchmark while push represents the benchmark with push thread. The benchmark with push thread was bound to processor core 0 and core 1. As can be shown in Figure 3, the speedup of mcf is 1.2, while in [19] the performance improvement of mcf is less than 10%. At top optimization level, our result of mcf is 1.24, while in [13] it is The mst performance improvement is also better than [19] in O2 level. Figure 4 shows the L2 cache miss rate of the original benchmark and benchmark with push thread. MT represents the main thread and ORI represents the original benchmark. O0 represents the program was compiled with non optimization option. Figure 3. Speed up of Scientific Computing Benchmarks (O2) As shown in Figure 4, by applying the push thread technique, the L2 cache miss rate of the main thread is lower than the original program s in all the three benchmarks. It illustrates that the L2 misses in the main thread was successfully transferred to the push thread. Namely, the L2 miss number in main thread was significantly deceased compared with the original benchmark. It also illuminate that the push thread takes an effective prefetching. For example, at the optimization level of O2, total number of L2 cache miss in MST-ORI is 4.524*10 8, while the L2 miss number in MST-MT is 1.21*10 8. Almost more than 70% L2 misses is transferred to the push thread. So the performance improvement of mst is very obvious, and the speedup of mst at O2 optimization level is Figure 4. L2 Miss Rate of Mcf, Em3d and Mst B. Bandwidth Utilization As discussed in section IV, it s straightforward to measure the FSB saturation. The bandwidth can be expressed as the number of bytes associated with the cache lines transferred per second. In other words, it s Cache line Bandwidth (bytes/sec). Bandwidth utilization is how much the percentage of the ideal peak bandwidth is used. The events BUS_TRANS_* can be used in a hierarchical manner to break down the contributions to the front side bus utilization. Figure 5 shows the bandwidth utilization of the original benchmark and the benchmark with push thread. When we applied push thread to the original program, the bandwidth utilization is the total bandwidth utilization of the main thread and push thread. It can be seen from the figure that there are additional bandwidth utilization when the push thread was applied to the three memory intensive benchmarks. It means that the push thread will result in some cache pollution and prefetch wastage. If the push thread prefetched the data timely before the main thread used it, the bandwidth utilization will decreased. Overall, the extra bandwidth utilization is within 5% compared with the base benchmark. It is quite tolerable. C. Performance of Mixed workloads As the multithread application becomes popular, the commercial server runs multiple applications in the most of the time. In order to verify the applicability of the push thread technique, we evaluated it with mixed workloads of three benchmarks. Figure 6 show the normalized execution time of the three benchmarks and mixed workloads. In Figure 6, the only postfix represents single benchmark is run. The pushxx postfix represents the benchmark was optimized with push thread. XX represents the processor core ID. We bound the main thread and push thread to two processor core

5 which shared the L2 cache. For example, mst_push01 represents the benchmark mst was optimized with push thread technique. And the main thread and push thread was bound to the processor core 0 and core 1 by Linux processor affinity interface. Figure 5. Bandwidth utilization of the application In Figure 6, we use the execution time of original program (O3) as the base metric. By analyzing the performance of mcf and mst in multiple parallel task condition, we can find that the performance of mcf and mst is improved properly under the optimization level of O3. For example, the performance of mcf is improved 14% at the O3 level when paralleled with mst. While paralleled with em3d, the performance of mcf is improved 15% at the O3 level. The performance of mst is improved 23% at O3 level when paralleled with mcf. When paralleled with em3d, the performance of mst improves 33% at O3 level. Figure 6. Normalized Execution time of mixed Workload In Figure 6, it can be seen that the performance of em3d is degrade when it paralleled with mcf. While it paralleled with mst, its performance is improved. When em3d accompanied with mcf, although they are executed in different processor cores, due to the limit of bandwidth and memory capacity, em3d will block and wait the IO operation. The main reason is that sum of the work set of mcf and em3d is larger than the memory capacity. And mcf is more aggressive than em3d in using the memory. So em3d can not apply enough memory capacity, just waiting its data swap in and swap out until mcf completed. Mst acompanied with mcf at optimization level 0 is the same case. D. Effect of Hardware Prefetcher As discussed in section IV, on Intel core 2 Quad processors, there are four hardware prefetchers, Among of them, two L1 cache prefetchers (DCU and IP prefetchers ) and two L2 cache prefetchers (adjacent line and stream prefetchers). Reference to [18], we also encode the configuration with four bits: bit 0 represents the setting of DPL prefetcher; bit 1, the ACL prefetcher; bit 2, the DCU prefetcher; bit 3, the IP prefetcher. The bit value 0 denotes that the corresponding prefetcher is enabled, and the bit value 1 denotes that the corresponding prefetcher is disabled. Fox example, the prefetcher configuration Config-0111 means that the IP prefetcher is enabled and the other three prefetchers are disabled. The default configuration for many of the Intel Q-series Core 2 CPU is config-0000, that is, all the prefetchers are enabled. Figure 7. Execution Time of mst in different configuration Figure 8. Execution Time of mcf in different configuration Figure 7 and 8 shows the normalized execution time of the three benchmarks with push thread optimization. All the execution time is normalized to the base execution time of the benchmark with push thread optimization at config In Figure 7, we can see that mst got the best performance improvement at config-0011, config-0111, config-1011, config-1111 relative to other configurations. The common

6 character of these configurations is that the first two bit is set. That is, when the two L2 prefetcher was disabled, mst can got the best performance improvement. It verified our discussion in section V(A). In Figure 8, Mcf has an obvious performance improvement at config-0000, config-0010, config-1000 and config In the four configurations, the first bit was set. It means that the L2 DPL hardware prefetcher can contribute to the performance improvement. In the same time, we can find em3d is not sensitive to the states of the hardware prefetcher. Compared with the execution time of the original benchmark, the performance of mst is improve 70% at O3 level, while mcf is improved 24%. The reason behind it was that the push thread optimization aims to the shared L2 prefetching, while L2 hardware prefetcher sometimes will result in cache pollution and issue useless prefetchers. VI. CONCLUSION With the ever-widening gap between processor and memory speed, a helper thread technique was proposed. A multi-parameter helper prefetching model is developed in this paper to model the performance potential of push thread prefetching for memory intensive workloads. It is established that for pointer chasing codes with long dependence chains. As the multithread application becomes popular in today s microprocessor, we evaluate the applicability of data push thread prefetching on commercial CMP platform. The evaluation results show that the data push thread technique can effectively prefetch data into LLC on commercial CMP platform. Based on our performance evaluation, the future interesting directions are worth careful exploration to optimize the parameters dynamically by monitoring the PMC such as L2 cache misses. In addition, the performance model is based on several assumptions. Profile-based analysis is an effective method to refine the model. ACKNOWLEDGMENT We would like to thank all the members of our research group for their contributions. And we also would like to the anonymous reviewers. This research is supported in part by MoE-Intel REFERENCES [1] J. Fu, J.H. Patel, and B.L. Janssens. Stride directed prefetching in scalar processors, In Proc. of the 25th Annual International Symposium on Microarchitecture, pp , [2] D. Joseph and D. Grunwald. Prefetching using Markov predictors, In Proc. of the Int'l Symp. oncomputer Arch, June 1997, pp [3] M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph pre-computation, In Proceedings of the 28th Annual International Symposium on Computer Architecture(Goteborg, Sweden). ACM, New York, 52 61, [4] J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic speculative precomputation, In Proceedings of the 34th International Symposium on Microarchitecture (Austin, Tex.).ACM, New York, , [5] J. Collins, H. Wang, D. Tullsen, et al. Speculative precomputation: Long-range prefetching of delinquent loads, In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 14 25, [6] D. Kim, D. Yeung. Design and evaluation of compiler algorithms for pre-execution, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, Calif.). ACM, New York, , [7] S. S. Liao, P. Wang, H. Wang, et al. Post-pass binary adaptation for software-based speculative precomputation, In Proceedings of the ACM SIGPLANConference on Programming Language Design and Implementation (Berlin, Germany). ACM, New York, , [8] C.-K. Luk. Tolerating Memory Latency through software-controlled pre-execution in simultaneous multithreading processors, In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 40 51, [9] A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi. Sliceprocessors: An implementation of operation-based prediction, In Proceedings of the International Conference on Supercomputing (Sorrento, Italy). ACM, New York, , [10] A. Roth, G. S Sohi. Speculative data-driven multithreading, In Proceedings of the 7th International Conference on High Performance Computer Architecture (Monterrey, Mexico). IEEE Computer Society Press, Los Alamitos, Calif., , [11] A. Roth, G. S Sohi. A quantitative framework for automated preexecution thread selection, In Proceedings of the 35th Annual International Symposium on Microarchitecture (Istanbul, Turkey). ACM, New York, , [12] C. B. Zilles, G. Sohi. Execution-based prediction using speculative slices, In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 2 13, [13] Y. Song, S. Kalogeropulos, P. Tirumalai, "Design and Implementation of A Compiler Framework for Helper Threading on Multi-Core Processors", Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 05), [14] D. Kim, S.S. Liao, P. Wang, et al. Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors, Proceedings of the international symposium on Code generation and optimization, Mar [15] Z. Gu, N. Zheng, Y. Zhang, et al. The Stable Conditions of a Task-Pair with Helper-Thread in CMP, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, , [16] S. Srinivasan, L. Zhao, B. Ganesh, et al. CMP Memory Modeling: How Much Does Accuracy Matter? The Fifth Annual Workshop on Modeling, Benchmarking and Simulation, Austin, Texas, [17] S. Byna, Y. Chen, X.H. Sun. A Taxonomy of Data Prefetching Mechanisms, Journal of Computer Science and Technology, 24(3): , [18] S. Liao, T. Hung, D. Nguyen, C. Chou, et al. Machine learning-based prefetch optimization for data center applications. In the proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'09), Portland, OR, November [19] J. Lee, C. Jung, D. Lim, Y. Solihin. Prefetching with Helper Threads for Lossely Coupled Multiprocessor Systems, IEEE Transactions on Parallel and Distributed System, 20(9): , Sep

Solving Prameter Selection Problem of Helper Thread Prefetching via Realtime Hardware Performance Monitoring

th International Conference on Parallel and Distributed Computing, Applications and Technologies Solving Prameter Selection Problem of Helper Thread Prefetching via Realtime Hardware Performance Monitoring