Performance Evaluation of data-push Thread on Commercial CMP Platform

Size: px
Start display at page:

Download "Performance Evaluation of data-push Thread on Commercial CMP Platform"

Transcription

1 Performance Evaluation of data-push Thread on Commercial CMP Platform Jianxun Zhang 1,2, Zhimin Gu 1, Ninghan Zheng 1,3, Yan Huang 1, Min Cai 1, Sicai Yang 1, Wenbiao Zhou 1 1 School of Computer, Beijing Institute of Technology, Beijing, China 2 Network Center, Tianjin University of Traditional Chinese Medicine, Tianjin, China 3 Department of Computer Science and Technology, Tsinghua University, Beijing, China Zhangjx@bit.edu.cn, zmgu@x263.net Abstract Helper thread is a promising prefetching technique to bridge the memory wall on contemporary CMP platform. However, the synchronization between application and helper thread is important to the performance improvement. Previous research mainly focused on the loop-count based synchronization, and it is only suitable for the main thread which has enough computation workload. As for the situation of small computation workload in main thread, this paper presents a multi-parameter helper thread prefetching model. By using memory intensive workloads, this paper gives a detailed performance evaluation of data-push(helper) thread on commercial CMP platform. As well, we evaluated the applicability of data push thread prefetching in multiple process environment. A methodology including workload selection and measurement metrics and hardware prefetcher throttle effect has been described. The evaluation results using data-push threads on em3d, mcf and mst show gains of 12%, 24%, 42% respectively when the hardware prefetcher was adjusted properly. Keywords-pre-execution; data prefetching; push/helper thread I. INTRODUCTION Chip on Multiple Processor (CMP) is the trend of contemporary microprocessor design. It provides a tremendous ability to perform arithmetic computations which are not involving slow memory operations such as the last level cache misses. Although there are in advances in memory technology, the speeding gap between the processor and memory is still widen, especially for those memory intensive workloads which with heavy pointer chasing. Prefetching is a promising technology to minimize the speed gap. Hardware prefetch schemes are quite effective on many memory access patterns such as stride [1] or ever encountered patterns [2] etc, but have trouble in predicting the pattern in pointer chasing workloads. In recent years, a pre-execution/pre-computation prefetching approach based helper thread aroused widely attention [3-12]. The pre-execution/pre-computation scheme has been found to compliment to hardware prefetch well and it could prefetching the addresses that hardware prefetch could not [3]. A software prefetching method based pre-execution is proposed in [13,14]. To our knowledge, this is the first two proposals which the helper thread was implemented on read commercial platforms, such as SMT platform and CMP platform. Motivated by these works, we applied pre-execution technique to the commercial CMP platforms. We proposed a helper thread framework on CMP in [15]. In this paper, we present a multi-parameter helper thread prefetching model which suitable for various application by adjusting the parameters, especially for those with small computation workloads in the main thread. We evaluated the bandwidth effect of data-push thread and also evaluated the performance effect of hardware prefetcher on the data-push thread. To study the performance improvement of data-push thread mechanism, we test it on Intel Core 2 Quad Processor with three memory intensive benchmarks of SPEC 2006 and Olden. Experiment result of em3d, mcf and mst was better than the literature [5,6,13,14,19] and the performance gain are 12%, 24%, 42% respectively. II. RELATED WORK Data prefetching is a promising technique to bridge the ever-widening gap between processor and memory speed. It is the motivation of various helper threading mechanisms implemented on different platforms. Collins et al. use speculative pre-computation for prefetching [5]. Luk describes the use of helper threading in the simultaneously multithreaded machines [8]. Both [5] and [8] require special hardware support. Liao et al. [7] use a binary rewriter to generate helper threads. Song et al. [13] propose a compiler framework on helper threading on Sun UltraSPARC IV+ processor. They introduce a cost-benefit method to select the candidate loops. After the candidate loop is selected, the terminal condition of helper thread is added in the main thread s relevant loop. Helper thread synchronized with the main thread after several loops to check whether the helper thread run far away to the main thread. If the terminal condition is satisfied, the helper thread will be terminated. Kim et al [14] apply pre-execution technique on a real multithread machine such as Intel Pentium 4 with Hype Threading technology. They constructed helper thread through software and proposed two synchronization mechanisms which are loop-based and sample-based synchronization mechanism respectively. Lee et al [19] propose a coarse-grain synchronization mechanism for control the execution rhythm between application and helper thread. In nature, they try to decrease the Synchronous frequency by using a lightweight general semaphores. All of the above proposals did not consider the small workload scenario in main thread. In this scenario, the main thread has little computation workload, it will run faster than the helper thread even the helper thread is a lightweight thread. Their loop-count based synchronization was not effective because helper thread s work is useless during the interval of synchronization. Different from the early works, we proposed a K forward-p Push size data-push thread scheme that targeted to real CMP platforms. In the case of small computation workload,

2 the target LLC miss was shared by the main thread and helper thread. Helper thread did not always do the prefetching work, and it will skip K iterations to guarantee it runs ahead the main thread. Based on our past work [15], a performance model was developed to analyze the relationship of parameter K and P. In this paper, we give a detailed evaluation of data-push thread on a real CMP platform due to the reason that the simplistic memory model of simulator will result in overly optimistic results [16]. III. PERFORMANCE MODEL OF DATA-PUSH PREFETCHING The latest version of CMP architecture often has shared last level cache. This makes it possible to execute helper thread running on the idle core to push data to the LLC before it is actually needed by the computing core, therefore, the main thread can hit in LLC and reduce the miss penalty. This approach is regarded as a processor-side push-based data prefetching method [17]. In this paper we name the prefetching thread as push thread, while calling the computing thread main thread. The main thread and push thread was executed on different core which shared the LLC. For memory intensive workloads, especially for those pointer chasing workloads, the execution time was dominated by the memory dependence chain of missing loads since the computations (including those cache hits) was either overlapped with the memory access latency or only accounts for a small portion of the overall execution time. Here, we ignore the computation time in our model. Suppose N represents the length of a memory dependence chain containing N dependent LLC missing loads. We do not model the Lower level of cache latency because it can be hidden successfully with out-of-order execution. We defined the execution time as the time to resolve all these missing loads, so the performance bound of original program is: Execution time original = N * LLC Miss_penalty 1 To model the performance potential of data-push thread prefetching, we first explain how the data-push thread worked. It is depicted in Figure 1. In order to guarantee the main thread can benefit from the push thread, the push thread runs ahead the main thread in suitable distance from beginning. Here, K represents the run-ahead distance. Initially, the main thread will suffered K LLC misses, and the push thread pushed LLC data at the point of K+1 until it pushed P LLC loads. Here, P represents the number of loads which push thread pushed. Block was defined as K plus P. Due to lack of computation workload, we assume the time is zero which the main thread enjoyed the pushed data in LLC. So the push thread will synchronize with the main thread at the boundary of the block and align the current working point. Then, the push thread will skip another K LLC loads and start to prefetch P number of LLC loads. Based on this scenario, when K = P, the N dependence chain loads can be divided N /( K + P) blocks. Then, the ( N * P)/( K + P) LLC missing loads can be prefetched by the push thread. It suites for the situation that the workload of main thread and helper thread is the same. The main thread s performance bound is the time to resolve the remaining ( N ( N * P) / ( K + P)) loads plus the synchronization overhead between the main thread and push thread. In special case, there are can be two synchronizations at the start and end of helper thread such as mcf. Suppose O represents the once synchronization overhead between the main thread and push thread. The performance bound of execution time is: Execution time ( N / ( K + P ))* K * LLC Miss_penalty data pushthread + N * O / ( K + P) 2 Main: Block Hide latency Push: K P Synchronization Main: Main Thread Push: Push Thread K : Run-ahead distance P : Push distance Figure 1. Data-push thread working Scenario (K=P) Suppose the prefetch accuracy of push thread is x%. For the purpose of simplicity, we assume the prefetch accuracy is a constant. The performance bound of execution time can be extended to the following equation. Execution time (( N / ( K + P )) * ( K + P * (1 x %))) data pushthread accuracy * LLC Miss_penalty + N * O / ( K + P) 3 When P<K<2*P, the push thread working scenario is depicted in figure 2. Initially, the push thread will run ahead main thread K distance and push P number of LLC missing loads. When the push thread completed the pushing of the P missing loads, it synchronized with the main thread. At this point, the main thread also completed P missing loads. The push thread got the current pointer of the main thread. It started to push data from the point of K+P. As shown in Figure 2, the yellow bar represents the K-P missing loads. The main thread will benefit the pushed data after it completed the K-P missing loads. In the same time, the push thread was also completing K-P missing loads from the start point of K+P. During the period of t1, the main thread and push thread load the same data. Once their current pointer is not absolutely the same, they will interfere with each other. The push thread will result in LLC cache line conflict or cache pollution. We define the block as K+2*P. The push thread will synchronize with main thread two times in one block. And in one block the main thread will load 2*P number of data. ( 2*P=K+(P-(K-P)) ) Through the above analysis, when K>P, the best value of K is divisible by P, therefore, it will reduce the interference between the main thread and push thread. It suites for the situation that the main thread has a little more workload than helper thread.

3 When P<K<2*P, the performance bound of execution time is: Execution time data pushthread ( K > P) ( N / ( K + 2 P))*(2 P) * LLC Miss_penalty + (2 N / ( K + 2 P))* O 4 When the prefetch accuracy of push thread is considered, the performance bound of execution time can be extended to the following equation. Execution time data pushthread accuracy (( N / ( K + 2 P)) * (2 P + K * (1 x%))) * LLC Miss_penalty 5 + (2 N / ( K + 2 P)) * O Main: Push: P K K P K-P Figure 2. Data-push thread working Scenario The model aimed to small computation workload in the main thread, the precondition is K P. In the case of small computation workload, if the value of K is less than P, the main thread will load the missing load by itself after it enjoyed the pushed LLC data. At this point, push thread can not run ahead the main thread. Then, it will result in much pressure on the shared LLC, such as bandwidth and LLC cache line conflict. In the multi-parameter data-push prefetch model, if we set the parameter K<P, it means that there has a certain mount computation workload in the main thread, but it still can not overlap with helper thread. In this scenario, the value of K should be set to less than P, so the push thread can be pushed more data to LLC. When the parameter K=0, it means there have enough computation workload in main thread. In the scenario of K=0, the synchronization mechanism is the same as [13,19]. However, in the real situation, if the two threads did not aware of the progress of each other, it will result in severe cache pollution. t1 Main: Main Thread Push: Push Thread K : Run-ahead distance P : Push distance K Block P Synchronizatio IV. EVALUATION SETUP A. Experiment Environment This section briefly describes the experimental test bed used in this study. Table I provides a summary of key architectural and system features. Intel Core 2 Quad Q6600 processor is a combination of previously mentioned two Core 2 Duo E6600 (Conroe) processors onto a single multi-chip module (MCM). On an Intel Core 2 CPU, there are four hardware prefetchers which are DPL, ACL, DCU and IP prefetcher. The detailed function of each prefetcher can be referenced to [18]. Each L1 data cache is equipped with two prefetchers. One of them is DCU prefetcher and the other is IP prefetcher. Each L2 data cache has one DPL prefetcher and one ACL prefetcher. Each prefetcher can be enabled or disabled independently. These prefetchers can improve the performance, mostly when accessing successive data, but can also cause performance degradation, when much unneeded data evicts required lines. In the paper, we also evaluated the performance of push thread prefetching at different configuration of hardware prefetcher. TABLE I. EXPERIMENTAL SETUP Processor Intel Core 2 Quad Processor Q6600 Memory 2GB(DDR 667, non-ecc) L1 D-Cache 32KB*4 8 set-association cache line 64byte L1 I-Cache 32KB*4 8 set-association cache line 64byte L2 Cache 4096KB*2 16 set-association cache line 64 byte FSB Speed 1066MHz Compiler gcc version O2 OS Fedora 9 with kernel B. Measuring Method The evaluation is performed on a real processor, the Intel Core 2 Quad processor. We insert time function to record executing time of benchmarks. The code region that have a high concentration of L2 cache miss can be obtained by profiling tool such as Intel VTune. The data-push thread was constructed manually after offline profiling. And we also use VTune TM performance analyzer to obtain the performance data including execution clock ticks, number of instructions, number of memory references, cache hits/misses, etc. To evaluate the performance of push thread, we use three memory intensive benchmarks which are mcf from SPEC2006, mst and em3d from Olden benchmark shown in Table II. TABLE II. BENCHMARKS AND DATA SETS Benchmark Suite Data Set Arguments Model Para 429.mcf SPEC 2006 Ref/inp.in K=P Mst Olden K>P Em3d Olden K=0 C. Measurement Metrics To understand the LLC behavior of the main thread and push thread, we defined the following metrics to measure the behavior of LLC cache. Other metrics include the execution time of the application and the number of L2 misses.

4 Bandwidth utilization: Intel uses the front side bus (FSB) as the only bus on the chip. All traffic to and from the processor is sent over this bus. Also, the two dual-core dice in the quad-core package communicate over this bus and all memory access is sent over it. The frequency of FSB is 1066 MHz. The theoretical peak bandwidth, k, in GiB/s for a 64-bit 9 wide bus is calculated as k = bus _ frequency *8/10. With 1066MHz FSB, the theoretical peak bandwidth of test bed Q6600 is 8.5GB/S. The percentage, p, of bus utilization can be defined as p = bus _ trans _ burst * 64 *100 / k. BUS_TRANS_BURST is a PMC event of Intel core 2 micro-architecture which counts the burst (full cache line) bus transactions. It can be directly obtained by Intel VTune. L2 Miss Rate: We defined the L2 miss rate as MEM_LOAD_RETIRED.L2_MISS/INST_RETIRED.ANY. MEM_LOAD_RETIRED.L2_MISS is a precise event, which counts the number of retired load operations that missed the L2 cache. INST_RETIRED.ANY counts the number of instructions that retire execution. V. EVALUATION RESULTS A. Performance Speed up of Single Benchmark In order to compare the performance improvement, we use processor affinity interface, provided by linux, to bind a thread to a specified processor core. The benchmark was compiled by gcc-4.3 with the optimization level O2. Figure 3 shows the performance speed up of em3d, mst and SPEC CPU2006 mcf comparing with the original program. In Figure 3, ori represents the original benchmark while push represents the benchmark with push thread. The benchmark with push thread was bound to processor core 0 and core 1. As can be shown in Figure 3, the speedup of mcf is 1.2, while in [19] the performance improvement of mcf is less than 10%. At top optimization level, our result of mcf is 1.24, while in [13] it is The mst performance improvement is also better than [19] in O2 level. Figure 4 shows the L2 cache miss rate of the original benchmark and benchmark with push thread. MT represents the main thread and ORI represents the original benchmark. O0 represents the program was compiled with non optimization option. Figure 3. Speed up of Scientific Computing Benchmarks (O2) As shown in Figure 4, by applying the push thread technique, the L2 cache miss rate of the main thread is lower than the original program s in all the three benchmarks. It illustrates that the L2 misses in the main thread was successfully transferred to the push thread. Namely, the L2 miss number in main thread was significantly deceased compared with the original benchmark. It also illuminate that the push thread takes an effective prefetching. For example, at the optimization level of O2, total number of L2 cache miss in MST-ORI is 4.524*10 8, while the L2 miss number in MST-MT is 1.21*10 8. Almost more than 70% L2 misses is transferred to the push thread. So the performance improvement of mst is very obvious, and the speedup of mst at O2 optimization level is Figure 4. L2 Miss Rate of Mcf, Em3d and Mst B. Bandwidth Utilization As discussed in section IV, it s straightforward to measure the FSB saturation. The bandwidth can be expressed as the number of bytes associated with the cache lines transferred per second. In other words, it s Cache line Bandwidth (bytes/sec). Bandwidth utilization is how much the percentage of the ideal peak bandwidth is used. The events BUS_TRANS_* can be used in a hierarchical manner to break down the contributions to the front side bus utilization. Figure 5 shows the bandwidth utilization of the original benchmark and the benchmark with push thread. When we applied push thread to the original program, the bandwidth utilization is the total bandwidth utilization of the main thread and push thread. It can be seen from the figure that there are additional bandwidth utilization when the push thread was applied to the three memory intensive benchmarks. It means that the push thread will result in some cache pollution and prefetch wastage. If the push thread prefetched the data timely before the main thread used it, the bandwidth utilization will decreased. Overall, the extra bandwidth utilization is within 5% compared with the base benchmark. It is quite tolerable. C. Performance of Mixed workloads As the multithread application becomes popular, the commercial server runs multiple applications in the most of the time. In order to verify the applicability of the push thread technique, we evaluated it with mixed workloads of three benchmarks. Figure 6 show the normalized execution time of the three benchmarks and mixed workloads. In Figure 6, the only postfix represents single benchmark is run. The pushxx postfix represents the benchmark was optimized with push thread. XX represents the processor core ID. We bound the main thread and push thread to two processor core

5 which shared the L2 cache. For example, mst_push01 represents the benchmark mst was optimized with push thread technique. And the main thread and push thread was bound to the processor core 0 and core 1 by Linux processor affinity interface. Figure 5. Bandwidth utilization of the application In Figure 6, we use the execution time of original program (O3) as the base metric. By analyzing the performance of mcf and mst in multiple parallel task condition, we can find that the performance of mcf and mst is improved properly under the optimization level of O3. For example, the performance of mcf is improved 14% at the O3 level when paralleled with mst. While paralleled with em3d, the performance of mcf is improved 15% at the O3 level. The performance of mst is improved 23% at O3 level when paralleled with mcf. When paralleled with em3d, the performance of mst improves 33% at O3 level. Figure 6. Normalized Execution time of mixed Workload In Figure 6, it can be seen that the performance of em3d is degrade when it paralleled with mcf. While it paralleled with mst, its performance is improved. When em3d accompanied with mcf, although they are executed in different processor cores, due to the limit of bandwidth and memory capacity, em3d will block and wait the IO operation. The main reason is that sum of the work set of mcf and em3d is larger than the memory capacity. And mcf is more aggressive than em3d in using the memory. So em3d can not apply enough memory capacity, just waiting its data swap in and swap out until mcf completed. Mst acompanied with mcf at optimization level 0 is the same case. D. Effect of Hardware Prefetcher As discussed in section IV, on Intel core 2 Quad processors, there are four hardware prefetchers, Among of them, two L1 cache prefetchers (DCU and IP prefetchers ) and two L2 cache prefetchers (adjacent line and stream prefetchers). Reference to [18], we also encode the configuration with four bits: bit 0 represents the setting of DPL prefetcher; bit 1, the ACL prefetcher; bit 2, the DCU prefetcher; bit 3, the IP prefetcher. The bit value 0 denotes that the corresponding prefetcher is enabled, and the bit value 1 denotes that the corresponding prefetcher is disabled. Fox example, the prefetcher configuration Config-0111 means that the IP prefetcher is enabled and the other three prefetchers are disabled. The default configuration for many of the Intel Q-series Core 2 CPU is config-0000, that is, all the prefetchers are enabled. Figure 7. Execution Time of mst in different configuration Figure 8. Execution Time of mcf in different configuration Figure 7 and 8 shows the normalized execution time of the three benchmarks with push thread optimization. All the execution time is normalized to the base execution time of the benchmark with push thread optimization at config In Figure 7, we can see that mst got the best performance improvement at config-0011, config-0111, config-1011, config-1111 relative to other configurations. The common

6 character of these configurations is that the first two bit is set. That is, when the two L2 prefetcher was disabled, mst can got the best performance improvement. It verified our discussion in section V(A). In Figure 8, Mcf has an obvious performance improvement at config-0000, config-0010, config-1000 and config In the four configurations, the first bit was set. It means that the L2 DPL hardware prefetcher can contribute to the performance improvement. In the same time, we can find em3d is not sensitive to the states of the hardware prefetcher. Compared with the execution time of the original benchmark, the performance of mst is improve 70% at O3 level, while mcf is improved 24%. The reason behind it was that the push thread optimization aims to the shared L2 prefetching, while L2 hardware prefetcher sometimes will result in cache pollution and issue useless prefetchers. VI. CONCLUSION With the ever-widening gap between processor and memory speed, a helper thread technique was proposed. A multi-parameter helper prefetching model is developed in this paper to model the performance potential of push thread prefetching for memory intensive workloads. It is established that for pointer chasing codes with long dependence chains. As the multithread application becomes popular in today s microprocessor, we evaluate the applicability of data push thread prefetching on commercial CMP platform. The evaluation results show that the data push thread technique can effectively prefetch data into LLC on commercial CMP platform. Based on our performance evaluation, the future interesting directions are worth careful exploration to optimize the parameters dynamically by monitoring the PMC such as L2 cache misses. In addition, the performance model is based on several assumptions. Profile-based analysis is an effective method to refine the model. ACKNOWLEDGMENT We would like to thank all the members of our research group for their contributions. And we also would like to the anonymous reviewers. This research is supported in part by MoE-Intel REFERENCES [1] J. Fu, J.H. Patel, and B.L. Janssens. Stride directed prefetching in scalar processors, In Proc. of the 25th Annual International Symposium on Microarchitecture, pp , [2] D. Joseph and D. Grunwald. Prefetching using Markov predictors, In Proc. of the Int'l Symp. oncomputer Arch, June 1997, pp [3] M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph pre-computation, In Proceedings of the 28th Annual International Symposium on Computer Architecture(Goteborg, Sweden). ACM, New York, 52 61, [4] J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic speculative precomputation, In Proceedings of the 34th International Symposium on Microarchitecture (Austin, Tex.).ACM, New York, , [5] J. Collins, H. Wang, D. Tullsen, et al. Speculative precomputation: Long-range prefetching of delinquent loads, In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 14 25, [6] D. Kim, D. Yeung. Design and evaluation of compiler algorithms for pre-execution, In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, Calif.). ACM, New York, , [7] S. S. Liao, P. Wang, H. Wang, et al. Post-pass binary adaptation for software-based speculative precomputation, In Proceedings of the ACM SIGPLANConference on Programming Language Design and Implementation (Berlin, Germany). ACM, New York, , [8] C.-K. Luk. Tolerating Memory Latency through software-controlled pre-execution in simultaneous multithreading processors, In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 40 51, [9] A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi. Sliceprocessors: An implementation of operation-based prediction, In Proceedings of the International Conference on Supercomputing (Sorrento, Italy). ACM, New York, , [10] A. Roth, G. S Sohi. Speculative data-driven multithreading, In Proceedings of the 7th International Conference on High Performance Computer Architecture (Monterrey, Mexico). IEEE Computer Society Press, Los Alamitos, Calif., , [11] A. Roth, G. S Sohi. A quantitative framework for automated preexecution thread selection, In Proceedings of the 35th Annual International Symposium on Microarchitecture (Istanbul, Turkey). ACM, New York, , [12] C. B. Zilles, G. Sohi. Execution-based prediction using speculative slices, In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 2 13, [13] Y. Song, S. Kalogeropulos, P. Tirumalai, "Design and Implementation of A Compiler Framework for Helper Threading on Multi-Core Processors", Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 05), [14] D. Kim, S.S. Liao, P. Wang, et al. Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors, Proceedings of the international symposium on Code generation and optimization, Mar [15] Z. Gu, N. Zheng, Y. Zhang, et al. The Stable Conditions of a Task-Pair with Helper-Thread in CMP, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, , [16] S. Srinivasan, L. Zhao, B. Ganesh, et al. CMP Memory Modeling: How Much Does Accuracy Matter? The Fifth Annual Workshop on Modeling, Benchmarking and Simulation, Austin, Texas, [17] S. Byna, Y. Chen, X.H. Sun. A Taxonomy of Data Prefetching Mechanisms, Journal of Computer Science and Technology, 24(3): , [18] S. Liao, T. Hung, D. Nguyen, C. Chou, et al. Machine learning-based prefetch optimization for data center applications. In the proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'09), Portland, OR, November [19] J. Lee, C. Jung, D. Lim, Y. Solihin. Prefetching with Helper Threads for Lossely Coupled Multiprocessor Systems, IEEE Transactions on Parallel and Distributed System, 20(9): , Sep

Solving Prameter Selection Problem of Helper Thread Prefetching via Realtime Hardware Performance Monitoring

Solving Prameter Selection Problem of Helper Thread Prefetching via Realtime Hardware Performance Monitoring th International Conference on Parallel and Distributed Computing, Applications and Technologies Solving Prameter Selection Problem of Helper Thread Prefetching via Realtime Hardware Performance Monitoring

More information

Exposing the Shared Cache Behavior of Helper Thread on CMP Platforms

Exposing the Shared Cache Behavior of Helper Thread on CMP Platforms 211 14th IEEE The International 14th IEEE International Conference Conference Computational on Computational Science and Science Engineering and Engineering CSE/I-SPAN/IUCC 211 Exposing the Shared Cache

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results

More information

Taxonomy of Data Prefetching for Multicore Processors

Taxonomy of Data Prefetching for Multicore Processors Byna S, Chen Y, Sun XH. Taxonomy of data prefetching for multicore processors. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 24(3): 405 417 May 2009 Taxonomy of Data Prefetching for Multicore Processors Surendra

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Pre-Computational Thread Paradigm: A Survey

Pre-Computational Thread Paradigm: A Survey Pre-Computational Thread Paradigm: A Survey Alok Garg Abstract The straight forward solution to exploit high instruction level parallelism is to increase the size of instruction window. Large instruction

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Improving Cloud Application Performance with Simulation-Guided CPU State Management

Improving Cloud Application Performance with Simulation-Guided CPU State Management Improving Cloud Application Performance with Simulation-Guided CPU State Management Mathias Gottschlag, Frank Bellosa April 23, 2017 KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT) - OPERATING SYSTEMS GROUP KIT

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Effective Prefetching for Multicore/Multiprocessor Systems

Effective Prefetching for Multicore/Multiprocessor Systems Effective Prefetching for Multicore/Multiprocessor Systems Suchita Pati and Pratyush Mahapatra Abstract Prefetching has been widely used by designers to hide memory access latency for applications with

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Stride- and Global History-based DRAM Page Management

Stride- and Global History-based DRAM Page Management 1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems

An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering 1-1-2003 An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems Wessam Hassanein José

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

MLP Exploitation with Seamless Preload

MLP Exploitation with Seamless Preload MLP Exploitation with Seamless Preload Zhen Yang, Xudong Shi, Feiqi Su and Jih-Kwon Peir Computer and Information Science and Engineering University of Florida Gainesville, FL 326, USA {zhyang, xushi,

More information

Exploiting MLP with Seamless Preload

Exploiting MLP with Seamless Preload Exploiting MLP with Seamless Preload Zhen Yang, Xudong Shi, Feiqi Su and Jih-Kwon Peir Computer and Information Science and Engineering University of Florida Gainesville, FL 326, USA {zhyang, xushi, fsu,

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

A Hybrid Hardware/Software. generated prefetching thread mechanism on Chip Multiprocessors.

A Hybrid Hardware/Software. generated prefetching thread mechanism on Chip Multiprocessors. A Hybrid Hardware/Software Generated Prefetching Thread Mechanism On Chip Multiprocessors Hou Rui, Longbing Zhang, and Weiwu Hu Key Laboratory of Computer System and Architecture, Institute of Computing

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Advanced Caches. ECE/CS 752 Fall 2017

Advanced Caches. ECE/CS 752 Fall 2017 Advanced Caches ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti Read on your own: Review:

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

CS 838 Chip Multiprocessor Prefetching

CS 838 Chip Multiprocessor Prefetching CS 838 Chip Multiprocessor Prefetching Kyle Nesbit and Nick Lindberg Department of Electrical and Computer Engineering University of Wisconsin Madison 1. Introduction Over the past two decades, advances

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 15:

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems

Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems Changhee Jung 1, Daeseob Lim 2, Jaejin Lee 3, and Yan Solihin 4 1 Embedded Software Research Division Electronics and Telecommunications

More information

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY Department of Computer science and engineering Year :II year CS6303 COMPUTER ARCHITECTURE Question Bank UNIT-1OVERVIEW AND INSTRUCTIONS PART-B

More information

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science

More information

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Accelerating and Adapting Precomputation Threads for Efficient Prefetching In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Adaptive Prefetching Java Objects 5/13/02

Adaptive Prefetching Java Objects 5/13/02 Adaptive Prefetching Java Objects Brad Beckmann Xidong Wang 5/13/02 Abstract This paper describes a novel adaptive prefetching technique that prefetches memory at the object granularity instead of the

More information

Overlapping Dependent Loads with Addressless Preload

Overlapping Dependent Loads with Addressless Preload Overlapping Dependent Loads with Addressless Preload Zhen Yang, Xudong Shi, Feiqi Su and Jih-Kwon Peir Computer and Information Science and Engineering University of Florida Gainesville, FL 326, USA {zhyang,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

smt-sprints: Software Precomputation with Intelligent Streaming for Resource-Constrained SMTs

smt-sprints: Software Precomputation with Intelligent Streaming for Resource-Constrained SMTs smt-sprints: Software Precomputation with Intelligent Streaming for Resource-Constrained SMTs Tanping Wang, Christos D. Antonopoulos and Dimitrios S. Nikolopoulos Department of Computer Science The College

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Data Access History Cache and Associated Data Prefetching Mechanisms

Data Access History Cache and Associated Data Prefetching Mechanisms Data Access History Cache and Associated Data Prefetching Mechanisms Yong Chen 1 chenyon1@iit.edu Surendra Byna 1 sbyna@iit.edu Xian-He Sun 1, 2 sun@iit.edu 1 Department of Computer Science, Illinois Institute

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Performance Impact of Resource Conflicts on Chip Multi-processor Servers

Performance Impact of Resource Conflicts on Chip Multi-processor Servers Performance Impact of Resource Conflicts on Chip Multi-processor Servers Myungho Lee, Yeonseung Ryu, Sugwon Hong, and Chungki Lee Department of Computer Software, MyongJi University, Yong-In, Gyung Gi

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Energy Characterization of Hardware-Based Data Prefetching

Energy Characterization of Hardware-Based Data Prefetching Energy Characterization of Hardware-Based Data Prefetching Yao Guo, Saurabh Chheda, Israel Koren, C. Mani Krishna, and Csaba Andras Moritz Electrical and Computer Engineering, University of Massachusetts,

More information

An Improvement Over Threads Communications on Multi-Core Processors

An Improvement Over Threads Communications on Multi-Core Processors Australian Journal of Basic and Applied Sciences, 6(12): 379-384, 2012 ISSN 1991-8178 An Improvement Over Threads Communications on Multi-Core Processors 1 Reza Fotohi, 2 Mehdi Effatparvar, 3 Fateme Sarkohaki,

More information

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 Reminder: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

15-740/ Computer Architecture

15-740/ Computer Architecture 15-740/18-740 Computer Architecture Lecture 16: Runahead and OoO Wrap-Up Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/17/2011 Review Set 9 Due this Wednesday (October 19) Wilkes, Slave Memories

More information

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:

More information

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information