Effect of Data Prefetching on Chip MultiProcessor

Size: px

Start display at page:

Download "Effect of Data Prefetching on Chip MultiProcessor"

Shannon May
5 years ago
Views:

1 THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE (CMP) CMP CMP CMP 5% Effect of Data Prefetching on Chip MultiProcessor Naoto FUKUMOTO, Tomonobu MIHARA, Koji INOUE, and Kazuaki MURAKAMI Abstract Graduate School of Information Science and Electrical Engineering, Kyushu University 744 Motooka Nishi-ku Fukuoka JAPAN Faculty of Information Science and Electrical Engineering, Kyushu University 744 Motooka Nishi-ku Fukuoka JAPAN {fukumoto,mihara}@c.csce.kyushu-u.ac.jp, {inoue,murakami}@i.kyushu-u.ac.jp Chip MultiProcessors (or CMPs) can achieve higher performance by means of exploiting thread level parallelism. Increasing the number of processor cores in a chip dramatically improves the peak performance. However, since the memory bandwidth does not scale with the number of cores, the negative impact of the memory-wall problem becomes more critical. Data prefetching is a well known approach to compensating for the poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed that the processor core in a chip is only one. In CMP chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMPs should be different from that on single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetch operations issued during a program execution. Then, we discuss qualitatively and quantitatively the effect of prefetching to the memory performance. The experimental results show that the negative effect of invalidation of prefetched data is very small. In addition, it is observed that about 5% of prefetch operations improve the cache hit rates of other cores. Key words CMP deta prefetching cache memory 1. (CMP:Chip MultiProcessor) CMP 1

2 CMP I/O CMP CMP CMP CMP CMP CMP [3] [5] CMP CPU CMP CMP CMP CMP CMP 2 CMP 3 [4] アドレスバスオンチップオフチップデータバス 1 L2 キャッシュ主記憶 CMP 4 CMP CMP 1 CMP 4 1 / L1 L2 / MOESI / L2 L1 L1 L2 L1 tagged [6] [1] ( 2.2 ) 2. 2 tagged [6] next line(sequential) a a + 1, a + 2,, a + d d 5 [1] 2

3 a s s PC a a + s, a + 2s, a + ds tagged PC 64 d 5 3. Natalie [4] CPU0 CPU1 Modified 4 CPU1 CPU0 CPU1 Modified CPU0 CPU1 時刻命令 Cache state 命令 Cache state Coherence Traffic 1 store A Modified CPU1 ライトヒット 2 prefetch A Shared CPU0 プリフェッチ発行 3 Owned CPU1 Ownedへ 4 store A Modified CPU1 Modifiedへ 5 Invalid CPU0 無効化 (broad cast) 2 3 マルチプロセッサ Useless イベント1 イベント3 Useful イベント2 Harmful Useless/Conflict イベント2 イベント1 イベント3 Useful/Conflict Harmful/Conflict シングルプロセッサ 3 Useless Useful 1 Useless/Conflict 2 Useful/Conflict 2 1 Harmful 3 Harmful/Conflict 2 3 Harmful Harmful/Conflict Natalie Harmful Harmful/Conflict 23% 4. CMP CMP 4 L2 CMP 4 3

4 Useful Useful/Conflict Useless/Remote イベント 2 4 Useless/Conflict /Remote イベント 4 イベント 4 Useless イベント 2 Useless/Conflict イベント 3 イベント 3 Harmful イベント 2 Harmful/Conflict 8 2 Useless/Remote 4 Useless/Conflict/Remote 2 4 Useless Useless/Conflict 4 Useless/Remote Useless/Conflict/Remote 1 Useful Useful/Conflict 1 Data/Address / 1 L2 L1 (Data/Address) Useless ±0 ±0 +1/+1 Useless/Remote ±0 1 +1/+1 Useful 1 ±0 ±0/±0 Harmful ±0 ±0 +1/+2 Useless/Conflict +1 ±0 +2/+2 Useless/Conflict/Remote /+2 Useful/Conflict ±0 ±0 +1/+1 Harmful/Conflict +1 ±0 +2/+3 1 L2 Useful/Conflict L2 L1 Useful 5. 1 CC CC exe, CC mem CC overlap CC = CC exe + CC mem CC overlap (1) CC mem (2) CC mem = AC {HCC L1 + MR L1 ((1 MR L1R ) (SBCC + HCC L1 ) + MR L1R ((HCC L2 + SBCC) + MR L2 (MBCC + MC L2 )))} (2) AC: HCC L1 :L1 MR L1 :L1 SBCC:L1-L2 MR L1R : L1 HCC L2 :L2 MR L2 :L2 MBCC:L2- MC L2 : (2) (2) MR L1 Useless/Conflict Useful Useless/Conflict Useful MR L1 MR L1 SBCC Useful SBCC MR L1R Useless/Remote, Useless/Conflict/Remote MR L1R MR L2 L2 L2 MBCCUseful MC L2 MR L1 MR L1R MR L2 SBCC MBCC CC mem MR L1 MR L2 4

2 L1 64KB 2-way, 64B lines, 1 clock cycle L1 64KB 2-way, 64B lines, 1 clock cycle L2 4MB 8-way, 64B lines, 12 clock cycles 64B L2-16B DRAM 300 clock cycles 6. 6. 1 Michigan CMP M5 [2] CMP 2 CMP CMP 2 SPLASH2 [7] 3 3 1000 3 3 SPLASH2 tagged barnes 8k particles 0.

5 2 L1 64KB 2-way, 64B lines, 1 clock cycle L1 64KB 2-way, 64B lines, 1 clock cycle L2 4MB 8-way, 64B lines, 12 clock cycles 64B L2-16B DRAM 300 clock cycles Michigan CMP M5 [2] CMP 2 CMP CMP 2 SPLASH2 [7] SPLASH2 tagged barnes 8k particles fmm 16k particles lu(contig) matrix radix 256K keys raytrace teapot.env water(spatial) 512 molecules Useful 80% tagged Useful 30% Useless/Conflict 20% Useful Radix Useful Useless/Conflict L1 Harmful Harmful/Conflict 1% Useless/Remote Useless/Conflict/Remote 5% 6 tagged 5 HCCL1 HCCRL1 HCCL2 SBCC MBSS MCL2 L1 L1 L2 HCCRL1=MR L1 (1 MR L1R ) HCC L1 HCCL2=MR L1 MR L1R HCC L2 SBCC=MR L1 SBCC MBSS=MR L1 MR L1R MR L2 MBCC MCL2=MR L1 MR L1R MR L2 MC L2 L1 L1 Useful 80% tagged tagged 3 tagged 6 5

6 60% 50% 40% 30% 20% 10% 0% 7. CMP / CMP CMP B B B B B B B B B B B B K K K K K K K K K K K K Barnes FMM LU Radix Raytrace Water 1% 5% tagged 7 (L1 Dcache=128,256,512,1024 KB) CMP 6. 3 Useful tagged Useful 30% 70% L1-L2 Useless/Conflict 20% Useful Useless/Conflict L Harmful Harmful/Conflict stride 77 L1 L1 7 L1 128KB 256KB 512KB 1MB LSI ( A: ) [1] J. L. Baer and T. F. Chen. An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty. In Proceedings of the 1991 Conference on Supercomputing, pp , [2] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. Network-oriented full-system simulation using m5. In Sixth Workshop on Computer Architecture Evaluation using Commercial Workloads, February [3] F. Dahlgren and P. Stenström. Evaluation of Hardware- Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, pp , [4] N. D.Enright Jerger, E. L. Hill, and M. H. Lipasti. Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, March [5] M.J. Garzaran, J.L. Briz, P.E. Ibanez, and V. Vinals. Hardware prefetching in bas-based multiprocessors: Pattern characterization and cost-effective hardware. In Proceedings of Parallel and Distributed Processing 2001, pp , February [6] A. J. Smith. Cache Memories, Computing Surveys, Vol.14, No.3, pp , September [7] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22th International Symposium on Computer Architecture, June tagged L1 6

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions Naoto Fukumoto, Kenichi Imazato, Koji Inoue, Kazuaki Murakami Department of Advanced Information Technology,