Effect of Data Prefetching on Chip MultiProcessor

Size: px
Start display at page:

Download "Effect of Data Prefetching on Chip MultiProcessor"

Transcription

1 THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE (CMP) CMP CMP CMP 5% Effect of Data Prefetching on Chip MultiProcessor Naoto FUKUMOTO, Tomonobu MIHARA, Koji INOUE, and Kazuaki MURAKAMI Abstract Graduate School of Information Science and Electrical Engineering, Kyushu University 744 Motooka Nishi-ku Fukuoka JAPAN Faculty of Information Science and Electrical Engineering, Kyushu University 744 Motooka Nishi-ku Fukuoka JAPAN {fukumoto,mihara}@c.csce.kyushu-u.ac.jp, {inoue,murakami}@i.kyushu-u.ac.jp Chip MultiProcessors (or CMPs) can achieve higher performance by means of exploiting thread level parallelism. Increasing the number of processor cores in a chip dramatically improves the peak performance. However, since the memory bandwidth does not scale with the number of cores, the negative impact of the memory-wall problem becomes more critical. Data prefetching is a well known approach to compensating for the poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed that the processor core in a chip is only one. In CMP chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMPs should be different from that on single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetch operations issued during a program execution. Then, we discuss qualitatively and quantitatively the effect of prefetching to the memory performance. The experimental results show that the negative effect of invalidation of prefetched data is very small. In addition, it is observed that about 5% of prefetch operations improve the cache hit rates of other cores. Key words CMP deta prefetching cache memory 1. (CMP:Chip MultiProcessor) CMP 1

2 CMP I/O CMP CMP CMP CMP CMP CMP [3] [5] CMP CPU CMP CMP CMP CMP CMP 2 CMP 3 [4] アドレスバス オンチップ オフチップ データバス 1 L2 キャッシュ 主記憶 CMP 4 CMP CMP 1 CMP 4 1 / L1 L2 / MOESI / L2 L1 L1 L2 L1 tagged [6] [1] ( 2.2 ) 2. 2 tagged [6] next line(sequential) a a + 1, a + 2,, a + d d 5 [1] 2

3 a s s PC a a + s, a + 2s, a + ds tagged PC 64 d 5 3. Natalie [4] CPU0 CPU1 Modified 4 CPU1 CPU0 CPU1 Modified CPU0 CPU1 時刻 命令 Cache state 命令 Cache state Coherence Traffic 1 store A Modified CPU1 ライトヒット 2 prefetch A Shared CPU0 プリフェッチ発行 3 Owned CPU1 Ownedへ 4 store A Modified CPU1 Modifiedへ 5 Invalid CPU0 無効化 (broad cast) 2 3 マルチプロセッサ Useless イベント1 イベント3 Useful イベント2 Harmful Useless/Conflict イベント2 イベント1 イベント3 Useful/Conflict Harmful/Conflict シングルプロセッサ 3 Useless Useful 1 Useless/Conflict 2 Useful/Conflict 2 1 Harmful 3 Harmful/Conflict 2 3 Harmful Harmful/Conflict Natalie Harmful Harmful/Conflict 23% 4. CMP CMP 4 L2 CMP 4 3

4 Useful Useful/Conflict Useless/Remote イベント 2 4 Useless/Conflict /Remote イベント 4 イベント 4 Useless イベント 2 Useless/Conflict イベント 3 イベント 3 Harmful イベント 2 Harmful/Conflict 8 2 Useless/Remote 4 Useless/Conflict/Remote 2 4 Useless Useless/Conflict 4 Useless/Remote Useless/Conflict/Remote 1 Useful Useful/Conflict 1 Data/Address / 1 L2 L1 (Data/Address) Useless ±0 ±0 +1/+1 Useless/Remote ±0 1 +1/+1 Useful 1 ±0 ±0/±0 Harmful ±0 ±0 +1/+2 Useless/Conflict +1 ±0 +2/+2 Useless/Conflict/Remote /+2 Useful/Conflict ±0 ±0 +1/+1 Harmful/Conflict +1 ±0 +2/+3 1 L2 Useful/Conflict L2 L1 Useful 5. 1 CC CC exe, CC mem CC overlap CC = CC exe + CC mem CC overlap (1) CC mem (2) CC mem = AC {HCC L1 + MR L1 ((1 MR L1R ) (SBCC + HCC L1 ) + MR L1R ((HCC L2 + SBCC) + MR L2 (MBCC + MC L2 )))} (2) AC: HCC L1 :L1 MR L1 :L1 SBCC:L1-L2 MR L1R : L1 HCC L2 :L2 MR L2 :L2 MBCC:L2- MC L2 : (2) (2) MR L1 Useless/Conflict Useful Useless/Conflict Useful MR L1 MR L1 SBCC Useful SBCC MR L1R Useless/Remote, Useless/Conflict/Remote MR L1R MR L2 L2 L2 MBCCUseful MC L2 MR L1 MR L1R MR L2 SBCC MBCC CC mem MR L1 MR L2 4

5 2 L1 64KB 2-way, 64B lines, 1 clock cycle L1 64KB 2-way, 64B lines, 1 clock cycle L2 4MB 8-way, 64B lines, 12 clock cycles 64B L2-16B DRAM 300 clock cycles Michigan CMP M5 [2] CMP 2 CMP CMP 2 SPLASH2 [7] SPLASH2 tagged barnes 8k particles fmm 16k particles lu(contig) matrix radix 256K keys raytrace teapot.env water(spatial) 512 molecules Useful 80% tagged Useful 30% Useless/Conflict 20% Useful Radix Useful Useless/Conflict L1 Harmful Harmful/Conflict 1% Useless/Remote Useless/Conflict/Remote 5% 6 tagged 5 HCCL1 HCCRL1 HCCL2 SBCC MBSS MCL2 L1 L1 L2 HCCRL1=MR L1 (1 MR L1R ) HCC L1 HCCL2=MR L1 MR L1R HCC L2 SBCC=MR L1 SBCC MBSS=MR L1 MR L1R MR L2 MBCC MCL2=MR L1 MR L1R MR L2 MC L2 L1 L1 Useful 80% tagged tagged 3 tagged 6 5

6 60% 50% 40% 30% 20% 10% 0% 7. CMP / CMP CMP B B B B B B B B B B B B K K K K K K K K K K K K Barnes FMM LU Radix Raytrace Water 1% 5% tagged 7 (L1 Dcache=128,256,512,1024 KB) CMP 6. 3 Useful tagged Useful 30% 70% L1-L2 Useless/Conflict 20% Useful Useless/Conflict L Harmful Harmful/Conflict stride 77 L1 L1 7 L1 128KB 256KB 512KB 1MB LSI ( A: ) [1] J. L. Baer and T. F. Chen. An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty. In Proceedings of the 1991 Conference on Supercomputing, pp , [2] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. Network-oriented full-system simulation using m5. In Sixth Workshop on Computer Architecture Evaluation using Commercial Workloads, February [3] F. Dahlgren and P. Stenström. Evaluation of Hardware- Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, pp , [4] N. D.Enright Jerger, E. L. Hill, and M. H. Lipasti. Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, March [5] M.J. Garzaran, J.L. Briz, P.E. Ibanez, and V. Vinals. Hardware prefetching in bas-based multiprocessors: Pattern characterization and cost-effective hardware. In Proceedings of Parallel and Distributed Processing 2001, pp , February [6] A. J. Smith. Cache Memories, Computing Surveys, Vol.14, No.3, pp , September [7] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22th International Symposium on Computer Architecture, June tagged L1 6

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions Naoto Fukumoto, Kenichi Imazato, Koji Inoue, Kazuaki Murakami Department of Advanced Information Technology,

More information

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden

More information

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami 3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

More information

3D Memory Architecture. Kyushu University

3D Memory Architecture. Kyushu University 3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES

PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES N. Ramasubramanian 1, Srinivas V.V. 2 and N. Ammasai Gounden 3 1, 2 Department of Computer Science and Engineering, National Institute

More information

Hardware Prefetching in Bus-Based Multiprocessors: Pattern Characterization and Cost-Effective Hardware

Hardware Prefetching in Bus-Based Multiprocessors: Pattern Characterization and Cost-Effective Hardware Hardware Prefetching in Bus-d Multiprocessors: Pattern Characterization and Cost-Effective Hardware M.J. Garzarán, J.L. Briz, P.E. Ibáñez and V. Viñals Univ. Zaragoza, España {garzaran,briz,imarin,victor}@posta.unizar.es

More information

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors vs. : Evaluation of Structure for Single-chip Multiprocessors Toru Kisuki,Masaki Wakabayashi,Junji Yamamoto,Keisuke Inoue, Hideharu Amano Department of Computer Science, Keio University 3-14-1, Hiyoshi

More information

RAPID HARDWARE PROTOTYPING ON RPM-2: METHODOLOGY AND EXPERIENCE

RAPID HARDWARE PROTOTYPING ON RPM-2: METHODOLOGY AND EXPERIENCE RAPID HARDWARE PROTOTYPING ON RPM-2: METHODOLOGY AND EXPERIENCE Michel Dubois, Jaeheon Jeong, Yong Ho Song, and Adrian Moga Department of Electrical Engineering - Systems University of Southern California

More information

An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors

An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors Myoung Kwon Tcheun, Hyunsoo Yoon, Seung Ryoul Maeng Department of Computer Science, CAR Korea Advanced nstitute of Science and

More information

Bundling: Reducing the Overhead of Multiprocessor Prefetchers

Bundling: Reducing the Overhead of Multiprocessor Prefetchers Bundling: Reducing the Overhead of Multiprocessor Prefetchers Dan Wallin and Erik Hagersten Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden fdan.wallin,

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Bundling: Reducing the Overhead of Multiprocessor Prefetchers

Bundling: Reducing the Overhead of Multiprocessor Prefetchers Bundling: Reducing the Overhead of Multiprocessor Prefetchers Dan Wallin and Erik Hagersten Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden dan.wallin,

More information

Snoop-Based Multiprocessor Design III: Case Studies

Snoop-Based Multiprocessor Design III: Case Studies Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions

More information

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

More information

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors

A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors Anders Landin and Mattias Karlgren Swedish Institute of Computer Science Box 1263, S-164 28 KISTA, Sweden flandin,

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung Department of Electrical and Computer Engineering University of

More information

Friendly Fire: Understanding the Effects of Multiprocessor Prefetches

Friendly Fire: Understanding the Effects of Multiprocessor Prefetches Friendly Fire: Understanding the Effects of Multiprocessor Prefetches Natalie D. Enright Jerger, Eric L. Hill, and Mikko H. Lipasti Department of Electrical & Computer Engineering, University of Wisconsin-Madison

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 14: Photonic Interconnect

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 14: Photonic Interconnect 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 14: Photonic Interconnect Instructor: Ron Dreslinski Winter 2016 1 1 Announcements 2 Remaining lecture schedule 3/15: Photonics

More information

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Bushra Ahsan Electrical Engineering Department City University of New York bahsan@gc.cuny.edu Mohamed Zahran Electrical Engineering

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. *

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. * Design of A Memory Latency Tolerant Processor() Naohiko SHIMIZU* Kazuyuki MIYASAKA** Hiroaki HARAMIISHI** *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. 1117 Kitakaname Hiratuka-shi

More information

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Maged M. Michael y, Ashwini K. Nanda z, Beng-Hong Lim z, and Michael L. Scott y y University of Rochester z IBM Research Department

More information

Cache Architecture Limitations in Multicore Processors

Cache Architecture Limitations in Multicore Processors Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 8(1): 7-13 Scholarlink Research Institute Journals, 2017 (ISSN: 2141-7016) jeteas.scholarlinkresearch.com Journal of Emerging Trends

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

More information

Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor

Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor Inseok Choi, Minshu Zhao, Xu Yang, and Donald Yeung Department of Electrical and Computer

More information

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi *, Jongbok Lee +, and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

The SPLASH-2 Programs: Characterization and Methodological Considerations

The SPLASH-2 Programs: Characterization and Methodological Considerations Appears in the Proceedings of the nd Annual International Symposium on Computer Architecture, pages -36, June 995 The SPLASH- Programs: Characterization and Methodological Considerations Steven Cameron

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Mostafa Mahmoud, Amr Wassal Computer Engineering Department, Faculty of Engineering, Cairo University, Cairo, Egypt {mostafa.m.hassan,

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Revisiting Processor Performance Program Execution Time =

More information

Cost-Performance Evaluation of SMP Clusters

Cost-Performance Evaluation of SMP Clusters Cost-Performance Evaluation of SMP Clusters Darshan Thaker, Vipin Chaudhary, Guy Edjlali, and Sumit Roy Parallel and Distributed Computing Laboratory Wayne State University Department of Electrical and

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

CS 838 Chip Multiprocessor Prefetching

CS 838 Chip Multiprocessor Prefetching CS 838 Chip Multiprocessor Prefetching Kyle Nesbit and Nick Lindberg Department of Electrical and Computer Engineering University of Wisconsin Madison 1. Introduction Over the past two decades, advances

More information

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood Multicast Snooping: A New Coherence Method Using A Multicast Address Ender Bilir, Ross Dickson, Ying Hu, Manoj Plakal, Daniel Sorin, Mark Hill & David Wood Computer Sciences Department University of Wisconsin

More information

Processor-Directed Cache Coherence Mechanism A Performance Study

Processor-Directed Cache Coherence Mechanism A Performance Study Processor-Directed Cache Coherence Mechanism A Performance Study H. Sarojadevi, dept. of CSE Nitte Meenakshi Institute of Technology (NMIT) Bangalore, India hsarojadevi@gmail.com S. K. Nandy CAD Lab, SERC

More information

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand, Onur Mutlu, Phillip B. Gibbons, Michael A.

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

Investigating design tradeoffs in S-NUCA based CMP systems

Investigating design tradeoffs in S-NUCA based CMP systems Investigating design tradeoffs in S-NUCA based CMP systems P. Foglia, C.A. Prete, M. Solinas University of Pisa Dept. of Information Engineering via Diotisalvi, 2 56100 Pisa, Italy {foglia, prete, marco.solinas}@iet.unipi.it

More information

CPACM: A New Embedded Memory Architecture Proposal

CPACM: A New Embedded Memory Architecture Proposal CPACM: A New Embedded Memory Architecture Proposal PaulKeltcher, Steph en Richardson Computer Systems and Technology Laboratory HP Laboratories Palo Alto HPL-2000-153 November 21 st, 2000* edram, embedded

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster, Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

ECE PP used in class for assessing cache coherence protocols

ECE PP used in class for assessing cache coherence protocols ECE 5315 PP used in class for assessing cache coherence protocols Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the

More information

Neighborhood Prefetching on Multiprocessors Using Instruction History

Neighborhood Prefetching on Multiprocessors Using Instruction History Neighborhood Prefetching on Multiprocessors Using Instruction History David M. Koppelman Department of Electrical & Computer Engineering, Louisiana State University koppel@ee.lsu.edu Abstract A multiprocessor

More information

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory:

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory: 1 1 1 1 1,a) 1 HTM 2 2 LogTM 72.2% 28.4% 1. Transactional Memory: TM [1] TM Hardware Transactional Memory: 1 Nagoya Institute of Technology, Nagoya, Aichi, 466-8555, Japan a) tsumura@nitech.ac.jp HTM HTM

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad

More information

Quantifying and Reducing the Effects of Wrong-Path Memory References in Cache-Coherent Multiprocessor Systems

Quantifying and Reducing the Effects of Wrong-Path Memory References in Cache-Coherent Multiprocessor Systems Quantifying and Reducing the Effects of Wrong-Path Memory References in Cache-Coherent Multiprocessor Systems Resit Sendag 1, Ayse Yilmazer 1, Joshua J. Yi 2, and Augustus K. Uht 1 1 - Department of Electrical

More information

CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators

CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators Dohyung Kim Soonhoi Ha Rajesh Gupta Department of Computer Science and Engineering School of Computer Science and Engineering

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Assignment 2: Understanding Data Cache Prefetching

Assignment 2: Understanding Data Cache Prefetching Assignment 2: Understanding Data Cache Prefetching Computer Architecture Due: Monday, March 27, 2017 at 4:00 PM This assignment represents the second practical component of the Computer Architecture module.

More information

A Low Energy Set-Associative I-Cache with Extended BTB

A Low Energy Set-Associative I-Cache with Extended BTB A Low Energy Set-Associative I-Cache with Extended BTB Koji Inoue, Vasily G. Moshnyaga Dept. of Elec. Eng. and Computer Science Fukuoka University 8-19-1 Nanakuma, Jonan-ku, Fukuoka 814-0180 JAPAN {inoue,

More information

[ 5.4] What cache line size is performs best? Which protocol is best to use?

[ 5.4] What cache line size is performs best? Which protocol is best to use? Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Split Private and Shared L2 Cache Architecture for Snooping-based CMP

Split Private and Shared L2 Cache Architecture for Snooping-based CMP Split Private and Shared L2 Cache Architecture for Snooping-based CMP Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin School of Informatics and Engineering, Flinders University zhao0043, karl.sammut,

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Emulation of realistic network traffic patterns on an eight-node data vortex interconnection network subsystem

Emulation of realistic network traffic patterns on an eight-node data vortex interconnection network subsystem Emulation of realistic network traffic patterns on an eight-node data vortex interconnection network subsystem Benjamin A. Small, Assaf Shacham, and Keren Bergman Department of Electrical Engineering,

More information

Boosting the Performance of Shared Memory Multiprocessors

Boosting the Performance of Shared Memory Multiprocessors Research Feature Boosting the Performance of Shared Memory Multiprocessors Proposed hardware optimizations to CC-NUMA machines shared memory multiprocessors that use cache consistency protocols can shorten

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter Seven. Idea: create powerful computers by connecting many smaller ones Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"

More information

Performance of coherence protocols

Performance of coherence protocols Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start

More information

Improving System. Performance: Caches

Improving System. Performance: Caches Improving System Performance: Caches December 04 CSC201 Section 002 Fall, 2000 A Motivating Example Application: making a (mechanical) clock dozens of tools and pages of instructions, hundreds of parts

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad de Murcia 30071 Murcia (Spain)

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted

More information