Processor-Directed Cache Coherence Mechanism A Performance Study

Similar documents
SHARED-MEMORY multiprocessors built from commodity

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Cache Injection on Bus Based Multiprocessors

Speculative Trace Scheduling in VLIW Processors

UNIT I (Two Marks Questions & Answers)

Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

CS 1013 Advance Computer Architecture UNIT I

NPTEL. High Performance Computer Architecture - Video course. Computer Science and Engineering.

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Directories vs. Snooping. in Chip-Multiprocessor

h Coherence Controllers

RECAP. B649 Parallel Architectures and Programming

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Author's personal copy

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors

Chapter 5. Multiprocessors and Thread-Level Parallelism

Why Multiprocessors?

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

CS 654 Computer Architecture Summary. Peter Kemper

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Dynamic Performance Tuning for Speculative Threads

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Limits of Data-Level Parallelism

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

CONSISTENCY MODELS IN DISTRIBUTED SHARED MEMORY SYSTEMS

Tutorial 11. Final Exam Review

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

A Mechanism for Verifying Data Speculation

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

CSE Computer Architecture I Fall 2011 Homework 07 Memory Hierarchies Assigned: November 8, 2011, Due: November 22, 2011, Total Points: 100

RICE UNIVERSITY. The Impact of Instruction-Level Parallelism on. Multiprocessor Performance and Simulation. Methodology. Vijay S.

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

The Use of Prediction for Accelerating Upgrade Misses in cc-numa Multiprocessors

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

EE382 Processor Design. Illinois

ECE/CS 757: Homework 1

Portland State University ECE 588/688. Cache-Only Memory Architectures

Speculative Sequential Consistency with Little Custom Storage

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Speculative Sequential Consistency with Little Custom Storage

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Lect. 6: Directory Coherence Protocol

Integrating MRPSOC with multigrain parallelism for improvement of performance

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Demand fetching is commonly employed to bring the data

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Execution-based Prediction Using Speculative Slices

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

SPECULATIVE MULTITHREADED ARCHITECTURES

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

A Study for Branch Predictors to Alleviate the Aliasing Problem

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Two-Level Address Storage and Address Prediction

EECS 570 Final Exam - SOLUTIONS Winter 2015

Improving the Accuracy vs. Speed Tradeoff for Simulating Shared-Memory Multiprocessors with ILP Processors

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

C152 Laboratory Exercise 3

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Shared Symmetric Memory Systems

Midterm Exam 02/09/2009

15-740/ Computer Architecture

EE382 Processor Design. Processor Issues for MP

Multiprocessors - Flynn s Taxonomy (1966)

ECE 485/585 Microprocessor System Design

IMPROVING THE RELIABILITY OF DETECTION OF LSB REPLACEMENT STEGANOGRAPHY

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Memory Systems IRAM. Principle of IRAM

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Transcription:

Processor-Directed Cache Coherence Mechanism A Performance Study H. Sarojadevi, dept. of CSE Nitte Meenakshi Institute of Technology (NMIT) Bangalore, India hsarojadevi@gmail.com S. K. Nandy CAD Lab, SERC Indian Institute of Science Bangalore, India nandy@serc.iisc.ernet.in Abstract Cache coherent multiprocessor architecture is widely used in the recent multi-core systems, embedded systems and massively parallel processors. With the ever increasing performance gap between processor and memory, there is a requirement for an optimal cache coherence mechanism in a cache coherent multiprocessor. The conventional directory based cache coherence scheme used in large scale multiprocessors suffers from considerable overhead. To overcome this problem we have developed a compiler assisted, processor directed cache coherence mechanism and evaluated. The approach is autoinvalidation based one that uses a hardware buffer termed Coherence Buffer (CB) and there is no need for directory. The CB method is compared in this paper with a self-invalidation based directory approach that employs a last touch predictor (LTP). Detailed architectural simulations of Distributed Shared Memory configurations with superscalar processors show that 8-entry 4-way associative CB performs better than the LTP based self-invalidation method as well as full-map 3-hop directory for five of the SPLASH-2 benchmarks under release consistency memory model. Given its performance, cost, complexity and scalability advantages, the CB approach is found to be promising approach for emerging applications in large scale multiprocessors, multi-core systems, and transaction processing systems. Keywords- Cache coherence, Distributed shared memory multiprocessor system, self-invalidation, Last touch predictor, Release consistency I. INTRODUCTION Emerging class of high performance multiprocessor architectures such as multi-core systems Error! Reference source not found., large scale multiprocessors [1], and massively parallel architectures [2] require faster and more efficient cache coherence mechanism. The constraints become more stringent for complex and critical applications such as biological sequence analysis, hospital information system, multimedia, transaction processing, signal processing, vehicle automation, and many grand challenge applications. In this paper we present performance study of a cost-effective hardware-software approach to cache coherence, which we term as Coherence Buffer (CB) scheme. The CB approach is processor-directed, auto-invalidation based method that does not need a directory. Results are presented to compare with a full-map directory and a directory based selfinvalidation scheme called Last Touch Predictor (LTP). LTP uses speculative self-invalidation of cache blocks. We use architectural simulations of various DSM configurations under Release Consistency (RC) memory model. SPLASH-2 benchmark suite, which is representative of different application domains with wider sharing patterns, is used for evaluating the performance comparison of CB and directory schemes. The programs are properly labeled (PL) [5] by the compiler to aid runtime coherence mechanism. The DSM system with 8 entry, 4- way associative CB with 16 processor configuration is simulated and compared with equivalent directory system ISSN : 0975-3397 Vol. 3 No. 9 september 2011 3202

with LTP. Simulated LTP has 16 first level registers each 12 bit long and 2-way associative second level last touch trace predictors. Speedup of CB over LTP is found to be 1.07-1.32 for five of the benchmarks. Thus the CB based approach improves memory performance and is most cost-effective. By avoiding global control, CB reduces large number of coherence transactions the system, thereby making the system more power aware. With its complete local control, the CB based system becomes highly scalable. The rest of the paper is organized as follows. Section II provides details of the CB scheme and that of LTP. Section III presents methodology and the results. Conclusions are written in Section IV. II. BACKGROUND CB based approach proposed in [3] and [6] imposes early coherence transactions that are directed by individual processor unlike the globally controlled coherence in a directory based scheme. Several ways to improve the directory based methods have been experimented, out of which the best performing one is Last Touch Predictor [LTP] approach. Details of the CB approach and directory approach having LTP enhancement are provided in the sub-sections below. Figure 1. Processor, CB, and Cache interactions A. Compiler assisted Coherence Buffer based approach This approach uses an on-chip buffer called CB [6] that has cache like structure shown in Figure 1. A controller, CBC carries out all CB related operations. The CBC dynamically identifies all shared accesses between two consecutive release boundaries that are identified with the help of compiler annotations [6]. These accesses are recorded in the CB and are indexed using block address of a cache line. The CB invalidates or updates corresponding blocks in memory at data sharing boundaries (release synchronization points) thereby enforcing coherence. The cache block address is further split into CB tag and CB index to facilitate address reconstruction during CB update and flush operations. The CBC coordinates with the cache controller (CC) for tracking changes in the cache. B. Last Touch Predictor based Directory method ISSN : 0975-3397 Vol. 3 No. 9 september 2011 3203

The approach improves directory performance by speculative self-invalidation of cache blocks using a predictor, called LTP. An LTP (figure 2) is a two level, adaptive predictor used for predicting memory invalidations [4]. The first level, called history table, stores memory traces generated by a processor in the form of encoded trace signature. The second level is a block invalidation predictor table storing a list of previously observed trace signatures for each memory block. LTP predicts the last touch to a memory block which indicates that the memory block is not going to be used and hence can be invalidated. Based on the prediction, memory blocks are evicted from the local cache without waiting for the directory to send explicit invalidation request. In case of misprediction, a recovery action needs to be initiated, and this involves multiple network transactions. Figure 2. Using Last Touch Predictor in a DSM While the LTP is considered as best self-invalidation scheme, it is more complex than the directory based scheme. Besides the directory transactions for regular and misprediction situations increase the overhead as more network level transactions incur. A simulation based study to investigate this aspect and comparing CB approach is presented in the following section. III. METHODOLOGY AND RESULTS A. Simulation Details RSIM simulator [7] on Solaris platform has been used for our experiments on DSM, DSM with LTP and the CB approaches. Simulator parameters are as given in table 1. TABLE 1. SYSTEM PARAMETERS RSIM is modified to incorporate the LTP into the directory scheme, this version is DSMLTP. The CB based coherence mechanism is built into RSIM after removing the directory part, retaining the processor and memory ISSN : 0975-3397 Vol. 3 No. 9 september 2011 3204

hierarchy structure - DSMCB. DSMDir uses full map three hop MSI invalidate based approach, forming the base configuration. SPLAH-2 benchmark suite of applications is used for the performance evaluation. The applications include - MP3D(50000 particles), Water(512 molecules), LU(512x512 matrix) with block size of 16, optimized LU i.e, LUOPT, FFT(256K points) and its optimized version FFTOPT, Radix(512K keys) and Quicksort(32K integers) ; these applications are compiler-annotated for proper labeling [7]. B. Invalidation Prediction with LTP Predicting invalidations correctly is important for the performance of LTP based coherence mechanism. The results in figure 3 show fraction of invalidations, giving a split up of number of correct predictions obtained by the LTP approach. It can be observed that major fraction of the invalidations is mispredicted in all the benchmarks. Figure 3. Invalidations with the LTP (12 bit wide 16 history registers, 1K entry two-way associative second level table) C. Execution time Performance Comparison of various Coherence Mechanisms Figure 4. Performance Comparison of Full-Map Directory (FM), LTP and CB based approaches ( LTP configuration: 12 bit wide 16 history registers, 1K entry two-way associative second level table). ISSN : 0975-3397 Vol. 3 No. 9 september 2011 3205

Overall execution time performance is shown in figure 4, for the three coherence mechanisms - full map directory denoted by FM, LTP based directory method denoted by LTP and CB based method denoted by CB - in a DSM. The execution times reported are relative to that of the full map directory execution time. The CB structure is of a width of one word plus status bits, and is 8 entry 4-way associative. The results in figure 4 show that the LTP improves performance in certain applications, which is an evidence to say that self-invalidation is certainly a step to achieve improved performance. However the cost-benefit of the LTP approach seems not high. Bursty nature and long queuing of self-invalidations can offset the performance achievable by the LTP, resulting in poor performance. The CB approach performs significantly better than the other two methods in general. The performance benefit is mainly because of the early and local coherence transactions that are processor-directed. Water and LUOPT cases are different. Due to its O (n 2 ) complexity, Water encounters heavy number of shared boundaries, for which larger CB size is required. The sharing pattern of LUOPT requires a CB configuration different from the present one to improve the performance. Both these results for Water and LUOPT were found in separate experiments reported in the thesis [8]. IV. CONCLUSION With the emerging class of multiprocessors that include multi-core and massively parallel processor systems, cache coherence is a major system mechanism that contributes to performance and power. Cost-effective approaches such as Coherence Buffer (CB) based processor-directed auto-invalidation approach [3][6] are a must for the current day parallel systems that invariably use shared memory units deployed in distributed architecture. In this paper performance study of a directory based approach with self-invalidation using a last touch predictor (LTP) is presented and compared with full map 3 hop conventional directory method, as well as the CB approach. While the processor-directed CB approach shows significant performance benefit, the cost-benefit of selfinvalidation based LTP approach does not seem to be good. The CB approach, which is a promising costeffective approach for cache coherence, can be extended easily for providing cache coherence is multi-core systems and other bus based multiprocessor systems. ACKNOWLEDGMENT This work was carried out as part of the post doctoral research work by the first author at CAD Lab, SERC, IISc. The author is grateful for the support provided by the Institute and the research guide. The author is thankful to the present institution of employment, i.e., NMIT, for the encouragement and support received. REFERENCES [1] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, M.K. Publishers, 2006. [2] Culler and Singh, Parallel Computer Architecture: A Hardware Software Approach, M.K. Publishers, 1998. [3] H. Sarojadevi, S. K. Nandy, and S. Balakrishnan, Enforcing Cache Coherence at Data Sharing Boundaries without Global Control: A Hardware-Software Approach, EuroPar 2002. [4] A.C. Lai and B. Falsafi, Selective, Accurate, and Timely self-invalidation Using Last-Touch Prediction, ISCA 2000. [5] S. Adve, V. Pai and P. Ranganathan, Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems, Proceedings of the IEEE, 1999, 87(3): pp. 445-455. [6] H. Sarojadevi, S. K. Nandy and S. Balakrishnan, On the Correctness of Program Execution when Cache Coherence is maintained Locally at Data-Sharing Boundaries in Distributed Shared Memory Multiprocessors, International Journal of Parallel Programming, 2004, vol.32(no.5): pp. 415-446. [7] C.J. Hughes, V. Pai, P. Ranganathan and S. Adve, RSIM: Simulating Shared Memory Multiprocessors with ILP Processors, IEEE Computer, 35(2): 40-49, 2002 [8] H. Sarojadevi, "Extending Architectural Support for Imposing Early and Local Cache Coherence in Distributed Shared-Memory Multiprocessors", PhD thesis / Technical report (CAD Lab), IISc, 2002 AUTHORS PROFILE Dr. H. Sarojadevi (corresonding author) is presently working as Professor in the department of Computer Science & Engineering, Nitte Meenakshi Institute of Technology (NMIT), Bangalore. She is a PhD graduate from the Indian Institute of Science. She has more than 18 publications most of them are at the International level. She has 14 years of teaching experience, 16 years of research experience, 4 years of industry experience. She is guiding 6 PhD candidates at present, and produced several Master level projects as well as BE Projects. Dr. S. K. Nandy is Professor in the Dept. of Super Computer Education and Research Centre, IISc, Bangalore. He is also convenor of the CAD Lab. He has authored several papers and carries out funded projects from Internationally reputed organizations. He has produced more than five PhDs and several MSc graduates from IISc. ISSN : 0975-3397 Vol. 3 No. 9 september 2011 3206