Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors

Similar documents
Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Flynn s Classification

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

Limitations of parallel processing

High Performance Multiprocessor System

PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES

Lecture 7: Implementing Cache Coherence. Topics: implementation details

CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

The Preliminary Evaluation of MBP-light with Two Protocol Policies for A Massively Parallel Processor JUMP-1

SNAIL-2: a SSS-MIN connected multiprocessor with cache coherent mechanism

Dept. of Computer Science, Keio University. Dept. of Information and Computer Science, Kanagawa Institute of Technology

A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Cycle accurate transaction-driven simulation with multiple processor simulators

CPACM: A New Embedded Memory Architecture Proposal

Chapter 12: Multiprocessor Architectures

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Chapter 5. Multiprocessors and Thread-Level Parallelism

SGI Challenge Overview

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

One-Level Cache Memory Design for Scalable SMT Architectures

Investigating design tradeoffs in S-NUCA based CMP systems

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

JBus Architecture Overview

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Investigating the Effectiveness of a Third Level Cache

Introducing Multi-core Computing / Hyperthreading

Multiprocessor Cache Coherency. What is Cache Coherence?

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Split Private and Shared L2 Cache Architecture for Snooping-based CMP

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

Lecture 24: Board Notes: Cache Coherency

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

Effect of Data Prefetching on Chip MultiProcessor

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

The PA 7300LC Microprocessor: A Highly Integrated System on a Chip

Assert. Reply. Rmiss Rmiss Rmiss. Wreq. Rmiss. Rreq. Wmiss Wmiss. Wreq. Ireq. no change. Read Write Read Write. Rreq. SI* Shared* Rreq no change 1 1 -

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Characteristics of Mult l ip i ro r ce c ssors r

Performance Characteristics. i960 CA SuperScalar Microprocessor

Lecture 1: Introduction

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Memory Hierarchy. Slides contents from:

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

12 Cache-Organization 1

COSC4201 Multiprocessors

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Lect. 6: Directory Coherence Protocol

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Multiprocessors & Thread Level Parallelism

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Introducing the SCSD \ Shared Cache for Shared Data" Multiprocessor Architecture. Nagi N. Mekhiel

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Computer Architecture Memory hierarchies and caches

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems

Memory Hierarchy. Slides contents from:

Cache Injection on Bus Based Multiprocessors

Replacement Policy: Which block to replace from the set?

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors

Author's personal copy

Directories vs. Snooping. in Chip-Multiprocessor

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

The Nios II Family of Configurable Soft-core Processors

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Adaptive Prefetching Technique for Shared Virtual Memory

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Snoop-Based Multiprocessor Design III: Case Studies

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Page 1. Multilevel Memories (Improving performance using a little cash )

6.1 Multiprocessor Computing Environment

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CSE502: Computer Architecture CSE 502: Computer Architecture

Transcription:

vs. : Evaluation of Structure for Single-chip Multiprocessors Toru Kisuki,Masaki Wakabayashi,Junji Yamamoto,Keisuke Inoue, Hideharu Amano Department of Computer Science, Keio University 3-14-1, Hiyoshi Yokohama 223 Japan {kisuki,masaki,junji,keisuke,hunga}@aa.cs.keio.ac.jp Abstract. The shared cache structures and snoop cache structures for single-chip multiprocessors are evaluated and compared using an instruction level simulator. Simulation results show that 1-port large shared cache achieves the best performance if there is no delay penalty for arbitration and accessing the bus. However, if 1-clock delay is assumed for accessing the shared cache, a snoop cache with internal wide bus and invalidate style NewKeio protocol overcomes shared caches. 1 Introduction In order to make the best use of a large silicon area, single-chip microprocessor approach using simple microprocessors has been investigated[1][2]. The precise simulation results demonstrated that this approach is hopeful compared with complicated superscaler processors for parallel applications[1]. In these evaluations, processors are connected with a shared cache which is suitable for on-chip implementation. However, a private cache with snoop mechanism which is mainly used in board level implementation is also hopeful structure for on-chip implementation, since each cache can be connected with a high bandwidth on-chip bus. The snoop cache protocol can be optimized for onchip implementation. In this paper, the structure of snoop cache for single-chip multiprocessor is discussed and the performance is compared with shared cache using an instruction level simulation. 2 structures for Single-Chip Multiprocessors PU PU PU PU PU PU PU PU Interface Multiport Memory VLSI On chip External Interface High Bandwidth VLSI Main Memory I/O Main Memory Low Bandwidth Fig. 1. Structure Fig. 2. Structure

2.1 cache structure In many possible connection architectures for single-chip multiprocessors, the most simple structure is connecting processors to shared cache directly. Fig.1 shows the most simple and quite realistic structure of the single-chip multiprocessor. In this structure, a large SRAM shared cache is also connected to main memory outside the chip. Due to the overhead caused by the conflict, multi-port memory which allows simultaneous access is sometimes introduced. But the access time is often stretched by a large fan-out required in cells of m-port memory. 2.2 cache structure Private cache with snoop mechanism is an alternative approach to efficient cache for single-chip multiprocessors. In this approach, each processor provides its private cache connected with each other by a shared bus. Each cache checks the address and data on the shared bus and maintains the consistency according to the cache consistency protocol. This structure is commonly used in the recent multiprocessor workstations. In the board level implementation, since the common 64 bits address/data multiplexed bus is used, the cost for maintaining cache consistency often dominates the performance. However, in the single-chip multiprocessor, the cost of wire is much less than that in the board-level implementation, and a bus with a large bandwidth can be used. Here, we propose the snoop cache with a wide bus for single-chip multiprocessors. In this cache, the following bus is used: (1) the size of the data bus is the same as the cache line, and (2) 64-bit address bus is provided independent from the data bus. In such a snoop cache, a cache line can be transferred between caches only with a clock, and overhead for maintaining the consistency is drastically reduced. 2.3 NewKeio protocol In a snoop cache with a wide bus, the gap between inside and outside of the chip is an essential problem. In order to cope with this problem, we have proposed a coherent protocol which minimizes the data transfer between cache and main memory. P/W (from shared memory) CEO P/W DEO (from owner cache) CSO P/ --- Processor B/ --- I DSO B/W(I) R --- Read W --- Write, B/W(U) S (I) (U) --- Invalidate --- Update Processor-based transition -Induced transition Fig. 3. NewKeio protocol B/W(U) This protocol is called NewKeio Protocol[3]. As shown in Fig.3, the ownership is introduced even for a clean line, and the miss-hit cache line is transferred from cache as possible. Moreover, the attribute is attached to the page table to s- elect whether the protocol uses invalidation or updating. By using the invalidation protocol for instruction or local data, and the update protocol for shared data, the loss caused by consistency is minimized. In Fig.3, (U),(I) indicates Update and Invalidate type, respectively.

3 Simulation Environment 3.1 Target systems By the end of 1997, it is possible to make a processor chip which has 500mm 2 die area in 0.25µm process technology. Since a sophisticated processors requires a large area, the possible number to be implement becomes small. Therefore, there are many combinations between the class and number of processors. Here, we select rather simple class RISC processor. In this case, from four to six processors and 256KB SRAM can be mounted. 500MHz system clock is assumed, and a latency accessing the outside main memory becomes 50 clocks. Here, we evaluate the following cache structures: 4-Port In this cache, 4-port memory is used. Each processor has its own port and accesses cache memory without any conflicts. However, the total cache size is set to be a forth (64KB) of the 1-Port. 1-Port A single-port memory is used for this cache. Since each processor shares a port, a conflict occurs when multiple processors access the port simultaneously. However, its simple structure allows a large cache size (256KB). (Illinois protocol) A common cache protocol (Illinois protocol[4]) is used as snoop cache protocol. Each processor has its own 64Kbyte cache, so total cache size is 256KB. (NewKeio protocol) The same structure as the snoop cache with Illinois protocol, but NewKeio protocol is used. 3.2 Instruction-level simulator ISIS Since the performance of cache is depending on the detail behavior of hardware including pipeline structure, a precise simulation is required. For this purpose, we developed an instruction level simulator called ISIS. ISIS simulates the behavior of the RISC processor pipeline stage for every instruction. Precise operation of each pipeline stage including delay slots can be simulated. Mem Memory Memory (1) Classes (2) Objects (3) Simulator Fig. 4. Organization of ISIS Fig.4 shows the organization of ISIS. All units such as processor, cache, bus and memory are implemented as classes. At first, each class is constructed, then each object is connected through an interface. In this way, various architectures can be simulated by replacing units.

Access number of memory Access number of memory Access number of memory Three parallel applications from SPLASH benchmark programs[5] are selected for evaluation. FFT(data points is set to be 4096) MP3D(100 steps with 8000 particles at test.geom) LU(128 128 matrix with a block size 16) 4 Simulation Results x 10 3 7 6 5 4 3 2 3.50 3.00 2.50 2.00 1.50 1.00 x 10 3 12 11 10 9 8 7 6 5 4 3 1 0.50 2 4 port Sharad 1 port 1 4 port 1 port Sharad (FFT) (MP3D) Fig. 5. Access Number of Memory 4 port Sharad 1 port (LU) 4.1 Access frequency of the outside memory Fig.5 shows the number of accessing outside memory. Since there is a large gap between the bandwidth of inside and outside chip, a large number of access to the main memory causes performance degradation. In FFT, 4-port shared cache shows most frequent memory access by the capacity miss. Illinois protocol is better than 4-port shared cache, since cache to cache inside the chip is provided for access miss of a clean line. NewKeio protocol shows further better performance than Illinois protocol. Unlike Illinois protocol which requires write back when the dirty line is required from other processor, the write back occurs only at replacing in NewKeio protocol. This causes the difference of the access number, especially in MP3D which requires a large amount of interprocessor communication. Due to this reason, the update type protocol is advantageous in MP3D. 4.2 Execution Time Considering the multi-port and the arbitration delay, it takes a few more clocks to access the shared cache. We assume this delay by penalizing additional clocks to shared cache (1-3 clocks). In order to investigate the influence of the conflicts which occur in 1-port shared cache, the result of ideal execution time without conflicts is also shown in Fig.6. Without delay, the performance of 1-port shared cache is the best in all applications. In 1-port shared cache, the influence of the conflict is not so large if no delay is assumed. As mentioned above, the performance of 4-port shared cache is degraded mainly by the communication with outside memory. However, with the delay penalty, the performance of 1-port shared cache is severely degraded because of the conflict. Even with 1-clock delay, invalidate

Execution Time (clock) Execution Time (clock) Execution Time (clock) 9.00 8.00 (64KB) Conflict (256KB) 25 No Conflict (256KB) Delay 3 25.00 Delay 2 Delay 1 7.00 20 2 6.00 5.00 15 15.00 4.00 3.00 10 1 2.00 5 5.00 1.00 (FFT) (MP3D) (LU) Fig. 6. Execution Time type NewKeio protocol shows better performance than 1-port shared cache. Since there are a few shared line and interprocessor communication, each private cache can be used efficiently and reduce the overhead to maintain the consistency. From this reason, the performance of invalidate type NewKeio protocol is better than that of update type. Compared with Illinois protocol, NewKeio protocol improves the performance in 15%-20%. 5 Conclusion The simulation results show that with the parameters used here, 1-port large shared cache achieves the best performance if there is no delay penalty for arbitration and accessing the bus. However, if 1-clock delay is assumed for accessing the shared cache, a snoop cache with internal wide bus and invalidate style NewKeio protocol shows better performance. Further simulations with various parameters and applications are required for investigating optimized cache structure for single-chip multiprocessors. References 1. Basem A. Nayfeh, Lance Hammond, and Kunle Olukotun. Evaluation of Design Alternatives for a Multiprocessor Microprocessor. ISCA, 1995. 2. Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevish, and Whay S. Lee. The M-Machine Multicompuer. 1995. 3. T. Terasawa, S. Ogura, K. Inoue, and H. Amano. A Coherence Protocol for Multiprocessor Chip. In proc. of IEEE International Conference on Wafer Scale Integration, pages 238 247, January 1995. 4. M.S.Papamarcos and J.H.Patel. A Low-overhead Coherence Solution for Multiprocessors with Private Memoryes. ISCSA84, pages 348 354, 1992. 5. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. ISCA, pages 24 36, June 1995. This article was processed using the LaT E X macro package with LLNCS style