Concerning with On-Chip Network Features to Improve Cache Coherence Protocols for CMPs

Size: px
Start display at page:

Download "Concerning with On-Chip Network Features to Improve Cache Coherence Protocols for CMPs"

Transcription

1 Concerning with On-Chip Network Features to Improve Cache Coherence rotocols for CMs Hongbo Zeng 1,2,KunHuang 1,2,MingWu 1,2,andWeiwuHu 1 1 Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Graduate University of the Chinese Academy of Sciences, Beijing, China {hbzeng,huangkun,wuming,hww}@ict.ac.cn Abstract. Chip multiprocessors (CMs) with on-chip network connecting processor cores have been pervasively accepted as a promising technology to efficiently utilize the ever increasing density of transistors on a chip. Communications in CMs require invalidating cached copies of a shared data block. The coherence traffic incurs more and more significant overhead as the number of cores in a CM increases. Conventional designs of cache coherence protocols do not take into account characteristics of underlying networks for flexibility reasons. However, in CMs, processor cores and the on-chip network are tightly integrated. Exposing the network features to cache coherence protocols will unveil some optimization opportunities. In this paper, we propose distance aware protocol and multi-target invalidations, which exploit the network characteristics to reduce the invalidation traffic overhead at negligible hardware cost. Experimental results on a 16-core CM simulator showed that the two mechanisms reduced the average invalidation traffic latency by 5%, up to 8%. 1 Introduction The wide availability of chip multiprocessors (CMs) has demonstrated their capabilities to efficiently utilize the ever increasing number of transistors. Onchip networks [1], which have been used to interconnect multiple processing elements on a chip, is a promising technology that targets the delay and power consumption problems of global wires [2]. Like distributed shared memory (DSM) machines, CMs maintain data coherence by cache coherence protocols. Conventionally, the cache coherence protocol and the network are considered as two non-related components in a DSM system. The design concepts and optimization techniques of protocols do not take into account the characteristics of underlying networks. Likewise, network optimizations concentrate on reducing communication latency without consciousness of up-level protocols. Considerable flexibility is achieved as the protocols can be deployed on a wide variety of networks. When CMs are concerned, however, processor cores and the interconnect network are tightly integrated and, in addition, parameters of the on-chip network are determined at design time. These L. Choi, Y. aek, and S. Cho (Eds.): ACSAC 2007, LNCS 4697, pp , c Springer-Verlag Berlin Heidelberg 2007

2 Concerning with On-Chip Network Features 305 natures motivated us to expose network characteristics to the protocols so as to explore new approaches to improve the performance of CMs. As more and more cores are placed on future CMs, one practical design methodology is to assemble tiles of same-sized cores into an array by an onchip network as depicted in figure 1 [3,4]. Each core contains a fraction of the L2 cache which is shared by multiple cores though physically distributed. Compared with private L2 caches, shared caches have the advantage of allowing more capacitance for each core and avoiding duplicated copies of the same cache line in private caches. Directory-based cache coherence protocol [5] is a more appropriate option for maintaining coherence of data copies among L1 caches than bus-based snoopy protocol for scalability reasons. A directory keeps track of the global coherence states and the sharers identifications of all cache lines in L2 cache. Before a processor core can modify the data of a cache line, it must send a read exclusive request to the directory which invalidates remote copies of that cache line. When the directory receives acknowledgments from all the sharers, it replies the requester with write grant. Figure 2 illustrates the communications incurred. The invalidation process introduces high overhead and its significance is growing as the number of cores in a CM increases. This paper presents two mechanisms exploiting the network features to reduce the invalidation overhead at negligible hardware cost. Following the discussion above, traditional designs of directory-based protocols may result in sub-optimal operations as they have little knowledge of the network. As shown in figure 1, where a 2-D mesh network with XY routing algorithm [6] is applied in a CM, if the directory in core A needs to invalidate data A b c B C Fig. 1. An example architecture of a tiled CM. Each core contains private L1 caches and a fraction of the shared L2 cache. Multiple cores are connected by a 2-D mesh routing on-chip network (boxes marked with represent processor cores and boxes marked with represent routers).

3 ead Exclusive 306 H. Zeng et al. Writer Directory Sharer Sharer Invalidation Invalidation Ack Ack esponse Fig. 2. The directory invalidates all the shared copies before replying the requestor copies in L1 caches of cores B and C, the protocol may first send out the invalidation message b for B and then message c for C. The problem is that c takes more hops so more clock cycles than b to complete, but it is sent later, which increases the overall delay. Lacking the information of the network (C is further than B) causes the protocol to make sub-optimal schedules. We propose that the coherence protocols designed for CMs should comprehensively consider the distance between cores. Observations showed that the number of cores sharing a cache line is usually larger than one. Conventional approaches, which send one invalidation message for each sharing core respectively, would create bursts into the network and cause significant contention. This would bring negative impact on performance. We extend the above optimization technique to compact multiple invalidation requests into one network packet, which effectively lower the network load. Using a cycle-accurate execution-driven simulator of a 16-core CM, we evaluate our proposed mechanisms with a set of scientific computation workloads. We find that the two mechanisms together reduced the average overhead of invalidation traffic by 5%, up to 8%. This paper is organized as follows: Section 2 explains the distance aware optimization technique; Section 3 extends this mechanism to deliver multiple invalidation requests instead of one within a network packet; Section 4 discusses the simulation methodology and the workloads we use; Section 5 presents experiment results; Section 6 describes related work and we conclude in section 7. 2 Distance Aware rotocol Conventional cache coherence protocols of DSMs are not designed for a dedicated network so as to facilitate the flexibility. rotocols neither care about whether the network is of mesh topology or torus topology, nor require messages to be delivered in order [5]. More specifically, processors proceed without the

4 Concerning with On-Chip Network Features 307 knowledge of what their positions are in the system and what is the distance from one to another. Although flexible, this methodology led to missing some optimization opportunities. Shared copies of a cache line must be invalidated before the data can be modified or evicted from L2 cache. With oblivious policy, the invalidation message for the further node, which takes more cycles to reach its destination, could be sent late resulting in increased overall latency. As we can also see from figure 1, it has little benefit of getting the acknowledgement early from node B, because the directory has to wait until the last acknowledgment arrives which is most probably from C the furthest node. So sending message for C early may be a better solution. The coherence protocol should be aware of the network topology and set the priority of dispatching invalidation messages based on the distances of sharers away from the directory. Each cycle, the distances of the sharers, which are left to be sent invalidation messages to, are calculated and then, the protocol sends an invalidation message to the furthest sharing node. As such, the total cost commonly will not exceed the delay of the invalidation-acknowledgment loop for the furthest node. This mechanism sends out long-delay messages first to hide the latency of short-range ones. Take an XY routing mesh topology network for example, one intuitive way of defining the distance between two nodes i and j is by the Manhattan distance: Distance i,j = x i x j + y i y j (1) where x and y are the coordinates of a node. To reduce the calculation latency in hardware, we could approximate the distance by hops in one dimension, which almost achieves the same improvement in practice. 3 Multi-target Invalidations Observations showed that the number of cores sharing a cache line is usually larger than one. It is especially the case for instruction cache lines which are shared by almost all processor cores while running parallel programs. When invalidating all the copies, coherence protocols conventionally send one invalidation message for each sharing core respectively. This could create message bursts into the network resulting in significant contention and negative impact on performance. This problem will be exacerbated in on-chip network environment where buffer resources are limited because of power and area budget. Compacting multiple invalidation requests into one network packet could reduce invalidation messages in flight. The processor cores are divided into several groups. Messages destined for the cores within a group potentially share one routing path to some extent. As such, directories would send one invalidation message for each group within which multiple cores are targeted. outers in the network deliver the multi-target invalidation to the specified group and dispatch the message to the targets one by one. In conjunction with the mechanism described in the above section, the multi-target invalidation for the furthest group

5 308 H. Zeng et al. should be sent first. This approach adds to each invalidation message a vector representing the targets in a group and the identification of that group. In routers, each buffer entry of the invalidation channel needs only to be augmented with a few extra bits, which is negligible. A B D C E G1 G2 Fig. 3. Multi-target invalidations. Group G1 consists of the light shaded nodes and group G2 consists of the heavy shaded nodes. Again, we demonstrate the mechanism with an XY routing mesh network. Figure 3 shows how multi-target invalidations are done. Cores in the same column form a group, because, with XY routing algorithm, messages for cores in the same column would first go through the same path along X dimension (e.g. messages for B and C from A share all the path from A to B). As illustrated, when the directory in core A needs to invalidate copies in the L1 caches of cores B, C, D and E, instead of sending out four invalidations respectively, it sends just two messages for the two groups. Each message has two targets, and the result is a 50% saving of the number of messages. After the message for A and B arrives at the router attached with core B, the router will find it has multiple targets and continue to send the message down to core C in parallel with invalidating the cached copy in the core B. 4 Simulator and Workloads We use a cycle-accurate execution-driven simulator to evaluate our proposed mechanisms. The processor cores modeled in the simulator conform to the architecture of the Godson2 processor [7] which is a high performance microprocessor implementing MIS ISA and featuring 4-issus, out-of-order execution and nonblocking caches, etc. We implement the directory-based write-invalidate cache coherence protocol and on-chip network in significant detail to make the simulator behave in strict accordance with the hardware implementation. This methodology provides accurate simulation results while at the cost of long simulation

6 Concerning with On-Chip Network Features 309 time. The simulator models a 16-core CM with an on-chip network using mesh topology and XY routing algorithm. We employ an aggressive implementation of routers which take two cycles to forward a packet without contention. The contention within the on-chip network is also simulated. The wires are optimistically assumed to take just two cycles to deliver a packet from a router to the next. We believe that as the technology evolves, the speed of processor cores will go further beyond that of wires, and the advantages of our mechanisms will be more evident as wire delay increases. The detail architecture parameters are summarized in Table 1. arameters Table 1. System configurations Value Number of cores 16 rocessor 4-issue, out-of-order Cache block size 32B L1 I-cache 64KB, 4-way, 1-cycle latency L1 D-cache 64KB, 4-way, 1-cycle latency Shared L2 cache 8MB, 4-way, 4-cycle latency DAM latency 100 processor cycles latency Network Topology 4*4 2-D mesh outer 2 pipeline stages Wire delay 2 processor cycles latency To test our ideas, we employ a set of scientific applications consisting of seven programs from the SLASH-2 benchmark suite. The programs are run to completion, but all experimental results reported in this paper are for the parallel phases of these applications. Table 2 presents the applications and the input data sets we use in the evaluation. Applications Table 2. Applications and input data sets roblem sizes FFT 256K complex data points LU 512*512 matrix Water nsquared (WATENS) 512 molecules, 3 timesteps Water spatial (WATES) 512 molecules, 3 timesteps Cholesky D750 LU noncontiguous (LUNC) 128*128 matrix Ocean 130*130 array, 1e-7 error tolerance When evaluating the multi-target invalidation mechanism, we divide the cores into four groups by columns. Four cores lying on the same column belong to one group. Each multi-target invalidation message needs a 4-bit vector to represent the four cores in a group and extra two bits to identify the group routed to.

7 310 H. Zeng et al. 5 esults This section describes the simulation results of applying both the two mechanisms, compared with the baseline protocol. Figure 4 depicts the distributions of the number of sharers when a block needs to be invalidated. The home node of a cache block is not counted, as invalidating the copy in the L1 cache of a block s home node does not incur a message into the network. As we can see, programs demonstrate various behaviors. Almost all the data in FFT and ocean has just one sharer except for the home node. We can predict that these two applications can gain little improvement as few messages can be saved from multi-target invalidations. Some applications (like LU, WATES and Cholesky) have modest number of sharers, but barely exceeds 5. The performance of these applications can be expected to boost. LUNC is different from others as most of its data is shared by multiple cores, however, the number of sharers also hardly surpasses ercentage(%) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% fft lu waterns watersp cholesky lunc ocean Fig. 4. Distributions of the number of sharers when a block needs to be invalidated eductions in cycles spent during invalidation process are shown in figure 5. The two mechanisms reduced the average invalidation traffic overhead by 5%. Among the applications, LU achieved the most improvement with nearly 8% decrease in latency. This can be inferred from its distribution of number of sharers. As we predicted, FFT and ocean barely have improvement. However, the traffic overhead within LUNC has not been reduced as we expected from the data in figure 4. We can explain the phenomena with figure 6. The mechanisms have limited impact on overall performance as the dominant factor is memory access latency. Figure 6 presents distributions of the number of targets in each multi-target invalidation message. In some applications, data is shared by several cores, however, the sharers are scattered around other than kept together in groups. So

8 Concerning with On-Chip Network Features 311 normalized invaliation overheads % % 98.00% 96.00% 94.00% 92.00% 90.00% 88.00% fft lu waterns watersp cholesky lunc ocean Fig. 5. Invalidation overhead scaled to the baseline protocol ercentage(%) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% fft lu waterns watersp cholesky lunc ocean Fig. 6. Distributions of the number of targets in each multi-target invalidation message each multi-target invalidation message is only responsible for few targets, decreasing its effect on reducing the overhead. This explains why LUNC receives modest improvement as most of its multi-target invalidations aim for just 1 or 2 sharers. 6 elated Work A large body of literature focuses on optimization techniques for cache coherence protocols. [8,9] proposed adaptive coherence protocols for different data sharing patterns. [10,11] effectively eliminated the overhead of remote misses by making producers push data to consumers in advance, instead of fetching data when consumers read misses actually happened. Lebeck and Wood [12] proposed

9 312 H. Zeng et al. Dynamic Self-Invalidation (DSI) to automatically write back the writer s dirty copy of data at synchronization boundaries so as to save the coherence messages occurred when a sharer subsequently reads the same cache line. Lai et al. [13] extended DSI with last-touch predictor to invalidate the data more timely and avoid potential message bursts into the network. On-chip networks also attract considerable attention. esearchers aim to reduce the transmission latency by shortening the depth of router pipeline stages and alleviate contention with adaptive routing algorithms [14,15]. All the researches mentioned above devote to their own field. ecently, however, we have noticed some work that demonstrated the benefits of coupling cache coherence protocols with underlying networks more tightly. Noel et al. [16] proposed embedding directories within each router node to satisfy requests with nearby data copies. Cheng et al. [17] leveraged wires of different power and latency properties to delivery different coherence protocol messages depending on their bandwidth-latency requirements. 7 Conclusions and Future Works In this paper, we proposed two techniques to reduce invalidation traffic overhead in CMs within which processor cores are connected by on-chip networks. We are motivated by one major difference between DSMs and CMs: as far as flexibility is concerned, traditional cache coherence protocols of DSMs are not designed for a dedicated network missing some optimization opportunities; as for CMs, parameters of the on-chip network are determined at design time, so we can jointly consider the both. Distance aware optimization chose to dispatch invalidations by the order of how far the sharers are away from the directory. Compared with oblivious policy, this mechanism guarantees long latency events to be processed first so as to lower overall overhead. Multi-target invalidations convey multiple invalidation messages for a group of cores within one network message. This approach can decrease invalidation traffic and alleviate message bursts into the network when crowd of cores share a cache line. We conducted simulation on a 16-core CM simulator using a subset of SLASH-2 benchmark suite. The experimental results showed that the two mechanisms together reduced the average invalidation traffic overhead by 5%, up to 8%. In the future, we will optimize the simulator to support more number of cores. As for now, the simulator models the coherence protocol in significant detail, so it takes prohibitively long time to simulate over 32 cores. We will refine the implementation in a more efficient and configurable way, and we believe that these two approaches will achieve more notable improvement in CMs containing much more cores. Acknowledgements. We would appreciate the anonymous reviewers for their valuable advice. Our work is supported by the National Natural Foundation of China for Distinguished Young Scholars under Grant No , the National Natural Science Foundation of China under Grant No and No.

10 Concerning with On-Chip Network Features , the National High Technology Development 863 rogram of China under Grant No. 2006AA010201, the National Basic esearch rogram of China under Grant No. 2005CB and the Beijing Natural Science Foundation under Grant No eferences 1. Dally, W.J., Towles, B.: oute packets, not wires: on-chip inteconnection networks. In: DAC 01: roceedings of the 38th conference on Design automation, New York, NY, USA, pp ACM ress, New York (2001) 2. Ho,., Mai, K.W., Horowitz, M.A.: The future of wires. roceedings of the IEEE 89(4), (2001) 3. Zhang, M., Asanovic, K.: Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: ISCA 05: roceedings of the 32nd Annual International Symposium on Computer Architecture, Washington, DC, USA, pp IEEE Computer Society, Los Alamitos (2005) 4. Held, J., Bautista, J., Koehl, S.: From a Few Cores to Many: A Tera-scale Computing esearch Overview. Technical report, intel (2006) 5. Laudon, J., Lenoski, D.: The sgi origin: a ccnuma highly scalable server. In: ISCA 97: roceedings of the 24th annual international symposium on Computer architecture, pp ACM ress, New York, NY, USA (1997) 6. Dally, W.J., Towles, B.: rinciples and ractices of Interconnection Networks. Morgan Kaufmann ublishers Inc., San Francisco, CA, USA (2003) 7. Hu, W., Zhang, F., Li, Z.: Microarchitecture of the Godson-2 rocessor. Journal of Computer Science and Technology 20(2), (2005) 8. Cox, A.L., Fowler,.J.: Adaptive cache coherency for detecting migratory shared data. In: ISCA 93: roceedings of the 20th annual international symposium on Computer architecture, New York, NY, USA, pp ACM ress, New York (1993) 9. Kaxiras, S., Goodman, J..: Improving CC-NUMA erformance Using Instruction- Based rediction. In: roceedings of the Fifth IEEE Symposium on High- erformance Computer Architecture, pp (1999) 10. Abdel-Shafi, H., Hall, J., Adve, S.V., Adve, V.S.: An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In: Third International Symposium on High-erformance Computer Architecture, 1997, pp (1997) 11. Koufaty, D.A., Chen, X., oulsen, D.K., Torrellas, J.: Data forwarding in scalable shared-memory multiprocessors. In: ICS 95: roceedings of the 9th international conference on Supercomputing, pp ACM ress, New York, NY, USA (1995) 12. Lebeck, A.., Wood, D.A.: Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors. In: ISCA 95: roceedings of the 22nd annual international symposium on Computer architecture, pp ACM ress, New York, NY, USA (1995) 13. Lai, A.-C., Falsafi, B.: Selective, accurate, and timely self-invalidation using lasttouch prediction. In: ISCA 00: roceedings of the 27th annual international symposium on Computer architecture, pp ACM ress, New York, NY, USA (2000)

11 314 H. Zeng et al. 14. Mullins,., West, A., Moore, S.: Low-latency virtual-channel routers for on-chip networks. In: ISCA 04: roceedings of the 31st annual international symposium on Computer architecture, Washington, DC, USA, p.188. IEEE Computer Society, 188 (2004) 15. Kim, J., ark, D., Theocharides, T., Vijaykrishnan, N., Das, C..: A low latency router supporting adaptivity for on-chip interconnects. In: DAC 05: roceedings of the 42nd annual conference on Design automation, pp ACM ress, New York, NY, USA (2005) 16. Eisley, N., eh, L.S., Shang, L.: In-network cache coherence. In: MICO 39: roceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, pp (2006) 17. Cheng, L., Muralimanohar, N., amani, K., Balasubramonian,., Carter, J.B.: Interconnect-aware coherence protocols for chip multiprocessors. In: ISCA 06: roceedings of the 33rd annual international symposium on Computer Architecture. Washington, DC, USA, pp IEEE Computer Society (2006)

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Akshay Chander, Aravind Narayanan, Madhan R and A.P. Shanti Department of Computer Science & Engineering, College of Engineering Guindy,

More information

Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures

Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures Jeffery A. Brown Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0404

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

WITH the development of the semiconductor technology,

WITH the development of the semiconductor technology, Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)

More information

Evaluation of NOC Using Tightly Coupled Router Architecture

Evaluation of NOC Using Tightly Coupled Router Architecture IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

Processor-Directed Cache Coherence Mechanism A Performance Study

Processor-Directed Cache Coherence Mechanism A Performance Study Processor-Directed Cache Coherence Mechanism A Performance Study H. Sarojadevi, dept. of CSE Nitte Meenakshi Institute of Technology (NMIT) Bangalore, India hsarojadevi@gmail.com S. K. Nandy CAD Lab, SERC

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

Cache Coherence Protocols for Chip Multiprocessors - I

Cache Coherence Protocols for Chip Multiprocessors - I Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood Multicast Snooping: A New Coherence Method Using A Multicast Address Ender Bilir, Ross Dickson, Ying Hu, Manoj Plakal, Daniel Sorin, Mark Hill & David Wood Computer Sciences Department University of Wisconsin

More information

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection

More information

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC QoS Aware BiNoC Architecture Shih-Hsin Lo, Ying-Cherng Lan, Hsin-Hsien Hsien Yeh, Wen-Chung Tsai, Yu-Hen Hu, and Sao-Jie Chen Ying-Cherng Lan CAD System Lab Graduate Institute of Electronics Engineering

More information

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip Anh T. Tran and Bevan M. Baas Department of Electrical and Computer Engineering University of California - Davis, USA {anhtr,

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Lecture 11: Large Cache Design

Lecture 11: Large Cache Design Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

MULTIPROCESSOR system has been used to improve

MULTIPROCESSOR system has been used to improve arallel Vector rocessing Using Multi Level Orbital DATA Nagi Mekhiel Abstract Many applications use vector operations by applying single instruction to multiple data that map to different locations in

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessor System

Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessor System Fuzzy Flow egulation for Network-on-Chip based Chip Multiprocessor System Yuan Yao and Zhonghai Lu KTH oyal Institute of Technology, Stockholm 14 th AS-DAC Conference 19-23, January, 2014, Singapore Outline

More information

Directories vs. Snooping. in Chip-Multiprocessor

Directories vs. Snooping. in Chip-Multiprocessor Directories vs. Snooping in Chip-Multiprocessor Karlen Lie soph@cs.wisc.edu Saengrawee Pratoomtong pratoomt@cae.wisc.edu University of Wisconsin Madison Computer Sciences Department 1210 West Dayton Street

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri

More information

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH Outline Introduction Overview of WiNoC system architecture Overlaid

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Lecture 1: Introduction

Lecture 1: Introduction Lecture 1: Introduction ourse organization: 4 lectures on cache coherence and consistency 2 lectures on transactional memory 2 lectures on interconnection networks 4 lectures on caches 4 lectures on memory

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Investigating design tradeoffs in S-NUCA based CMP systems

Investigating design tradeoffs in S-NUCA based CMP systems Investigating design tradeoffs in S-NUCA based CMP systems P. Foglia, C.A. Prete, M. Solinas University of Pisa Dept. of Information Engineering via Diotisalvi, 2 56100 Pisa, Italy {foglia, prete, marco.solinas}@iet.unipi.it

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch Directory-based Coherence Winter 2019 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt,

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect 1 A Soft Tolerant Network-on-Chip Router Pipeline for Multi-core Systems Pavan Poluri and Ahmed Louri Department of Electrical and Computer Engineering, University of Arizona Email: pavanp@email.arizona.edu,

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

Lecture 22: Router Design

Lecture 22: Router Design Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Simulating Future Chip Multi Processors

Simulating Future Chip Multi Processors Simulating Future Chip Multi Processors Faculty Advisor: Dr. Glenn D. Reinman Name: Mishali Naik Winter 2007 Spring 2007 Abstract With the ever growing advancements in chip technology, chip designers and

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3 IJSRD - International Journal for Scientific Research & Development Vol. 2, Issue 02, 2014 ISSN (online): 2321-0613 Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

Token Coherence. Milo M. K. Martin Dissertation Defense

Token Coherence. Milo M. K. Martin Dissertation Defense Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

The Use of Prediction for Accelerating Upgrade Misses in cc-numa Multiprocessors

The Use of Prediction for Accelerating Upgrade Misses in cc-numa Multiprocessors The Use of Prediction for Accelerating Upgrade Misses in cc-numa Multiprocessors Manuel E. Acacio, José González y,josém.garcía and JoséDuato z Universidad de Murcia, Spain. E-mail: fmeacacio,jmgarciag@ditec.um.es

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization

Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization Basic Network-on-Chip (BANC) interconnection for Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization Abderazek Ben Abdallah, Masahiro Sowa Graduate School of Information

More information

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Noc Evolution and Performance Optimization by Addition of Long Range Links: A Survey. By Naveen Choudhary & Vaishali Maheshwari

Noc Evolution and Performance Optimization by Addition of Long Range Links: A Survey. By Naveen Choudhary & Vaishali Maheshwari Global Journal of Computer Science and Technology: E Network, Web & Security Volume 15 Issue 6 Version 1.0 Year 2015 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Mostafa Mahmoud, Amr Wassal Computer Engineering Department, Faculty of Engineering, Cairo University, Cairo, Egypt {mostafa.m.hassan,

More information

A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks

A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks Hitoshi Oi and N. Ranganathan Department of Computer Science and Engineering, University of South Florida, Tampa, FL Abstract

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information