Concerning with On-Chip Network Features to Improve Cache Coherence Protocols for CMPs

Size: px

Start display at page:

Download "Concerning with On-Chip Network Features to Improve Cache Coherence Protocols for CMPs"

Brice Simmons
5 years ago
Views:

1 Concerning with On-Chip Network Features to Improve Cache Coherence rotocols for CMs Hongbo Zeng 1,2,KunHuang 1,2,MingWu 1,2,andWeiwuHu 1 1 Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Graduate University of the Chinese Academy of Sciences, Beijing, China {hbzeng,huangkun,wuming,hww}@ict.ac.cn Abstract. Chip multiprocessors (CMs) with on-chip network connecting processor cores have been pervasively accepted as a promising technology to efficiently utilize the ever increasing density of transistors on a chip. Communications in CMs require invalidating cached copies of a shared data block. The coherence traffic incurs more and more significant overhead as the number of cores in a CM increases. Conventional designs of cache coherence protocols do not take into account characteristics of underlying networks for flexibility reasons. However, in CMs, processor cores and the on-chip network are tightly integrated. Exposing the network features to cache coherence protocols will unveil some optimization opportunities. In this paper, we propose distance aware protocol and multi-target invalidations, which exploit the network characteristics to reduce the invalidation traffic overhead at negligible hardware cost. Experimental results on a 16-core CM simulator showed that the two mechanisms reduced the average invalidation traffic latency by 5%, up to 8%. 1 Introduction The wide availability of chip multiprocessors (CMs) has demonstrated their capabilities to efficiently utilize the ever increasing number of transistors. Onchip networks [1], which have been used to interconnect multiple processing elements on a chip, is a promising technology that targets the delay and power consumption problems of global wires [2]. Like distributed shared memory (DSM) machines, CMs maintain data coherence by cache coherence protocols. Conventionally, the cache coherence protocol and the network are considered as two non-related components in a DSM system. The design concepts and optimization techniques of protocols do not take into account the characteristics of underlying networks. Likewise, network optimizations concentrate on reducing communication latency without consciousness of up-level protocols. Considerable flexibility is achieved as the protocols can be deployed on a wide variety of networks. When CMs are concerned, however, processor cores and the interconnect network are tightly integrated and, in addition, parameters of the on-chip network are determined at design time. These L. Choi, Y. aek, and S. Cho (Eds.): ACSAC 2007, LNCS 4697, pp , c Springer-Verlag Berlin Heidelberg 2007

2 Concerning with On-Chip Network Features 305 natures motivated us to expose network characteristics to the protocols so as to explore new approaches to improve the performance of CMs. As more and more cores are placed on future CMs, one practical design methodology is to assemble tiles of same-sized cores into an array by an onchip network as depicted in figure 1 [3,4]. Each core contains a fraction of the L2 cache which is shared by multiple cores though physically distributed. Compared with private L2 caches, shared caches have the advantage of allowing more capacitance for each core and avoiding duplicated copies of the same cache line in private caches. Directory-based cache coherence protocol [5] is a more appropriate option for maintaining coherence of data copies among L1 caches than bus-based snoopy protocol for scalability reasons. A directory keeps track of the global coherence states and the sharers identifications of all cache lines in L2 cache. Before a processor core can modify the data of a cache line, it must send a read exclusive request to the directory which invalidates remote copies of that cache line. When the directory receives acknowledgments from all the sharers, it replies the requester with write grant. Figure 2 illustrates the communications incurred. The invalidation process introduces high overhead and its significance is growing as the number of cores in a CM increases. This paper presents two mechanisms exploiting the network features to reduce the invalidation overhead at negligible hardware cost. Following the discussion above, traditional designs of directory-based protocols may result in sub-optimal operations as they have little knowledge of the network. As shown in figure 1, where a 2-D mesh network with XY routing algorithm [6] is applied in a CM, if the directory in core A needs to invalidate data A b c B C Fig. 1. An example architecture of a tiled CM. Each core contains private L1 caches and a fraction of the shared L2 cache. Multiple cores are connected by a 2-D mesh routing on-chip network (boxes marked with represent processor cores and boxes marked with represent routers).

3 ead Exclusive 306 H. Zeng et al. Writer Directory Sharer Sharer Invalidation Invalidation Ack Ack esponse Fig. 2. The directory invalidates all the shared copies before replying the requestor copies in L1 caches of cores B and C, the protocol may first send out the invalidation message b for B and then message c for C. The problem is that c takes more hops so more clock cycles than b to complete, but it is sent later, which increases the overall delay. Lacking the information of the network (C is further than B) causes the protocol to make sub-optimal schedules. We propose that the coherence protocols designed for CMs should comprehensively consider the distance between cores. Observations showed that the number of cores sharing a cache line is usually larger than one. Conventional approaches, which send one invalidation message for each sharing core respectively, would create bursts into the network and cause significant contention. This would bring negative impact on performance. We extend the above optimization technique to compact multiple invalidation requests into one network packet, which effectively lower the network load. Using a cycle-accurate execution-driven simulator of a 16-core CM, we evaluate our proposed mechanisms with a set of scientific computation workloads. We find that the two mechanisms together reduced the average overhead of invalidation traffic by 5%, up to 8%. This paper is organized as follows: Section 2 explains the distance aware optimization technique; Section 3 extends this mechanism to deliver multiple invalidation requests instead of one within a network packet; Section 4 discusses the simulation methodology and the workloads we use; Section 5 presents experiment results; Section 6 describes related work and we conclude in section 7. 2 Distance Aware rotocol Conventional cache coherence protocols of DSMs are not designed for a dedicated network so as to facilitate the flexibility. rotocols neither care about whether the network is of mesh topology or torus topology, nor require messages to be delivered in order [5]. More specifically, processors proceed without the

4 Concerning with On-Chip Network Features 307 knowledge of what their positions are in the system and what is the distance from one to another. Although flexible, this methodology led to missing some optimization opportunities. Shared copies of a cache line must be invalidated before the data can be modified or evicted from L2 cache. With oblivious policy, the invalidation message for the further node, which takes more cycles to reach its destination, could be sent late resulting in increased overall latency. As we can also see from figure 1, it has little benefit of getting the acknowledgement early from node B, because the directory has to wait until the last acknowledgment arrives which is most probably from C the furthest node. So sending message for C early may be a better solution. The coherence protocol should be aware of the network topology and set the priority of dispatching invalidation messages based on the distances of sharers away from the directory. Each cycle, the distances of the sharers, which are left to be sent invalidation messages to, are calculated and then, the protocol sends an invalidation message to the furthest sharing node. As such, the total cost commonly will not exceed the delay of the invalidation-acknowledgment loop for the furthest node. This mechanism sends out long-delay messages first to hide the latency of short-range ones. Take an XY routing mesh topology network for example, one intuitive way of defining the distance between two nodes i and j is by the Manhattan distance: Distance i,j = x i x j + y i y j (1) where x and y are the coordinates of a node. To reduce the calculation latency in hardware, we could approximate the distance by hops in one dimension, which almost achieves the same improvement in practice. 3 Multi-target Invalidations Observations showed that the number of cores sharing a cache line is usually larger than one. It is especially the case for instruction cache lines which are shared by almost all processor cores while running parallel programs. When invalidating all the copies, coherence protocols conventionally send one invalidation message for each sharing core respectively. This could create message bursts into the network resulting in significant contention and negative impact on performance. This problem will be exacerbated in on-chip network environment where buffer resources are limited because of power and area budget. Compacting multiple invalidation requests into one network packet could reduce invalidation messages in flight. The processor cores are divided into several groups. Messages destined for the cores within a group potentially share one routing path to some extent. As such, directories would send one invalidation message for each group within which multiple cores are targeted. outers in the network deliver the multi-target invalidation to the specified group and dispatch the message to the targets one by one. In conjunction with the mechanism described in the above section, the multi-target invalidation for the furthest group

5 308 H. Zeng et al. should be sent first. This approach adds to each invalidation message a vector representing the targets in a group and the identification of that group. In routers, each buffer entry of the invalidation channel needs only to be augmented with a few extra bits, which is negligible. A B D C E G1 G2 Fig. 3. Multi-target invalidations. Group G1 consists of the light shaded nodes and group G2 consists of the heavy shaded nodes. Again, we demonstrate the mechanism with an XY routing mesh network. Figure 3 shows how multi-target invalidations are done. Cores in the same column form a group, because, with XY routing algorithm, messages for cores in the same column would first go through the same path along X dimension (e.g. messages for B and C from A share all the path from A to B). As illustrated, when the directory in core A needs to invalidate copies in the L1 caches of cores B, C, D and E, instead of sending out four invalidations respectively, it sends just two messages for the two groups. Each message has two targets, and the result is a 50% saving of the number of messages. After the message for A and B arrives at the router attached with core B, the router will find it has multiple targets and continue to send the message down to core C in parallel with invalidating the cached copy in the core B. 4 Simulator and Workloads We use a cycle-accurate execution-driven simulator to evaluate our proposed mechanisms. The processor cores modeled in the simulator conform to the architecture of the Godson2 processor [7] which is a high performance microprocessor implementing MIS ISA and featuring 4-issus, out-of-order execution and nonblocking caches, etc. We implement the directory-based write-invalidate cache coherence protocol and on-chip network in significant detail to make the simulator behave in strict accordance with the hardware implementation. This methodology provides accurate simulation results while at the cost of long simulation

6 Concerning with On-Chip Network Features 309 time. The simulator models a 16-core CM with an on-chip network using mesh topology and XY routing algorithm. We employ an aggressive implementation of routers which take two cycles to forward a packet without contention. The contention within the on-chip network is also simulated. The wires are optimistically assumed to take just two cycles to deliver a packet from a router to the next. We believe that as the technology evolves, the speed of processor cores will go further beyond that of wires, and the advantages of our mechanisms will be more evident as wire delay increases. The detail architecture parameters are summarized in Table 1. arameters Table 1. System configurations Value Number of cores 16 rocessor 4-issue, out-of-order Cache block size 32B L1 I-cache 64KB, 4-way, 1-cycle latency L1 D-cache 64KB, 4-way, 1-cycle latency Shared L2 cache 8MB, 4-way, 4-cycle latency DAM latency 100 processor cycles latency Network Topology 4*4 2-D mesh outer 2 pipeline stages Wire delay 2 processor cycles latency To test our ideas, we employ a set of scientific applications consisting of seven programs from the SLASH-2 benchmark suite. The programs are run to completion, but all experimental results reported in this paper are for the parallel phases of these applications. Table 2 presents the applications and the input data sets we use in the evaluation. Applications Table 2. Applications and input data sets roblem sizes FFT 256K complex data points LU 512*512 matrix Water nsquared (WATENS) 512 molecules, 3 timesteps Water spatial (WATES) 512 molecules, 3 timesteps Cholesky D750 LU noncontiguous (LUNC) 128*128 matrix Ocean 130*130 array, 1e-7 error tolerance When evaluating the multi-target invalidation mechanism, we divide the cores into four groups by columns. Four cores lying on the same column belong to one group. Each multi-target invalidation message needs a 4-bit vector to represent the four cores in a group and extra two bits to identify the group routed to.

7 310 H. Zeng et al. 5 esults This section describes the simulation results of applying both the two mechanisms, compared with the baseline protocol. Figure 4 depicts the distributions of the number of sharers when a block needs to be invalidated. The home node of a cache block is not counted, as invalidating the copy in the L1 cache of a block s home node does not incur a message into the network. As we can see, programs demonstrate various behaviors. Almost all the data in FFT and ocean has just one sharer except for the home node. We can predict that these two applications can gain little improvement as few messages can be saved from multi-target invalidations. Some applications (like LU, WATES and Cholesky) have modest number of sharers, but barely exceeds 5. The performance of these applications can be expected to boost. LUNC is different from others as most of its data is shared by multiple cores, however, the number of sharers also hardly surpasses ercentage(%) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% fft lu waterns watersp cholesky lunc ocean Fig. 4. Distributions of the number of sharers when a block needs to be invalidated eductions in cycles spent during invalidation process are shown in figure 5. The two mechanisms reduced the average invalidation traffic overhead by 5%. Among the applications, LU achieved the most improvement with nearly 8% decrease in latency. This can be inferred from its distribution of number of sharers. As we predicted, FFT and ocean barely have improvement. However, the traffic overhead within LUNC has not been reduced as we expected from the data in figure 4. We can explain the phenomena with figure 6. The mechanisms have limited impact on overall performance as the dominant factor is memory access latency. Figure 6 presents distributions of the number of targets in each multi-target invalidation message. In some applications, data is shared by several cores, however, the sharers are scattered around other than kept together in groups. So

8 Concerning with On-Chip Network Features 311 normalized invaliation overheads % % 98.00% 96.00% 94.00% 92.00% 90.00% 88.00% fft lu waterns watersp cholesky lunc ocean Fig. 5. Invalidation overhead scaled to the baseline protocol ercentage(%) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% fft lu waterns watersp cholesky lunc ocean Fig. 6. Distributions of the number of targets in each multi-target invalidation message each multi-target invalidation message is only responsible for few targets, decreasing its effect on reducing the overhead. This explains why LUNC receives modest improvement as most of its multi-target invalidations aim for just 1 or 2 sharers. 6 elated Work A large body of literature focuses on optimization techniques for cache coherence protocols. [8,9] proposed adaptive coherence protocols for different data sharing patterns. [10,11] effectively eliminated the overhead of remote misses by making producers push data to consumers in advance, instead of fetching data when consumers read misses actually happened. Lebeck and Wood [12] proposed

9 312 H. Zeng et al. Dynamic Self-Invalidation (DSI) to automatically write back the writer s dirty copy of data at synchronization boundaries so as to save the coherence messages occurred when a sharer subsequently reads the same cache line. Lai et al. [13] extended DSI with last-touch predictor to invalidate the data more timely and avoid potential message bursts into the network. On-chip networks also attract considerable attention. esearchers aim to reduce the transmission latency by shortening the depth of router pipeline stages and alleviate contention with adaptive routing algorithms [14,15]. All the researches mentioned above devote to their own field. ecently, however, we have noticed some work that demonstrated the benefits of coupling cache coherence protocols with underlying networks more tightly. Noel et al. [16] proposed embedding directories within each router node to satisfy requests with nearby data copies. Cheng et al. [17] leveraged wires of different power and latency properties to delivery different coherence protocol messages depending on their bandwidth-latency requirements. 7 Conclusions and Future Works In this paper, we proposed two techniques to reduce invalidation traffic overhead in CMs within which processor cores are connected by on-chip networks. We are motivated by one major difference between DSMs and CMs: as far as flexibility is concerned, traditional cache coherence protocols of DSMs are not designed for a dedicated network missing some optimization opportunities; as for CMs, parameters of the on-chip network are determined at design time, so we can jointly consider the both. Distance aware optimization chose to dispatch invalidations by the order of how far the sharers are away from the directory. Compared with oblivious policy, this mechanism guarantees long latency events to be processed first so as to lower overall overhead. Multi-target invalidations convey multiple invalidation messages for a group of cores within one network message. This approach can decrease invalidation traffic and alleviate message bursts into the network when crowd of cores share a cache line. We conducted simulation on a 16-core CM simulator using a subset of SLASH-2 benchmark suite. The experimental results showed that the two mechanisms together reduced the average invalidation traffic overhead by 5%, up to 8%. In the future, we will optimize the simulator to support more number of cores. As for now, the simulator models the coherence protocol in significant detail, so it takes prohibitively long time to simulate over 32 cores. We will refine the implementation in a more efficient and configurable way, and we believe that these two approaches will achieve more notable improvement in CMs containing much more cores. Acknowledgements. We would appreciate the anonymous reviewers for their valuable advice. Our work is supported by the National Natural Foundation of China for Distinguished Young Scholars under Grant No , the National Natural Science Foundation of China under Grant No and No.

10 Concerning with On-Chip Network Features , the National High Technology Development 863 rogram of China under Grant No. 2006AA010201, the National Basic esearch rogram of China under Grant No. 2005CB and the Beijing Natural Science Foundation under Grant No eferences 1. Dally, W.J., Towles, B.: oute packets, not wires: on-chip inteconnection networks. In: DAC 01: roceedings of the 38th conference on Design automation, New York, NY, USA, pp ACM ress, New York (2001) 2. Ho,., Mai, K.W., Horowitz, M.A.: The future of wires. roceedings of the IEEE 89(4), (2001) 3. Zhang, M., Asanovic, K.: Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: ISCA 05: roceedings of the 32nd Annual International Symposium on Computer Architecture, Washington, DC, USA, pp IEEE Computer Society, Los Alamitos (2005) 4. Held, J., Bautista, J., Koehl, S.: From a Few Cores to Many: A Tera-scale Computing esearch Overview. Technical report, intel (2006) 5. Laudon, J., Lenoski, D.: The sgi origin: a ccnuma highly scalable server. In: ISCA 97: roceedings of the 24th annual international symposium on Computer architecture, pp ACM ress, New York, NY, USA (1997) 6. Dally, W.J., Towles, B.: rinciples and ractices of Interconnection Networks. Morgan Kaufmann ublishers Inc., San Francisco, CA, USA (2003) 7. Hu, W., Zhang, F., Li, Z.: Microarchitecture of the Godson-2 rocessor. Journal of Computer Science and Technology 20(2), (2005) 8. Cox, A.L., Fowler,.J.: Adaptive cache coherency for detecting migratory shared data. In: ISCA 93: roceedings of the 20th annual international symposium on Computer architecture, New York, NY, USA, pp ACM ress, New York (1993) 9. Kaxiras, S., Goodman, J..: Improving CC-NUMA erformance Using Instruction- Based rediction. In: roceedings of the Fifth IEEE Symposium on High- erformance Computer Architecture, pp (1999) 10. Abdel-Shafi, H., Hall, J., Adve, S.V., Adve, V.S.: An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In: Third International Symposium on High-erformance Computer Architecture, 1997, pp (1997) 11. Koufaty, D.A., Chen, X., oulsen, D.K., Torrellas, J.: Data forwarding in scalable shared-memory multiprocessors. In: ICS 95: roceedings of the 9th international conference on Supercomputing, pp ACM ress, New York, NY, USA (1995) 12. Lebeck, A.., Wood, D.A.: Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors. In: ISCA 95: roceedings of the 22nd annual international symposium on Computer architecture, pp ACM ress, New York, NY, USA (1995) 13. Lai, A.-C., Falsafi, B.: Selective, accurate, and timely self-invalidation using lasttouch prediction. In: ISCA 00: roceedings of the 27th annual international symposium on Computer architecture, pp ACM ress, New York, NY, USA (2000)

11 314 H. Zeng et al. 14. Mullins,., West, A., Moore, S.: Low-latency virtual-channel routers for on-chip networks. In: ISCA 04: roceedings of the 31st annual international symposium on Computer architecture, Washington, DC, USA, p.188. IEEE Computer Society, 188 (2004) 15. Kim, J., ark, D., Theocharides, T., Vijaykrishnan, N., Das, C..: A low latency router supporting adaptivity for on-chip interconnects. In: DAC 05: roceedings of the 42nd annual conference on Design automation, pp ACM ress, New York, NY, USA (2005) 16. Eisley, N., eh, L.S., Shang, L.: In-network cache coherence. In: MICO 39: roceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, pp (2006) 17. Cheng, L., Muralimanohar, N., amani, K., Balasubramonian,., Carter, J.B.: Interconnect-aware coherence protocols for chip multiprocessors. In: ISCA 06: roceedings of the 33rd annual international symposium on Computer Architecture. Washington, DC, USA, pp IEEE Computer Society (2006)

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu