Comparison of memory write policies for NoC based Multicore Cache Coherent Systems

Size: px

Start display at page:

Download "Comparison of memory write policies for NoC based Multicore Cache Coherent Systems"

Adele Baker
5 years ago
Views:

Comparison of memory write policies for NoC based Mlticore Cache Coherent Systems Pierre Gironnet de Massas, Frederic Petrot System-Level Synthesis Grop TIMA Laboratory 46, Av Felix Viallet, 38031

1 Comparison of memory write policies for NoC based Mlticore Cache Coherent Systems Pierre Gironnet de Massas, Frederic Petrot System-Level Synthesis Grop TIMA Laboratory 46, Av Felix Viallet, Grenoble, France Abstract The following stdy shows a direct comparison of memory write policies in Shared Memory Mlticore Systems. Althogh there are mch work and many stdies abot this isse, or work takes into accont the difficlties related to on chip commnication sing network-like interconnects. Or stdy is based on Cycle Approximate Bit Accrate simlations (CABA) of platforms with p to 64 processors, modelling accrately all the aspects of mlti-threaded program exection and memory accesses. Or main reslts show that write-throgh caches perform well compared to write-back ones, with a slightly simpler implementation and comparable traffic. 1. Introdction In the past few years, the search for processors with higher-freqencies and evermore complex pipelines seems to have stopped. Instead, as integration capabilities still increase according to Moore's law, architects are ptting together more and more processors, memories and devices on a single die. Sch chips with dozen of processors achieve high comptational throghpt by exploiting parallelism at task level in applications. Since bses do not scale well with more than few processors - say ten - Network on Chip (NoC) interconnect [6] seems to be one of the best soltions for the next decades. In NoCs, the available bandwidth is not limited by the nmber of nodes becase each one comes with its own set of wires. Moreover, all commnications are on chip, hence NoC's achieve higher throghpt bandwidth and mch lower latency than off-chip large-scale Symmetric MltiProcessors (SMP). A high nmber of programmable components significantly increases software design and implementation complexity. Having a hardware that provides a simple programming model will be a major architectral argment in the ftre. Some architectres do not inclde any more caches since they are targeting low-power devices with streaming application models. Bt as shown in [10] hardware coherent caches offer similar performances in streaming applications and better overall performances when they do not perform write-allocate protocols. Hence, caches with hardware coherency are a good choice to provide a simple programming model and redce accesses latencies in a large scale NoC. For the sake of simplicity, we will se seqential consistency in or platform simlation. Nevertheless or comparison remains valid with a weaker model as the one sed in commercial designs. Cache Area is an important trade off, since in actal high-end processors it represents 50% of the area, and it will still increase in ftre implementations. Nevertheless, integrating several processors in the same die implies less area allowed to caches and memories. Conseqently, the cache implementation shold be as simple as possible with niform access and in-order reqest isses. De to the presence of a NoC, we se a directory-based coherency scheme inspired from one proposed by Censier and Featrier [5]. Its area overhead does not scale well with a high nmber of processors. Or work can be adapted to more efficient soltions as one reviewed in [16, 2]. Therefore, or work focses on on-chip shared memory mlticore systems with NoC and caches. The memory hierarchy implements seqential consistency and hardware coherence with a directory scheme. Hardware protocols can be divided into two categories to maintain coherency, write-pdate and write-invalidate [15]. We implement the later soltion since it is the most commonly sed and srely the best one in or context. As a reslt, we focs or stdy on memory write policies. In actal designs, the write-back is the most widely sed policy. In contrast, write-throgh memory pdates are well known in the literatre to give poor performances and are almost nsed in actal designs. The high bandwidth available in NoC architectres, and the high latencies to access foreign on chip nodes, makes it necessary to re-evalate the write-throgh memory pdate policy which is crrently considered inefficient. The main contribtion of or work is a head to head com /DATE EDAA

2 parison between directory-based generic write-back MESI like and write-throgh invalidate protocols implemented on top of a NoC. The comparison is done simlating at CABA level a fll application software with an OS. Therefore, all the instrctions and memory accesses are flly modelled. To give a self contained stdy, we present hereafter the necessary backgrond definitions. Secondly, we show the related work on the sbject, followed in the third section by a description of the compared protocols and their implementations. In the forth section we describe the experiments, and finally we present or reslts and conclding comments. 2. Backgrond and Definitions Or stdy focses on directory-based soltions and write-invalidate protocols. A memory location may have several readers bt only one writer. Hence, when a processor exectes a store instrction, all the copies of the targeted block mst be invalidated before the new vale is written. This ensre the coherency and seqential consistency of the memory. Two memory pdate policies named write-throgh and write-back are available with the writeinvalidate protocol. Write-throgh policy: A write reqest is sent to the main memory for each store issed by a processor. Ths, main memory is always p-to-date bt severe contention may appear on a bs based architectre. Write-back policy: This is the most widely sed policy. When a processor exectes a store instrction, it modifies the local copy bt no reqest is sent to the main memory. The modified block is said to be "dirty", it will be writtenback when it is evicted. This policy is commonly implemented as write-allocates: on a write miss the block is retrieved from memory and modified locally. The main advantage of this soltion is that only the necessary memory accesses are done (on a block basis however). 3. Related works Cache coherency has been a problem for architects from the earlier shared memory mltiprocessors systems. The first proposed soltions to maintain coherency are the Write-once, Illinois and Berkeley protocols, followed few years later by the Dragon and Firefly protocols. A review of these protocols was done by Archibal and Baer[4]. Other well-known reviews and stdies abot existing protocols [16, 2, 18] show that write-throgh invalidate is the less efficient protocol in a bs-like interconnect. Althogh previos cited work may seem a bit old, there is an p-to-date work comparing existing snooping protocols with cycle accrate simlations [11]. This work shows that Write-throgh Invalidate protocol is less efficient than write-back MESI. As we have stated before, write-back implementations take advantage of bs snooping facilities and leverage the bs locally pdating a block and writing back on eviction. With the se of non-bs interconnects, like crossbars or later NoC's soltions, directories schemes have been (and still are) a good soltion. Different soltions are reviewed in [16, 2]. With the growing nmber of caches and components in the same chips, interconnect latencies to access directory and foreign data is a major problem. There are several soltions like [12, 7] relying on specific network commnication and coherency message schemes, bt both of them still implement MESI like protocols. Protocol optimizations like [8, 9], try to hide or redce access latencies to foreign dirty blocks. Nevertheless, all proposed soltions (to or knowledge) allow a block to be dirty in a cache and ths falls in the write-back category. 4. Coherency protocols and implementation 4.1 Compared protocols: Write-throgh Invalidate (): This protocol is the simplest to implement as attested by its finite state machine diagram on figre 1. A write-throgh policy is combined with a write-invalidate protocol. When a write reqest is issed, the main memory, throgh its directory, will invalidate all the cached copies to ensre the coherency. Traffic overhead is the major drawback of this protocol. The main advantage is that the main memory always contains clean copies. Write-back MESI (): For comparison prposes we implemented a generic MESI like protocol (Illinois [13]). When a processor exectes a store instrction, the cache mst get the exclsivity of the targeted block. This reqest will invalidate all other copies, retrieve dirty blocks from foreign caches and allocate the clean block if needed. These actions can take a long time on a high latency NoC, and this is the major drawback of this protocol. Its finite state transition diagram can be seen in figre Protocols implementation Implementing a or protocol (at CABA or RTL level) can lead to completely different soltions. Hereafter we present the main distinctive behaviors of both protocols. We call a hop the delay needed to cross the NoC from one node to another, and se this nit to characterize the protocols actions. In or context a node is either a cache, its associated processor and its controller or the main memory, its controller and its directory. Transferring messages

The simlation environment: Or simlation platforms are bilt sing the Cycle Approximate, Bit Accrate comvi UJ z UQ._ @.1 m S 4' L. I..C.1..1. invalidation cp load / cp load/ RE.

3 The simlation environment: Or simlation platforms are bilt sing the Cycle Approximate, Bit Accrate comvi UJ z m S 4' L. I..C invalidation cp load / cp load/ RE.acl* cp store / Writ allocat 'invalidation_s cp load / rnii cp store,tnrp / c l invalidatiore cp ld cp load / cp store / Write cp store/cp load/ invalidation cp load / Read ( "I N cp store / Write processor reqest / issed command to memory transition - on memory directory reqest - transition on cp reqest * depends on the response states: Modified, Exclsive, Shared, Invalid, Valid be granted and all copies will be invalidated. This action is blocking for the data cache and the processor and costs 2 or 4 hops. In the second case, the block is allocated and can lead to an p to 6 hops action as shown in figre 2, which is blocking (stalls the processor) ntil step 4. Main memory * reqest to write-allocate a = l0dill 8 reqest to retrieve a clean copy Q3 response with a clean copy of the do 0 response with clean copy to processor 0 write-back of the Modified block 4' U ' U A0 Figre 2. a 6 hops example Figre 1. Finite State Diagram of the and protocols from node to node contribtes for almost all the data latencies. Hence, conting hops is a good comparison basis for the protocol's implementation efficiency. Read reqests: In both implementations, a read miss leads to at least a two hops reqest (read a clean copy from the memory). Nevertheless, in the block may be in Modified state in a foreign cache. In or implementation, the corresponding action is decomposed as follows: 1. a read reqest is sent by the cache to the main memory node. 2. the main memory node sends a reqest to the owner of the clean copy. 3. The foreign cache response contains a clean copy and pts his in Shared state. 4. The main memory node responds to the reqesting cache with a clean copy of the block. Write reqests: In, misses and hits are handled identically. A write command containing the modified word is sent to the main memory throgh its write bffer. The main memory node can respond immediately if there is no shared copies (2 hops action), or send invalidate commands to all the caches which contains a copy and waits for the acknowledgements before sending a response to the reqesting node (4 hops). These actions are non-blocking for the cache controller ntil the bffer is fll. In, a write misses when the block is either in Shared or Invalid state. In the first case, an exclsivity will [ processor action 11 ] read hit 0 0 read miss 2 b. hops 2 or 4 hops b., (+2 n.b.) write miss 2,4 hops n.b. 2, 4 blocking, (+2 n.b.) write hit S 2,4 hops n.b. 2, 4 hops blocking write hite - 0 write hit M 0 Table 1. Cost in hops of each reqest in both protocols, b. and n.b. stands for blocking and non-blocking reqests for the processor. In the table 1, we smmarize the cost of each action for both protocols. Or implementations can be optimized by allowing cache to cache transfers or any other architectral enhancement. For example, invalidation acknowledges can be sent directly to the reqesting node (cache) leveraging the memory node and saving one hop transfer. Nevertheless, or implementations were done with identical behaviors in actions steps, leading to a as fair as possible comparison. Moreover, typical protocol optimizations can often be applied on both protocols. 5. Experiments 5.1 Simlation environment and Hardware architectre

4 ponents of the SoCLib [1] library which ses the VCI [3] protocol to commnicate. This library allows to design and simlate in an easy way platforms with dozen of processors rnning cross-compiled software. Althogh most components (processors, caches, memories and many devices) are simlated with cycle-accrate precision, some are instead cycle-approximate. The main cycle-approximate component is the Generic Micro Network (GMN). This component does not trly represent a set of roters, bt a configrable crossbar like interconnect with internal delay fifos. Nevertheless, setting the minimm transfer delay and fifo's depth allows s to model a NoC with 2D mesh latencies and contention characteristics. This cycle-approximate component has no major impact in or reslts since it is sed for all the configrations and gives s fair comparisons between the different stdied protocols. RAM o RAM I shared maximm contention -, + 1 on memory bank I U ~~Generic Micro Network ISPARC V8 SPARC V8 ISPARC V8 SEMRAM Processor O Processor 31 Processo n-l IIziz RAM 0 RAM 1 RAM 2 RAM 3 RAM 4 RAM n+2 ~~~tack stackstc accesses sprayed over ~~~~~~m memory banks Modeled architectre description: In figre 3 we present or modelled architectres. The Sparc processors are connected to an instrction and data cache. The data cache and the memory implement one of the protocols (, ). The instrction and data cache se the same interconnect port in order to minimize the NoC area. As a reslt, high data workload can interfere with instrction misses reqests, increasing the average instrction access latency. We implemented two architectres to evalate the impact of the contention at memory banks in the protocols. As we can see, one has few memory banks which will receive reqests from p to 64 processors. The other one spreads the memory accesses among several memory banks (one per processor and three shared one). The table 2 smmarizes the main hardware characteristics. nmber of processors n {4, 16, 32, 64} nmber of memory banks m {2, n + 3} processor model SPARC-V8 with FPU data cache size 4Kb instrction cache size 4Kb data/instrction block size 32 bytes cache associativity Direct-mapped write-bffer size 8 words (32 bytes) NoC topology Mesh NoC Latency 3. 2 V/n +Tm+m 3 Table 2. Simlated platforms characteristics 5.2 Software architectre Application: On top of the CABA simlation platform, we exected the Ocean and Water SPLASH-2 benchmark tests. In [17] can be fond a detailed description of each Figre 3. Modeled architectres to compare coherency protocols with different levels of memory contention test bench with their main characteristics. This test bench site was designed to evalate Shared Memory Mltiprocessors architectres. It is widely sed and accepted by the research commnity. In or simlations, we execte a comparable workload per processor in each platform configration (4, 16, 32 and 64 processors). That way, the nmber of load/store instrctions exected by each processor will not decrease while increasing the processors nmber. The configration characteristics are presented briefly in figre 4. These test benches need some services, typically granted by a POSIX operating system. Hereafter we shortly describe the sed one and its possible configrations. Operating System configrations: We sed a lightweight operating system [14], which implements POSIX pthreads allowing to execte parallel tasks on mlticore systems. It can be configred in two ways: 1. Symmetric Schedling (SMP). In this configration the OS distribtes the workload on the available processors on a first come, first served basis. Hence, a task can migrate from processor to processor in an npredictable way. This configration is not sitable for large scale platforms since the centralized schedler access becomes a bottle-neck. 2. Decentralized Schedling (DS). This configration greatly limits contention as each processor has its own schedler. Tasks can be pin pointed on a desired processor in order to avoid migration.

5 Memory layot: On the architectre 1, we se the SMP kernel. All the shared and local data and thread stacks are in the same memory bank. This configration leads to a maximm memory workload and contention increasing accesses latencies. On the architectre 2, we se the DS version. Each thread is pin pointed to a dedicated processor. Its stack and local data are stored in a dedicated memory bank. Shared dynamic and static data are contained in different memory banks. The prpose of this layot is to spread as fairly as possible the accesses to all memory banks. Conseqently there are only few contention points and coherency protocol workload is slightly lower than in architectre Reslts and comments We present hereafter the obtained reslts on two SPLASH-2 benchmarks: exection time, NoC traffic and data latencies. 6.1 Exection time We rn the applications ntil completion and report the exection time for both architectres on figre 4: 1000 _ * 800 z EAN OCEAN: contigos partitions grid size: 4N+2 x 4N+2* error tolerance: protocol. This behavior was expected de to the centralized memory. In fact, sever contention appears on a specific memory bank as one that can be shown on a bs like interconnect with too many processors. DS: The exections took slightly less time. This is de to less contention on schedler strctres by the processors, and to a better repartition of the commnication workload since accesses are sprayed among all the memory banks. Moreover, pinpointing tasks on processors avoids migration overhead. Conseqently, the Ocean exection is p to 30% faster and the difference is increasing with the nmber of processors. In Water exection, both protocols give the same performances withot a clear advantage for one of them. These reslts show that, in or context, protocols performs well compared to ones. Moreover, ftre MPSoCs will embed many memory banks distribted across the die. As a reslt, protocols are a viable soltion in terms of performances. 6.2 Traffic load over the NoC As we can see in figre 5, the traffic is of the same magnitde order in both protocols. Differences in kernels, applications and architectres gives different behaviors. Therefore, there is not a clear advantage for none of the stdied protocols r r 7 1~ OCEAN]._ 'a 600 E 400 Xi 200 n centralized better than meoy*n :nmber of processors distribted memory better than WATER: spatial nsqared nmber of mols: 27 for N = { 4, 16) 64 for N = {32, 64) CD w in WER.S 70 architectre 1 architectre 1 architectre 2 architectre 2 x processors nmber architectre 1 architectre 1 architectre 2 architectre m3.107 ioe lo Figre 4. Exection time in Mega cycles for each simlation platform and both protocols 4 0 O processors nmber SMP: We see that and have almost the same exection times. Nevertheless, increasing the processors nmber above 32 gives an advantage to Figre 5. Total traffic on the NoC in bytes on a complete rn exection

6 6.3 Data cache stall cycles Instead of comparing directly data latencies, we compared the average stall time de to data cache accesses. The reason is that data latencies rely on loads instrctions only. Therefore, it does not show stall cycles de to fll write bffers and write-allocate actions. In the figre 6 we see that both protocols have almost identical reslts on both architectres. As expected, on architectre 1 there is more contention than on architectre 2. With more than 32 processors, the time being stalled by the data caches reaches almost 70% of the exection time. data cache stall cycles 100% architectre 1 80% 60% *,''.K< 40% 40% OCEAN architectre 1 8- architectre 2 Vi architectre 2 _O C processors nmber Figre 6. Percentage of data cache stall cycles 7. Conclsion In this paper we compared the memory pdate policies in NoC based shared memory mlticore systems implementing hardware coherence. Althogh there has been many works and stdies arond cache coherency, none of them (to or knowledge) has shown the sability of the very simple write-throgh-invalidate protocol. Moreover, all the new proposed protocols and optimizations make se of writeback policies. The obtained reslts have shown that in or context write-throgh-invalidate protocols are a possible and simple soltion to maintain coherency. This protocol performs very well compared to a classic write-back-mesi protocol in both exection time and generated traffic. The main limitation of this work is the lack of best-case / worstcase reslts wich will be done in ftre work. [2] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An evalation of directory schemes for cache coherence. In ISCA '88. [3] V. Alliance. Virtal component interface standard (ocb 2 [4] 2.0) J. Archibald and J.-L. Baer. Cache coherence protocols: evalation sing a mltiprocessor simlation model. ACM Trans. Compt. Syst., 4(4): , [5] L. Censier and P. Featrier. A new soltion to coherence problems in mlticache systems. ieee Transactions on Compters, C-27: , [6] W. J. Dally and B. Towles. Rote packets, not wires: onchip inteconnection networks. In DAC '01: Proceedings of the 38th conference on Design atomation. [7] N. Eisley, L.-S. Peh, and L. Shang. In-network cache coherence. In MICRO 39: Proceedings of the 39th Annal IEEE/ACM International Symposim on Microarchitectre, [9] A.-C. Lai and B. Falsafi. Selective, accrate, and timely self-invalidation sing last-toch prediction. In ISCA '00. [10] J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, [8] J. Hh, J. decopling: Chang, D. Brger, making se of incoherence. and G. S. Sohi. In ASPLOS-XI'04. Coherence M. Horowitz, and C. Kozyrakis. Comparing memory systems for chip mltiprocessors. In ISCA '07. [11] M. Loghi, M. Poncino, and L. Benini. Cache coherence tradeoffs in shared-memory mpsocs. Trans. on Embedded Compting Sys., 5(2), [12] M. M. K. Martin, M. D. Hill, and D. A. Wood. herence: decopling performance and correctness. Token co- In ISCA '03. [13] M. S. Papamarcos and J. H. Patel. A low-overhead coherence soltion for mltiprocessors with private cache memories. In ISCA '84: Proceedings of the 11th annal international symposim on Compter architectre. [14] F. P6trot and P. Gomez. Lightweight implementation of the posix threads api for an on-chip mips mltiprocessor with vci interconnect. In DATE, [15] P. Stenstrom. A srvey of cache coherence schemes for mltiprocessors. Compter, 23(6): 12-24, Jne [16] M. Tomasevic and V. Miltinovic. Hardware approaches coherence in shared-memory mltiprocessors, part 1. IEEE Micro, 14(5):52-59, [17] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gpta. The splash-2 programs: characterization and methodological considerations. In ISCA '95: Proceedings of the 22nd annal international symposim on Compter architectre. [18] Q. Yang, L. N. Bhyan, and B.-C. Li. Analysis and comparison of cache coherence protocols for a packet-switched mltiprocessor. IEEE Trans. Compt., 38(8): , References [1] Soclib project. http: // Home. html.

CS 153 Design of Operating Systems Spring 18

CS 153 Design of Operating Systems Spring 18 CS 53 Design of Operating Systems Spring 8 Lectre 2: Virtal Memory Instrctor: Chengy Song Slide contribtions from Nael Ab-Ghazaleh, Harsha Madhyvasta and Zhiyn Qian Recap: cache Well-written programs exhibit