Directories vs. Snooping. in Chip-Multiprocessor

Size: px

Start display at page:

Download "Directories vs. Snooping. in Chip-Multiprocessor"

Sydney Brent Austin
5 years ago
Views:

1 Directories vs. Snooping in Chip-Multiprocessor Karlen Lie Saengrawee Pratoomtong University of Wisconsin Madison Computer Sciences Department 1210 West Dayton Street Madison, WI USA

2 Abstract Advances in semiconductor integration technology have resulted in ever increasing transistor density. This has led to designs where multiple processors, each with its own caches, memory controller, network interface, as well network router, are integrated in a single chip. Previously understood and deeply explored tradeoffs in conventional multiprocessor systems need to be reevaluated. We chose to investigate the coherence choice between directory-based vs. snooping-based protocol on such chip multiprocessor system (CMP). Our result was inconclusive as we did not manage to get the snooping simulator fully working. However, from the numbers that we compiled, both protocols would perform equally well on a CMP, and the decision to choose one protocol over another should be made after considering whether the CMP would be used by itself or connected to other CMPs. 1 Introduction Our goal was to determine the implication of CMP on the choice of coherence protocol. With the advent of CMP, multiprocessor characteristics that influenced this choice changed. Most notably, CMP provided high on-chip bandwidth as well as much reduced on-chip latency. These changes resulted in systems that had different limitations as well as advantages as compared to conventional cache-coherent multiprocessors such as Symmetric Multiprocessor (SMP) and cache-coherent Non Uniform Memory Architecture (cc-numa) systems. In terms of coherence protocol, snoopy-based coherence was prevalent on smaller to mediumscale systems, with the largest of such system being the SUN Starfire[1] system capable of supporting up to 64 processors. Systems that intended to support more than 64 processors, such as the SGI Origin [2], adopted directory-based coherence protocol. This division was dictated by the common goal to achieve the best memory performance possible, both in terms of maximum data transfer rate, as well, lowest possible latency per memory operation. Generally, snoopy-based systems offered lower latency compared to directory-based systems, while directory-based systems were more scalable bandwidth-wise compared to snoopy-based systems. With multiple processors on a single chip, the division was not as clear as before. Both available

3 bandwidth and communication latency improved significantly. This led to the possible viability of system configurations that were previously considered to be impractical because of bandwidth limitation or highlatency overhead. On this project, we focused the comparison between the two coherence protocols on similarly configured system. We wanted to measure the effect of bandwidth and latency improvement on the overall execution time. We were also interested in the overall utilization of the available bandwidth, and the measure of latency for each memory instruction. To achieve this, we used the RSIM[3] multiprocessor simulator developed at Rice University. Since RSIM was originally a cc-numa system simulator, we had to incorporate the snoopy-coherence protocol and remove the directory-coherence portion from our version of RSIM. This paper is organized in the following manner. The next section discusses our methodology and setup. Section 3 presents our experiments results and discussions. Section 4 discusses related works, and finally, we present our conclusion in section 5. 2 Methodology The CMP we were emulating had the following features. We had 8 processors on a single-chip, each with its own L1 and L2 caches that were also on-chip. The directory states were also stored on-chip. For the directory-based protocol, the L2 caches of the eight processors and the directory were directly connected to a 2-D mesh interconnect. In the case of the snoopy-coherence protocol, the directory and the L2 caches were connected to a single split-transaction bus. For all of our simulations, we forced the processors to comply with Sequential Consistency (SC) model, and the coherence protocol we used was the 3-state MSI protocol. We also enforced inclusion between the L1 and the L2 caches. The rest of this section describes the simulator platform and the implemented modifications.

4 2.1 Baseline System We used RSIM as our simulation platform. As RSIM was originally a cc-numa system with directory-based coherence protocol, our baseline system was thus a CMP with directory-based coherence protocol. RSIM featured one superscalar, dynamically-scheduled uniprocessor per node, connected by a 2-D mesh network. Figure 1 shows the layout of our baseline simulation platform. The dotted-line marked the extent of the chip, with all components within the boundary to be built the same chip. Figure1. Baseline system architecture, derived from the RSIM developer manual[3]. To make unmodified RSIM emulate the behavior of a CMP, we adjusted various system parameters of the simulator. The main adjustments were the increase in bandwidth and the latency reduction of the interconnects, and the reduction of access latency of the directories and caches, all of which were assumed to be on-chip. By being on-chip and built using SRAM instead of the slower DRAM memory module, the access latency of the directory states and its cycle time between accesses were

5 consequently much smaller than the access latency and cycle time of the memory, which were still off-chip built using slower DRAM modules. Table 1 summarizes the parameter adjustments that we made to the unmodified version of RSIM to simulate our directory-based CMP. Parameter Specification Processor Speed 600 MHz Issue width / Instruction window size 4 / 64 Number of processor 8 Cache line size L1 Cache size 64 Bytes 16 KB L1 Cache Associativity 1 L1 type L1 access latency (Both tag and data) L2 Cache size WT 1.6 ns 64 KB L2 Cache Associativity 4 L2 type L2 access latency (tag /data) Maximum Bus bandwidth Bus Cycle time Memory Access latency Minimum Directory Access latency Coherence Protocol type Flitsize (Equivalent to Width of the network) Flitdelay (Latency for each flit to pass through a network switch) WB 5/6.6 ns 9.6 GB/s 300 MHz 60 ns 1.6 ns MSI 64 Bytes 1.6 ns Table1. Parameters for directory-based CMP.

6 2.2 Directory on a Bus Since our goal was to eventually have a snoopy-based multiprocessor CMP simulator to compare against the directory-based system described in section 1.1, we needed to either construct a brand new snoopy-based simulator, to choose another simulator to run the snoopy-based system on, or to modify the existing RSIM simulator to support this protocol. We decided against the first two options above based on the following reasons: Writing a brand new simulator was just not feasible in the limited amount of time available for the project. Using a different simulator to simulate the snoopy-based system would bring up too many uncontrollable invariants that would make the simulation results incomparable. Modifying the existing simulator should not pose too much of a difficulty. We thus proceeded on the next logical step of the modification of the baseline system. We modified the simulator by integrating all the processors to share the same bus. Since the bus would be our main and only communication medium, we disabled all network interfaces and updated the routing protocol to bypass those network interfaces and routed every network-bound transaction to go directly to the destination through the bus. All the parameters remained the same as the baseline system. Figure 2 shows the modified system. As in figure 1, the dotted line marks the boundary that separates on-chip from off-chip components. Figure2. Directory on a Bus

7 2.3 System with snoopy-based coherence protocol. This system was the simulator that would serve as the comparison point to the baseline simulator specified in section 2.1. It would serve as the comparison point between a directory-based and snoopybased system. Following were the modifications that we made to the directory-on-the-bus system described in section 2.2 to arrive at our system: Modified the caches such that it would snoop on the bus. Modified the order in which invalidations occurred. This was necessary because in the original RSIM, invalidations did not occur immediately. Instead, the system would first try to invalidate the L1, and then invalidate the L2. The existing delay factor to access these tags were unacceptable as well. Thus, we had to modify both the caches to handle snooping- messages simultaneously. Modified the bus to broadcast. We did not realize that the bus module in the original RSIM only supported point-to-point communication. As such, we had to rewrite a significant chunk of the bus module to incorporate this capability. Modified the bus to limit the number of outstanding transactions on the bus to be 1 transaction per cache line. Finally, instead of removing the directory module, we made use of the module to keep track of which cache line was currently in the shared-state. This way, we did not have to create a sharedline which that could needed to be pulled when any of the line was in shared-state to determine if memory should reply to a particular transaction. The directory would now cease to respond to all transactions except those that would ordinarily be serviced by the memory unit in a snoopy-based system. After applying all the above modifications, we essentially had a snooping-based system with separate processor-side cache-tags and bus-side cache-tags. The split-transaction bus had variable-response time delay, and response could come back out-of-order.

8 Up until the paper was written, we still could not get the snoopy-based system completely functioning. Even after we had resolved all the race conditions that existed in the system, we still had issues with our modifications being incompatible to the original RSIM version. 3 Results and Discussions The following section describes the results we collected from our available simulators. All the results were from the running of 3 applications from the Splash-2 benchmark suite, namely fft, radix, and mp3d. 3.1 Simulation results for the directory-on-the-bus Both the baseline and the directory-on-the-bus system had similar behavior in terms of miss/hit rate and number of references for each benchmark. Thus the simulation results presented here were based on comparison between directory latency, bus bandwidth and bus utilization. Ratio of Directory Latency (DirOnBus/Default) 3 Ratio fft radix mp3d Node Number Figure3. Directory latency ratio for directory-on-the-bus vs. baseline system.

9 Figure 3 shows that the ratio of directory latency between the directory-on-the-bus and the baseline system on all 8 processors. Directory latency (number of cycles needed/request) was derived by counting the total number of cycles elapsed from the time a coherence request was sent out to the time when all the coherence replies for that request arrived back at the directory. Ratio of how each type of coherence packet contributes to the bandwidth reduction 100% Percentage 50% 0% mp3d radix fft CopyBackINVL CopyBack INVL Spec Benchmark application Figure 4. Speculated reduction in bus bandwidth utilization. Figure 4 shows the speculated reduction in bandwidth utilization when the system architecture was changed from directory-on-the-bus to complete snoop. The values inside the table below the figure indicated the actual percentage of bandwidth reduction caused by each type of coherence packet. The bar graph represents the ratio of how each type of coherence packet contributes to the bandwidth reduction. We analyzed the total numbers of coherence packets that would not actually be sent when using snoopycoherence protocol and calculated the amount of reduction in bandwidth usage by multiplying them with the coherence packet size. In directory-based system, the directory had to generate invalidation packet to each of the sharer once it received a read own request while in snooping-based system, a read own request itself also served to invalidate all sharers. Thus the bandwidth consumed by all invalidation and invalidation

10 acknowledgement packets in our directory-on-the-bus system would not be present in a snoop-based system. Copyback and copyback invalidate packets were sent from the directory to the owner cache in response to read share and read own request. The directory-on-the-bus and the baseline simulator provided cache-to-cache transfer capability for MSI coherence protocol so that if the cache that held the block in M state received a copyback or copyback invalidate request, it would invoke the cache-to-cache transfer mechanism. In a snoopy-based system, these requests would not be applicable. Instead of having these requests, read own and read shared requests would serve the same purpose. In addition, separate messages that would originally go to the directory, acknowledgement for copyback invalidate and data transfer for copyback request would also not be present. Thus, we could calculate the total amount consumed by the two requests and their acknowledgements and subtract it from our total utilized bandwidth. For radix and mp3d, the major factor in reduction of bandwidth usage came from the removal of copyback packets, while for fft, it came from the removal of invalidation packets. Percentage of Bus Utilization 80 Percentage mp3d_ DOB mp3d_ Base radix_d OB radix_b ase fft_dob fft_base Contention and Arbitration delay Sending Packets Spec Benchmark Application Figure 5: Percentage of bus utilization.

11 Figure 5 shows the percentage of bus utilization of the directory-on-the-bus system compared to the average bus utilization among the 8 buses of the baseline system. Bus utilization was derived from 2 parts; utilization due to contention and arbitration delay and utilization due to the actual packet transfer. In the figure, <Application_name>_DOB indicates that the application was simulated on the directory-on-the-bus architecture while <Application_name>_Base indicates that the application was simulated on the baseline architecture. In a snoop-based system, we expected to see the reduction in bus utilization and contention because of the reduction in number of coherence packets sent out for a given memory operation. From the results, the directory-on-the-bus system had higher directory latency, percentage of bus utilization, and contention on the bus than the baseline system. The higher directory latency was caused by the contention for the single bus, and the fact that it was not possible for the directory to issue multiple coherence packets to different nodes at the same time. One very curious result that we obtained was the total execution time for the benchmarks. When we compared the total execution time between the baseline and directory-on-the-bus system, we had surprising result. For all benchmarks, the directory-on-the-bus actually took slightly less time, in terms of total number of simulation cycles, to complete the benchmarks. This was unexpected as the numbers for directory latency leaned in favor of the baseline system performing better than the modified system. Our guess was that even though directory latency was better in the baseline system, the time taken for a request from the cache to reach the directory was actually better in the modified system by virtue of having less hops. Thus, with applications that accessed remotely stored data frequently, directory-on-the-bus system would perform better than the baseline simulator. 3.2 Discussions The system used in our project had a maximum on-chip bandwidth of 9.6GB/s. Among all the benchmarks that we ran, mp3d consumed the most bandwidth at 4.6GB/s in the directory-on-the-bus system. The bandwidth usage for a snoopy-based system could be expected to be lower than that, as portrayed in figure 4 and 5. For our benchmark, we could see that communication bandwidth between the different processors was not a limiting factor. The main bottleneck would still be the off-chip bandwidth to

12 main memory and thus, when it came down to the choice of coherence protocol, both directory and snooping would perform equally well. We were not able to measure the latency for memory access in snoopy-based system. However, as snooping required fewer hops from the time a request was sent out to the time the reply was received, memory operations in snooping-based system should incur lower latency than to those of a similarly configured directory-based system. However, it should be noted that directory-based CMP system had several advantages. If most memory operations were to go to main memory, then access latency in directory-based system was not much worse than a snooping-based system, since access time to off-chip memory modules would dominate. Another advantage would be the ease in integrating multiple CMPs to form bigger multiprocessors. By using directory as the main coherence protocol, as more CMP modules got connected, no extra complexity would be added to support cache-coherence; the existing mechanism would suffice. 4 Related Works There are currently several proposed CMP systems. Piranha[4], brainchild of Compaq Research, was optimized for commercial workload. Piranha provided a crossbar with a 32GB/s maximum bandwidth to connect the eight simple, inorder, uniprocessors that resided on that same die. It used directory as its coherence protocol, and it utilized one big shared L2 cache which did not maintain inclusion with its L1 counterparts. Compared to Piranha, our system had lower on-chip bandwidth, and each processor still had an L1 and an L2 cache. Another proposed system was the IBM Power4. Power4[5] was also targeted at the highreliability server market. Each multichip module (MCM) had 2 uniprocessors and their own L1 caches, a shared L2 cache, and the tag for off-chip L3 cache. The on-chip bandwidth between the superscalar cores and the L2 cache was more than 100GB/s. Finally, from the academic projects, there was the Stanford Hydra CMP[6]. The Hydra had 4 processors on a single chip, a write-through bus that connected the L1 caches with the L2, main memory interface, as well as the I/O bus interface.

13 5 Conclusions and Future Works On the outset of the project, we wanted to compare the effect of CMP on the choice of directory vs. snooping based protocol. We did not manage to achieve that goal, as our snooping simulator did not quite work. However, we did obtain some measure of the expected bandwidth usage for the snooping system from our directory data. From there, we could conclude that even though a snooping-based protocol on a CMP would save a lot of bandwidth, bandwidth was actually not an issue. With the high availability of on-chip bandwidth, intra-chip communication would not be a bottleneck. Instead, more consideration should be given to how the CMP would fit on a larger picture, i.e. which protocol would be feasible if we were to connect more than one CMP together. An extension to this project would be to look at how multiple connected CMPs would perform under different coherence protocol. As CMPs were designed for high-reliability server markets, we would probably want more than just 2, 4, or 8 processors to run the server. Thus, the performance of the system could greatly be affected by how the CMPs communicated with each other.

14 References [1] A. Charlesworth. Starfire: Extending the SMP Envelope. In IEEE Micro, Vol. 18, No. 1, pages 39-49, Jan/Feb [2] J. Laudon and D. Lenoski. The SGI Origin: A ccnuma Highly Scalable Server. In Proceedings of the 24 th Annual International Symposium on Computer Architecture, pages , June [3] V. S. Pai, P. Ranganathan, S. V. Adve. RSIM Reference Manual Version 1.0. [4] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27 th Annual International Symposium on Computer Architecture, June [5] K. Diefendorff. Power4 Focuses on Memory Bandwidth. Microprocessor Report, Vol 13, No. 13, Oct [6] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE MICRO, Mar/Apr 1999.

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad de Murcia 30071 Murcia (Spain)