A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors

Size: px

Start display at page:

Download "A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors"

Jeffrey Carpenter
5 years ago
Views:

1 A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors Anders Landin and Mattias Karlgren Swedish Institute of Computer Science Box 1263, S KISTA, Sweden flandin, Abstract The performance of a COMA multiprocessor greatly depends on the efficiency of the large node caches, the attraction memories. When more than one processor share an attraction memory its behavior is changed. From experiments with program-driven simulation we have found that clustering may improve the performance of the attraction memory significantly. Traffic is reduced, and the miss rates are lower for shared attraction memories. However, clustering may introduce contention for the attraction memory that may ruin any potential performance gain from increased attraction memory hit rate. Provided enough local bandwidth, application execution can remain efficient at higher memory pressure in clustered systems than in systems withsingle processor nodes. At very highmemory pressure some applications change behavior and start suffering from clustering. This is caused by conflict misses due to the relatively lower associativity of the shared attraction memory. 1. Introduction It is popular to build shared-memory multiprocessors using nodes with a few processors on a shared bus as building blocks. This is often found attractive since modules with two or four high-end microprocessors are readily available on the market, and are often used in workstations or highend PCs. Clustering a couple of processors together in each node is also a way to amortize the overhead for the node controller and network interface over more processors, resulting in a machine with lower overall overhead than if it had only one processor per node. In addition, the network becomes simpler since the number of nodes is reduced for a machine with a given number of processors. Clustering may also reduce the overhead for management of directory information for the same reason. The aim of this study is to understand how the efficiency of the attraction memory in a cache-only memory architecture (COMA) multiprocessor is affected when it is shared by several processors. As the study will show, there are factors that may both improve and degrade the behavior of shared attraction memories. Although sharing attraction memories between processors has been proposed long ago [6] this is the first study that actually addresses the efficiency effects of the sharing. Program driven simulation is used to study the execution of the programs in the SPLASH-2 application suite [17] to drive a simulation model of a cluster-based memory system. The performance of shared attraction memories is measured for two and four-fold clustering at varying degrees of memory overhead. There is also opportunity to compare our observations of sharing in COMAs with a few recent papers that address cache sharing. Clustering with cache sharing has been studied by Erlichson, Nayfeh et al. in [3, 15] and by Bennett et al. in [1]. Part of the results in [3] actually apply to COMAs (with infinite caches). The results they report are consistent with our observations. However, the previous studies focus on UMAs and NUMAs and miss interesting properties of clustering in COMAs that this paper and [10] points out Organization The rest of the paper is organized as follows. Section 2 briefly presents the key issues for COMA architectures and the behavior of the attraction memories. The expected benefits and drawbacks of clustering are also briefly discussed. In section 3 the experimental methodology is described. The results are presented and discussed in section 4. Related work is covered in section 5. Finally, the paper is concluded in section 6.

2 Interconnect Node Controller with AM State & Tag SLC SLC SLC SLC Attraction Memory DRAM part Figure 1. A typical clustered node with four processors sharing an attraction memory 2. COMA Background In COMA multiprocessors [6, 5], all memory is used to form large node caches called attraction memories. Since the entire memory is cache, it is possible for data to migrate dynamically and live close to the processor that currently accesses it. It is also possible to replicate data that is shared among several processors. In a cluster-based COMA, the processors in the cluster share the same attraction memory. A sketch of a node with a shared attraction memory is shown in figure 1, where four processors with private lower-level caches share the attraction memory. In contrast to conventional caches, attraction memories have to treat replacements with special care. Since there is no backup main memory, an evicted cache line must be sent to another attraction memory to ensure that the data is not lost. The replacement behavior is a key factor for the trade-off between performance and memory utilization in COMA systems [8]. An important parameter for an execution in a COMA machine is the memory pressure [16]. This is the relation between the aggregate size of the attraction memories and the size of the working set of the application. The memory pressure (MP) is defined as: Application workingset MP = T otal attraction memory size. At a high memory pressure, most of the attraction memories are filled with unique data. At low memory pressure there is much cache space left for replication. The memory pressure can be adjusted dynamically by the operating system by paging data in and out between the attraction memory and disk. By selecting the physical address for new pages, the OS can set different memory pressures for different regions of the attraction memories. To reduce memory overhead and/or to execute efficiently even when the application workingset fills most of the available memory, it is desirable to run efficiently at relatively high memory pressures. However, at high memory pressure, replication is very limited. This leads to frequent replacement of data. In a UMA or NUMA machine replacement results in increased traffic, and of course in cache misses, when the evicted data is accessed again. In a COMA, the effects of replication may be even worse. When the last copy of a datum is replaced, it must be inserted into another attraction memory to prevent that it is lost. This may cause invalidation of a shared cache line in the receiving node. Thus the hit rate may be decreased in both nodes. At lower memory pressure, it is likely that an invalid cache line is found or that the space of a shared cache line can be reused, but at high memory pressure, very few cache lines are shared, and replacement of exclusive entries, which have to be relocated to other nodes, are frequent. This leads to significant increases in communication and impacts on execution time, thus removing much of the potential performance benefits offered by the COMA over NUMA and UMA systems. The effects of this has been studied in [11, 2] Shared attraction memories Depending on the behavior of the application program, we can expect either improvements or degradations of the efficiency as the the number of processors that share an attraction memory grows. As stated above, this has been covered for normal caches in [3, 1]. We can expect that clustering improves performance for coherence misses since a writing and a subsequently reading processor may be located in the same node. It may also reduce the coherence and cold miss rates if two reading processors belong to the same node the first processor may prefetch data for the second. Furthermore, the space needed for replication of data can be shared between several processors provided that they share the same data, thus reducing the need for extra replication space. This is valuable at higher memory pressures when replication space is scarce. On the other hand, we could for example expect a degradation of performance if one of the processors in a cluster makes accesses with very poor locality which may cause replacement of data for the other processors in the cluster. Naturally we can also expect higher demand for bandwidth within each cluster as the number of processors in the cluster is increased. This may lead to performance degradation due to contention for the resources in the node. 3. Experimental methodology The experiments have been done using the SimICS simulation system [13, 14]. SimICS is an efficient instruction

3 set processor simulator that simulates multiple processors executing the SPARC V8 instruction set. Memory accesses performed by the processors are fed to a memory system simulator which models the memory architecture. In the results presented below, the simulations have been done modeling both shared and private data accesses, but without modeling instruction fetches made by the processors. Instructions are instead assumed to always hit in the cache. This is a reasonable assumption for the applications modeled, since their instruction footprints are all relatively small. All ordinary data accesses as well as synchronization accesses have been modeled. Data pages are allocated consecutively on demand, as they are accessed by the processors. Allocation of a page is done instantly, without any delay for the processor. No instruction pages are allocated COMA architecture model When evaluating different degrees of clustering, we assume that the amount of attraction memory per processor is constant. That means that the attraction memory in a node with two processors is twice the size of an attraction memory in a one processor node, etc. The number of processors in the machine is held constant in all the simulations. The architecture modeled is the Bus-based COMA [11] which is a COMA with snooping attraction memories. Coherence is maintained with an invalidation-based coherence protocol which has four states per cache line (Exclusive, Owner, Shared and Invalid). It uses an accept-based replacement strategy. This means that upon replacement of a cache line in state Exclusive or Owner, a snooping-based mechanism is used to find a receiving node that can store the replaced cache line withoutcausing further avalanching replacements. When choosing what local line to replace, entries in state Shared are prioritized over entries in the Owner and Exclusive states. When choosing a receiver of the replacement, nodes with Invalid entries are prioritized over those with Shared entries. Throughout this study, we have simulated systems with 16 processors. The system configuration has been varied between 1, 2 and 4 processors per node, yielding 16, 8 and 4 node systems respectively. The first-level cache has been 4 kbyte direct mapped for all experiments. Each processor also has it s own second-level cache. Since the relation between working set and cache sizes is critical to the results, we have tried to match those in real systems with larger working sets. The size of each second-level cache has been set to be 1=128th of the total working set allocated by the application. The attraction memory size has been varied to achieve a memory pressure variation. The memory pressures used have been 6%, 50%, 75%, 81% and 87%. This means that a single copy of the working set, entirely fills 1, 8, 12, 13 and 14 of the 16 attraction memories in a 16-node machine. This methodology results in odd cache sizes but has the advantage that it allows the application working set to be constant throughout the experiments. For each memory pressure, the attraction memory size per processor has been held constant. A node with 2 processors thus has an attraction memory twice the size of a single-processor node. esses have been assigned to processors in sequential order, with the result that processes created after each other are likely to belong to the same cluster. Thus if the application has trivial communication locality this can be exploited in the clusters. The cache line size has been held at 64 bytes, and the attraction memory has been 4-way set-associative for all experiments, except where explicitly specified. Due to implementation constraints, it is generally not considered realistic to increase the associativity much beyond this in a real system Timing model The memory system simulator models contention effects for the node controllers, attraction memory DRAMs, second-level caches and the shared bus. The processors are 4-way superscalar and run at 250 MHz, yielding a cycle time of 4 ns. The contention-less access times for reads that hit in the memory hierarchy are: first-level cache, 0 ns; secondlevel cache, 32 ns; attraction memory, 148 ns, where 24 ns are spent in the node controller/state& tag memory and 100 ns are spent reading data from the DRAM; remote access, 332 ns, of which the shared bus is occupied 2 times 20 ns. A release consistency model [4] with a 10 entry write buffer has been assumed. No pipeline effects or other stalls have been modeled the processors execute 4 instructions of any kind per cycle but stall on read misses Applications We have used the programs in the SPLASH-2 application suite [17] to drive the simulator. Table 1 shows the applications used and the corresponding working set sizes. Measurements have been done on the parallel section of the execution as recommended in the SPLASH-2 codes. Note that for three of the applications LU, Ocean and Water SPLASH-2 contains two versions, one original and one with improved locality. We have included both versions in the study since they behave quite differently. 4. Results We start by looking at the sharing effects in large attraction memories where the memory pressure is low. Then we study what happens when the memory pressure is increased. Following that, we see how this impacts the execution time on the bus-based machine model.

4 Application Description Problem Working set (MB) Barnes N-body part. 3.5 Cholesky Sparse matrix factorization tk29.o 40.5 FFT 1-dim. Six-step FFT 1 M data points 50 FMM N-body two cluster, part. 29 LU cont Blocked LU-fact., enhanced locality 512* LU non Blocked LU-factorization 512* Ocean cont Ocean movement simul., enhanced locality 258* Ocean non Ocean movement simulation 258* Radiosity Light distribution -room -batch 29 Radix integer sorting 2 M keys, radix Raytrace hierarchical ray tracing car.env -a1 36 Volrend 3-D volume rendering 256*256*126 vx head 22.5 Water n2 molecular dyn. N-body, O(n2) 512 mol. 1 Water sp molecular dyn. N-body, O(n), larger data structure 512 mol. 1.7 Table 1. Applications and working sets 2 4 0% 20% 40% 60% 80% 100% LU non Ocean non FFT Radiosity Radix Volrend LU c Water n2 FMM Cholesky Ocean c Barnes Raytrace Water sp Figure 2. Read node miss rate at low memory pressure for 2 and 4-way clustering relative to 1-processor node miss rates 4.1. Coherence misses A goodmeasure of the attraction memory efficiency is the Read Node Miss rate, RNMr. The RNMr is the fraction of all reads the processors perform that result in node misses. To compare the efficiency for 2 and 4 fold clustering, we study the relative RNMr the RNMr of a clustered system divided by the RNMr of a non-clustered system. If the memory pressure is sufficiently low, the only misses that occur in a COMA are coherence misses and cold misses. We study the behavior at 6% (= 1=16) memory pressure. Here, the caches are effectively infinite, since the entire working set fits in each attraction memory, thus no replacements occur. In figure 2 the relative RNMr is shown for clustering with 2 and 4 processors per node. The upper group of bars shows the read node miss rate for 2-processor clusters relative to the RNMr of single-processor nodes. The lower half of the graph shows the corresponding figures for 4-processor clusters. It can be seen that clustering reduces the RNMr significantly for all applications. We can also see that there is a strong correspondence between the 2-fold and 4-fold clustering gains. The average relative read node miss rate is 82% for 2-fold clustering, and 62% for 4-fold clustering Increasing the memory pressure It is not realistic to expect that COMAs will normally run at only 6% memory pressure. It is always desirable to reduce the memory overhead, and run applications with working sets close to what fits in the physical memory. The performance implications of this has been studied in [8, 11, 2]. Here we will study how the efficiency of clustering changes as the memory pressure is increased. To understand what happens we examine the traffic on the global shared bus. Figure 3 shows the traffic for the eight applications where clustering consistently is effective. 10 bars are shown for each application. The first five bars show the traffic for oneprocessor nodes at 6%, 50%, 75%, 81% and 87% MP. The last five bars show the traffic for 4-processor nodes at the same MPs. Each bar is divided into three segments representing read, write and replacement traffic. The read traffic reflects the read node misses from above. As expected, no replacements are made at 6% MP, while an increasing memory pressure leads to growing traffic, especially for reads and replacements. As can be seen, the 4-processor nodes consistently show lower global traffic. The situation is slightlydifferent for the remaining six applications. Up to 81% MP the behavior is the same as seen for the applications above. However, at 87% MP clustering

5 100% 1 proc 4 proc Read Write Replace 80% 60% 40% 20% 0% Cholesky FFT LU non Ocean c Ocean non Radix Water n2 Water sp Figure 3. Traffic for 1 and 4-processor nodes at 6%, 50% 75% 81% and 87% memory pressure 100% Read Write Replace 80% 60% 40% 8 way, 87% MP 20% 0% Barnes FMM LU c Radiosity Raytrace Volrend Figure 4. Traffic for 1 and 4-processor nodes at 6%, 50% 75% 81% and 87% memory pressure no longer reduces traffic efficiently. For four of the applications the traffic is even higher in the clustered case. Figure 4 shows the traffic for these six applications. The format is the same as above with the addition of two bars per application. The new bars show traffic at 87% MP for 1 and 4-fold clustering with an attraction memory with 8-way associativity instead of the default 4-way. The new bars are placed just after the corresponding normal 87% bar. Except for LU cont, it shows clearly that the reason for the dramatic traffic increase at high memory pressure for these applications is conflict misses in the attraction memory. While clustering significantly reduces the sensitivity for these misses up until 81% MP for all the applications, at 87% MP clustering can no longer help. Instead conflict miss traffic is increased for Radiosity, Raytrace and Volrend. For LU cont the AM associativity explans only part of the increase since also the traffic for 8-way AMs here is increased by clustering. It is interesting to note that for single processor nodes with 4-way associative attraction memories, above 76.5% MP (49=64) there is no longer space to replicate a cache line over all the 16 nodes, while 8-way associativity moves this threshold to 88.2% MP (113=128). With four-processor clusters, the corresponding levels are 81.25% MP (13=16) and 90.6% MP (29=32). This is a likely explanation to the observed effects for conflict misses. A way to overcome this limitation is to break the inclusion in the cache hierarchy as studied in [9, 2] Performance impact We choose a memory pressure of 50% as a baseline for the execution time measurements. A lower memory pressure leads to considerable memory overhead while only marginal performance gains can be made (FFT is the most sensitive application, and its execution time is improved by 4.2% when going down to 6% MP). The performance impact of clustering is dependent on several factors. Most important is attraction memory bandwidth. This is natural since the aim in a COMA is to maxi-

6 Busy SLC stall AM stall Remote stall 200% 180% 160% 140% 120% 1 proc, 50% MP 1 proc, 81% MP 4 proc, 81% MP 100% 80% 60% 40% 20% 0% Barnes Cholesky FFT FMM LU c LU non Ocean c Ocean non Radiosity Radix Raytrace Volrend Water n2 Water sp Figure 5. Execution time for 1-way clustering at 50 and 81% MP and for 4-way clustering at 81% MP mize the likelihoodthat accesses can be satisfied within the node. A single processor loads the attraction memory much less than 2 or 4 processors do. It is therefore of prime importance that the nodes are designed to tolerate the increased attraction memory load. In the original configuration where the DRAM is occupied 100 ns per AM access, 5 of the applications show significant performance degradation for 4-way clustering at 50% memory pressure. If the DRAM bandwidth is doubled (while the latency is held constant), three applications still show a significant performance degradation. These are LU non (17.8% slower), Radix (12.7%), and Ocean non (5.5%). Six applications are slightly improved, while the remaining five are unaffected. Figure 5 shows the execution time for the applications on the simulated machine with the doubled DRAM bandwidth. For each application three bars are shown. They represent the execution time for single processor nodes at 50 and 81% MP and for 4-processor nodes at 81% MP. Each bar is divided into four sections. Busy indicates that the processor is executing instructions or performing memory accesses that hit in the first-level cache. SLC stall represents the time spent waitingfor accesses that hit in the second-level cache. Similarly, AM stall and Remote stall show the time spent waiting for accesses that hit and miss in the attraction memory. Here the performance impact of clustering is clearly illustrated. Except for LU non and Radix which suffer most from node contention, the performance is significantly better for all the applications with significant remote stall time at 81% MP. For Water not much can be done, since it already spends almost all its time inside the node. Still the remote stall time is reduced significantly. We can see that for many of the applications clustering removes the performance penalty that was a result of the memory pressure increase from 50 to 81%. If the DRAM bandwidth is doubled again and the node controller gets twice the default bandwidth, all applications except for the non-optimized LU show similar or better performance with 4-way clustering than with single processor nodes even at 50% memory pressure. Similarly, if the global bus bandwidth is halved, clustering becomes even more efficient since the penalty for remote accesses is increased. The effects of this is largest for Barnes, FFT and LU non. Yet, the overall results are very similar to figure Related work This work is starts off from the observations in [3] regarding basic clustering effects. The evaluation of the performance impact of clustering with infinite caches in [3] is confirmed by us. However, the conclusions in [3] apply to UMAs and NUMAs, while the real effects for COMAs are not addressed. They conclude that clustering is only effective in small caches (since that is where the effects are the largest for caches that are backed up by real memory). This line is carried further in [15]. Our work comes to very different results regarding the effects of clustering in COMAs. In [1] Bennett et al. study a system with many similarities to [3]. Again the architecture addressed is a NUMA-like system. In [8, 9] Joe and Hennessy address the issue of memory overhead due to replication space requirements in COMA architectures. They study the effects of increased associativity as a way to reduce the effects of high memory pressure. Memory utilization in COMA has been further studied in [7, 12, 11, 2]. This is the first study to provide understanding of the real behavior of shared COMA attraction memories. A more detailed analysis of the issues addressed in this paper is given in [10].

7 6. Conclusions We have studied the effects of clustering in COMA multiprocessors. By experiments with program-driven simulation of the applications in the SPLASH-2 suite, we have found that clustering may improve the efficiency of the attraction memories significantly. Traffic is reduced, and the miss rates are lower for shared attraction memories. When the memory pressure is low, the impact of the reduced miss rate is marginal for most applications due to the low communication required for coherence traffic. As memory pressure is increased large gains can be made from clustering. The reduced need for replication space results in that application execution can remain efficient at higher memory pressure in clustered systems than in system with single processor nodes. This results in a reduced memory overhead and/or higher performance. A necessary requirement is that the attraction memory has bandwidth enough to efficiently serve the processors in the cluster. Of the 14 applications we studied, all but one showed improved performance for fourway clustering at 81% memory pressure. The fourteenth application was dominated by intra-node contention. All applications showed significant reductions in traffic due to the clustering up to 81% memory pressure. When the memory pressure was increased further to very high levels some applications changed behavior and started to degrade from clustering. We found that this was caused by conflict misses due to the relatively lower associativity of the shared attraction memory. 7. Acknowledgments We wish to thank Peter Magnusson and Bengt Werner for cooperation and help with the SimICS simulator. SICS is sponsored by Ericsson AB, Telia AB, Celsius Tech Systems AB, IBM Svenska AB, Sun Microsystems AB, Swedish Defense Material Administration (FMV), Swedish State Railroads (SJ) and the National Board for Industrial and Technical Development (Nutek). This work was partly funded by the European commission under Esprit project SODA. [4] Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A. and Hennessy,J., Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In eedings of ISCA-17, ACM May [5] Hagersten, E. Toward Scalable Cache Only Memory Architectures, Ph.D. Thesis, SICS Dissertation Series 08, 1992 [6] Hagersten, E., Landin, A., and Haridi, S. DDM A Cache- Only Memory Architecture, in IEEE Computer, Sept [7] Jamil, S. and Lee,G., Unallocated Memory Space in COMA Multiprocessors, in of 8th Int l Conf. on Parallel and Distributed Computing Systems, September [8] Joe, T. and Hennessy. J. L., Evaluating the Memory Overhead Required for COMA Architectures, in of 21st Int Symposium on Computer Architecture, April 1994 [9] Joe, T., COMA-F: A Non-Hierarchical Cache-Only Memory Architecture, Ph.D. Dissertation, Stanford University, March [10] Karlgren, M., Performance Characterization of Shared Attraction Memories in Cluster-Based COMA Multiprocessors, Masters Thesis, SICS Research Report, February [11] Landin, A. and Dahlgren, F., Bus-Based COMA Reducing Traffic in Shared-Bus Multiprocessors, In eedings of HPCA-2, IEEE Feb [12] Lee, G. and Jamil, S., Memory Block Relocation in Cache- Only Memory Multiprocessors, in of 7th IASTED- ISMM Int l Conf. on Parallel and Distributed Computing and Systems, October 1995 [13] Magnusson,P., and Werner,B., Efficient Memory Simulation in SimICS, 28th Annual Simulation Symposium, [14] Magnusson, P., The SimICS Home Page, [15] Nayfeh, B.A., Olukotun, K. and Singh, J.P., The Impact of Shared-Cache Clustering in Small-Scale Shared-Memory Multiprocessors, In eedings of HPCA-2, IEEE Feb [16] Saulsbury,A., Wilkinson, T., Landin, A. and Carter, J. An Argument for Simple COMA, In HPCA-95, Presentation slides ans/simple-coma/hpca-95/hpca-pres.ps [17] Woo, S.C., Ohara, M., Torrie, E., Singh, J.P. and Gupta, A., The SPLASH-2 Programs: Characterization and Methodological Considerations, In eedings of ISCA-22, ACM, June References [1] Bennett, J.K., Fletcher, K.E. and Speight, W.E., Classification of Data Accesses to Shared Remote Data Caches in Cluster Multiprocessors, Rice ELEC TR 9502, Rice University [2] Dahlgren, F., and Landin, A., Reducing the Replacement Overhead in Bus-Based COMA Multiprocessors, In eedings of HPCA-3, IEEE February [3] Erlichson, A., Nayfeh, B.A., Singh J.P. and Olukotun, K., The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation, In eedings of Supercomputing 95, 1995.

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden