London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

Size: px

Start display at page:

Download "London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency"

Dominic Ryan
6 years ago
Views:

1 Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College of Science, Technology and Medicine London SW7 2BZ Abstract. Some shared-memory applications have execution times linear in the number of processors due to unfortunate allocation of the home and ownership of cache lines. We present a modied coherency protocol which avoids this eect. Read requests are routed via \proxies", randomly-selected intermediate nodes. We present results from executiondriven simulations of a cc-numa architecture which show that proxying can yield a large speedup in cases where read contention is extreme, while only causing small slowdowns in other benchmarks. We investigate how many proxies should be used and what eect the scheme has on trac levels and queuing of requests at node controllers. 1 Introduction Coherent-cache shared-memory multiprocessors can suer from disastrous contention eects, especially in large congurations [2]. In some cases, an apparentlyparallel application can show execution time proportional to the number of processors used. In this paper we study one potential cause for such behaviour, establish the scale of the eect, and investigate remedies. Each processor's memory and cache is managed by a \node controller". In addition to local memory references, the controller must handle requests arriving via the network from other processors. These requests concern cache lines owned by this cache (reads, ownership requests), lines of which a copy is held in this cache (invalidations and replacements), and lines whose \home" is this node, i.e. this node holds directory information about the line. It is obviously important that controllers can handle requests at a high rate. This is exacerbated in large congurations where unfortunate ownership migration or home allocation can lead to concentrations of requests at particular nodes. An interesting alternative is to distribute the workload to other node controllers, essentially using them to act as \proxies" for read requests. When a processor makes a read request, instead of going directly to the cache line's home, we route it rst to another node. If the proxy node has the line, it replies directly. If not, it requests the value from the home itself, allocates it in its own cache, and replies. We present results from simulation experiments which evaluate this idea.

2 2 Contention in Shared-Memory Multiprocessors Each node consists of a processor, with an integral rst level cache (flc), a large second-level cache (slc), some dram and a node controller. The slc, dram and controller are interconnected by two decoupled buses. The controller sends messages to, and receives messages from, the network and the processor. We model a cc-numa architecture with an invalidation-based coherency protocol which maintains sequential consistency. The identities of nodes which have cached a particular line are maintained using distributed singly-linked lists using a protocol similar to that outlined in [3]. Each cache line has a \home" node associated with it (at the granularity of a page) which: { Either holds a valid copy of the line (in slc, dram, or both), or knows the identity of a node which does, { Has pre-allocated space in dram to which the nal replacement of the line from cache can take place, and { Holds directory information for the line (head and state of the sharing list) in dram. When a memory reference cannot be completed on a client node a request is sent to the home node. When the read request is serviced at the home node, the controller performs a lookup in dram to determine the state of the line. Assuming that the state is exclusive, the slc bus is acquired and a lookup is done in the slc. If the data is not present, the slc bus is released, the data is read from dram, directory information is updated in dram, the mem bus released, and the reply message is dispatched. Note that the processor is prevented from accessing the slc for part of the transaction. In addition, whilst the request is being serviced, other requests may arrive which cannot be serviced until the controller has nished this transaction. Contention of this form can occur for either homes or owners: ownership of many cache lines with dierent homes may become concentrated in a single cache because of the application's write behaviour. Conversely, directory trac for a large set of lines, whose ownership is dispersed, may be concentrated in a single home node due to home allocation. 2.1 The Impact of Node Controller Contention The severity of controller contention is both application and architecture dependent. Some contention is inevitable and will result in the latency of transactions being elongated. The communications access pattern is non-uniform primarily because of the way homes and ownership are allocated. It is the non-uniform distribution of requests made by the application which causes the variation in contention over the execution time of the program. The characteristics of the architecture determine how eectively the non-uniform distribution of requests can be resolved. If the network is relatively fast and controller occupancy high, it is possible that requests can arrive at a controller at such a high rate that contention will occur.

3 3 Proxies Proxies form the basis of a technique designed to reduce the queuing of requests at controllers. Proxies also achieve combining: if multiple requests for the same cache line are sent to the same proxy, only the rst requires a request to be made to the home. When the data is supplied to the proxy, it sends it to all the clients which are waiting for it. This is done by using proxy cache entries to form \pending chains", distributed lists of the nodes which have requested a particular line. When the proxy receives the line, it is added to its slc and sent down the pending list of clients which have requested the data. 3.1 Selective Use of Proxying The overhead of resolving read requests via proxies can be two extra messages, and therefore to avoid unnecessary overheads only data structures which are contended for should be proxied. We control this on a page-by-page basis, so that proxying is enabled only for selected memory regions. We determine which data structures should be proxied by analysing traces from simulations using a trace analysis tool (see our companion paper in these proceedings) to determine which lines and pages are widely shared, and may therefore benet from proxying. To evaluate the cost of proxying in circumstances when it is not benecial, we have simulated runs in which all shared data is proxied. 3.2 Selecting the Proxy and the Proxy Set The requesting processor can select any member of the proxy set, i.e. the nodes which act as proxies for the requested line. In a clustered system it would make sense to allocate a proxy in each cluster. Alternatively, in some network designs it may be possible to choose a proxy to avoid congestion. Perhaps the most interesting possibility is to choose the proxy at random each time. This should reduce network contention and balance load evenly between the elements of the proxy set { this is the policy we simulate here. In general there is a trade-o in the number of proxies to use: too few proxies and there may still be contention, this time for proxies. Too many, and there will be little combining eect: requests from the proxies will cause contention. Although some benchmarks may benet from a large number of proxies, in less extreme examples a smaller proxy set should give better results because more combining will happen. 3.3 Potential Costs and Benets Among the benets we should look for are: { Queuing to gain copies of data served by a single controller, due to unfortunate data migration: contention from this source should be reduced, since the controller managing the data need only service requests from members of the proxy set, while other requests will be handled by proxies.

4 { Queuing for directory information served by a single node controller, due to unfortunate allocation of \homes": contention from this source should also be reduced. { Blocking in the interconnection network, due to unfortunate communications patterns: this should be reduced since, although the overall trac level is somewhat higher, the distribution should be more uniform. The potential costs are: { Every load (for addresses subject to proxying) must now go via a proxy, whereas with the basic protocol no indirection would be involved. For example, a simple client-home-client round trip can now take ve messages instead of three. { Cache pollution: allocating space in the cache for proxying may displace another line, and lead to a later cache miss. Extra invalidations are required also. { Hardware complexity: controllers must be able to deal with multiple outstanding requests (as required for multithreading) and be able to represent the pending chains required for combining. Our goal in this paper is to study the costs and benets of the proxy scheme, and to try to quantify the above eects. 4 Simulated Architecture and Benchmarks The simulated system consists of a set of nodes interconnected by a crossbar network. First level caches are write-through, direct-mapped, and 4KB in size. Instruction accesses are assumed to be dealt with by a separate, perfect memory system. Second level caches are write-back, direct-mapped, and 1MB in size. The line size is 64 bytes throughout. The clock speed of the system (other than processors) is 100 MHz. Latencies for various operations, expressed in terms of 10 ns clock cycles, are shown in Table 1. We have used a network bandwidth of 160 MB/s. Messages consist of a header (containing a type, and source, destination, requester and home identiers) and possibly a 64 byte payload. ge is a simple Gaussian elimination program, similar to that used by Bianchini and LeBlanc in their study of proxies [1]. At the end of each iteration a single processor updates a row of the matrix which is designated as the pivot row. Following a barrier, all processors read this row and use it to update a set of rows which they maintain. It is immediately clear that this will cause contention since the cache lines holding the pivot row will all reside in a single cache. The entire 256x256 matrix is annotated so that proxying is used. Trace analysis did not indicate that any particular data structure used by barnes [4] would cause contention problems, and as a result all shared data has been marked for proxying. Three iterations with particles are used. fmm [4] was run for a two-cluster Plummer distribution with cost zones partitioning, and the precision set at 1e-6. Trace results of a 32 node system showed

5 Table 1. Latencies of the Most Important Node Actions (100 MHz clock) Operation Time (cycles) Acquire slc bus 2 Release slc bus 1 slc lookup 6 slc line access 18 Acquire mem bus 3 Release mem bus 2 dram lookup 20 dram line access 24 Initiate message send 5 that queues of length 31 were occurring for access to elements of the f array which forms part of the G Memory data structure. The queuing occurs immediately after a barrier, when all the processors read all the elements of the f array. This was the most signicant case of read contention detected in fmm and it scales with the number of processors. It is independent of the number of particles, so simulations were run for a small problem size of 4096 particles and three iterations to reduce simulation time. 5 Simulation Results Our simulations have shown that contention only becomes an important issue when more than a few tens of nodes are used. For this reason the results presented below are from simulations of 64 node machines. The graph showing the variation in execution time with the number of proxies for ge (Figure 1) indicates that using a single proxy results in a reduction in execution time of 35%, but higher numbers of proxies do not produce further reductions. Execution time worsens slightly when more that ve proxies are used. Further instrumentation has been used to explain this in the following sections. 5.1 Queuing at Controllers The total number of cycles for which messages are waiting to be serviced is counted during simulation, and used to determine the relative buer delay, i.e. the buer delay observed for a particular number of proxies divided by that observed without proxying. For ge, buer delay is more than halved when proxying is used. This is a result of the write behaviour of the program which tends to concentrate ownership of individual lines on particular nodes. When proxying is used, messages are directed to nodes which do not act as home for any shared data, resulting in more uniform message distribution, and reduced queuing time. Interestingly, increasing the number of proxies beyond 1 oers little benet.

6 1 1 Relative Execution Time barnes fmm ge Relative Buffer Delay fmm barnes ge Relative Number of Messages Sent Number of Proxies barnes ge fmm Number of Proxies Proxy Hit Rate Number of Proxies fmm ge barnes Number of Proxies Fig. 1. Relative Measurements as a Function of Number of Proxies 5.2 Variation in Message Trac Proxying is designed to reduce queuing, but does so at the expense of increased network trac. Counts of the total number of messages sent during simulations were used to produce the graphs in which the variation in the relative number of messages is plotted against the number of proxies. In the case of ge, the overhead is clearly visible: 27% more messages are required when proxies are used. However, since the distribution of these messages is more uniform, the resulting contention is less severe. Note that little variation in message counts is observed as the number of proxies is increased beyond one. Using proxies results in higher network trac because read requests are routed via the proxy (rather than directly to the home), and sharing lists are longer resulting in more messages being required for invalidations and replacements. 5.3 Proxy Hit Rate This graph provides a measure of the eectiveness of proxies at moving network trac away from (possibly congested) home nodes. We dene a proxy hit to be a read request served by a proxy which either returns data directly, or adds the

7 requester to a pending list. A proxy miss requires the data to be fetched from the home node. The graph for ge shows a constant proxy hit rate because every node accesses each line of the pivot row, and therefore a high rate of combining is achieved. 5.4 fmm and barnes fmm shows only a minor change in overall execution time when proxies are used, although message queuing is reduced by about 30%. Together, the graphs showing the variation in network trac and proxy hit rate show that proxying is eective at keeping messages away from congested home nodes. However, the data structure marked in this case is small and accesses to it form a relatively small part of overall execution time. barnes is an interesting test of proxying since all its shared data structures have been marked for proxying. Despite the overhead that this could introduce, it has a relatively minor adverse eect on execution time. Network trac is again increased and buer delay reduced, resulting in only a small increase in load and store delays. The three benchmarks programs exhibit a range of behaviours. Although only ge shows a substantial improvement in execution time, the results have demonstrated that more random message delivery can result in reduced buering of messages, despite the inevitable increase in network trac. fmm showed little change in execution time because the marked data structure is only used in a small part of the program. The performance of barnes was not adversely aected, despite all data being proxied. For all programs, one or two proxies tend to realise most of the advantage that can be achieved. This is a positive result since using more proxies will result in increased cache pollution, invalidation and replacements. 6 Related Work Holt et. al. investigated the eect of varying network latency and occupancy on the performance of similar benchmarks, and concluded that controller occupancy has a large impact on performance [2]. Increasing problem size to maintain parallel eciency was shown to be ineective in many cases since it leads to unreasonably large problem sizes. Our results also demonstrate that occupancy is of critical importance. In addition we have found that proxying can be highly eective at alleviating contention in programs, such as ge, which suer particularly badly. Our proxy scheme is similar to Bianchini and LeBlanc's \eager combining" idea [1]. The important dierence is that in eager combining, when a dirty value is rst read, the value is broadcast to all proxies. In our protocol, only the requesting proxy gets the value. Although we have not simulated eager combining, we appear to get most, if not all, of the advantage of eager combining on ge. The dierence matters since eager combining would incur a large overhead on other benchmarks. We have shown that our protocol leads to only modest overheads.

8 7 Conclusions This paper has presented the proxy technique, discussed the design and implementation of a proxying cache coherence protocol, and presented some preliminary simulation results. We have shown that proxies benet some applications immensely, as expected, while for other benchmarks they do not lead to a substantial slowdown. We believe that the reason proxies do not substantially increase execution time, despite increasing the overall number of messages needed, lies in the eect of proxying on the distribution of network trac. This needs further analysis, but it would appear that performance is inuenced more by anomalous contention eects than by the overall average network trac level. Much if not all of the benet of proxying is realised with just one proxy per location, although there is some evidence that two proxies may be justied for some applications. We believe the reason for this is that severe read contention must involve data structures of considerable size, and the proxied copies are then spread uniformly across the machine. In summary, we have shown that proxies can be added to a cc-numa cache coherence protocol fairly easily, that proxying can improve some applications' performance (a 50% speedup for ge), appears to risk only a small slowdown for other applications, and that one proxy per location is likely to be enough. Acknowledgements. This work was funded by the U.K. Engineering and Physical Sciences Research Council through project GR/J (Combining Randomisation and Mixed-policy Caching for Bounded-contention Shared Memory). Enormous thanks are due to Ashley Saulsbury for allowing us to use his simulator. References 1. Ricardo Bianchini and Thomas J. LeBlanc. Eager combining: a coherency protocol for increasing eective network and memory bandwidth in shared-memory multiprocessors. In 6th IEEE Symposium on Parallel and Distributed Processing, Dallas, October, pages 204{213, Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The eects of latency, occupancy and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-660, Computer Systems Laboratory, Stanford University, January Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Sanjay Vishin. The S3.mp scalable shared memory multiprocessor. In International Conference on Parallel Processing, Pennsylvania State University, August, volume 1, pages 1{10, Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In 22nd Annual International Symposium on Computer Architecture, in Computer Architecture News, pages 24{36, June This article was processed using the LaT E X macro package with LLNCS style

Reactive Proxies: a Flexible Protocol. Extension to Reduce ccnuma. fsamt,

Reactive Proxies: a Flexible Protocol Extension to Reduce ccnuma Node Controller Contention Sarah A. M. Talbot and Paul H. J. Kelly Department of Computing Imperial College of Science, Technology and Medicine