Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report

Size: px

Start display at page:

Download "Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report"

Carol Edwards
6 years ago
Views:

1 Minimizing the Directory Size for Large-scale DSM Multiprocessors Technical Report Department of Computer Science and Engineering University of Minnesota EECS Building 200 Union Street SE Minneapolis, MN USA TR Minimizing the Directory Size for Large-scale DSM Multiprocessors Jinseok Kong, Pen-chung Yew, and Gyungho Lee January 09, 1999

3 Minimizing the Directory Size for Large-Scale DSM Multiprocessors Jinseok Kong, Pen-Chung Yew, Department of Computer Science University of Minnesota Minneapolis, MN , USA fjkong, Gyungho Lee Division of Engineering University of Texas San Antonio, TX , USA Abstract Directory-based cache coherence schemes are commonly used in large-scale distributed sharedmemory (DSM) multiprocessors, but most of them rely on heuristics to avoid large hardware requirements. We proposed to use physical address mapping on the memory modules to signicantly reduce the directory size needed. This approach allows the size of directory to grow as O(cn log 2 n) as in optimal pointer-based directory schemes [10], where n is the number of nodes in the system and c is the number of cache lines in each cache memory. Our simulations show that the proposed scheme can achieve a performance equivalent to the heuristic pointer-based directory schemes, but with a smaller directory size and a simpler pointer management scheme. Key words { cache coherence, directory protocol, distributed shared memory multiprocessor, memory architecture. 1

4 1 Introduction A major challenge to building a large-scale distributed shared memory (DSM) multiprocessor is to provide the system with a cost-eective cache coherence protocol. Snooping protocols [4, 9, 19] are not feasible on such systems because of their need to perform extensive broadcast in the network. Directory-based protocols which support point-to-point communications between nodes have long been proposed for such systems [5], and have also been implemented in some recent DSM multiprocessors [1, 14, 15, 16, 17]. Although the directory-based protocols increase their complexity, the distributed nature of the protocols makes them a potential choice even for systems with a large number of processors. However, the infamous scalability of the full-map directory size [5] poses a serious impediment for such protocols to be used in large systems. Many pointer-based directories [2, 6, 10, 18] have been proposed to reduce the directory size and to make their hardware cost more manageable. Most of the proposed pointer-based schemes rely on heuristics to scale with the increase of the system size. Although the heuristic schemes can reduce the memory overhead much below the memory size, they still require a large amount of memory compared to the cache memory size [18]. Furthermore, the heuristic schemes could increase the complexity of the directory protocols. This could raise the issue of stability for the system. Another consideration in the directory size is whether the directory schemes are scalable with the increase of the memory size. As a matter of fact, due to the rapid progress in VLSI technology, the DRAMs have quadrupled in capacity every three years since 1980's, which is more than the increase in the system size (e.g. Cray X-MP has 4 processors in around 1985 and Cray T3E can have 2048 processors in around 1997). Hence, even without scaling up the system size signicantly, the directory size could still increase signicantly if the memory size is substantially increased as in most of the recent systems. In this paper, we propose a physical address mapping scheme on the memory modules to control the number of cache lines which can be cached in each cache memory. More specically, we use the data interleaving scheme on the memory modules to group all of the cache lines which are mapped to the same cache set, and then assign the group of the cache lines to the same memory module. In a direct-mapped cache, only one of the cache lines in this group can reside in a particular cache 2

5 P Cache Node 0 Node n-1 P Cache Memory Module Directory Directory Memory Module Network Figure 1: Directory-based distributed shared memory (DSM) multiprocessor memory at any instance of time. Using this property, we can reduce the size of the pointer-based directories and allow it to scale well with both the system size and the memory size. To simplify our presentation, we assume a generic cache-coherent DSM multiprocessor (see Figure 1). Each system node consists of a processor with its own cache memory and a memory module which is a part of the main memory. These system nodes are connected by a point-to-point interconnection network. The paper is organized as follows: In Section 2, we provide the necessary background to motivate our study. Section 3 describes our proposed directory organization and its associated protocol. The implementation issues for the directory scheme are also discussed in this section. The simulation results are presented in Section 4, and Section 5 summarizes the results with some conclusions. 2 Background The storage requirement of a basic full-map directory [5] in each node is proportional to O(mn) for a system with n nodes, where m is the number of the cache lines in each memory module. This growth makes the full-map directory very expensive in a large DSM multiprocessor. Since the total cache memory size is much smaller than the total main memory size, and the data sharing is often limited to a small number of processors [2, 7, 21], most of the bits in the full-map directory are wasted. One key observation here is that the number of directory bits set at any time is bounded by the total number of cache lines in the cache memories, regardless of how the data is shared. To take advantage of this observation and to improve the scalability of the directory size, some 3

6 directory entries for cache lines head of free list (HFL) shared pointer pool Figure 2: Some linked lists in a directory using shared pointer pool pointer-based directory schemes have been proposed. For example, the size of the chained directory in each node as proposed in [10] grows as O(c log 2 n), where c is the number of cache lines in each cache memory. This is much smaller than O(mn) as in the full-map directory. However, the chained directory signicantly increases the protocol complexity and the data access latency because the directory information is distributed over the cache memories as linked lists. Several other pointer-based directories have also been proposed [2, 6, 18]. The limited-pointer directory scheme [2] tries to reduce the directory size by limiting the number of pointers allowed for each cache line in the main memory. An expensive broadcast scheme is used if excessive data sharing causes the number of pointers to exceed that limit. The LimitLESS scheme [6] reduces the broadcasting overhead by overowing the pointers to full-map bit vectors. The dynamic pointer allocation scheme [18] maintains a pointer pool of a xed size for each memory module. A linked list of pointers is maintained for each cache line which has a copy in the cache memories. Each of the pointers in the linked list points to the node which contains a copy of the cache line. Figure 2 shows an example of such linked lists in a node. One linked list shows that nodes and have a copy of the cache line , and another linked list shows that node has a copy of the cache line When the needed pointers exceed the total number of pointers in the shared pointer pool, some pointers are selected and \victimized" by invalidating their corresponding cache copies to free the pointers for new assignment. The performance could suer signicantly if the pointer pool is too small and the pointer overow occurs too often. Heuristics shows that the pointer overow could be rare with a pointer pool size less than 7% of the memory size. However, the number of pointers 4

7 page address (I p) line address (I ) l Index for nodes (I n) I n Index for sets (I s) (a) I l (b) I s I b byte offset (I b ) Memory module 000 physical address Memory module 001 physical address Memory module 111 physical address (c) set # Cache Figure 3: (a) A general data interleaving scheme on memory modules; (b) An example to show all of the cache lines in the memory module which are mapped to the cache set in the cache memory; (c) Every memory module has 4 cache lines mapped to the cache set still needs to be much (8 to 16 times) larger than the number of the cache lines in the cache memory [18]. Further, if the possibility of the pointer overow is not zero, the directory protocols should have to cope with the issue of the pointer overow. This could be critical on stability of the system. Our proposed scheme tries to avoid most of those problems. 3 The Proposed Scheme To simplify our presentation, let us rst look at some general data interleaving schemes on the main memory modules, and study their eect on the data mapping in the cache memories. Assume the unit of our data interleaving is one page, i.e., we allocate one page to each memory module in a round-robin fashion across all memory modules. Figure 3 (a) shows the assignment of the address elds in a physical byte address using such an interleaving scheme. The address eld I l species the cache line number, I b species the byte oset in a cache line, I p species the page number, and I n species the node number, i.e., the home node of a cache line. To simplify our 5

8 presentation, let us further assume that the cache is direct-mapped, and its size is the same as the page size. For a direct-mapped cache, the address eld I s is used as the index to the cache line within a cache memory. Using such an interleaving scheme, an entire page (or any portion of a page) can reside in any cache memory. If we study this data interleaving scheme more carefully, we can observe that, 2 I l?i s?i n (or 2 Ip?In ) of the cache lines in each memory module are mapped to the same cache line in each cache memory. To see this, in Figure 3 (a), the node number of a particular memory module and the set number of a particular cache line in its cache memory are highlighted as the shaded address elds. For a selected node and a particular cache line in the cache memory, the address bits in those shaded elds are xed. Only the remaining address bits in the non-shaded address eld, whose length is (I l? I s? I n ) bits, can be varied. Each of those addresses corresponds to a cache line in the selected memory module which is mapped to the same cache line in the cache memory. Figure 3 (b) shows an example with a 10-bit byte address and a 2-byte cache line. In the gure, the selected memory module (I n ) is 000 2, and the selected cache set (I s ) is The table lists all of the cache lines in the memory module that are mapped to the same cache set in the cache memory. Since (I l? I s? I n ) has two bits, we have four dierent cache lines mapped to the same cache set Figure 3 (c) shows those four cache lines in the memory module Figure 3 (c) also shows that all the other memory modules will have four cache lines mapped to the same cache set There are a total of 32 such cache lines since there are 8 memory modules in our example. It is important to note that, at any instance of time, only one of these 4 cache lines in each memory module (32 cache lines in total) can reside in a particular cache memory. Bringing in any of the other cache lines in this group will create a conict miss in that direct-mapped cache memory. With this important observation, we can calculate the maximum number of pointers needed in the pointer pool for each memory module using the dynamic pointer scheme [18] (described in Section 2) without causing pointer overow. In each memory module, the I n eld is xed, and only one cache line in the group specied in the (I l? I s? I n ) eld can reside in any cache memory, and there are 2 Is groups. Therefore, 6

9 page address (I p) line address (I ) l byte offset (I b ) Memory module 000 physical address Index for sets (I s) Index for nodes (I n) (a) I l I s I b Memory module 010 physical address Memory module 111 physical address set # Cache I n (b) (c) Figure 4: (a) An improved data interleaving scheme on memory modules; (b) An example to show all of the cache lines in memory modules (32 in each memory module) which are mapped to the cache set in the cache memory; (c) Our proposed data interleaving scheme will place all 32 cache lines in the same memory module we need a maximum of (2 In 2 Is ) pointers in each memory module 1. This worst case happens when all of the cache memories have all of their cache lines from the same memory module. Notice that 2 Is = c is the total number of cache lines in each cache memory, and 2 In = n is the total number of system nodes. The maximum number of pointers required in the pointer pool without \victimizing" any existing pointer is O(cn) (or O(cn log 2 n) in bits) for each memory module, which grows proportionally to the cache size. In contrast, the basic full-map directory scheme grows proportionally to the memory size O(mn). To see if we could reduce the number of needed pointers even further, let us change the interleaving scheme as follows. As shown in Figure 4 (a), we now interleave only a portion of a page specied in the address eld (I s? I n ) across all of the memory modules. The node number eld I n occupies the most signicant portion of the I s eld (as opposed to being on the outside 1 If the cache set associativity is a, at most a cache lines in the group can reside in the cache memory. Thus, the maximum number of pointers is a 2 In 2 Is. 7

10 of the I s eld as in Figure 3 (a)). In this interleaving scheme, a page is now spread across all memory modules instead of being placed entirely in one memory module. Figure 4 (b) shows the same example as in Figure 3 (b), but now we have 32 cache lines in memory module that are mapped to the same cache set in the cache memory. In a sense, the new interleaving scheme places all 32 cache lines, which used to be spread across all 8 memory modules, as shown in Figure 3 (c), to the same memory module 010 2, as shown in Figure 4 (c). Similar to the earlier argument, only one of these 32 cache lines can reside in cache set at any instance of time. Bringing in any of the other cache lines in this group will create a cache conict miss. Based on this observation, we can again estimate the maximum number of pointers needed in each memory module as follows. The rst interesting observation is that, since the eld for the node number I n is part of the address eld for the cache line number I s, all of the cache lines in a particular memory module (in which the eld I n has a xed value, i.e., its node number) can only occupy a specic portion of the cache memory (the part where the I n eld has the value of its node number). In other words, the addresses in each memory module can no longer be mapped to the entire cache memory as in the previous interleaving scheme, but to only a portion of the cache memory. The cache memory is, in eect, being divided into equal portions: one for each memory module, as shown in Figure 4 (c). Because of such a reduction in the mappable cache memory space for each memory module, more cache lines from a memory module are being mapped to the same cache set. Again, from Figure 4 (a), for a selected memory module and a selected cache set (in which the value of the I s eld is xed, and I n is a part of I s ), we have (2 I l?i s ) (or 2 Ip ) cache lines in that memory module being mapped to the same cache set. It is a much larger group than the group in Figure 3 (a), but the number of groups in each memory module decreases from 2 Is to 2 Is?In. The maximum number of pointers needed without having to \victimize" any pointer in each memory module is now 2 In 2 Is?In = 2 Is = c, i.e., it is proportional to the cache size c only (as opposed to cn in the previous case) 2. Let us use the Cray T3E [3] as an example to show how much reduction we could have using the above mentioned scheme assuming we organize the Cray T3E with directories. The Cray T3E can have up to 2048 processors. Each processor is a DEC Alpha chip [8] which has separate 2 If the set associativity is a, the maximum number of pointers is a 2 In 2 Is?In = a2 Is = c. 8

11 L1 8K-byte instruction and data caches, and an integrated 3-way set-associative L2 cache of 96K bytes (to simplify our discussion, we assume it has a direct-mapped L2 cache of 32K bytes). It can also have an optional 1M byte to 64M byte o-chip L3 cache (we assume a 2M-byte L3 cache). Each processor can have a memory module of up to 2G bytes. The cache line size for the L2 cache and the L3 cache can be either 32 bytes or 64 bytes. We will focus only on the L2 and L3 caches in our discussion, and assume the cache line size is 64 bytes. Using the full-map directory scheme for a Cray T3E with 1024 nodes, we will need a 1024-bit (i.e. 128 bytes) vector (1 bit for each node) for each cache line in the memory module. For a 2G byte memory module, there are 32M cache lines, hence, the directory size needed in each memory module is 128 bytes 32M, which is 4G bytes. Using the page interleaving scheme, each pointer will have 10 bits (approximated to 2 bytes) since there are 1024 nodes. The Cray T3E does not have a virtual memory. Hence, we assume the unit of the interleaving is 32K bytes if there is only an L2 cache (32K bytes = 2 9 lines), and 2M bytes if there is an L3 cache (2M bytes = 2 15 lines). Assume we ignore the other needed overhead to maintain a the pointer list, then the total number of pointers needed in a memory module without overowing the pointer pool is = 2 19, which is 1M bytes assuming there is only a L2 cache, and = 64M bytes if there is a L3 cache. Even though this is a substantial reduction in the directory size compared to the full-map directory scheme, it is still a signicant hardware overhead. Using our proposed interleaving scheme, however, the number of pointers needed in the pointer pool is 2 9 = 512 (1K bytes) and 2 15 (64K bytes), respectively, for the L2 and L3 caches. This is a very signicant reduction in the pointer pool size, and it makes the proposed scheme very workable even for a full-sized 1024-node Cray T3E. It is very important to note that the reduction in the mappable space in each cache memory for each memory module will not alter the cache miss ratio in the cache memory. This is because, even though we use dierent data interleaving schemes on the main memory modules, we did not change at all the original data layout within the physical address space, which the cache data mapping is based on. That is, the position of the I s eld is unchanged. The only thing changed is the location of the data in the memory modules, but not the cache behavior. However, dierent interleaving schemes will incur dierent amounts of memory conicts in the memory modules, and may also incur possible extra network trac to the remote memory modules which might have been local in 9

12 line address (I ) l page address (I p) byte offset (I b ) line address (I ) l page address (I p) byte offset (I b ) line address (I ) l page address (I p) byte offset (I b ) Index for sets (I s) Index for nodes (I n) (a) Index for sets (I s) Index for nodes (I n) (b) Index for L2 sets (I s) Index for L1 sets (I s) Index for nodes (I ) n (c) 2 1 Figure 5: Other data interleaving schemes on memory modules the original interleaving scheme. In fact, our simulations show very little dierence between the two address interleaving schemes since both schemes have nothing to do with locality of reference. We will discuss these issues using simulation results in Section 4. Nevertheless, the cache miss ratios should remain unchanged in each cache memory. Using a similar strategy, there could be many ways to interleave the data among the memory modules, each with dierent consequences in the total required pointers and the memory performance. One possible interleaving scheme is to straddle the I n eld between the I p and I s elds as shown in Figure 5 (a). These schemes tend to increase the required pointers, but our simulations show that it has very little impact on the overall performance (see Section 4). Another possible design consideration is the size of a page, which determines the length of the I p eld and its relationship to the cache size. The page size is usually smaller than the size of the L2 or L3 cache, and caches may have a set associativity larger than one. However, whatever the layout of I p and I s is, the size of the pointer pool depends only on the location of the I n eld relative to the I s eld. Figure 5 (b) shows the case in which the I n eld is a part of the I s eld, but is straddled over part of the I p eld. For systems with multi-level cache memories, if all of the cache memories at dierent levels satisfy the inclusion property, then we only need to consider the cache memory level which is farthest away from the processor, and is usually the largest in size. All of the discussions so far could still apply. However, if the inclusion property is relaxed [11, 12], then the maximum number 10

13 node number link memory module (a) head link HFL shared pointer pool (b) Figure 6: Directory organization of pointers required in each pointer pool will be the sum of the total cache lines at all levels if the I n eld is included in the I s eld of the L1 cache (I 1 s ). That is, if the total numbers of the cache lines in L1 and L2 caches are c 1 and c 2, respectively, and the I n eld stays within the I 1 s eld as shown in Figure 5 (c), the maximum number of pointers needed in each pointer pool will be c 1 + c 2. Again, dierent interleaving data sizes will have dierent consequences. However, studying all possible variations is beyond the scope of this paper. 3.1 Directory organization The management of the pointer pool used in our proposed scheme is very similar to the one proposed in [18]. The basic organization is shown in Figure 6. Each pointer will have two major elds: One stores the number of the node which has a copy of the cache line, and the other keeps a link which points to the next pointer in the non-circular singly linked list (see Figure 6 (b)). A free pointer list is maintained and its head is linked by a special register, called head of free list (HFL). To facilitate the access of the linked list associated with a cache line in the memory module, each cache line in the memory module could have a link (shown as \head link" in Figure 6 (a)) which points to the rst pointer of the associated linked list. The link is of the size log(c) bits, which is very small compared to the size of a cache line. However, the total size of the head links 11

14 can grow proportionately to the size of each memory module if each cache line in the memory module is associated with such a link. Even though this storage overhead is very small in practice, to avoid this extra overhead, we could hash the address of a cache line to nd the associated list in the pointer pool. It could make the pointer access a little more complicated with some extra time overhead if conicts occur during the hashing. However, it allows the directory size to scale only with the cache memory size instead of the memory size. In fact, we can optimize the hashing function by using the fact that only a n copies among 2 In?Is cache lines in a group are cached at any time (where a is the set associativity and n is the number of nodes). Thus, in the worst case, the number of the hashing conicts could be limited to a n. In Section 4, we study the eect of such pointer pool overhead on the overall performance. 3.2 Directory protocol and implementation Since our scheme is based only on the data interleaving on the memory modules, we could use many existing cache coherence protocols, for example, the protocol used in the full-map directory scheme [5]. That is, when a read request arrives at the memory module, the request is forwarded to the owner node where the request can get a valid copy of the requested cache line. After its completion, a pointer will be added to the linked list associated with the requested cache line. If it is an exclusive-read or a write operation, all of the copies pointed to by the linked list will have to be invalidated (or updated, depending on the protocol used). A new pointer is added to the linked list after the completion of the invalidation (or the update). It is pointed out in [18] that a change in the protocol for a write-back cache might be needed. It occurs when a \clean" cache line is being replaced. In a full-map directory scheme, the \clean" cache line can be replaced without informing the directory. Even though the bit corresponding to the replaced \clean" cache line will remain set in the directory, there could be no signicant side eect as long as data consistency is guaranteed. However, in the proposed scheme, the pointer corresponding to the replaced \clean" cache line will become \stale", and not returned to the free list. If the number of such \stale" pointers increases over time, the available free pointers will be depleted. To avoid such a situation, we need to force all of the replaced cache lines, either \clean" 12

15 or \dirty", to inform the directory and to free up the pointers associated with the replaced cache lines. Another possible modication needed is when we try to replace a cache line. We may want to fetch the new cache line rst and perform the replacement later to minimize the processor stall time. According to the proposed scheme, the replaced cache line and the line to be fetched are mapped to the same cache line, i.e., they should come from the same memory module. However, if we only provide c pointers in the pointer pool as proposed (where c is the number of cache lines in each cache memory), we implicitly assume that the replaced cache line will free up the pointer rst to allow the new cache line to use the pointer. If we reverse that order, it is possible that when the request for the new cache line arrives, it may nd that there is no longer a free pointer available in the free list. Since this situation will occur only temporarily, a small number of extra pointers could be added to avoid such temporary starvation. For instance, we could add n extra pointers assuming each processor will generate only one request for the new cache line at a time. However, if the replaced copy is available at the same time when a cache miss occurs, the extra pointers may not be necessary because we could send the missing request and the replaced copy to the same home node together. Usually, the address of the replaced copy is available without increasing the cache access time. Thus, the extra network trac needed for the replacement notication of the \clean" cache copy could be very minimal. There are many ways to implement the proposed scheme. For example, because the size of the pointer pool is very small, it can be implemented in SRAM, which is much faster than the DRAM of the memory module. If the size of the head links grows proportionately to the size of each memory module (see Figure 6), it could be implemented in DRAM for cost/performance design. The management of the linked lists in the pointer pool (insertion and deletion of pointers) can be carried out concurrently with the management of directory requests (decoding, forming, and sending requests). The pointer overhead can thus be partially overlapped. In a write operation, since the main overhead is in invalidating (or updating) the existing cache copies in the remote cache memories, the overhead in managing the pointer pool could be even less signicant. We address the impact of such overhead in Section 4. 13

16 Program Program Description Problem Size Barnes Barnes-Hut N-body simulation 16K particles Cholesky blocked sparse factorization d750.o FFT complex 1D radix- n six-step FFT 64K points FMM Fast Multipole N-body simulation 16K particles LU(Cont.) blocked dense LU factorization matrix Ocean(Cont.) ocean movement simulation ocean Radiosity computation of light equilibrium distri. room Radix integer radix sort 1M integers Raytrace rendering a 3D scene teapot Volrend rendering a 3D volume head (render only) Water-Nsqr Evaluating a water molecule system 1000 molecules Water-Sp improved version of Water-Nsqr 1000 molecules Table 1: Characteristics of the traces. System Unit Parameters Default Values # of processors = 32, page size = 4K bytes, line size = 64 bytes size (per node) 16K bytes access time 1 cycle cache cache ll time 4 cycles set associativity 4-way replacement random selection memory line access time 24 cycles network service time 30/10, 70/50, or 110/90 cycles Table 2: Parameters in simulation. 4 Performance Study We use simulations to study the eect of dierent directory schemes on system performance using the Splash-2 programs [22] as workload. Some characteristics of the Splash-2 programs are illustrated in Table 1. We use a modied MINT [20] simulator as our front end and attach a generic DSM multiprocessor (as shown in Figure 1) with our proposed directory schemes at the back end. For fast data access latency, the cache coherence protocol used is a non-blocking directory with ve cache states: invalid, shared, shared-owner, dirty, and issue [13]. This means that all subsequent requests to the same line can proceed beyond the directory. The default simulation parameters are shown in Table 2. There are 32 processors with one processor in each node. Each processor can generate one 14

17 line address (I ) l page address (I p) byte offset (I b ) line address (I ) l page address (I p) byte offset (I b ) Index for sets (I s) Index for nodes (I n) (a) page interleaving Index for sets (I s) Index for nodes (I n) (b) line interleaving Figure 7: Two interleaving schemes in our simulation. missing cache line request at a time. The size of a cache line is 64 bytes, and the page size is 4K bytes. In order to reect the working set sizes (relative to the cache memory) in our simulation, the cache size is set at 16K bytes so as to t between the rst and the second important working sets of the Splash-2 programs [22]. The cache access time is the unit of our simulation clock cycle. Four cycles are assumed for the cache ll time, and four-way set associativity for the cache memory is used. The replaced cache line is randomly selected. The memory access time is 24 cycles. Each non-memory instruction completes in a single cycle as assumed in MINT. To simplify our simulation, we do not simulate the network topology in detail, but rather assume the communication time between nodes (network latency or network service time) to be a constant regardless of their distances. We simulate three settings for the network latencies: 30 cycles/10 cycles; 70 cycles/50 cycles; and 110 cycles/90 cycles. In each pair of the settings, the longer latency is for requests with data, and the shorter latency is for requests without data. The contention at the system resources (i.e., cache, memory, directory, and network) is modeled using queues. Thus, we do simulate the memory conicts at each memory module since the data interleaving scheme could aect the memory access patterns. 4.1 Performance sensitivity to data interleaving schemes We rst study how dierent data interleaving schemes on the memory modules aect the system performance. Even though dierent data interleaving schemes will require dierent directory sizes, and hence, dierent directory processing times, we assume it is 16 cycles for all of the schemes in 15

18 this section. The eect of the directory processing time on the overall performance is studied in the next section. Also, instead of studying all possible data interleaving schemes, we only look at two schemes that place the node eld, I n, at the two extreme positions in the address. The other schemes which place the I n somewhere between these two extremes should have the results somewhat between these two schemes. Our simulation results actually show that there is very little dierence in the performance of the two schemes we studied. The rst scheme places the I n eld inside of the I p eld (see Figure 7 (a)). This corresponds to interleaving data with one page as the unit of data interleaving, i.e., page interleaving (). The second scheme places the I n eld outside the I p eld (see Figure 7 (b)). It corresponds to interleaving data with one cache line as the unit of data interleaving, i.e., line interleaving (). The page size is assumed to be 4K (or 2 12 ) bytes and the cache line size is 64 (or 2 6 ) bytes. Also, the width of I s is 6 bits because the size of the cache is assumed to be 16K (or 2 14 ) bytes, and the set associativity is four. The number of cache lines in a cache memory, c, is 214 byte cache size 2 6 byte line size = 256. The home node (I n ) of an address is determined from bit 12 to bit 16 in the scheme, and from bit 6 to bit 10 in the scheme. This will result in dierent numbers of pointers needed in the pointer pool: cn = = 8192 and c = 256 for the and the schemes, respectively, assuming 32 nodes. Our simulation results show no dierence in the hit rate for the cache memory between the two and schemes since the data interleaving schemes only aect the placement of data in the memory modules. Although the way of allocating data in the memory modules is dierent between the two schemes, it does not make a dierence in the memory hit rates. This is because both schemes allocate data in the memory modules without considering reference locality. In fact, we found that the possibility of nding the line requested by a cache miss is uniformly distributed among all memory modules. Figure 8 shows the average waiting time at the directory and at the memory module for both schemes with dierent network service times. In general, longer network service time increases the waiting time at the network while decreasing the waiting time at the other system resources. The 16

19 Waiting Time (Cycle) Waiting Time (Cycle) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp Directory Memory Directory Memory Figure 8: Waiting time per request at system resources. Speedup Speedup (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp legend legend blocking busy blocking busy Figure 9: Speedup. scheme has a longer average waiting time at the directory and the memory module than that of the scheme. This is because the requests for a missing cache line and a replaced cache line are sent to the same home memory module in the scheme. The replacement request which follows the missing line request usually needs to wait for the completion of the missing line request at the same memory module. The home directories of a missing cache line and a replaced cache line in the scheme are dierent in most cases. Hence, the conicts at the directory and the memory module happen less frequently. However, since the replacement request is not on the critical path of a data access, the longer waiting time does not aect the overall system performance. Our results show that there is very little dierence in the average overall data access latency. That is, the cache miss penalty is almost the same in both schemes. Figure 9 shows the speedup based on the (30/10) scheme, i.e., execution time of (30/10) execution time of a scheme. 17

20 1 (a) 16 nodes 1 (b) 32 nodes Barnes Cholesky FFT FMM LU Ocean Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp Number of pointers in use Barnes Cholesky FFT FMM LU Ocean Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp Number of pointers in use Figure 10: The number of pointers in use seen by a cache line request to the directory of the scheme The gure also shows the maximum stalled time among the processors at synchronization points (wait(), barrier(), lock(s), etc). The and schemes have almost the same performance. In the scheme, the maximum number of pointers needed in each pointer pool should be cn, where c is the number of cache lines in a cache memory and n is the number of nodes. Given our simulation parameters, cn is 8192 (= 256 cache lines in the cache memory 32 nodes). However, the actual number of the pointers used in the pointer pool will vary over the time during the program execution. The average number of pointers used in the pointer pool cannot reveal the usage of those pointers. In Figure 10, we show the number of pointers which have already been allocated when a cache line request arrives at the directory of the scheme. To compare against the scheme which needs a maximum of c pointers in its pointer pool (c = 256 in this example), the x-axis in the gure starts with c = 256. The y-axis represents the fraction of the requests which see the number of pointers in use is less than x. Figure 10 (a) and (b) assume the numbers of system nodes are 16 nodes and 32 nodes, respectively. From the gures, the number of pointers actually used is much less than the maximum numbers. Also, the number of pointers used in a program, such as LU, does not always increase as the system size grows. The number of pointers in use depends very 18

21 much on the program behavior regardless of the system size. From Figure 10 (b), even though the maximum number of the pointers which can be used by a program is = 8182, all programs uses the number of pointers less than = 1792, but the number is still much larger than 256 needed in the scheme. 4.2 Performance sensitivity to the directory processing times We consider three implementation choices for the directory; full-map directory, scheme using a head link for each cache line in the memory module to point to the rst pointer of the linked list, scheme using the hashing of the cache line address to locate the rst pointer of the linked list, designated as, and, respectively, in the gures. In our simulation, only the number of accesses to the directory is considered for the directory processing time. For the scheme, we assume that every request can be served within one DRAM access time (8 cycles). That is, the bit vector saved in DRAM can be fetched and updated in 8 cycles. For the scheme, the number of DRAM accesses per request is assumed to be two (16 cycles). In the scheme, 5 DRAM accesses (40 cycles) are used as the processing time. All three schemes have the same address mapping to the cache and the memory module, hence, there is no dierence in the cache memory and the memory module hit rates. Therefore, dierent directory schemes do not aect the number of requests to the cache memory, the memory module, the directory, and the network. Although we observe very little dierence in the number of requests among the schemes, the waiting time varies as the directory processing time increases. In general, if a request stays at a resource for a longer time, there is more possibility of conict at the resource, which in turn reduces the possibility of conict at the other resources. The dierent waiting times in turn aect the average data access latency as shown in Figure 11. The average data access latency of a cache line request includes the access latency in the local cache memory, the local memory module or a remote memory module if a node miss occurs, and a remote cache memory if it is needed. On the critical path of a data access, the waiting time caused by contention at the system resources is also included in the average latency. A longer directory processing time causes a small increase in the 19

22 Latency (Cycle) Latency (Cycle) waiting time (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean legend waiting time (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp legend remotememory hit remotecache hit localmemory hit localcache hit remotememory hit remotecache hit localmemory hit localcache hit Figure 11: Average data access latency. waiting time at the directory while it causes a large decline in the waiting time at the network. However, overall data access latency still grows. Figure 12 shows the speedup based on the system which uses the scheme with a network latency of 30 cycles for requests with data, and 10 cycles for requests without data. The results show that the performance is sensitive to the directory processing time. Hence, the size of the pointer pool (a smaller pointer pool will have a faster access time) and the management of the linked list should be as simple as possible. Schemes similar to seem to be a good choice. However, detailed hardware implementation of the pointer pool is beyond the scope of this paper. 5 Conclusion Controlling the data interleaving scheme allows us to change the data layout in the main memory without altering the cache access behavior in each cache memory. We can also take advantage of the data interleaving to control the size of directories needed to support cache coherence protocols. In this paper, we show the reduction in the directory size using a well conceived data interleaving 20

23 Speedup Speedup (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp legend legend blocking busy blocking busy Figure 12: Speedup. scheme could be quite substantial. It also allows the directory size to be scaled according to the cache memory size instead of the main memory size and the system size, which is very attractive to large-scale systems. Our simulations show that the impact of the dierent data interleaving schemes on the system performance is usually very minimal. The performance is, on the other hand, more sensitive to the directory management overhead. Hence, by selecting a suitable data interleaving scheme with an ecient pointer pool management mechanism, the directory-based cache coherence protocol could be used very cost-eectively even for very large DSM multiprocessors. References [1] A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B. H. Lim, K. Mackenzie, D. Yeung, \The MIT Alewife Machine: Architecture and Performance," Proc. of the 22nd Annual Int'l Sym. on Comp. Archi., [2] A. Agarwal, R. Simoni, M. Horowitz, J. Hennessy, \An Evaluation of Directory Schemes for Cache Coherence," Proc. of the 15th Annual Int'l Sym. on Comp. Archi., pp , [3] E. Anderson, J. Brooks, C. Grassl, S. Scott, \Performance of the CRAY T3E Multiprocessor," Proc. of the Supercomputing'97, Aug [4] J. Archibald, J. L. Baer, \Cache Coherence Protocols Evaluation Using a Multiprocessor Simulation Model," ACM Trans. on Comp. Sys., Vol. 4, No. 4, pp , Nov [5] L. Censier, P. Feautrier, \A New Solution to Coherence Problem in Multicache Systems," IEEE Trans. Computers, Vol. C-27, No. 12, Dec. 1978, pp [6] D. Chaiken, J. Kubiatowicz, A. Agarwal, \LimitLESS Directories: A Scalable Cache Coherence Scheme," ASPLOS-IV, April [7] S. J. Eggers, R. H. Katz, \A Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation," Proc. of the 15th Annual Int'l Sym. on Comp. Archi., pp , May [8] Degital Equipment Cop., Alpha Microprocessor Hardware Reference Manual, DEC part number EC-QP99B-TE, Feb

24 [9] J. R. Goodman, \Using Cache Memory to Reduce Processor-Memory Trac," Proc. of the 10th Annual Int'l Sym. on Comp. Archi., May 1983, pp [10] D. V. James, A. T. Laundrie, S. Gjessing, G. S. Sohi, \New Directories in Scalable Shared Memory Multiprocessor Architectures: Scalable Coherent Interface," Computer, Vol. 23, No. 6, June 1990, pp [11] N.P. Jouppi, \Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buers," Proc. of the 17th ISCA, pp , [12] J. Kong, G. Lee, \Relaxing the Inclusion Property in Cache Only Memory Architecture," Euro-Par'96, Vol. II, Aug. 1996, pp [13] J. Kong, P.-C. Yew, G. Lee, \A Non-blocking Directory for Large-Scale Multiprocessors," in preparation. [14] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, J. Hennessy, \The Stanford FLASH Multiprocessor," Proc. of the 21st Annual Int'l Sym. on Comp. Archi., [15] J. Laudon, D. Lenoski, \The SGI Origin: A ccnuma Highly Scalable Server," Proc. of the 24th Annual Int'l Sym. on Comp. Archi., pp , [16] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, J. Hennessy, \The Directory-Based Cache Coherence Protocol for DASH multiprocessor," Proc. of the 17th Annual Int'l Sym. on Comp. Archi., pp , [17] T. D. Lovett, R. M. Clapp, \STiNG: A CC-NUMA Computer System for the Commercial Marketplace," Proc. of the 23rd Annual Int'l Sym. on Comp. Archi., pp , [18] R. Simoni, M. Horowitz, \Dynamic Pointer Allocation for Scalable Cache Coherence Directories," Proc. of the Int. Sym. on Shared Memory Multiprocessing, April 1991, pp [19] P. Sweazey, A.J. Smith, \A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus," Proc. of the 13th Annual Int'l Sym. on Comp. Archi., pp , [20] J. Veenstra, R. Fowler, \Mint: A Front-End for Ecient Simulation of Shared-Memory Multiprocessors," Proc. of 2nd MASCOTS, Jan.-Feb., [21] W. Weber, A. Gupta, \Analysis of Cache invalidation Patterns in Multiprocessors," ASPLOS-III, pp , April [22] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, A. Gupta, \The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. of the 22nd Annual Int'l Sym. on Comp. Archi.,

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor