Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor

Size: px

Start display at page:

Download "Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor"

Carmella Lyons
5 years ago
Views:

1 In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing Dallas, October , pp Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor Mats Brorsson and Per Stenström Department of Computer Engineering, Lund University P.O. Box 118, S LUND, Sweden Abstract Directory-based, write-invalidate cache coherence protocols are effective in reducing latencies to the memory but suffer from cache misses due to coherence actions. It is therefore important to understand the nature of data sharing causing misses for this class of protocols. In this paper we identify a set of parameters that characterises the accesses to migratory and producer-consumer data in sufficient detail so as to predict the number of cache misses in directory-based, write-invalidate protocols. We show that the parameters can be extracted from real programs and used as input to a reference generator that artificially generates a stream of references causing accurate estimates of cold, coherence and directory replacement misses, compared to the program itself. 1 Introduction The problem of cache consistency is one of the most performance limiting factors for shared-memory multiprocessors. Depending on the application, the cost for maintaining a coherent view of the memory may easily account for half of the execution time or more [7]. It is not difficult to verify that the shared memory access pattern has a dramatic effect on the performance of the cache memory subsystem and this makes it very important to understand the nature of data sharing. We loosely refer to the program properties relating to the sharing of data as the sharing behaviour of a program. The use of workload models is an invaluable tool to understand performance issues related to program behaviour. We have in an earlier paper [4] presented a model of an access pattern we refer to as stationary data accesses (this access category will be defined in section 2.2). The results show that the miss ratios for a program with only stationary data, such as parallel matrix multiplication, can be predicted with high accuracy for cache block sizes ranging from 4 to 64 bytes. The fundamental concept behind this model is the identification of a number of parameters that characterise stationary data accesses. The characterisation of the sharing behaviour is based on the observation that a shared data object can be classified in one of a small number of classes during some time interval, depending on the access mode (read-only or read/write accesses) and the degree of sharing (number of sharers) of the accesses made to the object. This paper extends the previous model to cover data that are migratory or accessed in a producer-consumer fashion. Migratory data may be accessed by many processors, but only by one processor at a time. Previous studies have shown that migratory data is a common source of cache invalidations [8, 12], and it is thus important to characterise and model this behaviour. Variables which are accessed in a producer-consumer fashion are also quite common in many applications, especially in algorithms such as the Successive Over Relaxation (S.O.R.) program kernel in which a processor works on a submatrix and communicates with the processors working on the nearest-neighbouring submatrices. Migratory and producer-consumer data are defined in detail in section 3. In order to cope with migratory and producer-consumer data, the parameter set representing the sharing behaviour has been augmented with parameters describing the accesses to data in these categories. We have verified that the extended set of parameters can be used to accurately predict the cache miss ratios of directory-based, writeinvalidate protocols by means of a reference generator that artificially generates shared memory references based on the sharing behaviour parameters extracted from a program. The generated reference stream is put into a simulated cache coherence protocol. The resulting cache miss ratio, subdivided into cold, coherence and directory replacement miss ratios, are compared with the results from an actual execution of the program on simulated processors. The results show that with a limited set of parameters it is possible to capture the miss ratio components caused by migratory and producer-consumer data objects. Section 2 contains background information on the architectural assumptions and the characterisation of accesses to stationary data. Section 3 defines and describes the characteristics of migratory and producer-consumer data and section 4 discusses how it can be measured. Section 5 presents the reference generator that uses this information.

2 In section 6 we use the generator to evaluate the parameters. The paper is concluded in section 7. 2 Background This section briefly recapitulates the architectural assumptions and the characterisation of stationary data accesses originally presented in [4]. 2.1 Architectural assumptions We assume a shared memory multiprocessor consisting of N processing elements, each with a processor and cache memory. The processing elements share a common main memory which we assume can be accessed by all processing elements without contention. In order to concentrate on issues related to sharing behaviour we assume infinitely large caches. The cache memories are kept consistent by means of a directory-based, write-invalidate cache coherence protocol (see e.g. [1]). Each memory block has a directory that contains information on the identities of the processing elements holding a copy of this block. A coherent view of the memory is ensured by invalidating the copies of the block that are pointed out by the directory upon a write operation. We have chosen to study the relation between sharing behaviour and the miss ratio for a class of directory-based protocols with i entries in the directory, 1 < i < N. This class of protocols is often called limited directory cache coherence protocols and denoted Dir i NB where i is the number of entries in the directory and NB stands for No- Broadcast of coherence operations (in distinction to some protocols which rely on broadcasting) [1]. If the number of processors reading a memory block exceeds the number of entries, one of the existing copies is chosen to be invalidated to give space for the pointer to the new copy in the directory. This is called directory replacement. Assuming infinitely large cache memories, there will be three types of cache misses for this protocol: (1) a cold miss is a cache miss experienced by the first access a processor makes to a block; (2) a coherence miss is a miss that is not a cold miss, and the block has been modified by some other processor since it was last valid in the cache, and finally; (3) a directory replacement miss is a miss that is neither a cold nor a coherence miss and is caused by a directory replacement for the memory block. Note that the number of cache misses seen by a processor is completely determined by looking at block accesses because miss detection and coherence actions are based on address tags at block level. We will therefore in the following only consider block accesses. 2.2 Characterisation of stationary data The fundamental concept behind the characterisation of stationary data accesses (as for migratory and producerconsumer accesses, see section 3) is a classification of shared data blocks at regular time intervals according to the degree of sharing and whether blocks are modified or not. We assume a Single-Program-Multiple-Data programming model [6] and that the execution of a program can be viewed as a sequence of uniform time slots so that the processors execute in lock-step one time slot at a time. In a single time slot, we assume that a shared memory block may be accessed by 0, 1, i, or N processors. A processor may in one time slot perform exactly one shared read reference possibly followed by a write reference to the same block in which case a read-modify-write sequence has been performed. If more than one processor perform readmodify-write sequences to a block it is assumed that the read operations of all processors are carried out before any write operation is started. There is no restriction on the number of instructions or private data references made within a time slot. The number of cache misses seen by processors accessing a block depends on the degree of sharing and the access mode (read-only, read-modify-write). For read-only blocks, there will be directory replacement misses when the degree of sharing exceeds the number of entries, i, in the protocol directory. For read-modify-write blocks there will be an increasing number of coherence misses when the degree of sharing increases. We can classify a memory block according to the accesses made in one time slot in an access class c where c C = {UNR, ROE, ROF, ROS, RWE, RWF, RWS} according to the following: UNR Unreferenced. No processor accessed the block; ROE Read Only, Exclusive. One processor has done a read, but no write operation, to the block; ROF Read Only, Shared by few. i processors have each done a read, but no write operation, to the block; ROS Read Only, Shared by many. N processors have each done a read, but no write operation, to the block; RWE Read/Write, Exclusive. One processor has issued a read-modify-write sequence to the block; RWF Read/Write, Shared by few. i processors have each issued a read-modify-write sequence to the block; RWS Read/Write, Shared by many. N processors have each issued a read-modify-write sequence to the block. Based on this classification we define the sharing profile as the set, P ={P c c C} where P c is the fraction of references made to blocks in class c averaged over all shared blocks and all time slots. The distribution in the sharing profile affects the number of cache misses during the execution. For example, the higher P RWF and P RWS, the more coherence misses there will be and the higher P ROS, the more directory replacement misses there will be.

3 We define the block distribution as, B ={B c c C}. B c is the number of blocks that were classified in access class c in the first time slot they were accessed. Since the access classification reflects the degree of sharing, the block distribution contains information on the number of cold misses caused by a block. For instance, a block that is accessed in the ROS class will cause N cold misses. Finally, in order to capture the temporal locality of blocks we define the temporal granularity as T ={T c c C}, where T c is the number of time slots a block is in class c before being reclassified in some other class. In order to describe cold block accesses we define R cold as the rate by which cold blocks are accessed. For an access pattern which we refer to as stationary, it has been shown [4] that the parameters, S = (P, B, T, R cold ) can be extracted from real programs executing on a simulated multiprocessor and that they describe the access pattern in such detail that the number of cold, coherence and directory replacement misses under a Dir i NB protocol can be predicted with high accuracy. The method by which the parameters are extracted is discussed in section 4. The stationary access category is defined as follows: A block that is accessed in the stationary access category is called a stationary block. A stationary block is classified in the same access class for all time slots it is accessed. We refer to this class as the stationary class of a block. A stationary block will thus alternate between the UNR class (when it is not accessed) and its stationary class. All blocks with stationary class c, have the same temporal granularity, T c. All stationary blocks have the same temporal granularity, T UNR, for the unreferenced class, c = UNR. Even though some commonly used routines such as the matrix multiplication, access data in a stationary manner [4] this category is very limited in its scope. The four-tuple, S = (P, B, T, R cold ) cannot by itself describe the sharing behaviour in sufficient detail to recreate the cache miss ratio for a program in which the access pattern to shared blocks changes dynamically over time. Two examples of such dynamic access patterns are migratory and producer-consumer characterised data. For both of these categories the identity of a processor accessing a block exclusively may change and for producer-consumer blocks, the degree of sharing may change during the execution. Neither of these access categories are captured by the model of stationary data accesses. We will in the next section extend the parameter set, S =(P, B, T, R cold ) to cover also accesses to migratory and producer-consumer characterised data. 3 Migratory and producer-consumer data 3.1 Characterising migratory data Migratory data has been found to be quite common in many real-life applications [8]. In fact, any program that uses locks to protect shared variables will result in migratory data. Loosely speaking, a migratory block is read and written by several processors during the entire execution, but always exclusively by one processor at a time. To formally define migratory data accesses, we again assume that the program execution consists of a number of time slots executed synchronously by all processors. In each time slot, a migratory block is either unreferenced (UNR) or accessed by processor i in a read-modify-write fashion, denoted RWE i. The following regular expression defines the access sequence to a migratory block: (RWE i +UNR*)+(RWE j +UNR*)+, where i j In this expression, X* designates zero or more occurrences of X whereas X+ designates at least one occurrence of X. The regular expression above thus indicates that a migratory block migrates at least once from processor i to processor j during the total execution. In addition to the four-tuple, S = (P, B, T, R cold ) that describes stationary data accesses, we need to capture information on the frequency and probability of block migrations. This is done in two parameters denoted P' migr and P" migr as follows. P' migr is the probability that a block migrates when the block is classified as RWE from having been in the UNR class; P" migr is the probability that a block migrates to a new processor after T RWE (the temporal granularity) time slots. To conclude, migratory data accesses are described by the six-tuple: S 1 =(P, B, T, R cold,p' migr,p" migr ). 3.2 Characterising producer-consumer data Informally, producer-consumer accesses are characterised by one processor updating a block in a producer phase and an arbitrary number of processors reading the block in a non-overlapping consumer phase. Formally, we assume that the producer and consumer phases consist of an integral number of time slots which enables us to define producer-consumer accesses with the following regular expression: (RWE i +UNR*)+UNR+(ROX+UNR*)+UNR+ The underlined part is the producer phase and the double underlined part is the consumer phase with at least one UNR time slot in between. In this expression ROX represents one of the access classes: ROE, ROF, or ROS.

4 To characterise producer-consumer data accesses, we have extended the four-tuple S = (P, B, T, R cold ) with the number of consumers denoted nc. Moreover, unlike stationary blocks, producer-consumer blocks can now be classified in different access classes and information is needed on the probability of a transition of a block from one access class, c1 to another access class c2. This is done with a data structure called mobility matrix, MM ={m c1c2 c1, c2 C}. Element m c1c2 in the mobility matrix contains the probability for a reclassification of a block from class c1 to class c2 averaged over all analysis intervals. To conclude, producer-consumer data accesses are described by the six-tuple: S 2 = (P, B, T, R cold, nc, MM). 3.3 Characterising accesses in different categories Although the primary purpose of the parameters introduced in the previous sections have been to characterise accesses to each category in isolation, we will present experimental results that validate the sharing behaviour for: a program with only migratory blocks; a program with only producer-consumer blocks; and a program with both stationary and migratory blocks. Since the last program contains blocks belonging to two access categories (stationary and migratory), we need to extend the model with a parameter designating the fraction of all blocks that belong to either category. This is given by D = [D t t {stationary, migratory, producer-consumer}]. Furthermore we make the following simplifying assumptions: A block is stationary, migratory or producer-consumer throughout the entire execution. The temporal granularity for access class c is the same for all data access categories. 4 Measuring sharing characteristics We will now briefly discuss a method, that has been used in a previous study [4], to extract and characterise stationary data accesses in a real program into the parameter set S =(P, B, T, R cold ). The method has been augmented to categorise blocks into the migratory or producer-consumer categories according to the regular expressions in the previous section. Blocks that are neither migratory nor producer-consumer are assumed to be stationary. The program whose sharing behaviour is to be characterised executes on a simulated PRAM model [9]. All shared references are captured by an analysis tool according to figure 1. The simulation is interrupted at regular time intervals, called analysis intervals, at which each block is analysed and classified in one of the classes in C (see section 2.2) based on the number of reads and writes made to the block by each processor. The temporal granularity, T c, of class c is measured as the mean number of analysis intervals blocks are reclassified into the same class it previously was classified in. The information on the first time a block is classified, is used to calculate the block distribution, B, and R cold. Finally, the number of references made to blocks in each class is used to calculate the sharing profile, P. The additional parameters needed to characterise migratory and producer-consumer accesses, P' migr,p" migr, nc, and MM are easily calculated at the same time. The CacheMire test bench P P P... P Memory Figure 1: Measurement of sharing behaviour through the analysis of the references to the shared memory Under the ideal assumption that the program execution conforms to the lock-step model, and that a block stays in a certain access class an integral number of time slots, the analysis interval should be the same as a time slot. However, a program does not in reality conform to this simplistic model for the following reasons. Firstly, the completion time for read-modify-write operations may vary in length which makes it difficult to define a uniform time slot. Secondly, it may also be difficult to synchronise the start of an analysis interval with the start of a time slot leading to a skewing effect that can cause perturbation in the measurements. For stationary data blocks, it suffices to select an analysis interval that is several time slots. Then we minimise the chance that a RWE block is classified as ROE in one analysis interval and RWE in another due the skewing. For migratory and producer-consumer blocks, however, the duration of phases also has to be taken into account. If the analysis interval has the same duration as the phases, skewing can lead to perturbation. For example, if a block migrates between two processors in two successive phases, it may be classified as RWF due to this mismatch. However, under the assumption that a time slot is significantly shorter than the duration of a phase, e.g., the time between successive migrations or the duration of producer and consumer phases, the choice of analysis interval is not very critical. 5 A reference generator model Analysis tool Data gathering and analysis This section describes a reference generator that takes the set of parameters, S = (P, B, T, R cold, D, P' migr, P" migr,

5 Generation Random number generator with Sharing Profile distribution P ROE P ROF P ROS Lists with blocks currently belonging to class c P RWE P RWF P RWS c=unr c=roe c=rof c=ros c=rwe c=rwf c=rws Reclassification Stationary data: Transfer between UNR and the block s stationary class Migratory data: Transfer between UNR and RWE of the different processors Producer-Consumer data: Transfer between classes according to the Mobility Matrix Shared memory reference Blocks which recently have been referenced are re-inserted at the end of the list it was taken from The reclassification is invoked at regular time intervals. Figure 2: Structure of the model for artificial generation of shared memory references MM, nc) extracted from a program, and generates a stream of references according to these parameters. The reference generator replaces the processors in architectural simulations. Therefore, we will in the following use the term processor to denote the generation of references corresponding to a particular processor. A shared data block belongs to one of the access classes described in section 2.2. These classes are in the reference generator represented as lists with elements corresponding to memory blocks. Figure 2 shows the structure of the reference generator in which an access class is represented by one or more lists depending on the number of processors that may share blocks in a class. The ROE and RWE classes are represented by N lists, one for each processor, the ROF and RWF classes are represented by N/i (i is the number of processor identity entries in the cache coherence directory) lists and the other classes by one list each since these classes are common for all processors. The generation of references is synchronised so that all processors in a time slot will perform references to blocks in the same access class. For each time slot, one of the classes in C is selected with probability P c from the sharing profile and each processor will then perform a memory reference to a block belonging to this class. If one of the ROE or RWE classes was selected, the processors will perform references to different blocks. If the selected class was ROF or RWF the processors will pair-wise access the same block and if the selected class was ROS or RWS, all processors will access the same block. When a read/write class is selected, all processors generate the read references before any processor generates the entailing write reference to the same block. A block is tagged as either stationary, migratory or producer-consumer according to the distribution given by the parameter D. The generation of block accesses is the same independent of data category. At regular time intervals, based on analysis intervals, the classification of blocks is evaluated and if a block has remained in its class for a time corresponding to the temporal granularity for this class, it will be reclassified into some other class. The reclassification is, however, different depending on the data category of a block. Stationary blocks simply alternate between the UNR class and its stationary class. There are two kinds of actions on migratory blocks. If the block were in the UNR class it will be reclassified in the RWE class. With probability P' migr the block will migrate to a new processor, otherwise it will stay with the same processor that previously accessed it. If the block, however, were in the RWE class, it will remain in this class with probability P" migr, but for a different processor. Otherwise, it will be reclassified as UNR. Producer-consumer blocks are reclassified in a new class according to the conditional probabilities given by the mobility matrix, MM. Each time a producer-consumer block is classified in the ROE class, it changes the identity of the processor that will be accessing the block until the block has been accessed by nc processors. The model needs to be supplied with the analysis interval used in measuring the sharing behaviour in order to perform the reclassifications in correct intervals. The model also needs to know the value of i for ROF and RWF blocks in order to generate the correct degree of sharing

6 for these blocks. Finally, we have to supply the number of shared memory references to generate before termination. In the next section we will show that this reference generator, given the characteristics of migratory and producerconsumer data for a program, generates a stream of references that accurately mimics the real program with respect to the cache miss ratio. 6 Evaluation of the access characterisation In order to verify that the suggested parameters describes the access pattern with respect to miss ratios in directorybased write-invalidate cache coherence protocols, we have compared the miss ratios observed when the reference generator is used with the miss ratios seen when the real program is executed. Three different synthetic programs and the S.O.R. kernel was used as workload in the evaluation. 6.1 Experimental methodology The CacheMire test bench has been used as a framework to carry out the evaluation of the model [2]. Figure 3 shows the setup with a simulator for the Dir i NB class of cache coherence protocols. The protocol simulator classifies each cache miss as either a cold, coherence, or a directory replacement miss according to the definition in section 2.1. The miss ratio is calculated as the ratio between the number of cache misses and the number of all shared data references. References to instructions and private data are thus not included in the miss ratio calculation. The sharing behaviour was measured for the different programs, with varying analysis intervals and in some cases with varying block sizes from 4 bytes to 64 bytes. The default block size is 4 bytes. We have used 8 processors and the Dir 2 NB cache coherence protocol in all experiments. P M Program code P P P Cache Coherence protocol simulator M M Sharing behaviour parameters M SPARC processor simulators Comparison of cache miss ratios Model Memory reference generators Figure 3: The miss ratios observed by the artificial reference stream are compared with the real miss ratios 6.2 The synthetic programs Synth-A, B, C The synthetic program is a program which has been designed to produce specific shared data access patterns. It can be set up to access data in any applicable access class of all three categories. The shared data structures consist of integer arrays with consecutive accesses to the elements. Three variants of the synthetic program was used: (A) References are made to migratory data only. One array of 512 elements is shared by all processors. Processor i accesses elements number 64 i to 64 (i + 1) - 1 in sequence for processor cycles, after which these elements will be accessed by processor j =(i + 1) mod 8. (B) References are made to stationary data and migratory data with equal probability. The stationary data consists of data that are in the ROS and RWE classes. This processor contains the same migratory behaviour as for Synth-A but with a longer migration interval due to the accesses to stationary data. The iteration interval is processor cycles. The stationary data structures consists of one array of 512 elements subdivided among the eight processors for RWE accesses. There is also one array with 64 elements that is accessed by all processors for ROS accesses. (C) Only data in the producer-consumer category are accessed. This program contains a producer phase that is approximately processor cycles long, and a consumer phase that is approximately processor cycles long. The data structure is an integer array of 512 elements. Processor i updates elements number 64 i to 64 (i + 1) - 1 in the producer phase. The consumer phase can be one of two alternatives: (1) every tenth iteration all processors read elements 0 to 63 in the array; or (2) in nine out of ten iterations, processors i and i+1 read the elements of each others subvectors which were updated during the producer phase. The blocks containing these elements will thus be classified as ROF. The access pattern of Synth-A, -B, and -C as described above is repeated 50 iterations. In the Synth-A program there will be a coherence miss for each block a processor accesses between migrations since all blocks were updated by some other processor than the current. Cold misses are particularly easy to model when there are only migratory data. There will be exactly one cold miss per block and processor. Since there are at most two cache copies of a migratory block (between the read operation and the write operation issued by a processor) there will be no directory replacement misses. Synth-B accesses data from both the stationary and the migratory data categories. The number of coherence misses will be the same as for Synth-A which has the same data structure and access pattern to migratory data. The

7 coherence miss ratio will, however, be lower for the Synth- B program since this program has more shared references. The Synth-C program will experience coherence misses in the consumer phase when the consumers read a block which has been updated during the producer phase. There will also be some directory replacement misses since all processors access the same elements every tenth consumer phase. All Synth programs can utilise the prefetch effects generated by a larger block size to reduce the cold and coherence miss ratios because the shared data is organised in arrays with consecutive accesses to the elements. The directory replacement miss ratio should, however, not be affected by the cache block size. These misses are caused by accesses to ROS blocks and when many processors are Synth-A Synth-B Synth-C Miss ratio Miss ratio 4% 3% 2% 1% 0% (a) only migratory data Miss ratio 10% 8% 6% 4% 2% (b) stationary and 0% migratory data 4% 3% 2% 1% Model (b) only producer- 0% consumer data Cold misses Coherence misses Directory replacement misses Block size (bytes) Block size (bytes) Block size (bytes) Program Cold misses Coherence misses Directory replacement misses Analysis interval: 5000 processor cycles Figure 4: Miss ratios for the modelling of the synthetic program as a function of the block size reading a block, most of them will see a miss because of the limited directory size. There is no prefetch effect on these blocks since an increase in the block size will not bring more information into the cache. Figure 4 shows the miss ratios for Synth-A, -B, and -C when the cache block size was varied from 4 to 64 bytes. The effect on the miss ratio as discussed above can be seen for the three programs and we can also note that the trends are closely followed by the model. The choice of analysis interval is mainly dictated by the time between migrations for migratory data, and the length of producer and consumer phases for producer-consumer data. However, due to practical reasons, as described in section 5, the analysis interval cannot be as short as a time slot. We have used an analysis interval of 5000 processor cycles for all three programs. Although it is not part of this study, we have observed that there is a relatively large interval in which the analysis interval can be chosen to produce a good characterisation of migratory and producer-consumer data accesses. In summary, the experiments reported here show that the set of parameters, described in section 3.1, suffice to describe the sharing behaviour of accesses to migratory and producer-consumer data. The next section will show results for the S.O.R., which is a program kernel used in many scientific applications. 6.3 The S.O.R. program kernel The Successive Over Relaxation program kernel is a program with a dominating amount of producer-consumer data. The main shared data structure consists of a matrix whose elements are iteratively updated. In the parallel implementation of S.O.R., each iteration is split into two phases. In the first phase, all elements in even positions (the sum of the indices is even) are updated. In doing that, the values of the neighbouring elements in odd positions are read. The second phase performs the same operation on the elements in odd positions. The work is divided so that each processor work on a sub-matrix and communicates with other processors only at the borders between the sub-matrices. We have measured the sharing behaviour of S.O.R. with a matrix of randomly initiated elements. Each of the eight processors works on a 4 8 submatrix with 32 elements of which 12 are exclusively accessed by one processor, 16 are shared with one neighbour and 4 are shared between 4 neighbours. The time between two iterations was measured to be around 5100 processor cycles. Figure 5 shows a comparison of the miss ratios between the model and execution of the program itself for an analysis interval of 1000 processor cycles. As can be seen, there is a very good correspondence for all three types of cache misses which shows that the set of parameters characteris-

8 Miss ratio 6% 4% 2% 0% Model Figure 5: Comparison of miss ratios between the model and the program for the S.O.R. kernel ing producer-consumer data suffices to capture the characteristics in order to predict cache miss ratios. When the analysis interval is larger than 1500 processor cycles, the model does not work since the analysis of shared references will then be made over two (or more) iterations which gives a distorted view of the real sharing behaviour. To conclude this section, we have shown that the model, consisting of the reference generator and the set of parameters describing sharing behaviour, presented in this paper indeed model the sharing behaviour generated by data in the migratory and producer-consumer categories. The cache miss ratios observed when the reference generator is used are approximately the same as the real values cause by the programs themselves. 7 Conclusions Program Directory replacement misses Coherence misses Cold misses Analysis interval: 1000 processor cycles W have in this paper presented a model of accesses to migratory and producer-consumer data in a class of shared memory multiprocessor programs. The model consists of a set of parameters characterising the sharing behaviour with respect to cache miss ratios for directory-based, write-invalidate cache coherence protocols. It has been shown that the model can be used to accurately predict cold, coherence and directory replacement miss ratios for migratory and producer-consumer accesses. The main contribution in this paper is the identification of the characteristic information that essentially captures the access pattern to migratory and producer-consumer data. An additional contribution is the design and implementation of a reference generator that makes use of this information to generate a reference stream that mimics the references made by the processors. We have achieved the goal of the model, namely to understand the relation between migratory and producerconsumer data accesses and the performance of the cache coherence mechanism. The assumptions contain of course several simplifications and limitations that will be the focus of future research. Acknowledgements The research in this paper was supported by the Swedish National Board for Industrial and Technical Development (NUTEK) under contract number P855. References [1] A. Agarwal, R. Simoni, J. Hennessy and M. Horowitz, An Evaluation of Directory Schemes for Cache Coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pp [2] M. Brorsson, F. Dahlgren, H. Nilsson and P. Stenström. The CacheMire Test Bench A Flexible and Effective Approach for Simulation of Multiprocessors. In Proceedings of the 26th Annual Simulation Symposium, Washington DC, March pp [3] M. Brorsson and P. Stenström. Visualising Sharing Behaviour in relation to Shared Memory Management. In Proceedings of the 1992 International Conference on Parallel and Distributed Systems, pp , Hsinchu, Taiwan, December 1992, [4] M. Brorsson and P. Stenström, Modelling Accesses to Stationary Data in a Shared Memory Multiprocessor, In Proceedings of the 7th International Conference on Parallel and Distributed Computing Systems. October 1994, Las Vegas, NV. To appear. [5] J. B. Carter, J. K. Bennett, and W. Zwaenepoel, Implementation and Performance of Munin. Operating Systems Review. 25(5): , October, Proceedings of the 13th ACM SYmposium on Operating Systems Principles. [6] F. Darema, D. A. George, V. A. Norton, and G. F. Pfister. A Single-Program-Multiple-Data Computational Model for EPEX/FORTRAN. Parallel Computing. 7(1):11-24, April, [7] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.- D- Weber, Comparative Evaluation of Latency Reducing and Tolerating Techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pp , Toronto, Canada, May [8] A. Gupta and W.-D. Weber. Cache Invalidation Patterns in Shared-Memory Multiprocessors, IEEE Transactions on Computers, 41(7): , July [9] S. Fortune and J. Wyllie, Parallelism in Random Access Machines, In Proceedings of the 10th ACM Symposium on Theory of Computing, pages , [10] M. S. Lam, E. E. Rothberg and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp , April, [11] J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5-44, March [12] P. Stenström, M. Brorsson and L. Sandberg, An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 14-22, San Diego, California, pp

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department