Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor

Size: px
Start display at page:

Download "Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor"

Transcription

1 In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing Dallas, October , pp Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor Mats Brorsson and Per Stenström Department of Computer Engineering, Lund University P.O. Box 118, S LUND, Sweden Abstract Directory-based, write-invalidate cache coherence protocols are effective in reducing latencies to the memory but suffer from cache misses due to coherence actions. It is therefore important to understand the nature of data sharing causing misses for this class of protocols. In this paper we identify a set of parameters that characterises the accesses to migratory and producer-consumer data in sufficient detail so as to predict the number of cache misses in directory-based, write-invalidate protocols. We show that the parameters can be extracted from real programs and used as input to a reference generator that artificially generates a stream of references causing accurate estimates of cold, coherence and directory replacement misses, compared to the program itself. 1 Introduction The problem of cache consistency is one of the most performance limiting factors for shared-memory multiprocessors. Depending on the application, the cost for maintaining a coherent view of the memory may easily account for half of the execution time or more [7]. It is not difficult to verify that the shared memory access pattern has a dramatic effect on the performance of the cache memory subsystem and this makes it very important to understand the nature of data sharing. We loosely refer to the program properties relating to the sharing of data as the sharing behaviour of a program. The use of workload models is an invaluable tool to understand performance issues related to program behaviour. We have in an earlier paper [4] presented a model of an access pattern we refer to as stationary data accesses (this access category will be defined in section 2.2). The results show that the miss ratios for a program with only stationary data, such as parallel matrix multiplication, can be predicted with high accuracy for cache block sizes ranging from 4 to 64 bytes. The fundamental concept behind this model is the identification of a number of parameters that characterise stationary data accesses. The characterisation of the sharing behaviour is based on the observation that a shared data object can be classified in one of a small number of classes during some time interval, depending on the access mode (read-only or read/write accesses) and the degree of sharing (number of sharers) of the accesses made to the object. This paper extends the previous model to cover data that are migratory or accessed in a producer-consumer fashion. Migratory data may be accessed by many processors, but only by one processor at a time. Previous studies have shown that migratory data is a common source of cache invalidations [8, 12], and it is thus important to characterise and model this behaviour. Variables which are accessed in a producer-consumer fashion are also quite common in many applications, especially in algorithms such as the Successive Over Relaxation (S.O.R.) program kernel in which a processor works on a submatrix and communicates with the processors working on the nearest-neighbouring submatrices. Migratory and producer-consumer data are defined in detail in section 3. In order to cope with migratory and producer-consumer data, the parameter set representing the sharing behaviour has been augmented with parameters describing the accesses to data in these categories. We have verified that the extended set of parameters can be used to accurately predict the cache miss ratios of directory-based, writeinvalidate protocols by means of a reference generator that artificially generates shared memory references based on the sharing behaviour parameters extracted from a program. The generated reference stream is put into a simulated cache coherence protocol. The resulting cache miss ratio, subdivided into cold, coherence and directory replacement miss ratios, are compared with the results from an actual execution of the program on simulated processors. The results show that with a limited set of parameters it is possible to capture the miss ratio components caused by migratory and producer-consumer data objects. Section 2 contains background information on the architectural assumptions and the characterisation of accesses to stationary data. Section 3 defines and describes the characteristics of migratory and producer-consumer data and section 4 discusses how it can be measured. Section 5 presents the reference generator that uses this information.

2 In section 6 we use the generator to evaluate the parameters. The paper is concluded in section 7. 2 Background This section briefly recapitulates the architectural assumptions and the characterisation of stationary data accesses originally presented in [4]. 2.1 Architectural assumptions We assume a shared memory multiprocessor consisting of N processing elements, each with a processor and cache memory. The processing elements share a common main memory which we assume can be accessed by all processing elements without contention. In order to concentrate on issues related to sharing behaviour we assume infinitely large caches. The cache memories are kept consistent by means of a directory-based, write-invalidate cache coherence protocol (see e.g. [1]). Each memory block has a directory that contains information on the identities of the processing elements holding a copy of this block. A coherent view of the memory is ensured by invalidating the copies of the block that are pointed out by the directory upon a write operation. We have chosen to study the relation between sharing behaviour and the miss ratio for a class of directory-based protocols with i entries in the directory, 1 < i < N. This class of protocols is often called limited directory cache coherence protocols and denoted Dir i NB where i is the number of entries in the directory and NB stands for No- Broadcast of coherence operations (in distinction to some protocols which rely on broadcasting) [1]. If the number of processors reading a memory block exceeds the number of entries, one of the existing copies is chosen to be invalidated to give space for the pointer to the new copy in the directory. This is called directory replacement. Assuming infinitely large cache memories, there will be three types of cache misses for this protocol: (1) a cold miss is a cache miss experienced by the first access a processor makes to a block; (2) a coherence miss is a miss that is not a cold miss, and the block has been modified by some other processor since it was last valid in the cache, and finally; (3) a directory replacement miss is a miss that is neither a cold nor a coherence miss and is caused by a directory replacement for the memory block. Note that the number of cache misses seen by a processor is completely determined by looking at block accesses because miss detection and coherence actions are based on address tags at block level. We will therefore in the following only consider block accesses. 2.2 Characterisation of stationary data The fundamental concept behind the characterisation of stationary data accesses (as for migratory and producerconsumer accesses, see section 3) is a classification of shared data blocks at regular time intervals according to the degree of sharing and whether blocks are modified or not. We assume a Single-Program-Multiple-Data programming model [6] and that the execution of a program can be viewed as a sequence of uniform time slots so that the processors execute in lock-step one time slot at a time. In a single time slot, we assume that a shared memory block may be accessed by 0, 1, i, or N processors. A processor may in one time slot perform exactly one shared read reference possibly followed by a write reference to the same block in which case a read-modify-write sequence has been performed. If more than one processor perform readmodify-write sequences to a block it is assumed that the read operations of all processors are carried out before any write operation is started. There is no restriction on the number of instructions or private data references made within a time slot. The number of cache misses seen by processors accessing a block depends on the degree of sharing and the access mode (read-only, read-modify-write). For read-only blocks, there will be directory replacement misses when the degree of sharing exceeds the number of entries, i, in the protocol directory. For read-modify-write blocks there will be an increasing number of coherence misses when the degree of sharing increases. We can classify a memory block according to the accesses made in one time slot in an access class c where c C = {UNR, ROE, ROF, ROS, RWE, RWF, RWS} according to the following: UNR Unreferenced. No processor accessed the block; ROE Read Only, Exclusive. One processor has done a read, but no write operation, to the block; ROF Read Only, Shared by few. i processors have each done a read, but no write operation, to the block; ROS Read Only, Shared by many. N processors have each done a read, but no write operation, to the block; RWE Read/Write, Exclusive. One processor has issued a read-modify-write sequence to the block; RWF Read/Write, Shared by few. i processors have each issued a read-modify-write sequence to the block; RWS Read/Write, Shared by many. N processors have each issued a read-modify-write sequence to the block. Based on this classification we define the sharing profile as the set, P ={P c c C} where P c is the fraction of references made to blocks in class c averaged over all shared blocks and all time slots. The distribution in the sharing profile affects the number of cache misses during the execution. For example, the higher P RWF and P RWS, the more coherence misses there will be and the higher P ROS, the more directory replacement misses there will be.

3 We define the block distribution as, B ={B c c C}. B c is the number of blocks that were classified in access class c in the first time slot they were accessed. Since the access classification reflects the degree of sharing, the block distribution contains information on the number of cold misses caused by a block. For instance, a block that is accessed in the ROS class will cause N cold misses. Finally, in order to capture the temporal locality of blocks we define the temporal granularity as T ={T c c C}, where T c is the number of time slots a block is in class c before being reclassified in some other class. In order to describe cold block accesses we define R cold as the rate by which cold blocks are accessed. For an access pattern which we refer to as stationary, it has been shown [4] that the parameters, S = (P, B, T, R cold ) can be extracted from real programs executing on a simulated multiprocessor and that they describe the access pattern in such detail that the number of cold, coherence and directory replacement misses under a Dir i NB protocol can be predicted with high accuracy. The method by which the parameters are extracted is discussed in section 4. The stationary access category is defined as follows: A block that is accessed in the stationary access category is called a stationary block. A stationary block is classified in the same access class for all time slots it is accessed. We refer to this class as the stationary class of a block. A stationary block will thus alternate between the UNR class (when it is not accessed) and its stationary class. All blocks with stationary class c, have the same temporal granularity, T c. All stationary blocks have the same temporal granularity, T UNR, for the unreferenced class, c = UNR. Even though some commonly used routines such as the matrix multiplication, access data in a stationary manner [4] this category is very limited in its scope. The four-tuple, S = (P, B, T, R cold ) cannot by itself describe the sharing behaviour in sufficient detail to recreate the cache miss ratio for a program in which the access pattern to shared blocks changes dynamically over time. Two examples of such dynamic access patterns are migratory and producer-consumer characterised data. For both of these categories the identity of a processor accessing a block exclusively may change and for producer-consumer blocks, the degree of sharing may change during the execution. Neither of these access categories are captured by the model of stationary data accesses. We will in the next section extend the parameter set, S =(P, B, T, R cold ) to cover also accesses to migratory and producer-consumer characterised data. 3 Migratory and producer-consumer data 3.1 Characterising migratory data Migratory data has been found to be quite common in many real-life applications [8]. In fact, any program that uses locks to protect shared variables will result in migratory data. Loosely speaking, a migratory block is read and written by several processors during the entire execution, but always exclusively by one processor at a time. To formally define migratory data accesses, we again assume that the program execution consists of a number of time slots executed synchronously by all processors. In each time slot, a migratory block is either unreferenced (UNR) or accessed by processor i in a read-modify-write fashion, denoted RWE i. The following regular expression defines the access sequence to a migratory block: (RWE i +UNR*)+(RWE j +UNR*)+, where i j In this expression, X* designates zero or more occurrences of X whereas X+ designates at least one occurrence of X. The regular expression above thus indicates that a migratory block migrates at least once from processor i to processor j during the total execution. In addition to the four-tuple, S = (P, B, T, R cold ) that describes stationary data accesses, we need to capture information on the frequency and probability of block migrations. This is done in two parameters denoted P' migr and P" migr as follows. P' migr is the probability that a block migrates when the block is classified as RWE from having been in the UNR class; P" migr is the probability that a block migrates to a new processor after T RWE (the temporal granularity) time slots. To conclude, migratory data accesses are described by the six-tuple: S 1 =(P, B, T, R cold,p' migr,p" migr ). 3.2 Characterising producer-consumer data Informally, producer-consumer accesses are characterised by one processor updating a block in a producer phase and an arbitrary number of processors reading the block in a non-overlapping consumer phase. Formally, we assume that the producer and consumer phases consist of an integral number of time slots which enables us to define producer-consumer accesses with the following regular expression: (RWE i +UNR*)+UNR+(ROX+UNR*)+UNR+ The underlined part is the producer phase and the double underlined part is the consumer phase with at least one UNR time slot in between. In this expression ROX represents one of the access classes: ROE, ROF, or ROS.

4 To characterise producer-consumer data accesses, we have extended the four-tuple S = (P, B, T, R cold ) with the number of consumers denoted nc. Moreover, unlike stationary blocks, producer-consumer blocks can now be classified in different access classes and information is needed on the probability of a transition of a block from one access class, c1 to another access class c2. This is done with a data structure called mobility matrix, MM ={m c1c2 c1, c2 C}. Element m c1c2 in the mobility matrix contains the probability for a reclassification of a block from class c1 to class c2 averaged over all analysis intervals. To conclude, producer-consumer data accesses are described by the six-tuple: S 2 = (P, B, T, R cold, nc, MM). 3.3 Characterising accesses in different categories Although the primary purpose of the parameters introduced in the previous sections have been to characterise accesses to each category in isolation, we will present experimental results that validate the sharing behaviour for: a program with only migratory blocks; a program with only producer-consumer blocks; and a program with both stationary and migratory blocks. Since the last program contains blocks belonging to two access categories (stationary and migratory), we need to extend the model with a parameter designating the fraction of all blocks that belong to either category. This is given by D = [D t t {stationary, migratory, producer-consumer}]. Furthermore we make the following simplifying assumptions: A block is stationary, migratory or producer-consumer throughout the entire execution. The temporal granularity for access class c is the same for all data access categories. 4 Measuring sharing characteristics We will now briefly discuss a method, that has been used in a previous study [4], to extract and characterise stationary data accesses in a real program into the parameter set S =(P, B, T, R cold ). The method has been augmented to categorise blocks into the migratory or producer-consumer categories according to the regular expressions in the previous section. Blocks that are neither migratory nor producer-consumer are assumed to be stationary. The program whose sharing behaviour is to be characterised executes on a simulated PRAM model [9]. All shared references are captured by an analysis tool according to figure 1. The simulation is interrupted at regular time intervals, called analysis intervals, at which each block is analysed and classified in one of the classes in C (see section 2.2) based on the number of reads and writes made to the block by each processor. The temporal granularity, T c, of class c is measured as the mean number of analysis intervals blocks are reclassified into the same class it previously was classified in. The information on the first time a block is classified, is used to calculate the block distribution, B, and R cold. Finally, the number of references made to blocks in each class is used to calculate the sharing profile, P. The additional parameters needed to characterise migratory and producer-consumer accesses, P' migr,p" migr, nc, and MM are easily calculated at the same time. The CacheMire test bench P P P... P Memory Figure 1: Measurement of sharing behaviour through the analysis of the references to the shared memory Under the ideal assumption that the program execution conforms to the lock-step model, and that a block stays in a certain access class an integral number of time slots, the analysis interval should be the same as a time slot. However, a program does not in reality conform to this simplistic model for the following reasons. Firstly, the completion time for read-modify-write operations may vary in length which makes it difficult to define a uniform time slot. Secondly, it may also be difficult to synchronise the start of an analysis interval with the start of a time slot leading to a skewing effect that can cause perturbation in the measurements. For stationary data blocks, it suffices to select an analysis interval that is several time slots. Then we minimise the chance that a RWE block is classified as ROE in one analysis interval and RWE in another due the skewing. For migratory and producer-consumer blocks, however, the duration of phases also has to be taken into account. If the analysis interval has the same duration as the phases, skewing can lead to perturbation. For example, if a block migrates between two processors in two successive phases, it may be classified as RWF due to this mismatch. However, under the assumption that a time slot is significantly shorter than the duration of a phase, e.g., the time between successive migrations or the duration of producer and consumer phases, the choice of analysis interval is not very critical. 5 A reference generator model Analysis tool Data gathering and analysis This section describes a reference generator that takes the set of parameters, S = (P, B, T, R cold, D, P' migr, P" migr,

5 Generation Random number generator with Sharing Profile distribution P ROE P ROF P ROS Lists with blocks currently belonging to class c P RWE P RWF P RWS c=unr c=roe c=rof c=ros c=rwe c=rwf c=rws Reclassification Stationary data: Transfer between UNR and the block s stationary class Migratory data: Transfer between UNR and RWE of the different processors Producer-Consumer data: Transfer between classes according to the Mobility Matrix Shared memory reference Blocks which recently have been referenced are re-inserted at the end of the list it was taken from The reclassification is invoked at regular time intervals. Figure 2: Structure of the model for artificial generation of shared memory references MM, nc) extracted from a program, and generates a stream of references according to these parameters. The reference generator replaces the processors in architectural simulations. Therefore, we will in the following use the term processor to denote the generation of references corresponding to a particular processor. A shared data block belongs to one of the access classes described in section 2.2. These classes are in the reference generator represented as lists with elements corresponding to memory blocks. Figure 2 shows the structure of the reference generator in which an access class is represented by one or more lists depending on the number of processors that may share blocks in a class. The ROE and RWE classes are represented by N lists, one for each processor, the ROF and RWF classes are represented by N/i (i is the number of processor identity entries in the cache coherence directory) lists and the other classes by one list each since these classes are common for all processors. The generation of references is synchronised so that all processors in a time slot will perform references to blocks in the same access class. For each time slot, one of the classes in C is selected with probability P c from the sharing profile and each processor will then perform a memory reference to a block belonging to this class. If one of the ROE or RWE classes was selected, the processors will perform references to different blocks. If the selected class was ROF or RWF the processors will pair-wise access the same block and if the selected class was ROS or RWS, all processors will access the same block. When a read/write class is selected, all processors generate the read references before any processor generates the entailing write reference to the same block. A block is tagged as either stationary, migratory or producer-consumer according to the distribution given by the parameter D. The generation of block accesses is the same independent of data category. At regular time intervals, based on analysis intervals, the classification of blocks is evaluated and if a block has remained in its class for a time corresponding to the temporal granularity for this class, it will be reclassified into some other class. The reclassification is, however, different depending on the data category of a block. Stationary blocks simply alternate between the UNR class and its stationary class. There are two kinds of actions on migratory blocks. If the block were in the UNR class it will be reclassified in the RWE class. With probability P' migr the block will migrate to a new processor, otherwise it will stay with the same processor that previously accessed it. If the block, however, were in the RWE class, it will remain in this class with probability P" migr, but for a different processor. Otherwise, it will be reclassified as UNR. Producer-consumer blocks are reclassified in a new class according to the conditional probabilities given by the mobility matrix, MM. Each time a producer-consumer block is classified in the ROE class, it changes the identity of the processor that will be accessing the block until the block has been accessed by nc processors. The model needs to be supplied with the analysis interval used in measuring the sharing behaviour in order to perform the reclassifications in correct intervals. The model also needs to know the value of i for ROF and RWF blocks in order to generate the correct degree of sharing

6 for these blocks. Finally, we have to supply the number of shared memory references to generate before termination. In the next section we will show that this reference generator, given the characteristics of migratory and producerconsumer data for a program, generates a stream of references that accurately mimics the real program with respect to the cache miss ratio. 6 Evaluation of the access characterisation In order to verify that the suggested parameters describes the access pattern with respect to miss ratios in directorybased write-invalidate cache coherence protocols, we have compared the miss ratios observed when the reference generator is used with the miss ratios seen when the real program is executed. Three different synthetic programs and the S.O.R. kernel was used as workload in the evaluation. 6.1 Experimental methodology The CacheMire test bench has been used as a framework to carry out the evaluation of the model [2]. Figure 3 shows the setup with a simulator for the Dir i NB class of cache coherence protocols. The protocol simulator classifies each cache miss as either a cold, coherence, or a directory replacement miss according to the definition in section 2.1. The miss ratio is calculated as the ratio between the number of cache misses and the number of all shared data references. References to instructions and private data are thus not included in the miss ratio calculation. The sharing behaviour was measured for the different programs, with varying analysis intervals and in some cases with varying block sizes from 4 bytes to 64 bytes. The default block size is 4 bytes. We have used 8 processors and the Dir 2 NB cache coherence protocol in all experiments. P M Program code P P P Cache Coherence protocol simulator M M Sharing behaviour parameters M SPARC processor simulators Comparison of cache miss ratios Model Memory reference generators Figure 3: The miss ratios observed by the artificial reference stream are compared with the real miss ratios 6.2 The synthetic programs Synth-A, B, C The synthetic program is a program which has been designed to produce specific shared data access patterns. It can be set up to access data in any applicable access class of all three categories. The shared data structures consist of integer arrays with consecutive accesses to the elements. Three variants of the synthetic program was used: (A) References are made to migratory data only. One array of 512 elements is shared by all processors. Processor i accesses elements number 64 i to 64 (i + 1) - 1 in sequence for processor cycles, after which these elements will be accessed by processor j =(i + 1) mod 8. (B) References are made to stationary data and migratory data with equal probability. The stationary data consists of data that are in the ROS and RWE classes. This processor contains the same migratory behaviour as for Synth-A but with a longer migration interval due to the accesses to stationary data. The iteration interval is processor cycles. The stationary data structures consists of one array of 512 elements subdivided among the eight processors for RWE accesses. There is also one array with 64 elements that is accessed by all processors for ROS accesses. (C) Only data in the producer-consumer category are accessed. This program contains a producer phase that is approximately processor cycles long, and a consumer phase that is approximately processor cycles long. The data structure is an integer array of 512 elements. Processor i updates elements number 64 i to 64 (i + 1) - 1 in the producer phase. The consumer phase can be one of two alternatives: (1) every tenth iteration all processors read elements 0 to 63 in the array; or (2) in nine out of ten iterations, processors i and i+1 read the elements of each others subvectors which were updated during the producer phase. The blocks containing these elements will thus be classified as ROF. The access pattern of Synth-A, -B, and -C as described above is repeated 50 iterations. In the Synth-A program there will be a coherence miss for each block a processor accesses between migrations since all blocks were updated by some other processor than the current. Cold misses are particularly easy to model when there are only migratory data. There will be exactly one cold miss per block and processor. Since there are at most two cache copies of a migratory block (between the read operation and the write operation issued by a processor) there will be no directory replacement misses. Synth-B accesses data from both the stationary and the migratory data categories. The number of coherence misses will be the same as for Synth-A which has the same data structure and access pattern to migratory data. The

7 coherence miss ratio will, however, be lower for the Synth- B program since this program has more shared references. The Synth-C program will experience coherence misses in the consumer phase when the consumers read a block which has been updated during the producer phase. There will also be some directory replacement misses since all processors access the same elements every tenth consumer phase. All Synth programs can utilise the prefetch effects generated by a larger block size to reduce the cold and coherence miss ratios because the shared data is organised in arrays with consecutive accesses to the elements. The directory replacement miss ratio should, however, not be affected by the cache block size. These misses are caused by accesses to ROS blocks and when many processors are Synth-A Synth-B Synth-C Miss ratio Miss ratio 4% 3% 2% 1% 0% (a) only migratory data Miss ratio 10% 8% 6% 4% 2% (b) stationary and 0% migratory data 4% 3% 2% 1% Model (b) only producer- 0% consumer data Cold misses Coherence misses Directory replacement misses Block size (bytes) Block size (bytes) Block size (bytes) Program Cold misses Coherence misses Directory replacement misses Analysis interval: 5000 processor cycles Figure 4: Miss ratios for the modelling of the synthetic program as a function of the block size reading a block, most of them will see a miss because of the limited directory size. There is no prefetch effect on these blocks since an increase in the block size will not bring more information into the cache. Figure 4 shows the miss ratios for Synth-A, -B, and -C when the cache block size was varied from 4 to 64 bytes. The effect on the miss ratio as discussed above can be seen for the three programs and we can also note that the trends are closely followed by the model. The choice of analysis interval is mainly dictated by the time between migrations for migratory data, and the length of producer and consumer phases for producer-consumer data. However, due to practical reasons, as described in section 5, the analysis interval cannot be as short as a time slot. We have used an analysis interval of 5000 processor cycles for all three programs. Although it is not part of this study, we have observed that there is a relatively large interval in which the analysis interval can be chosen to produce a good characterisation of migratory and producer-consumer data accesses. In summary, the experiments reported here show that the set of parameters, described in section 3.1, suffice to describe the sharing behaviour of accesses to migratory and producer-consumer data. The next section will show results for the S.O.R., which is a program kernel used in many scientific applications. 6.3 The S.O.R. program kernel The Successive Over Relaxation program kernel is a program with a dominating amount of producer-consumer data. The main shared data structure consists of a matrix whose elements are iteratively updated. In the parallel implementation of S.O.R., each iteration is split into two phases. In the first phase, all elements in even positions (the sum of the indices is even) are updated. In doing that, the values of the neighbouring elements in odd positions are read. The second phase performs the same operation on the elements in odd positions. The work is divided so that each processor work on a sub-matrix and communicates with other processors only at the borders between the sub-matrices. We have measured the sharing behaviour of S.O.R. with a matrix of randomly initiated elements. Each of the eight processors works on a 4 8 submatrix with 32 elements of which 12 are exclusively accessed by one processor, 16 are shared with one neighbour and 4 are shared between 4 neighbours. The time between two iterations was measured to be around 5100 processor cycles. Figure 5 shows a comparison of the miss ratios between the model and execution of the program itself for an analysis interval of 1000 processor cycles. As can be seen, there is a very good correspondence for all three types of cache misses which shows that the set of parameters characteris-

8 Miss ratio 6% 4% 2% 0% Model Figure 5: Comparison of miss ratios between the model and the program for the S.O.R. kernel ing producer-consumer data suffices to capture the characteristics in order to predict cache miss ratios. When the analysis interval is larger than 1500 processor cycles, the model does not work since the analysis of shared references will then be made over two (or more) iterations which gives a distorted view of the real sharing behaviour. To conclude this section, we have shown that the model, consisting of the reference generator and the set of parameters describing sharing behaviour, presented in this paper indeed model the sharing behaviour generated by data in the migratory and producer-consumer categories. The cache miss ratios observed when the reference generator is used are approximately the same as the real values cause by the programs themselves. 7 Conclusions Program Directory replacement misses Coherence misses Cold misses Analysis interval: 1000 processor cycles W have in this paper presented a model of accesses to migratory and producer-consumer data in a class of shared memory multiprocessor programs. The model consists of a set of parameters characterising the sharing behaviour with respect to cache miss ratios for directory-based, write-invalidate cache coherence protocols. It has been shown that the model can be used to accurately predict cold, coherence and directory replacement miss ratios for migratory and producer-consumer accesses. The main contribution in this paper is the identification of the characteristic information that essentially captures the access pattern to migratory and producer-consumer data. An additional contribution is the design and implementation of a reference generator that makes use of this information to generate a reference stream that mimics the references made by the processors. We have achieved the goal of the model, namely to understand the relation between migratory and producerconsumer data accesses and the performance of the cache coherence mechanism. The assumptions contain of course several simplifications and limitations that will be the focus of future research. Acknowledgements The research in this paper was supported by the Swedish National Board for Industrial and Technical Development (NUTEK) under contract number P855. References [1] A. Agarwal, R. Simoni, J. Hennessy and M. Horowitz, An Evaluation of Directory Schemes for Cache Coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pp [2] M. Brorsson, F. Dahlgren, H. Nilsson and P. Stenström. The CacheMire Test Bench A Flexible and Effective Approach for Simulation of Multiprocessors. In Proceedings of the 26th Annual Simulation Symposium, Washington DC, March pp [3] M. Brorsson and P. Stenström. Visualising Sharing Behaviour in relation to Shared Memory Management. In Proceedings of the 1992 International Conference on Parallel and Distributed Systems, pp , Hsinchu, Taiwan, December 1992, [4] M. Brorsson and P. Stenström, Modelling Accesses to Stationary Data in a Shared Memory Multiprocessor, In Proceedings of the 7th International Conference on Parallel and Distributed Computing Systems. October 1994, Las Vegas, NV. To appear. [5] J. B. Carter, J. K. Bennett, and W. Zwaenepoel, Implementation and Performance of Munin. Operating Systems Review. 25(5): , October, Proceedings of the 13th ACM SYmposium on Operating Systems Principles. [6] F. Darema, D. A. George, V. A. Norton, and G. F. Pfister. A Single-Program-Multiple-Data Computational Model for EPEX/FORTRAN. Parallel Computing. 7(1):11-24, April, [7] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.- D- Weber, Comparative Evaluation of Latency Reducing and Tolerating Techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pp , Toronto, Canada, May [8] A. Gupta and W.-D. Weber. Cache Invalidation Patterns in Shared-Memory Multiprocessors, IEEE Transactions on Computers, 41(7): , July [9] S. Fortune and J. Wyllie, Parallelism in Random Access Machines, In Proceedings of the 10th ACM Symposium on Theory of Computing, pages , [10] M. S. Lam, E. E. Rothberg and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp , April, [11] J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5-44, March [12] P. Stenström, M. Brorsson and L. Sandberg, An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 14-22, San Diego, California, pp

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Producer-Push a Technique to Tolerate Latency in Software Distributed Shared Memory Systems

Producer-Push a Technique to Tolerate Latency in Software Distributed Shared Memory Systems Producer-Push a Technique to Tolerate Latency in Software Distributed Shared Memory Systems Sven Karlsson and Mats Brorsson Computer Systems Group, Department of Information Technology, Lund University

More information

Boosting the Performance of Shared Memory Multiprocessors

Boosting the Performance of Shared Memory Multiprocessors Research Feature Boosting the Performance of Shared Memory Multiprocessors Proposed hardware optimizations to CC-NUMA machines shared memory multiprocessors that use cache consistency protocols can shorten

More information

Performance of MP3D on the SB-PRAM prototype

Performance of MP3D on the SB-PRAM prototype Performance of MP3D on the SB-PRAM prototype Roman Dementiev, Michael Klein and Wolfgang J. Paul rd,ogrim,wjp @cs.uni-sb.de Saarland University Computer Science Department D-66123 Saarbrücken, Germany

More information

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 10, OCTOBER 1998 1041 Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE

More information

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden

More information

UPDATE-BASED CACHE COHERENCE PROTOCOLS FOR SCALABLE SHARED-MEMORY MULTIPROCESSORS

UPDATE-BASED CACHE COHERENCE PROTOCOLS FOR SCALABLE SHARED-MEMORY MULTIPROCESSORS UPDATE-BASED CACHE COHERENCE PROTOCOLS FOR SCALABLE SHARED-MEMORY MULTIPROCESSORS David B. Glasco Bruce A. Delagi Michael J. Flynn Technical Report No. CSL-TR-93-588 November 1993 This work was supported

More information

Performance Tuning of Small Scale Shared Memory Multiprocessor Applications using Visualisation

Performance Tuning of Small Scale Shared Memory Multiprocessor Applications using Visualisation In Proceedings of the 1th International Conference on Parallel and Distributed Computing Systems, New Orleans, October 1-3 1997, pp. 155-162. Performance Tuning of Small Scale Shared Memory Multiprocessor

More information

Combined Performance Gains of Simple Cache Protocol Extensions

Combined Performance Gains of Simple Cache Protocol Extensions Combined Performance Gains of Simple Cache Protocol Extensions Fredrik Dahlgren, Michel Duboi~ and Per Stenstrom Department of Computer Engineering Lund University P.O. Box 118, S-221 00 LUND, Sweden *Department

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Issues in Software Cache Coherence

Issues in Software Cache Coherence Issues in Software Cache Coherence Leonidas I. Kontothanassis and Michael L. Scott Department of Computer Science University of Rochester Rochester, NY 14627-0226 fkthanasi,scottg@cs.rochester.edu Abstract

More information

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Memory Design. Cache Memory. Processor operates much faster than the main memory can. Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry

More information

A trace-driven analysis of disk working set sizes

A trace-driven analysis of disk working set sizes A trace-driven analysis of disk working set sizes Chris Ruemmler and John Wilkes Operating Systems Research Department Hewlett-Packard Laboratories, Palo Alto, CA HPL OSR 93 23, 5 April 993 Keywords: UNIX,

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models Lecture 13: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models 1 Coherence Vs. Consistency Recall that coherence guarantees

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

Some questions of consensus building using co-association

Some questions of consensus building using co-association Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Achieve Significant Throughput Gains in Wireless Networks with Large Delay-Bandwidth Product

Achieve Significant Throughput Gains in Wireless Networks with Large Delay-Bandwidth Product Available online at www.sciencedirect.com ScienceDirect IERI Procedia 10 (2014 ) 153 159 2014 International Conference on Future Information Engineering Achieve Significant Throughput Gains in Wireless

More information

Online Facility Location

Online Facility Location Online Facility Location Adam Meyerson Abstract We consider the online variant of facility location, in which demand points arrive one at a time and we must maintain a set of facilities to service these

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information

An Efficient Lock Protocol for Home-based Lazy Release Consistency

An Efficient Lock Protocol for Home-based Lazy Release Consistency An Efficient Lock Protocol for Home-based Lazy ease Consistency Hee-Chul Yun Sang-Kwon Lee Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information

A Hybrid Shared Memory/Message Passing Parallel Machine

A Hybrid Shared Memory/Message Passing Parallel Machine A Hybrid Shared Memory/Message Passing Parallel Machine Matthew I. Frank and Mary K. Vernon Computer Sciences Department University of Wisconsin Madison Madison, WI 53706 {mfrank, vernon}@cs.wisc.edu Abstract

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

The Performance of Cache-Coherent Ring-based Multiprocessors

The Performance of Cache-Coherent Ring-based Multiprocessors Appeared in the Proceedings of the 2th Intl. Symp. on Computer Architecture, May 1993 The Performance of Cache-Coherent Ring-based Multiprocessors Luiz André Barroso and Michel Dubois barroso@paris.usc.edu;

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

A DISTRIBUTED SYNCHRONOUS ALGORITHM FOR MINIMUM-WEIGHT SPANNING TREES

A DISTRIBUTED SYNCHRONOUS ALGORITHM FOR MINIMUM-WEIGHT SPANNING TREES ISSN: 2778-5795 A DISTRIBUTED SYNCHRONOUS ALGORITHM FOR MINIMUM-WEIGHT SPANNING TREES Md. Mohsin Ali 1, Mst. Shakila Khan Rumi 2 1 Department of Computer Science, The University of Western Ontario, Canada

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

Author's personal copy

Author's personal copy J. Parallel Distrib. Comput. 68 (2008) 1413 1424 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Two proposals for the inclusion of

More information

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics System Models Nicola Dragoni Embedded Systems Engineering DTU Informatics 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models Architectural vs Fundamental Models Systems that are intended

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Applying March Tests to K-Way Set-Associative Cache Memories

Applying March Tests to K-Way Set-Associative Cache Memories 13th European Test Symposium Applying March Tests to K-Way Set-Associative Cache Memories Simone Alpe, Stefano Di Carlo, Paolo Prinetto, Alessandro Savino Politecnico di Torino, Dep. of Control and Computer

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes

Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes Reducing Memory and Traffic Requirements for Scalable Directory-Based Coherence Schemes Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry Computer Systems Laboratory Stanford University, CA 9435 Abstract

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration MULTIPROCESSORS Characteristics of Multiprocessors Interconnection Structures Interprocessor Arbitration Interprocessor Communication and Synchronization Cache Coherence 2 Characteristics of Multiprocessors

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

2 Discrete Dynamic Systems

2 Discrete Dynamic Systems 2 Discrete Dynamic Systems This chapter introduces discrete dynamic systems by first looking at models for dynamic and static aspects of systems, before covering continuous and discrete systems. Transition

More information

CSCI 5454 Ramdomized Min Cut

CSCI 5454 Ramdomized Min Cut CSCI 5454 Ramdomized Min Cut Sean Wiese, Ramya Nair April 8, 013 1 Randomized Minimum Cut A classic problem in computer science is finding the minimum cut of an undirected graph. If we are presented with

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Adaptive Data Dissemination in Mobile ad-hoc Networks

Adaptive Data Dissemination in Mobile ad-hoc Networks Adaptive Data Dissemination in Mobile ad-hoc Networks Joos-Hendrik Böse, Frank Bregulla, Katharina Hahn, Manuel Scholz Freie Universität Berlin, Institute of Computer Science, Takustr. 9, 14195 Berlin

More information

Fault tolerant TTCAN networks

Fault tolerant TTCAN networks Fault tolerant TTCAN networks B. MŸller, T. FŸhrer, F. Hartwich, R. Hugel, H. Weiler, Robert Bosch GmbH TTCAN is a time triggered layer using the CAN protocol to communicate in a time triggered fashion.

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

Removing Belady s Anomaly from Caches with Prefetch Data

Removing Belady s Anomaly from Caches with Prefetch Data Removing Belady s Anomaly from Caches with Prefetch Data Elizabeth Varki University of New Hampshire varki@cs.unh.edu Abstract Belady s anomaly occurs when a small cache gets more hits than a larger cache,

More information

Cycle accurate transaction-driven simulation with multiple processor simulators

Cycle accurate transaction-driven simulation with multiple processor simulators Cycle accurate transaction-driven simulation with multiple processor simulators Dohyung Kim 1a) and Rajesh Gupta 2 1 Engineering Center, Google Korea Ltd. 737 Yeoksam-dong, Gangnam-gu, Seoul 135 984, Korea

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

CS377P Programming for Performance Multicore Performance Cache Coherence

CS377P Programming for Performance Multicore Performance Cache Coherence CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional

More information

Flash Drive Emulation

Flash Drive Emulation Flash Drive Emulation Eric Aderhold & Blayne Field aderhold@cs.wisc.edu & bfield@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison Abstract Flash drives are becoming increasingly

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Modeling Systems Using Design Patterns

Modeling Systems Using Design Patterns Modeling Systems Using Design Patterns Jaroslav JAKUBÍK Slovak University of Technology Faculty of Informatics and Information Technologies Ilkovičova 3, 842 16 Bratislava, Slovakia jakubik@fiit.stuba.sk

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Computer Architecture, EIT090 exam

Computer Architecture, EIT090 exam Department of Information Technology Lund University Computer Architecture, EIT090 exam 15-12-2004 I. Problem 1 (15 points) Briefly (1-2 sentences) describe the following items/concepts concerning computer

More information

Comparing the Parix and PVM parallel programming environments

Comparing the Parix and PVM parallel programming environments Comparing the Parix and PVM parallel programming environments A.G. Hoekstra, P.M.A. Sloot, and L.O. Hertzberger Parallel Scientific Computing & Simulation Group, Computer Systems Department, Faculty of

More information

Automatic visual recognition for metro surveillance

Automatic visual recognition for metro surveillance Automatic visual recognition for metro surveillance F. Cupillard, M. Thonnat, F. Brémond Orion Research Group, INRIA, Sophia Antipolis, France Abstract We propose in this paper an approach for recognizing

More information

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network Thomas Nolte, Hans Hansson, and Christer Norström Mälardalen Real-Time Research Centre Department of Computer Engineering

More information

"is.n21.jiajia" "is.n21.nautilus" "is.n22.jiajia" "is.n22.nautilus"

is.n21.jiajia is.n21.nautilus is.n22.jiajia is.n22.nautilus A Preliminary Comparison Between Two Scope Consistency DSM Systems: JIAJIA and Nautilus Mario Donato Marino, Geraldo Lino de Campos Λ Computer Engineering Department- Polytechnic School of University of

More information

X-RAY: A Non-Invasive Exclusive Caching Mechanism for RAIDs

X-RAY: A Non-Invasive Exclusive Caching Mechanism for RAIDs : A Non-Invasive Exclusive Caching Mechanism for RAIDs Lakshmi N. Bairavasundaram, Muthian Sivathanu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau Computer Sciences Department, University of Wisconsin-Madison

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

A Reconfigurable Cache Design for Embedded Dynamic Data Cache

A Reconfigurable Cache Design for Embedded Dynamic Data Cache I J C T A, 9(17) 2016, pp. 8509-8517 International Science Press A Reconfigurable Cache Design for Embedded Dynamic Data Cache Shameedha Begum, T. Vidya, Amit D. Joshi and N. Ramasubramanian ABSTRACT Applications

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Mostafa Mahmoud, Amr Wassal Computer Engineering Department, Faculty of Engineering, Cairo University, Cairo, Egypt {mostafa.m.hassan,

More information

Operating Systems. Steven Hand. Michaelmas / Lent Term 2008/ lectures for CST IA. Handout 4. Operating Systems

Operating Systems. Steven Hand. Michaelmas / Lent Term 2008/ lectures for CST IA. Handout 4. Operating Systems Operating Systems Steven Hand Michaelmas / Lent Term 2008/09 17 lectures for CST IA Handout 4 Operating Systems N/H/MWF@12 I/O Hardware Wide variety of devices which interact with the computer via I/O:

More information

FOUR EDGE-INDEPENDENT SPANNING TREES 1

FOUR EDGE-INDEPENDENT SPANNING TREES 1 FOUR EDGE-INDEPENDENT SPANNING TREES 1 Alexander Hoyer and Robin Thomas School of Mathematics Georgia Institute of Technology Atlanta, Georgia 30332-0160, USA ABSTRACT We prove an ear-decomposition theorem

More information

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations 1 Paper on TM Pathologies (ISCA 08) LL: lazy versioning, lazy conflict detection, committing

More information

Latency Hiding on COMA Multiprocessors

Latency Hiding on COMA Multiprocessors Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access

More information

Data Distribution, Migration and Replication on a cc-numa Architecture

Data Distribution, Migration and Replication on a cc-numa Architecture Data Distribution, Migration and Replication on a cc-numa Architecture J. Mark Bull and Chris Johnson EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland,

More information

OPERATING SYSTEMS: Lesson 1: Introduction to Operating Systems

OPERATING SYSTEMS: Lesson 1: Introduction to Operating Systems OPERATING SYSTEMS: Lesson 1: Introduction to Jesús Carretero Pérez David Expósito Singh José Daniel García Sánchez Francisco Javier García Blas Florin Isaila 1 Why study? a) OS, and its internals, largely

More information