Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Size: px
Start display at page:

Download "Physical Design of Snoop-Based Cache Coherence on Multiprocessors"

Transcription

1 Physical Design of Snoop-Based Cache Coherence on Multiprocessors Muge Guher University Of Ottawa Abstract This report focuses on the hardware design issues associated with the physical implementation of cache coherence algorithms. The bus design, cache and controller design and the integration with the memory, play an important role on the latency and bandwidth achieved by a protocol. The design and implementation of the hardware that supports the logical operation of the protocol will be examined and a number of questions relating to these concepts will be addressed. 1. Questions How does memory know another cache will respond and provide a copy of the block so it doesn t have to? How would a design of a cache controller be modified in case of L1/L2 caches & split transactions buses? Show examples of modern split transaction buses/systems? How many outstanding transactions are allowed? How is the transaction propagated for multilevel caches? Again, show some examples of modern systems. Are there any solutions of shared L2 caches that are based on bus network? How does the bus network need to be modified to support shared caches? 2. Introduction The predominant parallel multiprocessor architecture where all processors share a global address space and symmetric access to the main memory is referred to as a Symmetric Multiprocessor (SMP). All the processors and memory is connected via a shared bus. There are various memory hierarchies found in multiprocessors; shared cache, bus based shared memory, dancehall, distributed memory etc. In all cases caches play an important role in reducing bandwidth requirements placed on the shared interconnect and the memory, by the processors. Cache Coherency issues arise, when these processors have a level of private (non-shared) cache hierarchy, where it is possible that a memory block can be present in one or more processors, while an other processor modifies the same block. On the logical side, there are various cache coherence protocols that address this problem of presenting the processors with the most up to date copy of the memory block. Usually these architectures exploit the fundamental properties of the interconnect bus. 3. Cache Coherence Requirements All cache coherency protocols are based on cache block states and state transitions. The state transitions of a cache block are driven by the actions of the local processor as well as those of other processors with a copy of the cache block. An important question in coherency protocols is: how to find copies of a cache block? One method is a snoop bus protocol where, the caches of the processors monitor bus transactions by snooping the bus to detect if another processor is accessing a cache block, whose copy, is also present in the snooping cache. A state diagram for a given protocol can fully capture the states, transitions and outputs of the algorithm, however a practical implementation of the algorithm has to be correct, require minimal hardware and offer high performance. The above requirements are interrelated such that, to provide high performance, multiple events can be launched at the same time and be outstanding, which leads to complex interactions between events and thus opens the design to the possibility of more bugs. At its most basic definition, a coherence protocol should invalidate (or update) stale copies of memory blocks on writes, and provide write serialization. The design should also ensure write atomicity and detect write completions, to be able to provide sequential consistency. It should be deadlock and live-lock free and make starvation unlikely. It should be able to handle error conditions appropriately and recover from them when possible. Sequential consistency implies that memory operations issued by one process must be visible to others and itself in program order. Atomicity implies that one memory operation should appear to complete with respect to all processes before the next one is issued. Snoop based designs provide serialization by the nature of the bus implementation which imposes total order. When there is a single level of cache per processor and the bus transactions are atomic, we can assume that the cache can hold the processor while it is performing its memory operations, therefore these operations appear atomic to the processor. 3.1 Cache Controller and Tags Recall that, the cache controller's main responsibilities include maintaining the tags and control bits associated

2 with the cache; performing the snoops and snarfs of data; updating the cache memory; and implementing the write policy. The cache controller is also responsible for determining if memory request is cacheable and if a request is a cache hit or miss. It is also possible that the cache controller can trigger write out to memory all dirty cache lines, immediately. This is done when it is necessary that the cache is in sync with the memory (i.e. no stale data). When a cache controller watches the address lines for transactions, this is called snooping. This function allows the controller to see if any transactions are accessing it's cache. A snoop cycle is initiated for cache coherency when data is written to the memory by an entity other than the processor and that address is also in the cache. Snoop hit or miss is determined the same way as read or write cycles and the cache block is invalidated. The basic idea behind the multiprocessor snooping based coherence is that the transactions on bus are visible to all processors and processors can monitor to bus to take action on events relevant to them. The uniprocessor cache controller must be enhanced to support a snooping cache coherence protocol. In this new arrangement, the cache controller has two controllers: 1. a bus side controller, to monitor bus operations 2. a processor side controller, to respond to processor operations Tags used by the processor Tags Cached Data Tags Tags used by the bus snooper Figure 3-1. Organization of single-level snoopy cache [1] When an operation occurs the controller must access the cache tags. The bus side controller must capture the address on the bus and perform a tag check on every bus transaction. If the check fails (snoop miss) this operation is of no importance to the cache. If the check is a snoop hit, the controller has to act according to its coherence protocol, which may involve it modifying (read-modifywrite) its state bits and/or trying to obtain the bus to put a memory block on it. Cache designs use a dual-ported tag and state store or duplicate the tag and state for every block, to allow both controllers to access the tag array at the same time and perform checks simultaneously. This ensures that during a bus transaction the processor will not be locked out from accessing the cache. However when a tag for a block is updated both copies are modified, and this will lock out one of the controllers briefly. The cache controller of a snoopy multiprocessor is also a responder to bus transactions. A memory bank controller responds to a relevant read/write operations on a fixed subset of addresses that it contains after some wait cycles. The cache controller monitors the bus and performs a tag check on every transaction to determine if it is relevant. In addition, the caches may need to snoop the new data off the bus for update-based protocols. Therefore it may have to update state, respond with data and/or generate new bus transactions. 3.2 Reporting Snoop Results How does memory know another cache will respond and provide a copy of the block so it doesn t have to? In a uniprocessor system an initiator places and address on the bus and all other devices monitor the address, a responder device will recognize it as relevant and data will be transferred between the two devices. The responser must acknowledge within a time-out-window of time by raising a wired-or signal. If no device responds a bus error occurs. Each cache must check the address on the bus against its tags, for snooping caches. All caches must respond before the transaction can proceed. The snoop results inform main memory whether it should respond to the request or a cache has a modified copy of the block When and how the snoop results is reported on the bus? When to report snoop results? As quickly as possible so that memory can take protocol specifics quickly. Some of the implementation options are: 1. Design could provide snoop results within fixed number of clock cycles from the address issue on the bus. This requires dual set of tags because the processor, which has higher priority, could be accessing the tags heavily. Even so, both sets of tags are inaccessible during processor updates. This option has the advantage of a simple memory sub-system but requires extra hardware and potentially longer snoop latency. The Pentium Pro, HP Servers and Sun enterprise implement this option. 2. Design could provide snoop results after a variable delay. Main memory assumes one of the caches will supply the data, until all have snooped and indicated results otherwise. This option may be easier to implement since controllers don't have to consider tag access-conflicts. It can also

3 provide high performance since the designer does not have to assume worst case delay for snoop results. SGI Challenge implements a variant of this option where memory fetches the data and stalls until snoops complete. 3. Design could provide snoop results immediately. Main memory maintains a state bit per block that indicates whether the block is modified in one of the caches. Memory doesn't wait for snoop results to take action. The disadvantage of this approach is the extra complexity added to the main memory subsystem. How to report snoop results? Use three wired-or signals. The responder can raise a wired-or signal to acknowledge on the bus. Two bits [1:2], for snoop results 1: Shared bit is asserted if any cache has a copy. 2: Dirty bit is asserted if some cache has a dirty copy. Dirty cache knows what action it has to take. One bit 3, indicating Snoop Valid. This is an inhibit signal, which is asserted until all processors have completed their snoop. For example, in the MESI protocol, the requesting cache controller needs to know whether the block is in an other cache or in memory to decide to move onto Exclusive or Shared state. The memory need to know if any cache has the block in modified state, so that it does not have to respond. Illinois MESI protocol is more complex because it allows cache-to-cache transfers, where the requesting controller can obtain the data data from other caches rather than memory. However a priority scheme needs to be implemented to service this feature. SI Challenge, Sun Enterprise Server, use cache to cache transfers only for data that are in in exclusive or modified state. This way there is a single supplier. The Challenge updates memory during cache-to-cache transfer, so that it does not need to provide a no shared modified state. 3.3 Write-backs To allow the processor to continue quickly on a cache miss that requires a write-back, i.e. replacement of modified data in the cache, it is desirable to delay the write back and service the miss first then process the write back asynchronously. For this an additional write-back buffer is provided to temporarily store the block being replaced, while the new block is brought in and before the bus can be reacquired to complete the write-back. If the controller sees a bus transaction containing address of the block it must supply the data from the write-back buffer and cancel its pending write-back request to the bus. Also an address comparator is added to the snoop logic on the write-back buffer to support this feature. 4.0 Multilevel Cache Hierarchies In modern systems, processors have two or three levels of caches. Since the early 90's the multiprocessor designs usually have an on-chip first-level cache and a much larger second-level cache on or off chip. In this arrangement the changes made by the processor to L1 cache may not be visible to the L2 cache controller that is responsible for the bus operations and bus transactions are not directly visible to the L1 cache. It appears to be rather complicated, however basic principles of cache coherence extend quite well to multi-level caches. One way to handle multi-level caches is to implement snooping hardware at each level of cache hierarchy. The disadvantages of this design is that: 1. L1 cache is usually on the processor chip and an on-chip snoop hardware will consume precious pin resources to monitor the addresses on the bus. 2. Duplicating the tags to allow simultaneous access by the snooper and the processor is too area intensive. 3. There is duplication of effort between the snoop logic at each level of hierarchy, L1 and L2, since blocks in the L1 cache are usually in L2 cache as well. A practical solution which is commonly used is by preserving what is referred to as the Inclusion Property. Assuming two cache levels, the inclusion property can be defined by the following two requirements: 1. If a cache block is in the L1 cache, then it also must be in the L2 cache. 2. If a block is in the modified state in the L1 cache, then it must also be marked as modified in the L2 cache. This can, for example, be accomplished by a write-through L1 cache. Therefore, if a multiprocessor system which implements a snoop based protocol, with multi levels of caches adheres to the inclusion property, only the lowest level cache needs to have snooping logic. This is because data in the higher level cache is a subset of the data in the lower level cache. The second requirement ensures that if a BusRd transaction requests a block that is in modified state in either cache, then the snoop logic in lover level cache can immediately respond to the bus. Now we have information flowing in both ways; L1 accesses L2 for cache miss handling and block state changes; L2 forwards to L1 blocks invalidated/updated by bus transactions. The first requirement for inclusion is not trivial to achieve. Differences in block size, associativity, etc. between the caches at the different levels may result in data being cached at level 1 and not being cached at higher levels ( level >1 ). For example, when the L1 cache is set-associative and the L2 cache is direct mapped, two cache blocks may reside in the L1 cache at

4 the same time while in the L2 cache they map onto the same entry. 4.1 Maintaining Inclusion Maintaining requirements for inclusion is not easy. The following scenarios are important to consider and must be handled properly. Assuming two level caches; Processor references to L1 can cause it to change state and perform replacements. If bus transactions cause the L2 cache to change state and flush blocks, these must be forwarded to L1. The modified state must be propagated out to L2 cache. For some cache configurations inclusion is achieved automatically. For example, on a commonly encountered case where: L1 cache is direct mapped, L2 cache is either direct mapped or set associative, Both caches specify the same block size, Number of sets in L1 cache is smaller than in L2 cache, inclusion property is preserved. Even though all L1 cache misses go the L2 cache, inclusion is not automatic for all configurations. If the two caches choose to replace different blocks, due to replacement algorithms that are based on history of accesses to the block. Inclusion can be violated if L1 is not-direct-mapped and uses LRU replacement policy, regardless of the configuration of L2 cache. A similar problem with replacements is seen when the L1 cache is split between instruction and data. Inclusion can not be guaranteed, if there are multiple independent caches that are connected to a highly associate cache at a lower level. Inclusion can be violated if the cache hierarchies have different cache block sizes. 4.2 Propagating transactions for Coherence in Hierarchy How is the transaction propagated for multilevel caches? Mechanisms used for propagating coherency events need to be extended to explicitly maintain inclusion when the cache configuration does not automatically do so. When a block in L2 is replaced, the address of the block is sent to L1 cache, to get it to invalidate or if dirty to flush the corresponding block. Information about invalidation/update can be passed from L2 to L1 in two different approaches; either propagate all bus transactions from L2 to L1 cache (even if the given block is not present there) or maintain an extra state bit per block (inclusion bits) in L2 cache to to identify which blocks in L2 are also in L1. Then with some extra hardware L2 controller can filter interventions to L1 cache. The same kind of tradeoff occurs during writes to L1. On a write hit to L1, the modification need to be communicated to L2, i.e propagate modified state from L1 to L2. L1 can be implemented as write-through (so all modifications affect also L2), but this can consume a lot of L2 cache bandwidth for writes and a write-buffer is needed between the two caches to prevent processor stalls. An other solution is to implement L1 as write-back cache and add a per bit state in every block in L2, indicating "modified-but-stale state, indicating that a newer version needs to be fetched from L1 in case there is the need to send this block as part of a bus transaction (handled by L2). The block behaves as modified as far as the coherence is concerned but data is fetched from L1 cache on a flush. Interestingly, dual tags are less critical when we have multilevel caches, as L2 serves as a filter for the L1 cache, screening out irrelevant transactions from the bus. If we maintain inclusion and provide only one transaction on the bus at a time, multi-level hierarchical do not pose any additional correctness issues. The required transactions are propagated up and down the hierarchy, and bus transactions may be held until propagation completes. Obviously the performance penalty for holding processor write until BusRdX has been granted is rather high, so there is a clear motivation to de-couple these operations. The penalty of holding the bus until L1 cache is queried can be avoided if the design follows early commitment. Tags Tags used mainly by processor Cached Data Cached Data L1 Cache Tags L2 Cache Tags used mainly by bus snooper Figure 4-1. Organization of two-level snoopy cache [1] 5.0 Split Transaction Bus A split transaction bus is a more aggressive implementation, where bus transactions are not necessarily atomic. In a Split-transaction bus (STB), transactions that require a response are split in two independent sub-transactions: a request transaction and a response transaction where each transaction can be arbitrated separately. Other transactions are allowed to interleave between them so the bus is used while the response to the original request is being fetched. Buffering between the bus and the cache controller allows

5 multiple transactions to be outstanding on the bus while waiting for snoop or/and data responses from other controllers. The advantage of this bus design is that, by pipelining bus operations the bus is utilized more efficiently, whereby more processors can share the same bus. However, this is done at the cost of increased complexity. There are three phases in a transaction; (1) a request is put on the bus, (2) snoop results are sent by other caches as a way to communicate the presence of the requested block in a state that is important for the requester to know and finally (3) data is sent for the requesting cache, if needed. There are some major issues encountered in supporting STBs and some of the common ones are discussed here. A new request can appear on the bus before the snoop and/or servicing of an earlier request are complete. In particular, these requests may be to the same block, i.e. conflicting requests. The number of buffers for incoming requests and possible data responses from bus to cache controller is usually fixed and small. Therefore flow control mechanisms are needed to avoid filling up these buffers. Care must be taken on the question of when and how snoop responses and data responses are produced on the bus, since requests from the bus are buffered. Based on the implementation snoop and data can be part of the same response transaction or they can be generated with respect to the request appearing on the bus. Intel Pentium Pro and DEC Turbo Laser busses keep the responses in the same order as requests whereas SGI Challenge and Sun Enterprise busses allow out of order responses. 5.1 SGI Challenge Example SGI Challenge handles conflicting requests conservatively. It only allows eight outstanding requests at a time and does not allow conflicting requests for the same block. Flow control for the small buffering space between the bus and the controllers is implemented through negative acknowledgement (NACK) lines on the bus. When a request arrives if the buffer is full, the request is NACKed as soon as it appears on bus. The request is invalid and the requestor must retry. Responses may be in different order than the requests. Request phase establishes the order of transactions, snoop results from the cache controllers are presented on the bus as part of the response phase, and may be together with the data response if applicable. The split-transaction bus design consists of two separate buses, a request bus for command (including NACK) plus address and a response bus for data. When a request (command and address pair) wins the bus arbitration and is granted the bus, it is assigned a unique 3-bit tag (3 bits due to 8 outstanding transactions). A returning response consists of data on the response bus as well as the the request tag. The address and data buses can be arbitrated separately. There are separate buses for arbitration, flow control and snoop results. Cache blocks are 128 bytes while cache lines are 256 bits, four bus cycles plus a one-cycle turnaround time are required for the response phase. A uniform pipeline strategy is followed, so the request phase is also five cycles comprised of; arbitration, resolution, address, decode, and acknowledgment. Overall, a complete request-response transaction takes three or more of these five-cycle phases: an address request phase, a data request phase (which uses the data bus arbitration logic and obtains access to the data bus for the response sub-transaction), and a data transfer or response phase (which uses the data bus). Three different memory operations can be in the three different phases at the same time. 5.2 Bus and cache Controller Design To keep track of the eight outstanding requests on the bus, each cache controller maintains an eight-entry table, called a request table. Whenever a new request is issued on the bus, it is added to all request tables at the same index as part of the arbitration process. The index is the 3- bit tag assigned to that request. Each table entry contains; block address, request type, state in that cache etc. This table is fully associative, therefore a new entry can be placed anywhere in table. The table is checked for a match by the requesting processor and by all snooped requests and responses on the bus. The table entry and tag are freed when the corresponding response is observed on the bus, now tag can be reassigned by bus. 5.3 Snoop Results and Request Conflicts Avoiding conflicting requests is straightforward, since every controller has a record of the pending transactions that have been issued to the bus in its request table, no request is issued for a block that has a transaction outstanding. Each controller sees all the requests made by the other controllers through the shared bus. Therefore, if controller wants a block for which there is already an entry in the request table (made by another controller), it may decide to snoop the result for the other request and set the shared bit when the data is transferred, in order to let the originator know that the block is now shared. SGI Challenge uses variable delay snooping. The snoop portion of the bus consists of three wired-or lines; sharing, dirty, inhibit, which extends the current response sub-transaction. The request phase determines who will respond, but may take may cycles and intervening request response transactions. All controllers present their snoop results on bus when they see response. There are no data response or snoop results for write backs and upgrades.

6 5.4 Flow Control The cache subsystem implements flow control at incoming request buffers from the bus to the cache controller (write-back buffer) and also at a response buffer where the responses to requests can be stored. The response buffer entry contains the address and a cache block of data, therefore the number of entries in the buffer are kept to a small number. The controller limits the number of outstanding requests so that it can guarantee buffer space for every response. Flow control is also needed at main memory, since it must be able to accept (potentially) a write-back from all of the eight pending requests. These transactions can happen rather quickly and possibly overflow buffers. SGI Challenge provides separate NACK lines for address and data buses. NACKed transactions are cancelled at everywhere and must be retried later. Backoff and priority schemes can be used to limit bandwidth consumption of failed retries to avoid starvation. The Sun Enterprise implements a scheme where the destination buffer which was full initiates the retry when it has free buffer space, thus only limiting the system to at most two retries. Writes are performed during request phase, the operations of an individual memory location is serialized as in the atomic case, even though the bus is pipelined. Once a BusRdX or BusUpgr has obtained the bus, the associated write is committed. However, commitment of a write does not guarantee that the value produced by the write is already visible to all other processors. Only actual completion guarantees that. Therefore, additional mechanisms are needed to support Sequential Consistency. As state before, condition necessary for SC in this case are; a processor should not be allowed to actually see the new value to a write before previous writes (in bus order) are visible to it. This can be done in two ways; (1) by not letting certain types of incoming transactions from bus to cache be reordered in the incoming queues or (2) allowing these re-orderings in the queues, but then ensuring that the important orders are preserved at the necessary points in the design. An alternative simpler approach is to threat all the requests in FIFO order. The fully associate request table can be replaced with a simple FIFO buffer that stores pending requests that are observed on the bus. Although this approach is simpler, it can have performance problems. 6.0 Multilevel Caches and STB A design shown in the figure below is a splittransaction bus with two level cache hierarchy. Response 8 Response/ request from L 2 to L 1 Response/ request from bus 7 Processor L 1 $ L 2 $ 2 Processor request 1 Response/ request from L 1 to L 2 Request/response to bus Figure 5-1 Multi-level cache hierarchy [1] It takes considerable number of cycles for a request to propagate through the cache controllers. Other transactions are allowed to propagate up and down the hierarchy during this time. To maintain high bandwidth while allowing the individual units such as controllers and caches to operate at their own rates, queues are placed between levels of the hierarchy. However this can lead to deadlocks and serialization issues. To avoid Fetch Deadlock L2 cache must buffer incoming requests or responses while its request are outstanding, to free up the bus. If one outstanding request is allowed per processor, the design needs enough space to hold p requests (p = number of processors) plus one reply (this is essential). If it does not have enough buffer space or if it would like to hold multiple outstanding requests then it may need to NACK bus requests. For that design the request bus arbiter needs a priority mechanism in bus arbiter to ensure progress. Buffer deadlock can occur if L1 is write-back, for example. If L1 is WB, there is data flow in both directions (L2->L1, L1->L2) and both levels may be waiting for the other with full buffers; this does not occur, for example, if L1 is not write-back. One could provide enough buffering however this can take up a lot of area, and is not scalable. An other way to deal with this is to limit the number of outstanding requests from processors and then provide enough buffering for incoming requests and responses at each level. 6.1 Sequential Consistency in Multi-level Caches Maintaing sequential consistency is an other major concern. The separation of commitment from completion is even greater with multi level cache hierarchy therefore it is important that the bus does not wait for an invalidation to reach all the way up to L1 and return a reply, but consider write committed when placed on the bus. Fortunately techniques for single-level cache bus extend well to the multi-level caches. They need to be applied at each level of hierarchy. These techniques are previously listed on section 5.4. Bus 4 3 Processor L 1 $ 5 L 2 $ 6

7 6.2 Case Studies of bus based multiprocessors A couple of case studies of bus based multiprocessors are presented in this section, to show how some of the key design issues, discussed in previous sections, are addressed in commercial implementations SGI Challenge A summary of main design features of SGI Challenge multiprocessor system is listed below to provide an example of systems that employ the techniques discussed in the previous sections. SGI Challenge supports up to 18/36 MIPS processors. Bus Design Powerpath-2 bus with peak bandwidth of 1.2GB/s, interleaved memory accessible through the bus, in uniform time, Powerpath-2 (PP2) is non-multiplexed, having a 256-bit-wide data portion and a separate 40-bitwide address portion, plus command and other signals, clocked at 47.6MHz and is a split-transaction design supporting eight outstanding read requests, the bus supports 16 slots, 9 of which can be used by 4-processor boards, (wide bus, more interface chips so higher latency, but more bw at slower clock) L2 block size is 128 bytes, i.e., can be transferred in 4 bus cycles, total of 329 signals, all transactions take 5 cycles: arbitration, resolution, address, decode, and acknowledge; when no transactions are occurring, each bus controller drops into a two-state idle machine, since the bus is STB, the address and data buses must be arbitrated for separately, one cache coherence chip (CC-chip) per processor; CC chip per processor has duplicate set of tags. Processor requests go from CC chip to A chip to bus four bit-sliced data chips (D-chips) that interface with the data bus; they are quite simple and shared among the processors; Figure 6-2 SGI Challenge processor board [1] Memory Access Latency Provides 250ns access time from address on bus to data on bus (a data request to be satisfied by MM appears 12 cycles after the address appeared on the bus) However overall latency seen by processor is 1000ns! 300 ns for request to get from processor to bus (down through cache hierarchy, CC chip and A chip) 400ns later, data gets to D chips (3 bus cycles to address phase of request transaction, 12 to access main memory, 5 to deliver data across bus to D chips) 300ns more for data to get to processor chip (up through D chips, CC chip, and 64-bit wide interface to processor chip, load data into primary cache, restart pipeline) Sun Enterprise 6000 Sun Enterprise supports up to 30 UltraSparc processors with Gigaplane system bus of 2.67GB/s. Even though some memory is physically local to a pair of processors, all of memory is accessed through the bus and therefore has uniform access time. Figure 6-1 Powerpath-2 bus state transition [1] Processor and Memory Systems Each processor board has: (4 MIPS R4400 processors per board share Address and Data-chips) one address chip (A-chip) A chip has address bus interface, request table, control logic Sun Gigaplane Bus Non-multiplexed, split-transaction, with 256-bit data lines and 41-bit physical addresses but is clocked at 83.5MHz, The bus can support up to 112 outstanding transactions, including up to 7 from each board, Bus has 388 signals, The bus protocol is very different from the Challenge, using collision-based speculative

8 arbitration techniques to avoid the cost of bus arbitration, emphasis on reducing latency. If a request collision occurs, the requestor that wins simply drives the address again in the next cycle, as it would with conventional arbitration, Five cycles after the address is on the bus, all boards assert their snoop signals on the state bus lines, Like the Challenge, invalidations are ordered by the BusRdX requests appearing on the address bus and are handled in FIFO fashion by the cache subsystem, Although the UltraSparc implements a five-state MOESI protocol in the L2 caches, the D-tags maintain an approximation using only three states: owned, shared, and invalid; that is the minimum information it needs to know when to place data on the bus. 7.0 Shared Cache Designs Are there any solutions of shared L2 caches that are based on bus network? How does the bus network need to be modified to support shared caches? Processors can be grouped together to share a level of memory hierarchy which leads to shared cache designs and this is a potentially rewarding implementation choice especially for multiple processors on a single chip. The interconnect is placed between the processors and the first-level cache. Both the cache and the memory system may be interleaved to provide higher bandwidth [1]. Sharing of two or more processors of L2 cache banks is typically achieved by using a crossbar. This allows multiple core ports to launch operations to the L2 in the same clock cycle. Similarly, multiple L2 banks are able to send data to various processor ports in the same cycle [6]. The crossbar interconnect consist of crossbar links and interface logic. A crossbar consists of address lines going from each cores to the banks (required for write-backs) and data lines going from every bank to the cores [6] (required for data reload as well as invalidate address). There are several advantages to this type of design over one where each processor has its private memory at that level of hierarchy. The benefits are the same regardless of which level of hierarchy the sharing occurs. These advantages are: 1. Eliminates the need for cache-coherence at this level. 2. If L1 cache is shared then there are no multiple copies of a cache block and hence no coherence problem. 3. Reduces the latency of communication. 4. L1 communication latency 2-10 clocks, mainmemory many times larger 5. reduced latency enables finer-grained sharing of data 6. Prefetching of data can occur across processors. 7. With private caches each processor incurs miss penalty separately. 8. Reduces the bandwidth requirements at the next level of the hierarchy. 9. Provides more effective use of long cache blocks, as there is no false sharing; 10. Shared cache is smaller than the combined size of the private caches if working sets from different processors overlap. However there are disadvantages associated with sharing L1 cache and some of these are: 1. Has to satisfy higher bandwidth requirements. 2. Hit latency to a shared cache is higher than to a private one. 3. The design of cache is more complicated due to 1 & Shared caches are usually slower. 5. Instead of constructive interference (like the working set example), destructive interference can occur. 7.1 Examples os Shared Cache Designs Alliant FX-8 machine (1980's), Eight custom processors. Clock cycle is 170ns. Processors connected using crossbar to 512Kbyte, 4-way interleaved cache. Cache has 32 byte block size, is direct mapped, using write-back, and supports two outstanding misses per processor. Encore Multimax (contemporary) Snoopy cache coherent multiprocessor Each private cache supports two processors instead of one. A Practical approach to sharing caches among groups of processors would be, to let each processor have a private L1 cache and have them share a L2 cache, since many processors already provide snoop support for L2 cache. The shared cache has to be large to reduce destructive interference, therefore packaging considerations are also important to consider when architecting shared cache multiprocessors.

9 8.0 New Questions and Answers Q1. Describe how the uniprocessor cache controller and tag memory has to be modified to support snoopy coherence protocol? Please see section 3.1 Q2. How are write-backs handled in snoopy cache protocol? Please see section 3.3. Q3. Describe how snoopy-bus cache coherency protocols work. From [3], In snoopy-bus protocols, every cache (as part of a processor) attached to the bus monitors all bus transactions. When a snooping cache detects a bus transaction which involves an address which is also present in the snooping cache, appropriate actions are taken. In a write-invalidate protocol, this means that write actions of one cache in the system will invalidate all other copies of the data in the other caches. For write-update systems, the writing cache needs to update all copies in other caches. For snoopy-bus protocols it is important that caches announce enough information about their actions (e.g. announce all write actions, including cache hits) so that other (snooping) caches can react to these actions. Q4. Explain why coherency protocols for write-back caches have separated Read-Write and Read-Only states. From [3], two reasons: in a write-back cache you should know if a block is dirty in order to write it back to memory when the block is invalidated by another cache (which wants to write the block) or when it simply is evicted. Moreover, once a cache block is in Read-Write state, the cache owning the block can write to it without needing to send invalidation messages over the bus (there are no copies). 9. References [1] David Culler, Jaswinder Pal Singh, and Anoop Gupta, Morgan Kaufmann, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann; preliminary draft edition (August 1997), pp [2] Daniel Braga de Faria, Stanfard, Book Summaries, retrieved October 2010, from [3] Andy Pimentel, Introduction to Parallel Architecture, retrived on October 2010, from pp [4] R. H. Katz, S. J. Eggers, DA.A. Wood, C.L Perkins and R.G. Shedon, Implementing a cache Consistency Protocol, Proceedings of the 12 th ISCA, 1985, pp [5] M. S. Papamarcos, J.H. Patel, A low Overhead Coherence Solution for Microprocessors with Private Cache Memories, Proceedings of the 11 th ISCA, 1984, pp [6] R. Kumar, V. Zyuban, and D. M. Tullsen, Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. In ISCA, Jun Q5. Why is it not a good design choice to implement snooping hardware at each level of hierarchy of multilevel cache designs? Please see section nd paragraph and set of bullet points. Q6. How are write-backs handled in snoop-based multiprocessor designs? What if any additional hardware is required? Please see section 3.3. Q6. Describe the types of deadlock isues that can occur in multi-level cache with STB implementation of multiprocessors and how they can be avoided? Please see section 6.0.

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 11: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming Tsinghua has its own ice cream! Wow! CMU / 清华 大学, Summer 2017 Review: MSI state transition

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

A Basic Snooping-Based Multi-processor

A Basic Snooping-Based Multi-processor Lecture 15: A Basic Snooping-Based Multi-processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Stompa (Serena Ryder) I wrote Stompa because I just get so excited

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Snoop-Based Multiprocessor Design III: Case Studies

Snoop-Based Multiprocessor Design III: Case Studies Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

JBus Architecture Overview

JBus Architecture Overview JBus Architecture Overview Technical Whitepaper Version 1.0, April 2003 This paper provides an overview of the JBus architecture. The Jbus interconnect features a 128-bit packet-switched, split-transaction

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

Review: Symmetric Multiprocesors (SMP)

Review: Symmetric Multiprocesors (SMP) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from work

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

ECE 551 System on Chip Design

ECE 551 System on Chip Design ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Module 15: Memory Consistency Models Lecture 34: Sequential Consistency and Relaxed Models Memory Consistency Models. Memory consistency Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Aleksandar Milenkovich 1

Aleksandar Milenkovich 1 Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3 MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 26 Cache Optimization Techniques (Contd.) (Refer

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Special Topics. Module 14: Directory-based Cache Coherence Lecture 33: SCI Protocol Directory-based Cache Coherence: Sequent NUMA-Q. Directory-based Cache Coherence: Special Topics Sequent NUMA-Q SCI protocol Directory overhead Cache overhead Handling read miss Handling write miss Handling writebacks Roll-out protocol Snoop interaction

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4 Outline CSCI Computer System Architecture Lec 8 Multiprocessor Introduction Xiuzhen Cheng Department of Computer Sciences The George Washington University MP Motivation SISD v. SIMD v. MIMD Centralized

More information

Dr e v prasad Dt

Dr e v prasad Dt Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction

More information

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline What is (Hardware) Shared Memory? ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs Coherence & Snooping Copyright 2006 Daniel J. Sorin Duke University

More information

Characteristics of Mult l ip i ro r ce c ssors r

Characteristics of Mult l ip i ro r ce c ssors r Characteristics of Multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input output equipment. The term processor in multiprocessor can mean either a central

More information

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O 6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor

More information

Replacement Policy: Which block to replace from the set?

Replacement Policy: Which block to replace from the set? Replacement Policy: Which block to replace from the set? Direct mapped: no choice Associative: evict least recently used (LRU) difficult/costly with increasing associativity Alternative: random replacement

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Shared Memory Architectures. Approaches to Building Parallel Machines

Shared Memory Architectures. Approaches to Building Parallel Machines Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information