[ 5.4] What cache line size is performs best? Which protocol is best to use?

Size: px

Start display at page:

Download "[ 5.4] What cache line size is performs best? Which protocol is best to use?"

Drusilla Pearson
5 years ago
Views:

1 Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part science. Parameters need to be chosen for the simulator. The authors selected a single-level 4-way set-associative 1 MB cache with 64-byte lines. The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic? The simulated workload consists of 6 parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs. Effect of coherence protocol [ 5.4.3] Three coherence protocols were compared: The Illinois MESI protocol ( Ill, left bar). The three-state invalidation protocol (3St) with bus upgrade for S M transitions. (This means that instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) The three-state invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.) Lecture 16 Architecture of Parallel Computers 1

2 Address bus Data bus Traffic (MB/s) Barnes/III Barnes/3St Barnes/3St-RdEx x LU/III d LU/3St l LU/3St-RdEx t Ocean/III x Ocean/3S Ill Ocean/3St-RdEx t Radiosity/III Ex Radiosity/3St Radiosity/3St-RdEx Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx In our parallel programs, which protocol seems to be best? Somewhat surprisingly, the result turns out to be the same for the multiprocessor workload. The reason for this? The advantage of the four-state protocol is that no bus traffic is generated on E M transitions. But E M transitions are very rare (less than 1 per 1K references) Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

3 Effect of cache line size [ 5.4.4] Recall from Lecture 11 that cache misses can be classified into four categories: Cold misses (called compulsory misses in the previous discussion) occur the first time that a block is referenced. Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain? True-sharing misses, on the other hand, occur when a processor writes some words into a cache block, invalidating the block in another processors cache, after which the other processor reads one of the modified words. How could we attack each of the four kinds of misses? To reduce capacity misses, we could To reduce conflict misses, we could To reduce cold misses, we could To reduce coherence misses, we could If we increase the line size, the number of coherence misses might go up or down. Why? Lecture 16 Architecture of Parallel Computers 3

4 Increasing the line size has other disadvantages. It increases conflict misses. Why? It increases bus traffic. Why? So it is not clear which line size will work best. 0.6 Upgrade False sharing True sharing Capacity Cold Barnes/8 Barnes/16 Barnes/32 8 Barnes/64 Miss rate (%) Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Results for the first three applications seem to show that which line size is best? For the second set of applications, Radix shows a greatly increasing number of false-sharing misses with increasing block size Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

5 12 Upgrade 10 8 False sharing True sharing Capacity Cold Miss rate (%) Ocean/8 Ocean/ Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 However, this is not the whole story. Larger line sizes also create more bus traffic. Traffic (bytes/instructions) Address bus Data bus 0 Barnes/8 Barnes/ Barnes/32 Barnes/64 Barnes/ Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 Lecture 16 Architecture of Parallel Computers 5

6 With this in mind, which line size would you say is best? Invalidate vs. update [ 5.4.5] Which is better, an update or an invalidation protocol? At first glance, it might seem that update schemes would always be superior to write-invalidate schemes. Why might this be true? Why might this not be true? When there are not many external rereads, When there is a high degree of sharing, For example, in a producer-consumer pattern, Update and invalidation schemes can be combined (see 5.4.5). Let s look at real programs Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

7 0.60 False sharing True sharing Capacity 2.00 Miss rate (%) Cold Miss rate (%) LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd Where there are many coherence misses, If there were many capacity misses, So let s look at bus traffic Lecture 16 Architecture of Parallel Computers 7

8 Note that in two of the applications, updates in an update protocol are much more prevalent than upgrades in an invalidation protocol. LU/inv LU/upd 0.00 Upgrade/update rate (%) Each of these operations produces bus traffic; therefore, the update protocol causes more traffic. Ocean/inv Ocean/mix Ocean/upd The main problem is that one processor tends to write a block multiple times before another processor reads it. Raytrace/inv Raytrace/upd This causes several bus transactions instead of one, as there would be in an invalidation protocol. Radix/inv Upgrade/update rate (%) In addition, updates cause problems in nonbus-based multiprocessors. Radix/mix Radix/upd Extending Cache Coherence [ 6.6] There are several ways in which cache coherence can be extended. These ways interact with other aspects of the architecture. Considering these issues yields a good understanding of where caches fit in the overall hardware/software architecture Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

9 Shared-cache designs [ 6.6.1] Why not let one cache serve multiple processors? This could be a 1stlevel cache, a 2nd-level cache, etc. P 1... Switch P n What would be the advantages? It eliminates the need for cache coherence. $ (interleaved) Main memory (interleaved) It may decrease the miss ratio. Why? It may reduce the need for a large cache. Why? It allows more effective use of long cache lines. Examples of such designs include the Alliant FX-8 (ca. 1985), which shared a cache among 8 processors, and the Encore Multimax (ca. 1987), which shared a cache among two processors. Today, shared caches are being considered for single-chip multiprocessors, with 4 8 processors sharing an on-chip first-level cache. What are some disadvantages of shared caches? Latency. Lecture 16 Architecture of Parallel Computers 9

10 Bandwidth. Speed of cache. Destructive interference the converse of overlapping working sets. Using shared second-level caches moderates both the advantages and disadvantages of shared first-level caches. Virtually indexed caches and cache coherence [ 6.6.2] Why not just store information in the cache by virtual address rather than physical address? What would be the advantage of this? [A TLB is like a cache that translates a virtual address of the form (page #, displacement) to a physical address of the form (frame #, displacement). ] Page Number Frame Number With large caches, it becomes increasingly difficult to do address translation in parallel with cache access. Why? 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

11 Recall this diagram from Lecture 10. What happens to the set number as the cache increases in size? Page number k Set number d j Byte within block Rather than slow down every instruction by serializing address translation and cache access, it makes sense just to use a virtually addressed cache. But virtually addressed caches have their own problems. One problem occurs when memory is shared. Two processes may have different virtual names for the same block. Process 1 s page table Process 2 s page table Page What s wrong with this? Why are duplicate copies of information bad? Think about what happens on a process switch. Assume The processor has been running Process 1. It then switches to Process 2. Later it switches back to Process 1. This is called the synonym problem. Lecture 16 Architecture of Parallel Computers 11

12 Solution: For each cache block, keep both a virtual tag and a physical tag. This allows a virtual address to be found, given a physical address, or vice versa. When a cache miss occurs, look up the physical address to find its virtual address (if any) within the cache. If the block is already in the cache, rename it to a new virtual address rather than reading in the block from memory. Let s see how this works in practice. 1. Look up referenced virtual address in cache. 2. Is virtual address found in cache? No 4. Use translated physical address to look up physical tags. 5. Is physical address found in cache? No 6. Cache miss; fetch data from lower level in memory hierarchy, writing back block(s) if necessary. Yes Yes 3. Cache hit. Proceed with reference. 7. Possibility of synonym in cache 8. Move cache block B from synonym virtual index to replace block C at current virtual index, writing back block C if necessary. Notes: Step 4: We are looking for another copy of the same block being used by another process, with a different virtual address. Step 6: Block at virtual address needs to be written back if it has been modified. Its physical tag may be the same as another block in cache; if so, that block needs to be written back too. Step 7: In one cache organization, there s not just a 2002 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

13 possibility of a synonym; it s a certainty. What organization is this? Step 8: If another process q was using this block and tries to access it again, it will get a miss on virtual address, but will find it by physical address. What will it therefore do with this block? Note that only physical addresses are put on the bus. Therefore, they can be snooped, and blocks can be invalidated (or updated) just like in coherence protocols for physically addressed caches. Figure 6.24 in CS&G, which I didn t have time to redraw, illustrates the organization of a virtually addressed cache. Consider first a virtual address. ASID means address-space identifier, which is often called a process identifier (PID) in other books. The diagram shows the V-index overlapping the virtual page number. What is the significance of this? Is the V-index actually stored in the cache? What is the state field in the tag memories? Now consider the physical address. Lecture 16 Architecture of Parallel Computers 13

14 What is the role of the P-index? How long is the P-index, relative to the V-index? Note that each P-tag entry points to a V-tag entry and vice versa; this makes it possible to find a physical address given a virtual address, or vice versa. In multilevel caches, the L 1 cache can be virtually tagged for rapid access, while the L 2 cache is physically tagged to avoid synonyms across processors. Example: Suppose P 1 s page 7 (addresses to 7FFF 16 ) is stored in page frame Suppose also that P 2 is using the same page, but calls it page 2. P 1 is the first to access block 0 on this page. Assuming cache lines are 16 bytes long, addresses 7000 to 700F are now cached. Let s assume the cache is direct mapped, and has 1024 (= ) lines. Which line will this block be placed in? Now suppose that P 2 accesses this block. What will happen? What number given above haven t we used yet? Where does this number get used or stored? TLB coherence [ 6.6.3] A TLB is effectively just a cache of recently used page-table entries. Each processor has its own TLB Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

15 How can two TLBs come to contain entries for the same page? Name some times when TLB entries need to be changed. Which of the above lead to coherence problems? Solution 1: Virtually addressed caches When a cache is virtually addressed, when is address translation needed? Given that accesses to page mappings are rare in virtually addressed caches, what table might we be able to dispense with? If we don t have a, we don t have a TLB coherence problem. Well, we still have the problem of keeping the swap-out and protection information coherent. Assuming no TLB, where will this information be found? This problem will be handled by the cache-coherence hardware when the OS updates the page tables. Solution 2: TLB shootdown With the TLB-shootdown approach, a processor that changes a TLB entry tells other processors that may be caching the same TLB entries to invalidate their copies. Step 1. A processor p that wants to modify a page table disables interprocessor interrupts and locks the page table. It also clears its active flag, which indicates whether it is actively using any page table. Lecture 16 Architecture of Parallel Computers 15

16 Step 2. Processor p sends an interrupt to other processors that might be using the page table, describing the TLB actions to be performed. Step 3. Each processor that receives the interrupt clears its active flag. Step 4. Processor p busy-waits till the active flags of all interrupted processors are clear, then modifies the page table. Processor p then releases the page-table lock, sets its active flag, and resumes execution. Step 5. Each interrupted processor busy-waits until none of the page tables it is using are locked. After executing the required TLB actions and setting its active flag, it resumes execution. This solution has an overhead that scales linearly with the number of processors. Why? Solution 3: Invalidate instructions Some processors, notably the PowerPC, have a TLB_invalidate_entry instruction. This instruction broadcasts the page address on the bus so that the snooping hardware on other processors can automatically invalidate the corresponding TLB entries without interrupting the processor. A processor simply issues a TLB_invalidate_entry instruction immediately after changing a page-table entry. This works well on a shared-bus system; if two processors change the same entry at the same time, only one change can be broadcast first on the bus Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring

A three-state update protocol

A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having