A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having the fourth state? The Firefly protocol, named after a multiprocessor workstation developed by DEC, is an example of such a protocol. Here is a state diagram for the Firefly protocol: V BR CRMx CWHx S BR, BW D BR, BW CWH CWMx Key: CRM CPU read miss CWM CPU write miss CWH CPU write hit BR bus read BW bus write A following a transition means SharedLine was asserted. An x means it was not. Processor-induced transitions Bus-induced transitions CWH CRM, CWM Read hits do not cause state transitions and are not shown. What do you think the states are, and how do they correspond to the states in The scheme works as follows: Lecture 10 Architecture of Parallel Computers 1
On a read hit, the data is returned immediately to the processor, and no caches change state. On a read miss, If another cache (other caches) had a copy of the block, it supplies (one supplies) it directly to the requesting cache and raises the SharedLine. The bus timing is fixed so all caches respond in the same cycle. All caches, including the requestor, set the state to shared. If the owning cache had the block in state dirty, the block is written to main memory at the same time. If no other cache had a copy of the block, it is read from main memory and assigned state valid-exclusive. On a write hit, If the block is already dirty, the write proceeds to the cache without delay. If the block is valid-exclusive, the write proceeds without delay and the state is changed to dirty. If the block is in state shared, the write is delayed until the bus is acquired and a write-word to main memory initiated. Other caches pick the data off the bus and update their copies (if any). They also raise the SharedLine. The writing cache can determine whether the block is still being shared by testing this line. If the SharedLine is not asserted, no other cache has a copy of the block. The requesting cache changes to state valid-exclusive. If the SharedLine is asserted, the block remains in state shared. 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 2
On a write miss, If any other caches have a copy of the block, they supply it. By inspecting the SharedLine, the requesting processor determines that the block has been supplied by another cache, and sets its state to shared. The block is also written to memory, and other caches pick the data off the bus and update their copies (if any). If no other cache has a copy of the block, the block is loaded from memory in state dirty. In update protocols in general, since all writes appear on the bus, write serialization, write-completion detection, and write atomicity are simple. Performance of coherence protocols [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part science. Parameters need to be chosen for the simulator. Culler & Singh (1998) selected a single-level 4-way set-associative 1 MB cache with 64-byte lines. The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic? The simulated workload consists of 6 parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs. Lecture 10 Architecture of Parallel Computers 3
Effect of coherence protocol [CS&G 5.4.3] Three coherence protocols were compared: 200 The Illinois MESI protocol ( Ill, left bar). The three-state invalidation protocol (3St) with bus upgrade for S M transitions. (This means that instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) The three-state invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.) 180 160 Address bus Data bus Traffic (MB/s) 140 120 100 80 60 40 20 0 x Barnes/III Barnes/3St d Barnes/3St-RdEx l LU/III t LU/3St x LU/3St-RdEx Ill Ocean/III t Ocean/3S Ex Ocean/3St-RdEx Radiosity/III Radiosity/3St Radiosity/3St-RdEx Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx In our parallel programs, which protocol seems to be best? 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 4
Somewhat surprisingly, the result turns out to be the same for the multiprocessor workload. The reason for this? The advantage of the four-state protocol is that no bus traffic is generated on E M transitions. But E M transitions are very rare (less than 1 per 1K references). Invalidate vs. update [CS&G 5.4.5] Which is better, an update or an invalidation protocol? Let s look at real programs. 0.60 False sharing 2.50 0.50 True sharing Capacity 2.00 Miss rate (%) 0.40 0.30 Cold Miss rate (%) 1.50 1.00 0.20 0.10 0.50 0.00 0.00 LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd Where there are many coherence misses, If there were many capacity misses, So let s look at bus traffic Lecture 10 Architecture of Parallel Computers 5
Note that in two of the applications, updates in an update protocol are much more prevalent than upgrades in an invalidation protocol. LU/inv LU/upd 0.00 Upgrade/update rate (%) 1.50 1.00 0.50 2.00 2.50 Each of these operations produces bus traffic; therefore, the update protocol causes more traffic. Ocean/inv Ocean/mix Ocean/upd The main problem is that one processor tends to write a block multiple times before another processor reads it. Raytrace/inv Raytrace/upd This causes several bus transactions instead of one, as there would be in an invalidation protocol. Radix/inv 0.00 1.00 Upgrade/update rate (%) 6.00 5.00 4.00 3.00 2.00 7.00 8.00 In addition, updates cause problems in nonbus-based multiprocessors. Radix/mix Radix/upd Effect of cache line size [CS&G 5.4.4] Cache misses can be classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. Capacity misses occur when the cache size is not sufficient to hold data between references. 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 6
Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain? True-sharing misses, on the other hand, occur when a processor writes some words into a cache block, invalidating the block in another processors cache, after which the other processor reads one of the modified words. How could we attack each of the four kinds of misses? To reduce capacity misses, we could To reduce conflict misses, we could To reduce cold misses, we could To reduce coherence misses, we could If we increase the line size, the number of coherence misses might go up or down. What happens to the number of false-sharing misses? What happens to the number of true-sharing misses? If we increase the line size, what happens to capacity misses? Lecture 10 Architecture of Parallel Computers 7
conflict misses? bus traffic? So it is not clear which line size will work best. 0.6 Upgrade 0.5 0.4 False sharing True sharing Capacity Cold 0.3 0.2 0.1 0 Barnes/8 Barnes/16 Barnes/32 8 Barnes/64 Miss rate (%) Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Results for the first three applications seem to show that which line size is best? For the second set of applications, which do not fit in cache, Radix shows a greatly increasing number of false-sharing misses with increasing block size. 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 8
12 Upgrade 10 8 False sharing True sharing Capacity Cold Miss rate (%) 6 4 2 0 Ocean/8 Ocean/16 8 6 2 4 8 6 8 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 However, larger line sizes also create more bus traffic. 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Address bus Data bus 0 Barnes/8 Barnes/16 2 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 4 28 Traffic (bytes/instructions) Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 With this in mind, which line size would you say is best? 32 or 64. Lecture 10 Architecture of Parallel Computers 9
Write propagation in multilevel caches [ 8.4.2] The coherence protocols we have seen so far have been based on one-level caches. Suppose each processor has its own L1 cache and L2 cache, and L2 the caches are coherent. Writes must be propagated upstream and downstream. Define the terms. Downstream write propagation. Upstream write propagation. Which makes downstream propagation simpler, a write-through or write-back L1 cache? Why? For upstream write propagation: o An invalidation/intervention received by the L2 must be propagated to the L1 (in case the L1 has the block). o The inclusion property cuts down the number of such upstream invalidations/interventions. Lock Implementations [ 9.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 10
Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock when there is no contention o Lock acquisition when lock is already locked o Lock acquisition when lock is free o Lock release Fairness Storage o Degree in which a thread can acquire a lock with respect to others o As a function of # of threads/processors The need for atomicity This code sequence illustrates the need for atomicity. Explain. void lock (int *lockvar) { while (*lockvar == 1) {} ; // wait until released *lockvar = 1; // acquire lock } void unlock (int *lockvar) { *lockvar = 0; } In assembly language, the sequence looks like this: lock: ld R1, &lockvar // R1 = lockvar bnz R1, lock // jump to lock if R1!= 0 st &lockvar, #1 // lockvar = 1 ret // return to caller unlock: sti &lockvar, #0 // lockvar = 0 ret // return to caller The ld-to-sti sequence must be executed atomically: The sequence appears to execute in its entirety Multiple sequences are serialized Lecture 10 Architecture of Parallel Computers 11
Examples of atomic instructions test-and-set Rx, M o read the value stored in memory location M, test the value against a constant (e.g. 0), and if they match, write the value in register Rx to the memory location M. fetch-and-op M o read the value stored in memory location M, perform op to it (e.g., increment, decrement, addition, subtraction), then store the new value to the memory location M. exchange Rx, M o atomically exchange (or swap) the value in memory location M with the value in register Rx. compare-and-swap Rx, Ry, M o compare the value in memory location M with the value in register Rx. If they match, write the value in register Ry to M, and copy the value in Rx to Ry. How to ensure one atomic instruction is executed at a time: 1. Reserve the bus until done o Other atomic instructions cannot get to the bus 2. Reserve the cache block involved until done o Obtain exclusive permission (e.g. M in MESI) o Reject or delay any invalidation or intervention requests until done 3. Provide illusion of atomicity instead o Using load-link/store-conditional (to be discussed later) 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 12
Test and set test-and-set is implemented like this: lock: t&s R1, &lockvar // R1 = MEM[&lockvar]; // if (R1==0) MEM[&lockvar]=1 bnz R1, lock; // jump to lock if R1!= 0 ret // return to caller unlock: st &lockvar, #0 // MEM[&lockvar] = 0 ret // return to caller What value does lockvar have when the lock is acquired? free? Here is an example of test-and-set execution. Describe what it shows. Lecture 10 Architecture of Parallel Computers 13
Let s look at how a sequence of test-and-sets by three processors plays out: Request P1 P2 P3 BusRequest Initially - P1: t&s M BusRdX P2: t&s I M BusRdX P3: t&s I I M BusRdX P2: t&s I M I BusRdX P1: unlock M I I BusRdX P2: t&s I M I BusRdX P3: t&s I I M BusRdX P3: t&s I I M P2: unlock I M I BusRdX P3: t&s I I M BusRdX P3: unlock I I M How does test-and-set perform on the four metrics listed above? Uncontended latency Fairness Traffic Storage 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 14