CS377P Programming for Performance Multicore Performance Cache Coherence

Size: px

Start display at page:

Download "CS377P Programming for Performance Multicore Performance Cache Coherence"

Oscar Preston
5 years ago
Views:

1 CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015

2 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional Memory and Cache Coherence

3 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional Memory and Cache Coherence

4 The Problem Shared Address Space (Core) Private Caches Copies of shared data can exist in multiple locations How to keep these copies synchronized transparently? Not transparent to performance!

5 Shared Reads processor/core processor/core READ 0xdae... cc tags data cc tags data 0xdae... X Memory 0xdae... X

6 Shared Reads Copy processor/core processor/core cc tags data cc tags data 0xdae... X 0xdae... X Memory 0xdae... X

7 Shared Writes processor/core processor/core WRITE Y to 0xdae... cc tags data cc tags data 0xdae... X 0xdae... X Memory 0xdae... X

8 Shared Writes Invalidate Copies processor/core processor/core WRITE Y to 0xdae... cc tags data cc tags data 0xdae... X 0xdae... X INVALIDATE 0xdae... Memory 0xdae... X

9 Shared Writes Then Write processor/core processor/core cc tags data cc tags data 0xdae... Y Memory 0xdae... X

10 Simplified MESI INVALID read from memory read from other processor write by other processor EXCLUSIVE read by other processor write to memory SHARED write by this processorr write by this processor MODIFIED Solid lines actions by this processor Dashed lines actions by other processor Many transitions not shown e.g. INVALID from SHARED, EXCLUSIVE on cache line replacement

11 The MESI Protocol (Simplified) Every cache line begins in INVALID state On a read, the cache line is put into: EXCLUSIVE: if it was read from memory SHARED: if it was read from another copy On a write, line is moved to MODIFIED state If it was previously SHARED, all other copies are INVALIDATED It will eventually be written back to memory

12 Snoop Protocols Requires a shared bus among all processors All requests to read/write are broadcast on the bus All processors snoop /listen to memory requests If a processor has a copy in EXCLUSIVE/SHARED/MODIFIED state: It responds with a copy of its data Moves its line to SHARED Processors broadcast INVALIDATE to all processors before writing Must wait for acknowledgements

13 Directory-based Protocols Requires a shared structure called directory Directory tracks contents of every cache in the system Addresses only Caches talk to directory only Directory send messages only to caches that contain affected data Used in systems with large number of processors > 8 Implementation need NOT be a centralized structure

14 Summary of Cache Coherence Reads and writes to shared data involve communication with other processors Expensive Possible Serialization bottleneck

15 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional Memory and Cache Coherence

16 Shared Variables Variables that are read/written by multiple threads are called shared variables.

17 Compilers and Cache Coherence int *a; T0 T1 while(*a < 1000) while(1) *a = *a + 1; printf("%d\n", *a);

18 Volatiles volatile int *a; T0 T1 while(*a < 1000) while(1) *a = *a + 1; printf("%d\n", *a);

19 False Sharing int sums[nthreads]; Tx for(...) sums[x] += a[...];

20 Memory Layout for sums sums xC 0x10 0x14 0x18 0x1C T0 T1 T2 T3 T4 T5 T6 T7 sums[] occupies a single cache line.

21 Cache Line Bouncing No thread shares data with another thread However, thread data resides within the same cache line Coherence operates at cache-line granularity Every write to the cache line will potentially be serialized

22 Summary Memory locations may be stored in registers by compiler Will not participate in cache coherence Data layout may cause inadvertent conflicts with each other One solution: Privatize and then merge

23 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional Memory and Cache Coherence

24 Basic Spinning Lock while(atomic_cas(lock, UNLOCKED, LOCKED)!= LOCKED);

25 Basic Spinning Lock and CC Must invalidate every copy of lock at every check! Consumes bus/directory bandwidth!

26 Ticket Lock lock: ticket = atomicadd(lock.ticket, 1); while(lock.current!= ticket); unlock: lock.current++;

27 Ticket Lock and CC lock.current is in SHARED state lock.current is INVALIDATEd by unlock owned by writing core lock.current moves back to SHARED line is distributed serially to all requesting cores assumption in the Non-scalable locks are dangerous paper very small critical sections may execute faster than this distribution

28 MCS Locks Each lock maintains a queue of waiting threads If a thread did not arrive first, it adds itself to the queue The thread at the head of the queue gets the lock After it finishes, it passes the lock to the next waiting thread if there is waiting thread

29 MCS Locks Unlocked head: NULL

30 MCS Locks Successful Lock head: cpu0_qnode next: NULL have_lock: 1

31 MCS Locks Unsuccessful Lock Attempt (Step 1) head: cpu1_qnode next: NULL have_lock: 1 next: NULL have_lock: 0 Put node at head of queue (using atomicexchange), but not first to arrive, so lock acquisition is unsuccessful.

32 MCS Locks Unsuccessful Lock Attempt (Step 2) head: cpu1_qnode next: cpu1_qnode have_lock: 1 next: NULL have_lock: 0 Add node to next pointer of previous head of queue. Only one thread will do this Set have lock to 0 Spin on have lock

33 MCS Unlock head: cpu1_qnode next: cpu1_qnode have_lock: 0 next: NULL have_lock: 1 Attempt to store NULL back into lock (atomiccas) Fails if there is a waiting thread If there are waiting threads: Set have lock of next node to 1 Waiting lock will eventually notice this

34 MCS Locks and CC Why is spinning on have lock cheap?

35 Summary Scalability of data structures must take cache coherence into account.

36 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional Memory and Cache Coherence

37 Transactional Memory Must detect conflicts with other threads Conflict: Read/Write sets overlap

38 Maintaining Read and Write Sets Cache line granularity Additional bits required in cache Piggy-back on cache coherence protocol to detect conflicts More states added to CC protocol

39 Side-effects of using Cache Coherence What is the maximum size of a transaction?

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache