Shared Memory. SMP Architectures and Programming

Size: px

Start display at page:

Download "Shared Memory. SMP Architectures and Programming"

Harold Melvin Logan
5 years ago
Views:

1 Shared Memory SMP Architectures and Programming 1

2 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2

3 Shared Memory Processes or threads share memory No explicit communication is needed, only synchronization Easy to program 3

4 No explicit communication needed Can a process work at any time? No we still need communication Barriers, mutex and signals are much more important in SMP than in distributed memory architectures Swap Test-and-set Interrupts/systemcalls 4

5 Shared Memory Architecture UMA Uniform Memory Architecture NUMA Non Uniform Memory Architecture 5

6 SMP Architectures Shared bus Very simple Fairly Cheap Poor scalability Crossbar Switch More complex More expensive Scales better 6

7 SMP HW M M M M $ $ $ $ P P P P (a) (b) 7

8 SHMP a quick refresh Shared bus Rather simple Very cheap Only scales to a few processors Maintains the standard memory view 8

9 Shared Memoy Shared Bus... Processor Memory Bus Interrupt Controller Communications Bus Frame Buffer Interrupt Controller... I/O Interface I/O Bus Interrupt Controller I/O Interface I/O Bus 9 Interrupt Controller Memory Bus Controller Cache Controller Cache Memory Interrupt Controller Memory Bus Controller Cache Controller Cache Memory CPU CPU Interrupt Controller Memory Bus Controller Cache Controller Cache Memory CPU

10 SHMP a quick refresh Crossbar switched Rather complex Quite expensive Can scale to tens of processors Needs a relaxed memory consistency protocol 10

11 Crossbar switch CPU CPU CPU Cache CPU Cache Memory CPU Cache Cache CPU Cache Cache Cache Cache Cache Memory Cache Memory Cache Mem. Ctl. Mem. Ctl. Mem. Ctl. UPA Interface UPA Interface UPA Interface GigaPlane BUS I/O Extension I/O Extension 11

12 COMA Architectures Eliminates all main memory Cache entries are the only thing in the system Problems: Invalidating all copies on write Not eliminating the last copy 12

13 SHMP a quick refresh MP on a chip Extremely simple Extremely cheap Only very few processors per chip Allows the CPUs to work together more closely 13

14 MP on a chip (Majc) North UPA Instruction Cache Graphics Preprocessor Memory Controller PCI Instruction Cache CPU 0 Data Cache CPU 1 South UPA 14

15 SHMP a quick refresh Virtual MP on a chip Named Hyper-threading Extremely cheap only an extra register-file per VP and some control logic Virtual depth can be quite large but few applications may take advantage of it Allows us much better utilization of the CPU area 15

16 MP on a chip First Virtual CPU Second Virtual CPU INT Unit FP Unit 16

17 Processes and Threads A thread is simply a process which shares the address space of the process it resides in with the other threads in that process What is the importance of Threads vs Processes in SMP? 17

18 The MESI Protocol Common protocol for ensuring sequential consistency States are Modified Exclusive Shared Invalid 18

19 MESI Protocol M E S I 19

20 False Sharing Unfortunate memory layout may cause performance problems due to false sharing Int cnt[2]; A: cnt[0]++; B: cnt[1]++; cnt[0] cacheline cnt[1] 20

21 False Sharing: A test! volatile int count[2]; #ifdef FALSE worker(void * arg) { int i,index=(int)arg; for(i=0;i< ; i++) count[index]++; } #else worker(void * arg) { int i,index=(int)arg; int temp=0; for(i=0;i< ; i++) temp++; count[index]+=temp; } #endif main() { pthread_t t; } pthread_create(&t,null,worker,null); worker((void *)1); printf("%d %d\n",count[0],count[1]); False Sharing u 0.020s 0: % 0+0k 0+0io 245pf+0w No False Sharing 2.690u 0.000s 0: % 0+0k 0+0io 245pf+0w 21

22 Scaling Shared Memory MESI requires broadcast of memory operations To scale SHM architectures above a few tens of processors we may relax memory consistency 22

23 Defining Memory Consistency A Memory Consistency Model defines a set of constraints that must be meet by a system to conform to the given consistency model. These constraints define a set of rules that define how memory operations are viewed relative to: Real time Each other Different nodes 23

24 Why Relax the Consistency Model To simplify bus design on SMP systems More relaxed consistency models requires less bus bandwidth More relaxed consistency requires less cache synchronization To lower contention on DSM systems More relaxed consistency models allows better sharing More relaxed consistency models requires less interconnect bandwidth 24

25 Strict Consistency Any read to a memory location x returns the value stored by the most resent write to x. Performs correctly with race conditions Can t be implemented in systems with more than one CPU 25

26 Strict Consistency P 0 : P 1 : R(x)0 W(x)1 R(x)1 P 0 : P 1 : W(x)1 R(x)0 R(x)1 26

27 Sequential Consistency [A multiprocessor system is sequentially consistent if ] the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appears in this sequence in the order specified by its program. Handles all correct code, except race conditions Can be implemented with more than one CPU 27

28 Sequential Consistency P 0 : P 1 : R(x)0 W(x)1 R(x)1 P 0 : P 1 : W(x)1 R(x)0 R(x)1 P 0 : W(x)1 P 0 : W(x)1 W(y)1 P 1 : R(x)0 P 1 : R(x)0 P 2 : R(x)1 P 2 : R(y)1 28

29 Causal Consistency Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines. Still fits programmers idea of sequential memory accesses Hard to make an efficient implementation 29

30 Causal Consistency P 0 W(X)1 P 1 R(X)1 W(Y)1 P 2 R(Y)1 R(X)1 P 0 W(X)1 P 1 R(X)1 W(Y)1 P 2 R(Y)1 R(X)0 30

31 PRAM Consistency Writes done by a single process are received by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes. Operations from one node can be grouped for better performance Does not comply with ordinary memory conception 31

32 W(X)1 W(y(1) W(X)2 W(y(1) W(X)3 W(y(1) PRAM Model CPU CPU... CPU Memory 32

33 PRAM Consistency P 0 W(X)1 P 1 R(X)1 W(Y)1 P 2 R(Y)1 R(X)0 P 0 W(X)1 W(X)2 P 1 R(X)2 R(X)1 33

34 Processor Consistency 1. Before a read is allowed to perform with respect to any other processor, all previous reads must be performed. 2. Before a write is allowed to perform with respect to any other processor, all other accesses (read and write) must be performed. Slightly stronger than PRAM Slightly easier than PRAM 34

35 Weak Consistency 1. Accesses to synchronization variables are sequentially consistent. 2. No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere. 3. No data access ( read or write ) is allowed to be performed until all previous accesses to synchronization variables have been performed. Synchronization variables are different from ordinary variables Lends itself to natural synchronization based parallel programming 35

36 Weak Consistency P 0 W(X)1 W(X)2 S P 1 R(X)1 S R(X)2 P 0 W(X)1 W(X)2 S P 1 S R(X)1 R(X)2 36

37 Release Consistency 1. Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed successfully. 2. Before a release is allowed to be performed, all previous reads and writes done by the process must have completed. 3. The acquire and release accesses must be processor consistent. Synchronization's now differ between Acquire and Release Lends itself directly to semaphore synchronized parallel programming 37

38 Release Consistency P 0 : Acq(L) W(x)0 W(x)1 Rel(L) P 1 : R(x)0 P 2 : Acq(L) R(x)1 Rel(L) P 0 : Acq(L) W(x)0 W(x)1 Rel(L) P 1 : Acq(L) R(x)0 Rel(L) P 2 : Acq(L) R(x)1 Rel(L) 38

39 Lazy Release Consistency Differs only slightly from Release Consistency Release dependent variables are not propagated at release, but rather at the following acquire This allows Release Consistency to be used with smaller granularity 39

40 Entry Consistency 1. An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process. 2. Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in non-exclusive mode. 3. After an exclusive mode access to a synchronization variable has been performed, any other process next non-exclusive mode access to that synchronization variable may not be performed until it has been performed with respect to that variable s owner. Associates specific synchronization variables with specific data variables 40

41 Automatic Update Automatic update consistency has the same semantics as lazy release consistency, and adding: Before performing a release all automatic updates must be performed. Lends itself to hardware support Efficient when two nodes are sharing the same data often 41

42 Comparing Consistency models Added Semantics Entry Automatic Update Release Lazy Release Weak PRAM Processor Causal Strict Sequential Efficiency 42

43 Working with Relaxed Consistency Models Natural tradeoff between efficiency and added work Anything beyond Causal Consistency requires the consistency model to be explicitly known Compiler knowledge of the consistency model can hide the relaxation from the programmer 43

44 CASE Studies Intel MP Specification Sequent NUMA-Q SGI Origin-2000 Cray T3E 44

45 Intel MP Specification Shared bus approach Sequential Consistency MESI Protocol Scales to 4 processors directly 8 processors by gluing two 4 processor busses together UMA 45

46 Sequent NUMA-Q Based on Intel MP Spec and SCI bus Scalable Coherent Bus 4 way SMP boards keept coherent by MESI Boards keeps coherent by SCI The SCI ring is implemented in one chip UMA/NUMA 46

47 SGI Origin-2000 Extention to Stanford DASH Provides sequential consistency Boards with one R12000 or two R10000 processors (MESI) LARGE interface to the world keeps memory coherent NUMA (cc-numa) 47

48 Cray T3E Alpha processors with local memory May access the memory on other processors Memory is note keept coherent in any way! Special functionalities for synchronization NUMA 48

49 Summary Relaxing memory consistency is necessary for any system with more than one processor Simple relaxation can be hidden Strong relaxation can achieve better performance 49

Distributed Shared Memory

Distributed Shared Memory History, fundamentals and a few examples Coming up The Purpose of DSM Research Distributed Shared Memory Models Distributed Shared Memory Timeline Three example DSM Systems The