Flynn's Classification

Size: px

Start display at page:

Download "Flynn's Classification"

Eustace Merritt
5 years ago
Views:

1 Multiprocessors Oracles's SPARC M7-32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s.

2 Flynn's Classification Single instruction stream, single data stream (SISD) Uniprocessor. Single instruction stream, multiple data streams (SIMD) Data-level parallelism Applying same operations to multiple items of data in parallel Eg. Multimedia extensions, Vector architectures Applications: Gaming, 3-dimensional, real-time virtual environments. Multiple instruction streams, single data stream (MISD) Multiple instruction streams, multiple data streams (MIMD) Thread-level parallelism

3 SIMD Instructions ADDVV V2, V0, V1 V1 [0] [1] [2]... [VLR-1] V V2 [0] [1] [2]... [VLR-1] Vector Length Register VLR

4 Multithreading

5 Motivation for Multiprocessing Performance Clusters, Software as a Service Data intensive applications Natural parallelism in large scientific applications More return on investments by replicating current designs

6 Symmetric Multiprocessor (SMP) Processor Processor Processor Processor One One or or more more levels of of Cache One One or or more more levels of of Cache One One or or more more levels of of Cache One One or or more more levels of of Cache Shared Cache Symmetric Shared Memory Centralized Shared Memory Uniform Memory Access Main Memory I/O I/O System

7 Distributed Shared Memory Multicore Multicore MP MP Multicore Multicore MP MP Multicore Multicore MP MP Memory Memory I/O I/O Memory Memory I/O I/O Memory Memory I/O I/O Interconnection Network Memory Memory I/O I/O Memory Memory I/O I/O Memory Memory I/O I/O Multicore Multicore MP MP Multicore Multicore MP MP Multicore Multicore MP MP Non Uniform Memory Access

8 Distributed Shared Memory Scalability High memory bandwidth demands Low memory access latency to local memory Communication infrastructure is complex

9 Example CPI = Base CPI + Remote request rate Remote request cost

10 Shared Memory vs. Message Passing Shared Memory Machine: processors share the same physical address space Implicit Communication, Hardware controlled cache coherence Message Passing Machine Explicit communication programmed No cache coherence (simpler hardware) Message passing libraries: MPI PP PP PP PP PP PP PP PP CC CC CC CC MM MM MM MM Main Main Memory Interconnect

11 Shared Memory vs. Message Passing Read A PP PP PP PP A A MM A MM MM MM Read A Interconnect

12 Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i 1 to n do for j 1 to n do temp = A[i,j]; A[i,j] 0.2 * (A[i,j] + neighbors); diff += abs(a[i,j] temp); end for end for if (diff < TOL) then done = 1; end while end procedure

13 Shared Address Space Model int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A G_MALLOC(); initialize (A); CREATE (nprocs,solve,a); WAIT_FOR_END (nprocs); end main procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i mymin to mymax for j 1 to n do endfor endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); endwhile

14 Shared Address Space Model procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i mymin to mymax for j 1 to n do endfor endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); endwhile

15 Message Passing Model Thread m-1 m-1 MPI_Send() mya[1] Thread m mya[2] mya[3] mya[4] Thread m+1 m+1 MPI_Receive()

16 Message Passing Model main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; mya malloc( ) initialize(mya); while (!done) do mydiff = 0; if (pid!= 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid!= nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); if (pid!= 0) RECEIVE(&myA[0,0], n, pid-1, ROW); if (pid!= nprocs-1) RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i 1 to nn do for j 1 to n do endfor endfor if (pid!= 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i 1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; endfor if (mydiff < TOL) done = 1; for i 1 to nprocs-1 do SEND(done, 1, I, DONE); endfor endif endwhile

17 Multiprocessor Cache Coherence

18 Multiprocessor Cache Coherence A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors.

19 Coherence Cache Coherence Which value to return on a read Consistency When should a written value be available to read Memory Consistency Models Write Propagation A write is visible after a sufficient time lapse Write Serialization All writes to a location are seen by every processor in the same order

20 Cache Coherence Directory based protocols Sharing status maintained in a directory Snooping protocols Sharing status is stored in the cache controller Cache controller snoops broadcast medium Write Invalidate protocols Invalidates other processors' copies on a write Write Update protocols Updates all data copies on a write Sharing Status Invalid (I), Shared (S) (or Clean), Modified (M) (or Dirty)

21 SMP - Write Invalidate Shared CPU A Cache Miss CPU B Cache Miss Cache Miss Cache Miss Memory CPU CPU A reads X

22 SMP - Write Invalidate Shared CPU A CPU B Cache Miss Shared Cache Miss Cache Miss Cache Miss Memory CPU CPU B reads X

23 SMP - Write Invalidate Modified Shared CPU A CPU B Shared Invalid Write Invalidate X Write Invalidate X Write Invalidate X Memory CPU CPU A writes X

24 SMP - Write Invalidate Modified Invalid CPU A CPU B Cache Miss Cache Miss Cache Miss Memory CPU CPU B reads X

25 SMP - Write Invalidate Shared Invalid Shared CPU A CPU B Write Back Memory CPU CPU B reads X

26 Write Invalidate Coherence Protocol Invalidate Find data Writeback / Writethrough Cache block states Contention for tags Enforcing write serialization

27 SMP Example Processor A Caches Processor B Caches Main Memory Processor C Caches I/O System Processor D Caches A: Rd X B: Rd X C: Rd X A: Wr X A: Wr X C: Wr X B: Rd X A: Rd X A: Rd Y B: Wr X B: Rd Y B: Wr X B: Wr Y

28 SMP Cache Coherence MESI Protocol Exclusive state: No invalidate messages on writes. Intel i7 uses MESIF MOESI Protocol Owned state: Only valid copy in the system. Main memory copy is stale. Owner supplied data on a miss. AMD Opteron

29 Directory Based Cache Coherence Physical memory is distributed among all processors Directory is distributed Keeps track of sharing status Physical address determines data location Point-to-point messages between nodes are sent over an ICN

30 Directory Based Example Shared CPU CPU A Private Private Cache Cache CPU CPU B Private Private Cache Cache CPU CPU C Private Private Cache Cache MM DD MM D D MM D D S: S: A Interconnection Network A: A: Read Read X

31 Directory Based Example Shared CPU CPU A Private Private Cache Cache Shared CPU CPU B Private Private Cache Cache CPU CPU C Private Private Cache Cache MM DD MM D D MM D D S: S: A, A, B Read Read X XX B: B: Read Read X

32 Directory Based Example Modified Shared CPU CPU A Private Private Cache Cache Invalidate Shared CPU CPU B Private Private Cache Cache CPU CPU C Private Private Cache Cache MM DD MM D D MM D D S: M: M: S: A, AA, B ACK ACK Inv Inv X A: A: Write X

33 Directory Based Example Modified Shared CPU CPU A Private Private Cache Cache Invalidate Shared CPU CPU B Private Private Cache Cache CPU CPU C Private Private Cache Cache MM DD MM D D MM D D S: M: M: S: A, AA, B C: C: Write Write X B: B: Read Read X C, C, A: A: Write Write X

34 Multiprocessor Performance Amdahl's Law Coherence miss (apart from the 3Cs) Would not have occured if another processor did not write to the same cache line Not a miss in a uniprocessor False coherence miss Another word in the same cache line is written into by another processor Not a miss if Cache line = 1 word

35 Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock

36 Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */ ReleaseLock(L): L = 0;

37 Problem in Implementing Locks AcquireLock(L): while (L==1) ; L = 1; wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L)

38 Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock. BNEZ wait ADDI R1, R1, 1 # Critical Section # Both P1 and P2 are executing in the Critical Section!!!

39 Atomic Exchange Hardware support for lock implementation Atomic exchange: Swap contents between register and memory. Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly). Atomic Read-Modify-Write (RMW) instruction

40 Lock Implementation lock: Test&Set R1, L BNZ R1, lock Critical Section SW R0, L 1 R1 R1 0 L L The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)

41 Lock Performance Issues Spin Lock Process may enter into an infinite loop of read-modify till it succeeds If lock is in memory heavy traffic Atomicity ensures that process is not switched out other processes do not progress

42 Caching Locks Locks can be cached Atomic exchange happens between RF and local copy in cache Coherence ensures that a lock update is seen by other processors. P1 P2 T&S T&S C1 C2 M L L I Inv Inv L Main Main Memory

43 Caching Locks Locks can be cached Atomic exchange happens between RF and local copy in cache Coherence ensures that a lock update is seen by other processors. P1 C1 C2 P2 T&S T&S I M L L I M Inv Inv L Main Main Memory

44 Caching Locks If two processes share a lock variable, T&S generate huge amounts of coherence traffic T&S T&S P1 C1 C2 I M L L I M P2 Inv Inv L Main Main Memory

45 Coherence Traffic for a Lock lockloop: Test and Test and Set test R1, Lock T0 T1 T2 P0 P1 P2 bnz R1, lockloop t&s R1, Lock bnz R1, lockloop # Critical Section # st Lock, #0 S 1 S 1 S 1 Interconnect Main Main Memory 1

46 Coherence Traffic for a Lock lockloop: Test and Test and Set test R1, Lock T0 T1 T2 P0 P1 P2 bnz R1, lockloop t&s R1, Lock bnz R1, lockloop # Critical Section # st Lock, #0 SM 10 SI 1 SI 1 Write Invalidate Lock T0 T0 releases Lock Lock Main Main Memory 1

47 Coherence Traffic for a Lock lockloop: Test and Test and Set test R1, Lock T0 T1 T2 P0 P1 P2 bnz R1, lockloop t&s R1, Lock bnz R1, lockloop # Critical Section # st Lock, #0 MS 10 SI 0 I 1 Read Miss Lock Main Main Memory 10 T1 T1 tests tests Lock Lock T1 T1 exits exits inner inner loop loop

48 Coherence Traffic for a Lock lockloop: Test and Test and Set test R1, Lock T0 T1 T2 P0 P1 P2 bnz R1, lockloop t&s R1, Lock bnz R1, lockloop # Critical Section # st Lock, #0 S 10 S 0 I 1 Write Miss Lock T1 T1 tests-and-sets Lock Lock Main Main Memory 0

49 Coherence Traffic for a Lock Test and Test and Set test R1, Lock T0 T1 T2 P0 P1 P2 lockloop: bnz R1, lockloop t&s R1, Lock bnz R1, lockloop # Critical Section # st Lock, #0 I 1 M 0 I 1 Write Miss Lock T1 T1 tests-and-sets Lock Lock Main Main Memory 0

50 Coherence Traffic for a Lock Test and Test and Set test R1, Lock T0 T1 T2 P0 1 P1 P2 lockloop: bnz R1, lockloop t&s R1, Lock bnz R1, lockloop # Critical Section # st Lock, #0 I 1 M 0 I 1 Main Main Memory 0 T1 T1 tests-and-sets Lock Lock Atomic Read-Modify-Write

51 LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: if (R2 == )... Some other other thread has has modified X R2 R2 is is filled filled with with a special value value indicating failure of of SC SC

52 LL-SC Example lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 Spin lock with lower coherence traffic. ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying If there are i processes waiting for the lock, how many bus transactions happen? 1 write by the releaser + i read-miss requests + i responses + 1 write by acquirer + 0 (i-1 failed SCs) + i-1 read-miss requests + i-1 responses.

53 Load Linked and Store Conditional LL-SC is an implementation of atomic read-modify-write High flexibility LL: Record the loaded address in a table Table updates a flag if any other process has modified the contents of the value pointed to by the address Perform any number of instructions SC: store succeeds only if the flag in the table is clear no other process attempted a store since the local LL (success only if the operation was effectively atomic) LL-SC does not generate bus traffic if the SC fails More efficient than test&test&set

54 References D J. Sorin, M D. Hill, D A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis lectures on computer architecture, Morgan and Claypool R Balasubramonian, N Jouppi, N Muralimanohar. Multi-Core Cache Hierarchies. SLoCA, M&C Tim Harris, James Larus, and Ravi Rajwar. Transactional Memory, 2e. SLoCA, M&C Michael L. Scott. Shared Memory Synchronization. SLoCA, M&C S Adve and K Gharacharloo. Shared Memory Consistency Models. HP Labs Tech Report. WRL 95/7.

55 Slides Contents Rajeev Balasubramonian, CS6810, University of Utah. Matthew T Jacob, High Performance Computing, IISc/NPTEL. Hennessy and Patterson. CA. 5ed.

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections ) Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Taxonomy SISD: single instruction and single data stream: uniprocessor