Lecture 17. NUMA Architecture and Programming

Size: px

Start display at page:

Download "Lecture 17. NUMA Architecture and Programming"

Lesley Norman
5 years ago
Views:

1 Lecture 17 NUMA Architecture and Programming

2 Announcements Extended office hours today until 6pm Weds after class? Partitioning and communication in Particle method project 2012 Scott B. Baden /CSE 260/ Fall

3 Uniform vs. non-uniform partitioning Rectilinear partitioning maintains fixed connectivity, simplifying communication 2012 Scott B. Baden /CSE 260/ Fall

4 Uniform vs. non-uniform partitioning Rectilinear partitioning maintains fixed connectivity, simplifying communication 2012 Scott B. Baden /CSE 260/ Fall

5 Load imbalance drift increases running time S e n oc d s 130 STATIC 110 DYNAMIC Velocity Evaluation Acomparisonofstaticanddynamicloadbalancingofavortexdynamics computation on 32 processors of the Intel ipsc-1. If the workloads are partitioned only at the beginningofthecalculation(staticloadbalancing, top curve), the loads drift gradually out of balance, andthetimerequiredtoperformavelocity evaluation steadily increases with time. In contrast, a dynamic load balancing strategy (bottom curve) periodically rebalances the workloads and can maintain an almost steady running time; here loads were rebalanced every fourth velocity evaluation Scott B. Baden /CSE 260/ Fall

6 Ghost cells with irregular partitioning D i L i L k L j C Processor i is assigned L i,asubregionoftheworkmesh,andanexternalinteractionregion D i,whichwe have shaded. D i is a surrounding shell of bins and is C bins thick, where we have chosen C =2. Thisregion divides into 6 subregions, one for each interacting processor. Subproblems L i and L k are locally dependent because D i and L k intersect. In comparison, subproblems L i and L j are locally independent, because D i and L j do not intersect Scott B. Baden /CSE 260/ Fall

7 Repatriating particles L i L i D i L-L i i Processor i s assigned subproblem changes from L i to L i as the result of repartitioning. The processor must give up the work contained in the region L i L i to other processors Scott B. Baden /CSE 260/ Fall

8 Today s lecture Consistency NUMA Example NUMA Systems u u Cray XE-6 SGI Performance progrmaming 2012 Scott B. Baden /CSE 260/ Fall

9 Memory model Cache coherence tells us that memory will eventually be consistent The memory consistency policy tells us when this will happen Even if memory is consistent, changes don t propagate instantaneously These give rise to correctness issues involving program behavior 2012 Scott B. Baden /CSE 260/ Fall

10 Memory consistency model The memory consistency model determines when a written value will be seen by a reader Sequential Consistency maintains a linear execution on a parallel architecture that is consistent with the sequential execution of some interleaved arrangement of the separate concurrent instruction streams Expensive to implement Relaxed consistency u u Enforce consistency only at well defined times Useful in handling false sharing 2012 Scott B. Baden /CSE 260/ Fall

11 Memory consistency A memory system is consistent if the following 3 conditions hold u u u Program order Definition of a coherent view of memory Serialization of writes 2012 Scott B. Baden /CSE 260/ Fall

12 Program order If a processor writes and then reads the same location X, and there are no other intervening writes by other processors to X, then the read will always return the value previously written. X:=2 X:=2 Memory P X:= Scott B. Baden /CSE 260/ Fall

13 Definition of a coherent view of memory If a processor P reads from location X that was previously written by a processor Q, then the read will return the value previously written, if a sufficient amount of time has elapsed between the read and the write. X:=1 Memory P X:=1 Q X:=1 Load X 2012 Scott B. Baden /CSE 260/ Fall

14 Serialization of writes If two processors write to the same location X, then other processors reading X will observe the same the sequence of values in the order written If 10 and then 20 is written into X, then no processor can read 20 and then Scott B. Baden /CSE 260/ Fall

15 Today s lecture Consistency NUMA Example NUMA Systems u u Cray XE-6 SGI Performance programming 2012 Scott B. Baden /CSE 260/ Fall

for each block of memory Stanford Dash; NUMA nodes of the Cray XE-6, SGI UV, Altix, Origin

16 NUMA Architectures The address space is global to all processors, but memory is physically distributed Point-to-point messages manage coherence A directory keeps track of sharers, one for each block of memory Stanford Dash; NUMA nodes of the Cray XE-6, SGI UV, Altix, Origin 2000 en.wikipedia.org/wiki/ Non-Uniform_Memory_Access 2012 Scott B. Baden /CSE 260/ Fall

17 Some terminology Every block of memory has an associated home: the specific processor that physically holds the associated portion of the global address space Every block also has an owner: the processor whose memory contains the actual value of the data Initially home = owner, but this can change if a processor other than the home processor writes a block 2012 Scott B. Baden /CSE 260/ Fall

18 Inside a directory Each processor has a 1-bit sharer entry in the directory There is also a dirty bit and a PID identifying the owner in the case of a dirt block P P Cache Cache Interconnection Network Memory Directory Parallel Computer Architecture, Culler, Singh, & Gupta Memory Directory presence bits dirty bit presence bits dirty bit 2012 Scott B. Baden /CSE 260/ Fall

19 P0 loads A Operation of a directory Set directory entry for A (on P1) to indicate that P0 is a sharer Mem $ P0 P Scott B. Baden /CSE 260/ Fall

20 Operation of a directory P2, P3 load A (not shown) Set directory entry for A (on P1) to indicate that P0 is a sharer Mem P2 $ P P0 P Scott B. Baden /CSE 260/ Fall

21 Acquiring ownership of a block P0 writes A P0 becomes the owner of A Mem $ P0 P Scott B. Baden /CSE 260/ Fall

22 Acquiring ownership of a block P0 becomes the owner of A P1 s directory entry for A is set to Dirty Outstanding sharers are invalidated Access to line is blocked until all invalidations are acknowledged Mem $ P P0 P D P0 P Scott B. Baden /CSE 260/ Fall

23 Change of ownership P0 stores into A (home & owner) P1 stores into A (becomes owner) P2 loads A Store A, #y Store A, #x (home & owner) P1 P0 A dirty P2 Load A 1 1 D P1 Directory 2012 Scott B. Baden /CSE 260/ Fall

24 Forwarding P0 stores into A (home & owner) P1 stores into A (becomes owner) P2 loads A home (P0) forwards request to owner (P1) P1 Store A, #x (home & owner) Store A, #y P0 A dirty P2 Load A 1 1 D P1 Directory 2012 Scott B. Baden /CSE 260/ Fall

25 Performance issues False sharing Locality, locality, locality Page placement u Page migration u Copying v. redistribution u Layout u 2012 Scott B. Baden /CSE 260/ Fall

26 Today s lecture Consistency NUMA Example NUMA Systems u u Cray XE-6 SGI Performance programming 2012 Scott B. Baden /CSE 260/ Fall

24 cores sharing 32GB main memory Packaged as 2 AMD Opteron 6172 processors Magny-Cours Each processor is a directly connected Multi-Chip Module: two

coherence traffic Direct access to 8GB main memory via 2 memory channels 4 Hyper Transport (HT) links for communicating with other dies Asymmetric

27 24 cores sharing 32GB main memory Packaged as 2 AMD Opteron 6172 processors Magny-Cours Each processor is a directly connected Multi-Chip Module: two hex-core dies living on the same socket Each die has 6MB of shared L3, 512KB L2/core, 64K L1/core u u u Cray XE6 node 1MB of L3 is used for cache coherence traffic Direct access to 8GB main memory via 2 memory channels 4 Hyper Transport (HT) links for communicating with other dies Asymmetric connections between dies and processors Scott B. Baden /CSE 260/ Fall

XE-6 Processor memory interconnect (node) http://www.hector.ac.

28 XE-6 Processor memory interconnect (node) Scott B. Baden /CSE 260/ Fall

29 XE-6 Processor memory interconnect (node) Scott B. Baden /CSE 260/ Fall

30 Today s lecture Consistency Example NUMA Systems u Cray XE-6 u SGI Origin 2000 Performance programming 2012 Scott B. Baden /CSE 260/ Fall

31 Origin 2000 Interconnect 2012 Scott B. Baden /CSE 260/ Fall

32 Locality 2012 Scott B. Baden /CSE 260/ Fall

33 Poor Locality 2012 Scott B. Baden /CSE 260/ Fall

34 Today s lecture Consistency NUMA Example NUMA Systems u u Cray XE-6 SGI Performance programming 2012 Scott B. Baden /CSE 260/ Fall

35 Quick primer on paging We group the physical and virtual address spaces into units called pages Pages are backed up on disk Virtual to physical mapping done by the Translation Lookaside Buffer (TLB), backs up page tables set up by the OS When we allocate a block of memory, we don t need to allocate physical storage to pages; we do it on demand 2012 Scott B. Baden /CSE 260/ Fall

36 Remote access latency When we allocate a block of memory, which processor(s) is (are) the owner(s)? We can control memory locality with the same kind of data layouts that we use with message passing Page allocation policies u u First touch Round robin Page placement and Page migration Copying v. redistribution Layout 2012 Scott B. Baden /CSE 260/ Fall

37 Example Consider the following loop for r = 0 to nreps-1 for i = 0 to n-1 a[i] = b[i] + q*c[i] 2012 Scott B. Baden /CSE 260/ Fall

Page Migration a[i] = b[i] + q*c[i] Round robin initialization, w/ migration Parallel initialization Serial initialization (No migration) Parallel Initialization, First touch

38 Page Migration a[i] = b[i] + q*c[i] Round robin initialization, w/ migration Parallel initialization Serial initialization (No migration) Parallel Initialization, First touch (No migration) techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/sgi_developer/books/oron2_pftune/ sgi_html/ch08.html#id Scott B. Baden /CSE 260/ Fall

Migration eventually reaches the optimal time Round robin initial, w/ migration One node initial placement, w/ migration Parallel initialization Serial

39 Migration eventually reaches the optimal time Round robin initial, w/ migration One node initial placement, w/ migration Parallel initialization Serial initialization Parallel initialization, first touch, migration Parallel initialization, first touch (optimal) 2012 Scott B. Baden /CSE 260/ Fall

40 Cumulative effect of Page Migration 2012 Scott B. Baden /CSE 260/ Fall

41 Bisection Bandwidth and Latency a(i) = b(i) + q*c(i) 2012 Scott B. Baden /CSE 260/ Fall

42 Parallelization via the compiler integer n, lda, ldr, lds, i, j, k real*8 a(lda,n), r(ldr,n), s(lds,n)!$omp PARALLEL DO private(j,k,i), shared(n,a,r,s), schedule(static) do j = 1, n do k = 1, n do i = 1, n a(i,j) = a(i,j) + r(i,k)*s(k,j) enddo enddo enddo 2012 Scott B. Baden /CSE 260/ Fall

43 Coping with false sharing

44 False sharing Successive writes by P0 and P1 cause the processors to uselessly invalidate one another s cache P0 P Scott B. Baden /CSE 260/ Fall

45 An example of false sharing float a[m,n], s[m] // Outer loop is in parallel // Consider m=4, 128 byte cache line size // Thread i updates element s[i] #pragma omp parallel for private(i,j), shared(s,a) for i = 0, m-1 s[i] = 0.0 for j = 0, n-1 s[i] += a[i,j] end for end for 2012 Scott B. Baden /CSE 260/ Fall

46 Avoiding false sharing float a[m,n], s[m,32] #pragma omp parallel for private(i,j), shared(s,a) for i = 0, m-1 s[i,1] = 0.0 for j = 0, n-1 s[i,1] += a[i,j] end for end for Scott B. Baden /CSE 260/ Fall

47 False sharing in higher dimension arrays Jacobi s method for solving Laplace s Eqn for j=1 : N for i=1 : M u [i,j] = (u[i-1,j]+ u[i+1,j] +u[i,j-1]+u[i,j+1])/4; C a c h e b l o c k C o n t i g u i t y i n m e m o r y l a y o u t s t r a d d l e s p a r t i t i o n b o u n d a r y C a c h e b l o c k i s w i t h i n a p a r t i t P 0 P 1 P 2 P 3 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 4 P 5 P 6 P 7 P 8 P 8 Parallel Computer Architecture, Culler, Singh, & Gupta In global memory Distributed memory 2012 Scott B. Baden /CSE 260/ Fall

48 Fin

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden CSE 160 Lecture 8 NUMA OpenMP Scott B. Baden OpenMP Today s lecture NUMA Architectures 2013 Scott B. Baden / CSE 160 / Fall 2013 2 OpenMP A higher level interface for threads programming Parallelization