TDT 4260 lecture 9 spring semester 2015

Size: px

Start display at page:

Download "TDT 4260 lecture 9 spring semester 2015"

Jerome Moore
6 years ago
Views:

1 1 TDT 4260 lecture 9 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU

2 2 Lecture overview Repetition - CMP application classes - Vector MIPS Today Vector & SIMD processing Sparse matrices Gather/scatter Stride Shared Multiprocessor Intro Cache coherence snooping

3 3 Repetition

4 4 Repetition

5 5 Excursion to the Vilje supercomputer??? 5 min walk Your interest yes Monday 23/3 in the 12:15 14:00 timeslot???

6 6 Stride Consider straightforward matrix multiplication: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } Vectorize multiplication of rows of B with columns of D Matrices stored row-by-row (row-major order) or column-by-column (column-major order) Distance between elements that are to be gathered into a vector register = stride Vector processors must handle strides > 1 (Example: LVWS) Load Vector With Stride

7 7 Background, Sparse matrices Sparse matrix: Only few elements have non-zero values Example: from solving a finite element problem in 2 dimensions (FEM) (The nonzero elements are shown in black) «Often no sense in multiplying with zero» How can non-zero elements be condensed into a dense vector? ==> compact storage formats (using meta-data) and scatter/gather (Lars-Ivar Hesselberg Simonsen, Master thesis (2013))

8 Background, Sparse matrices Sparse matrix formats CSV, compressed sparse vector CSR, compressed sparse row CSX (Research CARD & NTNU-IT/HPC & GRNET under PRACE)) Energy-efficient Sparse Matrix

8 8 Background, Sparse matrices Sparse matrix formats CSV, compressed sparse vector CSR, compressed sparse row CSX (Research CARD & NTNU-IT/HPC & GRNET under PRACE)) Energy-efficient Sparse Matrix Autotuning with CSX - A Trade-off Study, J C Meyer, J M Cebrian, L Natvig, V Karakasis, D Siakavaras, and K Nikas, HP-PAC (High-Performance, Power-Aware Computing, Boston, May 2013) (part of PRACE) Scatter-gather LVI: Load Vector Indexed (gather) SVI: Store Vector Indexed (scatter)

9 9 Scatter-Gather Consider: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; Use index vector: LV Vk, Rk ;load K LVI Va, (Ra+Vk) ;load A[K[]] LV Vm, Rm ;load M LVI Vc, (Rc+Vm) ;load C[M[]] ADDVV.D Va, Va, Vc ;add them SVI (Ra+Vk), Va ;store A[K[]]

10 10 Section 4.3 SIMD INSTRUCTION SET EXTENSIONS FOR MULTIMEDIA

11 11 SIMD Extensions Media applications operate on data types narrower than the native word size Limitations, compared to vector instructions: Number of data operands encoded into op code No sophisticated addressing modes (strided, scatter-gather) No mask registers

12 12 SIMD DAXPY L.D F0,a ;load scalar a MOV F1, F0 ;copy a into F1 for SIMD MUL MOV F2, F0 ;copy a into F2 for SIMD MUL MOV F3, F0 ;copy a into F3 for SIMD MUL DADDIU R4,Rx,#512 ;last address to load Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3] MUL.4D F4,F4,F0 ;a X[i],a X[i+1],a X[i+2],a X[i+3] L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3] ADD.4D F8,F8,F4 ;a X[i]+Y[i],..., a X[i+3]+Y[i+3] S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] DADDIU Rx,Rx,#32 ;increment index to X DADDIU Ry,Ry,#32 ;increment index to Y DSUBU R20,R4,Rx ;compute bound BNEZ R20,Loop ;check if done

13 13 SSE and AVX is energy efficient Recent research in the CARD group See PP4EE video (Juan Manuel Cebrian) (link under Its Learning), Paper in Computing 2013 (link under Its Learning), also newer and larger paper presented at ISPASS March 2014 Details not required

14 14 CHAP 5 - THREAD LEVEL PARALLELISM

15 15 Single/ILP Multi/TLP Uniprocessor trends Getting too complex Speed of light Diminishing returns from ILP Multiprocessor Focus in the textbook: 4-32 CPUs Increased performance through parallelism Multichip Multicore ((Single) Chip Multiprocessors CMP) Cost effective Right balance of ILP and TLP is unclear today Desktop vs. server?

16 16 Other Factors Multiprocessors Growth in data-intensive applications Databases, file servers, multimedia, Growing interest in servers, server performance Increasing desktop performance less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially in servers where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design Power/cooling issues multicore

17 17 A bit of history 60s and 70s: Lots of research in multiprocessors. 80s and 90s: Relentless focus on single threaded performance, common belief that parallel computing was dead. Today: Almost ALL new systems are parallel systems!

18 18 MIMD: architecture $ = cache P1 Pn $ $ P1 P n Interconnection network (IN) Mem $ Mem $ Mem Interconnection network (IN) Centralized Distributed

19 19 Centralized Multiprocessor Also called Symmetric Multiprocessors (SMPs) Uniform Access (UMA) architecture Shared memory becomes bottleneck Large caches single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Scaling beyond that is hard

20 20 Distributed memory P P P Network M 1. Shared address space Logically shared, physically distributed Distributed Shared (DSM) NUMA architecture Conceptual Model P M P M P M Network Implementation 2. Separate address spaces Every P-M module is a separate computer Multicomputer Clusters Not a focus in this course

21 Distributed (Shared) Multiprocessor Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communication becomes

21 21 Distributed (Shared) Multiprocessor Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communication becomes more complex Pro/Con: Possible to change software to take advantage of memory that is close, but this can also make SW less portable Non-Uniform Access (NUMA) Classical performance vs. portability trade-off

22 22 MP (MIMD), cluster of SMPs Proc. Proc. Proc. Proc. Proc. Proc. Caches Caches Caches Caches Caches Caches Node Interc. Network Node Interc. Network I/O I/O Cluster Interconnection Network Combination of centralized and distributed

23 23 Kahoot Quiz no 2

24 24 Communication models Shared memory Centralized or Distributed Shared Communication using LOAD/STORE Coordinated using traditional OS methods Semaphores, monitors, etc. Busy-wait more acceptable than for uniprosessor Message passing Using send (put) and receive (get) Asynchronous / Synchronous Libraries, standards, PVM, MPI,

25 25 Shared How do we program these things? Thread 0: Thread 0: (on a different core) data = 42; flag = 1; while (flag == 0) { } printf( data %i, data); $ $ data = 0; flag = 0;

26 26 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ flag = 0; data = 0; flag = 0;

27 27 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ flag = 0; data = 0; flag = 0;

28 28 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ data = 42; flag = 0; data = 0; flag = 0;

29 29 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ data = 42; flag = 1; data = 0; flag = 0; flag = 0;

30 30 Shared How do we program these things? Thread 0: data = 42; flag = 1; $ $ data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); flag = 0; Now What?? data = 0; flag = 0;

31 31 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); Now What?? $ $ data = 42; flag = 1; flag = 0; Nothing happens since this cache is not updated data = 0; flag = 0;

32 Enforcing coherence Separate caches makes multiple copies frequent Migration Moved from shared memory to local cache Speeds up access, reduces memory bandwidth requirements Replication Several

32 32 Enforcing coherence Separate caches makes multiple copies frequent Migration Moved from shared memory to local cache Speeds up access, reduces memory bandwidth requirements Replication Several local copies when item is read by several Speeds up access, reduces memory contention Need coherence protocols to track shared data Directory based Status stored in shared location (Centralized or distributed) (Bus) snooping Each cache maintains local status All caches monitor broadcast medium Write invalidate / Write update

33 33 Snooping: Write invalidate Several reads or one write: No change Writes require exclusive access Writes to shared data: All other cache copies invalidated Invalidate command and address broadcasted All caches listen (snoops) and invalidates if necessary Read miss: Write-Through: always up to date Write-Back: Caches listen and any exclusive copy is put on the bus

34 34 Snooping: Write update Also called write broadcast Must know which cache blocks are shared Usually Write-Through Write to shared data: Broadcast, all caches listen and updates their copy (if any) Read miss: Main memory is up to date

35 35 Snooping: Invalidate vs. Update Repeated writes to the same address (no reads) requires several updates, but only one invalidate Invalidates are done at cache block level, while updates are done of individual words Invalidate most common Less bus traffic Less memory traffic Bus and memory bandwidth typical bottleneck

36 36 An Example Snoopy Protocol Invalidation protocol, write-back cache Each cache block is in one state Shared : Clean in all caches and up-to-date in memory, block can be read Exclusive : One cache has only copy, its writeable, and dirty Invalid : block contains no data (1) example (2) in more detail (Next lecture)

37 37 Snooping: Invalidation protocol (1/6) Processor 0 Processor 1 Processor 2 Processor N-1 read x read miss Interconnection Network x o I/O System Main

38 38 Snooping: Invalidation protocol (2/6) Processor 0 Processor 1 Processor 2 Processor N-1 x o shared Interconnection Network x o I/O System Main

39 39 Snooping: Invalidation protocol (3/6) Processor 0 Processor 1 Processor 2 read x Processor N-1 x o shared read miss Interconnection Network x o I/O System Main

40 40 Snooping: Invalidation protocol (4/6) Processor 0 Processor 1 Processor 2 Processor N-1 x o shared x o shared Interconnection Network x o I/O System Main

41 41 Snooping: Invalidation protocol (5/6) Processor 0 Processor 1 Processor 2 Processor N-1 write x x o shared x o shared invalidate Interconnection Network x o I/O System Main

42 42 Snooping: Invalidation protocol (6/6) Processor 0 Processor 1 Processor 2 Processor N-1 x 1 exclusive Interconnection Network x o I/O System Main

43 Directory Based Coherency Snooping doesn t scale Idea: Track who is sharing various addresses in a directory. Used both in large DSM machines and in small multicores. Complicated! Things you have to worry about: Deadlock Livelock & Starvation Consistency models Lasse Natvig

44 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 44 Lasse Natvig

45 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0 45 Lasse Natvig

46 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0 46 Lasse Natvig

47 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1 47 Lasse Natvig

48 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1 48 Lasse Natvig

49 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2 49 Lasse Natvig

50 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2 50 Lasse Natvig

Directory Based Coherency 0 1 2 4 Ld X Ld X

51 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2,4 51 Lasse Natvig

52 Directory Based Coherency Ld X Ld X Ld X St X=7 Directory: 0,1,2,4 52 Lasse Natvig

53 Directory Based Coherency Ld X Ld X Ld X St X=7 Can I write???? Please? Directory: 0,1,2,4 53 Lasse Natvig

54 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 0,1,2,4 54 Lasse Natvig

55 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval is complete Can I write???? Please? Directory: 0,1,2,4 55 Lasse Natvig

56 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval is complete Can I write???? Please? Directory: 1,2,4 56 Lasse Natvig

57 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 1,2,4 57 Lasse Natvig

58 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval complete Directory: 1,2,4 Can I write???? Please? 58 Lasse Natvig

59 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval complete Directory: 2,4 Can I write???? Please? 59 Lasse Natvig

60 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 2,4 60 Lasse Natvig

61 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval complete Inval Can I write???? Please? Directory: 2,4 61 Lasse Natvig

62 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval complete Inval Can I write???? Please? Directory: 4 62 Lasse Natvig

63 Directory Based Coherency Ld X Ld X Ld X St X=7 Yes, you can write! Can I write???? Please? Directory: 4 63 Lasse Natvig

64 Directory Based Coherency Ld X Ld X Ld X St X=7 X = 7 Yes, you can write! Can I write???? Please? Directory: 4 64 Lasse Natvig

65 Directory Based Coherency Ld X St X=7 X = 7 Give me new X Directory: 4 65 Lasse Natvig

66 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Directory: 4 66 Lasse Natvig

67 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Here you go! Directory: 4 67 Lasse Natvig

68 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Here you go! Directory: 4 X = 7 68 Lasse Natvig

69 Directory Based Coherency Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 4 X = 7 69 Lasse Natvig

70 Directory Based Coherency Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 1,4 X = 7 70 Lasse Natvig

71 Directory Based Coherency All the different types of messages are travelling over various interconnects. Interconnect design is a big component of shared memory systems! Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 1,4 X = 7 71 Lasse Natvig

72 Directory based cache coherence Large multiprocessor systems, lots of CPUs Distributed memory preferable Increases memory bandwidth Snooping bus with broadcast? A single bus becomes a bottleneck Other ways of communicating needed With these, broadcasting is hard/expensive Can avoid broadcast if we know exactly which caches have a copy Directory 72 Lasse Natvig

states: Shared Uncached Modified Protocol based on messages Invalidate and

73 Directory based cache coherence Directory knows which blocks are in which cache and their state Directory can be partitioned and distributed Typical states: Shared Uncached Modified Protocol based on messages Invalidate and update sent only where needed Avoids broadcast, reduces traffic 73 Lasse Natvig

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23) Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction