Caching Puts and Gets in a PGAS Language Runtime

Size: px

Start display at page:

Download "Caching Puts and Gets in a PGAS Language Runtime"

Noel Barrett
6 years ago
Views:

1 Caching Puts and Gets in a PGAS Language Runtime Michael Ferguson Cray Inc. Daniel Buettner Laboratory for Telecommunication Sciences September 17, 2015 C O M P U T E S T O R E A N A L Y Z E

2 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. C O M P U T E S T O R E A N A L Y Z E Copyright 2015 Cray Inc. 2

3 CC Flickr/Daniel Jolivet

4 CC Flickr/Ben Salter

5 CC Flickr/ajmexico

6 TRAIN LATENCY (8 HOUR TRIP, 60 TON CARS, 60 SEC/CAR) 800 Latency (min) Number of Cars 6

7 TRAIN BANDWIDTH 12 Bandwidth tons/min Number of Cars 7

8 CC Flickr/ChrisDag

9 INFINIBAND (IB) LATENCY 4000 * with small 10-node cluster, QDR IB 3000 Latency (ns) Request Size (bytes) 9

10 INFINIBAND (IB) BANDWIDTH Bandwidth MB/s * with small 10-node cluster, QDR IB Max BW: 5000 MB/s Request Size (bytes) 10

11 AGGREGATION OVERLAP CC Flickr/ajmexico CC Flickr/Barry Lewis CACHE HELPS WITH BOTH! 11

12 BACKGROUND: MEMORY MODEL ALLOWS PREFETCH AND WRITE-BEHIND Library of Congress

13 Memory model for C11, C++11, Chapel: data race free programs are sequentially consistent See Adve, S. V., Boehm, H.-J Memory models: a case for rethinking parallel languages and hardware. Communications of the ACM 53(8): /8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext C O M P U T E S T O R E A N A L Y Z E 13

14 A RACY PROGRAM Thread 1 x = 42; notify = 1; Thread 2 while 0 == notify { /* wait */ } compute_with(x); C O M P U T E S T O R E A N A L Y Z E 14

15 A RACY PROGRAM Thread 1 x = 42; notify = 1; Thread 2 while 0 == notify { /* wait */ } compute_with(x); compiler or processor Thread 1 r1 = 42; notify = 1; x = r1; Thread 2 r2 = notify; while 0 == r2 { /* wait */ } compute_with(x); C O M P U T E S T O R E A N A L Y Z E 15

16 prefetch load x = A[i] Compiler and processor would like to start loads earlier in order to hide memory latency. We ll call that prefetch. C O M P U T E S T O R E A N A L Y Z E 16

17 Compiler and processor would like to complete stores later in order to hide memory latency. We ll call that write behind. B[i] = store y write behind C O M P U T E S T O R E A N A L Y Z E 17

18 prefetch Overlap loads (start early) Reuse values from earlier load Aggregate loads (cache lines) load x store y Overlap stores (finish later) Aggregate many stores into a single store write behind C O M P U T E S T O R E A N A L Y Z E 18

19 REMEMBER THE RACY PROGRAM? Thread 1 x = 42; notify = true; Thread 2 while 0 == notify { /* wait */ } compute_with(x); compiler or processor Thread 1 r1 = 42; notify = 1; x = r1; Thread 2 r2 = notify; while 0 == r2 { /* wait */ } compute_with(x); C O M P U T E S T O R E A N A L Y Z E 19

20 prefetch acquire load x store y release write behind C O M P U T E S T O R E A N A L Y Z E 20

21 prefetch acquire load x atomic, sync provide both acquire, release store y release write behind C O M P U T E S T O R E A N A L Y Z E 21

22 COMMUNICATION OPTIMIZATION Library of Congress

23 prefetch Overlap GETs (start early) Reuse values from earlier GET Aggregate GETs (cache lines) load x GET x store y PUT y Overlap PUTs (finish later) Aggregate many PUTs into a single PUT write behind C O M P U T E S T O R E A N A L Y Z E 23

24 FIXING IT WITH A CACHE Library of Congress

25 CACHE FOR REMOTE DATA Goal: communication aggregation and overlap Bonus points: avoiding repeated communication Software cache in Chapel's runtime One cache per pthread Write-back cache with dirty bits C O M P U T E S T O R E A N A L Y Z E 25

26 CACHE COHERENCY Simple, local coherency Discard all cached data on acquire Wait for pending operations on a release Strategy used in related work with UPC C O M P U T E S T O R E A N A L Y Z E 26

27 CACHE FEATURES Do PUTs in background Overlap Aggregation GET PUT GET PUT X Start one PUT per contiguous written region Round GETs up to 64-byte X cache lines X Sequential read-ahead X X X Programmer-provided prefetch hints* X C O M P U T E S T O R E A N A L Y Z E 27

28 OTHER APPROACHES Library of Congress

29 WEAK MEMORY CONSISTENCY? 1 x starts at 0;... if someoption then 2 x = 2; if someotheroption then 3 x = 3; 4 return x; C O M P U T E S T O R E A N A L Y Z E 29

30 WEAK MEMORY CONSISTENCY? 1 x starts at 0; PUT 2 into x;... 3 PUT 3 into x; 4 GET x; Chapel OpenSHMEM result must be 3 result could be 0, 2, or 3 C O M P U T E S T O R E A N A L Y Z E 30

31 COMPILER OPTIMIZATION? PUT 1 into B[f(1)] for i in { // PUT into B B[f(i)] = i; } Can the compiler prove these PUTs do not overlap? PUT 2 into B[f(2)] PUT 3 into B[f(3)] 31

32 COMPILER OPTIMIZATION? for i in { // PUT into B B[f(i)] = i; } PUT 1 into B[f(1)] PUT 2 into B[f(2)] PUT 2 into B[f(2)] With a cache, conflicting access is handled at runtime. 32

33 OVERLAPPING GETS var A:[1..n] int; on Locales[1] { var sum:int; for i in 1..n do sum += A[f(i)] } We would like to overlap the GETs for A[f(i)] with each other C O M P U T E S T O R E A N A L Y Z E 33

34 var A:[1..n] int; on Locales[1] { var sum:int; var h: [0..k] handles; var bufs: [0..k] int; // Warm up loop for i in 1..k { // nonblocking GET A[f(i) into bufs[i%k] h[i%k] = get_nb(bufs[i%k], A[f(i)]) } for i in 1..n { wait (h[i%k]); sum += bufs[i%k]; if i+k<=n { // nonblocking GET A[f(i+k)] into bufs[(i+k)%k] h[(i+k)%k] = get_nb(bufs[(i+k)%k],a[f(i+k)]) } } } Explicit overlap is messy! 34

35 var A:[1..n] int; on Locales[1] { var sum:int; // Optional warm up for i in 1..k do prefetch(a[f(i)]); for i in 1..n { if i+k <= n then prefetch(a[f(i+k)]); sum += A[f(i)] } } Much better! 35

36 COMMUNICATION AGGREGATION for i in 1..n do B[i] = compute(i); var localb:[1..n] int; for i in 1..n do localb[i] = compute(i); B = localb; for i in 1..n do consume(a[i]); var locala:[1..n] int = A; for i in 1..n do consume(locala[i]); Simple, cache aggregates Manual optimization reduces portability 36

37 PERFORMANCE San Diego Air and Space Museum

38 TEST CONFIGURATIONS Cray XC30 TM system with 50 nodes, Aries network GASNet Aries: GASNet with the aries conduit Cray ugni: native ugni support for Chapel Cray CS400 TM system with 200 nodes, FDR InfiniBand GASNet ibv: GASNet with the InfinBand Verbs conduit v1.9+ is revision 5ba6639 v1.11+ is revision 6c635a1 C O M P U T E S T O R E A N A L Y Z E 38

39 SYNTHETIC BENCHMARKS GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v x Speedup copy 0 rand-puts rand-gets 39

40 APPLICATION BENCHMARKS 5 4 GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v x Speedup lulesh minimd PTRANS SSCA2.4 40

41 PREFETCH EXAMPLE var A:[1..n] int; on Locales[1] { var sum:int; // Optional warm up for i in 1..k do prefetch(a[f(i)]); for i in 1..n { if i+k <= n then prefetch(a[f(i+k)]); sum += A[f(i)] } } C O M P U T E S T O R E A N A L Y Z E 41

42 PREFETCH EXAMPLE 10 8 GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v x Speedup Prefetch Distance 42

43 VS OPTIMIZATION 5 4 GASNet ibv v1.9+ GASNet ibv v1.11+ GASNet Aries v1.9+ GASNet Aries v x Speedup lulesh minimd PTRANS SSCA2.4 *for GASNet/Aries, lulesh improved 3.2x between v1.9+ and v

44 VS OPTIMIZATION * with GASNet Aries v1.11+ configuration C C+cache llvm llvm+cache 250 Time lulesh minimd PTRANS*10 44

45 Cache for Remote Data: providing communication overlap and aggregation since Chapel v 1.10! Library of Congress

46 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc. C O M P U T E S T O R E A N A L Y Z E Copyright 2015 Cray Inc. 46

48 Backup Slides C O M P U T E S T O R E A N A L Y Z E 48

49 APPLICATION BENCHMARKS 5 4 GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v loc GASNet Aries v x Speedup lulesh minimd PTRANS SSCA2.4 49

50 ADDING IMPLIED FENCES acquire and release triggered by task or on statement spawn, join, start, and finish release on { acquire release } acquire sync { release begin { acquire. release } } acquire C O M P U T E S T O R E A N A L Y Z E 50

51 LOOKING INSIDE Library of Congress

52 CACHE ENTRY 1024 byte cache page node address readahead trigger min sequence number max put sequence number max prefetch sequence number 64 byte cache lines Valid Line Bits C O M P U T E S T O R E A N A L Y Z E Optional Dirty Bits 52

53 Pointer Tree top bits bottom bits 17 bits 10 bits 17 bits 10 bits 10 bits top[top bits] top half bottom half page offset... bottom[bottom bits] page entries C O M P U T E S T O R E A N A L Y Z E Inspired by Two Level Tree Structure for Fast Pointer Lookup by Hans J Boehm 53

54 CACHE DATA STRUCTURES New Pages Am LRU Ain Pointer Tree 2Q Queues Aout Dirty LRU Free Lists Operations Queue per task: last acquire sequence number C O M P U T E S T O R E A N A L Y Z E 54

55 WRITE BEHIND Write Recorded in Dirty Bits, Page added to Dirty Queue 55

56 56

57 Flushed on release or when there are too many dirty pages 57

58 READAHEAD ra skip,len = 0 GET with 2 earlier valid lines triggers synchronous readahead ra skip=1 pg len = 1 pg C O M P U T E S T O R E A N A L Y Z E 58

59 ra skip=1 pg len = 1 pg The next GET triggers asynchronous readahead ra skip,len=0 ra skip=1 pg len =2 pg C O M P U T E S T O R E A N A L Y Z E 59

60 ra skip,len=0 ra skip=1 pg len =2 pg GET here triggers more readahead ra skip,len=0 ra skip=2 pg len =4 pg C O M P U T E S T O R E A N A L Y Z E 60

61 INFINIBAND (IB) LATENCY 3600 * with small 10-node cluster, QDR IB 2700 Latency (ns) GASNET get IB Benchmark GASNET put Request Size (bytes) 61

62 INFINIBAND (IB) BANDWIDTH Bandwidth MB/s IB Benchmark GASNET put GASNET get * with small 10-node cluster, QDR IB Max BW: 5000 MB/s Request Size (bytes) 62

63 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc. C O M P U T E S T O R E A N A L Y Z E Copyright 2015 Cray Inc. 63

Hands-On II: Ray Tracing (data parallelism) COMPUTE STORE ANALYZE

Hands-On II: Ray Tracing (data parallelism) Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include