Caching Puts and Gets in a PGAS Language Runtime
|
|
- Noel Barrett
- 6 years ago
- Views:
Transcription
1 Caching Puts and Gets in a PGAS Language Runtime Michael Ferguson Cray Inc. Daniel Buettner Laboratory for Telecommunication Sciences September 17, 2015 C O M P U T E S T O R E A N A L Y Z E
2 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. C O M P U T E S T O R E A N A L Y Z E Copyright 2015 Cray Inc. 2
3 CC Flickr/Daniel Jolivet
4 CC Flickr/Ben Salter
5 CC Flickr/ajmexico
6 TRAIN LATENCY (8 HOUR TRIP, 60 TON CARS, 60 SEC/CAR) 800 Latency (min) Number of Cars 6
7 TRAIN BANDWIDTH 12 Bandwidth tons/min Number of Cars 7
8 CC Flickr/ChrisDag
9 INFINIBAND (IB) LATENCY 4000 * with small 10-node cluster, QDR IB 3000 Latency (ns) Request Size (bytes) 9
10 INFINIBAND (IB) BANDWIDTH Bandwidth MB/s * with small 10-node cluster, QDR IB Max BW: 5000 MB/s Request Size (bytes) 10
11 AGGREGATION OVERLAP CC Flickr/ajmexico CC Flickr/Barry Lewis CACHE HELPS WITH BOTH! 11
12 BACKGROUND: MEMORY MODEL ALLOWS PREFETCH AND WRITE-BEHIND Library of Congress
13 Memory model for C11, C++11, Chapel: data race free programs are sequentially consistent See Adve, S. V., Boehm, H.-J Memory models: a case for rethinking parallel languages and hardware. Communications of the ACM 53(8): /8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext C O M P U T E S T O R E A N A L Y Z E 13
14 A RACY PROGRAM Thread 1 x = 42; notify = 1; Thread 2 while 0 == notify { /* wait */ } compute_with(x); C O M P U T E S T O R E A N A L Y Z E 14
15 A RACY PROGRAM Thread 1 x = 42; notify = 1; Thread 2 while 0 == notify { /* wait */ } compute_with(x); compiler or processor Thread 1 r1 = 42; notify = 1; x = r1; Thread 2 r2 = notify; while 0 == r2 { /* wait */ } compute_with(x); C O M P U T E S T O R E A N A L Y Z E 15
16 prefetch load x = A[i] Compiler and processor would like to start loads earlier in order to hide memory latency. We ll call that prefetch. C O M P U T E S T O R E A N A L Y Z E 16
17 Compiler and processor would like to complete stores later in order to hide memory latency. We ll call that write behind. B[i] = store y write behind C O M P U T E S T O R E A N A L Y Z E 17
18 prefetch Overlap loads (start early) Reuse values from earlier load Aggregate loads (cache lines) load x store y Overlap stores (finish later) Aggregate many stores into a single store write behind C O M P U T E S T O R E A N A L Y Z E 18
19 REMEMBER THE RACY PROGRAM? Thread 1 x = 42; notify = true; Thread 2 while 0 == notify { /* wait */ } compute_with(x); compiler or processor Thread 1 r1 = 42; notify = 1; x = r1; Thread 2 r2 = notify; while 0 == r2 { /* wait */ } compute_with(x); C O M P U T E S T O R E A N A L Y Z E 19
20 prefetch acquire load x store y release write behind C O M P U T E S T O R E A N A L Y Z E 20
21 prefetch acquire load x atomic, sync provide both acquire, release store y release write behind C O M P U T E S T O R E A N A L Y Z E 21
22 COMMUNICATION OPTIMIZATION Library of Congress
23 prefetch Overlap GETs (start early) Reuse values from earlier GET Aggregate GETs (cache lines) load x GET x store y PUT y Overlap PUTs (finish later) Aggregate many PUTs into a single PUT write behind C O M P U T E S T O R E A N A L Y Z E 23
24 FIXING IT WITH A CACHE Library of Congress
25 CACHE FOR REMOTE DATA Goal: communication aggregation and overlap Bonus points: avoiding repeated communication Software cache in Chapel's runtime One cache per pthread Write-back cache with dirty bits C O M P U T E S T O R E A N A L Y Z E 25
26 CACHE COHERENCY Simple, local coherency Discard all cached data on acquire Wait for pending operations on a release Strategy used in related work with UPC C O M P U T E S T O R E A N A L Y Z E 26
27 CACHE FEATURES Do PUTs in background Overlap Aggregation GET PUT GET PUT X Start one PUT per contiguous written region Round GETs up to 64-byte X cache lines X Sequential read-ahead X X X Programmer-provided prefetch hints* X C O M P U T E S T O R E A N A L Y Z E 27
28 OTHER APPROACHES Library of Congress
29 WEAK MEMORY CONSISTENCY? 1 x starts at 0;... if someoption then 2 x = 2; if someotheroption then 3 x = 3; 4 return x; C O M P U T E S T O R E A N A L Y Z E 29
30 WEAK MEMORY CONSISTENCY? 1 x starts at 0; PUT 2 into x;... 3 PUT 3 into x; 4 GET x; Chapel OpenSHMEM result must be 3 result could be 0, 2, or 3 C O M P U T E S T O R E A N A L Y Z E 30
31 COMPILER OPTIMIZATION? PUT 1 into B[f(1)] for i in { // PUT into B B[f(i)] = i; } Can the compiler prove these PUTs do not overlap? PUT 2 into B[f(2)] PUT 3 into B[f(3)] 31
32 COMPILER OPTIMIZATION? for i in { // PUT into B B[f(i)] = i; } PUT 1 into B[f(1)] PUT 2 into B[f(2)] PUT 2 into B[f(2)] With a cache, conflicting access is handled at runtime. 32
33 OVERLAPPING GETS var A:[1..n] int; on Locales[1] { var sum:int; for i in 1..n do sum += A[f(i)] } We would like to overlap the GETs for A[f(i)] with each other C O M P U T E S T O R E A N A L Y Z E 33
34 var A:[1..n] int; on Locales[1] { var sum:int; var h: [0..k] handles; var bufs: [0..k] int; // Warm up loop for i in 1..k { // nonblocking GET A[f(i) into bufs[i%k] h[i%k] = get_nb(bufs[i%k], A[f(i)]) } for i in 1..n { wait (h[i%k]); sum += bufs[i%k]; if i+k<=n { // nonblocking GET A[f(i+k)] into bufs[(i+k)%k] h[(i+k)%k] = get_nb(bufs[(i+k)%k],a[f(i+k)]) } } } Explicit overlap is messy! 34
35 var A:[1..n] int; on Locales[1] { var sum:int; // Optional warm up for i in 1..k do prefetch(a[f(i)]); for i in 1..n { if i+k <= n then prefetch(a[f(i+k)]); sum += A[f(i)] } } Much better! 35
36 COMMUNICATION AGGREGATION for i in 1..n do B[i] = compute(i); var localb:[1..n] int; for i in 1..n do localb[i] = compute(i); B = localb; for i in 1..n do consume(a[i]); var locala:[1..n] int = A; for i in 1..n do consume(locala[i]); Simple, cache aggregates Manual optimization reduces portability 36
37 PERFORMANCE San Diego Air and Space Museum
38 TEST CONFIGURATIONS Cray XC30 TM system with 50 nodes, Aries network GASNet Aries: GASNet with the aries conduit Cray ugni: native ugni support for Chapel Cray CS400 TM system with 200 nodes, FDR InfiniBand GASNet ibv: GASNet with the InfinBand Verbs conduit v1.9+ is revision 5ba6639 v1.11+ is revision 6c635a1 C O M P U T E S T O R E A N A L Y Z E 38
39 SYNTHETIC BENCHMARKS GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v x Speedup copy 0 rand-puts rand-gets 39
40 APPLICATION BENCHMARKS 5 4 GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v x Speedup lulesh minimd PTRANS SSCA2.4 40
41 PREFETCH EXAMPLE var A:[1..n] int; on Locales[1] { var sum:int; // Optional warm up for i in 1..k do prefetch(a[f(i)]); for i in 1..n { if i+k <= n then prefetch(a[f(i+k)]); sum += A[f(i)] } } C O M P U T E S T O R E A N A L Y Z E 41
42 PREFETCH EXAMPLE 10 8 GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v x Speedup Prefetch Distance 42
43 VS OPTIMIZATION 5 4 GASNet ibv v1.9+ GASNet ibv v1.11+ GASNet Aries v1.9+ GASNet Aries v x Speedup lulesh minimd PTRANS SSCA2.4 *for GASNet/Aries, lulesh improved 3.2x between v1.9+ and v
44 VS OPTIMIZATION * with GASNet Aries v1.11+ configuration C C+cache llvm llvm+cache 250 Time lulesh minimd PTRANS*10 44
45 Cache for Remote Data: providing communication overlap and aggregation since Chapel v 1.10! Library of Congress
46 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc. C O M P U T E S T O R E A N A L Y Z E Copyright 2015 Cray Inc. 46
47
48 Backup Slides C O M P U T E S T O R E A N A L Y Z E 48
49 APPLICATION BENCHMARKS 5 4 GASNet ibv v1.11+ GASNet Aries v1.11+ Cray ugni v loc GASNet Aries v x Speedup lulesh minimd PTRANS SSCA2.4 49
50 ADDING IMPLIED FENCES acquire and release triggered by task or on statement spawn, join, start, and finish release on { acquire release } acquire sync { release begin { acquire. release } } acquire C O M P U T E S T O R E A N A L Y Z E 50
51 LOOKING INSIDE Library of Congress
52 CACHE ENTRY 1024 byte cache page node address readahead trigger min sequence number max put sequence number max prefetch sequence number 64 byte cache lines Valid Line Bits C O M P U T E S T O R E A N A L Y Z E Optional Dirty Bits 52
53 Pointer Tree top bits bottom bits 17 bits 10 bits 17 bits 10 bits 10 bits top[top bits] top half bottom half page offset... bottom[bottom bits] page entries C O M P U T E S T O R E A N A L Y Z E Inspired by Two Level Tree Structure for Fast Pointer Lookup by Hans J Boehm 53
54 CACHE DATA STRUCTURES New Pages Am LRU Ain Pointer Tree 2Q Queues Aout Dirty LRU Free Lists Operations Queue per task: last acquire sequence number C O M P U T E S T O R E A N A L Y Z E 54
55 WRITE BEHIND Write Recorded in Dirty Bits, Page added to Dirty Queue 55
56 56
57 Flushed on release or when there are too many dirty pages 57
58 READAHEAD ra skip,len = 0 GET with 2 earlier valid lines triggers synchronous readahead ra skip=1 pg len = 1 pg C O M P U T E S T O R E A N A L Y Z E 58
59 ra skip=1 pg len = 1 pg The next GET triggers asynchronous readahead ra skip,len=0 ra skip=1 pg len =2 pg C O M P U T E S T O R E A N A L Y Z E 59
60 ra skip,len=0 ra skip=1 pg len =2 pg GET here triggers more readahead ra skip,len=0 ra skip=2 pg len =4 pg C O M P U T E S T O R E A N A L Y Z E 60
61 INFINIBAND (IB) LATENCY 3600 * with small 10-node cluster, QDR IB 2700 Latency (ns) GASNET get IB Benchmark GASNET put Request Size (bytes) 61
62 INFINIBAND (IB) BANDWIDTH Bandwidth MB/s IB Benchmark GASNET put GASNET get * with small 10-node cluster, QDR IB Max BW: 5000 MB/s Request Size (bytes) 62
63 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc. C O M P U T E S T O R E A N A L Y Z E Copyright 2015 Cray Inc. 63
64
Hands-On II: Ray Tracing (data parallelism) COMPUTE STORE ANALYZE
Hands-On II: Ray Tracing (data parallelism) Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include
More informationSonexion GridRAID Characteristics
Sonexion GridRAID Characteristics CUG 2014 Mark Swan, Cray Inc. 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking
More informationThe Use and I: Transitivity of Module Uses and its Impact
The Use and I: Transitivity of Module Uses and its Impact Lydia Duncan, Cray Inc. CHIUW 2016 May 27 th, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationMemory Leaks Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016
Memory Leaks Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationArray, Domain, & Domain Map Improvements Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018
Array, Domain, & Domain Map Improvements Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current
More informationProductive Programming in Chapel: A Computation-Driven Introduction Chapel Team, Cray Inc. SC16, Salt Lake City, UT November 13, 2016
Productive Programming in Chapel: A Computation-Driven Introduction Chapel Team, Cray Inc. SC16, Salt Lake City, UT November 13, 2016 Safe Harbor Statement This presentation may contain forward-looking
More informationLocality/Affinity Features COMPUTE STORE ANALYZE
Locality/Affinity Features Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about
More informationCompiler / Tools Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018
Compiler / Tools Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationQ & A, Project Status, and Wrap-up COMPUTE STORE ANALYZE
Q & A, Project Status, and Wrap-up Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements
More informationOpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017
OpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking
More informationGrab-Bag Topics / Demo COMPUTE STORE ANALYZE
Grab-Bag Topics / Demo Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about
More informationData-Centric Locality in Chapel
Data-Centric Locality in Chapel Ben Harshbarger Cray Inc. CHIUW 2015 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationAdding Lifetime Checking to Chapel Michael Ferguson, Cray Inc. CHIUW 2018 May 25, 2018
Adding Lifetime Checking to Chapel Michael Ferguson, Cray Inc. CHIUW 2018 May 25, 2018 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationCompiler Improvements Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016
Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationXC System Management Usability BOF Joel Landsteiner & Harold Longley, Cray Inc. Copyright 2017 Cray Inc.
XC System Management Usability BOF Joel Landsteiner & Harold Longley, Cray Inc. 1 BOF Survey https://www.surveymonkey.com/r/kmg657s Aggregate Ideas at scale! Take a moment to fill out quick feedback form
More informationAn Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar
An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and Conclusion
More informationLustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore
Lustre Lockahead: Early Experience and Performance using Optimized Locking Michael Moore Agenda Purpose Investigate performance of a new Lustre and MPI-IO feature called Lustre Lockahead (LLA) Discuss
More informationTransferring User Defined Types in
Transferring User Defined Types in OpenACC James Beyer, Ph.D () 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking
More informationCray XC System Node Diagnosability. Jeffrey J. Schutkoske Platform Services Group (PSG)
Cray XC System Node Diagnosability Jeffrey J. Schutkoske Platform Services Group (PSG) jjs@cray.com Safe Harbor Statement This presentation may contain forward-looking statements that are based on our
More informationReveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment
Reveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to
More informationData Parallelism COMPUTE STORE ANALYZE
Data Parallelism Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial
More informationMPI for Cray XE/XK Systems & Recent Enhancements
MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.
More informationProject Caribou; Streaming metrics for Sonexion Craig Flaskerud
Project Caribou; Streaming metrics for Sonexion Craig Flaskerud Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual
More informationReveal. Dr. Stephen Sachs
Reveal Dr. Stephen Sachs Agenda Reveal Overview Loop work estimates Creating program library with CCE Using Reveal to add OpenMP 2 Cray Compiler Optimization Feedback OpenMP Assistance MCDRAM Allocation
More informationHybrid Warm Water Direct Cooling Solution Implementation in CS300-LC
Hybrid Warm Water Direct Cooling Solution Implementation in CS300-LC Roger Smith Mississippi State University Giridhar Chukkapalli Cray, Inc. C O M P U T E S T O R E A N A L Y Z E 1 Safe Harbor Statement
More informationPorting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)
Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5 Alistair Hart (Cray UK Ltd.) Safe Harbor Statement This presentation may contain forward-looking statements that are based on
More informationHow-to write a xtpmd_plugin for your Cray XC system Steven J. Martin
How-to write a xtpmd_plugin for your Cray XC system Steven J. Martin (stevem@cray.com) Cray XC Telemetry Plugin Introduction Enabling sites to get telemetry data off the Cray Plugin interface enables site
More informationRedfish APIs on Next Generation Cray Hardware CUG 2018 Steven J. Martin, Cray Inc.
Redfish APIs on Next Generation Cray Hardware Steven J. Martin, Cray Inc. Modernizing Cray Systems Management Use of Redfish APIs on Next Generation Cray Hardware Steven Martin, David Rush, Kevin Hughes,
More informationVectorization of Chapel Code Elliot Ronaghan, Cray Inc. June 13, 2015
Vectorization of Chapel Code Elliot Ronaghan, Cray Inc. CHIUW @PLDI June 13, 2015 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationIntel Xeon PhiTM Knights Landing (KNL) System Software Clark Snyder, Peter Hill, John Sygulla
Intel Xeon PhiTM Knights Landing (KNL) System Software Clark Snyder, Peter Hill, John Sygulla Motivation The Intel Xeon Phi TM Knights Landing (KNL) has 20 different configurations 5 NUMA modes X 4 memory
More informationCray Performance Tools Enhancements for Next Generation Systems Heidi Poxon
Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon Agenda Cray Performance Tools Overview Recent Enhancements Support for Cray systems with KNL 2 Cray Performance Analysis Tools
More informationNew Tools and Tool Improvements Chapel Team, Cray Inc. Chapel version 1.16 October 5, 2017
New Tools and Tool Improvements Chapel Team, Cray Inc. Chapel version 1.16 October 5, 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationChapel s New Adventures in Data Locality Brad Chamberlain Chapel Team, Cray Inc. August 2, 2017
Chapel s New Adventures in Data Locality Brad Chamberlain Chapel Team, Cray Inc. August 2, 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current
More informationProductive Programming in Chapel:
Productive Programming in Chapel: A Computation-Driven Introduction Base Language with n-body Michael Ferguson and Lydia Duncan Cray Inc, SC15 November 15th, 2015 C O M P U T E S T O R E A N A L Y Z E
More informationEnhancing scalability of the gyrokinetic code GS2 by using MPI Shared Memory for FFTs
Enhancing scalability of the gyrokinetic code GS2 by using MPI Shared Memory for FFTs Lucian Anton 1, Ferdinand van Wyk 2,4, Edmund Highcock 2, Colin Roach 3, Joseph T. Parker 5 1 Cray UK, 2 University
More informationLustre Networking at Cray. Chris Horn
Lustre Networking at Cray Chris Horn hornc@cray.com Agenda Lustre Networking at Cray LNet Basics Flat vs. Fine-Grained Routing Cost Effectiveness - Bandwidth Matching Connection Reliability Dealing with
More informationExperiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms
Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms Kristyn J. Maschhoff and Michael F. Ringenburg Cray Inc. CUG 2015 Copyright 2015 Cray Inc Legal Disclaimer Information
More informationChapel Hierarchical Locales
Chapel Hierarchical Locales Greg Titus, Chapel Team, Cray Inc. SC14 Emerging Technologies November 18 th, 2014 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationChapel: Productive Parallel Programming from the Pacific Northwest
Chapel: Productive Parallel Programming from the Pacific Northwest Brad Chamberlain, Cray Inc. / UW CS&E Pacific Northwest Prog. Languages and Software Eng. Meeting March 15 th, 2016 Safe Harbor Statement
More informationPerformance Optimizations Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016
Performance Optimizations Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationXC Series Shifter User Guide (CLE 6.0.UP02) S-2571
XC Series Shifter User Guide (CLE 6.0.UP02) S-2571 Contents Contents 1 About the XC Series Shifter User Guide...3 2 Shifter System Introduction...6 3 Download and Convert the Docker Image...7 4 Submit
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More informationFirst experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs. Alistair Hart (Cray UK Ltd.
First experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs Alistair Hart (Cray UK Ltd.) Safe Harbor Statement This presentation may contain forward-looking
More informationPerformance Optimizations Generated Code Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015
Performance Optimizations Generated Code Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationData Parallelism, By Example COMPUTE STORE ANALYZE
Data Parallelism, By Example Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements
More informationChapel Background and Motivation COMPUTE STORE ANALYZE
Chapel Background and Motivation Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements
More informationI/O Buffering and Streaming
I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks
More informationEvaluating Shifter for HPC Applications Don Bahls Cray Inc.
Evaluating Shifter for HPC Applications Don Bahls Cray Inc. Agenda Motivation Shifter User Defined Images (UDIs) provide a mechanism to access a wider array of software in the HPC environment without enduring
More informationADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE
13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:
More informationBenchmarks, Performance Optimizations, and Memory Leaks Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016
Benchmarks, Performance Optimizations, and Memory Leaks Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationStandard Library and Interoperability Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015
Standard Library and Interoperability Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationPurity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language
Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,
More informationScalable Software Transactional Memory for Chapel High-Productivity Language
Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationEnabling Parallel Computing in Chapel with Clang and LLVM
Enabling Parallel Computing in Chapel with Clang and LLVM Michael Ferguson Cray Inc. October 19, 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationLanguage Improvements Chapel Team, Cray Inc. Chapel version 1.15 April 6, 2017
Language Improvements Chapel Team, Cray Inc. Chapel version 1.15 April 6, 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationGrappa: A latency tolerant runtime for large-scale irregular applications
Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University
More informationPorting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT
Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick UPC Review. July 22, 2009. What is GASNet?
More informationToward Understanding Life-Long Performance of a Sonexion File System
Toward Understanding Life-Long Performance of a Sonexion File System CUG 2015 Mark Swan, Doug Petesch, Cray Inc. dpetesch@cray.com Safe Harbor Statement This presentation may contain forward-looking statements
More informationIntel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor
Technical Resources Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPETY RIGHTS
More informationAtomic Transactions in Cilk Project Presentation 12/1/03
Atomic Transactions in Cilk 6.895 Project Presentation 12/1/03 Data Races and Nondeterminism int x = 0; 1: read x 1: write x time cilk void increment() { x = x + 1; cilk int main() { spawn increment();
More informationThe Transition to PCI Express* for Client SSDs
The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers
More informationAdvanced cache optimizations. ECE 154B Dmitri Strukov
Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationThe Journey of an I/O request through the Block Layer
The Journey of an I/O request through the Block Layer Suresh Jayaraman Linux Kernel Engineer SUSE Labs sjayaraman@suse.com Introduction Motivation Scope Common cases More emphasis on the Block layer Why
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationImproving I/O Bandwidth With Cray DVS Client-Side Caching
Improving I/O Bandwidth With Cray DVS Client-Side Caching Bryce Hicks Cray Inc. Bloomington, MN USA bryceh@cray.com Abstract Cray s Data Virtualization Service, DVS, is an I/O forwarder providing access
More informationMICHAL MROZEK ZBIGNIEW ZDANOWICZ
MICHAL MROZEK ZBIGNIEW ZDANOWICZ Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
More informationLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25
More informationCREATING A COMMON SOFTWARE VERBS IMPLEMENTATION
12th ANNUAL WORKSHOP 2016 CREATING A COMMON SOFTWARE VERBS IMPLEMENTATION Dennis Dalessandro, Network Software Engineer Intel April 6th, 2016 AGENDA Overview What is rdmavt and why bother? Technical details
More informationOPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE
OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE Sean Hefty Openfabrics Interfaces Working Group Co-Chair Intel November 2016 OFIWG: develop interfaces aligned with application needs Open Source Expand
More informationIntroduction to Cray Data Virtualization Service S
TM Introduction to Cray Data Virtualization Service S 0005 4002 2008-2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or
More informationThe Intel Processor Diagnostic Tool Release Notes
The Intel Processor Diagnostic Tool Release Notes Page 1 of 7 LEGAL INFORMATION INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More informationLanguage and Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016
Language and Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationRelaxed Memory Consistency
Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationCS510 Advanced Topics in Concurrency. Jonathan Walpole
CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to
More informationParallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012
Parallel Programming Features in the Fortran Standard Steve Lionel 12/4/2012 Agenda Overview of popular parallelism methodologies FORALL a look back DO CONCURRENT Coarrays Fortran 2015 Q+A 12/5/2012 2
More informationIntel Xeon Phi Coprocessor Performance Analysis
Intel Xeon Phi Coprocessor Performance Analysis Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
More informationKrzysztof Laskowski, Intel Pavan K Lanka, Intel
Krzysztof Laskowski, Intel Pavan K Lanka, Intel Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
More informationAnnouncements. ! Previous lecture. Caches. Inf3 Computer Architecture
Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More informationModule 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency
Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional
More informationMobility: Innovation Unleashed!
Mobility: Innovation Unleashed! Mooly Eden Corporate Vice President General Manager, Mobile Platforms Group Intel Corporation Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationPortable, MPI-Interoperable! Coarray Fortran
Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationHigh Performance Computing Systems
High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What
More informationScaling with PGAS Languages
Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More information