LLVM-based Communication Optimizations for PGAS Programs

Size: px

Start display at page:

Download "LLVM-based Communication Optimizations for PGAS Programs"

Hannah Sherman
5 years ago
Views:

Akihiro Hayashi (Rice University) Jisheng Zhao (Rice

1 LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University) 1

2 A Big Picture Berkeley Lab. X10, Habanero-UPC++, on i t a ic n u n m o i t m Co timiza Op Photo Credits : Argonne National Lab. RIKEN AICS

3 PGAS Languages q High-productivity features: Global-View Task parallelism Data Distribution Synchronization Habanero-UPC++ X10 CAF Photo Credits :

4 Communication is implicit in some PGAS Programming Models q Global Address Space Compiler and Runtime is responsible for performing communications across nodes Remote Data Access in Chapel 1: var x = 1; // on Node 0 : on Locales[1] {// on Node 1 : = x; // DATA ACCESS 4: } 4

5 Communication is Implicit in some PGAS Programming Models (Cont d) Remote Data Access 1: var x = 1; // on Node 0 : on Locales[1] {// on Node 1 : = x; // DATA ACCESS Compiler Op>miza>on 1: var x = 1; : on Locales[1] { : = 1; OR! Run>me affinity handling if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get( ); } 5

6 Latency (ms) Communication Optimization is Important ! ! ! 10000! 1000! 100! 10! 1! Optimized (Bulk Transfer) Unoptimized Lower is better 1,500x! 59x! Transferred Byte A synthe>c Chapel program on Intel Xeon CPU X5660 Clusters with QDR Inifiniband 6

PGAS Optimizations are language-specific Chapel

7 PGAS Optimizations are language-specific Chapel Compiler Berkeley Lab. UPC Compiler Argonne National Lab. X10, Habanero-UPC++, X10 Compiler Habanero-C Compiler Photo Credits : RIKEN AICS 7

8 Our goal Berkeley Lab. Argonne National Lab. X10, Habanero-UPC++, Photo Credits : RIKEN AICS 8

9 Why LLVM? q Widely used language-agnostic compiler C/C++ Frontend Clang C/C++, Fortran, Ada, Objective-C Frontend dragonegg Chapel Frontend UPC++ Frontend LLVM Intermediate Representation (LLVM IR) Analysis & Optimizations x86 backend Power PC backend ARM backend PTX backend x86 Binary PPC Binary ARM Binary GPU Binary 9

10 Summary & Contributions q Our Observations : Many PGAS languages share semantically similar constructs PGAS Optimizations are language-specific q Contributions: Built a compilation framework that can uniformly optimize PGAS programs(initial Focus : Communication) ü Enabling existing LLVM passes for communication optimizations ü PGAS-aware communication optimizations Photo Credits :

11 Chapel Programs Chapel- LLVM frontend Overview of our framework Need to be implemented when supporting a new language/runtime Generally language-agnostic UPC++ Programs X10 Programs UPC++- LLVM frontend X10-LLVM frontend LLVM IR LLVM-based Communication Optimization passes Lowering Pass CAF Programs CAF-LLVM frontend 1. Vanilla LLVM IR. use address space feature to express communications 11

access as if it were local access 1.Existing LLVM Optimizations.

12 How optimizations work Chapel // x is possibly remote x = 1; UPC++ shared_var<int> x; x = 1; store i64 1, i64 addrspace(100)* %x, treat remote access as if it were local access 1.Existing LLVM Optimizations.PGAS-aware Optimizations Runtime-Specific Lowering" Communication API Calls! Address space-aware Optimizations 1

13 LLVM-based Communication Optimizations for Chapel 1. Enabling Existing LLVM passes Loop invariant code motion (LICM) Scalar replacement,. Aggregation Combine sequences of loads/stores on adjacent memory location into a single memcpy These are already implemented in the standard Chapel compiler 1

14 An optimization example: LICM for Communication Optimizations LICM by LLVM for i in { %x = load i64 addrspace(100)* %xptr A(i) = %x; } LICM = Loop Invariant Code Motion 14

15 An optimization example: Aggregation // p is possibly remote sum = p.x + p.y; load i64 addrspace(100)* %pptr+0 load i64 addrspace(100)* %pptr+4 x! y! GET! GET! llvm.memcpy( ); GET! 15

16 LLVM-based Communication Optimizations for Chapel. Locality Optimization Infer the locality of data and convert possiblyremote access to definitely-local access at compile-time if possible 4. Coalescing Remote array access vectorization These are implemented, but not in the standard Chapel compiler 16

17 An Optimization example: Locality Optimization 1: proc habanero(ref x, ref y, ref z) { : var p: int = 0; 1.A is definitelylocal : var A:[1..N] int; 4: local { p = z; } 5: z = A(0) + z;.p and z are 6:} definitely local.definitely-local access! (avoid run@me affinity checking) 17

18 An Optimization example: Coalescing Before 1:for i in 1..N { : = A(i); :} AUer Perform bulk transfer 1:localA = A; :for i in 1..N { : = locala(i); 4:} Converted to definitely-local access 18

19 Performance Evaluations: Benchmarks Application Size Smith-Waterman 185,600 x 19,000 Cholesky Decomp NPB EP 10,000 x 10,000 CLASS = D Sobel 48,000 x 48,000 SSCA Kernel 4 Stream EP SCALE = 16 ^0 19

20 Performance Evaluations: Platforms q Cray XC0 NERSC Node ü Intel Xeon x 4 cores ü 64GB of RAM Interconnect ü Cray Aries interconnect with Dragonfly topology q Westmere Rice Node ü Intel Xeon CPU x 1 cores ü 48 GB of RAM Interconnect ü Quad-data rated infiniband 0

21 Performance Evaluations: Details of Compiler & Runtime q Compiler Chapel Compiler version LLVM. q Runtime : GASNet-1..0 ü Cray XC : aries ü Westmere Cluster : ibv-conduit Qthreads-1.10 ü Cray XC: shepherds, 4 workers / shepherd ü Westmere Cluster : shepherds, 6 workers / shepherd 1

22 Performance Evaluation BRIEF SUMMARY OF PERFORMANCE EVALUATIONS

23 Performance Improvement over LLVM-unopt Results on the Cray XC (LLVM-unopt vs. LLVM-allopt) x 19.5x 1.1x.4x Higher is better Coalescing Aggregation 1.4x Locality Opt Existing 1.x SW Cholesky Sobel StreamEP EP SSCA ü 4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1,, 4, 8, 16,, 64 locales)

24 Performance Improvement over LLVM-unopt Results on Westmere Cluster (LLVM-unopt vs. LLVM-allopt) x 16.9x 1.1x.5x Coalescing Aggregation 1.x Locality Opt Existing.x SW Cholesky Sobel StreamEP EP SSCA ü 4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1,, 4, 8, 16,, 64 locales) 4

25 Performance Evaluation DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION 5

26 Cholesky Decomposition 6 dependencies Node0 Node1 Node Node

27 Metrics 1. Performance & Scalability Baseline (LLVM-unopt) LLVM-based Optimizations (LLVM-allopt). The dynamic number of communication API calls. Analysis of optimized code 4. Performance comparison Conventional C-backend vs. LLVM-backend 7

28 Speedup over LLVM-unopt 1locale Performance Improvement by LLVM (Cholesky on the Cray XC) LLVM-unopt LLVM-allopt locale! locales! 4 locales! 8 locales! 16 locales! locales! ü LLVM-based communication optimizations show scalability 8

29 Dynamic number of communication API calls (normalized to LLVM-unopt) Communication API calls elimination by LLVM (Cholesky on the Cray XC) LLVM-unopt 100.0% 100.0% 100.0% 100.0% 100.0% 89.% 8.x improvement 1.1% LLVM-allopt 500x improvement 0.% 1.1x improvement LOCAL GET REMOTE_GET LOCAL_PUT REMOTE_PUT 9

30 Analysis of optimized code LLVM-unopt for jb in zero..tilesize-1 { for kb in zero..tilesize-1 { 4GETS for ib in zero..tilesize-1 { 9GETS + 1PUT }}} LLVM-allopt 1.ALLOCATE LOCAL BUFFER.PERFORM BULK TRANSFER for jb in zero..tilesize-1 { for kb in zero..tilesize-1 { 1GET for ib in zero..tilesize-1 { 1GET + 1PUT }}} 0

31 Performance comparison with C-backend Speedup over LLVM-unopt 1locale C-backend LLVM-unopt LLVM-allopt C-backend is faster! locale locales 4 locales 8 locales 16 locales locales 64 locales 1

>> 48 ptr 48BITS_MASK; 1. Needs more instructions.

32 Current limitation For C Code Generation : 18bit struct pointer ptr.locale; ptr.addr; For LLVM Code Generation : 64bit packed pointer Locale addr (16bit) (48bit) ptr >> 48 ptr 48BITS_MASK; 1. Needs more instructions. Lose opportunities for Alias analysis q In LLVM., many optimizations assume that the pointer size is the same across all address spaces

33 Conclusions q LLVM-based Communication optimizations for PGAS Programs Promising way to optimize PGAS programs in a language-agnostic manner Preliminary Evaluation with 6 Chapel applications ü Cray XC0 Supercomputer 4.6x average performance improvement ü Westmere Cluster 4.4x average performance improvement

34 Future work q Extend LLVM IR to support parallel programs with PGAS and explicit task parallelism Higher-level IR Parallel Programs (Chapel, X10, CAF, HC, ) 1.RI-PIR Gen.Analysis.Transformation LLVM Runtime-Independent Optimizations e.g. Task Parallel Construct 1.RS-PIR Gen.Analysis.Transformation LLVM Runtime-Specific Optimizations e.g. GASNet API Binary 4

35 Acknowledgements q Special thanks to Brad Chamberlain (Cray) Rafael Larrosa Jimenez (UMA) Rafael Asenjo Plaza (UMA) Habanero Group at Rice 5

36 Backup slides 6

37 Compilation Flow Chapel Programs AST Generation and Optimizations C-code Generation C Programs Backend Compiler s Optimizations (e.g. gcc O) Binary LLVM IR Generation LLVM IR LLVM Optimizations Binary 7

LLVM-based Communication Optimizations for PGAS Programs

LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi Rice University ahayashi@rice.edu Jisheng Zhao Rice University jisheng.zhao@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu