C PGAS XcalableMP(XMP) Unified Parallel

Size: px

Start display at page:

Download "C PGAS XcalableMP(XMP) Unified Parallel"

Milo Cannon
5 years ago
Views:

1 PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba 2 Graduate School of Systems and Information Engineering, University of Tsukuba 3 RIKEN Advanced Institute for Computational Science Partitioned Global Address SpacePGASPGAS MPI Single Program Multiple Data MPI XcalableMPXMP 1),2) Unified Parallel CUPC 3) PGAS XMP UPC C XMP UPC UPC XMP UPC 2 PGAS XMP UPC 3 4 read/writelaplace Solver NAS Parallel Benchmarks NPB 4) Conjugate GradientCG5 2. Partitioned Global Address Space 2.1 Partitioned Global Address Space PGAS PGAS PGAS 1 c 2011 Information Processing Society of Japan

2 MPI PGAS XMP UPC instance of executionxmp UPC MPI 2.2 XcalableMP XMP XcalableMP XMP Spec WG e-science 5) XMP High Performance FortranHPF 7) XMP HPF HPF XMP XMP C Fortran C XMP XMP Fig. 1 XMP Fig. 2 gmove gmove XMP Fortran :Fig. 2 a2[] N/2 N-1 a1[] 0 N/2-1 for Fig. 3 loop Fig. 3 t 2.3 Unified Parallel C UPC PGAS 1 UPC Consortium 6) UPC #pragma xmp template t(0:n-1) template t index 0 N-1 #pragma xmp nodes p(4) #pragma xmp distribute t(block) onto p node 1 node 2 node 3 node 4 index 0 N/4-1 N/2-1 3*N/4-1 N-1 #pragma xmp align a[i] with t(i) a[] node 1 node 2 node 3 node 4 node 1 node 2 node 3 node 4 index 0 N/4-1 N/2-1 3*N/4-1 N-1 1 (XMP) Fig. 1 Conceptual diagram of template(xmp) #pragma xmp gmove a1[0:n/2-1] = a2[n/2:n-1]; 2 Gmove (XMP) Fig. 2 Example of gmove directive(xmp) #pragma xmp loop on t(i) for(i = 0; i < N; i++){ a[i] = func(i); } 3 Loop (XMP) Fig. 3 Example of loop directive(xmp) shared double a1[100], a2[100]; upc_memcpy(a1, a2, 100*sizeof(double)); 4 UPC (UPC) Fig. 4 How to declare and transfer shared data(upc) shared 1 UPC Fig. 4 Fig double a1[] a2[] a1[] a2[] Block shared [10] double a[100]; shared Block Fig. 4 upc memcpy() a2 100*sizeof(double) a1 upc memget() upc memput() for Fig. 5 upc forall upc forall 4 3 C 3. XcalableMP Unified Parallel C XMP UPC 2 c 2011 Information Processing Society of Japan

3 upc_forall( i=0; i<n; i++; &a[i]){ a[i] = func(i); } 5 upc forall (UPC) Fig. 5 Example of upc forall(upc) #pragma xmp nodes p(2, 2) #pragma xmp template t(0:9, 0:9) #pragma xmp distribute t(block, cyclic) onto p int a[10][10]; #pragma xmp align a[i][j] with t(j, i) 6 2 (XMP) Fig. 6 Example of distribution of two-dimensional array(xmp) 1 Fig. 6 Table 1 Indexes of each process in Fig. 6 Process 1st indexes 2nd indexes of a[][] of a[][] p(1,1) 0, 1, 2, 3, 4 0, 2, 4, 6, 8 p(2,1) 5, 6, 7, 8, 9 0, 2, 4, 6, 8 p(1,2) 0, 1, 2, 3, 4 1, 3, 5, 7, 9 p(2,2) 5, 6, 7, 8, 9 1, 3, 5, 7, 9 Fig. 7 #pragma xmp coarray y[1:3] = x[2:4]:[3]; 7 Co-array (XMP) Example of Co-array Function(XMP) 3.1 CPU XMP UPC cyclicblockblock-cyclic XMP gblock XMP Fig Fig. 6 Table 1 UPC 1 UPC XMP XMP XMP UPC upc alloc() XMP 3.2 XMP UPC XMP UPC UPC strict/relaxed 2 strict relaxed relaxed UPC UPC XMP UPC UPC XMP Fig. 2 UPC XMP 3.3 XMP Co-array Fortran 8) Fig. 7 Fig. 7 3 x[] 2 4 y[] 1 3 Fortran XMP CAF CAF codimension UPC 3 c 2011 Information Processing Society of Japan

4. 4.1 XMP UPC XMP Omni XMP Compiler 9) version 0.5.3TXMPUPC Lawrence Berkeley National Laboratory UC Berkeley Berkeley UPC 10) version 2.12.

4 XMP UPC XMP Omni XMP Compiler 9) version 0.5.3TXMPUPC Lawrence Berkeley National Laboratory UC Berkeley Berkeley UPC 10) version BUPC XMP MPI BUPC MPI GASNet 11),12) T2K Tsukuba System Table 2BUPC GASNet API Infiniband APIibv CPU 8 1 CPU 10 BUPC -O3 param max-inline-insns-single=35000 param inline-unit-growth=10000 param large-function-growth= PGAS read/write double 2 20 Block Cyclic TXMP loop Fig. 3BUPC upc forall Fig. 5 Fig. 8 read/write Native TXMP BUPC gcc read/write Fig. 8 TXMP Block Native Cyclic TXMP 2 Table 2 Specifications of each node on experimental environment CPU AMD Opteron Quad-Core 8000 series 2.3GHz (4 sockets) Memory DDR2 667MHz 32GB Network Infiniband DDR (4 rails) 8GB/s OS Linux Compiler gcc MPI mvapich2-1.7a 8 Fig. 8 Access speed in global region Block XMP HPF TXMP 13) BUPC Block Native 3 Cyclic 2 Block Cyclic Block shared Block 14) UPC shared XMP XMP UPC 4.3 Laplace Solver Laplace Solver 4 TXMP Laplace Solver Fig. 9 BUPC Fig Block XMP shadow reflect BUPC TXMP 4 c 2011 Information Processing Society of Japan

5 9 Laplace solver (TXMP) Fig. 9 Source of laplace solver (TXMP) 10 Fig. 10 Laplace solver Result of laplace solver 11 Privatization (UPC) Fig. 11 Sample code of privatization(upc) 12 Laplace solver (BUPC) Fig. 12 Source of laplace solver (BUPC) 4 Fig. 12 THREADS MYTHREAD MPI 0 Laplace Solver 2 UPC 1 SIZE 512 TIMES 100 Fig. 10 Fig. 10 TXMP BUPC 4.2 Block TXMP BUPC 4.4 NAS Parallel Benchmark Conjugate Gradient CG UPC UPC NAS Parallel BenchmarksUPC-NPB 15) XMP UPC w[] UPC-NPB CG (1) (2)(1) (3)(2) 3 (2) Privatization Fig. 11 Privatization Fig SIZE w[] w[] 1 3 w[] SIZE/THREADS w ptr w[] w ptr 16) MPI (1) BUPC-1(2) BUPC-2(3) BUPC-3 CG TXMP Fig. 13 BUPC-1 Fig c 2011 Information Processing Society of Japan

13 Conjugate gradient (TXMP) Fig. 13 Source of conjugate gradient (TXMP) 14 Conjugate gradient Fig. 14 Result of conjugate gradient 15 Conjugate gradient (BUPC) Fig.

06 16 54.67 56.01 55.18 32 26.66 27.96 27.94 64 13.82 14.86 14.53 128 6.76 7.488 7.65 Table 4 4 Comm. time of each implementation Cores TXMP BUPC-1 BUPC-2 1 2.82 1.45 1.43 2 7.49 5.29 2.02 4 9.72 7.

6 13 Conjugate gradient (TXMP) Fig. 13 Source of conjugate gradient (TXMP) 14 Conjugate gradient Fig. 14 Result of conjugate gradient 15 Conjugate gradient (BUPC) Fig. 15 Source of conjugate gradient (BUPC) Table 3 3 CPU CPU time of each implementation Cores TXMP BUPC-1 BUPC Table 4 4 Comm. time of each implementation Cores TXMP BUPC-1 BUPC CG 2 PROC COLS PROCS ROW 2 for w[]bupc-1 w1[] 2 for w[] q[] CLASS C NA CG Fig. 14 BUPC-2 BUPC-3 BUPC-3 2 MPI-CG BUPC-2 TXMP Table 3 CPU Table 4 Table 4 MPI-CG MPI Table 3 Table 4 CPU 1 TXMP BUPC Fig. 13 Fig. 15 TXMP BUPC TXMP w[] BUPC q[] 2, 8, 32, 128 TXMP BUPC 6 c 2011 Information Processing Society of Japan

7 5 XcalableMP Unified Parallel C Table 5 Language features of XcalableMP and Unified Parallel C XcalableMP Unified Parallel C upc forall C, Fortran C 5. C PGAS XMP UPC Table 5 XMP XMP Laplace Solver CG UPC XMP UPC XMP UPC e- XcalableMP 1) XcalableMP Specification DRAFT 0.7, XcalableMP Specification Working Group, ),,. XcalableMP,, Vol.3, No.3, pp , ) UPC Consortium, UPC Language Specifications V1.2, Technical Report LBNL , Berkeley National Lab, spec 1. 2.pdf 4) Bailey, D.H. and et al.: THE NAS PARALLEL BENCHMARKS, Technical Report NAS , Nasa Ames Research Center ) IT e-, go.jp/bmenu/boshu/detail/ /002.htm 6) Unified Parallel C at George Washington University, 7) C.H. Koelbel, D.B. Loverman, R. Shreiber, GL. Steele Jr., M.E. Zosel. The High Performance Fortran Handbook. MIT Press, ) R. Numwich and J. Reid. Co-Array Fortran for parallel programming. Technical Report RAL-TR , Rutherford Appleton Laboratory, ) 10) 11) Christian Bell, Dan Bonachea, Rajesh Nishtala, Katherine Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In The 20th Int l Parallel and Distributed Processing Symposium (IPDPS), ) 13) fhpf J.JSSAC, Vol. 11, No. 3,4, pp , ) Wei-Yu Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Katherine Yelick. A Performance Analysis of the Berkeley UPC Compiler, ICS 03 Proceedings of the 17th annual international conference on Supercomputing, ) 16) El-Ghazawi, T., Chauvin, S. UPC benchmarking issues, Parallel Processing, International Conference, pp , c 2011 Information Processing Society of Japan

HPC Challenge Awards 2010 Class2 XcalableMP Submission

HPC Challenge Awards 2010 Class2 XcalableMP Submission Jinpil Lee, Masahiro Nakao, Mitsuhisa Sato University of Tsukuba Submission Overview XcalableMP Language and model, proposed by XMP spec WG Fortran