Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS

Size: px

Start display at page:

Download "Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS"

Lilian Dalton
6 years ago
Views:

1 Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP October 29, 2015 Hidetoshi Iwashita, RIKEN AICS

Background XMP Contains Coarray Features XcalableMP (XMP) A PGAS language, an extension of Fortran and C Has

directives n Data distribution distribute, align & shadow directives n Work distribution task, loop & array

n Intrinsic procedures Localview programming model Coarray Features compatible with Coarray Fortran (CAF) 1.

2 Background XMP Contains Coarray Features XcalableMP (XMP) A PGAS language, an extension of Fortran and C Has two programming models: Globalview programming model n Abstraction of distribution nodes & templates directives n Data distribution distribute, align & shadow directives n Work distribution task, loop & array directives n Communication/synchronization reflect, gmove, reduction, bcast, barrier & wait_async directives n Intrinsic procedures Localview programming model Coarray Features compatible with Coarray Fortran (CAF) 1.0 n Interoperability with globalview coarray, image & local_alias directives n Coarray/C extensions LENS2015 WORKSHOP 2

3 Background MPI, XMP and CAF Programming An example of 2dimensional stencil communication (width=2) MPI ( 30 lines) call mpi_cart_create, mpi_cart_get and mpi_cart_shift call mpi_type_vector and mpi_type_commit call mpi_isend, mpi_irecv and mpi_waitall image [k1, k21] image [k11, k2] 1 1 (1) n n (3) m m+2 a(m, n) on image [k1, k2] (2) (4) image [k1, k2+1] XMP Globalview (1 line)!$xmp reflect (A) CAF (6 lines) if (k1>1) A(1:0,1:n) = A(m1:m,1:n)[k11,k2] (1) if (k1<k1x) A(m+1:m+2,1:n) = A(1:2,1:n)[k1+1,k2] (2) sync all if (k2>1) A(1:m+2,1:0) = A(1:m+2,n1:n)[k1,k21] (3) if (k2<k2x) A(1:m+2,n+1:n+2) = A(1:m+2,1:2)[k1,k2+1] (4) sync all image [k1+1, k2] Ease of programming: MPI < CAF < XMP Expressiveness: MPI CAF > XMP LENS2015 WORKSHOP 3

4 Contents Coarray Fortran and Other Implementations Issues in Our Implementation Evaluation, Comparing with Other Implementations Summary and Conclusion LENS2015 WORKSHOP 4

5 Coarray Fortran Language Specification An extension of Fortran to describe parallel execution. Adopted as a part of Fortran 2008 Basic Usage of Coarrays Declaration: real A(10,10)[*] // coarray A(10,10) on each image Reference (for get ) and Definition (for put ):... A(i,j)[k]... // reference to A(i,j) on image k A(i,j)[k] =... // assignment to A(i,j) on image k Useful in the context of array expression/assignment:... A[k]... // reference to the whole array A on k... A(i1:i2, j1:j2)[k]... // reference to a subarray of A on k LENS2015 WORKSHOP 5

6 Coarray Fortran Existing Implementations COMPILERS Vendors Cray Fortran Intel Fortran Open Source OpenUH (U. of Houston) Based on Open64 compiler OpenCoarrays Called by GCC or later TRANSLATORS (Sourcetosource Compilers) Open Source Rice CAF (noncompatible w/ F2008) Based on ROSE sourcetosource compiler OmniXMP CAF (preliminary version) Based on OmniXMP sourcetosource compiler LENS2015 WORKSHOP 6

7 Status of Our Implementation Fortran2008 Coarray Features (à ) Fortran2015 Coarray Features Partially supported: co_sum co_max co_min Interoperability with XMP globalview Not supported yet Section Feature in [1] declaration of static coarrays 3 initialization of coarrays declaration of allocatable coarrays reference to coindexed object 4 definition to coindexed variable dummy argument of static coarray 5 dummy argument of allocatable coarray ALLOCATE statement for coarray 9 DEALLOCATE statement for coarray implicit deallocation derived type coarray allocatable component of derived type coarray 10 pointer component of derived type coarray coarray component of structure SYNC ALL statement SYNC IMAGES statement LOCK/UNLOCK statements 12 CRITIDAL section SYNC MEMORY statement stat= and errmsg= specifiers normal termination 13 error termination, ERROR STOP statement image_index, lcobound, ucobound 15 num_images, this_image([coarray [,dim]) atomic_define, atomic_ref Support [1] John Reid. Coarrays in the next Fortran Standard. ISO/IEC JTC1/SC22/WG5 N1824, April 21, 2010 LENS2015 WORKSHOP 7

8 Our Implementation Based on OmniXMP Compiler Omni XMP compiler added to to support coarrays Coarray library XMP library XMP & Coarray program Coarray translator XMP translator Fortran program Fortran compiler (GNU, Fujitsu, ) object Fortran library MPI Linker (GNU, Fujitsu, ) GASNet FJRDMA Implementation of Coarray Features A part of OmniXMP compiler For interoperability with XMP globalview Translates CAF programs into F90 programs For portability not depending on Fortran compilers Advantage of the translator Any Fortran compiler can be chosen. Issues we faced during implementation 1. Memory allocation of the coarrays via the communication library 2. Knowing the Fortran data layout at runtime executable LENS2015 WORKSHOP 8

9 Issue 1 Declaration of Static Coarrays main subrou*ne program foo user_main init_main translator foo init_foo built in main rou*ne call traverser call user_main (3) subrou*ne bar linker a.out bar init_bar initializers traverser generator traverser call init_main call init_foo call init_bar (1) (2) Issue GASNet requires all coarrays to be allocated via GASNet library. à Allocation of static coarrays causes runtime overhead at the entrance of every procedure. Solution: allocation just before executing the program (1) Translator generates initializers corresponding to procedures. (2) Traverser generator generates traverser which calls initializers. (3) Traverser is called previously to the user s main program. LENS2015 WORKSHOP 9

10 Issue 2 Reference to Coindexed Objects Issue Data layout of an array is decided by the backend Fortran compiler. Example of array variable data layout: whole array of an explicitshape array allocation subarray (a part of whole array) and assumedshape array stride data object array element 2dimentionally (fully) contiguous 1dimentionally contiguous noncontiguous For efficient communication, the runtime library should know how long and periodic the contiguous data are. LENS2015 WORKSHOP 10

arguments: Addresses: Sizes: P 0 = address of A(ib, jb, ) L 0 is size of array element [byte] P 1 = address of A(ib+1, jb, ) L 1 = size(a, 1) P 2 = address

11 Issue 2 (cont.) Reference to Coindexed Objects Solution: algorithm by cooperation of translator and runtime library (1) Translator generates a library call with arguments: Addresses: Sizes: P 0 = address of A(ib, jb, ) L 0 is size of array element [byte] P 1 = address of A(ib+1, jb, ) L 1 = size(a, 1) P 2 = address of A(ib, jb+1, ) L 2 = size(a, 2) (2) Runtime library executes the following algorithm. P 0 1dim. contiguous P 0 P 2 L 2 2dim. contiguous L 0 P 1 P 0 + L 0 == P 1? no yes L 1 P 0 + L 0 L 1 == P 2? no yes L 0 L 1 L 2 bytes contiguous L 0 bytes contiguous L 0 L 1 bytes contiguous LENS2015 WORKSHOP 11

12 Evaluation Application: Himeno benchmark The original MPI program, 610 lines (excl. comment lines) Ported CAF program, 415 lines ( minus 32% ) Add declaration and allocation of communication buffers as coarrays. Replace mpi_allreduce with co_sum Delete codes around mpi_cart_create, mpi_cart_get, mpi_cat_shift Delete codes around mpi_type_vector, mpi_type_commit Replace mpi_isend/irecv and mpi_waitall with coarray assignment statements. Hardware: HAPACS/TCA in Univ. of Tsukuba CPU Memory GPU Node Network Intel Xeon E52680 v2, 2.8GHz, 10 cores, 2 CPU/node 128GB NVIDIA Tesla K20X x 4 (not used in this evaluation) 64 node/system Mellanox InfiniBand QDR 8GB/s/node LENS2015 WORKSHOP 12

13 # of nodes (images) Comparison in the Implementations Coarray Fortran 1x1x1 1x1x2 1x2x2 2x2x2 2x2x better [GFLOPS] 110 OpenUH gfortran Himeno bench Mmodel/CAF version ifort OpenUH3.0.40/mvapich2.0+GASNet OpenCoarrays1.0.0/gfortran6.0.0/mpich3.1.4 OmniXMP0.9.1/gfortran4.4.7/mvapich2.0+GASNet OmniXMP0.9.1/ifort15.0.2/IntelMPI5.0.0+GASNet ifort coarray=distributed/intelmpi5.0.0 (1) (2) MPI 2x2x4 (1) OmniXMP and OpenUH are comparable in performance if they use comparable Fortran compilers and the same GASNet. (2) The performance of OmniXMP much depends on the Fortran compilers. LENS2015 WORKSHOP 13

14 # of nodes (images) Comparison in the Implementations Coarray Fortran MPI 1x1x1 1x1x2 1x2x2 2x2x2 2x2x4 2x2x better [GFLOPS] 110 OpenUH gfortran Himeno bench Mmodel/CAF version ifort OpenUH3.0.40/mvapich2.0+GASNet OpenCoarrays1.0.0/gfortran6.0.0/mpich3.1.4 OmniXMP0.9.1/gfortran4.4.7/mvapich2.0+GASNet OmniXMP0.9.1/ifort15.0.2/IntelMPI5.0.0+GASNet ifort coarray=distributed/intelmpi5.0.0 (1) OmniXMP and OpenUH are comparable in performance if they use comparable Fortran compilers and the same GASNet. (2) The performance of OmniXMP much depends on the Fortran compilers. (3) CAF programs/omnixmp is currently 2% to 5% less performance than MPI programs with the same Fortran. LENS2015 WORKSHOP 14 (1) gfortran6.0.0/mpich3.1.4 gfortran4.4.7/mvapich2.0 ifort15.0.2/intelmpi5.0.0 (3) (3) (2)

15 Summary and Conclusion OmniXMP CAF Translator Implemented major features Settled some issues about Efficient memory allocation of coarrays Knowing data layout at runtime Evaluation on Himeno benchmark on HAPACS The CAF version program is 32% shorter than the original MPI version and 2% to 5% less performance. OMNIXMP is higher performance than OpenUH, OpenCoarrays and Intel s implementations when Intel Fortran is chosen. OmniXMP with Intel Fortran is 2 to 3 times higher performance than the one with gfortran. The Advantage of the Translator Any backend (Fortran) compiler can be chosen to get the best performance on the environment. LENS2015 WORKSHOP 15

16 LENS2015 WORKSHOP Appendix

17 Goals of OmniXMP CAF Interoperability with XcalableMP (XMP) globalview programming model Portability across different Fortran compilers and platforms Compatibility with coarray features in Fortran2008 And, of course, high performance LENS2015 WORKSHOP 17

Declaration of Allocatable Coarray subroutine FOO real(4), allocatable :: a(:, :)[:]

pointer :: a(:, J call xmpf_coarray_alloc2d_r4 ( desc_a, a, tag_foo, lb1, ub1, lb2, ub2 )

(descriptor, a, tag, lb1, ub1, lb2, ub2 ) real(4), pointer :: a(:, :) real(4) ::

) call pointer_assign(a, data) contains subroutine pointer_assign(a, data) real(4),

18 Declaration of Allocatable Coarray subroutine FOO real(4), allocatable :: a(:, :)[:] allocate ( a(lb1:ub1, lb2:ub2)[*] ) a(i, j) end subroutine subroutine FOO real(4), pointer :: a(:, J call xmpf_coarray_alloc2d_r4 ( desc_a, a, tag_foo, lb1, ub1, lb2, ub2 ) a(i, j) end soubroutine & Runtime library subroutine xmpf_coarray_alloc2d_r4 & (descriptor, a, tag, lb1, ub1, lb2, ub2 ) real(4), pointer :: a(:, :) real(4) :: data(lb1:ub1, lb2:ub2) pointer (pdata, data) pdata = xxx_malloc( 4*(ub1lb1+1)*(ub2lb1+1) ) call pointer_assign(a, data) contains subroutine pointer_assign(a, data) real(4), pointer :: a(:, :) real(4), target :: data(lb1:, lb2:)! set lbounds a => data end subroutine end subroutine LENS2015 WORKSHOP 18

OmniXMP CAF Translator Declaration of a Static

real(4) :: a(10,20) pointer (cp_a, a) common

.. end subroutine Fortran90 program Fortran90

19 OmniXMP CAF Translator Declaration of a Static Coarray XMP & CAF program subroutine foo real(4), save :: a(10,20)[*]... end subroutine Translator subroutine foo real(4) :: a(10,20) pointer (cp_a, a) common /xmpf_cp_foo/cp_a... end subroutine Fortran90 program Fortran90 compiler subroutine xmpf_init_foo integer(8) :: cp_a common /xmpf_cp_foo/cp_a cp_a = xmpf_coarray_malloc(8*10*20) end subroutine object object object Fortran Linker Run9me library LENS2015 WORKSHOP 19 a.out

On (1), (4) and (8), the original MPI program was evaluated. On (2), (3), (5), (6), (7) and (9), cafwide was evaluated. (1) mvapich2 2.0 and gcc 4.4.7; option O2. (2) omni/gnu xmpf90 0.9.1 built w/ (1) and GASNet 1.

20 On (1), (4) and (8), the original MPI program was evaluated. On (2), (3), (5), (6), (7) and (9), cafwide was evaluated. (1) mvapich2 2.0 and gcc 4.4.7; option O2. (2) omni/gnu xmpf built w/ (1) and GASNet ibvconduit (built w/ gnu); option O2. (3) UHCAF OpenUH built w/ (1) and GASNet above; options mpi staticlibcaf layer=gasnetibv. (4) Intel MPI and icc/ifort ; option O2. (5) omni/intelo2 xmpf built w/ (4) and GASNet above; option O2. (6) omni/intelo1 same as (5); option O1. (7) ifort/intelmpi same as (4); options O2 coarray=distributed mt_mpi (8) mpich and hydra built w/ gcc 6.0.0; option O2. (9) OpenCoarray 1.0.0; called from (8) w/ options O2 fcoarray=lib lcaf_mpi. LENS2015 WORKSHOP 20

An Open64-based Compiler and Runtime Implementation for Coarray Fortran

An Open64-based Compiler and Runtime Implementation for Coarray Fortran talk by Deepak Eachempati Presented at: Open64 Developer Forum 2010 8/25/2010 1 Outline Motivation Implementation Overview Evaluation