Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing

Size: px

Start display at page:

Download "Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing"

Annice Robertson
5 years ago
Views:

1 ABSTRACT Masahiro Nakao RIKEN Advanced Institute for Computational Science Hyogo, Japan Taisuke Boku Center for Computational Sciences University of Tsukuba Ibaraki, Japan To reduce the programming cost on a cluster system, Partitioned Global Address Space (PGAS) languages are used. We have designed an XcalableMP (XMP) PGAS language and developed the Omni XMP compiler (Omni compiler) for XMP. In the present study, we evaluated the performance of the Omni compiler on Oakforest- PACS, which is a cluster system based on Knights Landing, and on a general Linux cluster system. We performed performance tuning for the Omni compiler using a Lattice QCD mini-application and some mathematical functions appearing in that application. As a result, the performance of the Omni compiler after tuning was improved compared to before tuning on both systems. Furthermore, we compared the performance of MPI and OpenMP (MPI+OpenMP), which is an existing programming model, to that of XMP with the tuned Omni compiler. The results showed that the performance of the Lattice QCD mini-application using XMP was achieving more than 94% of the implementation written in MPI + OpenMP. CCS CONCEPTS Software and its engineering Parallel programming languages; KEYWORDS Knights Landing, Cluster system, PGAS language, Compiler ACM Reference Format: Masahiro Nakao, Hitoshi Murai, Taisuke Boku, and Mitsuhisa Sato Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing. In HPC Asia 218 WS: Workshops of HPC Asia 218, January 31, 218, Chiyoda, Tokyo, Japan. ACM, New York, NY, USA, 7 pages. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan 218 Association for Computing Machinery. ACM ISBN /18/1... $ BACKGROUND Hitoshi Murai RIKEN Advanced Institute for Computational Science Hyogo, Japan Mitsuhisa Sato RIKEN Advanced Institute for Computational Science Hyogo, Japan For developing parallel applications on cluster systems, Partitioned Global Address Space (PGAS) languages [1, 3, 7, 1] are used to reduce the programming cost. A PGAS language creates a global memory space from a unique memory space of each machine and provides users with a global address to access the created global memory space. By using a PGAS language, a user can program like a shared memory system in a cluster system. This feature enables users to develop parallel applications with ease. We have designed an XcalableMP (XMP) PGAS language [4, 8, 9], which is a directive-based language extension based on the C and Fortran languages. XMP also specifies that some OpenMP directives can be combined into XMP directives for thread programming. In addition, we have developed the Omni XMP compiler (Omni compiler) [11] as a compiler system for XMP. The Omni compiler is a source-to-source compiler that translates code written in XMP into parallel code. In previous studies [8, 9], we evaluated the performance of XMP and the Omni compiler on the K computer [5] and on general Linux cluster systems. However, we have little experience with XMP on cluster systems based on many-core processors which have been attracting attention in the HPC field. In this study, we evaluate the performance of the Omni compiler on Oakforest-PACS [6], which is a cluster system based on Knights Landing, and a general Linux cluster system. Moreover, to evaluate the performance of XMP, we implement a Lattice QCD miniapplication, which is an important application in the HPC field. Specifically, this study makes the following key contributions: (1) We describe how to implement the Lattice QCD mini-application using a hybrid model of XMP and OpenMP. (2) We describe an effective code translation method for a source-to-source compiler. The remainder of this paper is structured as follows. Sections 2 and 3 give overviews of XMP and the Omni compiler, respectively. Section 4 discusses the performance tuning of the Omni compiler on Knights Landing and a general CPU. Section 5 evaluates the performance of the Lattice QCD mini-application and of some mathematical functions appearing in this application. Section 6 summarizes this paper and discusses areas for future research.

2 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan M. Nakao et al. #pragma xmp template t[n] template t index N-1 #pragma xmp nodes p[4] #pragma xmp distribute t[block] onto p index N/4-1 N/2-1 3*N/4-1 N-1 double a[n]; #pragma xmp align a[i] with t[i] a[] p[] p[1] p[2] p[3] p[] p[1] p[2] p[3] index N/4-1 N/2-1 3*N/4-1 N-1 (a) One-dimensional #pragma xmp template t[n][m] #pragma xmp nodes p[2][2] #pragma xmp distribute t[block][block] onto n double b[n][m]; #pragma xmp align b[i][j] with t[i][j] template t N-1 M-1 p[][] p[1][] p[][1] p[1][1] (b) Two-dimensional p[][] p[1][] b[][] p[][1] p[1][1] N/2-1 N-1 M/2-1 M-1 Figure 1: Examples of code for data mapping[1] 2 XCALABLEMP The XMP specification has been designed by the PC Cluster Consortium ( XMP provides directives for data mapping, work mapping, and communication to develop a parallel application for cluster systems. This section describes the XMP C language syntax required for implementing the Lattice QCD mini-application. 2.1 Data Mapping XMP can declare a distributed array on the global memory space. In order to express the distribution method, a virtual index set called a template is used. The template, node, distribute, align directives are used to declare the distributed array. Fig. 1 shows an XMP example code for declaring a distributed array. In Fig. 1a, the template directive defines a template t that has index values from to N 1. An XMP execution unit is called a node, and each node executes redundantly in parallel. The node directive defines a node set p that has four nodes. This means an application is executed by four nodes. If * is used in place of a number in square bracket (as in p[*]), then the number of nodes is dynamically determined when a program runs. The distribute directive distributes t onto p in a block manner, meaning that the elements of template t are distributed with block sizes as equal as 1 #pragma xmp loop on t[i] 2 for(i=;i<n;i++) 3 a[i] =... ; (a) Only XcalableMP 1 #pragma xmp loop on t[i] 2 #pragma omp parallel for 3 for(i=;i<n;i++) 4 a[i] =... ; (b) XcalableMP and OpenMP Figure 2: Examples of code for work mapping 1 int sum = ; 2 #pragma xmp loop on t[i] reduction(+:sum) 3 for(i=;i<n;i++) 4 sum += a[i]; (a) Only XcalableMP with reduction clause 1 int sum = ; 2 #pragma xmp loop on t[i] reduction(+:sum) 3 #pragma omp parallel for reduction(+:sum) 4 for(i=;i<n;i++) 5 sum += a[i]; (b) XcalableMP and OpenMP with reduction clauses Figure 3: Examples of code for reduction possible. XMP also provides cyclic, block-cyclic, and uneven-block distributions. The align directive aligns the distributed array a[] with t. In the case of N = 16, each node has four elements of a[]. XMP also declares a multi-dimensional distributed array. Fig. 1b shows how to declare a two-dimensional distributed array b[][] using a two-dimensional template and a two-dimensional node set. Please refer to [1] for the details of the multi-dimensional distribution. 2.2 Work Mapping The loop directive parallelizes the next loop statement depending on a template. Fig. 2 shows an XMP example code where the distributed array a[] and template t defined in Fig. 1a are used. In Fig. 2a, the loop directive parallelizes a loop statement depending on template t. In the case of N = 16, a node p[] executes index i from to 3. Fig. 2b shows an example of a hybrid program using XMP and OpenMP directives. In Fig. 2b, firstly, the loop directive parallelizes a loop statement across nodes, and then the parallel for directive also parallelizes the loop statement parallelized by the loop directive across threads within a node. The order of the loop directive and the parallel for directive does not matter in this case. 2.3 Communication Reduction. To support an addition expression in a loop statement, a reduction clause can be added to the loop directive. Fig. 3 shows an XMP example code. In Fig. 3a, a reduction clause is added to the loop directive. The reduction clause performs a reduction operation for the local variable sum located in each node when finishing the loop statement. In Fig. 3b, the reduction clauses are added to the loop and parallel for directives in order to perform a reduction operation for thread programming.

3 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan a[] #pragma xmp shadow a[1] p[] p[1] #pragma acc enter data copyin(a)... #pragma xmp reflect (a) p[] p[1] 1 double b[1][1]; 2 #pragma xmp align b[i][j] with t[i][j] (a) User code p[2] p[2] p[3] p[3] (a) Add halo region (b) Synchronize halo region Figure 4: Examples of code for shadow/reflect directives[1] User code Omni compiler Base language (C or Fortan) Frontend + XcalableMP directive Translator Translated code Base language (C or Fortan) Backend + Runtime call Runtime library Execution binary Figure 5: Compile flow of Omni XcalableMP compiler 1 int _XMP_ADDR_b; 2 static void _XMP_DESC_b; 3 _XMP_init_array_desc(&_XMP_DESC_b,.., sizeof(double), 1, 1); 4 : 5 _XMP_alloc_array(&_XMP_ADDR_b, _XMP_DESC_b,...); (b) Translated code Figure 6: Code translation of align directive 1 #pragma xmp loop (i,j) on t[i][j] 2 for(i=;i<1;i++) 3 for(j=;j<1;j++) 4 b[i][j] =..; (a) User code 1 for(i=;i<5;i++) 2 for(j=;j<5;j++) 3 (_XMP_ADDR_b + i (5) + j) =..; (b) Translated code Figure 7: Code translation of loop directive Halo exchange. For easy development of stencil applications, XMP provides shadow and reflect directives. Fig. 4 shows the concepts of these directives where a distributed array a[] is the array declared in Fig. 1a and N = 16. In Fig. 4a, the shadow directive adds halo regions to both sides of a distributed array a[]. The halo regions appear as gray cells. The width of the halo region is 1, which is indicated in the shadow directive. In Fig. 4b, the reflect directive synchronizes the halo regions between neighboring nodes. Note that global boundary (the left halo of p[] and the right halo of p[3]) is not updated in the example. When the global boundary is also updated, the periodic clause should be added to the reflect directive[1]. 3 OMNI XCALABLEMP COMPILER 3.1 Overview Fig. 5 shows the compile flow of the Omni compiler. Firstly, the Omni compiler translates XMP directives in user code into XMP runtime calls. If necessary, code besides the XMP directives is also modified. Secondly, a native compiler (e.g., gcc, Intel, or PGI) compiles the translated code and creates an execution binary with linking to an XMP runtime library. 3.2 Example of Code Translation This section describes how to translate user code by using Omni compiler-1.2.2, which is the latest stable version 1. The distributed array b[][] and template t used in this section are defined in Fig. 1b and N = M = Distributed array. Fig. 6 shows an XMP example code of an align directive and an array. The align directive deletes a declaration of an array b[][] in the user code and creates a pointer _XMP_ADDR_b and a descriptor _XMP_DESC_b for a new array in the translated code. Moreover, the align directive adds some functions to allocate memory for the new array. Note that a multidimensional distributed array is expressed as a one-dimensional array in the translated code, since the size of each dimension of the array may be determined dynamically. (For example, * is used in the node set described in section 2.1) Loop statement. Fig. 7 shows an XMP example code of an loop directive and a loop statement. In Fig. 7a, the loop directive parallelizes the next nested loop statement depending on the twodimensional template t. In Fig. 7b, the initial value (i =, j = ) and ending condition (i < 5, j < 5) in the loop statements are described as constants automatically in this case. If an iteration differs for each node, then the iteration is calculated just before the loop statement by an XMP function and the results are used in its initial value and ending condition as variables. 4 IMPLEMENTATION OF LATTICE QCD 4.1 Overview of Lattice QCD Quantum chromodynamics (QCD) is a fundamental equation representing a quark, which is a species of elementary particle, and a gluon, which is the particle that mediates the strong interaction. The Lattice QCD is a discrete formulation of QCD that simulates in a lattice of four dimensions (time: T and space: ZYX). The quark degrees of freedom are represented as a field that has four components: spin and three color components. The gluon is defined as a

4 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan M. Nakao et al. 1 S = B 2 R = B 3 X = B 4 sr = norm(s) 5 T = WD(U,X) 6 S = WD(U,T) 7 R = R S 8 P = R 9 rrp = rr = norm(r) 1 do{ 11 T = WD(U,P) 12 V = WD(U,T) 13 pap = dot(v,p) 14 cr = rr/pap 15 X = cr P + X 16 R = cr V + R 17 rr = norm(r) 18 bk = rr/rrp 19 P = bk P 2 P = P + R 21 rrp = rr 22 }while(rr/sr > 1.E 16) 1 void WD(Quark_t v_out[nt][nz][ny][nx], const Gluon_t u[4][ NT][NZ][NY][NX], const Quark_t v[nt][nz][ny][nx]){ 4 #pragma omp parallel for collapse(4)... 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++){ Figure 1: Portion of Wilson Dirac operator code Figure 8: Lattice QCD pseudo-code 1 typedef struct Quark { 2 double v[4][3][2]; 3 } Quark_t; 4 typedef struct Gluon { 5 double v[3][3][2]; 6 } Gluon_t; 7 Quark_t v[nt][nz][ny][nx], tmp_v[nt][nz][ny][nx]; 8 Gluon_t u[4][nt][nz][ny][nx]; 9 1 #pragma xmp template t[nt][nz] 11 #pragma xmp nodes p[pt][pz] 12 #pragma xmp distribute t[block][block] onto p 13 #pragma xmp align v[i][j][ ][ ] with t[i][j] 14 #pragma xmp align tmp_v[i][j][ ][ ] with t[i][j] 15 #pragma xmp align u[ ][i][j][ ][ ] with t[i][j] 16 #pragma xmp shadow v[1][1][][] 17 #pragma xmp shadow tmp_v[1][1][][] 18 #pragma xmp shadow u[][1][1][][] Figure 9: Declaration of distributed arrays for Lattice QCD 3 3 complex matrix. In a Lattice QCD simulation, one needs to solve many times a linear equation for the matrix that represents the interaction between the quark and gluon fields. 4.2 Implementation in XcalableMP We implemented a Lattice QCD mini-application based on the existing Lattice QCD application Bridge++ [2]. Our code extracted the main function of Bridge++. Fig. 8 shows pseudo-code of our implementation where U is a gluon, the other uppercase characters are quarks, and lowercase characters are scalar variables. The WD() in lines 5, 6, 11, and 12 is the Wilson Dirac operator [12], which is a main kernel in the Lattice QCD mini-application for calculating the interactions of the quarks under the influence of the gluon. Our implementation uses the CG method to solve the linear equation. For the CG method, the following mathematical functions for the quark matrix are implemented. (1) COPY method in lines 1 3 and 8, which copies to the left-hand side from the right-hand side. (2) NORM method in lines 4, 9, and 17 calculates the L2-norm. (3) AXPY method in lines 7, 15, 16, and 2 adds matrixes. (4) DOT method in line 13 calculates a dot product. (5) SCAL method in line 19 multiplies by a scalar. Lines 14, 18, and 21 are scalar-to-scalar operations. Fig. 9 shows the declaration of the distributed arrays of the quark and gluon. Specifically, in lines 1 8, the quark and gluon structure arrays are declared. The last dimension [2]" of each structure represents the real and imaginary parts of a complex number. 1 #pragma xmp reflect(v) width(/periodic/1,/periodic/1,,) orthogonal 2 #pragma xmp reflect(u) width(,/periodic/1:,/periodic/1:,,) orthogonal 3 WD(tmp_v, u, v); 4 #pragma xmp reflect(tmp_v) width(/periodic/1,/periodic/1,,) orthogonal 5 WD(v, u, tmp_v); Figure 11: Calling Wilson Dirac operator code 1 void scal(quark_t v[nt][nz][ny][nx], const double a){ 4 #pragma omp parallel for collapse(4) 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++) 9 for(i=;i<4;i++) 1 for(j=;j<3;j++) 11 for(k=;k<2;k++) 12 v[t][z][y][x].v[i][j][k] = a; 13 } Figure 12: SCAL operator code Macros NT, NZ, NY, and NX are the numbers of TZYX axis elements. Macros PT and PZ in the nodes directive of line 11 are the numbers of nodes on only the T and Z axes. Thus, the program is parallelized on the T and Z axes. In the align directives of lines 13 15, an * in square brackets means that the dimension is not divided. In the shadow directive of lines 16 18, a in square brackets means that no halo region exists. Fig. 1 shows a portion of the Wilson Dirac operator code. In line 3, the loop directive parallelizes the outer two loop statements. In line 4, the parallel for directive with the collapse clause also parallelizes all loop statements parallelized by the loop directive. In the loop statements, one calculation needs neighboring elements of distributed arrays. Fig. 11 shows calling WD() where the reflect directives are used before WD() to synchronize own halo region between neighboring nodes. The width and orthogonal clauses restrict the transfer range of the halo region to reduce communication time [4, 1]. The halo region of u is not synchronized before the second WD() because values of u are not updated in WD(). Fig. 12 shows SCAL method where loop statements comprise seven nested loops, and are parallelized using XMP and OpenMP directives. COPY, NORM, AXPY, and DOT methods are shown

5 1 #pragma omp parallel for collapse(4) 2 for(t=;t<16;t++) 3 for(z=;16;z++) 4 for(y=;y<32;y++) 5 for(x=;x<32;x++) 6 for(i=;i<4;i++) 7 for(j=;j<3;j++) 8 for(k=;k<2;k++) 9 ( (( (( (((_XMP_ADDR_v + (t+1) (16+2) (32) (32) + ( z+1) (32) (32) + y (32) + x) >v) + i)) + j)) + k)) = a; Figure 13: Translated SCAL operator code 1 Quark_t ( _XMP_MULTI_ADDR_v)[16+2][32][32] = (Quark_t ( ) [16+2][32][32])_XMP_ADDR_v; 2 #pragma omp parallel for collapse(4) 3 for(t=;t<16;t++) 4 for(z=;16;z++) 5 for(y=;y<32;y++) 6 for(x=;x<32;x++) 7 for(i=;i<4;i++) 8 for(j=;j<3;j++) 9 for(k=;k<2;k++) 1 ( (( (( (((&(_XMP_MULTI_ADDR_v[t+1][z+1][y][x]) ) >v) + i)) + j)) + k)) = a; HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan Table 1: Specification of Oakforest-PACS CPU Intel Xeon Phi GHz 68Cores Memory MCDRAM 16GB, DDR4 96GB Network Intel Omni-Path Host Fabric Interface 12.5GB/s Software intel/ , intelmpi/ Table 2: Specifications of COMA CPU Intel Xeon-E5 267v GHz 1Cores, 2Sockets Memory DDR3 64GB Network InfiniBand FDR 7GB/s Software intel/17..5, intelmpi/ Figure 14: New translated SCAL operator code in Figs in Appendix. The reason for specifying collapse(4) instead of collapse(7) in the parallel for directive is because the performance of collapse(4) was better than that of collapse(7) in our preliminary evaluation. 5 PERFORMANCE EVALUATION 5.1 Tuning Omni XcalableMP Compiler In our preliminary evaluation using, the performance of the mathematical functions deteriorated. Fig. 13 shows a portion of translated SCAL code from Fig. 12 generated by Omni compiler In Fig. 13, NT, NZ, NY, and NX are 32, and PT and PZ are 2. In line 9, the reason why 1 is added to t and z is because it calculates the size of a halo. Here, the 2 in the (16 + 2) term indicates the size of the halo of both sides. Because the distributed array v is one dimensional, it is difficult for a native compiler to effectively optimize the code. In order to solve this problem, the code conversion of Omni compiler has been changed to leave the size information of each dimension just before a target loop statement if possible (For example, when node set is determined dynamically, the new code conversion is not performed). Fig. 14 shows a portion of translated SCAL code from Fig. 12 generated by the new Omni compiler. In line 1, a new pointer _XMP_MULTI_ADDR_v is declared which has the size of each dimension. The new pointer is defined as a head pointer of the distributed array v. In a loop statement, operations are performed using the new pointer. 5.2 Performance of Mathematical Functions In order to investigate the influence of the code conversion change described in section 5.1, we conduct performance evaluations of (a) (NT, NZ, NY, NX) = (32, 32, 32, 32) (b) (NT, NZ, NY, NX) = (2, 2, 32, 32) Figure 15: Performance of mathematical functions on Oakforest-PACS the mathematical functions in Figs. 12, We use a single compute node of not only Oakforest-PACS but also the COMA system (COMA) as a general Linux cluster system. Tables 1 and 2 show the evaluation environments. We use two program sizes: (NT, NZ, NY, NX ) = (32, 32, 32, 32) and (2, 2, 32, 32). The compile options on Oakforest-PACS are set as -O2 -mcmodel=medium -axmic- AVX512, whereas those on COMA are set as -O2 -mcmodel=medium. The number of processes is 1 on both machines, but the numbers of threads are 64 on Oakforest-PACS and 1 on COMA. Memory mode of Oakforest-PACS is cache which means MCDRAM works for DDR4 memory as a cache. We use the Intel compiler as the backend compiler of the Omni compiler. We execute the mathematical functions 1 times and measure the elapsed time.

6 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan M. Nakao et al Performance (GFlops) MPI+OpenMP (a) (NT, NZ, NY, NX) = (32, 32, 32, 32) 1x1 2x1 2x2 4x2 4x4 8x4 8x8 16x8 16x16 Number of processes (PT x PZ) (a) Oakforest-PACS Performance (GFlops) MPI+OpenMP (b) (NT, NZ, NY, NX) = (2, 2, 32, 32) Figure 16: Performance of mathematical functions on COMA 1x1 2x1 2x2 4x2 4x4 8x4 8x8 16x8 16x16 Number of processes (PT x PZ) (b) COMA Figure 18: Performance of Lattice QCD mini-application 1 void scal(quark_t v[nt+2][nz+2][ny][nx], const double a){ 3 #pragma omp parallel for collapse(4) 4 for(t=1;t<nt+1;t++) 5 for(z=1;z<nz+1;z++) 6 for(y=;y<ny;y++) 7 for(x=;x<nx;x++) 8 for(i=;i<4;i++) 9 for(j=;j<3;j++) 1 for(k=;k<2;k++) 11 v[t][z][y][x].v[i][j][k] = a; 12 } Figure 17: SCAL operator code using OpenMP Figs. 15 and 16 show the performance results. For comparison purposes, the figures also include results of code using only OpenMP which is compiled by the Intel compiler. Fig. 17 shows a code example of SCAL using OpenMP. Note that, 1 is added in the conditions of the outer two loop statements in lines 4 and 5, which reflects the natural writing style. All performance results of the new Omni compiler are equal to or better than those of Omni compiler The performance improvement of SCAL is particularly remarkable. In addition, these results show that the performance of the new Omni compiler is close to that of only OpenMP. 5.3 Performance of Lattice QCD This section evaluates the performance of the Lattice QCD miniapplication written in XMP using and the new Omni compiler. For comparison purposes, an implementation of the application written in a combination of MPI and OpenMP (MPI + OpenMP) is also evaluated. While we assigned one process per compute node on Oakforest-PACS, we assigned two processes per computer node on COMA because the computer node of COMA has two CPU sockets. We performed up to 256 processes on both systems. The compile options, the number of threads, the memory mode, and the backend compiler are the same as those in the section 5.2. We execute the Lattice QCD code with strong scaling in regions (32,32,32,32) as (NT,NZ,NY,NX ). Fig. 18 shows the performance results. All the performance results using the new Omni compiler are better than those using. In addition, XMP using the new Omni compiler achieves a performance of more than 94% of MPI+OpenMP. Examining the reason why the performance of XMP is a little worse than MPI in 256 processes, as shown in Figs. 15b and 16b, the performance of the mathematical functions except for COPY using the new Omni compiler is the almost same or a little worse than that of only OpenMP. Although the performance of COPY using the new Omni compiler is better, the function COPY is only called a small number of times, as shown in Fig SUMMARY AND FUTURE WORK In this paper, we examined the tuned performance of the Omni compiler for an XMP PGAS language on a Knights Landing cluster system and a general Linux cluster system. Specifically, we evaluated the XMP performance through an implementation of a

7 Lattice QCD mini-application. As a result of evaluating the Omni compiler before and after the performance tuning, we found that the performance of the Omni Compiler after tuning was superior. The results also showed that the performance of the Lattice QCD mini-application written in XMP using the tuned Omni compiler is achieving more than 94% of the implementation written in MPI+OpenMP. Future research will examine the performance difference between XMP and MPI+OpenMP more deeply and also evaluate the performance of other applications besides the Lattice QCD mini-application. ACKNOWLEDGEMENTS We would like to extend grateful thanks to Hideo Matsufuru who provided us the Lattice QCD code. This research used the Oakforest- PACS and COMA systems provided by Interdisciplinary Computational Science Program in the Center for Computational Sciences, University of Tsukuba. The work was supported by the Japan Science and Technology Agency, Core Research for Evolutional Science and Technology program entitled Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era in the re search area of Development of System Software Technologies for Post-Peta Scale High Performance Computing. A MATHEMATICAL FUNCTIONS 1 void copy(quark_t v[nt][nz][ny][nx], const Quark_t w[nt][ NZ][NY][NX]){ 4 #pragma omp parallel for collapse(4) 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++) 9 for(i=;i<4;i++) 1 for(j=;j<3;j++) 11 for(k=;k<2;k++) 12 v[t][z][y][x].v[i][j][k] = w[t][z][y][x].v[i][j][k]; 13 } Figure 19: COPY operator code 1 double norm(const Quark_t v[nt][nz][ny][nx]){ 3 double a =.; 4 #pragma xmp loop (t,z) on t[t][z] reduction (+:a) 5 #pragma omp parallel for collapse(4) reduction(+:a) 6 for(t=;t<nt;t++) 7 for(z=;z<nz;z++) 8 for(y=;y<ny;y++) 9 for(x=;x<nx;x++) 1 for(i=;i<4;i++) 11 for(j=;j<3;j++) 12 for(k=;k<2;k++) 13 a += v[t][z][y][x].v[i][j][k] v[t][z][y][x].v[i][j][k]; 14 return a; 15 } Figure 2: NORM operator code HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan 1 void axpy(quark_t v[nt][nz][ny][nx], const double a, const Quark_t w[nt][nz][ny][nx]){ 4 #pragma omp parallel for collapse(4) 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++) 9 for(i=;i<4;i++) 1 for(j=;j<3;j++) 11 for(k=;k<2;k++) 12 v[t][z][y][x].v[i][j][k] += a w[t][z][y][x].v[i][j][k]; 13 } Figure 21: AXPY operator code 1 double dot(const Quark_t v[nt][nz][ny][nx], const Quark_t w[ NT][NZ][NY][NX]){ 3 double a =.; 4 #pragma xmp loop (t,z) on t[t][z] reduction(+:a) 5 #pragma omp parallel for collapse(4) reduction(+:a) 6 for(t=;t<nt;t++) 7 for(z=;z<nz;z++) 8 for(y=;y<ny;y++) 9 for(x=;x<nx;x++) 1 for(i=;i<4;i++) 11 for(j=;j<3;j++) 12 for(k=;k<2;k++) 13 a += v[t][z][y][x].v[i][j][k] w[t][z][y][x].v[i][j][k]; 14 return a; 15 } Figure 22: DOT operator code REFERENCES [1] Andrew I. Stone et al Evaluating Coarray Fortran with the CGPOP Miniapp. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS). [2] Bridge++ Project Bridge++. (217). Lattice-code/. [3] F. Cantonnet et al. 24. Productivity analysis of the UPC language. In 18th International Parallel and Distributed Processing Symposium, 24. Proceedings [4] XcalableMP Specification Working Group XcalableMP Specification. (217). [5] Hiroyuki Miyazaki et al Overview of the K computer. FUJITSU SCIEN- TIFIC and TECHNICAL JOURNAL 48, 3 (212), [6] Information Technology Center, The University of Tokyo Oakforest-PACS Supercomputer System. (217). [7] Katherine Yelick et. al. 27. Productivity and Performance Using Partitioned Global Address Space Languages. PASCO 7, Proceedings of the 27 international workshop on Parallel symbolic computation. (27). [8] Masahiro Nakao et al Productivity and Performance of Global-View Programming with XcalableMP PGAS Language. In Proceedings of the th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC- GRID 12) [9] Masahiro Nakao et al Implementation and evaluation of the HPC challenge benchmark in the XcalableMP PGAS language. The International Journal of High Performance Computing Applications (217), [1] Masahiro Nakao et al Implementing Lattice QCD Application with XcalableACC Language on Accelerated Cluster. In 217 IEEE International Conference on Cluster Computing (CLUSTER) [11] Omni Compiler Project Omni Compiler. (217). [12] Wilson, K. G Confinement of quarks. Phys. Rev. D 1 (Oct 1974), Issue 8.

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status);

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status); $ $ 2 global void kernel(int a[max], int llimit, int ulimit) {... } : int main(int argc, char *argv[]){ MPI_Int(&argc, &argc); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);