Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing

Size: px
Start display at page:

Download "Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing"

Transcription

1 ABSTRACT Masahiro Nakao RIKEN Advanced Institute for Computational Science Hyogo, Japan Taisuke Boku Center for Computational Sciences University of Tsukuba Ibaraki, Japan To reduce the programming cost on a cluster system, Partitioned Global Address Space (PGAS) languages are used. We have designed an XcalableMP (XMP) PGAS language and developed the Omni XMP compiler (Omni compiler) for XMP. In the present study, we evaluated the performance of the Omni compiler on Oakforest- PACS, which is a cluster system based on Knights Landing, and on a general Linux cluster system. We performed performance tuning for the Omni compiler using a Lattice QCD mini-application and some mathematical functions appearing in that application. As a result, the performance of the Omni compiler after tuning was improved compared to before tuning on both systems. Furthermore, we compared the performance of MPI and OpenMP (MPI+OpenMP), which is an existing programming model, to that of XMP with the tuned Omni compiler. The results showed that the performance of the Lattice QCD mini-application using XMP was achieving more than 94% of the implementation written in MPI + OpenMP. CCS CONCEPTS Software and its engineering Parallel programming languages; KEYWORDS Knights Landing, Cluster system, PGAS language, Compiler ACM Reference Format: Masahiro Nakao, Hitoshi Murai, Taisuke Boku, and Mitsuhisa Sato Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing. In HPC Asia 218 WS: Workshops of HPC Asia 218, January 31, 218, Chiyoda, Tokyo, Japan. ACM, New York, NY, USA, 7 pages. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan 218 Association for Computing Machinery. ACM ISBN /18/1... $ BACKGROUND Hitoshi Murai RIKEN Advanced Institute for Computational Science Hyogo, Japan Mitsuhisa Sato RIKEN Advanced Institute for Computational Science Hyogo, Japan For developing parallel applications on cluster systems, Partitioned Global Address Space (PGAS) languages [1, 3, 7, 1] are used to reduce the programming cost. A PGAS language creates a global memory space from a unique memory space of each machine and provides users with a global address to access the created global memory space. By using a PGAS language, a user can program like a shared memory system in a cluster system. This feature enables users to develop parallel applications with ease. We have designed an XcalableMP (XMP) PGAS language [4, 8, 9], which is a directive-based language extension based on the C and Fortran languages. XMP also specifies that some OpenMP directives can be combined into XMP directives for thread programming. In addition, we have developed the Omni XMP compiler (Omni compiler) [11] as a compiler system for XMP. The Omni compiler is a source-to-source compiler that translates code written in XMP into parallel code. In previous studies [8, 9], we evaluated the performance of XMP and the Omni compiler on the K computer [5] and on general Linux cluster systems. However, we have little experience with XMP on cluster systems based on many-core processors which have been attracting attention in the HPC field. In this study, we evaluate the performance of the Omni compiler on Oakforest-PACS [6], which is a cluster system based on Knights Landing, and a general Linux cluster system. Moreover, to evaluate the performance of XMP, we implement a Lattice QCD miniapplication, which is an important application in the HPC field. Specifically, this study makes the following key contributions: (1) We describe how to implement the Lattice QCD mini-application using a hybrid model of XMP and OpenMP. (2) We describe an effective code translation method for a source-to-source compiler. The remainder of this paper is structured as follows. Sections 2 and 3 give overviews of XMP and the Omni compiler, respectively. Section 4 discusses the performance tuning of the Omni compiler on Knights Landing and a general CPU. Section 5 evaluates the performance of the Lattice QCD mini-application and of some mathematical functions appearing in this application. Section 6 summarizes this paper and discusses areas for future research.

2 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan M. Nakao et al. #pragma xmp template t[n] template t index N-1 #pragma xmp nodes p[4] #pragma xmp distribute t[block] onto p index N/4-1 N/2-1 3*N/4-1 N-1 double a[n]; #pragma xmp align a[i] with t[i] a[] p[] p[1] p[2] p[3] p[] p[1] p[2] p[3] index N/4-1 N/2-1 3*N/4-1 N-1 (a) One-dimensional #pragma xmp template t[n][m] #pragma xmp nodes p[2][2] #pragma xmp distribute t[block][block] onto n double b[n][m]; #pragma xmp align b[i][j] with t[i][j] template t N-1 M-1 p[][] p[1][] p[][1] p[1][1] (b) Two-dimensional p[][] p[1][] b[][] p[][1] p[1][1] N/2-1 N-1 M/2-1 M-1 Figure 1: Examples of code for data mapping[1] 2 XCALABLEMP The XMP specification has been designed by the PC Cluster Consortium ( XMP provides directives for data mapping, work mapping, and communication to develop a parallel application for cluster systems. This section describes the XMP C language syntax required for implementing the Lattice QCD mini-application. 2.1 Data Mapping XMP can declare a distributed array on the global memory space. In order to express the distribution method, a virtual index set called a template is used. The template, node, distribute, align directives are used to declare the distributed array. Fig. 1 shows an XMP example code for declaring a distributed array. In Fig. 1a, the template directive defines a template t that has index values from to N 1. An XMP execution unit is called a node, and each node executes redundantly in parallel. The node directive defines a node set p that has four nodes. This means an application is executed by four nodes. If * is used in place of a number in square bracket (as in p[*]), then the number of nodes is dynamically determined when a program runs. The distribute directive distributes t onto p in a block manner, meaning that the elements of template t are distributed with block sizes as equal as 1 #pragma xmp loop on t[i] 2 for(i=;i<n;i++) 3 a[i] =... ; (a) Only XcalableMP 1 #pragma xmp loop on t[i] 2 #pragma omp parallel for 3 for(i=;i<n;i++) 4 a[i] =... ; (b) XcalableMP and OpenMP Figure 2: Examples of code for work mapping 1 int sum = ; 2 #pragma xmp loop on t[i] reduction(+:sum) 3 for(i=;i<n;i++) 4 sum += a[i]; (a) Only XcalableMP with reduction clause 1 int sum = ; 2 #pragma xmp loop on t[i] reduction(+:sum) 3 #pragma omp parallel for reduction(+:sum) 4 for(i=;i<n;i++) 5 sum += a[i]; (b) XcalableMP and OpenMP with reduction clauses Figure 3: Examples of code for reduction possible. XMP also provides cyclic, block-cyclic, and uneven-block distributions. The align directive aligns the distributed array a[] with t. In the case of N = 16, each node has four elements of a[]. XMP also declares a multi-dimensional distributed array. Fig. 1b shows how to declare a two-dimensional distributed array b[][] using a two-dimensional template and a two-dimensional node set. Please refer to [1] for the details of the multi-dimensional distribution. 2.2 Work Mapping The loop directive parallelizes the next loop statement depending on a template. Fig. 2 shows an XMP example code where the distributed array a[] and template t defined in Fig. 1a are used. In Fig. 2a, the loop directive parallelizes a loop statement depending on template t. In the case of N = 16, a node p[] executes index i from to 3. Fig. 2b shows an example of a hybrid program using XMP and OpenMP directives. In Fig. 2b, firstly, the loop directive parallelizes a loop statement across nodes, and then the parallel for directive also parallelizes the loop statement parallelized by the loop directive across threads within a node. The order of the loop directive and the parallel for directive does not matter in this case. 2.3 Communication Reduction. To support an addition expression in a loop statement, a reduction clause can be added to the loop directive. Fig. 3 shows an XMP example code. In Fig. 3a, a reduction clause is added to the loop directive. The reduction clause performs a reduction operation for the local variable sum located in each node when finishing the loop statement. In Fig. 3b, the reduction clauses are added to the loop and parallel for directives in order to perform a reduction operation for thread programming.

3 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan a[] #pragma xmp shadow a[1] p[] p[1] #pragma acc enter data copyin(a)... #pragma xmp reflect (a) p[] p[1] 1 double b[1][1]; 2 #pragma xmp align b[i][j] with t[i][j] (a) User code p[2] p[2] p[3] p[3] (a) Add halo region (b) Synchronize halo region Figure 4: Examples of code for shadow/reflect directives[1] User code Omni compiler Base language (C or Fortan) Frontend + XcalableMP directive Translator Translated code Base language (C or Fortan) Backend + Runtime call Runtime library Execution binary Figure 5: Compile flow of Omni XcalableMP compiler 1 int _XMP_ADDR_b; 2 static void _XMP_DESC_b; 3 _XMP_init_array_desc(&_XMP_DESC_b,.., sizeof(double), 1, 1); 4 : 5 _XMP_alloc_array(&_XMP_ADDR_b, _XMP_DESC_b,...); (b) Translated code Figure 6: Code translation of align directive 1 #pragma xmp loop (i,j) on t[i][j] 2 for(i=;i<1;i++) 3 for(j=;j<1;j++) 4 b[i][j] =..; (a) User code 1 for(i=;i<5;i++) 2 for(j=;j<5;j++) 3 (_XMP_ADDR_b + i (5) + j) =..; (b) Translated code Figure 7: Code translation of loop directive Halo exchange. For easy development of stencil applications, XMP provides shadow and reflect directives. Fig. 4 shows the concepts of these directives where a distributed array a[] is the array declared in Fig. 1a and N = 16. In Fig. 4a, the shadow directive adds halo regions to both sides of a distributed array a[]. The halo regions appear as gray cells. The width of the halo region is 1, which is indicated in the shadow directive. In Fig. 4b, the reflect directive synchronizes the halo regions between neighboring nodes. Note that global boundary (the left halo of p[] and the right halo of p[3]) is not updated in the example. When the global boundary is also updated, the periodic clause should be added to the reflect directive[1]. 3 OMNI XCALABLEMP COMPILER 3.1 Overview Fig. 5 shows the compile flow of the Omni compiler. Firstly, the Omni compiler translates XMP directives in user code into XMP runtime calls. If necessary, code besides the XMP directives is also modified. Secondly, a native compiler (e.g., gcc, Intel, or PGI) compiles the translated code and creates an execution binary with linking to an XMP runtime library. 3.2 Example of Code Translation This section describes how to translate user code by using Omni compiler-1.2.2, which is the latest stable version 1. The distributed array b[][] and template t used in this section are defined in Fig. 1b and N = M = Distributed array. Fig. 6 shows an XMP example code of an align directive and an array. The align directive deletes a declaration of an array b[][] in the user code and creates a pointer _XMP_ADDR_b and a descriptor _XMP_DESC_b for a new array in the translated code. Moreover, the align directive adds some functions to allocate memory for the new array. Note that a multidimensional distributed array is expressed as a one-dimensional array in the translated code, since the size of each dimension of the array may be determined dynamically. (For example, * is used in the node set described in section 2.1) Loop statement. Fig. 7 shows an XMP example code of an loop directive and a loop statement. In Fig. 7a, the loop directive parallelizes the next nested loop statement depending on the twodimensional template t. In Fig. 7b, the initial value (i =, j = ) and ending condition (i < 5, j < 5) in the loop statements are described as constants automatically in this case. If an iteration differs for each node, then the iteration is calculated just before the loop statement by an XMP function and the results are used in its initial value and ending condition as variables. 4 IMPLEMENTATION OF LATTICE QCD 4.1 Overview of Lattice QCD Quantum chromodynamics (QCD) is a fundamental equation representing a quark, which is a species of elementary particle, and a gluon, which is the particle that mediates the strong interaction. The Lattice QCD is a discrete formulation of QCD that simulates in a lattice of four dimensions (time: T and space: ZYX). The quark degrees of freedom are represented as a field that has four components: spin and three color components. The gluon is defined as a

4 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan M. Nakao et al. 1 S = B 2 R = B 3 X = B 4 sr = norm(s) 5 T = WD(U,X) 6 S = WD(U,T) 7 R = R S 8 P = R 9 rrp = rr = norm(r) 1 do{ 11 T = WD(U,P) 12 V = WD(U,T) 13 pap = dot(v,p) 14 cr = rr/pap 15 X = cr P + X 16 R = cr V + R 17 rr = norm(r) 18 bk = rr/rrp 19 P = bk P 2 P = P + R 21 rrp = rr 22 }while(rr/sr > 1.E 16) 1 void WD(Quark_t v_out[nt][nz][ny][nx], const Gluon_t u[4][ NT][NZ][NY][NX], const Quark_t v[nt][nz][ny][nx]){ 4 #pragma omp parallel for collapse(4)... 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++){ Figure 1: Portion of Wilson Dirac operator code Figure 8: Lattice QCD pseudo-code 1 typedef struct Quark { 2 double v[4][3][2]; 3 } Quark_t; 4 typedef struct Gluon { 5 double v[3][3][2]; 6 } Gluon_t; 7 Quark_t v[nt][nz][ny][nx], tmp_v[nt][nz][ny][nx]; 8 Gluon_t u[4][nt][nz][ny][nx]; 9 1 #pragma xmp template t[nt][nz] 11 #pragma xmp nodes p[pt][pz] 12 #pragma xmp distribute t[block][block] onto p 13 #pragma xmp align v[i][j][ ][ ] with t[i][j] 14 #pragma xmp align tmp_v[i][j][ ][ ] with t[i][j] 15 #pragma xmp align u[ ][i][j][ ][ ] with t[i][j] 16 #pragma xmp shadow v[1][1][][] 17 #pragma xmp shadow tmp_v[1][1][][] 18 #pragma xmp shadow u[][1][1][][] Figure 9: Declaration of distributed arrays for Lattice QCD 3 3 complex matrix. In a Lattice QCD simulation, one needs to solve many times a linear equation for the matrix that represents the interaction between the quark and gluon fields. 4.2 Implementation in XcalableMP We implemented a Lattice QCD mini-application based on the existing Lattice QCD application Bridge++ [2]. Our code extracted the main function of Bridge++. Fig. 8 shows pseudo-code of our implementation where U is a gluon, the other uppercase characters are quarks, and lowercase characters are scalar variables. The WD() in lines 5, 6, 11, and 12 is the Wilson Dirac operator [12], which is a main kernel in the Lattice QCD mini-application for calculating the interactions of the quarks under the influence of the gluon. Our implementation uses the CG method to solve the linear equation. For the CG method, the following mathematical functions for the quark matrix are implemented. (1) COPY method in lines 1 3 and 8, which copies to the left-hand side from the right-hand side. (2) NORM method in lines 4, 9, and 17 calculates the L2-norm. (3) AXPY method in lines 7, 15, 16, and 2 adds matrixes. (4) DOT method in line 13 calculates a dot product. (5) SCAL method in line 19 multiplies by a scalar. Lines 14, 18, and 21 are scalar-to-scalar operations. Fig. 9 shows the declaration of the distributed arrays of the quark and gluon. Specifically, in lines 1 8, the quark and gluon structure arrays are declared. The last dimension [2]" of each structure represents the real and imaginary parts of a complex number. 1 #pragma xmp reflect(v) width(/periodic/1,/periodic/1,,) orthogonal 2 #pragma xmp reflect(u) width(,/periodic/1:,/periodic/1:,,) orthogonal 3 WD(tmp_v, u, v); 4 #pragma xmp reflect(tmp_v) width(/periodic/1,/periodic/1,,) orthogonal 5 WD(v, u, tmp_v); Figure 11: Calling Wilson Dirac operator code 1 void scal(quark_t v[nt][nz][ny][nx], const double a){ 4 #pragma omp parallel for collapse(4) 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++) 9 for(i=;i<4;i++) 1 for(j=;j<3;j++) 11 for(k=;k<2;k++) 12 v[t][z][y][x].v[i][j][k] = a; 13 } Figure 12: SCAL operator code Macros NT, NZ, NY, and NX are the numbers of TZYX axis elements. Macros PT and PZ in the nodes directive of line 11 are the numbers of nodes on only the T and Z axes. Thus, the program is parallelized on the T and Z axes. In the align directives of lines 13 15, an * in square brackets means that the dimension is not divided. In the shadow directive of lines 16 18, a in square brackets means that no halo region exists. Fig. 1 shows a portion of the Wilson Dirac operator code. In line 3, the loop directive parallelizes the outer two loop statements. In line 4, the parallel for directive with the collapse clause also parallelizes all loop statements parallelized by the loop directive. In the loop statements, one calculation needs neighboring elements of distributed arrays. Fig. 11 shows calling WD() where the reflect directives are used before WD() to synchronize own halo region between neighboring nodes. The width and orthogonal clauses restrict the transfer range of the halo region to reduce communication time [4, 1]. The halo region of u is not synchronized before the second WD() because values of u are not updated in WD(). Fig. 12 shows SCAL method where loop statements comprise seven nested loops, and are parallelized using XMP and OpenMP directives. COPY, NORM, AXPY, and DOT methods are shown

5 1 #pragma omp parallel for collapse(4) 2 for(t=;t<16;t++) 3 for(z=;16;z++) 4 for(y=;y<32;y++) 5 for(x=;x<32;x++) 6 for(i=;i<4;i++) 7 for(j=;j<3;j++) 8 for(k=;k<2;k++) 9 ( (( (( (((_XMP_ADDR_v + (t+1) (16+2) (32) (32) + ( z+1) (32) (32) + y (32) + x) >v) + i)) + j)) + k)) = a; Figure 13: Translated SCAL operator code 1 Quark_t ( _XMP_MULTI_ADDR_v)[16+2][32][32] = (Quark_t ( ) [16+2][32][32])_XMP_ADDR_v; 2 #pragma omp parallel for collapse(4) 3 for(t=;t<16;t++) 4 for(z=;16;z++) 5 for(y=;y<32;y++) 6 for(x=;x<32;x++) 7 for(i=;i<4;i++) 8 for(j=;j<3;j++) 9 for(k=;k<2;k++) 1 ( (( (( (((&(_XMP_MULTI_ADDR_v[t+1][z+1][y][x]) ) >v) + i)) + j)) + k)) = a; HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan Table 1: Specification of Oakforest-PACS CPU Intel Xeon Phi GHz 68Cores Memory MCDRAM 16GB, DDR4 96GB Network Intel Omni-Path Host Fabric Interface 12.5GB/s Software intel/ , intelmpi/ Table 2: Specifications of COMA CPU Intel Xeon-E5 267v GHz 1Cores, 2Sockets Memory DDR3 64GB Network InfiniBand FDR 7GB/s Software intel/17..5, intelmpi/ Figure 14: New translated SCAL operator code in Figs in Appendix. The reason for specifying collapse(4) instead of collapse(7) in the parallel for directive is because the performance of collapse(4) was better than that of collapse(7) in our preliminary evaluation. 5 PERFORMANCE EVALUATION 5.1 Tuning Omni XcalableMP Compiler In our preliminary evaluation using, the performance of the mathematical functions deteriorated. Fig. 13 shows a portion of translated SCAL code from Fig. 12 generated by Omni compiler In Fig. 13, NT, NZ, NY, and NX are 32, and PT and PZ are 2. In line 9, the reason why 1 is added to t and z is because it calculates the size of a halo. Here, the 2 in the (16 + 2) term indicates the size of the halo of both sides. Because the distributed array v is one dimensional, it is difficult for a native compiler to effectively optimize the code. In order to solve this problem, the code conversion of Omni compiler has been changed to leave the size information of each dimension just before a target loop statement if possible (For example, when node set is determined dynamically, the new code conversion is not performed). Fig. 14 shows a portion of translated SCAL code from Fig. 12 generated by the new Omni compiler. In line 1, a new pointer _XMP_MULTI_ADDR_v is declared which has the size of each dimension. The new pointer is defined as a head pointer of the distributed array v. In a loop statement, operations are performed using the new pointer. 5.2 Performance of Mathematical Functions In order to investigate the influence of the code conversion change described in section 5.1, we conduct performance evaluations of (a) (NT, NZ, NY, NX) = (32, 32, 32, 32) (b) (NT, NZ, NY, NX) = (2, 2, 32, 32) Figure 15: Performance of mathematical functions on Oakforest-PACS the mathematical functions in Figs. 12, We use a single compute node of not only Oakforest-PACS but also the COMA system (COMA) as a general Linux cluster system. Tables 1 and 2 show the evaluation environments. We use two program sizes: (NT, NZ, NY, NX ) = (32, 32, 32, 32) and (2, 2, 32, 32). The compile options on Oakforest-PACS are set as -O2 -mcmodel=medium -axmic- AVX512, whereas those on COMA are set as -O2 -mcmodel=medium. The number of processes is 1 on both machines, but the numbers of threads are 64 on Oakforest-PACS and 1 on COMA. Memory mode of Oakforest-PACS is cache which means MCDRAM works for DDR4 memory as a cache. We use the Intel compiler as the backend compiler of the Omni compiler. We execute the mathematical functions 1 times and measure the elapsed time.

6 HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan M. Nakao et al Performance (GFlops) MPI+OpenMP (a) (NT, NZ, NY, NX) = (32, 32, 32, 32) 1x1 2x1 2x2 4x2 4x4 8x4 8x8 16x8 16x16 Number of processes (PT x PZ) (a) Oakforest-PACS Performance (GFlops) MPI+OpenMP (b) (NT, NZ, NY, NX) = (2, 2, 32, 32) Figure 16: Performance of mathematical functions on COMA 1x1 2x1 2x2 4x2 4x4 8x4 8x8 16x8 16x16 Number of processes (PT x PZ) (b) COMA Figure 18: Performance of Lattice QCD mini-application 1 void scal(quark_t v[nt+2][nz+2][ny][nx], const double a){ 3 #pragma omp parallel for collapse(4) 4 for(t=1;t<nt+1;t++) 5 for(z=1;z<nz+1;z++) 6 for(y=;y<ny;y++) 7 for(x=;x<nx;x++) 8 for(i=;i<4;i++) 9 for(j=;j<3;j++) 1 for(k=;k<2;k++) 11 v[t][z][y][x].v[i][j][k] = a; 12 } Figure 17: SCAL operator code using OpenMP Figs. 15 and 16 show the performance results. For comparison purposes, the figures also include results of code using only OpenMP which is compiled by the Intel compiler. Fig. 17 shows a code example of SCAL using OpenMP. Note that, 1 is added in the conditions of the outer two loop statements in lines 4 and 5, which reflects the natural writing style. All performance results of the new Omni compiler are equal to or better than those of Omni compiler The performance improvement of SCAL is particularly remarkable. In addition, these results show that the performance of the new Omni compiler is close to that of only OpenMP. 5.3 Performance of Lattice QCD This section evaluates the performance of the Lattice QCD miniapplication written in XMP using and the new Omni compiler. For comparison purposes, an implementation of the application written in a combination of MPI and OpenMP (MPI + OpenMP) is also evaluated. While we assigned one process per compute node on Oakforest-PACS, we assigned two processes per computer node on COMA because the computer node of COMA has two CPU sockets. We performed up to 256 processes on both systems. The compile options, the number of threads, the memory mode, and the backend compiler are the same as those in the section 5.2. We execute the Lattice QCD code with strong scaling in regions (32,32,32,32) as (NT,NZ,NY,NX ). Fig. 18 shows the performance results. All the performance results using the new Omni compiler are better than those using. In addition, XMP using the new Omni compiler achieves a performance of more than 94% of MPI+OpenMP. Examining the reason why the performance of XMP is a little worse than MPI in 256 processes, as shown in Figs. 15b and 16b, the performance of the mathematical functions except for COPY using the new Omni compiler is the almost same or a little worse than that of only OpenMP. Although the performance of COPY using the new Omni compiler is better, the function COPY is only called a small number of times, as shown in Fig SUMMARY AND FUTURE WORK In this paper, we examined the tuned performance of the Omni compiler for an XMP PGAS language on a Knights Landing cluster system and a general Linux cluster system. Specifically, we evaluated the XMP performance through an implementation of a

7 Lattice QCD mini-application. As a result of evaluating the Omni compiler before and after the performance tuning, we found that the performance of the Omni Compiler after tuning was superior. The results also showed that the performance of the Lattice QCD mini-application written in XMP using the tuned Omni compiler is achieving more than 94% of the implementation written in MPI+OpenMP. Future research will examine the performance difference between XMP and MPI+OpenMP more deeply and also evaluate the performance of other applications besides the Lattice QCD mini-application. ACKNOWLEDGEMENTS We would like to extend grateful thanks to Hideo Matsufuru who provided us the Lattice QCD code. This research used the Oakforest- PACS and COMA systems provided by Interdisciplinary Computational Science Program in the Center for Computational Sciences, University of Tsukuba. The work was supported by the Japan Science and Technology Agency, Core Research for Evolutional Science and Technology program entitled Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era in the re search area of Development of System Software Technologies for Post-Peta Scale High Performance Computing. A MATHEMATICAL FUNCTIONS 1 void copy(quark_t v[nt][nz][ny][nx], const Quark_t w[nt][ NZ][NY][NX]){ 4 #pragma omp parallel for collapse(4) 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++) 9 for(i=;i<4;i++) 1 for(j=;j<3;j++) 11 for(k=;k<2;k++) 12 v[t][z][y][x].v[i][j][k] = w[t][z][y][x].v[i][j][k]; 13 } Figure 19: COPY operator code 1 double norm(const Quark_t v[nt][nz][ny][nx]){ 3 double a =.; 4 #pragma xmp loop (t,z) on t[t][z] reduction (+:a) 5 #pragma omp parallel for collapse(4) reduction(+:a) 6 for(t=;t<nt;t++) 7 for(z=;z<nz;z++) 8 for(y=;y<ny;y++) 9 for(x=;x<nx;x++) 1 for(i=;i<4;i++) 11 for(j=;j<3;j++) 12 for(k=;k<2;k++) 13 a += v[t][z][y][x].v[i][j][k] v[t][z][y][x].v[i][j][k]; 14 return a; 15 } Figure 2: NORM operator code HPC Asia 218 WS, January 31, 218, Chiyoda, Tokyo, Japan 1 void axpy(quark_t v[nt][nz][ny][nx], const double a, const Quark_t w[nt][nz][ny][nx]){ 4 #pragma omp parallel for collapse(4) 5 for(t=;t<nt;t++) 6 for(z=;z<nz;z++) 7 for(y=;y<ny;y++) 8 for(x=;x<nx;x++) 9 for(i=;i<4;i++) 1 for(j=;j<3;j++) 11 for(k=;k<2;k++) 12 v[t][z][y][x].v[i][j][k] += a w[t][z][y][x].v[i][j][k]; 13 } Figure 21: AXPY operator code 1 double dot(const Quark_t v[nt][nz][ny][nx], const Quark_t w[ NT][NZ][NY][NX]){ 3 double a =.; 4 #pragma xmp loop (t,z) on t[t][z] reduction(+:a) 5 #pragma omp parallel for collapse(4) reduction(+:a) 6 for(t=;t<nt;t++) 7 for(z=;z<nz;z++) 8 for(y=;y<ny;y++) 9 for(x=;x<nx;x++) 1 for(i=;i<4;i++) 11 for(j=;j<3;j++) 12 for(k=;k<2;k++) 13 a += v[t][z][y][x].v[i][j][k] w[t][z][y][x].v[i][j][k]; 14 return a; 15 } Figure 22: DOT operator code REFERENCES [1] Andrew I. Stone et al Evaluating Coarray Fortran with the CGPOP Miniapp. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS). [2] Bridge++ Project Bridge++. (217). Lattice-code/. [3] F. Cantonnet et al. 24. Productivity analysis of the UPC language. In 18th International Parallel and Distributed Processing Symposium, 24. Proceedings [4] XcalableMP Specification Working Group XcalableMP Specification. (217). [5] Hiroyuki Miyazaki et al Overview of the K computer. FUJITSU SCIEN- TIFIC and TECHNICAL JOURNAL 48, 3 (212), [6] Information Technology Center, The University of Tokyo Oakforest-PACS Supercomputer System. (217). [7] Katherine Yelick et. al. 27. Productivity and Performance Using Partitioned Global Address Space Languages. PASCO 7, Proceedings of the 27 international workshop on Parallel symbolic computation. (27). [8] Masahiro Nakao et al Productivity and Performance of Global-View Programming with XcalableMP PGAS Language. In Proceedings of the th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC- GRID 12) [9] Masahiro Nakao et al Implementation and evaluation of the HPC challenge benchmark in the XcalableMP PGAS language. The International Journal of High Performance Computing Applications (217), [1] Masahiro Nakao et al Implementing Lattice QCD Application with XcalableACC Language on Accelerated Cluster. In 217 IEEE International Conference on Cluster Computing (CLUSTER) [11] Omni Compiler Project Omni Compiler. (217). [12] Wilson, K. G Confinement of quarks. Phys. Rev. D 1 (Oct 1974), Issue 8.

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status);

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status); $ $ 2 global void kernel(int a[max], int llimit, int ulimit) {... } : int main(int argc, char *argv[]){ MPI_Int(&argc, &argc); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation http://omni compiler.org/ Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation MS03 Code Generation Techniques for HPC Earth Science Applications Mitsuhisa Sato (RIKEN / Advanced

More information

HPC Challenge Awards 2010 Class2 XcalableMP Submission

HPC Challenge Awards 2010 Class2 XcalableMP Submission HPC Challenge Awards 2010 Class2 XcalableMP Submission Jinpil Lee, Masahiro Nakao, Mitsuhisa Sato University of Tsukuba Submission Overview XcalableMP Language and model, proposed by XMP spec WG Fortran

More information

C PGAS XcalableMP(XMP) Unified Parallel

C PGAS XcalableMP(XMP) Unified Parallel PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato

Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato Center for Computational Sciences, University of Tsukuba, Japan RIKEN Advanced Institute for Computational Science, Japan 2 XMP/C int array[16];

More information

Performance Comparison between Two Programming Models of XcalableMP

Performance Comparison between Two Programming Models of XcalableMP Performance Comparison between Two Programming Models of XcalableMP H. Sakagami Fund. Phys. Sim. Div., National Institute for Fusion Science XcalableMP specification Working Group (XMP-WG) Dilemma in Parallel

More information

Programming Environment Research Team

Programming Environment Research Team Chapter 2 Programming Environment Research Team 2.1 Members Mitsuhisa Sato (Team Leader) Hitoshi Murai (Research Scientist) Miwako Tsuji (Research Scientist) Masahiro Nakao (Research Scientist) Jinpil

More information

Linkage of XcalableMP and Python languages for high productivity on HPC cluster system

Linkage of XcalableMP and Python languages for high productivity on HPC cluster system Linkage of XcalableMP and Python languages for high productivity on HPC cluster system - Application to Graph Order/degree Problem - 1 1 2 1 Masahiro Nakao, Hitoshi Murai, Taisuke Boku, Mitsuhisa Sato

More information

int a[100]; #pragma xmp nodes p[*] #pragma xmp template t[100] #pragma xmp distribute t[block] onto p #pragma xmp align a[i] with t[i]

int a[100]; #pragma xmp nodes p[*] #pragma xmp template t[100] #pragma xmp distribute t[block] onto p #pragma xmp align a[i] with t[i] 2 3 4 int a[100]; #pragma xmp nodes p[*] #pragma xmp template t[100] #pragma xmp distribute t[block] onto p #pragma xmp align a[i] with t[i] integer :: a(100)!$xmp nodes p(*)!$xmp template t(100)!$xmp

More information

A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing

A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing Masahiro Nakao (RIKEN Center for Computational Science) CANDAR'18, Nov. 2018@Takayama, Gifu, Japan What is Order/Degree

More information

XcalableMP Implementation and

XcalableMP Implementation and XcalableMP Implementation and Performance of NAS Parallel Benchmarks Mitsuhisa Sato Masahiro Nakao, Jinpil Lee and Taisuke Boku University of Tsukuba, Japan What s XcalableMP? XcalableMP (XMP for short)

More information

Linkage of XcalableMP and Python languages for high productivity on HPC cluster system

Linkage of XcalableMP and Python languages for high productivity on HPC cluster system Linkage of XcalableMP and Python languages for high productivity on HPC cluster system Masahiro Nakao (RIKEN Center for Computational Science) 6th XMP Workshop@University of Tsukuba, Japan Background XcalableMP

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

GRID Testing and Profiling. November 2017

GRID Testing and Profiling. November 2017 GRID Testing and Profiling November 2017 2 GRID C++ library for Lattice Quantum Chromodynamics (Lattice QCD) calculations Developed by Peter Boyle (U. of Edinburgh) et al. Hybrid MPI+OpenMP plus NUMA aware

More information

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Update of Post-K Development Yutaka Ishikawa RIKEN AICS Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing

More information

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors Performance Evaluation of Large-scale Parallel Simulation Codes and Designing New Language Features on the (High Performance Fortran) Data-Parallel Programming Environment Project Representative Yasuo

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

OpenStaPLE, an OpenACC Lattice QCD Application

OpenStaPLE, an OpenACC Lattice QCD Application OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)

More information

Exploring XcalableMP. Shun Liang. August 24, 2012

Exploring XcalableMP. Shun Liang. August 24, 2012 Exploring XcalableMP Shun Liang August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract This project has implemented synthetic and application

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Miwako TSUJI XcalableMP

Miwako TSUJI XcalableMP Miwako TSUJI AICS 2014.10.24 2 XcalableMP 2010.09 2014.03 2013.10.25 AKIHABARA FP2C (Framework for Post-Petascale Computing) YML + XMP(-dev) + StarPU integrated developed in Japan and in France Experimental

More information

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,

More information

Introduction of Oakforest-PACS

Introduction of Oakforest-PACS Introduction of Oakforest-PACS Hiroshi Nakamura Director of Information Technology Center The Univ. of Tokyo (Director of JCAHPC) Outline Supercomputer deployment plan in Japan What is JCAHPC? Oakforest-PACS

More information

A Local-View Array Library for Partitioned Global Address Space C++ Programs

A Local-View Array Library for Partitioned Global Address Space C++ Programs Lawrence Berkeley National Laboratory A Local-View Array Library for Partitioned Global Address Space C++ Programs Amir Kamil, Yili Zheng, and Katherine Yelick Lawrence Berkeley Lab Berkeley, CA, USA June

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

Runtime Correctness Checking for Emerging Programming Paradigms

Runtime Correctness Checking for Emerging Programming Paradigms (protze@itc.rwth-aachen.de), Christian Terboven, Matthias S. Müller, Serge Petiton, Nahid Emad, Hitoshi Murai and Taisuke Boku RWTH Aachen University, Germany University of Tsukuba / RIKEN, Japan Maison

More information

arxiv: v2 [hep-lat] 21 Nov 2018

arxiv: v2 [hep-lat] 21 Nov 2018 arxiv:1806.06043v2 [hep-lat] 21 Nov 2018 E-mail: j.m.o.rantaharju@swansea.ac.uk Ed Bennett E-mail: e.j.bennett@swansea.ac.uk Mark Dawson E-mail: mark.dawson@swansea.ac.uk Michele Mesiti E-mail: michele.mesiti@swansea.ac.uk

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS

Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP October 29, 2015 Hidetoshi Iwashita, RIKEN AICS Background XMP Contains Coarray Features XcalableMP (XMP) A PGAS language,

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

What is Stencil Computation?

What is Stencil Computation? Model Checking Stencil Computations Written in a Partitioned Global Address Space Language Tatsuya Abe, Toshiyuki Maeda, and Mitsuhisa Sato RIKEN AICS HIPS 13 May 20, 2013 What is Stencil Computation?

More information

Basic Specification of Oakforest-PACS

Basic Specification of Oakforest-PACS Basic Specification of Oakforest-PACS Joint Center for Advanced HPC (JCAHPC) by Information Technology Center, the University of Tokyo and Center for Computational Sciences, University of Tsukuba Oakforest-PACS

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

arxiv: v2 [hep-lat] 3 Nov 2016

arxiv: v2 [hep-lat] 3 Nov 2016 MILC staggered conjugate gradient performance on Intel KNL arxiv:1611.00728v2 [hep-lat] 3 Nov 2016 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli@umail.iu.edu Carleton

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Progress Report on QDP-JIT

Progress Report on QDP-JIT Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Lecture 2: Introduction to OpenMP with application to a simple PDE solver Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor

More information

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS HPC User Forum, 7 th September, 2016 Outline of Talk Introduction of FLAGSHIP2020 project An Overview of post K system Concluding Remarks

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

FFTSS Library Version 3.0 User s Guide

FFTSS Library Version 3.0 User s Guide Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large

More information

Tightly Coupled Accelerators Architecture

Tightly Coupled Accelerators Architecture Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1 What is Tightly Coupled Accelerators

More information

Boundary element quadrature schemes for multi- and many-core architectures

Boundary element quadrature schemes for multi- and many-core architectures Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava jan.zapletal@vsb.cz Intel MIC

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Fujitsu s new supercomputer, delivering the next step in Exascale capability Fujitsu s new supercomputer, delivering the next step in Exascale capability Toshiyuki Shimizu November 19th, 2014 0 Past, PRIMEHPC FX100, and roadmap for Exascale 2011 2012 2013 2014 2015 2016 2017 2018

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED Post-K Supercomputer Overview 1 Post-K supercomputer overview Developing Post-K as the successor to the K computer with RIKEN Developing HPC-optimized high performance CPU and system software Selected

More information

XcalableMP and XcalableACC for Productivity and Performance in HPC Challenge Award Competition Class 2 at SC14

XcalableMP and XcalableACC for Productivity and Performance in HPC Challenge Award Competition Class 2 at SC14 XcalableMP and XcalableACC for Productivity and Performance in HPC Challenge Award Competition Class 2 at SC14 Masahiro Nakao 1,2,a) Hitoshi Murai 1 Hidetoshi Iwashita 1 Takenori Shimosaka 1 Akihiro Tabuchi

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

OpenMPI OpenMP like tool for easy programming in MPI

OpenMPI OpenMP like tool for easy programming in MPI OpenMPI OpenMP like tool for easy programming in MPI Taisuke Boku 1, Mitsuhisa Sato 1, Masazumi Matsubara 2, Daisuke Takahashi 1 1 Graduate School of Systems and Information Engineering, University of

More information

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system 123 Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system Mitsuhisa Sato a, Hiroshi Harada a, Atsushi Hasegawa b and Yutaka Ishikawa a a Real World Computing

More information

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss The CLAW project CLAW provides high-level Abstractions for Weather and climate models Valentin Clément,

More information

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Investigation of Intel MIC for implementation of Fast Fourier Transform

Investigation of Intel MIC for implementation of Fast Fourier Transform Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP Topics Lecture 11 Introduction OpenMP Some Examples Library functions Environment variables 1 2 Introduction Shared Memory Parallelization OpenMP is: a standard for parallel programming in C, C++, and

More information

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract

More information

Lattice QCD code Bridge++ on arithmetic accelerators

Lattice QCD code Bridge++ on arithmetic accelerators Lattice QCD code Bridge++ on arithmetic accelerators a, S. Aoki b, T. Aoyama c, K. Kanaya d,e, H. Matsufuru a, T. Miyamoto b, Y. Namekawa f, H. Nemura f, Y. Taniguchi d, S. Ueda g, and N. Ukita f a Computing

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

A case study of performance portability with OpenMP 4.5

A case study of performance portability with OpenMP 4.5 A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

IPSJ SIG Technical Report Vol.2014-HPC-145 No /7/29 XcalableMP FFT 1 1 1,2 HPC PGAS XcalableMP XcalableMP G-FFT 90.6% 186.6TFLOPS XMP MPI

IPSJ SIG Technical Report Vol.2014-HPC-145 No /7/29 XcalableMP FFT 1 1 1,2 HPC PGAS XcalableMP XcalableMP G-FFT 90.6% 186.6TFLOPS XMP MPI XcalableMP FFT, HPC PGAS XcalableMP XcalableMP 89 G-FFT 9.6% 86.6TFLOPS XMP MPI. Fourier (FFT) MPI [] Partitioned Global Address Space (PGAS) FFT PGAS PGAS XcalableMP(XMP)[] C Fortran XMP HPC [] Global-FFT

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

PEZY-SC Omni OpenACC GPU. Green500[1] Shoubu( ) ExaScaler PEZY-SC. [4] Omni OpenACC NVIDIA GPU. ExaScaler PEZY-SC PZCL PZCL OpenCL[2]

PEZY-SC Omni OpenACC GPU. Green500[1] Shoubu( ) ExaScaler PEZY-SC. [4] Omni OpenACC NVIDIA GPU. ExaScaler PEZY-SC PZCL PZCL OpenCL[2] ZY-SC Omni 1,a) 2 2 3 3 1,4 1,5 ZY-SC ZY-SC OpenCL ZCL ZY-SC Suiren Blue ZCL N-Body 98%NB CG 88% ZCL 1. Green500[1] 2015 11 10 1 Shoubu( ) xascaler ZY-SC 1024 8192 MIMD xascaler ZY-SC ZCL ZCL OpenCL[2]

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Xingfu Wu and Valerie Taylor Department of Computer Science, Texas A&M University Email: {wuxf, taylor}@cs.tamu.edu

More information

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications 1 Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational

More information

MILC Performance Benchmark and Profiling. April 2013

MILC Performance Benchmark and Profiling. April 2013 MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Deutscher Wetterdienst

Deutscher Wetterdienst Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Fig. 1. Omni OpenMP compiler

Fig. 1. Omni OpenMP compiler Performance Evaluation of the Omni OpenMP Compiler Kazuhiro Kusano, Shigehisa Satoh and Mitsuhisa Sato RWCP Tsukuba Research Center, Real World Computing Partnership 1-6-1, Takezono, Tsukuba-shi, Ibaraki,

More information

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double a[100]; #pragma acc data copy(a) { #pragma acc parallel loop for(i=0;i<100;i++)

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double a[100]; #pragma acc data copy(a) { #pragma acc parallel loop for(i=0;i<100;i++) 2 MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double a[100]; #pragma acc data copy(a) { #pragma acc parallel loop for(i=0;i

More information