Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL

Technical Report Abstract: This technical report presents CESGA experience of porting three applications to the new Intel Xeon Phi coprocessor. The objective of these experiments was to evaluate the complexity of such a work, to check the new capabilities of this new technology, and to know its limitations for executing scientific and technical software. The results show that the migration of these applications to Xeon Phi can be done easily, but getting real improved performance demands a more deeply refactoring. Document Id.: CESGA-2013-001 Date: July 28 th, 2013 Responsible: Status: Andrés Gómez FINAL

Evaluation of Intel Xeon Phi to execute easily scientific applications. José Carlos Mouriño Gallego Carmen Cotelo Queijo Andrés Gómez Tato Aurelio Rodríguez López Technical Report CESGA-2013-001 Act:29/07/2013 2 / 33

Technical Report CESGA-2013-001 Act:29/07/2013 3 / 33

Index 1 Introduction... 8 2 INTEL Xeon Phi... 8 3 Applications... 13 3.1 CalcunetW... 13 3.2 GammaMaps... 14 3.3 ROMS... 16 4 Results... 18 4.1 Infrastructure... 18 4.2 CalcunetW... 20 4.3 GammaMaps... 23 4.4 ROMS... 30 5 Conclusions... 32 Technical Report CESGA-2013-001 Act:29/07/2013 4 / 33

Figures Table 1: Host characteritics for testbed 1 19 Table 2: Intel Xeon Phi technical characteristics for testbed 1 19 Table 3: Host characteristics for testbed 2 19 Table 4: Intel Xeon Phi technical characteristics for testbed 2 20 Table 5: Grid size for ROMS benchmark 31 Table 6: MPI benchmark results 32 Figure 1: The first generation Intel Xeon Phi product codenamed Knights Corner... 9 Figure 2: Example of Compact policy in a 4 core coprocessor for 8 threads... 11 Figure 3: Example of scatter policy in a 4 core coprocessor for 8 threads... 11 Figure 4: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity fine.... 12 Figure 5: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity core.... 12 Figure 6: On the left, a voxel with the calculated dose. On the right, example of meshes for reference (blue) and test (green) doses.... 15 Figure 7: Grid domain decomposition... 16 Figure 8: Example of tiled grid... 17 Figure 9 : Execution time with one random matrix... 21 Figure 10: Scalability with one random matrix... 22 Figure 11: Parallel performance with increasing number of random matrixes... 23 Figure 12: Speed-up for the local host with different affinities. The x and Y loops were collapsed... 24 Figure 13: Elapsed time for the test case on the host... 25 Figure 14: Elapsed time for the offload method... 26 Technical Report CESGA-2013-001 Act:29/07/2013 5 / 33

Figure 15: Execution times for the different phases... 27 Figure 16: Elapsed times in the second test-bed. Xeon E5-2680 + Xeon Phi 60 cores... 28 Figure 17: Offload execution times for the 60 cores Xeon Phi... 29 Figure 18: Comparative results... 30 Technical Report CESGA-2013-001 Act:29/07/2013 6 / 33

Tables Table 1: Host characteritics for testbed 1... 19 Table 2: Intel Xeon Phi technical characteristics for testbed 1... 19 Table 3: Host characteritics for testbed 2... 19 Table 4: Intel Xeon Phi technical characteristics for testbed 2... 20 Table 5: Grid size for ROMS benchmark... 31 Table 6: MPI benchmark results... 32 Technical Report CESGA-2013-001 Act:29/07/2013 7 / 33

1 Introduction Heterogeneous computing with multiple levels parallelism is a leading topic for the design of future exascale systems. Indeed, accelerators like current generation GPGPUs offer relatively high bandwidth with lower relative power consumption than general-purpose processors. However, GPU-based acceleration requires special programming constructs (e.g. NVIDIA's CUDA language) for the accelerated work. With the release of Intel Many Integrated Core (MIC) architecture, an additional coprocessor technology is available to the scientific community. This document reports on several early porting experiences to the Intel Xeon Phi platform. An attractive feature of this architecture is the support for standard threading models like OpenMP which are already used by many scientific applications. In addition, the Xeon Phi platform is based on x86 architecture, and C/C++ and FORTRAN kernels can be easily compiled for direct native execution on it. The objective of this work was to check the programmability and usability of the new Intel Xeon Phi on different contexts: several programming languages (C and FORTRAN), using Intel Math Libraries (MKL) in different configurations, and applying MPI for a real application. The applications considered are taken from existing development efforts at CESGA. Calcunetw 1, an application developed in C which uses extensively the matrix multiplication BLAS library included in MKL; GammaMaps, a FORTRAN application which calculates a figure-of-merit between two radiotherapy treatment doses; and ROMS, a FORTRAN application for oceanography which was used to check MPI inside Xeon Phi. The remainder of the paper is organized as follows. First of all, a brief description of Intel MIC architecture is presented. Next section the applications used as test case are briefly described. Finally, the results of the tests are presented with a final section with the conclusions. 2 INTEL Xeon Phi In this section, a brief description of the Intel Xeon Phi architecture is done. A more detailed architecture description can be read from Intel Website 2. 1 J.C. Mouriño, E. Estrada, A. Gomez. CalcuNetW. Calculate Measurements in Complex Networks, Informe Técnico CESGA- 2005-003 2 http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Technical Report CESGA-2013-001 Act:29/07/2013 8 / 33

Intel Many Integrated Core (Intel MIC) is a multiprocessor computer architecture developed by Intel. It combines in a coprocessor several modified Intel CPU cores which use X86 instruction set executed in-order with a sort pipeline. Each core includes a new vector processing unit (VPU) of 512 bits SIMD, a dedicated L2 cache of 512B for data, and 32 kb L1 cache for data and TLB. L2 cache is kept fully coherent among all the cores. VPU can execute up to 32 single or 16 double precision floating point operations per cycle with Fused Multiply-Add (FMA, which calculates a*b+c as a single instruction), or half of them when it cannot applied. All the floating point operations follow IEEE 754 arithmetic, making this system suitable for scientific HPC. Each core executes 4 hardware threads, so one 60 core Xeon Phi can execute up to 240 threads simultaneously. The cores are connected to a high speed bidirectional ring interconnection which allows them to access RAM memory (up to 8GB) through the memory controllers which are directly connected, and the PCIe bus. RAM memory is based on GDDR5 technology. Intel Xeon Processor KNC Card GDDR5 GDDR5 Main memory memory Main memory Intel > 50 cores Linux OS GDDR5.. GDDR5 GDDR5 GDDR5 Figure 1: The first generation Intel Xeon Phi product codenamed Knights Corner The Intel Xeon Phi is provided as a coprocessor unit which is attached to PCIe bus of the host. This board loads a dedicated operating Linux system and can be configured to have its own IP address and services, so the final user can login in or copy data using standard Linux commands as ssh or scp. The Technical Report CESGA-2013-001 Act:29/07/2013 9 / 33

Xeon Phi filesystem is mounted directly on the RAM memory, as a consequence the copy of the operating system or the load of data files reduces the amount available to applications. As alternative, NFS filesystem can be used to access the host filesystem or, when working on offload model (see later), an automatic mounting of special directories is done. Intel provides C, C++ and FORTRAN compilers, mathematical libraries (MKL), debuggers, and other tools for developing. The compilers can generate applications which can be executed in two modes: - Native. The generated binary can only be executed on the Intel Xeon Phi. If it is compiled on the host, the executable must be transferred to the coprocessor for execution. To simplify this step, Intel has included the tool micnativeloadex which copies the executable and needed libraries to the Xeon Phi before executing it. - Offload. The application is executed on the host but some sections are selected to run on the Xeon Phi using pragmas. The compiler generates automatically the binary code for executing these sections on the coprocessor which are transferred automatically. Although it is possible to select which data should be transferred from the host memory to the board and back by the programmer using also pragmas, the compiler can automatically detect them in several cases, reducing the complexity of porting applications. Some functions of the MKL library support offload mode directly. This mode of execution is selected by external environment variables 3 : - MKL_MIC_ENABLE. If set to 1, MKL library uses the Xeon Phi coprocessor with automatic offload. - MKL_HOST_WORKDIVISION. A number between 0.0 and 1.0 telling MKL library how much work must be done on the host. - MKL_MIC_WORKDIVISION. A number between 0.0 and 1.0 selecting the amount of work to be done in the Intel Xeon Phi coprocessor. If more than one board is present attached to the host, it is possible to set it for each using MKL_MIC_<BOARD NUMBER>_WORKDIVISION, where BOARD NUMBER is the id of the Xeon Phi coprocessor (starting in 0). - MKL_MIC_MAX_MEMORY. Limits the amount of memory to be used on MIC when automatic offload is used. Because it is a multi-threaded environment, when an OpenMP application is executed, the affinity of the threads to the cores is an important issue. It can be selected with an environment variable 3 http://software.intel.com/sites/products/documentation/doclib/iss/2013/mkl/mkl_userguide_lnx/guid-3dc4fc7d- A1E4-423D-9C0C-06AB265FFA86.htm Technical Report CESGA-2013-001 Act:29/07/2013 10 / 33

(prefix_kmp_affinity, where prefix is set using another environment variable MIC_ENV_PREFIX). Intel Xeon Phi supports three policies and two granularities. The policies are: - Compact. The threads are placed in order in the cores as compact as possible. So, an 8 threads application will use only two cores, because each one can execute up to 4 threads (Figure 2). - Scatter. Threads are spread as much as possible among the cores in order, avoiding the sharing of the same core if possible (see Figure 3 for an example of 4 cores and 8 threads). - Balanced. This mode, that it is not supported on hosts, is similar to scatter. But if the number of demanded threads is larger than the number of cores, the threads are placed grouping those with nearest tag. For example, for an 8 thread application on a 4 cores Xeon Phi, thread 0 and 1 will share the same core (see Figure 4). Figure 2: Example of Compact policy in a 4 core coprocessor for 8 threads Figure 3: Example of scatter policy in a 4 core coprocessor for 8 threads Technical Report CESGA-2013-001 Act:29/07/2013 11 / 33

Figure 4: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity fine. Figure 5: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity core. Granularities are: - Fine (or thread). Each thread is bound to a single hardware thread. - Core. The threads are bound to the core, and can migrate from one hardware thread to another. See Figure 5 as an example for the balanced policy. More information about Xeon Phi and how to program it is available in Intel Xeon Phi Coprocessor Technical Report CESGA-2013-001 Act:29/07/2013 12 / 33

System Software Developers Guide 4. 3 Applications 3.1 CalcunetW Complex networks, consisting of sets of nodes or vertices joined together in pairs by links or edges, appear frequently in various technological, social and biological scenarios. These networks include the Internet, the World Wide Web, social networks, scientific collaboration networks, lexicon or semantic networks, neural networks, food webs, metabolic networks and protein protein interaction networks. They have been shown to share global statistical features, such as the small world and the scale free effects, as well as the clustering property. The first feature is simply the fact that the average distance between nodes in the network is short and usually scales logarithmically with the total number of nodes. The second is a characteristic of several real world networks in which there are many nodes with low degree and only a small number with high degree (the so called hubs ). The node degree is simply the number of ties a node has with other nodes. In scale free networks, the node degree follows a power law distribution. Finally, clustering is a property of two linked nodes that are each linked to a third node. In consequence, these three nodes form a triangle and the clustering is frequently measured by counting the number of triangles in the network. In order to calculate some measurements in complex networks a simple program has been developed. This application calculates some characterization measurements in a given network and compares it with a number of random networks given by the user. The measurements calculated by the program are: the Subgraph Centrality (SC), SC odd, SC even, Bipartivity, Network Communicability (C(G)) and Network communicability for Connected Nodes. You can find a detailed description of this application in CalcuNetW technical report 5. A given number of random networks are calculated for comparison issues, and the average value of the target measurements is calculated indicating also the mean squared error. As the number of random networks grow, the computational time increases, but the mean results will improve. The random networks are calculated taking into account some restrictions. The networks must have the same number of nodes and edges than the original network, they must also be connected and the node degree must be the same that in the original network. The program is given as a simple executable that has been developed in C, using the Lapack and Blas libraries. The program was initially not parallelized, but can make use of the parallel capabilities of the 4 http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide 5 https://www.cesga.es/es/biblioteca/downloadasset/id/57 Technical Report CESGA-2013-001 Act:29/07/2013 13 / 33

MKL Lapack and Blas libraries. However, the application is easily parallelizable with OpenMP. Using the Intel Xeon Phi, we have explored its parallel capabilities to speed up the process. The size of the matrix could be huge in real cases. This fact has been taken into account for the movement of data between the host and the Xeon Phi and the distribution of task among the Xeon Phi processors. 3.2 GammaMaps In cancer radiation therapy treatments, especially in complex cases, the calculated doses are verified using experimental data before the treatment itself is delivered to the patient. There are many figureof-merit to check the quality of the proposed treatment. One of them is the gamma index 6 which generates a difference map between the measured and calculated doses. This gamma index can be used also to compare two calculated doses using different algorithms, as the eimrt 7 project does. In this case, the calculated doses for the treatment (reference dose) are compared with those that are obtained simulating the linear accelerator and patient s body using Monte Carlo techniques (test dose). For each case, the body of the patient is divided in small cubes (coined voxels) and the dose deposited on it is calculated. As consequence, the full patient is a tridimensional set of voxels with information about the deposited dose on its volume. In Figure 6, the meshes for the reference (blue) and test (green) are shown. Both grids can be different in the size of the voxels and position of their edges. 6 D. a Low, W. B. Harms, S. Mutic, and J. a Purdy, A technique for the quantitative evaluation of dose distributions., Medical physics, vol. 25, no. 5, pp. 656 61, May 1998. 7 D. M. González-Castaño, J. Pena, F. Gómez, A. Gago-Arias, F. J. González-Castaño, D. a Rodríguez-Silva, A. Gómez, C. Mouriño, M. Pombar, and M. Sánchez, eimrt: a web platform for the verification and optimization of radiation treatment plans., Journal of applied clinical medical physics / American College of Medical Physics, vol. 10, no. 3, p. 2998, Jan. 2009. Technical Report CESGA-2013-001 Act:29/07/2013 14 / 33

Figure 6: On the left, a voxel with the calculated dose. On the right, example of meshes for reference (blue) and test (green) doses. The gamma index is defined for each voxel of the reference dose as: γ(r r ) = min ( r t r r 2 (d(r R + t ) d(r r )) 2 D 2 2 ), t Dtest where d(r t ) is the test dose for the voxel at position r t, d(r r ) the reference dose for voxel at position r r, R is a parameter indicating the desired geometric distance to agreement among doses (usually 3mm) and D the maximum required difference between doses (commonly 3% of the maximum dose to deliver to the tumour). Any value less than one is considered as acceptable while values higher than 1 should be investigated. GammaMaps is an application developed for the eimrt 8 project that calculates this gamma index for two and three dimensional data. It is written in FORTRAN and uses a geometric algorithm to speed up the process 9. It is parallelized using OpenMP. 8 http://eimrt.cesga.es 9 T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, Geometric interpretation of the γ dose distribution comparison technique: Interpolation-free calculation, Medical Physics, vol. 35, no. 3, p. 879, 2008 Technical Report CESGA-2013-001 Act:29/07/2013 15 / 33

3.3 ROMS Regional Ocean Modelling System (ROMS) is a software that models and simulates an ocean region using a finite difference grid and time stepping. It is a complex model with many options and capabilities. The code is written in F90/F95 and it uses C-preprocessing flags to activate the various physical and numerical options. The simulations can take from hours to days to complete due to the compute-intensive nature of the software. The size and resolution of simulations are constrained by the performance of the computing hardware used. Figure 7: Grid domain decomposition (Source www.myroms.org) ROMS can be run in parallel with OpenMP or MPI. It does not use MKL libraries and neither is an easy case to introduce OpenMP pragmas for activating the offload mode. Therefore this test case was used to try the Message-Passing Interface (MPI). Technical Report CESGA-2013-001 Act:29/07/2013 16 / 33

An example of a grid domain decomposition with tiles is shown in figure 8, one colour per tile. The overlap areas are known as ghost points. Each tile is an MPI process and it contains the information needed to time-step all the interior points. For MPI jobs, the ghost points need to be updated between interior point computations. Figure 8: Example of tiled grid 10 Main characteristics of ROMS MPI: - The master process (0) does all the I/O (NetCDF). o On input, it sends the tiled fields to the respective processes. o It collects the tiled fields for output. - ROMS needs to pass many small MPI messages. - Product NtileI * NtileJ must match number of MPI processes (more MPI processes then less 10 Source www.myroms.org Technical Report CESGA-2013-001 Act:29/07/2013 17 / 33

points in the tile and more communications are needed to exchange neighbours information). This test compares host and hybrid modes because ROMS real simulations involve input and output files of several GB managed by the master (process 0). 4 Results 4.1 Infrastructure The tests were performed during February 2013 in an infrastructure provided by Intel remotely. Two systems were used: - A Xeon Phi with 61 cores (tetbed 1). The characteristics of such a system are shown in Table 1 for the host and Table 2 for the Intel Xeon Phi coprocessor. CalculetW and GammaMaps were tested on this system. Some execution attempts have been made with ROMS. The maximum number of threads was selected to 240, leaving the last core without usage. - A Xeon Phi with 60 cores (testbed 2). The characteristics of such a system are shown in Table 3 for the host and Table 4 for the Intel Xeon Phi coprocessor. GammaMaps with some additional modifications explained below, and ROMS were tested on this system. The maximum number of threads was selected to 240. Technical Report CESGA-2013-001 Act:29/07/2013 18 / 33

Host CPU Model Intel Xeon CPU E5-2680 0 @ 2.70GHz Nr. of cores 16 Memory Operating System Compiler Version 32788 MB Linux 2.6.32-279.el6.x86_64 2013U2 Table 1: Host characteritics for testbed 1 Intel Xeon Phi Model Nr. of cores Memory Operating System Compiler Version GDDR Technology GDDR Frecuency Beta0 Engineering Sample 61 at 1.09GHz 7936 MB MPSS Gold U1 2013U2 GDDR5 2750000 KHz Table 2: Intel Xeon Phi technical characteristics for testbed 1 Host Test-bed 2 CPU Model Intel Xeon CPU E5-2680 Nr. of cores 16 Memory Operating System Compiler Version 128GB RHEL6.3 composer_xe_2013.3.163 Table 3: Host characteristics for testbed 2 Technical Report CESGA-2013-001 Act:29/07/2013 19 / 33

Intel Xeon Phi Test-bed 2 Model Nr. of cores Memory Operating System Compiler Version GDDR Technology GDDR Frecuency 5110P 60 at 1.053 GHz 8GB Elpida MPSS release 2.1 Kernel 2.6.38.8-g9b2c036 on an k1om 2013U3 GDDR5 2500000 KHz Table 4: Intel Xeon Phi technical characteristics for testbed 2 4.2 CalcunetW The following figures show the main initial achievements. Figure 9 shows the elapsed time for calculating one network of 2324 nodes plus one random network executed in several cases: - Host. The application has been compiled for executing on host. - Xeon Phi Native. It is compiled to be executed exclusively on Xeon Phi coprocessor. The input file is copied to the Xeon Phi coprocessor before it is executed. The time to copy this data is not included in the results. - Compiler assisted Offload. The code has been modified to include a pragma before the DGEMM call to execute it on the Intel Xeon Phi using offloading. - Workdivision=1.0. The application without modification is executed on host selecting a workdivision equal to one so MKL DGEMM function is executed on Intel Xeon Phi coprocessor. - Automatic Offload. In this case, MKL should select automatically the workdivision among host and coprocessor. The results (mean value of 10 repetitions in testbed 1) show that native execution on the Xeon Phi is about 6 times slower than other methods. There are no significant differences between the host and the offload versions, because the amount of offloaded calculations is less than a 10%. Technical Report CESGA-2013-001 Act:29/07/2013 20 / 33

Figure 9 : Execution time with one random matrix Figure 10 shows the scalability, both in host (hyperthreading was enabled) and Xeon Phi. Again the elapsed time for a network of 2324 nodes plus one random network has been measured. This version of the program has not been parallelized itself, but makes use of the parallel capabilities of the MKL functions. As it can be observed the program does not reduce its execution time beyond 4 threads in the case of the hosts and 16 threads in the case of Xeon Phi. Finally, the application has been parallelized with OpenMP. Each thread generates and calculates one random matrix. Figure 11 shows the elapsed time for different number of random matrixes in one socket of the host, in two sockets (in both cases without hyperthreading) and in the Xeon Phi. In this case the number of nodes of each network is 616 due to memory restrictions inside the Xeon Phi. As it can be observed, if the number of matrixes (networks) is high, the performance in the Xeon Phi is better than in an E5-2670 CPU without hyperthreading. But if host is used with both sockets, its performance is still better than one Xeon Phi. Technical Report CESGA-2013-001 Act:29/07/2013 21 / 33

The size of the matrixes could be huge in real cases, so it must be taken into account in the movement of data between the host and the Xeon Phi, and in the distribution of task among the Xeon Phi processors. Figure 10: Scalability with one random matrix Technical Report CESGA-2013-001 Act:29/07/2013 22 / 33

4.3 GammaMaps Figure 11: Parallel performance with increasing number of random matrixes Using the Xeon Phi the Fortran program for 3D calculations was tested in four models: local execution (where the main loops were parallelized using OpenMP and executed with all the available cores), native Xeon Phi execution (where the same code was compiled to be executed on the Xeon Phi), offload to the Xeon Phi (where the initial information is read and processed by the host but the main loops are executed on the Xeon Phi exclusively) and nested (where the main loops are executed simultaneously by the Xeon Phi and the host after modifying the code to support this new execution method using parallel sections). For the OpenMP executions, all the available affinities for each system (host and Xeon Phi) were tested. The three first methods were executed on testbed 1 but timings for the nested case were recorded on a 60 CPU Xeon Phi (testbed 2). Next pictures show main initial achievements. Figure 12 shows the speedup of the problem when is executed in the host with different affinities. Scatter affinity show the best performance and scales well up to 16 threads. Because there can be some imbalance, the best solution is to use dynamic Technical Report CESGA-2013-001 Act:29/07/2013 23 / 33

scheduling. Figure 14 shows the elapsed time when the main loops are offloaded to Xeon Phi. Balanced and scatter affinities produce similar results but compact, when the number of threads is below the maximum, performs worse, as expected. Figure 15 compares the results of three of the execution methods on the 61 CPU Xeon Phi. The program can be divided in four sections: reading information (two files of about 300MB each), initialization of arrays based on the read information, parallel computation of the gamma index using these arrays, and storage of the results (again around 300MB). This figure shows the best elapsed time for the different execution methods, being seconds the total time. Xeon Phi performs very well on the parallel phase, but due to I/O bad performance, the total time is not competitive with the local host. Also, the initialization phase performs worse than the local host, even having several sections vectorized. Figure 12: Speed-up for the local host with different affinities. The x and Y loops were collapsed Figure 16, Figure 17 and Figure 18 show the results when the same case was executed on the second test-bed. Now, it includes the nested case, where the main loops where divided symmetrically between the host and the Xeon Phi. It shows that the nested solution using the full capacity of the host is limited by the Xeon Phi section. To achieve a better elapsed time, the correct workdivision Technical Report CESGA-2013-001 Act:29/07/2013 24 / 33

should be defined. The main achievements for this test case are: - A Xeon Phi is almost equivalent in performance (for this test case) to one E5-2670 CPU without hyperthreading. - Xeon Phi I/O was too slow (almost 10 times slower than the host). - Sharing the parallel work between the host and Xeon Phi required refactoring. Figure 13: Elapsed time for the test case on the host Technical Report CESGA-2013-001 Act:29/07/2013 25 / 33

Figure 14: Elapsed time for the offload method Technical Report CESGA-2013-001 Act:29/07/2013 26 / 33

Figure 15: Execution times for the different phases Technical Report CESGA-2013-001 Act:29/07/2013 27 / 33

Figure 16: Elapsed times in the second test-bed. Xeon E5-2680 + Xeon Phi 60 cores Technical Report CESGA-2013-001 Act:29/07/2013 28 / 33

Figure 17: Offload execution times for the 60 cores Xeon Phi Technical Report CESGA-2013-001 Act:29/07/2013 29 / 33

Figure 18: Comparative results 4.4 ROMS Because usually ROMS has a high demand of input and output (it has to read and write several GB of data) and in other tests a poor I/O was observed on Intel Xeon Phi, only host and hybrid cases were executed. Although the executed benchmarks (see later) do not need input files, this situation is not common. As with many scientific applications ROMS needs some external libraries. Prior to performing the tests on the MIC it was necessary to build the libraries required. The libraries zlib and NetCDF were compiled for both the host CPU and MIC. While building these libraries for the host architecture is a common task, building them for MIC needs cross-compilation techniques. Native MIC builds were configured by adding a new mic" Linux target to the autotools configure.sub ( following the model of other existing target called blackfin ). To build native libraries for MIC, the -mmic" flag was added to Technical Report CESGA-2013-001 Act:29/07/2013 30 / 33

the compiler options. Configure options used to build a native MIC version of the netcdf- 4.1.3 static library were:./configure CXX=icpc CC=icc FC=ifort FFLAGS=-mmic FCFLAGS=-mmic --disable-shared --disable-netcdf-4 --host=mic The option --disable-netcdf-4 avoids the needed of compile HDF5 previously, it was used to simplify testing and save time, because first tries to compile HDF5 on MIC were unsuccessful. Following this strategy it was possible to build host and native NetCDF static libraries and after that the ROMS executables on MIC. To execute the tests, ROMS benchmark case 11 was used. The model run for 200 time-steps and no input files are needed since all the initial and forcing fields are set-up with analytical expressions. Table 5 shows the parameters used for the benchmark. Grid Values Number of I-direction INTERIOR RHO-points Lm == 512 Number of J-direction INTERIOR RHO-points Mm == 64 Number of vertical levels N == 30 Table 5: Grid size for ROMS benchmark The benchmark was executed with 16 tiles, i.e., 16 MPI processes. On the native host version, this matches the number of cores. They were executed using the command: mpirun -np 16./oceanM-benck1.host where oceanm-benck1.host the name of the executable. For the hybrid version, the executed command was: mpirun -np 1./oceanM-benck1.host : -np 15 -host mic0./oceanm-benck1.mic where mic0 is the name for Xeon Phi coprocessor and oceanm-benck1.mic the name of the executable for Xeon Phi, that should be copied previously to the coprocessor and the environment variable I_MPI_MIC should be set to enable. In this case, the process 0 is executed on the host while the other processes are executed on MIC. Table 6 shows the results of this test. The hybrid version is 20 times slower than the host version, maybe due to the low performance of a single core on Xeon Phi where 11 https://www.myroms.org/wiki/index.php/test_case Technical Report CESGA-2013-001 Act:29/07/2013 31 / 33

the hardware threads are not fully used and the communications between process 0 and others. On host these MPI messages are faster because the usage of shared memory, while between Xeon Phi and host the communication is done using the PCI interface. For this benchmark, running with more than 16 processes was unfeasible. To use the 60 cores, a larger case must be used to have enough points on the tiles to calculate, increasing the demand of memory. A second optimization could be the usage of a hybrid parallel mode (MPI+OpenMP) where the Xeon Phi hardware threads can be used for each process. Unfortunately, due to time constraints of the full experiment, no further investigation could be done. HOST NtileI * NtileJ (nº cores) 16 16 HYBRID Elapsed time (seconds): 18.18 365.80 Table 6: MPI benchmark results Apart from the first problems found in compiling the required libraries, porting ROMS to the architecture of MIC was not a hard task. Another issue is easily achieving good performance in this architecture. This requires further analysis and maybe some modifications in the code. While probably ROMS is not the best real application case to exploit this kind of architecture, because real simulations need to perform an important amount of IO (up to 60G for a NetCDF output file). 5 Conclusions The three presented experiments were designed to investigate the complexity of porting existing applications and running them on the new Intel MIC architecture. The objective was to execute the applications with the minimum changes and compare the results against the execution on a classical architecture. Vectorization was not used explicitly (i.e., specific pragmas to drive compiler to vectorize some loops), but the compiler options were selected to allow auto-vectorization. Changes on the original code were done only to introduce pragmas which permit offloading mode and, in one single case, to execute on hybrid mode sharing loop work among host and MIC. The main early conclusions that we can extract from the experiments are: Technical Report CESGA-2013-001 Act:29/07/2013 32 / 33

- The I/O of the Xeon Phi should be improved, being now a handicap for the native mode. We have tested both the NFS-mounted and local filesystems, having the same results. - The initial porting of the applications is easy, but getting real performance needs real modifications to the code. Some new pragmas to divide the work among host CPUs and Xeon Phi in parallelized OpenMP loops are welcome. - The performance of a Xeon Phi for the selected cases is close to a single Xeon E5-2670 CPU (using all the cores). - Affinity policy is important for a good performance when the full number of threads is not used. - RAM memory is, for some problems, small. Large ratio memory/cores could be desirable; taking into account that filesystem consumes part of the available memory 12. - The MPI low performance need more research. We do not have yet a clear idea about the causes. Acknowledgements The authors would like to thank Intel for providing access to Intel Xeon Phi coprocessors. 12 Intel has released a new Xeon Phi with 16GB on June 2013 which could solve some of the issues detected in this work. Technical Report CESGA-2013-001 Act:29/07/2013 33 / 33