Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL

Size: px
Start display at page:

Download "Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL"

Transcription

1 Technical Report Abstract: This technical report presents CESGA experience of porting three applications to the new Intel Xeon Phi coprocessor. The objective of these experiments was to evaluate the complexity of such a work, to check the new capabilities of this new technology, and to know its limitations for executing scientific and technical software. The results show that the migration of these applications to Xeon Phi can be done easily, but getting real improved performance demands a more deeply refactoring. Document Id.: CESGA Date: July 28 th, 2013 Responsible: Status: Andrés Gómez FINAL

2 Evaluation of Intel Xeon Phi to execute easily scientific applications. José Carlos Mouriño Gallego Carmen Cotelo Queijo Andrés Gómez Tato Aurelio Rodríguez López Technical Report CESGA Act:29/07/ / 33

3 Technical Report CESGA Act:29/07/ / 33

4 Index 1 Introduction INTEL Xeon Phi Applications CalcunetW GammaMaps ROMS Results Infrastructure CalcunetW GammaMaps ROMS Conclusions Technical Report CESGA Act:29/07/ / 33

5 Figures Table 1: Host characteritics for testbed 1 19 Table 2: Intel Xeon Phi technical characteristics for testbed 1 19 Table 3: Host characteristics for testbed 2 19 Table 4: Intel Xeon Phi technical characteristics for testbed 2 20 Table 5: Grid size for ROMS benchmark 31 Table 6: MPI benchmark results 32 Figure 1: The first generation Intel Xeon Phi product codenamed Knights Corner... 9 Figure 2: Example of Compact policy in a 4 core coprocessor for 8 threads Figure 3: Example of scatter policy in a 4 core coprocessor for 8 threads Figure 4: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity fine Figure 5: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity core Figure 6: On the left, a voxel with the calculated dose. On the right, example of meshes for reference (blue) and test (green) doses Figure 7: Grid domain decomposition Figure 8: Example of tiled grid Figure 9 : Execution time with one random matrix Figure 10: Scalability with one random matrix Figure 11: Parallel performance with increasing number of random matrixes Figure 12: Speed-up for the local host with different affinities. The x and Y loops were collapsed Figure 13: Elapsed time for the test case on the host Figure 14: Elapsed time for the offload method Technical Report CESGA Act:29/07/ / 33

6 Figure 15: Execution times for the different phases Figure 16: Elapsed times in the second test-bed. Xeon E Xeon Phi 60 cores Figure 17: Offload execution times for the 60 cores Xeon Phi Figure 18: Comparative results Technical Report CESGA Act:29/07/ / 33

7 Tables Table 1: Host characteritics for testbed Table 2: Intel Xeon Phi technical characteristics for testbed Table 3: Host characteritics for testbed Table 4: Intel Xeon Phi technical characteristics for testbed Table 5: Grid size for ROMS benchmark Table 6: MPI benchmark results Technical Report CESGA Act:29/07/ / 33

8 1 Introduction Heterogeneous computing with multiple levels parallelism is a leading topic for the design of future exascale systems. Indeed, accelerators like current generation GPGPUs offer relatively high bandwidth with lower relative power consumption than general-purpose processors. However, GPU-based acceleration requires special programming constructs (e.g. NVIDIA's CUDA language) for the accelerated work. With the release of Intel Many Integrated Core (MIC) architecture, an additional coprocessor technology is available to the scientific community. This document reports on several early porting experiences to the Intel Xeon Phi platform. An attractive feature of this architecture is the support for standard threading models like OpenMP which are already used by many scientific applications. In addition, the Xeon Phi platform is based on x86 architecture, and C/C++ and FORTRAN kernels can be easily compiled for direct native execution on it. The objective of this work was to check the programmability and usability of the new Intel Xeon Phi on different contexts: several programming languages (C and FORTRAN), using Intel Math Libraries (MKL) in different configurations, and applying MPI for a real application. The applications considered are taken from existing development efforts at CESGA. Calcunetw 1, an application developed in C which uses extensively the matrix multiplication BLAS library included in MKL; GammaMaps, a FORTRAN application which calculates a figure-of-merit between two radiotherapy treatment doses; and ROMS, a FORTRAN application for oceanography which was used to check MPI inside Xeon Phi. The remainder of the paper is organized as follows. First of all, a brief description of Intel MIC architecture is presented. Next section the applications used as test case are briefly described. Finally, the results of the tests are presented with a final section with the conclusions. 2 INTEL Xeon Phi In this section, a brief description of the Intel Xeon Phi architecture is done. A more detailed architecture description can be read from Intel Website 2. 1 J.C. Mouriño, E. Estrada, A. Gomez. CalcuNetW. Calculate Measurements in Complex Networks, Informe Técnico CESGA Technical Report CESGA Act:29/07/ / 33

9 Intel Many Integrated Core (Intel MIC) is a multiprocessor computer architecture developed by Intel. It combines in a coprocessor several modified Intel CPU cores which use X86 instruction set executed in-order with a sort pipeline. Each core includes a new vector processing unit (VPU) of 512 bits SIMD, a dedicated L2 cache of 512B for data, and 32 kb L1 cache for data and TLB. L2 cache is kept fully coherent among all the cores. VPU can execute up to 32 single or 16 double precision floating point operations per cycle with Fused Multiply-Add (FMA, which calculates a*b+c as a single instruction), or half of them when it cannot applied. All the floating point operations follow IEEE 754 arithmetic, making this system suitable for scientific HPC. Each core executes 4 hardware threads, so one 60 core Xeon Phi can execute up to 240 threads simultaneously. The cores are connected to a high speed bidirectional ring interconnection which allows them to access RAM memory (up to 8GB) through the memory controllers which are directly connected, and the PCIe bus. RAM memory is based on GDDR5 technology. Intel Xeon Processor KNC Card GDDR5 GDDR5 Main memory memory Main memory Intel > 50 cores Linux OS GDDR5.. GDDR5 GDDR5 GDDR5 Figure 1: The first generation Intel Xeon Phi product codenamed Knights Corner The Intel Xeon Phi is provided as a coprocessor unit which is attached to PCIe bus of the host. This board loads a dedicated operating Linux system and can be configured to have its own IP address and services, so the final user can login in or copy data using standard Linux commands as ssh or scp. The Technical Report CESGA Act:29/07/ / 33

10 Xeon Phi filesystem is mounted directly on the RAM memory, as a consequence the copy of the operating system or the load of data files reduces the amount available to applications. As alternative, NFS filesystem can be used to access the host filesystem or, when working on offload model (see later), an automatic mounting of special directories is done. Intel provides C, C++ and FORTRAN compilers, mathematical libraries (MKL), debuggers, and other tools for developing. The compilers can generate applications which can be executed in two modes: - Native. The generated binary can only be executed on the Intel Xeon Phi. If it is compiled on the host, the executable must be transferred to the coprocessor for execution. To simplify this step, Intel has included the tool micnativeloadex which copies the executable and needed libraries to the Xeon Phi before executing it. - Offload. The application is executed on the host but some sections are selected to run on the Xeon Phi using pragmas. The compiler generates automatically the binary code for executing these sections on the coprocessor which are transferred automatically. Although it is possible to select which data should be transferred from the host memory to the board and back by the programmer using also pragmas, the compiler can automatically detect them in several cases, reducing the complexity of porting applications. Some functions of the MKL library support offload mode directly. This mode of execution is selected by external environment variables 3 : - MKL_MIC_ENABLE. If set to 1, MKL library uses the Xeon Phi coprocessor with automatic offload. - MKL_HOST_WORKDIVISION. A number between 0.0 and 1.0 telling MKL library how much work must be done on the host. - MKL_MIC_WORKDIVISION. A number between 0.0 and 1.0 selecting the amount of work to be done in the Intel Xeon Phi coprocessor. If more than one board is present attached to the host, it is possible to set it for each using MKL_MIC_<BOARD NUMBER>_WORKDIVISION, where BOARD NUMBER is the id of the Xeon Phi coprocessor (starting in 0). - MKL_MIC_MAX_MEMORY. Limits the amount of memory to be used on MIC when automatic offload is used. Because it is a multi-threaded environment, when an OpenMP application is executed, the affinity of the threads to the cores is an important issue. It can be selected with an environment variable 3 A1E4-423D-9C0C-06AB265FFA86.htm Technical Report CESGA Act:29/07/ / 33

11 (prefix_kmp_affinity, where prefix is set using another environment variable MIC_ENV_PREFIX). Intel Xeon Phi supports three policies and two granularities. The policies are: - Compact. The threads are placed in order in the cores as compact as possible. So, an 8 threads application will use only two cores, because each one can execute up to 4 threads (Figure 2). - Scatter. Threads are spread as much as possible among the cores in order, avoiding the sharing of the same core if possible (see Figure 3 for an example of 4 cores and 8 threads). - Balanced. This mode, that it is not supported on hosts, is similar to scatter. But if the number of demanded threads is larger than the number of cores, the threads are placed grouping those with nearest tag. For example, for an 8 thread application on a 4 cores Xeon Phi, thread 0 and 1 will share the same core (see Figure 4). Figure 2: Example of Compact policy in a 4 core coprocessor for 8 threads Figure 3: Example of scatter policy in a 4 core coprocessor for 8 threads Technical Report CESGA Act:29/07/ / 33

12 Figure 4: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity fine. Figure 5: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity core. Granularities are: - Fine (or thread). Each thread is bound to a single hardware thread. - Core. The threads are bound to the core, and can migrate from one hardware thread to another. See Figure 5 as an example for the balanced policy. More information about Xeon Phi and how to program it is available in Intel Xeon Phi Coprocessor Technical Report CESGA Act:29/07/ / 33

13 System Software Developers Guide 4. 3 Applications 3.1 CalcunetW Complex networks, consisting of sets of nodes or vertices joined together in pairs by links or edges, appear frequently in various technological, social and biological scenarios. These networks include the Internet, the World Wide Web, social networks, scientific collaboration networks, lexicon or semantic networks, neural networks, food webs, metabolic networks and protein protein interaction networks. They have been shown to share global statistical features, such as the small world and the scale free effects, as well as the clustering property. The first feature is simply the fact that the average distance between nodes in the network is short and usually scales logarithmically with the total number of nodes. The second is a characteristic of several real world networks in which there are many nodes with low degree and only a small number with high degree (the so called hubs ). The node degree is simply the number of ties a node has with other nodes. In scale free networks, the node degree follows a power law distribution. Finally, clustering is a property of two linked nodes that are each linked to a third node. In consequence, these three nodes form a triangle and the clustering is frequently measured by counting the number of triangles in the network. In order to calculate some measurements in complex networks a simple program has been developed. This application calculates some characterization measurements in a given network and compares it with a number of random networks given by the user. The measurements calculated by the program are: the Subgraph Centrality (SC), SC odd, SC even, Bipartivity, Network Communicability (C(G)) and Network communicability for Connected Nodes. You can find a detailed description of this application in CalcuNetW technical report 5. A given number of random networks are calculated for comparison issues, and the average value of the target measurements is calculated indicating also the mean squared error. As the number of random networks grow, the computational time increases, but the mean results will improve. The random networks are calculated taking into account some restrictions. The networks must have the same number of nodes and edges than the original network, they must also be connected and the node degree must be the same that in the original network. The program is given as a simple executable that has been developed in C, using the Lapack and Blas libraries. The program was initially not parallelized, but can make use of the parallel capabilities of the Technical Report CESGA Act:29/07/ / 33

14 MKL Lapack and Blas libraries. However, the application is easily parallelizable with OpenMP. Using the Intel Xeon Phi, we have explored its parallel capabilities to speed up the process. The size of the matrix could be huge in real cases. This fact has been taken into account for the movement of data between the host and the Xeon Phi and the distribution of task among the Xeon Phi processors. 3.2 GammaMaps In cancer radiation therapy treatments, especially in complex cases, the calculated doses are verified using experimental data before the treatment itself is delivered to the patient. There are many figureof-merit to check the quality of the proposed treatment. One of them is the gamma index 6 which generates a difference map between the measured and calculated doses. This gamma index can be used also to compare two calculated doses using different algorithms, as the eimrt 7 project does. In this case, the calculated doses for the treatment (reference dose) are compared with those that are obtained simulating the linear accelerator and patient s body using Monte Carlo techniques (test dose). For each case, the body of the patient is divided in small cubes (coined voxels) and the dose deposited on it is calculated. As consequence, the full patient is a tridimensional set of voxels with information about the deposited dose on its volume. In Figure 6, the meshes for the reference (blue) and test (green) are shown. Both grids can be different in the size of the voxels and position of their edges. 6 D. a Low, W. B. Harms, S. Mutic, and J. a Purdy, A technique for the quantitative evaluation of dose distributions., Medical physics, vol. 25, no. 5, pp , May D. M. González-Castaño, J. Pena, F. Gómez, A. Gago-Arias, F. J. González-Castaño, D. a Rodríguez-Silva, A. Gómez, C. Mouriño, M. Pombar, and M. Sánchez, eimrt: a web platform for the verification and optimization of radiation treatment plans., Journal of applied clinical medical physics / American College of Medical Physics, vol. 10, no. 3, p. 2998, Jan Technical Report CESGA Act:29/07/ / 33

15 Figure 6: On the left, a voxel with the calculated dose. On the right, example of meshes for reference (blue) and test (green) doses. The gamma index is defined for each voxel of the reference dose as: γ(r r ) = min ( r t r r 2 (d(r R + t ) d(r r )) 2 D 2 2 ), t Dtest where d(r t ) is the test dose for the voxel at position r t, d(r r ) the reference dose for voxel at position r r, R is a parameter indicating the desired geometric distance to agreement among doses (usually 3mm) and D the maximum required difference between doses (commonly 3% of the maximum dose to deliver to the tumour). Any value less than one is considered as acceptable while values higher than 1 should be investigated. GammaMaps is an application developed for the eimrt 8 project that calculates this gamma index for two and three dimensional data. It is written in FORTRAN and uses a geometric algorithm to speed up the process 9. It is parallelized using OpenMP T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, Geometric interpretation of the γ dose distribution comparison technique: Interpolation-free calculation, Medical Physics, vol. 35, no. 3, p. 879, 2008 Technical Report CESGA Act:29/07/ / 33

16 3.3 ROMS Regional Ocean Modelling System (ROMS) is a software that models and simulates an ocean region using a finite difference grid and time stepping. It is a complex model with many options and capabilities. The code is written in F90/F95 and it uses C-preprocessing flags to activate the various physical and numerical options. The simulations can take from hours to days to complete due to the compute-intensive nature of the software. The size and resolution of simulations are constrained by the performance of the computing hardware used. Figure 7: Grid domain decomposition (Source ROMS can be run in parallel with OpenMP or MPI. It does not use MKL libraries and neither is an easy case to introduce OpenMP pragmas for activating the offload mode. Therefore this test case was used to try the Message-Passing Interface (MPI). Technical Report CESGA Act:29/07/ / 33

17 An example of a grid domain decomposition with tiles is shown in figure 8, one colour per tile. The overlap areas are known as ghost points. Each tile is an MPI process and it contains the information needed to time-step all the interior points. For MPI jobs, the ghost points need to be updated between interior point computations. Figure 8: Example of tiled grid 10 Main characteristics of ROMS MPI: - The master process (0) does all the I/O (NetCDF). o On input, it sends the tiled fields to the respective processes. o It collects the tiled fields for output. - ROMS needs to pass many small MPI messages. - Product NtileI * NtileJ must match number of MPI processes (more MPI processes then less 10 Source Technical Report CESGA Act:29/07/ / 33

18 points in the tile and more communications are needed to exchange neighbours information). This test compares host and hybrid modes because ROMS real simulations involve input and output files of several GB managed by the master (process 0). 4 Results 4.1 Infrastructure The tests were performed during February 2013 in an infrastructure provided by Intel remotely. Two systems were used: - A Xeon Phi with 61 cores (tetbed 1). The characteristics of such a system are shown in Table 1 for the host and Table 2 for the Intel Xeon Phi coprocessor. CalculetW and GammaMaps were tested on this system. Some execution attempts have been made with ROMS. The maximum number of threads was selected to 240, leaving the last core without usage. - A Xeon Phi with 60 cores (testbed 2). The characteristics of such a system are shown in Table 3 for the host and Table 4 for the Intel Xeon Phi coprocessor. GammaMaps with some additional modifications explained below, and ROMS were tested on this system. The maximum number of threads was selected to 240. Technical Report CESGA Act:29/07/ / 33

19 Host CPU Model Intel Xeon CPU E GHz Nr. of cores 16 Memory Operating System Compiler Version MB Linux el6.x86_ U2 Table 1: Host characteritics for testbed 1 Intel Xeon Phi Model Nr. of cores Memory Operating System Compiler Version GDDR Technology GDDR Frecuency Beta0 Engineering Sample 61 at 1.09GHz 7936 MB MPSS Gold U1 2013U2 GDDR KHz Table 2: Intel Xeon Phi technical characteristics for testbed 1 Host Test-bed 2 CPU Model Intel Xeon CPU E Nr. of cores 16 Memory Operating System Compiler Version 128GB RHEL6.3 composer_xe_ Table 3: Host characteristics for testbed 2 Technical Report CESGA Act:29/07/ / 33

20 Intel Xeon Phi Test-bed 2 Model Nr. of cores Memory Operating System Compiler Version GDDR Technology GDDR Frecuency 5110P 60 at GHz 8GB Elpida MPSS release 2.1 Kernel g9b2c036 on an k1om 2013U3 GDDR KHz Table 4: Intel Xeon Phi technical characteristics for testbed CalcunetW The following figures show the main initial achievements. Figure 9 shows the elapsed time for calculating one network of 2324 nodes plus one random network executed in several cases: - Host. The application has been compiled for executing on host. - Xeon Phi Native. It is compiled to be executed exclusively on Xeon Phi coprocessor. The input file is copied to the Xeon Phi coprocessor before it is executed. The time to copy this data is not included in the results. - Compiler assisted Offload. The code has been modified to include a pragma before the DGEMM call to execute it on the Intel Xeon Phi using offloading. - Workdivision=1.0. The application without modification is executed on host selecting a workdivision equal to one so MKL DGEMM function is executed on Intel Xeon Phi coprocessor. - Automatic Offload. In this case, MKL should select automatically the workdivision among host and coprocessor. The results (mean value of 10 repetitions in testbed 1) show that native execution on the Xeon Phi is about 6 times slower than other methods. There are no significant differences between the host and the offload versions, because the amount of offloaded calculations is less than a 10%. Technical Report CESGA Act:29/07/ / 33

21 Figure 9 : Execution time with one random matrix Figure 10 shows the scalability, both in host (hyperthreading was enabled) and Xeon Phi. Again the elapsed time for a network of 2324 nodes plus one random network has been measured. This version of the program has not been parallelized itself, but makes use of the parallel capabilities of the MKL functions. As it can be observed the program does not reduce its execution time beyond 4 threads in the case of the hosts and 16 threads in the case of Xeon Phi. Finally, the application has been parallelized with OpenMP. Each thread generates and calculates one random matrix. Figure 11 shows the elapsed time for different number of random matrixes in one socket of the host, in two sockets (in both cases without hyperthreading) and in the Xeon Phi. In this case the number of nodes of each network is 616 due to memory restrictions inside the Xeon Phi. As it can be observed, if the number of matrixes (networks) is high, the performance in the Xeon Phi is better than in an E CPU without hyperthreading. But if host is used with both sockets, its performance is still better than one Xeon Phi. Technical Report CESGA Act:29/07/ / 33

22 The size of the matrixes could be huge in real cases, so it must be taken into account in the movement of data between the host and the Xeon Phi, and in the distribution of task among the Xeon Phi processors. Figure 10: Scalability with one random matrix Technical Report CESGA Act:29/07/ / 33

23 4.3 GammaMaps Figure 11: Parallel performance with increasing number of random matrixes Using the Xeon Phi the Fortran program for 3D calculations was tested in four models: local execution (where the main loops were parallelized using OpenMP and executed with all the available cores), native Xeon Phi execution (where the same code was compiled to be executed on the Xeon Phi), offload to the Xeon Phi (where the initial information is read and processed by the host but the main loops are executed on the Xeon Phi exclusively) and nested (where the main loops are executed simultaneously by the Xeon Phi and the host after modifying the code to support this new execution method using parallel sections). For the OpenMP executions, all the available affinities for each system (host and Xeon Phi) were tested. The three first methods were executed on testbed 1 but timings for the nested case were recorded on a 60 CPU Xeon Phi (testbed 2). Next pictures show main initial achievements. Figure 12 shows the speedup of the problem when is executed in the host with different affinities. Scatter affinity show the best performance and scales well up to 16 threads. Because there can be some imbalance, the best solution is to use dynamic Technical Report CESGA Act:29/07/ / 33

24 scheduling. Figure 14 shows the elapsed time when the main loops are offloaded to Xeon Phi. Balanced and scatter affinities produce similar results but compact, when the number of threads is below the maximum, performs worse, as expected. Figure 15 compares the results of three of the execution methods on the 61 CPU Xeon Phi. The program can be divided in four sections: reading information (two files of about 300MB each), initialization of arrays based on the read information, parallel computation of the gamma index using these arrays, and storage of the results (again around 300MB). This figure shows the best elapsed time for the different execution methods, being seconds the total time. Xeon Phi performs very well on the parallel phase, but due to I/O bad performance, the total time is not competitive with the local host. Also, the initialization phase performs worse than the local host, even having several sections vectorized. Figure 12: Speed-up for the local host with different affinities. The x and Y loops were collapsed Figure 16, Figure 17 and Figure 18 show the results when the same case was executed on the second test-bed. Now, it includes the nested case, where the main loops where divided symmetrically between the host and the Xeon Phi. It shows that the nested solution using the full capacity of the host is limited by the Xeon Phi section. To achieve a better elapsed time, the correct workdivision Technical Report CESGA Act:29/07/ / 33

25 should be defined. The main achievements for this test case are: - A Xeon Phi is almost equivalent in performance (for this test case) to one E CPU without hyperthreading. - Xeon Phi I/O was too slow (almost 10 times slower than the host). - Sharing the parallel work between the host and Xeon Phi required refactoring. Figure 13: Elapsed time for the test case on the host Technical Report CESGA Act:29/07/ / 33

26 Figure 14: Elapsed time for the offload method Technical Report CESGA Act:29/07/ / 33

27 Figure 15: Execution times for the different phases Technical Report CESGA Act:29/07/ / 33

28 Figure 16: Elapsed times in the second test-bed. Xeon E Xeon Phi 60 cores Technical Report CESGA Act:29/07/ / 33

29 Figure 17: Offload execution times for the 60 cores Xeon Phi Technical Report CESGA Act:29/07/ / 33

30 Figure 18: Comparative results 4.4 ROMS Because usually ROMS has a high demand of input and output (it has to read and write several GB of data) and in other tests a poor I/O was observed on Intel Xeon Phi, only host and hybrid cases were executed. Although the executed benchmarks (see later) do not need input files, this situation is not common. As with many scientific applications ROMS needs some external libraries. Prior to performing the tests on the MIC it was necessary to build the libraries required. The libraries zlib and NetCDF were compiled for both the host CPU and MIC. While building these libraries for the host architecture is a common task, building them for MIC needs cross-compilation techniques. Native MIC builds were configured by adding a new mic" Linux target to the autotools configure.sub ( following the model of other existing target called blackfin ). To build native libraries for MIC, the -mmic" flag was added to Technical Report CESGA Act:29/07/ / 33

31 the compiler options. Configure options used to build a native MIC version of the netcdf static library were:./configure CXX=icpc CC=icc FC=ifort FFLAGS=-mmic FCFLAGS=-mmic --disable-shared --disable-netcdf-4 --host=mic The option --disable-netcdf-4 avoids the needed of compile HDF5 previously, it was used to simplify testing and save time, because first tries to compile HDF5 on MIC were unsuccessful. Following this strategy it was possible to build host and native NetCDF static libraries and after that the ROMS executables on MIC. To execute the tests, ROMS benchmark case 11 was used. The model run for 200 time-steps and no input files are needed since all the initial and forcing fields are set-up with analytical expressions. Table 5 shows the parameters used for the benchmark. Grid Values Number of I-direction INTERIOR RHO-points Lm == 512 Number of J-direction INTERIOR RHO-points Mm == 64 Number of vertical levels N == 30 Table 5: Grid size for ROMS benchmark The benchmark was executed with 16 tiles, i.e., 16 MPI processes. On the native host version, this matches the number of cores. They were executed using the command: mpirun -np 16./oceanM-benck1.host where oceanm-benck1.host the name of the executable. For the hybrid version, the executed command was: mpirun -np 1./oceanM-benck1.host : -np 15 -host mic0./oceanm-benck1.mic where mic0 is the name for Xeon Phi coprocessor and oceanm-benck1.mic the name of the executable for Xeon Phi, that should be copied previously to the coprocessor and the environment variable I_MPI_MIC should be set to enable. In this case, the process 0 is executed on the host while the other processes are executed on MIC. Table 6 shows the results of this test. The hybrid version is 20 times slower than the host version, maybe due to the low performance of a single core on Xeon Phi where 11 Technical Report CESGA Act:29/07/ / 33

32 the hardware threads are not fully used and the communications between process 0 and others. On host these MPI messages are faster because the usage of shared memory, while between Xeon Phi and host the communication is done using the PCI interface. For this benchmark, running with more than 16 processes was unfeasible. To use the 60 cores, a larger case must be used to have enough points on the tiles to calculate, increasing the demand of memory. A second optimization could be the usage of a hybrid parallel mode (MPI+OpenMP) where the Xeon Phi hardware threads can be used for each process. Unfortunately, due to time constraints of the full experiment, no further investigation could be done. HOST NtileI * NtileJ (nº cores) HYBRID Elapsed time (seconds): Table 6: MPI benchmark results Apart from the first problems found in compiling the required libraries, porting ROMS to the architecture of MIC was not a hard task. Another issue is easily achieving good performance in this architecture. This requires further analysis and maybe some modifications in the code. While probably ROMS is not the best real application case to exploit this kind of architecture, because real simulations need to perform an important amount of IO (up to 60G for a NetCDF output file). 5 Conclusions The three presented experiments were designed to investigate the complexity of porting existing applications and running them on the new Intel MIC architecture. The objective was to execute the applications with the minimum changes and compare the results against the execution on a classical architecture. Vectorization was not used explicitly (i.e., specific pragmas to drive compiler to vectorize some loops), but the compiler options were selected to allow auto-vectorization. Changes on the original code were done only to introduce pragmas which permit offloading mode and, in one single case, to execute on hybrid mode sharing loop work among host and MIC. The main early conclusions that we can extract from the experiments are: Technical Report CESGA Act:29/07/ / 33

33 - The I/O of the Xeon Phi should be improved, being now a handicap for the native mode. We have tested both the NFS-mounted and local filesystems, having the same results. - The initial porting of the applications is easy, but getting real performance needs real modifications to the code. Some new pragmas to divide the work among host CPUs and Xeon Phi in parallelized OpenMP loops are welcome. - The performance of a Xeon Phi for the selected cases is close to a single Xeon E CPU (using all the cores). - Affinity policy is important for a good performance when the full number of threads is not used. - RAM memory is, for some problems, small. Large ratio memory/cores could be desirable; taking into account that filesystem consumes part of the available memory The MPI low performance need more research. We do not have yet a clear idea about the causes. Acknowledgements The authors would like to thank Intel for providing access to Intel Xeon Phi coprocessors. 12 Intel has released a new Xeon Phi with 16GB on June 2013 which could solve some of the issues detected in this work. Technical Report CESGA Act:29/07/ / 33

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Overview of Intel Xeon Phi Coprocessor

Overview of Intel Xeon Phi Coprocessor Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Debugging Intel Xeon Phi KNC Tutorial

Debugging Intel Xeon Phi KNC Tutorial Debugging Intel Xeon Phi KNC Tutorial Last revised on: 10/7/16 07:37 Overview: The Intel Xeon Phi Coprocessor 2 Debug Library Requirements 2 Debugging Host-Side Applications that Use the Intel Offload

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava, , Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing Jay Boisseau, Director April 17, 2012 TACC Vision & Strategy Provide the most powerful, capable computing technologies and

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

SCALABLE HYBRID PROTOTYPE

SCALABLE HYBRID PROTOTYPE SCALABLE HYBRID PROTOTYPE Scalable Hybrid Prototype Part of the PRACE Technology Evaluation Objectives Enabling key applications on new architectures Familiarizing users and providing a research platform

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

rabbit.engr.oregonstate.edu What is rabbit?

rabbit.engr.oregonstate.edu What is rabbit? 1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Scientific Computing with Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn- AACE: Applications R. Glenn Brook Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn- brook@tennessee.edu Ryan C. Hulguin Computational Science

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP), Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Computer Architecture and Structured Parallel Programming James Reinders, Intel Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison

More information

Day 6: Optimization on Parallel Intel Architectures

Day 6: Optimization on Parallel Intel Architectures Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Offload acceleration of scientific calculations within.net assemblies

Offload acceleration of scientific calculations within.net assemblies Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015 Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Running HARMONIE on Xeon Phi Coprocessors

Running HARMONIE on Xeon Phi Coprocessors Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

John Hengeveld Director of Marketing, HPC Evangelist

John Hengeveld Director of Marketing, HPC Evangelist MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

OpenStaPLE, an OpenACC Lattice QCD Application

OpenStaPLE, an OpenACC Lattice QCD Application OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)

More information

Experiences with ENZO on the Intel Many Integrated Core Architecture

Experiences with ENZO on the Intel Many Integrated Core Architecture Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Investigation of Intel MIC for implementation of Fast Fourier Transform

Investigation of Intel MIC for implementation of Fast Fourier Transform Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Path to Exascale? Intel in Research and HPC 2012

Path to Exascale? Intel in Research and HPC 2012 Path to Exascale? Intel in Research and HPC 2012 Intel s Investment in Manufacturing New Capacity for 14nm and Beyond D1X Oregon Development Fab Fab 42 Arizona High Volume Fab 22nm Fab Upgrades D1D Oregon

More information