Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL
|
|
- Kelly Green
- 5 years ago
- Views:
Transcription
1 Technical Report Abstract: This technical report presents CESGA experience of porting three applications to the new Intel Xeon Phi coprocessor. The objective of these experiments was to evaluate the complexity of such a work, to check the new capabilities of this new technology, and to know its limitations for executing scientific and technical software. The results show that the migration of these applications to Xeon Phi can be done easily, but getting real improved performance demands a more deeply refactoring. Document Id.: CESGA Date: July 28 th, 2013 Responsible: Status: Andrés Gómez FINAL
2 Evaluation of Intel Xeon Phi to execute easily scientific applications. José Carlos Mouriño Gallego Carmen Cotelo Queijo Andrés Gómez Tato Aurelio Rodríguez López Technical Report CESGA Act:29/07/ / 33
3 Technical Report CESGA Act:29/07/ / 33
4 Index 1 Introduction INTEL Xeon Phi Applications CalcunetW GammaMaps ROMS Results Infrastructure CalcunetW GammaMaps ROMS Conclusions Technical Report CESGA Act:29/07/ / 33
5 Figures Table 1: Host characteritics for testbed 1 19 Table 2: Intel Xeon Phi technical characteristics for testbed 1 19 Table 3: Host characteristics for testbed 2 19 Table 4: Intel Xeon Phi technical characteristics for testbed 2 20 Table 5: Grid size for ROMS benchmark 31 Table 6: MPI benchmark results 32 Figure 1: The first generation Intel Xeon Phi product codenamed Knights Corner... 9 Figure 2: Example of Compact policy in a 4 core coprocessor for 8 threads Figure 3: Example of scatter policy in a 4 core coprocessor for 8 threads Figure 4: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity fine Figure 5: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity core Figure 6: On the left, a voxel with the calculated dose. On the right, example of meshes for reference (blue) and test (green) doses Figure 7: Grid domain decomposition Figure 8: Example of tiled grid Figure 9 : Execution time with one random matrix Figure 10: Scalability with one random matrix Figure 11: Parallel performance with increasing number of random matrixes Figure 12: Speed-up for the local host with different affinities. The x and Y loops were collapsed Figure 13: Elapsed time for the test case on the host Figure 14: Elapsed time for the offload method Technical Report CESGA Act:29/07/ / 33
6 Figure 15: Execution times for the different phases Figure 16: Elapsed times in the second test-bed. Xeon E Xeon Phi 60 cores Figure 17: Offload execution times for the 60 cores Xeon Phi Figure 18: Comparative results Technical Report CESGA Act:29/07/ / 33
7 Tables Table 1: Host characteritics for testbed Table 2: Intel Xeon Phi technical characteristics for testbed Table 3: Host characteritics for testbed Table 4: Intel Xeon Phi technical characteristics for testbed Table 5: Grid size for ROMS benchmark Table 6: MPI benchmark results Technical Report CESGA Act:29/07/ / 33
8 1 Introduction Heterogeneous computing with multiple levels parallelism is a leading topic for the design of future exascale systems. Indeed, accelerators like current generation GPGPUs offer relatively high bandwidth with lower relative power consumption than general-purpose processors. However, GPU-based acceleration requires special programming constructs (e.g. NVIDIA's CUDA language) for the accelerated work. With the release of Intel Many Integrated Core (MIC) architecture, an additional coprocessor technology is available to the scientific community. This document reports on several early porting experiences to the Intel Xeon Phi platform. An attractive feature of this architecture is the support for standard threading models like OpenMP which are already used by many scientific applications. In addition, the Xeon Phi platform is based on x86 architecture, and C/C++ and FORTRAN kernels can be easily compiled for direct native execution on it. The objective of this work was to check the programmability and usability of the new Intel Xeon Phi on different contexts: several programming languages (C and FORTRAN), using Intel Math Libraries (MKL) in different configurations, and applying MPI for a real application. The applications considered are taken from existing development efforts at CESGA. Calcunetw 1, an application developed in C which uses extensively the matrix multiplication BLAS library included in MKL; GammaMaps, a FORTRAN application which calculates a figure-of-merit between two radiotherapy treatment doses; and ROMS, a FORTRAN application for oceanography which was used to check MPI inside Xeon Phi. The remainder of the paper is organized as follows. First of all, a brief description of Intel MIC architecture is presented. Next section the applications used as test case are briefly described. Finally, the results of the tests are presented with a final section with the conclusions. 2 INTEL Xeon Phi In this section, a brief description of the Intel Xeon Phi architecture is done. A more detailed architecture description can be read from Intel Website 2. 1 J.C. Mouriño, E. Estrada, A. Gomez. CalcuNetW. Calculate Measurements in Complex Networks, Informe Técnico CESGA Technical Report CESGA Act:29/07/ / 33
9 Intel Many Integrated Core (Intel MIC) is a multiprocessor computer architecture developed by Intel. It combines in a coprocessor several modified Intel CPU cores which use X86 instruction set executed in-order with a sort pipeline. Each core includes a new vector processing unit (VPU) of 512 bits SIMD, a dedicated L2 cache of 512B for data, and 32 kb L1 cache for data and TLB. L2 cache is kept fully coherent among all the cores. VPU can execute up to 32 single or 16 double precision floating point operations per cycle with Fused Multiply-Add (FMA, which calculates a*b+c as a single instruction), or half of them when it cannot applied. All the floating point operations follow IEEE 754 arithmetic, making this system suitable for scientific HPC. Each core executes 4 hardware threads, so one 60 core Xeon Phi can execute up to 240 threads simultaneously. The cores are connected to a high speed bidirectional ring interconnection which allows them to access RAM memory (up to 8GB) through the memory controllers which are directly connected, and the PCIe bus. RAM memory is based on GDDR5 technology. Intel Xeon Processor KNC Card GDDR5 GDDR5 Main memory memory Main memory Intel > 50 cores Linux OS GDDR5.. GDDR5 GDDR5 GDDR5 Figure 1: The first generation Intel Xeon Phi product codenamed Knights Corner The Intel Xeon Phi is provided as a coprocessor unit which is attached to PCIe bus of the host. This board loads a dedicated operating Linux system and can be configured to have its own IP address and services, so the final user can login in or copy data using standard Linux commands as ssh or scp. The Technical Report CESGA Act:29/07/ / 33
10 Xeon Phi filesystem is mounted directly on the RAM memory, as a consequence the copy of the operating system or the load of data files reduces the amount available to applications. As alternative, NFS filesystem can be used to access the host filesystem or, when working on offload model (see later), an automatic mounting of special directories is done. Intel provides C, C++ and FORTRAN compilers, mathematical libraries (MKL), debuggers, and other tools for developing. The compilers can generate applications which can be executed in two modes: - Native. The generated binary can only be executed on the Intel Xeon Phi. If it is compiled on the host, the executable must be transferred to the coprocessor for execution. To simplify this step, Intel has included the tool micnativeloadex which copies the executable and needed libraries to the Xeon Phi before executing it. - Offload. The application is executed on the host but some sections are selected to run on the Xeon Phi using pragmas. The compiler generates automatically the binary code for executing these sections on the coprocessor which are transferred automatically. Although it is possible to select which data should be transferred from the host memory to the board and back by the programmer using also pragmas, the compiler can automatically detect them in several cases, reducing the complexity of porting applications. Some functions of the MKL library support offload mode directly. This mode of execution is selected by external environment variables 3 : - MKL_MIC_ENABLE. If set to 1, MKL library uses the Xeon Phi coprocessor with automatic offload. - MKL_HOST_WORKDIVISION. A number between 0.0 and 1.0 telling MKL library how much work must be done on the host. - MKL_MIC_WORKDIVISION. A number between 0.0 and 1.0 selecting the amount of work to be done in the Intel Xeon Phi coprocessor. If more than one board is present attached to the host, it is possible to set it for each using MKL_MIC_<BOARD NUMBER>_WORKDIVISION, where BOARD NUMBER is the id of the Xeon Phi coprocessor (starting in 0). - MKL_MIC_MAX_MEMORY. Limits the amount of memory to be used on MIC when automatic offload is used. Because it is a multi-threaded environment, when an OpenMP application is executed, the affinity of the threads to the cores is an important issue. It can be selected with an environment variable 3 A1E4-423D-9C0C-06AB265FFA86.htm Technical Report CESGA Act:29/07/ / 33
11 (prefix_kmp_affinity, where prefix is set using another environment variable MIC_ENV_PREFIX). Intel Xeon Phi supports three policies and two granularities. The policies are: - Compact. The threads are placed in order in the cores as compact as possible. So, an 8 threads application will use only two cores, because each one can execute up to 4 threads (Figure 2). - Scatter. Threads are spread as much as possible among the cores in order, avoiding the sharing of the same core if possible (see Figure 3 for an example of 4 cores and 8 threads). - Balanced. This mode, that it is not supported on hosts, is similar to scatter. But if the number of demanded threads is larger than the number of cores, the threads are placed grouping those with nearest tag. For example, for an 8 thread application on a 4 cores Xeon Phi, thread 0 and 1 will share the same core (see Figure 4). Figure 2: Example of Compact policy in a 4 core coprocessor for 8 threads Figure 3: Example of scatter policy in a 4 core coprocessor for 8 threads Technical Report CESGA Act:29/07/ / 33
12 Figure 4: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity fine. Figure 5: Example of balanced policy in a 4 core coprocessor for 8 threads with granularity core. Granularities are: - Fine (or thread). Each thread is bound to a single hardware thread. - Core. The threads are bound to the core, and can migrate from one hardware thread to another. See Figure 5 as an example for the balanced policy. More information about Xeon Phi and how to program it is available in Intel Xeon Phi Coprocessor Technical Report CESGA Act:29/07/ / 33
13 System Software Developers Guide 4. 3 Applications 3.1 CalcunetW Complex networks, consisting of sets of nodes or vertices joined together in pairs by links or edges, appear frequently in various technological, social and biological scenarios. These networks include the Internet, the World Wide Web, social networks, scientific collaboration networks, lexicon or semantic networks, neural networks, food webs, metabolic networks and protein protein interaction networks. They have been shown to share global statistical features, such as the small world and the scale free effects, as well as the clustering property. The first feature is simply the fact that the average distance between nodes in the network is short and usually scales logarithmically with the total number of nodes. The second is a characteristic of several real world networks in which there are many nodes with low degree and only a small number with high degree (the so called hubs ). The node degree is simply the number of ties a node has with other nodes. In scale free networks, the node degree follows a power law distribution. Finally, clustering is a property of two linked nodes that are each linked to a third node. In consequence, these three nodes form a triangle and the clustering is frequently measured by counting the number of triangles in the network. In order to calculate some measurements in complex networks a simple program has been developed. This application calculates some characterization measurements in a given network and compares it with a number of random networks given by the user. The measurements calculated by the program are: the Subgraph Centrality (SC), SC odd, SC even, Bipartivity, Network Communicability (C(G)) and Network communicability for Connected Nodes. You can find a detailed description of this application in CalcuNetW technical report 5. A given number of random networks are calculated for comparison issues, and the average value of the target measurements is calculated indicating also the mean squared error. As the number of random networks grow, the computational time increases, but the mean results will improve. The random networks are calculated taking into account some restrictions. The networks must have the same number of nodes and edges than the original network, they must also be connected and the node degree must be the same that in the original network. The program is given as a simple executable that has been developed in C, using the Lapack and Blas libraries. The program was initially not parallelized, but can make use of the parallel capabilities of the Technical Report CESGA Act:29/07/ / 33
14 MKL Lapack and Blas libraries. However, the application is easily parallelizable with OpenMP. Using the Intel Xeon Phi, we have explored its parallel capabilities to speed up the process. The size of the matrix could be huge in real cases. This fact has been taken into account for the movement of data between the host and the Xeon Phi and the distribution of task among the Xeon Phi processors. 3.2 GammaMaps In cancer radiation therapy treatments, especially in complex cases, the calculated doses are verified using experimental data before the treatment itself is delivered to the patient. There are many figureof-merit to check the quality of the proposed treatment. One of them is the gamma index 6 which generates a difference map between the measured and calculated doses. This gamma index can be used also to compare two calculated doses using different algorithms, as the eimrt 7 project does. In this case, the calculated doses for the treatment (reference dose) are compared with those that are obtained simulating the linear accelerator and patient s body using Monte Carlo techniques (test dose). For each case, the body of the patient is divided in small cubes (coined voxels) and the dose deposited on it is calculated. As consequence, the full patient is a tridimensional set of voxels with information about the deposited dose on its volume. In Figure 6, the meshes for the reference (blue) and test (green) are shown. Both grids can be different in the size of the voxels and position of their edges. 6 D. a Low, W. B. Harms, S. Mutic, and J. a Purdy, A technique for the quantitative evaluation of dose distributions., Medical physics, vol. 25, no. 5, pp , May D. M. González-Castaño, J. Pena, F. Gómez, A. Gago-Arias, F. J. González-Castaño, D. a Rodríguez-Silva, A. Gómez, C. Mouriño, M. Pombar, and M. Sánchez, eimrt: a web platform for the verification and optimization of radiation treatment plans., Journal of applied clinical medical physics / American College of Medical Physics, vol. 10, no. 3, p. 2998, Jan Technical Report CESGA Act:29/07/ / 33
15 Figure 6: On the left, a voxel with the calculated dose. On the right, example of meshes for reference (blue) and test (green) doses. The gamma index is defined for each voxel of the reference dose as: γ(r r ) = min ( r t r r 2 (d(r R + t ) d(r r )) 2 D 2 2 ), t Dtest where d(r t ) is the test dose for the voxel at position r t, d(r r ) the reference dose for voxel at position r r, R is a parameter indicating the desired geometric distance to agreement among doses (usually 3mm) and D the maximum required difference between doses (commonly 3% of the maximum dose to deliver to the tumour). Any value less than one is considered as acceptable while values higher than 1 should be investigated. GammaMaps is an application developed for the eimrt 8 project that calculates this gamma index for two and three dimensional data. It is written in FORTRAN and uses a geometric algorithm to speed up the process 9. It is parallelized using OpenMP T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, Geometric interpretation of the γ dose distribution comparison technique: Interpolation-free calculation, Medical Physics, vol. 35, no. 3, p. 879, 2008 Technical Report CESGA Act:29/07/ / 33
16 3.3 ROMS Regional Ocean Modelling System (ROMS) is a software that models and simulates an ocean region using a finite difference grid and time stepping. It is a complex model with many options and capabilities. The code is written in F90/F95 and it uses C-preprocessing flags to activate the various physical and numerical options. The simulations can take from hours to days to complete due to the compute-intensive nature of the software. The size and resolution of simulations are constrained by the performance of the computing hardware used. Figure 7: Grid domain decomposition (Source ROMS can be run in parallel with OpenMP or MPI. It does not use MKL libraries and neither is an easy case to introduce OpenMP pragmas for activating the offload mode. Therefore this test case was used to try the Message-Passing Interface (MPI). Technical Report CESGA Act:29/07/ / 33
17 An example of a grid domain decomposition with tiles is shown in figure 8, one colour per tile. The overlap areas are known as ghost points. Each tile is an MPI process and it contains the information needed to time-step all the interior points. For MPI jobs, the ghost points need to be updated between interior point computations. Figure 8: Example of tiled grid 10 Main characteristics of ROMS MPI: - The master process (0) does all the I/O (NetCDF). o On input, it sends the tiled fields to the respective processes. o It collects the tiled fields for output. - ROMS needs to pass many small MPI messages. - Product NtileI * NtileJ must match number of MPI processes (more MPI processes then less 10 Source Technical Report CESGA Act:29/07/ / 33
18 points in the tile and more communications are needed to exchange neighbours information). This test compares host and hybrid modes because ROMS real simulations involve input and output files of several GB managed by the master (process 0). 4 Results 4.1 Infrastructure The tests were performed during February 2013 in an infrastructure provided by Intel remotely. Two systems were used: - A Xeon Phi with 61 cores (tetbed 1). The characteristics of such a system are shown in Table 1 for the host and Table 2 for the Intel Xeon Phi coprocessor. CalculetW and GammaMaps were tested on this system. Some execution attempts have been made with ROMS. The maximum number of threads was selected to 240, leaving the last core without usage. - A Xeon Phi with 60 cores (testbed 2). The characteristics of such a system are shown in Table 3 for the host and Table 4 for the Intel Xeon Phi coprocessor. GammaMaps with some additional modifications explained below, and ROMS were tested on this system. The maximum number of threads was selected to 240. Technical Report CESGA Act:29/07/ / 33
19 Host CPU Model Intel Xeon CPU E GHz Nr. of cores 16 Memory Operating System Compiler Version MB Linux el6.x86_ U2 Table 1: Host characteritics for testbed 1 Intel Xeon Phi Model Nr. of cores Memory Operating System Compiler Version GDDR Technology GDDR Frecuency Beta0 Engineering Sample 61 at 1.09GHz 7936 MB MPSS Gold U1 2013U2 GDDR KHz Table 2: Intel Xeon Phi technical characteristics for testbed 1 Host Test-bed 2 CPU Model Intel Xeon CPU E Nr. of cores 16 Memory Operating System Compiler Version 128GB RHEL6.3 composer_xe_ Table 3: Host characteristics for testbed 2 Technical Report CESGA Act:29/07/ / 33
20 Intel Xeon Phi Test-bed 2 Model Nr. of cores Memory Operating System Compiler Version GDDR Technology GDDR Frecuency 5110P 60 at GHz 8GB Elpida MPSS release 2.1 Kernel g9b2c036 on an k1om 2013U3 GDDR KHz Table 4: Intel Xeon Phi technical characteristics for testbed CalcunetW The following figures show the main initial achievements. Figure 9 shows the elapsed time for calculating one network of 2324 nodes plus one random network executed in several cases: - Host. The application has been compiled for executing on host. - Xeon Phi Native. It is compiled to be executed exclusively on Xeon Phi coprocessor. The input file is copied to the Xeon Phi coprocessor before it is executed. The time to copy this data is not included in the results. - Compiler assisted Offload. The code has been modified to include a pragma before the DGEMM call to execute it on the Intel Xeon Phi using offloading. - Workdivision=1.0. The application without modification is executed on host selecting a workdivision equal to one so MKL DGEMM function is executed on Intel Xeon Phi coprocessor. - Automatic Offload. In this case, MKL should select automatically the workdivision among host and coprocessor. The results (mean value of 10 repetitions in testbed 1) show that native execution on the Xeon Phi is about 6 times slower than other methods. There are no significant differences between the host and the offload versions, because the amount of offloaded calculations is less than a 10%. Technical Report CESGA Act:29/07/ / 33
21 Figure 9 : Execution time with one random matrix Figure 10 shows the scalability, both in host (hyperthreading was enabled) and Xeon Phi. Again the elapsed time for a network of 2324 nodes plus one random network has been measured. This version of the program has not been parallelized itself, but makes use of the parallel capabilities of the MKL functions. As it can be observed the program does not reduce its execution time beyond 4 threads in the case of the hosts and 16 threads in the case of Xeon Phi. Finally, the application has been parallelized with OpenMP. Each thread generates and calculates one random matrix. Figure 11 shows the elapsed time for different number of random matrixes in one socket of the host, in two sockets (in both cases without hyperthreading) and in the Xeon Phi. In this case the number of nodes of each network is 616 due to memory restrictions inside the Xeon Phi. As it can be observed, if the number of matrixes (networks) is high, the performance in the Xeon Phi is better than in an E CPU without hyperthreading. But if host is used with both sockets, its performance is still better than one Xeon Phi. Technical Report CESGA Act:29/07/ / 33
22 The size of the matrixes could be huge in real cases, so it must be taken into account in the movement of data between the host and the Xeon Phi, and in the distribution of task among the Xeon Phi processors. Figure 10: Scalability with one random matrix Technical Report CESGA Act:29/07/ / 33
23 4.3 GammaMaps Figure 11: Parallel performance with increasing number of random matrixes Using the Xeon Phi the Fortran program for 3D calculations was tested in four models: local execution (where the main loops were parallelized using OpenMP and executed with all the available cores), native Xeon Phi execution (where the same code was compiled to be executed on the Xeon Phi), offload to the Xeon Phi (where the initial information is read and processed by the host but the main loops are executed on the Xeon Phi exclusively) and nested (where the main loops are executed simultaneously by the Xeon Phi and the host after modifying the code to support this new execution method using parallel sections). For the OpenMP executions, all the available affinities for each system (host and Xeon Phi) were tested. The three first methods were executed on testbed 1 but timings for the nested case were recorded on a 60 CPU Xeon Phi (testbed 2). Next pictures show main initial achievements. Figure 12 shows the speedup of the problem when is executed in the host with different affinities. Scatter affinity show the best performance and scales well up to 16 threads. Because there can be some imbalance, the best solution is to use dynamic Technical Report CESGA Act:29/07/ / 33
24 scheduling. Figure 14 shows the elapsed time when the main loops are offloaded to Xeon Phi. Balanced and scatter affinities produce similar results but compact, when the number of threads is below the maximum, performs worse, as expected. Figure 15 compares the results of three of the execution methods on the 61 CPU Xeon Phi. The program can be divided in four sections: reading information (two files of about 300MB each), initialization of arrays based on the read information, parallel computation of the gamma index using these arrays, and storage of the results (again around 300MB). This figure shows the best elapsed time for the different execution methods, being seconds the total time. Xeon Phi performs very well on the parallel phase, but due to I/O bad performance, the total time is not competitive with the local host. Also, the initialization phase performs worse than the local host, even having several sections vectorized. Figure 12: Speed-up for the local host with different affinities. The x and Y loops were collapsed Figure 16, Figure 17 and Figure 18 show the results when the same case was executed on the second test-bed. Now, it includes the nested case, where the main loops where divided symmetrically between the host and the Xeon Phi. It shows that the nested solution using the full capacity of the host is limited by the Xeon Phi section. To achieve a better elapsed time, the correct workdivision Technical Report CESGA Act:29/07/ / 33
25 should be defined. The main achievements for this test case are: - A Xeon Phi is almost equivalent in performance (for this test case) to one E CPU without hyperthreading. - Xeon Phi I/O was too slow (almost 10 times slower than the host). - Sharing the parallel work between the host and Xeon Phi required refactoring. Figure 13: Elapsed time for the test case on the host Technical Report CESGA Act:29/07/ / 33
26 Figure 14: Elapsed time for the offload method Technical Report CESGA Act:29/07/ / 33
27 Figure 15: Execution times for the different phases Technical Report CESGA Act:29/07/ / 33
28 Figure 16: Elapsed times in the second test-bed. Xeon E Xeon Phi 60 cores Technical Report CESGA Act:29/07/ / 33
29 Figure 17: Offload execution times for the 60 cores Xeon Phi Technical Report CESGA Act:29/07/ / 33
30 Figure 18: Comparative results 4.4 ROMS Because usually ROMS has a high demand of input and output (it has to read and write several GB of data) and in other tests a poor I/O was observed on Intel Xeon Phi, only host and hybrid cases were executed. Although the executed benchmarks (see later) do not need input files, this situation is not common. As with many scientific applications ROMS needs some external libraries. Prior to performing the tests on the MIC it was necessary to build the libraries required. The libraries zlib and NetCDF were compiled for both the host CPU and MIC. While building these libraries for the host architecture is a common task, building them for MIC needs cross-compilation techniques. Native MIC builds were configured by adding a new mic" Linux target to the autotools configure.sub ( following the model of other existing target called blackfin ). To build native libraries for MIC, the -mmic" flag was added to Technical Report CESGA Act:29/07/ / 33
31 the compiler options. Configure options used to build a native MIC version of the netcdf static library were:./configure CXX=icpc CC=icc FC=ifort FFLAGS=-mmic FCFLAGS=-mmic --disable-shared --disable-netcdf-4 --host=mic The option --disable-netcdf-4 avoids the needed of compile HDF5 previously, it was used to simplify testing and save time, because first tries to compile HDF5 on MIC were unsuccessful. Following this strategy it was possible to build host and native NetCDF static libraries and after that the ROMS executables on MIC. To execute the tests, ROMS benchmark case 11 was used. The model run for 200 time-steps and no input files are needed since all the initial and forcing fields are set-up with analytical expressions. Table 5 shows the parameters used for the benchmark. Grid Values Number of I-direction INTERIOR RHO-points Lm == 512 Number of J-direction INTERIOR RHO-points Mm == 64 Number of vertical levels N == 30 Table 5: Grid size for ROMS benchmark The benchmark was executed with 16 tiles, i.e., 16 MPI processes. On the native host version, this matches the number of cores. They were executed using the command: mpirun -np 16./oceanM-benck1.host where oceanm-benck1.host the name of the executable. For the hybrid version, the executed command was: mpirun -np 1./oceanM-benck1.host : -np 15 -host mic0./oceanm-benck1.mic where mic0 is the name for Xeon Phi coprocessor and oceanm-benck1.mic the name of the executable for Xeon Phi, that should be copied previously to the coprocessor and the environment variable I_MPI_MIC should be set to enable. In this case, the process 0 is executed on the host while the other processes are executed on MIC. Table 6 shows the results of this test. The hybrid version is 20 times slower than the host version, maybe due to the low performance of a single core on Xeon Phi where 11 Technical Report CESGA Act:29/07/ / 33
32 the hardware threads are not fully used and the communications between process 0 and others. On host these MPI messages are faster because the usage of shared memory, while between Xeon Phi and host the communication is done using the PCI interface. For this benchmark, running with more than 16 processes was unfeasible. To use the 60 cores, a larger case must be used to have enough points on the tiles to calculate, increasing the demand of memory. A second optimization could be the usage of a hybrid parallel mode (MPI+OpenMP) where the Xeon Phi hardware threads can be used for each process. Unfortunately, due to time constraints of the full experiment, no further investigation could be done. HOST NtileI * NtileJ (nº cores) HYBRID Elapsed time (seconds): Table 6: MPI benchmark results Apart from the first problems found in compiling the required libraries, porting ROMS to the architecture of MIC was not a hard task. Another issue is easily achieving good performance in this architecture. This requires further analysis and maybe some modifications in the code. While probably ROMS is not the best real application case to exploit this kind of architecture, because real simulations need to perform an important amount of IO (up to 60G for a NetCDF output file). 5 Conclusions The three presented experiments were designed to investigate the complexity of porting existing applications and running them on the new Intel MIC architecture. The objective was to execute the applications with the minimum changes and compare the results against the execution on a classical architecture. Vectorization was not used explicitly (i.e., specific pragmas to drive compiler to vectorize some loops), but the compiler options were selected to allow auto-vectorization. Changes on the original code were done only to introduce pragmas which permit offloading mode and, in one single case, to execute on hybrid mode sharing loop work among host and MIC. The main early conclusions that we can extract from the experiments are: Technical Report CESGA Act:29/07/ / 33
33 - The I/O of the Xeon Phi should be improved, being now a handicap for the native mode. We have tested both the NFS-mounted and local filesystems, having the same results. - The initial porting of the applications is easy, but getting real performance needs real modifications to the code. Some new pragmas to divide the work among host CPUs and Xeon Phi in parallelized OpenMP loops are welcome. - The performance of a Xeon Phi for the selected cases is close to a single Xeon E CPU (using all the cores). - Affinity policy is important for a good performance when the full number of threads is not used. - RAM memory is, for some problems, small. Large ratio memory/cores could be desirable; taking into account that filesystem consumes part of the available memory The MPI low performance need more research. We do not have yet a clear idea about the causes. Acknowledgements The authors would like to thank Intel for providing access to Intel Xeon Phi coprocessors. 12 Intel has released a new Xeon Phi with 16GB on June 2013 which could solve some of the issues detected in this work. Technical Report CESGA Act:29/07/ / 33
Accelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationOverview of Intel Xeon Phi Coprocessor
Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing
More informationReusing this material
XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationDebugging Intel Xeon Phi KNC Tutorial
Debugging Intel Xeon Phi KNC Tutorial Last revised on: 10/7/16 07:37 Overview: The Intel Xeon Phi Coprocessor 2 Debug Library Requirements 2 Debugging Host-Side Applications that Use the Intel Offload
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,
, Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationUsing Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System
Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationVincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012
Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More information6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance
The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationIntroduction to the Intel Xeon Phi on Stampede
June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors
More informationTACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing
TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing Jay Boisseau, Director April 17, 2012 TACC Vision & Strategy Provide the most powerful, capable computing technologies and
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationSCALABLE HYBRID PROTOTYPE
SCALABLE HYBRID PROTOTYPE Scalable Hybrid Prototype Part of the PRACE Technology Evaluation Objectives Enabling key applications on new architectures Familiarizing users and providing a research platform
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationrabbit.engr.oregonstate.edu What is rabbit?
1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationAACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-
AACE: Applications R. Glenn Brook Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn- brook@tennessee.edu Ryan C. Hulguin Computational Science
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationE, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),
Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationComputer Architecture and Structured Parallel Programming James Reinders, Intel
Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationPreliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede
Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison
More informationDay 6: Optimization on Parallel Intel Architectures
Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationOptimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda
More informationIntel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop
Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationOffload acceleration of scientific calculations within.net assemblies
Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationAn Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015
Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationRunning HARMONIE on Xeon Phi Coprocessors
Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationJohn Hengeveld Director of Marketing, HPC Evangelist
MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More informationExperiences with ENZO on the Intel Many Integrated Core Architecture
Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationInvestigation of Intel MIC for implementation of Fast Fourier Transform
Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationKlaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation
S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationPath to Exascale? Intel in Research and HPC 2012
Path to Exascale? Intel in Research and HPC 2012 Intel s Investment in Manufacturing New Capacity for 14nm and Beyond D1X Oregon Development Fab Fab 42 Arizona High Volume Fab 22nm Fab Upgrades D1D Oregon
More information