Investigation of Intel MIC for implementation of Fast Fourier Transform

Size: px

Start display at page:

Download "Investigation of Intel MIC for implementation of Fast Fourier Transform"

Alexander Reynolds
5 years ago
Views:

1 Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur address: The objective of the project was to run the code for Fast Fourier Transform on a newly developed computing architecture The Intel MIC (Many Integrated Core). If higher performance is obtained on this architecture then the simulation software Tarang which relies heavily on computing Fourier Transforms will also be able to perform better and faster. I. INTRODUCTION Accurate numerical schemes are required for simulating turbulent flows. Such simulations are of importance in weather prediction, climate modeling and aid in general understanding of fluid flows. Prof Mahendra K Verma has developed Tarang (Verma, 2011) for these applications. At present, Tarang has solvers for incompressible flows involving pure fluid, Rayleigh Benard convection, passive and active scalars, magnetohydrodynamics, liquid metals, etc. One of the key algorithms used in Tarang is the Fast Fourier Transform (FFT). Generally simulations spend 75% of their time computing Fourier Transforms. Therefore, optimizing the FFT implementation will significantly improve Tarang s capacity to handle bigger simulations. At present Tarang is written to run on Intel Multi-Core Platforms (e.g Intel Xeon, Intel Core i7). However new promising computing platforms have been developed such as Intel MIC (Many Integrated Core), which can be leveraged to improve the performance of Tarang. The objective of the project is to investigate the Intel MIC architecture and port the Tarang s code to this new platform. II. ABOUT INTEL MIC Intel MIC (pronounced mike) or Intel Many Integrated Cores, is a coprocessor computer architecture developed by Intel for the High Performance Computing. Intel MIC is a shared memory architecture combines many Intel CPU cores onto a single chip. Programs for it can be written in C, C++ and FORTRAN. The programs use the familiar programming models and support parallel execution of code using standard parallel programming APIs like OpenMP and MPI. This gives Intel MIC the advantage that an existing source code written for an Intel Xeon processor can be compiled and run on an Intel MIC based chip. Starting from 2011, processors based on this architecture have been released and branded as Intel Xeon Phi. These processors have been installed in many supercomputing facilities. The Texas Advanced Computing Center (TACC) is using Coprocessors based on Intel MIC in their 10- PetaFLOPS "Stampede" supercomputer. In June 2013, the Tianhe-2 supercomputer at the National Supercomputing Center in Guangzhou (NSCC-GZ) was announced as the world's fastest supercomputer. It utilizes Intel Ivy Bridge-EP Xeon and Xeon Phi processors to achieve PetaFLOPS. In IIT Kanpur s HPC-2010 Computing cluster 4 nodes have been equipped with the Xeon Phi 5100 Cards, 2 cards on each node. The key features of Intel Xeon 5100 are cores (variation of the Intel Pentium core) have been packed on a single chip, with a shared memory of 16GB. 2. Each core can execute up to 4 threads at once, giving the processor the ability to execute a total of 240 processes in parallel. 3. Each core has a 512-bit wide SIMD Vector Processing Units (VPUs). 4. Xeon Phi is a coprocessor, so parts of computation from the host processor can be offloaded onto it for execution. Further details on how to use Intel Xeon Phi installed in the HPC 2010 Cluster are given in the Appendix. III. ABOUT FAST FOURIER TRANSFORM A Fast Fourier Transform (FFT) is an algorithm to compute the discrete Fourier transform (DFT) and its inverse. They are frequently used in engineering applications. Fast Fourier Transforms have been described as the most important numerical algorithm of our lifetime (Strang 1994). Given a Signal X ( t ) of size N, the naïve way of computing the Discrete Fourier Transform F( k ) is given by the following equation N 1 t 0 2 t N F( k) X ( t) e i k (3.1) 1

2 This can be interpreted as multiplying the vector X(t) with the matrix W whose elements are dimension of the Matrix is N 2 O( N ). N Wij 2 i ij N e. As the, the operation will have a time complexity of An algorithm to calculate the approximate Fourier Transform was devised by Goertzel (Goertzel 1958). It was a O(log N ) algorithm but the errors grew rapidly (Gentleman 1969), so it was suitable only for computing small number of coefficients. As finding fourier transform is essentially the task of multiplying a matrix with a vector, (Cooley, et al., 1965) propsed an alogrithm which was based on (Good, 1958) technique of matrix multiplication. IV. EXPERIMENTS 1. Comparison of different FFT implementations FFT has been implemented by a number of libraries, the most famous among them is the FFTW3[citation] developed at MIT by Matteo Frigo and Steven Johnson. For use on its processors, Intel too has developed an implementation of FFT. It is shipped as a part of the Intel MKL (Math Kernel Library). As part of the first experiment and to get a hands on experience of parallel programming in the Intel Xeon and MIC environments, the performance of the FFTW and MKL were compared. a. Comparision of performance of FFTW3 and MKL on Intel Xeon The current code is Tarang is compatible with Intel Xeon and uses FFTW3 for computing Fourier transforms. The task was to compute the Fast Fourier transform of a signal containing 10 7 elements. This operation typically gets faster as more and more threads (parallel processes) are used. The time taken to compute the transform was recorded as a function of number threads as shown in Fig 1. Clearly Intel MKL outperforms FFTW3. Another point to note here is that the Intel MKL has wrapper functions for FFTW3. This makes the job of programming easier since the same code; albeit with minor modifications, can be linked to either FFTW or MKL as per requirement. b. Comparison of performance of FFTW and MKL on Intel Xeon Phi Intel MKL does support Intel MIC. FFTW is compatible with the most of the common x86 platforms. But since the Intel MIC is a new architecture, FFTW could not be built for it. Although it might be possible to tweak the FFTW build process to get this done, but that would require deeper knowledge of Intel MIC FIG. 1 (Color Online) Comparison between FFT implementation of Intel MKL and FFTW3 on Intel Xeon and FFTW compile process. Intel MKL does support Intel MIC. 2. Scaling of FFT on Intel Xeon and Intel Xeon Phi The performance of FFT on Xeon and Xeon Phi was compared. The MKL implementation was only used, because as observed in the previous experiment FFTW is incompatible with Xeon Phi. Code on Xeon Phi and Xeon can be executed in a number of ways (all of them have been described in the Appendix). For Intel Xeon there were two possibilities Offload Enabled and Offload Disabled. The documentation of Intel MKL claims that if a Xeon Phi is attached to the Host Processor the compiler will ensure that the relevant portions of the code are offloaded on to the Xeon Phi to gain additional speed-up. In the Offload Enabled mode this feature is allowed while in the Offload Disabled mode the code is executed strictly on the host processor. Fig 2 compares the performance in the two modes. Fig 2 (Color Online) Comparison of performance of FFT on Intel Xeon with Offload enabled and disabled For Intel Xeon Phi, there are two modes of execution (both are explained in the Appendix) Offload 2

Execution and Native Execution. The documentation claims that the Native Execution is faster than Offload Execution because Offload Execution has communication overheads.

The Intel MKL implementation performs faster for signal sizes which can be factored into products of smaller primes.

3 Execution and Native Execution. The documentation claims that the Native Execution is faster than Offload Execution because Offload Execution has communication overheads. Fig 3 shows the performances of the two modes FFT of a 10 7 element long signal was computed for each of the 4 modes of execution. 3. Variation in performance with Signal Size As mentioned Section II: About Fourier Transform, the performance of implementations of FFT are dependent on the signal size. The Intel MKL implementation performs faster for signal sizes which can be factored into products of smaller primes. So if the signal size is of the form 2 k it will be transformed most efficiently, while a signal size equal to a large prime will be the most inefficient. The size of the signal was varied between 10 6 and 10 7, taking only the numbers of the form 2 n 3 m. The transformed were carried out on both Xeon and Xeon Phi. A 3D graph is plotted with Signal Size (Data Size) on X-axis, Number of threads on Y-axis and GigaFlops on the Z-axis. Note that in the graph although the execution on Xeon was scaled up to has a maximum of 32 threads, it has been stretched along the Y-axis to 240 threads for ease of comparison. FIG. 3 (Color Online) Comparison between performance of native execution and Offload execution of FFT on Intel Xeon Phi Xeon can support 32 parallel threads while Xeon Phi can support 240 parallel threads. It was expected that as FFT is highly scalable algorithm, FFT would perform much better on Xeon Phi. But as evident from the graph, the performance on Xeon Phi is worse than performance on Xeon. The FFT on Xeon Phi does not scale beyond 120 threads. Fig 4 shows the performance of FFT in both Xeon and Xeon Phi. FIG. 5 3D plot of Signal Size Vs Number of Threads vs Performance(GFLOPs) FIG. 4 (Color Online) Performance of FFT on Intel Xeon and Intel Xeon Phi It was observed that for certain combinations of Threads and Signal Size the data points of Xeon Phi lie above that of Xeon, indicating that Xeon Phi can indeed outperform Xeon. A graph showing the best performances of Xeon and Xeon Phi have been plotted for comparison. 3

4 FIG. 6 Best performances of Xeon and Xeon Phi for various signal sizes V. CONCLUSIONS It is clear that Intel Xeon Phi might offer an opportunity to speed up Tarang. Added to this is the fact that the existing code of Tarang can be ported to Intel Xeon Phi with minimal changes. This makes Intel Xeon Phi a very attractive option. However to obtain the speed up a lot more study needs to be carried out. The implementation of Fast Fourier Transform is highly non-trivial and requires careful study to identify the parameters, its performance depends on. Similarly, Xeon Phi is also new computer architecture and the nuances of its hardware must be known to optimize FFT for it. VI. FUTURE WORK The first step would be to understand the parallel implementation of FFT and the architecture of Intel Xeon Phi. The understanding gained through this will be used to optimize the algorithm for Intel MIC architecture. Further the following questions also need to be answered- Is MKL utilizing the VPU units of Intel Xeon Phi? Why is the FFT algorithm not scaling beyond 120 threads on Xeon Phi? Xeon Phi has 60 physical cores and each core can support 4 threads, so when more than 60 threads are instantiated, how are they distributed among the cores? Bibliography Cooley J W and Tukey J W An algorithm for the machine calculation of complex Fourier series [Journal]. - [s.l.] : Mathematics of Computation, Gentleman W. M. An error analysis of Goertzel's (Watt's) method for computing Fourier Coefficients [Journal] // Journal of Computation pp. 12: Goertzel G. An algorithm for the evaluation of fnite trigonometric series [Article] // The American Mathematical. - January p. 65(1): Good I J The interaction algorithm and practical Fourier analysis. [Journal] // Statistics. - [s.l.] : Royal Society, Strang Gilbert Wavelets [Article] // American Scientist. - May p. 82. Thiagarajan Sudha Udanapalli [et al.] Intel Xeon Phi Coprocessor Developer's Quick Start Guide [Online]. - Intel, Verma Mahendra K Object-oriented Pseudo-spectral code TARANG for turbulence simulation [Online] // arxiv.org. - March APPENDIX Working with Intel Xeon Phi The reference material for development on Intel Xeon Phi is available at (Thiagarajan, et al., 2013). This article gives a high-level description of features of Xeon Phi and how to program on it in IIT Kanpur s HPC Setting Up the Environment After getting an account created at the HPC-2010 cluster do the following to access the Xeon Phi coprocessor and set up the environment 1. Log on to the HPC2010 hpc2010.hpc.iitk.ac.in 2. 4 nodes are available which have Xeon Phi cards attached to them mic001, mic002, mic003 and mic004. Log onto any one of mic To run the 64 bit Intel Compiler, /opt/extra_software/intel/initpaths intel64 Parallel Programming Options on Intel Xeon Phi Most of the parallel programming options available on the host systems are available for the Intel Xeon Phi Coprocessor. These include the following: 1. Intel Threading Building Blocks (Intel TBB) 2. OpenMP 3. Intel Cilk Plus 4. pthreads Of the 4 options only OpenMP was used for this project. There is no correspondence between OpenMP threads on the host CPU and on the Intel Xeon Phi Coprocessor. Because an OpenMP parallel region within an 4

5 offload/pragma is offloaded as a unit, the offload compiler creates a team of threads based on the available resources on Intel Xeon Phi Coprocessor. Since the entire OpenMP construct is executed on the Intel Xeon Phi coprocessor, within the construct the usual OpenMP semantics of shared and private data apply. Compiling A Program There are two main ways of compiling a program for Xeon Phi Native Compilation: The binary is built on the host s file system using the Intel s icc compiler. This file and its dependencies are then copied to the coprocessor s filesystem and executed. For example if the following code is saved in text.cpp. int main() float ret = 0; int data[size] = initialze(): #pragma omp parallel for for(int i = 0; i < SIZE; i++) ret +=data[i]; return ret; The pragama directive will instruct the compiler to divide the for loop among maximum number of threads the processor can support. The Compilation commands will be as follows 1. Compile the program with the mmic -mmic -openmp test.cpp 2. The output file is not required to be copied to the Xeon Phi s filesystem as the host and the Xeon Phi share the same filesystem on HPC2010. So now log on to one the MIC cards (either mic001-mic0 or mic001-mic0 3. Set the library paths for OpenMP and Intel MKL Offload Compilation: Intel s icc allows the user to specify a region of code to be offloaded to the Xeon Phi card. There are many ways to it, here a simple method is described to get started and run basic codes. The code is modified to indicate the segment of the code to be offloaded int main() float ret = 0; int data[size] = initialzefooarray(): #prgama offload target (mic) #pragma omp parallel for for(int i = 0; i < SIZE; i++) ret +=data[i]; return ret; To compile and run it, following commands are followed 1. As the code contains segments to be offloaded, the compiler has to be instructed to link the offload segments to libraries meant for MIC. This done using the -offload-option. The $(LIBS) variable can be replaced by the libraries to be used for compilation of the test.cpp -openmp -offloadattribute-target=mic -offload- option,mic,compiler,"- L/opt/extra_software/intel/composer_xe_20 13/lib/mic $(LIBS)" 2. The program can now be executed on the host directly, the segment meant for Xeon Phi will be LD_LIBRARY_PATH=/opt/extra_software/i ntel/mkl/lib/mic/:/opt/extra_software /intel/composer_xe_2013/lib/mic/ 4. Execute the 5

Intel Performance Libraries

Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation