FEniCS Performance Investigation and Porting minidft to GPU Clusters

Size: px

Start display at page:

Download "FEniCS Performance Investigation and Porting minidft to GPU Clusters"

Maximilian Carr
6 years ago
Views:

1 FEniCS Performance Investigation and Porting minidft to GPU Clusters Chao Peng 17th August 2017 MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2017

2 Abstract This dissertation project is based on the participation in the 2017 International Supercomputing Conference Student Cluster Competition (ISC 17 SCC). The author was responsible for FEniCS optimisation before the competition and ported minidft to the CPU-GPU hybrid cluster after the competition as part of the dissertation project. The report has three parts. The first part describes the design of the cluster for the competition and discusses the process of hardware and software installation. It also includes challenges we faced and the competition result. In the second part, the report provides details and step-by-step performance results of how the author optimised FEniCS on the cluster by trying different compilers, BLAS libraries, MPI implementations, compiler optimisation flags and optimisation levels. The last part of this report introduces the process of porting the CPU-only program, minidft, to the cluster with NVIDIA P100 GPUs. The process starts from CPUoriented optimisation including replacing OpenBLAS, FFTW3 and ScaLAPACK by Intel MKL library and some optimisation of the source code. Then this report discusses performance improvement give by GPU-enabled libraries including cublas and MAGMA. Some command line optimisation methods such as process binding are also included in this part. After utilising MKL, cublas and MAGMA, the overall execution time of minidft was reduced to only 7% of the original version.

3 Contents 1 Introduction 1 2 Background Review Heterogeneous Architectures The NVIDIA Tesla P100 Accelerator The CUDA Programming Model The cufft and cublas Libraries The MAGMA Library Project Motivation Obstacles and Deviation from the Project Plan The Student Cluster Competition Competition Guidance Benchmarks and Applications Awards Team EPCC s Cluster Configuration Hardware Configuration Software Configuration Preparing for the Competition Competition Results and Experiences Building FEniCS on the Cluster and Performance Analysis Introduction to FEniCS Initial Work and Obstacles Performance Investigation Summary Porting minidft to NVIDIA GPUs Initial Performance Testing Optimisation Based on Source Code Invastigation Performance with GPU-enabled Libraries FFT-related Code Optimisation BLAS-related Code Optimisation MAGMA-related Code Optimisation Process Binding i

4 5.5 NVIDIA Multi-Process Service Power Consumption and Summary Conclusions and Future Work Future Work A Automated Building Script for FEniCS 36 ii

5 List of Tables 2.1 Top 10 supercomputers in the 2017 June Green 500 list; Available from: Team EPCC cluster hardware configuration FEniCS timing for different number of MPI tasks FEniCS timing for different MPI implementations FEniCS timing for different BLAS libraries FEniCS timing for different levels of optimisation minidft timing for different compilers and libraries minidft timing for different levels of optimisation minidft timing for different FFT communication types FFTW subroutines replaced by cufft subroutines minidft timing for process binding iii

6 List of Figures 3.1 One node of Team EPCC s cluster FEniCS Streamline FEniCS timing for different number of MPI tasks FEniCS timing for different MPI implementations using the small test case FEniCS timing for different MPI implementations using the large test case FEniCS timing for different levels of optimisation Step-by-step optimisation summary of FEniCS minidft timing for different optimisation levels Process binding topology Step-by-step optimisation summary of minidft CPU-only minidft power consumption GPU-enabled minidft power consumption iv

7 Acknowledgements First and foremost, I would like to express great gratitude to my supervisor Dr. Michele Weiland for her constant guidance and time she devoted to me. She carefully and patiently helped me arrange my work and guided me through the whole project. Her rigorous attitude of scholarship and abundant knowledge reserve on HPC greatly improved my understanding of academic research. There is a special thank to my teammates, Alexandros Nakos, Antriani Mappoura as well as Jingmei Zhang for their work and friendship we have built. We would also express our thanks to the coach of the team, Mr. Emmanouil-Ioannis Farsarakis for his supports and encouragements. Boston Limited and its member of staff Mr. Konstantinos Mouzakitis deserve many thanks for their technical support and providing us access to the fantastic cluster with state-of-the-art hardware components. Finally, I would like to thank my parents for their precious love and unconditional support, without which I would not have the opportunity to do this Master of Science program at EPCC (Edinburgh Parallel Computing Centre), the University of Edinburgh.

8 Chapter 1 Introduction High Performance Computing (HPC) has experienced great performance improvement by parallelism and modern heterogeneous architectures. With the development of the HPC industry, processors, accelerators, interconnects, storage devices, etc. have been updated for several generations. In addition, the HPC industry has also driven the development of parallel programming models and libraries such as Message-Passing Interface (MPI), Open Multi-Processing (OpenMP) and High Performance Fortran (HPF). Additionally, in order to solve more complex problems such as atmospheric simulation which requires high-density computations, inputs and outputs large-scale datasets make great demands on arithmetic operation speed and power efficiency to existing supercomputers. This issue gives rise to modern heterogeneous computing systems with both processing units and accelerators and relevant programming models such as Open Computing Language (OpenCL) and Open ACCelerators (OpenACC). With the aim to train the next generation of HPC experts, three major Student Cluster Competitions including Asia Student Supercomputer Challenge (ASC) and another two are held within two mainstream HPC conferences, Supercomputing Conference (SC)and International Supercomputing Conference (ISC) every year. They touch a wide range of HPC topics from choosing components of a cluster to optimising applications according to characteristics of the cluster in order to achieve better performance. These competitions attract many student teams from well-known universities and scientific research institutions and provide them with a good platform for academic communication. This report is written by a member of Team EPCC for the 2017 ISC Student Cluster Competition, which required each team to build a cluster on which they optimise and run benchmarks as well as applications (announced by the HPC Advisory Council) within a power budget of 3000 Watts. This report covers work done by the author for the competition including the optimisation process of FEniCS and minidft. FEniCS, a popular computing platform for solving partial differential equations (PDEs) was chosen by the HPC Advisory Council as one of the applications for teams to run 1

9 on their clusters in the competition. As a mature project, FEniCS is composed of seven parts. Each part has different dependencies and is compiled separately. Although FEniCS community provides Docker as well as Anaconda versions, these prebuilt packages cannot be modified according to user s specifications and characteristics of the target machine. This paper will discuss how FEniCS was built in the cluster for the competition and performance comparison of different configurations. MiniDFT is a simplified program for modelling materials using the plan-wave density functional theory and was chosen by the HPC Advisory Council as the coding challenge of the competition. It utilises MPI and one of its computational back-ends, FFTW3, uses OpenMP to achieve better performance through parallelism. However, it does not have GPU support. This paper implements a GPU-enabled minidft version and investigates the performance improvement provided by GPUs. The remainder of this report is organised as follows: Chapter 2: A background review of some aspects of HPC that are related to this dissertation including heterogeneous systems, the NVIDIA P100 GPU and some GPUoriented mathematical libraries including cufft, cublas and MAGMA which are used to accelerate minidft in Chapter 5. This chapter also presents the project motivation, obstacles we met and deviations from the initial project proposal. Chapter 3: This chapter includes an introduction to the ISC Student Cluster Competition and a review of work done for the competition. This chapter starts from the detail of the cluster and gives the reasoning behind the cluster configuration in terms of both hardware and software. Chapter 4: The process of building FEniCS with different compilers as well as compiling parameters is introduced in this chapter. Chapter 5: This chapter presents the process of optimising minidft. The process includes basic optimisation by using different compilers and CPU-optimised mathematical libraries. Then, work done for porting minidft to NVIDIA GPUs using GPUenabled libraries and GPU-specific command line optimisation is described. Chapter 6: Conclusions of this dissertation project and future work will be discussed in this chapter. 2

10 Chapter 2 Background Review Work done for the competition is discussed in the subsequent chapters and due to the fact that the competition touches a wide range of fields of HPC, this chapter gives a brief introduction to some aspects that are related to this project. 2.1 Heterogeneous Architectures High Performance Computing has experienced great performance improvement benefiting from both hardware and software parallelism. In the past, the performance of Central Processing Units (CPUs) doubled roughly every 18 months by developing chip fabrication technology and increasing clock frequency[1]. Binary digits (0s and 1s) are represented by different voltages and the increment of clock frequency causes voltage to decrease to keep the power consumption reasonable. However, voltage cannot be reduced any further because binary digits will not be distinguished easily. Over the last few years, computer scientists and engineers have made noticeable progress on Graphics Processing Units (GPUs) for the highly lucrative gaming market[2]. GPUs show excellent characteristics for HPC such as power-efficiency and enormous floating-point computing power, which has made them firmly established in the HPC industry. In addition, Intel also released a different type of accelerator, Intel Xeon Phi series, to compete with GPUs for scientific computing. Most supercomputers were using CPU-only architectures before 2009, but power efficiency has raised fast adoption of GPUs in recent years[3]. Now, supercomputers containing both traditional CPUs as well as accelerators have become popular and scientific simulations on climate, physics, astronomy, etc. benefit a lot from the development of heterogeneous HPC systems. The Green 500 list[4] ranks the top 500 supercomputers all over the world each year in June and November. Unlike the Top 500 list whose ordering specification is only the performance of floating point operations per second (FLOPS), the Green 500 list puts 3

11 a premium on power efficiency and uses "FLOPS-per-Watt" as its power-performance metric. According to the 2017 June Green 500 list (as shown in Table 2.1, nine of the top 10 most power-efficient supercomputers are heterogeneous systems and they are all using NVIDIA Tesla P100 as their accelerators. Rank Name CPU GPU Power Efficiency 1 TSUBAME3.0 Xeon E5-2680v4 14C NVIDIA Tesla P kukai Xeon E5-2650Lv4 14C NVIDIA Tesla P AIST AI Cloud Xeon E5-2630Lv4 10C NVIDIA Tesla P RAIDEN Xeon E5-2698v4 20C NVIDIA Tesla P Wilkes-2 Xeon E5-2650v4 12C NVIDIA Tesla P Piz Daint Xeon E5-2690v3 12C NVIDIA Tesla P Gyoukou Xeon D C N/A RCF2 Xeon E5-2650v4 12C NVIDIA Tesla P N/A Xeon E5-2698v4 20C NVIDIA Tesla P Saturn V Xeon E5-2698v4 20C NVIDIA Tesla P Table 2.1: Top 10 supercomputers in the 2017 June Green 500 list; Available from: The NVIDIA Tesla P100 Accelerator NVIDIA Tesla GPUs are widely used in supercomputers, enabling leading-edge Artificial Intelligence and Machine Learning systems and speeding up numerous HPC applications as well as scientific research of many domains with highly complex simulations. Key features of Tesla P100 include: Exceptional performance. According to the white paper[5] released by NVIDIA, Tesla P100 delivers 5.3 TFLOPS, 10.6 TFLOPS and 21.2 TFLOPS of double precision, single precision and half-precision floating point performance respectively. For those Deep Learning algorithms which do not require high levels of floating-point precision, the extreme half-precision floating point performance provided by P100 and the reduced storage requirements for half-precision datatypes can give noticeable speedups. The brand-new NVLink interconnect. As more and more hybrid systems deploy multi-gpu architectures to solve bigger and more complex problems, the bandwidth between GPUs has become an issue. Therefore, NVIDIA introduces a new high-speed interface, NVLINK, enabling up to 160 GB/s data transfer rate from one GPU to another, 5 times faster than the traditional PCIe Gen 3. 4

12 High-speed memory. NVIDIA P100 is the first GPU that introduces the new memory model, High Bandwidth Memory 2 (HBM2), which can provide more excellent performance through higher bandwidth (up to 256GB/s) and lower power consumption than conventional GPUs. This feature enables P100 to tackle with much larger problems with much larger datasets. Simplified programming. P100 provides two major features of GPU programming, Unified Memory and Compute Preemption. Based on the architecture of P100, Unified Memory provides a single and unified virtual address space for accessing both CPUs and GPUs. Therefore, programmers do not need to take into consideration how to manage data transfer between virtual memory systems so that they can concentrate on designing hybrid parallel programs. Some long-running applications might occupy the system when waiting for a task to complete. They could be killed by the operating system or the CUDA driver so programmers need to divide large workloads into small ones. With Compute Preemption, programmers are able to let their programs wait for certain conditions to occur while scheduled alongside other tasks. 2.3 The CUDA Programming Model Introduced by NVIDIA in 2007, the CUDA (Compute Unified Device Architecture) programming model is designed for joint CPU-GPU execution of an application[3]. There are also some other more recent models such as OpenCL, OpenACC and C++ AMP supporting hybrid system programming. The release of CUDA opened up the possibility for developers to concentrate on algorithm design rather than think about graphic primitives[6]. CUDA provides extensions with new keywords and application programming interfaces to both C/C++ and Fortran programming languages thus developers do not need to learn a new programming language and CUDA makes it easier for them to modify existing code to enable GPU support. CUDA programs consist of a host and one or more devices. A CUDA host is generally a traditional CPU such as an Intel microprocessor while a device is a NVIDIA GPU. This programming model can be described as the master-worker pattern where the CPU acts as the master, initialising the program and executing some serial parts of the program, while the GPU serves as the worker which is responsible for executing parallel regions. In addition, the device code marked with CUDA keywords for parallel functions are called kernels. The execution of a CUDA program starts with host execution which launches various kernel functions. When a kernel function is called, it is executed by a number of threads and each thread is executed on a single CUDA core. All threads are collectively referred to as a three-dimensional grid and this is the only level where threads can communicate and synchronise. 5

13 In heterogeneous systems, there are always not only one CPU or CUDA host. Most HPC clusters and supercomputers have more than one node and each node has one or more hosts and one or more devices. The dominating programming model for computing clusters, MPI, can be used together with CUDA to program on these systems. A simple and versatile approach is to associate each MPI rank (process) with a single GPU. When there are more than one GPU per node, multi-gpu programming can be realised by setting multiple MPI ranks per node[7]. 2.4 The cufft and cublas Libraries It is clear that traditional C/C++ and Fortran programming benefit a lot from a great deal of mature, well-documented and easy-to-use libraries. FFTW[8] and OpenBLAS[9] are two examples. FFTW is a free and open-source C subroutine library for computing the discrete Fourier transform (DFT) while OpenBLAS is an optimised BLAS (Basic Linear Algebra Subprograms) library which provides basic vector and matrix operations. Both of these libraries show outstanding performance on traditional CPU-only platforms. NVIDIA provides the corresponding cufft (CUDA Fast Fourier Transform)[10] and cublas (CUDA Basic Linear Algebra Subroutines)[11] libraries for CUDA programmers to speed up their applications by deploying compute-intensive operations to a single GPU using hundreds of CUDA cores inside NVIDIA GPUs or distribute work across multi-gpu systems efficiently. However, these libraries are implemented using the C-based CUDA, which means that modifying those CPU-only programs written in Fortran to support NVIDIA GPU execution needs additional Fortran wrapper interfaces. 2.5 The MAGMA Library Standing for Matrix Algebra on GPU and Multicore Architectures, MAGMA[12] is a library for dense linear algebra functionalities on heterogeneous systems. It is similar to LAPACK and ScaLAPACK[13] which are designed for CPU-only architectures and designed by the team that developed LAPACK and ScaLAPACK[14]. Therefore, developers can port their existing applications to hybrid systems efficiently and smoothly. MAGMA uses a flexible methodology to schedule tasks: small tasks (run on the critical path and cannot be parallelised) are scheduled to execute on the CPU while larger tasks are scheduled on the GPU by the library. In addition, many MAGMA subroutines have multiple versions enabling developers to arrange the workload. For example, a LAPACK subroutine named DGETRF MAGMA provides versions including magma_dgetrf, magma_dgetrf_m, magma_dgetrf_gpu and magma_dgetrf_mgpu so that developers can choose whether the function is executed in the mode of CPU/GPU 6

14 or CPU/multiple-GPU and whether the matrix is located on CPU host or GPU device memory. 2.6 Project Motivation As the author is a member of Team EPCC for the 2017 ISC Student Cluster Competition, the project is generally composed of three parts: cluster design and configuration, work done by the author for the ISC Student Cluster Competition (optimising FEniCS on the cluster) and optimisation of the CPU-only project minidft using GPU-enabled libraries. The project aims to investigate and make use of all the previously mentioned HPC features and innovations. It starts from maintaining a cluster with a liquid cooling system for the competition such as installing software, compiling and linking libraries. This process is essential for all subsequent optimisation and performance measurement work and can be difficult but beneficial for Linux beginners. Optimisation tasks include improving performance of FEniCS by trying different popular compilers and libraries, rewriting the code of minidft to make it support GPUs by use of GPU-enabled libraries. MiniDFT uses traditional mathematical libraries such as FFTW3, OpenBLAS and SCALACK and this motivates the author to improve its performance by introducing corresponding libraries that are designed for hybrid systems with GPUs. 2.7 Obstacles and Deviation from the Project Plan During the preparation period for the competition, we faced many difficulties from compiling the code to executing the program throughout the three nodes. Some of the problem were easy to solve while others could cost us hours or even days to find out the solution since our team members had limited knowledge and experience on Linux system administration, cluster network, libraries and application optimisation. In order to provide the reader with better understanding of work done for the project when reading and examining this report, obstacles that the author met are presented and discussed in detail throughout the report. This section provides an overview of deviation from the initial project proposal. The project was planned to focus on the optimisation of FEniCS. The project preparation report[15] proposed that before the competition days, we should conduct vectorisation of the code of FEniCS to improve its performance according to the hardware architecture of the cluster, i.e. Intel Xeon 2630v4 CPUs and additional work should be on enabling PETSc GPU support to offload some work to GPUs. After the competition, future work was proposed to be optimising FEniCS on the KNL architecture. 7

15 The vectorisation report generated by Intel Vector Advisor showed that there were not many loops to be vectorised. In addition, the version of FEniCS with GPU-enabled PETSc did not provide performance improvement. Thus, the author decided to change the work after the competition to porting minidft to the cluster. The most significant obstacle was the system accessibility. Due to some network or maintaining issues, we were unable to access to the cluster remotely from time to time. The longest off-line of the cluster lasted for a month. During these periods, team members cannot conduct any power and performance measurement and thus cannot do any future work accordingly. In addition, before a separate interface to monitor the power consumption of the whole system was set up, we were only able to check the power consumption by Linux commands, which were not convenient to use compared to a separate monitor interface. 8

16 Chapter 3 The Student Cluster Competition The ISC Student Cluster Competition takes place every year together since Being held alongside the International Supercomputing Conference makes more opportunity for participating teams to explore the latest development as well as achievement on HPC technology and express their ideas to and get valuable feedback from attendees of the conference including scholars from prestigious universities and scientists from well-known HPC vendors. 3.1 Competition Guidance Student teams from around the world which were composed of four to six students and a coach had to submit their initial proposal containing their biographies, the reason for participating and approaches to effective teamwork etc. to the competition board before Nov 11, During the 2016 Supercomputing Conference held in Salt Lake City, 12 teams were announced to participate in the finals of the 2017 ISC Student Cluster Competition. Selected teams should submit their more detailed team as well as individual profiles and final cluster configuration before April 2017 to the HPC Advisory Council which acted as the competition board. In preparation for the competition, student teams needed to optimise both benchmarks and applications for their clusters. Benchmarks and applications will be referred to in general as tasks in this report when it is not necessary to tell the difference between them. There was also a coding challenge which required teams to modify the source code or switch to other libraries to bring all potentialities of their clusters into full play. During the competition days from June 19 to June 21 in Frankfurt, teams had to run all the benchmarks and applications including a secret application which was announced on the second day of competition on their clusters. As each team had a power budget of not exceeding 3000 Watts, power consumption of every team was monitored in real time and at any time when the power consumption of a team exceeded the limit, they would be warned by the competition board and if it happened during the benchmark 9

17 running period, their final score would be deducted accordingly. After submitting each result, the competition board calculated the final score, taking both cluster performance and an interview by the representatives of the competition board into consideration. 3.2 Benchmarks and Applications HPC Challenge (HPCC) and High Performance Conjugate Gradient (HPCG) were used for benchmarking the clusters. HPCC has seven different tests measuring aspects from floating point execution performance to sustainable memory bandwidth while HPCG is another comprehensive benchmark which is also used to rank the Top 500 list. Moreover, teams can have an independent run of High Performance LINPACK (HPL), a test from HPCC, which measures the floating point rate of execution for solving a linear system of equations. At the ISC17 Student Cluster Competition, the known applications (announced three months before the competition) were: FEniCS[16]: A computational interface to solve partial differential equations with both C++ and Python support. The competition board provided a small test for the teams to validate their installation of FEniCS before the official running of FEniCS and a large case for final scoring on the competition day. Coding Challenge: MiniDFT[17]: An application takes a set of atomic coordinates and pseudopotentials as input and models materials. The competition board provided the original code of minidft and gave a brief introduction of required dependencies and running guide of minidft. This task allowed teams to modify the source code and change libraries minidft used to accelerate it. TensorFlow[18]: A library for numerical computation uses data flow graphs and allows users to deploy it on hybrid systems. This challenge was also known as the CAPTCHA challenge because for this application, teams need to run Keras, a Python deep learning libraries, on TensorFlow to identify images. On the second day of the competition, Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS), a classical molecular dynamics code, was announced as the secret application. Each member of Team EPCC was responsible for one or two tasks (benchmarks and/or applications) and the whole team worked together to exchange ideas and help each other solve problems. 3.3 Awards Six awards were given at the 2017 ISC competition: 10

18 Overall winner: For three teams with highest scores (10% HPCC performance, 10% HPCG, 10% FEniCS, 10% LAMMPS, 25% MiniDFT, 25% TensorFlow and 10% Interview by the competition board). Highest LINPACK: For the team with the best performance of HPL (either part of the HPCC run and independent run). Fan Favourite: For the team which received most votes from ISC participants. AI Award: For the team which solves CAPTCHA challenge on TensorFlow and achieves the highest accuracy. 3.4 Team EPCC s Cluster Configuration Hardware Configuration Hardware and software configuration of the cluster was designed in a joint effort by our team and the industrial sponsor of the team, Boston Ltd, who also provided the hardware. The company gave us several hardware choices including Intel Xeon CPUs, Intel Xeon Phi accelerators and NVIDIA P100 GPUs. As the team aimed at the Highest LINKPACK award and based on the preliminary power consumption measurement conducted by the sponsor, we decided to have a GPU-centric design. Team EPCC s cluster is composed of three nodes with three NVIDIA P100 GPUs and two Intel Xeon E5-2630v4 CPUs each. NVLINK is deployed for intra-gpu communication and Mellanox Infiniband is used as the interconnect switch. Initially, the cluster used only air cooling while adding liquid cooling on our cluster helps remove power hungry fans, saving on power consumption overall. Figure 3.1 shows a computational node of the cluster. Hardware details are summarised in Table 3.1. Component CPU GPU RAM Storage Interconnect Cooling System Detail 2 x Intel Xeon GHz (10 cores each) per node 60 cores (or MPI ranks) on 3 nodes in total 3 x NVIDIA Tesla P100 (3584 cores and 16GB HBM2 each) per node 9 GPUs on 3 nodes in total 64GB 2400MHz 2 x 900GB SSD Mellanox EDR Infiniband Networking Liquid Cooling Table 3.1: Team EPCC cluster hardware configuration 11

19 3.4.2 Software Configuration Figure 3.1: One node of Team EPCC s cluster Besides the operating system, NVIDIA GPU libraries, GNU and Intel compilers, the team installed other compilers and libraries according to our needs during the preparation period. In order to maintain different versions of applications such as FEniCS and minidft to compare the difference of performance, the same library can be installed multiple times in different locations and different implementations of MPI such as both OpenMPI and MPICH are installed to help improve performance of applications. The operating system used is CentOS 7.3 which is well-known for its stability and reliability. Following software and libraries were installed: NVIDIA drivers are installed as well as CUDA Toolkit version 8.0. MPI implementations Open MPI and MPICH Intel MPI 2017 Compilers PGI 17.4 Intel Compilers 2016 GNU Libraries 12

20 Intel Math Kernel Library (MKL) (Required by FEniCS and minidft) OpenBLAS version and (Required by FEniCS and minidft) FFTW3.3.6 (Required by minidft) ScaLAPACK (Required by FEniCS and minidft) CMake (Required by FEniCS) Boost (Required by FEniCS) HDF (Required by FEniCS) SWIG (Required by FEniCS) PETSc (Required by FEniCS) SLEPc (Required by FEniCS) Eigen (Required by FEniCS) Anaconda Python 3 (Required by FEniCS) 3.5 Preparing for the Competition Due to the fact that the hardware was located at the headquarters of Boston Limited during the preparation period, the only times at which we had direct access to the cluster was during the competition days and a two-day training session at the headquarters in London on the 30th and 31st of May. During the training session, Konstantinos Mouzakitis, a Senior HPC Systems Engineer of Boston Limited was responsible for leading us a visit of the company and teaching us skills including system configuration using command lines and how to screw and unscrew components of the cluster. For the remaining time, we only had remote access to the cluster. The Intelligent Platform Management Interface (IPMI), which provides management and monitoring interface of CPUs, BIOS and the operating system was installed on the cluster. It allowed us to reboot the cluster when encountering hardware or software crashes and most importantly, it helped us monitor and manipulate the BIOS configuration, temperatures, fan speeds etc. during power consumption measurement and application performance tests. A Windows system with a Remote Desktop Protocol (RDP), which had a graphical user interface to monitor the overall power consumption of the cluster, was also set up. Each member of Team EPCC was responsible for one or two benchmarks or applications. The author of this report was mainly responsible for FEniCS during the competition preparation period and ported minidft to the NVIDIA Tesla P100 GPU architecture after the competition. As mentioned in the Introduction Chapter, details on the optimisation process will be presented in Chapters 4 and 5. 13

21 Before the competition days, we had team meetings of all team members and the coach every one or two weeks for us to exchange ideas on optimisation, report work to the coach and arrange work for the following week. Each team member also had meetings with their dissertation supervisors periodically to discuss their own work. 3.6 Competition Results and Experiences The team originally set the Highest LINPACK award as their primary goal, which highly influenced the choice of the hardware configuration. After performing HPL tests with 4 nodes (each node had 2 GPUs) and 3 nodes (each node had 3 GPUs) respectively, the team decided that the latter option was more promising both in terms of performance and power consumption. The final performance of HPL for the competition were 33.99TFLOPS at the power consumption at exactly 3000 Watts, which was the fourth highest HPL performance achieved during the competition and approximately three times of the record set at 2016 ISC Student Cluster Competition. Using the power efficiency formula (GFlops/watts) used to rank the Green 500 list, the power efficiency of Team EPCC s cluster was 11.33, which can rank at the 4th place in the June 2017 Green 500 list. Being one of the team members of Team EPCC taking part in the 2017 ISC Student Cluster Competition was the experience of lifetime. Due to the fact that this competition touched a wide range of aspects of HPC, we had to practice that we learned from our MSc in High Performance Computing with Data Science program. It also required us to find useful information and read literature by ourselves, which helped us develop our skills of searching for and processing information. For team members who did not have much Linux experience, the competition had forced us to get familiar with command line work as quick as possible. We had to learn how to install and configure different libraries and software and how to cope with errors. Team work was greatly important for us. Each team member has his or her own strong and weak points. Learning from others strong points to offset one s weakness acted an important role in our collaboration. Last but not least, discussing with other teams, conference participants and the competition board was also inspiring as we had the chance to exchange ideas and learn from each other. 14

22 Chapter 4 Building FEniCS on the Cluster and Performance Analysis As a mature project for solving partial differential equations, FEniCS is widely used in scientific simulations and was one of the benchmarking applications at the 2017 ISC Student Cluster Competition. In this chapter, we will have a look at how it was installed on the cluster and compare performance given by different configurations and compilers. 4.1 Introduction to FEniCS As shown in Figure 4.1, FEniCS basically works like a streamline starting from and ending at DOLFIN, which is the main user interface of FEniCS. End users are able to describe their problem in handwriting-style formulas benefiting from Unified Form Language (UFL). FEniCS From Compiler (FFC) can then translates these formulas into program source code which will be further compiled by the compiler. Variational forms expressed in UFL are passed to FFC to generate low-level code, which can then be used by DOLFIN to assemble linear systems. This code generation depends on the FInite element Automatic Tabulator (FIAT) and Instant. Instant is used for inlining C/C++ code in Python. Finally, mshr can be used to plot graphs. As is described in Section 3.4.2, FEniCS has a number of required and optional dependencies. For instance, Python package Six which makes Python programs compatible with both Python 2 and Python 3 without modification, and NumPy which provides fundamental scientific computing functionalities are required by all the Python-related components. SWIG (Simplified Wrapper and Interface Generator) is required by both DOLFIN and Instant to connect programs written in C++ and Python. In terms of optional dependencies, MPI and HDF5, for example, can be installed together with DOLFIN to enable parallelism and improve performance. Therefore, these dependencies should be installed properly according to end-user s requirements and specifica- 15

Figure 4.1: FEniCS Streamline tions to make sure that FEniCS can work. 4.2 Initial Work and Obstacles FEniCS provides some simple installing methods including Docker containers, prebuilt Anaconda Python packages and Ubuntu Packages.

23 Figure 4.1: FEniCS Streamline tions to make sure that FEniCS can work. 4.2 Initial Work and Obstacles FEniCS provides some simple installing methods including Docker containers, prebuilt Anaconda Python packages and Ubuntu Packages. For end users, these options are easy to use and straightforward but in order to modify parameters and try different compilers, we had to build FEniCS from its source code and these pre-built packages can be used to help validate the results of the optimised program. The author of this report had limited experience on Linux before the preparation period of the competition and had never used Linux terminal commands or batch systems in the past. The author was used to Eclipse Integrated Development Environment for Java which allows developers and was not familiar with command-line operations and setting Linux environment variables. Therefore, we encountered problems such as selecting the right compiler and linking the correct library. Library linking was the greatest obstacle of compiling FEniCS. Boost is a well-known library family which provides a variety of high-quality C/C++ libraries from frequentlyused algorithms to regular expression processing. As discussed in Section 4.1, DOLFIN is the computational back-end of FEniCS and Boost is one of its compulsory depend- 16

24 encies. The author first installed multiple versions of Boost with default configuration to the default directory. As the author was not familiar with Linux environmental variables and did not know how to set BOOST_ROOT variable to indicate which Boost was going to be linked, it caused that DOLFIN was installed linking with a version of Boost and looked for another version of Boost in run-time. As a result, FEniCS cannot be launched. Similar problems was also encountered when the author compiled multiple PETSc libraries with difference configurations and tried to test DOLFIN with each of them. The problem was solved by installing compilers and libraries to user-specified folders and setting environment variables such as LD_LIBRARY_PATH, BOOST_ROOT and PETSC_DIR to indicate which libraries to link with. In order to compile FEniCS with different compilers, libraries and configurations conveniently, an automated build script was written so that only a few statements need to be changed. For example, the variable PREFIX only needs to be set once at the beginning. Therefore the directory it points to is used to install libraries and it tells CMake where to find them. CMake is a widely-used tool to find and link to different libraries and then generate Makefile accordingly. The author was not familiar with the logic "Cmake - make - make install" thus met up problems with CMake. One problem was that the original Cmake installed on the cluster was a lower version and cannot link to Boost and upper versions. We thought that the problem was caused by compiling and installing Boost and spent a lot of time on reinstalling Boost. After the problem was solved, we gained a lot of experience on CMake and the author decided to write a CMakeList.txt file for building minidft to practice using CMake. OpenBLAS is a well-known optimised BLAS library and is required by PETSc, the computational back-end of DOLFIN. When compiling OpenBLAS with Intel Compiler (icc for C, icpc for C++ and ifort for Fortran) during performance investigation, it raised a fatal error "unknown register name %1 in asm statement". The reason was that icc did not yet recognize the gcc-style register naming. Therefore, we used gcc for C, g++ for C++ and ifort for Fortran instead. The compilation of OpenMPI with Intel Compiler was straightforward i.e. specifying C, C++ and Fortran compilers as Intel Compiler was correct. However, when executing MPI tasks with is implementation on more than 1 node, other nodes except the host node halted with an error that libimf.so, libsvml.so, libirng.so and libintlc.so.5 cannot be found. This issue was solved by making symbol links to corresponding Intel libraries in the FEniCS library path. 4.3 Performance Investigation Not like a benchmarking program, FEniCS is a scientific application and most of the programs using it are quite time-consuming. Before the competition days, the competition board did not provide neither test cases nor benchmarking programs. They only 17

25 provided a list of required optional dependencies of DOLFIN. In this case, example programs installed together with FEniCS were used to validate each installation. On the first day of the competition, a small test case which was provided the competition board to help teams check whether their FEniCS can work properly was used to help choose suitable compilation configuration as it has relatively lower completion time. Following steps were used to investigate performance of FEniCS: The first step was to investigate how it performs with different number of MPI processes, which can tell us how well FEniCS scales across cluster nodes. GNU compilers, OpenMPI 2.2.1, OpenBLAS compiled by GNU compilers and the small test case was used for this step. Table 4.1 and Figure 4.2 present FEniCS performance with regard to different number of MPI tasks. Nodes used MPI tasks Time (sec) Table 4.1: FEniCS timing for different number of MPI tasks We can see that the execution time of the small test case decreased when the number of MPI tasks increased. However, when we used 64 MPI tasks, its performance was not as good as 32 tasks. The reason can be that the test case was too small so that much time was spent on communication between MPI tasks rather than computation. The second step, for FEniCS performance investigation, was to test which MPI library and implementation works better with FEniCS. The chosen implementations were: MPICH 3.2, compiled with GNU Compiler OpenMPI 2.2.1, compiled with GNU Compiler OpenMPI 2.2.1, compiled with Intel Compiler Intel MPI, using Intel Compiler The BLAS library used in the second step was OpenBLAS compiled with the compiler for the MPI implementation and optimisation flag for each test case was -O2. The small test case was performed with 30 MPI ranks on the 3 nodes (10 ranks per node) while the large test case was performed with 60 MPI ranks on the 3 nodes (20 ranks per node). The results of the MPI library testing are shown in Table 4.2 and Figures 4.3 and 4.4. It can be clearly seen in Table 4.2 that in terms of the small test case, OpenMPI compiled with GNU Compiler had the best performance, which was more than two times 18

26 Figure 4.2: FEniCS timing for different number of MPI tasks Figure 4.3: FEniCS timing for different MPI implementations using the small test case 19

27 Test case Time (sec) MPICH-GNU OpenMPI-GNU OpenMPI-Intel Intel MPI Small Large Table 4.2: FEniCS timing for different MPI implementations Figure 4.4: FEniCS timing for different MPI implementations using the large test case faster than MPICH compiled with GNU Compiler. The execution time of OpenMPI compiled with Intel Compiler had slightly slower performance and the execution time of Intel MPI was approximately 2 seconds slower than OpenMPI compiled with GNU Compiler. When it comes to the large test case, we can see that Intel MPI, which was the second worst for the small test case gave the best performance. The reason might be that Intel MPI can provide better scalability for long-running programs. Moreover, MPICH compiled with GNU Compiler again had the worst performance for the large test case. The third step was to test how different BLAS libraries affected the performance of FEniCS. The small test case executed too fast that we can hardly see the difference so the large test case which was used for competition scoring was used in this step. Based on the previous step, Intel MPI with Intel Compiler were used in this step. Selected BLAS libraries were as following: Intel Compiler 2016 with OpenBLAS compiled by Intel Compiler 20

28 Intel Compiler 2016 with Intel Math Kernel Library (MKL) The test case was executed using all the available CPU cores (60 MPI tasks). The results of the compiler and library testing are shown in Table 4.3. BLAS library Time (sec) OpenBLAS Intel MKL 1112 Table 4.3: FEniCS timing for different BLAS libraries It can be seen in Table 4.3 that the Intel Compiler provided better performance than the GNU Compiler and when the OpenBLAS library was replaced by the MKL, the performance increased further. The fourth step was to compile FEniCS using the best compiler from previous steps with different optimisation levels from -O0 to -O3 and -fast which is equivalent to -xhost -O3 -ipo -no-prec-div -static -fp-model fast=2 according to Intel C++ Compiler manuals page and see what happened with performance. The version compiled by Intel Compiler and linked with Intel MKL library was chosen for this test.the bigger test case was also used in the step and the program was executed with 60 MPI tasks. Table 4.4 and Figure 4.5 show the testing results. Optimisation Level Time (sec) -O O O O fast Table 4.4: FEniCS timing for different levels of optimisation We can see that the performance of FEniCS was improved when using the optimisation level -fast and unsurprisingly, -O0 provided the worst performance. The final step used Intel processor-specific optimization. Since the CPUs of the cluster were all Intel Xeon Processor E5 v4 Family, which was referred to as CORE-AVX512 in terms of processor-specific option for the Intel Compiler, we set the following flags: -fast -march=core-avx2 -xcore-avx2. Flag -axcore-avx2 can replace xcore-avx2 because xcore-avx2 only generates specialised code for the specified processor which is incompatible to older processors while -axcore-avx2 generates multiple code paths and executable files can be larger.the performance of FEniCS was further improved to seconds benefiting from the machine-specialised compilation. 21

29 4.4 Summary Figure 4.5: FEniCS timing for different levels of optimisation FEniCS is quite a huge program consisting of various components and each component has its required and optional dependencies. An automatic building script helped us save time on downloading, configuring and linking selected libraries when trying different optimisation methods. Although we encountered a variety of obstacles of software compiling and linking, we did learn a lot through this process. FEniCS showed acceptable scalability for the multi-node cluster with MPI and according to the MPI implementation comparison, Intel MPI showed best performance. Additionally, as a well-known optimised mathematical library with widely-used routines including BLAS, FFT and LAPACK, MKL also helped improve the performance of FEniCS. Finally, optimisation flag -fast provided best performance among optimisation flags -O0, -O1, -O2, -O3 and -fast. Figure 4.6 provides a step-by-step optimisation summary of this chapter. 22

30 Figure 4.6: Step-by-step optimisation summary of FEniCS 23

31 Chapter 5 Porting minidft to NVIDIA GPUs As a minimalist version of the general-purpose Quantum ESPRESSO (open-source Package for Research in Electronic Structure, Simulation, and Optimization) code, minidft[17] is an application for modelling materials written in Fortran to compute the Kohn-Sham equations using the plane-wave density functional theory (DFT). The design of minidft utilizes some parallelism technologies such as MPI and Open Multi- Processing OpenMP and minidft is suitable to run on CPU-only systems. 5.1 Initial Performance Testing The README file of minidft said that following libraries were used by minidft: ScaLAPACK (Scalable Linear Algebra PACKage): A library solves linear algebra and provides high-performance linear routines for distributed-memory architectures. OpenBLAS: An optimized BLAS (Basic Linear Algebra Subprograms) library. FFTW3: A library provides C subroutine and computes the discrete Fourier transform (DFT) in one or more dimensions. It supports arbitrary input size and both real as well as complex data. These libraries have been well optimised on traditional CPU-only machines and support MPI as well as OpenMP to make use of parallelism on CPUs. However, they do not have GPU support and other parts of minidft source code do not use GPU functionalities either, which means that minidft is a CPU-only program. Therefore, looking for corresponding GPU-enabled computational libraries to replace these traditional libraries is a method of optimisation. In addition, some performance testing results of FEniCS in Section 4.3 show that when only running with CPUs, MKL can provide better performance than OpenBLAS. Actually, MKL also provides mathematical subroutines of FFT and ScaLAPACK. Therefore, we had a look on how minidft performs with Intel Compiler and MKL library. 24

32 A primary test of the original minidft program using pe-23.local.in as the input file was conducted using GNU and Intel compilers and the result is given in Table 5.1. The version compiled by GNU compiler was linked to original OpenBLAS, FFTW3 and ScaLAPACK while the Intel version was linked to MKL. Compiler flags used for this step was -O3 for both GNU and Intel compilers. Compiler CPU (sec) WALL (sec) GNU Compiler Intel Compiler Table 5.1: minidft timing for different compilers and libraries The column CPU means time consumed by computation and WALL means time consumed by both computation and communication. The result of this step showed that Intel Compiler and MKL library help save approximately half of the execution time of GNU version of minidft. Then, we had a look on how optimisation flags affected the performance of minidft compiled with Intel Compiled and MKL library. The results of this step are shown in Table 5.2 and Figure 5.1. We can see that -O3 provided the best performance and this optimisation flag is used in subsequent optimisation steps. Compiler CPU (sec) WALL (sec) -O O O O fast O3 with CPU-specific flags Table 5.2: minidft timing for different levels of optimisation 5.2 Optimisation Based on Source Code Invastigation The source code file fft_base.f90 contains a subroutine named fft_scatter. This subroutine is used to transpose the fft grid across nodes from columns to planes, or in the opposite direction, and it is implemented in two ways. The default one uses MPI_Alltoallv broadcast method to send data from all to all processes. However, another implementation designs a non-blocking transpose. It makes the loop iterations different on each process (MPI task) so that all processes will not send a message to the same process at the same time, and it uses a combination of MPI_Isend, MPI_Irecv and MPI_test to make communications asynchronous and make sure the routine exit when all processes have sent and received data. This source code file defines a macro named NON- BLOCKING_FFT and when we turn it on, the later implementation will be used and 25

33 Figure 5.1: minidft timing for different optimisation levels otherwise, the first one will work. The later implementation of FFT_Scatter is suitable for switched network such as Infiniband when no topology is defined while the blocking implementation should be better on the network with a defined topology. As the three nodes of the cluster were connected by a switched network using Infiniband, a performance test was conducted with turning -D NONBLOCKING_FFT on. The comparison of this version and the previous Intel Compiler and MKL version is shown in Table 5.3. FFT communication CPU (sec) WALL (sec) Blocking Non-blocking s Table 5.3: minidft timing for different FFT communication types We can see that the non-blocking communication methods helped reduce and seconds of computation and overall execution time respectively. 26

34 5.3 Performance with GPU-enabled Libraries FFT-related Code Optimisation NVIDIA has developed the cufft library which provides FFT functionalities on GPUs and can be used to replace FFTW library. The report first tried to make use of cufft library and made the following modifications to the code: Add DEVICE keyword to arrays executed on GPUs. Replace "OpenMP parallel do" directives on arrays on GPUs by CUDA CUF Kernel directives which automatically arrange loops on kernel arrays on GPUs. Here is an example of replacement: 1! O r i g i n a l OpenMP d i r e c t i v e! $omp p a r a l l e l d e f a u l t ( s h a r e d ), p r i v a t e ( mc, j, i ) 3! $omp do DO i = 1, d f f t%n s t 5 mc = d f f t%ismap ( i ) DO j = 1, d f f t%npp ( me_p ) 7 f _ i n ( j + ( i 1 ) nppx ) = f_aux ( mc + ( j 1 ) d f f t% nnp ) ENDDO 9 ENDDO! $omp end p a r a l l e l 1! CUDA CUF K e r n e l 3 DO i = 1, d f f t%n s t mc = d f f t%ismap ( i ) 5! $ cuf k e r n e l do ( 1 ) <<<, >>> DO j = 1, d f f t%npp ( me_p ) 7 f _ i n ( j + ( i 1 ) nppx ) = f_aux ( mc + ( j 1 ) d f f t% nnp ) ENDDO 9 ENDDO Replace FFTW2 subroutines by cufft subroutines as shown in Table 5.4. FFTW subroutine dfftw_destroy_plan dfftw_plan_many_dft dfftw_execute_dft cufft subroutine cufftdestroy cufftplanmany cufftexecz2z Table 5.4: FFTW subroutines replaced by cufft subroutines 27

35 A CUDA-aware OpenMPI compiled with PGI Compiler was used to compile this version of minidft because Intel Compiler does not have CUDA syntax support such as CUDA CUF Kernel directives used here. However, the execution time of minidft was increased to 1097 seconds. Therefore, we decided not to use cufft but the MKL FFT subroutines instead BLAS-related Code Optimisation The second step of porting minidft to GPUs was to replace BLAS subroutine calls by cublas library. After checking the source code, we found two BLAS subroutines, ZGEMM as well as ZGEMV was used in minidft. ZGEMM performs matrix-matrix operations and ZGEMV performs matrix-vector operations. These subroutines were replaced by corresponding cublas subroutines cublas_zgemm and cublas_zgemv respectively. NVIDIA provides a Fortran binding interface for cublas named fortran_thunking.c under CUDA installation directory and can be used in this step because minidft was implemented in Fortran. Macros (such as # define ZGEMM cublas_zgemm) were used in this step to fast replace original function calls.! O r i g i n a l ZGEMM 2 CALL ZGEMM( N, N, n, m, nkb, ( 1. D0, 0. D0 ), vkb, lda, ps, nkb, ( 1. D0, 0. D0 ), hpsi, l d a ) 1! cublas ZGEMM CALL cublas_zgemm( N, N, n, m, nkb, ( 1. D0, 0. D0 ), vkb, lda, ps, nkb, ( 1. D0, 0. D0 ), hpsi, l d a ) After using cublas, the execution time for computation and the overall time were reduced to seconds (90.95 seconds faster) and seconds (72.05 seconds faster) respectively. Moreover, the subroutine add_vuspsi_k in add_vuspsi.f90 multiplied one matrix over each column of another matrix using ZGEMV. The matrix operation part of this subroutine was replaced by a matrix-matrix multiplication function using ZGEMM and MPI reduce (summation) developed by Siyuan Liu and the execution time was further reduced to seconds and seconds for computation and overall execution respectively because the number of matrix operations was divided by the number of columns. Now, minidft is more than two times faster than the previous step MAGMA-related Code Optimisation The original minidft uses ScaLAPACK for diagonalisation in the source code file cdiaghg.f90. This operation can be ported to GPUs by MAGMA library which supports 28

36 hybrid CPU+GPU LAPACK functionalities. However, MAGMA is a serial LAPACK library while the original minidft used a parallel version with full data distribution. Filippo Spiga et al. developed a plug-in code to accelerate Quantum ESPRESSO using NVIDIA GPU[20] which provides a serial version of diagonalisation. After importing MAGMA library as well as the GPU-enabled function, diagonalisation can be performed on GPUs. Two functions provided by MAGMA, magmaf_zhegvd as well as magmaf_zhegvx are used in program file cdiaghg.f90. As introduces in Section 2.5, there is a multiple-gpu version of magmaf_zhegvd named magmaf_zhegvd_m. Both of them were tested. 1! Multi GPU v e r s i o n of ZHEGVD and t h e f i r s t p a r a m e t e r i n d i c a t e s t h e number o f GPUs CALL magmaf_zhegvd_m ( 3, 1, V, U, n, v, ldh, s, ldh, e, work, lwork, rwork, lrwork, iwork, liwork, i n f o ) 1! S i n g l e GPU v e r s i o n of ZHEGVD CALL magmaf_zhegvd ( 1, V, U, n, v, ldh, s, ldh, e, work, lwork, rwork, lrwork, iwork, liwork, i n f o ) With the single-gpu version, the execution time for computation and overall time were reduced to seconds (99.55 seconds faster than the previous step) and seconds (68.96 seconds faster than the previous step) respectively. However, the multiple-gpu version did not perform better ( seconds on computation and seconds overall). The reason could be that the size of the matrix in our case was not large enough to have better scalability on multiple GPUs and the overall time was increased by communication between GPUs. 5.4 Process Binding Numactl is a utility which can be used to control NUMA policy for processes or shared memory. NUMA (stands for Non-Uniform Memory Access) is a memory architecture in which a given CPU core has variable access speeds to different regions of memory. Figure 5.2 shows the topology we used for binding processes when executing minidft and the execution time comparison is shown in Table 5.5. We can see that the computation performance was not improved and the reason is that the number of GPUs cannot be divided evenly by the number of CPUs in each node. However, the overall execution time was reduced because GPUs were bound to CPUs so that communication between CPU and GPU was fixed and stable. 29

37 Figure 5.2: Process binding topology 30

Comparing Performance and Power Consumption on Different Architectures

Comparing Performance and Power Consumption on Different Architectures Andriani Mappoura August 18, 2017 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2017 Abstract