Comparing Performance and Power Consumption on Different Architectures

Size: px

Start display at page:

Download "Comparing Performance and Power Consumption on Different Architectures"

Bernard Lambert Stafford
5 years ago
Views:

1 Comparing Performance and Power Consumption on Different Architectures Andriani Mappoura August 18, 2017 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2017

2 Abstract This project aims to compare a conventional CPU platform with a GPU-based heterogeneous cluster and a platform composed of Intel Xeon Phi processors. The comparison regards performance as well as power consumption since both are critical factors in the field of HPC. The first part of this project regards the participation in the Student Cluster Competition (SCC) that was held during the International Supercomputing Conference in June The minidft application and High Performance Conjugate Gradient benchmark were chosen from the SCC. These two codes were ported and optimised on three different platforms; one of them is the cluster that was used at the SCC. A description of the work that was carried out on the GPU-based cluster and the Intel Xeon Phi nodes is presented. From the outcomes of this project, it was observed that both GPUs and Xeon Phi processors are able to achieve significantly better performance and power efficiency under the appropriate configuration and code modifications compared to a CPU-only platform.

3 Contents Chapter 1 Introduction Report Organisation... 2 Chapter 2 Background Overview The P100 NVIDIA GPU The second generation Intel Xeon Phi Processor, Knights Landing (KNL) The Student Cluster Competition Rules and Background The team s cluster Setting up the system Project Motivation Obstacles and Deviation from initial project plan...13 Chapter 3 SCC s Applications High Performance LINPACK Background Performance Investigations Power Consumption Investigations Final configuration and competition s results High Performance Conjugate Gradient Background Performance Investigations MiniDFT Background...24 i

4 3.3.2 The code challenge...24 Chapter 4 MiniDFT and HPCG on KNL Porting and Optimising on KNL Performance results for MiniDFT Performance results for HPCG...37 Chapter 5 Comparing Performance and Power Consumption on different platforms MiniDFT on KNL-nodes & CPU-nodes HPCG on KNL-nodes, CPU-nodes & GPU+CPU-nodes...44 Chapter 6 Conclusions Future Work...48 ii

5 List of Tables Table 1 Initial investigations of HPL performance and power consumption in one node...12 Table 2 Initial investigations of performance in Teraflops for N and NB input parameters in one node with three GPUs...17 Table 3 Initial GPU_DGEMM_SPLIT investigations on node with N=80750 and NB= Table 4 Performance results and Input Parameters keeping GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS= Table 5 GPU frequency investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS= Table 6 OMP_NUM_THREADS and Hyperthreading investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and GPU_frequency=1265MHz...20 Table 7 ECC investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1, GPU_frequency=1265MHz, OMP_NUM_THREADS=4 and hyperthreading enabled...20 Table 8 Configuration before the competition...21 Table 9 Final Configuration...22 Table 10 Performance and Power Consumption Investigations for local problem size 256x256x256, 60 seconds run and OMP_NUM_THREADS= Table 11 Execution times for two different combinations of compilers and MPI implementations for the small.in input file Table 12 Execution times in one node (20 cores) for different numbers of threads and MPI processes, keeping MKL threading layer sequential for the small.in input file.26 Table 13 Optimal values for input parameters...27 Table 14 Profiling results of minidft with CrayPat on one node...34 Table 15 Final Results of minidft and impact of vectorisation...37 iii

6 Table 16 Platforms Specifications...41 Table 17 Power Consumption and Performance Results of minidft on one node of Platforms 2 & Table 18 - Power Consumption and Performance Results of HPCG on one node of Platforms 1, 2 & iv

7 List of Figures Figure 1- NVIDIA Tesla P Figure 2 - NVIDIA NVLink Hybrid Cube Mesh... 5 Figure 3 - NVIDIA Pascal Architecture GP100, Full GPU with 60 SM Unit... 6 Figure 4 NVIDIA Pascal Architecture GP100, SM Unit... 6 Figure 5 Intel Xeon Phi Processor... 7 Figure 6 KNL Architecture... 8 Figure 7 Tile Design... 8 Figure 8 MCDRAM in cache, flat and hybrid modes... 9 Figure 9 The front view of the cluster with liquid cooling system on the top...11 Figure 10 Initial Problem Size N investigations on three nodes...18 Figure 11 VTune profiling, amplxe-gui output for minidft, using 10 MPI processes and OMP_NUM_THREADS=2 with small.in input file...25 Figure 12 Investigations on MPI processes and OpenMP threads, on one node configured in cache mode...30 Figure 13 Investigations on hyperthreading on one node configured in cache mode 31 Figure 14 Scalability of minidft across nodes configured in cache mode with 64 MPI processes per node and OMP_NUM_THREADS= Figure 15 Cache Vs Flat mode results using 64 MPI processes on each node and OMP_NUM_THREADS= Figure 16 Forcing Vectorisation with omp simd directive...35 Figure 17 - Helping Vectorisation with introducing a new aligned vector...36 Figure 18 - Helping Vectorisation with replacing integer with a new aligned vector...36 Figure 19 Global problem size investigations on one node...37 v

8 Figure 20 Impact of hyperthreading on one node with 4 MPI processes...38 Figure 21 Scalability across multiple nodes using 4 MPI processes per node and 32 OpenMP threads per core...39 Figure 22 Cache and Flat Mode for different number of nodes, MPI processes, OpenMP threads and hyperthreding...40 Figure 23 Performance of HPCG across multiple nodes...40 Figure 24 CPU time on Platforms 2 & Figure 25 Relative CPU time of Platforms 2 & Figure 26 Scalability on Platforms 2 & Figure 27 - Performance on Platforms 2 & vi

9 Acknowledgements I would first like to thank my supervisor Fiona Reid for her valuable guidance and her continuous willingness to help me throughout this period. In addition, I would like to thank Boston Limited and especially David Power and Konstantinos Mouzakitis for supporting our participation in the Student Cluster Competition, as well as our team coach Emmanouil Farsarakis and my teammates for this great cooperation. Special thanks to my family and friends for their love and support. vii

10 viii

11 Chapter 1 Introduction High Performance Computing (HPC) is a field that has been rapidly advanced, introducing new hardware and software features. Exascale-level computing is considered the next goal of HPC [1]. This new target aims to create systems that will be able to achieve double precision (64-bit) operations per second within a power consumption of 20 to 30 MW. Thus, an exascale cluster would be roughly 50 times faster than a modern cluster that delivers 20 Petaflops. However, Sunway TaihuLight, the Chinese supercomputer that was ranked first in the TOP500 list and fourth in the Green500 list in 2016, delivers 6, Megaflops per Watt [2]. Consequently, today s most powerful system, which is also one of the most power efficient systems in our days, needs 165 MW to deliver Exaflops. In order to understand these power consumption numbers, it is worth mentioning that the largest power station in United Kingdom, Drax, has power capacity of 3,960 MW where the high-pressure turbines used generate at 140MW each. [3] Obviously, the exascale challenge requires once again new features in both hardware and software. The new supercomputers will be developed by co-designing applications, systemware and hardware, while power consumption will play a key role to this process. Regarding the hardware part of this challenge, new technologies are required in different parts such as the memory, the processing units as well as the cooling process of the systems. Liquid cooling has already been introduced and widely used instead of conventional air-cooling fans, in order to achieve a high ratio of operations per Watt. Numerous studies have been carried out investigating new memory features that could help with the power consumption issue. [4] Moreover, it is often argued that accelerators, coprocessors and other many-core architectures could play a key role in this new era, since multi-core systems performance increase is much slower, while their power efficiency is not particularly promising [5]. Many-core architectures seem to be a hot topic in the field of HPC; their popularity in the TOP500 and Green500 lists is particularly impressive [6], [7]. More specifically, in the latest announced June 2017 Green500 list, nine out of ten most energy-efficient supercomputers are using NVIDIA GPUs. Regarding the June 2017 TOP500 list, among the ten most powerful supercomputers, three are using Intel Xeon Phi and two are using NVIDIA GPUs. These technologies could ensure a great performance and energy efficiency under an appropriate manipulation. 1

12 NVIDIA GPUs seem to be the predominant choice of accelerators in the area of HPC. The Pascal Architecture was introduced in April 2016 by NVIDIA. In addition, Intel introduced Xeon Phi. The first generation was Knights Corner (KNC), an HPC-purpose coprocessor. The second generation, Knights Landing (KNL), was later released and it is mainly used as a standalone CPU. These technologies can be beneficial for various fields such as Machine Learning. Nevertheless, it is argued that heterogeneous clusters are over-estimated. The benchmarks that are used to rank supercomputers such as High Performance LINPACK (HPL) do not represent a wide variety of real applications. Most scientific codes were initially developed for CPU systems. Thus, exploiting accelerators fully requires a great amount of changes in the existing codes and sometimes it is not possible to achieve the desirable performance. As far as Xeon Phi is concerned, it appears that less code modifications might be required, but not all applications are able to scale well on these systems. [8] This project aims to compare these different processing units in terms of performance as well as power consumption. The writer of this project participated in the Student Cluster Competition (SCC) as part of the MSc dissertation and had the opportunity to investigate the performance and power consumption of different benchmarks and applications in a cluster that was built with P100 GPUs. After the completion of the competition, two of these applications were ported and optimised on KNL nodes, as well as on CPU-only nodes. The performance and power consumption on these three different systems was measured and compared. 1.1 Report Organisation The report is organised in six chapters as follows: Chapter 1 is a brief introduction to the current project. Chapter 2 contains the background theory that is required for understanding some technical features as well as the SCC. The motivation for this project, the obstacles faced throughout this period and the deviation from the initial plan are also described. Chapter 3 describes the work done by the author of the report for the SCC preparation, the results achieved and the outcome of the SCC. Chapter 4 contains the work done in order to port and optimise an application and a benchmark of the SCC on the KNL nodes. Chapter 5 presents the comparison of the performance and power consumption of three different platforms for one application and one benchmark. Chapter 6 contains the conclusions that were drawn from this project and future work. 2

13 Chapter 2 Background Overview From the mid-1980s until the mid-2000s, manufactures managed to improve processors performance mainly by increasing the processors frequency, a technique known as frequency scaling. However, heat generation and power consumption have significantly increased and it is argued that this is the main limitation of further increasing the performance of modern CPUs [9]. Power consumption is influenced by frequency, as shown by the equation: [ ] (1) where P is the power, f is the frequency, V is the voltage and C is the capacitance. Reducing the voltage used to be a method for keeping power consumption low. There is a limit though, regarding the overall voltage, since the difference of voltages distinguishes 0s and 1s in the systems. A further decrease would not allow these digital differences to be clear. Making smaller transistors reduces power consumption, as well, since the capacitance is reduced. Manufacturers try to fit as many as possible smaller transistors (smaller transistors are also faster) to a single chip. Nevertheless, there are physical constraints in terms of the size and the speed of a single chip. Transistors gates have already became too thin. Previous attempts to make transistors even smaller resulted in leaky transistors that would not even be able to make a processor functional. [10] As a result, the current trend is parallel scaling with multi-core processors. Moore s law, which says that chip performance doubles every 18 months, is actually still valid; performance can be increased by exploiting parallelism and developing software that could take advantage of the new hardware features. Programming is now more difficult since codes cannot just become faster as they used to. Instead, different programming techniques should be used in order to exploit the capabilities of multi-core chips. Many-core architecture is a subcategory of multi-core processors that aims to achieve a higher level of parallelism and lower power consumption. In the area of HPC, many-core processors, accelerators and coprocessors are able to achieve a high degree of parallelism due to their special design. They are composed of a high number of simple, number-crunching independent cores in a single silicon die with high-bandwidth 3

14 memory. Moreover, these cores consume less power than the modern CPUs cores that are power-hungry duo to their capability to deal with complex concepts such as branch prediction and out-of-order execution. Consequently, the right code development is again required in order to exploit the maximum performance capabilities of these technologies, since their simple cores are slower than modern CPU s cores in terms of latency and single thread performance. GPUs are a very common choice of many-core accelerators in HPC. They are used in heterogeneous systems with CPUs. They cannot be used independently, as they are designed neither for I/O operations nor for running the operating system. The host which is the CPUs, offloads computationally expensive parts of the code to the GPUs, known as the device, in order to accelerate these computations. However, in order to offload effectively these parts to the GPUs, a lot of changes to the code might be needed. The Intel Xeon Phi was released aiming to help and accelerate various scientific fields, using the Many Integrated Code (MIC) architecture. The first generation of Xeon Phi, Knights Corner (KNC), was a coprocessor that could be connected via the PCI express bus. It was used like GPUs, accompanying CPUs. However, it could be also used directly without offloading code from the host system, i.e. in native mode. The fact that the cores used in KNC were initially designed for CPUs in 1993 indicates their simplicity. The second generation of Xeon Phi is the Knight s Landing (KNL) and one of its main differences from the previous generation is that KNL is a many-core processor available as a stand-alone system that does not need a host CPU. Although, no changes to the code are required to run a code on Xeon Phi, unlike GPUs, some alterations may be essential in order to achieve good performance. 2.1 The P100 NVIDIA GPU The P100 NVIDIA GPU is today s fastest GPU, built on the NVIDIA Pascal architecture and its target is to accelerate HPC and Big Data applications as well as Deep Learning and Artificial Intelligence systems. Figure 1- NVIDIA Tesla P100 Source: 4

15 There are four main new features that made the P100 GPUs that powerful. The first one is the Pascal Architecture that targets data-centre applications and reached three times better performance than the previous GPU generation, Tesla K40. The second new feature is the integration of both compute and data on a single package. That was implemented by introducing Chip on Wafer on Substrate (CoWoS) with High Bandwidth Memory 2 (HBM2) technology. This improvement gave three times more memory bandwidth comparing to the previous version and helped data-intensive applications. Of course, NVIDIA NVLink is one of the new features. NVILink is the first high-speed interconnect for GPU-to-CPU and GPU-to-GPU and its bandwidth is five times the PCI Express Gen3 s bandwidth. Figure 2 shows how eight GPUs can be connected with NVLink, and PCIe connection of GPUs with the CPUs through the PCIe Switches. Finally, the page migration engine with unified memory enables virtual memory paging and page faulting, giving the applications the opportunity to scale beyond the physical memory size of the system. P100 GPUs have also being improved regarding power efficiency with the TSMC s 16-nm FinFET manufacturing process, which used to be 28-nm on previous K40 and M40. Figure 2 - NVIDIA NVLink Hybrid Cube Mesh Source: As far as the Pascal GP100 GPU architecture is concerned, it includes the same main features with previous architectures, i.e. Texture Processing Clusters (TPCs), an array of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers, with some differences. Figure 3, presents the design of the Pascal GP100. It consists of six GPCs, thirty TPCs and 60 SMs (two in each TPC, ten in each GPC). Each SM has 64 CUDA Cores, i.e. there are 3840 cores overall. In this figure, we can also see the memory controllers attached to L2 cache that control the HBM2 DRAM. Figure 4 presents the design of an SM Unit in Pascal GP100. It consists of 64 CUDA cores that 5

16 form two wraps. Each wrap executes the same instruction on multiple data at a time and its cores are controlled by a wrap scheduler. Figure 3 - NVIDIA Pascal Architecture GP100, Full GPU with 60 SM Unit Source: NVIDIA Pascal Architecture Whitepaper Figure 4 NVIDIA Pascal Architecture GP100, SM Unit Source: NVIDIA Pascal Architecture Whitepaper 6

2.2 The second generation Intel Xeon Phi Processor, Knights Landing (KNL) KNL, Intel s latest many-core processor, was introduced in June 2016, at ISC in Germany.

17 2.2 The second generation Intel Xeon Phi Processor, Knights Landing (KNL) KNL, Intel s latest many-core processor, was introduced in June 2016, at ISC in Germany. It is targeted for supercomputing and high performance computing applications by achieving massive parallelism and vectorization. Figure 5 Intel Xeon Phi Processor Source: KNL introduced many improvements over the previous generation KNC. First, the fact that KNL is a stand-alone CPU avoids the PCIe bottleneck of KNC. The different cores, the new processor architecture, the original memory technology and the operation modes are some of the most important new features. Figure 6 presents the KNL architecture. On KNC, the on-die interconnect used to be a ring. Now on KNL, a mesh interconnect is used to connect 36 tiles allowing a higher bandwidth connection between cores and memory. Each tile consists of two cores, sharing the L2 cache, with two improved Vector Processing Units (VPU) in each core, as shown in Figure 7. The new cores aim to balance both single and parallel thread performance as well as power efficiency. The peak Flops on KNL can be up to three times higher than the peak Flops on KNC. Depending on the Xeon Phi model, the peak double precision performance can be Gflops on KNL and Gflops on KNC. Regarding hyperthreading, KNC required at least two threads per core in order to give good performance. However, that is not the case in KNL, where there are codes that do not need hyperthreading at all. 7

Figure 6 KNL Architecture Source: http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf Figure 7 Tile Design Source: http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf The KNL has two levels of memory.

18 Figure 6 KNL Architecture Source: Figure 7 Tile Design Source: The KNL has two levels of memory. It has a huge DRAM main memory that can be directly accessed. The second level is MCDRAM, a high bandwidth memory of 16GB with higher latency than DRAM, which can be used in flat and cache mode. The flat mode means that MCDRAM is used as main memory in the same address space. In the cache mode, the MCDRAM is actually a last level cache for DRAM. There is also hybrid mode were a specific percentage of MCDRAM is used as cache and the rest as part of the memory. Figure 8 shows these three different memory configurations. Regarding the programmability effort, cache mode is the easiest because it requires no changes in order to use MCDRAM. On other hand, hybrid and flat modes requires modifications either inside the code or at the command-line using the numactl program, in order to use HBM memory; by default the DRAM is used. Features such as the access patters, the problem size and the requirements of memory bandwidth of an application might determine which memory mode can be beneficial for it. 8

Figure 8 MCDRAM in cache, flat and hybrid modes Source: http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf KNL also offers three different cluster-configuring options.

19 Figure 8 MCDRAM in cache, flat and hybrid modes Source: KNL also offers three different cluster-configuring options. The first one is All-to-All where addresses are uniformly hashed across all distributed directories. Quadrant is the second one that divides chip into four virtual quadrants and address is hashed to a directory at the same Quadrant as the memory. The last one is Sub-NUMA clustering where each Quadrant is exposed as a separate NUMA domain to the operating system. All-to-All might have lower performance than the other two options because it generally causes more mesh traffic. Quadrant clustering can be a better choice than Sub-NUMA for the applications that use KNL as a symmetric multi-processor. On other hand, Sub-NUMA clustering can be advantageous for MPI or hybrid applications that use KNL as a distributed memory system, with proper control of processes and threads pinning. [11] Last but not least, the Omni-Path Fabric is integrated on package of KNL. This feature offers better scalability to large systems and lower power consumption and cost. 2.3 The Student Cluster Competition The SCC was held from 19th until the 21st of June, in Frankfurt, during the International Supercomputing Conference 2017 (ISC 17). Our team consisted of four students who represented EPCC and the University of Edinburgh. The team s coach and supervisors helped and guided the team. Our participation was completed in collaboration with our vendor Boston Limited that provided us the cluster as well as technical support [12]. Twelve teams from all around the world participated Rules and Background The aim of this competition is to give students a number of benchmarks and applications that are ported and optimised in each team s cluster, in order to achieve the best possible performance while the power budget must not exceed 3KW. In order to measure the power consumption during the competition, a Power Distribution Unit (PDU) is given to each team, in order to monitor the power consumption. Screens with the power consumption of all teams are available in the conference hall. If the power limit is exceeded by a team, then a SCC supervisor comes to the booth of the team 9

20 to ensure that the application or benchmark would be repeated. A new rule regarding the power consumption was introduced this year. During the competition we were informed that every time that the power limit is exceeded, there would be a penalty on the marking. Another rule is that teams are not allowed to physically touch the system after the first run. Moreover, changes in the BIOS are forbidden after the competition has started. All system s equipment that is used for the first run should be powered on during the whole competition. Finally, rebooting is not permitted, unless there is a significant reason such as a system hang. In that case, an SCC Supervisor should be notified to give the permission for rebooting the system. [13] During the competition, we were given a USB flash drive with the instructions for running each day s applications as well as the input files. Our results were submitted with that USB flash drive to the SCC committee. The given benchmarks were High Performance LINPACK (HPC), HPC Challenge (HPCC) and High Performance Conjugate Gradient (HPCG). As far as applications are concerned, the first was FEniCS, a computing platform for partial differential equations. MiniDFT, that is part of Quantum Espresso (QE), was the application used for the code challenge. The last one given before the competition was TensorFlow, an open source software library for numerical computation. This application was used for the CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) challenge using the Keras Deep Learning library. On the second day of the competition a secrete application was announced. This was the LAMMPS from the GRANULAR package. [14] The competition traditionally includes five awards. The first one is the Highest LINPACK that is given to the team who delivers the highest performance on HPL under the power budget. The second one is the Fan Favourite that is given to the team that receives the most votes from the ISC participants. The other three awards regard the 1 st, 2 nd and 3 rd Place Overall Winners whose score depends on the HPCC performance, the applications runs as well as the interviews that were carried out during the competition. [14] The team s cluster Our cluster consisted of three nodes. One of them was the head node that was used for some further configuration. All nodes were used for computing purposes. Each node consisted of: 2 x Intel Xeon 2630 V4 CPUs: 2.2GHz frequency 10 cores (20 threads) 3 x NVIDIA Tesla P100 GPUs: 3584 cores 10

16GB HBM2 memory 1 x DDR4 RAM: 64GB 2400MHz frequency 1 x SSD: 900GB The interconnection of the nodes was implemented with Mellanox EDR Infiniband Networking.

21 16GB HBM2 memory 1 x DDR4 RAM: 64GB 2400MHz frequency 1 x SSD: 900GB The interconnection of the nodes was implemented with Mellanox EDR Infiniband Networking. Liquid cooling was added for cooling the system, instead of the pure air-fans. The liquid cooling was provided by CoolIT [15]. This feature enabled us to remove some energy-hungry air-fans and the total power consumption was eventually reduced. Figure 9 presents the cluster after it was set up on the first day of the competition. Figure 9 The front view of the cluster with liquid cooling system on the top It is worth pointing out that the initial configuration plan included a cluster with four nodes, each one containing two GPUs and two CPUs, i.e. eight GPUs and eight CPUs. However, after some further investigations on one node, it appeared that HPL 11

22 benchmark could scale almost linearly when a third GPU was added. Thus, the alternative plan was to add a third GPU per node. In order to keep the power consumption under the power budget, three nodes could be included to the system. Consequently, the cluster would include in total nine GPUs instead of eight and six CPUs instead of eight. Our team decided to focus on the Highest LINPACK award. Thus, the second plan was eventually implemented that enabled us to achieve a particularly high performance of the HPL benchmark during the competition. Table 1 presents the performance and power consumption results that lead us to our final configuration decision. Although power consumption in one node with three GPUs was above 1KW, we decided that the plan could be still advantageous, since we would be later able to reduce the power consumption by under-clocking GPUs and/or CPUs frequencies and removing air-fans. Number of GPUs on the head node: Performance (Tflops) Peak Power Consumption (Watt) Table 1 Initial investigations of HPL performance and power consumption in one node Setting up the system The cluster was accessed remotely throughout the preparation period before the competition. The operating system used was CentOS 7.3. A number of different software packages were installed and used. These included: NVIDIA drivers Several Libraries CUDA libraries (version 8.0) cudnn (version 5.1) Intel Math Kernel Library (MKL) OpenBLAS (version ) Different Compilers: Intel GNU PGI

23 Several MPI Implementations: Intel MPI (version 5.1.2) OpenMPI (versions 1.10, 1.6.5) MVAPICH2 (version 2.3a) In order to facilitate the accesses of the different members of the team to the cluster, ssh keys were used and different user accounts were set up. User accounts was a simple solution that could keep each user s folders and setup settings isolated without requiring any cluster management tool. As far as the power consumption measurements are concerned before the competition, a Windows system, with power meter software, was installed that was accessed through Remote Desktop (RDP) client. 2.4 Project Motivation Obviously, much effort is put on creating and advancing technologies for the field of HPC. Both Intel and NVIDIA have done a lot for this purpose and their last products, KNL Xeon Phi and GPU P100 respectively, seem particularly promising in terms of performance and power efficiency. Some prior investigations from these companies as well as from other research organisations proved that these technologies could give better results than previous generations of these products and conservative CPU-only systems for highly parallelised applications, under the appropriate software and hardware configuration. Through the SCC competition, various applications and benchmarks were examined on a P100 GPUs based cluster, regarding performance and power consumption. That gave the writer the idea of porting and investigating some of the SCC s applications on Intel Xeon Phi processors and Intel Xeon processors. That was a unique opportunity to implement the purpose of this project, which is comparing these technologies for HPC applications. 2.5 Obstacles and Deviation from initial project plan From the beginning of the SCC preparation, the given benchmarks and applications were divided among the members of the team and the writer was responsible for two benchmarks, HPL and HPCG. Thus, the initial purpose of this project was to compare performance and power consumption of these two benchmarks, on three different platforms, i.e. the SCC GPU-based cluster, ARCHER KNL-nodes and ARCHER CPU-nodes. However, more applications were announced later on for the SCC and the writer undertook one of them, the minidft application. This application was given for the code challenge and it required more work compared to the other two benchmarks. It is also 13

24 worth mentioning that for both benchmarks, optimised binaries provided by NVIDIA were used and as a result, code profiling and optimisation was not required. Consequently, it was decided to port and optimise minidft on KNL and CPU nodes, since the writer became familiar with that code. HPCG was also ported on KNL and CPU nodes, but there was not enough time to port HPL, too. Unfortunately, we did not manage to port minidft efficiently on the GPUs of the SCC cluster and it was decided that the CPU-only code version would be used. Consequently, for this application, we compared the power consumption and performance only on KNL and CPU nodes. The power measurements caused another obstacle for this project. More specifically, it was planned that the power on KNL would be measured on a different system, called Hydra, since a tool was available there for this purpose. It was later noticed, that there is also a way to measure power consumption on the ARCHER nodes as well, with the CrayPat profiler and this would give more realistic results, since both power consumption and performance would be measured on the same system. However, the tool used to measure power consumption on the SCC cluster, included the power consumption of every component inside the node, in contrary to the power measurements on ARCHER. As a result, some extra measurements were required in order to make an appropriate comparison of the power consumption on the different platforms. It is also worth mentioning that after the competition, the SCC cluster was unavailable for a long period of time and this resulted in a delay of the report writing compared to the initial plan, since a significant amount of measurements were saved in the cluster. 14

25 Chapter 3 SCC s Applications The given applications and benchmarks of the competition were divided among the members of the team. Three of them were undertaken by the writer of the current report. The work and results that were carried out are described in this chapter. 3.1 High Performance LINPACK The first benchmark that was undertaken by the writer of this report was HPL, in collaboration with another member of the team. This benchmark s performance was tested and submitted on the first day of the competition Background HPL is probably the most well-known benchmark in the field of HPC, since it is used to rank the TOP500 supercomputers. The algorithm of the benchmark solves a dense linear system of double precision numbers. It uses LU factorisation with row-partial pivoting. It mainly captures the computing power of a system on floating-point operations. Data are distributed among processors in two-dimensional block-cyclic way to ensure good load balance. [16] In our case, an optimised binary for GPUs was used that was provided by NVIDIA that required OpenMPI version In order to achieve the best possible result in out cluster, two main processes were implemented. The first one was to find the combination of input parameters as well as system configuration that ensures the best performance without investigating the power consumption. After this optimal combination was found, we measured power consumption and investigated how our previous decisions could change in order to keep the peak power consumption under the power budget. 15

26 3.1.2 Performance Investigations The performance of HPL can be significantly improved by tuning some input parameters. Thus, a lot of experiments were carried out to ensure that we are using the optimal combination for our machine configuration. [17] At the beginning, only one node was available and thus some initial investigations were implemented in one node. It was soon concluded that the more important input parameters were the size of the problem (N), the block size of the sub-problems that are given to the GPUs (NB) as well as the process grid dimensions (P and Q). For this benchmark, a matrix of size N by N is created. The matrix is divided into smaller blocks of size NB by NB. These blocks are cyclic distributed to a P by Q process grid. The rest of the input parameters were investigated, as well, but they could not offer a significant benefit to the performance and so these results are not presented in the report. As far as the problem size is concerned, the bigger the problem is the better performance that could be reached. Thus, the bigger problem that can fit to our system s memory would be the solution. However, we would like to leave some space for other processes such as for operating system s requirements. Thus, it is suggested to find a problem size that is about 80% of the total memory of the system [18]. Given that our system was using three GPUs with 16GB memory each and 64GB RAM on each node and that we are using double precision numbers (8 bytes), an initial suggestion of the value of N could be obtained from the equation: where n is the number of the nodes. The value of NB is also important for the performance. The smaller the size of blocks is the better load balance will be among the GPUs, since the distribution will be more even. On other hand, a very small block size can lead to a significant communication overhead. [18] The optimal P and Q values depend on the interconnection network of the system. It is generally suggested to give P and Q values that are as close as possible. Our final configuration included nine GPUs. Consequently, there were not many combinations to be tested. The final process grid was set to 3 by 3. Table 2 presents the initial results of the performance in one node. The results show that there is a limit while increasing the problem size which when exceeded the performance cannot be further improved. Regarding the NB size is seems that 384 was the optimal value. (2) 16

27 NB N , Table 2 Initial investigations of performance in Teraflops for N and NB input parameters in one node with three GPUs It is worth mentioning that reaching this performance limit regarding the problem size, we suggested increasing the memory on the nodes, so that our system could fit a bigger problem size and that would increase performance. People from Boston agreed to double memory so that we could test bigger problems. However, the performance limit regarding did not much change, probably because GPUs were already fully working and more work was just added to the CPUs. Given that more memory would consume more power, we kept the initial size of memory. In the scripts that are used to run the HPL benchmark, there are some other parameters can influence the performance. The first one is the number of OpenMP threads. Each MPI process is mapped to one GPU and we have three GPUs using two CPUs on each node. Thus, two MPI ranks will share 10 cores on the same CPU. As a result, the number of OpenMP threads was set to five, so that each thread will run on a different core. Increasing the number of threads did not improve the performance. In addition, the GPU_DGEMM_SPLIT environmental variable indicates the percentage of the work that will be offloaded to the GPUs. Some investigations on one node are presented in Table 3. The optimal choice was to offload the whole work on GPUs. However, when the problem size cannot fit on the GPUs memory, the CPUs are also used to solve the system, although this variable is set to one. These values of OpenMP threads and GPU_DGEMM_SPLIT were used for the rest of the experiments that are described in this section. GPU_DGEMM_SPLIT Performance (Tflops) Table 3 Initial GPU_DGEMM_SPLIT investigations on node with N=80750 and NB=384 17

28 Obviously, more experiments were carried out while more nodes were added to the system. Using the Equation 2, we could define the theoretical optimal size and then with some further investigations close to this number, the best values were defined. Regarding the value of NB, its optimal value was 384 in all cases. This value is not much influenced by the number of nodes used, since it represents the sub-problem size that is given to each GPU. Figure 10, presents the initial results on three nodes. It is clear that as the problem size increases, the performance increases, too. However, when the problem does not fit in the system memory, swapping occurs and the performance drops dramatically. Table 4 shows the optimal input parameters and the resulted performance for different number of nodes. The full list of the input parameters used for nine GPUs is shown in Appendix A.2 and the scripts used for running HPL are presented in Appendix A.1 Figure 10 Initial Problem Size N investigations on three nodes # of nodes N NB Performance (Tflops) Speedup (Performance/Performance in one node) 1 (3 GPUs) (6 GPUs) (9 GPUs) Table 4 Performance results and Input Parameters keeping GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS=5 18

29 3.1.3 Power Consumption Investigations The next step was to focus on power consumption keeping the optimal input parameter combination. The final described configuration on 9 GPUs consumed about 900Watts above the power limit. Thus, several methods were attempted in order to decrease the power consumption. First, we attempted decreasing the CPU frequency, hoping that this change would not much affect the performance. CPU frequency should be within 1.20GHz and 3.10 GHz (default CPU frequency limits). With the script in Appendix A.3, we could change the upper and lower limit for all the CPUs. However, decreasing the CPU frequency to the minimum permitted value, i.e. 1.20GHz, not only influenced dramatically the performance but also did not improve power consumption. The current problem size was forcing the usage of the CPUs in order to solve the system, as it could not fit to the GPUs and so the performance drop was about 20% keeping the peak power consumption stable. Reducing the percentage of the problem that is offloaded to the GPUs with the GPU_DGEMM_SPLIT environmental variable did not offer any power gain either. The default power limits were eventually used. Then, the GPU frequency impact on power was examined. Table 5 shows the impact of GPU frequency on both performance and power consumption. Decreasing the frequency below 1265MHz resulted in a significant decrease on performance. As a result, we kept this number and examined other configurations. Changing the number of threads used per rank was one of them. As it is shown in Table 6, decreasing this parameter has a small impact on performance but it also reduces the power consumption. Combining this with disabling the hyperthreading results in the optimal for performance configuration that did not exceed the power limit. The hyperthreading configuration was implemented with the script in Appendix A.5. It is also worth noting that from the beginning of these experiments error correction code (ECC) was turned off and persistence mode was enabled on all the GPUs. These two options decreased further idle power consumption of the GPUs. ECC s purpose is finding and correcting data corruption. However, in the case of GPUs these errors are quite rare and it is generally suggested to turn off ECC because it influences both performance and power consumption. Turning back on ECC gave us the results shown in Table 7. The script in Appendix A.6 was used for changing ECC. Regarding the persistence mode, the script is presented in Appendix A.7. 19

30 GPU frequency (MHz) Performance (Tflops) Peak Power Consumption (Watts) Table 5 GPU frequency investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS=5 OMP_NUM_THREADS Hyperthreding Performance (Tflops) Peak Power Consumption (Watts) 5 enabled disabled enabled disabled enabled Table 6 OMP_NUM_THREADS and Hyperthreading investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and GPU_frequency=1265MHz ECC Performance (Tflops) Peak Power Consumption (Watts) Enabled Disabled Table 7 ECC investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1, GPU_frequency=1265MHz, OMP_NUM_THREADS=4 and hyperthreading enabled To sum up, before the competition we concluded with the configuration that had the best possible performance within the power budget that is summarised in Table 8. 20

31 Configuration before the competition: Performance (Tflops) Peak Power Consumption (Watts) ECC disabled Persistence mode enabled Hyperthreading disabled Input parameters HPL.dat file A.2 OMP_NUM_THREADS=4 GPU_DGEMM_SPLIT=1 GPU_frequency=1265MHz Default CPU frequency limits OpenMPI version Table 8 Configuration before the competition Final configuration and competition s results Some further investigations on power consumption and performance were required in Frankfurt. After setting up our cluster in the conference hall, our performance and power results of the previously mentioned configuration were improved. Moreover, we had the chance to make some changes directly to the hardware that decreased further power consumption. More specifically, we disabled hyperthreading from the bios and the peak power consumption was decreased about 30Watts. Then, we disabled one of the two pumps used for the liquid cooling and as a result peak power consumption was decreased 90 Watts. Given that, we had the opportunity to increase GPU frequency and OMP_NUM_THREADS variable, achieving even better performance. It is worth mentioning that power consumption was measured with a different tool that was given to each team by the committee and probably that was a reason of the unexpected improved results. In addition, the committee asked from all teams using clusters with GPUs to run HPL benchmark with a specific given binary that required OpenMPI version Our final configuration and results are presented in Table 9 and that way we reached the third place regarding the HPL benchmark. [19] 21

32 Final Configuration: Performance (Tflops) Power Consumption (Watts) ECC disabled Persistence mode enabled Hyperthreading disabled from bios Input parameters HPL.dat file A.2 OMP_NUM_THREADS=5 GPU_DGEMM_SPLIT=1 GPU_frequency=1290MHz Default CPU frequency limits Use one out of two pumps for liquid cooling OpenMPI version Table 9 Final Configuration 3.2 High Performance Conjugate Gradient The second benchmark that was investigated was HPCG. This benchmark was tested on the first day of the competition Background HPCG is a significant benchmark in the area of HPC, as it was created as an alternative metric for ranking HPC systems. The algorithm of this benchmark creates a synthetic sparse 3D linear system with double precision (64 bit) floating-point values. A fixed number of symmetric Gauss-Seidel preconditioned conjugate gradient iterations is performed to solve the system. It is implemented in C++ with MPI and OpenMP. An official run needs a minimum of thirty minutes to complete. [20] It is argued that HPCG can be a better way to rank HPC supercomputers than HPL. HPL tests only floating-point oriented patters and so tends to create an optimistic performance prediction of the systems. Hence, it is representative only for compute-bound applications. On other hand, HPCG tests more patterns such as memory accesses and global communications. It has a lower compute to data access ratio and other system s features such as the memory bandwidth can influence its performance. Consequently, it could potentially represent a higher number of real scientific applications. [21] HPCG optimised binaries can be found for several platforms. In our case, the HPCG 3.1 binary version, optimised by NVIDIA for GPUs including Pascal, was used that required OpenMPI version [22] 22

33 3.2.2 Performance Investigations For this benchmark, there were only a few configurations that could be tested. Regarding input parameters, they are only two. The first one is the size of the local problem on each GPU, defined as a three dimensional grid, and the second one is the runtime in seconds. Moreover, the number of OpenMP threads could be manipulated as well as the GPU frequency, through the run script. The HPCG package provided by NVIDIA, containing the optimised binary, included also a file with configuring suggestions regarding the GPU model of the underlying hardware. For P100 GPUs, it was suggested to set the local problem size to 256x256x256 and set the maximum cost clock GPU frequency. Some tests were carried out regarding the problem size and it was found out that the suggested one could give the best performance. The GPU frequency was set to 1480MHz for Graphics and to 715MHz for memory. Given that the implementation assumes one-to-one mapping between GPUs and MPI ranks (i.e. three MPI ranks were running on each node with two CPUs, with 10 cores each), five OpenMP threads was the highest number that could ensure that each thread will run on different core. Further increase to the OpenMP threads did not offer any performance gain. Table 10 presents HPCG results before the competition, for 256x256x256 local problem size, 60 seconds run and five OpenMP threads. # of nodes Performance (Gflops) Speedup (Performance/Performance in one node) Peak Power Consumption (Watts) 1 (3 GPUs) (6 GPUs) (9 GPUs) Table 10 Performance and Power Consumption Investigations for local problem size 256x256x256, 60 seconds run and OMP_NUM_THREADS=5 The power consumption was not a problem in this case; consequently, no further investigations were required. On the first day of the competition, our run gave GF performance for 3660 seconds reaching the fourth place [23]. The scripts used to run the HPCG benchmark are presented in Appendix A MiniDFT This year a code challenge was announced by the SCC committee, using the MiniDFT application. A GitHub repository was created and given to the teams where the initial code was available as well as the instructions, input files and further information about 23

34 the given application and the code challenge [24]. The improved code of each team was later uploaded and shared among the SCC participants in a different GitHub repository Background MiniDFT is a mini-app that is part of Quantum Espresso (QE) code. Its purpose is modelling materials with plane-wave density functional theory (DFT). Its algorithm uses either the Local Density Approximation (LDA) or the Perdew-Burke-Ernzerhof (PBE) exchange-correlation functional, in order to solve of the Kohn-Sham equations. The input of the applications includes a set of atomic coordinates and pseudopotentials. Its parallelisation is implemented with MPI and OpenMP. The code is written in FORTRAN and C. [25] The code challenge This challenge had no restrictions regarding the changes that could be implemented in the code as long as the verification rules of the output were not violated. Some suggested changes were improving multi-threading, linking optimized libraries, compiler optimisation flags, offloading code to accelerators, rewriting parallelism and re-implementing algorithms. [26] A small input case, the small.in file, was given, so that we could implement some initial investigations. However, for the final submission, a larger known input case was included, the pe-23.local.in file, and a second unknown input case that was given on the second day of the competition. Porting MiniDFT on our cluster included compiling with different compilers, using several MPI implementations and linking with libraries. After some initial investigations, the two main choices were the Intel Compiler Suite with Intel MPI and the PGI Compiler Suite with MVAPICH2. Using the given small input case, the two tests execution times were compared and there was a noticeable difference, which is presented in Table 11. As far as the libraries are concerned, MiniDFT required ScaLAPACK, BLAS and FFT libraries. Math Kernel Library (MKL) was used for this purpose. Using the sequential version of MKL library was suggested by Quantum-ESPRESSO User s Guide [27] and in Intel MKL User s Guide [28], so that the application s processes will not interfere with MKL threads. In addition, the performance of the application was tested when MKL library was linked statically and dynamically and it appeared that statically linking was slightly better. The appropriate compilation options are provided by the Intel s MKL Link Line Advisor site [29]. The last two mentioned options (sequential and statically linking) were used for the following experiments. Intel Compiler & Intel MPI (seconds CPU) PGI Compiler & MVAPICH2 (seconds CPU) Table 11 Execution times for two different combinations of compilers and MPI implementations for the small.in input file. 24

35 Profiling and identifying overheads and computationally expensive functions and loops was the next step, before the optimisation process. For this purpose, VTune was used and some time-consuming subroutines and loops were found. The initial output of the VTune GUI tool (amplxe-gui) for the small.in input file, on one node (20 cores) using ten MPI processes and two OpenMP threads per process, is presented in Figure 11. A great amount of time is spent on MKL routines, but also on communications between processes (MPI_Alltoall routine) and threads ( omp parallel regions). From the section Summary of the profiler, it was noted that the total time spend on CPU waiting on an OpenMP barrier inside the parallel region was considerably high, about 40.4% of the total CPU time while the overhead due to MPI communications was about 17.5%. Figure 11 VTune profiling, amplxe-gui output for minidft, using 10 MPI processes and OMP_NUM_THREADS=2 with small.in input file It was then decided to offload these parts of the code to the GPUs. Due to time limitation, OpenACC was used instead of CUDA, with parallel loop as well as data directives, in order to eliminate data movements from CPUs to GPUs and from GPUs to CPUs. However, due to the nature of the existing algorithm, it was not possible to offload the large loops on the GPUs, so smaller loops were offloaded. It appeared though those two main problems limited the performance. The first problem was that these parts were not computationally intensive enough so that computations acceleration could cover the data transaction overhead. The second problem was the number of MPI processes per GPUs used. More specifically, the number of MPI processes used was higher than the number of GPUs. Consequently, more than one process was using each GPU and an 25

36 extra overhead was added while changing the running process on the GPUs. Decreasing the number of MPI processes was not a solution to this problem, since the performance could scale almost linearly to the number MPI processes. If more time was available, using CUDA with GPU enabled libraries such as cufft and/or cublas could possibly give better results. Using OpenACC required PGI compilers. As it was mentioned before, the Intel compilers produced a faster executable. OpenACC directives were not possible to overcome this performance difference. Thus, it was eventually decided to use the initial only-cpu version of the code compiled with Intel Compilers and using Intel MPI library. For this version, two optimisation flags, O3 and fast, were added and the execution time was further decreased. In more detail, fast increased performance by about 11.8% compared to the initial version that was using only O3. Compiler flags for the PGI version, such as Mcache_align, fast, Minline and Mipa=fast,inline were also investigated. Hyper-threading was tested, as well. However, the execution time was increased about 7%. The two different levels of parallelism, MPI and OpenMP were then investigated. The idea is that OpenMP could add an extra level of parallelism inside a node, for each MPI process. Although, several combinations of threads per processes were tested, it appeared that pure MPI was definitely better. As it was concluded from VTune profilier, a significant amount of time was consumed in omp parallel regions, due to load imbalance. The results in one node, for the small input case are presented in Table 12. MPI Processes OpenMP Threads Benchmark Time (seconds CPU) Table 12 Execution times in one node (20 cores) for different numbers of threads and MPI processes, keeping MKL threading layer sequential for the small.in input file The input parameters seemed to play a significant role in performance. Those are the number of pools, the number of band groups (nbgrp), the number of MPI processes for Diagonalization (ndiag) and the number of task groups (ntg). The given version of MiniDFT supported only one pool. Each pool is responsible of a group of k-points and it is partitioned into band groups in order to improve the parallel scaling. Task-group parallelism is implemented for improving the parallelisation of the FFTs. A lot of experiments were carried out in order to identify the best combination for the known 26

37 input case pe-23.local.in, which is presented in Table 13. The same combination was used for the unknown input case, as well. More detailed tests results can be found in Appendix B. Input parameter Final value npool 1 nbgrp 1 ndiag 64 ntg 5 Table 13 Optimal values for input parameters The last part was to try some code improvements. This included branch elimination, loop fusion and transferring if statements outside the loops. These alterations improved slightly the final performance, by about 0.3%. It is worth pointing out that during the competition, the committee carried out an interview for each team where we discussed the process of porting and optimising minidft, before submitting our results. There they suggested us using NVBLAS library. With just some trivial changes to the Makefile, this library could make use of the GPUs for the BLAS calls; no changes in the code were needed [30]. Unfortunately, the performance gain was very small and there were not enough time to further investigate the usage of this library. 27

38 Chapter 4 MiniDFT and HPCG on KNL In this chapter the results of porting and optimising one benchmark (HPCG) and one application (minidft) of the competition is presented. First, a brief description of this process is explained. 4.1 Porting and Optimising on KNL The KNL nodes on ARCHER, the UK National Supercomputing Service, were used for investigating performance. ARCHER has twelve compute nodes with one Xeon Phi Processor on each one. The KNL model is 7210 with 64-cores at 1.3GHz and each core can run up to four hyperthreads. Each node has a DDR memory of 96GB. [31] After the code was successfully compiled for each program on the ARCHER KNL nodes, the input parameters were initially investigated, in order to identify the optimal combination. The next step was to investigate the optimal combination of MPI processes and OpenMP threads, since both programs are hybrid. Depending on previous investigations, most of the hybrid codes seemed to perform better on ARCHER Xeon Phi processors than on Xeon CPUs [32]. Hyperthreading impact was examined as well as the performance on several numbers of nodes. Then, some technical features of the KNL were tested. The first one was MCDRAM modes. As explained in section 2.2, MCDRAM can be used in three different modes, i.e. flat mode, cache mode and hybrid mode. Moreover, there are three possible cluster options called all-to-all, quadrant and sub-numa. On ARCHER, each KNL has a 16GB on-chip MCDRAM memory. Two of the KNL nodes are configured in flat mode, the rest of them in cache mode and all of them are set to quadrant clustering mode. Nodes are configured at boot time and given that rebooting the initial configuration of the system is a time-consuming process, we were able to examine only these two options, i.e. flat and cache memory modes with quadrant clustering mode for up to two and eight nodes respectively. Vectorisation was the next feature to examine. For this purpose, the Cray Performance Analysis Tool (CrayPat) was used in order to profile the applications, identify the most time consuming parts of the code and focus on the loops that could be vectorised in order 28

39 to improve performance [33]. Moreover, vectorisation reports were necessary for this purpose. The Intel compilers, -qopt-report=5 compiler flag was used to produce these reports for the appropriate files. Sometimes the compiler does not manage to vectorise efficiently all the loops. As a result, the loops might not be vectorised at all or they might be vectorised without giving the maximum possible performance. Possible reasons are non-unit stride loops, i.e. the accesses on the arrays are not contiguous, unaligned data accesses and data dependencies between loop iterations. First of all, making the data aligned to cache line boundaries is significant for performance because it results in minimum data, instructions and cache lines loads. This requires a trivial modification in the code at the definition of the variables. Making the data aligned is not always enough, in addition, you often need to inform the compiler of alignment with the corresponding directives. However, the compiler still might not vectorise the loop because it thinks that there are dependencies on loop iterations that may or may not be real. In some other cases, the compiler might not vectorise the loop because of non-unit stride loops. In these cases, the compiler needs to gather and/or scatter the data from/to the arrays and this process has an overhead. In addition, when a loop has just a few iterations, the compiler might choose not to vectorise the loop at all, because the scalar execution might be faster. For these cases, there are several compiler directives that either inform the compiler that there are no dependencies or force vectorisation. Lastly, several compiling options were tested. This includes linking libraries and optimisation flags. 4.2 Performance results for MiniDFT As it was mentioned in section 3.3.2, two input files were given for the competition. In this section, the investigations were carried out using the small.in input file. Identifying the optimal input parameters was a necessary process; as it was observed from previous investigations on the SCC cluster wrong decisions might lead to significant drop of performance. The results for different number of nodes are presented in Appendix B.3. From previous investigations, it was concluded that this application s performance is better when only MPI processes are used, without OpenMP threads due to load imbalance. Hybrid version of minidft did not give better results on KNL than its pure MPI version. This was actually expected, since explicit OpenMP is a recent addition, still under development and it is suggested not to mix MPI and OpenMP parallelisation if it is not known how to run them in a controlled manner [27]. As shown in Figure 12, the execution time decreases with a significant rate as the number of MPI 32 processes increases and the number of OpenMP threads decreases. Hyperthreading did not improve the performance either. In fact, using the best case of number of processes, i.e. 64, doubling OpenMP threads with hyperthreading makes the 29

40 execution time more than two times slower. This test decreased performance even more, in the case with 32 MPI processes. It is worth mentioning, that minidft is probably not the best application to test hyperthreading with, due to the poor performance of OpenMP that was explained earlier. However, increasing the MPI processes with hyperthreading deteriorated the performance, as well. Figure 12 Investigations on MPI processes and OpenMP threads, on one node configured in cache mode 30

Figure 13 Investigations on hyperthreading on one node configured in cache mode Thus, for the rest of the experiments that were carried out, 64 MPI processes were used per node with OMP_NUM_THREADS

41 Figure 13 Investigations on hyperthreading on one node configured in cache mode Thus, for the rest of the experiments that were carried out, 64 MPI processes were used per node with OMP_NUM_THREADS set to one. Figure 14 presents the improvement on performance, as more nodes were used, all configured in cache mode. The line is linear for up to two nodes but then the performance increases with a slower rate. With CrayPat profiler, it was observed that the time spent on MPI routines such as MPI_Barrier and MPI_Alltoall rises significantly as the number of MPI processes increases. Thus, the hybrid version was tested once more on multiple nodes, in order to decrease the processes communication overhead. However, this resulted in higher overall execution times than the corresponding pure MPI executions. 31

42 Figure 14 Scalability of minidft across nodes configured in cache mode with 64 MPI processes per node and OMP_NUM_THREADS=1 Nodes configured in flat mode were also used and the results compared to the cache mode are presented in Figure 15. The time in cache mode was about 1.6 times faster than in flat mode. The numactl program was used to target MCDRAM with -p 1 option for setting the preferred memory. With this change in the run script, the program tries to use MCDRAM and it falls back when it is exhausted. A possible explanation, for this noticeable difference on performance, is that the application is streaming but its problem size fits in the cache. Thus, using a high bandwidth cache is more beneficial than using a high bandwidth main memory. 32

Figure 15 Cache Vs Flat mode results using 64 MPI processes on each node and OMP_NUM_THREADS=1 Two different ways of linking MKL libraries were investigated both with static linking.

43 Figure 15 Cache Vs Flat mode results using 64 MPI processes on each node and OMP_NUM_THREADS=1 Two different ways of linking MKL libraries were investigated both with static linking. The first one was using OpenMP threading layer (default option in the given Makefile) and the second one was using the sequential version of MKL. With the second option, the performance was slightly improved, about 1.1%. Several flags were added in order to investigate their impact on performance. Flags -fast, -fma, -ipo, -finlne-functions and -align array64byte did not offer any significant improvement. Some of these flags were. The last part was to investigate the impact of vectorisation and try to improve it with some techniques. This included making the data aligned, adding new aligned vectors and forcing vectorisation using compiler directives. Using the compiler flags -no-vec -no-simd -qno-openmp-simd the vectorisation can be disabled. This option decreased performance about 7.3% on the initial code on one node, indicating that vectorisation was already beneficial for the application. The initial code of minidft was profiled with CrayPat in order to identify the most time-consuming functions-loops. Non user-defined functions that consumed a considerable amount of time were MKL (31.5%) and MPI (10.5%) functions. The most time consuming, user-defined functions are fft_scatter (28.7%) from fft_base.f90 file, exxenergy2 (7.2%) and vexx (6.9%) from exx.f90 file and cft_1z (2.1%) from fft_base.f90 file and the mentioned lines on each function are the most time-consuming loops. Using the Intel s compiler vectorisation reports, an attempt was made to improve their performance. 33

44 Table 2: Profile by Group, Function, and Line Samp% Samp Imb. Imb. Group Samp Samp% Function Source Line PE=HIDE 100.0% 72, Total % 33, USER % 20, fft_base_mp_fft_scatter_ 3 minidft/isc17-scc-minidft/src/fft_base.f % 2, % line % 2, % line % % line % 1, % line % 10, % line % 1, % line.885 ================================================================ ========== 7.2% 5, exx_mp_exxenergy2_ 3 minidft/isc17-scc-minidft/src/exx.f % 1, % line % 2, % line.1236 ================================================================ ========== 6.9% 5, exx_mp_vexx_ 3 minidft/isc17-scc-minidft/src/exx.f % 1, % line % 1, % line % 1, % line.1006 ================================================================ ========== 2.1% 1, fft_scalar_mp_cft_1z_ 3 minidft/isc17-scc-minidft/src/fft_scalar.f % 1, % line.173 ================================================================== ========== 42.2% 30, ETC Table 14 Profiling results of minidft with CrayPat on one node 34

45 After identifying the loops that could be vectorised, the arrays used inside these loops were aligned. This was achieved using!dir$ attributes align : 64 :: variable_name directive after the declaration of the variables. For the variables that were used inside a function as in or inout the file where their declaration was first made had to be found to add the directive there. Then it was required to add the!dir$ assume_aligned variable_name : 64 directive inside the file with the actual loop. Forcing vectorisation was not beneficial in every loop. In most cases where it was indicated by the vectorisation report that the loop was not vectorised because it seems inefficient, adding omp simd directive resulted in worse performance. In the most time consuming loop, i.e. line 808 in fft_base.f90, the accesses inside the loop where neither aligned nor contiguous and the compiler report was giving the message non-unit strided load was emulated for the variable f_aux. That was the only case where just forcing vectorisaton with omp simd directive improved performance and the reason is probably the fact that it is the largest loop and it was worth the scattering overhead. The exact code is given in Figure 16. It is worth noting that this directive requires removing any omp parallel directives. However, in our case that was not a problem since we were not using multiple OpenMP threads per process. In some cases, introducing new aligned arrays could help vectorisation and the overall performance. For example in Figure 17, a new vector was introduced as well as a new loop in order to ensure that the aligned calculations can take full advantage of the vectorisation. In Figure 18, an integer was replaced by an array in order to make all accesses inside the loop aligned. The omp simd directive forces vectorisation but it can also be used to inform the compiler that the vectors inside the loop are aligned. However, it was observed from the vectorisation reports that the omp simd directive was not producing aligned arrays of complex datatype. In these cases the!dir$ vector aligned was added or used instead. All the modifications on these files can be found in the submitted code. Introducing the compiler directives improved the final performance about 3.4% on one node. Table 15 presents the results across several nodes and the impact of final vectorisation. It is worth mentioning that as the number of nodes increases, the percentage of the benefit due to vectorisation decreases. However, this was expected, taking into consideration the fact that the percentage of time spent on these loops becomes less significant as the communication overhead increases.!$omp simd DO j = 1, dfft%npp( me_p ) f_in( j + ( i - 1 ) * nppx ) = f_aux( mc + ( j - 1 ) * dfft%nnp ) ENDDO Figure 16 Forcing Vectorisation with omp simd directive 35

46 A. Initial Code do ig=1,ngm vc = vc + fac(ig) * rhoc(nls(ig)) * CONJG(rhoc(nls(ig))) end do B. Improved Code!$omp simd aligned(help:64) aligned(nls:64)!dir$ vector aligned do ig=1,ngm! here dir is required for nls help(ig) = rhoc(nls(ig)) * CONJG(rhoc(nls(ig))) end do!$omp simd reduction(+:vc) aligned(fac:64) aligned(help:64) do ig=1,ngm vc = vc + fac(ig) * help(ig) Figure 17 - Helping Vectorisation with introducing a new aligned vector A. Original Code DO proc = 1, nprocp gproc = dfft%nplist( proc ) + 1 sendcount (proc) = npp_ ( gproc ) * ncp_ (me) recvcount (proc) = npp_ (me) * ncp_ ( gproc ) ENDDO B. Improved Code!$omp simd aligned(gproc2:64) DO proc = 1, nprocp gproc2 (proc) = dfft%nplist( proc ) + 1 ENDDO!$omp simd aligned(sendcount:64) aligned(recvcount:64) aligned(gproc2:64) DO proc = 1, nprocp sendcount (proc) = npp_ ( gproc2 (proc) ) * ncp_ (me) recvcount (proc) = npp_ (me) * ncp_ ( gproc2 (proc) ) Figure 18 - Helping Vectorisation with replacing integer with a new aligned vector 36

47 # of nodes Final Execution Time (sec CPU) % of improvement due to overall vectorisation Table 15 Final Results of minidft and impact of vectorisation 4.3 Performance results for HPCG For HPCG benchmark the version from Intel Math Kernel Library Benchmarks 2017 Update 3 for Linux, was used [34]. The provided code is already optimised for Intel Xeon and Intel Xeon Phi. First, the problem size was investigated. As it was mentioned before, the first input parameter is the local dimension grid size for each process. Thus, the global problem size depends on this parameter as well as the number of processes. A valid run should have a problem size that is large enough to occupy at least ¼ of the main memory. For each number of processes there is an optimal local size as shown in Figure 1. These configurations could give about the same performance. Using one process or more than 16 processes per core was generally inefficient. # of MPI processes OpenMP threads per process Optimal Local Size x 192 x x 128 x x 96 x x 64 x 64 Figure 19 Global problem size investigations on one node In addition, using hyperthreading and increasing the number of OpenMP threads per process had a different impact depending on the number of processes. For example, using two MPI processes with 64 OpenMP threads, i.e. using two threads per physical core, could not significantly increase performance comparing to the corresponding 37

48 non-hyperthreading version. This is due to the fact that OpenMP parallelisation seems less effective when a high number of threads causes significant synchronisation overhead. In the case of four, eight and sixteen processes, the performance with two threads per core was about the same. Figure 20 presents the impact of hyperthreading on different local sizes with four MPI processes. Using four threads per cores is generally less efficient than using two threads per core. It is worth noting that using four threads per core for the smaller problem size has worse performance than using one thread per core, since there is not enough parallelisation to exploit inside the local problem and the threads synchronisation overhead becomes a limiting factor. Figure 20 Impact of hyperthreading on one node with 4 MPI processes Scalability of HPCG across multiple nodes is presented in Figure 21, using four MPI processes per node. The scalability is not linear, but is close to linear. It was attempted to improve scalability across nodes, by reducing the number of MPI processes and increasing the problem size, but it did not offer better performance. 38

49 Figure 21 Scalability across multiple nodes using 4 MPI processes per node and 32 OpenMP threads per core The performance of HPCG on nodes configured in flat mode was also tested. The numactl program with -p 1 option was used as before. The performance was less than halve of the corresponding performance on nodes configured in cache mode. It was also observed that increasing the number of threads per core could not offer any performance gain, indicating that for this application a high-bandwidth cache was more important than high-bandwidth memory. The comparison of HPCG on flat and cache mode is summarised in Figure 22. It is worth mentioning that increasing the problem size on flat mode configured nodes did not improve performance either. Regarding vectorisation, the code was already optimised. All the arrays were aligned and all the critical loops were vectorised, as it was shown by the vectorisation reports. Adding compiler directives, omp simd did not improve performance. It was also noticed that disabling vectorisation with -no-vec -no-simd -qno-openmp-simd flag, did not decrease the performance significantly, but just about 0.9%. It seems that although the loops were appropriately vectorised, vectorisation was not efficient in this case. Adding optimisation flags was not beneficial. Figure 23 presents the results of HPCG. 39

50 Figure 22 Cache and Flat Mode for different number of nodes, MPI processes, OpenMP threads and hyperthreding # of KNL nodes Performance (Gflops) Figure 23 Performance of HPCG across multiple nodes 40

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers