Comparing Performance and Power Consumption on Different Architectures

Size: px
Start display at page:

Download "Comparing Performance and Power Consumption on Different Architectures"

Transcription

1 Comparing Performance and Power Consumption on Different Architectures Andriani Mappoura August 18, 2017 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2017

2 Abstract This project aims to compare a conventional CPU platform with a GPU-based heterogeneous cluster and a platform composed of Intel Xeon Phi processors. The comparison regards performance as well as power consumption since both are critical factors in the field of HPC. The first part of this project regards the participation in the Student Cluster Competition (SCC) that was held during the International Supercomputing Conference in June The minidft application and High Performance Conjugate Gradient benchmark were chosen from the SCC. These two codes were ported and optimised on three different platforms; one of them is the cluster that was used at the SCC. A description of the work that was carried out on the GPU-based cluster and the Intel Xeon Phi nodes is presented. From the outcomes of this project, it was observed that both GPUs and Xeon Phi processors are able to achieve significantly better performance and power efficiency under the appropriate configuration and code modifications compared to a CPU-only platform.

3 Contents Chapter 1 Introduction Report Organisation... 2 Chapter 2 Background Overview The P100 NVIDIA GPU The second generation Intel Xeon Phi Processor, Knights Landing (KNL) The Student Cluster Competition Rules and Background The team s cluster Setting up the system Project Motivation Obstacles and Deviation from initial project plan...13 Chapter 3 SCC s Applications High Performance LINPACK Background Performance Investigations Power Consumption Investigations Final configuration and competition s results High Performance Conjugate Gradient Background Performance Investigations MiniDFT Background...24 i

4 3.3.2 The code challenge...24 Chapter 4 MiniDFT and HPCG on KNL Porting and Optimising on KNL Performance results for MiniDFT Performance results for HPCG...37 Chapter 5 Comparing Performance and Power Consumption on different platforms MiniDFT on KNL-nodes & CPU-nodes HPCG on KNL-nodes, CPU-nodes & GPU+CPU-nodes...44 Chapter 6 Conclusions Future Work...48 ii

5 List of Tables Table 1 Initial investigations of HPL performance and power consumption in one node...12 Table 2 Initial investigations of performance in Teraflops for N and NB input parameters in one node with three GPUs...17 Table 3 Initial GPU_DGEMM_SPLIT investigations on node with N=80750 and NB= Table 4 Performance results and Input Parameters keeping GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS= Table 5 GPU frequency investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS= Table 6 OMP_NUM_THREADS and Hyperthreading investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and GPU_frequency=1265MHz...20 Table 7 ECC investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1, GPU_frequency=1265MHz, OMP_NUM_THREADS=4 and hyperthreading enabled...20 Table 8 Configuration before the competition...21 Table 9 Final Configuration...22 Table 10 Performance and Power Consumption Investigations for local problem size 256x256x256, 60 seconds run and OMP_NUM_THREADS= Table 11 Execution times for two different combinations of compilers and MPI implementations for the small.in input file Table 12 Execution times in one node (20 cores) for different numbers of threads and MPI processes, keeping MKL threading layer sequential for the small.in input file.26 Table 13 Optimal values for input parameters...27 Table 14 Profiling results of minidft with CrayPat on one node...34 Table 15 Final Results of minidft and impact of vectorisation...37 iii

6 Table 16 Platforms Specifications...41 Table 17 Power Consumption and Performance Results of minidft on one node of Platforms 2 & Table 18 - Power Consumption and Performance Results of HPCG on one node of Platforms 1, 2 & iv

7 List of Figures Figure 1- NVIDIA Tesla P Figure 2 - NVIDIA NVLink Hybrid Cube Mesh... 5 Figure 3 - NVIDIA Pascal Architecture GP100, Full GPU with 60 SM Unit... 6 Figure 4 NVIDIA Pascal Architecture GP100, SM Unit... 6 Figure 5 Intel Xeon Phi Processor... 7 Figure 6 KNL Architecture... 8 Figure 7 Tile Design... 8 Figure 8 MCDRAM in cache, flat and hybrid modes... 9 Figure 9 The front view of the cluster with liquid cooling system on the top...11 Figure 10 Initial Problem Size N investigations on three nodes...18 Figure 11 VTune profiling, amplxe-gui output for minidft, using 10 MPI processes and OMP_NUM_THREADS=2 with small.in input file...25 Figure 12 Investigations on MPI processes and OpenMP threads, on one node configured in cache mode...30 Figure 13 Investigations on hyperthreading on one node configured in cache mode 31 Figure 14 Scalability of minidft across nodes configured in cache mode with 64 MPI processes per node and OMP_NUM_THREADS= Figure 15 Cache Vs Flat mode results using 64 MPI processes on each node and OMP_NUM_THREADS= Figure 16 Forcing Vectorisation with omp simd directive...35 Figure 17 - Helping Vectorisation with introducing a new aligned vector...36 Figure 18 - Helping Vectorisation with replacing integer with a new aligned vector...36 Figure 19 Global problem size investigations on one node...37 v

8 Figure 20 Impact of hyperthreading on one node with 4 MPI processes...38 Figure 21 Scalability across multiple nodes using 4 MPI processes per node and 32 OpenMP threads per core...39 Figure 22 Cache and Flat Mode for different number of nodes, MPI processes, OpenMP threads and hyperthreding...40 Figure 23 Performance of HPCG across multiple nodes...40 Figure 24 CPU time on Platforms 2 & Figure 25 Relative CPU time of Platforms 2 & Figure 26 Scalability on Platforms 2 & Figure 27 - Performance on Platforms 2 & vi

9 Acknowledgements I would first like to thank my supervisor Fiona Reid for her valuable guidance and her continuous willingness to help me throughout this period. In addition, I would like to thank Boston Limited and especially David Power and Konstantinos Mouzakitis for supporting our participation in the Student Cluster Competition, as well as our team coach Emmanouil Farsarakis and my teammates for this great cooperation. Special thanks to my family and friends for their love and support. vii

10 viii

11 Chapter 1 Introduction High Performance Computing (HPC) is a field that has been rapidly advanced, introducing new hardware and software features. Exascale-level computing is considered the next goal of HPC [1]. This new target aims to create systems that will be able to achieve double precision (64-bit) operations per second within a power consumption of 20 to 30 MW. Thus, an exascale cluster would be roughly 50 times faster than a modern cluster that delivers 20 Petaflops. However, Sunway TaihuLight, the Chinese supercomputer that was ranked first in the TOP500 list and fourth in the Green500 list in 2016, delivers 6, Megaflops per Watt [2]. Consequently, today s most powerful system, which is also one of the most power efficient systems in our days, needs 165 MW to deliver Exaflops. In order to understand these power consumption numbers, it is worth mentioning that the largest power station in United Kingdom, Drax, has power capacity of 3,960 MW where the high-pressure turbines used generate at 140MW each. [3] Obviously, the exascale challenge requires once again new features in both hardware and software. The new supercomputers will be developed by co-designing applications, systemware and hardware, while power consumption will play a key role to this process. Regarding the hardware part of this challenge, new technologies are required in different parts such as the memory, the processing units as well as the cooling process of the systems. Liquid cooling has already been introduced and widely used instead of conventional air-cooling fans, in order to achieve a high ratio of operations per Watt. Numerous studies have been carried out investigating new memory features that could help with the power consumption issue. [4] Moreover, it is often argued that accelerators, coprocessors and other many-core architectures could play a key role in this new era, since multi-core systems performance increase is much slower, while their power efficiency is not particularly promising [5]. Many-core architectures seem to be a hot topic in the field of HPC; their popularity in the TOP500 and Green500 lists is particularly impressive [6], [7]. More specifically, in the latest announced June 2017 Green500 list, nine out of ten most energy-efficient supercomputers are using NVIDIA GPUs. Regarding the June 2017 TOP500 list, among the ten most powerful supercomputers, three are using Intel Xeon Phi and two are using NVIDIA GPUs. These technologies could ensure a great performance and energy efficiency under an appropriate manipulation. 1

12 NVIDIA GPUs seem to be the predominant choice of accelerators in the area of HPC. The Pascal Architecture was introduced in April 2016 by NVIDIA. In addition, Intel introduced Xeon Phi. The first generation was Knights Corner (KNC), an HPC-purpose coprocessor. The second generation, Knights Landing (KNL), was later released and it is mainly used as a standalone CPU. These technologies can be beneficial for various fields such as Machine Learning. Nevertheless, it is argued that heterogeneous clusters are over-estimated. The benchmarks that are used to rank supercomputers such as High Performance LINPACK (HPL) do not represent a wide variety of real applications. Most scientific codes were initially developed for CPU systems. Thus, exploiting accelerators fully requires a great amount of changes in the existing codes and sometimes it is not possible to achieve the desirable performance. As far as Xeon Phi is concerned, it appears that less code modifications might be required, but not all applications are able to scale well on these systems. [8] This project aims to compare these different processing units in terms of performance as well as power consumption. The writer of this project participated in the Student Cluster Competition (SCC) as part of the MSc dissertation and had the opportunity to investigate the performance and power consumption of different benchmarks and applications in a cluster that was built with P100 GPUs. After the completion of the competition, two of these applications were ported and optimised on KNL nodes, as well as on CPU-only nodes. The performance and power consumption on these three different systems was measured and compared. 1.1 Report Organisation The report is organised in six chapters as follows: Chapter 1 is a brief introduction to the current project. Chapter 2 contains the background theory that is required for understanding some technical features as well as the SCC. The motivation for this project, the obstacles faced throughout this period and the deviation from the initial plan are also described. Chapter 3 describes the work done by the author of the report for the SCC preparation, the results achieved and the outcome of the SCC. Chapter 4 contains the work done in order to port and optimise an application and a benchmark of the SCC on the KNL nodes. Chapter 5 presents the comparison of the performance and power consumption of three different platforms for one application and one benchmark. Chapter 6 contains the conclusions that were drawn from this project and future work. 2

13 Chapter 2 Background Overview From the mid-1980s until the mid-2000s, manufactures managed to improve processors performance mainly by increasing the processors frequency, a technique known as frequency scaling. However, heat generation and power consumption have significantly increased and it is argued that this is the main limitation of further increasing the performance of modern CPUs [9]. Power consumption is influenced by frequency, as shown by the equation: [ ] (1) where P is the power, f is the frequency, V is the voltage and C is the capacitance. Reducing the voltage used to be a method for keeping power consumption low. There is a limit though, regarding the overall voltage, since the difference of voltages distinguishes 0s and 1s in the systems. A further decrease would not allow these digital differences to be clear. Making smaller transistors reduces power consumption, as well, since the capacitance is reduced. Manufacturers try to fit as many as possible smaller transistors (smaller transistors are also faster) to a single chip. Nevertheless, there are physical constraints in terms of the size and the speed of a single chip. Transistors gates have already became too thin. Previous attempts to make transistors even smaller resulted in leaky transistors that would not even be able to make a processor functional. [10] As a result, the current trend is parallel scaling with multi-core processors. Moore s law, which says that chip performance doubles every 18 months, is actually still valid; performance can be increased by exploiting parallelism and developing software that could take advantage of the new hardware features. Programming is now more difficult since codes cannot just become faster as they used to. Instead, different programming techniques should be used in order to exploit the capabilities of multi-core chips. Many-core architecture is a subcategory of multi-core processors that aims to achieve a higher level of parallelism and lower power consumption. In the area of HPC, many-core processors, accelerators and coprocessors are able to achieve a high degree of parallelism due to their special design. They are composed of a high number of simple, number-crunching independent cores in a single silicon die with high-bandwidth 3

14 memory. Moreover, these cores consume less power than the modern CPUs cores that are power-hungry duo to their capability to deal with complex concepts such as branch prediction and out-of-order execution. Consequently, the right code development is again required in order to exploit the maximum performance capabilities of these technologies, since their simple cores are slower than modern CPU s cores in terms of latency and single thread performance. GPUs are a very common choice of many-core accelerators in HPC. They are used in heterogeneous systems with CPUs. They cannot be used independently, as they are designed neither for I/O operations nor for running the operating system. The host which is the CPUs, offloads computationally expensive parts of the code to the GPUs, known as the device, in order to accelerate these computations. However, in order to offload effectively these parts to the GPUs, a lot of changes to the code might be needed. The Intel Xeon Phi was released aiming to help and accelerate various scientific fields, using the Many Integrated Code (MIC) architecture. The first generation of Xeon Phi, Knights Corner (KNC), was a coprocessor that could be connected via the PCI express bus. It was used like GPUs, accompanying CPUs. However, it could be also used directly without offloading code from the host system, i.e. in native mode. The fact that the cores used in KNC were initially designed for CPUs in 1993 indicates their simplicity. The second generation of Xeon Phi is the Knight s Landing (KNL) and one of its main differences from the previous generation is that KNL is a many-core processor available as a stand-alone system that does not need a host CPU. Although, no changes to the code are required to run a code on Xeon Phi, unlike GPUs, some alterations may be essential in order to achieve good performance. 2.1 The P100 NVIDIA GPU The P100 NVIDIA GPU is today s fastest GPU, built on the NVIDIA Pascal architecture and its target is to accelerate HPC and Big Data applications as well as Deep Learning and Artificial Intelligence systems. Figure 1- NVIDIA Tesla P100 Source: 4

15 There are four main new features that made the P100 GPUs that powerful. The first one is the Pascal Architecture that targets data-centre applications and reached three times better performance than the previous GPU generation, Tesla K40. The second new feature is the integration of both compute and data on a single package. That was implemented by introducing Chip on Wafer on Substrate (CoWoS) with High Bandwidth Memory 2 (HBM2) technology. This improvement gave three times more memory bandwidth comparing to the previous version and helped data-intensive applications. Of course, NVIDIA NVLink is one of the new features. NVILink is the first high-speed interconnect for GPU-to-CPU and GPU-to-GPU and its bandwidth is five times the PCI Express Gen3 s bandwidth. Figure 2 shows how eight GPUs can be connected with NVLink, and PCIe connection of GPUs with the CPUs through the PCIe Switches. Finally, the page migration engine with unified memory enables virtual memory paging and page faulting, giving the applications the opportunity to scale beyond the physical memory size of the system. P100 GPUs have also being improved regarding power efficiency with the TSMC s 16-nm FinFET manufacturing process, which used to be 28-nm on previous K40 and M40. Figure 2 - NVIDIA NVLink Hybrid Cube Mesh Source: As far as the Pascal GP100 GPU architecture is concerned, it includes the same main features with previous architectures, i.e. Texture Processing Clusters (TPCs), an array of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers, with some differences. Figure 3, presents the design of the Pascal GP100. It consists of six GPCs, thirty TPCs and 60 SMs (two in each TPC, ten in each GPC). Each SM has 64 CUDA Cores, i.e. there are 3840 cores overall. In this figure, we can also see the memory controllers attached to L2 cache that control the HBM2 DRAM. Figure 4 presents the design of an SM Unit in Pascal GP100. It consists of 64 CUDA cores that 5

16 form two wraps. Each wrap executes the same instruction on multiple data at a time and its cores are controlled by a wrap scheduler. Figure 3 - NVIDIA Pascal Architecture GP100, Full GPU with 60 SM Unit Source: NVIDIA Pascal Architecture Whitepaper Figure 4 NVIDIA Pascal Architecture GP100, SM Unit Source: NVIDIA Pascal Architecture Whitepaper 6

17 2.2 The second generation Intel Xeon Phi Processor, Knights Landing (KNL) KNL, Intel s latest many-core processor, was introduced in June 2016, at ISC in Germany. It is targeted for supercomputing and high performance computing applications by achieving massive parallelism and vectorization. Figure 5 Intel Xeon Phi Processor Source: KNL introduced many improvements over the previous generation KNC. First, the fact that KNL is a stand-alone CPU avoids the PCIe bottleneck of KNC. The different cores, the new processor architecture, the original memory technology and the operation modes are some of the most important new features. Figure 6 presents the KNL architecture. On KNC, the on-die interconnect used to be a ring. Now on KNL, a mesh interconnect is used to connect 36 tiles allowing a higher bandwidth connection between cores and memory. Each tile consists of two cores, sharing the L2 cache, with two improved Vector Processing Units (VPU) in each core, as shown in Figure 7. The new cores aim to balance both single and parallel thread performance as well as power efficiency. The peak Flops on KNL can be up to three times higher than the peak Flops on KNC. Depending on the Xeon Phi model, the peak double precision performance can be Gflops on KNL and Gflops on KNC. Regarding hyperthreading, KNC required at least two threads per core in order to give good performance. However, that is not the case in KNL, where there are codes that do not need hyperthreading at all. 7

18 Figure 6 KNL Architecture Source: Figure 7 Tile Design Source: The KNL has two levels of memory. It has a huge DRAM main memory that can be directly accessed. The second level is MCDRAM, a high bandwidth memory of 16GB with higher latency than DRAM, which can be used in flat and cache mode. The flat mode means that MCDRAM is used as main memory in the same address space. In the cache mode, the MCDRAM is actually a last level cache for DRAM. There is also hybrid mode were a specific percentage of MCDRAM is used as cache and the rest as part of the memory. Figure 8 shows these three different memory configurations. Regarding the programmability effort, cache mode is the easiest because it requires no changes in order to use MCDRAM. On other hand, hybrid and flat modes requires modifications either inside the code or at the command-line using the numactl program, in order to use HBM memory; by default the DRAM is used. Features such as the access patters, the problem size and the requirements of memory bandwidth of an application might determine which memory mode can be beneficial for it. 8

19 Figure 8 MCDRAM in cache, flat and hybrid modes Source: KNL also offers three different cluster-configuring options. The first one is All-to-All where addresses are uniformly hashed across all distributed directories. Quadrant is the second one that divides chip into four virtual quadrants and address is hashed to a directory at the same Quadrant as the memory. The last one is Sub-NUMA clustering where each Quadrant is exposed as a separate NUMA domain to the operating system. All-to-All might have lower performance than the other two options because it generally causes more mesh traffic. Quadrant clustering can be a better choice than Sub-NUMA for the applications that use KNL as a symmetric multi-processor. On other hand, Sub-NUMA clustering can be advantageous for MPI or hybrid applications that use KNL as a distributed memory system, with proper control of processes and threads pinning. [11] Last but not least, the Omni-Path Fabric is integrated on package of KNL. This feature offers better scalability to large systems and lower power consumption and cost. 2.3 The Student Cluster Competition The SCC was held from 19th until the 21st of June, in Frankfurt, during the International Supercomputing Conference 2017 (ISC 17). Our team consisted of four students who represented EPCC and the University of Edinburgh. The team s coach and supervisors helped and guided the team. Our participation was completed in collaboration with our vendor Boston Limited that provided us the cluster as well as technical support [12]. Twelve teams from all around the world participated Rules and Background The aim of this competition is to give students a number of benchmarks and applications that are ported and optimised in each team s cluster, in order to achieve the best possible performance while the power budget must not exceed 3KW. In order to measure the power consumption during the competition, a Power Distribution Unit (PDU) is given to each team, in order to monitor the power consumption. Screens with the power consumption of all teams are available in the conference hall. If the power limit is exceeded by a team, then a SCC supervisor comes to the booth of the team 9

20 to ensure that the application or benchmark would be repeated. A new rule regarding the power consumption was introduced this year. During the competition we were informed that every time that the power limit is exceeded, there would be a penalty on the marking. Another rule is that teams are not allowed to physically touch the system after the first run. Moreover, changes in the BIOS are forbidden after the competition has started. All system s equipment that is used for the first run should be powered on during the whole competition. Finally, rebooting is not permitted, unless there is a significant reason such as a system hang. In that case, an SCC Supervisor should be notified to give the permission for rebooting the system. [13] During the competition, we were given a USB flash drive with the instructions for running each day s applications as well as the input files. Our results were submitted with that USB flash drive to the SCC committee. The given benchmarks were High Performance LINPACK (HPC), HPC Challenge (HPCC) and High Performance Conjugate Gradient (HPCG). As far as applications are concerned, the first was FEniCS, a computing platform for partial differential equations. MiniDFT, that is part of Quantum Espresso (QE), was the application used for the code challenge. The last one given before the competition was TensorFlow, an open source software library for numerical computation. This application was used for the CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) challenge using the Keras Deep Learning library. On the second day of the competition a secrete application was announced. This was the LAMMPS from the GRANULAR package. [14] The competition traditionally includes five awards. The first one is the Highest LINPACK that is given to the team who delivers the highest performance on HPL under the power budget. The second one is the Fan Favourite that is given to the team that receives the most votes from the ISC participants. The other three awards regard the 1 st, 2 nd and 3 rd Place Overall Winners whose score depends on the HPCC performance, the applications runs as well as the interviews that were carried out during the competition. [14] The team s cluster Our cluster consisted of three nodes. One of them was the head node that was used for some further configuration. All nodes were used for computing purposes. Each node consisted of: 2 x Intel Xeon 2630 V4 CPUs: 2.2GHz frequency 10 cores (20 threads) 3 x NVIDIA Tesla P100 GPUs: 3584 cores 10

21 16GB HBM2 memory 1 x DDR4 RAM: 64GB 2400MHz frequency 1 x SSD: 900GB The interconnection of the nodes was implemented with Mellanox EDR Infiniband Networking. Liquid cooling was added for cooling the system, instead of the pure air-fans. The liquid cooling was provided by CoolIT [15]. This feature enabled us to remove some energy-hungry air-fans and the total power consumption was eventually reduced. Figure 9 presents the cluster after it was set up on the first day of the competition. Figure 9 The front view of the cluster with liquid cooling system on the top It is worth pointing out that the initial configuration plan included a cluster with four nodes, each one containing two GPUs and two CPUs, i.e. eight GPUs and eight CPUs. However, after some further investigations on one node, it appeared that HPL 11

22 benchmark could scale almost linearly when a third GPU was added. Thus, the alternative plan was to add a third GPU per node. In order to keep the power consumption under the power budget, three nodes could be included to the system. Consequently, the cluster would include in total nine GPUs instead of eight and six CPUs instead of eight. Our team decided to focus on the Highest LINPACK award. Thus, the second plan was eventually implemented that enabled us to achieve a particularly high performance of the HPL benchmark during the competition. Table 1 presents the performance and power consumption results that lead us to our final configuration decision. Although power consumption in one node with three GPUs was above 1KW, we decided that the plan could be still advantageous, since we would be later able to reduce the power consumption by under-clocking GPUs and/or CPUs frequencies and removing air-fans. Number of GPUs on the head node: Performance (Tflops) Peak Power Consumption (Watt) Table 1 Initial investigations of HPL performance and power consumption in one node Setting up the system The cluster was accessed remotely throughout the preparation period before the competition. The operating system used was CentOS 7.3. A number of different software packages were installed and used. These included: NVIDIA drivers Several Libraries CUDA libraries (version 8.0) cudnn (version 5.1) Intel Math Kernel Library (MKL) OpenBLAS (version ) Different Compilers: Intel GNU PGI

23 Several MPI Implementations: Intel MPI (version 5.1.2) OpenMPI (versions 1.10, 1.6.5) MVAPICH2 (version 2.3a) In order to facilitate the accesses of the different members of the team to the cluster, ssh keys were used and different user accounts were set up. User accounts was a simple solution that could keep each user s folders and setup settings isolated without requiring any cluster management tool. As far as the power consumption measurements are concerned before the competition, a Windows system, with power meter software, was installed that was accessed through Remote Desktop (RDP) client. 2.4 Project Motivation Obviously, much effort is put on creating and advancing technologies for the field of HPC. Both Intel and NVIDIA have done a lot for this purpose and their last products, KNL Xeon Phi and GPU P100 respectively, seem particularly promising in terms of performance and power efficiency. Some prior investigations from these companies as well as from other research organisations proved that these technologies could give better results than previous generations of these products and conservative CPU-only systems for highly parallelised applications, under the appropriate software and hardware configuration. Through the SCC competition, various applications and benchmarks were examined on a P100 GPUs based cluster, regarding performance and power consumption. That gave the writer the idea of porting and investigating some of the SCC s applications on Intel Xeon Phi processors and Intel Xeon processors. That was a unique opportunity to implement the purpose of this project, which is comparing these technologies for HPC applications. 2.5 Obstacles and Deviation from initial project plan From the beginning of the SCC preparation, the given benchmarks and applications were divided among the members of the team and the writer was responsible for two benchmarks, HPL and HPCG. Thus, the initial purpose of this project was to compare performance and power consumption of these two benchmarks, on three different platforms, i.e. the SCC GPU-based cluster, ARCHER KNL-nodes and ARCHER CPU-nodes. However, more applications were announced later on for the SCC and the writer undertook one of them, the minidft application. This application was given for the code challenge and it required more work compared to the other two benchmarks. It is also 13

24 worth mentioning that for both benchmarks, optimised binaries provided by NVIDIA were used and as a result, code profiling and optimisation was not required. Consequently, it was decided to port and optimise minidft on KNL and CPU nodes, since the writer became familiar with that code. HPCG was also ported on KNL and CPU nodes, but there was not enough time to port HPL, too. Unfortunately, we did not manage to port minidft efficiently on the GPUs of the SCC cluster and it was decided that the CPU-only code version would be used. Consequently, for this application, we compared the power consumption and performance only on KNL and CPU nodes. The power measurements caused another obstacle for this project. More specifically, it was planned that the power on KNL would be measured on a different system, called Hydra, since a tool was available there for this purpose. It was later noticed, that there is also a way to measure power consumption on the ARCHER nodes as well, with the CrayPat profiler and this would give more realistic results, since both power consumption and performance would be measured on the same system. However, the tool used to measure power consumption on the SCC cluster, included the power consumption of every component inside the node, in contrary to the power measurements on ARCHER. As a result, some extra measurements were required in order to make an appropriate comparison of the power consumption on the different platforms. It is also worth mentioning that after the competition, the SCC cluster was unavailable for a long period of time and this resulted in a delay of the report writing compared to the initial plan, since a significant amount of measurements were saved in the cluster. 14

25 Chapter 3 SCC s Applications The given applications and benchmarks of the competition were divided among the members of the team. Three of them were undertaken by the writer of the current report. The work and results that were carried out are described in this chapter. 3.1 High Performance LINPACK The first benchmark that was undertaken by the writer of this report was HPL, in collaboration with another member of the team. This benchmark s performance was tested and submitted on the first day of the competition Background HPL is probably the most well-known benchmark in the field of HPC, since it is used to rank the TOP500 supercomputers. The algorithm of the benchmark solves a dense linear system of double precision numbers. It uses LU factorisation with row-partial pivoting. It mainly captures the computing power of a system on floating-point operations. Data are distributed among processors in two-dimensional block-cyclic way to ensure good load balance. [16] In our case, an optimised binary for GPUs was used that was provided by NVIDIA that required OpenMPI version In order to achieve the best possible result in out cluster, two main processes were implemented. The first one was to find the combination of input parameters as well as system configuration that ensures the best performance without investigating the power consumption. After this optimal combination was found, we measured power consumption and investigated how our previous decisions could change in order to keep the peak power consumption under the power budget. 15

26 3.1.2 Performance Investigations The performance of HPL can be significantly improved by tuning some input parameters. Thus, a lot of experiments were carried out to ensure that we are using the optimal combination for our machine configuration. [17] At the beginning, only one node was available and thus some initial investigations were implemented in one node. It was soon concluded that the more important input parameters were the size of the problem (N), the block size of the sub-problems that are given to the GPUs (NB) as well as the process grid dimensions (P and Q). For this benchmark, a matrix of size N by N is created. The matrix is divided into smaller blocks of size NB by NB. These blocks are cyclic distributed to a P by Q process grid. The rest of the input parameters were investigated, as well, but they could not offer a significant benefit to the performance and so these results are not presented in the report. As far as the problem size is concerned, the bigger the problem is the better performance that could be reached. Thus, the bigger problem that can fit to our system s memory would be the solution. However, we would like to leave some space for other processes such as for operating system s requirements. Thus, it is suggested to find a problem size that is about 80% of the total memory of the system [18]. Given that our system was using three GPUs with 16GB memory each and 64GB RAM on each node and that we are using double precision numbers (8 bytes), an initial suggestion of the value of N could be obtained from the equation: where n is the number of the nodes. The value of NB is also important for the performance. The smaller the size of blocks is the better load balance will be among the GPUs, since the distribution will be more even. On other hand, a very small block size can lead to a significant communication overhead. [18] The optimal P and Q values depend on the interconnection network of the system. It is generally suggested to give P and Q values that are as close as possible. Our final configuration included nine GPUs. Consequently, there were not many combinations to be tested. The final process grid was set to 3 by 3. Table 2 presents the initial results of the performance in one node. The results show that there is a limit while increasing the problem size which when exceeded the performance cannot be further improved. Regarding the NB size is seems that 384 was the optimal value. (2) 16

27 NB N , Table 2 Initial investigations of performance in Teraflops for N and NB input parameters in one node with three GPUs It is worth mentioning that reaching this performance limit regarding the problem size, we suggested increasing the memory on the nodes, so that our system could fit a bigger problem size and that would increase performance. People from Boston agreed to double memory so that we could test bigger problems. However, the performance limit regarding did not much change, probably because GPUs were already fully working and more work was just added to the CPUs. Given that more memory would consume more power, we kept the initial size of memory. In the scripts that are used to run the HPL benchmark, there are some other parameters can influence the performance. The first one is the number of OpenMP threads. Each MPI process is mapped to one GPU and we have three GPUs using two CPUs on each node. Thus, two MPI ranks will share 10 cores on the same CPU. As a result, the number of OpenMP threads was set to five, so that each thread will run on a different core. Increasing the number of threads did not improve the performance. In addition, the GPU_DGEMM_SPLIT environmental variable indicates the percentage of the work that will be offloaded to the GPUs. Some investigations on one node are presented in Table 3. The optimal choice was to offload the whole work on GPUs. However, when the problem size cannot fit on the GPUs memory, the CPUs are also used to solve the system, although this variable is set to one. These values of OpenMP threads and GPU_DGEMM_SPLIT were used for the rest of the experiments that are described in this section. GPU_DGEMM_SPLIT Performance (Tflops) Table 3 Initial GPU_DGEMM_SPLIT investigations on node with N=80750 and NB=384 17

28 Obviously, more experiments were carried out while more nodes were added to the system. Using the Equation 2, we could define the theoretical optimal size and then with some further investigations close to this number, the best values were defined. Regarding the value of NB, its optimal value was 384 in all cases. This value is not much influenced by the number of nodes used, since it represents the sub-problem size that is given to each GPU. Figure 10, presents the initial results on three nodes. It is clear that as the problem size increases, the performance increases, too. However, when the problem does not fit in the system memory, swapping occurs and the performance drops dramatically. Table 4 shows the optimal input parameters and the resulted performance for different number of nodes. The full list of the input parameters used for nine GPUs is shown in Appendix A.2 and the scripts used for running HPL are presented in Appendix A.1 Figure 10 Initial Problem Size N investigations on three nodes # of nodes N NB Performance (Tflops) Speedup (Performance/Performance in one node) 1 (3 GPUs) (6 GPUs) (9 GPUs) Table 4 Performance results and Input Parameters keeping GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS=5 18

29 3.1.3 Power Consumption Investigations The next step was to focus on power consumption keeping the optimal input parameter combination. The final described configuration on 9 GPUs consumed about 900Watts above the power limit. Thus, several methods were attempted in order to decrease the power consumption. First, we attempted decreasing the CPU frequency, hoping that this change would not much affect the performance. CPU frequency should be within 1.20GHz and 3.10 GHz (default CPU frequency limits). With the script in Appendix A.3, we could change the upper and lower limit for all the CPUs. However, decreasing the CPU frequency to the minimum permitted value, i.e. 1.20GHz, not only influenced dramatically the performance but also did not improve power consumption. The current problem size was forcing the usage of the CPUs in order to solve the system, as it could not fit to the GPUs and so the performance drop was about 20% keeping the peak power consumption stable. Reducing the percentage of the problem that is offloaded to the GPUs with the GPU_DGEMM_SPLIT environmental variable did not offer any power gain either. The default power limits were eventually used. Then, the GPU frequency impact on power was examined. Table 5 shows the impact of GPU frequency on both performance and power consumption. Decreasing the frequency below 1265MHz resulted in a significant decrease on performance. As a result, we kept this number and examined other configurations. Changing the number of threads used per rank was one of them. As it is shown in Table 6, decreasing this parameter has a small impact on performance but it also reduces the power consumption. Combining this with disabling the hyperthreading results in the optimal for performance configuration that did not exceed the power limit. The hyperthreading configuration was implemented with the script in Appendix A.5. It is also worth noting that from the beginning of these experiments error correction code (ECC) was turned off and persistence mode was enabled on all the GPUs. These two options decreased further idle power consumption of the GPUs. ECC s purpose is finding and correcting data corruption. However, in the case of GPUs these errors are quite rare and it is generally suggested to turn off ECC because it influences both performance and power consumption. Turning back on ECC gave us the results shown in Table 7. The script in Appendix A.6 was used for changing ECC. Regarding the persistence mode, the script is presented in Appendix A.7. 19

30 GPU frequency (MHz) Performance (Tflops) Peak Power Consumption (Watts) Table 5 GPU frequency investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and OMP_NUM_THREADS=5 OMP_NUM_THREADS Hyperthreding Performance (Tflops) Peak Power Consumption (Watts) 5 enabled disabled enabled disabled enabled Table 6 OMP_NUM_THREADS and Hyperthreading investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1 and GPU_frequency=1265MHz ECC Performance (Tflops) Peak Power Consumption (Watts) Enabled Disabled Table 7 ECC investigations with N=139995, NB=384, GPU_DGEMM_SPLIT=1, GPU_frequency=1265MHz, OMP_NUM_THREADS=4 and hyperthreading enabled To sum up, before the competition we concluded with the configuration that had the best possible performance within the power budget that is summarised in Table 8. 20

31 Configuration before the competition: Performance (Tflops) Peak Power Consumption (Watts) ECC disabled Persistence mode enabled Hyperthreading disabled Input parameters HPL.dat file A.2 OMP_NUM_THREADS=4 GPU_DGEMM_SPLIT=1 GPU_frequency=1265MHz Default CPU frequency limits OpenMPI version Table 8 Configuration before the competition Final configuration and competition s results Some further investigations on power consumption and performance were required in Frankfurt. After setting up our cluster in the conference hall, our performance and power results of the previously mentioned configuration were improved. Moreover, we had the chance to make some changes directly to the hardware that decreased further power consumption. More specifically, we disabled hyperthreading from the bios and the peak power consumption was decreased about 30Watts. Then, we disabled one of the two pumps used for the liquid cooling and as a result peak power consumption was decreased 90 Watts. Given that, we had the opportunity to increase GPU frequency and OMP_NUM_THREADS variable, achieving even better performance. It is worth mentioning that power consumption was measured with a different tool that was given to each team by the committee and probably that was a reason of the unexpected improved results. In addition, the committee asked from all teams using clusters with GPUs to run HPL benchmark with a specific given binary that required OpenMPI version Our final configuration and results are presented in Table 9 and that way we reached the third place regarding the HPL benchmark. [19] 21

32 Final Configuration: Performance (Tflops) Power Consumption (Watts) ECC disabled Persistence mode enabled Hyperthreading disabled from bios Input parameters HPL.dat file A.2 OMP_NUM_THREADS=5 GPU_DGEMM_SPLIT=1 GPU_frequency=1290MHz Default CPU frequency limits Use one out of two pumps for liquid cooling OpenMPI version Table 9 Final Configuration 3.2 High Performance Conjugate Gradient The second benchmark that was investigated was HPCG. This benchmark was tested on the first day of the competition Background HPCG is a significant benchmark in the area of HPC, as it was created as an alternative metric for ranking HPC systems. The algorithm of this benchmark creates a synthetic sparse 3D linear system with double precision (64 bit) floating-point values. A fixed number of symmetric Gauss-Seidel preconditioned conjugate gradient iterations is performed to solve the system. It is implemented in C++ with MPI and OpenMP. An official run needs a minimum of thirty minutes to complete. [20] It is argued that HPCG can be a better way to rank HPC supercomputers than HPL. HPL tests only floating-point oriented patters and so tends to create an optimistic performance prediction of the systems. Hence, it is representative only for compute-bound applications. On other hand, HPCG tests more patterns such as memory accesses and global communications. It has a lower compute to data access ratio and other system s features such as the memory bandwidth can influence its performance. Consequently, it could potentially represent a higher number of real scientific applications. [21] HPCG optimised binaries can be found for several platforms. In our case, the HPCG 3.1 binary version, optimised by NVIDIA for GPUs including Pascal, was used that required OpenMPI version [22] 22

33 3.2.2 Performance Investigations For this benchmark, there were only a few configurations that could be tested. Regarding input parameters, they are only two. The first one is the size of the local problem on each GPU, defined as a three dimensional grid, and the second one is the runtime in seconds. Moreover, the number of OpenMP threads could be manipulated as well as the GPU frequency, through the run script. The HPCG package provided by NVIDIA, containing the optimised binary, included also a file with configuring suggestions regarding the GPU model of the underlying hardware. For P100 GPUs, it was suggested to set the local problem size to 256x256x256 and set the maximum cost clock GPU frequency. Some tests were carried out regarding the problem size and it was found out that the suggested one could give the best performance. The GPU frequency was set to 1480MHz for Graphics and to 715MHz for memory. Given that the implementation assumes one-to-one mapping between GPUs and MPI ranks (i.e. three MPI ranks were running on each node with two CPUs, with 10 cores each), five OpenMP threads was the highest number that could ensure that each thread will run on different core. Further increase to the OpenMP threads did not offer any performance gain. Table 10 presents HPCG results before the competition, for 256x256x256 local problem size, 60 seconds run and five OpenMP threads. # of nodes Performance (Gflops) Speedup (Performance/Performance in one node) Peak Power Consumption (Watts) 1 (3 GPUs) (6 GPUs) (9 GPUs) Table 10 Performance and Power Consumption Investigations for local problem size 256x256x256, 60 seconds run and OMP_NUM_THREADS=5 The power consumption was not a problem in this case; consequently, no further investigations were required. On the first day of the competition, our run gave GF performance for 3660 seconds reaching the fourth place [23]. The scripts used to run the HPCG benchmark are presented in Appendix A MiniDFT This year a code challenge was announced by the SCC committee, using the MiniDFT application. A GitHub repository was created and given to the teams where the initial code was available as well as the instructions, input files and further information about 23

34 the given application and the code challenge [24]. The improved code of each team was later uploaded and shared among the SCC participants in a different GitHub repository Background MiniDFT is a mini-app that is part of Quantum Espresso (QE) code. Its purpose is modelling materials with plane-wave density functional theory (DFT). Its algorithm uses either the Local Density Approximation (LDA) or the Perdew-Burke-Ernzerhof (PBE) exchange-correlation functional, in order to solve of the Kohn-Sham equations. The input of the applications includes a set of atomic coordinates and pseudopotentials. Its parallelisation is implemented with MPI and OpenMP. The code is written in FORTRAN and C. [25] The code challenge This challenge had no restrictions regarding the changes that could be implemented in the code as long as the verification rules of the output were not violated. Some suggested changes were improving multi-threading, linking optimized libraries, compiler optimisation flags, offloading code to accelerators, rewriting parallelism and re-implementing algorithms. [26] A small input case, the small.in file, was given, so that we could implement some initial investigations. However, for the final submission, a larger known input case was included, the pe-23.local.in file, and a second unknown input case that was given on the second day of the competition. Porting MiniDFT on our cluster included compiling with different compilers, using several MPI implementations and linking with libraries. After some initial investigations, the two main choices were the Intel Compiler Suite with Intel MPI and the PGI Compiler Suite with MVAPICH2. Using the given small input case, the two tests execution times were compared and there was a noticeable difference, which is presented in Table 11. As far as the libraries are concerned, MiniDFT required ScaLAPACK, BLAS and FFT libraries. Math Kernel Library (MKL) was used for this purpose. Using the sequential version of MKL library was suggested by Quantum-ESPRESSO User s Guide [27] and in Intel MKL User s Guide [28], so that the application s processes will not interfere with MKL threads. In addition, the performance of the application was tested when MKL library was linked statically and dynamically and it appeared that statically linking was slightly better. The appropriate compilation options are provided by the Intel s MKL Link Line Advisor site [29]. The last two mentioned options (sequential and statically linking) were used for the following experiments. Intel Compiler & Intel MPI (seconds CPU) PGI Compiler & MVAPICH2 (seconds CPU) Table 11 Execution times for two different combinations of compilers and MPI implementations for the small.in input file. 24

35 Profiling and identifying overheads and computationally expensive functions and loops was the next step, before the optimisation process. For this purpose, VTune was used and some time-consuming subroutines and loops were found. The initial output of the VTune GUI tool (amplxe-gui) for the small.in input file, on one node (20 cores) using ten MPI processes and two OpenMP threads per process, is presented in Figure 11. A great amount of time is spent on MKL routines, but also on communications between processes (MPI_Alltoall routine) and threads ( omp parallel regions). From the section Summary of the profiler, it was noted that the total time spend on CPU waiting on an OpenMP barrier inside the parallel region was considerably high, about 40.4% of the total CPU time while the overhead due to MPI communications was about 17.5%. Figure 11 VTune profiling, amplxe-gui output for minidft, using 10 MPI processes and OMP_NUM_THREADS=2 with small.in input file It was then decided to offload these parts of the code to the GPUs. Due to time limitation, OpenACC was used instead of CUDA, with parallel loop as well as data directives, in order to eliminate data movements from CPUs to GPUs and from GPUs to CPUs. However, due to the nature of the existing algorithm, it was not possible to offload the large loops on the GPUs, so smaller loops were offloaded. It appeared though those two main problems limited the performance. The first problem was that these parts were not computationally intensive enough so that computations acceleration could cover the data transaction overhead. The second problem was the number of MPI processes per GPUs used. More specifically, the number of MPI processes used was higher than the number of GPUs. Consequently, more than one process was using each GPU and an 25

36 extra overhead was added while changing the running process on the GPUs. Decreasing the number of MPI processes was not a solution to this problem, since the performance could scale almost linearly to the number MPI processes. If more time was available, using CUDA with GPU enabled libraries such as cufft and/or cublas could possibly give better results. Using OpenACC required PGI compilers. As it was mentioned before, the Intel compilers produced a faster executable. OpenACC directives were not possible to overcome this performance difference. Thus, it was eventually decided to use the initial only-cpu version of the code compiled with Intel Compilers and using Intel MPI library. For this version, two optimisation flags, O3 and fast, were added and the execution time was further decreased. In more detail, fast increased performance by about 11.8% compared to the initial version that was using only O3. Compiler flags for the PGI version, such as Mcache_align, fast, Minline and Mipa=fast,inline were also investigated. Hyper-threading was tested, as well. However, the execution time was increased about 7%. The two different levels of parallelism, MPI and OpenMP were then investigated. The idea is that OpenMP could add an extra level of parallelism inside a node, for each MPI process. Although, several combinations of threads per processes were tested, it appeared that pure MPI was definitely better. As it was concluded from VTune profilier, a significant amount of time was consumed in omp parallel regions, due to load imbalance. The results in one node, for the small input case are presented in Table 12. MPI Processes OpenMP Threads Benchmark Time (seconds CPU) Table 12 Execution times in one node (20 cores) for different numbers of threads and MPI processes, keeping MKL threading layer sequential for the small.in input file The input parameters seemed to play a significant role in performance. Those are the number of pools, the number of band groups (nbgrp), the number of MPI processes for Diagonalization (ndiag) and the number of task groups (ntg). The given version of MiniDFT supported only one pool. Each pool is responsible of a group of k-points and it is partitioned into band groups in order to improve the parallel scaling. Task-group parallelism is implemented for improving the parallelisation of the FFTs. A lot of experiments were carried out in order to identify the best combination for the known 26

37 input case pe-23.local.in, which is presented in Table 13. The same combination was used for the unknown input case, as well. More detailed tests results can be found in Appendix B. Input parameter Final value npool 1 nbgrp 1 ndiag 64 ntg 5 Table 13 Optimal values for input parameters The last part was to try some code improvements. This included branch elimination, loop fusion and transferring if statements outside the loops. These alterations improved slightly the final performance, by about 0.3%. It is worth pointing out that during the competition, the committee carried out an interview for each team where we discussed the process of porting and optimising minidft, before submitting our results. There they suggested us using NVBLAS library. With just some trivial changes to the Makefile, this library could make use of the GPUs for the BLAS calls; no changes in the code were needed [30]. Unfortunately, the performance gain was very small and there were not enough time to further investigate the usage of this library. 27

38 Chapter 4 MiniDFT and HPCG on KNL In this chapter the results of porting and optimising one benchmark (HPCG) and one application (minidft) of the competition is presented. First, a brief description of this process is explained. 4.1 Porting and Optimising on KNL The KNL nodes on ARCHER, the UK National Supercomputing Service, were used for investigating performance. ARCHER has twelve compute nodes with one Xeon Phi Processor on each one. The KNL model is 7210 with 64-cores at 1.3GHz and each core can run up to four hyperthreads. Each node has a DDR memory of 96GB. [31] After the code was successfully compiled for each program on the ARCHER KNL nodes, the input parameters were initially investigated, in order to identify the optimal combination. The next step was to investigate the optimal combination of MPI processes and OpenMP threads, since both programs are hybrid. Depending on previous investigations, most of the hybrid codes seemed to perform better on ARCHER Xeon Phi processors than on Xeon CPUs [32]. Hyperthreading impact was examined as well as the performance on several numbers of nodes. Then, some technical features of the KNL were tested. The first one was MCDRAM modes. As explained in section 2.2, MCDRAM can be used in three different modes, i.e. flat mode, cache mode and hybrid mode. Moreover, there are three possible cluster options called all-to-all, quadrant and sub-numa. On ARCHER, each KNL has a 16GB on-chip MCDRAM memory. Two of the KNL nodes are configured in flat mode, the rest of them in cache mode and all of them are set to quadrant clustering mode. Nodes are configured at boot time and given that rebooting the initial configuration of the system is a time-consuming process, we were able to examine only these two options, i.e. flat and cache memory modes with quadrant clustering mode for up to two and eight nodes respectively. Vectorisation was the next feature to examine. For this purpose, the Cray Performance Analysis Tool (CrayPat) was used in order to profile the applications, identify the most time consuming parts of the code and focus on the loops that could be vectorised in order 28

39 to improve performance [33]. Moreover, vectorisation reports were necessary for this purpose. The Intel compilers, -qopt-report=5 compiler flag was used to produce these reports for the appropriate files. Sometimes the compiler does not manage to vectorise efficiently all the loops. As a result, the loops might not be vectorised at all or they might be vectorised without giving the maximum possible performance. Possible reasons are non-unit stride loops, i.e. the accesses on the arrays are not contiguous, unaligned data accesses and data dependencies between loop iterations. First of all, making the data aligned to cache line boundaries is significant for performance because it results in minimum data, instructions and cache lines loads. This requires a trivial modification in the code at the definition of the variables. Making the data aligned is not always enough, in addition, you often need to inform the compiler of alignment with the corresponding directives. However, the compiler still might not vectorise the loop because it thinks that there are dependencies on loop iterations that may or may not be real. In some other cases, the compiler might not vectorise the loop because of non-unit stride loops. In these cases, the compiler needs to gather and/or scatter the data from/to the arrays and this process has an overhead. In addition, when a loop has just a few iterations, the compiler might choose not to vectorise the loop at all, because the scalar execution might be faster. For these cases, there are several compiler directives that either inform the compiler that there are no dependencies or force vectorisation. Lastly, several compiling options were tested. This includes linking libraries and optimisation flags. 4.2 Performance results for MiniDFT As it was mentioned in section 3.3.2, two input files were given for the competition. In this section, the investigations were carried out using the small.in input file. Identifying the optimal input parameters was a necessary process; as it was observed from previous investigations on the SCC cluster wrong decisions might lead to significant drop of performance. The results for different number of nodes are presented in Appendix B.3. From previous investigations, it was concluded that this application s performance is better when only MPI processes are used, without OpenMP threads due to load imbalance. Hybrid version of minidft did not give better results on KNL than its pure MPI version. This was actually expected, since explicit OpenMP is a recent addition, still under development and it is suggested not to mix MPI and OpenMP parallelisation if it is not known how to run them in a controlled manner [27]. As shown in Figure 12, the execution time decreases with a significant rate as the number of MPI 32 processes increases and the number of OpenMP threads decreases. Hyperthreading did not improve the performance either. In fact, using the best case of number of processes, i.e. 64, doubling OpenMP threads with hyperthreading makes the 29

40 execution time more than two times slower. This test decreased performance even more, in the case with 32 MPI processes. It is worth mentioning, that minidft is probably not the best application to test hyperthreading with, due to the poor performance of OpenMP that was explained earlier. However, increasing the MPI processes with hyperthreading deteriorated the performance, as well. Figure 12 Investigations on MPI processes and OpenMP threads, on one node configured in cache mode 30

41 Figure 13 Investigations on hyperthreading on one node configured in cache mode Thus, for the rest of the experiments that were carried out, 64 MPI processes were used per node with OMP_NUM_THREADS set to one. Figure 14 presents the improvement on performance, as more nodes were used, all configured in cache mode. The line is linear for up to two nodes but then the performance increases with a slower rate. With CrayPat profiler, it was observed that the time spent on MPI routines such as MPI_Barrier and MPI_Alltoall rises significantly as the number of MPI processes increases. Thus, the hybrid version was tested once more on multiple nodes, in order to decrease the processes communication overhead. However, this resulted in higher overall execution times than the corresponding pure MPI executions. 31

42 Figure 14 Scalability of minidft across nodes configured in cache mode with 64 MPI processes per node and OMP_NUM_THREADS=1 Nodes configured in flat mode were also used and the results compared to the cache mode are presented in Figure 15. The time in cache mode was about 1.6 times faster than in flat mode. The numactl program was used to target MCDRAM with -p 1 option for setting the preferred memory. With this change in the run script, the program tries to use MCDRAM and it falls back when it is exhausted. A possible explanation, for this noticeable difference on performance, is that the application is streaming but its problem size fits in the cache. Thus, using a high bandwidth cache is more beneficial than using a high bandwidth main memory. 32

43 Figure 15 Cache Vs Flat mode results using 64 MPI processes on each node and OMP_NUM_THREADS=1 Two different ways of linking MKL libraries were investigated both with static linking. The first one was using OpenMP threading layer (default option in the given Makefile) and the second one was using the sequential version of MKL. With the second option, the performance was slightly improved, about 1.1%. Several flags were added in order to investigate their impact on performance. Flags -fast, -fma, -ipo, -finlne-functions and -align array64byte did not offer any significant improvement. Some of these flags were. The last part was to investigate the impact of vectorisation and try to improve it with some techniques. This included making the data aligned, adding new aligned vectors and forcing vectorisation using compiler directives. Using the compiler flags -no-vec -no-simd -qno-openmp-simd the vectorisation can be disabled. This option decreased performance about 7.3% on the initial code on one node, indicating that vectorisation was already beneficial for the application. The initial code of minidft was profiled with CrayPat in order to identify the most time-consuming functions-loops. Non user-defined functions that consumed a considerable amount of time were MKL (31.5%) and MPI (10.5%) functions. The most time consuming, user-defined functions are fft_scatter (28.7%) from fft_base.f90 file, exxenergy2 (7.2%) and vexx (6.9%) from exx.f90 file and cft_1z (2.1%) from fft_base.f90 file and the mentioned lines on each function are the most time-consuming loops. Using the Intel s compiler vectorisation reports, an attempt was made to improve their performance. 33

44 Table 2: Profile by Group, Function, and Line Samp% Samp Imb. Imb. Group Samp Samp% Function Source Line PE=HIDE 100.0% 72, Total % 33, USER % 20, fft_base_mp_fft_scatter_ 3 minidft/isc17-scc-minidft/src/fft_base.f % 2, % line % 2, % line % % line % 1, % line % 10, % line % 1, % line.885 ================================================================ ========== 7.2% 5, exx_mp_exxenergy2_ 3 minidft/isc17-scc-minidft/src/exx.f % 1, % line % 2, % line.1236 ================================================================ ========== 6.9% 5, exx_mp_vexx_ 3 minidft/isc17-scc-minidft/src/exx.f % 1, % line % 1, % line % 1, % line.1006 ================================================================ ========== 2.1% 1, fft_scalar_mp_cft_1z_ 3 minidft/isc17-scc-minidft/src/fft_scalar.f % 1, % line.173 ================================================================== ========== 42.2% 30, ETC Table 14 Profiling results of minidft with CrayPat on one node 34

45 After identifying the loops that could be vectorised, the arrays used inside these loops were aligned. This was achieved using!dir$ attributes align : 64 :: variable_name directive after the declaration of the variables. For the variables that were used inside a function as in or inout the file where their declaration was first made had to be found to add the directive there. Then it was required to add the!dir$ assume_aligned variable_name : 64 directive inside the file with the actual loop. Forcing vectorisation was not beneficial in every loop. In most cases where it was indicated by the vectorisation report that the loop was not vectorised because it seems inefficient, adding omp simd directive resulted in worse performance. In the most time consuming loop, i.e. line 808 in fft_base.f90, the accesses inside the loop where neither aligned nor contiguous and the compiler report was giving the message non-unit strided load was emulated for the variable f_aux. That was the only case where just forcing vectorisaton with omp simd directive improved performance and the reason is probably the fact that it is the largest loop and it was worth the scattering overhead. The exact code is given in Figure 16. It is worth noting that this directive requires removing any omp parallel directives. However, in our case that was not a problem since we were not using multiple OpenMP threads per process. In some cases, introducing new aligned arrays could help vectorisation and the overall performance. For example in Figure 17, a new vector was introduced as well as a new loop in order to ensure that the aligned calculations can take full advantage of the vectorisation. In Figure 18, an integer was replaced by an array in order to make all accesses inside the loop aligned. The omp simd directive forces vectorisation but it can also be used to inform the compiler that the vectors inside the loop are aligned. However, it was observed from the vectorisation reports that the omp simd directive was not producing aligned arrays of complex datatype. In these cases the!dir$ vector aligned was added or used instead. All the modifications on these files can be found in the submitted code. Introducing the compiler directives improved the final performance about 3.4% on one node. Table 15 presents the results across several nodes and the impact of final vectorisation. It is worth mentioning that as the number of nodes increases, the percentage of the benefit due to vectorisation decreases. However, this was expected, taking into consideration the fact that the percentage of time spent on these loops becomes less significant as the communication overhead increases.!$omp simd DO j = 1, dfft%npp( me_p ) f_in( j + ( i - 1 ) * nppx ) = f_aux( mc + ( j - 1 ) * dfft%nnp ) ENDDO Figure 16 Forcing Vectorisation with omp simd directive 35

46 A. Initial Code do ig=1,ngm vc = vc + fac(ig) * rhoc(nls(ig)) * CONJG(rhoc(nls(ig))) end do B. Improved Code!$omp simd aligned(help:64) aligned(nls:64)!dir$ vector aligned do ig=1,ngm! here dir is required for nls help(ig) = rhoc(nls(ig)) * CONJG(rhoc(nls(ig))) end do!$omp simd reduction(+:vc) aligned(fac:64) aligned(help:64) do ig=1,ngm vc = vc + fac(ig) * help(ig) Figure 17 - Helping Vectorisation with introducing a new aligned vector A. Original Code DO proc = 1, nprocp gproc = dfft%nplist( proc ) + 1 sendcount (proc) = npp_ ( gproc ) * ncp_ (me) recvcount (proc) = npp_ (me) * ncp_ ( gproc ) ENDDO B. Improved Code!$omp simd aligned(gproc2:64) DO proc = 1, nprocp gproc2 (proc) = dfft%nplist( proc ) + 1 ENDDO!$omp simd aligned(sendcount:64) aligned(recvcount:64) aligned(gproc2:64) DO proc = 1, nprocp sendcount (proc) = npp_ ( gproc2 (proc) ) * ncp_ (me) recvcount (proc) = npp_ (me) * ncp_ ( gproc2 (proc) ) Figure 18 - Helping Vectorisation with replacing integer with a new aligned vector 36

47 # of nodes Final Execution Time (sec CPU) % of improvement due to overall vectorisation Table 15 Final Results of minidft and impact of vectorisation 4.3 Performance results for HPCG For HPCG benchmark the version from Intel Math Kernel Library Benchmarks 2017 Update 3 for Linux, was used [34]. The provided code is already optimised for Intel Xeon and Intel Xeon Phi. First, the problem size was investigated. As it was mentioned before, the first input parameter is the local dimension grid size for each process. Thus, the global problem size depends on this parameter as well as the number of processes. A valid run should have a problem size that is large enough to occupy at least ¼ of the main memory. For each number of processes there is an optimal local size as shown in Figure 1. These configurations could give about the same performance. Using one process or more than 16 processes per core was generally inefficient. # of MPI processes OpenMP threads per process Optimal Local Size x 192 x x 128 x x 96 x x 64 x 64 Figure 19 Global problem size investigations on one node In addition, using hyperthreading and increasing the number of OpenMP threads per process had a different impact depending on the number of processes. For example, using two MPI processes with 64 OpenMP threads, i.e. using two threads per physical core, could not significantly increase performance comparing to the corresponding 37

48 non-hyperthreading version. This is due to the fact that OpenMP parallelisation seems less effective when a high number of threads causes significant synchronisation overhead. In the case of four, eight and sixteen processes, the performance with two threads per core was about the same. Figure 20 presents the impact of hyperthreading on different local sizes with four MPI processes. Using four threads per cores is generally less efficient than using two threads per core. It is worth noting that using four threads per core for the smaller problem size has worse performance than using one thread per core, since there is not enough parallelisation to exploit inside the local problem and the threads synchronisation overhead becomes a limiting factor. Figure 20 Impact of hyperthreading on one node with 4 MPI processes Scalability of HPCG across multiple nodes is presented in Figure 21, using four MPI processes per node. The scalability is not linear, but is close to linear. It was attempted to improve scalability across nodes, by reducing the number of MPI processes and increasing the problem size, but it did not offer better performance. 38

49 Figure 21 Scalability across multiple nodes using 4 MPI processes per node and 32 OpenMP threads per core The performance of HPCG on nodes configured in flat mode was also tested. The numactl program with -p 1 option was used as before. The performance was less than halve of the corresponding performance on nodes configured in cache mode. It was also observed that increasing the number of threads per core could not offer any performance gain, indicating that for this application a high-bandwidth cache was more important than high-bandwidth memory. The comparison of HPCG on flat and cache mode is summarised in Figure 22. It is worth mentioning that increasing the problem size on flat mode configured nodes did not improve performance either. Regarding vectorisation, the code was already optimised. All the arrays were aligned and all the critical loops were vectorised, as it was shown by the vectorisation reports. Adding compiler directives, omp simd did not improve performance. It was also noticed that disabling vectorisation with -no-vec -no-simd -qno-openmp-simd flag, did not decrease the performance significantly, but just about 0.9%. It seems that although the loops were appropriately vectorised, vectorisation was not efficient in this case. Adding optimisation flags was not beneficial. Figure 23 presents the results of HPCG. 39

50 Figure 22 Cache and Flat Mode for different number of nodes, MPI processes, OpenMP threads and hyperthreding # of KNL nodes Performance (Gflops) Figure 23 Performance of HPCG across multiple nodes 40

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

FEniCS Performance Investigation and Porting minidft to GPU Clusters

FEniCS Performance Investigation and Porting minidft to GPU Clusters FEniCS Performance Investigation and Porting minidft to GPU Clusters Chao Peng 17th August 2017 MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2017

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Game-changing Extreme GPU computing with The Dell PowerEdge C4130 Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University Introduction to High Performance Computing Shaohao Chen Research Computing Services (RCS) Boston University Outline What is HPC? Why computer cluster? Basic structure of a computer cluster Computer performance

More information

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing : A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing Balazs Gerofi Exascale System Software Team, RIKEN Center for Computational Science 218/Nov/15 SC 18 Intel Extreme Computing

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

The Future of High- Performance Computing

The Future of High- Performance Computing Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Comparing Two Large-Scale Systems Oakridge Titan Google Data Center Monolithic

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU WP-08608-001_v1.1 August 2017 WP-08608-001_v1.1 TABLE OF CONTENTS Introduction to the NVIDIA Tesla V100 GPU Architecture...

More information

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016 The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016 2016 Supermicro 15 Minutes Two Swim Lanes Intel Phi Roadmap & SKUs Phi in the TOP500 Use Cases Supermicro

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara Many advantages as a supercomputing resource: Low energy consumption. Limited floor space requirements Fast

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

Welcome. Virtual tutorial starts at BST

Welcome. Virtual tutorial starts at BST Welcome Virtual tutorial starts at 15.00 BST Using KNL on ARCHER Adrian Jackson With thanks to: adrianj@epcc.ed.ac.uk @adrianjhpc Harvey Richardson from Cray Slides from Intel Xeon Phi Knights Landing

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

TSUBAME-KFC : Ultra Green Supercomputing Testbed

TSUBAME-KFC : Ultra Green Supercomputing Testbed TSUBAME-KFC : Ultra Green Supercomputing Testbed Toshio Endo,Akira Nukada, Satoshi Matsuoka TSUBAME-KFC is developed by GSIC, Tokyo Institute of Technology NEC, NVIDIA, Green Revolution Cooling, SUPERMICRO,

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Single-thread optimization Parallelization Common pitfalls

More information

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015 Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

HPC Issues for DFT Calculations. Adrian Jackson EPCC

HPC Issues for DFT Calculations. Adrian Jackson EPCC HC Issues for DFT Calculations Adrian Jackson ECC Scientific Simulation Simulation fast becoming 4 th pillar of science Observation, Theory, Experimentation, Simulation Explore universe through simulation

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

John Hengeveld Director of Marketing, HPC Evangelist

John Hengeveld Director of Marketing, HPC Evangelist MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA, S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com A TALE OF ENLIGHTENMENT Basic OK List 10 for x = 1 to 3 20 print

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System x idataplex CINECA, Italy Lenovo System

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example

More information

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU Whitepaper NVIDIA Tesla P100 The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World s Fastest GPU NVIDIA Tesla P100 WP-08019-001_v01.2 1 Table of Contents Introduction...

More information

Deep Learning with Intel DAAL

Deep Learning with Intel DAAL Deep Learning with Intel DAAL on Knights Landing Processor David Ojika dave.n.ojika@cern.ch March 22, 2017 Outline Introduction and Motivation Intel Knights Landing Processor Intel Data Analytics and Acceleration

More information

ECE 574 Cluster Computing Lecture 23

ECE 574 Cluster Computing Lecture 23 ECE 574 Cluster Computing Lecture 23 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 December 2015 Announcements Project presentations next week There is a final. time. Maybe

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016 RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information