FEniCS Performance Investigation and Porting minidft to GPU Clusters

Size: px
Start display at page:

Download "FEniCS Performance Investigation and Porting minidft to GPU Clusters"

Transcription

1 FEniCS Performance Investigation and Porting minidft to GPU Clusters Chao Peng 17th August 2017 MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2017

2 Abstract This dissertation project is based on the participation in the 2017 International Supercomputing Conference Student Cluster Competition (ISC 17 SCC). The author was responsible for FEniCS optimisation before the competition and ported minidft to the CPU-GPU hybrid cluster after the competition as part of the dissertation project. The report has three parts. The first part describes the design of the cluster for the competition and discusses the process of hardware and software installation. It also includes challenges we faced and the competition result. In the second part, the report provides details and step-by-step performance results of how the author optimised FEniCS on the cluster by trying different compilers, BLAS libraries, MPI implementations, compiler optimisation flags and optimisation levels. The last part of this report introduces the process of porting the CPU-only program, minidft, to the cluster with NVIDIA P100 GPUs. The process starts from CPUoriented optimisation including replacing OpenBLAS, FFTW3 and ScaLAPACK by Intel MKL library and some optimisation of the source code. Then this report discusses performance improvement give by GPU-enabled libraries including cublas and MAGMA. Some command line optimisation methods such as process binding are also included in this part. After utilising MKL, cublas and MAGMA, the overall execution time of minidft was reduced to only 7% of the original version.

3 Contents 1 Introduction 1 2 Background Review Heterogeneous Architectures The NVIDIA Tesla P100 Accelerator The CUDA Programming Model The cufft and cublas Libraries The MAGMA Library Project Motivation Obstacles and Deviation from the Project Plan The Student Cluster Competition Competition Guidance Benchmarks and Applications Awards Team EPCC s Cluster Configuration Hardware Configuration Software Configuration Preparing for the Competition Competition Results and Experiences Building FEniCS on the Cluster and Performance Analysis Introduction to FEniCS Initial Work and Obstacles Performance Investigation Summary Porting minidft to NVIDIA GPUs Initial Performance Testing Optimisation Based on Source Code Invastigation Performance with GPU-enabled Libraries FFT-related Code Optimisation BLAS-related Code Optimisation MAGMA-related Code Optimisation Process Binding i

4 5.5 NVIDIA Multi-Process Service Power Consumption and Summary Conclusions and Future Work Future Work A Automated Building Script for FEniCS 36 ii

5 List of Tables 2.1 Top 10 supercomputers in the 2017 June Green 500 list; Available from: Team EPCC cluster hardware configuration FEniCS timing for different number of MPI tasks FEniCS timing for different MPI implementations FEniCS timing for different BLAS libraries FEniCS timing for different levels of optimisation minidft timing for different compilers and libraries minidft timing for different levels of optimisation minidft timing for different FFT communication types FFTW subroutines replaced by cufft subroutines minidft timing for process binding iii

6 List of Figures 3.1 One node of Team EPCC s cluster FEniCS Streamline FEniCS timing for different number of MPI tasks FEniCS timing for different MPI implementations using the small test case FEniCS timing for different MPI implementations using the large test case FEniCS timing for different levels of optimisation Step-by-step optimisation summary of FEniCS minidft timing for different optimisation levels Process binding topology Step-by-step optimisation summary of minidft CPU-only minidft power consumption GPU-enabled minidft power consumption iv

7 Acknowledgements First and foremost, I would like to express great gratitude to my supervisor Dr. Michele Weiland for her constant guidance and time she devoted to me. She carefully and patiently helped me arrange my work and guided me through the whole project. Her rigorous attitude of scholarship and abundant knowledge reserve on HPC greatly improved my understanding of academic research. There is a special thank to my teammates, Alexandros Nakos, Antriani Mappoura as well as Jingmei Zhang for their work and friendship we have built. We would also express our thanks to the coach of the team, Mr. Emmanouil-Ioannis Farsarakis for his supports and encouragements. Boston Limited and its member of staff Mr. Konstantinos Mouzakitis deserve many thanks for their technical support and providing us access to the fantastic cluster with state-of-the-art hardware components. Finally, I would like to thank my parents for their precious love and unconditional support, without which I would not have the opportunity to do this Master of Science program at EPCC (Edinburgh Parallel Computing Centre), the University of Edinburgh.

8 Chapter 1 Introduction High Performance Computing (HPC) has experienced great performance improvement by parallelism and modern heterogeneous architectures. With the development of the HPC industry, processors, accelerators, interconnects, storage devices, etc. have been updated for several generations. In addition, the HPC industry has also driven the development of parallel programming models and libraries such as Message-Passing Interface (MPI), Open Multi-Processing (OpenMP) and High Performance Fortran (HPF). Additionally, in order to solve more complex problems such as atmospheric simulation which requires high-density computations, inputs and outputs large-scale datasets make great demands on arithmetic operation speed and power efficiency to existing supercomputers. This issue gives rise to modern heterogeneous computing systems with both processing units and accelerators and relevant programming models such as Open Computing Language (OpenCL) and Open ACCelerators (OpenACC). With the aim to train the next generation of HPC experts, three major Student Cluster Competitions including Asia Student Supercomputer Challenge (ASC) and another two are held within two mainstream HPC conferences, Supercomputing Conference (SC)and International Supercomputing Conference (ISC) every year. They touch a wide range of HPC topics from choosing components of a cluster to optimising applications according to characteristics of the cluster in order to achieve better performance. These competitions attract many student teams from well-known universities and scientific research institutions and provide them with a good platform for academic communication. This report is written by a member of Team EPCC for the 2017 ISC Student Cluster Competition, which required each team to build a cluster on which they optimise and run benchmarks as well as applications (announced by the HPC Advisory Council) within a power budget of 3000 Watts. This report covers work done by the author for the competition including the optimisation process of FEniCS and minidft. FEniCS, a popular computing platform for solving partial differential equations (PDEs) was chosen by the HPC Advisory Council as one of the applications for teams to run 1

9 on their clusters in the competition. As a mature project, FEniCS is composed of seven parts. Each part has different dependencies and is compiled separately. Although FEniCS community provides Docker as well as Anaconda versions, these prebuilt packages cannot be modified according to user s specifications and characteristics of the target machine. This paper will discuss how FEniCS was built in the cluster for the competition and performance comparison of different configurations. MiniDFT is a simplified program for modelling materials using the plan-wave density functional theory and was chosen by the HPC Advisory Council as the coding challenge of the competition. It utilises MPI and one of its computational back-ends, FFTW3, uses OpenMP to achieve better performance through parallelism. However, it does not have GPU support. This paper implements a GPU-enabled minidft version and investigates the performance improvement provided by GPUs. The remainder of this report is organised as follows: Chapter 2: A background review of some aspects of HPC that are related to this dissertation including heterogeneous systems, the NVIDIA P100 GPU and some GPUoriented mathematical libraries including cufft, cublas and MAGMA which are used to accelerate minidft in Chapter 5. This chapter also presents the project motivation, obstacles we met and deviations from the initial project proposal. Chapter 3: This chapter includes an introduction to the ISC Student Cluster Competition and a review of work done for the competition. This chapter starts from the detail of the cluster and gives the reasoning behind the cluster configuration in terms of both hardware and software. Chapter 4: The process of building FEniCS with different compilers as well as compiling parameters is introduced in this chapter. Chapter 5: This chapter presents the process of optimising minidft. The process includes basic optimisation by using different compilers and CPU-optimised mathematical libraries. Then, work done for porting minidft to NVIDIA GPUs using GPUenabled libraries and GPU-specific command line optimisation is described. Chapter 6: Conclusions of this dissertation project and future work will be discussed in this chapter. 2

10 Chapter 2 Background Review Work done for the competition is discussed in the subsequent chapters and due to the fact that the competition touches a wide range of fields of HPC, this chapter gives a brief introduction to some aspects that are related to this project. 2.1 Heterogeneous Architectures High Performance Computing has experienced great performance improvement benefiting from both hardware and software parallelism. In the past, the performance of Central Processing Units (CPUs) doubled roughly every 18 months by developing chip fabrication technology and increasing clock frequency[1]. Binary digits (0s and 1s) are represented by different voltages and the increment of clock frequency causes voltage to decrease to keep the power consumption reasonable. However, voltage cannot be reduced any further because binary digits will not be distinguished easily. Over the last few years, computer scientists and engineers have made noticeable progress on Graphics Processing Units (GPUs) for the highly lucrative gaming market[2]. GPUs show excellent characteristics for HPC such as power-efficiency and enormous floating-point computing power, which has made them firmly established in the HPC industry. In addition, Intel also released a different type of accelerator, Intel Xeon Phi series, to compete with GPUs for scientific computing. Most supercomputers were using CPU-only architectures before 2009, but power efficiency has raised fast adoption of GPUs in recent years[3]. Now, supercomputers containing both traditional CPUs as well as accelerators have become popular and scientific simulations on climate, physics, astronomy, etc. benefit a lot from the development of heterogeneous HPC systems. The Green 500 list[4] ranks the top 500 supercomputers all over the world each year in June and November. Unlike the Top 500 list whose ordering specification is only the performance of floating point operations per second (FLOPS), the Green 500 list puts 3

11 a premium on power efficiency and uses "FLOPS-per-Watt" as its power-performance metric. According to the 2017 June Green 500 list (as shown in Table 2.1, nine of the top 10 most power-efficient supercomputers are heterogeneous systems and they are all using NVIDIA Tesla P100 as their accelerators. Rank Name CPU GPU Power Efficiency 1 TSUBAME3.0 Xeon E5-2680v4 14C NVIDIA Tesla P kukai Xeon E5-2650Lv4 14C NVIDIA Tesla P AIST AI Cloud Xeon E5-2630Lv4 10C NVIDIA Tesla P RAIDEN Xeon E5-2698v4 20C NVIDIA Tesla P Wilkes-2 Xeon E5-2650v4 12C NVIDIA Tesla P Piz Daint Xeon E5-2690v3 12C NVIDIA Tesla P Gyoukou Xeon D C N/A RCF2 Xeon E5-2650v4 12C NVIDIA Tesla P N/A Xeon E5-2698v4 20C NVIDIA Tesla P Saturn V Xeon E5-2698v4 20C NVIDIA Tesla P Table 2.1: Top 10 supercomputers in the 2017 June Green 500 list; Available from: The NVIDIA Tesla P100 Accelerator NVIDIA Tesla GPUs are widely used in supercomputers, enabling leading-edge Artificial Intelligence and Machine Learning systems and speeding up numerous HPC applications as well as scientific research of many domains with highly complex simulations. Key features of Tesla P100 include: Exceptional performance. According to the white paper[5] released by NVIDIA, Tesla P100 delivers 5.3 TFLOPS, 10.6 TFLOPS and 21.2 TFLOPS of double precision, single precision and half-precision floating point performance respectively. For those Deep Learning algorithms which do not require high levels of floating-point precision, the extreme half-precision floating point performance provided by P100 and the reduced storage requirements for half-precision datatypes can give noticeable speedups. The brand-new NVLink interconnect. As more and more hybrid systems deploy multi-gpu architectures to solve bigger and more complex problems, the bandwidth between GPUs has become an issue. Therefore, NVIDIA introduces a new high-speed interface, NVLINK, enabling up to 160 GB/s data transfer rate from one GPU to another, 5 times faster than the traditional PCIe Gen 3. 4

12 High-speed memory. NVIDIA P100 is the first GPU that introduces the new memory model, High Bandwidth Memory 2 (HBM2), which can provide more excellent performance through higher bandwidth (up to 256GB/s) and lower power consumption than conventional GPUs. This feature enables P100 to tackle with much larger problems with much larger datasets. Simplified programming. P100 provides two major features of GPU programming, Unified Memory and Compute Preemption. Based on the architecture of P100, Unified Memory provides a single and unified virtual address space for accessing both CPUs and GPUs. Therefore, programmers do not need to take into consideration how to manage data transfer between virtual memory systems so that they can concentrate on designing hybrid parallel programs. Some long-running applications might occupy the system when waiting for a task to complete. They could be killed by the operating system or the CUDA driver so programmers need to divide large workloads into small ones. With Compute Preemption, programmers are able to let their programs wait for certain conditions to occur while scheduled alongside other tasks. 2.3 The CUDA Programming Model Introduced by NVIDIA in 2007, the CUDA (Compute Unified Device Architecture) programming model is designed for joint CPU-GPU execution of an application[3]. There are also some other more recent models such as OpenCL, OpenACC and C++ AMP supporting hybrid system programming. The release of CUDA opened up the possibility for developers to concentrate on algorithm design rather than think about graphic primitives[6]. CUDA provides extensions with new keywords and application programming interfaces to both C/C++ and Fortran programming languages thus developers do not need to learn a new programming language and CUDA makes it easier for them to modify existing code to enable GPU support. CUDA programs consist of a host and one or more devices. A CUDA host is generally a traditional CPU such as an Intel microprocessor while a device is a NVIDIA GPU. This programming model can be described as the master-worker pattern where the CPU acts as the master, initialising the program and executing some serial parts of the program, while the GPU serves as the worker which is responsible for executing parallel regions. In addition, the device code marked with CUDA keywords for parallel functions are called kernels. The execution of a CUDA program starts with host execution which launches various kernel functions. When a kernel function is called, it is executed by a number of threads and each thread is executed on a single CUDA core. All threads are collectively referred to as a three-dimensional grid and this is the only level where threads can communicate and synchronise. 5

13 In heterogeneous systems, there are always not only one CPU or CUDA host. Most HPC clusters and supercomputers have more than one node and each node has one or more hosts and one or more devices. The dominating programming model for computing clusters, MPI, can be used together with CUDA to program on these systems. A simple and versatile approach is to associate each MPI rank (process) with a single GPU. When there are more than one GPU per node, multi-gpu programming can be realised by setting multiple MPI ranks per node[7]. 2.4 The cufft and cublas Libraries It is clear that traditional C/C++ and Fortran programming benefit a lot from a great deal of mature, well-documented and easy-to-use libraries. FFTW[8] and OpenBLAS[9] are two examples. FFTW is a free and open-source C subroutine library for computing the discrete Fourier transform (DFT) while OpenBLAS is an optimised BLAS (Basic Linear Algebra Subprograms) library which provides basic vector and matrix operations. Both of these libraries show outstanding performance on traditional CPU-only platforms. NVIDIA provides the corresponding cufft (CUDA Fast Fourier Transform)[10] and cublas (CUDA Basic Linear Algebra Subroutines)[11] libraries for CUDA programmers to speed up their applications by deploying compute-intensive operations to a single GPU using hundreds of CUDA cores inside NVIDIA GPUs or distribute work across multi-gpu systems efficiently. However, these libraries are implemented using the C-based CUDA, which means that modifying those CPU-only programs written in Fortran to support NVIDIA GPU execution needs additional Fortran wrapper interfaces. 2.5 The MAGMA Library Standing for Matrix Algebra on GPU and Multicore Architectures, MAGMA[12] is a library for dense linear algebra functionalities on heterogeneous systems. It is similar to LAPACK and ScaLAPACK[13] which are designed for CPU-only architectures and designed by the team that developed LAPACK and ScaLAPACK[14]. Therefore, developers can port their existing applications to hybrid systems efficiently and smoothly. MAGMA uses a flexible methodology to schedule tasks: small tasks (run on the critical path and cannot be parallelised) are scheduled to execute on the CPU while larger tasks are scheduled on the GPU by the library. In addition, many MAGMA subroutines have multiple versions enabling developers to arrange the workload. For example, a LAPACK subroutine named DGETRF MAGMA provides versions including magma_dgetrf, magma_dgetrf_m, magma_dgetrf_gpu and magma_dgetrf_mgpu so that developers can choose whether the function is executed in the mode of CPU/GPU 6

14 or CPU/multiple-GPU and whether the matrix is located on CPU host or GPU device memory. 2.6 Project Motivation As the author is a member of Team EPCC for the 2017 ISC Student Cluster Competition, the project is generally composed of three parts: cluster design and configuration, work done by the author for the ISC Student Cluster Competition (optimising FEniCS on the cluster) and optimisation of the CPU-only project minidft using GPU-enabled libraries. The project aims to investigate and make use of all the previously mentioned HPC features and innovations. It starts from maintaining a cluster with a liquid cooling system for the competition such as installing software, compiling and linking libraries. This process is essential for all subsequent optimisation and performance measurement work and can be difficult but beneficial for Linux beginners. Optimisation tasks include improving performance of FEniCS by trying different popular compilers and libraries, rewriting the code of minidft to make it support GPUs by use of GPU-enabled libraries. MiniDFT uses traditional mathematical libraries such as FFTW3, OpenBLAS and SCALACK and this motivates the author to improve its performance by introducing corresponding libraries that are designed for hybrid systems with GPUs. 2.7 Obstacles and Deviation from the Project Plan During the preparation period for the competition, we faced many difficulties from compiling the code to executing the program throughout the three nodes. Some of the problem were easy to solve while others could cost us hours or even days to find out the solution since our team members had limited knowledge and experience on Linux system administration, cluster network, libraries and application optimisation. In order to provide the reader with better understanding of work done for the project when reading and examining this report, obstacles that the author met are presented and discussed in detail throughout the report. This section provides an overview of deviation from the initial project proposal. The project was planned to focus on the optimisation of FEniCS. The project preparation report[15] proposed that before the competition days, we should conduct vectorisation of the code of FEniCS to improve its performance according to the hardware architecture of the cluster, i.e. Intel Xeon 2630v4 CPUs and additional work should be on enabling PETSc GPU support to offload some work to GPUs. After the competition, future work was proposed to be optimising FEniCS on the KNL architecture. 7

15 The vectorisation report generated by Intel Vector Advisor showed that there were not many loops to be vectorised. In addition, the version of FEniCS with GPU-enabled PETSc did not provide performance improvement. Thus, the author decided to change the work after the competition to porting minidft to the cluster. The most significant obstacle was the system accessibility. Due to some network or maintaining issues, we were unable to access to the cluster remotely from time to time. The longest off-line of the cluster lasted for a month. During these periods, team members cannot conduct any power and performance measurement and thus cannot do any future work accordingly. In addition, before a separate interface to monitor the power consumption of the whole system was set up, we were only able to check the power consumption by Linux commands, which were not convenient to use compared to a separate monitor interface. 8

16 Chapter 3 The Student Cluster Competition The ISC Student Cluster Competition takes place every year together since Being held alongside the International Supercomputing Conference makes more opportunity for participating teams to explore the latest development as well as achievement on HPC technology and express their ideas to and get valuable feedback from attendees of the conference including scholars from prestigious universities and scientists from well-known HPC vendors. 3.1 Competition Guidance Student teams from around the world which were composed of four to six students and a coach had to submit their initial proposal containing their biographies, the reason for participating and approaches to effective teamwork etc. to the competition board before Nov 11, During the 2016 Supercomputing Conference held in Salt Lake City, 12 teams were announced to participate in the finals of the 2017 ISC Student Cluster Competition. Selected teams should submit their more detailed team as well as individual profiles and final cluster configuration before April 2017 to the HPC Advisory Council which acted as the competition board. In preparation for the competition, student teams needed to optimise both benchmarks and applications for their clusters. Benchmarks and applications will be referred to in general as tasks in this report when it is not necessary to tell the difference between them. There was also a coding challenge which required teams to modify the source code or switch to other libraries to bring all potentialities of their clusters into full play. During the competition days from June 19 to June 21 in Frankfurt, teams had to run all the benchmarks and applications including a secret application which was announced on the second day of competition on their clusters. As each team had a power budget of not exceeding 3000 Watts, power consumption of every team was monitored in real time and at any time when the power consumption of a team exceeded the limit, they would be warned by the competition board and if it happened during the benchmark 9

17 running period, their final score would be deducted accordingly. After submitting each result, the competition board calculated the final score, taking both cluster performance and an interview by the representatives of the competition board into consideration. 3.2 Benchmarks and Applications HPC Challenge (HPCC) and High Performance Conjugate Gradient (HPCG) were used for benchmarking the clusters. HPCC has seven different tests measuring aspects from floating point execution performance to sustainable memory bandwidth while HPCG is another comprehensive benchmark which is also used to rank the Top 500 list. Moreover, teams can have an independent run of High Performance LINPACK (HPL), a test from HPCC, which measures the floating point rate of execution for solving a linear system of equations. At the ISC17 Student Cluster Competition, the known applications (announced three months before the competition) were: FEniCS[16]: A computational interface to solve partial differential equations with both C++ and Python support. The competition board provided a small test for the teams to validate their installation of FEniCS before the official running of FEniCS and a large case for final scoring on the competition day. Coding Challenge: MiniDFT[17]: An application takes a set of atomic coordinates and pseudopotentials as input and models materials. The competition board provided the original code of minidft and gave a brief introduction of required dependencies and running guide of minidft. This task allowed teams to modify the source code and change libraries minidft used to accelerate it. TensorFlow[18]: A library for numerical computation uses data flow graphs and allows users to deploy it on hybrid systems. This challenge was also known as the CAPTCHA challenge because for this application, teams need to run Keras, a Python deep learning libraries, on TensorFlow to identify images. On the second day of the competition, Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS), a classical molecular dynamics code, was announced as the secret application. Each member of Team EPCC was responsible for one or two tasks (benchmarks and/or applications) and the whole team worked together to exchange ideas and help each other solve problems. 3.3 Awards Six awards were given at the 2017 ISC competition: 10

18 Overall winner: For three teams with highest scores (10% HPCC performance, 10% HPCG, 10% FEniCS, 10% LAMMPS, 25% MiniDFT, 25% TensorFlow and 10% Interview by the competition board). Highest LINPACK: For the team with the best performance of HPL (either part of the HPCC run and independent run). Fan Favourite: For the team which received most votes from ISC participants. AI Award: For the team which solves CAPTCHA challenge on TensorFlow and achieves the highest accuracy. 3.4 Team EPCC s Cluster Configuration Hardware Configuration Hardware and software configuration of the cluster was designed in a joint effort by our team and the industrial sponsor of the team, Boston Ltd, who also provided the hardware. The company gave us several hardware choices including Intel Xeon CPUs, Intel Xeon Phi accelerators and NVIDIA P100 GPUs. As the team aimed at the Highest LINKPACK award and based on the preliminary power consumption measurement conducted by the sponsor, we decided to have a GPU-centric design. Team EPCC s cluster is composed of three nodes with three NVIDIA P100 GPUs and two Intel Xeon E5-2630v4 CPUs each. NVLINK is deployed for intra-gpu communication and Mellanox Infiniband is used as the interconnect switch. Initially, the cluster used only air cooling while adding liquid cooling on our cluster helps remove power hungry fans, saving on power consumption overall. Figure 3.1 shows a computational node of the cluster. Hardware details are summarised in Table 3.1. Component CPU GPU RAM Storage Interconnect Cooling System Detail 2 x Intel Xeon GHz (10 cores each) per node 60 cores (or MPI ranks) on 3 nodes in total 3 x NVIDIA Tesla P100 (3584 cores and 16GB HBM2 each) per node 9 GPUs on 3 nodes in total 64GB 2400MHz 2 x 900GB SSD Mellanox EDR Infiniband Networking Liquid Cooling Table 3.1: Team EPCC cluster hardware configuration 11

19 3.4.2 Software Configuration Figure 3.1: One node of Team EPCC s cluster Besides the operating system, NVIDIA GPU libraries, GNU and Intel compilers, the team installed other compilers and libraries according to our needs during the preparation period. In order to maintain different versions of applications such as FEniCS and minidft to compare the difference of performance, the same library can be installed multiple times in different locations and different implementations of MPI such as both OpenMPI and MPICH are installed to help improve performance of applications. The operating system used is CentOS 7.3 which is well-known for its stability and reliability. Following software and libraries were installed: NVIDIA drivers are installed as well as CUDA Toolkit version 8.0. MPI implementations Open MPI and MPICH Intel MPI 2017 Compilers PGI 17.4 Intel Compilers 2016 GNU Libraries 12

20 Intel Math Kernel Library (MKL) (Required by FEniCS and minidft) OpenBLAS version and (Required by FEniCS and minidft) FFTW3.3.6 (Required by minidft) ScaLAPACK (Required by FEniCS and minidft) CMake (Required by FEniCS) Boost (Required by FEniCS) HDF (Required by FEniCS) SWIG (Required by FEniCS) PETSc (Required by FEniCS) SLEPc (Required by FEniCS) Eigen (Required by FEniCS) Anaconda Python 3 (Required by FEniCS) 3.5 Preparing for the Competition Due to the fact that the hardware was located at the headquarters of Boston Limited during the preparation period, the only times at which we had direct access to the cluster was during the competition days and a two-day training session at the headquarters in London on the 30th and 31st of May. During the training session, Konstantinos Mouzakitis, a Senior HPC Systems Engineer of Boston Limited was responsible for leading us a visit of the company and teaching us skills including system configuration using command lines and how to screw and unscrew components of the cluster. For the remaining time, we only had remote access to the cluster. The Intelligent Platform Management Interface (IPMI), which provides management and monitoring interface of CPUs, BIOS and the operating system was installed on the cluster. It allowed us to reboot the cluster when encountering hardware or software crashes and most importantly, it helped us monitor and manipulate the BIOS configuration, temperatures, fan speeds etc. during power consumption measurement and application performance tests. A Windows system with a Remote Desktop Protocol (RDP), which had a graphical user interface to monitor the overall power consumption of the cluster, was also set up. Each member of Team EPCC was responsible for one or two benchmarks or applications. The author of this report was mainly responsible for FEniCS during the competition preparation period and ported minidft to the NVIDIA Tesla P100 GPU architecture after the competition. As mentioned in the Introduction Chapter, details on the optimisation process will be presented in Chapters 4 and 5. 13

21 Before the competition days, we had team meetings of all team members and the coach every one or two weeks for us to exchange ideas on optimisation, report work to the coach and arrange work for the following week. Each team member also had meetings with their dissertation supervisors periodically to discuss their own work. 3.6 Competition Results and Experiences The team originally set the Highest LINPACK award as their primary goal, which highly influenced the choice of the hardware configuration. After performing HPL tests with 4 nodes (each node had 2 GPUs) and 3 nodes (each node had 3 GPUs) respectively, the team decided that the latter option was more promising both in terms of performance and power consumption. The final performance of HPL for the competition were 33.99TFLOPS at the power consumption at exactly 3000 Watts, which was the fourth highest HPL performance achieved during the competition and approximately three times of the record set at 2016 ISC Student Cluster Competition. Using the power efficiency formula (GFlops/watts) used to rank the Green 500 list, the power efficiency of Team EPCC s cluster was 11.33, which can rank at the 4th place in the June 2017 Green 500 list. Being one of the team members of Team EPCC taking part in the 2017 ISC Student Cluster Competition was the experience of lifetime. Due to the fact that this competition touched a wide range of aspects of HPC, we had to practice that we learned from our MSc in High Performance Computing with Data Science program. It also required us to find useful information and read literature by ourselves, which helped us develop our skills of searching for and processing information. For team members who did not have much Linux experience, the competition had forced us to get familiar with command line work as quick as possible. We had to learn how to install and configure different libraries and software and how to cope with errors. Team work was greatly important for us. Each team member has his or her own strong and weak points. Learning from others strong points to offset one s weakness acted an important role in our collaboration. Last but not least, discussing with other teams, conference participants and the competition board was also inspiring as we had the chance to exchange ideas and learn from each other. 14

22 Chapter 4 Building FEniCS on the Cluster and Performance Analysis As a mature project for solving partial differential equations, FEniCS is widely used in scientific simulations and was one of the benchmarking applications at the 2017 ISC Student Cluster Competition. In this chapter, we will have a look at how it was installed on the cluster and compare performance given by different configurations and compilers. 4.1 Introduction to FEniCS As shown in Figure 4.1, FEniCS basically works like a streamline starting from and ending at DOLFIN, which is the main user interface of FEniCS. End users are able to describe their problem in handwriting-style formulas benefiting from Unified Form Language (UFL). FEniCS From Compiler (FFC) can then translates these formulas into program source code which will be further compiled by the compiler. Variational forms expressed in UFL are passed to FFC to generate low-level code, which can then be used by DOLFIN to assemble linear systems. This code generation depends on the FInite element Automatic Tabulator (FIAT) and Instant. Instant is used for inlining C/C++ code in Python. Finally, mshr can be used to plot graphs. As is described in Section 3.4.2, FEniCS has a number of required and optional dependencies. For instance, Python package Six which makes Python programs compatible with both Python 2 and Python 3 without modification, and NumPy which provides fundamental scientific computing functionalities are required by all the Python-related components. SWIG (Simplified Wrapper and Interface Generator) is required by both DOLFIN and Instant to connect programs written in C++ and Python. In terms of optional dependencies, MPI and HDF5, for example, can be installed together with DOLFIN to enable parallelism and improve performance. Therefore, these dependencies should be installed properly according to end-user s requirements and specifica- 15

23 Figure 4.1: FEniCS Streamline tions to make sure that FEniCS can work. 4.2 Initial Work and Obstacles FEniCS provides some simple installing methods including Docker containers, prebuilt Anaconda Python packages and Ubuntu Packages. For end users, these options are easy to use and straightforward but in order to modify parameters and try different compilers, we had to build FEniCS from its source code and these pre-built packages can be used to help validate the results of the optimised program. The author of this report had limited experience on Linux before the preparation period of the competition and had never used Linux terminal commands or batch systems in the past. The author was used to Eclipse Integrated Development Environment for Java which allows developers and was not familiar with command-line operations and setting Linux environment variables. Therefore, we encountered problems such as selecting the right compiler and linking the correct library. Library linking was the greatest obstacle of compiling FEniCS. Boost is a well-known library family which provides a variety of high-quality C/C++ libraries from frequentlyused algorithms to regular expression processing. As discussed in Section 4.1, DOLFIN is the computational back-end of FEniCS and Boost is one of its compulsory depend- 16

24 encies. The author first installed multiple versions of Boost with default configuration to the default directory. As the author was not familiar with Linux environmental variables and did not know how to set BOOST_ROOT variable to indicate which Boost was going to be linked, it caused that DOLFIN was installed linking with a version of Boost and looked for another version of Boost in run-time. As a result, FEniCS cannot be launched. Similar problems was also encountered when the author compiled multiple PETSc libraries with difference configurations and tried to test DOLFIN with each of them. The problem was solved by installing compilers and libraries to user-specified folders and setting environment variables such as LD_LIBRARY_PATH, BOOST_ROOT and PETSC_DIR to indicate which libraries to link with. In order to compile FEniCS with different compilers, libraries and configurations conveniently, an automated build script was written so that only a few statements need to be changed. For example, the variable PREFIX only needs to be set once at the beginning. Therefore the directory it points to is used to install libraries and it tells CMake where to find them. CMake is a widely-used tool to find and link to different libraries and then generate Makefile accordingly. The author was not familiar with the logic "Cmake - make - make install" thus met up problems with CMake. One problem was that the original Cmake installed on the cluster was a lower version and cannot link to Boost and upper versions. We thought that the problem was caused by compiling and installing Boost and spent a lot of time on reinstalling Boost. After the problem was solved, we gained a lot of experience on CMake and the author decided to write a CMakeList.txt file for building minidft to practice using CMake. OpenBLAS is a well-known optimised BLAS library and is required by PETSc, the computational back-end of DOLFIN. When compiling OpenBLAS with Intel Compiler (icc for C, icpc for C++ and ifort for Fortran) during performance investigation, it raised a fatal error "unknown register name %1 in asm statement". The reason was that icc did not yet recognize the gcc-style register naming. Therefore, we used gcc for C, g++ for C++ and ifort for Fortran instead. The compilation of OpenMPI with Intel Compiler was straightforward i.e. specifying C, C++ and Fortran compilers as Intel Compiler was correct. However, when executing MPI tasks with is implementation on more than 1 node, other nodes except the host node halted with an error that libimf.so, libsvml.so, libirng.so and libintlc.so.5 cannot be found. This issue was solved by making symbol links to corresponding Intel libraries in the FEniCS library path. 4.3 Performance Investigation Not like a benchmarking program, FEniCS is a scientific application and most of the programs using it are quite time-consuming. Before the competition days, the competition board did not provide neither test cases nor benchmarking programs. They only 17

25 provided a list of required optional dependencies of DOLFIN. In this case, example programs installed together with FEniCS were used to validate each installation. On the first day of the competition, a small test case which was provided the competition board to help teams check whether their FEniCS can work properly was used to help choose suitable compilation configuration as it has relatively lower completion time. Following steps were used to investigate performance of FEniCS: The first step was to investigate how it performs with different number of MPI processes, which can tell us how well FEniCS scales across cluster nodes. GNU compilers, OpenMPI 2.2.1, OpenBLAS compiled by GNU compilers and the small test case was used for this step. Table 4.1 and Figure 4.2 present FEniCS performance with regard to different number of MPI tasks. Nodes used MPI tasks Time (sec) Table 4.1: FEniCS timing for different number of MPI tasks We can see that the execution time of the small test case decreased when the number of MPI tasks increased. However, when we used 64 MPI tasks, its performance was not as good as 32 tasks. The reason can be that the test case was too small so that much time was spent on communication between MPI tasks rather than computation. The second step, for FEniCS performance investigation, was to test which MPI library and implementation works better with FEniCS. The chosen implementations were: MPICH 3.2, compiled with GNU Compiler OpenMPI 2.2.1, compiled with GNU Compiler OpenMPI 2.2.1, compiled with Intel Compiler Intel MPI, using Intel Compiler The BLAS library used in the second step was OpenBLAS compiled with the compiler for the MPI implementation and optimisation flag for each test case was -O2. The small test case was performed with 30 MPI ranks on the 3 nodes (10 ranks per node) while the large test case was performed with 60 MPI ranks on the 3 nodes (20 ranks per node). The results of the MPI library testing are shown in Table 4.2 and Figures 4.3 and 4.4. It can be clearly seen in Table 4.2 that in terms of the small test case, OpenMPI compiled with GNU Compiler had the best performance, which was more than two times 18

26 Figure 4.2: FEniCS timing for different number of MPI tasks Figure 4.3: FEniCS timing for different MPI implementations using the small test case 19

27 Test case Time (sec) MPICH-GNU OpenMPI-GNU OpenMPI-Intel Intel MPI Small Large Table 4.2: FEniCS timing for different MPI implementations Figure 4.4: FEniCS timing for different MPI implementations using the large test case faster than MPICH compiled with GNU Compiler. The execution time of OpenMPI compiled with Intel Compiler had slightly slower performance and the execution time of Intel MPI was approximately 2 seconds slower than OpenMPI compiled with GNU Compiler. When it comes to the large test case, we can see that Intel MPI, which was the second worst for the small test case gave the best performance. The reason might be that Intel MPI can provide better scalability for long-running programs. Moreover, MPICH compiled with GNU Compiler again had the worst performance for the large test case. The third step was to test how different BLAS libraries affected the performance of FEniCS. The small test case executed too fast that we can hardly see the difference so the large test case which was used for competition scoring was used in this step. Based on the previous step, Intel MPI with Intel Compiler were used in this step. Selected BLAS libraries were as following: Intel Compiler 2016 with OpenBLAS compiled by Intel Compiler 20

28 Intel Compiler 2016 with Intel Math Kernel Library (MKL) The test case was executed using all the available CPU cores (60 MPI tasks). The results of the compiler and library testing are shown in Table 4.3. BLAS library Time (sec) OpenBLAS Intel MKL 1112 Table 4.3: FEniCS timing for different BLAS libraries It can be seen in Table 4.3 that the Intel Compiler provided better performance than the GNU Compiler and when the OpenBLAS library was replaced by the MKL, the performance increased further. The fourth step was to compile FEniCS using the best compiler from previous steps with different optimisation levels from -O0 to -O3 and -fast which is equivalent to -xhost -O3 -ipo -no-prec-div -static -fp-model fast=2 according to Intel C++ Compiler manuals page and see what happened with performance. The version compiled by Intel Compiler and linked with Intel MKL library was chosen for this test.the bigger test case was also used in the step and the program was executed with 60 MPI tasks. Table 4.4 and Figure 4.5 show the testing results. Optimisation Level Time (sec) -O O O O fast Table 4.4: FEniCS timing for different levels of optimisation We can see that the performance of FEniCS was improved when using the optimisation level -fast and unsurprisingly, -O0 provided the worst performance. The final step used Intel processor-specific optimization. Since the CPUs of the cluster were all Intel Xeon Processor E5 v4 Family, which was referred to as CORE-AVX512 in terms of processor-specific option for the Intel Compiler, we set the following flags: -fast -march=core-avx2 -xcore-avx2. Flag -axcore-avx2 can replace xcore-avx2 because xcore-avx2 only generates specialised code for the specified processor which is incompatible to older processors while -axcore-avx2 generates multiple code paths and executable files can be larger.the performance of FEniCS was further improved to seconds benefiting from the machine-specialised compilation. 21

29 4.4 Summary Figure 4.5: FEniCS timing for different levels of optimisation FEniCS is quite a huge program consisting of various components and each component has its required and optional dependencies. An automatic building script helped us save time on downloading, configuring and linking selected libraries when trying different optimisation methods. Although we encountered a variety of obstacles of software compiling and linking, we did learn a lot through this process. FEniCS showed acceptable scalability for the multi-node cluster with MPI and according to the MPI implementation comparison, Intel MPI showed best performance. Additionally, as a well-known optimised mathematical library with widely-used routines including BLAS, FFT and LAPACK, MKL also helped improve the performance of FEniCS. Finally, optimisation flag -fast provided best performance among optimisation flags -O0, -O1, -O2, -O3 and -fast. Figure 4.6 provides a step-by-step optimisation summary of this chapter. 22

30 Figure 4.6: Step-by-step optimisation summary of FEniCS 23

31 Chapter 5 Porting minidft to NVIDIA GPUs As a minimalist version of the general-purpose Quantum ESPRESSO (open-source Package for Research in Electronic Structure, Simulation, and Optimization) code, minidft[17] is an application for modelling materials written in Fortran to compute the Kohn-Sham equations using the plane-wave density functional theory (DFT). The design of minidft utilizes some parallelism technologies such as MPI and Open Multi- Processing OpenMP and minidft is suitable to run on CPU-only systems. 5.1 Initial Performance Testing The README file of minidft said that following libraries were used by minidft: ScaLAPACK (Scalable Linear Algebra PACKage): A library solves linear algebra and provides high-performance linear routines for distributed-memory architectures. OpenBLAS: An optimized BLAS (Basic Linear Algebra Subprograms) library. FFTW3: A library provides C subroutine and computes the discrete Fourier transform (DFT) in one or more dimensions. It supports arbitrary input size and both real as well as complex data. These libraries have been well optimised on traditional CPU-only machines and support MPI as well as OpenMP to make use of parallelism on CPUs. However, they do not have GPU support and other parts of minidft source code do not use GPU functionalities either, which means that minidft is a CPU-only program. Therefore, looking for corresponding GPU-enabled computational libraries to replace these traditional libraries is a method of optimisation. In addition, some performance testing results of FEniCS in Section 4.3 show that when only running with CPUs, MKL can provide better performance than OpenBLAS. Actually, MKL also provides mathematical subroutines of FFT and ScaLAPACK. Therefore, we had a look on how minidft performs with Intel Compiler and MKL library. 24

32 A primary test of the original minidft program using pe-23.local.in as the input file was conducted using GNU and Intel compilers and the result is given in Table 5.1. The version compiled by GNU compiler was linked to original OpenBLAS, FFTW3 and ScaLAPACK while the Intel version was linked to MKL. Compiler flags used for this step was -O3 for both GNU and Intel compilers. Compiler CPU (sec) WALL (sec) GNU Compiler Intel Compiler Table 5.1: minidft timing for different compilers and libraries The column CPU means time consumed by computation and WALL means time consumed by both computation and communication. The result of this step showed that Intel Compiler and MKL library help save approximately half of the execution time of GNU version of minidft. Then, we had a look on how optimisation flags affected the performance of minidft compiled with Intel Compiled and MKL library. The results of this step are shown in Table 5.2 and Figure 5.1. We can see that -O3 provided the best performance and this optimisation flag is used in subsequent optimisation steps. Compiler CPU (sec) WALL (sec) -O O O O fast O3 with CPU-specific flags Table 5.2: minidft timing for different levels of optimisation 5.2 Optimisation Based on Source Code Invastigation The source code file fft_base.f90 contains a subroutine named fft_scatter. This subroutine is used to transpose the fft grid across nodes from columns to planes, or in the opposite direction, and it is implemented in two ways. The default one uses MPI_Alltoallv broadcast method to send data from all to all processes. However, another implementation designs a non-blocking transpose. It makes the loop iterations different on each process (MPI task) so that all processes will not send a message to the same process at the same time, and it uses a combination of MPI_Isend, MPI_Irecv and MPI_test to make communications asynchronous and make sure the routine exit when all processes have sent and received data. This source code file defines a macro named NON- BLOCKING_FFT and when we turn it on, the later implementation will be used and 25

33 Figure 5.1: minidft timing for different optimisation levels otherwise, the first one will work. The later implementation of FFT_Scatter is suitable for switched network such as Infiniband when no topology is defined while the blocking implementation should be better on the network with a defined topology. As the three nodes of the cluster were connected by a switched network using Infiniband, a performance test was conducted with turning -D NONBLOCKING_FFT on. The comparison of this version and the previous Intel Compiler and MKL version is shown in Table 5.3. FFT communication CPU (sec) WALL (sec) Blocking Non-blocking s Table 5.3: minidft timing for different FFT communication types We can see that the non-blocking communication methods helped reduce and seconds of computation and overall execution time respectively. 26

34 5.3 Performance with GPU-enabled Libraries FFT-related Code Optimisation NVIDIA has developed the cufft library which provides FFT functionalities on GPUs and can be used to replace FFTW library. The report first tried to make use of cufft library and made the following modifications to the code: Add DEVICE keyword to arrays executed on GPUs. Replace "OpenMP parallel do" directives on arrays on GPUs by CUDA CUF Kernel directives which automatically arrange loops on kernel arrays on GPUs. Here is an example of replacement: 1! O r i g i n a l OpenMP d i r e c t i v e! $omp p a r a l l e l d e f a u l t ( s h a r e d ), p r i v a t e ( mc, j, i ) 3! $omp do DO i = 1, d f f t%n s t 5 mc = d f f t%ismap ( i ) DO j = 1, d f f t%npp ( me_p ) 7 f _ i n ( j + ( i 1 ) nppx ) = f_aux ( mc + ( j 1 ) d f f t% nnp ) ENDDO 9 ENDDO! $omp end p a r a l l e l 1! CUDA CUF K e r n e l 3 DO i = 1, d f f t%n s t mc = d f f t%ismap ( i ) 5! $ cuf k e r n e l do ( 1 ) <<<, >>> DO j = 1, d f f t%npp ( me_p ) 7 f _ i n ( j + ( i 1 ) nppx ) = f_aux ( mc + ( j 1 ) d f f t% nnp ) ENDDO 9 ENDDO Replace FFTW2 subroutines by cufft subroutines as shown in Table 5.4. FFTW subroutine dfftw_destroy_plan dfftw_plan_many_dft dfftw_execute_dft cufft subroutine cufftdestroy cufftplanmany cufftexecz2z Table 5.4: FFTW subroutines replaced by cufft subroutines 27

35 A CUDA-aware OpenMPI compiled with PGI Compiler was used to compile this version of minidft because Intel Compiler does not have CUDA syntax support such as CUDA CUF Kernel directives used here. However, the execution time of minidft was increased to 1097 seconds. Therefore, we decided not to use cufft but the MKL FFT subroutines instead BLAS-related Code Optimisation The second step of porting minidft to GPUs was to replace BLAS subroutine calls by cublas library. After checking the source code, we found two BLAS subroutines, ZGEMM as well as ZGEMV was used in minidft. ZGEMM performs matrix-matrix operations and ZGEMV performs matrix-vector operations. These subroutines were replaced by corresponding cublas subroutines cublas_zgemm and cublas_zgemv respectively. NVIDIA provides a Fortran binding interface for cublas named fortran_thunking.c under CUDA installation directory and can be used in this step because minidft was implemented in Fortran. Macros (such as # define ZGEMM cublas_zgemm) were used in this step to fast replace original function calls.! O r i g i n a l ZGEMM 2 CALL ZGEMM( N, N, n, m, nkb, ( 1. D0, 0. D0 ), vkb, lda, ps, nkb, ( 1. D0, 0. D0 ), hpsi, l d a ) 1! cublas ZGEMM CALL cublas_zgemm( N, N, n, m, nkb, ( 1. D0, 0. D0 ), vkb, lda, ps, nkb, ( 1. D0, 0. D0 ), hpsi, l d a ) After using cublas, the execution time for computation and the overall time were reduced to seconds (90.95 seconds faster) and seconds (72.05 seconds faster) respectively. Moreover, the subroutine add_vuspsi_k in add_vuspsi.f90 multiplied one matrix over each column of another matrix using ZGEMV. The matrix operation part of this subroutine was replaced by a matrix-matrix multiplication function using ZGEMM and MPI reduce (summation) developed by Siyuan Liu and the execution time was further reduced to seconds and seconds for computation and overall execution respectively because the number of matrix operations was divided by the number of columns. Now, minidft is more than two times faster than the previous step MAGMA-related Code Optimisation The original minidft uses ScaLAPACK for diagonalisation in the source code file cdiaghg.f90. This operation can be ported to GPUs by MAGMA library which supports 28

36 hybrid CPU+GPU LAPACK functionalities. However, MAGMA is a serial LAPACK library while the original minidft used a parallel version with full data distribution. Filippo Spiga et al. developed a plug-in code to accelerate Quantum ESPRESSO using NVIDIA GPU[20] which provides a serial version of diagonalisation. After importing MAGMA library as well as the GPU-enabled function, diagonalisation can be performed on GPUs. Two functions provided by MAGMA, magmaf_zhegvd as well as magmaf_zhegvx are used in program file cdiaghg.f90. As introduces in Section 2.5, there is a multiple-gpu version of magmaf_zhegvd named magmaf_zhegvd_m. Both of them were tested. 1! Multi GPU v e r s i o n of ZHEGVD and t h e f i r s t p a r a m e t e r i n d i c a t e s t h e number o f GPUs CALL magmaf_zhegvd_m ( 3, 1, V, U, n, v, ldh, s, ldh, e, work, lwork, rwork, lrwork, iwork, liwork, i n f o ) 1! S i n g l e GPU v e r s i o n of ZHEGVD CALL magmaf_zhegvd ( 1, V, U, n, v, ldh, s, ldh, e, work, lwork, rwork, lrwork, iwork, liwork, i n f o ) With the single-gpu version, the execution time for computation and overall time were reduced to seconds (99.55 seconds faster than the previous step) and seconds (68.96 seconds faster than the previous step) respectively. However, the multiple-gpu version did not perform better ( seconds on computation and seconds overall). The reason could be that the size of the matrix in our case was not large enough to have better scalability on multiple GPUs and the overall time was increased by communication between GPUs. 5.4 Process Binding Numactl is a utility which can be used to control NUMA policy for processes or shared memory. NUMA (stands for Non-Uniform Memory Access) is a memory architecture in which a given CPU core has variable access speeds to different regions of memory. Figure 5.2 shows the topology we used for binding processes when executing minidft and the execution time comparison is shown in Table 5.5. We can see that the computation performance was not improved and the reason is that the number of GPUs cannot be divided evenly by the number of CPUs in each node. However, the overall execution time was reduced because GPUs were bound to CPUs so that communication between CPU and GPU was fixed and stable. 29

37 Figure 5.2: Process binding topology 30

Comparing Performance and Power Consumption on Different Architectures

Comparing Performance and Power Consumption on Different Architectures Comparing Performance and Power Consumption on Different Architectures Andriani Mappoura August 18, 2017 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2017 Abstract

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Game-changing Extreme GPU computing with The Dell PowerEdge C4130 Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

Cuda C Programming Guide Appendix C Table C-

Cuda C Programming Guide Appendix C Table C- Cuda C Programming Guide Appendix C Table C-4 Professional CUDA C Programming (1118739329) cover image into the powerful world of parallel GPU programming with this down-to-earth, practical guide Table

More information

IBM Deep Learning Solutions

IBM Deep Learning Solutions IBM Deep Learning Solutions Reference Architecture for Deep Learning on POWER8, P100, and NVLink October, 2016 How do you teach a computer to Perceive? 2 Deep Learning: teaching Siri to recognize a bicycle

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence TM Artificial Intelligence Server for Fraud Date: Q2 2017 Application: Artificial Intelligence Tags: Artificial intelligence, GPU, GTX 1080 TI HM Revenue & Customs The UK s tax, payments and customs authority

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA, S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com A TALE OF ENLIGHTENMENT Basic OK List 10 for x = 1 to 3 20 print

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI Overview Unparalleled Value Product Portfolio Software Platform From Desk to Data Center to Cloud Summary AI researchers depend on computing performance to gain

More information

Genius Quick Start Guide

Genius Quick Start Guide Genius Quick Start Guide Overview of the system Genius consists of a total of 116 nodes with 2 Skylake Xeon Gold 6140 processors. Each with 18 cores, at least 192GB of memory and 800 GB of local SSD disk.

More information

LAMMPS Performance Benchmark and Profiling. July 2012

LAMMPS Performance Benchmark and Profiling. July 2012 LAMMPS Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

NAMD Performance Benchmark and Profiling. January 2015

NAMD Performance Benchmark and Profiling. January 2015 NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015 LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA

More information

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations Performance Brief Quad-Core Workstation Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations With eight cores and up to 80 GFLOPS of peak performance at your fingertips,

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

The Mont-Blanc project Updates from the Barcelona Supercomputing Center montblanc-project.eu @MontBlanc_EU The Mont-Blanc project Updates from the Barcelona Supercomputing Center Filippo Mantovani This project has received funding from the European Union's Horizon 2020 research

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

The Arm Technology Ecosystem: Current Products and Future Outlook

The Arm Technology Ecosystem: Current Products and Future Outlook The Arm Technology Ecosystem: Current Products and Future Outlook Dan Ernst, PhD Advanced Technology Cray, Inc. Why is an Ecosystem Important? An Ecosystem is a collection of common material Developed

More information

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Page 1 of 17 In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Niloo Ranjan Jibonananda Sanyal Joshua New Page 2 of 17 Table of Contents In-Situ Statistical Analysis

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

The BioHPC Nucleus Cluster & Future Developments

The BioHPC Nucleus Cluster & Future Developments 1 The BioHPC Nucleus Cluster & Future Developments Overview Today we ll talk about the BioHPC Nucleus HPC cluster with some technical details for those interested! How is it designed? What hardware does

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

SNAP Performance Benchmark and Profiling. April 2014

SNAP Performance Benchmark and Profiling. April 2014 SNAP Performance Benchmark and Profiling April 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting

More information

LAMMPSCUDA GPU Performance. April 2011

LAMMPSCUDA GPU Performance. April 2011 LAMMPSCUDA GPU Performance April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory Council

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

How to Write Fast Code , spring st Lecture, Jan. 14 th

How to Write Fast Code , spring st Lecture, Jan. 14 th How to Write Fast Code 18-645, spring 2008 1 st Lecture, Jan. 14 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Today Motivation and idea behind this course Technicalities

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010 Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010 Windows HPC Server 2008 R2 Windows HPC Server 2008 R2 makes supercomputing

More information

AMBER 11 Performance Benchmark and Profiling. July 2011

AMBER 11 Performance Benchmark and Profiling. July 2011 AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded

More information

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

GOING ARM A CODE PERSPECTIVE

GOING ARM A CODE PERSPECTIVE GOING ARM A CODE PERSPECTIVE ISC18 Guillaume Colin de Verdière JUNE 2018 GCdV PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France June 2018 A history of disruptions All dates are installation dates of the machines

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

The Cray Programming Environment. An Introduction

The Cray Programming Environment. An Introduction The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

ARCHER Champions 2 workshop

ARCHER Champions 2 workshop ARCHER Champions 2 workshop Mike Giles Mathematical Institute & OeRC, University of Oxford Sept 5th, 2016 Mike Giles (Oxford) ARCHER Champions 2 Sept 5th, 2016 1 / 14 Tier 2 bids Out of the 8 bids, I know

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Agenda 1 Agenda-Day 1 HPC Overview What is a cluster? Shared v.s. Distributed Parallel v.s. Massively Parallel Interconnects

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to

More information

Preliminaries. Chapter The FEniCS Project

Preliminaries. Chapter The FEniCS Project Chapter 1 Preliminaries 1.1 The FEniCS Project The FEniCS Project is a research and software project aimed at creating mathematical methods and software for automated computational mathematical modeling.

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

OpenPOWER Performance

OpenPOWER Performance OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER IBM SOFTWARE LINUX ECOSYSTEM OPEN SOURCE Solutions with full stack

More information

Automated Finite Element Computations in the FEniCS Framework using GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Accelerating Insights In the Technical Computing Transformation

Accelerating Insights In the Technical Computing Transformation Accelerating Insights In the Technical Computing Transformation Dr. Rajeeb Hazra Vice President, Data Center Group General Manager, Technical Computing Group June 2014 TOP500 Highlights Intel Xeon Phi

More information

GROMACS Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. August 2011 GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA GPU Developments for the NEMO Model Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA NVIDIA HPC AND ESM UPDATE TOPICS OF DISCUSSION GPU PROGRESS ON NEMO MODEL 2 NVIDIA GPU

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA

Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA 14. LS-DYNA Forum, Oktober 2016, Bamberg Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA Eric Schnepf 1, Dr. Eckardt Kehl 1, Chih-Song Kuo 2, Dymitrios Kyranas 2 1 Fujitsu Technology Solutions

More information

CPMD Performance Benchmark and Profiling. February 2014

CPMD Performance Benchmark and Profiling. February 2014 CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Performance Analysis of LS-DYNA in Huawei HPC Environment

Performance Analysis of LS-DYNA in Huawei HPC Environment Performance Analysis of LS-DYNA in Huawei HPC Environment Pak Lui, Zhanxian Chen, Xiangxu Fu, Yaoguo Hu, Jingsong Huang Huawei Technologies Abstract LS-DYNA is a general-purpose finite element analysis

More information

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University Introduction to High Performance Computing Shaohao Chen Research Computing Services (RCS) Boston University Outline What is HPC? Why computer cluster? Basic structure of a computer cluster Computer performance

More information

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015 Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute

More information

The Architecture and the Application Performance of the Earth Simulator

The Architecture and the Application Performance of the Earth Simulator The Architecture and the Application Performance of the Earth Simulator Ken ichi Itakura (JAMSTEC) http://www.jamstec.go.jp 15 Dec., 2011 ICTS-TIFR Discussion Meeting-2011 1 Location of Earth Simulator

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

4D Visualisations of Climate Data with ParaView

4D Visualisations of Climate Data with ParaView 4D Visualisations of Climate Data with ParaView Georgia Klontza August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract In weather sciences and

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

OCTOPUS Performance Benchmark and Profiling. June 2015

OCTOPUS Performance Benchmark and Profiling. June 2015 OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the

More information

SuperMike-II Launch Workshop. System Overview and Allocations

SuperMike-II Launch Workshop. System Overview and Allocations : System Overview and Allocations Dr Jim Lupo CCT Computational Enablement jalupo@cct.lsu.edu SuperMike-II: Serious Heterogeneous Computing Power System Hardware SuperMike provides 442 nodes, 221TB of

More information

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE www.iceotope.com DATA SHEET INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE BLADE SERVER TM PLATFORM 80% Our liquid cooling platform is proven to reduce cooling energy consumption by

More information