Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

Size: px

Start display at page:

Download "Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU"

Baldric Armstrong
6 years ago
Views:

1 Porting Scalable Parallel CFD Application NVIDIA D. V., N. Munikrishna, Nikhil Vijay Shende 1 N. Balakrishnan 2 Thejaswi Rao 3 1. S & I Engineering Solutions Pvt. Ltd., Bangalore, India 2. Aerospace Engineering, Indian Institute of Science, Banglore, India 3. NVIDIA Graphics Pvt. Ltd., Banglore, India Technology Conference Silicon Valley March 26 29, / 18

2 Introduction The HiFUN Software High Resolution Flow Solver on Unstructured Meshes. A Computational Fluid Dynamics (CFD) Flow Solver. Primary product of the company SandI. Robust, fast, accurate and efficient tool. About SandI A technology company. Incubated from Indian Institute of Science, Bangalore. Promotes high end CFD technologies with uncompromising quality standards. 2 / 18

3 Introduction The HiFUN Software High Resolution Flow Solver on Unstructured Meshes. A Computational Fluid Dynamics (CFD) Flow Solver. Primary product of the company SandI. Robust, fast, accurate and efficient tool. About SandI A technology company. Incubated from Indian Institute of Science, Bangalore. Promotes high end CFD technologies with uncompromising quality standards. 2 / 18

4 3 / 18 Features of HiFUN General

5 Features of HiFUN Well Validated AIAA DPW SPICES AIAA HiLiftPW 4 / 18

6 Features of HiFUN Super Scalable Workload: 165 Million Volumes Simulation CPU Cores Time (Hours/Days) RANS / URANS / DES / / 18

7 6 / 18 SandI NVIDIA Collaboration Way Ahead NVIDIA Pascal, Volta NVLink With IBM Power CPU 2018 GTC 2018 GTCx Mumbai 2016 HiFUN in Apps Catalogue GTC 2016: Poster Presentation 2015 NVIDIA Innovation Award 2014 Joint Development Initiative Kicks Off

8 NVIDIA Hybrid Supercomputers Consist of CPU and NVIDIA. Less power to achieve same FLOPS. Less cooling & space. Thousands of computing cores sharing same RAM. Higher memory bandwidth. High data transfer overheads with CPU. 7 / 18

9 NVIDIA Hybrid Supercomputers Consist of CPU and NVIDIA. Less power to achieve same FLOPS. Less cooling & space. Thousands of computing cores sharing same RAM. Higher memory bandwidth. High data transfer overheads with CPU. 7 / 18

10 NVIDIA Parallelization Model on Shared memory. Many FLOPS per byte of data from CPU to. Re look at parallelization of CFD algorithms. Parallelization Challenges General purpose algorithms. Implicit: Global data dependence. Complex multi layered unstructured data structure. 8 / 18

11 NVIDIA Parallelization Model on Shared memory. Many FLOPS per byte of data from CPU to. Re look at parallelization of CFD algorithms. Parallelization Challenges General purpose algorithms. Implicit: Global data dependence. Complex multi layered unstructured data structure. 8 / 18

12 NVIDIA Constraints No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to. Optimal data communication between CPU &. 9 / 18

13 NVIDIA Constraints No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to. Optimal data communication between CPU &. 9 / 18

14 NVIDIA Constraints No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to. Optimal data communication between CPU &. 9 / 18

15 10 / 18 NVIDIA Onera M6 NASA CRM Trap Wing Configurations & Workloads (Million) Onera M6 Wing: 1.1, 9.3, 12.12, 15.4 NASA CRM: 6.2, 26.5, 30 NASA Trap Wing: 20, 66 Simulation Type Steady RANS Simulations

16 10 / 18 NVIDIA Onera M6 NASA CRM Trap Wing Configurations & Workloads (Million) Onera M6 Wing: 1.1, 9.3, 12.12, 15.4 NASA CRM: 6.2, 26.5, 30 NASA Trap Wing: 20, 66 Simulation Type Steady RANS Simulations

17 NVIDIA Computing Platform: NVIDIA PSG Node configuration Two Hexa deca core Intel(R) Xeon(R) Haswell processors. Eight NVIDIA Tesla K 80 s. Memory = 12 GB. Total CPU Memory per node = 256 GB. Infiniband interconnect Software PGI Compiler 16.7 OPENMPI OpenACC / 18

18 NVIDIA Computing Platform: NVIDIA PSG Node configuration Two Hexa deca core Intel(R) Xeon(R) Haswell processors. Eight NVIDIA Tesla K 80 s. Memory = 12 GB. Total CPU Memory per node = 256 GB. Infiniband interconnect Software PGI Compiler 16.7 OPENMPI OpenACC / 18

19 12 / 18 NVIDIA Parallel Performance Parameters Ideal Speed up Ratio of number of nodes used for a given run to reference number of nodes. Actual Speed up Ratio of time/iteration using reference number of nodes to time/iteration using number of nodes for given run. Accelerator Speed up Ratio of time per iteration obtained using given no. of CPUs to time per iteration obtained using same no. of CPUs working in tandem with s.

20 NVIDIA Single Node Performance Observations Accelerator Speed up on 2 Increase in grid size increases utilization and accelerator speed up. Important to load completely. 13 / 18

21 NVIDIA Single Node Performance Varying s % Increase Observations Increase in no. of s increase accelerator speed up. Use of 4 s per node is optimal. 14 / 18

22 NVIDIA Single Node Performance Time to RANS Solution (Hours) Observations Time to solution on 1 million grid 15 minutes. Time to solution on 30 million grid half a day. Single node serves as a desktop supercomputer. 15 / 18

23 NVIDIA Multi node Performance Observations Parallel Speed up: 66 Million Workload Near linear speed up using 2 s per node. Drop in speed up for larger no. nodes and/or higher s due to lower utilization. 16 / 18

24 NVIDIA Multi node Performance Normalized Time Per Iteration: 66 Million Workload Observations Drop in time/iter with increase in no. of nodes and/or s. Time to solution with 8 nodes 4 hours. 17 / 18

25 18 / 18 NVIDIA Concluding Remarks Offload model to port. based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on.

26 18 / 18 NVIDIA Concluding Remarks Offload model to port. based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on.

27 18 / 18 NVIDIA Concluding Remarks Offload model to port. based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on.

28 18 / 18 NVIDIA Concluding Remarks Offload model to port. based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on.

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under