J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Size: px

Start display at page:

Download "J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst"

Piers Ellis
5 years ago
Views:

1 Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst

2 Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function (Parallel Nsight ) Isotropic Turbulence Decay Speedup Results on Lincoln, Forge and Keeneland Supercomputers Summary

DDR5) Tesla M2070 NVIDIA (6 GB DDR5) two hex-core Intel Xeon Processors (12 cores

3 192 servers 384 GPUs Two quad-core Intel processors 2 GB of RAM per core PCI-e Gen2 X4 for each GPU 120 servers 360 Tesla M2070 NVIDIA Fermi GPUs (6 GB 288 DDR5) Tesla M2070 NVIDIA (6 GB DDR5) two hex-core Intel Xeon Processors (12 cores Two per eight-core node) AMD (16 cores per node) PCI-e Gen2 16 PCI-e Gen2 8 (36 servers)

4 Pinned Memory Portable Memory Page-Locked Host Memory Mapped Memory Write-Combined Memory

5 Minimize the data transfer between CPU (host) and GPU (device) Maximize coalesced access to the global memory where possible Use suitable memory type for the saving data (shared, texture, constant, mapped or write-combining memory) Overlap copying data with kernel launches where possible Run kernels concurrently where possible Minimize CUDA synchronization call Use single large copy instead of many small copies Maximize the occupancy where possible.

6 Time (ms) Bandwidth (GB/sec) a = b (one read and one write - Copy) a = sb (one read and one write - Scalar) a = b + c (two reads and one write - Add) a = sb + c (two reads and one write - TriAdd) Copy (GTX 295) Scalar (GTX 295) Add (GTX 295) TriAdd (GTX 295) Copy (GTX 480) Scalar (GTX 480) Add (GTX 480) 10 0 TriAdd (GTX 480) Copy (GTX 295) Scalar (GTX 295) Add (GTX 295) TriAdd (GTX 295) Copy (GTX 480) 10 0 Scalar (GTX 480) 10 0 Add (GTX 480) TriAdd (GTX 480) Vector Length Vector Length

7 3D incompressible Parallel Second-order staggered mesh CPU Three-step Runge-Kutta Non-uniform spacing in all directions GPU

8 Kernel Algorithm Coalesced Access Shared Memory Maximize the Occupancy Minimize the Thread Divergence Use Faster Math Function Minimize Data Transfer Run Kernel Concurrently Minimize CUDA Synchronization Overlap Copying data with Kernel Launches Use Suitable Memory (Constant, Mapped, ) Single Large Copy Instead of Many Small Copies Compute Visual Profiler NVIDIA Parallel Nsight

9 Subdomain Size Domain Size Subdomain Size Subdomain Size Subdomain Size Subdomain Size Domain Size Domain Size

10 CPU GPU

11 CPU Pinned Memory GPU Mapped Memory

14 T KE 1 1 de Bruyn Kops a nd R ile y P re se nt T im e Turbulence Kinetic Energy S. M. de Bruyn Kops, J. J. Riley, Direct numerical simulation of laboratory experiments in isotropic turbulence, Physics of Fluids 10 (9), (1998)

15 5s 7s 12s 20s 40s 110s

16 Time (ms) Speedup Single (SP) and Double (DP) precision SP is 2 times faster than DP ECC effects UP 58x and 44x speedup CPU (SP) GPU (SP - ECC off) GPU (SP - ECC on) CPU (DP) GPU (DP - ECC off) GPU (DP - ECC on) GPU (SP - ECC off) GPU (SP - ECC on) GPU (DP - ECC off) GPU (DP - ECC on) Problem Size Problem Size

17 Time (ms) Time (ms) Time (ms) CG (Conjugate Gradient) Solver PCI-e effects on Lincoln and Forge Supercomputers Laplace MPI AXPY Laplace_Inv Extract Copy Extract Interior Copy MPI Interior 10 2 Fix MPI Summation Fix 10 1 Summation N 2 N 3 2 N PCI-e x8 + MPI 10 1 Lincoln Copy Problem Size Problem Size

18 Speedup MCUPS/Processor UP 45x speedup (16 GPUs Vs. 16 CPU cores) Reasonable performance loss from 2 GPUs to 64 GPUs (less than 50%) Lincoln supercomputer has lower bandwidth between CPU and GPU (2 GB/s instead of 8 GB/s) CPU GPU CPU GPU CPU GPU Number of Processors Number of Processors

19 Speedup MCUPS/Processor UP 40x speedup (32 GPUs Vs. 32 CPU cores) Simulating case with 64 GPUs (256 3 per GPU) per GPU or CPU per GPU or CPU CPU GPU CPU GPU Number of Processors Number of Processors

20 MCUPS/Processor Speedup vs. same number of processors UP 25x speedup (4 GPUs Vs. 4 CPU cores) Perfect Performance up to 4 GPUs per node CPU GPU CPU GPU CPU GPU Number of Processors Number of Processors

21 Speedup vs. same number of processors MCUPS/Processor UP 35x speedup (32 GPUs Vs. 32 CPU cores) Simulating case with 64 GPUs (256 3 per GPU) Perfect Performance up to 64 GPUs (4 GPUs per node) GPU (8 GPUs/Node) GPU (4 GPUs/Node) GPU (8 GPUs/Node) GPU (4 GPUs/Node) CPU (16 Cores/Node) GPU ( 8 GPUs/Node) GPU ( 4 GPUs/Node) CPU ( 8 Cores/Node) GPU ( 8 GPUs/Node) GPU ( 4 GPUs/Node) Number of Processors Number of Processors

22 MCUPS/Processor Speedup vs. same number of processors Speedup and Performance per processor for Strong scaling of the and CFD problem on Keeneland using GPUs and CPUs UP 25x speedup (32 GPUs Vs. 32 CPU cores) CPU GPU CPU GPU CPU GPU Number of Processors Number of Processors

23 Speedup vs. same number of processors MCUPS/Processor UP 20x speedup (16 GPUs Vs. 16 CPU cores) Simulating 1024x1536x2048 case with 192 GPUs (256 3 per GPU) Perfect Performance up to 192 GPUs per GPU or CPU per GPU or CPU Number of Processors CPU GPU CPU GPU Number of Processors

24 Forge Efficiency Keeneland Efficiency 96% Efficiency up to 64 GPUs on Forge and Keeneland supercomputers 90% Efficiency up to 192 GPUs on Keeneland 70% Efficiency up to 64 and 192 CPUs on Forge and Keeneland 110% 110% 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% CPU (16 Cores/Node) 40% CPU ( 8 Cores/Node) 30% GPU ( 8 GPUs/Node) 30% GPU ( 8 GPUs/Node) 20% GPU ( 4 GPUs/Node) 20% GPU ( 4 GPUs/Node) 10% Ideal 10% 110% 110% 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% per CPU (8 Cores/Node) per CPU (4 Cores/Node) per GPU per GPU Ideal 20% 20% 10% 10% Number of Processors Number of Processors

25 Unlike the CPU, GPU is well suited for large problems PCI-e speed controls the performance for large scientific problems. Internal calculations are now so efficient that the operations related to MPI communication are the primary scaling bottleneck. It was determined that GPU synchronization calls can have a profound effect on the GPU performance. It was determined that GPUs can significantly enhance the speed of CFD calculations (by roughly a factor of 20-30x)

27 1. Ali Khajeh-Saeed and J. Blair Perot, GPU-Supercomputer Acceleration of Pattern Matching, GPU Computing Gems Emerald Edition, Chapter 13, January 2011, Ali Khajeh-Saeed, Stephen Poole and J. Blair Perot, Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors, Journal of Computational Physics 229 (2010) Ali Khajeh-Saeed and J. Blair Perot, Computational Fluid Dynamics Simulations using many Graphics Processors, Submitted to the Computing in Science and Engineering - Special Issue on scientific computing with GPUs, August Timothy McGuiness, Ali Khajeh-Saeed, Stephen Poole and J. Blair Perot, High Performance Computing on GPU Clusters, Submitted to the SIAM Journal on Scientific Computing, October Ali Khajeh-Saeed and J. Blair Perot, Turbulence Simulation using many Graphics Processors, Submitted to the 64 th Annual Meeting of the APS Division of Fluid Dynamics, November 20-22, 2011, Baltimore, MD. 6. T. McGuiness and J.B. Perot, Parallel Graph Analysis and Adaptive Meshing using Graphics Processing Units, 2010 Meeting of the Canadian CFD Society, London, Ontario, S. Menon, J. B. Perot, Implementation of an efficient conjugate gradient algorithm for Poisson solutions on graphics processors, Proceedings of the 2007 Meeting of the Canadian CFD Society, Canada, (2007).

28 Time (ms) Time (ms) Time (ms) Time (ms) CPU (SP) GPU (SP) CPU (DP) GPU (DP) Orion CPU (SP) GPU (SP) CPU (DP) GPU (DP) Lincoln Problem Size CPU (SP) GPU (SP) CPU (DP) GPU (DP) Problem Size CPU (SP) GPU (SP) CPU (DP) GPU (DP) Problem Size Forge Keeneland Problem Size

31 Time (ms) x128x128 Per GPU 256x256x256 Per GPU x 8x GPU

Computational Fluid Dynamics Simulations using Many Graphics Processors

Computational Fluid Dynamics Simulations using Many Graphics Processors Ali Khajeh-Saeed and J. Blair Perot Theoretical and Computational Fluid Dynamics Laboratory, University of Massachusetts, Amherst,