designing a GPU Computing Solution

Size: px

Start display at page:

Download "designing a GPU Computing Solution"

Myles Brooks
5 years ago
Views:

1 designing a GPU Computing Solution Patrick Van Reeth EMEA HPC Competency Center - GPU Computing Solutions Saturday, May the 29th, Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Cooling supplement multi-core Density with more

2 Current Computing Challenges Insatiable demand for : more Flops and Bigger data sets Accelerators Power Cooling supplement multi-core Density with more silicon Price/perf devoted to computation 2 Footer Goes Here

3 What is an HPC Accelerator? Hundreds of functional units executing in parallel Speedup applications by 2x, 10x, 30x, or even 100x! Really fast Really cheap Was hard to program, but getting easier now Specific applications benefit Not useful for most applications Excellent for a growing number of highly parallel applications When it works, it can really fly! 3 Footer Goes Here

4 NVIDIA Tesla Industry Results Speedups are 20x to 150x 146X 36X 19X 17X 100X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Transcoding HD video stream to H.264 Simulation in Matlab using.mex file CUDA function Astrophysics N- body simulation 149X 47X 20X 24X 30X Financial simulation of LIBOR model with swaptions An M-script API for linear Algebra operations on GPU Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics Cmatch exact string matching to find similar proteins and gene sequences 4 Footer Goes Here

Hundreds of functional Acc? Acc? units executing in Acc?

5 How Accelerators can improve the Applicative Environment? Multi Nodes Multi CPUs Multi-cores Many cores : brings Acc? Hundreds of functional Acc? Acc? units executing in Acc? Acc? Acc? parallel Acc? Speed -up $$$ m² Watts 5 Footer Goes Here

6 HP Accelerator program Hybrid Platforms FPGAs GPUs Large codes and operating systems Floating point Text or integer Performance per Watt Floating point (engineering computations) 6 Footer Goes Here

7 Application Speed-up key factors Understand the existing code : Track the most computationally expensive areas (inner loops ) Probe the load-balancing on nodes & cores Examine the data set splits Probe the communications The tool box is rich : Multi-core CPUs Memory bwth/latency InterConnect Accelerators PCI-E speed SW Environments : C & Fortran compilers, OpenCL, CUDA, HMPP, Allinea, TotalView Be Agnostic first!!! 7 Footer Goes Here

8 GPU rules : Accelerator Success Factors TRACK THE MINES!!! Embarrassingly parallel is good! Minimize memory access versus calculation. Too few calculations per memory read/write is bad. use cache when possible (use memory hierarchy!) Thread/core mapping is very important Redesign the code architecture to scale to the right levels : Node(s), core(s), GPU(s) Minimize CPU-GPU transfers Partition the data sets to stay in the onboard memory boundary Data set split : shift along the correct X, Y, Z axis! Hide communications Balance CPU (e.g. summations, less flops ) and GPU (e.g. convolutions, transpositions, ) execution time appropriately. Y O X 8 Footer Goes Here

Determine the right platform CPU / GPU execution time ratio will determine the type of node cores, the # GPU per node PCI-E communication between CPU & GPU will

9 Determine the right platform CPU / GPU execution time ratio will determine the type of node cores, the # GPU per node PCI-E communication between CPU & GPU will determine if can be shared by multiple GPUs Single or double precision will determine the GPU type Data set size (SMP, Cluster, ) $$$, Watts, perf 9 Footer Goes Here

10 GPU Solution architectures (at node level) Applicative environment topology 1 High CPU activity CPU Few CPU/GPU communications PCI-E High GPU activity GPU CPU activity limited to communications: Low-end bi-socket servers Low communication : PCI-E can be shared by GPUs many GPU cores requested 10 Footer Goes Here

GPU Solution architectures (at node level) Applicative environment topology 2 High CPU activity CPU Few CPU/GPU communications PCI-E High GPU activity GPU

11 GPU Solution architectures (at node level) Applicative environment topology 2 High CPU activity CPU Few CPU/GPU communications PCI-E High GPU activity GPU CPU intense activity, Large Data sets : N x multicore Tbytes Mem Low communication : PCI-E can be shared by GPUs many GPU cores requested 11 Footer Goes Here

CPU PCI-E GPU CPU activity limited to communications : Bi-socket Servers

12 GPU Solution architectures (at node level) Applicative environment topology 3 Few CPU activity Few CPU/GPU communications High GPU activity CPU PCI-E GPU CPU activity limited to communications : Bi-socket Servers GPU Dedicated PCI-E 16x Direct access to many GPUs PCI-E 16x 12 Footer Goes Here

13 GPU Solution architectures (at node level) Applicative Environment topology 4 High CPU activity CPU CPU intense activity, Large Data sets : N x multicore Tbytes Mem Critical CPU/GPU communications PCI-E GPU Dedicated PCI-E 16x High GPU activity GPU Direct access to many GPUs PCI-E 16x 13 Footer Goes Here

14 Conclusions When request to speed-up an existing environment : analyze & probe the existing code, Identify if/where accelerators fit Adding GPU is not exactly playing LEGO! Define the right CPU/GPU split, and data set partitioning Adapt the HW architecture : At node level At cluster level 14 Footer Goes Here

15 15 Footer Goes Here Thank You!

The Future of GPU Computing

The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist