Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Size: px

Start display at page:

Download "Integrating GPUs as fast co-processors into the existing parallel FE package FEAST"

Edgar Welch
5 years ago
Views:

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.

1 Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke Mathematics III: Applied Mathematics and Numerics Computer Science VII: Computer Graphics University of Dortmund ASIM th Symposium on Simulation Technique Workshop Implementational Issues in Scientific Computing Hannover, Germany, September 14, 2006

2 Acknowledgements This work is a joint collaboration of Christian Becker, Stefan Turek and the FEAST group in Dortmund Robert Strzodka, Stanford University, Max Planck Center Patrick McCormick, Los Alamos National Laboratories

3 Overview 1 Motivation and Background 2 Integration into FEAST 3 Preliminary Results 4 Summary and Conclusions

4 Overview 1 Motivation and Background 2 Integration into FEAST 3 Preliminary Results 4 Summary and Conclusions

5 Motivation We want to solve large systems that arise from FEM discretisations fast on commodity clusters. CPUs are general-purpose and only achieve close-to-peak performance in-cache. CPUs devote most of the area to memory (hierachies) and not to processing elements (PEs). Emerging parallel specialised chips are PE-dominated and provide potentially lots of FLOPS and huge memory bandwidth. Goal: Investigate how such designs can be used as numerical co-processors in scientific computing.

6 GPU Characteristics Focus exemplary on Graphics Processors: High-level view of the GPU: parallel array processor with up to 24 cores, deeply pipelined. Peak performance > 200 GFLOP/s but very hard to achieve in practice, so more importantly: Sustained memory bandwidth > 20 GB/s for streaming access patterns despite tiny caches. Up to 1 GB onboard DDR4 memory clocked at > 1 GHz. High-end models cost about EUR 500 and consume Watts.

7 GPU Programming and Limitations Challenge: Reformulate algorithms to the data-stream based programming paradigm! Currently only programmable through graphics APIs, some background required. Incoherent branches and incoherent memory access patterns are expensive. Full gather support, very limited scatter. No read-modify-write! Only saturated with lots of parallel threads in flight. Most important limitations PCIe bus between host system and GPU delivers up to 2 GB/s only. GPUs only provide quasi-ieee 32-bit floating point storage and arithmetics. No double precision!

8 Mathematical Background Test problem: Poisson equation in 2D u = f in some domain Ω R 2 with Dirichlet BCs Bilinear conforming Finite Elements (Q 1 ) for increasing level of refinement of underlying quadrilateral mesh. Resulting linear system matrices comprise nine bands. Example: Unitsquare, multigrid in single and double precision: single precision double precision Level Cycles Error Reduction Cycles Error Reduction E E E E E E E E E E E E E E E E E E

9 Mixed Precision Iterative Refinement Single precision computation insufficient for required result accuracy, but: High precision only necessary at few, crucial stages! Mixed precision iterative refinement approach to solve Ax = b: Compute d = b Ax in high precision. Solve Ac = d approximately in low precision. Update x = x + c in high precision and iterate. Use arbitrary iterative inner solvers until few digits are gained locally. Fits naturally on target hardware: Few, high precision updates on the CPU and expensive low precision iterative solution on the GPU. Exhaustive experimental and theoretical foundation: very robust wrt. solvers, degrees of anisotropy in the discretisation and matrix condition. Combined GPU-CPU scheme is up to five times faster than and as accurate as computing entirely on the CPU in double precision, or emulating double precision on the GPU.

FEAST Solution Strategy ScaRC approach: Combine advantages of (parallel) domain decomposition and multigrid methods. Exploit structured subdomains for high efficiency.

10 FEAST Solution Strategy ScaRC approach: Combine advantages of (parallel) domain decomposition and multigrid methods. Exploit structured subdomains for high efficiency. Hide anisotropies locally to increase robustness. Globally unstructured locally structured. Recursive solution: Smooth outer global multigrid with local multigrid on the refined macros. Low communication overhead.

11 Overview 1 Motivation and Background 2 Integration into FEAST 3 Preliminary Results 4 Summary and Conclusions

12 Integration into FEAST FEAST: Under development since 1999, 100K+ lines of code, tuned data structures, adaptions for clusters (MPI) and NEC vector machines. Consequence: Full rewrite to incorporate GPUs is out of question! Goal: Minimally invasive integration. Some observations: Local sub-problems are all highly structured. Smoothing of the outer multigrid is performed locally with small communication overhead. GPU backend adds new smoother, while FEAST maintains all global data structures. Data flow example: Outer MG calls smoother, matrix and current defect are duplicated into GPU memory, smoothing is performed independently, correction term is read back to the CPU.

13 Integration Issues First prototype straightforward to assemble. Expected 5x speedup based on performance of standalone GPU-CPU iterative refinement multigrid (not MG-MG!) solver. Observed a disappointing break-even. Identified and addressed two main bottlenecks: Poor performance for small problem sizes (not enough parallel threads in flight to saturate PEs). Transfers to and from on-chip memory ( manual prefetching ), CPU and GPU computations all done sequentially. Resulting performance for MG-MG: 3.5x compared to CPU-only solution on a single node (Athlon X , GeForce 7800 GTX). GPU smoother only provides multigrid with local Jacobi, anisotropic macros require more powerful smoothers.

14 Performance Improvements Poor performance for small problem sizes Outer MG with F or W cycles results in smoothing of (too) many (too) small problems. GPUs inappropriate for small problems, plus additional transfer overhead/penalty. Solution: Dynamic CPU-GPU switch based on problem size: Small problems are rescheduled to a single precision CPU smoother. Overlapping transfers and computing CPU idle when GPU computes and vice versa, but some CPU support required to orchestrate GPU. Solution: Streaming compute model: Smooth problem i on GPU while transferring data for problem i + 1 and i 1 and update defect for problem i 1.

15 Coarsely Adapted Grids GPU offers MG with Jacobi smoother. CPU offers wide range of MG smoothers, esp. for anisotropic generalised tensorproduct meshes. Goal: Many easy sub-problems are scheduled on the GPU, while the CPU smoothes few hard ones with a more powerful numerical scheme in the meantime. This is a hard dynamic scheduling problem. All tests so far are based on (suboptimal) static partitionings of the domain.

16 Overview 1 Motivation and Background 2 Integration into FEAST 3 Preliminary Results 4 Summary and Conclusions

17 Test Environment Cluster with 32 compute nodes and 1 master node. Dual Intel EM64T 3.4 GHz, NVIDIA Quadro FX1400 PCIe mid-range graphics card. Fully connected via Infiniband. Two test cases: A Full cartesian case: Static 3:1 scheduling GPU:GPU, both MG-Jacobi as local smoother to outer (parallel) MG. B Coarsely adapted grids: Some sub-problems require a more powerful smoother (CPU), while cartesian sub-problems are scheduled to the GPU. Level 10 computations missing for all test cases, results are preliminary since we did not have time to adapt FEAST to the Xeon architecture.

18 Test Case A CPU vs. CPU-GPU, one or two jobs per dualnode CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K) CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K) 300 1x16p CPU MGCPU x16p CPU MGCPU x16p GPU FX1400 2x16p CPU MGCPU2 2x16p GPU FX x16p GPU FX1400 2x16p CPU MGCPU2 2x16p GPU FX1400 Seconds Seconds per macro grid node Level absolute and normalised time for solution Configurations: 1x16p CPU: 16 nodes, one CPU each, one CPU process 1x16p GPU: 16 nodes, one GPU each, one GPU process 2x16p CPU: 16 nodes, two CPUs each, two CPU process 2x16p GPU: 16 nodes, two CPUs and one GPU each, one CPU and one GPU process Level

19 Test Case A CPU vs. CPU-GPU, one or two jobs per dualnode CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K) CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K) 300 1x16p CPU MGCPU x16p CPU MGCPU x16p GPU FX1400 2x16p CPU MGCPU2 2x16p GPU FX x16p GPU FX1400 2x16p CPU MGCPU2 2x16p GPU FX1400 Seconds Seconds per macro grid node Level Level absolute and normalised time for solution 2nd CPU job per node gains 20% performance only (shared FSB). CPU-GPU configurations jiggle because of CPU-switch in GPU module for small levels. For large problem sizes, the GPU outperforms the CPU jobs, 1x16 GPU is even faster than 2x16 CPU!

20 Test Case A CPU vs. CPU-GPU scalability test CPU, GPU Performance Study for 1x32p, 2x16p (Threshold=20K) CPU, GPU Performance Study for 1x32p, 2x16p (Threshold=20K) 250 1x32p CPU MGCPU x32p CPU MGCPU2 1x32p GPU FX1400 2x16p CPU MGCPU x32p GPU FX1400 2x16p CPU MGCPU2 Seconds x16p GPU FX1400 Seconds per macro grid node x16p GPU FX Level absolute and normalised time for solution Configurations: 2x16p CPU: 16 nodes, two CPUs each, two CPU process 2x16p GPU: 16 nodes, two CPUs and one GPU each, one CPU and one GPU process 1x32p CPU: 32 nodes, one CPU each, one CPU process 1x32p GPU: 32 nodes, one CPU and one GPU each, one GPU process Level

21 Test Case A CPU vs. CPU-GPU scalability test CPU, GPU Performance Study for 1x32p, 2x16p (Threshold=20K) CPU, GPU Performance Study for 1x32p, 2x16p (Threshold=20K) 250 1x32p CPU MGCPU x32p CPU MGCPU2 1x32p GPU FX1400 2x16p CPU MGCPU x32p GPU FX1400 2x16p CPU MGCPU2 Seconds x16p GPU FX1400 Seconds per macro grid node x16p GPU FX Level Level absolute and normalised time for solution Significant gain of 1x32 shows importance of memory bandwidth for the Xeons. CPU-GPU configuration wins by a smaller margin than before. Tendency: Increasing the problem sizes leads to increasing time per grid node on the CPU, but not on the GPU.

22 Test Case B Coarsely adapted grids CPU, GPU Performance Study for 2x16p (Threshold=20K) CPU, GPU Performance Study for 2x16p (Threshold=20K) 350 2x16p CPU ADI + CPU JAC x16p CPU ADI + CPU JAC 2x16p CPU ADI + GPU FX1400 2x16p CPU ADI + GPU FX Seconds Seconds per macro grid node Level Level absolute and normalised time for solution Configurations: 2x16p CPU: 16 nodes, CPU-MG-ADITRIGS and CPU-MG-JACOBI 2x16p GPU: 16 nodes, CPU-MG-ADITRIGS and GPU-MG-JACOBI

23 Test Case B Coarsely adapted grids CPU, GPU Performance Study for 2x16p (Threshold=20K) CPU, GPU Performance Study for 2x16p (Threshold=20K) 350 2x16p CPU ADI + CPU JAC x16p CPU ADI + CPU JAC 2x16p CPU ADI + GPU FX1400 2x16p CPU ADI + GPU FX Seconds Seconds per macro grid node Level Level absolute and normalised time for solution Results are consistent with previous graphs. Additional advantage of CPU-GPU configuration: Less strain on the memory subsystem.

24 Overview 1 Motivation and Background 2 Integration into FEAST 3 Preliminary Results 4 Summary and Conclusions

25 Summary and Conclusions Clearly work in progress in its current state. Interesting perspectives: Inexpensive upgrade of commodity clusters wrt. to TCO. Potential to accelerate production codes. But: Maintaining two code lines on the solver and data structure level, not on the application level. Paradigm shift to data parallelism: Multicores, Cell BE etc., so start learning now: The first honest attempt at petascale computing, the IBM Roadrunner at LANL, will contain multi-gpus, Cells, Opterons and will in general be a massively parallel hybrid machine.

26 Further Reading Owens et al. A Survey Of General Purpose Computing On Graphics Hardware, Eurographics 2005 State Of The Art Report Papers, Demos, Forums, FAQs goeddeke/gpgpu Tutorials

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dominik Göddeke Universität Dortmund dominik.goeddeke@math.uni-dortmund.de Christian Becker christian.becker@math.uni-dortmund.de