Large scale Imaging on Current Many- Core Platforms

Size: px

Start display at page:

Download "Large scale Imaging on Current Many- Core Platforms"

Earl Atkins
6 years ago
Views:

1 Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

2 Outline Introduction 3D Anisotropic Diffusion in OpenCL HDR Compression on GPU A Large Scale Imaging Framework Performance on Jugene and Tsubame Summary 2

3 Introduction Goals: Real-time Imaging for large data sets High productivity High efficiency High flexibility Currently Specialized implementations Single node frameworks Matlab 3

4 Example: Anisotropic Diffusion (CED, Weickert) Abdominal CTA scan (medium noise level) original vs filtered Original Image Anisotropic Diffusion 3D 100 iterations Edge sensibility

5 Performance Table Typ: Intel Core i7 4.2GHz NVIDIA Geforce 570 GTX Processors: 4 15 Stream Processors: Parallel Threads: RAM: 16GB (Up to 64GB) 1.28 ( Up to 2.5 GB) Memory Bandwidth: 21GB/s 152GB/s GFLOPS Average time for one iteration (measured over 50 iterations) x size [voxel] y size [voxel] z size [voxel] GPU time [s] CPU time [s]

6 Summary Anisotropic Diffusion Real-time capability: 64x64x64 = 0.26 million points Productivity One master thesis (Thomas Kluge) Efficiency? Flexibility no 6

images Image histogram and windowing Interactive computation with

7 High Dynamic Range Compression Data: Siemens AG, Healthcare Sector Original Image (960x960) HDR Compression Sequences of 2D x-ray images Image histogram and windowing Interactive computation with user-input parameters Interactive visualization of results with OpenGL 7

8 HDR Compression Idea: Modify magnitude of image gradient by positiondependent attenuating function 2 Energy functional C = I Φ : R R Φ Fattal/Lischinski/Werman, Gradient Domain High Dynamic Range Compression, SIGGRAPH, 2002 E( u) = min u( x) C d x Solve by multigrid the Euler-Lagrange equation u u = u = 0 Ω f in Ω on Ω 2 8

9 Performance Engineering Algorithm Create performance model Hardware Integrate them in software framework Identify performance bottlenecks Create problemspecific, hardwaredependent, and highly efficient kernel 9

10 What is Multigrid? An efficient iterative solver for sparse systems Au h = f h multigrid solver has complexity O(N) in number of unknowns N 10

11 Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property Smooth error on fine grid (e.g. Jacobi, Gauß-Seidel) 11

12 Multigrid idea Multigrid methods are based on two principles: 1. Smoothing property 2. Coarse grid principle Approximate smooth error on coarser grids 12

Optimized HDR Compression (size 2048x2048) fps 140 120 100 80

NVIDIA GTX 295 112 GB/s peak bandwidth compute capability 1.

13 Optimized HDR Compression (size 2048x2048) fps GTX 295/2 GTX 480 GTX 480 (wavefront) half of an NVIDIA GTX GB/s peak bandwidth compute capability 1.3 NVIDIA GTX GB/s peak bandwidth compute capability 2.0 (Fermi) 13

14 Summary HDR Compression Real-time capability: 4096x4096 = 16.7 million points Productivity One PhD (Markus Stürmer) Efficiency Nearly optimal Flexibility no 14

15 WaLBerla: address flexibility and productivity 15

16 WaLBerla: Patch concept (Christian Feichtinger) 16

17 Summary walberla Real-time capability:? Productivity Provides interprocess communication Provides debugging, IO, data management Clear interfaces Efficiency Depends on kernel Flexibility Stencils: arbitrary 3D stencils Multigrid and CG solvers Hardware: CPU, GPU, HPC clusters 17

18 Tsubame 2.0 in Japan Compute nodes: 1442 Processor: Intel Xeon X5670 GPU: 3 x Nvidia Tesla M2050 LINPACK performance: 1.2 Petaflops Power consumption: 1.4 MW Interconnect: QDR Infiniband 18

19 Multigrid results on Tsubame Settings: Times for one V(2,2) cycle with 256^3 unknowns per GPU, double accuracy CG as coarse grid solver Poisson 3D (linear interpolation, full weighting) RBGS smoother Walberla MPI communication routines 19

20 Weak scaling with MPI runtime in ms Smoother fine Coarser Grids V-cycle Efficiency 1,2 1 0,8 0,6 0,4 0,2 efficiency No. GPUs 0 20

21 Weak scaling without MPI 3000 Smoother fine Coarser Grids V-cycle efficiency 1, runtime in ms ,8 0,6 0,4 efficiency 500 0, No.GPUs 0 21

22 Strong scaling runtime in ms Smoother fine Coarser Grids V-cycle Speedup 2,5 2 1,5 1 0,5 speedup No. GPUs 0 22

23 Optimal problem size on 1029 GPUs level 5 level runtime in ms unknowns in million 23

24 Summary walberla on Tsubame Real-time capability: > 500 million points Productivity GPU cluster hard to program: walberla provides communication routines CUDA or OpenCL Efficiency Multigrid scalability limited by coarser grids Flexibility yes 24

25 Performance Engineering revisited performance of code is a property that can and should be predicted, planned, and steered performance vs effort architectural factors programming techniques optimization techniques Performance Model! 25

26 Performance Model Input Algorithm: V-cycle on (block)structured grid Generic Implementation Hardware information (bandwidth, peak performance) Assumption t = max( t, t ) + total comp overlapcomm t nonoverlapcomm Computation time limited by memory bandwidth and instruction throughput Communication time limited by network bandwidth and latency (for direct and collective communication) 26

27 Data sets for 3D HDR Compression MRI data provided by Universitätsklinikum Erlangen Tetrahedral finite element mesh used in HHG 27

28 Blue Gene/P in Jülich (Jugene) Compute node: 4-way SMP processor Processor: 32-bit PowerPC 450 core 850 MHz Cores: Overall peak performance: 1 Petaflops Main memory: 2 Gbytes per node (aggregate 144 TB) 28

29 Runtime Prediction from Performance Model 29

30 Strong Scaling for Multigrid Solver on Jugene 30

31 Summary & Conclusion We want to provide a generic framework for large scale imaging Supports various computer architectures Supports various imaging applications Supports performance models to estimate the overall runtime for your application on a certain hardware 31

32 THE END Questions?

Efficient Imaging Algorithms on Many-Core Platforms

Efficient Imaging Algorithms on Many-Core Platforms H. Köstler Dagstuhl, 22.11.2011 Contents Imaging Applications HDR Compression performance of PDE-based models Image Denoising performance of patch-based