Hybrid Implementation of 3D Kirchhoff Migration

Size: px

Start display at page:

Download "Hybrid Implementation of 3D Kirchhoff Migration"

Margery Stevens
5 years ago
Views:

1 Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013

2 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

3 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

4 Motivation Heterogeneous clusters becoming the standard Geophysical applications process huge data sets (TBs). Need to: Fully utilize all devices (CPU & GPU) Execute computational kernels on optimal hardware Maximize utilization of network and device bandwidth No well-known best practices for irregular applications on heterogeneous systems

5 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

6 The Problem at Hand (I) What is Kirchhoff Migration? Subsurface imaging algorithm based on ray-tracing 2 stages: traveltime (TT) computation and migration (MIG). Multi-node execution time in the week(s)

7 The Problem at Hand (II) Interesting target for hybrid execution Data structures: massive, dynamically sized, legacy implementation requires pointer chasing Compute intensive kernels, complex control flow, high register pressure Major I/O bottlenecks Software engineering challenges Massive code base Facilitate upgrades Constant numerical checks

8 microjobs spawn Existing software infrastructure (applies to MIG and TT) vel= inline= crossline= nodes= Setup The Problem at Hand (IV) 1 2 Job launch Master Node microjobs Execution Worker Node Management Process Worker Node Microjob Process Both use microjobs as the scheduling unit, 1 microjob per node Both TT & MIG use a two-stage pipeline of microjobs Stage 1: computation bound, Stage 2: I/O bound

9 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

10 Solution Strategy Incremental porting approach: Legacy CPU-only approach GPU-only Support dynamically sized, pointer-based data structures Usual GPU optimization strategies ( shared, access coalescing, etc) Stabilize application numerical instability GPU only CPU+GPU w/ static work partitioning Support same data layout efficiently on host & device Modified I/O front-end w/ dedicated communication thread CPU+GPU CPU+GPU dynamic work distribution Different load-balancing algorithms for TT, MIG New I/O back-end, utilize the shared FS

11 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

12 GPU Implementation (I) TT: main computational kernel (tt_hwt): OpenMP Loop-carried dependencies CUDA, thousands of iterations SIMD parallelism across rays, kernels with complex control flow Ray-based wavefront is dynamically sized Each ray has multiple references to others Large amount of sequential computation

13 GPU Implementation (II) MIG: main computational kernel (kern_corr): Large source of overhead CUDA, hundreds of iterations Loop-carried dependencies Parallelize across x,y physical dimensions Accumulate work from multiple traces for sufficient parallelism Data size > GPU memory, so host-managed caching in device memory is used

14 GPU Implementation (III) Optimization Techniques: Maximize use of non-blocking CUDA streams and events Use the massively parallel GPU to re-order arrays for improved memory coalescing in future kernels Track live and modified data to reduce or eliminate copies Focus on divergent code sections, try to maximize shared code shared memory used, particularly for kernels with high register pressure Brute force optimization of block sizes No natural division of work between blocks, but tweaking block size caused significant speedup

15 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

16 CPU+GPU w/ Static Work Partitioning TT: OpenMP parallelization across shots Given N CPU cores, M GPUs Launch N OpenMP threads (#pragma omp parallel for) Thread 0 -Thread M-1 use GPUs to accelerate parallel loops over rays within each shot MIG: partition x,y dimensions within traces across devices and threads Dedicated input communication thread, buffers tasks

17 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

18 CPU+GPU w/ Dynamic Work Distribution TT: threads peek at workload of other threads, donate GPUs to threads with more computation Integer value indicates workload Oversubscription of GPUs improved performance All threads running, w/ or w/o GPU 1. T 1 owns the GPU 2. T 1 checks workload on T 2, finds it is larger 3. T 1 donates ownership of the GPU to T 2 4. T 2 discovers that a GPU has been donated, switches to GPU execution t=1 t=2 1 GPU GPU = light load = heavy load t=3 T 1 T 2 Time 2 3 GPU 4 GPU

19 spawn CPU+GPU w/ Dynamic Work Distribution MIG: add support for multiple microjobs per node, each microjob assigned a device (GPU or multi-core CPU) GPUs naturally receive more work as they process microjobs faster... Master Node... microjobs Worker Node Management Process Microjob Process GPU spawn Microjob Process CPU Microjob Process GPU

20 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

21 Results (I) Tests ran on 1 master node, 5 worker nodes 2 socket systems, 6 core Intel Xeon X NVIDIA M2090 per node Infiniband QDR Intel SW stack Panasas FS CUDA 5.0 V gencode arch=compute_20,code=sm_20 fmad=false

22 Results (II) Overall wallclock, including I/O, initialization, etc: TT Set A Set B Legacy (s) (1.00x) (1.00x) GPU only (s) (1.37x) Static Hybrid (s) (1.09x) (1.35x) Dynamic Hybrid (s) (1.53x) (1.78x) Good speedup for TT Significant amount of sequential code remains Accelerated relatively small part of application Poor speedup for MIG MIG Set A Set B Legacy (s) (1.00x) (1.00x) Dynamic Hybrid (s) (1.03x) (1.17x)

23 Speedup Results (III) Further Investigation of TT Performance: Slowdown for 16/1596 shots, Min=0.84x, Max=3.65x Per-Shot Speedup, Data Set A Shots

24 Microjobs Completed Results (IV) Further Investigation of MIG Performance: Recall the microjob pipeline Stage 1 is compute bound, Stage 2 is I/O bound Time (m) Legacy Stage1 Legacy Stage2 Hybrid Stage 1 Hybrid Stage 2 MIG Stage 1 Set A Set B Legacy (m) Dynamic Hybrid (m) 1320 (3.73x) 1200 (3.30x)

25 Speedup Further Investigation of MIG Performance: Many microjobs were too small to fully utilize comutational resources Acceleration of actual computational kernel 104 kernels on GPU, 22 kernels on CPU Maximum speedup of 35.3x Results (V) MIG Kernels

26 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

27 Conclusions Full system utilization, no idle resources CPU is always preparing work for GPU or doing useful computation CUDA streams ensure GPU always has work Dedicated communication threads maximize network utilization Performance improvements from GPUs for both MIG and TT TT overall 1.8x, TT shots up to 3.65x, MIG kernels up to 35.3x Limited by sequential code, I/O overhead Work continues on optimizations New inter-node greedy work distribution for MIG New inter-thread, device management algorithm by donation

28 Acknowledgements Repsol Mauricio Araya-Polo (now in Shell International E&P) Gladys Gonzalez NVIDIA contact:

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra