Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
Motivation Heterogeneous clusters becoming the standard Geophysical applications process huge data sets (TBs). Need to: Fully utilize all devices (CPU & GPU) Execute computational kernels on optimal hardware Maximize utilization of network and device bandwidth No well-known best practices for irregular applications on heterogeneous systems
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
The Problem at Hand (I) What is Kirchhoff Migration? Subsurface imaging algorithm based on ray-tracing 2 stages: traveltime (TT) computation and migration (MIG). Multi-node execution time in the week(s)
The Problem at Hand (II) Interesting target for hybrid execution Data structures: massive, dynamically sized, legacy implementation requires pointer chasing Compute intensive kernels, complex control flow, high register pressure Major I/O bottlenecks Software engineering challenges Massive code base Facilitate upgrades Constant numerical checks
microjobs spawn Existing software infrastructure (applies to MIG and TT) vel= inline= crossline= nodes= Setup The Problem at Hand (IV) 1 2 Job launch Master Node microjobs Execution Worker Node Management Process 0010 1100 1101 0011 1101 0010 1010 0110 0110 1001 1011 0011 Worker Node Microjob Process Both use microjobs as the scheduling unit, 1 microjob per node Both TT & MIG use a two-stage pipeline of microjobs Stage 1: computation bound, Stage 2: I/O bound
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
Solution Strategy Incremental porting approach: Legacy CPU-only approach GPU-only Support dynamically sized, pointer-based data structures Usual GPU optimization strategies ( shared, access coalescing, etc) Stabilize application numerical instability GPU only CPU+GPU w/ static work partitioning Support same data layout efficiently on host & device Modified I/O front-end w/ dedicated communication thread CPU+GPU CPU+GPU dynamic work distribution Different load-balancing algorithms for TT, MIG New I/O back-end, utilize the shared FS
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
GPU Implementation (I) TT: main computational kernel (tt_hwt): OpenMP Loop-carried dependencies CUDA, thousands of iterations SIMD parallelism across rays, kernels with complex control flow Ray-based wavefront is dynamically sized Each ray has multiple references to others Large amount of sequential computation
GPU Implementation (II) MIG: main computational kernel (kern_corr): Large source of overhead CUDA, hundreds of iterations Loop-carried dependencies Parallelize across x,y physical dimensions Accumulate work from multiple traces for sufficient parallelism Data size > GPU memory, so host-managed caching in device memory is used
GPU Implementation (III) Optimization Techniques: Maximize use of non-blocking CUDA streams and events Use the massively parallel GPU to re-order arrays for improved memory coalescing in future kernels Track live and modified data to reduce or eliminate copies Focus on divergent code sections, try to maximize shared code shared memory used, particularly for kernels with high register pressure Brute force optimization of block sizes No natural division of work between blocks, but tweaking block size caused significant speedup
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
CPU+GPU w/ Static Work Partitioning TT: OpenMP parallelization across shots Given N CPU cores, M GPUs Launch N OpenMP threads (#pragma omp parallel for) Thread 0 -Thread M-1 use GPUs to accelerate parallel loops over rays within each shot MIG: partition x,y dimensions within traces across devices and threads Dedicated input communication thread, buffers tasks
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
CPU+GPU w/ Dynamic Work Distribution TT: threads peek at workload of other threads, donate GPUs to threads with more computation Integer value indicates workload Oversubscription of GPUs improved performance All threads running, w/ or w/o GPU 1. T 1 owns the GPU 2. T 1 checks workload on T 2, finds it is larger 3. T 1 donates ownership of the GPU to T 2 4. T 2 discovers that a GPU has been donated, switches to GPU execution t=1 t=2 1 GPU GPU = light load = heavy load t=3 T 1 T 2 Time 2 3 GPU 4 GPU
spawn CPU+GPU w/ Dynamic Work Distribution MIG: add support for multiple microjobs per node, each microjob assigned a device (GPU or multi-core CPU) GPUs naturally receive more work as they process microjobs faster... Master Node... microjobs Worker Node Management Process Microjob Process GPU spawn Microjob Process CPU Microjob Process GPU
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
Results (I) Tests ran on 1 master node, 5 worker nodes 2 socket systems, 6 core Intel Xeon X5675 3 NVIDIA M2090 per node Infiniband QDR Intel SW stack Panasas FS CUDA 5.0 V0.2.1221 gencode arch=compute_20,code=sm_20 fmad=false
Results (II) Overall wallclock, including I/O, initialization, etc: TT Set A Set B Legacy (s) 54093.72 (1.00x) 15683.16 (1.00x) GPU only (s) 11453.69 (1.37x) Static Hybrid (s) 49818.889 (1.09x) 11602.262 (1.35x) Dynamic Hybrid (s) 35434.645 (1.53x) 8832.886 (1.78x) Good speedup for TT Significant amount of sequential code remains Accelerated relatively small part of application Poor speedup for MIG MIG Set A Set B Legacy (s) 5304.11 (1.00x) 4183.47 (1.00x) Dynamic Hybrid (s) 5174.47 (1.03x) 3563.60 (1.17x)
1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 1001 1051 1101 1151 1201 1251 1301 1351 1401 1451 1501 1551 Speedup Results (III) Further Investigation of TT Performance: Slowdown for 16/1596 shots, Min=0.84x, Max=3.65x 4 3.5 Per-Shot Speedup, Data Set A 3 2.5 2 1.5 1 0.5 0 Shots
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 100 Microjobs Completed Results (IV) Further Investigation of MIG Performance: Recall the microjob pipeline Stage 1 is compute bound, Stage 2 is I/O bound 140 120 100 80 60 40 20 0 Time (m) Legacy Stage1 Legacy Stage2 Hybrid Stage 1 Hybrid Stage 2 MIG Stage 1 Set A Set B Legacy (m) 4920 3960 Dynamic Hybrid (m) 1320 (3.73x) 1200 (3.30x)
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 Speedup Further Investigation of MIG Performance: Many microjobs were too small to fully utilize comutational resources Acceleration of actual computational kernel 104 kernels on GPU, 22 kernels on CPU Maximum speedup of 35.3x Results (V) 40 35 30 25 20 15 10 5 0 MIG Kernels
Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions
Conclusions Full system utilization, no idle resources CPU is always preparing work for GPU or doing useful computation CUDA streams ensure GPU always has work Dedicated communication threads maximize network utilization Performance improvements from GPUs for both MIG and TT TT overall 1.8x, TT shots up to 3.65x, MIG kernels up to 35.3x Limited by sequential code, I/O overhead Work continues on optimizations New inter-node greedy work distribution for MIG New inter-thread, device management algorithm by donation
Acknowledgements Repsol Mauricio Araya-Polo (now in Shell International E&P) Gladys Gonzalez NVIDIA contact: jmaxg3@gmail.com