Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Size: px

Start display at page:

Download "Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on"

Mercy Carson
5 years ago
Views:

1 Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on Max Grossman, Gladys Gonzalez, Mauricio Araya- Polo GT, San Jose March 25, 2014

2 Agenda The Problem at Hand 3. Strategy 4. GPU 5. Results 6. onclusions

3 Mo<va<on As heterogeneous clusters becoming the standard, full resource becomes more challenging No generally- applicable best for irregular on heterogeneous systems Kirchhoff is a great case study in op@mizing holis@c resource u@liza@on on huge data sets (TBs)

4 The Problem at Hand (I) What is Kirchhoff Subsurface imaging algorithm based on ray- tracing 2 stages: travel@me (TT) computa@on and migra@on (MIG) Lots of divergent, irregular kernels

5 The Problem at Hand (II) Build on GT 2013 scheduling work Greedy inter- node task system for job Two- stage pipeline of tasks: stage 1 is computa@on- bound, stage 2 is I/O bound result aggrega@on tasks For TT: Single compute process per node Mul@ple threads per process nego@ate access to GPUs and PU cores by dona@on For MIG: Many compute processes per node, each assigned a single device (PU or GPU) Master Node microjobs Worker Node microjobs Worker Node Manager spawn Worker

6 The Problem at Hand (III) While previous work was evaluated on real- world datasets, no one ever requests support for smaller data For example: 1. GPUs with ~3.5GB of global memory 2. Individual dense matrix size of ~2600MB 3. 5 matrices on- device = ~360% GPU memory u@liza@on from these data structures alone learly, something more clever is required to scale up to arbitrarily large datasets

7 Solu<on Strategy ustom of device memory management for TT and MIG based on of each algorithm Focus on: Maximizing of device memory Minimizing PU- GPU Maximizing overlap (inter- device and inter- node) Future- proofing against larger and larger datasets

8 GPU Implementa<on (I) MIG Problem Each trace accesses a set of matrices, but a single trace has insufficient computa@on for the whole GPU Hundreds to thousands of these matrices in total Must manage transport of whole matrices to and from GPU Matrices referenced by a trace can be pre- computed based on trace # trace

GPU Implementa<on (II) MIG Solu@on - Memory Management Group traces into chunks based on matrix affinity Pre- allocate on device 1)

9 GPU Implementa<on (II) MIG - Memory Management Group traces into chunks based on matrix affinity Pre- allocate on device 1) objects for parameterizing a single trace chunk, and 2) slots for caching tables GPU Global Memory Sta@c Alloca@ons P P P P P P

10 GPU Implementa<on (III) MIG - Model Dedicated I/O and compute threads The host side must manage: 1. Kernel objects 2. TT caching slots 3. UDA streams, events A working set object stores the work for a single kernel invoca@on trace chunks, associated matrix cache slots, etc Memory- mapped flags enable immediate resource release Dynamic device selec@on based on resource availability Working Set = UDA stream = UDA event = kernel parameters

11 GPU Implementa<on (IV) MIG - ache Management ache management uses reference to track in- use cache slots When placing a set of tables on the GPU for a trace chunk, the host will simply check for tables in- cache and for sufficient free slots Exclusive device access and pre- determined table accesses implies no forced evic@on If resource alloca@on for a par@cular chunk fails (no more streams, events, cache slots, etc) 1. The working set currently being populated is immediately launched on the GPU 2. The current trace chunk is executed mul@- threaded on the PU

12 GPU Implementa<on (V) TT Problem Small of massive, 3D, global, read- only matrices shared by all tasks and all devices Unpredictable access pajerns threads sharing each GPU kernel pipeline, each using different access pajerns

13 GPU Implementa<on (VI) TT - ache Design global matrices into sub- matrices of fixed size N N Pre- allocate as many cache slots as possible on GPU reference tracking Track all references to global matrices on- device Submatrix UIDs Track cache hits and misses N N N N N N N N N N N N N N N N N N N N N N N N

14 GPU Implementa<on (VI) TT - aching Policy Use LRU- like caching policy on host to manage cache slots threads share a single device s memory Itera@vely call kernels un@l converge on comple@on Refresh cache on each itera@on Lock rlock during kernel launch to ensure consistent mapping and slots Given: Global submatrix -> cache slot mapping Thread-local reference history for every thread All cache hits and misses for a kernel instance for each submatrix s: if s in hits or s in misses: mark s as MRU in history if s in misses and!s.is_cached: wlock.lock() if!s.is_cached: slot = select_victim(mapping, histories) update_slot(slot, s) if own(wlock): update_mapping(mapping) wlock.unlock()

15 Results (I) 2 socket systems, 6 core Intel Xeon X5675 NVIDIA K10 Infiniband QDR Intel SW stack Panasas FS UDA 5.0 V prec- sqrt=true, prec- div=true, fmad=false

16 Results (II) TT compares a 1 core + 1 GPU against 16 PU cores. MIG compares 16 cores + 1 GPU against 16 PU cores. Data$Set Legacy$Execution$Time GPU$Execution$Time Speedup TT"A TT"B TT" TT"D TT"E TT"F TT"G Data$Set Legacy$Execution$Time GPU$Execution$Time Speedup MIG$A For comparison, GT13 results showed ~3.5x TT speedup, ~1.15x MIG speedup

17 Results (III) #"Evic.ons,"References,"Misses" 4500" 4000" 3500" 3000" 2500" 2000" 1500" 1000" #"Tables"Evicted" #"Reported"Tables"Referenced" #"Reported"ache"Misses" #"Points"Processed" " " " " 50000" #"Points"Processes" 500" 0" 1" 184" 367" 550" 733" 916" 1099" 1282" 1465" 1648" 1831" 2014" 2197" 2380" 2563" 2746" 2929" 3112" 3295" 3478" 3661" 3844" 4027" 4210" 4393" 4576" 4759" 4942" 5125" 5308" 5491" 5674" 5857" 6040" 6223" 6406" 6589" 6772" 6955" 7138" 7321" Time"Step" 0" between points being processed and table cache behavior.

18 Results (IV) between cache behavior and per- @me. 0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4" 4.5" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 4000" 4500" 1" 140" 279" 418" 557" 696" 835" 974" 1113" 1252" 1391" 1530" 1669" 1808" 1947" 2086" 2225" 2364" 2503" 2642" 2781" 2920" 3059" 3198" 3337" 3476" 3615" 3754" 3893" 4032" 4171" 4310" 4449" 4588" 4727" 4866" 5005" 5144" 5283" 5422" 5561" 5700" 5839" 5978" 6117" 6256" 6395" 6534" 6673" 6812" 6951" 7090" 7229" 7368" Itera&on)Execu&on)Time)(s)) #)Tables)Evicted,)Refereced,)ache)Misses) Time)Step) #"Tables"Evicted" #"Reported"Tables"Referenced" #"Reported"ache"Misses" Table"ache"apacity" IteraEon"ExecuEon"Time"

19 onclusions This work expands on full resource from GT 2013 Performance improvements from GPUs for both MIG and TT, even with caching overheads Kirchhoff on GPU is future- proofed against the next wave of ever- growing datasets

20 Backup Slides

21 Results (II) Overall wallclock, including I/O, etc: TT Set A Set B Legacy (s) (1.00x) (1.00x) GPU only (s) (0.83x) (1.37x) Sta<c Hybrid (s) (1.09x) (1.35x) Dynamic Hybrid (s) (1.53x) (1.78x) Good speedup for TT Significant amount of sequen@al code remains Accelerated rela@vely small part of applica@on Poor speedup for MIG MIG Set A Set B Legacy (s) (1.00x) (1.00x) Dynamic Hybrid (s) (1.03x) (1.17x)

22 Results (III) Further of TT Performance: Slowdown for 16/1596 shots, Min=0.84x, Max=3.65x Speedup Per- Shot Speedup, Data Set A Shots

23 Results (IV) Further of MIG Performance: Recall the microjob pipeline Stage 1 is compute bound, Stage 2 is I/O bound Microjobs ompleted Time (m) Legacy Stage1 Legacy Stage2 Hybrid Stage 1 Hybrid Stage 2 MIG Stage 1 Set A Set B Legacy (m) Dynamic Hybrid (m) 1320 (3.73x) 1200 (3.30x)

24 Results (V) Further of MIG Performance: Many microjobs were too small to fully resources of actual kernel 104 kernels on GPU, 22 kernels on PU Maximum speedup of 35.3x Speedup MIG Kernels

25 Acknowledgements Repsol Gladys Gonzalez Mauricio Araya- Polo (now in Shell E&P) NVIDIA

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation