Hybrid Implementation of 3D Kirchhoff Migration

Similar documents
Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems

Parallel Systems. Project topics

CPU-GPU Heterogeneous Computing

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Heterogeneous platforms

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

rcuda: an approach to provide remote access to GPU computational power

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

High-Performance Data Loading and Augmentation for Deep Neural Network Training

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

CUDA GPGPU Workshop 2012

Large Data in MATLAB: A Seismic Data Processing Case Study U. M. Sundar Senior Application Engineer

Approaches to Parallel Computing

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

GPU Fundamentals Jeff Larkin November 14, 2016

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Modern Processor Architectures. L25: Modern Compiler Design

CUDA. Matthew Joyner, Jeremy Williams

Maximizing Face Detection Performance

Parallel Computer Architecture and Programming Final Project

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Real-Time Support for GPU. GPU Management Heechul Yun

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Performance impact of dynamic parallelism on different clustering algorithms

Chapter 3 Parallel Software

CUDA Programming Model

Portland State University ECE 588/688. Graphics Processors

The Case for Heterogeneous HTAP

Experiences with Achieving Portability across Heterogeneous Architectures

GPUfs: Integrating a file system with GPUs

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Profiling & Tuning Applications. CUDA Course István Reguly

Porting CPU-based Multiprocessing Algorithms to GPU for Distributed Acoustic Sensing

Addressing Heterogeneity in Manycore Applications

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Speeding up the execution of numerical computations and simulations with rcuda José Duato

! Readings! ! Room-level, on-chip! vs.!

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

AUTOMATIC SMT THREADING

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Warps and Reduction Algorithms

When MPPDB Meets GPU:

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

ANSYS HPC Technology Leadership

Building NVLink for Developers

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Computing. Hwansoo Han (SKKU)

Simultaneous Multithreading on Pentium 4

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

GPU Acceleration for Seismic Interpretation Algorithms

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Accelerator programming with OpenACC

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Parallel Architectures

High performance Computing and O&G Challenges

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to parallel Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

A Comprehensive Study on the Performance of Implicit LS-DYNA

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ECE 8823: GPU Architectures. Objectives

Advanced CUDA Optimization 1. Introduction

IBM Power AC922 Server

The determination of the correct

Maximizing Memory Performance for ANSYS Simulations

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Using Virtual Texturing to Handle Massive Texture Data

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Cartoon parallel architectures; CPUs and GPUs

Transcription:

Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

Motivation Heterogeneous clusters becoming the standard Geophysical applications process huge data sets (TBs). Need to: Fully utilize all devices (CPU & GPU) Execute computational kernels on optimal hardware Maximize utilization of network and device bandwidth No well-known best practices for irregular applications on heterogeneous systems

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

The Problem at Hand (I) What is Kirchhoff Migration? Subsurface imaging algorithm based on ray-tracing 2 stages: traveltime (TT) computation and migration (MIG). Multi-node execution time in the week(s)

The Problem at Hand (II) Interesting target for hybrid execution Data structures: massive, dynamically sized, legacy implementation requires pointer chasing Compute intensive kernels, complex control flow, high register pressure Major I/O bottlenecks Software engineering challenges Massive code base Facilitate upgrades Constant numerical checks

microjobs spawn Existing software infrastructure (applies to MIG and TT) vel= inline= crossline= nodes= Setup The Problem at Hand (IV) 1 2 Job launch Master Node microjobs Execution Worker Node Management Process 0010 1100 1101 0011 1101 0010 1010 0110 0110 1001 1011 0011 Worker Node Microjob Process Both use microjobs as the scheduling unit, 1 microjob per node Both TT & MIG use a two-stage pipeline of microjobs Stage 1: computation bound, Stage 2: I/O bound

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

Solution Strategy Incremental porting approach: Legacy CPU-only approach GPU-only Support dynamically sized, pointer-based data structures Usual GPU optimization strategies ( shared, access coalescing, etc) Stabilize application numerical instability GPU only CPU+GPU w/ static work partitioning Support same data layout efficiently on host & device Modified I/O front-end w/ dedicated communication thread CPU+GPU CPU+GPU dynamic work distribution Different load-balancing algorithms for TT, MIG New I/O back-end, utilize the shared FS

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

GPU Implementation (I) TT: main computational kernel (tt_hwt): OpenMP Loop-carried dependencies CUDA, thousands of iterations SIMD parallelism across rays, kernels with complex control flow Ray-based wavefront is dynamically sized Each ray has multiple references to others Large amount of sequential computation

GPU Implementation (II) MIG: main computational kernel (kern_corr): Large source of overhead CUDA, hundreds of iterations Loop-carried dependencies Parallelize across x,y physical dimensions Accumulate work from multiple traces for sufficient parallelism Data size > GPU memory, so host-managed caching in device memory is used

GPU Implementation (III) Optimization Techniques: Maximize use of non-blocking CUDA streams and events Use the massively parallel GPU to re-order arrays for improved memory coalescing in future kernels Track live and modified data to reduce or eliminate copies Focus on divergent code sections, try to maximize shared code shared memory used, particularly for kernels with high register pressure Brute force optimization of block sizes No natural division of work between blocks, but tweaking block size caused significant speedup

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

CPU+GPU w/ Static Work Partitioning TT: OpenMP parallelization across shots Given N CPU cores, M GPUs Launch N OpenMP threads (#pragma omp parallel for) Thread 0 -Thread M-1 use GPUs to accelerate parallel loops over rays within each shot MIG: partition x,y dimensions within traces across devices and threads Dedicated input communication thread, buffers tasks

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

CPU+GPU w/ Dynamic Work Distribution TT: threads peek at workload of other threads, donate GPUs to threads with more computation Integer value indicates workload Oversubscription of GPUs improved performance All threads running, w/ or w/o GPU 1. T 1 owns the GPU 2. T 1 checks workload on T 2, finds it is larger 3. T 1 donates ownership of the GPU to T 2 4. T 2 discovers that a GPU has been donated, switches to GPU execution t=1 t=2 1 GPU GPU = light load = heavy load t=3 T 1 T 2 Time 2 3 GPU 4 GPU

spawn CPU+GPU w/ Dynamic Work Distribution MIG: add support for multiple microjobs per node, each microjob assigned a device (GPU or multi-core CPU) GPUs naturally receive more work as they process microjobs faster... Master Node... microjobs Worker Node Management Process Microjob Process GPU spawn Microjob Process CPU Microjob Process GPU

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

Results (I) Tests ran on 1 master node, 5 worker nodes 2 socket systems, 6 core Intel Xeon X5675 3 NVIDIA M2090 per node Infiniband QDR Intel SW stack Panasas FS CUDA 5.0 V0.2.1221 gencode arch=compute_20,code=sm_20 fmad=false

Results (II) Overall wallclock, including I/O, initialization, etc: TT Set A Set B Legacy (s) 54093.72 (1.00x) 15683.16 (1.00x) GPU only (s) 11453.69 (1.37x) Static Hybrid (s) 49818.889 (1.09x) 11602.262 (1.35x) Dynamic Hybrid (s) 35434.645 (1.53x) 8832.886 (1.78x) Good speedup for TT Significant amount of sequential code remains Accelerated relatively small part of application Poor speedup for MIG MIG Set A Set B Legacy (s) 5304.11 (1.00x) 4183.47 (1.00x) Dynamic Hybrid (s) 5174.47 (1.03x) 3563.60 (1.17x)

1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 1001 1051 1101 1151 1201 1251 1301 1351 1401 1451 1501 1551 Speedup Results (III) Further Investigation of TT Performance: Slowdown for 16/1596 shots, Min=0.84x, Max=3.65x 4 3.5 Per-Shot Speedup, Data Set A 3 2.5 2 1.5 1 0.5 0 Shots

0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 100 Microjobs Completed Results (IV) Further Investigation of MIG Performance: Recall the microjob pipeline Stage 1 is compute bound, Stage 2 is I/O bound 140 120 100 80 60 40 20 0 Time (m) Legacy Stage1 Legacy Stage2 Hybrid Stage 1 Hybrid Stage 2 MIG Stage 1 Set A Set B Legacy (m) 4920 3960 Dynamic Hybrid (m) 1320 (3.73x) 1200 (3.30x)

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 Speedup Further Investigation of MIG Performance: Many microjobs were too small to fully utilize comutational resources Acceleration of actual computational kernel 104 kernels on GPU, 22 kernels on CPU Maximum speedup of 35.3x Results (V) 40 35 30 25 20 15 10 5 0 MIG Kernels

Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation 5. CPU+GPU w/ Static Work Partitioning 6. CPU+GPU w/ Dynamic Work Distribution 7. Results 8. Conclusions

Conclusions Full system utilization, no idle resources CPU is always preparing work for GPU or doing useful computation CUDA streams ensure GPU always has work Dedicated communication threads maximize network utilization Performance improvements from GPUs for both MIG and TT TT overall 1.8x, TT shots up to 3.65x, MIG kernels up to 35.3x Limited by sequential code, I/O overhead Work continues on optimizations New inter-node greedy work distribution for MIG New inter-thread, device management algorithm by donation

Acknowledgements Repsol Mauricio Araya-Polo (now in Shell International E&P) Gladys Gonzalez NVIDIA contact: jmaxg3@gmail.com