ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Similar documents
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Accelerating CFD with Graphics Hardware

High performance Computing and O&G Challenges

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

High Performance Ocean Modeling using CUDA

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Performance impact of dynamic parallelism on different clustering algorithms

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Portland State University ECE 588/688. Graphics Processors

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Advanced CUDA Optimization 1. Introduction

Accelerating image registration on GPUs

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Tesla Architecture, CUDA and Optimization Strategies

Parallelising Pipelined Wavefront Computations on the GPU

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

CUDA GPGPU Workshop 2012

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

High Performance Computing on GPUs using NVIDIA CUDA

Hybrid Implementation of 3D Kirchhoff Migration

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Optimization solutions for the segmented sum algorithmic function

TUNING CUDA APPLICATIONS FOR MAXWELL

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to GPU hardware and to CUDA

Optimization of Tele-Immersion Codes

Performance potential for simulating spin models on GPU

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Introduction II. Overview

Threading Hardware in G80

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Mathematical computations with GPUs

TUNING CUDA APPLICATIONS FOR MAXWELL

Parallel Numerical Algorithms

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

JCudaMP: OpenMP/Java on CUDA

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud

High-Performance and Parallel Computing

GPU Accelerated Machine Learning for Bond Price Prediction

Multi Agent Navigation on GPU. Avi Bleiweiss

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

LDetector: A low overhead data race detector for GPU programs

CS427 Multicore Architecture and Parallel Computing

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Improving Performance of Machine Learning Workloads

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing

IN FINITE element simulations usually two computation

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Tesla GPU Computing A Revolution in High Performance Computing

CUDA. Matthew Joyner, Jeremy Williams

CUDA Performance Optimization. Patrick Legresley

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

General Purpose GPU Computing in Partial Wave Analysis

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Software and Performance Engineering for numerical codes on GPU clusters

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

High-Performance Holistic XML Twig Filtering Using GPUs. Ildar Absalyamov, Roger Moussalli, Walid Najjar and Vassilis Tsotras

Concurrency for data-intensive applications

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

GPU Programming Using NVIDIA CUDA

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Introduction to CUDA

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CUDA Programming Model

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

OpenACC 2.6 Proposed Features

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Chapter 3 Parallel Software

B. Tech. Project Second Stage Report on

Parallel Computing: Parallel Architectures Jin, Hai

High Performance Computing with Accelerators

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Transcription:

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV

Introduction The simulation of wave propagation represents a typical context demanding massive computation. The COMPSYN package is a set of applications designed for producing synthetic seismograms used in evaluating seismic hazard and risk. COMPSYN is time consuming. We explore the possibility of accelerating COMPSYN by using a cluster of multicore processors with multiple GPUs. P. Spudich and L. Xu Documentation of software package COMPSYN sxv3.11: programs for earthquake ground motion calculation using complete 1-D Greens functions International Handbook of Earthquake and Engineering Seismology, 2002

Motivations The study of seismic wave propagation allows to formulate models describing the Earth surface motion to foresee the consequences of earthquakes comparison to verify the accuracy of a model synthetic seismograms produced by a simulation real seismograms obtained from seismometers

COMPSYN three input files *.old Earth velocity structure five applications OLSON XOLSON OLSON calculates Green s functions in the frequency and wavenumber domain for laterally homogeneous velocity structures *.tfd - fault geometry definition - observer locations *.sld earthquake slip distribution TFAULT SLIP SEESLO TFAULT computes Green s functions evaluated for surface observers as functions of position by Bessel-transforming the OLSON output rearranged by XOLSON

Parallelization TFAULT module has been selected for parallelization: TFAULT is one of the most time-consuming module TFAULT can be required to run many times Parallelization of TFAULT translation from Fortran to C analysis of the code of TFAULT and its structure parallelization on the host side parallelization on the device side GPU OpenMP API to exploit the power of multicore processors language MPI to allow the execution on the cluster CUDA to develop the parallelization for GPU

Determining the kernel TFAULT computes the tractions on a grid of points on the fault plane, for each observer The structure of TFAULT is highly parallel and consists of five nested loops Loop 1 loop over the observers Loop 2 loop over the frequencies Loop 3 loop over the lines of constant depth on the fault surface Loop 4 loop over the sample points on lines of constant depth on the fault surface Loop 5 loop over the wavenumbers

Determining the kernel Loop 4 is the best candidate for becoming the kernel Loop 5 presents dependencies among iterations Loop 1 and Loop 2 are too complex Loop 3 is simpler but has too few iterations to harness the GPUs and presents many memory accesses Loop 4 has a number of iterations greater than other loops but not great enough to fully exploit the GPU capabilities then we group more Loops 4 in the same GPU grid grouped Loop 4 are consecutive iterations of Loop 3

Determining the kernel Loop 3 is splitted into two phases first phase gathers the information needed by each iteration for its computing tasks on the GPU second phase consists of a sequence of grid launches Note that the number of grid launches is determined at runtime by the number of iterations of Loop 3 is much lower than the number of iterations of Loop 3 the number of executed CUDA threads is increased

Execution configuration Great attention is required in the choice of the size and number of blocks In the choice we considered: the available multiprocessors the dimension of the warp and the size of the grid the shared memory required by each block the occupancy (active threads/number of supported threads) the block sizes for obtaining the best performances (64 256) We choose the largest block size which determines the least number of void CUDA threads among the possible values of 64, 128 and 192

accesses The design of memory transfers is a very important issue We have host memory many memory supports on device (Device) Grid Block (0, 0) Shared Block (1, 0) Shared Data to transfer: Bessel lookup table vertical and horizontal arrays of weights for linear interpolation among values in the Bessel lookup table Host support arrays *.xoo file output array Thread (0, 0) Global Constant Texture Thread (1, 0) Thread (0, 0) Thread (1, 0)

accesses host-to-device transfers must be minimized *.xoo file and data related to Bessel functions must be transfer from host to device to reduce the impact of these transfers we batch several transfers in one large transfer (Device) Grid Block (0, 0) Shared Thread (0, 0) Thread (1, 0) Block (1, 0) Shared Thread (0, 0) Thread (1, 0) Host Global Constant Texture

accesses The *.xoo file is moved to global memory We use shared memory and cooperation among threads in the same block to obtain coalesced accesses The two arrays of weights for linear interpolation for Bessel functions (and other arrays) are moved in constant memory that is cached These (small) arrays are accessed in read-only mode at the same address by all threads (Device) Grid Block (0, 0) Global Constant Shared Thread (0, 0) Thread (1, 0) Block (1, 0) Shared Thread (0, 0) Thread (1, 0) Texture

accesses The output array is accessed in write-only mode and is stored in the global memory The output array has a different size at each GPU execution Each grid has a different number of threads, depending on Loops 3 grouped, and produces a different amount of output To eliminate the overhead of multiple re-allocations we preallocate a large enough array (Device) Grid Block (0, 0) Global Constant Texture Shared Thread (0, 0) Thread (1, 0) Block (1, 0) Shared Thread (0, 0) Thread (1, 0)

Host side parallelization multicore processors the several cores generate a set of flows of execution at CPU level sequential execution of CUDA grids creation of several CUDA grids at the same time competition for the GPU among the CPU threads (+ other drawbacks) grids related to different observers working with the same frequency value merged single grid Loop 1 is efficiently multicore parallelized This level of parallelization is implemented by OpenMP

Host side parallelization cluster observers are partitioned among the nodes of the cluster (multicore CPU + GPU) input scenarios for real cases with a high number of observers (order of 10 2 ) A cluster corresponds to an architecture with distributed memory instead of shared memory This level of parallelization is implemented by MPI the inter-process communications are limited the resulting overhead is not significant the use of the cluster shows a good scalability

Experimental results Our cluster consists of 4 nodes Each node consists of 2 quad-core Intel Xeon E5520 CPUs operating at 2.27 GHz 2 Tesla T10 GPUs, each with 240 cores 4 GB memory four T10 GPUs forms a Tesla S1070 GPU Computing System Each node runs Linux kernel 2.6.27 Nodes are connected by a PCI Express-2 bus

Experimental results Data related to the earthquake of Colfiorito, Italy, 1997 The experimental test considers 256 observers 287 different frequencies for each observer a grid of 120000 points associated with the fault plane We evaluate the performance of the two parallelized versions: TFaultMPI the version running on the multicore CPU cluster TFaultCUDA the version realized for the multicore CPU cluster with two Tesla associated to each node. We measure the execution time required by the two versions by using 1 up to 4 nodes of the cluster

Time (s) Performance TFaultMPI 1 node 2 nodes 3 nodes 4 nodes 61126 s (16 h 59 m) 32084 s (8 h 55 m) 20728 s (5 h 45 m) 15532 s (4 h 19 m) TFaultCUDA 6428 s (1 h 47 m) 3242 s (54 m) 2024 s (33.7 m) 1511 s (25.2 m) Speedup 9.5x 9.9x 10.2x 10.3x 70000 10,40 60000 TFaultMPI 10,20 50000 40000 TFaultCUDA 10,00 9,80 30000 9,60 20000 10000 9,40 9,20 Speed-up 0 1 node 2 nodes 3 nodes 4 nodes 9,00 1 node 2 nodes 3 nodes 4 nodes

Experimental results We perform a comparison between the original version of COMPSYN the two parallel versions, TFaultMPI and TFaultCUDA Due to the long time required to perform the simulation over the whole set of 256 observers, we made the simulation for a smaller subset and extrapolated the execution time value The computed estimate of the speedup for 256 observers is: 15x for TFaultMPI 140x for TFaultCUDA, by using a single node in both cases

Conclusions and future work This parallelization study has given promising results We have two parallelized versions TFaultMPI, running on a multicore CPU cluster (host side parallelization) TFaultCUDA, parallelized both host side and device side Advantages the use of COMPSYN is effective even in the most complex scenarios the parallel version is naturally scalable thanks to the balancing of the workload obtained by distributing the computations related to the different observers on the available computing nodes

Conclusions and future work Meanwhile work We have improved the parallelization design by implementing the register spilling improving the memory coalesced accesses using the page-locked memory and the texture memory obtaining a better performance Future work We are planning to realize a browser based interface We are evaluating the parallelization of other modules of COMPSYN in a problem-oriented fashion