Graphical Processing Units (GPU)-based modeling for Acoustic and Ultrasonic NDE

Similar documents
Finite Element Modeling and Multiphysics Simulation of Air Coupled Ultrasonic with Time Domain Analysis

high performance medical reconstruction using stream programming paradigms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Two-Phase flows on massively parallel multi-gpu clusters

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

RT 3D FDTD Simulation of LF and MF Room Acoustics

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

General Purpose GPU Computing in Partial Wave Analysis

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

GPU for HPC. October 2010

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Nvidia Tesla The Personal Supercomputer

GPU Programming Using NVIDIA CUDA

Computational electromagnetic modeling in parallel by FDTD in 2D SIMON ELGLAND. Thesis for the Degree of Master of Science in Robotics

Software and Performance Engineering for numerical codes on GPU clusters

3D Finite Difference Time-Domain Modeling of Acoustic Wave Propagation based on Domain Decomposition

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Saurabh GUPTA and Prabhu RAJAGOPAL *

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

COMPARISON OF PYTHON 3 SINGLE-GPU PARALLELIZATION TECHNOLOGIES ON THE EXAMPLE OF A CHARGED PARTICLES DYNAMICS SIMULATION PROBLEM

Optimization solutions for the segmented sum algorithmic function

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors

International Supercomputing Conference 2009

Advanced CUDA Optimization 1. Introduction

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CME 213 S PRING Eric Darve

Parallel Computing: Parallel Architectures Jin, Hai

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

CUDA Experiences: Over-Optimization and Future HPC

Accelerating CFD with Graphics Hardware

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar

COMSOL BASED 2-D FEM MODEL FOR ULTRASONIC GUIDED WAVE PROPAGATION IN SYMMETRICALLY DELAMINATED UNIDIRECTIONAL MULTI- LAYERED COMPOSITE STRUCTURE

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

Finite-difference elastic modelling below a structured free surface

GPUs and Emerging Architectures

INSPECTION USING SHEAR WAVE TIME OF FLIGHT DIFFRACTION (S-TOFD) TECHNIQUE

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Center for Computational Science

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

MD-CUDA. Presented by Wes Toland Syed Nabeel

Virtual EM Inc. Ann Arbor, Michigan, USA

Introduction to GPU hardware and to CUDA

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Higher Level Programming Abstractions for FPGAs using OpenCL

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

OzenCloud Case Studies

GENERAL-PURPOSE COMPUTATION USING GRAPHICAL PROCESSING UNITS

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Design of an FPGA-Based FDTD Accelerator Using OpenCL

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Technology for a better society. hetcomp.com

Measurement of real time information using GPU

Numerical Algorithms on Multi-GPU Architectures

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

I. INTRODUCTION. Manisha N. Kella * 1 and Sohil Gadhiya2.

Lecture 1. Introduction Course Overview

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Performance potential for simulating spin models on GPU

Turbostream: A CFD solver for manycore

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Trends in HPC (hardware complexity and software challenges)

Acceleration of a 2D Euler Flow Solver Using Commodity Graphics Hardware

Exploiting graphical processing units for data-parallel scientific applications

REPORT DOCUMENTATION PAGE

CS 179: GPU Programming

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

ENGINEERING MECHANICS 2012 pp Svratka, Czech Republic, May 14 17, 2012 Paper #249

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU Programming for Mathematical and Scientific Computing

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

An Introduction to Graphical Processing Unit

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Lecture 1: Introduction and Computational Thinking

Abstract. Introduction. Kevin Todisco

ATS-GPU Real Time Signal Processing Software

GPU-Based Implementation of a Computational Model of Cerebral Cortex Folding

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

A meshfree weak-strong form method

OP2 FOR MANY-CORE ARCHITECTURES

IBM Power AC922 Server

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Transcription:

18th World Conference on Nondestructive Testing, 16-20 April 2012, Durban, South Africa Graphical Processing Units (GPU)-based modeling for Acoustic and Ultrasonic NDE Nahas CHERUVALLYKUDY, Krishnan BALASUBRAMANIAM and Prabhu RAJAGOPAL Centre for NDE, Indian Institute of Technology - Madras, Chennai 600036, T.N., India Abstract: With rapidly improving computational power numerical models are being developed for ever more comple problems that cannot be solved analytically, maing them more and more computationally intensive. Parallel computing has emerged as an important paradigm to speed up the processing of such models. In recent years graphics processing units (GPU) are among the massively parallel devices widely available to the commodity maret. General purpose computing on these devices went mainstream after the introduction of NVIDA compute unified device architecture (CUDA). Here we develop CUDA implementations of the finite difference time domain (FDTD) scheme for three-dimensional problems of acoustic and two-dimensional problems of ultrasonic NDE. Simulations are run on the commodity GPU GeForce 9800GT. Results show a strong improvement in the speed of computation as compared to that of a serial implementation on a CPU. We also discuss the accuracy of GPU-based computation. Keywords: GPU Computing, CUDA, Acoustic Emission, Elastic wave propagation Introduction From the introduction of microprocessors to early 2000 s programmers relied on faster processors to increase the speed of their programs, as software was mostly written serially and computers were also woring serially. Hardware manufactures increased the frequency of the processors that decreased the average time taen to eecute an instruction to build powerful computers. However as the frequency of each processor increased, the power consumption also rose dramatically. Moreover cooling the computer also became a major issue. Thus manufactures realized that computers cannot become faster any more: instead they have to grow wider. Instead of a very fast single core chip two or more cores of moderate speed started to appear in computers. This system has its problems too. Serially coded software does not run faster in wider computers: programs have to be parallelized. In parallel computing a problem is solved using multiple processing units simultaneously. The concept of parallel programming however, is not entirely new to new to the computing world, though. From 1970 onward vector processors were running parallel programs in high-end systems lie supercomputers and high performance computing systems. However, with the introduction of multi-core processors in destop PCs, parallel programs became the mainstream paradigm in day-to-day information processing. Graphical processing units (GPU) were introduced to personal computers much before multi-core CPUs. GPUs are processors dedicated to accelerating the building of images in the frame buffer intended for output display. These processors are highly parallel and multithreaded, and are much more efficient at handling processes that include large amount of data and single instructions. Their high capacity for single instruction multiple data algorithms made them suitable for video games and high quality video playbac and processing systems. In a GPU

more numbers of transistors are devoted to data processing rather than data caching and flow control. Thus GPUs are well suited for data parallel algorithms with high floating point operations. After the introduction of single precision floating point arithmetic to mainstream consumer end graphics cards, programmers realized that multi-core GPUs can be used even for non-graphical algorithms that are of the nature of single instruction multiple data. Using shading language application programming interfaces (APIs), programmers have attempted to harvest the massively parallel GPU for general purpose scientific and engineering computations. In late 2006 NVIDIA introduced its new parallel computing architecture called compute unified device architecture, or CUDA [1]. With the introduction of CUDA programming GPUs for general purpose computing become much more affordable and accessible to the average programmer. As a cross platform programming interface, OpenCL [2] was added to the scenario in 2008. But CUDA already too the lead in scientific and high performance computing world by providing a mature compiler, a better documentation, performance libraries and high end graphics cards lie Tesla that fully dedicated for computational purpose only. The CUDA programming model CUDA etends the standard C/C++ languages to mae an interface to the GPUs. Normal programs are divided into ernels and host code where ernels are compiled and eecuted on the GPUs, while the host code is eecuted on the CPU. Kernels are the compute intense portion of the program whereas the host code sets an environment for the program. Kernels are C functions that are eecuted N times in parallel by N different CUDA threads. Each process in a ernel is treated as a thread. These threads are grouped into blocs which are further grouped into grids. Each thread has its own private memory and a shared memory visible to all threads in the same bloc. Apart from these two types there eists a third memory nown as global memory, which can be accessed by any thread in the ernel and is the main memory for computations. As the number of cores in CPUs and GPUs grows the challenge is to create a program that scales automatically when we update the hardware. Since CUDA groups threads into blocs, these blocs can be eecuted on any processing core without disturbing any other bloc in the grid. Thus this programming model is etremely scalable. CUDA programs are eecuted similarly to other programs. The host code starts eecution from the CPU and ernel then eecutes on GPU. At the end of GPU eecution control is given bac to CPU. The NVIDIA CUDA Compiler is used to compile the ernel and a normal C compiler such as Microsoft Visual C++ is used to compile the host code. On a Linu platform the CUDA Compiler can be used along with GCC. In this paper, using CUDA, we develop parallel finite difference scheme as applicable to two main areas of interest to NDE community, namely acoustic (sound in fluids) and ultrasonic wave propagation. CUDA implementation of acoustic wave propagation The three-dimensional acoustic wave equation can be represented by the following form: ( ) (1)

where p represents pressure at a point in space and c represents the speed of acoustic waves in the (fluid) medium. Upon discretization using forward time (eplicit) centered space finite difference scheme, Eq.(1) taes the form: ( ) (2) where represents pressure at a spatial point (i,j,) at the n th time step. The whole space at the n th time step is represented using a three-dimensional matri. To increase the efficiency of computation, the three-dimensional matri is then decomposed into several one-dimensional arrays and stored in memory. In sequential computing each array element is updated in order using a spatial loop which is then enclosed in a time loop. In our GPU-based parallel implementation, the spatial loop is decomposed into CUDA ernels which then update the entire spatial matri in one go. Since each time step depends on the previous step, the time loop cannot be parallelized. The bloc size of the ernels is calculated using the CUDA occupancy calculator. For maimum computational efficiency here the bloc size was fied to be16 X 16. Hence 256 threads are available in one bloc. The grid size is calculated according to the size of the problem domain. If we have M rows, N columns and Z number of slices then grid width will be (N/bloc width) X Z and grid height will be M/bloc height.the computational scheme developed for our CUDA implementation is represented in Figure 1. System specification A normal destop PC was used. The Intel processor core i3-530 cloced at 2.93 GHz with 3.24GB ram was used as the host platform. NVIDIA GeForce 9800GT with 112 CUDA cores cloced at 1.37 GHz and 512 MB ram was used as the GPU device. Windows 7 32 bit (professional edition) was the operating system used. Bloc 1 Bloc 1 Bloc 3 Time Loop Grid q Bloc 4 Bloc 5 Bloc 6 Bloc 7 Bloc 8 Bloc 9 Thread 1 Thread 2 Thread 3 Thread 4 Update q Update q Figure 1 : Cuda implementation of Acoustic Wave propagation

Problem specification Sound propagation through air was chosen as a simple illustrative problem. As we did not use absorbing or radiation conditions at the boundary, to save computational time a relatively small domain of size 1m 3 was selected. A toneburst pulse was used to ecite the medium at center for a minimum 100 time steps. Space was divided into 100 cells in each dimension. The temporal step was chosen accordingly to match stability criteria. Results While the problem too about 25 seconds to compute 500 time steps using the CPU and it too only 0.12 seconds with the GPU. Figure 2 shows an A-scan result obtained from a point near the center of the problem domain. The results are not amenable to easy physical interpretation due to the presence of multiple reflections from the domain-boundaries. The small difference between the amplitude values in the results from the CPU and the GPU values is perhaps because of their different architectures. We also note that the GeForce 9800GT device is not IEEE754 compliant processor [3], and is intended mainly for home use and the gaming maret. We are currently investigating this issue further and also looing into whether IEEE compliant double precision capable GPUs provide any improvement in results. Figure 2 A-scan monitored at the mid-point of the problem domain CUDA implementation of elastic wave propagation We chose the first order velocity-stress formulation to represent elastic wave propagation. Assuming plan strain conditions, we can then write: (3) (4)

(5) (6) where ρ is density and λ and μ are Lame s constants. This system of equation is discretized using eplicit centered finite difference scheme using a staggered grid and the leapfrog algorithm [5]. In the leapfrog algorithm the field components are updated sequentially in time: the velocity components are calculated first then the stress components from the velocity components, the velocity components again using the stress components and so on. (7) After discretization, the equation for velocity component can be written as: v ( T z ( i 0.5, j 0.5) v z ( i 0.5, j) T z ( i 0.5, j 0.5) ( T ( i 0.5, j 1)) Similarly stress component can be written as: T 1 j 0.5) T ( v z z j) v j 0.5) ( 2 ) ( v z j 1)) j 0.5) T ( i 0.5, j 0.5) v 0.5 ( i 1, j 0.5)) ( i 0.5, j 0.5)) (8) (9) In the same manner discretized equations can be obtained for all field components. Again, we find that the spatial loop can be enclosed in a temporal loop. In this case there are five field variables, two velocity components and three stress components: so we need five grids at the least. Calculation of bloc size and grid size was done similarly to the method discussed above. Problem and system specification Aluminium of size 30mm width and 15mm height with density 2667 g/m 3, Longitudinal wave velocity 6396 m/s and Shear wave velocity 3103 m/s was chosen for simulation. The ecitation consisted of a 5-cycle Hanning windowed toneburst pulse centered at 2.25 MHz for applied the first 120 time-steps at the center of the model domain. The sample was divided into a grid of 317 X 158 with a time step size of 9.8910-9 second. The problem was again solved on the same CPU and GPU devices described earlier. Results While the CPU too 1.500 seconds to solve the problem to 1000 time-steps, the GPU corresponding implementation too only 0.030 seconds. Figure 3 shows the A-scan obtained

Figure 3 A-scan from point 75,100 from a point near the center of the sample. Again, we believe the discrepancy in the results especially in the multiple-scattered signals is due to the low-end GPU we used: we are investigating this issue currently. Conclusions and future wor Eplicit Finite Difference time domain (FDTD) implementations of two-dimensional elastic wave propagation and three-dimensional acoustic wave propagation were developed for GPU computation using NVIDIA CUDA. GPU computing is shown to be effective in increasing the computational speed. However the underlying architecture of the GPU device used strongly impacts the accuracies achievable. Ongoing and further wor concerns addressing accuracy issues as well as the etension of CUDA implementations to other areas of NDE. References 1. NVIDA CUDA C Programming Guide, Version 4.0 (5/6/2011) http://developer.nvidia.com/category/zone/cuda-zone 2. The OpenCL specification, Version 1.1, revision 44 (6/1/2011) http://www.hronos.org/registry/cl/specs/opencl-1.1.pdf 3. N. Whitehead and A. Fit-Florea, Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs, http://developer.nvidia.com/content/precisionperformance-floating-point-and-ieee-754-compliance-nvidia-gpus 4. B. Deschizeau and Jean-Yves Blanc, GPU Gems 3,edited by Hubert Nguyen, Ch. 38, Addison-Wesley, Boston, 2008, pp. 831-850 5. C.T. Schroder and W.R. Scott, IEEE Transactions on Geoscience and Remote Sensing, vol. 38, pp. 1505-512 (2000)