Thrust ++ : Portable, Abstract Library for Medical Imaging Applications

Similar documents
CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

high performance medical reconstruction using stream programming paradigms

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA

Advanced CUDA Optimization 1. Introduction

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

CUDA. Matthew Joyner, Jeremy Williams

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Introduction to Medical Imaging. Cone-Beam CT. Klaus Mueller. Computer Science Department Stony Brook University

GPU-Based Acceleration for CT Image Reconstruction

Medical Image Processing: Image Reconstruction and 3D Renderings

GPU implementation for rapid iterative image reconstruction algorithm

Tesla GPU Computing A Revolution in High Performance Computing

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

CS 179: GPU Programming. Lecture 12 / Homework 4

GPU-accelerated ray-tracing for real-time treatment planning

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Chris Sewell Li-Ta Lo James Ahrens Los Alamos National Laboratory

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Maximizing GPU Power for Vision and Depth Sensor Processing. From NVIDIA's Tegra K1 to GPUs on the Cloud. Chen Sagiv Eri Rubin SagivTech Ltd.

X-ray imaging software tools for HPC clusters and the Cloud

High Resolution Iterative CT Reconstruction using Graphics Hardware

OpenACC Course. Office Hour #2 Q&A

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Comparison of High-Speed Ray Casting on GPU

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Parallel Hybrid Computing Stéphane Bihan, CAPS

DEVELOPMENT OF CONE BEAM TOMOGRAPHIC RECONSTRUCTION SOFTWARE MODULE

FPGA-based Supercomputing: New Opportunities and Challenges

Clinical Evaluation of GPU-Based Cone Beam Computed Tomography.

Accelerating Cone Beam Reconstruction Using the CUDA-enabled GPU

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1. Eyal Hirsch

Modern GPU Programming With CUDA and Thrust. Gilles Civario (ICHEC)

Parallel Systems. Project topics

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

GPUs and GPGPUs. Greg Blanton John T. Lubia

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Technology for a better society. hetcomp.com

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

Tesla GPU Computing A Revolution in High Performance Computing

Chapter 3 Parallel Software

Scaling Calibration in the ATRACT Algorithm

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Handout 3. HSAIL and A SIMT GPU Simulator

Advanced CUDA Optimizations. Umar Arshad ArrayFire

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

GPUs in CT Reconstruction. Logan Johnson

Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering

Combined algorithmic and GPU acceleration for ultra-fast circular conebeam backprojection

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Performance Analysis of Sobel Edge Detection Filter on GPU using CUDA & OpenGL

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

CUDA 6.0 Performance Report. April 2014

Introduction to Multicore Programming

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

Simultaneous Solving of Linear Programming Problems in GPU

Efficient 3D Crease Point Extraction from 2D Crease Pixels of Sinogram

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Parallel FFT Program Optimizations on Heterogeneous Computers

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Introduction to GPU hardware and to CUDA

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

GPU Fundamentals Jeff Larkin November 14, 2016

General Purpose GPU Computing in Partial Wave Analysis

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Copperhead: Data Parallel Python Bryan Catanzaro, NVIDIA Research 1/19

Harp-DAAL for High Performance Big Data Computing

Implementation of a backprojection algorithm on CELL

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Radon Transform and Filtered Backprojection

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

High Performance Computing on GPUs using NVIDIA CUDA

Warps and Reduction Algorithms

Enabling the Next Generation of Computational Graphics with NVIDIA Nsight Visual Studio Edition. Jeff Kiel Director, Graphics Developer Tools

RapidMind. Accelerating Medical Imaging. May 13, 2009

Hybrid Implementation of 3D Kirchhoff Migration

Discrete Estimation of Data Completeness for 3D Scan Trajectories with Detector Offset

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Transcription:

Siemens Corporate Technology March, 2015 Thrust ++ : Portable, Abstract Library for Medical Imaging Applications Siemens AG 2015. All rights reserved

Agenda Parallel Computing Challenges and Solutions What is Thrust ++? Thrust ++ pipeline pattern Demo Thrust ++ Data structures and Algorithms Future Work Page 2 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Parallel Computing Real-time performance and scalability to Siemens products Healthcare Industry Energy I&C SINUMERIK CNC Advanced Coronary Analysis Accelerated algorithms Rich and interactive advanced visualization Real-time analytics Real-time embedded control High precision simulations Rich and interactive operator interfaces Fault Localization of Turbine blades Detailed 3D simulations Interactive CFD Real-time algorithms for automation and control Security Solutions Real-time image and video processing High performance solutions for crowd simulations Various probes Signal Conversion Portable Diagnosis & Screening Virtual Design of Steam Turbines Danger Management (Evacuation) Page 3 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Thrust ++ is based on Thrust 1 Thrust: Features Open source C++ template library for parallel platforms Abstraction Portability Boosts Productivity Widely used parallel abstraction library Thrust Users [1] Thrust: http://thrust.github.io/ Page 4 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Agenda Parallel Computing Challenges and Solutions What is Thrust ++? Thrust ++ pipeline pattern Demo Thrust ++ Data structures and Algorithms Future Work Page 5 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Extensions to Thrust Extended Collection of Algorithms FFT-1d, FIR, Convolution, Thrust: Limitations Limited Data Structures Lack of Multi-Dimensional Data Structures Lack of Complex Data Structures Limited Algorithms Generic for 1Dimensional data No Patterns ++ Parallel Patterns Pipeline pattern Data-structures Device Texture 1D/2D Host Texture 1D/2D Device Array 2D/3D Host Array 2D/3D Page 6 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Agenda Parallel Computing Challenges and Solutions What is Thrust ++? Thrust ++ pipeline pattern Demo Thrust ++ Data structures and Algorithms Future Work Page 7 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Pipeline Example illustrates how a pipeline can be used to speed up computation. Pipelines occur in various domains Example of an Image Processing Pipeline Source: Parallel Programming with Microsoft.NET Page 8 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Thrust ++ Pipeline pattern Thrust ++ pipeline pattern can be used to develop portable parallel stream based applications. It is an extension to FluenC 2 Key features Abstraction from underlying hardware Linear pipelines with multiple sources/sinks Serial (in-order) and parallel (out-of-order) stages Generic programming in STL style Supports different back ends such as CUDA, TBB, OpenMP, and CPP Current version, pipeline pattern uses the following external schedulers Sequential scheduler Parallel scheduler (TBB) 2 T. Schuele, Efficient parallel execution of streaming applications on multi-core processors, in Proceedings of the 19th International Euromicro Conference on Parallel, Distributed and Network-based Processing, PDP2011, Ayia Napa, Cyprus, 9-11 February 2011, 2011 Page 9 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Steps to set up the Thrust ++ Pipeline Include Header file #include <thrust/patterns/pipeline/pipeline.hpp> Construct a network A network consists of a set of stages that are connected by communication channels. Create Stages Inheriting from parent classes Source: This type of stage has only output port. Serial: This type of stage has input and output port. Only one instance of this stage can run at a time Parallel: This type of stage has input and output port. Multiple instances of this type of stage can run in parallel. In general, processes that neither have any side effects nor maintain a state can safely be executed in parallel. Sink: This type of stage has only input ports Connect the stages E.g. stage1.connect<0>(stage2.port<0>()) Start the network E.g. nw(10) Source Sink Serial Parallel_A Parallel_B Page 10 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Pipeline: CUDA Requirements: CUDA operations must be in different, non-0, streams cudamemcpyasync with host from 'pinned' memory Sufficient resources must be available A blocked operation blocks all other operations in the queue, even in other streams Stream Scheduling A CUDA operation is dispatched from the engine queue if: Preceding calls in the same stream have completed, Preceding calls in the same queue have been dispatched, and Resources are available Thrust ++ Extensions Asynchronous Copies Pinned Allocation Streams support Page 11 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Issue Order Time Streams scheduling Sequential Scheduler: Results into depth first Image1 Image2 Image3 Stage 1 Stage 2 stream1 stream2 stream3 Stage3 Compute queue execution S1-I1 S1-I1 S2-I1 S2-I1 S3-I1 S3-I1 S1-I2 S1-I2 S2-I2 S2-I2 S3-I2 S1-I3 S3-I2 S2-I3 S3-I3 Page 12 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Issue Order Time Streams scheduling TBB Scheduler: In ideal situation will result into breadth first Image1 Image2 Image3 Stage 1 Stage 2 stream1 stream2 stream3 Stage3 Compute queue execution S1-I1 S1-I1 S1-I2 S1-I3 S1-I2 S1-I3 S2-I1 S2-I1 S3-I1 S2-I2 S3-I2 S2-I3 S3-I3 S2-I2 S2-I3 Page 13 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Event Matrix Image-0 Image-1 Image-2 Pre Stage W([0][0]) W([1][0]) Stage 0: Source Post Stage R([0][0]) R([1][0]) R([2][0]) Pre Stage W([0][1]) W([1][1]) Stage 1: Serial Post Stage R([0][1]) R([1][1]) R([2][1]) Pre Stage Stage 2: Parallel Post Stage R([0][2]) R([1][2]) R([2][2]) Pre Stage W([0][3]) W([1][3]) Stage 3: Sink Post Stage R([0][3]) R([1][3]) R([2][3]) Create event matrix of dimension [Max_Streams][Max_Stages] R (Post-stage) W (Pre-Stage) //Pre- Stage if(image_id!=0 && (stage_kind!= PARALLEL)) cudastreamwaitevent(stream_id,kernelevent[image_id-1][stageid],0); //Post Stage cudaeventrecord(kernelevent[image_id][stageid], stream_id); Guarantees in-order execution cudaeventrecord cudastreamwaitevent Note: Parallel Stage has no wait event as it does not demand in-order execution Page 14 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Agenda Parallel Computing Challenges and Solutions What is Thrust ++? Thrust ++ pipeline pattern Demo Thrust ++ Data structures and Algorithms Future Work Page 15 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Demo Application to illustrate Pipeline pattern CT Reconstruction Cone beam computed tomography ( X-ray beam is a cone beam.) The CBCT scanner rotates around the patient at different projection angles Typically at a resolution of 1 or 0.5 degrees. A full scan of 360 degrees resulting in a total of 720 projections with resolution of 0.5 degrees. A reconstruction algorithm is used to reconstruct the 3D image of interest from the projections. A popular algorithm for cone beam computed tomography reconstruction is the Feldkamp algorithm. Input Images 512*1024*720 Source Pre- Processing Convolution Parallel Serial Back- Projection Sink Post- Processing Page 16 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Summary and Findings: CT Reconstruction Patterns Texture usage should be avoided inside pipeline (CUDA7.0 RC removes this limitation) Device to device copy is launched by default in zero stream. This breaks the pipeline Parallel scheduler provides better overlap between stages Complete C++11 support not available till CUDA 5.5 for linux Data CUDA Memory Type Reason Input Texture Memory Takes advantage of hardware linear interpolation Pre-computed values Shared Memory/Constant memory Output Global memory N*N*N float values Few pre-computed values repeatedly accessed by multiple threads. Page 17 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Execution time (ms) Summary and Findings: CT Reconstruction Hardware backend Nvidia K20c using CUDA 5.5 25 20 15 43.4% 25% 10 5 0 Native CUDA (Shared memory, Textures) Thrust PARADIGM (Constant memory, Textures) Performance comparison of different implementations of back projection kernel What could be the other reasons for the performance degradation with PARADIGM? Thread configuration Kernel Back-projection Thread configuration CUDA Native implementation Block: [128 * 4] Grid: [4 * 128] PARADIGM implementation Block: [768*1] Grid: [26*1] Page 18 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Execution time (ms) Summary and Findings: CT Reconstruction 25 20 15 10 5 0 Native CUDA (Shared memory, Textures) Thrust PARADIGM (Constant memory, Textures) PARADIGM (512,512) Series1 12.18 21.54 16.18 12.17 Performance comparison of different implementations of back projection kernels. The PARADIGM implementation with launch configuration (512,512) has minimal performance degradation with that of the Native CUDA implementation Page 19 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Agenda Parallel Computing Challenges and Solutions What is Thrust ++? Thrust ++ pipeline pattern Demo Thrust ++ Data structures and Algorithms Future Work Page 20 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Thrust ++ Data Structures Two Dimensional device/host_array2d 2 Dimensional Data structure for host/device device/host_image2d Read Only data structure for 2D spatial locality Supports interpolation Supported only on Kepler +, CUDA 5.0 onwards ( Bindless texture ) Three Dimensional device/host_array3d 3 Dimensional Data structure for host/device cudacreatetextureobject () Texture Data Structure cudamemcpytoarray() tex1dfetch() Page 21 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Agenda Parallel Computing Challenges and Solutions What is Thrust ++? Thrust ++ pipeline pattern Demo Thrust ++ Data structure and Algorithms Future Work Page 22 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Future Work Non linear pipeline support for CUDA OpenCL backend Support embedded space: Mali GPU VexCL: https://github.com/ddemidov/vexcl Boost.Compute: http://kylelutz.github.io/compute/ Shared memory/local Memory Bulk library approach (http://on-demand.gputechconf.com/gtc/2014/presentations/s4673-thrustparallel-algorithms-bulk.pdf) Other calculators Occupancy is not always the best measure to get optimal performance Two phase decomposition/launchbox ( http://nvlabs.github.io/moderngpu/intro.html ) ++ Parallel software Abstractions for Rapid Adaptation, Deployment and InteGration over Multicore/Manycore architectures Page 23 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Contact Thank you. For further information you can contact Parallel Systems India Lab: Giri Prabhakar Head of Research Group Parallel Systems India Lab E-mail: giri.prab@siemens.com Bharatkumar Sharma Lead Research Engineer Parallel Systems India Lab E-mail: bharatkumar.sharma@siemens.com Page 24 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Backup Siemens AG 2015. All rights reserved

BackProjection Data CUDA Memory Type Reason Input Texture Memory Takes advantage of hardware linear interpolation Precomputed values Shared Memory/Constant memory Few pre-computed values repeatedly accessed by multiple threads. Output Global memory N*N*N float values Backprojection is the most compute intensive step in 3D image reconstruction amounting for 97% of execution time For N*N*N volume, N*N threads are launched. A loop runs depth wise which captures contribution of each input convoluted image to the output volume as shown in Figure The loop is unrolled to increase ILP to get better performance. Page 26 March 2015 Corporate Technology Siemens AG 2015. All rights reserved

Software Stack Application Parallel Algorithms Parallel Patterns Data Structures and Iterators Power Monitor 2 Device Back-end Extensions Many-core Software Abstraction Layer 1 CUDA OpenMP TBB Operating System & Device Drivers Hardware (multi-core CPU + many-core device (GPU, MIC, )) [1] Thrust, an open-source C++ library based on C++ STL will be extended to provide the Many-core Software Abstraction Layer [2] Tools and utilities to monitor and optimize power (energy) usage Page 27 March 2015 Corporate Technology Siemens AG 2015. All rights reserved