Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
|
|
- Lily Goodman
- 5 years ago
- Views:
Transcription
1 Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009
2 Outline Leukocyte tracking: Problem Current approaches Acceleration using CUDA: Bottlenecks Optimization techniques Performance impact 2
3 Leukocyte Tracking Velocity of rolling leukocytes (white blood cells) provides important information about the inflammatory response Velocity measured by tracking leukocytes through multiple frames 3
4 Leukocyte Tracking: Approaches Manual analysis Researcher marks leukocyte centers frameby-frame Process 1 minute of video in tens of hours Automated analysis using MATLAB Removes manual effort and observer bias Process 1 minute of video in >4.5 hours 4
5 5 Goal: Leverage CUDA and a GPU to accelerate leukocyte tracking to near real-time speeds
6 Acceleration 1. Translation: convert MATLAB code to C 2. Parallelization: OpenMP for multi-core CPU CUDA for GPU Experimental setup: CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770 GPU: NVIDIA GeForce GTX 280 (PCIe 2.0) 6
7 CUDA Programming model for running generalpurpose applications on NVIDIA GPUs Based on C, with some minor extensions Main CUDA abstraction: kernel function Scalar program invoked across many threads Threads grouped into thread blocks Communication only allowed among threads within the same thread block
8 Acceleration using CUDA CPU GPU Program Allocate GPU memory Transfer input data Launch kernel Transfer results Free GPU memory CUDA kernel Step 1: Determine which code to offload to the GPU as a CUDA kernel Step 2: Write the CPU-side CUDA code We focus on these two steps Step 3: Write and optimize the GPU kernel
9 Tracking Algorithm Inputs: Video frame Location of cells in previous frame Output: Location of cells in current frame For each cell: Extract sub-image near cell s old location Compute MGVF matrix over sub-image Evolve active contour using MGVF matrix 99.8% of total runtime 9
10 Computing the MGVF Matrix Motion Gradient Vector Flow Gradient vector field biased in the assumed direction of motion MGVF matrix is approximated via an iterative solution procedure Sub-image near cell Corresponding MGVF 10
11 MGVF Pseudo-code MGVF = normalized sub-image gradient do { Compute the difference between each element and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (not converged) Initial kernel body 11
12 Speedup over MATLAB 12 Naïve CUDA Implementation 250x 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x C C + OpenMP Naïve CUDA CUDA Kernel is called ~50,000 times per frame Amount of work per call is small Runtime dominated by CUDA overheads: Memory allocation Memory copying Kernel call overhead
13 Kernel Overhead Kernel calls are not cheap! Overhead of one kernel call: 9 µs Overhead of one CPU function: 3 ns Heaviside kernel: 27% of kernel runtime due to computation 73% of kernel runtime due to kernel overhead 13
14 Lesson 1: Reduce Kernel Overhead Increase amount of work per kernel call Decrease total number of kernel calls Amortize overhead of each kernel call across more computation 14
15 Larger Kernel Implementation MGVF = normalized sub-image gradient do { Compute the difference between each pixel and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) Expand kernel body 15
16 Larger Kernel Implementation 250x Speedup over MATLAB 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x 6.3x C C + OpenMP Naïve CUDA Larger Kernel CUDA Memory Allocation 71% bottleneck Memory Copying 15% Kernel Execution 9% 16 0% 20% 40% 60% 80% 100% Percentage of Runtime
17 Memory Allocation Overhead malloc (CPU memory) cudamalloc (GPU memory) Time Per Call (microseconds) E-07 1E-06 1E Megabytes Allocated Per Call
18 Lesson 2: Reduce Memory Management Overhead Reduce the number of memory allocations Allocate memory once and reuse it throughout the application If memory size is not known a priori, estimate and only re-allocate if estimate is too small 18
19 Reduced Allocation Implementation 250x Speedup over MATLAB 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x 6.3x 25.4x C C + OpenMP Naïve CUDA Larger Kernel Reduced Allocation CUDA Memory Allocation 3% Memory Copying 56% bottleneck Kernel Execution 31% 19 0% 20% 40% 60% 80% 100% Percentage of Runtime
20 Memory Transfer Overhead CPU to GPU GPU to CPU Transfer Time (milliseconds) transfer size used by this application E-06 1E Megabytes per Transfer
21 Lesson 3: Reduce Memory Transfer Overhead If the CPU operates on values produced by the GPU: Move the operation to the GPU May improve performance even if the operation itself is slower on the GPU values produced by GPU Memory Transfer Operation (GPU) Operation (CPU) Memory Transfer values consumed by GPU 21 Time
22 GPU Reduction Implementation MGVF = normalized sub-image gradient do { Compute the difference between each pixel and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) Add convergence check to kernel body 22
23 Kernel Overhead Revisited Overhead depends on calling pattern: One at a time (synchronous): 9 µs Back-to-back (asynchronous): 3 µs Implicit Synchronization Synchronous: Kernel Call Memory Transfer Kernel Call Memory Transfer Kernel Call Asynchronous: Kernel Call Kernel Call Kernel Call Kernel Call Kernel Call 23
24 Lesson 1 Revisited: Reduce Kernel Overhead Increase amount of work per kernel call Decrease total number of kernel calls Amortize overhead of each kernel call across more computation Launch kernels back-to-back Kernel calls are asynchronous: avoid explicit or implicit synchronization between kernel calls Overlap kernel execution on the GPU with driver access on the CPU 24
25 GPU Reduction Implementation 250x Speedup over MATLAB 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x 6.3x 25.4x C C + OpenMP Naïve CUDA Larger Kernel Reduced Allocation 60.7x GPU Reduction CUDA Memory Allocation 7% Memory Copying 1% Kernel Execution 80% 25 0% 20% 40% 60% 80% 100% Percentage of Runtime
26 Persistent Thread Block MGVF = normalized sub-image gradient do { Compute the difference between each pixel and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) How can we offload the entire while loop as a kernel? 26
27 Persistent Thread Block Problem: need a global memory fence Multiple thread blocks compute the MGVF matrix Thread blocks cannot communicate with each other So each iteration requires a separate kernel call Solution: compute entire matrix in one thread block Arbitrary number of iterations can be computed in a single kernel call 27
28 Persistent Thread Block: Example MGVF Matrix MGVF Matrix Canonical CUDA Approach Persistent Thread Block (1-to-1 mapping between threads and data elements) 28
29 Persistent Thread Block: Example GPU Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM GPU Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM Canonical CUDA Approach Persistent Thread Block (1-to-1 mapping between threads and data elements) 29
30 Lesson 4: Avoid Global Memory Fences Confine dependent computations to a single thread block Execute an iterative algorithm until convergence in a single kernel call Only efficient if there are multiple independent computations 30
31 Persistent Thread Block Implementation Speedup over MATLAB 250x 200x 150x 100x 50x 0x 27x speedup over OpenMP 2.0x 7.7x 0.8x 6.3x 25.4x C C + OpenMP Naïve CUDA Larger Kernel Reduced Allocation 60.7x GPU Reduction 211.3x Persistent Thread Block CUDA 31
32 Absolute Performance Frames per Second (FPS) MATLAB C C + OpenMP CUDA 32
33 Conclusions CUDA overheads can be significant bottlenecks Techniques presented here can help mitigate the impact of these bottlenecks CUDA provides enormous performance improvements for leukocyte tracking 200x speedup over MATLAB 27x speedup over OpenMP Processing time for a 1 minute video reduced from >4.5 hours to <1.5 minutes Real-time leukocyte tracking will be feasible in the near future 33
34 Acknowledgements Funding provided by: NSF grant IIS SRC grant NVIDIA research grant GRC AMD/Mahboob Kahn Ph.D. fellowship Equipment donated by NVIDIA 34
35 Software Source code available at: ImageJ plugin will be available soon 35
Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron Departments of Computer Science and Electrical
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationRodinia Benchmark Suite
Rodinia Benchmark Suite CIS 601 Paper Presentation 3/16/2017 Presented by Grayson Honan, Shreyas Shivakumar, Akshay Sriraman Rodinia: A Benchmark Suite for Heterogeneous Computing Shuai Che, Michael Boyer,
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationCUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging
CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction
More informationA GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv: v1 [cs.dc] 5 Sep 2013
A GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv:1309.1230v1 [cs.dc] 5 Sep 2013 Kerry A. Seitz, Jr. 1, Alex Kennedy 1, Owen Ransom 2, Bassam A. Younis 2, and John D. Owens 3 1 Department
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationStream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology
Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology Dan Amerson, Technical Director, Emergent Game Technologies Purpose Why am I giving this talk? To answer this question:
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationSIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner
SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis Hannes Fassold, Jakub Rosner 2014-03-26 2 Overview GPU-activities @ AVM research group SIFT descriptor extraction Algorithm GPU implementation
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationGeorgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing
Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationOptimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationGPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations
GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations Fred Lionetti @ CSE Andrew McCulloch @ Bioeng Scott Baden @ CSE University of California, San Diego What is heart modeling? Bioengineer
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationInter-Block GPU Communication via Fast Barrier Synchronization
CS 3580 - Advanced Topics in Parallel Computing Inter-Block GPU Communication via Fast Barrier Synchronization Mohammad Hasanzadeh-Mofrad University of Pittsburgh September 12, 2017 1 General Purpose Graphics
More informationParallel H.264/AVC Motion Compensation for GPUs using OpenCL
Parallel H.264/AVC Motion Compensation for GPUs using OpenCL Biao Wang, Mauricio Alvarez-Mesa, Chi Ching Chi, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January
More informationvs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs
Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationHigh Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.
High Performance Computing Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Review Paper Two-Level Checkpoint/Restart Modeling for GPGPU Supada
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationRodinia: A Benchmark Suite for Heterogeneous Computing
Rodinia: A Benchmark Suite for Heterogeneous Computing Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee and Kevin Skadron {sc5nf, mwb7w, jm6dg, dt2f, jws9c, sl4ge,
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationREDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS
BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationAccelerating Braided B+ Tree Searches on a GPU with CUDA
Accelerating Braided B+ Tree Searches on a GPU with CUDA Jordan Fix, Andrew Wilkes, Kevin Skadron University of Virginia Department of Computer Science Charlottesville, VA 22904 {jsf7x, ajw3m, skadron}@virginia.edu
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationMAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015
MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video
More informationGeoImaging Accelerator Pansharpen Test Results. Executive Summary
Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance Whitepaper), the same approach has
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationA Mixed Hierarchical Algorithm for Nearest Neighbor Search
A Mixed Hierarchical Algorithm for Nearest Neighbor Search Carlo del Mundo Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA cdel@vt.edu ABSTRACT The k nearest neighbor (knn) search
More informationA Cross-Input Adaptive Framework for GPU Program Optimizations
A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationAlternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationEvaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi
Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationNUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationA case study of performance portability with OpenMP 4.5
A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationFMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationFace Detection using GPU-based Convolutional Neural Networks
Face Detection using GPU-based Convolutional Neural Networks Fabian Nasse 1, Christian Thurau 2 and Gernot A. Fink 1 1 TU Dortmund University, Department of Computer Science, Dortmund, Germany 2 Fraunhofer
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationConcurrent kernel execution on Graphic Processing Units
Master 2 Informatique Université de Bordeaux I Projet d Étude et de Recherche Concurrent kernel execution on Graphic Processing Units Adrien Cassagne, Aurélien George, Benjamin Lorendeau, Jean-Charles
More informationA GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou
A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled
More informationLarge Scale Debugging
Large Scale Debugging Project Meeting Report - December 2015 Didier Nadeau Under the supervision of Michel Dagenais Distributed Open Reliable Systems Analysis Lab École Polytechnique de Montréal Table
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationCS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010
CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationAcceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University
More informationPerformance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX
Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationGPU Acceleration of the Generalized Interpolation Material Point Method
GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall, Phil Wallstedt, and James Guilkey Sponsored in part
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More information