Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Size: px

Start display at page:

Download "Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors"

Lily Goodman
5 years ago
Views:

1 Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009

2 Outline Leukocyte tracking: Problem Current approaches Acceleration using CUDA: Bottlenecks Optimization techniques Performance impact 2

3 Leukocyte Tracking Velocity of rolling leukocytes (white blood cells) provides important information about the inflammatory response Velocity measured by tracking leukocytes through multiple frames 3

4 Leukocyte Tracking: Approaches Manual analysis Researcher marks leukocyte centers frameby-frame Process 1 minute of video in tens of hours Automated analysis using MATLAB Removes manual effort and observer bias Process 1 minute of video in >4.5 hours 4

5 5 Goal: Leverage CUDA and a GPU to accelerate leukocyte tracking to near real-time speeds

6 Acceleration 1. Translation: convert MATLAB code to C 2. Parallelization: OpenMP for multi-core CPU CUDA for GPU Experimental setup: CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770 GPU: NVIDIA GeForce GTX 280 (PCIe 2.0) 6

7 CUDA Programming model for running generalpurpose applications on NVIDIA GPUs Based on C, with some minor extensions Main CUDA abstraction: kernel function Scalar program invoked across many threads Threads grouped into thread blocks Communication only allowed among threads within the same thread block

8 Acceleration using CUDA CPU GPU Program Allocate GPU memory Transfer input data Launch kernel Transfer results Free GPU memory CUDA kernel Step 1: Determine which code to offload to the GPU as a CUDA kernel Step 2: Write the CPU-side CUDA code We focus on these two steps Step 3: Write and optimize the GPU kernel

9 Tracking Algorithm Inputs: Video frame Location of cells in previous frame Output: Location of cells in current frame For each cell: Extract sub-image near cell s old location Compute MGVF matrix over sub-image Evolve active contour using MGVF matrix 99.8% of total runtime 9

of motion MGVF matrix is approximated via an iterative

10 Computing the MGVF Matrix Motion Gradient Vector Flow Gradient vector field biased in the assumed direction of motion MGVF matrix is approximated via an iterative solution procedure Sub-image near cell Corresponding MGVF 10

11 MGVF Pseudo-code MGVF = normalized sub-image gradient do { Compute the difference between each element and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (not converged) Initial kernel body 11

12 Speedup over MATLAB 12 Naïve CUDA Implementation 250x 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x C C + OpenMP Naïve CUDA CUDA Kernel is called ~50,000 times per frame Amount of work per call is small Runtime dominated by CUDA overheads: Memory allocation Memory copying Kernel call overhead

13 Kernel Overhead Kernel calls are not cheap! Overhead of one kernel call: 9 µs Overhead of one CPU function: 3 ns Heaviside kernel: 27% of kernel runtime due to computation 73% of kernel runtime due to kernel overhead 13

14 Lesson 1: Reduce Kernel Overhead Increase amount of work per kernel call Decrease total number of kernel calls Amortize overhead of each kernel call across more computation 14

15 Larger Kernel Implementation MGVF = normalized sub-image gradient do { Compute the difference between each pixel and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) Expand kernel body 15

16 Larger Kernel Implementation 250x Speedup over MATLAB 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x 6.3x C C + OpenMP Naïve CUDA Larger Kernel CUDA Memory Allocation 71% bottleneck Memory Copying 15% Kernel Execution 9% 16 0% 20% 40% 60% 80% 100% Percentage of Runtime

17 Memory Allocation Overhead malloc (CPU memory) cudamalloc (GPU memory) Time Per Call (microseconds) E-07 1E-06 1E Megabytes Allocated Per Call

18 Lesson 2: Reduce Memory Management Overhead Reduce the number of memory allocations Allocate memory once and reuse it throughout the application If memory size is not known a priori, estimate and only re-allocate if estimate is too small 18

19 Reduced Allocation Implementation 250x Speedup over MATLAB 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x 6.3x 25.4x C C + OpenMP Naïve CUDA Larger Kernel Reduced Allocation CUDA Memory Allocation 3% Memory Copying 56% bottleneck Kernel Execution 31% 19 0% 20% 40% 60% 80% 100% Percentage of Runtime

20 Memory Transfer Overhead CPU to GPU GPU to CPU Transfer Time (milliseconds) transfer size used by this application E-06 1E Megabytes per Transfer

21 Lesson 3: Reduce Memory Transfer Overhead If the CPU operates on values produced by the GPU: Move the operation to the GPU May improve performance even if the operation itself is slower on the GPU values produced by GPU Memory Transfer Operation (GPU) Operation (CPU) Memory Transfer values consumed by GPU 21 Time

22 GPU Reduction Implementation MGVF = normalized sub-image gradient do { Compute the difference between each pixel and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) Add convergence check to kernel body 22

23 Kernel Overhead Revisited Overhead depends on calling pattern: One at a time (synchronous): 9 µs Back-to-back (asynchronous): 3 µs Implicit Synchronization Synchronous: Kernel Call Memory Transfer Kernel Call Memory Transfer Kernel Call Asynchronous: Kernel Call Kernel Call Kernel Call Kernel Call Kernel Call 23

24 Lesson 1 Revisited: Reduce Kernel Overhead Increase amount of work per kernel call Decrease total number of kernel calls Amortize overhead of each kernel call across more computation Launch kernels back-to-back Kernel calls are asynchronous: avoid explicit or implicit synchronization between kernel calls Overlap kernel execution on the GPU with driver access on the CPU 24

25 GPU Reduction Implementation 250x Speedup over MATLAB 200x 150x 100x 50x 0x 2.0x 7.7x 0.8x 6.3x 25.4x C C + OpenMP Naïve CUDA Larger Kernel Reduced Allocation 60.7x GPU Reduction CUDA Memory Allocation 7% Memory Copying 1% Kernel Execution 80% 25 0% 20% 40% 60% 80% 100% Percentage of Runtime

26 Persistent Thread Block MGVF = normalized sub-image gradient do { Compute the difference between each pixel and its eight neighbors Compute the regularized Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) How can we offload the entire while loop as a kernel? 26

27 Persistent Thread Block Problem: need a global memory fence Multiple thread blocks compute the MGVF matrix Thread blocks cannot communicate with each other So each iteration requires a separate kernel call Solution: compute entire matrix in one thread block Arbitrary number of iterations can be computed in a single kernel call 27

28 Persistent Thread Block: Example MGVF Matrix MGVF Matrix Canonical CUDA Approach Persistent Thread Block (1-to-1 mapping between threads and data elements) 28

29 Persistent Thread Block: Example GPU Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM GPU Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM Cell Cell Cell SM SM SM Canonical CUDA Approach Persistent Thread Block (1-to-1 mapping between threads and data elements) 29

30 Lesson 4: Avoid Global Memory Fences Confine dependent computations to a single thread block Execute an iterative algorithm until convergence in a single kernel call Only efficient if there are multiple independent computations 30

31 Persistent Thread Block Implementation Speedup over MATLAB 250x 200x 150x 100x 50x 0x 27x speedup over OpenMP 2.0x 7.7x 0.8x 6.3x 25.4x C C + OpenMP Naïve CUDA Larger Kernel Reduced Allocation 60.7x GPU Reduction 211.3x Persistent Thread Block CUDA 31

32 Absolute Performance Frames per Second (FPS) MATLAB C C + OpenMP CUDA 32

33 Conclusions CUDA overheads can be significant bottlenecks Techniques presented here can help mitigate the impact of these bottlenecks CUDA provides enormous performance improvements for leukocyte tracking 200x speedup over MATLAB 27x speedup over OpenMP Processing time for a 1 minute video reduced from >4.5 hours to <1.5 minutes Real-time leukocyte tracking will be feasible in the near future 33

34 Acknowledgements Funding provided by: NSF grant IIS SRC grant NVIDIA research grant GRC AMD/Mahboob Kahn Ph.D. fellowship Equipment donated by NVIDIA 34

35 Software Source code available at: ImageJ plugin will be available soon 35

Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron Departments of Computer Science and Electrical