Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory

Size: px

Start display at page:

Download "Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory"

Osborne Parks
6 years ago
Views:

Mosher Boeing Research & Technology Avionics Systems Technology MODAT Ocean Search

1 Engineering, Operations & Technology Boeing Research & Technology Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory Aaron Mosher Boeing Research & Technology Avionics Systems Technology MODAT Ocean Search GPU Technology Conference 2016 S6141 RROI # EOT Mosher, 4/4/2016, S6141 1

Background: Computer Vision algorithms give us the capability to make vehicles smarter (better automation); but they are often limited by computation power available.

2 Background: Computer Vision algorithms give us the capability to make vehicles smarter (better automation); but they are often limited by computation power available. Boeing Research & Technology GPGPU DARPA Grand Challenge (2004/2005) CMU RedTeam. shock-mounted Electronics enclosure, multiple rack-mount computers with 5KW aux generator and cooling [1] cswap: Cost, Size, Weight, Power Usually focused on Thermal Dissipation (power) I can supply the power, but how do I dissipate the heat? 3-4 KW ~ 4 KW <10 W Great idea, now how do I fit it on my Vehicle? Image/signal processing Deep Learning Sensor Data into actionable Decisions Mosher, 4/4/2016, S6141 2

) Push towards concurrency, parallel operations Computationally expensive algorithms Miniaturization for deployment Traditional paths: FPGA, ASIC

3 Why explore GPGPU? Performance vs Power vs Cost: CPU speeds not getting faster since 2005 [2],[3]; but transistor count still growing(simd, multi-core, etc.) Push towards concurrency, parallel operations Computationally expensive algorithms Miniaturization for deployment Traditional paths: FPGA, ASIC Long development / adaptation time Speed costs money, how fast do you want to go? Can GPGPU can provide a faster and easier way to get there? illustrative trend data based on: Laptop Computer system on a chip/module Mosher, 4/4/2016, S6141 3

Algorithms Image Processing or Signal Processing is a good fit Image manipulation (GPU s already good at this) For each pixel: do <<something>> Signal processing cufft, CUBLAS, etc.

4 Algorithms Image Processing or Signal Processing is a good fit Image manipulation (GPU s already good at this) For each pixel: do <<something>> Signal processing cufft, CUBLAS, etc. Large data-sets, same operation for each element Structure of the algorithm Structure for Parallelization? Some places have lots of branch divergence: while loops, if-then-else, open-ended iterations Leave these in CPU, utilize streaming and async operations to maximize performance around them Adaptation effort vs throughput How much time do you want to spend in re-writing vs how fast does it need to run for your needs? Mosher, 4/4/2016, S6141 4

flowchart of execution flow and data dependency a flowchart of data dependency helps you design concurrency Usually software is written in

5 First steps: Profile and understand data flow before you start Starting from an existing (reference) algorithm? Profile the current run-time performance Find bottlenecks in current throughput, understand where to spend your time wisely Map out a flowchart of execution flow and data dependency a flowchart of data dependency helps you design concurrency Usually software is written in a linear (imperative) style, but data-dependency graph can help you determine areas for concurrency Function A Function B Function C Function D Mosher, 4/4/2016, S6141 5

Kernels. Restructured application to keep GPU pipeline full of compute tasks.

6 Initial efforts and results Boeing Research & Technology GPGPU Initial Cut at replacing functions with CUDA kernels Structure of the algorithm led to suboptimal utilization of Kernels. Restructured application to keep GPU pipeline full of compute tasks. Identified further areas of optimization needed (some Kernels taking too long) Mosher, 4/4/2016, S6141 6

Some algorithms may require intermediate steps of data movement between GPU & CPU. Use overlapped operations [6].

7 GPGPU Pipeline Optimization: After GPU Pipeline is kept full with processing, no air gaps where it stalls. In this case, most processing moved up into device (GPU) except for some processing at the end of the cycle. Some algorithms may require intermediate steps of data movement between GPU & CPU. Use overlapped operations [6]. What is missing from this Graph? The CPU. Its doing nothing. Further work: restructure algorithm to utilize both GPU and CPU concurrently. Mosher, 4/4/2016, S6141 7

8 MODAT adaptation results: Before: MFC application on Intel i7 Intel Performance Primitives (click to play video) After: OpenCV + CUDA on the Tegra X1 Mosher, 4/4/2016, S6141 8

9 Watts Results / performance: (MODAT) Before our adaptation, this algorithm ran at: ~ 13.8 frames per second (older C/C++ algorithm using MFC) on M4800 ~ 36.6 frames per second when adapted to use IPP (Intel Performance Primitives) on M4800 After adaptation (OpenCV + CUDA) it ran at: ~ 14 frames per second (TK1) ~ 26 frames per second (TX1) ~ 29 frames per second on M4800 (Quadro K2100m) Throughput on TX1 approaches (but does not match) the laptop (Dell M4800) using CUDA or IPP. However, a ~7:1 reduction in power dissipation. Compared to simple C++ algorithm: a CUDA implementation on X1 is 11x improvement in performance-per-watt. Compared to an optimized IPP implementation: a CUDA implementation on X1 is 5.7x improvement in performance-per-watt. ~3 to 4 person-months of development effort Mosher, 4/4/2016, S6141 9

10 Unified Memory Physically combined memory on the Tegra K1 & X1 Portable to other GPU platforms, data-migration happening from Host/Device but is hidden Initial testing with K1 was disappointing, slow memory access speeds (cache problem?) Testing with X1 was encouraging, as far as I can tell unified memory incurs no penalty compared to traditional Global memory. The real value of Unified Memory, in our experience, is that it removes barriers to adaptation / conversion: Ease of programming /adaptation: existing algorithms had complicated C++ classes, object oriented structures, deep-copy operations for data [4] Object Oriented programming is common design. GPU: CPU: Take advantage in conjunction with overlapped operations and CUDA streams [6] Our Ocean Search algorithm does this, ~15% speedup when implemented Mosher, 4/4/2016, S

GFLOP/s are an estimate and relative measurement only, not an optimized algorithm for maximum performance Tegra X1 On K1,

11 Unified Memory benchmarking: Based on BoxFilter sample, reversing pixels per-row in the image (mirror left-right) Boeing Research & Technology GPGPU Tegra K1 In these examples, Memory Reversal benchmark includes both upload & download times GFLOP/s are an estimate and relative measurement only, not an optimized algorithm for maximum performance Tegra X1 On K1, depends on access patterns, data movement Faster on X1 (Shield TV and Jetson TX1). Around 2x faster memory Mosher, 4/4/2016, S

Conclusions and lessons-learned What we learned: Reference algorithm (CPU only): start with profiler before jumping-in to CUDA adaptation Adaptation effort vs runtime performance gained use Nsight

12 Conclusions and lessons-learned What we learned: Reference algorithm (CPU only): start with profiler before jumping-in to CUDA adaptation Adaptation effort vs runtime performance gained use Nsight Profiler; chose optimizations carefully to maintain development schedule What went right: Was able to provide a computer-vision capability within a size/weight/power not previously possible. What went wrong: Some experiments in Kernel optimization took extra development time, but yielded no appreciable runtime improvement. 80% / 20% rule? Profile often to understand where to spend effort Mosher, 4/4/2016, S

References and Credits: [1] A Robust Approach to High-Speed Navigation for Unrehearsed Desert Terrain Chris Urmson, Charlie Ragusa, and David Ray, Journal of Field Robotics 23(8), 467 508 (2006) [2]

13 References and Credits: [1] A Robust Approach to High-Speed Navigation for Unrehearsed Desert Terrain Chris Urmson, Charlie Ragusa, and David Ray, Journal of Field Robotics 23(8), (2006) [2] CPU DB: Recording Microprocessor History Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University. ACM Queue, April 6, Volume 10 issue 4. [3] [4] [5] (Alex St. John) [6] Mosher, 4/4/2016, S

14 Questions and Answers time: [1] Mosher, 4/4/2016, S

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle