Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications Workshop on Space Flight Software November 6, 2009 Brent Tweddle Massachusetts Institute of Technology Space Systems Laboratory 0/17

Machine Vision in Space CSA Space Vision System JSC Sprint AERCam & Mini AERcam DARPA Orbital Express AVGS GSFC Hubble Robotic Repair NRL SUMO FREND JPL Mars Exploration Rovers MIT SSL SPHERES 1/17

MER Driving Speeds Mode Speed Manual Driving 124 m/hr AutoNav (safe terrain) 36 m/hr AutoNav (obstacles) 10 m/hr Visual Odometry 10 m/hr Visual Odometry + AutoNav 5 m/hr 13 second 0.5 m drive 70 second compute Flight Processor 20 MHz RAD6000 CPU 128 MB DRAM VxWorks Operating System Memory Space & Cache Shared by 97 other tasks [1] J. J. Biesiadecki, C. Leger, and M. W. Maimone. Tradeoffs between directed and autonomous driving on the mars exploration rovers. In S. Thrun, R. A. Brooks, and H. F. Durrant-Whyte, editors, ISRR, volume 28 of Springer Tracts in Advanced Robotics, pages 254 267. Springer, 2005. [2] M. W. Maimone, A. E. Johnson, Y. Cheng, R. G. Willson, and L. Matthies. Autonomous Navigation Results from the Mars Exploration Rover (MER) Mission. In M. H. A. Jr. and O. Khatib, editors, ISER, Springer Tracts in Advanced Robotics, pages 3 13. Springer, 2004. 2/17

Overview Characteristics of Vision Algorithms Parallelism and locality Hardware Architecture CPU FPGA GPU GPU Programming Model Initial Performance Results and Comparison Path to space operations Conclusions 3/17

Stereo Depth Map d Minimize Windowed Sum of Squared Differences over d Left Stereo Image Stereo Disparity Map Right Stereo Image Characteristics 2D Spatial Locality Read or Write Data Parallel Minimal Branching & Instruction Complexity 4/17

Main Memory Cache Fundamentals 0 1 2 3 4 5 6 7 8 9 A B C D E F Matrix Data 0 1 2 3 4 5 6 7 8 9 A B C D E F Principle of Locality 4 C 5 D 6 E 7 F 2 Block Cache Cache: Smaller, faster, more expensive memory that mirrors data that is likely to be used in the future Temporal Locality: Data that has been recently accessed will likely be accessed again in the future Spatial Locality Data that is near recently accessed data will likely be accessed in the future 5/17

Main Memory 2D Spatial Locality: Morton Mapping 0 1 4 5 2 3 6 7 8 9 C D A B E F Matrix Data 0 1 2 3 4 5 6 7 8 9 A B C D E F 2 A 3 B 6 E 7 F 2 Block Cache 2D Principle of Locality Optimized for Two dimensional applications Currently Implemented as texture cache in GPU s, used in machine vison applications Could be implemented on standard CPU s, but need a remap procedure that will generate cache missed Translation from x-y to Morton Mapping address is computationally more expensive 6/17

Data Parallel Visual Navigation Algorithms Estimation Particle Filter 2D Image Processing Disparity Map Kernel Filtering 3D Data Processing Iterative Closest Point Path Planning Rapidly Exploring Random Trees 7/17

CPU Architecture Pentium 4 Willamette Released Nov 2000 1.3 to 2.0 GHz 256 kb cache Total Power @ 1.6 GHz: 60.8 W L1 miss: 2 cycles L2 miss: 7 cycles [1] W. Wu, L. Jin, J. Yang, P. Liu, and S. X. D. Tan. A Systematic Method For Functional Unit Power Estimation in Microprocessors. In Design Automation Conference, 2006. 8/17

FPGA Architecture Programmable logic implemented as look up tables Incorporates on-chip memory and DSP blocks Implemented using VHDL or Verilog to describe logic Development and testing is very difficult Less power efficient than a custom ASIC Altera Stratix Look Up Table 9/17

NVIDIA GPU Architecture Architecture Designed for Data Parallel Applications Programming Model: Single Program, Multiple Data 10/17

GPU s for Embedded Systems Processor Theoretical Peak GFLOPS Watts Watts per GFLOPS Quad Bloomfield Xeon 3.2 GHz Core 2 Duo Penryn 2.53 GHz 25.6 GFLOPS 130 W 5.078 20.2 GFLOPS 25 W 0.810 Cell Processor 152 GFLOPS 80 W 0.526 NVIDIA Tesla C870 518 GFLOPS 170 W 0.328 NVIDIA GeForce 9800 GT NVIDIA GeForce 8800M GTS 504 GFLOPS 105 W 0.208 240 GFLOPS 35 W 0.145 Assumptions: Xeon issues 2 flops per cycle per core Core2Duo issues 4 flops per cycle per core http://icl.cs.utk.edu/hpcc/hpcc_desc.cgi?field=theoretical%20peak 11/17

Mip-Mapping & Texture Cache GPU s have hardware to support mapping textures onto 3D objects Rendered Scene 2D Spatial Locality High throughput Low latency Data is stored as a Mip-Map in Texture Cache Hardware supports sub-pixel interpolation Morton Access Pattern Stored Texture Mip-Mapped Texture 12/17

CUDA Programming Model Single Program, Multiple Data in C Same instruction issued to 8 threads (context & data) Parallel Execution with no guarantee of order Race conditions & deadlocks are possible Synchronization and mutual exclusion is necessary Direct control on on-chip memory (memory read is 100s of cycles) Implement custom caching protocols Maximizing performance is challenging Aligned Memory access Resource Utilization global void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { int tx = threadidx.x; int ty = threadidx.y; int Mcols = M.width; int Ncols = N.width; O(n^3/p) float sum = 0; for(int i = 0; i < Mcols; ++i) { float a = M.elements[tx * Mcols + i]; float b = N.elements[i * Ncols + ty]; sum += a * b; } int index = tx * Ncols + ty; P.elements[index] = sum; } 13/17

Initial GPU Stereo Results Implemented stereo disparity map on GPU with LR Consistency Check based on NVIDIA original code 25 ms for a 640x480 frame Optimized algorithms for CPU SIMD hardware 512x512: <0.1s Van der Mark, Gavrila, Real-Time Dense Stereo for Intelligent Vehicles, IEEE Trans. ITS, 2006 14/17

Path To Space Future Research and Development COTS GPU Implementation of Navigation Algorithms Do they work well in practice? Development of embedded system architectures Should we: Radiation harden a COTS GPU Or build a rad-hard GPU-like ASIC? Software testing of parallel algorithms? ESA s architecture: Primary Flight Computer to monitor Accelerator for errors Primary Flight Computer GPU Vision Accelerator 15/17

Summary & Conclusions Discussed Characteristics of Machine Vision Algorithms Identified need for faster and more power efficient processing architectures GPU Architecture matches well with Machine Vision 2D Locality Texture Caches Data Parallel SPMD Programming Model Minimal Branching and Instruction Complexity Reduced control hardware Initial Performance Tests show promise Significant work ahead 16/17

Questions & Acknowledgements 17/17