GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

Overview of ArgonST Manufacturer of integrated sensor hardware and sensor analysis systems 2 RF, COMINT, ELINT, EO/IR, LIDAR, Multispectral, Hyperspectral, Acoustic Research Group Focus Artificial Intelligence and Machine Learning Automated Scene Understanding Visual Navigation

Au COIN Project 3 Automated Understanding via Collective Image Navigation Advanced, ultra-tight coupling methods for visual navigation Partnered with the Air Force Institute of Technology This research: Front-end processing

Outline 4 Introduction Overview of SURF GPU Implementation Details Results Conclusion

Robust Image Features Summarize by a small number of interest points Less data More entropy Robust features Relatively insensitive to view changes Match more reliably SURF (Bay et. al 2006) 5 Scale, rotation, affine, perspective, etc. Simple to compute, small features

SURF: Detection 6 Use determinant of Hessian Components are convolution of the image with Gaussian derivatives Approximate these with box filters Very easy to compute (constant time) Does not impair detection

SURF: Detection 7 Scale invariance Run detector at many scales Take the 3 3 3 local maxima Fit a quadratic patch to get sub-pixel resolution Rotation invariance Compute local orientation of the image near the interest point Compute descriptor relative to the local coordinate system

SURF: Description Split region around feature into 16 bins Each bin: sum 25 high-frequency Haar wavelet responses in x and y Also sum magnitude of responses ( dx, dy, dx, dy ) 16 = 64 dimensions 8 Same or better matching performance as SIFT (128 dimensions) Contrast invariance: normalize to a unit vector

Outline 9 Introduction Overview of SURF GPU Implementation Details Results Conclusion

Implementation Details Target platform: GeForce Go 7950 GTX OpenGL+Cg instead of CUDA No 32-bit integer textures No hardware blending of 32-bit floats Performance target 10 fps at 1280x1024 (speed of the camera) Bottleneck: memory bandwidth 10 Computation is almost free Texture lookups are expensive

Integral Image Computation The Integral Image allows constant time summation over arbitrarily large regions Each pixel contains the sum of all the values in the original image to the left and above it The sum of any rectangular region can be computed with four lookups 11

1-D Parallel Approach 12 Sum across columns in parallel, then across rows ~1000-degree parallelism (good) ~2000 passes (not as bad as you'd think) Ping-Pong between two textures to avoid readafter write dependencies Bad: Texture cache is 2-D (8 8 pixel blocks) Cache is flushed between rendering passes If we only use one row (column) in each rendering pass, we're wasting 7/8 of the memory bandwidth

2-D Parallel Approach (Blelloch) Sum within a column (row) in parallel as well Two phase approach, O(log n) passes each Upsweep: Collects local sums Downsweep: Distributes cumulative sums 13

Moment Pyramid Algorithm Blelloch still sums across columns, then rows Can we sum in both directions at once? To generate an integral image from a ¼ resolution integral image, need 4 pieces: Sum of upperleft region for odd x, odd y Sum of left row for odd x, even y 14 Sum of upper column for even x, odd y Original pixel for even x, even y

Moment Pyramid Algorithm Sum of upperleft region for odd x, odd y Sum of left row for odd x, even y Original pixel for even x, even y Where do we get the row/column sums? Output three values during upsweep 15 Sum of upper column for even x, odd y Sum of all 4 values, sum of 2 odd x, sum of 2 odd y Apply Blelloch's algorithm to make row/column sums on each level cumulative

Moment Pyramid Algorithm Downsweeps: Distribute cumulative sums Upsweeps: Collect local sums 16

Moment Pyramid Algorithm Why is this actually faster? Only read/write a full-sized image once Extremely good cache use More reads than texels fetched Algorithm 1D Ping Pong 2D Blelloch Reads 4N 12.5% 2N (2N) 4N 1.00 10N 100.0% 4N (6N) 6N 3.89 (4.33N) 4.33N 4.63 2D Moment Pyramid 7.67N 17 Cache Real Adds (effective) Writes Speed up Efficiency 109.5% 3.33N

Box Filters Gaussian derivatives for feature location Applied at many different resolutions and scales 18 Identify both position and size

Box Filters Simple implementation requires 32 lookups 19 Too many!

Box Filters Simple implementation requires 32 lookups Many differences separated by common offsets Compute differences for all pixels in one pass Reuse each result for several pixels in another pass Can easily reduce to 17 lookups per scale 47% reduction in running time Do 3 scales at once: 13.33 per scale 20 Too many! 77% reduction in running time

Box Filter Sampling Locations Pass 1 21 Pass 2 Pass 4 Pass 3 Pass 5

Point Location Compute Hessian determinant from box filters Find 3 3 3 local maxima over threshold 22 Multiple passes of EarlyZ culling Tried stencil buffer approach, but had driver problems on Linux Convert to a flat array of coordinates using the HistoPyramid algorithm (Ziegler et al. 2006)

Orientation Detection Compute HF Haar responses in a 6s radius Sort by angle, use sliding window to find max 23 Sorting on a GPU is about as slow as on a CPU! Don't sort: histogram R2VB (Scheuermann & Hensley 2007) RMS error 0.20 using 256 bins Sum sliding window with Blelloch's algorithm

Feature Descriptor Need oriented Haar responses 24 Can only sum over rectangular regions Compute axis aligned responses Rotate the resulting vector

Outline Introduction Overview of SURF GPU Implementation Details Results Conclusion 25

Framerate vs. Resolution GeForce Go 7950 GTX GeForce 8800 GTX Does not include time to transfer image to the card, as this can be done asynchronously, and affects only latency, not throughput. 26

Go 7950 GTX Performance Breakdown 50.00% 40.00% 30.00% T0=1.00 T0=0.50 T0=0.25 20.00% 10.00% 0.00% Radial Undistortion Integral Image Box Filters Point Location Orient. Detection Feature Extraction Execution time (in %) in each stage of the algorithm for various threshold levels 27

8800 GTX Performance Breakdown 50.00% 40.00% 30.00% T0=1.00 T0=0.50 T0=0.25 20.00% 10.00% 0.00% Radial Undistortion Integral Image Box Filters Point Location Orient. Detection Feature Extraction Execution time (in %) in each stage of the algorithm for various threshold levels 28

7 Series vs. 8 Series 5 Speed-up 4 3 T0=1.00 T0=0.50 T0=0.25 2 1 0 Radial Undistortion Integral Image Box Filters Point Loc. Orient. Detection Feature Extraction Overall Improvement of the 8800 GTX over the Go 7950 GTX in each stage of the algorithm for various threshold levels 29

Registration Examples 30 + + +

Outline Introduction Overview of SURF GPU Implementation Details Results Conclusion 31

Conclusion 32 Lots of pieces 2-D parallel prefix sums (integral image) Common subexpression elimination (box filters) EarlyZ culling (point location) HistoPyramid (point location) Scattered writes for histogram generation (orientation detection) 1-D parallel prefix sums (orientation detection)

Conclusion Can process video in real time on a laptop 33 New cards will only be faster Scales to high resolutions on a desktop while still real time Enables a whole host of algorithms which require robust features as input Recognition, Tracking, Structure from Motion, Visual Navigation, etc.

Future Improvements CUDA Skip the graphics pipeline No render to texture API for multipass algorithms 32-bit Integer Textures Can reduce memory bandwidth by at least half Hardware bilinear interpolation for 32-bit floats 34 Need to add an extra copy Or avoid the texture cache (a large portion of local memory) Big speed gain for Haar responses

Questions? 35