Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

Similar documents
CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Dense Linear Algebra. HPC - Algorithms and Applications

Massively Parallel Architectures

Multi-Processors and GPU

Introduction to GPGPU and GPU-architectures

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

A Cross-Input Adaptive Framework for GPU Program Optimizations

1/25/12. Administrative

Double-Precision Matrix Multiply on CUDA

Tesla Architecture, CUDA and Optimization Strategies

Device Memories and Matrix Multiplication

High Performance Computing and GPU Programming

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU Programming Using NVIDIA CUDA

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

AIRWC : Accelerated Image Registration With CUDA. Richard Ansorge 1 st August BSS Group, Cavendish Laboratory, University of Cambridge UK.

LUNAR TEMPERATURE CALCULATIONS ON A GPU

Warps and Reduction Algorithms

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Parallel Programming Multicore systems

Unrolling parallel loops

CME 213 S PRING Eric Darve

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Using GPUs to compute the multilevel summation of electrostatic forces

HPC with Multicore and GPUs

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

General Purpose GPU Computing in Partial Wave Analysis

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Accelerating image registration on GPUs

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Matrix Multiplication in CUDA. A case study

From Brook to CUDA. GPU Technology Conference

Introduction to GPU hardware and to CUDA

Introduction to CELL B.E. and GPU Programming. Agenda

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPUs and GPGPUs. Greg Blanton John T. Lubia

William Yang Group 14 Mentor: Dr. Rogerio Richa Visual Tracking of Surgical Tools in Retinal Surgery using Particle Filtering

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPU computing

ECE 571 Advanced Microprocessor-Based Design Lecture 18


What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Programmable Graphics Hardware (GPU) A Primer

Dense matching GPU implementation

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Fundamental CUDA Optimization. NVIDIA Corporation

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Threading Hardware in G80

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Lecture 1: Introduction and Computational Thinking

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

QR Decomposition on GPUs

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Introduction to CUDA Programming

Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory

Understanding Peak Floating-Point Performance Claims

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Fundamental CUDA Optimization. NVIDIA Corporation

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

Lecture 8: GPU Programming. CSE599G1: Spring 2017

FPGA-Based Feature Detection

Static Scene Reconstruction

high performance medical reconstruction using stream programming paradigms

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Portland State University ECE 588/688. Graphics Processors

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications

COMP 322: Fundamentals of Parallel Programming

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

A Parallel Decoding Algorithm of LDPC Codes using CUDA

Performance potential for simulating spin models on GPU

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Lecture 1: Gentle Introduction to GPUs

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Module Memory and Data Locality

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Technology for a better society. hetcomp.com

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CS427 Multicore Architecture and Parallel Computing

Parallel Computing. Hwansoo Han (SKKU)

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Stereo Video Processing for Depth Map

GPU for HPC. October 2010

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

NVIDIA Fermi Architecture

GPU-based Distributed Behavior Models with CUDA

Transcription:

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications Workshop on Space Flight Software November 6, 2009 Brent Tweddle Massachusetts Institute of Technology Space Systems Laboratory 0/17

Machine Vision in Space CSA Space Vision System JSC Sprint AERCam & Mini AERcam DARPA Orbital Express AVGS GSFC Hubble Robotic Repair NRL SUMO FREND JPL Mars Exploration Rovers MIT SSL SPHERES 1/17

MER Driving Speeds Mode Speed Manual Driving 124 m/hr AutoNav (safe terrain) 36 m/hr AutoNav (obstacles) 10 m/hr Visual Odometry 10 m/hr Visual Odometry + AutoNav 5 m/hr 13 second 0.5 m drive 70 second compute Flight Processor 20 MHz RAD6000 CPU 128 MB DRAM VxWorks Operating System Memory Space & Cache Shared by 97 other tasks [1] J. J. Biesiadecki, C. Leger, and M. W. Maimone. Tradeoffs between directed and autonomous driving on the mars exploration rovers. In S. Thrun, R. A. Brooks, and H. F. Durrant-Whyte, editors, ISRR, volume 28 of Springer Tracts in Advanced Robotics, pages 254 267. Springer, 2005. [2] M. W. Maimone, A. E. Johnson, Y. Cheng, R. G. Willson, and L. Matthies. Autonomous Navigation Results from the Mars Exploration Rover (MER) Mission. In M. H. A. Jr. and O. Khatib, editors, ISER, Springer Tracts in Advanced Robotics, pages 3 13. Springer, 2004. 2/17

Overview Characteristics of Vision Algorithms Parallelism and locality Hardware Architecture CPU FPGA GPU GPU Programming Model Initial Performance Results and Comparison Path to space operations Conclusions 3/17

Stereo Depth Map d Minimize Windowed Sum of Squared Differences over d Left Stereo Image Stereo Disparity Map Right Stereo Image Characteristics 2D Spatial Locality Read or Write Data Parallel Minimal Branching & Instruction Complexity 4/17

Main Memory Cache Fundamentals 0 1 2 3 4 5 6 7 8 9 A B C D E F Matrix Data 0 1 2 3 4 5 6 7 8 9 A B C D E F Principle of Locality 4 C 5 D 6 E 7 F 2 Block Cache Cache: Smaller, faster, more expensive memory that mirrors data that is likely to be used in the future Temporal Locality: Data that has been recently accessed will likely be accessed again in the future Spatial Locality Data that is near recently accessed data will likely be accessed in the future 5/17

Main Memory 2D Spatial Locality: Morton Mapping 0 1 4 5 2 3 6 7 8 9 C D A B E F Matrix Data 0 1 2 3 4 5 6 7 8 9 A B C D E F 2 A 3 B 6 E 7 F 2 Block Cache 2D Principle of Locality Optimized for Two dimensional applications Currently Implemented as texture cache in GPU s, used in machine vison applications Could be implemented on standard CPU s, but need a remap procedure that will generate cache missed Translation from x-y to Morton Mapping address is computationally more expensive 6/17

Data Parallel Visual Navigation Algorithms Estimation Particle Filter 2D Image Processing Disparity Map Kernel Filtering 3D Data Processing Iterative Closest Point Path Planning Rapidly Exploring Random Trees 7/17

CPU Architecture Pentium 4 Willamette Released Nov 2000 1.3 to 2.0 GHz 256 kb cache Total Power @ 1.6 GHz: 60.8 W L1 miss: 2 cycles L2 miss: 7 cycles [1] W. Wu, L. Jin, J. Yang, P. Liu, and S. X. D. Tan. A Systematic Method For Functional Unit Power Estimation in Microprocessors. In Design Automation Conference, 2006. 8/17

FPGA Architecture Programmable logic implemented as look up tables Incorporates on-chip memory and DSP blocks Implemented using VHDL or Verilog to describe logic Development and testing is very difficult Less power efficient than a custom ASIC Altera Stratix Look Up Table 9/17

NVIDIA GPU Architecture Architecture Designed for Data Parallel Applications Programming Model: Single Program, Multiple Data 10/17

GPU s for Embedded Systems Processor Theoretical Peak GFLOPS Watts Watts per GFLOPS Quad Bloomfield Xeon 3.2 GHz Core 2 Duo Penryn 2.53 GHz 25.6 GFLOPS 130 W 5.078 20.2 GFLOPS 25 W 0.810 Cell Processor 152 GFLOPS 80 W 0.526 NVIDIA Tesla C870 518 GFLOPS 170 W 0.328 NVIDIA GeForce 9800 GT NVIDIA GeForce 8800M GTS 504 GFLOPS 105 W 0.208 240 GFLOPS 35 W 0.145 Assumptions: Xeon issues 2 flops per cycle per core Core2Duo issues 4 flops per cycle per core http://icl.cs.utk.edu/hpcc/hpcc_desc.cgi?field=theoretical%20peak 11/17

Mip-Mapping & Texture Cache GPU s have hardware to support mapping textures onto 3D objects Rendered Scene 2D Spatial Locality High throughput Low latency Data is stored as a Mip-Map in Texture Cache Hardware supports sub-pixel interpolation Morton Access Pattern Stored Texture Mip-Mapped Texture 12/17

CUDA Programming Model Single Program, Multiple Data in C Same instruction issued to 8 threads (context & data) Parallel Execution with no guarantee of order Race conditions & deadlocks are possible Synchronization and mutual exclusion is necessary Direct control on on-chip memory (memory read is 100s of cycles) Implement custom caching protocols Maximizing performance is challenging Aligned Memory access Resource Utilization global void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { int tx = threadidx.x; int ty = threadidx.y; int Mcols = M.width; int Ncols = N.width; O(n^3/p) float sum = 0; for(int i = 0; i < Mcols; ++i) { float a = M.elements[tx * Mcols + i]; float b = N.elements[i * Ncols + ty]; sum += a * b; } int index = tx * Ncols + ty; P.elements[index] = sum; } 13/17

Initial GPU Stereo Results Implemented stereo disparity map on GPU with LR Consistency Check based on NVIDIA original code 25 ms for a 640x480 frame Optimized algorithms for CPU SIMD hardware 512x512: <0.1s Van der Mark, Gavrila, Real-Time Dense Stereo for Intelligent Vehicles, IEEE Trans. ITS, 2006 14/17

Path To Space Future Research and Development COTS GPU Implementation of Navigation Algorithms Do they work well in practice? Development of embedded system architectures Should we: Radiation harden a COTS GPU Or build a rad-hard GPU-like ASIC? Software testing of parallel algorithms? ESA s architecture: Primary Flight Computer to monitor Accelerator for errors Primary Flight Computer GPU Vision Accelerator 15/17

Summary & Conclusions Discussed Characteristics of Machine Vision Algorithms Identified need for faster and more power efficient processing architectures GPU Architecture matches well with Machine Vision 2D Locality Texture Caches Data Parallel SPMD Programming Model Minimal Branching and Instruction Complexity Reduced control hardware Initial Performance Tests show promise Significant work ahead 16/17

Questions & Acknowledgements 17/17