Contour Detection on Mobile Platforms

Size: px

Start display at page:

Download "Contour Detection on Mobile Platforms"

Bernice Conley
6 years ago
Views:

1 Contour Detection on Mobile Platforms Bor-Yiing Su, Prof. Kurt Keutzer, Parallel Computing Lab, University of California, Berkeley 1/26

2 Diagnosing Power/Performance Correctness Category of This Work Personal Health Image Hearing, Speech Parallel Retrieval Music Browser Design Patterns/Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Static Verification Parallel Libraries Parallel Frameworks Type Systems Efficiency Sketching Languages Autotuners Legacy Communication & Schedulers Code Synch. Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU 2 RAMP Manycore Directed Testing Dynamic Checking Debugging with Replay 2/26

3 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Conclusion 3/26

4 What will future visual applications look like? Interact with images Computer needs to understand images to provide image interaction services Summarize information from the images Location aware services Optical character recognition Interact with image parts through touchpad interfaces Augmented reality 4/26

work: using Damascene as a case study to understand the capability of modern

5 CBIR on a Mobile Platform Goal: meeting end user s requirements Improving detection rate (accuracy), response time (speed), and battery life (energy) Our work: using Damascene as a case study to understand the capability of modern mobile platforms in terms of: Execution time Power consumption Choice of computation platform 5/26

6 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Conclusion 6/26

Using contours to detect and localize junctions in natural images.

7 Image Contour Detection State-of-the-art algorithm which achieves the highest accuracy: gpb M. Maire, P. Arbeláez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in natural images. CVPR, pp. 1 8, June Image Human Generated Contours Machine Generated Contours 7/26

8 Computation Flow of Damascene Contours Combine Non-max Suppression Intervening Contour Combine, Normalize 8/26

9 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Targeting Platforms Performance on Desktop Platforms On Nvidia GTX480 GPU On Intel Nehalem Core i7 920 Quad-Core CPU Performance on Mobile Platforms On ARM v7 Quad-Core Processor On MSM8655 Single-Core Snapdragon with NEON Support Execution Time Analysis Power Analysis Conclusion 9/26

4 GB/s 1.5 GB 64 kb per SM 768 kb Specifications Core i7 920 Cores 4 Processor Clock 2.

10 Desktop Platforms Fermi GTX 480 Nelalem Core i7 920 Specifications GTX 480 CUDA Cores 480 Processor Clock SP FLOPS Memory Bandwidth Memory Size L1 Cache + Shared Memory L2 Cache 1.4 GHz 1.35 T GB/s 1.5 GB 64 kb per SM 768 kb Specifications Core i7 920 Cores 4 Processor Clock 2.66 GHz SP FLOPS G Memory Bandwidth 25.6 GB/s L1 Cache 64 kb per core L2 Cache 1MB per core L3 Cache 8 MB 10/26

11 Mobile Platforms ARM Cortex A9 Snapdragon MSM8655 Specifications Cortex A9 Cores 4 Processor Clock 800 MHz SIMD Width (NEON) 128 bits DMIPS 8000 L1 Cache 32 kb L2 Cache 512 kb Memory Bandwidth N.A. Specifications MSM 8655 Cores 1 Processor Clock 1 GHz SIMD Width (NEON) 128 bits DMIPS 2100 L1 Cache 32 kb L2 Cache 256 kb Memory Bandwidth N.A. 11/26

12 Damascene on GTX 480 The starting implementation Image size used for benchmarking: 321 x 481 pixels Routine Execution Time (ms) Preprocess Textons 129 Localcues 733 Intervening Lanczos Solver 800 Poseprocess 7.96 Total 1706 Bryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram, Yunsup Lee, Mark Murphy, and Kurt Keutzer, "Efficient, High-Quality Image Contour Detection," in International Conference on Computer Vision (ICCV-2009), pp , Kyoto Japan, September /26

13 Porting to the CPU Platform Porting the original CUDA implementation to serial C implementation Using our revised algorithms from the CUDA implementation Using pthread to parallelize routines that execute more than 1 second on a 321 x 481 image Exploring coarse grained parallelism on CPU instead of fine grained parallelism on GPU Setting compiler flags to vectorize the code, but not using SSE intrinsics manually Resulting in 10%-20% performance gain Image size used for benchmarking The same as the image used in GPU benchmarking 321 x 481 pixels 13/26

14 Performance on Core i7 920 Not all kernels scale linearly on the number of threads Some computations are bounded by memory bandwidth Compiler vectorization works better on single thread Not all routines in Damascene are parallelized Routines Number of Used Threads Routines Number of Threads texton localcues intervene Lanczos Total texton localcues intervene Lanczos Total Execution Time (s) Scalability on Number of Threads 14/26

15 Porting to the ARM Cortex A9 Quad-Core Processor Code base The same as the CPU pthread code Compiler flags None: not using any compiler flags fpu: -march=armv7-a -mfloat-abi=softfp -mfpu=neon All: -O3 -ftree-vectorize Neonize: using the NEON intrinsics manually 15/26

16 Performance on the ARM Cortex A9 Quad-Core Processor Image size used for benchmarking : 205 x 154 pixels Due to memory limitations Scalability is very close to linear on the number of threads Computation tends to be compute bounded on a slower processor Number of Used Threads Number of Threads Routines Routines texton texton localcues localcues intervene intervene Lanczos Lanczos Total Total Execution Time (s) Scalability on Number of Threads 16/26

Porting to the Single Core Snapdragon Processor Development environment: Android NDK Code base The same as the serial CPU implementation: Not using pthread Compiler

17 Porting to the Single Core Snapdragon Processor Development environment: Android NDK Code base The same as the serial CPU implementation: Not using pthread Compiler flags None: not using any compiler flags fpu: -march=armv7-a -mfloat-abi=softfp -mfpu=neon All: -O3 -ftree-vectorize Neonize: using the NEON intrinsics manually 17/26

18 Benchmarking the Lanczos Eigensolver The Lanczos iterative solver is composed of spmv, saxpy, sdot and snrm2 kernels All these kernels are memory bounded Manually NEONize the kernels result in a performance gain of 7%-3x Sparse Matrix Pattern 18/26

19 Performance on the Snapdragon Processor Image size used for benchmarking : 205 x 154 pixels Due to memory limitations Setting all compiler flags delivers a 6x speedup on the contour detection computation It is essential to get the compiler flags right on floating point computations Routine Flag_None Flag_All Ratio texton localcues intervene Lanczos Total Execution Time (s) Compiler Flags None vs. All 19/26

20 Execution Time Analysis Image size used for benchmarking : 205 x 154 pixels Implementations Snapdragon: single core, 4-way SIMD, compiler vectorization Cortex A9: quad-core, 4-way SIMD, compiler vectorization Nehalem: quad-core, 4-way SIMD, compiler vectorization Fermi: 15 cores, 16-way SIMD, dual issue, manual vectorization Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Execution Time (s) Speedup Ratio 20/26

21 Execution Time Observations The performance gap between snapdragon and Core i7 920 is about 50x Frequency difference: 1G vs. 2.66G Memory bandwidth difference: 1G vs 25.6 G Bytes/s A quad-core snapdragon with perfect scalability will close the gap to 12x Platform choice of computation Transferring the entire image to the cloud and using the CUDA implementation will be faster than performing the computation locally on the mobile platform Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Execution Time (s) Speedup Ratio 21/26

22 Power Analysis Image size used for benchmarking : 205 x 154 pixels Snapdragon power numbers Using the Trepn profiler Sampling CPU current every second Nehalem and Fermi power numbers Using the Kill-a-watt meter Measuring the rough system power Routine Max Power (W) Min Power (W) Total Energy (Joule) Snapdrago n Nehalem (1 thread) Nehalem (2 threads) Nehalem (4 threads) Fermi Ratio /26

23 Power Observations Race to halt principle is well applied on the desktop platforms Mobile platforms are about 400x more power efficient than desktop platforms Although the execution time on the mobile platform is about 50x (compared to CPU) to 250x (compared to GPU) slower, it still consumes the least amount of energy As long as the image uploading and the results downloading energy is smaller than the computation energy, we should keep the computation on the cloud Routine Max Power (W) Min Power (W) Total Energy (Joule) Snapdrago n Nehalem (1 thread) Nehalem (2 threads) Nehalem (4 threads) Fermi Ratio /26

24 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Conclusion 24/26

25 Conclusion Execution time Our hand optimized CUDA code still delivers the best performance Our CPU implementation achieves linear scalability on ARM Cortex A9, but not on Core i7 920 The computation tends to be more compute bounded on the slower ARM Cortex A9, but more memory bounded on the faster Core i7 920 The gap between single core snapdragon and quad core Core i7 920 is about 50x Power Race to halt principle is well applied on the desktop platforms Snapdragon is about 400x more power efficient than desktop platforms Although Snapdragon is about 50x 250x slower than the desktop platforms, it still consumes less energy Platform choice of computation Damascene is very computationally intensive and has lots of data parallelism Uploading the entire image to the cloud will be better than performing the computations locally on the mobile platform in terms of both total execution time and energy consumption With the advancement of mobile technologies, this decision might be changed in the future 25/26

26 Future Work Include network bandwidth scenarios Data transmission latency model Data transmission power consumption model Include input pre-processing analysis Image down sampling Image compression rate Port the entire object recognition system onto mobile platforms Client-cloud co-design Revisit the object recognition system from the perspective of clientcloud co-design Try out other object recognition systems that are less computationally intensive on the mobile platforms 26/26

27 Thank You 27/26

Patterns and Computer Vision

Patterns and Computer Vision Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Kurt Keutzer, Tim Mattson, Mark Murphy, David Sheffield, Bor-Yiing Su, Narayanan Sundaram and the rest of the PALLAS