Challenges of Heterogeneous MPSoC for Image Processing

Size: px

Start display at page:

Download "Challenges of Heterogeneous MPSoC for Image Processing"

Stewart Allen
6 years ago
Views:

1 Challenges of Heterogeneous MPSoC for Image Processing DGLR 2017 Walter Stechele Institute for Integrated Systems Technische Universität München

2 Overview Reconfigurable hardware Hardware <---> software migration Heterogeneous MPSoC Application mapping and resource-aware programming Case studies from driver assistance and robotic vision 2

C1 FPGA C2 I/O on-chip bus ShapeEng Eng0 Eng1 ICAP MEM CTRL Video IF

3 AutoVision Processor Shape Engine Contrast Engine Taillight Engine Optischer Fluß C Highway X X Tunnel entrance X X Tunnel X X City X X C1 FPGA C2 I/O on-chip bus ShapeEng Eng0 Eng1 ICAP MEM CTRL Video IF MatE CensusE TaillightE ConEng ShapeEng EdgeEng SDRAM Partial Bitstreams 3

4 Optical Flow Census Transformation HW SW Draw features Matching Post Proc List of features Status SW-Algorithm (Matlab) SW-Algorithm (OpenCV) Profiling HW/SW Partitioning HW Accelerator Demonstrator Results: Optical flow (640x480, ~ 17000feat) Core2 Duo 1,86GHz: 40 ms Engine: 2217ms 4

Census transformation 1) Compute signature for every pixel within

> < x < = < < > x > < > = > < 2) same comparison in Frame t k 1

from both images correspondence / motion vector [1] F Stein:

5 Census transformation 1) Compute signature for every pixel within Frame t k Frame t k ( 0ms) Frame t k 1 (40ms) Frame t k ( 0ms) > < < > < x < = < < > x > < > = > < 2) same comparison in Frame t k 1 Signature k Signature k ) Match the signatures from both images correspondence / motion vector [1] F Stein: Efficient Computation of Optical Flow Using the Census Transform, DAGM-Symposium,

6 Algorithmic redesign Software Version [1] n Signature = Address/Key Frame t k Frame t k 1 3 steps: low pass filter Census transformation Matching Frame t k n Frame t k Signature serves as address, pixel coordinates as value non-consecutive memory regions (no bursting possible) Counter update requires a read for every write operation Too high memory consumption through table based indexing scheme global matching: motion vectors across whole image possible Algorithm unsuitable for FPGA implementation!!! m m x,y x,y 0 4 x,y x,y 1 1 x,y x,y 5 0 x,y x,y 1 2 [1] F Stein: Efficient Computation of Optical Flow Using the Census Transform, DAGM-Symposium,

7 Algorithmic redesign Hardware Version [2] 3 steps: low pass filter Census transformation Matching m m n Frame t k n m m Signature = Value n n Frame t k 1 Signature serves as value, pixel coordinates as address Bursting possible Table based approach removed completely -> no counter update local matching: motion vectors only within neighborhood possible (225 parallel paths and comparisons in one clock cycle) Algorithm in that form unsuitable for software!!! [2] C Claus, A Laika, Li Jia, W Stechele: High performance FPGA based optical flow calculation using the census transformation, IV

8 Performance Comparison Bursting possible Counter update required Matching Scheme Platform Frequency Image smoothing Census transformation Finding matches Drawing motion vectors Total time Power consumption Processing on C Processing on FPGA SW no yes global Intel Core 2 Duo 186 GHz 187 ms 3868 ms 4055 ms* 65 W (TDP) HW yes no local FPGA, 2 emb PPCs 100 (FPGA), 300 (PPC) 395 ms 592 ms 123 ms 2217 ms* 10 W (TDP, < 1W target) *Execution time of images with a resolution of 640x480 and approximately feat detected 8

9 CensusEngine 9

10 MatchingEngine 10

11 AutoVision in a car Special thanks to DFG, BMW, Xilinx, sensor-to-image 11

grasping Consists of three main stages (Harris corner detection, SIFT feature extraction and SIFT feature

12 Object Recognition An approach used in computer vision to extract features and infer the contents of an image Enables the ARMAR robot to recognize objects and to carry out tasks like object tracking, object grasping Consists of three main stages (Harris corner detection, SIFT feature extraction and SIFT feature matching) CAM Frame Buffer Stage-1 Stage-2 Stage-3 Harris Corner SIFT Feature Extraction SIFT Feature Matching 12

13 Heterogeneous MPSoC Tiled hardware architecture with: Tightly Coupled Processor Array (TCPA) for image processing Invasive-Core (i-core) with special instruction support (SI) Loosely coupled LEON3 cores for high level algorithms Network-on-Chip (NoC) Tile Local Memory (TLM) Memory tile with interface to external DDR-II memory IO tile and Ethernet interface 13

14 Humanoid Robot ARMAR from KIT [Asfour et al] Camera Microphone Accelerometer Pressure sensor Torque sensor Rotary encoder 14

AMBA APB Bus Configuration Manager Reconfigurable Buffer Reconfigurable Buffer Harris Corner Detection on TCPA Tightly Coupled Processor Array [Teich et al] TCPA consists of numerous light weight

15 AMBA APB Bus Configuration Manager Reconfigurable Buffer Reconfigurable Buffer Harris Corner Detection on TCPA Tightly Coupled Processor Array [Teich et al] TCPA consists of numerous light weight processing elements (PE) TCPA benefit from instruction and loop-level parallelism and offers significant acceleration to image processing algorithms Direct PE to PE communication channels, results in continuous streaming of data from the surrounding buffers through the array Irq Ctrl Config & Com Proc Network Adapter AHB/APB Bridge AMBA AHB Bus IM GC AG IM AG Reconfigurable Buffer GC Config Memory Config Loader GC Reconfigurable Buffer AG IM AG GC IM 15

16 Configuration Manager I/O Buffers I/O Buffers Mapping HCD on TCPA 3x3 IM GC AG IM AG I/O Buffers GC TCPA prototype for HCD consists of two PEs Achieved a frame rate of 5 fps (640x480 pixels) GC I/O Buffers AG IM AG GC IM TCPA implementation is expected to consumes less power due to its lightweight PE structure AMBA bus Conf & Com Proc (LEON3) Memory 16

17 SIFT Feature Matching on i-core [Henkel et al] i-core - an extension of a LEON3 processor with a reconfigurable fabric, which allows loading application specific accelerators at runtime Start Harris Corner Detection SIFT Feature Extraction Euclidean distance Distance between p and q D p, q = Σ (p q) 2 SIFT Feature Matching Visualize for( k = 0; k < ndimension; k++) { v = pquery[k] pdata[k] ; sum += v * v; } Stop 17

18 SIFT Feature Matching on i-core Two memory ports provide a high-bandwidth connection (2x128 bits) to the tile-local memory 18

19 Homogeneous vs Heterogeneous Three stages of the object recognition algorithm operate in a pipelined fashion Hardware variants used for comparison: Homogeneous MPSoC, 2x3 tile design with four LEON3 PEs per tile Heterogeneous MPSoC, 2x3 tile design with one TCPA tile and one i-core CAM Frame Buffer Stage-1 Stage-2 Stage-3 Harris Corner on TCPA SIFT Feature Extraction on LEON3 SIFT Feature Matching on i-core TCPA i-core LEON3 HCD-TCPA SIFT-Extr-LEON3 SIFT-Match-iCore Time 19

20 Homogeneous vs Heterogeneous Three stages of the object recognition algorithm operate in a pipelined fashion Hardware variants used for comparison: Homogeneous MPSoC, 2x3 tile design with four LEON3 PEs per tile Heterogeneous MPSoC, 2x3 tile design with one TCPA tile and one i-core Load LEON3 Load TCPA Load i-core Throughput WOLT (msec) Homogeneous 59% 0% 0% 97 frames 732 Heterogeneous 23% 62% 72% 97 frames 683 TCPA i-core LEON3 HCD-TCPA SIFT-Extr-LEON3 SIFT-Match-iCore Time 20

on LEON3 SIFT Feature Matching on i-core TCPA i-core LEON3 (a) Frame 1 (c) (b) 0 200 400 600 800 1000 1200

21 Additional Applications Conventional task distribution App-3 App-1 Audio Filtering on TCPA App-2 Matrix Multiplication i-core CAM Frame Buffer Stage-1 Stage-2 Stage-3 Harris Corner on TCPA SIFT Feature Extraction on LEON3 SIFT Feature Matching on i-core TCPA i-core LEON3 (a) Frame 1 (c) (b) Time (d) Frame 2 Audio-TCPA MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-LEON3 21

22 Additional Applications Conventional task distribution Load LEON3 Load TCPA Load i-core Throughput WOLT (msec) Obj-Recog Only 23% 62% 72% 97 frames 683 All Three Apps 17% 96% 67% 72 frames 1400 TCPA i-core LEON3 (a) Frame 1 (c) (b) Time (d) Frame 2 Audio-TCPA MatrixMul-i-Core MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-i-Core SIFT-Match-LEON3 22

23 Conventional vs Resource-aware Resource-aware task distribution App-1 Audio Filtering on TCPA App-2 Matrix Multiplication i-core CAM Frame Buffer Harris Corner on TCPA SIFT Feature Extraction on LEON3 App-3 Stage-1 Stage-2 Stage-3 SIFT Feature Matching on i-core TCPA i-core LEON3 Frame 1 (e) (h) (f) Frame 2 (x) Frame 3 Frame Time (g) (y) Audio-TCPA MatrixMul-i-Core MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-i-Core SIFT-Match-LEON3 23

24 Conventional vs Resource-aware Resource-aware task distribution Load LEON3 Load TCPA Load i- Core Throughput WOLT (msec) Conventional 17% 96% 67% 72 frames 1400 Resource-aware 38% 80% 75% 98 frames 705 TCPA i-core LEON3 Frame 1 (e) (h) (f) Frame 2 (x) Frame 3 Frame Time (g) (y) Audio-TCPA MatrixMul-i-Core MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-i-Core SIFT-Match-LEON3 24

25 Summary for Heterogeneous MPSoC What we could see so far Resource-awareness helps task distribution on heterogeneous MPSoC Improves throughput and WOLT (worst observed latency time) Now, what if no more C cores available 25

0 100 200 300 400 500 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171

26 Execution time Frame No Expected Observed 26 HCD and Additional Load Core count Frame No Required Available Harris Corner Detection Operating System Many-core HW Add l Load

x = p 1 p 2 I y = p 3 p 4 X Harris Map w x I x 2 w w x I x I y w w x

27 Stage -3 Stage -2 Stage -1 Harris Corner Detection - Stages Harris corner detection algorithm consists of three main stages Covariance I x = p 1 p 2 I y = p 3 p 4 X Harris Map w x I x 2 w w x I x I y w w x I x I y w w x I y 2 w = a b b c X Threshold R = ac b 2 k((a + c)(a c)) 27

Subsampling by Dropping Pixels Advantages: Reduce

alternate pixel horizontally to reduce the

horizontally & vertically reduces workload by 75%

considered equally Memory read/write remains almost

28 Subsampling by Dropping Pixels Advantages: Reduce the computational load by dropping pixels Drop every alternate pixel horizontally to reduce the computations by 50% Drop every alternate pixels horizontally & vertically reduces workload by 75% Disadvantages: Regions with and without corners are considered equally Memory read/write remains almost the same Low ratio of computation : memory leads to poor scalability 28

29 Masking Technique Covariance Harris Map Threshold X Threshold the covariance image to generate a mask unique to the input image, based on [Alkaabi 2004] Mask out regions with regular intensities and unmask others Masked and unmasked regions appear in clusters, making it highly cache friendly 29

Masking Technique Self-adaptive HCD algorithm with variable masking Negligible loss in precision & recall until threshold of 8 Execution time reduced by 60% with a mask threshold of 100 keeping up

30 Masking Technique Self-adaptive HCD algorithm with variable masking Negligible loss in precision & recall until threshold of 8 Execution time reduced by 60% with a mask threshold of 100 keeping up precision & recall to 88-90% precision = 1 recall = #incorrect matches #correct + #incorrect #correct matches #total possible matches Resource Usage Precision Recall Mask Threshold = 0 Mask Threshold = 2 Mask Threshold = 4 Mask Threshold = 8 Harris Corner Detection Mask Threshold = 16 Mask Threshold = 32 Mask Threshold = 64 Mask Threshold = 128 Operating System Many-core HW 30

31 Precision/Recall Rate Duration (milliseconds) Results: Conventional vs Resource-aware HCD Execution time profile RA CN Accuracy 1 0,8 0,6 0,4 0, Frame No PR-RA RE-RA PR-CN RE-CN Summary 31

32 Conclusion Reconfigurable hardware Hardware is not always fixed Optical flow in hardware and software Software is not simply mapped to hardware Heterogeneous MPSoC Static mapping is not sufficient Resource-aware computing Software adaptation is not limited to parameter tuning 32

33 References J A Colmenares, G Eads, S A Hofmeyr, S Bird, M Moretó, D Chou, B Gluzman, E Roman, D B Bartolini, N Mor et al: Tessellation: refactoring the OS around explicit resource containers with continuous adaptation, DAC 2013 Henry Hoffmann, Jonathan Eastep, Marco D Santambrogio, Jason E Miller, Anant Agarwal: Application Heartbeats - A Generic Interface for Specifying Program Performance and Goals in Autonomous Computing Environments, in ICAC 2010 Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, Martin Rinard: Dynamic Knobs for Responsive Power-Aware Computing, ASPLOS 2012 D B Bartolini, R Cattaneo, G C Durelli, M Maggio, M D Santambrogio and F Sironi The autonomic operating system research project: achievements and future directions, DAC 2013 Edoardo Paone, Davide Gadioli, Gianluca Palermo, Vittorio Zaccaria, Cristina Silvano: Evaluating Orthogonality between Application Auto-Tuning and Run-Time Resource Management for Adaptive OpenCL Applications, ASAP

34 References F Stein: Efficient Computation of Optical Flow Using the Census Transform, DAGM-Symposium, 2004 S Alkaabi, F Deravi: Candidate pruning for fast corner detection, Electronics Letters, 2004 [Teich et al] [Henkel et al] J Paul, W Stechele et al: Resource awareness on heterogeneous MPSoCs for image processing, Journal of Systems Architecture, Elsevier, 2015 J Paul, W Stechele et al: Self-adaptive corner detection on MPSoCs through resource-aware programming, Journal of Systems Architecture, Elsevier, 2015 C Claus, A Laika, Li Jia, W Stechele: High performance FPGA based optical flow calculation using the census transformation, Intelligent Vehicles Symposium,

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann