Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform

Size: px

Start display at page:

Download "Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform"

Gwenda Williams
5 years ago
Views:

1 Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform By: Floris Driessen (f.c.driessen@student.tue.nl)

2 Introduction 1 Video applications on embedded platforms Use of accelerators Faster Energy efficiency USB camera

3 Platform of Interest - ZYNQ 2 Zedboard by Digilent Xilinx Zynq platform Dual core ARM Cortex A9 Programmable logic 512 MB RAM USB connectivity HDMI output USB camera

4 Naïve implementation 3 Software 1. Read camera frame 2. Copy frame to DMA region 3. Perform HW accelerated operation (Sobel) 4. Copy result from DMA region 5. Show result Separate DMA region needed due to lack of DMA drivers Zynq platform USB ARM Core 0 Linux RAM ARM Core 1 Programmable logic DMA RAM

DMA region Mmap Not cached Frame capturing Zynq platform

5 Bottleneck Study 4 Performance limit Converting the format Camera output to accelerator input Copying from/to DMA region Mmap Not cached Frame capturing Zynq platform USB ARM Core 0 Linux RAM ARM Core 1 Programmable logic DMA RAM

6 Possible improvements 5 Exploiting scratchpad A frame would not fit DMA driver support Not feasible within time frame of project Optimize the current implementation Copying data Converting format Capturing camera frame

Format conversion 6 Naïve implementation Combined conversion and copy Writing small chunks to mmaped memory (slow) Split conversion and copy OpenCV mixchannels NEON interleaving ARM SIMD Next slide

7 Format conversion 6 Naïve implementation Combined conversion and copy Writing small chunks to mmaped memory (slow) Split conversion and copy OpenCV mixchannels NEON interleaving ARM SIMD Next slide Implementation Convert + copy [s] Speed-up Naïve 1,95 1x Split 0,28+0,04=0,32 6,1x OpenCV 0,05+0,04= ,7x NEON 0,04 50,6x 0x00 R0 R0 vst4.8 {d0-d3} [#0] 0x01 G0 G0 0x02 B0 vld3.8 {d0-d2} [#0] B0 0x03 R1 x 0x04 G1 R1 0x05 B1 G1 0x06 R2 B1 0x07 G2 x R7 R6 R5 R4 R3 R2 R1 R0 d0 G7 G6 G5 G4 G3 G2 G1 G0 d1 B7 B6 B5 B4 B3 B2 B1 B0 d2 x x x x x x x x d3

8 NEON RGB24 to RGB32 conversion example 7 void attribute ((noinline)) neonrgbtorgba_gas(unsigned char* src, unsigned char* dst, int numpix) { asm( // numpix/8 " mov r2, r2, lsr #3\n" // numpix/8 // load alpha channel value " vmov.u8 d3, #0xff\n" "loop1:\n" // load 8 rgb pixels with deinterleave " vld3.8 {d0,d1,d2}, [r0]!\n" // preload next values " pld [r0,#40]\n" " pld [r0,#48]\n" " pld [r0,#56]\n" // substract loop counter " subs r2, r2, #1\n" //" vswp d0, d2\n" // store as 4*8bit values " vst4.8 {d0-d3}, [r1]!\n" // loop if not ready " bgt loop1\n" ); } 0x00 R0 R0 0x01 G0 G0 0x02 B0 B0 0x03 R1 x 0x04 G1 R1 0x05 B1 G1 0x06 R2 B1 0x07 G2 x R7 x R6 x R5 x R4 x R3 x R2 x R1 x R0 x d0 G7 x G6 x G5 x G4 x G3 x G2 x G1 x G0 x d1 B7 x B6 x B5 x B4 x B3 x B2 x B1 x B0 x d2 x x x x x x x x d3

9 Execution time [ms] Frame copy from/to DMA RAM 8 OpenCV (as used in the naïve implementation) Manual copy (loop over virtual contiguous memory) Memcpy from C library NEON accelerated copy OpenCV Manual Memcpy Neon copy Linux RAM Linux RAM Linux RAM DMA RAM DMA RAM Linux RAM

10 Camera capture 9 OpenCV Always BGR24 Video4Linux Different formats Not a big improvement V4L2 RGB24 V4L2 BGR24 V4L2 MJPEG V4L2 YUYV 0.04 OpenCV BGR Frame delay [s]

11 Execution time per frame [s] Results 10 Multiple configurations Combined the conversion and copy (NEON accelerated) 1: Split convert and copy 2: OpenCV mixchannels 3: Combined mixchannels to external 4: No convert back + V4L capture 5: NEON copy 6: Combined NEON convert and NEON copy Copy back and convert Sobel calculation Convert and copy Get frame Application configuration

Contributions 11 Framework for combining USB camera with accelerators in programmable logic Multiple format conversion routines NEON NEON copying

12 Contributions 11 Framework for combining USB camera with accelerators in programmable logic Multiple format conversion routines NEON NEON copying routines Video4Linux frame capture Capture frame Process result Convert format Copy to DMA RAM Execute accelerator Copy result back Convert format

13 Conclusion and Future work 12 Huge improvement 32x (0,2 to 7,7 FPS) Still one ARM core unoccupied for processing data after accelerator Make camera frame buffer available to DMA DMA buffer sharing Linux kernel 3.8 Improve frame capture Takes more than half of the time Latency of ~4 frames Driver from manufacturer Consider other cameras

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Lecture 25: Interrupt Handling and Multi-Data Processing Spring 2018 Jason Tang 1 Topics Interrupt handling Vector processing Multi-data processing 2 I/O Communication Software needs to know when: I/O