Efficient Video Processing on Embedded GPU

Similar documents
4K Video Processing and Streaming Platform on TX1

4K Video Processing and Streaming Platform on TX1

4K HEVC Video Processing with GPU Optimization on Jetson TX1

Multimedia SoC System Solutions

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

THE LEADER IN VISUAL COMPUTING

Realtime Signal Processing on Nvidia TX2 using CUDA

Realtime Signal Processing on Embedded GPUs

ACCELERATED GSTREAMER USER GUIDE

Embedded Streaming Media with GStreamer and BeagleBoard. Presented by Todd Fischer todd.fischer (at) ridgerun.com

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

GstShark profiling: a real-life example. Michael Grüner - David Soto -

GStreamer Conference 2013, Edinburgh 22 October Sebastian Dröge Centricular Ltd

Using mobile processors for general purpose industrial signal processing

Our Technology Expertise for Software Engineering Services. AceThought Services Your Partner in Innovation

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

Designing with NXP i.mx8m SoC

MCN Streaming. An Adaptive Video Streaming Platform. Qin Chen Advisor: Prof. Dapeng Oliver Wu

TEGRA LINUX DRIVER PACKAGE R23.2

What's new in GStreamer Land The last 2 years and the future

Intelligent Surveillance

Memory Management in Tizen. SW Platform Team, SW R&D Center

GStreamer Daemon - Building a media server under 30min. Michael Grüner - David Soto -

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

FOSDEM 3 February 2018, Brussels. Tim-Philipp Müller < >

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

TEGRA LINUX DRIVER PACKAGE R24.1

Simple Plugin API. Wim Taymans Principal Software Engineer October 10, Pinos Wim Taymans

Accelerating Cloud Graphics

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

TEGRA LINUX DRIVER PACKAGE R21.2

WE INNOVATE WE DELIVER YOU SUCCEED

What s cooking in GStreamer. FOSDEM, Brussels 1 February Tim-Philipp Müller Sebastian Dröge

Remote Access and Output Sharing Between Multiple ECUs for Automotive

Porting Tizen-IVI 3.0 to an ARM based SoC Platform

Porting Tizen-IVI 3.0 to an ARM based SoC Platform. Damian Hobson-Garcia, IGEL Co., Ltd.

Mobile AR Hardware Futures

Connecting the uvcvideo driver of the PHYTEC USB-CAM-104H

NVIDIA DESIGNWORKS Ankit Patel - Prerna Dogra -

Chapter 28. Multimedia

Autonomous Driving Solutions

TBS8510 Transcoder Server User Guide

Android Multimedia Framework Overview. Li Li, Solution and Service Wind River

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

April 4-7, 2016 Silicon Valley VISIONWORKS A CUDA ACCELERATED COMPUTER VISION LIBRARY S6783. Elif Albuz, April 4, 2016

The Mobile Internet: The Potential of Handhelds to Bring Internet to the Masses. April 2008

EGLSTREAMS: INTEROPERABILITY FOR CAMERA, CUDA AND OPENGL. Debalina Bhattacharjee Sharan Ashwathnarayan

FOSDEM Open Media Devroom. 02 February 2019, Brussels. Tim-Philipp Müller < >

TEGRA LINUX DRIVER PACKAGE

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

NVIDIA TEGRA LINUX DRIVER PACKAGE

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

TEGRA LINUX DRIVER PACKAGE R21.6

Model: LT-122-PCIE For PCI Express

Completing the Multimedia Architecture

Case Study: Building a High Quality Video Pipeline Using GStreamer and V4Linux on an i.mx6

GTC 2013 March San Jose, CA The Smartest People. The Best Ideas. The Biggest Opportunities. Opportunities for Participation:

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Track Three Building a Rich UI Based Dual Display Video Player with the Freescale i.mx53 using LinuxLink

Mustang-200 Accelerate to the Future

DS-6900UDI SERIES DECODER

Bringing display and 3D to the C.H.I.P computer

Elaborazione dati real-time su architetture embedded many-core e FPGA

Khronos and the Mobile Ecosystem

What's new in GStreamer

TEGRA LINUX DRIVER PACKAGE R21.7

HDMI based Video Conference Device Recording

Accelerate to the Future

Martin Dubois, ing. Contents

Bringing display and 3D to the C.H.I.P computer

OpenMAX AL, OpenSL ES

GStreamer in the living room and in outer space

TEGRA LINUX DRIVER PACKAGE R19.3

A Linux multimedia platform for SH-Mobile processors

Android PerfHUD ES quick start guide

Introduction to creating 3D UI with BeagleBoard. ESC-341 Presented by Diego Dompe

P I X E V I A : A I B A S E D, R E A L - T I M E C O M P U T E R V I S I O N S Y S T E M F O R D R O N E S

PCIe/104 or PCI/104-Express 4-Channel Audio/Video Codec Model 953 User's Manual Rev.C September 2017

E9171-based Graphics/Compute Engine

Freescale i.mx6 Architecture

TBS8520 Transcoder Server User Guide

Discover Video. StreamEngine. User Guide. Version 1.0. Discover Video LLC 8/5/2016

SOM PRODUCTS BRIEF. S y s t e m o n M o d u l e. Engicam. SOMProducts ver

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich

AM57x Sitara Processors Multimedia and Graphics

LegUp: Accelerating Memcached on Cloud FPGAs

Embedded Vision Solutions

AddPac Technology. Sales and Marketing.

NVIDIA Jetson TX1, TX2 TX2i Software Features. DA_ March 8, 2018

Hugo Cunha. Senior Firmware Developer Globaltronics

TEGRA LINUX DRIVER PACKAGE R17.1

MAGEWELL Pro Capture HDMI Technical Specification

Copyright Khronos Group Page 1

Project Title: Design and Implementation of IPTV System Project Supervisor: Dr. Khaled Fouad Elsayed

CS365-TI Digital Media Software Development Kit

S CUDA on Xavier

Intelligent Video Analytics for Urban Management

NSIGHT ECLIPSE EDITION

Practical Experiences from Using Pulseaudio in Embedded Handheld Device

Transcription:

Efficient Video Processing on Embedded GPU Tobias Kammacher Armin Weiss Matthias Frei Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of Applied Sciences (ZHAW)

Goals 1. Share Experiences 2. Benefits of Embedded GPU 3. Bottlenecks 2

Experience with Gstreamer on Embedded Devices Live Video Stream HW / SW Embedded + 4K Drivers Gbps Mbps 3

Experience with Gstreamer on Embedded Devices Live Video Stream HW / SW Embedded + 4K Drivers Gbps Mbps Nvidia Jetson TX1 Development Board 4K HDMI Capture Module 4

Experience with Gstreamer on Embedded Devices Live Video Stream HW / SW Embedded + 4K Drivers Gbps Mbps Multi Camera Capture Debayer on GPU GPU is powerful Realtime? 5

Experience with Gstreamer on Embedded Devices Live Video Stream HW / SW Embedded + 4K Drivers Gbps Mbps Multi Camera Capture Debayer on GPU GPU is powerful Realtime? Live Video Processing Computer Vision Deep Learning Person Tree 6

Embedded: Nvidia TX1/TX2 Interfaces CSI PCIe USB Ethernet 7 Image: nvidia.com

Embedded: Nvidia TX1/TX2 Interfaces CSI PCIe USB Ethernet 8 Image: nvidia.com

Embedded: Nvidia TX1/TX2 Processing GStreamer MM API CPU GPU DMAs CODECs H.264 H.265 VP8 Streaming HLS Mpeg-TS RT(S)P Interfaces CSI PCIe USB Ethernet 9 Image: nvidia.com

Software Frameworks on TX1/TX2 OS: Linux for Tegra (L4T) by Nvidia Kernel 4.4.15 Video Input: V4L2 drivers (e.g. for CSI) Video Output: Xorg or proprietary framebuffer Multimedia APIs GStreamer Hardware Scaling, CODECs (omx) Video Input, Display ISP hidden L4T Multimedia API (Nvidia) Video input, V4L2 API, Buffer management OpenCV, Deep Learning Frameworks (TensorRT, Yolo,..) GPU Integration CUDA OpenGL (ES) / EGL Vulkan GStreamer is free software available under the terms of the LGPL license OpenGL and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc 10

Software Stack High Level Sources Sinks Processing CODECs Stream VisionWorks TensorRT OpenCV (-> AI) User Space v4l2, alsa, tcp/udp libargus, V4L2 API Buffer utility xvideo, overlay (omx), tcp/udp NVOSD GStreamer mix, scale, convert, cuda, opengl Multimedia API cuda, opengl omx h264/h265, libav, mp3 NvVideoEncoder, NvVideoDecoder rtp, rtsp, hls, mpeg-ts Libraries X11 OpenGL, EGL, Vulkan CUDA OpenMAX (omx) Kernel Space V4L2, videobuf2 v4l2-subdev ALSA DRM/KMS/FB GPU Driver Linux Kernel (Frameworks) Modules / Drivers Sockets TCP/IP/UDP Host1x / Graphics Host Eth Driver CPU HW VI (CSI) Video Source Display Ctrl GPU Convert CODECs H.264/265/VP8 PCIe Ctrl Eth PHY 11

Simple Video Streaming Pipeline HLS Gstreamer Pipeline V4L2 Source Convert Encode H.265 MPEG- TS Mux HLS Sink WebServer (lighttpd) $ gst-launch-1.0 v4l2src! videoconvert! omxh265enc bitrate=5000000! mpegtsmux! hlssink playlist-location=/var/www/playlist.m3u8 location=/var/www/segment%05d.ts playlist-root=http://192.168.0.1 12

Video Processing Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) 13

Video Processing Example: Scaling, Mixing Logo 1080p Video 4K Video Images: CC BY-SA Wikimedia 14

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) 15

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) CPU: Using compositor element: 1.2 FPS gst-launch-1.0 v4l2src! 'video/x-raw, format=uyvy, framerate=30/1, width=3840, height=2160'! compositor name=comp sink_0::alpha=1 sink_1::alpha=0.5! xvimagesink sync=false videotestsrc pattern=1! 'video/xraw,format=uyvy, framerate=30/1, width=1000, height=1000'! comp. 16

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) CPU: Using compositor element: 1.2 FPS gst-launch-1.0 v4l2src! 'video/x-raw, format=uyvy, framerate=30/1, width=3840, height=2160'! compositor name=comp sink_0::alpha=1 sink_1::alpha=0.5! xvimagesink sync=false videotestsrc pattern=1! 'video/xraw,format=uyvy, framerate=30/1, width=1000, height=1000'! comp. OpenGL (glvideomixer & glimagesink): 6.8 FPS 17

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) CPU: Using compositor element: 1.2 FPS gst-launch-1.0 v4l2src! 'video/x-raw, format=uyvy, framerate=30/1, width=3840, height=2160'! compositor name=comp sink_0::alpha=1 sink_1::alpha=0.5! xvimagesink sync=false videotestsrc pattern=1! 'video/xraw,format=uyvy, framerate=30/1, width=1000, height=1000'! comp. OpenGL (glvideomixer & glimagesink): 6.8 FPS 35 30 25 20 15 10 5 0 PiP Pipeline FPS? CPU OpenGL Required Need a solution with better performance => GPU 18

Use GPU with GStreamer GStreamer Plugin From nvidia: nvivafilter CUDA processing NVMM frame format (Nv internal) EGLImage type Only 1 input and 1 output pad V4L2 Source V4L2 Source Gstreamer Pipeline GPU Plugin Signals Display Sink Our own plugin (internal) CUDA processing Multiple input pads, 1 output pad Allocate managed memory from GPU and pass to src plugin Support Userptr io-mode Alternatives? 19

GPU Processing GPU Memory Access Methods Unified Virtual Addressing TX1 CPU L2 Cache GPU L2 Cache Memory Controller DRAM 4GB CPU Buffer GPU Buffer 20

GPU Processing GPU Memory Access Methods Unified Virtual Addressing TX1 Zero Copy TX1 CPU GPU CPU GPU L2 Cache L2 Cache L2 Cache L2 Cache Memory Controller DRAM 4GB Memory Controller DRAM 4GB CPU Buffer GPU Buffer Shared Buffer 21

GPU Processing GPU Memory Access Methods Unified Virtual Addressing TX1 Zero Copy TX1 Managed Memory TX1 CPU GPU CPU GPU CPU GPU L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache Memory Controller DRAM 4GB Memory Controller DRAM 4GB Memory Controller DRAM 4GB CPU Buffer GPU Buffer Shared Buffer Shared Buffer 22

GPU Processing PiP Test (GPU Data Transfer and Kernel Execution) Unified Virtual Addressing Step 1: cudamemcpy() to GPU * 12.5 ms Step 2: Execute kernel 9-11 ms Step 3: cudamemcpy() to host * 7.2 ms Total: 30 ms * Upload 4K + 1080p, Download 4K 23

GPU Processing PiP Test (GPU Data Transfer and Kernel Execution) Unified Virtual Addressing Step 1: cudamemcpy() to GPU * 12.5 ms Step 2: Execute kernel 9-11 ms Step 3: cudamemcpy() to host * 7.2 ms Step 1: Zero Copy cudamallochost(): Allocate memory on host ** Step 2: Execute kernel 23.5 25.7 ms - Total: 30 ms Total: 25 ms * Upload 4K + 1080p, Download 4K ** One time only operation 24

GPU Processing PiP Test (GPU Data Transfer and Kernel Execution) Unified Virtual Addressing Step 1: cudamemcpy() to GPU * 12.5 ms Step 2: Execute kernel 9-11 ms Step 3: cudamemcpy() to host * 7.2 ms Step 1: Zero Copy cudamallochost(): Allocate memory on host ** Step 2: Execute kernel 23.5 25.7 ms - Total: 30 ms Total: 25 ms * Upload 4K + 1080p, Download 4K ** One time only operation Step 1: Managed Memory cudamallocmanaged(): Allocate shared memory ** Step 2: Execute kernel 9-11 ms Step 3: synchronize with CPU 0.2 ms - Total: 10 ms 25

GPU Processing Results PiP pipeline achieves 30 FPS Using managed memory Additional: Consecutive kernels executed faster 35 30 25 20 15 10 5 0 PiP Pipeline FPS CPU OpenGL GPU 26

Conclusion Hardware Mapping 2nd Video Source Gbps Mbps Video Input Color Space Conversion Scaling Picture in Picture H.264/H.265 Encoder Audio Audio/Video Mux Encryption Transport Protocol Packer Forward Error Correction Ethernet Output GPU Recorder HW Block CPU 27

Conclusion Live 4K on Embedded GPU and HW-accelerated blocks Enable Desktop -> Embedded Bottlenecks and Solutions Allocate GPU Managed Memory for Capture Gst GPU Plugin 28

Get started with embedded GPU now! Blog: https://blog.zhaw.ch/high-performance/ 4K Drivers: https://github.com/ines-hpmm Hardware Board: http://pender.ch/products_zhaw.shtml tobias.kammacher@zhaw.ch matthias.rosenthal@zhaw.ch 29