4K HEVC Video Processing with GPU Optimization on Jetson TX1

Similar documents
4K Video Processing and Streaming Platform on TX1

Efficient Video Processing on Embedded GPU

4K Video Processing and Streaming Platform on TX1

Using mobile processors for general purpose industrial signal processing

Realtime Signal Processing on Nvidia TX2 using CUDA

Realtime Signal Processing on Embedded GPUs

Multimedia SoC System Solutions

GstShark profiling: a real-life example. Michael Grüner - David Soto -

GStreamer Conference 2013, Edinburgh 22 October Sebastian Dröge Centricular Ltd

Embedded Streaming Media with GStreamer and BeagleBoard. Presented by Todd Fischer todd.fischer (at) ridgerun.com

THE LEADER IN VISUAL COMPUTING

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

Intelligent Surveillance

MCN Streaming. An Adaptive Video Streaming Platform. Qin Chen Advisor: Prof. Dapeng Oliver Wu

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

Porting Tizen-IVI 3.0 to an ARM based SoC Platform. Damian Hobson-Garcia, IGEL Co., Ltd.

Case Study: Building a High Quality Video Pipeline Using GStreamer and V4Linux on an i.mx6

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Porting Tizen-IVI 3.0 to an ARM based SoC Platform

Memory Management in Tizen. SW Platform Team, SW R&D Center

Simple Plugin API. Wim Taymans Principal Software Engineer October 10, Pinos Wim Taymans

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Embedded Vision Solutions

GStreamer Daemon - Building a media server under 30min. Michael Grüner - David Soto -

Connecting the uvcvideo driver of the PHYTEC USB-CAM-104H

EGLSTREAMS: INTEROPERABILITY FOR CAMERA, CUDA AND OPENGL. Debalina Bhattacharjee Sharan Ashwathnarayan

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

TBS8510 Transcoder Server User Guide

Solid State Graphics (SSG) SDK Setup and Raw Video Player Guide

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

E9171-based Graphics/Compute Engine

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces

arm MULTICORE PLATFORMS FOR ADVANCED APPLICATIONS Product Longevity

Remote Access and Output Sharing Between Multiple ECUs for Automotive

SOM PRODUCTS BRIEF. S y s t e m o n M o d u l e. Engicam. SOMProducts ver

Speed Sign Detection Using Convolutional Neural Network Accelerator IP User Guide

Designing with NXP i.mx8m SoC

AMD HD5450 PCIe ADD-IN BOARD. Datasheet AEGX-A3T5-01FST1

The Mobile Internet: The Potential of Handhelds to Bring Internet to the Masses. April 2008

Speed Sign Detection Using Convolutional Neural Network Accelerator IP Reference Design

Hugo Cunha. Senior Firmware Developer Globaltronics

GPM0002 E9171-based Graphics/Compute Engine

Mustang-200 Accelerate to the Future

A Linux multimedia platform for SH-Mobile processors

TEGRA LINUX DRIVER PACKAGE R23.2

Accelerating Cloud Graphics

ACCELERATED GSTREAMER USER GUIDE

Accelerate to the Future

Adhocracy Innovation with Imaging technology. Socionext Inc. Hiroyuki Komori July 5th, 2017

Exporting virtual memory as dmabuf. Nikhil Devshatwar Texas Instruments, India

NETPC-BD6C User s Manual BASED ON VERSION AMD DRIVER

GPGPU introduction and network applications. PacketShaders, SSLShader

What s cooking in GStreamer. FOSDEM, Brussels 1 February Tim-Philipp Müller Sebastian Dröge

Bringing display and 3D to the C.H.I.P computer

TBS8520 Transcoder Server User Guide

NVJPEG. DA _v0.1.4 August nvjpeg Libary Guide

AMD HD7750 PCIe ADD-IN BOARD. Datasheet (GFX-A3T2-01FST1)

Thunderbolt 3 USER GUIDE. For more information visit

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

April 4-7, 2016 Silicon Valley VISIONWORKS A CUDA ACCELERATED COMPUTER VISION LIBRARY S6783. Elif Albuz, April 4, 2016

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

AMD HD7750 2GB PCIEx16

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Bringing display and 3D to the C.H.I.P computer

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

Object Counting Using Convolutional Neural Network Accelerator IP Reference Design

edix Data Sheet XTRMX January 2018 XTRMX.com/edix

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Efficient Data Transfers

Windowing System on a 3D Pipeline. February 2005

FOSDEM 3 February 2018, Brussels. Tim-Philipp Müller < >

Face Tracking Using Convolutional Neural Network Accelerator IP Reference Design

HLS Authoring Update. Media #WWDC17. Eryk Vershen, AVFoundation Engineer

Shareable Camera Framework for Multiple Computer Vision Applications

Khronos and the Mobile Ecosystem

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide

GTC 2013 March San Jose, CA The Smartest People. The Best Ideas. The Biggest Opportunities. Opportunities for Participation:

TEGRA LINUX DRIVER PACKAGE R24.1

NXP-Freescale i.mx6 MicroSoM i4pro. Quad Core SoM (System-On-Module) Rev 1.3

GPU Fundamentals Jeff Larkin November 14, 2016

Inspiron Setup and Specifications

Before you use your Point Grey Zebra2 camera, we recommend that you are aware of the following resources:

. SMARC 2.0 Compliant

How to achieve low latency audio/video streaming over IP network?

Mobile AR Hardware Futures

VT988. Key Features. Benefits. High speed 16 ADC at 3 GSPS with Synchronous Capture VT ADC for synchronous capture

Discover Video. StreamEngine. User Guide. Version 1.0. Discover Video LLC 8/5/2016

ARM Multimedia IP: working together to drive down system power and bandwidth

Optimizing Film, Media with OpenCL & Intel Quick Sync Video

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

Inspiron Setup and Specifications

Elaborazione dati real-time su architetture embedded many-core e FPGA

Build cost-effective, reliable signage solutions with the 8 display output, single slot form factor NVIDIA NVS 810

Distributing Computation to Large GPU Clusters

ECE 574 Cluster Computing Lecture 17

PacketShader: A GPU-Accelerated Software Router

Spring 2009 Prof. Hyesoon Kim

Hezi Saar, Sr. Staff Product Marketing Manager Synopsys. Powering Imaging Applications with MIPI CSI-2

G3399 Single Board Computer Introduction

Transcription:

4K HEVC Video Processing with GPU Optimization on Jetson TX1 Tobias Kammacher Matthias Frei Hans Gelke Institute of Embedded Systems / High Performance Multimedia Research Group Zurich University of Applied Sciences

Goal 1. 4K Video Realtime 2. Capturing 3. Scaling, Mixing 4. Encoding H.265/HEVC 2

Requirements Streaming Use Case 2nd Video Source Gbps Mbps Video Input Color Space Conversion Scaling Picture in Picture H.264/H.265 Encoder Audio Audio/Video Mux Encryption Transport Protocol Packer Forward Error Correction Ethernet Output Recorder 3

Requirements Hardware Limitations Nvidia Jetson TX1 Development Board Gigabit Ethernet Video Output HDMI 2.0 4096x2160 60Hz WIFI 802.1ac 2x2 Video Input 4 Images: anandtech.com, wccftech.com

Video Input MIPI CSI-2 Video Interface (Camera Serial Interface) 1-4 Lanes Images: wccftech.com, nvidia.com 5

Video Input Use Case / Hardware Design Capture HDMI sources 4K + 1080p and process them CSI HDMI 4K HDMI 1080p Toshiba TC358840 TC358840 Jetson TX1 Dev Board Process Video Ethernet Stream 4K* requires 8 CSI lanes in our case (hardware limitation) * 2160p30 YUV422 6

Video Input Hardware Design Nvidia Jetson TX1 Development Board HDMI2CSI module 7

Video Input Capture Driver Development Driver Overview Bridge: tc358840 Host: tegra_vi2 Features Capture 720p60, 1080p60, 2160p30 Capture custom formats Available Open Source: https://github.com/ines-hpmm 8

Video Rendering Frameworks GStreamer Pipeline-based Multimedia Framework OpenGL Graphics API with GPU Acceleration Plugins for Gstreamer available GStreamer is free software available under the terms of the LGPL license OpenGL and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc 9

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) 10

Video Processing Example: Scaling, Mixing Logo 1080p Video 4K Video Images: CC BY-SA Wikimedia 11

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) CPU: Using compositor element: 1.2 FPS gst-launch-1.0 v4l2src! 'video/x-raw, format=uyvy, framerate=30/1, width=3840, height=2160'! compositor name=comp sink_0::alpha=1 sink_1::alpha=0.5! xvimagesink sync=false videotestsrc pattern=1! 'video/xraw,format=uyvy, framerate=30/1, width=1000, height=1000'! comp. OpenGL (glvideomixer & glimagesink): 6.8 FPS 35 30 25 20 15 10 5 0 PiP Pipeline FPS? CPU OpenGL Required Need a solution with better performance => GPU 12

GPU Processing GPU Memory Access Methods Unified Virtual Addressing TX1 Zero Copy TX1 Managed Memory TX1 CPU GPU CPU GPU CPU GPU L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache Memory Controller DRAM 4GB Memory Controller DRAM 4GB Memory Controller DRAM 4GB CPU Buffer GPU Buffer Shared Buffer Shared Buffer 13

GPU Processing Results Tested use cases PiP Dynamic Logo Fixed Images: CC BY-SA Wikimedia 14

GPU Processing PiP Test (GPU Data Transfer and Kernel Execution) Unified Virtual Addressing Step 1: cudamemcpy() to GPU * 12.5 ms Step 2: Execute kernel 9-11 ms Step 3: cudamemcpy() to host * 7.2 ms Step 1: Zero Copy cudamallochost(): Allocate memory on host ** Step 2: Execute kernel 23.5 25.7 ms - Total: 30 ms Total: 25 ms * Upload 4K + 1080p, Download 4K ** One time only operation Step 1: Managed Memory cudamallocmanaged(): Allocate shared memory ** Step 2: Execute kernel 9-11 ms Step 3: synchronize with CPU 0.2 ms - Total: 10 ms 15

GPU Processing Results PiP pipeline achieves 30 FPS Using managed memory Logo pipeline achieves 30 FPS Using UVA, copy once Additional: Consecutive kernels executed faster 35 30 25 20 15 10 5 0 PiP Pipeline FPS CPU OpenGL GPU 16

Conclusion Hardware Mapping 2nd Video Source Gbps Mbps Video Input Color Space Conversion Scaling Picture in Picture H.264/H.265 Encoder Audio Audio/Video Mux Encryption Transport Protocol Packer Forward Error Correction Ethernet Output GPU Recorder HW Block CPU 17

Conclusion Driver for 4K HDMI capturing Video processing on GPU Shared Memory Data transfer mode TX1 for 4K video Capturing Processing Streaming 18

Find us on the web: Blog: https://blog.zhaw.ch/high-performance/ Github: https://github.com/ines-hpmm Hardware Board: http://pender.ch/products_zhaw.shtml 19