4K HEVC Video Processing with GPU Optimization on Jetson TX1

4K HEVC Video Processing with GPU Optimization on Jetson TX1 Tobias Kammacher Matthias Frei Hans Gelke Institute of Embedded Systems / High Performance Multimedia Research Group Zurich University of Applied Sciences

Goal 1. 4K Video Realtime 2. Capturing 3. Scaling, Mixing 4. Encoding H.265/HEVC 2

Requirements Streaming Use Case 2nd Video Source Gbps Mbps Video Input Color Space Conversion Scaling Picture in Picture H.264/H.265 Encoder Audio Audio/Video Mux Encryption Transport Protocol Packer Forward Error Correction Ethernet Output Recorder 3

Requirements Hardware Limitations Nvidia Jetson TX1 Development Board Gigabit Ethernet Video Output HDMI 2.0 4096x2160 60Hz WIFI 802.1ac 2x2 Video Input 4 Images: anandtech.com, wccftech.com

Video Input MIPI CSI-2 Video Interface (Camera Serial Interface) 1-4 Lanes Images: wccftech.com, nvidia.com 5

Video Input Use Case / Hardware Design Capture HDMI sources 4K + 1080p and process them CSI HDMI 4K HDMI 1080p Toshiba TC358840 TC358840 Jetson TX1 Dev Board Process Video Ethernet Stream 4K* requires 8 CSI lanes in our case (hardware limitation) * 2160p30 YUV422 6

Video Input Hardware Design Nvidia Jetson TX1 Development Board HDMI2CSI module 7

Video Input Capture Driver Development Driver Overview Bridge: tc358840 Host: tegra_vi2 Features Capture 720p60, 1080p60, 2160p30 Capture custom formats Available Open Source: https://github.com/ines-hpmm 8

Video Rendering Frameworks GStreamer Pipeline-based Multimedia Framework OpenGL Graphics API with GPU Acceleration Plugins for Gstreamer available GStreamer is free software available under the terms of the LGPL license OpenGL and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc 9

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) 10

Video Processing Example: Scaling, Mixing Logo 1080p Video 4K Video Images: CC BY-SA Wikimedia 11

Video Processing Example: Scaling, Mixing Gstreamer Pipeline V4L2 Source V4L2 Source Format Convert Scale Mix (PiP) Render HDMI Mixing two sources (4K and 1080p) CPU: Using compositor element: 1.2 FPS gst-launch-1.0 v4l2src! 'video/x-raw, format=uyvy, framerate=30/1, width=3840, height=2160'! compositor name=comp sink_0::alpha=1 sink_1::alpha=0.5! xvimagesink sync=false videotestsrc pattern=1! 'video/xraw,format=uyvy, framerate=30/1, width=1000, height=1000'! comp. OpenGL (glvideomixer & glimagesink): 6.8 FPS 35 30 25 20 15 10 5 0 PiP Pipeline FPS? CPU OpenGL Required Need a solution with better performance => GPU 12

GPU Processing GPU Memory Access Methods Unified Virtual Addressing TX1 Zero Copy TX1 Managed Memory TX1 CPU GPU CPU GPU CPU GPU L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache Memory Controller DRAM 4GB Memory Controller DRAM 4GB Memory Controller DRAM 4GB CPU Buffer GPU Buffer Shared Buffer Shared Buffer 13

GPU Processing Results Tested use cases PiP Dynamic Logo Fixed Images: CC BY-SA Wikimedia 14

GPU Processing PiP Test (GPU Data Transfer and Kernel Execution) Unified Virtual Addressing Step 1: cudamemcpy() to GPU * 12.5 ms Step 2: Execute kernel 9-11 ms Step 3: cudamemcpy() to host * 7.2 ms Step 1: Zero Copy cudamallochost(): Allocate memory on host ** Step 2: Execute kernel 23.5 25.7 ms - Total: 30 ms Total: 25 ms * Upload 4K + 1080p, Download 4K ** One time only operation Step 1: Managed Memory cudamallocmanaged(): Allocate shared memory ** Step 2: Execute kernel 9-11 ms Step 3: synchronize with CPU 0.2 ms - Total: 10 ms 15

GPU Processing Results PiP pipeline achieves 30 FPS Using managed memory Logo pipeline achieves 30 FPS Using UVA, copy once Additional: Consecutive kernels executed faster 35 30 25 20 15 10 5 0 PiP Pipeline FPS CPU OpenGL GPU 16

Conclusion Hardware Mapping 2nd Video Source Gbps Mbps Video Input Color Space Conversion Scaling Picture in Picture H.264/H.265 Encoder Audio Audio/Video Mux Encryption Transport Protocol Packer Forward Error Correction Ethernet Output GPU Recorder HW Block CPU 17

Conclusion Driver for 4K HDMI capturing Video processing on GPU Shared Memory Data transfer mode TX1 for 4K video Capturing Processing Streaming 18

Find us on the web: Blog: https://blog.zhaw.ch/high-performance/ Github: https://github.com/ines-hpmm Hardware Board: http://pender.ch/products_zhaw.shtml 19