PERSEUS Plus HEVC the first single-fpga real-time 4Kp60 encoder

Similar documents
Native playback H264 Video freezes on high spec pro system Posted by Julia - 03 Jul :27

Exploring OpenCL Memory Throughput on the Zynq

Adaptable Intelligence The Next Computing Era

The right OTT Codec: HEVC, AV1, AVC or? Prepared by: Jeff Campbell. June 2016, ANGA Com Cologne, Germany

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs

Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices

COMPUTING. SharpStreamer Platform. 2U Video Transcode Acceleration Appliance

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

COMPUTING. SharpStreamer Platform. 3U Video Transcode Acceleration Appliance

Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices. Dr Jose Luis Nunez-Yanez University of Bristol

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Encoding at Scale for Live Video Streaming

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Advanced FPGA Design Methodologies with Xilinx Vivado

Model-Based Design for Video/Image Processing Applications

The Democratization of Media Transformation Socionext America Inc.

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

Hardware and Software Co-Design for Motor Control Applications

Cover TBD. intel Quartus prime Design software

MediaKind Encoding On-Demand

Design AXI Master IP using Vivado HLS tool

FPGA Co-Processing Architectures for Video Compression

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

FPGA Entering the Era of the All Programmable SoC

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Table 1: Example Implementation Statistics for Xilinx FPGAs

Cover TBD. intel Quartus prime Design software

Producing High-Quality Video for JavaFXTM Applications

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Designing Multi-Channel, Real-Time Video Processors with Zynq All Programmable SoC Hyuk Kim Embedded Specialist Jun, 2014

Industry Collaboration and Innovation

Employing Multi-FPGA Debug Techniques

Tips for Making Video IP Daniel E. Michek. Copyright 2015 Xilinx.

More performance options

Copyright 2014 Xilinx

Parallel H.264/AVC Motion Compensation for GPUs using OpenCL

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

Intel Iris Graphics Quick Sync Video Innovation Behind Quality and Performance Leadership

Codecs comparison from TCO and compression efficiency perspective

Xilinx ML Suite Overview

HEAD HardwarE Accelerated Deduplication

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

Simplify System Complexity

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Developed Hybrid Memory System for New SoC. -Why choose Wide I/O?

Accelerating FPGA/ASIC Design and Verification

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Extending the Power of FPGAs

Building an Area-optimized Multi-format Video Encoder IP. Tomi Jalonen VP Sales

Shortest path to the lab. Real-world verification. Probes provide observability

Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors

SDSoC: Session 1

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Contents. About Objective Quality Benchmarks 15 Overview of Objective Benchmarks and Tools 16

Simplify System Complexity

An 80-core GRVI Phalanx Overlay on PYNQ-Z1:

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Build cost-effective, reliable signage solutions with the 8 display output, single slot form factor NVIDIA NVS 810

10GBase-R PCS/PMA Controller Core

XPU A Programmable FPGA Accelerator for Diverse Workloads

NS115 System Emulation Based on Cadence Palladium XP

Implementation of Hardware Accelerators on Zynq

VidPort Technical Reference Guide TD-VID-HD-REF-C

FPGA Design Flow 1. All About FPGA

HES-7 ASIC Prototyping

High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Kronos File Optimizing Performance

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

Will Everything Start To Look Like An SoC?

VMDC Version 7.0 Performance Guide

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

SDAccel Development Environment User Guide

Lab 5. Using Fpro SoC with Hardware Accelerators Fast Sorting

Advanced Synthesis Techniques

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Designing a Multi-Processor based system with FPGAs

Video Quality for Live Adaptive Bit-Rate Streaming: Achieving Consistency and Efficiency

Smart Video Transcoding Solution for Surveillance Applications. White Paper. AvidBeam Technologies 12/2/15

Knowledge Workers Task Workers. Minimal or No GPU Utilization Use existing clear text codec (no significant investments) Professional Users R R R

Xilinx Vivado/SDK Tutorial

Vivado HLx Design Entry. June 2016

Codecs in 2018 and Beyond

Industry Collaboration and Innovation

Zhang Tianfei. Rosen Xu

NOT FOR DISTRIBUTION OR REPRODUCTION

100M Gate Designs in FPGAs

Copyright 2017 Xilinx.

EITF35: Introduction to Structured VLSI Design

Beyond Hardware IP An overview of Arm development solutions

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Hardware-Software Co-Design and Prototyping on SoC FPGAs Puneet Kumar Prateek Sikka Application Engineering Team

Altera SDK for OpenCL

MATLAB/Simulink 기반의프로그래머블 SoC 설계및검증

Transcription:

PERSEUS Plus HEVC the first single-fpga real-time 4Kp60 encoder Presented By Name: Fabio Murra and Obioma Okehie Title: PERSEUS Plus, the first single-fpga real-time 4Kp60 encoder Date: 10 th December 2018

Unique compression technologies to dramatically improve density & video quality to all screens over any network

Traditional approaches to video compression h.264, HEVC, VP9, AV1 Traditionally, standards have been developed within MPEG every 7-10 years to freeze the codec algorithms and deploy it as a hardware block within dedicated encoding / decoding devices and SoCs needed to deal with the high complexity of the algorithm in real time Traditional encoding approaches

Transforms Control PERSEUS Plus: a new approach Shaders and Scalers graphics computations Unique hierarchical image representation is far more efficient than the traditional block-based codecs Combining PERSEUS Plus with an existing base codec improves the overall quality and bandwidth requirements The approach better utilises the hardware resources available in modern chipsets and FPGAs Lower res base increases density 4x

PERSEUS Plus: a new approach V-Nova becoming a standard: PERSEUS Pro undergoing standardization as VC-6/ST-2117 Unique hierarchical image representation is far more efficient than the traditional block-based codecs Combining PERSEUS Plus with an existing base codec improves the overall quality and bandwidth requirements The approach better utilises the hardware resources available in modern chipsets and FPGAs PERSEUS Plus in process for Low Complexity Codec Enhancements

PERSEUS on Xilinx FPGA: unique benefits Unbeatable Density Bandwidth savings Codec Agnostic 4x increase in density on FPGA 50x denser than the equivalent software-only implementation UHDp60 in single FPGA. Up to 50% more efficient Live UHDp60 @8Mbps, 1080p60 @3Mbps Increase reach, improve quality of experience, reduce cost PERSUS Plus is codec agnostic Works with h.264, HEVC, VP9 and even AV1 when available Maximum compatibility with existing workflow

The ONLY 4Kp60 real-time encoder on single FPGA 4Kp60 on single VU9P 4Kp60 on 4 x VU9P 4Kp60 on 80 x x86 cores V-Nova PERSEUS+ NGC HEVC NGC HEVC only x265 Software (very slow preset) Best Performance Medium Performance Lowest Performance Lowest Cost Medium Cost Highest Cost Lowest Power Medium Power Highest Power

PERSEUS: from Algorithm to Board

Compression algorithm to Board Design spec Algorithm Fix-Pointing RTL Coding System Verification Board Programming Synthesis / P&R RTL Verification Board Installation Host Machine setup

Going through the implementation options Two options to achieving design implementation Full hardware flow We have full control of every aspect of the implementation and deployment platforms Tools used for pre-synthesis stages can be flexible Take full responsibility on host / kernel drivers Typically longer design times (Much work involved) SDAccel design flow Static region abstracted out. Hence, only need to care about the core IP Reduces design time Integration with partner IP s much easier and faster Option we chose SDAccel RTL-kernel design flow OpenCL runtime XMA runtime

SDAccel RTL Coding Full system integration (Core-IP, DDR, PCIe) Split Core-IP into separate kernels Package custom IP Existing project Started from a known good design SDAccel generate kernels Block design based integration Split design into separate kernels (parallelism) Generated RTL kernels (.xo) and microblaze executables (.elf) via Vivado Integrate host code and external IP within SDAccel MicroBlaze executable build Synthesize Generate RTL kernel (.xo) Kernel Verification Host Code SDAccel build *.exe *.xclbin External IP (.xo file)

FPGA occupancy: 4Kp60 in a single VU9P FPGA Host CPU Static Region DSA (Xilinx) Memory Subsystem (Xilinx) PERSEUS Kernels NGC Kernel DSA Free % LUTS Dynamic Region Core-IP DDR Item Current LUTs Memory LUTs Current DSPs Current FFs 36kbit BRAMs (RAMB36) (OR) 18kbit BRAMs (RAMB18) BRAM (Kbits) 288Kbit UltraRams (RAMB288) Core-IP % 62.94 20.11 45.77 32.91 45.09 19.12 64.22 41.67 DSA static % 9.36 1.35 0.04 5.56 10.42 0.19 10.60 0.00 Fitting Percentage (%) 72.3 21.5 45.8 38.5 55.5 19.3 74.8 41.7

SDAccel Difficulties we faced and how we solved them (2017.4.op) Close communication with Xilinx SME s & FAE s will help resolve most issues in a timely manner. Kernel verification (HW-Emu option does not enable backpressure on the memory interfaces) Will require mixed signal simulator license to use external faster simulators for complex systems Accessing local memory contents (e.g ROM values) Create extra HW process to dump contents onto DDR Debugging the microblaze code Connecting the microblaze processor to the AXI-MM interface to dump text printouts onto DDR Lack of control on allocation of local memory (DDR on FPGA) Address re-routing logic required (for cases where the allocated address is outside the module address range)

Integrating PERSEUS Plus into FFmpeg framework ffmpeg \ -f rawvideo -pix_fmt yuv420p -s:v 1920x1080 -r 30 -an -i /home/ffmpeg/vu9p/testsequences/kimono1_1920x1080_24.yuv \ -frames 240 -c:v libx264 -preset medium -profile:v high -crf 23 -bf 4 -refs 3 -g 30 -b:v 4000k -maxrate 4000k -bufsize 8000k -f h264 -r 30 -y./sw_outdir/x264_medium_out0_br4000k.h264 Change < 20 characters to get hyper acceleration $ ffmpeg \ -f rawvideo -pix_fmt yuv420p -s:v 1920x1080 -r 30 -an -i /home/ffmpeg/vu9p/testsequences/kimono1_1920x1080_24.yuv \ -frames 240 -b:v 4000k -g 30 -c:v pplusenc_fpga -y./hw_outdir/out1_br4000k.h264 Page 14 https://trac.ffmpeg.org/wiki/encodingforstreamingsites

PERSEUS Plus Visual Quality Improvement

Quality UHD VQ improvements PERSEUS Plus HEVC (NGCodec) HEVC (NGCodec) 4x lift in density coupled with: - Video quality improvement - Bandwidth savings Bitrate

Quality UHD VQ improvements PERSEUS Plus HEVC (NGCodec) HEVC (NGCodec) frames

'Watchability' MS-SSIM (MS-SSIM) Improving quality and density of existing deployments QSV h.264 PERSEUS Quality (higher is better) 1.00 0.96 0.92 0.98 1 0.88 0.92 0.94 0.96 0.9 0.84 0.88 0.86 0.80 0.84 480p at 500 kbps 1080p at 1.0 Mbps - Full HD over 4G X.264 Perseus 0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 4500 4750 5000 5250 5500 Frames PERSEUS Plus h.264 Native h.264

PERSEUS Plus Products: 1. Acceleration for any 3 rd party IP 2. Acceleration for Xilinx Video IP

PERSEUS XSA Available for deployment Benefits Enhance existing server performance: add real-time 4Kp60 increase ABR density Reduce power Reduce $ per channel PERSEUS + any codec QSV (available now) x264 (available now) x265 (available now) VP9 (roadmap) AVS2 (pending business case) AV1 (pending business case) VVC (pending business case) PERSEUS Plus Xilinx VU9P implementation of PERSEUS Plus works with any codec

PERSEUS XDE Available soon 4Kp60 on single VU9P 4Kp60 on 4 x VU9P 4Kp60 on 80 x x86 cores V-Nova PERSEUS+ NGC HEVC NGC HEVC only x265 Software (very slow preset) Best Performance Medium Performance Lowest Performance Lowest Cost Medium Cost Highest Cost Lowest Power Medium Power Highest Power

PERSEUS Plus Xilinx IP Offering Under review Codec Partner Description PERSEUS Plus benefits Availability H.264 HDE Alma High density encoder Improve video quality Feasibility H.264 HQE IDT High quality encoder HEVC-HDE NGCodec High density encoder HEVC-HQE NGCodec High quality encoder Improve density 4x Improve UHD VQ VP9-HQE NGCodec High quality encoder Improve density 4x Improve UHD VQ Zynq-H.264 Xilinx Hardened H.264 core Improved video quality Improved density Zynq-H.265 Xilinx Hardened H.265 core Improved video quality Improved density SOON Roadmap Feasibility Feasibility Codec Partner Description PERSEUS Plus benefits Availability x.264 Open source software encoder Improve video quality (from medium to very-slow) Improve density x.265 Open source software encoder Improve video quality (from fastest to medium) Improve density QSV Intel hardened core Improve density 3x Improve video quality NOW NOW NOW