Energy scalability and the RESUME scalable video codec

Similar documents
Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Multimedia Decoder Using the Nios II Processor

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Nios II Family of Configurable Soft-core Processors

The S6000 Family of Processors

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

Implementing Video and Image Processing Designs Using FPGAs. Click to add subtitle

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Embedded Systems: Hardware Components (part II) Todor Stefanov

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Hardware/Software Co-design

Lecture 41: Introduction to Reconfigurable Computing

4K Format Conversion Reference Design

DDR and DDR2 SDRAM Controller Compiler User Guide

A 1-GHz Configurable Processor Core MeP-h1

Constructing Application-specific Memory Hierarchies on FPGAs

A Multimedia Streaming Server/Client Framework for DM64x

CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video

Cornell Cup Tutorials

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

FPGA for Software Engineers

Functional modeling style for efficient SW code generation of video codec applications

TKT-2431 SoC design. Introduction to exercises

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Cover TBD. intel Quartus prime Design software

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Performance Verification for ESL Design Methodology from AADL Models

Reconfigurable Computing. Introduction

Cover TBD. intel Quartus prime Design software

Qsys and IP Core Integration

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

The Implement of MPEG-4 Video Encoding Based on NiosII Embedded Platform

Today s Agenda Background/Experience Course Information Altera DE2B Board do Overview Introduction to Embedded Systems Design Abstraction Microprocess

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

1. Data plane blocks can be optimized for different applications. 2. The IP blocks can be reused and the design complexity decreases.

Scalable Video Coding

Design of Embedded Hardware and Firmware

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

Lab 1: Using the LegUp High-level Synthesis Framework

System-on-a-Programmable-Chip (SOPC) Development Board

System-on-Chip. Outline. Example: iphone 3GS disassembled. System-on-Chip is Everywhere! SoC Challenges. SoC Challenges and Current Solutions

Section III. Transport and Communication

3-D Accelerator on Chip

EEL 4783: Hardware/Software Co-design with FPGAs

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems. "System On Programmable Chip" NIOS II Avalon Bus. René Beuchat. Laboratoire d'architecture des Processeurs.

Graphics Controller Core

Multimedia in Mobile Phones. Architectures and Trends Lund

FPGAs Provide Reconfigurable DSP Solutions

Introduction to the Qsys System Integration Tool

MAX 10 FPGA Device Overview

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

NIOS CPU Based Embedded Computer System on Programmable Chip

Design Space Exploration for Memory Subsystems of VLIW Architectures

Choosing a Processor: Benchmarks and Beyond (S043)

Field Programmable Gate Array (FPGA) Devices

EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR AND VHDL EXAMPLES

Key technologies for many core architectures

Embedded Computing Platform. Architecture and Instruction Set

EE382V: System-on-a-Chip (SoC) Design

Practical Hardware Debugging: Quick Notes On How to Simulate Altera s Nios II Multiprocessor Systems Using Mentor Graphics ModelSim

An H.264/AVC Main Profile Video Decoder Accelerator in a Multimedia SOC Platform

Enabling New Low-Cost Embedded System Using Cyclone III FPGAs

Classification of Semiconductor LSI

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

AN 690: PCI Express DMA Reference Design for Stratix V Devices

Altera SDK for OpenCL

Park Sung Chul. AE MentorGraphics Korea

Flexible Architecture Research Machine (FARM)

FPGA IMPLEMENTATION OF BIT PLANE ENTROPY ENCODER FOR 3 D DWT BASED VIDEO COMPRESSION

System-level simulation (HW/SW co-simulation) Outline. EE290A: Design of Embedded System ASV/LL 9/10

Building Data Path for the Custom Instruction. Yong ZHU *

9. Verification and Board Bring-Up

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

INT G bit TCP Offload Engine SOC

FPGA Adaptive Software Debug and Performance Analysis

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

ECE 111 ECE 111. Advanced Digital Design. Advanced Digital Design Winter, Sujit Dey. Sujit Dey. ECE Department UC San Diego

INT 1011 TCP Offload Engine (Full Offload)

Introduction of the Research Based on FPGA at NICS

MAX 10 FPGA Device Overview

Digital Systems Design. System on a Programmable Chip

System Level Design with IBM PowerPC Models

Platform-based Design

Turbo Encoder Co-processor Reference Design

VLSI Design Automation. Maurizio Palesi

Design Space Exploration Using Parameterized Cores

Intel MAX 10 FPGA Device Overview

Designing Embedded Processors in FPGAs

EN2911X: Reconfigurable Computing Lecture 01: Introduction

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Jumping Hurdles. High Expectations in a Low Power Environment. Christopher Fadeley Software Engineering Manager EIZO Rugged Solutions

VIDEO COMPRESSION STANDARDS

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Transcription:

Energy scalability and the RESUME scalable video codec Harald Devos, Hendrik Eeckhaut, Mark Christiaens ELIS/PARIS Ghent University pag. 1

Outline Introduction Scalable Video Reconfigurable HW: FPGAs Implementation details Energy measurements pag. 2

Scalable Video Server Intelligent Network Clients Node Node Encode once Rescale video stream Quality ~ Deployed hardware resources pag. 3

Overview Video Codec Motion Estim. Wavelet Transform Entropy Encoding Original frames P a c k Pull bit stream Motion Comp. Decompressed frames Inverse Wavelet T. Entropy Decoding U n p a c k pag. 4

Overview Video Codec Motion Estim. Original frames Wavelet Transform Entropy Encoding Temporal + Temporal Scalability Motion Comp. Decompressed frames P a c k Pull bit stream Inverse Wavelet T. Entropy Decoding U n p a c k pag. 5

Overview Video Codec Original frames Motion Estim. Wavelet Transform Temporal + Temporal Scalability Spatial + Resolution Scalability Motion Comp. Decompressed frames Inverse Wavelet T. Entropy Encoding P a c k Pull bit stream Entropy Decoding U n p a c k pag. 6

Overview Video Codec Motion Estim. Original frames Temporal + Temporal Scalability Motion Comp. Decompressed frames Wavelet Transform Entropy Encoding Statistical Spatial + + Resolution & Resolution Quality Scalability Scalability Inverse Wavelet T. Entropy Decoding P a c k Pull bit stream U n p a c k pag. 7

FPGA FPGA: Field Programmable Gate Array e.g. : Altera Stratix IO LE LE LE Mem LE Mem M-RAM DSP blocks pag. 8

Development Board 256 MiB PC333 DDR SDRAM Altera Stratix S60 PCI interface pag. 9

Introduction: RESUME RESUME project (Reconfigurable Embedded Systems for Use in scalable Multimedia Environments) Build real-time decoder for scalable video Software profilation: Hardware acceleration needed Scalable video scalable hardware and energy? pag. 10

Outline Introduction Implementation details System Overview 2D-IDWT Energy measurements pag. 11

System Overview Unpack Entropy Decoding Inverse Wavelet T. Motion Comp. Decoded frames Enc. video stream Control PCI (DMA) PCI Software FPGA VGA-card pag. 12

System Overview pag. 13

System Overview Bitplane Inverse Inverse assembler assembl. Wavelet WaveletT. T. WED Motion Motion Comp. Comp. Color Conv. Data Objects are too large to store in the FPGA DDR WED Bitplane assembl. Inverse Wavelet T. Bottleneck Motion Comp. Color Conv. pag. 14

2D-IDWT Inverse Discrete Wavelet Transform Resolution scalability pag. 15

st 2D-IDWT: 1 Design Made manually (SystemC, VHDL) Results: Simulation: 869530 cycles/frame Synthesis: Clock @ 68.91 MHz Expectation: 79 frames/s Measurements on hardware: 29 frames/s: Memory bottle neck!!! pag. 16

2D-IDWT: 2 nd Design Loop Transformations improve spatial and temporal locality of data accesses polyhedral model common practice for software (cfr. cache optimization) Hardware Generation from the polyhedral model (CLooGVHDL) pag. 17

Loop Transformations Original algorithm in, e.g., C Representation in the Loop Polyhedral Model Transformations Optimized algorithm in, e.g., C Optimized algorithm in HW (VHDL) pag. 18

2D-IDWT: Loop transformations Data flow to external memory Data flow Burst Usage Variant 5.25 RC 50% RC-based 2.625 RC 100% Line-based 2 RC 100% Stripe-based 1st design 2nd design pag. 19

Outline Introduction Implementation details Energy measurements Method Results and problems pag. 20

Power supply FPGA alone not possible entire board pag. 21

PCI extender 3.3V 5V pag. 22

TCP202 15 Ampere AC/DC current probe Accuracy: ~ 20 ma pag. 23

pag. 24

pag. 25

Steady state current FPGA board: 1.8 A x 3.3 V = 6 Watt when idle pag. 26

pag. 27

Line-Based IDWT pag. 28

Line-Based IDWT pag. 29

Line-Based IDWT pag. 30

I (A) I (A) Energy Isteady state Time (s) P (W) P (W) Time (s) Time (s) P(t) = 3.3V x I(t) Time (s) E = P(t) x dt pag. 31

Automation Measurement (PC is master) Trigger scope Save wave trace GPIB Processing Matlab-script Steady state current determination Energy calculation pag. 32

Energy for increasing quality Foreman CIF, 10 GOPS (161 frames) 32 different image quality settings 20 identical runs pag. 33

Noise Steady state current: after - before Impact temperature -> Add heat sink Steady state current calculation pag. 34

Different sequences > 1J pag. 35

Measure per component? CPU WED PCI MS AS IDWT MC CC AD VGA DDR Log commands components and replay per component Keep all (intermediate) data in DDR 256 MiB 5 GOPS pag. 36

Per component 10 5 GOPs Energy = ~ 1/2 pag. 37

Wavelet entropy decoder E (J) x 10 PSNR(dB) pag. 38

Inverse wavelet transform # calculations = constant! E (J) x 1.5 pag. 39 PSNR(dB)

Whole = sum of components? VGA-component Interaction pag. 40

2 variants of IDWT RC-IDWT Made manually LB-IDWT Generated semiautomatically Energy total decoder, 10 GoPs (=161 frames) pag. 41

2 variants of IDWT RC-IDWT T= 40 s, Pmean = 0.635 W E=25.4 J LB-IDWT T=10 s, Pmean = 1.16 W E=11.6 J pag. 42

Future work Resolution and temporal scalability Try different approach for measurement per component Measure temperature of FPGA (MAX1619) Predict energy consumption Steady state current? pag. 43

Conclusions Energy measurement feasible Sufficient accuracy: not trivial Scalability has significant impact on energy consumption External memory has large impact pag. 44

References From loop transformation to hardware generation, H. Devos et al. ProRISC 06, Veldhoven, The Netherlands. Finding and applying loop transformations for optimized FPGA implementations, H. Devos et al. Transactions on HiPEAC, to appear. pag. 45

pag. 46

Reconfigurable computing CPU Flexibility DSP VLIW FPGA ASIC Efficiency Development effort pag. 47

Infrastructure: SOPC-builder Board PCI DMA PCIcore DDR DDR-core WED IDWT... Avalon switch fabric FPGA Custom components SOPC-builder (Quartus, Altera): Automatic generation of Avalon switch fabric pag. 48

Calculation Limited Frame rate (frames/s) Bandwidth Limited pag. 49 Available BandWidth to external Memory (MB/s)

ed m r o sf ed as Lin e-b sed Str ipe -ba Frame rate (frames/s) n a r w o R -c u l o n = Manual t n u m CLooGVHDL + Manual opt. CLooGVHDL Impulse C pag. 50 Available BandWidth to external Memory (MB/s)

The Polyhedral Model SCoP: Static Control Part Part of program with data independent control flow Typically set of nested loops (hot code) Loop bounds are linear expressions of parameters and other iterators pag. 51

2D-IDWT: Memory bottle neck Memory hierarchy large but slow external memory fast but smaller (parallel) on-chip memory External memory often bottle neck Minimize accesses to external memory Increase reuse of data stored in on-chip buffers pag. 52

2D-IDWT: Problem Memory bottle neck Design automation needed Manual design process = slow, errorprone,... Lots of designs to be made Reconfigurable HW (QoS) Different platforms pag. 53

2D-IDWT Memory bottle neck -> Loop transformations SW-techniques can be reused for HW Polyhedral model eases transformations Design automation CLooGVHDL: hardware generation from the polyhedral model pag. 54

Overview Video Codec Motion Estim. Original frames Temporal + Temporal Scalability Motion Comp. Decompressed frames Wavelet Transform Entropy Encoding Statistical Spatial + + Resolution & Resolution Quality Scalability Scalability Inverse Wavelet T. Entropy Decoding P a c k Pull bit stream U n p a c k pag. 55