Energy scalability and the RESUME scalable video codec Harald Devos, Hendrik Eeckhaut, Mark Christiaens ELIS/PARIS Ghent University pag. 1
Outline Introduction Scalable Video Reconfigurable HW: FPGAs Implementation details Energy measurements pag. 2
Scalable Video Server Intelligent Network Clients Node Node Encode once Rescale video stream Quality ~ Deployed hardware resources pag. 3
Overview Video Codec Motion Estim. Wavelet Transform Entropy Encoding Original frames P a c k Pull bit stream Motion Comp. Decompressed frames Inverse Wavelet T. Entropy Decoding U n p a c k pag. 4
Overview Video Codec Motion Estim. Original frames Wavelet Transform Entropy Encoding Temporal + Temporal Scalability Motion Comp. Decompressed frames P a c k Pull bit stream Inverse Wavelet T. Entropy Decoding U n p a c k pag. 5
Overview Video Codec Original frames Motion Estim. Wavelet Transform Temporal + Temporal Scalability Spatial + Resolution Scalability Motion Comp. Decompressed frames Inverse Wavelet T. Entropy Encoding P a c k Pull bit stream Entropy Decoding U n p a c k pag. 6
Overview Video Codec Motion Estim. Original frames Temporal + Temporal Scalability Motion Comp. Decompressed frames Wavelet Transform Entropy Encoding Statistical Spatial + + Resolution & Resolution Quality Scalability Scalability Inverse Wavelet T. Entropy Decoding P a c k Pull bit stream U n p a c k pag. 7
FPGA FPGA: Field Programmable Gate Array e.g. : Altera Stratix IO LE LE LE Mem LE Mem M-RAM DSP blocks pag. 8
Development Board 256 MiB PC333 DDR SDRAM Altera Stratix S60 PCI interface pag. 9
Introduction: RESUME RESUME project (Reconfigurable Embedded Systems for Use in scalable Multimedia Environments) Build real-time decoder for scalable video Software profilation: Hardware acceleration needed Scalable video scalable hardware and energy? pag. 10
Outline Introduction Implementation details System Overview 2D-IDWT Energy measurements pag. 11
System Overview Unpack Entropy Decoding Inverse Wavelet T. Motion Comp. Decoded frames Enc. video stream Control PCI (DMA) PCI Software FPGA VGA-card pag. 12
System Overview pag. 13
System Overview Bitplane Inverse Inverse assembler assembl. Wavelet WaveletT. T. WED Motion Motion Comp. Comp. Color Conv. Data Objects are too large to store in the FPGA DDR WED Bitplane assembl. Inverse Wavelet T. Bottleneck Motion Comp. Color Conv. pag. 14
2D-IDWT Inverse Discrete Wavelet Transform Resolution scalability pag. 15
st 2D-IDWT: 1 Design Made manually (SystemC, VHDL) Results: Simulation: 869530 cycles/frame Synthesis: Clock @ 68.91 MHz Expectation: 79 frames/s Measurements on hardware: 29 frames/s: Memory bottle neck!!! pag. 16
2D-IDWT: 2 nd Design Loop Transformations improve spatial and temporal locality of data accesses polyhedral model common practice for software (cfr. cache optimization) Hardware Generation from the polyhedral model (CLooGVHDL) pag. 17
Loop Transformations Original algorithm in, e.g., C Representation in the Loop Polyhedral Model Transformations Optimized algorithm in, e.g., C Optimized algorithm in HW (VHDL) pag. 18
2D-IDWT: Loop transformations Data flow to external memory Data flow Burst Usage Variant 5.25 RC 50% RC-based 2.625 RC 100% Line-based 2 RC 100% Stripe-based 1st design 2nd design pag. 19
Outline Introduction Implementation details Energy measurements Method Results and problems pag. 20
Power supply FPGA alone not possible entire board pag. 21
PCI extender 3.3V 5V pag. 22
TCP202 15 Ampere AC/DC current probe Accuracy: ~ 20 ma pag. 23
pag. 24
pag. 25
Steady state current FPGA board: 1.8 A x 3.3 V = 6 Watt when idle pag. 26
pag. 27
Line-Based IDWT pag. 28
Line-Based IDWT pag. 29
Line-Based IDWT pag. 30
I (A) I (A) Energy Isteady state Time (s) P (W) P (W) Time (s) Time (s) P(t) = 3.3V x I(t) Time (s) E = P(t) x dt pag. 31
Automation Measurement (PC is master) Trigger scope Save wave trace GPIB Processing Matlab-script Steady state current determination Energy calculation pag. 32
Energy for increasing quality Foreman CIF, 10 GOPS (161 frames) 32 different image quality settings 20 identical runs pag. 33
Noise Steady state current: after - before Impact temperature -> Add heat sink Steady state current calculation pag. 34
Different sequences > 1J pag. 35
Measure per component? CPU WED PCI MS AS IDWT MC CC AD VGA DDR Log commands components and replay per component Keep all (intermediate) data in DDR 256 MiB 5 GOPS pag. 36
Per component 10 5 GOPs Energy = ~ 1/2 pag. 37
Wavelet entropy decoder E (J) x 10 PSNR(dB) pag. 38
Inverse wavelet transform # calculations = constant! E (J) x 1.5 pag. 39 PSNR(dB)
Whole = sum of components? VGA-component Interaction pag. 40
2 variants of IDWT RC-IDWT Made manually LB-IDWT Generated semiautomatically Energy total decoder, 10 GoPs (=161 frames) pag. 41
2 variants of IDWT RC-IDWT T= 40 s, Pmean = 0.635 W E=25.4 J LB-IDWT T=10 s, Pmean = 1.16 W E=11.6 J pag. 42
Future work Resolution and temporal scalability Try different approach for measurement per component Measure temperature of FPGA (MAX1619) Predict energy consumption Steady state current? pag. 43
Conclusions Energy measurement feasible Sufficient accuracy: not trivial Scalability has significant impact on energy consumption External memory has large impact pag. 44
References From loop transformation to hardware generation, H. Devos et al. ProRISC 06, Veldhoven, The Netherlands. Finding and applying loop transformations for optimized FPGA implementations, H. Devos et al. Transactions on HiPEAC, to appear. pag. 45
pag. 46
Reconfigurable computing CPU Flexibility DSP VLIW FPGA ASIC Efficiency Development effort pag. 47
Infrastructure: SOPC-builder Board PCI DMA PCIcore DDR DDR-core WED IDWT... Avalon switch fabric FPGA Custom components SOPC-builder (Quartus, Altera): Automatic generation of Avalon switch fabric pag. 48
Calculation Limited Frame rate (frames/s) Bandwidth Limited pag. 49 Available BandWidth to external Memory (MB/s)
ed m r o sf ed as Lin e-b sed Str ipe -ba Frame rate (frames/s) n a r w o R -c u l o n = Manual t n u m CLooGVHDL + Manual opt. CLooGVHDL Impulse C pag. 50 Available BandWidth to external Memory (MB/s)
The Polyhedral Model SCoP: Static Control Part Part of program with data independent control flow Typically set of nested loops (hot code) Loop bounds are linear expressions of parameters and other iterators pag. 51
2D-IDWT: Memory bottle neck Memory hierarchy large but slow external memory fast but smaller (parallel) on-chip memory External memory often bottle neck Minimize accesses to external memory Increase reuse of data stored in on-chip buffers pag. 52
2D-IDWT: Problem Memory bottle neck Design automation needed Manual design process = slow, errorprone,... Lots of designs to be made Reconfigurable HW (QoS) Different platforms pag. 53
2D-IDWT Memory bottle neck -> Loop transformations SW-techniques can be reused for HW Polyhedral model eases transformations Design automation CLooGVHDL: hardware generation from the polyhedral model pag. 54
Overview Video Codec Motion Estim. Original frames Temporal + Temporal Scalability Motion Comp. Decompressed frames Wavelet Transform Entropy Encoding Statistical Spatial + + Resolution & Resolution Quality Scalability Scalability Inverse Wavelet T. Entropy Decoding P a c k Pull bit stream U n p a c k pag. 55