Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems

Similar documents
Implication of variable code block size in JPEG 2000 and its VLSI implementation

Fast FPGA Implementation of EBCOT block in JPEG2000 Standard

FPGA Implementation of Rate Control for JPEG2000

Keywords - DWT, Lifting Scheme, DWT Processor.

Design of 2-D DWT VLSI Architecture for Image Processing

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Design and Implementation of 3-D DWT for Video Processing Applications

Nios II Processor-Based Hardware/Software Co-Design of the JPEG2000 Standard

Comparison of EBCOT Technique Using HAAR Wavelet and Hadamard Transform

Design and Analysis of Efficient Reconfigurable Wavelet Filters

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

HIGH LEVEL SYNTHESIS OF A 2D-DWT SYSTEM ARCHITECTURE FOR JPEG 2000 USING FPGAs

JPEG Joint Photographic Experts Group ISO/IEC JTC1/SC29/WG1 Still image compression standard Features

FAST AND EFFICIENT SPATIAL SCALABLE IMAGE COMPRESSION USING WAVELET LOWER TREES

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

SIGNAL COMPRESSION. 9. Lossy image compression: SPIHT and S+P

JPEG Descrizione ed applicazioni. Arcangelo Bruna. Advanced System Technology

A HIGH-PERFORMANCE ARCHITECTURE OF JPEG2000 ENCODER

Parallel graph traversal for FPGA

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

Wavelet Transform (WT) & JPEG-2000

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Fully Integrated Communication Terminal and Equipment. FlexWave II :Executive Summary

FPGA Implementation of Image Compression Using SPIHT Algorithm

Comparative Study and Implementation of JPEG and JPEG2000 Standards for Satellite Meteorological Imaging Controller using HDL

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

FPGA IMPLEMENTATION OF BIT PLANE ENTROPY ENCODER FOR 3 D DWT BASED VIDEO COMPRESSION

Memory-Efficient and High-Speed Line-Based Architecture for 2-D Discrete Wavelet Transform with Lifting Scheme

GPU-Based DWT Acceleration for JPEG2000

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

FPGA Implementation Of DWT-SPIHT Algorithm For Image Compression

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

SPIHT Image Compression on FPGAs

An Algorithm for Image Compression Using 2D Wavelet Transform

Porting Performance across GPUs and FPGAs

An Efficient VLSI Architecture of 1D/2D and 3D for DWT Based Image Compression and Decompression Using a Lifting Scheme

Low-complexity video compression based on 3-D DWT and fast entropy coding

An Efficient Context-Based BPGC Scalable Image Coder Rong Zhang, Qibin Sun, and Wai-Choong Wong

IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA

Design and Implementation of Lifting Based Two Dimensional Discrete Wavelet Transform

Low-Memory Packetized SPIHT Image Compression

An FPGA Based Adaptive Viterbi Decoder

Co-synthesis and Accelerator based Embedded System Design

Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine

Efficient Implementation of Low Power 2-D DCT Architecture

An Hierarchical Approach of processing Wavelet Co-efficient in Breadth First Way by the Arithmetic coder

A Distributed Canny Edge Detector and Its Implementation on FPGA

DIGITAL IMAGE PROCESSING WRITTEN REPORT ADAPTIVE IMAGE COMPRESSION TECHNIQUES FOR WIRELESS MULTIMEDIA APPLICATIONS

Using Shift Number Coding with Wavelet Transform for Image Compression

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA Solutions: Modular Architecture for Peak Performance

The Efficient Implementation of Numerical Integration for FPGA Platforms

JPEG 2000 compression

INTRODUCTION TO FPGA ARCHITECTURE

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Overview of ROCCC 2.0

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

Compression of RADARSAT Data with Block Adaptive Wavelets Abstract: 1. Introduction

A Review on Image Compression in Parallel using CUDA

ASIC Implementation of one level 2D DWT and 2D DWT in Hybrid Wave-Pipelining & Pipelining

Modified SPIHT Image Coder For Wireless Communication

Scalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures

An Efficient Hardware Architecture for Multimedia Encryption and Authentication using the Discrete Wavelet Transform

642 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 5, MAY 2001

1. INTRODUCTION AND MOTIVATION

Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC

Reconfigurable Computing. Introduction

Lecture 5: Error Resilience & Scalability

JPEG2000. Andrew Perkis. The creation of the next generation still image compression system JPEG2000 1

Adaptive Quantization for Video Compression in Frequency Domain

Design of Feature Extraction Circuit for Speech Recognition Applications

FPGA IMPLEMENTATION OF MEMORY EFFICIENT HIGH SPEED STRUCTURE FOR MULTILEVEL 2D-DWT

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

13.6 FLEXIBILITY AND ADAPTABILITY OF NOAA S LOW RATE INFORMATION TRANSMISSION SYSTEM

Developing Applications for HPRCs

Jyoti S. Pawadshetty*, Dr.J.W.Bakal** *(ME (IT)-II, PIIT New Panvel.) ** (Principal, SSJCOE Dombivali.)

8- BAND HYPER-SPECTRAL IMAGE COMPRESSION USING EMBEDDED ZERO TREE WAVELET

High Speed Arithmetic Coder Architecture used in SPIHT

Wavelet Based Image Compression Using ROI SPIHT Coding

Multimedia Decoder Using the Nios II Processor

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Parallel FIR Filters. Chapter 5

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar

A High-Performance JPEG2000 Architecture

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

EITF35: Introduction to Structured VLSI Design

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

Research Article VLSI Implementation of Hybrid Wave-Pipelined 2D DWT Using Lifting Scheme

EFFICIENT ENCODER DESIGN FOR JPEG2000 EBCOT CONTEXT FORMATION

Ultra-Fast NoC Emulation on a Single FPGA

FPGA Implementation of an Efficient Two-dimensional Wavelet Decomposing Algorithm

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

The WINLAB Cognitive Radio Platform

EXPLORING ON STEGANOGRAPHY FOR LOW BIT RATE WAVELET BASED CODER IN IMAGE RETRIEVAL SYSTEM

H100 Series FPGA Application Accelerators

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Analysis and Comparison of EZW, SPIHT and EBCOT Coding Schemes with Reduced Execution Time

Transcription:

Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems Ryan DeVille, Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering University of Florida DeVille

Introduction EBCOT Algorithm Multicomponent Transform Discrete Wavelet Transform Quantization Tier-1 Encoding (compression) Tier-2 Encoding (packetization) JPEG-2000 Encoding State-of-the-art low bit-rate compression algorithm Progressive transmission by quality, resolution, component, or spatial locality Spatially random access to bitstream Region of interest coding Motivation for porting JPEG-2000 to RC systems High-performance and low-cost solution is attractive for airborne and satellite imaging systems Speedup readily available with fine-grain and coarse-grain parallelism opportunities DeVille 2

Related Research EBCOT Encoder designs Group of Column optimization method Previous RC Designs Space systems prototype [5] Scalable Entropy Encoder [6] Dual Processing Elements Architecture [7] 2D Discrete Wavelet Transform designs Several mimic early VLSI designs [8, 9] Multiple architecture designs classifications [10] Direct 1D, transpose, perform another 1D Intrinsically slow Separate serial and parallel filters or parallel row, parallel column filters Processes along rows and columns Represents significant performance improvement Symmetrically extended Improves processing efficiency, especially towards center of image DeVille 3

JPEG-2000 Encoder Design & Develop. Software code profiling first used to determine effort distribution Previous research efforts show that DWT and Tier1 encoding consume 80-85% of execution time Current profiling results with Jasper and OpenJPEG show that >90% of execution time spent in DWT and Tier1 Benchmark images selected from Kodak Lossless True Color Image Suite, JasPer benchmark images, standard image processing images (lena, etc.) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% water.pnm Jasper Execution Time Profile lena.ras baboon.ras kodim23.ras kodim22.ras kodim21.ras kodim16.ras kodim11.ras kodim10.ras kodim06.ras camera.ras peppers.ras TIER2 TIER1 QUANT FWT MCT DeVille 4

Discrete Wavelet Transform (DWT) Features Second-most computationally intensive block in compression process Transforms each component tile data into coefficients Reversible transform involves all integer operations Represents high- and low-frequency components of image Amenable to compression results in better compression ratios Recursive application yields frequency bands at multiple resolutions Operation a 3 LH a 3 HL a 3 LL a 3 HH 2D transform achieved by successively applying 1D transform in X&Y directions a 2 LH Each 1D transform consist of Filtering step De-interleave step: reorganizing of data in bands a 2 HL a 2 HH Available data and functional parallelism can a 1 HL be exploited a 1 LH a 1 HH DeVille 5

DWT Hardware Architecture Input Buffer Challenges presented by DWT Parallel processing limited by memory bandwidth requirements Some sequential nature in processing involved Design features Data-level parallelism exploited by operating on multiple tiles Function-level parallelism exploited by pipelining different processing step Data reuse eliminates extra read cycles Internal architecture Each tile is entirely stored in single Block RAM to minimize data movement Overlapped processing to further reduce latency Even Coeff Odd Coeff Tile Data DWT Column Temp Buffer Deinterleave Column Temp Buffer DWT Row Temp Buffer Deinterleave Row Output Buffer DeVille 6

Embedded Block Coding with Optimized Truncation (EBCOT): Tier-1 Features Specially adapted arithmetic coder Four bit-plane coding primitives Three coding passes for each bit-plane (except the most significant) Operation Coding passes: CUP begins at most significant bit plane Iteratively perform coding passes over remaining bit planes Coding-pass-generated context and bit data serially encoded and compressed by arithmetic encoder Flush and reset arithmetic coder at completion DeVille 7

Tier-1 Encoding Hardware Architecture Challenges presented by Tier-1 encoding: Serial process creation of current MQ context data directly depends upon previous pass results Bursty communication contextual data from a pass short, semi-continuous bursts Large amounts of data and flags must be stored through multiple iterations of algorithm, requiring high memory bandwidth Internal architecture (high-level) Retrieve current stripe from memory for processing Data is operated in a pipelined fashion through registers Context and data information sent to queues Serializing agent: arithmetic entropy encoder MQ Input Controller regulates input to arithmetic entropy encoder, insuring correct operation Data from arithmetic entropy encoder is written to a separate, final buffer Write buffer Cleanup Pass Magnitude Reference Pass Significance Propogation Pass Read buffer Design decision to use MQ encoder as serializing agent saves area and BlockRAM space without sacrificing too much performance. DeVille 8

Target HPEC Platform High-Perf. Embedded Computing: Nallatech BenNUEY w/ BenBLUE-II Three FPGAs (all Xilinx Virtex2 6000, -4) Single user FPGA on BenNUEY PCI board Dual FPGAs on BenBLUE-II daughter card PCI FPGA (Xilinx Spartan2) ZBT SSRAM (2 MB) PCI COMMS bus (32-bit data, 40 Mhz) BenNUEY User FPGA (Xilinx2 6000, -4) ZBT SSRAM (2 MB) ZBT SSRAM (4 MB) BenBLUE-II Primary FPGA (Xilinx Virtex2 6000, -4) BenBLUE-II Secondary FPGA (Xilinx Virtex2 6000, -4) ZBT SSRAM (4 MB) 32 32 64 64 Local Bus (64-bit data, 66 MHz) Inter-FPGA communications bus Low bandwidth to system memory through 64/66 MHz PCI bus connection Large memory storage capability with 12 MB SRAM (166 MHz, ZBT) Advantages/Disadvantages High configuration time (PCI bus + chained JTAG interface) Large memory storage helps alleviate strain on PCI bus Very good IO interface support with proprietary tools (159 IO, userdefined clk) DeVille 9 * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.

DWT Single FPGA Results Single-module design processing one tile (μs) Single-module design processing eight tiles (μs) DMA write time 127 1001 DMA read time 80 573 Computation time (part 1) 52 56 Computation time (part 2) 48 404 Total time for FPGA solution 307 2034 Exec. Time (us) 2500 2000 1500 1000 500 0 Performance Comparison 1 8 Tiles processed Time for software solution 130 1043 Results for single DWT module design for BenNUEY board operating at 80 MHz Note: software solution comes from exec. on server with 2.4 GHz Xeon CPU Processing eight tiles (μs) Processing forty tiles (μs) DMA write time 758 3750 DMA read time 382 1900 Computation time (part 1) 80 80 Computation time (part 2) 82 424 Total time for FPGA solution 1302 6154 Time for software solution 1043 5219 Results for Eight DWT modules design for BenNUEY board operating at 40 MHz Exec. Time (us) FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution 8000 6000 4000 2000 0 Performance Comparison 8 40 Tiles Processed FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution Resource Utilization on Virtex2 6000-4 # of Modules Slices BRAMs Single Module 1157 ( 3%) 6 ( 4%) Eight Modules 5742 (17%) 48 (33%) DeVille 10

Tier-1 Encoding Current Results Single-module design processing one codeblock (μs) Eight-module design processing one codeblock each (μs) DMA Write Time 70 218 DMA Read Time 49 388 Computation Time 175 175 Total Time 294 781 Software Time 276 2189 Results for Tier1 module design for BenNUEY board operating at 90 MHz Note: software solution comes from execution on server with 2.4 GHz Xeon Processor # of modules Slices BlockRAMs Single 3,527 (10%) 7 (5%) Eight 25,556 (75%) 56 (38%) Profiling shows performance projections with DMA transfer times included. peppers.ras camera.ras kodim06.ras kodim10.ras kodim11.ras kodim16.ras kodim21.ras kodim22.ras kodim23.ras baboon.ras lena.ras w ater.pnm 0% 20% 40% 60% 80% 100% MCT FWT QUANT TIER1 TIER2 DeVille 11 * Results synthesized with Synplify Pro 7.7.1, PAR with Xilinx ISE 6.3

Conclusions from HPEC Platform Multi-chip system offers resources for increased parallelism or a multi-component application Order of magnitude improvement in total computation time Faster computation times on FPGA But communication overhead severely hinders performance improvement Low-bandwidth PCI interconnect not amenable to designs with challenging memory demands DeVille 12

Target HPC Platform High-Performance Computing: SGI Altix 350 with FPGA Brick Single FPGA: Virtex2 6000 (-6 speed grade) Approximately 33% of chip used for SGI s RASC system layer Two algorithm clock speeds: 200 MHz and 100 MHz High bandwidth to system memory through proprietary NUMAlink interconnect (12.8 GB/s) through Scalable System Port (6.4 GB/s) 3 banks of QDR SRAM (6 MB each) with a full bandwidth of 9.6 GB/s (1.6 GB/s for each read and write) Advantages/Disadvantages Extremely low reconfiguration time High memory bandwidth greatly helps memory-intensive apps, such as JPEG-2K 2 MB QDR SRAM SGI Altix w/ RASC extension 2 MB QDR SRAM DeVille 13 * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.

Performance Projections 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% water.pnm lena.ras baboon.ras kodim23.ras kodim22.ras kodim21.ras kodim16.ras kodim11.ras kodim10.ras kodim06.ras camera.ras peppers.ras TIER2 TIER1 QUANT FWT MCT Profile shows projections for no-latency, infinite-bandwidth interconnect. NUMAlink interconnect Approximate order-of-magnitude improvement of transfers in similar designs Mitigates communication overhead bottleneck DeVille 14

Lessons Learned and Conclusions Lessons Learned HW/SW codesign Shared-memory systems more amenable to closely-coupled processing associated with communication-sensitive RC applications PCI boards for servers effective when tasks are offloaded for processing with minimal or masked communication Memory bandwidth constrains parallelism in DWT design Serializing agent (arithmetic coder) in Tier-1 design is key limit to performance improvement Conclusions Identifying and accelerating key components yields better system performance (with a wary eye on Amdahl s Law) Performance enhancements achieved mostly through functional parallelism due to sequential processing constraints DeVille 15

Future Work and Acknowledgments Future Work: Full system implementation on SGI Altix with RASC Region of Interest capability Lossy encoding and rate capability MCT and Tier-2 encoding on FPGA as well Single FPGA JPEG-2000 encoding application Acknowledgments We wish to thank the following vendors for equipment and/or tools in support of this research: SGI Nallatech Xilinx Aldec Special thanks to SGI Digital Media group, SGI RASC engineers for their help and suggestions DeVille 16

References [1] Adams, M.D. and Ward, R.K., JasPer: a portable flexible open-source software tool kit for image coding/process, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 04), pp. 241-244, May 2004. [2] OpenJPEG. http://www.opegjpeg.org/ [3] Liu, L., Li, D., Li, Z., Wang, Z. and Chen, H., A VLSI architecture of EBCOT encoder for JPEG2000, in 5 th International Conference on ASIC, pp. 882-885, Oct. 2003. [4] Chen, K., Lian, C., Chen, H., and L. Chen, Analysis and architecture design of EBCOT for JPEG-2000, in IEEE International Symposium on Circuits and Systems, vol. 2, pp. 765-768, May 2001. [5] Van Buren, D., A high-rate JPEG2000 compression system for space, in IEEE Aerospace Conference, March 2005. [6] Aouadi, I., and Hammami, O., Analysis and hardware design of a scalable dual JPEG-2000 entropy coder, in Euromicro Symposium on Digital System Design (DSD 2004), pp. 227-233, Sept. 2004. [7] Gangadhar, M. and Bhatia, D., FPGA based EBCOT architecture for JPEG 2000, in IEEE International Conference on Field-Programmable Technology (FPT 03), pp. 228-233, Dec. 2003 [8] Hung, K., Huang Y., Truong, T., Wang, C., FPGA implementation for 2D discrete wavelet transform, in Electronics Letters, pp. 639-640, April 1998. [9] Lakshminarayanan, G. Venkataramani, B. Senthil Kumar, J., Yousuf, A.K. and Sriram, G., Design and FPGA implementation of image block encoders with 2D-DWT, in Conference on Convergent Technologies for Asia- Pacific Region (TENCON 2003), pp. 1015-1019, Oct. 2003. [10] McCanny, P., Masud, S., and McCanny, J., Design and implementation of the symmetrically extended 2-D wavelet transform, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 02), vol. 3, pp. 3108-31111, May 2002. [11] D. Taubman, High performance scalable image compression with EBCOT, in IEEE Trans. Image Processing, vol. 9, pp. 1158-1170, July 2000. [12] I.E.G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester, West Sussex, New York: John Wiley and Sons, Ltd (UK), 2002. [13] T. Acharya and P.-S. Tsai, JPEG 2000 Standard for image Compression: Concepts, Algorithms, and VLSI Architectures. Hoboken, New Jersey: John Wiley and Sons, Inc., 2005. DeVille 17