Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems

Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems Ryan DeVille, Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering University of Florida DeVille

Introduction EBCOT Algorithm Multicomponent Transform Discrete Wavelet Transform Quantization Tier-1 Encoding (compression) Tier-2 Encoding (packetization) JPEG-2000 Encoding State-of-the-art low bit-rate compression algorithm Progressive transmission by quality, resolution, component, or spatial locality Spatially random access to bitstream Region of interest coding Motivation for porting JPEG-2000 to RC systems High-performance and low-cost solution is attractive for airborne and satellite imaging systems Speedup readily available with fine-grain and coarse-grain parallelism opportunities DeVille 2

Related Research EBCOT Encoder designs Group of Column optimization method Previous RC Designs Space systems prototype [5] Scalable Entropy Encoder [6] Dual Processing Elements Architecture [7] 2D Discrete Wavelet Transform designs Several mimic early VLSI designs [8, 9] Multiple architecture designs classifications [10] Direct 1D, transpose, perform another 1D Intrinsically slow Separate serial and parallel filters or parallel row, parallel column filters Processes along rows and columns Represents significant performance improvement Symmetrically extended Improves processing efficiency, especially towards center of image DeVille 3

JPEG-2000 Encoder Design & Develop. Software code profiling first used to determine effort distribution Previous research efforts show that DWT and Tier1 encoding consume 80-85% of execution time Current profiling results with Jasper and OpenJPEG show that >90% of execution time spent in DWT and Tier1 Benchmark images selected from Kodak Lossless True Color Image Suite, JasPer benchmark images, standard image processing images (lena, etc.) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% water.pnm Jasper Execution Time Profile lena.ras baboon.ras kodim23.ras kodim22.ras kodim21.ras kodim16.ras kodim11.ras kodim10.ras kodim06.ras camera.ras peppers.ras TIER2 TIER1 QUANT FWT MCT DeVille 4

Discrete Wavelet Transform (DWT) Features Second-most computationally intensive block in compression process Transforms each component tile data into coefficients Reversible transform involves all integer operations Represents high- and low-frequency components of image Amenable to compression results in better compression ratios Recursive application yields frequency bands at multiple resolutions Operation a 3 LH a 3 HL a 3 LL a 3 HH 2D transform achieved by successively applying 1D transform in X&Y directions a 2 LH Each 1D transform consist of Filtering step De-interleave step: reorganizing of data in bands a 2 HL a 2 HH Available data and functional parallelism can a 1 HL be exploited a 1 LH a 1 HH DeVille 5

DWT Hardware Architecture Input Buffer Challenges presented by DWT Parallel processing limited by memory bandwidth requirements Some sequential nature in processing involved Design features Data-level parallelism exploited by operating on multiple tiles Function-level parallelism exploited by pipelining different processing step Data reuse eliminates extra read cycles Internal architecture Each tile is entirely stored in single Block RAM to minimize data movement Overlapped processing to further reduce latency Even Coeff Odd Coeff Tile Data DWT Column Temp Buffer Deinterleave Column Temp Buffer DWT Row Temp Buffer Deinterleave Row Output Buffer DeVille 6

Embedded Block Coding with Optimized Truncation (EBCOT): Tier-1 Features Specially adapted arithmetic coder Four bit-plane coding primitives Three coding passes for each bit-plane (except the most significant) Operation Coding passes: CUP begins at most significant bit plane Iteratively perform coding passes over remaining bit planes Coding-pass-generated context and bit data serially encoded and compressed by arithmetic encoder Flush and reset arithmetic coder at completion DeVille 7

Tier-1 Encoding Hardware Architecture Challenges presented by Tier-1 encoding: Serial process creation of current MQ context data directly depends upon previous pass results Bursty communication contextual data from a pass short, semi-continuous bursts Large amounts of data and flags must be stored through multiple iterations of algorithm, requiring high memory bandwidth Internal architecture (high-level) Retrieve current stripe from memory for processing Data is operated in a pipelined fashion through registers Context and data information sent to queues Serializing agent: arithmetic entropy encoder MQ Input Controller regulates input to arithmetic entropy encoder, insuring correct operation Data from arithmetic entropy encoder is written to a separate, final buffer Write buffer Cleanup Pass Magnitude Reference Pass Significance Propogation Pass Read buffer Design decision to use MQ encoder as serializing agent saves area and BlockRAM space without sacrificing too much performance. DeVille 8

Target HPEC Platform High-Perf. Embedded Computing: Nallatech BenNUEY w/ BenBLUE-II Three FPGAs (all Xilinx Virtex2 6000, -4) Single user FPGA on BenNUEY PCI board Dual FPGAs on BenBLUE-II daughter card PCI FPGA (Xilinx Spartan2) ZBT SSRAM (2 MB) PCI COMMS bus (32-bit data, 40 Mhz) BenNUEY User FPGA (Xilinx2 6000, -4) ZBT SSRAM (2 MB) ZBT SSRAM (4 MB) BenBLUE-II Primary FPGA (Xilinx Virtex2 6000, -4) BenBLUE-II Secondary FPGA (Xilinx Virtex2 6000, -4) ZBT SSRAM (4 MB) 32 32 64 64 Local Bus (64-bit data, 66 MHz) Inter-FPGA communications bus Low bandwidth to system memory through 64/66 MHz PCI bus connection Large memory storage capability with 12 MB SRAM (166 MHz, ZBT) Advantages/Disadvantages High configuration time (PCI bus + chained JTAG interface) Large memory storage helps alleviate strain on PCI bus Very good IO interface support with proprietary tools (159 IO, userdefined clk) DeVille 9 * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.

DWT Single FPGA Results Single-module design processing one tile (μs) Single-module design processing eight tiles (μs) DMA write time 127 1001 DMA read time 80 573 Computation time (part 1) 52 56 Computation time (part 2) 48 404 Total time for FPGA solution 307 2034 Exec. Time (us) 2500 2000 1500 1000 500 0 Performance Comparison 1 8 Tiles processed Time for software solution 130 1043 Results for single DWT module design for BenNUEY board operating at 80 MHz Note: software solution comes from exec. on server with 2.4 GHz Xeon CPU Processing eight tiles (μs) Processing forty tiles (μs) DMA write time 758 3750 DMA read time 382 1900 Computation time (part 1) 80 80 Computation time (part 2) 82 424 Total time for FPGA solution 1302 6154 Time for software solution 1043 5219 Results for Eight DWT modules design for BenNUEY board operating at 40 MHz Exec. Time (us) FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution 8000 6000 4000 2000 0 Performance Comparison 8 40 Tiles Processed FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution Resource Utilization on Virtex2 6000-4 # of Modules Slices BRAMs Single Module 1157 ( 3%) 6 ( 4%) Eight Modules 5742 (17%) 48 (33%) DeVille 10

Tier-1 Encoding Current Results Single-module design processing one codeblock (μs) Eight-module design processing one codeblock each (μs) DMA Write Time 70 218 DMA Read Time 49 388 Computation Time 175 175 Total Time 294 781 Software Time 276 2189 Results for Tier1 module design for BenNUEY board operating at 90 MHz Note: software solution comes from execution on server with 2.4 GHz Xeon Processor # of modules Slices BlockRAMs Single 3,527 (10%) 7 (5%) Eight 25,556 (75%) 56 (38%) Profiling shows performance projections with DMA transfer times included. peppers.ras camera.ras kodim06.ras kodim10.ras kodim11.ras kodim16.ras kodim21.ras kodim22.ras kodim23.ras baboon.ras lena.ras w ater.pnm 0% 20% 40% 60% 80% 100% MCT FWT QUANT TIER1 TIER2 DeVille 11 * Results synthesized with Synplify Pro 7.7.1, PAR with Xilinx ISE 6.3

Conclusions from HPEC Platform Multi-chip system offers resources for increased parallelism or a multi-component application Order of magnitude improvement in total computation time Faster computation times on FPGA But communication overhead severely hinders performance improvement Low-bandwidth PCI interconnect not amenable to designs with challenging memory demands DeVille 12

Target HPC Platform High-Performance Computing: SGI Altix 350 with FPGA Brick Single FPGA: Virtex2 6000 (-6 speed grade) Approximately 33% of chip used for SGI s RASC system layer Two algorithm clock speeds: 200 MHz and 100 MHz High bandwidth to system memory through proprietary NUMAlink interconnect (12.8 GB/s) through Scalable System Port (6.4 GB/s) 3 banks of QDR SRAM (6 MB each) with a full bandwidth of 9.6 GB/s (1.6 GB/s for each read and write) Advantages/Disadvantages Extremely low reconfiguration time High memory bandwidth greatly helps memory-intensive apps, such as JPEG-2K 2 MB QDR SRAM SGI Altix w/ RASC extension 2 MB QDR SRAM DeVille 13 * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.

Performance Projections 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% water.pnm lena.ras baboon.ras kodim23.ras kodim22.ras kodim21.ras kodim16.ras kodim11.ras kodim10.ras kodim06.ras camera.ras peppers.ras TIER2 TIER1 QUANT FWT MCT Profile shows projections for no-latency, infinite-bandwidth interconnect. NUMAlink interconnect Approximate order-of-magnitude improvement of transfers in similar designs Mitigates communication overhead bottleneck DeVille 14

Lessons Learned and Conclusions Lessons Learned HW/SW codesign Shared-memory systems more amenable to closely-coupled processing associated with communication-sensitive RC applications PCI boards for servers effective when tasks are offloaded for processing with minimal or masked communication Memory bandwidth constrains parallelism in DWT design Serializing agent (arithmetic coder) in Tier-1 design is key limit to performance improvement Conclusions Identifying and accelerating key components yields better system performance (with a wary eye on Amdahl s Law) Performance enhancements achieved mostly through functional parallelism due to sequential processing constraints DeVille 15

Future Work and Acknowledgments Future Work: Full system implementation on SGI Altix with RASC Region of Interest capability Lossy encoding and rate capability MCT and Tier-2 encoding on FPGA as well Single FPGA JPEG-2000 encoding application Acknowledgments We wish to thank the following vendors for equipment and/or tools in support of this research: SGI Nallatech Xilinx Aldec Special thanks to SGI Digital Media group, SGI RASC engineers for their help and suggestions DeVille 16

References [1] Adams, M.D. and Ward, R.K., JasPer: a portable flexible open-source software tool kit for image coding/process, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 04), pp. 241-244, May 2004. [2] OpenJPEG. http://www.opegjpeg.org/ [3] Liu, L., Li, D., Li, Z., Wang, Z. and Chen, H., A VLSI architecture of EBCOT encoder for JPEG2000, in 5 th International Conference on ASIC, pp. 882-885, Oct. 2003. [4] Chen, K., Lian, C., Chen, H., and L. Chen, Analysis and architecture design of EBCOT for JPEG-2000, in IEEE International Symposium on Circuits and Systems, vol. 2, pp. 765-768, May 2001. [5] Van Buren, D., A high-rate JPEG2000 compression system for space, in IEEE Aerospace Conference, March 2005. [6] Aouadi, I., and Hammami, O., Analysis and hardware design of a scalable dual JPEG-2000 entropy coder, in Euromicro Symposium on Digital System Design (DSD 2004), pp. 227-233, Sept. 2004. [7] Gangadhar, M. and Bhatia, D., FPGA based EBCOT architecture for JPEG 2000, in IEEE International Conference on Field-Programmable Technology (FPT 03), pp. 228-233, Dec. 2003 [8] Hung, K., Huang Y., Truong, T., Wang, C., FPGA implementation for 2D discrete wavelet transform, in Electronics Letters, pp. 639-640, April 1998. [9] Lakshminarayanan, G. Venkataramani, B. Senthil Kumar, J., Yousuf, A.K. and Sriram, G., Design and FPGA implementation of image block encoders with 2D-DWT, in Conference on Convergent Technologies for Asia- Pacific Region (TENCON 2003), pp. 1015-1019, Oct. 2003. [10] McCanny, P., Masud, S., and McCanny, J., Design and implementation of the symmetrically extended 2-D wavelet transform, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 02), vol. 3, pp. 3108-31111, May 2002. [11] D. Taubman, High performance scalable image compression with EBCOT, in IEEE Trans. Image Processing, vol. 9, pp. 1158-1170, July 2000. [12] I.E.G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester, West Sussex, New York: John Wiley and Sons, Ltd (UK), 2002. [13] T. Acharya and P.-S. Tsai, JPEG 2000 Standard for image Compression: Concepts, Algorithms, and VLSI Architectures. Hoboken, New Jersey: John Wiley and Sons, Inc., 2005. DeVille 17