Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library

Similar documents
JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

Digital Image Representation Image Compression

ROI Based Image Compression in Baseline JPEG

Image Compression Algorithm and JPEG Standard

IMAGE COMPRESSION USING HYBRID QUANTIZATION METHOD IN JPEG

Index. 1. Motivation 2. Background 3. JPEG Compression The Discrete Cosine Transformation Quantization Coding 4. MPEG 5.

Lecture 8 JPEG Compression (Part 3)

AN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES

Video Compression An Introduction

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

Interactive Progressive Encoding System For Transmission of Complex Images

Multimedia Signals and Systems Still Image Compression - JPEG

MRT based Fixed Block size Transform Coding

Stereo Image Compression

Lecture 8 JPEG Compression (Part 3)

Image, video and audio coding concepts. Roadmap. Rationale. Stefan Alfredsson. (based on material by Johan Garcia)

Robert Matthew Buckley. Nova Southeastern University. Dr. Laszlo. MCIS625 On Line. Module 2 Graphics File Format Essay

Fundamentals of Video Compression. Video Compression

COLOR IMAGE COMPRESSION USING DISCRETE COSINUS TRANSFORM (DCT)

Introduction ti to JPEG

IMAGE COMPRESSION. October 7, ICSY Lab, University of Kaiserslautern, Germany

Compression Part 2 Lossy Image Compression (JPEG) Norm Zeck

FRACTAL IMAGE COMPRESSION OF GRAYSCALE AND RGB IMAGES USING DCT WITH QUADTREE DECOMPOSITION AND HUFFMAN CODING. Moheb R. Girgis and Mohammed M.

Multimedia Communications. Transform Coding

Compression II: Images (JPEG)

Features. Sequential encoding. Progressive encoding. Hierarchical encoding. Lossless encoding using a different strategy

JPEG Compression Using MATLAB

CHAPTER 6. 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform. 6.3 Wavelet Transform based compression technique 106

JPEG Copy Paste Forgery Detection Using BAG Optimized for Complex Images

A HYBRID DPCM-DCT AND RLE CODING FOR SATELLITE IMAGE COMPRESSION

ISSN (ONLINE): , VOLUME-3, ISSUE-1,

The Analysis and Detection of Double JPEG2000 Compression Based on Statistical Characterization of DWT Coefficients

CS 335 Graphics and Multimedia. Image Compression

Nios II Embedded Electronic Photo Album

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

JPEG. Wikipedia: Felis_silvestris_silvestris.jpg, Michael Gäbler CC BY 3.0

Compression of Stereo Images using a Huffman-Zip Scheme

VC 12/13 T16 Video Compression

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

06/12/2017. Image compression. Image compression. Image compression. Image compression. Coding redundancy: image 1 has four gray levels

In the first part of our project report, published

Reconstruction PSNR [db]

TKT-2431 SoC design. Introduction to exercises

Lecture 5: Compression I. This Week s Schedule

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

Digital Video Processing

Parallel JPEG Color Conversion on Multi-Core Processor

Ian Snyder. December 14, 2009

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

IMAGE COMPRESSION USING HYBRID TRANSFORM TECHNIQUE

Digital Image Processing

Image Compression. CS 6640 School of Computing University of Utah

Enhancing the Image Compression Rate Using Steganography

Video Compression Method for On-Board Systems of Construction Robots

Using Shift Number Coding with Wavelet Transform for Image Compression

An Efficient Image Compression Using Bit Allocation based on Psychovisual Threshold

7.5 Dictionary-based Coding

Multimedia Systems Image III (Image Compression, JPEG) Mahdi Amiri April 2011 Sharif University of Technology

Lossy compression CSCI 470: Web Science Keith Vertanen Copyright 2013

An introduction to JPEG compression using MATLAB

Image and Video Coding I: Fundamentals

A New Lossy Image Compression Technique Using DCT, Round Variable Method & Run Length Encoding

JPEG 2000 compression

International Journal of Wavelets, Multiresolution and Information Processing c World Scientific Publishing Company

HYBRID TRANSFORMATION TECHNIQUE FOR IMAGE COMPRESSION

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

Lossy compression. CSCI 470: Web Science Keith Vertanen

CMPT 365 Multimedia Systems. Media Compression - Image

SOME CONCEPTS IN DISCRETE COSINE TRANSFORMS ~ Jennie G. Abraham Fall 2009, EE5355

Image Compression Techniques

SIMD Implementation of the Discrete Wavelet Transform

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

An Improved Complex Spatially Scalable ACC DCT Based Video Compression Method

Vector Quantization. A Many-Core Approach

MPEG-4: Simple Profile (SP)

Video Compression System for Online Usage Using DCT 1 S.B. Midhun Kumar, 2 Mr.A.Jayakumar M.E 1 UG Student, 2 Associate Professor

JPEG IMAGE CODING WITH ADAPTIVE QUANTIZATION

A Comparison of Still-Image Compression Standards Using Different Image Quality Metrics and Proposed Methods for Improving Lossy Image Quality

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression

THE TRANSFORM AND DATA COMPRESSION HANDBOOK

A Reversible Data Hiding Scheme For JPEG Images

What is multimedia? Multimedia. Continuous media. Most common media types. Continuous media processing. Interactivity. What is multimedia?

IMAGE COMPRESSION TECHNIQUES

An Analytical Review of Lossy Image Compression using n-tv Method

Redundant Data Elimination for Image Compression and Internet Transmission using MATLAB

7: Image Compression

Multimedia. What is multimedia? Media types. Interchange formats. + Text +Graphics +Audio +Image +Video. Petri Vuorimaa 1

Compressive Sensing for Multimedia. Communications in Wireless Sensor Networks

Rate Distortion Optimization in Video Compression

INF5063: Programming heterogeneous multi-core processors. September 17, 2010

ESE532: System-on-a-Chip Architecture. Today. Message. Project. Expect. Why MPEG Encode? MPEG Encoding Project Motion Estimation DCT Entropy Encoding

Combining Support Vector Machine Learning With the Discrete Cosine Transform in Image Compression

Introduction to Video Compression

Optimal Porting of Embedded Software on DSPs

Chapter 1. Digital Data Representation and Communication. Part 2

CHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER

A Hybrid Architecture for Video Transmission

ISSN Vol.06,Issue.10, November-2014, Pages:

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8

Transcription:

Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library Jingun Hong 1, Wasuwee Sodsong 1, Seongwook Chung 1, Cheong Ghil Kim 2, Yeongkyu Lim 3, Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University, 2 Namseoul University, 3 LG Electronics Inc. Abstract In this paper, we propose a task-parallel programming extension for the JPEG decoder of the libjpeg-turbo library. Efficient JPEG decoding is especially important for resourceconstrained mobile devices such as smartphones, where decoding (e.g., browsing of web pages containing images, image search aso) is far more common than image encoding. The aim of our work is to utilize multiple CPU cores for JPEG decompression from a single client thread. Our method is orthogonal to libjpeg-turbo's support for data-parallelism (SIMD). Experimental evaluation of our approach on a 4-core Intel i7-2600k CPU shows speed-ups of up to 2.5x over the sequential, and up to 34% over the SIMD-version of the libjpeg-turbo JPEG decoder. Keywords: JPEG decoding, multicores, parallel programming patterns, fork/join task parallelism 1. Introduction The JPEG format is a de-facto standard for digital images on the world-wide web. Six out of the top-ten most popular websites [1], including Facebook, Youtube, Yahoo, Amazon and Twitter use JPEG images. W3Tech's web statistics [2] report JPGEG as the most popular image format for web sites, with a penetration of 72%. Virtually all digital cameras and smartphones employ JPEG compression, which results in large amounts of new user content on a day-to- day basis. E.g., more than 250 million photos are uploaded on Facebook every day [3]. Decoding JPEG images is a computationally expensive operation: compressed images undergo Huffan decompression, de-quantization, inverse DCT, upsampling, color space conversion (e.g., from YCbCr to RGB) and optional color quantization and precision reduction. Decoding of a color image of resolution 3072x2304 takes 112 ms on an Intel i7, and at resolution 720 x 960 the same image takes 22 ms to decompress (measurements where conducted on an instrumented version of the Google Chrome web browser). JPEG decoding thus constitutes a substantial part of webpage rendering. To enhance the web user experience, efficient JPEG decoding should take advantage of multicore architectures. Libjpeg [4], the JPEG reference implementation from the Independent JPEG Group, is strictly sequential. Libjpeg-turbo [5] is a re-implementation of libjepg that utilizes MMX, SSE, SSE2, and NEON SIMD instructions on x86 and ARM platforms. Libjpeg-turbo found wide-spread use, e.g., with the Chrome and Firefox web- 1 Corresponding Author 147

browsers, WebKit [6], and the Ubuntu, Fedora and opensuse Linux distributions. From our experiments, libjpeg-turbo accelerates JPEG decoding by a factor of 2 on x86. Neither libjpeg nor libjpeg-turbo are capable of utilizing multiple CPU cores. Considerable computing power is thus un-utilized during JPEG decoding. To the best of our knowledge, this is the first approach to apply task parallelism with the libjpeg-turbo library, although thread parallel approaches to image processing algorithms exist [7, 8]. Figure 1.JPEG Decoder Path To parallelize JPEG decoding on multicore architectures, we extend the libjpeg-turbo library with task-level parallelism; images are partitioned and decoded on the available cores of a shared-memory multicore computer. 2. Background and Related Work JPEG image encoding consists of color space transformation, downsampling, block splitting, discrete cosine transform, quantization and entropy coding. After loading the raw image into the main memory, the image is broken down into blocks of 8x8 pixels. A 2D discrete cosine transform (DCT) is applied to each 8x8 block to separate it into its frequency components. Since the human eye is less sensitive to higher frequency information, the quantization step divides the DCT transform by a quantization table so that the higher frequency coefficients become 0. At this stage, the lossy compression takes place, meaning that the high frequency components have been discarded. After applying zig-zag reordering, as a final step, Huffman encoding is used to encode the whole picture by replacing the statistically higher occurring bits with the smallest symbols. Decoding a JPEG image takes the above steps in reversed order as depicted in Figure 1. Entropy decoding, dequantization, inverse DCT, upsampling and color conversion are done during decoding. First we apply Huffman decoding in order to revert Huffman encoding. The JPEG decoder takes the DCT coefficient matrices and performs the entrywise product with the quantization matrix. Then the inverse DCT reconstructs a sequence from the DCT coefficients. The image can be displayed after color conversion from YCbCr to RGB color space finally. A preliminary version of this work, excluding design, implementation and evaluation of the task-parallel JPEG decoder and excluding combined task/data-parallel JPEG decoding appeared at the IST conference [9]. 148

3. Utilizing Task-parallelism with the Libjpeg-turbo Library Figure 2 shows the overall software architecture of the libjpeg-turbo library. Libjpeg-turbo has been designed with the goal to conserve resources, esp. memory. Both encoding and decoding is done in units of 8x8 pixels. As depicted in Figure 2, a 2-tier controller and buffer hierarchy is used to manage decoding and storage of single 8x8 pixel rows in various stages of the decoding process. We had to re-engineer the library to store the whole image in memory, to facilitate partitioning and task-parallel image decoding. Storing the whole image is done after the Huffman decoding stage. All other phases of the JPEG decoding path are then conducted in parallel. Figure 2. Libjpeg-turbo software architecture, B1-B3 are buffers which store each a single row of 8x8 blocks from the JPEG image 3.1. Design and Implementation of Fork/Join Task Parallelism with the Libjpeg-turbo JPEG decoder To employ task parallelism with JPEG decoding, we applied fork/join parallelism [10]. In the execution context of the client, the library would create a worker for each available core and partition the image across those workers, who would then decode in parallel. After finishing decoding, workers where joined by the client. After joining all workers, the client would resume execution. Retro-fitting fork/join task-parallelism with the libjpeg-turbo library was possible once we had the whole image available. Because each worker induces an almost identical computation load, load-balance across cores was ensured. Algorithm 1 shows the code executed by each of our workers. In the libjpeg-turbo library, decoding happens in rows of minimum coded units (MCUs), which consist of 8 or 16 pixel rows of the original image, depending on the upsampling factor of the image. Although we hold the whole compressed JPEG image in memory, we kept row-by-row decoding with the worker threads to exploit cache locality of image data. 149

Algorithm 1: ThreadKernel 1 foreach MCU_row in ImageSlice(tid) do 2 inversedct(mcu_row) 3 upsampling(mcu_row) 4 colorconversion(mcu_row) 5 end loop Figure 3. Execution times of the sequential version of libjpeg-turbo, the SSEbased SIMD version, and the SIMD/task-parallel combined version. The combined version is up to 2.52x faster than the sequential version, and 34% faster than the SIMD version 4. Experimental Results We have implemented our task-parallel decoding extension with libjpeg-turbo [5] version 1.2.3. All experiments were conducted on a 4-core Intel i7-2600k 3.4GHz CPU running Ubuntu Linux 10.10, Linux kernel version 2.6.35-32. All source code was compiled with GCC version 4.4.5 and option -O3. Experiments where conducted with a selection of 100 JPEG images, with resolutions ranging from 16x16 up to 2200x2200 pixels. Experiments utilized all 4 cores/8 hyperthreads of the i7 processor. Fig. 3 shows the speed-ups achieved with sequential, SSE-based SIMD and combined SIMD/fork-join task-parallel decoding. For the SIMD/fork-join decoding, each task-parallel worker thread would run the SIMD-version of the JPEG decoder. The combined version is up to 2.52 times faster than the sequential version. Adding task-parallelism to the SIMD version yields an improvement of 34%. 150

Figure 4. Proportions of Huffman-decoding (sequential) and the parallel part of the decoding path with the SIMD/task-parallel combined version. It follows that with the combined version, around 50% of the execution time is spent for Huffman decoding It should be noted that both with the SIMD-version and the combined version, Huffman decoding is done sequentially; we investigated the ratio of Huffman-decoding vs. the later, parallel part of the decoding path from Fig. 1. The results are depicted in Fig. 4. As it turned out, with the SIMD/task-parallel combined version, Huffman decoding constitutes a serious, sequential bottleneck: around 50% of the overall execution time is spent for Huffman decoding. It will thus become necessary, to attempt to parallelize the Huffman decoder itself, as suggested in [11]. 5. Conclusion and Future Work We have proposed a task-parallel programming extension for decoding of JPEG images with the libjpeg-turbo library. Our method is orthogonal to libjpeg-turbo's support for dataparallelism (SIMD). We have achieved speed-ups of up to 2.5x over the sequential version, and up to 34% over the SIMD-version of the libjpeg-turbo JPEG decoder on a 4-core Intel i7-2600k CPU. As for future work, it should be noted that although the observed overhead from fork/join parallelism was low, different parallel programming patterns need to be considered yet to orchestrate JPEG decoding with multiple clients. Such a scenario would likely occur, e.g., with multiple threads from a web browser, or with several client programs performing JPEG decoding in parallel. With more parallelism exploited with JPEG decoding, Huffman decoding becomes a sequential bottleneck; it will thus become necessary to attempt to parallelize the Huffman decoder itself, as suggested in [11]. References [1] Google Inc.: Most popular websites on the internet, http://mostpopularwebsites.net, retrieved (2012) January. [2] W3Techs: http://w3techs.com, retrieved (2012) March. 151

[3] Facebook Inc.: Statistics, http://www.facebook.com/press/info.php?statistics, retrieved (2012) January. [4] Libjpeg: http://libjpeg.sourceforge.net/, retrieved (2012) January. [5] Libjpeg Turbo: http://sourceforge.net/projects/libjpeg-turbo, retrieved (2012) January. [6] The WebKit Open Source Project: http://www.webkit.org/, retrieved (2012) March. [7] J. K. Heo, Distributed Image Preprocessing Using Object Activation, The Journal of The Institute of Webcasting, Internet and Telecommunication 11(1), pp. 87-92, (2011) February. [8] I. S. Ahn, G. S. Choi and S. Y. Kim, System Development of Iron Plate Defects Detection System Using Image Processing and Multi Thread Method. The Journal of The Institute of Webcasting, Internet and Telecommunication 9(3), pp. 145-153, (2009) June. [9] J. Hong, W. Sodsong, S. Chung, C. G. Kim, Y. Lim, S. D. Kim and B. Burgstaller, Task-parallel JPEG Decoding with the Libjpeg-turbo Library, In: Proc. of the International Conference on Information Science and Technology (2012). [10] T. Mattson, B. Sanders and B. Massingill, Patterns for parallel programming, First edn. Addison-Wesley Professional (2004). [11] S. T. Klein and Y. Wiseman, Parallel Huffman decoding with applications to JPEG files, Comput. J. 46(5), pp. 487-497, (2003). 152