GPU Implementation of a Modified Signed Discrete Cosine Transform

Size: px

Start display at page:

Download "GPU Implementation of a Modified Signed Discrete Cosine Transform"

Erik Foster
6 years ago
Views:

1 The 9th International Conference on INFOrmatics and Systems (INFOS24) 5-7 December GPU Implementation of a Modified Signed Discrete Cosine Transform Reem T. Haweel Basic Science Department Ain-Shams University, Faculty of Computer and Information Science Cairo, Egypt reem_tarek_@hotmail.com Wail S. El-Kilani Computer Systems Department Ain-Shams University, Faculty of Computer and Information Science Cairo, Egypt wail.elkilani@gmail.com Hassan H. Ramadan Basic Science Department Ain-Shams University, Faculty of Computer and Information Science Cairo, Egypt hramadan@eun.eg Abstract Real time imaging is essential for internet multimedia and modern satellite communications. Discrete Cosine Transform (DCT) is the core for image processing such as image compression and coding for its high power compaction property. The Signed DCT (SDCT) and its modifications approximate the DCT while requiring much less number of arithmetic operations which is essential to speed up real time applications. However, carrying out 2-D DCT or SDCT on CPU, takes much time for the high order of computation. This paper employs the Graphics Processing Unit (GPU) based on Compute Unified Device Architecture (CUDA) to reduce the time required to handle such high amount of computations. An efficient and fast modified SDCT () is employed. The essential features of the are presented. A flow diagram is provided for the efficient implementation of the. Only 7 additions are required for both forward and backward transformations. Computer analysis is provided to illustrate the high speed up achieved by the GPU implementation for the. Keywords discrete cosine transform; signed discrete cosine transform; parallel computation; CPU; GPU; CUDA. I. INTRODUCTION The rapid growth of digital imaging applications, including desktop publishing, multimedia, teleconferencing, and highdefinition television has increased the need for effective and standardized image compression techniques. At the present state of technology, the only solution is to compress multimedia data before its storage and transmission, and decompress it at the receiver for play back. Image compression addresses the problem of reducing the amount of data required to present a digital image with acceptable image quality. Removing coding redundancy is achieved using Huffman codes which contain the smallest possible number of code symbols (e.g., bits) per source symbol (e.g., gray level value) subject to the constraint that the source symbols are coded one at a time. Visual redundancy, which is due to data that is ignored by the human visual system (i.e. visually nonessential information) is achieved employing Discrete Cosine Transform (DCT) []. The DCT is an example of transform coding. The DCT incorporates real sinusoids and possesses many interesting features. In addition to its orthogonal structure, the DCT has good power compaction properties. The DCT relocates the highest energies to the upper left corner of the image. The lesser energy or information is relocated into other areas. The DCT is the best substitute for the Karhunen Loeve transform, which is considered to be statistically optimal for power concentration [2]. For this reason, the DCT is the core of image coding [3] and video compression techniques such as JPEG, MPEG, MPEG2, H.26, and H263 [4]. In spite of the existence of many fast algorithms [5] which reduce the total number of operations required to compute such transforms, multiplication operations may be inevitable. To increase the speed of transformation while keeping the compaction properties of the DCT, the Signed Discrete Cosine Transform (SDCT) [6] was suggested. All the elements of the SDCT transform are or, that is no multiplication operations or transcendental expressions are required. Moreover, the SDCT maintains the periodicity and spectral structure of its originating DCT and, in turn, maintains its good de-correlation and energy compaction characteristics [6]. The computation of the SDCT requires 24 additions [6]. Following the introduction of the SDCT a stream of research papers has followed such as the Bouguezel-Ahmed-Swamy (BAS) series of algorithms [7]-[]. The target is to modify the SDCT to further reduce the computational complexity and to achieve orthogonality. The strategy is to change some of the SDCT matrix elements and to clear others. In spite of the efforts expended to reduce the number of operations required to compute the DCT, SDCT and its modifications, the 2-D implementations of such algorithms is very time consuming. Speeding up these 2-D computations is crucial for real time applications such as multimedia internet and satellite communications. Since designing fast algorithms is not sufficient, it is important to seek implementation on different platforms such as increasing the operational frequency of the Central Processing Unit (CPU). However, the CPU clock frequency cannot be increased beyond some limit because of various factors such as overheating. Since these algorithms are highly parallel, parallel computing offers channels for enhancing the timing performance. The developing of computing capability of Graphics Process Unit (GPU) has greatly enhanced the world of parallel computing [7]. GPU could assign processing tasks to multiple threads and execute these threads simultaneously. This feature could speedup heavy data computation to a level, which we would never imagine in the past. Due to the hike in Copyright 24 by Faculty of Computers and Information Cairo University PDC-68

2 The 9th International Conference on INFOrmatics and Systems (INFOS24) 5-7 December architectural and compiler complexity, multicore processors have always challenged engineers. But companies in the field of visual computing and graphic processing like NVIDIA and AMD has welcomed those challenges and opened new doors in the field of parallel and high performance computing [2]. The Compute Unified Device Architecture (CUDA), empowered by NVIDIA, integrated with high end C language which consists of additional functions, provides an interface between the developer and the device for the transferring of data and distribution of work between GPU and CPU [3]. It has the capability of identifying, programming, tracing a single core computation and performing multiple tasks in parallel as per the user requirement [6]. This paper investigates the GPU implementation of an efficient modification for the SDCT. The rest of the paper is organized as follows. Section II summarizes the main features of the modified SDCT () algorithm. The GPU implementation of DCT is discussed in section III. The proposed GPU implementation for the is illustrated in section IV. The enhanced timing performance of the proposed GPU method is discussed in section V. Finally, section V concludes the work. II. T DCT MODIFIED SIGNED DISCRETE COSINE TRANSFORM THE DCT MATRIX, T DCT, OF ORDER N, IS DEFINED AS []: N ( i, j) 2 (2 j ) i cos N 2N i, i N, j N j N To increase the speed of transformation while keeping the compaction properties of the DCT, the Signed Discrete Cosine Transform (SDCT) [6] was suggested simply by applying the signum function operator to the DCT elements in (). The SDCT matrix, T SDCT, is given by () T SDCT ( i, j ) sign { T DCT ( i, j )} (2) N Where sign{.} is the signum function defined as sign{ x} if if if x x x All the elements of the transform are or, that is no multiplication operations or transcendental expressions are required. The SDCT maintains the periodicity and spectral structure of its originating DCT and, in turn, maintains its good de-correlation and energy compaction characteristics [6]. The 8-by-8 SDCT transform matrix is given by [6]. (3) T SDCT 8 (4) Unfortunately, the SDCT reverse transformation matrix is not orthogonal. The computation of the SDCT requires 24 additions [6]. An interesting transform related to the SDCT is introduced in [4] with a transform, T D T Where, D diag, 2, 2, 2,, 2, 2, 2, In T, Some of the SDCT elements have been changed to ±.5 and 24 elements have been cleared (turned to zero). It has been shown in [4] that the power compaction of the transform, and consequently the compression capabilities, is high. The computation of the transform requires 7 additions and two shifts. The transform in [4] is not completely orthogonal. There are two nonzero off-diagonal elements in T T t. However, the effect of these two nonzero elements on the compression is negligible and the approximation of the transform transpose as the transform inverse can be performed. An efficient Modified SDCT () transform [5] results from applying the signum function operator to the transform in [4] given in eq. 5. The matrix T P and its associated diagonal D are given by Where, (5) (6) T p DT (7) Copyright 24 by Faculty of Computers and Information Cairo University PDC-69

3 The 9th International Conference on INFOrmatics and Systems (INFOS24) 5-7 December T sign And, D TT t T diag,, 8 2, 8, 2,, 8 2, 2 2 It has been shown that the maintains the good power compaction and orthogonality properties of its originating transform [5]. While maintaining the same 24 zeroes of (5) the ±.5 elements have been conve rted to ±. Thus the shift operations required in the transform of [4] have been eliminated. The fast tree computation for the transform is shown in Fig.. Only 7 add operations are required. Fig.. (9) Flow diagram for the fast implementation of the proposed algorithm The computational requirements for different 8x8 transforms are illustrated in table-. The lowest complexity is that for the. TABLE I. COMPUTATIONAL REQUIREMENTS FOR 8X8 TRANSFORMs Transform Adds Shifts Mult. Total BAS [] SDCT [6] Transforms in [4], [5] BAS [7] 2 2 BAS [8], BAS [] BAS [9] 8 8 Transform in [2] Transform in [3] 7 7 Proposed transform 7 7 (8) III. GPU IMPLEMENTATION OF DCT DCT is applied on images by dividing each image into 8x8 blocks. Each block in the image is processed independently of other blocks. Processing of all image blocks can be efficiently done in parallel and thus distributed among parallel computing units. Graphics Processing Unit (GPU) is a parallel processing technique devoted to speed up the implementation of tasks with high computational burdens [7]. The Compute Unified Device Architecture (CUDA) is adopting a unified hardware processing architecture to facilitate working on GPU [7], [8]. CUDA programming model holds the concept of a host (CPU) and a device (GPU). Data is first read and serial processing is done on the host side. It is then transferred to the device global memory to apply highly threaded parallel processing. The functions that run parallel tasks on the GPU side are called kernels. They are c functions that are executed on multiple different CUDA threads. These CUDA threads form blocks with up to three-dimensional tops. All threads in one block run the same kernel. CUDA Blocks are organized into the grid, which has up to two dimensions. Execution blocks will further be called CUDA blocks to distinguish between them and image logical blocks. Several techniques for efficient implementation of DCT on both the CPU and GPU using direct matrix multiplication is presented in [9]. These techniques have shown that the speed of the DCT on the GPU exceeded that on the CPU. In [2], the DCT was implemented on GPU using CUDA. In this implementation, A Cordic based Loeffler DCT is calculated using four stages that must be executed in serial. Parallelization is then done in each stage. IV. PRPOSED GPU IMPLEMENTATION OF MSDC Mathematically, the 2-D transformation F of one image block X is computed as t F T X T () In 2-D implementation a -D transformation is firstly performed to the columns of the 8x8 blocks using the transformation matrix in (7). Secondly, the sam e transformation is performed to the rows. In the proposed GPU implementation, each thread is responsible for applying a -D transformation for one row and another transformation for one column. That is each image block is 2-D transformed using 8 such threads. Since is parallelizable on all levels of details, logically the size of a CUDA block has no effect on the algorithm performance. In order to enhance the performance and utilization, the block size is set in accordance with the hardware architecture. GPU multiprocessors create, manage, schedule, and execute threads in groups of warps [8]. In case of NVIDIA GeForce 8x series, the number of threads in a warp is 32. The number of threads per CUDA block should be chosen as a multiple of the warp size to avoid wasting computing resources and maximize parallel execution between various functional units within a multiprocessor. Consequently, Copyright 24 by Faculty of Computers and Information Cairo University PDC-7

4 The 9th International Conference on INFOrmatics and Systems (INFOS24) 5-7 December the size of employed CUDA block in the proposed GPU implementation is 32 threads. That is each CUDA block maps to 4 image blocks. The main GPU memory types are shared memory, global memory, and constant memory [7]. Each CUDA block has its own shared memory which can be accessed by all threads contained in that block. Global memory is the only channel to communicate with the host (CPU). It can be accessed by all CUDA blocks contained in the GPU. Constant memory is a fast read only memory. However, constant memory is not employed by the proposed GPU implementation since the has no constant transform parameters as in conventional DCT. In the proposed GPU implementation, the CPU reads a bitmap image and sends it to the GPU global memory which is shared by the CUDA blocks. As depicted before, each CUDA block shared memory contains 4 image blocks. After the transformations performed by the threads of each block, the transformed image blocks are gathered in the global memory and sent back to the CPU. V. PERFORMANCE EVALUATION To illustrate the efficiency for the proposed GPU implementation of the, a test database of bitmap images with different sizes has been employed. The data base includes 265-by-265, 52-by-52, 48-by-48, 296-by- 296 and 448-by-448 gray images. The images have been given indices from () to (5) respectively. In the first part of the evaluation, the DCT and the algorithms have been run on both CPU (Intel(R) Core(TM) i5-42m with 2.5 GHz frequency and running Windows 7 32-bit) and GPU (GeForce GT 72M of compute capability 2. and CUDA driver version 6 using CUDA Toolkit 6 and visual studio 2). Fig. 2 shows the time taken by the DCT and the for the data base images on CPU. Fig. 3 shows the time taken by the two algorithms on the GPU. Generally, the GPU time is much less than the CPU time. The time taken by the is less than that by the DCT for all image sizes. Table-2 illustrates the running times for DCT and on CPU and GPU for different image sizes in mille seconds (ms). Fig. 4 shows the speed up order employing the GPU for the. The speed up order ranges from x28 to x45. The speed up order increases as the image size increases. Image sizes TABLE II. RUNNING TIMES (MS) FOR DCT AND ON CPU AND GPU DCT on CPU DCT on GPU On CPU on GPU 496 x x x x x Speed Up Order (x) DCT on CPU on CPU Fig. 2. Running times for the DCT and on CPU Fig DCT on GPU on GPU Running times for the DCT and on GPU Fig. 4. Speed up of on GPU over DCT on CPU Copyright 24 by Faculty of Computers and Information Cairo University PDC-7

5 The 9th International Conference on INFOrmatics and Systems (INFOS24) 5-7 December In the second part of the experiments, the proposed running times are compared to those of the DCT implementation of [2] both on CPU and GPU. The running times in (ms) are depicted in Table III and Fig. 5 & 6. It is clear that the proposed outperforms the implementation of [2] both on CPU and GPU. Image sizes TABLE III. On CPU RUNNING TIMES (MS) FOR AND IMPLEMENTATION OF [2] [2] on CPU on GPU [2] on GPU 248x x x x x on CPU [2] on CPU Fig. 5. CPU Running times for the and implementation of [2] on GPU [2] on GPU Fig. 6. GPU Running times for the and implementation of [2] VI. CONCLUSIONS The Signed Discrete Cosine Transform (SDCT) approximates the DCT which is efficiently employed in image processing such as compression and coding. To further save computations, modifications for the SDCT have been explored. The strategy of such modifications is to convert some of the SDCT elements to zeroes and modify others. An efficient modified SDCT () transform with low complexity and high power compaction has been presented. The results from applying the signum function operator on the high performance transform in [2]. A flow diagram for the fast computation of the is shown. The computational complexity is 7 additions only, which is the lowest among this family of transforms. The CPU for the implementation of the 2-D DCT and related transforms such as is large. However, the employment of the 2-D DCT and in image processing such as compression is highly parallel. This paper investigates the GPU implementation for the. The implementation has been achieved through NIVIDIA GPU employing CUDA. Performance evaluation has been conducted employing data base of different gray images with sizes ranging from 265-by- 265 to 496-by-496. As expected, the time for the is less than the time for the DCT both in CPU and GPU. The GPU implementation of the is much faster than the implementation on CPU. The speedup achieved by the ranges from x28 to x45. The speedup is higher for larger images. REFERENCES [] N. Ahemd, T. Natarajian, and K. R. Rao, Discrete cosine transform, IEEE Trans. Acoust, Speech, Signal Processing, vol. ASSP-32, pp , Dec [2] R. J. Clark, Relation between Karhunen-Loeve and cosine transform, Proc. Inst. Elec. Eng., pt. F, vol. 28, no 6, pp , 98. [3] Abdelhafez W. M. and Mofaddel M. A., Hyprid Scheme for lifting based image coding, Seventh International Conference on Computer Engineering & Systems (ICCES), pp.4-46, 22. [4] Lakhani, G., Modifying JPEG binary arithmetic coder for exploiting inter/intra-block and DCT coefficient sign redundancies, IEEE Trans. Image Processing, vol. 22, issue 4, pp , 23. [5] H. S. Hou, A fast recursive algorithm for computing the discrete cosine transform, IEEE Trans. Acoust. Speech, Signal Process., vol. ASSP-35, no., pp , Oct [6] Haweel T.I., A new square wave transform based on the DCT, Signal Processing, 8, pp , 2. [7] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, A multiplicationfree transform for image compression, in 2nd Int. Conf. Signals, Circuits and Systems, pp. -4, Nov. 28. [8] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, Low complexity 8X8 transform for image compression, Electron. Lett. vol. 44, pp , Sep. 28. [9] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, A fast 8x8 transform for image compression, in 29 Int. Conf. Microelectronics (ICM), pp , Dec. 29. [] S. Bouguezel, M. O. Ahmad, and M.N. S. Swamy, A novel transform for image compression, in 53rd IEEE Int. Midwest Sump. Circuits and Systems (MWSCAS), pp , Aug. 2. [] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, A low complexity parametric transform for image compression, in Proc. IEEE Int. Symp. Circuits and systems, 2. [2] Whitepaper on NVIDA s Next Generation CUDA Compute Architecture: Fermi, version.. [3] A whitepaper by Peter N. Glaskowsky, NVIDIA s Fermi: First Complete GPU Computing Architecture, September 29. Copyright 24 by Faculty of Computers and Information Cairo University PDC-72

6 The 9th International Conference on INFOrmatics and Systems (INFOS24) 5-7 December [4] Ranjan K. Senapati, Umesh C. Pati and Kamala K. Mahapatra, A low complexity orthogonal 8x8 transform matrix for fast image compression, Annual IEEE India Conference (INDICON), pp. -4, 2. [5] Reem T. Haweel, Wail S. El Kilani, A Fast Modified Signed Discrete Cosine Transform For Image Compression, 9th IEEE International Conference on Computer Engineering and Systems (ICCES 24) Dec 24. [6] LIU Duo, FAN Xiao Ya, "Parallel program design for JPEG compression Encoding", 22 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 22). [7] Owens, J.D. Houston, M. Luebke, D. Green, S. Stone, J.E. Phillips, J.C., GPU Computing, Proceedings of the IEEE, May 28, Volume: 96, Issue: 5, pp: [8] C. Nvidia, CUDA C Programming Guide. Version 6., 24. [9] Bo Fang, Guobin Shen, Shipeng Li, and Huifang Chen, Techniques for Efficient DCT/IDCT Implementation on Generic GPU", IEEE International Symposium on Circuits and Systems (ISCAS), 25. [2] Kgotlaetsile M. Modieginyane, Zenzo P. Ncube, Naison Gasela, CUDA based performance evaluation of the computational efficiency of the DCT image compression technique on both he CPU and GPU, Advanced Computing: An International Journal ( ACIJ ), Vol.4, No.3, May 23 Copyright 24 by Faculty of Computers and Information Cairo University PDC-73

2016, IJARCSSE All Rights Reserved Page 441

2016, IJARCSSE All Rights Reserved Page 441 Volume 6, Issue 9, September 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Implementation