Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment

Size: px

Start display at page:

Download "Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment"

Gwen Brown
5 years ago
Views:

1 Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment Heegon Kim, Sungju Lee, Yongwha Chung, Daihee Park, and Taewoong Jeon Dept. of Computer and Information Science, Korea University, Sejong, Korea {khg86,peacfeel,ychungy,dhpark,jeon}@korea.ac.kr Abstract. Recently, many multimedia applications can be parallelized by using multicore platforms such as CPU and PU. In this paper, we propose a parallel processing approach for a multimedia application by using both CPU and PU. Instead of distributing the parallelizable workload to either CPU or PU(i.e., homogeneous computing), we distribute the workload simultaneously into both CPU and PU(i.e., heterogeneous computing) by using OpenCL. Based on the experimental results with a photomosaic application, we confirm that the proposed parallel processing approach can provide better performance than the typical parallel processing approach by utilizing the given resource maximally. Keywords: CPU, PU, Heterogeneous Computing, OpenCL. 1 Introduction As multicore processors are used for handheld devices as well as PCs/servers, parallel processing approaches have been developed for many applications[1-2]. For example, many approaches have been reported to parallelize multimedia applications [3-4]. Furthermore, many users create their own content using these devices as handheld devices such as smartphones become powerful. In this paper, we focus on parallelizing multimedia applications by using both CPU and PU. In fact, these applications have sufficient parallelism, and many parallel processing results have been reported[5-7] by using general-purpose programming on PU such as Nvidia s CUDA[8], in addition to Pthread[9] on CPU. Recently, OpenCL[10] has been defined as a standard for heterogeneous parallel computing. It provides a cross-platform framework for writing software able to run on different kinds of devices, from multicore CPUs to PUs. That is, a parallel program written with OpenCL can be executed on either CPU or PU[11]. enerally, it is true that PU can provide better performance than CPU for multimedia applications. However, a current multicore CPU is also a powerful processor, and thus, when used together with PU, can reduce the total execution time. We propose a load balancing approach which can overcome the performance limit of either CPU-only or PU-only execution. We first parallelize a given multimedia Corresponding author. James J. (Jong Hyuk) Park et al. (eds.), Multimedia and Ubiquitous Engineering, Lecture Notes in Electrical Engineering 308, DOI: / _4, Springer-Verlag Berlin Heidelberg

2 28 H. Kim et al. application with OpenCL, and measure its execution time on CPU and PU, respectively. Then, we partition the parallelized workload into two parts, based on the relative performance of PU over CPU. Finally, we assign the PU-portion of workload to PU by using a non-blocking command, and then assign the remaining parallel portion to CPU without waiting for a result from PU. By reducing the idle time on either CPU or PU, we overlap the PU execution maximally with the CPU execution. The rest of the paper is structured as follows. Section 2 explains OpenCL[10] and multimedia application Photomosaic[12]. Section 3 describes our proposed load balancing approach. The experimental results are given in Section 4, and conclusions are provided in Section 5. 2 Background 2.1 OpenCL OpenCL[10] is an open standard aimed at providing a programming environment suitable to access heterogeneous architectures. In particular, OpenCL(shown in Fig. 1) allows to execute computational workloads on various multicore processors. Considering the increasing availability of such types of processors, OpenCL is playing a crucial role in enabling portable applications to access a wide range of computational resources. To achieve this aim, various levels of abstraction have been introduced in the OpenCL model. Platform performs an abstraction of the number and type of computing devices in a hardware platform. At this level are made available to developers the routines to query and to manage the computing devices, to create the contexts and work queues for submission of sets of instructions called kernels. Execution is based on the concept of kernel which is a collection of instructions executed on the computing device, multicore CPU or PU, called OpenCL device. An OpenCL application can be divided in two programs: host and kernel. The host program is executed on CPU. It defines the context for the kernels and manages their execution. Especially, when a kernel is submitted for execution by the host, an index space is defined. An instance of the kernel executes for each point in this index space. This kernel instance is called a work-item and is identified by its point in the index space, which provides a global ID for the workitem. Each work-item executes the same code on distinguished data. That is, work-items are organized into work-groups providing a more coarse-grained decomposition of the index space. Language describes the syntax and programming interface for writing kernels(set of instructions that execute on computing device such as multicore CPUs or PUs).

Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment 29

2 Photomosaic A photomosaic[12] is a compound word of Photograph and Mosaic.

converted into small tile images of similar colors. Fig. 2.

2 shows the similarity between the original image and the result image of the

In this paper, the photomosaic iterates the loop of image conversion 5 times per

3 Parallel Photomosaic The performance of each core of CPU is better than PU s,

The PU which has hundreds of cores is more advantageous, if calculation is made of

3 Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment 29 Fig. 1. Platform model of OpenCL 2.2 Photomosaic A photomosaic[12] is a compound word of Photograph and Mosaic. The photomosaic divides a large image into several small parts, which are converted into small tile images of similar colors. Fig. 2. Result of the photomosaic Fig. 2 shows the similarity between the original image and the result image of the photomosaic. The result image is composed of many smaller tile images. In this paper, the photomosaic iterates the loop of image conversion 5 times per pixel. 3 Parallel Photomosaic The performance of each core of CPU is better than PU s, whereas the number of CPU cores is less than the number of PU cores. The PU which has hundreds of cores is more advantageous, if calculation is made of a lot of iterations of the same operation. A large number of studies of PU-equipped environments using only the PU parallel processing have been published[13-14]. The photomosaic does not have data dependency among the tile images. Therefore, a parallel photomosaic by OpenCL is processed using compute units for each tile

4 30 H. Kim et al. images. The host program is waiting during the execution of the kernel function, because typical OpenCL programs are performed by synchronization using blocking mode(see Fig. 3). Fig. 3. Typical parallel processing of photomosaic using PU In heterogeneous computing environments, we propose an approach which improves performance using not only PU but also CPU to reduce the CPU idle time(i.e., waiting time). OpenCL allows asynchronous processing using non-blocking mode. In this paper, non-blocking mode is used in order to reduce the CPU idle time. In non-blocking mode, both CPU and PU resources can be used simultaneously as shown in Fig. 4. Since the idle time is reduced, the proposed approach can effect a speedup higher than can be achieved by typical parallel processing. Fig. 4. Proposed parallel processing of photomosaic using both PU and CPU 4 Experimental Results For evaluating the proposed approach, we used AMD Phenom II X4 955 Processor, eforce TX 285, and the target image with resolution. The number of tile images is AMD Phenom II X4 955 Processor has four cores, and eforce TX 285 has 240 cores. However, the PU core provides lower performance than the CPU core. Also, many typical parallel processing studies with PU have focused on PU only. First, the execution time of the photomosaic was measured for evaluating parallel OpenCL speedup. The photomosaic was measured in three ways: sequential, parallel using PU-only by OpenCL, and parallel using multicore CPU-only by OpenCL. Table 1 shows the sequential and parallel execution times of the photomosaic application. Multicore CPU-only was measured using multicore CPU, and PUonly was measured using PU. Multicore CPU(x%)+PU(y%) was measured using both multicore CPU and PU, and multicore CPU had x% portion while PU had y% portion. The result shows that the performance of using CPU-only by OpenCL provides super speedup(i.e., a 4-core CPU has a speedup of 17). The reason is that the cache-hit ratio was highly improved with the increased number of cores.

Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment 31 Table 1. Sequential and parallel execution times of the photomosaic Execution time(sec) Sequential processing 340.

5 Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment 31 Table 1. Sequential and parallel execution times of the photomosaic Execution time(sec) Sequential processing Parallel processing Multicore CPU-only PU-only Multicore CPU(50%) + PU(50%) Multicore CPU(25%) + PU(75%) 8.40 Next, the execution time of the photomosaic with the workload divided into two parts was measured, in which one part was performed by CPU and the other part was performed by PU. As Table 1 shows, the photomosaic that was divided into 25% CPU portion and 75% PU portion can provide better performance than the one using multicore CPU-only or PU-only. These portions were constrained by index space, therefore the division into two parts is not possible in certain proportions depending on the PU performance and CPU performance(i.e., CPU(33%) + PU(66%) ). The proposed approach can have a speedup of 40, and can yield 25% better performance than the one using PU-only by OpenCL. However, if the 2-part division is made inappropriately, the proposed approach provides lower performance than PU-only. Fig. 5 shows the speedups with OpenCL achieved by four different ways of parallel processing. Fig. 5. Speedup with OpenCL 5 Conclusions We have proposed an efficient heterogeneous parallel processing approach to reduce CPU idle time. The approach, which uses both CPU and PU by OpenCL, decreases total execution time for better performance.

6 32 H. Kim et al. Experiments with the use of both CPU and PU for parallel processing have demonstrated that our parallel processing approach can provide a speedup of 40 and (if properly load-balanced between CPU and PU) 25% better performance than the generally used parallel approach using PU only. Acknowledgement. This research was supported by Basic Science Research Program through the National Research Foundation of Korea(funded by the Ministry of Education, Science and Technology, 2012R1A1A ) and BK21 Plus Program. References 1. Held, J., Bautista, J., Koehl, S.: From a Few Cores to Many: A Tera-Scale Computing Research Overview. Intel White Paper (2006) 2. Levy, M., Conte, T.: Embedded Multicore Processors and Systems. IEEE Micro 29, 7 9 (2009) 3. Sihn, K., Baik, H., Kim, J., Bae, S., Song, J.: Novel Approaches to Parallel H.264 Decoder on Symmetric Multicore Systems. In: Proc. of International Conference on Acoustics, Speech, and Signal Processing, pp (2009) 4. Chen, W., Hang, H.: H.264/AVC Motion Estimation Implementation on CUDA. In: Proc. of International Multimedia and Expo Conf., pp (2008) 5. Shams, R., Sadeghi, P., Kennedy, R., Hartley, R.: A Survey of Medical Image Registration on Multicore and the PU. IEEE Signal Processing Magazine 27(2), (2010) 6. Bienia, C., Kumar, S., Singh, J., Li, K.: The PARSEC Benchmark Suite: Characterization and Architectural Implications. In: Proc. of International Conference on Parallel Architectures and Compilation Techniques, pp (2008) 7. Kim, H., Lee, S., Chung, Y., Pan, S.: Parallelizing H.264 and AES Collectively. KSII Tr. Internet & Info. Systems 7(9), (2013) 8. NVidia, NVidia CUDA Compute Unified Device Architecture Programming uide, NVidia (2008) 9. Akhter, S., Roberts, J.: Multi-Core Programming - Increasing Performance through Software Multi-Threading. Intel Press, Hillsboro (2006) 10. Stone, J., ohara, D., Shi,.: OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering 12(3), (2010) 11. aetano, R., Pesquet-Popescu, B.: OpenCL Implementation of Motion Estimation for Cloud Video Processing. In: Proc. of International Symposium on Multimedia Signal Processing, pp. 1 6 (2011) 12. Silvers, R., Hawley, M.: Photomosaics. Henry Holt, New York (1997) 13. Cao, J., Xie, X.-f., Liang, J., Li, D.-d.: PU Accelerated Target Tracking Method. In: Jin, D., Lin, S. (eds.) Advances in MSEC Vol. 1. AISC, vol. 128, pp Springer, Heidelberg (2011) 14. Davendra, D., Zelinka, I.: PU Based Enhanced Differential Evolution Algorithm: A Comparison between CUDA and OpenCL. Intelligent Systems Reference Library, vol. 38, pp (2013)

Real-time processing for intelligent-surveillance applications

LETTER IEICE Electronics Express, Vol.14, No.8, 1 12 Real-time processing for intelligent-surveillance applications Sungju Lee, Heegon Kim, Jaewon Sa, Byungkwan Park, and Yongwha Chung a) Dept. of Computer