2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Size: px

Start display at page:

Download "2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems"

Ronald Stephens
5 years ago
Views:

1 2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems Accelerating a computer vision algorithm on a mobile SoC using CPU-GPU co-processing - A case study on face detection Youngwan Lee Department of Information and Communication Engineering Inha University Incheon, Korea youngwan88@gmail.com Cheolyong Jang Department of Information and Communication Engineering Inha University Incheon, Korea cyjang@gmail.com Hakil Kim Department of Information and Communication Engineering Inha University Incheon, Korea hikim@inha.ac.kr ABSTRACT Recently, mobile devices have become equipped with sophisticated hardware components such as a heterogeneous multi-core SoC that consists of a CPU, GPU, and DSP. This provides opportunities to realize computationally-intensive computer vision applications using General Purpose GPU (GPGPU) programming tools such as Open Graphics Library for Embedded System (OpenGL ES) and Open Computing Language (OpenCL). As a case study, the aim of this research was to accelerate the Viola-Jones face detection algorithm which is computationally expensive and limited in use on mobile devices due to irregular memory access and imbalanced workloads resulting in low performance regarding the processing time. To solve the above challenges, the proposed method of this study adapted CPU GPU task parallelism, sliding window parallelism, scale image parallelism, dynamic allocation of threads, and local memory optimization to improve the computational time. The experimental results show that the proposed method achieved a 3.3~6.29 times increased computational time compared to the well-optimized OpenCV implementation on a CPU. The proposed method can be adapted to other applications using mobile GPUs and CPUs. Keywords Computer vision; Mobile GPGPU; OpenGL ES 2.0; OpenCL; CPU-GPU co-processing 1. INTRODUCTION In recent years, the number of mobile devices with high-definition displays, high-resolution cameras, and application processors has increased exponentially, which has facilitated pragmatic computer vision applications such as face detection, mobile visual search, 3- D games, and augmented reality on mobile devices [7,14,15,19]. However, computationally intensive computer vision applications for practical use on mobile devices are limited because of computational restrictions and limited performance compared to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. MobileSoft 16, May 16 17, 2016, Austin, TX, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM /16/05 $ DOI: computers. To address this limitation, many researchers have tried to use GPUs as general purpose GPUs (GPGPUs) to perform computations in applications usually handled by CPUs to accelerate image processing and computer vision algorithms [5,20,21] using several GPU programming models such as Open Graphics Library for Embedded System 2.0 (OpenGL ES 2.0) [8] and Open Computing Language (OpenCL) [9]. However, many studies and advancements applied to desktop GPUs (dgpus) are not suitable for mobile applications because of the difference between dgpus and the mobile hardware architecture, namely System-On-Chip (SoC), with a CPU and GPU. To achieve good performance, it is of great importance to analyze the algorithms and workloads on a mobile phone and redesign an efficient workload partitioning policy for mobile hardware architecture. In this study, we present an acceleration and optimization method on mobile devices for an exemplar computer vision application the widely used Viola -Jones face detection algorithm [17,18] to exploit the capability of mobile CPUs and GPUs using OpenGL ES and OpenCL. Because of irregular memory access and an imbalanced workload, it is challenging to optimize the Viola-Jones face detection algorithm on a mobile SoC. This paper addresses the problems regarding the full use of computing power from both the mobile CPU and GPU. The rest of the paper is organized as follows: Section 2 explains the GPGPU image processing framework on a mobile device. Section 3 discusses related works on accelerating face detection algorithms with GPUs. Section 4 briefly describes the Viola Jones face detection algorithm. Section 5 presents the proposed accelerating face detection algorithm based on CPU-GPU co-processing. The experimental results are shown in Section 6. Finally, Section 7 concludes this paper. 2. MOBILE GPGPU IMAGE PROCESSING FRAMEWORK There are several differences between a mobile GPU and a dgpu. First, because a mobile GPU and CPU are both integrated into the application processor the SoC, they can save data transfer time by sharing the same memory bus. Second, the memory bandwidth of the mobile GPU is much lower than a dgpu. Additionally, the mobile GPU has far fewer compute units than that of a dgpu. For these reasons, it is necessary to carefully analyze specific algorithms and efficiently map them to the mobile SoC as well as find an optimal mapping method for the mobile SoC. OpenGL ES and OpenCL support mobile SoCs. OpenGL ES is an embedded version of OpenGL which is a standard graphic API 70

2 Figure 1. Mobile GPGPU Image Processing Framework. providing a graphic rendering pipeline as well as a GPGPU tool. OpenCL is an open parallel computing framework which can be used on heterogeneous platforms including CPUs, GPUs, and even DSPs. Because of the nature of shared memory on a mobile SoC, OpenGL ES and Open CL can both access the same data in the memory without any data copying enabling the processing to take place in the same memory rather than increasing the number of separate allocations. Considering the mobile GPU as a combination of both the main rendering device by OpenGL ES and the main compute device by OpenCL, these functionalities time-share the GPGPU. When a video stream provider such as mobile device camera supplies frame data as a GLES texture data source in the global memory, the GPU can use it in the cl_mem format for a compute OpenCL kernel. After the compute OpenCL kernel is executed, the computed result data are stored as GLES texture that the GPU can render in the display. 3. RELATED WORKS There are many works that have accelerated Viola-Jones face detection with a dgpu rather than with a mobile GPU implemented by the Compute Unified Device Architecture (CUDA) [10] or OpenCL. Sharma et al. [16] presented a face detection and tracking algorithm based on the haar-like feature on the GTX285 and achieved more than 20 times the processing performance than that of the VGA image processing performance. Oro et al. [12,13] also proposed a haar-like feature based face detection algorithm for HD video on the GTX470 and achieved an increased speed of 2.5 times. However, they used CUDA which is a GPGPU programming tool for only NVIDIA GPUs. When compared to OpenCL used in several compute components, it is unable to deal with the imbalanced workload problem that has been encountered while implementing the Viola-Jones face detection algorithm in GPUs. Several studies have been done in attempt to address the imbalanced computation problem [2,3,6,11]. Hefenbrock et al. [2] presented a multi-gpu solution that evaluates each detection window in a different thread, and computes each scaled window in parallel in a different GPU. Obukhov [11] also proposed another solution that consists of a stage-parallel and pixel-parallel implementation. Jia et al. [3] resolved this irregular workload problem of the GPU by using Uberkernel and Persistent threads. However, these studies do not utilize the CPU resources because most computations are executed on the GPU. Although Wang et al. [21] made use of the computational capability of both the CPU and GPU cores, their algorithm is only optimized for the Intel Sandy bridge chipset. Making full use of the computing power from both CPU and GPU on a mobile SoC, this paper presents a solution for the imbalanced computation problem with the Viola-Jones face detection on a mobile device using OpenCL. 4. VIOLA - JONES ALGORITHM The Viola-Jones object detection framework was proposed by Paul Viola and Michael Jones for face detection. The proposed cascade classifier is a particular case of ensemble learning which can speed up to achieve real-time processing. Because adaboost is a variant of boosting algorithms, this method was trained with adaboost by weighting the haar-like features which make the features suitable for face detection. However, we only discuss the detection process because the training process does not affect the speed in the face detection process. 4.1 Haar-like features Haar-like features in the Viola-Jones algorithm can judge whether a face is correct from an image. Using haar-like features makes it easier to find the edge, line and saliency of a face. As shown in Figure. 5, haar-like features, which consist of rectangular areas, are calculated by the difference between the intensity of the white areas and black areas. 4.2 Integral image As mentioned above, calculating haar-like features is very timeconsuming because it is based on a sliding window. The integral image can simply be acquired by calculating the sum of the intensity values within a particular area using only the pixel values of four points. 4.3 Cascade classifier A cascade can be seen as a strong classifier structure which consists of a number of weak classifiers for each stage in sequence. The weak classifiers of each stage initially have a simple structure because they only contain a few features, and as stages progress, the weak classifiers will get more complex making it more difficult to proceed to the next stage. As shown in Figure 2, since the sub-window cannot pass the initial classifier, it just decides that there is no face and does not proceed to the next stage. In contrast, if the sub-window can successfully pass every stage until the last one, then it can be determined as a face. Thus, the advantage is that because a sub-window can fail in any stage, the process will stop at the cost of a little time and save much processing time. 4.4 Scaling & Exhaustive sliding window Detection is carried out in each sliding window called a detection window which scans the whole image shown in Figure 5. After all the sliding windows in an image are evaluated, the same process will be repeated for rescaled images to detect faces of different sizes. 5. PROPOSED METHOD In this section, the parallel implementation of a face detection algorithm is presented first followed by optimization technologies. 5.1 Implementation Skin color filtering This paper applied skin color filtering which can reduce the detection region to accelerate the face detection algorithm. Skin 71 Figure 2. Cascade classifier.

color filtering is for robust rotation, scale, and occlusion of a face.

Figure 3. (a) Skin color filtering. (b) detected image. 5.1.1. Reducing search area.

In particular, we use effective pixel-based skin detection method to make it become the real-time processing [4]. Examples are shown in Figure 3.

3 color filtering is for robust rotation, scale, and occlusion of a face. In particular, we use the effective pixel-based skin detection should be noted that when the CPU reads an image object from the GPU, the data transfer overhead between the CPU and GPU is (a) (b) Figure 3. (a) Skin color filtering. (b) detected image Reducing search area. The proposed method adapts skin color filtering which can reduce the detection region to accelerate face detection algorithm. The skin color filtering is to robust rotation, scale, occlusion of face. In particular, we use effective pixel-based skin detection method to make it become the real-time processing [4]. Examples are shown in Figure 3. The skin-colored image is obtained from a color image with the color channels (R, G, B) by applying a color threshold (1): R 95 & G 40 & B 20 & max R, G, B min R, G, B 15 & (1) R G 15 & R G & R B If non-skin pixels have values similar to the skin, then they will be considered candidates for skin. This is because the method is based on a fixed color threshold. However, skin color filtering is still an effective way to decrease the overall process. Even in real skin-colored areas, there are still some pixel values that cannot satisfy the threshold, resulting in black holes, which will influence detection performance. To solve this problem, this paper adapts the dilation technique which can fill in the holes in the skincolored areas Design for parallelism CPU-GPU task-level parallelism Figure 4 shows a flow diagram of the proposed face detection algorithm based on CPU-GPU co-processing. OpenCL GPU kernels are executed in the right box. As a part of the process, in the left box, CPU serial computations are carried out. The Image 2- dimensional memory object that was converted from the texture data by the OpenGL ES pipeline is delivered to the OpenCL computing units. In the first step, scaling images and skin color filtering, which screen for skin-colored pixels, are carried out. After dilation of the skin-colored mask in the GPU kernel, CPU is treated as the host which reads the skin-colored mask from the GPU. It Figure 4. Flow diagram of the proposed face detection algorithm. Figure 5. Combined image for the GPU kernel. negligible due to the characteristic of the shared memory system on a mobile SoC. Collection of the skin-colored pixel s coordinates running on the CPU can be executed concurrently by executing the Integral kernel on the GPU, which enables the computing resources of both the CPU and GPU to be fully used at the same time. Finally, in the cascade GPU kernel, detection window computations are executed with the skin-colored pixels which are delivered from the CPU Sliding window parallelism Data parallelism means the same tasks are simultaneously executed on multiple processors across different pieces of distributed data. In particular, there should be no data dependencies affecting the execution order among the processors. As mentioned above, to implement face detection, a cascade classifier is computed to determine whether a face is in the sliding window. It is very efficient to do data parallelism when executing the same process for millions of detection windows independently Scale image parallelism The Viola-Jones face detection algorithm is scale invariant by processing several scales of images. Naïve implementation performs the face detection algorithm by iterative process among the scaled-down images, so that almost all kernels are iteratively launched. In such a process, several kernels in the loop increase the waste of computation resources due to the barrier synchronization problem. In addition, it can cause kernel launch overheads by iteratively performing the same kernel. To solve this problem, as shown in Figure 5, we merge the scaled down images into a single image. This method can reduce the waste of computing resources by eliminating kernel iterations. When we make a unified single image by combining all the scaled down images, a 2-dimensional image memory object has more advantages than a 1-dimensional global memory buffer which is commonly used in OpenCL. Therefore, this will not only access data more quickly but will also make it easier to handle boundary conditions compared to a global memory buffer. 5.2 Optimization Dynamic allocation of work-items Because a GPU uses the SIMT (Single Instruction Multiple Thread) programming model, units of work-groups are scheduled and 72

Local work size indicates the number of work-items included in a work-group.

4 (a) (b) Figure 6. Reduction of idle work-items in a GPU (a) Original NDRange (b) Optimized NDRange executed in the GPU. Global work size refers to the total number of work-items (threads) in a GPU and is set as the size of the image. Each pixel of an image is computed by a work-item in the GPU. Local work size indicates the number of work-items included in a work-group. As mentioned in section 4, faces are originally detected in the cascade kernel via sliding detection window in the Viola-Jones algorithm in a serial CPU version. However, in the Cascade GPU kernel each work-item has its own detection window in parallel which means it is not necessary to slide the detection window. Nonface pixels are considered as not a face and rejected at stages 1 or 2 where simpler classifiers are used to reject the majority of images. As is shown in Figure 6. (a), earlier rejected work-items need to wait until all work-items finish the detection window computation in the same work-group because the unit of the workgroup is executed in the GPU which results in idle work-items. If only one work-item still works until the final stage, the other workitems are idle. Thus, here is a serious imbalanced computation problem which leads to poor performance. To address the imbalanced workload problem, this study presents a new approach to dynamically allocate the global work size according to the number of skin-colored pixels. In other words, by only allowing work-items to compute the detection window of a skin-colored pixel, it is less likely to be rejected; on the contrary, non-skin pixels cannot be computed which prevents idle threads from occurring and takes full advantage of the GPU resource. Figure 6. (b) shows that that global work size is allocated according to the number of skin-colored pixels, and there are few idle work-items in the GPU Local memory optimization Similar to a dgpu, a mobile GPU also has bottleneck issues regarding performance due to global memory access. A mobile GPU suffers from a longer latency from the off-chip global memory access than that of a dgpu. Therefore, memory optimization is essential in parallel image processing in a mobile GPU. Local memory where work-items can share data in a same work-group has a lower latency than that of global memory. Thus, loading these shared data into the local memory can reduce global memory access and improve processing performance. However, one should note that as more local memory is required by a kernel, fewer workitems are available to execute it. Therefore, it is important to analyze whether the data are suitable for sharing in the work-items in a work-group and to find the optimal size of the data to load. When each work-item computes a detection window in a cascade kernel, the same classifier data trained in advance are used by all work-items. Thus, this study tried to find the optimal size of the classifier data to load and thereby partially load the classifier data into local memory. We tried to load 3 features of the classifier data of cascade stage 1 that most work-items share because when the higher stage is in progress, more work-items are returned, and fewer classifier data are shared. An average reduction of 12% was observed in execution time after using local memory. 6. EXPERIMENTAL RESULTS 6.1 Experiment set-up For the experiment, we chose as a test platform the Galaxy S5- LTEA smartphone, which is driven by the Qualcomm application processor. Qualcomm is the clear leader in the smartphone application processor market with the Snapdragon series. The Galaxy S5-LTEA is powered by a Snapdragon 808 SoC with a 2.45 GHz quad-core Krait 400 CPU and 578 MHz Adreno 330 quadcore GPU. The Adreno 330 GPU supports advanced graphics APIs, including OpenGL ES 3.0 and OpenCL 1.2 library. The mobile operating system was Android 5.0. The OpenCV library was used to implement face detection for the CPU version. In the performance evaluation, this paper experimented with two different datasets. The first dataset is the Image of Groups [1] dataset which contains frontal face images in color and group images that are composed of a number of people. Additionally, this dataset considers illumination conditions, faces of various races, and size of faces. We collected 60 images containing 622 faces as part of the dataset. In addition, test images were resized to HD (720p) maintaining a fixed ratio of the image to fit the output size on the mobile display. The other dataset was the INHA FACE, in which the images inside belongs to the HD level and is comprised of people at different distances (1 m, 3 m, and 5 m). The reason we used this dataset is 73

120 CPUonly GPUonly CPU-GPU 100 88.01 80 60 40 20 23.82 28.04 31.12 16.0514.72 39.40 27.30 0 1.18 1.18 5.65 5.70 Scailing & Skin Color Filtering Dilation Integral Cascade Figure 7.

2 Accuracy & Execution time We used cascade classification from the OpenCV 2.4.9 library which is well known in the fields of computer vision.

our CPU-GPU implementation. The results of the detection from each version were the same, which means there is no performance penalty due to the acceleration of our CPU-GPU implementation.

5 120 CPUonly GPUonly CPU-GPU Scailing & Skin Color Filtering Dilation Integral Cascade Figure 7. Execution time in each kernel to evaluate the relationship between the processing time and the amount of skin-colored pixels. 6.2 Accuracy & Execution time We used cascade classification from the OpenCV library which is well known in the fields of computer vision. Thus, we set the same configuration parameters and then compared the performance between the CPU implementation of the welloptimized OpenCV library which is widely used and considered accurate, and our CPU-GPU implementation. The results of the detection from each version were the same, which means there is no performance penalty due to the acceleration of our CPU-GPU implementation. In the first experiment, we measured the processing time within each kernel and compared the proposed CPU-GPU version with other versions such as the CPU only and the GPU only. As shown in Figure 7, the cascade kernel spent most of the time in the detection window due to its computational complexity. Compared to other versions, the proposed CPU-GPU version spent less time Figure 8. Average execution time according distance in the cascade kernel because it reduces the idleness of the workitems. In addition, the CPU-GPU version had the lowest time cost and a computational speed 3.22 times faster than that of the CPU only version. In the second experiment, we measured the execution time according to the amount of skin-colored pixels. As shown in Figure 10, the shorter the distance between camera and people, the more skin-colored pixels are found, which results in more computational efforts. In contrast, as the distance to the camera became longer, fewer skin-colored pixels are detected. Our experiments were carried out under different scenarios taking into consideration distances of 1, 3 and 5 m. At 1 m, the images contained the largest amount of skin-colored pixels, so the processing time was the longest. And it was observed that an increase in distance causes a decrease in the number of colored pixels, thereby reducing processing time. Finally, processing time at 5 m is the fastest due to the least amount of skin-colored pixels. Compared with the other implementations, when the distance was 1 m, 3 m and 5 m, the processing time of the proposed CPU-GPU method was ms, Figure 9. Results from the Images of the Groups dataset. Figure 10. Results from the INHA_FACE dataset. 74

6 35.16 ms, and ms, respectively, which shows that the CPU- GPU method had the best performance regarding processing time. Table 1. Comparison of the performance of different methods with the Image of Groups dataset Method Execution time (ms) fps Speedup CPU only GPU only x CPU-GPU x Table 2. Comparison of the performance of different methods with the INHA FACE dataset Method Execution time (ms) fps Speedup CPU only GPU only x CPU-GPU x Tables 1and 2 compare the performance of each method using the Image of Groups and the INHA FACE datasets, respectively. It is obvious that the method proposed in this study achieves 3.3 times and 6.29 times increased processing times compared to the CPU only method with the Image of Groups and INHA FACE datasets, respectively. Additionally, note that real-time processing was obtained with the INHA FACE dataset. 7. Conclusions This paper presents an optimized parallel implementation of the Viola - Jones face detection algorithm as a case study into mapping a computer vision application on a mobile SoC using CPU-GPU co-processing. To explore both the CPU and GPU computational power, we discussed several parallelization and optimization methods to accelerate the algorithm: CPU GPU task parallelism, sliding window parallelism, scale image parallelism, dynamic allocation of work-items, and local memory optimization. These methods resolved the imbalanced workload problem and improved the processing time in mobile SoCs. The performance is much better than a well-optimized CPU implementation from the OpenCV library. Finally, for future work, we plan to experiment with power consumption and port this algorithm to other mobile devices to validate and optimize our work. 8. ACKNOWLEDGEMENTS This work was supported by the Industrial Strategic Technology Development Program ( , The Development of Fusion Processor based on Multi-Shader GPU) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) 9. References [1] Gallagher, A.C. and Chen, T Understanding Images of Groups of People. Computer Vision and Pattern Recognition (CVPR). (2009), [2] Hefenbrock, D., Oberg, J., Thanh, N.T.N., Kastner, R. and Baden, S.B Accelerating Viola-Jones face detection to FPGA-level using GPUs. Proceedings - IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM (2010), [3] Jia, H., Zhang, Y., Wang, W. and Xu, J Accelerating Viola-Jones Facce Detection Algorithm on GPUs IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. (2012), [4] Kakumanu, P., Makrogiannis, S. and Bourbakis, N A survey of skin-color modeling and detection methods. Pattern Recognition. 40, 3 (2007), [5] Kang, S.H., Lee, S. and Park, I.K Parallelization and Optimization of Feature Detection Algorithms on Embedded GPU. (2014), M. Rahman, J.Ren, and N. Kehtarnavaz, Real-time implementation of robust face detection on mobile platforms, IEEE ICASSP 09, pp. 1353, [6] Li, E., Wang, B., Yang, L., Peng, Y., Du, Y., Zhang, Y. and Chiu, Y.-J GPU and CPU Cooperative Accelaration for Face Detection on Modern Processors IEEE International Conference on Multimedia and Expo. (2012), [7] Liu, X., Lou, Y., Yu, A. and Lang, B Search by mobile image based on visual and spatial consistency. Multimedia and Expo (ICME), (2011), 1 6. [8] Munshi, A., and Leech, J., OpenGL ES common profile specification version (full specification). Khronos Group. [9] Munshi, A., OpenCL specification 1.1. Khronos OpenCL Working Group. [10] Nvidia. CUDA RUNTIME API, March [11] Obukhov, A Haar classifiers for object detection with cuda. GPU Computing Gems Emerald Edition, [12] Oro, D., Fern ndez, C., Segura, C., Martorell, X. and Hernando, J Accelerating Boosting-Based Face Detection on GPUs st International Conference on Parallel Processing. (2012), [13] Oro, D., Fernández, C., Saeta, J.R., Martorell, X. and Hernando, J Real-time GPU-based face detection in HD video sequences. Proceedings of the IEEE International Conference on Computer Vision. (2011), [14] Pulli, K., Baksheev, A., Kornyakov, K. and Eruhimov, V Real-time computer vision with OpenCV. Communications of the ACM. 55, 6 (2012), 61. [15] Rahman, M., Ren, J. and Kehtarnavaz, N Realtime implementation of robust face detection on mobile platforms. Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference on. (2009), [16] Sharma, B., Thota, R., Vydyanathan, N. and Kale, A Towards a robust, real-time face processing system using CUDA-enabled GPUs International Conference on High Performance Computing (HiPC). (2009),

7 [17] Viola, P., Jones, M Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition (CVPR) 1, I 511 I 518. [18] Viola, P., Jones, M Robust real-time face detection. International journal of computer vision 57, 2, [19] Wagner, D., Schmalstieg, D History and future of tracking for mobile phone augmented reality IEEE International Symposium on Ubiquitous Virtual Reality,7-10. [20] Wang, G., Rister, B. and Cavallaro, J.R Workload analysis and efficient OpenCL-based implementation of SIFT algorithm on a smartphone IEEE Global Conference on Signal and Information Processing (December 2013), [21] Wang, G., Xiong, Y., Yun, J. and Cavallaro, J.R Accelerating computer vision algorithms using OpenCL framework on the mobile GPU - A case study. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. (2013),

Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms

Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms Subhi A. Bahudaila and Adel Sallam M. Haider Information Technology Department, Faculty of Engineering, Aden University.