GPGPU on Mobile Devices
Introduction Addressing GPGPU for very mobile devices Tablets Smartphones
Introduction Why dedicated GPUs in mobile devices? Gaming Physics simulation for realistic effects 3D-GUI / compositor effects
Introduction Why GPGPU on mobile devices? Computational photography Image enhancement Image editing Computer vision Visual/image recognition Geo-localization Token recognition Augmented reality Easier support for new media codecs
Hardware Dedicated GPUs PowerVR's GPU series e.g. SGX540/543 used for the Nexus S / ipad 4-8 cores at ~200MHz 20-35 MTriangles/s / 1000 MPixel/s fill rate Nvidia's Tegra series CPU/GPU combination (usually ARM CPU core) Used mostly in tablets and cars ULP (ultra-low power) GeForce GPU ~300-400 MHz core clock speed 4 pixel and 4 vertex shader processors Not a unified architecture!
Hardware Nvidia's Tegra Development Kit
Graphics/GPU APIs OpenGL ES is the embedded 3D graphics standard OpenGL for Embedded Systems http://www.khronos.org/opengles/ OpenVG is 2D rendering standard Open Vector Graphics http://www.khronos.org/openvg EGL Embedded-System Graphics Library Interface to the window system Mobile version of WGL and GLX
OpenGL ES 2.0 Mostly programmable pipeline No more fixed function pipeline No glbegin() / glend() Drawing only via vertex arrays OpenGL ES Shading Language Similar to GLSL Sample from OpenGL ES Quick Reference Card Frame buffer objects available Depth test, stencil test etc.
OpenCL Embedded Profile
OpenCL Embedded Profile
OpenCL Embedded Profile Stripped-down version of OpenCL Minimum requirements can be smaller No 64-bit integers Reduced floating point accuracy Nearest sampling for float texture images 2D/3D image support is optional Might run on a DSP not on the GPU! Or even on a mixed CPU/GPU/DSP environment
OpenCL Embedded Profile OpenCL Embedded Profile Prototype in Mobile Devive Nokia http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5336267
OpenCL Embedded Profile Querying available profiles
Application 1 Accelerating image recognition on mobile devices using GPGPU, SPIE 2011 Miguel Bordallo López et al, U Oulo & Nokia Research Goal: face tracking on mobile devices Locally binary patterns as features Ada-Boost for classification Hardware platform TI OMAP3530 ARM Cortex A-8 PowerVR SGX 530 E.g. Nokia N900
Mobile Face Tracking Most steps on the GPU Linear classifier on CPU c (f 1,..., f n ; x )=sgn ( wi f i ( x )) Courtesy of Miguel Bordallo López
Mobile Face Tracking Optional preprocessing Convert to gray scale One quarter of the image per color channel Better utilization of vec4 shader units Courtesy of Miguel Bordallo López
Mobile Face Tracking Locally binary patterns (LBP) A.k.a. census transform Ojala et.al. 1994 Zabih & Woodfill, 1994 Texture-based features Robust to illumination changes
Mobile Face Tracking LBP extraction run-time Rescaling & image preprocessing Courtesy of Miguel Bordallo López
Mobile Face Tracking Both preprocessing & LBP extraction Power consumption Courtesy of Miguel Bordallo López
Application 2 OpenCL for image processing, Nokia "OpenCL embedded profile prototype in mobile device," J. Leskelä et al., IEEE Workshop on Signal Processing Systems, 2009. Geometric distortion + blurring + color transformation Based on OpenCL not OpenGL ES Leskelä et al., 2009
OpenCL for Image Processing Kernel header Leskelä et al., 2009
OpenCL for Image Processing Local declarations Leskelä et al., 2009
OpenCL for Image Processing Distort geometry (texture lookup) Leskelä et al., 2009
OpenCL for Image Processing Fetch neighboring pixels & blur Leskelä et al., 2009
OpenCL for Image Processing Color transformation and write to destination Leskelä et al., 2009
OpenCL for Image Processing Run-time comparison CPU: ARM Cortex A-8; 550 MHz GPU: PowerVR SGX530; 110 MHz 3 MPixel RGBA images CPU only: 8.6 s/image Slow FPU Will work faster with fixed point arithmetic GPU only: 2.4 s/image CPU+GPU: 2.5 s/image Bad scheduling CPU+GPU improved: 2.02 s/image Leskelä et al., 2009
OpenCL for Image Processing Energy consumption CPU: 3.93 J/frame GPU: 0.56 J/frame (~14%) 0.26 J/frame due to CPU GPU data transfer When power consumption matters High parallelism at low clock frequencies (110 MHz) is better than low parallelism at high clock frequencies (550 Mhz) Dissipation increases super-linearly with frequency Leskelä et al., 2009