GPGPU on Mobile Devices

Introduction Addressing GPGPU for very mobile devices Tablets Smartphones

Introduction Why dedicated GPUs in mobile devices? Gaming Physics simulation for realistic effects 3D-GUI / compositor effects

Introduction Why GPGPU on mobile devices? Computational photography Image enhancement Image editing Computer vision Visual/image recognition Geo-localization Token recognition Augmented reality Easier support for new media codecs

Hardware Dedicated GPUs PowerVR's GPU series e.g. SGX540/543 used for the Nexus S / ipad 4-8 cores at ~200MHz 20-35 MTriangles/s / 1000 MPixel/s fill rate Nvidia's Tegra series CPU/GPU combination (usually ARM CPU core) Used mostly in tablets and cars ULP (ultra-low power) GeForce GPU ~300-400 MHz core clock speed 4 pixel and 4 vertex shader processors Not a unified architecture!

Hardware Nvidia's Tegra Development Kit

Graphics/GPU APIs OpenGL ES is the embedded 3D graphics standard OpenGL for Embedded Systems http://www.khronos.org/opengles/ OpenVG is 2D rendering standard Open Vector Graphics http://www.khronos.org/openvg EGL Embedded-System Graphics Library Interface to the window system Mobile version of WGL and GLX

OpenGL ES 2.0 Mostly programmable pipeline No more fixed function pipeline No glbegin() / glend() Drawing only via vertex arrays OpenGL ES Shading Language Similar to GLSL Sample from OpenGL ES Quick Reference Card Frame buffer objects available Depth test, stencil test etc.

OpenCL Embedded Profile

OpenCL Embedded Profile Stripped-down version of OpenCL Minimum requirements can be smaller No 64-bit integers Reduced floating point accuracy Nearest sampling for float texture images 2D/3D image support is optional Might run on a DSP not on the GPU! Or even on a mixed CPU/GPU/DSP environment

OpenCL Embedded Profile OpenCL Embedded Profile Prototype in Mobile Devive Nokia http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5336267

OpenCL Embedded Profile Querying available profiles

Application 1 Accelerating image recognition on mobile devices using GPGPU, SPIE 2011 Miguel Bordallo López et al, U Oulo & Nokia Research Goal: face tracking on mobile devices Locally binary patterns as features Ada-Boost for classification Hardware platform TI OMAP3530 ARM Cortex A-8 PowerVR SGX 530 E.g. Nokia N900

Mobile Face Tracking Most steps on the GPU Linear classifier on CPU c (f 1,..., f n ; x )=sgn ( wi f i ( x )) Courtesy of Miguel Bordallo López

Mobile Face Tracking Optional preprocessing Convert to gray scale One quarter of the image per color channel Better utilization of vec4 shader units Courtesy of Miguel Bordallo López

Mobile Face Tracking Locally binary patterns (LBP) A.k.a. census transform Ojala et.al. 1994 Zabih & Woodfill, 1994 Texture-based features Robust to illumination changes

Mobile Face Tracking LBP extraction run-time Rescaling & image preprocessing Courtesy of Miguel Bordallo López

Mobile Face Tracking Both preprocessing & LBP extraction Power consumption Courtesy of Miguel Bordallo López

Application 2 OpenCL for image processing, Nokia "OpenCL embedded profile prototype in mobile device," J. Leskelä et al., IEEE Workshop on Signal Processing Systems, 2009. Geometric distortion + blurring + color transformation Based on OpenCL not OpenGL ES Leskelä et al., 2009

OpenCL for Image Processing Kernel header Leskelä et al., 2009

OpenCL for Image Processing Local declarations Leskelä et al., 2009

OpenCL for Image Processing Distort geometry (texture lookup) Leskelä et al., 2009

OpenCL for Image Processing Fetch neighboring pixels & blur Leskelä et al., 2009

OpenCL for Image Processing Color transformation and write to destination Leskelä et al., 2009

OpenCL for Image Processing Run-time comparison CPU: ARM Cortex A-8; 550 MHz GPU: PowerVR SGX530; 110 MHz 3 MPixel RGBA images CPU only: 8.6 s/image Slow FPU Will work faster with fixed point arithmetic GPU only: 2.4 s/image CPU+GPU: 2.5 s/image Bad scheduling CPU+GPU improved: 2.02 s/image Leskelä et al., 2009

OpenCL for Image Processing Energy consumption CPU: 3.93 J/frame GPU: 0.56 J/frame (~14%) 0.26 J/frame due to CPU GPU data transfer When power consumption matters High parallelism at low clock frequencies (110 MHz) is better than low parallelism at high clock frequencies (550 Mhz) Dissipation increases super-linearly with frequency Leskelä et al., 2009