A GRAPHICS PROCESSING UNIT IMPLEMENTATION OF THE PARTICLE FILTER

A GRAPHICS PROCESSING UNIT IMPLEMENTATION OF THE PARTICLE FILTER Gusaf Hendeby, Jeroen D. Hol, Rickard Karlsson, Fredrik Gusafsson Deparmen of Elecrical Engineering Auomaic Conrol Linköping Universiy, Sweden {hendeby, hol, rickard, fredrik}@isy.liu.se ABSTRACT Modern graphics cards for compuers, and especially heir graphics processing unis (GPUs), are designed for fas rendering of graphics. In order o achieve his GPUs are equipped wih a parallel archiecure which can be exploied for general-purpose compuing on GPU (GPGPU) as a complemen o he cenral processing uni (CPU). In his paper GPGPU echniques are used o make a parallel GPU implemenaion of sae-of-he-ar recursive Bayesian esimaion using paricle filers (PF). The modificaions made o obain a parallel paricle filer, especially for he resampling sep, are discussed and he performance of he resuling GPU implemenaion is compared o one achieved wih a radiional CPU implemenaion. The resuling GPU filer is faser wih he same accuracy as he CPU filer for many paricles, and i shows how he paricle filer can be parallelized. 1. INTRODUCTION Modern graphics processing unis (GPUs) are designed o handle huge amouns of daa abou a scene and o render oupu o screen in real ime. To achieve his, he GPU is equipped wih a single insrucion muliple daa (SIMD) parallel insrucion archiecure. GPUs are developing rapidly in order o mee he ever increasing demands from he compuer game indusry, and as a side-effec, general-purpose compuing on graphics processing unis (GPGPU) has emerged o uilize his new source of compuaional power [1 3]. For highly parallelizable algorihms he GPU may even ouperform he sequenial cenral processing uni (CPU). The paricle filer (PF) is an algorihm o perform recursive Bayesian esimaion [4 6]. Due o is naure, a large par consiss of performing idenical operaions on many paricles (samples), so i is poenially well suied for parallel implemenaion. Successful parallelizaion may lead o a drasic reducion of compuaion ime and open up for new applicaions requiring large sae space descripions wih many paricles. Noneheless, filering and esimaion algorihms have only recenly been invesigaed in his conex, see for insance [7, 8]. There are many ypes of parallel hardware available nowadays; examples include mulicore processors, field-programmable gae arrays (FPGAs), compuer clusers, and GPUs. GPUs are low cos and easily accessible SIMD parallel hardware almos every new compuer comes wih a decen graphics card. Hence, GPUs are an ineresing opion for speeding up a PF and o es parallel implemenaions. A firs GPU implemenaion of he PF was repored in [9] for a visual racking compuer vision applicaion. In conras, in This work has been funded by he Swedish Research Council (VR), he EU-IST projec MATRIS, and he Sraegic Research Cener MOVIII, funded by he Swedish Foundaion for Sraegic Research, SSF. Figure 1: The graphics pipeline. The verex and fragmen processors can be programmed wih user code which will be evaluaed in parallel on several pipelines. (See Secion 2.1.) his paper a general PF GPU implemenaion is developed. To he bes of he auhors knowledge no successful complee implemenaion of a general PF on a GPU has ye been repored and his aricle aims o fill his gap: GPGPU echniques are used o implemen a PF on a GPU and is performance is compared o ha of a CPU implemenaion. The paper is organized as follows: In Secion 2 GPGPU programming is briefly inroduced and his is used in Secion 3 o discuss various aspecs of he PF requiring special aenion for a GPU implemenaion. Resuls from CPU and GPU implemenaions are compared in Secion 5, and concluding remarks are given in Secion 6. 2. GENERAL PURPOSE GRAPHICS PROGRAMMING GPUs operae according o he sandardized graphics pipeline (see Figure 1), which is implemened a hardware level [2]. This pipeline, which defines how he graphics should be processed, is highly opimized for he ypical graphics applicaion, i.e., displaying 3D objecs. The verex processor receives verices, i.e., corners of he geomerical objecs o display, and ransform and projec hem o deermine how he objecs should be shown on he screen. All verices are processed independenly and as much in parallel as here are pipelines available. In he raserizer i is deermined wha fragmens, or poenial pixels, he geomerical shapes may resul in, and he fragmens are passed on o he fragmen processor. The fragmens are hen processed independenly and as much in parallel as here are pipelines available, and he resuling color of he pixels is sored in he frame buffer before being shown on he screen. A he hardware level he graphics pipeline is implemened using a number of processors, each having muliple pipelines performing he same insrucion on differen daa. Tha is, GPUs are SIMD processors, and each processing pipeline can be hough of as a parallel sub-processor. 2007 EURASIP 1639

2.1 Programming he GPU The wo seps in he graphics pipeline open o programming are he verex processor (working wih he primiives making up he polygons o be rendered) and he fragmen processor (working wih fragmens, i.e., poenial pixels in he final resul). Boh hese processors can be conrolled wih programs called shaders, and consis of several parallel pipelines (subprocessors) for SIMD operaions. Shaders, or GPU programs, were inroduced o replace, wha used o be, fixed funcionaliy in he graphics pipeline wih more flexible programmable processors. They were mainly inended o allow for more advanced graphics effecs, bu hey also go GPGPU sared. Programming he verex and fragmen processors is in many aspecs very similar o programming a CPU, wih limiaions and exensions made o beer suppor he graphics card and is inended usage, bu i should be kep in mind ha he code runs in parallel on muliple pipelines of he processor. Some prominen differences include he basic daa ypes which are available; mos operaions of a GPU operae on colors (represened by one o four floaing poin numbers), and daa is sen o and from he graphics card using exures (1D 3D arrays of color daa). In newer generaions of GPUs 32 bi floaing poin operaions are suppored, bu he rounding unis do no fully conform o he IEEE floaing poin sandard, hence providing somewha poorer numerical accuracy. In order o use he GPU for general purpose calculaions, a ypical GPGPU applicaion applies a program srucure similar o Algorihm 1. These very simple seps make sure ha he fragmen program is execued once for every elemen of he daa. The workload is auomaically disribued over he available processor pipelines. Algorihm 1 GPGPU skeleon program 1 1. Program he fragmen shader wih he desired operaion. 2. Send he daa o he GPU in he form of a exure. 3. Draw a recangle of suiable size on he screen o sar he calculaion. 4. Read back he resuling exure o he CPU. 2.2 GPU Programming Language There are various ways o access he GPU resources as a programmer, including C for graphics (Cg), [10], and OpenGL [11] which includes he OpenGL Shading Language (GLSL), [12]. This paper will use GLSL ha operaes closer o he hardware han Cg. For more informaion and alernaives see [1, 2, 10]. To run GLSL code on he GPU, he OpenGL applicaion programming inerface (API) is used [11, 12]. The GLSL code is passed as ex o he API ha compiles and links he differen shaders ino binary code ha is sen o he GPU and execued he nex ime he graphics card is asked o render a scene. 1 The sream processing capabiliies of he upcoming GPU generaions migh change his raher complicaed mehod of performing GPGPU. 3. RECURSIVE BAYESIAN ESTIMATION The general nonlinear filering problem is o esimae he sae, x, of a sae-space sysem x +1 = f (x,w ), y = h(x ) + e, (1a) (1b) where y are measuremen and w p w (w ) and e p e (e ) are process and measuremen noise, respecively. The funcion f describes he dynamics of he sysem, h he measuremens, and p w and p e are probabiliy densiy funcions (PDF) for he involved noise. For he imporan special case of linear-gaussian dynamics and linear-gaussian observaions he Kalman filer, [13, 14], solves he esimaion problem in an opimal way. A more general soluion is he paricle filer (PF), [4 6], which approximaely solves he Bayesian inference for he poserior sae disribuion, [15], given by p(x +1 Y ) = p(x +1 x )p(x Y )dx, (2a) p(x Y ) = p(y x )p(x Y 1 ), (2b) p(y Y 1 ) where Y = {y i } i=1 is he se of available measuremens. The PF uses saisical mehods o approximae he inegrals. The basic PF algorihm is given in Algorihm 2. Algorihm 2 Basic Paricle Filer [5] 1. Le := 0, generae N paricles {x (i) 0 }N i=1 p(x 0). 2. Measuremen updae: Compue he paricle weighs ω (i) = p(y x (i) ) / j p(y x ( j) ). 3. Resample: (a) Generae N uniform random numbers {u (i) } N i=1 U (0,1). (b) Compue he cumulaive weighs: c (i) = i j) j=1 ω(. (c) Generae N new paricles using u (i) and c (i) : {x (i ) } N i=1 where Pr(x(i ) = x ( j) ) = ω j. 4. Time updae: (a) Generae process noise {w (i) } N i=1 p w(w ). (b) Simulae new paricles x (i) +1 = f (x(i ),w (i) ). 5. Le := + 1 and repea from 2. 4. GPU BASED PARTICLE FILTER To implemen a parallel PF on a GPU here are several aspecs of Algorihm 2 ha require special aenion. Resampling and weigh normalizaion are he wo mos challenging seps o implemen in a parallel fashion since in hese seps all paricles and heir weighs inerac wih each oher. The main difficulies are cumulaive summaion, and selecion and redisribuion of paricles. In he following secions, soluions suiable for parallel implemenaion are proposed for hese asks, ogeher wih a discussion on issues wih random number generaion, likelihood evaluaion as par of he measuremen updae, and sae propagaion as par of he ime updae. 2007 EURASIP 1640

Forward Adder Original daa 1 2 3 4 5 6 7 8 1 + 2 = 3 3 + 4 = 7 5 + 6 = 11 7 + 8 = 15 3 + 7 = 10 11 + 15 = 26 10 + 26 = 36 Cumulaive sum 1 = 3 2 3 6 = 10 4 10 15 = 21 6 21 28 = 36 8 36 3 = 10 7 10 21 = 36 15 36 10 = 36 26 36 36 Backward Adder Figure 2: Illusraion of a parallel implemenaion of cumulaive sum generaion of he numbers 1,2,...,8. Firs he sum is calculaed using a forward adder ree. Then he parial summaion resuls are used by he backward adder o consruc he cumulaive sum; 1,3,...,36. 4.1 Random Number Generaion A presen, sae-of-he-ar graphics cards do no have sufficien suppor for random number generaion for usage in a PF, since he saisical properies of he buil-in generaors are oo poor. The algorihm in his paper herefore relies on random numbers generaed on he CPU o be passed o he GPU. This inroduces quie a lo of daa ransfer as several random numbers per paricle are required for one ieraion of he PF. Uploading daa o he graphics card is raher quick, bu sill some performance is los. Generaing random numbers on he GPU suiable for use in Mone Carlo simulaion is an ongoing research opic, see e.g., [16 18]. Doing so will no only reduce daa ranspor and allow a sandalone GPU implemenaion, an efficien parallel version will improve overall performance as he random number generaion iself akes a considerable amoun of ime. 4.2 Likelihood Evaluaion and Sae Propagaion Boh likelihood evaluaion (as par of he measuremen updae) and sae propagaion (in he ime updae), Seps 2 and 4b of Algorihm 2, can be implemened sraighforwardly in a parallel fashion since all paricles are handled independenly. As a consequence of his, boh operaions can be performed in O(1) ime wih N parallel processors, i.e., one processing elemen per paricle. To solve new filering problems, only hese wo funcions have o be modified. As no parallelizaion issues need o be addressed, his is easily accomplished. In he presened GPU implemenaion he paricles x (i) and he weighs ω (i) are sored in separae exures which are updaed by he sae propagaion and he likelihood evaluaion, respecively. Texures can only hold four dimensional sae vecors, bu using muliple rendering arges he sae vecors can easily be exended when needed. When he measuremen noise is low-dimensional he likelihood compuaions can be replaced wih fas exure lookups uilizing hardware inerpolaion. Furhermore, as discussed above, he sae propagaion uses exernally generaed process noise, bu i would also be possible o generae he random numbers on he GPU. 4.3 Summaion Summaions are par of he weigh normalizaion (during measuremen updaes) and cumulaive weigh calculaion (during resampling), Seps 2 and 3b of Algorihm 2. A cumulaive sum can be implemened using a muli-pass scheme, where an adder ree is run forward and hen backward, as illusraed in Figure 2. Running only he forward pass he 1 u (k) 0 x (1) x (2) x (3) x (4) x (5) x (k ) x (6) x (7) x (8) Figure 3: Paricle selecion by comparing uniform random numbers ( ) o he cumulaive sum of paricle weighs ( ). oal sum is compued. This muli-pass scheme is a sandard mehod for parallelizing seemingly sequenial algorihms based on gaher and scaer principles. The reference [2] describes hese conceps in for he GPU seing. In he forward pass parial sums are creaed ha are used in he backward pass o compue he missing parial sums o complee he cumulaive sum. The resuling algorihm is O(log N) in ime given N parallel processors and N paricles. 4.4 Paricle Selecion To preven sample impoverishmen, he resampling sep, Seps 3 of Algorihm 2, replaces he weighed paricle disribuion wih a unweighed one. This is done by drawing a new se of paricles {x (i ) } wih replacemen from he original paricles {x (i) } in such a way ha Pr(x (i ) = x ( j) ) = ω ( j). Sandard resampling algorihms [4, 19, 20] selec he paricles by comparing uniform random numbers u (k) o he cumulaive sum of he normalized paricle weighs c (i), as illusraed in Figure 3. Tha is, assign x (k ) = x (i),wih i such ha u (k) [c (i 1),c (i) ), (3) which makes use of an explici expression for he generalized inverse cumulaive probabiliy disribuion. Differen mehods are used o generae he uniform random numbers [20]. Sraified resampling, [19], generaes uniform random numbers according o u (k) = (k 1) + ũ(k), wih ũ (k) U (0,1), (4) N whereas sysemaic resampling, [19], uses u (k) = (k 1) + ũ, wih ũ U (0,1), (5) N 2007 EURASIP 1641

k : x (k ) = x (2) x (4) x (5) x (7) p (0) p (2) (4) p p(5) p (7) p (1) p (3) p (6) p (8) 0 1 2 3 4 5 6 7 8 x (2) x (4) x (5) x (5) x (5) x (7) x (7) x (7) Verices Raserize Fragmens Figure 4: Paricle selecion on he GPU. The verices p, cumulaive weighs snapped o an equidisan grid, define a line where every segmen represens a paricle. Some verices may coincide, resuling in line segmens of zero lengh. The raserizer creaes paricles x according o he lengh of he corresponding line segmens. where U (0,1) is a uniform disribuion beween 0 and 1. Boh mehods produce ordered uniform random numbers which have exacly one number in every inerval of lengh N 1, reducing he number of u (k) o be compared o c (i) o a single one. This is he key propery enabling a parallel implemenaion. Uilizing he raserizaion funcionaliy of he graphics pipeline, he selecion of paricles can be implemened in a single render pass: calculae verices p (i) by assigning he cumulaive weighs c (i) o an equidisan grid depending on he uniform random numbers u (i). Tha is, { p (i) Nc = (i), if Nc (i) Nc (i) < u ( Nc(i) ), Nc (i) (6) + 1, oherwise, where x is he floor operaion. Drawing a line connecing he verices p (i) and associaing a paricle o every line segmen, he raserizaion process creaes he resampled se of paricles according o he lengh of each segmen. This procedure is illusraed wih an example in Figure 4 based upon he daa in Figure 3. The compuaional complexiy of his is O(1) wih N parallel processors, as he verex posiions can be calculaed independenly. Unforunaely, he curren generaion of GPUs has a maximal exure size limiing he number of paricles ha can be resampled as a single uni. To solve his, muliple subses of paricles are simulaneously being resampled and hen redisribued ino differen ses, similarly o wha is described in [21]. This modificaion of he resampling sep does no seem o significanly affec he performance of he paricle filer as a whole. 4.5 Complexiy Consideraions From he descripions of he differen seps of he PF algorihms i is clear ha he resampling sep is he boleneck ha gives he ime complexiy of he algorihm, O(log N) compared o O(N) for a sequenial algorihm. The analysis of he algorihm complexiy above assumes ha here are as many parallel processors as here are paricles in he paricle filer, i.e., N parallel elemens. Today his is a bi oo opimisic, a modern GPU has an order of en parallel pipelines, hence much less han he ypical number of paricles. However, he number of parallel unis is consanly increasing so he degree of parallelizaion is improving. Especially he cumulaive sum suffers from a low degree of parallelizaion. Wih full parallelizaion he ime complexiy of he operaion is O(log N) whereas he sequenial Table 1: Hardware used for he evaluaion. GPU Model: NVIDIA GeFORCE 7900 GTX Driver: 2.1.0 NVIDIA 96.40 Bus: PCI Express, 14.4 GB/s Clock speed: 650 MHz Processors: 8/24 (verex/fragmen) CPU Model: Inel Xeon 5130 Clock speed: 2.0 GHz Memory: 2.0 GB Operaing Sysem: CenOS 4.4 (Linux) algorihms is O(N), however he parallel implemenaion uses O(N logn) operaions in oal. As a resul, wih few pipelines and many paricles he parallel implemenaion will be slower han he sequenial one. However, as he degree of parallelizaion increases his will be less and less imporan. 5. FILTER EVALUATION To evaluae he designed PF on he GPU wo PF have been implemened; one sandard PF running on he CPU and one implemened as described in Secion 4 running on he GPU. (The code for boh implemenaions is wrien in C++ and compiled using gcc 3.4.6.) The filers were hen used o filer daa from a consan velociy racking model, measured wih wo disance measuring sensors. The esimaes obained were very similar wih only small differences ha can be explained by he differen resampling mehod (one, or muliple ses) and he presence of round off errors. This shows ha he GPU implemenaion does work, and ha he modificaion of he resampling sep is accepable. The hardware is presened in Table 1. Noe ha here are 8 parallel pipelines in which he paricle selecion and redisribuion is conduced and ha he res of he seps in he PF algorihm is performed in 24 pipelines, i.e., N number of pipelines. To sudy he ime complexiy of he PF, simulaions wih 1000 ime seps were run wih differen numbers of paricles. The ime spen in he paricle filers was recorded, excluding he generaion of he random numbers which was he same for boh filer implemenaions. The resuls can be found in Figure 5. The maximum number of paricles (10 6 ) may seem raher large for curren applicaions, however, i helps o show he rend in compuaion ime and o show ha i is possible o use his many paricles. This makes i possible o work wih large sae dimensions and open up for PFs in new applicaion areas. Some observaions should be made: for few paricles he overhead from iniializing and using he GPU is large and hence he CPU implemenaion is he fases. The CPU complexiy follows a linear rend, whereas a firs he GPU ime hardly increases when using more paricles; parallelizaion pays off. For even more paricles here are no enough parallel processing unis available and he complexiy becomes linear, bu he GPU implemenaion is sill faser han he CPU. Noe ha he paricle selecion is performed on 8 processors and he oher seps on 24, see Table 1, and ha hence he degree of parallelizaion is no very high for many paricles. A furher analysis of he ime spen in he GPU implemenaion shows in wha par of he algorihm mos of he ime is spen. Figure 6, shows ha mos of he ime is spen in he resampling sep, and ha he porion of ime spen here 2007 EURASIP 1642

Time [ms] 10 7 10 6 10 5 10 4 10 3 10 2 10 1 CPU GPU 10 2 10 4 10 6 Number of paricles, N Figure 5: Time comparison beween CPU and GPU implemenaion. The number of paricles is large o show ha he calculaion is racable, and o show he effec of he parallelizaion. Noe he log-log scale. Relaive Time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 resampling ime updae measuremen updae 10 2 10 4 10 6 Number of paricles, N Figure 6: Relaive ime spen in differen pars of GPU implemenaion. increases wih more paricles. This is quie naural since his sep is he leas parallel in is naure and requires muliple passes. Hence, opimizaion effors should be direced ino his par of he algorihm. 6. CONCLUSIONS In his paper, he firs complee parallel general paricle filer implemenaion in lieraure on a GPU is described. Using simulaions, he parallel GPU implemenaion is shown o ouperform a CPU implemenaion on compuaion speed for many paricles while mainaining he same filer qualiy. The echniques and soluions used in deriving he implemenaion can also be used o implemen paricle filers on oher similar parallel archiecures. References [1] GPGPU programming web-sie, 2006, hp:// www.gpgpu.org. [2] M. Pharr, Ed., GPU Gems 2. Programming Techniques for High-Performance Graphics and General-Purpose Compuaion, Addison-Wesley, 2005. [3] M. D. McCool, Signal processing and general-purpose compuing on GPUs, IEEE Signal Process. Mag., vol. 24, no. 3, pp. 109 114, May 2007. [4] A. Douce, N. de Freias, and N. Gordon, Eds., Sequenial Mone Carlo Mehods in Pracice, Saisics for Engineering and Informaion Science. Springer-Verlag, New York, 2001. [5] N. J. Gordon, D. J. Salmond, and A. F. M. Smih, Novel approach o nonlinear/non-gaussian Bayesian sae esimaion, IEE Proc.-F, vol. 140, no. 2, pp. 107 113, Apr. 1993. [6] B. Risic, S. Arulampalam, and N. Gordon, Beyond he Kalman Filer: Paricle Filers for Tracking Applicaions, Arech House, Inc, 2004. [7] S. Maskell, B. Alun-Jones, and M. Macleoad, A single insrucion muliple daa paricle filer, in Proc. Nonlinear Saisical Signal Processing Workshop, Cambridge, UK, Sep. 2006. [8] A. S. Monemayor, J. J. Panrigo, A. Sánchez, and F. Fernández, Paricle filer on GPUs for real ime racking, in Proc. SIGGRAPH, Los Angeles, CA, USA, Aug. 2004. [9] A. S. Monemayor, J. J. Panrigo, R. Cabido, B. R. Payne, Á. Sánchez, and F. Fernáandez, Improving GPU paricle filer wih shader model 3.0 fir visual racking, in Proc. SIGGRAPH, Boson, MA, USA, Aug. 2006. [10] NVIDIA developer web-sie, 2006, hp:// developer.nvidia.com. [11] D. Shreiner, M. Woo, J. Neider, and T. Davis, OpenGL Programming Language. The official guide o learning OpenGL, Version 2, Addison-Wesley, 5 ediion, 2005. [12] R. J. Ros, OpenGL Shading Language, Addison- Wesley, 2 ediion, 2006. [13] R. E. Kalman, A new approach o linear filering and predicion problems, Trans. ASME, vol. 82, no. Series D, pp. 35 45, Mar. 1960. [14] T. Kailah, A. H. Sayed, and B. Hassibi, Linear Esimaion, Prenice-Hall, Inc, 2000. [15] A. H. Jazwinski, Sochasic Processes and Filering Theory, vol. 64 of Mahemaics in Science and Engineering, Academic Press, Inc, 1970. [16] C. J. K. Tan, The PLFG parallel pseud-random number generaor, Fuure Generaion Compuer Sysems, vol. 18, pp. 693 698, 2002. [17] A. De Maeis and S. Pagnui, Parallelizaion of random number generaors and long-range correlaion, Numer. Mah., vol. 53, no. 5, pp. 595 608, 1988. [18] M. Sussman, W. Cruchfield, and M. Papakipos, Pseudorandom number generaion on he GPU, in Graphics Hardware. Eurographics Symp. Proc, Vienna, Ausria, Aug. 2006, pp. 87 94. [19] G. Kiagawa, Mone Carlo filer and smooher for nongaussian nonlinear sae space models, J. Compu. and Graphical Sa., vol. 5, no. 1, pp. 1 25, Mar. 1996. [20] J. D. Hol, T. B. Schön, and F. Gusafsson, On resampling algorihms for paricle filers, in Proc. Nonlinear Saisical Signal Processing Workshop, Cambridge, UK, Sep. 2006. [21] M. Bolić, P. M. Djurić, and S. Hong, Resampling algorihms and archiecures for disribued paricle filers, IEEE Trans. Signal Process., vol. 53, no. 7, pp. 2442 2450, July 2005. 2007 EURASIP 1643