Institutionen för systemteknik

Size: px
Start display at page:

Download "Institutionen för systemteknik"

Transcription

1 Institutionen för systemteknik Department of Electrical Engineering Examensarbete GPU Implementation of the Particle Filter Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet av Joakim Gebart LiTH-ISY-EX--13/4698--SE Linköping 2013 Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Linköpings tekniska högskola Linköpings universitet Linköping

2

3 GPU Implementation of the Particle Filter Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet av Joakim Gebart LiTH-ISY-EX--13/4698--SE Handledare: Examinator: Niklas Wahlström isy, Linköpings universitet Gustaf Hendeby FOI David Törnqvist isy, Linköpings universitet Linköping, 15 juni 2013

4

5 Avdelning, Institution Division, Department Reglerteknik Department of Electrical Engineering SE Linköping Datum Date Språk Language Svenska/Swedish Engelska/English URL för elektronisk version Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--13/4698--SE Serietitel och serienummer Title of series, numbering ISSN Titel Title GPU implementation av partikelfiltret GPU Implementation of the Particle Filter Författare Author Joakim Gebart Sammanfattning Abstract This thesis work analyses the obstacles faced when adapting the particle filtering algorithm to run on massively parallel compute architectures. Graphics processing units are one example of massively parallel compute architectures which allow for the developer to distribute computational load over hundreds or thousands of processor cores. This thesis studies an implementation written for NVIDIA GeForce GPUs, yielding varying speed ups, up to 3000% in some cases, when compared to the equivalent algorithm performed on CPU. The particle filter, also known in the literature as sequential Monte-Carlo methods, is an algorithm used for signal processing when the system generating the signals has a highly nonlinear behaviour or non-gaussian noise distributions where a Kalman filter and its extended variants are not effective. The particle filter was chosen as a good candidate for parallelisation because of its inherently parallel nature. There are, however, several steps of the classic formulation where computations are dependent on other computations in the same step which requires them to be run in sequence instead of in parallel. To avoid these difficulties alternative ways of computing the results must be used, such as parallel scan operations and scatter/gather methods. Another area where parallel programming still is not widespread is the area of pseudo-random number generation. Pseudo-random numbers are required by the algorithm to simulate the process noise as well as for avoiding the particle depletion problem using a resampling step. In this thesis a recently published counter-based pseudo-random number generator is used. Nyckelord Keywords GPGPU, Particle filtering, CUDA, Sequential Monte Carlo, C++

6

7 Abstract This thesis work analyses the obstacles faced when adapting the particle filtering algorithm to run on massively parallel compute architectures. Graphics processing units are one example of massively parallel compute architectures which allow for the developer to distribute computational load over hundreds or thousands of processor cores. This thesis studies an implementation written for NVIDIA GeForce GPUs, yielding varying speed ups, up to 3000% in some cases, when compared to the equivalent algorithm performed on CPU. The particle filter, also known in the literature as sequential Monte-Carlo methods, is an algorithm used for signal processing when the system generating the signals has a highly nonlinear behaviour or non-gaussian noise distributions where a Kalman filter and its extended variants are not effective. The particle filter was chosen as a good candidate for parallelisation because of its inherently parallel nature. There are, however, several steps of the classic formulation where computations are dependent on other computations in the same step which requires them to be run in sequence instead of in parallel. To avoid these difficulties alternative ways of computing the results must be used, such as parallel scan operations and scatter/gather methods. Another area where parallel programming still is not widespread is the area of pseudo-random number generation. Pseudo-random numbers are required by the algorithm to simulate the process noise as well as for avoiding the particle depletion problem using a resampling step. In this thesis a recently published counter-based pseudo-random number generator is used. iii

8

9 Contents 1 Introduction Prior work Outline Theory Linear state-space model General state-space model Bayesian filtering Linear filtering Nonlinear filtering Particle filtering Time update Measurement update Resampling Summary Parallel programming in CUDA Brief history of GPGPU CUDA framework CUDA programming model Compute capability Branching and flow control CUDA memory architecture Coalescing memory operations High level libraries Thrust template library Random123 random number generation library Eigen linear algebra library Parallelisation Time update Random number generation Recursive PRNGs v

10 vi CONTENTS Counter-based PRNGs The Box-Muller transform The Kaiser and Dickman method Measurement update Residual, ek i Likelihood, p(y k xk i ) Resampling Parallel binary prefix operations Parallel reduction Normalizing weights Computing the cumulative probability distribution Computing the number of particle copies Copying particles Complexity summary Implementation C++ templates Function objects Particle filtering framework The time update The measurement update Resampling Results Hardware Software Applications dimensional terrain navigation using altitude sensor and range meter Tracking targets captured by an image processing system Comparing floating point results Memory usage Correctness of the implementation Execution time Impact of choice of random number generator dimensional terrain navigation Visual tracking Conclusions Number of particles in relation to number of processing cores Comparing results between architectures Online applications Future work Implement Kalman filter time update function object Evaluating more resampling methods Physical memory layout optimizations in Terrain2d

11 CONTENTS vii Parallelising the CPU implementation Bibliography 67

12 viii CONTENTS

13 1 Introduction Graphics processing units (GPUs) have evolved tremendously over the last decade. Today they are quite capable in terms of computing abilities and are therefore used in other areas than computer graphics. An advantage of using GPUs for generalpurpose computing is their large number of computing cores, usually numbering in the hundreds in a single chip, compared to the 4, 8, or even 12 cores in the CPUs available today. GPUs are a cheap source of computing power, but they should still be regarded as a complement to, not a replacement for, general purpose processors because they have other associated limitations. These limitations come from design decisions needed for achieving the high density of cores, such less cache and the difficulty of dealing with divergent branches, this is discussed at further length in chapter 3. Particle filtering is a signal processing method that is suitable for nonlinear systems and non-gaussian noise distributions where Kalman filtering, extended Kalman filtering and unscented Kalman filtering do not perform well. The particle filter algorithm is computationally demanding because a large number of potential states or particles must be processed in each time step; however, because many of the computations are not dependent on each other, it is suitable parallelisation where the workload is distributed over multiple processors to speed up the computations. This thesis studies one implementation where the computations in a particle filtering problem are parallelised and run on a NVIDIA GeForce GPU, yielding speed ups when compared to the equivalent algorithm performed on CPU. The accuracy of the implementation is evaluated by comparing the filter output on GPU to the filter output on a reference implementation running on CPU. This thesis report is mainly targeted at engineers and engineering students interested in signal processing, computer science, and parallel algorithm implementa- 1

14 2 1 Introduction tion. 1.1 Prior work Particle filters have been tested on parallel architectures before [Brun et al., 2002], [Teulière and Brun, 2003], including GPUs [Montemayor et al., 2004], [Hendeby et al., 2010]. However, these early examples of GPGPU 1 particle filtering were written for OpenGL or Direct3D before computing frameworks such as CUDA and OpenCL were developed. Chao et al. [2010] describes a particle filter CUDA implementation that attempts to reduce the number of particles needed to achieve a good filter output by increasing the ratio of effective particles, additionally using a technique which allows for a localized resampling scheme, which yield a performance benefit when compared to a classical particle filter on GPU. Different resampling methods have been studied in parallelisation contexts [Míguez, 2007], including new resampling schemes for utilizing parallel architectures [Bolic et al., 2004] and specifically GPUs [Murray, 2012]. 1.2 Outline The report begins with an introduction to state-space models, estimation and filtering in chapter 2. The particle filter algorithm is also introduced and a detailed breakdown of the algorithm is provided which is used as the ground for the rest of the report. An introduction to the CUDA framework and programming model is provided in chapter 3 for readers with no experience in parallel programming. Chapter 4 analyses each part of the particle filter algorithm with regard to the limitations and possibilities outlined in the previous chapter and provides solutions to calculating the results in parallel to allow performance gains on massively parallel architectures, such as GPUs. Chapter 5 introduces some important C++ concepts and provides a short description of the particle filtering framework created during the writing of this thesis. Three example applications of the particle filter are presented in chapter 6. These applications are used to collect results and the gains in execution time from the parallelisation of the filter is presented in the same chapter, along with an analysis of the correctness of the implementation. The collective results are further discussed in chapter 7 and some conclusions are drawn. In the end, some suggestions on future work within the same domain as this thesis are presented. 1 General-Purpose computing on Graphics Processing Units

15 2Theory Sensor measurements always contain undesired noise when used in the physical world. The purpose of signal processing is to extract useful information from noisy signals. For simple problems, such as removing a disturbing tone from a sound recording, it is often enough to simply remove the undesired frequencies from the signal spectrum using a band stop filter, but for more complex problems it will be possible to achieve better results using more of the knowledge about the system that is being measured. In many cases it is not even possible to directly measure the quantity that is needed but rather some signal processing is required to get an estimate from some other measurement source. 2.1 Linear state-space model To use knowledge about the system being measured it is often useful to formulate a state-space model of the system. The state-space model describes how the system behaves over time. For example, if the initial position, velocity and direction of a car are known, it is possible to predict the future position of the car by integrating the velocity over time. A simple 2-dimensional linear state-space model of a car moving in one dimension at a constant velocity is x 1 [k + 1] = x 1 [k] + x 2 [k]t s (2.1) x 2 [k + 1] = x 2 [k] (2.2) where x 1 is the position and x 2 is the velocity of the vehicle, k is the current time 3

16 4 2 Theory step, T s is the time step length. In most cases not all states are measurable, to represent this fact we add a measurement equation y[k] = x 1 [k] (2.3) where y[k] is the measurement. As can be seen in the above equation the only measurement is the position. It is often too complicated or not even possible to model the system completely. To account for the error introduced by simplifications such as neglecting air resistance and rolling resistance or errors in the state equations or parameters we introduce process noise v[k]. The measurements also have some noise, this is represented by measurement noise e[k]. The complete model is thus x 1 [k + 1] = x 1 [k] + x 2 [k]t s + T s 2 v[k] 2 (2.4) x 2 [k + 1] = x 2 [k] + T s v[k] (2.5) y[k] = x 1 [k] + e[k] (2.6) The process noise v[k] and measurement noise e[k] are independent, white stochastic processes with known probability density functions. The coefficients T s 2 2 and T s for the process noise comes from sampling a continuous system and the assumption that the physical noise can be seen as an external unknown force acting on the vehicle, see Gustafsson [2010] for an explanation. For a more thorough explanation of state-space models and model based prediction, see e.g. Gustafsson [2010]. For the remainder of this report the state will be used as a vector and the notation x[k] will be written as x k for readability. x k = x 1 [k]. x nx [k] (2.7) 2.2 General state-space model There are many real-world systems that are inherently nonlinear and where the process noise term v k will become too large if approximated by a linear system and render the model useless for filtering. In order to cope with such systems we will need a more general model that allows for nonlinear relations between the states.

17 2.3 Bayesian filtering 5 A state-space model for a general nonlinear system is given by x k+1 = f (x k, u k, v k ), v k p vk (2.8) y k = h(x k ) + e k, e k p ek (2.9) where x k is the state vector, y k is the measurement, u k is an external control input (e.g. output from a controller). v k is a disturbance caused by model errors and noise in measuring u k. e k is noise in the measurement of y k. The distributions v k and e k can be arbitrary but their probability density functions p vk and p ek are assumed known at filter design time. Only y k and u k are measurable. The above model is a very general system model and very few assumptions are made about the modelled system which makes it suitable for nonlinear systems and/or non-gaussian noise distributions. Measurements and control inputs are assumed to arrive in order, see Orton and Marrs [2005] for an extension to systems with out-of-sequence measurements. 2.3 Bayesian filtering Bayesian filtering is a statistical approach to filtering using likelihoods and probability densities to extract useful information from a system. In Bayesian filtering the goal is to reconstruct the filtering distribution p(x k y 1:k ) and prediction distribution p(x k+1 y 1:k ) of a system using all prior information, including all previous measurements. The filtering density describes the distribution of the state at the current time given all measurements up to the current time step. The prediction distribution describes the state distribution in the next time step given previous measurements up to the current time step, hence prediction. Bayesian filtering is a more general approach to the filtering problem than linear filtering since the theory can be applied to any system dynamics and arbitrary noise distributions. Gustafsson [2010] gives a longer description of nonlinear filtering, below is a short summary. Consider Bayes theorem conditioned on the variable C p(a B, C) = p(b A, C)p(A C) p(b C) (2.10) Substituting A = x k, B = y k, C = y 1:k 1 and recognizing the Markov property of state-space models 1 yields 1 A state distribution based on all previous measurements incorporates all possible information from the measurements and therefore conditioning on both the state and the measurements will add no new information compared to conditioning on only the state

18 6 2 Theory p(x k y 1:k ) = p(y k x k )p(x k y 1:k 1 ) p(y k y 1:k 1 ) (2.11) Equation (2.11) is known as a measurement update in Bayesian filtering because it incorporates the information from new measurements into the filtering distribution. p(y k x k ) is called the likelihood of the measurement. The denominator in (2.11) can be expressed by using the law of total probability p(y k y 1:k 1 ) = R n p(y k x k )p(x k y 1:k 1 ) dx k (2.12) Using Bayes rule conditioned on C, p(a, B C) = p(a B, C)p(B C) (2.13) with A = x k+1, B = x k, C = y 1:k and again recognizing the Markov property of state-space models yields p(x k+1, x k y 1:k ) = p(x k+1 x k )p(x k y 1:k ) (2.14) Integration on both sides with respect to x k over the entire state-space (using the law of total probability) yields p(x k+1 y 1:k ) = R n p(x k+1 x k )p(x k y 1:k ) dx k (2.15) This is called the time update in Bayesian filtering because it advances the state distributions to the next time step. By using (2.11) and (2.15) recursively and initiating by p(x 1 y 0 ) = p(x 0 ) (2.16) we arrive at the general Bayesian recursion

19 2.4 Linear filtering 7 p(x k y 1:k ) = p(y k x k )p(x k y 1:k 1 ) p(y k y 1:k 1 ) p(y k y 1:k 1 ) = p(y k x k )p(x k y 1:k 1 ) dx k (2.17a) (2.17b) p(x k+1 y 1:k ) = R n x R n x p(x k+1 x k )p(x k y 1:k ) dx k (2.17c) (2.17) are the equations that need to be solved to arrive at the filtering and prediction densities. On-line filtering applications typically requires a new estimate every time a new measurement is available. To prevent the computational complexity from increasing with increasing number of measurements (y 1:k grows with one measurement vector per each new measurement), it is necessary to utilize the recursion in the Bayesian filtering framework and keeping the most current state estimate at all times. From this result the actual computations become an iterative solution, with one time update per time step or tick and one measurement update per measurement. 2.4 Linear filtering The linear filtering problem is to estimate the state of a linear dynamic system given observations of some measurable part of the system. The model in section 2.1 has a linear state update and is thus a linear system. For linear systems with Gaussian noise distributions it can be proven that the Kalman filter is an optimal solution (with regards to minimum variance) to the problem of reconstructing the unknown state x k from all previous (noisy) measurements y 1:k [Kalman, 1960]. 2.5 Nonlinear filtering The extended Kalman filter (EKF) is an extension of the Kalman filter that allows systems with minor nonlinearities in the system dynamics and Gaussian or almost- Gaussian noise distributions. In each time step of the EKF, the nonlinear system model is linearised about the current state estimate using the Jacobian (partial derivatives) of the nonlinear state prediction function and Taylor series expansions. The EKF is thoroughly discussed in Jazwinski [1970]. Another approach, called unscented Kalman filter (UKF) [Julier et al., 1995], is to let a number of points in the state-space propagate through the model and afterwards recover the distribution mean and covariance. In some cases, this method

20 8 2 Theory allows for prediction functions with greater nonlinearities than the EKF can handle. The above solutions only work on models with small nonlinearities or else the process noise term will become too great and the filter will yield very little useful information. The UKF is somewhat better at handling the nonlinearities than the EKF but it still fails on greater nonlinearities. To handle arbitrary distributions and greater nonlinearities it is better to use other statistical methods, such as a point mass filter (PMF) or Monte Carlo-approaches such as a particle filter (PF). The particle filter is described in chapter Particle filtering For linear systems with Gaussian noise distributions it can be proven that the Kalman filter is the optimal solution to the problem of estimating the system state given noisy measurements. However, many systems are highly nonlinear or have non-gaussian noise distributions. For such systems particle filtering (PF), or sequential Monte-Carlo (SMC), is a better method for estimating the system s filtering distribution and prediction distribution. The particle filter was first introduced by Gordon et al. [1993]. The particle filter is a Bayesian approach to the filtering and prediction problems. The probability density function in the general Bayesian filtering framework is approximated by a set of N p particles or state-space vectors, each particle with its own associated weight. Thus, the density distributions can be seen as clouds of probability mass points. The particle filter consists of three parts: Time update, for advancing the state, closely related to (2.17c). Measurement update, when new measurements arrive, closely related to (2.17a). Resampling, to prevent particle depletion. The following sections describe each part in greater detail Time update The time update uses the system model to predict the state distributions in the next time step in order to advance the time counter to the next time step. This step represents the time update described in (2.17c) in section 2.3. The time update updates the prediction density in the Bayesian framework. Predicting the future state is performed using a model of the system. x k+1 = f (x k, u k, v k ), v k p vk (2.18)

21 2.6 Particle filtering 9 where x k+1 is the new system state, x k is the old system state, f (x k, u k, v k ) is a (possibly nonlinear) function for predicting the future state given the current state, u k is a measurable control input, v k is a sample of the process noise, drawn from the process noise distribution, p vk Measurement update The measurement update modifies the particle weights according to the measurement noise distribution p ek. This step represents the measurement update step in the general Bayesian recursion in section 2.3 and updates the filtering density. Below is the measurement update in the particle filter w i k k = 1 c k w i k k 1 p(y k x i k ) (2.19) where p(y k x k ) is the likelihood of the measurement, in an application this will be closely related to the measurement noise distribution. (2.19) can be compared to (2.17a) in the general Bayesian recursion, the particle weight w i k k 1 can be compared to the previous prediction density p(x k y 1:k 1 ). c k is a normalization weight N p c k = w i k k 1 p(y k xk i ) (2.20) i=1 (2.20) can be compared to (2.17b), substituting the integral over a continuous distribution with a sum of discrete samples. With the normalization constant c k defined as in (2.20) the sum of all particle weights after a measurement update is always Resampling Because of process noise and measurement noise, particles will tend to diverge from the true state and all weights except one will tend to zero. This situation is called depletion or impoverishment because the pool of useful particles is depleted. To prevent depletion, particles having a very low weight should be replaced by particles of a higher weight. There are many methods of resampling, some commonly used ones include multinomial resampling and systematic resampling [Douc et al., 2005]. Multinomial resampling can be likened to having all particles spread out around a wheel of fortune or roulette wheel, the width of the fields corresponding to the weight of the particle, and a needle pointing at the edge of the wheel [Blanco, 2009]. The needle points out one particle to put in the set of particles to use in the next time step. This method needs N p spins of the wheel to sample N p particles for the next time step. Systematic resampling, on the other hand, can be likened

22 10 2 Theory to having the same roulette wheel as in multinomial resampling, but with N p needles spread out at even distances around the edge of the wheel. The systematic resampling method thus needs only a single spin to decide N p particles for the next time step. Figure 2.1 shows an illustration of the two resampling schemes. The consequence of using multiple needles is that if a particle has weight greater than 1 N p it is guaranteed that at least one copy will be made, while when using only one needle it will be possible, however unlikely, that a particle of weight greater than 1 N p will not be copied at all. Systematic resampling is thus still a stochastic process but a more deterministic such than multinomial resampling. Figure 2.1: To the left, multinomial resampling as a fortune wheel. To the right, systematic resampling with N p = 8. Each coloured field represents a single particle, the width of the field corresponding to the weight of the particle, thereby giving a particle of greater weight a greater chance of being pointed out by a needle after a spin. Spinning the right wheel once will point out 8 particles at once, while the left requires one spin per particle. The spin of the wheel can be simulated by drawing a number from a uniform distribution. Using u n U(0, 1) to represent the location (distance from angle zero ) of the needle along the edge of the spun wheel it is necessary to find which particle is located at the chosen point. If all particles had the same weight it would have been trivial to find the index of the particle pointed at; simply multiply the location pointed at by the number of particles and rounding the result to an integer to get a particle index. However, since particles are not of equal weight this method will not work. An alternative way of finding the index can be performed by first computing the cumulative sum of particle weights and seeking along this sequence to find the index of the particle whose region begins before the location of the needle and ends after the location of the needle.

23 2.6 Particle filtering 11 Let u be a vector of length N p of needle locations, from now on called intersection levels, where N p is the number of particles. The intersection levels can be chosen as uniformly distributed stochastic variables u n U(0, 1), n = 0, 1, 2,..., N p 1 (2.21) this is a more formal description of multinomial resampling. The name multinomial resampling comes from the fact that the probability density of the count of particle copies is distributed according to the multinomial distribution, using the particle weights as parameters. To reduce the number of random samples needed, u can be chosen as a small random offset, a single sample from a uniform distribution, with uniform steps above it u n = n ( ) 1 + ξ, n = 0, 1, 2,..., N N p 1, ξ U 0, p N p (2.22) This method with a single random offset is known as systematic resampling [Douc et al., 2005], stratified sampling [Carpenter et al., 1997] or stochastic universal sampling [Whitley, 1994]. In this report the term systematic resampling will be used to refer to this resampling scheme. Given a finite number of particles, each with its own weight, calculate the cumulative weight of all particles before particle with index, i: cdf k [i] = i w j k k (2.23) cdf k [i] is the cumulative density function of the discrete particle distribution. It is not necessary that the particles are ordered in any way. (2.23) is straightforward to compute sequentially in a computer program while it takes some extra work to parallelise efficiently, this is further discussed in section The index, i, of the particle at location u n on the roulette wheel can be found by solving j=1 cdf k [i] = u n (2.24) for i. This is easily performed in a sequential fashion by a search through the sequence cdf k, also known as performing a lower bound search for u n in cdf k. Given a uniformly distributed level u n it will be more likely to draw a particle with greater weight than one of less weight. Figure 2.2 shows cdf k [i] as a discrete function and the intersection levels u as horizontal lines. Each intersection between

24 12 2 Theory cdf k and a level line adds a copy of the corresponding particle to the set of particles in the next time step. A particle that crosses more than one level line will be duplicated multiple times in the set of particles for the next time step, once per intersection cdf k [i] u n Picking particles to copy Probability Particle index (i) Figure 2.2: Computed cumulative density function of a simulated distribution along with the levels from (2.22) with N p = 10, ξ = The dashed lines (level lines) can be seen as a representation of the needles on the roulette wheel in figure 2.1. Each intersection between the CDF and a level line adds a copy of the corresponding particle to the set of particles in the next time step. In this figure, the particle with index i = 6 will be used once (one level line intersecting the CDF at i = 6), the particle with i = 7 will be used twice (two level intersections), and the particle with i = 8 will be discarded (no intersections). Since the resampling step is only used to prevent depletion it is not necessary to perform the resampling after each measurement update. One method to reduce the computational load is to count effective particles, N ef f, the number of particles with weights above a threshold value, and do a resampling whenever that number falls below a defined quota. However, if the filter is designed to run in a realtime online situation it is necessary to be able to do resampling before the next

25 2.6 Particle filtering 13 measurement arrive, thus the resampling should be fast enough to run between each measurement update and the only reason to skip the resampling step would be to save power. In the implementations created in this thesis work no such optimizations are done and resampling is done after each measurement update Summary Algorithm 1 lists the complete particle filter algorithm described in this chapter. Algorithm 1 Particle filter 1. Distribute N p particles x 0 over the state space. x i 0 p(x 0) (2.25) where i = 1, 2,..., N p denotes the particle index 2. Give each particle an initial weight w i 1 0 = 1/N p (2.26) 3. For each time step k = 1, 2,...: Time update: Update the estimated state for each particle x i k+1 =f (xi k, u k, v i k ), vi k p v k (2.27) where u k is a vector of all measurable external system inputs. Measurement update: If there is a new measurement, recalculate particle weights where w i k k = 1 c k w i k k 1 p(y k x i k ) (2.28) N p c k = w i k k 1 p(y k xk i ) (2.29) i=1 and p(y k xk i ) is the likelihood of the measurement given the particle state. Resampling: Sample the distribution in w k k N p times to form a new particle set for the next time step

26

27 3 Parallel programming in CUDA Parallel programming means designing a program for executing many operations in parallel instead of sequentially one at a time. Some algorithms are parallel in their very nature and thus requires almost no modifications to be able to run in parallel while others are quite sequential in their original formulation but can still be modified to run in parallel by clever changes to the algorithm flow. 3.1 Brief history of GPGPU Graphics processing units, GPUs, are designed for rendering 3D-graphics in games and applications, but GPUs have become more and more programmable since the first shader-capable cards were launched in 2001 with the GeForce 3 series of cards. Shader programs are tiny programs that run in parallel on a GPU and are used in all modern games for calculating lighting and visual effects. However, by using carefully selected geometry and saving the output from the shader stages it was possible to use the GPU for other computations than displaying graphics on screen but the close connection to graphics made it hard to use efficiently for unrelated problems and it was mostly used for computer vision and image processing research. In 2006, NVIDIA launched their CUDA framework [NVIDIA, 2012c] to ease adoption of GPU technology for general purpose computing. AMD launched their counterpart, Stream SDK, later in the same year [AMD, 2006]. In 2008 the Khronos Group launched the OpenCL language in collaboration with Apple, IBM, AMD, NVIDIA and Intel [Khronos, 2008]. OpenCL is a heterogeneous computing framework, designed for executing both on CPU and GPU platforms. 15

28 16 3 Parallel programming in CUDA 3.2 CUDA framework CUDA, or Compute Unified Device Architecture, is a parallel programming framework designed by NVIDIA for their GeForce and Tesla product lines. CUDA is a C-based language with some extensions for calling GPU kernels from CPU code. It is possible to interface with CUDA code from C++ programs just like interfacing regular C functions. Since its introduction in 2006, CUDA has become the most popular GPGPU framework with circa articles published [Google, a] compared to circa 4100 articles published containing the word OpenCL [Google, b]. Developers have created bindings for CUDA in other programming languages; Fortran, Haskell, Python, Java, MATLAB, Perl, Ruby,.NET are a few examples. 3.3 CUDA programming model CUDA programs are first started on the CPU but the CPU-side code can call functions known as kernels on the GPU to perform computations. Kernels are written in CUDA C, a superset of C with support for launching GPU kernels via a special syntax, in addition there is some support in CUDA C for C++ constructs such as classes and templates. Some terms need to be explained to understand the next section: The following terms relate to the logical 1 threads: thread a single thread of execution. block a group of multiple threads that execute the same kernel. grid a group of blocks. Figure 3.1 shows the different levels of logical thread groups relate to each other, along with what kinds of memory is available at which level. Figure 3.2 shows a block diagram of the Fermi architecture SM hardware. The following terms relate to the physical 2 threads: core a single compute core, one core runs exactly one instruction at a time. warp a group of threads that execute in parallel on the hardware, a warp consists of 32 threads on current generation CUDA hardware. Kernels are executed by one or more Streaming Multiprocessors (SM). A typical mid-to-high-end GeForce card from the Fermi family (GeForce 400 and GeForce 500 series) has 8-16 SMs on a single GPU 3. Each SM consists of 32 CUDA Cores 1 Logical as in software constructs. 2 Physical as in hardware architecture dependent. 3 See section 6.1 for a more detailed description of the specific hardware used in this thesis work.

29 3.3 CUDA programming model 17 Thread per-thread local private memory Block per-block shared memory Grid 0 Grid perapplication context global memory Figure 3.1: CUDA thread concepts. Source: NVIDIA [2009], figure used with permission from NVIDIA (cores) on the hardware used in this thesis [NVIDIA, 2009]. Threads are scheduled for execution by the warp schedulers, seen at the top of figure 3.2. Each SM has two warp scheduler units that work in a lockstep fashion. The smallest unit that a warp scheduler can schedule is called a warp, which consists of 32 threads on all CUDA hardware released so far at the time of writing. Only one warp may execute at a time on each SM. Threads in CUDA are much more lightweight than CPU threads, context switches are cheaper and all threads of a warp execute the same instruction or have to wait while the other threads in the warp execute the instruction, this is called Single Instruction Multiple Thread (SIMT) and is similar to traditional CPU Single Instruction Multiple Data (SIMD) instructions such as SSE, AVX, NEON, Altivec etc., this has consequences when using conditional statements as described in section 3.5. To allow for problems which demand more than 32 threads to solve the CUDA threads are arranged into logical groups called blocks and grids of sizes that are

30 18 3 Parallel programming in CUDA Instruction Cache Warp Scheduler Dispatch Unit Warp Scheduler Dispatch Unit Register File (32,768 x 32-bit) Core Core Core Core Core Core Core Core LD/ST LD/ST LD/ST LD/ST SFU CUDA Core Dispatch Port Operand Collector FP Unit INT Unit Core Core Core Core Core Core Core Core LD/ST LD/ST LD/ST LD/ST SFU Result Queue Core Core Core Core Core Core Core Core LD/ST LD/ST LD/ST LD/ST SFU 1-4 GB Global Memory LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect Network 64 KB Shared Memory / L1 Cache Uniform Cache Figure 3.2: Fermi architecture streaming multiprocessor (SM) block diagram. Source: NVIDIA [2009], figure used with permission from NVIDIA defined by the software developer. A block is a 3-dimensional collection of threads, each thread in the block has its own individual 3-dimensional identification number to allow the developer to distinguish between the threads in the kernel code. Threads within a single block can share data through shared memory, this reduces the load on global memory. Shared memory has a much lower latency than global memory but is a limited resource, the user can choose between (per block) 16 kb shared memory and 48 kb L1 cache or 48 kb shared memory and 16 kb L1 cache. Several blocks of threads in turn can be grouped into a grid. Grids are 3-dimensional arrays of blocks. The maximum block size is tied to the available hardware resources while the grids can be of (almost) arbitrary size. Blocks within a grid can only share data through global memory, which is the on-gpu memory which has the highest latency, more on this in section 3.6. The identification numbers of threads and blocks are 3-dimensional because many applications have a natural way of dividing up the threads in a 3-dimensional space, for example in a finite element (FEM) simulation the IDs could be correlated to the x,y,z-position of the element handled by the thread. If a problem has less than three dimensions it is still possible to keep the length of the extra dimensions at

31 3.4 Compute capability 19 1, also known as singleton dimensions. A Fermi GPU can have 48 warps (1536 threads) active at once per SM, given that the threads use little enough local and shared memory to fit all at the same time. Context switches between threads are fast since registers are allocated to the threads and hence there is no need for saving and restoring registers and shared memory between thread switches. The result is that it is actually desired to overallocate the hardware since it will hide memory stalls inside the kernels by letting the warp schedulers switch the currently active warp whenever a stall occurs. 3.4 Compute capability NVIDIA uses the term Compute capability to refer to different versions of their CUDA hardware which have different capabilities. The Fermi card used in this thesis work is of compute capability 2.0. For a detailed description of what each Compute capability version represents, see NVIDIA [2012b]. The major addition in Compute capability 2.0 was the addition of full IEEE floating point support in both single- and double precision computations. This is an important feature when comparing results computed on different architectures and is discussed further in section Branching and flow control The thread warp is a hardware group of threads that execute on the same SM. Threads of a warp can be compared to sharing a common program counter between the threads, hence all threads must execute the same line of program code. If the code has some brancing statements such as if... then... else the warp must first execute the threads that enter the first block, while the other threads of the warp wait, next the threads that enter the next block will execute while the other threads wait and so on. Because of this behaviour conditional statements should be avoided in GPU code if possible. When threads of a warp follow different lines of execution it is known as having divergent threads. While conditional blocks should be kept to a minimum inside CUDA kernels, it is sometimes possible to reorder statements so that all threads of the same warp follow only a single path of execution in an if... then... else block and mitigate this limitation. 3.6 CUDA memory architecture GPU memory is organized in a hierarchical structure with three primary levels; local, shared and global. Figure 3.1 shows how the memory levels relate to the threading concepts introduced in section 3.3. Each thread have a small amount of thread-local memory accessible only by the thread, local memory contents is stored in registers. Operations on local memory are completed in one clock cycle. Each thread block have a somewhat larger amount of shared memory. Shared memory

32 20 3 Parallel programming in CUDA is accessible by all threads in the same block. The remaining memory is called global memory. Global memory has a high total throughput (many GB/s) but it has a much higher latency than accessing registers in the SM. Current CUDA hardware can only access memory in 32-, 64-, or 128 byte blocks. In addition, there is a high latency for accessing global memory, in the order of hundreds of clock cycles. It is possible to hide this long latency by using more algorithm threads than available hardware can run in parallel and let the warp scheduler handle the memory stalls by issuing another warp when the active warp is waiting for a global memory operation to complete. The number of algorithm threads divided by the number of hardware threads is called the occupancy of a CUDA program. 3.7 Coalescing memory operations All operations using global memory should be written in a way that allows the GPU to use all bytes in a 32-, 64- or 128-bit operation. To achieve this it is necessary to use separate arrays for different variables belonging to the same particle. The common practise in C++ programs that run on a CPU is to create a struct or class to group attributes together that belong to the same object and then create a single array of such objects, this method is called Array of Structs or AoS, or more commonly known as Object Oriented Programming (OOP). OOP focuses on arranging data in a way that makes logical sense to the software architect and makes it easier to design programs. The opposite to OOP is called Data Oriented Programming (DOP), the focus in DOP is to arrange the data in a way that is more efficient for the computer hardware, grouping data primarily by which attribute it belongs to instead of grouping data by which object it belongs to. This is realized by using a single struct containing each attribute as an array of some primitive data type ( int, float, double etc.), this approach is called Structure of Arrays or SoA. Using a SoA approach allows the hardware to use all the loaded bytes in a 128-bit read instead of discarding some parts of each struct that are not needed by the current operation. 3.8 High level libraries This section describes the higher level libraries used in the thesis work Thrust template library Thrust [Hoberock and Bell, 2012] is a collection of C++ templates for parallel algorithms written in CUDA. It contains parallel versions of many C++ standard template library algorithms for sorting, reductions, partial sums and some specialized combined operations to optimize performance for GPU execution and minimize memory accesses and transfers. This library is used for its parallel algorithm primitives throughout the particle filtering framework described in chapter 5. The

33 3.8 High level libraries 21 primitives used are described in section Random123 random number generation library Random123 [Salmon, 2012b] is a library for counter-based pseudo-random number generation. The library contains both optimized CPU code and an efficient CUDA implementation of the counter-based pseudo-random number generators described in Salmon et al. [2011]. Random123 is being debated for inclusion in the Boost C++ library collection [Salmon, 2012a]. Random number generation is a complex subject and it is very difficult to develop good pseudo-random number generators. Random number generation is a major part of the time update step; generating process noise, vk i in (2.27) Eigen linear algebra library Eigen is not a CUDA library but is used extensively on the CPU-side in this thesis work. The Eigen library [Guennebaud et al., 2010] is a collection of C++ templates for common linear algebra operations, such as matrix-vector multiplications, matrix-matrix-multiplications, matrix inverse, Cholesky factorization and lots more. The implementations are highly optimized and make automatic use of vectorization instructions such as SSE and AVX.

34

35 4 Parallelisation This chapter discusses parallelising the particle filter algorithm. It would be desirable to run one thread per particle up to at least the number of computing cores available. However, some parts of the algorithm are more difficult to parallelise due to dependencies between the particles. In addition, given the CUDA programming model it can be beneficial to over-allocate threads so that the warp scheduler can keep the computing cores active while waiting for highlatency operations such as loads and stores in global memory. Preliminary profiling and analysis performed during the beginning of this thesis work indicated that the resampling step is the most complex part of a simple PF test program. The chapter is split up into sections each covering a single component of the entire algorithm. The notation particle-local is used in this chapter to denote a variable that is specific to a given particle and independent of all other particles. 4.1 Time update The system prediction function, f (xk i, u k, vk i ) in (2.8), does not have any dependencies between the particles. It needs to read one global variable, u k, two particlelocal variables, xk i and vi k, and write one particle-local value, xi k+1. The process noise, vk i, however, must be sampled from its distribution before computing f (xk i, u k, vk i ), this requires a parallel random number generator and is explained in section

36 24 4 Parallelisation 4.2 Random number generation Random or pseudo-random numbers are needed to generate process noise in the time update. A pseudo-random number generator (PRNG) is an algorithm/function that yield seemingly uncorrelated numbers in sequence, but since computers work in a deterministic fashion a PRNG is not a source for truly random numbers Recursive PRNGs Conventional pseudo-random number generators (PRNG) are designed in a recursive fashion where the generated number in the sequence is dependent on an internal state which in turn is dependent on the internal state in the previous iteration. x k+1 =f (x k ), k = 0, 1,... (4.1) where x 0 is known as the seed. Since each state depend on the previous state it is an inherently sequential process to generate more numbers from the same sequence. The number of articles concerning parallel pseudo-random number generation has increased recently with the introduction of massively parallel GPUs, [Howes and Thomas, 2007], [Bradley et al., 2011], [Manssen et al., 2012]. To generate good quality random numbers in parallel it is necessary that each parallel thread generate a disjunct sequence of random numbers [Brent, 2004]. Each separate thread requires a unique instance of the PRNG which has a seed that is distinct from all other threads or the threads will follow the same sequence of pseudo-random numbers. To achieve disjunct sequences of pseudo-random numbers there are several alternatives [Manssen et al., 2012] using recursive pseudo-random number generators. A first idea might be to seed the generators in turn with another PRNG sequence and using a long period generator to minimize the risk of overlapping sequences. However, to be certain that the seed yields a disjunct sequence it is necessary to seed all generators with the same seed and advancing the state for each thread further than the number of pseudo-random numbers needed by the computations (per thread). By analysing the state advancing function it can also be possible to create a skip-ahead method for a PRNG. Bradley et al. [2011] describes CUDA implementations of the conventional PRNGs Mersenne-twister (MT19937), Sobol and MRG32k3a including skipahead methods for each generator Counter-based PRNGs Recently, an article [Salmon et al., 2011] was published that describes a parallel counter-based pseudo-random number generator. This generator uses the principle of encrypting a counter using a modified cryptographic cipher. The algorithms in the ciphers are modified to reduce the number of rounds applied which decreases the number of operations (and computation time) needed but also reduces the cryptographic strength. These counter-based PRNGs all pass the TestU01

37 4.2 Random number generation 25 BigCrush [L Ecuyer and Simard, 2007] test batch, which is currently the most used testing suite for random number generators [Manssen et al., 2012]. The use of a counter in the PRNG lowers the memory requirement for storing the state and it is trivial to advance arbitrary numbers of steps. The counter-based PRNG library Random123 [Salmon, 2012b] was used in the implementation in this thesis The Box-Muller transform The Box-Muller transform [Box and Muller, 1958] is used in this thesis to transform a pair of U(0, 1)-distributed numbers obtained from the Random123 PRNGs into a pair of Gaussian N (0, 1)-distributed numbers. The Box-Muller transform is defined as X 1 = 2 ln U 1 cos (2πU 2 ), U 1,2 U(0, 1) (4.2) X 2 = 2 ln U 1 sin (2πU 2 ) (4.3) where X 1 and X 2 are independent and Gaussian N (0, 1) distributed if U 1 and U 2 are independent and uniform U(0, 1) distributed. For a derivation of the transform and justification of the method, see Box and Muller [1958]. To keep the implementation complexity down, the Box-Muller transform is used in both the CPU and the GPU implementations in this thesis, even though there are faster methods available on the CPU side, e.g. Marsaglia and Tsang [2000] The Kaiser and Dickman method The multivariate normal distribution is a generalisation of the 1-dimensional Gaussian distribution to higher dimensions. A covariance matrix and a vector mean is used to describe the multivariate normal distribution instead of a scalar variance and scalar mean. The probability density function, pdf(x), of the multivariate normal distribution is pdf(x) = 1 ( (2π) k R exp 1 ) 2 (x µ)t R 1 (x µ) (4.4) where x is an n-dimensional vector, µ is the mean and R is the covariance matrix, R is the determinant of the covariance. To generate new samples from the multivariate normal distribution with covariance R and mean µ, the most straightforward method is to use the Cholesky factorisation of R to perform an affine transform to the correct distribution from independent N (0, 1)-distributed numbers, this method is known as the Kaiser and Dickman method from their paper [Kaiser and Dickman, 1962].

38 26 4 Parallelisation Algorithm 2 Drawing samples from a N-dimensional multivariate normal distribution using the Kaiser and Dickman method. 1. Generate N independent N (0, 1)-distributed numbers, z N N (0, I). 2. Compute the Cholesky factorisation of the covariance matrix R, that is find a L s.t. R = LL T 3. x = Lz + µ is a multivariate normally distributed sample with mean µ and covariance matrix R. Note: There are other methods of finding a factorisation of the matrix R for the affine transform, e.g. spectral decomposition, matrix square root etc. Cholesky factorisation is however already implemented in most linear algebra libraries for programming and thus readily available. The matrix factorisation in the Kaiser-Dickman method only needs to be performed whenever the covariance matrix changes. The applications in this thesis work all have constant process noise covariances and therefore the matrices are factorised on filter initialization and the resulting matrix is stored for use in the time update. The matrix multiplication x = Lz + µ is performed in a sequential manner for each element in the matrix but many particles are processed in parallel. 4.3 Measurement update As described in section 2.6.2, the measurement update first updates the weights, then normalizes the set of weights Residual, e i k The residual is computed according to e i k = y k h(x i k ) (4.5) where x i k is the state vector and y k is the new measurement. This computation is almost trivial to parallelise as it only needs to read one global value, y k, and one particle-local value, x i k, and writes one particle-local value, ei k Likelihood, p(y k x i k ) The likelihood is computed by using the probability density function p ek. in the special case of Gaussian noise, using (4.4): p(y k x i k ) = p e k (e i k ) (4.6)

39 4.4 Resampling 27 p ek (e i k ) = 1 (2π) k R exp ( 1 2 (ei k µ)t R 1 (e i k µ) ) (4.7) where µ is the mean and R is the covariance matrix of the noise distribution, R is the determinant of the covariance. This computation, too, is straightforward to parallelise as it uses one particle-local variable, ek i, (possibly vector valued) and two constants, µ and R, (possibly vectorand matrix-valued) and writes one scalar-valued particle-local variable, p(y k xk i ). The computation can be optimized by ignoring the constant normalizing factor and only compute the exponential function on each call, since all weights are normalized properly at the end of the measurement update. If the covariance matrix is a diagonal matrix the vector-matrix-vector product in the exponential function argument can be simplified further in the implementation. 4.4 Resampling The resampling step is the step that is the most difficult for a GPU to process in parallel because it involves random reads from global memory, which hurts performance because of the hardware architecture, see section Parallel binary prefix operations Binary prefix operations, also known as scan operations, are the building blocks of many parallel algorithms and have been written about extensively before [Hillis and Steele, 1986], [Harris et al., 2007]. A binary prefix operation takes an input vector x = (x 1, x 2, x 3,..., x n ) (4.8) and any associative binary operator, (such as addition, multiplication, AND, OR, min, max etc.) called in this section, and generates the sequence S = scan(, x) = (x 1, x 1 x 2, x 1 x 2 x 3,..., x 1... x n ) (4.9) Assuming n is a power of two 1, parallelising the above operation efficiently is done in two phases; an up-sweep and a down-sweep, the names stem from the original formulation of results in a binary tree in Blelloch [1990]. The elements of the vector x are seen as leaves on a balanced binary tree. During the up-sweep each node remembers the value of its left child and computes the value of the operator 1 It is easy to extend the algorithm to values of n that are not powers of two by padding the vector with the chosen operator s identity (0 for addition/subtraction, 1 for multiplication/division etc.)

40 28 4 Parallelisation applied to the values of its two child nodes and passes the result to its parent. Each node on the same level can be computed in parallel but all levels must be computed in sequence, starting from the bottom of the binary tree (the leaves) and moving up. During the down-sweep each node forwards to its left child the value from its parent and to its right child the value of the operator applied to the stored value (that originated in its left child node during the up-sweep) and the value passed from its parent. Again, in this phase each node on the same level can be computed in parallel but all levels must be computed in sequence, starting from the top of the binary tree and moving down to the leaves. By arranging the computations in a binary tree, the number of operations 2 needed will be O(N log N), N leaves and log 2 N levels of the tree Parallel reduction A reduction operation is any operation that takes a vector and reduces it to a scalar, one example reduction is the sum-of-all-elements-operation. It is trivial to implement a sum of all elements in a sequential fashion; simply iterate over all values and add them one at a time to an accumulator variable. To implement a reduction in parallel one can use the same up-sweep as in the parallelisation of the binary prefix scan but omit the down-sweep. The required value can be found by applying the operator to the children of the top node. Because of the similarities with parallel binary prefix operations the resulting complexity is therefore also of the same order for a parallel reduction as for a parallel binary prefix operation, although the number of operations should be about half for a reduction compared to a binary prefix operation because of the missing down-sweep in the reduction Normalizing weights To perform the resampling methods mentioned in section it is necessary to know the sum of the weights as a normalization constant. The parallel reduction described above can be used to parallelise this computation. This normalization can be combined with the computation of the cumulative probability density, section below, and is not necessary to perform as a separate step Computing the cumulative probability distribution A binary prefix operation with = + is used to compute the empirical cumulative distribution function in (2.23) Computing the number of particle copies A (scalar) lower bound search searches for a key (a value) in a sorted sequence of values and returns the first position where the key can be inserted in the sequence 2 Number of operations here means the total number of mathematical results that need to be computed.

41 4.5 Complexity summary 29 without violating the sorting order. A vectorised lower bound search performs the same search as a scalar lower bound search but takes a vector of keys and returns a vector of resulting positions. Finding the particles that intersect each level line as in figure 2.2 can be implemented as a vectorised lower bound search. The result of a lower bound search using the cumulative distribution as a value sequence and the intersection level as key will yield the index of the first particle that crosses the level line. The implementation in this thesis computes all intersection levels u n and stores them in a vector u, then performs a vectorised lower bound search on all elements of u in parallel. The results of the search is stored in a vector. Thrust provides an implementation of a vectorised lower bound where one thread is executed per key but each thread performs the search in a sequential manner, this method requires that the number of keys being searched for is large in order to utilise the GPU hardware and to hide the memory latency Copying particles The particles found by the lower bound search in section must be copied, this is done by first creating a new empty vector, and then for each particle, copy the particle indicated by the corresponding element in the index vector, this is sometimes called a gather operation. The index vector is the result of the lower bound search in section Figure 4.1 shows the particle duplication process in three steps. The old_particles vector is discarded after the gather operation and replaced by the next_particles vector. Because of the random nature of the particle duplication process it is very difficult to implement the gather operation efficiently in parallel. It is however much more efficient to run one thread per element of the next_particles vector and have several threads reading from the same memory location (gather) than to execute one thread per element of the old_particles vector and let each thread write all of its own copies to the next_particles vector in sequence (scatter), the latter method will result in uncoalesced unaligned random writes to memory and divergent threads. 4.5 Complexity summary Below is a breakdown of algorithm 1. Completely parallelisable means that there are no dependencies on other particles in that step and thus it is possible to run each computation in a separate thread. The estimated complexities of the computations for N particles are: Time update: Sampling the process noise is completely parallelisable given a parallel random number generator, O(N) operations.

42 30 4 Parallelisation old_particles particle 1 particle 2 particle 3 particle 4 particle 5 particle 6 particle 7 particle 8 An empty vector called next_particles is created next_particles The result of the lower bound search gives the mapping between old and next particle 1 particle 2 particle 3 particle 4 particle 5 particle 6 particle 7 particle 8 Result after copying particle 1 particle 1 particle 2 particle 1 particle 3 particle 2 particle 4 particle 5 particle 5 particle 5 particle 6 particle 8 particle 7 particle 8 particle 8 particle 8 Figure 4.1: Copying particles as a part of the resampling step. Predicting the future state, x k+1, is completely parallelisable, O(N) operations Measurement update: Computing h(xk i ) and solving the measurement equation in (2.19) for e k is completely parallelisable, O(N) operations. Computing the likelihood p(y k xk i ) for each i is completely parallelisable, O(N) operations. Normalizing the particle weights is parallelisable by using a parallel reduction, O(N log 2 N) operations. Resampling:

43 4.5 Complexity summary 31 Computing the cumulative distribution function of the particles is done using a parallel prefix sum, O(N log 2 N) operations. Computing the number of particle copies is parallelisable using a parallel binary search, O(N log 2 N) operations. Copying the particles can be done in parallel, O(N) operations, but requires either uncoalesced memory reads or uncoalesced memory writes, which hurt performance. A gather operation was chosen over a scatter operation as in this particular case it leaves no GPU threads idle, gives a deterministic number of operations (minus waiting for high latency global memory), and gives coalesced writes to global memory. Amdahl s law is a formula for estimating the maximum theoretical speedup for adding more threads to an algorithm. Below is Amdahl s law for computing the speedup 1 S(N) = (1 P ) + N P (4.10) where S(N) is the speedup achieved for N parallel threads, where P is the proportion of the algorithm that can be parallelised. For a completely parallelisable algorith, P = 1, while an algorithm in which it is impossible to perform anything at all in parallel yields P = 0. From (4.10) it can be seen that the time complexity of a completely parallelisable algorithm requiring O(T ) operations running on M parallel threads can be approximated by ( ) T O = M O (T ) O (1) T M T M (4.11) From (4.11) it can be seen that depending on the number of computations, the computation time will be close to constant for low numbers of computations compared to the number of threads, and gradually turn to linear in time for a greater number of computations. A comparison of time complexities between the sequential version and the parallel version of the filter is given in table 4.1. The approximations in table 4.1 suggest that a large number of threads executing in parallel using the chosen parallel algorithm implementation could be able to outperform a single thread running the sequential implementation.

44 32 4 Parallelisation Table 4.1: Time complexity of sequential particle filter compared to parallel filter with M threads. Step Sequential time complexity Parallel time complexity Time update Sampling process noise O(N) Prediction O(N) O ( ) N M O ( ) N M Measurement update Measurement O(N) Likelihood O(N) O ( ) N M O ( ) N M Normalization a O(N) O Resampling CDF O(N) O ( N log2 N M ( N log2 N M ( N log2 N M Number of copies O(N) O Copying particles b O(N) O ( ) N M a The normalization in the measurement update can be merged into the computation of the empirical cumulative density function in the resampling step. b Performance is highly dependent on the memory architecture of the CUDA hardware. ) ) )

45 5 Implementation During the writing of this thesis two implementations of the particle filter were developed. One implementation was written in C++ using standard template library (STL) components and the Eigen linear algebra library for optimized linear algebra calculations, this implementation is run solely on the CPU. The second implementation was written in CUDA C++ using the Thrust template library for algorithmic primitives. This CUDA implementation runs the computations on the GPU and algorithm control on the CPU. Some parts of the initialization of the GPU implementation, such as computing the Cholesky factorization of the process noise covariance needed by the process noise generator, is done on the CPU, but this is only computed once when initializing the filter code. 5.1 C++ templates C++ has a very powerful template system that allows for meta programming. A function can be written as a template where for example the data type of some variable is determined by a template parameter. Any good C++ book will give a thorough explanation of templates and use cases, e.g. Stroustrup [2000], Lippman et al. [2005]. C++ templates are used extensively throughout the particle filtering framework to allow generic code for the particle filtering steps which are algorithmically the same steps regardless of the number of states or the data types of the involved variables. 33

46 34 5 Implementation 5.2 Function objects A function object is an instance of a C++ class that has an overloaded parenthesisoperator (operator()(...)). A function object can be called as a regular function except, since it is an object, it can also keep track of internal state or bound parameters. In the implementations used in this thesis work, both the time update and measurement update are realised as function objects which have private data members that consist of internal variables, such as process noise covariance or the newly arrived measurement. For further explanation of function objects, see any good C++ book [Stroustrup, 2000], [Lippman et al., 2005]. 5.3 Particle filtering framework A framework for particle filtering solutions was written in C++, the filter consists of a template for particle filters that can be combined with generic function objects for calculating the specific measurement equation and prediction that is needed by the problem. This is realised by the way of C++ templates, the measurement update method takes an object as an argument and uses it as a function for computing the particle weights given the particle state, and the time update function takes an object as an argument and uses the given object as a function for computing the next particle state given the current state. Two variants of the same framework were written, both have the same public member functions although because of implementation details the function objects have tiny differences in their function prototypes. The CUDA variant takes pointers to Scalar arrays where the CPU variant takes references to Eigen-objects as function arguments. Eigen s Vector or Matrix classes provide the data() member function that returns a pointer to the values in the vector. In addition, to run on the GPU the function objects need to have their member functions declared as device -functions to tell the CUDA compiler to compile it as a device (GPU) function. This framework uses column-major storage order in its physical memory layout. This is the same as Eigen, MATLAB and OpenGL default storage orders The time update Function declaration: time_update(timeupdatefunctor& sys) The time update member function takes one function object as an argument and applies the function object to each state vector in the particle collection. The function object sys is assumed to predict the state and replace the given state vector, x k, with the prediction, x k+1. The function prototype for sys is:

47 5.3 Particle filtering framework 35 Vector& sys::operator()(vector&& x, unsigned int tid, unsigned int ctr2) Where tid and ctr2 are two integers used in the PRNG to provide random numbers that are independent of the pseudo-random numbers used by the other particles. Algorithm 3 outlines the steps performed by the framework during the measurement update. Algorithm 3 Time update For each particle do: 1. Calculate the prediction from the state, xk+1 i = sys(xi k, i, k). (Generation of process noise is integrated in sys()) The measurement update Function declaration: measurement_update(const MeasurementVector& y, const MeasurementDistribution& dist, MeasurementFunctor& measurement) The measurement update member function takes one measurement vector and two function objects as arguments. The MeasurementFunctor object, measurement, is a function object that when called with a parameter particle state returns the noise-free measurement that would be generated if the given particle state was the true state, this is the equivalent of h(x) in (2.19). The MeasurementDistribution object, dist, is a function object that when called with a vector e returns the probability density at position e in the distribution, this is equivalent to the probability density function of p ek in (2.19). Algorithm 4 outlines the steps performed by the framework during the measurement update. Algorithm 4 Measurement update For each particle do: 1. Compute the measurement from the state, yk i = measurement(xi k ) 2. Compute the residual, ek i = y yi k 3. Compute the likelihood, wk i = dist(ei k ) Finally, normalize all weights, wk i = wi k 1 i wi k Resampling Function declaration: resample_systematic(scalar offset, unsigned int Np = 0) The resample_systematic function implements the systematic resampling method described in section offset is the single random draw, ξ in (2.22). Np is the desired number of particles after resampling with the special value Np = 0 meaning

48 36 5 Implementation keep the same total number of particles. Thus, it is possible to modify the number of particles at run time during the resampling step. Algorithm 5 outlines the steps performed by the framework during the resampling step in the GPU implementation. Algorithm 5 Parallel resampling 1. Compute the CDF using a parallel prefix scan, cdf k [i] = i j=1 w j k k. 2. For each particle, in parallel, do: (a) Generate the intersection level u n in (2.22) using ξ = offset. (b) Perform a lower bound search for u n in cdf k, store the index of the intersection as i. (c) Copy the state of particle i, xk i, to position n in the next set of particles, xk,next n := xi k, where xn k,next is a temporary vector. 3. Replace the set of particle states by the new set, xk i := xi k,next, i = 1, 2,..., Np. 4. Set default weight on all particles, wk i := Np 1, i = 1, 2,..., Np The CPU implementation follows the same path of execution except each particle is evaluated sequentially instead of in parallel, and the parallel prefix scan is replaced by a sequential cumulative sum: cdf k [i] =cdf k [i 1] + w i k, cdf k[0] = 0 (5.1)

49 6Results This chapter contains the results obtained from running the implementations described in chapter Hardware The tests were run on the following computer hardware: Lenovo ThinkStation NVIDIA GeForce GTX 570 (GF110 GPU), Fermi architecture, 1280 MB GDDR5 RAM Intel Xeon CPU 3.07GHz 6GB DDR3 RAM The GeForce card has 1280 MBytes of GDDR5 SDRAM on a 320-bit bus and the memory runs at 1900 MHz, a theoretical bandwidth of 152 GBytes/s. The GPU core runs at 732 MHz core clock. The GTX 570 has 15 streaming multiprocessors, each SM has 32 CUDA cores that run at twice the core clock, 1464 MHz. Each CUDA core can complete 2 single precision (32-bit) floating point operations per clock cycle, or 1 double precision (64-bit) floating point operation per 4 clock cycles. Each SM has 4 Special Function Units, SFU, which can be used for computing transcendental functions, such as sin, cos, x etc. Each SFU can compute 1 special function results per clock cycle, yielding 1 special function for an entire warp in 8 clock cycles. However, the SFU is only available for single precision calculations and the functions are not IEEE compliant with regard to special case handling and rounding. These functions are therefore implemented in software in 37

50 38 6 Results NVIDIA s CUDA C library. The GeForce family of GPUs have an artificial limit on double precision performance to 1 8 of the single precision performance, this is a way for the chip manufacturer to sell more workstation (Quadro) and compute (Tesla) cards 1. Quadro and Tesla cards have a double precision performance that is 2 1 of the single precision performance, this is the maximum that the Fermi architecture allows. Tesla and Quadro cards are, according to marketing, more thoroughly tested than the gaming counterparts, the GeForce line. In addition, Tesla cards often use ECC (error-correcting code) memory that can correct some bit-flip errors automatically. Haque and Pande [2010] suggests Tesla cards and GeForce cards of the Tesla architecture 2 exhibit the same amount of soft errors and thus the more thoroughly tested talk is only marketing nonsense, on the other hand, the paper uses Tesla 3 cards without ECC, the Fermi architecture adds ECC support to the Tesla line of cards which can reduce memory errors. All Fermi architecture cards have IEEE compliant floating point units. This means that rounding and precision of floating point numbers use the same rules and definitions on the GPU and the CPU. 6.2 Software The test code was compiled and run on the following: C/C++ compiler: GNU GCC (Gentoo p1.6, pie-0.5.2) CUDA compiler: CUDA compilation tools, release 5.0, V (Built on Tue Jul 31 17:46:14 PDT 2012) Operating system: Gentoo Linux, rolling release, Display driver: nvidia-drivers CUDA driver version: 5.0 (libcuda.so ) The code was compiled as a 64-bit binary at optimization level -O3 without debugging information when compiling for result data collection. SSE vectorization was enabled for the Eigen library on the CPU implementation. 1 In this context, Tesla is a product line of cards. 2 In this context, Tesla is the architectural generation before the Fermi generation of GPUs. The Tesla architecture is the architecture of the GeForce 200 and GeForce 300 lines of cards. This is not to be confused with the Tesla line of cards, which exist with both Tesla-, Fermiand Kepler-architecture GPUs. Kepler is the architectural generation of the GeForce 600 line of cards, released in may/june Both the architecture and the card line. Problem?

51 6.3 Applications Applications This section describes the applications of the particle filter that were developed during this thesis work dimensional terrain navigation using altitude sensor and range meter This is an extension of the simple terrain navigation example in Gustafsson [2010], pp The problem is extended to two-dimensional navigation and adding states that represent velocity. An aircraft is flying at a known velocity and known altitude above a 2-dimensional landscape. The aircraft is equipped with a range sensor that measures the distance to the ground. The goal is to estimate the position of the aircraft in the X- and Y-direction given only noisy measurements and a height map of the studied area. The range sensor is modelled as a linear sensor measuring the vertical distance to the ground, y k and with Gaussian measurement noise. For a state-space model we choose two states representing the position of the aircraft. To add further complications, it is quite likely that an aircraft cannot measure its absolute ground speed without either a GPS or using image processing, therefore we add states for the velocity and yaw orientation of the aircraft, resulting in a standard 4-dimensional state-space representation using cartesian position and polar velocity: x k = P x,k P y,k V k θ k (6.1) where P x and P y are the positions in east-west- and north-south-directions in the map, respectively. V is the velocity of the aircraft, θ is the direction of the velocity vector.

52 40 6 Results The following (nonlinear) state-space model was used: x k+1 =f (x k ) + w k, w k N (0, Q) (6.2a) P x,k + V k cos θ k P f (x k ) = y,k + V k sin θ k V k + a k θ k + ω k (6.2b) y k =s k h(x k ) + e k, e k N (0, σ 2 ) (6.2c) h(x k ) =heightmap(p x,k, P y,k ) Q = σ =10 (6.2d) (6.2e) (6.2f) where a k is the acceleration and ω k is the angular velocity of the aircraft. s k is the altitude of the aircraft, for simplicity of the example we assume that s k is known. w k is the process noise, additive Gaussian noise with covariance matrix Q. heightmap(p x,k, P y,k ) is the terrain altitude at position (P x,k, P y,k ). The altitude map is only defined on evenly spaced points on a grid (raster), the heightmap function uses bilinear interpolation to calculate continuous altitude values between the data points. In this implementation a true height map of a geographic region with a resolution of circa 50 m/pixel was used to generate simulated range measurements. The same height map was used in the filter as well. A predefined flight path was used for generating measurements, shown in figure 6.1. An artificial Gaussian noise with mean µ = 0 and standard deviation σ = 5 [m] was added to the measurement vector to simulate noise in the range sensor.

53 6.3 Applications 41 Figure 6.1: True flight path in the 2-dimensional navigation problem. The height map is shown in grey scale, brighter points correspond to higher terrain altitude. The flight path begins near the south-west corner of the map and ends near the north edge. The axis values are in pixels.

54 42 6 Results Tracking targets captured by an image processing system In this scenario the goal is to track the 2-dimensional position of a target found by image processing on a camera. The image processing for finding a coordinate to use as a measurement has already been performed on the dataset. This dataset is from a real world measurement. Data has been provided by the FoT project S3, Swedish Defence Research Agency (FOI) [Näsström et al., 2012]. The targets are assumed to move on a plane with the camera placed at an angle towards the surface. The camera used in the data collection was placed on the roof of a four or five storey building and the targets were people walking around in a park outside the building. The exact orientation of the camera was recorded at each frame to allow transforming points in the world coordinate space to the image plane. Figure 6.2: A sample frame from the camera used in the visual tracking problem Figure 6.2 shows a frame from the camera. The sensor is a FIR sensor that sees the heat of the people moving in the park. The following state-space representation was used in this problem: x k = P x,k P y,k P z,k V x,k V y,k (6.3) where P {x,y,z},k is the 3D position of the tracked target, V {x,y},k is the velocity of the target in the x,y-plane. Targets are assumed to have very little vertical velocity and move about on the ground plane. The following (linear) system model was used:

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Implementations of the FFT algorithm on GPU Examensarbete utfört i Elektroniksystem vid Tekniska högskolan vid Linköpings

More information

Probabilistic Robotics

Probabilistic Robotics Probabilistic Robotics Bayes Filter Implementations Discrete filters, Particle filters Piecewise Constant Representation of belief 2 Discrete Bayes Filter Algorithm 1. Algorithm Discrete_Bayes_filter(

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

L10. PARTICLE FILTERING CONTINUED. NA568 Mobile Robotics: Methods & Algorithms

L10. PARTICLE FILTERING CONTINUED. NA568 Mobile Robotics: Methods & Algorithms L10. PARTICLE FILTERING CONTINUED NA568 Mobile Robotics: Methods & Algorithms Gaussian Filters The Kalman filter and its variants can only model (unimodal) Gaussian distributions Courtesy: K. Arras Motivation

More information

The ARCUS Planning Framework for UAV Surveillance with EO/IR Sensors

The ARCUS Planning Framework for UAV Surveillance with EO/IR Sensors Technical report from Automatic Control at Linköpings universitet The ARCUS Planning Framework for UAV Surveillance with EO/IR Sensors Per Skoglar Division of Automatic Control E-mail: skoglar@isy.liu.se

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Practical Course WS12/13 Introduction to Monte Carlo Localization

Practical Course WS12/13 Introduction to Monte Carlo Localization Practical Course WS12/13 Introduction to Monte Carlo Localization Cyrill Stachniss and Luciano Spinello 1 State Estimation Estimate the state of a system given observations and controls Goal: 2 Bayes Filter

More information

Probabilistic Robotics

Probabilistic Robotics Probabilistic Robotics Discrete Filters and Particle Filters Models Some slides adopted from: Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Kai Arras and Probabilistic Robotics Book SA-1 Probabilistic

More information

This chapter explains two techniques which are frequently used throughout

This chapter explains two techniques which are frequently used throughout Chapter 2 Basic Techniques This chapter explains two techniques which are frequently used throughout this thesis. First, we will introduce the concept of particle filters. A particle filter is a recursive

More information

Terrain Rendering using Multiple Optimally Adapting Meshes (MOAM)

Terrain Rendering using Multiple Optimally Adapting Meshes (MOAM) Examensarbete LITH-ITN-MT-EX--04/018--SE Terrain Rendering using Multiple Optimally Adapting Meshes (MOAM) Mårten Larsson 2004-02-23 Department of Science and Technology Linköpings Universitet SE-601 74

More information

Overview. EECS 124, UC Berkeley, Spring 2008 Lecture 23: Localization and Mapping. Statistical Models

Overview. EECS 124, UC Berkeley, Spring 2008 Lecture 23: Localization and Mapping. Statistical Models Introduction ti to Embedded dsystems EECS 124, UC Berkeley, Spring 2008 Lecture 23: Localization and Mapping Gabe Hoffmann Ph.D. Candidate, Aero/Astro Engineering Stanford University Statistical Models

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Design and Implementation of a DMA Controller for Digital Signal Processor Examensarbete utfört i Datorteknik vid Tekniska

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

PGT - A path generation toolbox for Matlab (v0.1)

PGT - A path generation toolbox for Matlab (v0.1) PGT - A path generation toolbox for Matlab (v0.1) Maria Nyström, Mikael Norrlöf Division of Automatic Control Department of Electrical Engineering Linköpings universitet, SE-581 83 Linköping, Sweden WWW:

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Machine Learning for detection of barcodes and OCR Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings

More information

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization. Wolfram Burgard

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization. Wolfram Burgard Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization Wolfram Burgard 1 Motivation Recall: Discrete filter Discretize the continuous state space High memory complexity

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Automated Fault Tree Generation from Requirement Structures Examensarbete utfört i Fordonssystem vid Tekniska högskolan

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Water simulation for cell based sandbox games Examensarbete utfört i Datateknik vid Tekniska högskolan vid Linköpings universitet

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Probability and Statistics for Final Year Engineering Students

Probability and Statistics for Final Year Engineering Students Probability and Statistics for Final Year Engineering Students By Yoni Nazarathy, Last Updated: April 11, 2011. Lecture 1: Introduction and Basic Terms Welcome to the course, time table, assessment, etc..

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Computer Vision 2 Lecture 8

Computer Vision 2 Lecture 8 Computer Vision 2 Lecture 8 Multi-Object Tracking (30.05.2016) leibe@vision.rwth-aachen.de, stueckler@vision.rwth-aachen.de RWTH Aachen University, Computer Vision Group http://www.vision.rwth-aachen.de

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Baseband Processing Using the Julia Language Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation Unit 5 SIMULATION THEORY Lesson 39 Learning objective: To learn random number generation. Methods of simulation. Monte Carlo method of simulation You ve already read basics of simulation now I will be

More information

Time Series Analysis by State Space Methods

Time Series Analysis by State Space Methods Time Series Analysis by State Space Methods Second Edition J. Durbin London School of Economics and Political Science and University College London S. J. Koopman Vrije Universiteit Amsterdam OXFORD UNIVERSITY

More information

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño High Quality DXT Compression using OpenCL for CUDA Ignacio Castaño icastano@nvidia.com March 2009 Document Change History Version Date Responsible Reason for Change 0.1 02/01/2007 Ignacio Castaño First

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

POST-SIEVING ON GPUs

POST-SIEVING ON GPUs POST-SIEVING ON GPUs Andrea Miele 1, Joppe W Bos 2, Thorsten Kleinjung 1, Arjen K Lenstra 1 1 LACAL, EPFL, Lausanne, Switzerland 2 NXP Semiconductors, Leuven, Belgium 1/18 NUMBER FIELD SIEVE (NFS) Asymptotically

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Mobile Robotics. Mathematics, Models, and Methods. HI Cambridge. Alonzo Kelly. Carnegie Mellon University UNIVERSITY PRESS

Mobile Robotics. Mathematics, Models, and Methods. HI Cambridge. Alonzo Kelly. Carnegie Mellon University UNIVERSITY PRESS Mobile Robotics Mathematics, Models, and Methods Alonzo Kelly Carnegie Mellon University HI Cambridge UNIVERSITY PRESS Contents Preface page xiii 1 Introduction 1 1.1 Applications of Mobile Robots 2 1.2

More information

Institutionen för systemteknik Department of Electrical Engineering

Institutionen för systemteknik Department of Electrical Engineering Institutionen för systemteknik Department of Electrical Engineering Examensarbete Automatic Parallel Memory Address Generation for Parallel DSP Computing Master thesis performed in Computer Engineering

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

THE classical approach to multiple target tracking (MTT) is

THE classical approach to multiple target tracking (MTT) is IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007 1589 A Bayesian Approach to Multiple Target Detection and Tracking Mark R. Morelande, Christopher M. Kreucher, and Keith Kastella Abstract

More information

Humanoid Robotics. Monte Carlo Localization. Maren Bennewitz

Humanoid Robotics. Monte Carlo Localization. Maren Bennewitz Humanoid Robotics Monte Carlo Localization Maren Bennewitz 1 Basis Probability Rules (1) If x and y are independent: Bayes rule: Often written as: The denominator is a normalizing constant that ensures

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Probability Models.S4 Simulating Random Variables

Probability Models.S4 Simulating Random Variables Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard Probability Models.S4 Simulating Random Variables In the fashion of the last several sections, we will often create probability

More information

Abstract. Introduction. Kevin Todisco

Abstract. Introduction. Kevin Todisco - Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Probabilistic Robotics

Probabilistic Robotics Probabilistic Robotics Sebastian Thrun Wolfram Burgard Dieter Fox The MIT Press Cambridge, Massachusetts London, England Preface xvii Acknowledgments xix I Basics 1 1 Introduction 3 1.1 Uncertainty in

More information

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA) NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU

More information

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA) NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU

More information

Vehicle Positioning with Map Matching Using Integration of a Dead Reckoning System and GPS

Vehicle Positioning with Map Matching Using Integration of a Dead Reckoning System and GPS Vehicle Positioning with Map Matching Using Integration of a Dead Reckoning System and GPS Examensarbete utfört i Reglerteknik vid Tekniska Högskolan i Linköping av David Andersson Johan Fjellström Reg

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Bits, Words, and Integers

Bits, Words, and Integers Computer Science 52 Bits, Words, and Integers Spring Semester, 2017 In this document, we look at how bits are organized into meaningful data. In particular, we will see the details of how integers are

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University

A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011 2 / 27 Outline 1 2 3 3 / 27 Outline 1 2 3 CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Bus System for Coresonic SIMT DSP Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information