GPU-Accelerated Multiple Observer Siting

Size: px
Start display at page:

Download "GPU-Accelerated Multiple Observer Siting"

Transcription

1 GPU-Accelerated Multiple Observer Siting Parallelization of the Franklin-Vogt multiple observer siting algorithm. Abstract We present two fast parallel implementations of the Franklin-Vogt multiple observer terrain siting algorithm, using either OpenMP or Nvidia CUDA. On a terrain, assuming that observers can see out to a radiusof-interest of 100, finding the approximately observers that have the greatest joint viewshed coverage takes only 20 seconds in CUDA. That is a factor of 60 speedup from the sequential version. The OpenMP version exhibits a factor of 16 speedup on a 16 core system. Multiple observer siting finds a set of observers with a quasi-maximal joint viewshed for some terrain. Applications include radio transmission towers, Li-Fi base stations, mobile ad hoc networks, environmental monitoring sites, indoor wayfinding, and path planning for surveillance drones. The algorithm has four steps: finding the approximate visibility indices of all points, selecting a candidate subset of potential top observers, finding each one s viewshed, and greedily incrementing the set of final observers. 1

2 1 Introduction The purpose of multiple observer siting (Franklin, 2002) is to place observers to cover the surface of a terrain or targets above the terrain. It is useful in the placement of radio transmission towers, mobile ad hoc networks, and environmental monitoring sites. Given a terrain represented as a digital elevation map (DEM), the algorithm assumes that observer and target points are placed above terrain points. Therefore, the number of possible observer and target positions on the plane is the size of the terrain. The objective of the algorithm can be to cover as many targets as possible using a number of observers or to cover a number of targets using as few observers as possible. In the usual sense, an observer covers a target that is within a radius of interest and is visible or has a direct line of sight from the observer. If the targets are the terrain points (raised to a common height), then the targets visible from an observer constitute the viewshed of the observer, and the number of targets covered by an observer is the viewshed area of the observer. There are many possibilities. The radius of interest is usually defined in 2D, but can also be defined in 3D. An observer can cover a target with a quality or probability (Akbarzadeh et al., 2013). A target may require the covering of at least k > 1 observers for positioning or reliability. An observer may require the covering of at least one other observer for communication purposes. Both the observers and the targets can be mobile (Efrat et al., 2012). The application may have other restrictions than the number of observers or targets, for example, the cost of placing an observer at each position. The terrain may be either the traditional surface of the earth, or the tops of buildings in an urban terrain. The recent higher resolution databases require more efficient algorithms. Current observers may be remotely operated or autonomous airborne vehicles, whose operators desire to optimize the flight path. This may require repeatedly recomputing the siting problem with slightly varying parameters. A recent application requiring optimizing observer siting is Li-Fi, or light fidelity, 2

3 which switches LEDs on and off at a high speed to effect high speed communication. Its main advantage is its immunity to electronic interference. Its visibility computation is complicated by its light s ability to reflect from shiny surfaces. Visibility is also applicable to GHz radio signals that can be blocked by heavy objects like reinforced concrete pillars. That is relevant to siting the thousands of beacons that may required for indoor wayfinding in large buildings like airports. (Schipol airport s wayfinding system has 2000 beacons). Other notable related research includes the many line-of-sight issues in the Modeling and Simulation community discussed with comparisons of various LOS algorithms in Line-of-Sight Technical Working Group (2004), the relation of visibility to topographic features studied in Lee (1992), and the pioneering work of Nagy (1994). Champion and Lavery (2002) studied line-of-sight on natural terrain defined by an L 1 -spline. Multiple observer siting is a compute-intensive problem with considerable inherent parallelism. The multiple observer siting algorithm of Franklin and Vogt (2004, 2006) can be optimized to reduce the time and space complexities. Parallelizing the algorithm would greatly increase its speed, so that the program would be very fast for small terrains and not too slow for large terrains. GPUs are massively parallel devices containing thousands of processing units. They were designed for computer graphics applications but are fully programmable and optimized for SIMD vector operations. Using GPUs for scientific computing is called general-purpose computing on graphics processing units (GPGPU) (Luebke et al., 2004). Although a CPU core is much more powerful than a GPU core, a GPU has many more cores than a CPU. Latest GPUs have up to a few hundred GB/s of memory bandwidth and a few TFLOPS of single-precision processing power, but theoretical peak performance is very hard to achieve. The parallelization of line-of-sight and viewshed algorithms on terrains using GPGPU or multi-core CPUs is an active topic. Strnad (2011) parallelized the line- 3

4 of-sight calculations between two sets of points a source set and a destination set on a GPU, and implementated it on a multi-core CPU for comparison. Zhao et al. (2013) parallelized the R3 algorithm (Franklin and Ray, 1994) to compute viewsheds on a GPU. The parallel algorithm combines coarse-scale and fine-scale domain decompositions to deal with memory limit and enhance memory access performance. Osterman (2012) parallelized the r.los module (R3 algorithm) of the open-source GRASS GIS on a GPU. Osterman et al. (2014) also parallelized the R2 algorithm (Franklin and Ray, 1994). Axell and Fridén (2015) parallelized and compared the R2 algorithm on a GPU and on a multi-core CPU. Bravo et al. (2015) parallelized the XDRAW algorithm (Franklin and Ray, 1994) to compute viewsheds on a multi-core CPU, after improving its IO efficiency and compatibility with SIMD instructions. Ferreira et al. (2014, 2016) parallelized the sweep-line algorithm of Kreveld (1996) to compute viewsheds on multi-core CPUs. In multiple observer siting, Magalhães et al. (2010) proposed a local search heuristic to increase the percentage of a terrain covered by a set of k observers. Given a set of candidate observers, each subset of k observers is a solution S and each solution with one observer different from S is a neighbor of S. Starting from an initial solution, the heuristic repeatedly replaces the current solution with a better neighbor, until a local optima is found. Pena et al. (2014) improved the performance of the heuristic by accelerating the overlay of viewsheds on a GPU using dynamic programming and a sparse-dense matrix multiplication algorithm. In this paper, we optimize and parallelize the multiple observer siting algoirthm of Franklin and Vogt on a GPU. We also implement it on multi-core CPUs for comparison. With visibility computation being an essential part, the siting algorithm is different and more complicated. We review the multiple observer siting algoirthm in the next section. 4

5 2 Multiple Observer Siting Franklin and Vogt (2004, 2006) proposed an algorithm to select a set of observers to cover the surface of a terrain. Let the visibility index of a given terrain point be the number of terrain points visible from an observer at that point, divided by the total number of terrain points within the given radius of interest. The algorithm first computes an approximate visibility index for each terrain point, and then selects a set of terrain points with high visibility indexes as candidate observer positions. The observers at the candidate positions are called tentative observers. Then the algorithm computes the viewshed of each tentative observer and iteratively greedily selects observers from the tentative observers to cover the terrain surface. As an option, the algorithm can select observers that are visible from other observers. At the top level, this algorithm has four sequential steps: vix, findmax, viewshed, and site. vix, the first step, computes an approximate visibility index for each terrain point, which is normalized as an integer from 0 to 255 and stored in a byte. The parameters of vix include the number of rows or columns of the terrain, nrows, the radius of interest of observers, roi, the height of observers or targets above the terrain, height, and the number of random targets for each observer. For simplicity, the algorithm assumes that the terrain is a square and the observer height and the target height are equal, but these restrictions are easy to remove. Placing an observer at each terrain point, vix picks a number of random terrain points within roi as targets and calculates their line-of-sight visibility. Then it calculates the ratio of visible targets, normalized in 0 255, as the approximate visibility index of the terrain point. findmax, the second step, selects a small subset of terrain points with high visibility indexes as tentative observers. Since highly visible points are often close to each other, for example, along ridges and on water surface, findmax divides the terrain into square blocks and selects a number of highly visible points in each block 5

6 as tentative observers. The parameters of findmax include the number to select and the block width. findmax adjusts their values so that the last block is not too small and each block has the same number of tentative observers. viewshed, the third step, computes the viewshed of each such tentative observer using the R2 algorithm. With a square of width (2roi + 1) centered at an observer, the algorithm shoots a ray from the observer to each point on the boundary of the square and calculates the visibility of the points along the rasterized ray. Terrain points within roi along a ray are visited in order and their visibility is determined by the local horizon. The observer and the points as targets are height above the terrain, while the points as obstacles are on the terrain. Figure 1 shows a schematic illustration of the algorithm with roi = 3. On the left of the figure, an observer O is at the center of a 7 7 square, wherein the shaded point cells are not within roi and the dashed lines are the rays from O to the boundary points. On the right of the figure, the points along rays OA and OB are on a 1D terrain and their visibility is determined by the local horizon (dotted lines). Because of the discrete nature of the algorithm, a point is often on multiple rays and is considered visible if it is visible along any ray. site, the last step, selects a set of observers from the tentative observers to cover the terrain. It uses a greedy algorithm that adds one tentative observer to the set of observers at a time. At first, the set of observers and their cumulative viewshed are empty. Each iteration of the algorithm computes the union area of the viewshed of each unused tentative observer and the cumulative viewshed, and finds the unused tentative observer with the largest union area. Then it adds that tentative observer to the set of observers, and updates the cumulative viewshed as its union with the viewshed of the new observer. That is fast because it is just the computation of the union of two bit vectors. Since an unused viewshed would not add more area to the cumulative viewshed in this iteration than it would in the previous iteration, it is unnecessary to compute its union area in this iteration if another unused viewshed 6

7 A O B A O B Figure 1: Computing a viewshed using the R2 algorithm. 7

8 that would add more area is encountered. The algorithm may stop at a maximum number of observers or a minimum coverage of the terrain. 3 Optimization First we optimize the multiple observer siting algorithm to reduce its time and space cost. It is important that we parallelize an efficient sequential algorithm rather than an inefficient one. The space requirement is important for GPU parallel programs because the GPU memory is smaller than the CPU memory, and memory copies between them are expensive. The parallel algorithm has the same four steps as the sequential one: vix, findmax, viewshed, and site. vix is often the most time-consuming step. As it computes approximate visibility indexes by target sampling, it can compute approximate line-of-sight visibilities by point sampling to further increase speed. It still uses random points within roi of an observer as targets. Rana (2003) proposed using topographic feature points as targets, but it is not definitely better than using random points. For each target, instead of computing a line-of-sight visibility by evaluating all points along the line between the observer and the target, vix computes an approximate line-of-sight visibility by evaluating a subset of the points with an interval between successive points of evaluation along the line of sight. The idea is like volume ray casting in Computer Graphics, which uses equidistant sampling points along a viewing ray for shading and compositing. interval = 1 is equivalent to evaluating all the points. The choice of interval and its effects on the outcome will be discussed in the results section. Instead of selecting multiple tentative observers per block, findmax selects one tentative observer per block. There are two benefits. First, selecting one tentative observer is accomplished by scanning the block points for the highest visibility index, while selecting multiple tentative observers requires sorting the points by visibility 8

9 index. Second, highly visible points are still close to each other in a block. With the same number of tentative observers, a smaller block width and one tentative observer per block has better results than a larger block width and multiple tentative observers per block. The time complexity of findmax is linear in the data size, i.e., θ(nrows 2 ). viewshed still uses the R2 algorithm with rasterized rays, but it can use other algorithms to increase speed. For example, the XDRAW algorithm is faster but less accurate than R2. Wang et al. (2000) proposed a viewshed algorithm that uses a plane instead of lines of sight in each of 8 standard sectors around the observer to approximate the local horizon. The algorithm is faster but less accurate than XDRAW. Israelevitz (2003) extended XDRAW to increase accuracy by sacrificing speed. A more compact representation is used for the viewsheds of tentative observers and the cumulative viewshed in site. A viewshed is (2roi + 1) (2roi + 1) pixels, and the cumulative viewshed is nrows nrows pixels. Each pixel uses only one bit and each row of a viewshed is padded to word size. The bit representation is compact and fast to process using bitwise operators. If rows are not padded, indexing is easier but boundary detection is harder. The difficulty of the representation is the misalignment between a viewshed and the cumulative viewshed. A word in a viewshed usually overlaps two words in the cumulative viewshed. The number of tentative observers is nrows2 bw 2 of viewshed is O( nrows2 bw 2 roi 2 ). (bw = block width) and the time complexity The time cost of site depends very much on the number of tentative observers. Previously, site computed the union area of each unused viewshed V and the cumulative viewshed C in each iteration, which is very time consuming. Two modifications to the algorithm of site greatly reduce its time complexity. First, the size of V is (2roi + 1) 2 and the size of C is nrows 2, so that the time to compute the area of V C is O(nrows 2 ). To find the tentative observer to add, however, it can look for the unused viewshed that would add the largest extra area to the cumulative viewshed, 9

10 instead of looking for the one with the largest union area. The extra area of V can be computed as the area of V C V, where C V is the corresponding (2roi +1) 2 region in C (or partially in C), the time of which is O(roi 2 ). In bit representation, V C V can be implemented as V and (not C V ). Second, it is unnecessary to compute the extra area of all unused tentative observers in each iteration. The ones whose extra area needs computing are within 2roi of the last added observer. If the distance between an unused tentative observer and the last added observer is larger than 2roi, then C V and V C V stay the same. The number of tentative observers within 2roi is 4πroi 2. In each iteration of site, the time to find the unused tentative observers bw 2 whose extra area needs computing is O( nrows2 ). The time to compute the extra area bw 2 of them is O( roi 4 bw 2 ). The time to find the tentative observer to add (with the largest extra area) is O( nrows2 ). The time to update C is O(roi 2 ). Summing all up, the time bw 2 complexity of an iteration of site is O( nrows2 + roi 4 + roi 2 ). If O( roi ) = O(1), which bw 2 bw 2 bw is reasonable, the time complexity is O( nrows2 roi 2 + roi 2 ). The iteration stops when the coverage of C is not lower than a threshold, or when no unused tentative observers have a positive extra area. Algorithm 1 shows the algorithm. Algorithm 1: The algorithm of site Input: A DEM, a set of tentative observers, and their viewsheds Output: A set of observers the set of observers and the cumulative viewshed are empty; while coverage is lower than a threshold and can be increased do foreach unused tentative observer do if the set of observers is empty or it is within 2roi of the last added observer then compute the extra area that its viewshed would add to the cumulative viewshed; end end add the unused tentative observer with the largest extra area to the set of observers; update the cumulative viewshed as its union with the viewshed of the newly added observer; end return the set of observers; 10

11 4 Parallelization The multiple observer siting algorithm is compute-intensive, but with considerable inherent parallelism, which will greatly increases its speed. The Compute Unified Device Architecture (CUDA) (NVIDIA, 2016) is a parallel computing platform and programming model for NVIDIA GPUs. We parallelize the algorithm using CUDA on an NVIDIA GPU. The four steps of the algorithm are parallelized separately. The K20Xm NVIDIA GPU has 14 streaming multiprocessors, each with 192 CUDA cores. To perform a task using CUDA, it is necessary to define a kernel function that is executed by a large number of CUDA threads. Each thread does a fraction of the work and is executed on a CUDA core. The threads are grouped into thread blocks and each thread block is executed on a multiprocessor. The threads of a block are divided into warps of 32 threads and each warp is instruction locked. The threads of a block are synchronized, while the threads of different blocks cannot be synchronized, or even communicate. vix, the first step, computes a visibility index for every terrain point. The number of points is usually very large, so that we define a kernel function to compute the visibility index of a point in each CUDA thread. The function picks random targets for the point and calculates their line-of-sight visibility. findmax, the second step, finds the point with the highest visibility index in every terrain block as a tentative observer. The number of terrain blocks is much smaller than the number of points. We define a kernel function to find the most visible point in a terrain block in each thread block. The function finds the most visible point in a portion of the terrain block in each thread of the block, and performs a parallel reduction to find the most visible point of the terrain block. viewshed, the third step, computes the viewshed of every tentative observer. The number of tentative observers is the same as the number of terrain blocks. We define a kernel function to compute the viewshed of a tentative observer in each thread block. The function computes a slice of the viewshed in each thread of the 11

12 block. For example, if the block has 4 threads, then each thread computes a quarter of the viewshed. In the case of Figure 1, each thread shoots 6 consecutive rays from the observer to points on the boundary of the square. The major task in an iteration of site, the last step, is to compute the extra area of unused tentative observers within 2 roi of the last added observer. In an earlier attempt, we defined a kernel function to check if the extra area of a tentative observer needs computing and, if so, compute its extra area in each CUDA thread. The function is slow because the extra area of most tentative observers does not need computing so that the workload is very unbalanced among the threads. The problem is that the function does two tasks: finding tentative observers whose extra area needs computing and computing the extra area. Defining separate functions for the two tasks eliminates the problem. The first function checks if the extra area of a tentative observer needs computing in each CUDA thread. The second function computes the extra area of a tentative observer in each thread block, with each thread processing one or more rows of the viewshed. The two functions can be called in sequence from the CPU, or the second function can be called from the first function. We choose the latter and define the first function to call the second function if an extra area needs computing. A third kernel function is defined to find the unused tentative observer with the largest extra area in a portion of the tentative observers in each thread block. The function finds a tentative observer in each thread of the block and performs a parallel reduction to find the tentative observer. On the CPU, the tentative observer to add is found from the results of the thread blocks. A fourth kernel function is defined to update the cumulative viewshed. The function processes a row of the viewshed of the newly added observer in each CUDA thread. A viewshed is too small to utilize the GPU, but it is slower to update the cumulative viewshed on the CPU and copy it to the GPU. To sum up, the CUDA program has seven kernel functions, one each for vix, findmax, and viewshed, and four for site. Functions 1 3 are called once, while 12

13 functions 4 7 are called in a loop. Function 5 is called from function compute visibility indexes 2. select tentative observers 3. compute the viewsheds of tentative observers 4. find the unused tentative observers whose extra area needs computing 5. compute the extra area of a viewshed 6. find unused tentative observers with the largest extra area 7. update the cumulative viewshed For comparison, we parallelize the algorithm using OpenMP on multi-core CPUs. OpenMP uses compiler directives and library routines to direct parallel execution. In the OpenMP program, we use a #pragma omp parallel for schedule(guided) directive for the following for loops: compute the visibility index for each terrain point find the most visible point as a tentative observer for each terrain block compute the viewshed for each tentative observer check if the extra area needs computing for each tentative observer compute the extra area for each row of a viewshed update the cumulative viewshed for each row of a viewshed In addition, a parallel region (using a #pragma omp parallel directive) finds the unused tentative observer with the largest extra area. Each thread finds a tentative observer in a portion of the tentative observers, and updates the global tentative observer with the largest extra area in a critical region (using a #pragma omp critical directive). 13

14 5 Results and Discussion We test the parallel and sequential programs on a machine with an NVIDIA Tesla K20Xm GPU accelerator (14 streaming multiprocessors, 192 CUDA cores per multiprocessor, and 6GB GDDR5 memory), two Intel Xeon E5-2687W CPUs at 3.1GHz (8 cores and 16 hyper-threads per CPU), and 128GB DDR3 memory, running Ubuntu LTS. The CUDA program is compiled with nvcc and gcc at optimization level -O2. The OpenMP and sequential programs are compiled with gcc at optimization level -O2. The terrain dataset is a Puget Sound DEM from Lindstrom and Pascucci (2001), which was extracted from USGS 10-meter DEMs. The unit of value is 0.1 meter and the range of values is [0, 43930]. Figure 2 shows a shaded relief plot of the terrain, which is about half mountains and half plains. vix computes an approximate line-of-sight visibility with an interval between successive evaluation points along each line of sight. We use the CUDA program to test the effects of interval because it is the fastest. To evaluate the accuracy of the approximate visibility index, the exact visibility index map of the terrain, normalized in integers 0 255, is computed with roi = 100 (1000 meters) and observer/target height = 100 (10 meters). Figure 3 shows the exact visibility index map of the terrain. The figure shows that highly visible points (light colored) are on flat terrain plains, valleys, and waters. In the simplest case, interval can be a constant value. However, we found that points closer to the observer along the line of sight are more important for determining visibility. For example, a closer point with a higher elevation than the observer appears higher (has a larger altitude angle) than a more distant point with the same elevation. Therefore, evaluation points should become denser and interval should become smaller towards the observer. The same is true from the viewpoint of the target, so that interval should become smaller towards the target. We test the CUDA program with roi = 100, height = 100, block width (bw) = 50 ( tentative observers), and target cov- 14

15 Figure 2: Shaded relief plot of the terrain dataset. 15

16 erage = 95%. vix uses 10, 50, or 250 random targets per point and different values of interval. Let the observer be at point 0 and the target be at point 100 along the line of sight. Starting from point 1, the evaluation points are point 1, 1 + interval, 1 + 2interval, and so on, if interval is constant. The following values of interval (and the corresponding evaluation points) are tested: 1: 99 points (1, 2, 3,..., 99) 2: 50 points (1, 3, 5,..., 99) 4: 25 points (1, 5, 9,..., 97) 8: 13 points (1, 9, 17,..., 97) 16: 7 points (1, 17, 33,..., 97) 32: 4 points (1, 33, 65, 97) exponential : 7 points (1, 2, 4,..., 64) Fibonacci : 10 points (1, 2, 3,..., 89) bidirectional exponential : 12 points (1, 2, 4,..., 96, 98, 99) bidirectional Fibonacci : 16 points (1, 2, 3,..., 97, 98, 99) Table 1 shows the running time of vix in seconds, the RMSE of the approximate visibility index map (VIM), and the number of observers selected for the target coverage. The smaller the number of observers, the better the result. More random targets per point produces a longer running time for vix, a smaller RMSE of the VIM, and a smaller number of observers. The improvement from 10 to 50 targets is larger than the improvement from 50 to 250 targets. However, a smaller RMSE does not necessarily mean fewer observers. For example, interval = 8 has a smaller RMSE but more observers than interval = exponential or Fibonacci. A larger interval produces shorter time, larger RMSE, and more observers. Figure 4 shows 16

17 Figure 3: Exact visibility index map of the terrain, normalized in integers 0 255, with roi = 100 and height = 100. that (interval =) exponential and Fibonacci are below and thus better than the 1, 2,..., 32 curve, while bidirectional exponential and bidirectional Fibonacci are almost on the curve. We choose 50 targets and interval = exponential for vix, so that its time complexity is O(nrows 2 log(roi)). Table 2 shows the results of the parallel and sequential programs with different combinations of roi and bw: roi = 50 and bw = 25, roi = 100 and bw = 50, roi = 200 and bw = 100, roi = 100 and bw = 25, and roi = 200 and bw = 50. The first three combinations have roi bw roi = 2 and the last two combinations have = 4. bw The other parameters are height = 100 and target coverage = 95%. The number 17

18 Table 1: Results of vix using 10, 50, or 250 random targets per point. interval: the interval between successive evaluation points along the line of sight. Time: the running time of CUDA vix in seconds. RMSE: the RMSE of the approximate visibility index map. Observ.: the number of observers selected for 95% coverage. 10 targets 50 targets 250 targets interval Time RMSE Observ. Time RMSE Observ. Time RMSE Observ exp fib biexp bifib Observers Observers Observers 10 targets exp 8 biexp bifib fib targets exp biexp bifib 2 fib targets exp biexp bifib fib Time Figure 4: The running time of CUDA vix in seconds versus the number of observers selected for 95% coverage. Data points are labeled with the value of interval. 18

19 of tentative observers is , , or when bw = 25, 50, or 100. Each result is an average from 10 runs of the program. The OpenMP program uses 50 threads with dynamic threads disabled. The results are the running time of vix, findmax, viewshed, site, and the program (total) in seconds, and the number of observers. The total time includes I/O time, which is about 0.2 seconds. The number of observers is slightly different for the programs because of parallel execution and randomization. A smaller bw and more tentative observers produce longer time of viewshed and site but less observers. The time of vix is roughly proportional to log(roi). The time of findmax is very small. The time of viewshed is roughly proportional to roi and the number of tentative observers. The time of site varies a lot. It is related to roi, bw, and the number of observers. With a fixed roi, the time of site increases as bw decreases and the number of tentative observers increases. With a fixed bw, it may either decrease or increase as roi increases. Table 3 shows the speedups of the parallel programs over the sequential program. The speedup of vix is about 72 to 95 times for the CUDA program and 19 to 20 times for the OpenMP program. The speedup for the CUDA program decreases as roi increases, because evaluation points are farther from the observer and memory accesses for their elevations are less local, and because CUDA threads in a warp are instruction locked and access memory at the same time. The speedup for the OpenMP program also decreases a little as roi increases. The speedup of findmax is about 5 to 10 times for the CUDA program and 13 to 16 times for the OpenMP program. The workload is too small for the CUDA program. As bw increases, the speedup for the CUDA program increases because each thread has more independent work, while the speedup for the OpenMP program decreases because the granularity of parallelism is larger. The speedup of viewshed is about 39 to 57 times for the CUDA program and 17 to 18 times for the OpenMP program. For the same reason as for vix, the speedup for the CUDA program decreases as roi increases (with the same bw), but increases as bw decreases and the number of tentative observers 19

20 Table 2: Results of the CUDA, OpenMP, and sequential programs, averaged over 10 runs. vix,..., total: running time in seconds. Observers: the number of observers selected for 95% coverage. CUDA program roi bw vix findmax viewshed site Total Observers OpenMP program roi bw vix findmax viewshed site Total Observers Sequential program roi bw vix findmax viewshed site Total Observers

21 Table 3: Speedups of the CUDA and OpenMP programs over the sequential program. CUDA program roi bw vix findmax viewshed site Total OpenMP program roi bw vix findmax viewshed site Total increases (with the same roi). However, the speedup for the OpenMP program does not decrease as roi increases (with the same bw). The speedup of site is about 5 to 9 times for the CUDA program and 2 to 7 times for the OpenMP program. The speedup increases as bw decreases (with the same roi) but may increase or decrease as roi increases (with the same bw). The speedup of the program is about 27 to 60 times for the CUDA program and 15 to 17 times for the OpenMP program. The speedup may increase or decrease as roi increases (with the same bw), and decreases as bw decreases (with the same roi). The CUDA program is about 3 times as fast as the OpenMP program. 6 Conclusions We have optimized the multiple observer siting algorithm and parallelized it using CUDA on a GPU and using OpenMP on multi-core CPUs. A sample execution time is as follows. On a terrain, assuming that observers can see out to a radius-of-interest of 100, finding the approximately observers that have 21

22 the most joint viewshed coverage takes 1200 seconds for a sequential program. An OpenMP program with 16 threads takes 71 seconds, while a CUDA program on an Nvidia K20Xm CPU takes only 20 seconds. In general, the speedup of the CUDA program is about 30 to 60 times on an NVIDIA Tesla K20Xm GPU accelerator over the sequential program on a CPU core. The speedup of the OpenMP program is about 16 times on two Intel Xeon E5-2687W CPUs with 16 cores over the sequential program. Both techniques are very effective in accelerating the program. The CUDA program is faster, while the OpenMP program is easier to implement. Due to overhead, the GPU is more efficient for long computation, while the CPU is more efficient for short computation. If a program contains both long and short computations, it is possible to achieve greater efficiency by combining GPU and CPU parallel execution. There are two directions for future work. The first direction is to further increase speed by selecting multiple observers in each iteration of site to increase parallelism. Because parallel execution is fast, the second direction is to reduce the number of observers selected for a target coverage by computing a more accurate visibility index map or using more tentative observers. If the algorithm uses all terrain points as tentative observers, it only needs two parts: viewshed, to compute viewsheds, and site, to select observers. References Akbarzadeh, V., Gagné, C., Parizeau, M., Argany, M., and Mostafavi, M. A., Probabilistic sensing model for sensor placement optimization based on line-ofsight coverage, IEEE Transactions on Instrumentation and Measurement, 62(2): Axell, T. and Fridén, M., Comparison between GPU and parallel CPU optimizations in viewshed analysis, Master s thesis, Chalmers University of Technology, Gothenburg, Sweden. 22

23 Bravo, J. C., Sarjakoski, T., and Westerholm, J., Efficient implementation of a fast viewshed algorithm on SIMD architectures, Proceedings of the 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 4 6 March 2015, Turku, Finland (EUROMICRO, Sankt Augustin, Germany), pp Champion, D. C. and Lavery, J. E., Line of sight in natural terrain determined by L 1 -spline and conventional methods, Proceedings of the 23rd Army Science Conference, 2 5 December 2002, Orlando, Florida (Department of the Army, Washington, DC). Efrat, A., Mitchell, J. S. B., Sankararaman, S., and Myers, P., Efficient algorithms for pursuing moving evaders in terrains, Proceedings of the 20th International Conference on Advances in Geographic Information Systems, 6 9 November 2012, Redondo Beach, California (ACM, New York, New York), pp Ferreira, C. R., Andrade, M. V. A., Magalhães, S. V. G., Franklin, W. R., and Pena, G. C., A parallel algorithm for viewshed computation on grid terrains, Journal of Information and Data Management, 5(2): Ferreira, C. R., Andrade, M. V. A., aes, S. V. G. M., and Franklin, W. R., An efficient external memory algorithm for terrain viewshed computation, ACM Transactions on Spatial Algorithms and Systems, 2(2). Franklin, W. R., Siting observers on terrain, Advances in Spatial Data Handling: 10th International Symposium on Spatial Data Handling (Richardson, D. E. and Oosterom, P., editors), Springer-Verlag, Heidelberg, Germany, pp Franklin, W. R. and Ray, C., Higher isn t necessarily better: visibility algorithms and experiments, Advances in GIS Research: Sixth International Symposium on Spatial Data Handling (Waugh, T. C. and Healey, R. G., editors), Taylor & Francis, Bristol, Pennsylvania, pp

24 Franklin, W. R. and Vogt, C., Efficient observer siting on large terrain cells, Proceedings of the Third International Conference on Geographic Information Science, 20 23, October 2004, Adelphi, Maryland (University of Maryland, College Park, Maryland). Franklin, W. R. and Vogt, C., Tradeoffs when multiple observer siting on large terrain cells, Progress in Spatial Data Handling: 12th International Symposium on Spatial Data Handling (Riedl, A., Kainz, W., and Elmes, G. A., editors), Springer-Verlag, Heidelberg, Germany, pp Israelevitz, D., A fast algorithm for approximate viewshed computation, Photogrammetric engineering and remote sensing, 69(7): Kreveld, M. V., Variations on sweep algorithms: efficient computation of extended viewsheds and class intervals, Proceedings of the 7th International Symposium on Spatial Data Handling, August 1996, Delft, Netherlands, (IGU Commission on GIS, Charleston, South Carolina), pp Lee, J., Visibility dominance and topographic features on digital elevation models, Proceedings of the 5th International Symposium on Spatial Data Handling, 3 7 August 1992, Charleston, South Carolina (IGU Commission on GIS, Charleston, South Carolina), pp Lindstrom, P. and Pascucci, V., Visualization of large terrains made easy, Proceedings of the Conference on Visualization 01, October 2001, San Diego, California, (IEEE Computer Society, Washington, DC), pp Line-of-Sight Technical Working Group, Website, U.S. Army Topographic Engineering Center (last date accessed: 30 October 2008). Luebke, D., Harris, M., Krüger, J., Purcell, T., Govindaraju, N., Buck, I., Woolley, C., and Lefohn, A., GPGPU: general purpose computation on graphics 24

25 hardware, ACM SIGGRAPH 2004 Course Notes, 8 12 August 2004, Los Angeles, California, (ACM, New York, New York). Magalhães, S. V. G., Andrade, M. V. A., and Ferreira, C., Heuristics to site observers in a terrain represented by a digital elevation matrix, Proceedings of the XI Brazilian Symposium on Geoinformatics, 28 November 1 December 2010, Campos do Jordão, Brazil (INPE, São José dos Campos, Brazil), pp Nagy, G., Terrain visibility, Computers & Graphics, 18(6): NVIDIA, CUDA parallel computing platform, home new.html, NVIDIA Corporation, Santa Clara, California (last date accessed: 1 July 2016). Osterman, A., Implementation of the r.cuda.los module in the open source GRASS GIS by using parallel computation on the NVIDIA CUDA graphic cards, Elektrotehniški Vestnik, 79(1 2): Osterman, A., Benedičič, L., and Ritoša, P., An IO-efficient parallel implementation of an R2 viewshed algorithm for large terrain maps on a CUDA GPU, International Journal of Geographical Information Science, 28(11): Pena, G. C., Magalhães, S. V. G., Andrade, M. V. A., Franklin, W. R., Ferreira, C. R., and Li, W., An efficient GPU multiple-observer siting method based on sparse-matrix multiplication, Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, 4 7 November 2014, Dallas, Texas (ACM, New York, New York), pp Rana, S., Fast approximation of visibility dominance using topographic features as targets and the associated uncertainty, Photogrammetric Engineering and Remote Sensing, 69(8):

26 Strnad, D., Parallel terrain visibility calculation on the graphics processing unit, Concurrency and Computation: Practice and Experience, 23(18): Wang, J., Robinson, G. J., and White, K., Generating viewsheds without using sightlines, Photogrammetric engineering and remote sensing, 66(1): Zhao, Y., Padmanabhan, A., and Wang, S., A parallel computing approach to viewshed analysis of large terrain data using graphics processing units, International Journal of Geographical Information Science, 27(2):

A Parallel Sweep Line Algorithm for Visibility Computation

A Parallel Sweep Line Algorithm for Visibility Computation Universidade Federal de Viçosa Departamento de Informática Programa de Pós-Graduação em Ciência da Computação A Parallel Sweep Line Algorithm for Visibility Computation Chaulio R. Ferreira Marcus V. A.

More information

A parallel algorithm for viewshed computation on grid terrains

A parallel algorithm for viewshed computation on grid terrains A parallel algorithm for viewshed computation on grid terrains Chaulio R. Ferreira 1, Marcus V. A. Andrade 1, Salles V. G. Magalhães 1, W. R. Franklin 2, Guilherme C. Pena 1 1 Universidade Federal de Viçosa,

More information

A Parallel Sweep Line Algorithm for Visibility Computation

A Parallel Sweep Line Algorithm for Visibility Computation A Parallel Sweep Line Algorithm for Visibility Computation Chaulio R. Ferreira 1, Marcus V. A. Andrade 1, Salles V. G. Magalhes 1, W. R. Franklin 2, Guilherme C. Pena 1 1 Departamento de Informática Universidade

More information

Comparison between GPU and parallel CPU optimizations in viewshed analysis

Comparison between GPU and parallel CPU optimizations in viewshed analysis Comparison between GPU and parallel CPU optimizations in viewshed analysis Master s thesis in Computer Science: Algorithms, Languages and Logic TOBIAS AXELL MATTIAS FRIDÉN Department of Computer Science

More information

Improved Visibility Computation on Massive Grid Terrains

Improved Visibility Computation on Massive Grid Terrains Improved Visibility Computation on Massive Grid Terrains Jeremy Fishman Herman Haverkort Laura Toma Bowdoin College USA Eindhoven University The Netherlands Bowdoin College USA Laura Toma ACM GIS 2009

More information

An efficient GPU multiple-observer siting method based on sparse-matrix multiplication

An efficient GPU multiple-observer siting method based on sparse-matrix multiplication An efficient GPU multiple-observer siting method based on sparse-matrix multiplication Guilherme C. Pena Universidade Fed. de Viçosa Viçosa, MG, Brazil guilherme.pena@ufv.br W. Randolph Franklin Rensselaer

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Border Patrol. Shingo Murata Swarthmore College Swarthmore, PA

Border Patrol. Shingo Murata Swarthmore College Swarthmore, PA Border Patrol Shingo Murata Swarthmore College Swarthmore, PA 19081 smurata1@cs.swarthmore.edu Dan Amato Swarthmore College Swarthmore, PA 19081 damato1@cs.swarthmore.edu Abstract We implement a border

More information

Applied Cartography and Introduction to GIS GEOG 2017 EL. Lecture-7 Chapters 13 and 14

Applied Cartography and Introduction to GIS GEOG 2017 EL. Lecture-7 Chapters 13 and 14 Applied Cartography and Introduction to GIS GEOG 2017 EL Lecture-7 Chapters 13 and 14 Data for Terrain Mapping and Analysis DEM (digital elevation model) and TIN (triangulated irregular network) are two

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Thiago L. Gomes Salles V. G. Magalhães Marcus V. A. Andrade Guilherme C. Pena. Universidade Federal de Viçosa (UFV)

Thiago L. Gomes Salles V. G. Magalhães Marcus V. A. Andrade Guilherme C. Pena. Universidade Federal de Viçosa (UFV) Thiago L. Gomes Salles V. G. Magalhães Marcus V. A. Andrade Guilherme C. Pena Universidade Federal de Viçosa (UFV) The availability of high resolution terrain data has become a challenge in GIS; On one

More information

Computing Visibility on Terrains in External Memory

Computing Visibility on Terrains in External Memory Computing Visibility on Terrains in External Memory Herman Haverkort Laura Toma Yi Zhuang TU. Eindhoven Netherlands Bowdoin College USA Visibility Problem: visibility map (viewshed) of v terrain T arbitrary

More information

Universidade Federal de Viçosa Departamento de Informática - DPI

Universidade Federal de Viçosa Departamento de Informática - DPI - DPI Sistema Interativo para Posicionamento de Observadores em Terrenos Representados por Modelos Digitais de Elevacao --- An Interactive System to Site Observer in Terrains represented by Digital Elevation

More information

Advanced Computer Graphics Project Report. Terrain Approximation

Advanced Computer Graphics Project Report. Terrain Approximation Advanced Computer Graphics Project Report Terrain Approximation Wenli Li (liw9@rpi.edu) 1. Introduction DEM datasets are getting larger and larger with increasing precision, so that approximating DEM can

More information

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute ParCube W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute 2017-11-07 Which pairs intersect? Abstract Parallelization of a 3d application (intersection detection). Good

More information

DEM Artifacts: Layering or pancake effects

DEM Artifacts: Layering or pancake effects Outcomes DEM Artifacts: Stream networks & watersheds derived using ArcGIS s HYDROLOGY routines are only as good as the DEMs used. - Both DEM examples below have problems - Lidar and SRTM DEM products are

More information

Computing Visibility on Terrains in External Memory

Computing Visibility on Terrains in External Memory Computing Visibility on Terrains in External Memory Herman Haverkort Laura Toma Yi Zhuang TU. Eindhoven Netherlands Bowdoin College USA ALENEX 2007 New Orleans, USA Visibility Problem: visibility map (viewshed)

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Lecture 06. Raster and Vector Data Models. Part (1) Common Data Models. Raster. Vector. Points. Points. ( x,y ) Area. Area Line.

Lecture 06. Raster and Vector Data Models. Part (1) Common Data Models. Raster. Vector. Points. Points. ( x,y ) Area. Area Line. Lecture 06 Raster and Vector Data Models Part (1) 1 Common Data Models Vector Raster Y Points Points ( x,y ) Line Area Line Area 2 X 1 3 Raster uses a grid cell structure Vector is more like a drawn map

More information

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Lab 12: Sampling and Interpolation

Lab 12: Sampling and Interpolation Lab 12: Sampling and Interpolation What You ll Learn: -Systematic and random sampling -Majority filtering -Stratified sampling -A few basic interpolation methods Data for the exercise are in the L12 subdirectory.

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

A Viewshed Based Classification of Landscapes Using Geomorphometrics

A Viewshed Based Classification of Landscapes Using Geomorphometrics A Viewshed Based Classification of Landscapes Using Geomorphometrics J. Washtell, S. Carver and K. Arrell School of Geography, University of Leeds, LS2 9JT Tel. (+44) 0113 3433343 Fax (+44) 0113 3433308

More information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory

More information

Online Document Clustering Using the GPU

Online Document Clustering Using the GPU Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

FOOTPRINTS EXTRACTION

FOOTPRINTS EXTRACTION Building Footprints Extraction of Dense Residential Areas from LiDAR data KyoHyouk Kim and Jie Shan Purdue University School of Civil Engineering 550 Stadium Mall Drive West Lafayette, IN 47907, USA {kim458,

More information

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu

More information

Dense DSM Generation Using the GPU

Dense DSM Generation Using the GPU Photogrammetric Week '13 Dieter Fritsch (Ed.) Wichmann/VDE Verlag, Belin & Offenbach, 2013 Rotenberg et al. 285 Dense DSM Generation Using the GPU KIRILL ROTENBERG, LOUIS SIMARD, PHILIPPE SIMARD, Montreal

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Estimating the Free Region of a Sensor Node

Estimating the Free Region of a Sensor Node Estimating the Free Region of a Sensor Node Laxmi Gewali, Navin Rongratana, Jan B. Pedersen School of Computer Science, University of Nevada 4505 Maryland Parkway Las Vegas, NV, 89154, USA Abstract We

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Lab 11: Terrain Analyses

Lab 11: Terrain Analyses Lab 11: Terrain Analyses What You ll Learn: Basic terrain analysis functions, including watershed, viewshed, and profile processing. There is a mix of old and new functions used in this lab. We ll explain

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs 2013 DOE Visiting Faculty Program Project Report By Jianting Zhang (Visiting Faculty) (Department of Computer Science,

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

DIGITAL TERRAIN MODELLING. Endre Katona University of Szeged Department of Informatics

DIGITAL TERRAIN MODELLING. Endre Katona University of Szeged Department of Informatics DIGITAL TERRAIN MODELLING Endre Katona University of Szeged Department of Informatics katona@inf.u-szeged.hu The problem: data sources data structures algorithms DTM = Digital Terrain Model Terrain function:

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Shallow Water Simulations on Graphics Hardware

Shallow Water Simulations on Graphics Hardware Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis

More information

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA

More information

GPU-accelerated data expansion for the Marching Cubes algorithm

GPU-accelerated data expansion for the Marching Cubes algorithm GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction

More information

DIGITAL SURFACE MODELS OF CITY AREAS BY VERY HIGH RESOLUTION SPACE IMAGERY

DIGITAL SURFACE MODELS OF CITY AREAS BY VERY HIGH RESOLUTION SPACE IMAGERY DIGITAL SURFACE MODELS OF CITY AREAS BY VERY HIGH RESOLUTION SPACE IMAGERY Jacobsen, K. University of Hannover, Institute of Photogrammetry and Geoinformation, Nienburger Str.1, D30167 Hannover phone +49

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup)

An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Rensselaer Polytechnic Institute Universidade Federal de Viçosa An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Prof. Dr. W Randolph Franklin, RPI Salles Viana Gomes

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION

THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION M. Downey a, 1, U. Tempelmann b a Pixelgrammetry Inc., suite 212, 5438 11 Street NE, Calgary,

More information

An Introduction to Lidar & Forestry May 2013

An Introduction to Lidar & Forestry May 2013 An Introduction to Lidar & Forestry May 2013 Introduction to Lidar & Forestry Lidar technology Derivatives from point clouds Applied to forestry Publish & Share Futures Lidar Light Detection And Ranging

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Implementation of the r.cuda.los module in the open source GRASS GIS by using parallel computation on the NVIDIA CUDA graphic cards

Implementation of the r.cuda.los module in the open source GRASS GIS by using parallel computation on the NVIDIA CUDA graphic cards ELEKTROTEHNIŠKI VESTNIK 79(1-2): 19 24, 2012 ENGLISH EDITION Implementation of the r.cuda.los module in the open source GRASS GIS by using parallel computation on the NVIDIA CUDA graphic cards Andrej Osterman

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Faster Simulations of the National Airspace System

Faster Simulations of the National Airspace System Faster Simulations of the National Airspace System PK Menon Monish Tandale Sandy Wiraatmadja Optimal Synthesis Inc. Joseph Rios NASA Ames Research Center NVIDIA GPU Technology Conference 2010, San Jose,

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

[Youn *, 5(11): November 2018] ISSN DOI /zenodo Impact Factor

[Youn *, 5(11): November 2018] ISSN DOI /zenodo Impact Factor GLOBAL JOURNAL OF ENGINEERING SCIENCE AND RESEARCHES AUTOMATIC EXTRACTING DEM FROM DSM WITH CONSECUTIVE MORPHOLOGICAL FILTERING Junhee Youn *1 & Tae-Hoon Kim 2 *1,2 Korea Institute of Civil Engineering

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño High Quality DXT Compression using OpenCL for CUDA Ignacio Castaño icastano@nvidia.com March 2009 Document Change History Version Date Responsible Reason for Change 0.1 02/01/2007 Ignacio Castaño First

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

Geometric Considerations for Distribution of Sensors in Ad-hoc Sensor Networks

Geometric Considerations for Distribution of Sensors in Ad-hoc Sensor Networks Geometric Considerations for Distribution of Sensors in Ad-hoc Sensor Networks Ted Brown, Deniz Sarioz, Amotz Bar-Noy, Tom LaPorta, Dinesh Verma, Matthew Johnson, Hosam Rowaihy November 20, 2006 1 Introduction

More information

Topographic Lidar Data Employed to Map, Preserve U.S. History

Topographic Lidar Data Employed to Map, Preserve U.S. History OCTOBER 11, 2016 Topographic Lidar Data Employed to Map, Preserve U.S. History In August 2015, the National Park Service (NPS) contracted Woolpert for the Little Bighorn National Monument Mapping Project

More information

Simultaneous Solving of Linear Programming Problems in GPU

Simultaneous Solving of Linear Programming Problems in GPU Simultaneous Solving of Linear Programming Problems in GPU Amit Gurung* amitgurung@nitm.ac.in Binayak Das* binayak89cse@gmail.com Rajarshi Ray* raj.ray84@gmail.com * National Institute of Technology Meghalaya

More information