A GPU based Real-Time Line Detector using a Cascaded 2D Line Space Jochen Hunz jhunz@uni-koblenz.de Anna Katharina Hebborn ahebborn@uni-koblenz.de University of Koblenz-Landau Universitätsstraße 1 56070 Koblenz Germany Stefan Müller stefanm@uni-koblenz.de ABSTRACT We propose the Line Space as a novel parameterization for lines in 2D images. The approach has similarities to the well known Hough Transform; however, we use a linear parameterization instead of angular representations leading to a better quality and less redundancy. The Line Space is very well suited for GPU implementation since all potential lines in an image are captured through rasterization. In addition, we improve the efficiency by introducing the term Cascaded Line Space, where the image is subdivided into smaller Line Spaces which are finally merged to the global Line Space. We implemented our approaches exploiting modern GPU facilities (i.e. compute shader) and we will describe the details in this paper. Finally, we will discuss the enormous potential of the Line Space for further extensions. KEYWORDS Image Processing, Visual Computing, Line Detection, GPGPU, Line Space 1 INTRODUCTION Lines are typical image features of interest in computer vision and image analysis. Several use cases are depending on capturing a set of lines in images accurately and efficiently. Since 1962, one established way to detect lines in images is the Hough Transform [1]. Significantly involved in the present algorithm is the work of Duda and Hart [2]. They represent a line in its normal form. Thereby they parameterize the line through its algebraic distance d and its normal's angle α. By restricting α to the interval [0, α), all normal parameters are unique. Therefore, every line in the input image corresponds to a unique point in the Hough accumulator. If the coordinate system s origin is the center of the image, the maximum distance d to the origin is. Thus, the resolution of the Hough accumulator depends on the number of discretizations of α and the input image of size. This is, the resolution of the Hough accumulator is given by. Transforming a set of collinear points from the input image to the Hough accumulator will result in sinusoidal curves with one common point of intersection. Finding this point of intersection results in a detected line. The discretization of the angle α leads to a subsampling of the lines if the discretization is not selected properly. Thus, the line's accuracy is not perfect. In contrast to this, our approach detects lines in images using a linear parameterization of lines instead of an angular representation. Doing this, we consider all possible lines in an image. In this paper, we propose a novel approach for detecting lines in images. Due to the computing power of modern GPUs and the versatile ability of compute shaders, all possible lines in an image can be scanned. In section 2, we will describe the idea and theory behind the Line ISBN: 978-1-941968-02-4 2014 SDIWC 56
Space and we will introduce the Cascaded Line Space to optimize the algorithm. Section 3 illustrates and discusses implementation details for detecting lines in an efficient way on the GPU using OpenGL Compute Shaders. Afterwards, section 4 presents our evaluated results related to the Hough Transform, specifically the quality of the results and the algorithm's efficiency. At last, we will conclude this paper with a discussion about the results and we will give a look on what might await us in the future, in terms of possibilities and chances of the Line Space. 2 LINE SPACE all corner points of an image, which is filtered by the Canny edge detector [4]. In figure 1 an edge image of size n x n is shown with an exemplary line. The border edges are numbered from 0 to 4n - 1. Now we consider all lines from every border edge to all border edges, each identified by a tuple (start edge number, end edge number). As a result, we get a two dimensional space (the Line Space) of size 4n x 4n, where each index tuple represents a line in the image. As a consequence, the algorithm rasterizes lines in total without any applied optimization. The blue line in figure 1a is represented by the index (1,8) or (8,1) respectively. Thereby, the entry in the Line Space is the number of pixels in the Canny image with a value not equal to zero. As we can see, the Line Space has some interesting properties: (a) (b) ) Figure 1: (a) An empty image with one exemplary line and (b) the corresponding Line Space entries. The term Line Space was introduced by Drettakis and Sillion [3] in the context of hierarchical radiosity simulation. In their paper, a line is considered as a link between two arbitrary surface elements, surrounded by a shaft, covering all potential rays between both surface elements. For each shaft, a list of potential occluding candidates is computed during the radiosity simulation. The candidate list can be reused for further visibility computation (i.e. if a link needs to be refined) or for dynamic scene updates. We use the term Line Space in a somehow similar manner. However, instead of shafts between surfaces, we investigate lines between 1. LS(s,e) = LS(e,s): it is symmetric and only the upper triangle is needed. 2. LS(s,s) = 0: the elements of the diagonal characterize degenerated lines with zero length and can be omitted. 3. Collinearity: lines between collinear edges are also degenerated, leading to blocks around the diagonal in this example. 2.1 Cascaded Line Space As the computation of the Line Space should be as fast as possible to remain real-time capable, we developed a Cascaded Line Space (CLS) to speed up the algorithm. The idea is rather simple: Instead of rasterizing lines between all corner points of an image, we divide the image into cascades of size. In a similar way as described before, we number the border edges of each cascade from 0 to 4k - 1. Now we consider all lines within a cascade between these border edges, resulting in one Line Space per cascade. Thereby, each Line Space is pixels in size. Consequently, we ISBN: 978-1-941968-02-4 2014 SDIWC 57
compute Line Spaces while rasterizing lines, which are as many lines as before. Because of that we can store the CLS in an Image of size 4n x 4n. The advantage of this approach is, that we rasterize short lines instead of longer lines through the whole image, which results in a significant speed increase. Figure 2a shows an image of size 16 x 16 divided into 4 x 4 cascades. The result of our algorithm is a CLS as in figure 2b. In order to obtain the same information from the CLS as from the Line Space we need to merge the CLS together. We can compute the next higher CLS hierarchy based on the existing one by merging groups of four Line Spaces to one Line Space. Thereby, the cascade intersection points of each line in the Line Space are determined as in figure 3. Here, the line intersects the cascade at the red, green, and orange points. The red points give the index tupels (2,5) and (5,2) within the upper left cascade, whereby the green points give the index tuple (12,13) and (13,12) within the lower left cascade, for instance. Knowing this for every intersection pair, we can perform simple look ups in the CLS to determine how many pixels are set on the way of this line. The result is stored in the next higher CLS hierachy at the proposed line entry. Doing this for all cascades leads to a new CLS which is a composition of cascades. Repeating this procedure times will result in the complete Line Space. Figure 4a shows the result of the first merge step of the CLS shown in figure 2b. The result is a new CLS with 2 x 2 cascades of size each. The next and last merge step in this example is shown in figure 4b, which is the complete Line Space of the image in figure 2a: A CLS with only 1 cascade of size (a) (b) ) Figure 2: (a) An image of size 16 x 16 divided into 4 x 4 cascades and (b) the corresponding Cascaded Line Space. (a) (b) ) Figure 3: (a) Determined intersection points for each cascade and (b) the corresponding CLS entries, exemplary for the left upper image section of figure 2. (a) (b) ) Figure 4: (a) Cascaded Line Space with 2 x 2 cascades and (b) the complete Line Space as a result of the last merge step. ISBN: 978-1-941968-02-4 2014 SDIWC 58
3 IMPLEMENTATION As the Line Space is constructed through line rasterization, we implemented our approach using OpenGL 4 and the performance of modern GPUs. We obtained the best performance using Compute Shaders as they provide high-speed general-purpose computing. The dispatch of a Compute Shader can be fully configured. Thereby, the Shader gets dispatched in one global work group, which is a three dimensional space of local work groups. Each local work group itself forms a three dimensional space of threads and gets executed on one streaming multiprocessor of the GPU. A Compute Shader invocation is dispatched for every border edge tuple (start edge, end edge) to determine whether a line appears between the edges or not. For this, a line from the start to the end edge is rasterized by the Digital Differential Analyzer (DDA) algorithm. Thereby, the number of pixels with a value not equal to zero is counted. The Line Space stores this value, whereby the tuple serves as the Line Space's index. We reduce the computational complexity of the Line Space by using a Buffer Object to store all important tuples and to omit equivalent and degenerated lines at the same time. The following listing shows the core of the entire algorithm in the OpenGL Shading Language: void main() { uint ID = gl_globalinvocationid.x; ivec2 tuple = buffer[id]; uint hits = DDA(tuple.x,tuple.y); imagestore(linespace,tuple,hits); } For CLS computation the core of the algorithm remains the same, only the input buffer and the compute shader dispatch vary. Now the buffer contains only border edge tuples of one cascade and we dispatch a set of two dimensional local work groups. Using this technique, we can determine the ID of the cascade inside the shader. Knowing this, we can compute the exact start- and endpoint of the line in the image. After computing the number of line hits, the result gets stored on its proposed cascade position. To merge the cascades as fast as possible, the intersections points for every cascade level can be precomputed and can be made available in the shader through a buffer. It is also recommended to omit equivalent and degenerated lines as before. 3.1 Line Extraction After computing the Line Space, it needs to get analyzed in a similar way as the Hough accumulator of the Hough Transform to extract the detected lines. As every entry in the Line Space stores the number of line hits, the simplest way to analyze the Line Space is to introduce a threshold. Now every Line Space entry is considered. If the value of the entry is greater than the threshold, the entry is regarded as a detected line. The value of the threshold should depend on the size of the input image as well as on the size of the lines which are expected. If the threshold is chosen too small, this approach detects too many lines. Conversely, an overlarge t would detect too few or, in the worst case, even no lines in the image. The neighborhood of a strong Line Space entry is also densely occupied, and therefore, it is advantageous to supress that neighborhood. Thus, we avoid detecting similar and false lines. As the number of detected lines by using thresholding depends on the used threshold and the input image, a stable method, which detects exactly m lines, is desirable. In general, one could search for the m-maxima by going through the Line Space sequentially. Nevertheless, a preferable technique would be to search for the m-maxima on the GPU. For this, the Line Space is divided into a grid using a compute shader. Each compute shader ISBN: 978-1-941968-02-4 2014 SDIWC 59
invocation reduces the four entries of a grid cell to one entry so that only the largest entry remains (see figure 5). Repeating the image reduction ld(4n) times will lead to the maximum value in the Line Space, and therefore to the most distinct line in the image. For this, each entry in the Line Space must store its position additionally. With this knowledge, the detected line can be deleted in the Line Space by setting its value to zero. The next maximum in the input image will be another distinct line, which is different from the first one. Therefore, doing this procedure m times will deliver the m-maxima of the Line Space. It can improve the results if the neighborhoods of the maxima gets deleted in the Line Space. maxima. To improve the detected lines, the Line Space can be filtered in a preprocessing step. One approach is to use the image reduction technique as in figure 5. Importantly, the Line Space is not reduced ld(4n) times but less. This will reduce the regions a line in the input image produces in the Line Space and will suppress false lines in such a region. 4 RESULTS Figure 5: Image reduction on an n x n input image applied ld(n) = 3 times to find the global maximum. [5] The computational performance of the algorithm depends on m. To detect any number of lines with almost constant performance, the Line Space gets sorted by using a sorting algorithm which is appropriate for GPUs. There are two categories of sorting algorithms: datadriven ones and data-independent ones. Thereby, data-independent sorting algorithms are well suited to be implemented for multiple processors, therefore to run on the GPU [6]. The most common algorithms in the literature are the bitonic merge sort [6] and the radix sort [7]. The first m texel of the sorted Line Space are the m most distinct lines in the input image. Consequently, if the unfiltered Line Space is sorted, the detected lines will be the same as through reducing the Line Space m*ld(4n) times without deleting the neighborhoods of the Figure 6: Top row: input and Canny image of size 512 x 512. Bottom row: 8 most distinct lines detected by Hough (left) and the Line Space (right). We compare the Line Space and the Cascaded Line Space to the Hough Transform. Both extract straight lines from an image and in direct comparison, both approaches find reasonable lines in the image (see figure 6). To be as fair as possible, we use the well tested and fast CPU and GPU implementation provided by OpenCV. As mentioned in the introduction, the Hough Transform's accuarcy depends on the fineness of the discretization of the angle α and the algebraic distance d to the image coordinate system's origin. Therefore, simply applying the Hough Transform to images of different sizes using the same discretization parameters (e.g. an of one ISBN: 978-1-941968-02-4 2014 SDIWC 60
degree) is not a fair comparison. Furthermore, the result's accuracy depends strongly on the parameters and the image size. Hence, the minimum angle between two discrete lines in an image serve as the for the Hough Transform to be comparable to the Line Space. 4.1 Performance Our test system consists of an Intel Core i7 CPU with 4 cores, 2.66 GHz and 12 GB main memory. The GPU is an Nvidia GTX 770 with 1536 cores and 2 GB video memory. The input Canny images varies between 64 x 64 and 1024 x 1024 pixels in size. We only use quadratic images as our proposed implementation of the CLS is solely running on those. However, this is only an implementation detail and the algorithm is adaptable for arbitrary sized images as well. Table 1 shows the average computation time in milliseconds for the Hough Transform running on the CPU and the GPU as well as the average computation time for the Line Space and the CLS. While evaluating the performance of the CLS we consider the initial cascade computation time and the time to merge the cascades to the global Line Space separately. Here we use an initial cascading of k = 8, which provides the best performance in our test scenario. Consequently, a Canny image of size n x n =1024 x 1024 consists of 128 x 128 cascades and as many Line Spaces. The Hough Transform only considers pixels in the image with a value not equal to zero. Therefore, the Hough Transform's performance strongly depends on the number of these pixels in the image. To consider this circumstance, we use two different test images with a different amount of structure in it. Thus, we have two different test results for the Hough Transform for Canny images of n = 512, for instance. In the first image p = 6.7% of the pixels are not equal to zero while the second image has twice as many pixels not equal to zero (p = 13.5%). Please notice that the second image corresponds to the Canny image of figure 6. The computation of the Line Space only depends on the image size and therefore does not require a differentation. For image sizes up to 128 x 128 both, the Line Space as well as the CLS, run significantly faster than the implementations of the Hough Transform. Table 1: Average computation time in milliseconds for the Hough Transform on the CPU and the GPU, the Line Space and the Cascaded Line Space for an Canny image size of n x n. The Cascaded Line Space is separated into the computation time for the initial cascades and the merge phase. Two different images are tested. The minimum angle between two discrete lines in an image is given by. The amount of pixels in percent with a value not equal to zero is given by p. n p Hough CPU Hough GPU Line Space 64 0.451 128 0.226 256 0.112 512 0.06 1024 0.028 15.3 16.3 13.4 15.1 9.8 14.4 6.7 13.5 4.5 10.6 2.14 2.37 23.30 26.97 147.89 207.79 780.57 1576.35 5040.72 12952.30 2.11 1.56 1.56 1.57 2.70 4.56 8.50 13.23 30.62 59.52 Cascaded Line Space Initial Cascades Merge Total 0.11 0.02 0.06 0.08 0.60 0.07 0.30 0.37 4.43 0.24 1.49 1.73 34.13 0.91 7.45 8.36 247.53 3.31 36.46 39.77 ISBN: 978-1-941968-02-4 2014 SDIWC 61
The Line Space is more than 19 times faster than the GPU Hough Transform for n = 64 and p = 15.3 and the CLS is even faster than the Line Space. For n = 128 the Line Space is still more than twice as fast as the GPU Hough Transform. For an image resolution of n = 256 the Line Space is significantly slower than the GPU Hough Transform for the first image (p = 9.8%). Nevertheless, the CLS is still more than 1.5 times faster than the GPU Hough Transform. However, the Hough Transform requires 4.56 ms on the second image (p = 14.4%) which is slower than the Line Space and significantly slower than the CLS. Considering images of size n = 512 and p = 6.7 the CLS is also faster than the GPU Hough Transform, although the difference is not that big anymore. However, it can be stated that the initial computation of the cascades for k = 8 is very fast since it requires only 0.91 milliseconds to get computed. Obviously, the merging of the initial cascades is crucial as it is the main time factor. The Hough Transform runs significantly slower on the second image (p = 13.5) than the CLS. The resulting lines of that image are shown in figure 6. For n = 1024, the GPU Hough Transform is 29% faster than the CLS on image 1 but for image 2, the CLS is almost 1.5 times faster than the GPU Hough Transform. Again, we can observe that the initial computation of the cascades is very fast, requiring only 3.31 milliseconds. As before, the expensive part is the merging. One must consider that for an image size of n = 1024 and an initial cascading of k = 8 a total of merge steps are necessary. However, increasing k will not increase the performance. Therefore, it is desirable for future work to speed up the merging phase. Overall, it can be stated that the CLS is faster than the Line Space considering every image size in our scenario. The merging phase of the CLS is time consuming and therefore critical. The computation of the initial cascades is very fast in contrast to the merge phase. For images with only a few pixels not equal to zero, the Hough Transform runs faster than the CLS. In spite of this, the CLS would already be the better choice for an image with p = 6.7%. In general, the CLS runs constantly fast but the Hough Transform's performance greatly depends on the input image which can be a disadvantage. Furthermore, the Hough Transform is very slow when running on the CPU so that the GPU implementation should be used in general. 5 DISCUSSION AND FUTURE WORK We introduced the Line Space as a new and efficient parameterization for lines in 2D images. The major benefit of the Line Space is that all potential lines in an image are captured without any redundancy. In addition, the Line Space is well suited for GPU implementation. As a first application, we presented a global line detection algorithm with a Cascaded Line Space. The results of our brute force implementation can already compete with the OpenCV GPU Hough Transform. We have several ideas to optimize the merge phase as the most time consuming task. One idea is to use the group shared memory provided for GPU cores, which might result in a significant speed-up. Another idea is to merge line spaces directly, since for each line exit point in one cascade the line starting point of the next cascade is well defined. Using this approach, we are working on a line segment detector as well. In conclusion, we are convinced that the Line Space has an enormous potential for line and line segment detection, since it provides an efficient basis for further optimization and more complex algorithm in this area. ISBN: 978-1-941968-02-4 2014 SDIWC 62
6 REFERENCES [1] Hough, Paul VC. "Method and means for recognizing complex patterns." U.S. Patent No. 3,069,654. 18 Dec. 1962. [2] Duda, Richard O., and Peter E. Hart. "Use of the Hough transformation to detect lines and curves in pictures." Communications of the ACM 15.1 (1972): 11-15. [3] Drettakis, George, and François X. Sillion. "Interactive update of global illumination using a line-space hierarchy." Proceedings of the 24th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1997, pp. 57-64. [4] Canny, John. "A computational approach to edge detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 6 (1986): 679-698. [5] Buck, Ian, and Tim Purcell. "A toolkit for computation on GPUs." GPU Gems (2004): 621-636. [6] Kipfer, Peter, and Rüdiger Westermann. "Improved GPU sorting." GPU gems 2 (2005): 733-746. [7] Harris, Mark, Shubhabrata Sengupta, and John D. Owens. "Gpu gems 3."Parallel Prefix Sum (Scan) with CUDA (2007): 851-876. ISBN: 978-1-941968-02-4 2014 SDIWC 63