University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision

report University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision Web Server master database User Interface Images + labels image feature algorithm Extract features Phone Client image it is b database image feature algorithm Extract feature? Find closest match a c b Matthew Johnson November 2010

2 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Applied Computer Vision Computer Vision is not only an exciting field of research, it is also one of the key new technologies driving innovation in business and industry today. Just as human vision is a fundamental and incredibly rich channel of information which we use for a huge array of activities, advances in computer vision are creating new opportunities for innovation in the way in which we interact with our world through digital devices. In this lecture, we will focus on ways in which the concepts and techniques you have been exposed to so far in this course are being used today in the field of image recognition and search on embedded devices.

Mobile Computer Vision 3 Impaired Vision The first thing that becomes clear as one begins to work on a ubiquitous computing device like a mobile phone is that one can no longer take anything for granted. For example, most mobile phones in use today do not have any hardware support for division. This means that any division operation is translated in assembly into a long series of other instructions, thus making it incredibly costly. Here is a partial list of the challenges posed by mobile devices: Limited processor hardware (slow floating point operations, lack of functionality) Small cache (i.e. limited stack space for procedures) Slow access to a small supply of Random Access Memory (RAM) Costly disk access Arcane versions of known programming languages How do we overcome these challenges?

4 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Client/Server Model Web Server File Server User Interface image feature algorithm Images + labels master database Extract features Phone Client image feature algorithm image Extract features it is b report? Find closest match The most common and, at first, only way of overcoming these challenges on mobile devices was to offload any significant computation to a remote server. The system on the phone becomes a thin client (meaning that it does no real computation) that sends the image to the server, which computes feature vectors, matches them against a database of known images, and returns the information on the match. In order to add new images to this system, a second system interacts with the server to upload images and labels.

Mobile Computer Vision 5 Client/Server Model, cont. This arrangement solves the problem of the mobile device s limited capabilities, but has several problems of its own: If the file server stops functioning, the entire system fails If communication to the file server is lost, the entire system fails The server(s) have to perform massive amounts of parallel computation and database indexing Large quantities of data (i.e. images) require lots of time and bandwidth to transfer Significant delay in response to user actions

6 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics report Smart Clients Web Server master database User Interface Images + labels image feature algorithm Extract features Phone Client image it is b database image feature algorithm Extract feature? Find closest match a c b An alternative way of approaching this problem is to embrace and overcome the difficulties of performing vision on mobile devices through research into modifications to existing algorithms and entirely new algorithms which are able to solve computer vision problems with the tools available and within the limits imposed. This would be considered a smart client model, because the client does some or all of the computation. In this model, the client is able to perform feature vector extraction and matching locally, which means that it must maintain a full or partial copy of the master database on the device.

Mobile Computer Vision 7 Smart Clients, cont. We can see quite clearly how a smart client system, while it has to overcome the challenges of running in a limited environment, also has many benefits: If the server malfunctions, clients can still operate with their current database Similarly, if communication to the server is lost, the system still operates All computation is done by the phones locally, thereby greatly reducing server-based computation All communication with the server consists of feature vectors and metadata, much smaller than images Immediate response to user actions Clearly there are great gains to be had, but first we must overcome the limitations imposed by mobile platforms.

8 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Smart Client: System Design If we are going to achieve Computer Vision on a mobile phone, we must first come to terms with reality. That reality dictates that we have no large constructs in memory, no division, and no floating point numbers. Most of the standard Computer Vision algorithms are simply not available for use. Instead, we will have to rely on more efficient algorithms, and where those are unavailable, on approximations. A crucial requirement for this task is knowing the literature. We must understand the basic underlying principles of a technique so that we know what can be jettisoned without reducing performance. We are going to design a logo recognition system that can run on a mobile phone. It is going to be based on using a distance transform on the output of an edge detector to match logo contours. Thus, this system will contain 3 parts: Edge Detector Distance Transform Contour Fragment Matching

Mobile Computer Vision 9 Convolve with Logo Logo Recognition Logo recognition is a key task for many mobile vision applications. Restaurants, shops and other businesses take great care to have distinctive logos that are prominently displayed, making them ideal targets for user localization and value-added experiences. Edge Detection Distance Transform We will use Canny edge detection, the Euclidean distance transform, and a simple threshold-based contour recognizer. In the course of developing this system, we will examine the main techniques by which computer vision algorithms can be adapted to work on an embedded device like a mobile phone.

10 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Problems Our chosen techniques pose some very difficult problems to solve. The Canny edge detection algorithm poses the biggest challenge. The system works by following the peaks of the first derivative of a Gaussian convolved over the image to find continuous edge contours. In practice, it is implemented using the following steps, almost all of which pose intractable problems in their original forms: Convolution with a Gaussian: floating point arithmetic Computing derivative in the horizontal and vertical directions: floating point arithmetic Compute edge magnitude: square root computation Non-maximum suppression: floating point arithmetic, arctangent computation Hysteresis: recursion, redundant memory access Euclidean distance computation consists of finding the Euclidean distance at each pixel in the image to the nearest edge. A naïve implementation is computation intensive, but we will examine a technique that makes it very efficient.

Mobile Computer Vision 11 Gaussian Approximation When convolving an image with a Gaussian, one must perform 2k floating point multiplications per pixel, where k is the size of the Gaussian kernel, typically 6σ. The standard way of optimizing this is to convert the floating point values to integer approximations, with an integer division at the end. Unfortunately, this method of approximation is still too computationally intensive, requiring 2k integer multiplications and 2 integer divisions. The tool we want to use is called a box filter approximation, as discussed by Rau and McClellan in [4]. A box filter simply sets a pixel to the mean of its neighboring pixels. One interesting property of a box filter in one dimension is that multiple applications approximate a Gaussian, as seen below: i=0 i=1 i=2 i=3

12 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Gaussian Approximation, cont. Another property is that it is separable (that is, when computed in the horizontal and vertical directions separately it is equivalent to computing it using a two-dimensional square filter), and as such can be used to filter in two dimensions as well while losing none of its efficiency. Even more interesting is that, by using a rolling sum, this can be computed using one addition, one subtraction and one division per pass per pixel. As it takes about 3 passes to get a good approximation, this results in 6 additions, 6 subtractions and 6 divisions per pixel to convolve in the horizontal and vertical directions, regardless of the size of the kernel. 6 divisions sounds like a limitation, but if the size of the kernel is a power of 2, then we can use bit shifting instead of division, leaving a very efficient Gaussian approximation whose outputs are integers. Because the outputs are integers, we have removed the floating point arithmetic from derivative computation. Similarly, if we store edge magnitude as the magnitude squared (i.e. gx 2 + gy 2 ) then we avoid the square root computation, as all that is important is the ordering of edges by magnitude, not the actual magnitude values.

Mobile Computer Vision 13 Non-Maximum Suppression In order to suppress non-maximum edge signals, we must compare every pixel to those perpendicular to its edge direction. Some simple trigonometry allows us to avoid a full arctangent computation and simply use gx and gy to perform interpolation directly. While this is exact, the floating point computation is essential yet too expensive. Therefore, we will use another approximation trick: the lookup table. 0 2 In our system so far, gx and gy are integers. Thus, we can use them to index a two-dimensional array. As can be seen above, there are only 4 possible comparison sets, and thus in each array index will be an integer from 0 to 3 indicating which set to compare. We will fill this array during an initialization stage and thus avoid all floating point operations at runtime.

14 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Hysteresis The biggest challenge is hysteresis. During hysteresis, we connect together contours using low and high magnitude thresholds. Each contour consists of edgels (single edges which have passed the non-maximum suppression stage) with a magnitude greater than the lower threshold that are joined together (using 8-way connectivity) with at least one edgel in the contour with a magnitude greater than the higher threshold. The way this is done is by starting at an edgel above the higher threshold, and recursively determining which lower threshold pixels are connected to it. 1 1 2 2 3 1 2 2 4 3 1 2 2 This recursion can require many iterations, and given the limitations of a mobile phone is not possible. Thus, we need a different way of doing this. Thankfully, there is a tool in computer vision which will solve this problem for us very efficiently.

Mobile Computer Vision 15 Connected Components Connected components analysis takes a binary image (pixels are either 1 or 0) and finds which on pixels are connected together and labels each group of connected pixels (i.e. component) with a unique label. When implemented correctly the technique is very fast, and thus will provide what we need with a considerable speedup over the recursive method. Binary Image Input Labeled Components

16 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Connected Components, cont. If we create a binary image from the edgels with magnitudes greater than the lower threshold (indicated by L in the diagram below), then as described by Liu and Haralick [3] we can use connected components to find potential contours within the image by identifying all L pixels which are connected. Then, we simply exclude components which do not contain an edgel whose magnitude is greater than the higher threshold (indicated by H in the diagram). This achieves the same effect, while being greatly more efficient in computation and requiring no recursion whatsoever. L L L L L L L L L L H L L L H H H H H L L L L L L L L L L L Thresholded Maximum Edges Connected Components Extracted Contours L L

Mobile Computer Vision 17 Euclidean Distance Transform As mentioned previously, a naïve Euclidean distance transform is too computationally expensive for our needs. We are instead going to use a technique developed by Bailey in [1] which achieves the same accuracy using only integers and a minimum of computation. For a moment, let us reduce the distance transform to a one-dimensional problem. In one dimension, finding the distance from any index to an edge index can be computed very quickly and efficiently in two passes, as seen below: 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 2 1 0 1 2 3 3 2 1 0 1 2 3 4 5 4 3 2 1 0 1 2 3 4 5 6 7 8

18 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Efficient Euclidean DT Thus, for each row in an image we can find the distance from each pixel to the nearest edge in that row very efficiently. This is, naturally, insufficient. Think now of storing the square distance instead of the distance in each row pixel. This traces out a slice of a paraboloid of revolution, whose center is at the edge pixel, as seen below. 0 1 4 9 16 25 36 0 1 4 9 16 25 36 49 64 81 0 1 4 9 16 25 36 49 64 4 1 0 1 4 9 9 4 1 0 1 4 9 16 25 16 9 4 1 0 1 4 9 16 25 36 49 64 The paraboloid centered at each edge extends across the entire image, intersecting with others from other edges at various points. Each paraboloid determines the area of influence for each edge. By evaluating al paraboloids at a pixel, we can determine which edge is nearest to that pixel.

Mobile Computer Vision 19 Efficient Euclidean DT, cont. If we now examine the columns of our image after storing the square distance in each row pixel, we see a record of all the edges which can affect that column. By calculating the value of each paraboloid at each index, we can see which has the least value and thus find the true squared distance to an edge at each row in the column, as seen below. Initial 9 4 36 81 25 16 25 100 49 16 Final 5 4 5 8 13 16 17 20 17 16 At first, this seems like an O(N 2 ) operation, but by scanning the column in order and keeping track of influence changes, it becomes O(N). Thus, the total complexity of the transform remains linear in the number of pixels.

20 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Thin Client: Descriptor Compression While we have discounted to a certain extent server based methods, there have been several advances in descriptor compression in the past few years which make a server-based model increasingly feasible for large-scale deployment. Descriptor compression is an area of research that concentrates on encoding an image descriptor (e.g. a gradient histogram, SURF, SIFT) at a lower bitrate. There are several techniques that are used, but they fall into two categories: quantization and entropy coding. We are going to look briefly at a generalized method for descriptor compression proposed by Johnson in [2]. In this method, any image descriptor meeting simple requirements can be compressed for optimized matching, resulting in significant savings in both storage and computational requirements while improving performance. Image Patch 1 2 3 4 5 6 7 8 Descriptor Algorithm 1 2 3 4 5 6 7 8 0.232 0.179 0.334 0.255 0.134 0.497 0.231 0.138 0.384 0.217 0.145 0.254 0.185 0.069 0.160 0.587 0.249 0.201 0.243 0.307 0.578 0.054 0.303 0.065 0.265 0.252 0.228 0.255 0.383 0.387 0.122 0.108 0.102 0.380 0.128 0.391 0.364 0.236 0.371 0.030 0.122 0.372 0.417 0.090 0.009 0.528 0.157 0.306 0.188 0.322 0.452 0.039 0.281 0.145 0.279 0.295 0.276 0.357 0.013 0.353 0.087 0.244 0.167 0.502 F compression 2 2 2 2 3 1 2 3 1 3 3 2 2 3 3 1 2 2 2 2 1 3 2 3 2 2 2 2 2 1 3 3 3 2 3 1 2 3 1 3 3 2 1 3 3 1 3 2 3 2 1 3 2 2 2 2 3 1 3 2 3 2 3 1 c 2 3 3 3 1 0 4 9 4 0 1 9 1 0 1 3 2 1 + 9 + 0 + 1 = 11 D 6 14 23 11 6 6 6 14 1 1 4 4 4 4 1 1 w

Mobile Computer Vision 21 Hufmann Tree Encoding The technique used in [2] for compression is Hufmann Tree Encoding. Hufmann Trees are specialized binary trees in which each node has a value equal to the sum of its children. In this context, they can be used as a lossy method of encoding normalized distributions (like intensity or orientation histograms). First, the values in the distribution (a) are used to construct a Huffman Tree (b). The values are then approximated using a negative power of 2 equal to their depth in the tree (d). In this way, the distribution still sums to 1 but can be represented using N log(n) bits (c), where N is the length of the distribution. 0.421 3 2 1 1 0.500 0.157 0.234 0.157 0.188 0.188 0.421 3 0.234 1.000 0.579 2 0.345 3 0.125 0.250 0.125 (a) (b) (c) (d) =

22 Engineering Part IIB - Module 4F12 - Computer Vision and Robotics Distance Lookup Matrix One advantage of using Huffman Tree Encoding is that it limits the possible values for each index of the distribution. As most distance metrics (e.g. Minkowski norms, Kullback-Leibler Divergence) can be separated on a per-index basis, this means that we can pre-compute all possible distance scores in a lookup table. As the majority of computation time in an image retrieval application is spent computing distances between different descriptors, this optimization results in sizeable savings in computational requirements. In the diagram below, two distributions (x and y) are compressed (c x and c y ) and then are compared using a distance lookup matrix D. 0.421 0.157 0.234 0.188 x 1 3 0.126 3 0 4 9 3 0.132 2 4 0 1 1 3 9 1 0 2 0.523 c x D c y 9 + 0 + 4 + 1= 14 y 0.219

Mobile Computer Vision 23 Further Reading References [1] D. Bailey. An efficient Euclidean distance transform. In Proc. of IWCIA04, pages 394 408, 2004. [2] M. Johnson. Generalized descriptor compression for storage and matching. In Proc. of British Machine Vision Conf., Aberystwyth, Wales, August 2010. [3] G. Liu and R. M. Haralick. Two practical issues in Canny s edge detector implementation. In Proc. of Int. Conf. on Pattern Recognition, volume 3, page 3680, 2000. [4] Rau R. and J. McClellan. Efficient approximation of Gaussian filters. IEEE Trans. on Signal Processing, 45:468 471, February 1997.