Fast Natural Feature Tracking for Mobile Augmented Reality Applications

Fast Natural Feature Tracking for Mobile Augmented Reality Applications Jong-Seung Park 1, Byeong-Jo Bae 2, and Ramesh Jain 3 1 Dept. of Computer Science & Eng., University of Incheon, Korea 2 Hyundai MnSoft, Inc., Seoul, Korea 3 Dept. of Computer Science, University of California, Irvine, CA, USA Abstract. Fast natural feature tracking is essential to make markerless augmented reality applications practical on low performance mobile devices. To speed up the natural feature tracking process which includes computationally expensive procedures, we propose a novel fast tracking method using optical flow aimed for mobile augmented reality applications. Experimental results showed that the proposed method significantly reduces the computational cost and also stabilizes the camera pose estimation process. Keywords: Tracking, Natural feature, Augmented reality, Optical flow 1 Introduction Natural features are features from unprepared scenes. Natural feature tracking (NFT) requires a lot of computationally expensive operations. Most previous natural feature tracking methods include heavy feature extraction and pattern matching procedures for each of the input image frame [1]. Classical NFT approaches repeat the feature extraction and matching procedures for each input frame. They try to match the extracted features from the input frame against each of the registered patterns until a successful pattern is found. The feature extraction and matching procedures require heavy computational cost which is hard to be allowed on low performance mobile devices. To speed up the NFT process, we propose a novel fast tracking method using feature-based optical flow. We implemented the proposed method on mobile devices to run in realtime so that it can be viably used with mobile augmented reality applications. Moreover, during tracking, we keep the total number of feature points constant by inserting new feature points proportional to the number of vanished feature points. The basic principle of speeding up the tracking process is to call the feature extraction and matching procedures less often and also restrict the area to extract new features to small portioned subregions. Once part of the input frame is matched to a specific pattern, we initiate tracking of the matched features through the successive video frames. As long as the tracking is successful, we do not perform new feature extraction and matching procedures. - 273 -

2 Tracking Natural Features Feature tracking is the process of finding corresponding positions on the successive image to image points on the first image. The measure of correspondence is based on the similarity on an image neighborhood of a fixed size window. A wellknown method is the feature tracker developed by Lucas and Kanade [2]. Let I and J be two consecutive grayscale images, and their scalar quantities I(x, y) and J(x, y) be the pixel intensities at image coordinates (x, y). The consecutive image frame J contains most parts of the first image frame I. The position (x, y) in I will be moved to a new position (x + d x,y + d y )inj. The tracker must determine the disparity vector (d x,d y )at(x, y) from the intensity similarity of I and J. The similarity criterion is measured from the set of local neighborhood pixels, denoted by W, centered at the position (x, y). The disparity is commonly computed by minimizing the residual error due to the brightness differences. This approach stably tracks small feature movements in video frames taken in rapid succession, but it is still not reliable enough when the feature movements between frames are farther apart in time. Bay [3] proposed an interest point detector and descriptor called Speeded Up Robust Features (SURF) to reduce the time for feature computation and matching. They approximate second order Gaussian derivatives in the Hessian- Laplace detector with box filters, which can be evaluated very fast using integral images. SURF is currently regarded as a potentially more efficient descriptor then the previous descriptors such as SIFT and Fern. Howenver, the known NFT methods are still too slow to track features on a live video. It is even worse when using it on a mobile device. To speed up tracking, several approaches put strict restrictions on the target object to be tracked. Some mobile AR applications utilize ARToolKit-like approaches, which track only black and white markers. However, such marker-based approaches are not preferable especially for outdoor AR applications since they significantly restrict service domains for the sake of speedup. Feature tracking is a necessary preprocessing step of the problem of structure from motion which finds the 3D structure of captured scenes from images sensed over time. Because the feature matches are the only preliminary information for further vision-based inference, conventional point-based tracking schemes try to seek as many feature points as possible. Most previous schemes of natural feature tracking have focused on the description and matching of features between two consecutive images. Their methods extract a new set of point features from each of the newly appeared image, instead of considering previous matched features. The extraction and matching of a new point set is time-consuming and should be avoided especially when the method is used for a real-time application. Our claim is to utilize the previous matched features to speed up the tracking process. In vision-based augmented reality applications the purpose of natural feature tracking is to compute a homography between a planar scene and a projected image. To ensure the existence of a pattern, there must be a large number of feature points for the planar scene pattern and also enough number of the feature points - 274 -

must be matched to points in the projected image. To identify the rectangular region of the projected pattern, a homography is computed from the matched point pairs. An application utilizes the homography for further service-specific processing. Conventional feature tracking approaches find correspondences between two consecutive images, namely I t and I t+1. They extract an initial set of feature points from the first frame and track their movements along the consecutive image frames. Contrary to the conventional tracking approaches, the natural feature tracking approach in AR domains finds the correspondences between an on-the-fly image and one of pre-registered patterns. Previous methods of natural feature tracking have been focused on the matching of two point sets and they newly extract feature points on each of the on-the-fly image. 3 Speeding Up Feature Tracking We found there are several cues to speed up the tracking process. First, good features will not be lost during tracking until they disappear from the field of view. This cue indicates that we do not have to extract features from each frame. The tracked feature positions at the next frame are likely to be detected as new feature locations. We can just use the tracked feature locations instead of finding new feature locations. Second, it is not necessary to track a huge number of feature points to compute a homography. We can reduce the feature matching time by limiting the number of features to be matched. It is theoretically possible to compute a homography from only four correspondences. Practically twenty points are enough to enforce robustness. Third, we can predict whether each feature point will be disappeared soon or not. This cue means that we can efficiently manage the set of feature points to be tracked. We can exclude feature points which will be disappeared soon and include new points which will stay for a long time. Based on the three cues, we invoke the feature extraction step less often with a fewer number of features. The Lucas-Kanade feature tracker is a widely used feature tracking method [2]. It estimates the optical flow of each feature pixel by assuming that the flow is constant in a local neighborhood of the pixel. Bouguet [4] proposed a fast pyramidal implementation of the iterative Lucas-Kanade feature tracker [2]. The advantage of his pyramidal implementation is that each residual disparity vector in a hierarchal level can be kept very small. This allows large pixel motions and fast approximation of the iterative tracker. A classical NFT method has three main steps as shown in Fig. 1. First, it acquires a new image frame and extracts features. Then, it performs feature matching between the extracted features and the features of predefined patterns. As the result of matching it gets the matched pairs of feature points. Finally, it estimates camera poses and performs image synthesis for rendering. The feature extraction procedure is time-consuming since it contains several convolution operators which inspect every pixel position. The extraction procedure is likely to extract excessive number of features to avoid any failure in matching. To make - 275 -

Image acquisition & Feature extraction Bootstrap pattern matching Pattern matching Image acquisition & Feature tracking Camera pose estimation & Image synthesis Camera pose estimation & Image synthesis Fig. 1. Comparison classical methods (left) and the proposed method (right). matters worse, the feature extraction procedure is repeated for each input frame to extract a new set of feature positions. The feature matching process must be repeated for each of the registered patterns. It means that the matching step becomes slower when the number of patterns is increased. It is not appropriate to apply this classical NFT approach to low performance mobile devices. Instead, we need a new fast method with reduced overhead in feature extraction and matching. The basic idea of our proposed method is to reduce the number of features by excluding unnecessary features. In our proposed method, as shows in Fig. 1, we perform the pattern matching procedure only once and track the matched features for the next consecutive frames. Initially, we extract and describe features using a scale- and rotation-invariant interest point detector. The feature matching procedure tries to match feature vectors against each of the predefined patterns. Once a matched pattern is found, we track the matched feature points on the next consecutive frames. Our tracking implementation is based on the pyramidal scheme of the Lucas-Kanade feature tracker. 4 Experimental Results Experimental results showed that the proposed method significantly reduces the computational cost and also stabilizes the camera pose estimation process. We captured a scene containing a specific pattern plate. While capturing the pattern plate, the camera has been rotated, zoomed, and tilted and, hence, the captured images of the pattern plate are changed according to the camera motion. We tested the accuracy of our algorithm when tracking the pattern in the captured frames. While tracking, we evaluated the homography between the input frame and the matched pattern using the matched pairs. The accuracy is measured by the sum of differences of the evaluated corner positions from the homography and the actual corner positions which are manually specified. In the SURF method - 276 -

Table 1. Comparison of the processing time for a single video frame (in ms). method #patterns capture extract match track render total 1 1.053 193.940 9.902-2.194 207.089 SURF 5 1.063 194.035 71.775-2.409 269.282 10 1.052 192.756 218.301-2.409 414.518 1 1.042 1.168 0.061 25.635 1.316 29.222 Proposed 5 1.032 1.633 0.492 27.764 1.339 32.260 10 1.050 1.176 1.231 25.851 1.334 30.642 [3] there are abrupt increases of error in some frames. In the proposed method the error is stable throughout the input stream. We compared the processing time on the three platforms between SURF and our proposed method. Table 1 shows the results of comparison. In our test, the SURF method takes at least 35 ms per each frame. The feature extraction step takes more than half of the total processing time. In the proposed method, the feature extraction and matching time is significantly reduced. 5 Conclusions The heavy computational burden of classical NFT approaches prohibits the runtime execution of NFT on the low performance mobile devices. To speed up NFT, we proposed the fast feature tracking based on the optical flow. The proposed algorithm is feasible to track natural features in unknown and time varying outdoor environments. To guarantee continuity in tracking without increasing the time complexity we introduced the partial feature extraction and matching in image subregions. As long as the tracking is successful, further feature extraction and matching procedures are partially performed only on the subregions in which no features are contained. Experimental results from real data set showed that the proposed method is more than 7 times faster than the SURF method. The method not only shows the significant speed-up but also maintains at the same level of accuracy. References 1. Lim, M.J., Jung, H.W., Lee, K.Y.: Game-type recognition rehabilitation system based on augmented reality through object understanding. The Journal of the Institute of Webcasting, Internet and Telecommunication 11(3) (2011) 93 98 2. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence. (1981) 674 679 3. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Surf: Speeded up robust features. Computer Vision and Image Understanding 110(3) (2008) 346 359 4. Bouguet, J.: Pyramidal implementation of the Lucas Kanade feature tracker. Technical report, Intel Corporation Microprocessor Research Labs (2000) - 277 -