Project report Augmented reality with ARToolKit

Project report Augmented reality with ARToolKit FMA175 Image Analysis, Project Mathematical Sciences, Lund Institute of Technology Supervisor: Petter Strandmark Fredrik Larsson (dt07fl2@student.lth.se) December 5, 2011 1

1 Introduction Augmented reality (AR) is the the concept of enhancing real physical world with an extra layer of information. Additionally, this should be done in real-time and also provide some means of interaction. In a computer application this can be achieved by analyzing a video capture feed using image analysis and computer vision algorithms and then rendering some object on top of the video image. Determining where and how to render the objects can be done in numerous ways. It is possible to use positioning systems such as GPS, gyroscopic sensors or different image analysis and computer vision algorithms to detect markers in the video feed. The latter is the approach discussed in this report. The main problem, and what is common for all approaches, is how to determine where the viewer is positioned and oriented in the real physical world. The goal of this project is to explore the capabilities and limitations of a software library called ARToolKit. Using this library a demo application has also been produced. This demo application is written in the C programming language with GNU/Linux being the target platform. 2 ARToolKit ARToolKit is a software library that aids in the development of AR applications. It is written in C, and is free for non-commercial use under the GNU General Public License. A more production-ready and better supported version is also available for non-free use. The software was originally developed by Dr. Hirokazu Kato but is currently maintained by the Human Interface Technology Laboratory at the University of Washington [1]. Since its initial release in the late 1990 s it has undergone a rewrite and the current incarnation of the toolkit was released in 2004. After that a few sporadic releases has occurred up until it s most recent version (2.72.1) which was released in 2007. At this time, not much seems to be going on in terms of further development of the library, at least if judging by the project s official web site. The software library aims to be cross-platform and runs on most common operating systems, including Microsoft Windows, GNU/Linux and MacOS X. Several ports and bindings exist for other languages and platforms, such as Java and Android [2]. 2.1 Detection algorithm The primary functionality of the ARToolKit library is to detect markers in a captured video frame. These markers typically consist of a black and white pattern with a thick frame. A number of sample patterns is bundled with the library, but it is also possible to create custom patterns. An example pattern is displayed in figure 1. This pattern is also used by the demo application developed during this project. The toolkit supports detecting multiple markers in the same image frame. The algorithm used to detect the pattern uses a few basic concepts of image analysis. As a first step, the captured image is filtered through a thresholding filter yielding a binary image. The threshold value is one of the few parameters that can be set by the user of the library. The binary image is then passed through a connected-component labeling algorithm. The results of this pass is a labeling of the different regions of the image and the goal is to find big regions of the image, such as the wide black border shown in figure 1. From the information acquired from the labeling, the algorithm proceeds by detecting the contours of the pattern, from which one can extract the edges and corners of the 2

Figure 1: An example of a pattern that ARToolKit can detect. pattern. This finalizes the detection algorithm, and the obtained information can be used in the next step which computes the camera transform [3]. 2.2 Computer vision After detecting a pattern in the video a number of transformations is performed in order to be able to render a three-dimensional object on top of the frame. The mathematical model provided by the pinhole camera is simple and convenient, but does not correspond fully with the physical camera used to capture the image. It is however possible to idealize the camera using an affine transformation. This transformation is the 3 3-matrix α α cotθ u 0 β K = 0 v sinθ 0, 0 0 1 which contains what is called the camera s intrinsic parameters [4]. α and β are the magnification factors in the x and y directions respectively, expressed in pixel units. The parameter θ is the skew factor, or the angle between the axes, which should ideally be equal to 90, but may not be. Finally u 0 and v 0 are the location of the principal point, in pixel units, which is the point where the optical axis intersects the image plane. After the normalization, the detected pattern can be matched against a number of templates to determine which pattern that has been detected. Next, using the lines and corners from the detection algorithm, a projective transformation is computed. The projective transformation maps the image plane onto itself with the perspective taken into account. An important property of this transformation is that a line maps to a line, with cross-ratios preserved. And finally, at this point the camera transform can be computed, which is a mapping between the camera s coordinate system and the world s. These computations needs to be done at every frame because the transformations depend both on the real world position of both markers and camera. The intrinsic parameters however only change if the focal length of the camera changes, e.g. when zooming. 3

2.3 Computer graphics ARToolKit is tightly integrated with the OpenGL graphics pipeline which is used for the actual rendering. OpenGL has, put simply, three different spaces between which transformations are done. An object that is to be rendered to the screen first has it s coordinates defined in its own model space. In order to place this object into a scene, the world transformation is applied, and thus, the coordinates are now in world space. Finally, the object is transformed into view space which is defined by a camera model. These transformation operates on points in three-dimensions given in homogeneous coordinates, and are thus matrices of size 4 4. These transformations can be combined to one single transformation by multiplying the matrices together, which is often referred to as the model-view transform. The results of the detection and computer vision algorithms described in the previous section can be used to set up these matrices in order for us to render graphics which appear in the captured video frame. The rendering of a frame with ARToolKit normally starts with grabbing a frame from the video capture device and rendering it to a frame buffer. The previously described algorithms are then applied to the image in order to detect a pattern. If no marker is detected, the frame buffer is displayed to the screen and the rendering is complete. If a marker is detected however, the model-view transformation matrix is computed and passed down to the OpenGL pipeline. Next, using the standard OpenGL draw commands whatever geometry that is desired can be rendered to the frame buffer. When the rendering is complete, the frame buffer is displayed to the screen and the next video frame can be grabbed from the camera. 3 Demo application In order to test and analyze the ARToolKit a simple demonstration application was implemented. This application renders a four-vertex polygon, i.e. a quad, textured with an image, e.g. a photo. Additionally, in order for it to appear more realistically in the video frame, a few adjustments are made. The demo applications uses OpenGL shaders to apply these adjustments in an efficient manner, and the adjustments are described in the following sections. A screen capture of the application is displayed in figure 2. 3.1 White balance adjustment In most cases, the white balance of the captured image and the rendered image does not match. In an attempt to overcome this discrepancy a simple method of manual white balance calibration was implemented. The user of the program can manually using the mouse select a color w = (R w,g w, B w ) from the captured video frame, which is then used as the white point. In this case the colors are 8-bit RGB values, i.e. each color component are in the range [0, 255]. In order to apply the white balance adjustments a pixel s color (R,G, B ) is scaled into the resulting color (R,G, B) with the transformation R 255/R w 0 0 R G = 0 255/G w 0 G. B 0 0 255/B w B This adjustment will make the rendered image get the same tint as the background video frame. 4

Figure 2: The demo application in action. 3.2 Anti-aliasing The discrete nature of a computer screen will lead to jagged edges (aliasing) when the objects are drawn to it, causing a disturbing transition from the background to the rendered object. This is a common problem in computer graphics that has to be dealt with if decent image quality is desired. There are many solutions to this problem, and one is supported natively by OpenGL and by recent graphics hardware. This method is based on multisampling and requires the objects to be rendered in the correct order to work. There are also other methods of anti-aliasing. For instance, it is possible to, in a post-processing step, use edge-detection algorithms to find edges and after that remove the jagged edges. Due to the way ARToolKit renders the video feed by default, a way to incorporate the native multisample anti-aliasing as described above was not found. However, a very simple anti-aliasing filter based on alpha blending was applied so that the edge of the rendered photo better blends with the background. This method simply make the rendered image slightly transparent in the edges. The method is not in any way good, and will only work for rectangle shaped objects. For the purpose of this project, it will do the job and slightly improve the rendering quality. The results of the anti-aliasing is displayed in figure 3. 4 Results Augmented reality is a concept with many potential uses in many different areas. The method utilized by ARToolKit, by using markers, is a simple and easy to grasp way of achieving nice effects and interactivity. However, there are many drawbacks to this method. For one thing, the pattern must be positioned so that all of it is visible in the video frame. If even the slightest part of it is covered or creased, for just a few frames, the detection will fail. There is of course a possibility using additional algorithms to approximate the pattern location and orientation, but it is not supported by ARToolKit. Also the observa- 5

Figure 3: Aliasing between background image and rendered image is evident in the left figure. On the right, results of an attempt to remove these artifacts. tion angle of a pattern is of course limited to the hemisphere above it. The image quality produced by the video capture device, along with lighting conditions is yet another factor that needs to be taken into account. More recent research in the area have revealed new and more involved methods of augmented realism. One such method is Parallel Tracking and Mapping (PTAM) which need no markers or precomputed maps [5], and therefore offers more flexibility. The ARToolKit is a quite dated and poorly documented piece of software. For making a simple demo application it does the job, but in order to do more advanced rendering a more powerful library is needed. In fact, even during the writing of the simple demo application in this project, its limitations was inhibiting. If one wish to get involved in the underlying algorithms, digging around in the source code is pretty much the only option. But then, on the other hand, there is a production grade version of the library supposedly better supported and more stable. It is possible to apply many other techniques to improve the appearance of the final image than the ones experimented with during this project. However, since the library is rather limiting when it comes to accessing more modern features of the OpenGL pipeline. One idea for further improvement is to try to approximate the noise that is present in the video frame, and then apply that to the rendered image as well. The white balance calibration could also be done automatically by using a known white region in the video frame instead of manual selection of a white point. Another big issue that should be addressed for further improvements is the jittery appearance of the rendered image. This is caused by approximation errors that will be different from one frame to the next. Very often, this will cause a big enough difference in the computations so that the position of the rendered object is changed, even though the camera is stationary. One possible solution for this would be to use previous computations and try to interpolate between them to get smoother movement. Bibliography [1] HIT Lab. ARToolKit Home Page. [online] Available at: http://www.hitl.washington. edu/artoolkit/ [Accessed 30 November 2011] 6

[2] nyatla.jp. FrontPage.en - NyARToolKit. [online] Available at: http://nyatla.jp/ nyartoolkit/wiki/index.php?frontpage.en [Accessed 30 November 2011] [3] HIT Lab. ARToolKit Documentation (Computer Vision Algorithm). [online] Available at: http://www.hitl.washington.edu/artoolkit/documentation/vision.htm [Accessed 30 November 2011] [4] Forsyth, D.A. and Ponce, J, 2003. Computer Vision, A Modern Approach. Upper Saddle River, NJ: Pearson Education. [5] Klein, G. Parallel Tracking and Mapping for Small AR Workspaces (PTAM). Available at: http://www.robots.ox.ac.uk/~gk/ptam/ [Accessed 30 November 2011] 7