2016 International Conference on Computational Science and Computational Intelligence IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues Taylor Ripke Department of Computer Science Central Michigan University Mount Pleasant, MI 48859 Email: Ripke1tj@cmich.edu Roger Lee Department of Computer Science Central Michigan University Mount Pleasant, MI 48859 Email: Lee1ry@cmich.edu Abstract Depth estimation and spatial awareness given a single monocular image is a challenging task for a computer as depth information is not retained when the 3D world is projected onto a 2D plane. Therefore, we must combine our prior knowledge with other monocular cues present in the image, such as occlusion, texture variations, and shadows to understand the depth of the image. In this paper, we present IDE-3D (Indoor Depth Estimation 3D), a tool designed to generate a box model depth map of an indoor environment. The program combines a variety of input from the image, including 3D geometric shape estimation utilizing local and global scene structures, pixel analysis, and outlier removal, to produce a depthmap of the image with acceptable results. We generate a box model of the room and apply our best fit algorithm to calculate the predicted depth of the room by analyzing the horizontal plane and apply a depthmap gradient to it. The current application shows a successful implementation of our best fit algorithm in the controlled experiment by incorporating a box model and texture gradient approach. Future work will include estimating the same depth using an object s shift relative to the focus. Keywords-Image Segmentation, Depth Perception I. INTRODUCTION Throughout the years, research in computer vision has expanded into numerous subfields, including: object recognition, neural networks, and depth estimation. These areas provide the foundation for many applications used today; however, there is still much work to be done. In this paper, we investigate an approach to estimating depth in a single, monocular image utilizing pixel and geometrical analysis. Given several images, it is possible to accurately measure the depth of an image. Computers can mimic the triangulation and overlay done by humans to measure the depth of an object using two cameras. However, the ability to measure depth given only a single image is a challenging task as there are very few monocular cues present in the image. Therefore, it is important to pay attention to the other monocular cues in the image, such as: shading, perspective, size familiarity, and occlusion. Figure 1. Computed depth map from our best fit algorithm using the box model approach. For this project, we had motivation from prior research in the field [5,6,7,8] and want to build upon previous techniques and contribute another method to enhance depth prediction in monocular images. To do this, our project is divided into two segments. In the first segment, we will be generating a box model of a room utilizing our best fit algorithm after removing the outliers and objects from the calculations to build the box model of the room. The approach relies on finding the ground/wall boundary as shown in Fig 1. Given a threshold, the boundaries are determined by ignoring everything that does not resemble a line. Our algorithm does not perform well where the boundary is not present, as discussed in the results. However, in cases where the algorithm can find the boundary, it separates the room into distinct regions where it will apply the second phase of the algorithm. The current program is not designed to account for objects. 978-1-5090-5510-4/16 $31.00 2016 IEEE DOI 10.1109/CSCI.2016.153 788 792
During the second phase of the algorithm, we apply a gradient technique to illustrate the depth of the room as accurately as possible. It does this by exploiting the geometry of the scene and finding similar patterns of pixels that form the boundaries of objects. As observed in Fig 1, the green to blue texture on the right wall shows the progression of depth in the image. The wall in the back of the image is a solid blue, showing that it has been classified as being the same distance from the camera. Finally, the increasing progression of yellow on the ground also shows the progression of depth. We test our program by providing it with various indoor images that we gathered ourselves. We show that with the current state of the program, it effectively produces box models of the rooms and applies a gradient to show the progression of depth in the scene. Rather than considering applying a gradient to the image as a whole, we segmented it using the box model approach and applied the gradients on the surfaces produced by the algorithm. In the next section, we will discuss the previous work done in the field and the impact it has had on current research. A variety of techniques will be presented that illustrate numerous approaches that can be used to estimate depth in images. A current trend in research suggests a strong need to exploit the higher level information in the scene, rather than focusing on local cues alone. The ability to understand what a group of pixels represent is more valuable than evaluating each pixel individually. However, it is important to consider all monocular cues present as they provide crucial information about the image. II. BACKGROUND AND RELATED WORK As stated previously, estimating depth in a single image is a challenging task as depth information is lost when the image is created. Regardless, humans have the capability to perceive depth in a monocular image using the information they have gained throughout their lifetime. Therefore, it should not be unreasonable to think that one day computers may do the same. Perhaps one of the biggest problems in computer vision is effectively and efficiency recognizing and classifying objects. A computer may make the mistake of thinking that a person up close is taller than a skyscraper far away if the picture is taken at the right angle. However, it is possible to produce an accurate depth map utilizing other monocular cues present in the image. Previous techniques will be presented that show the numerous ways depth can be perceived in a single image. Most approaches taken recently have focused on depth at the local scale [1,2,3,4]. While their results have been successful, it is important to use high level information from the global structure of the scene. Individual pixels and local information alone is not enough to determine the context of the image. For example, if a computer was shown an array of blue pixels, it would have a hard time identifying whether it is viewing the sky, the ocean, a river, or perhaps a blueberry. Rather, if we can recognize that the image is being taken outside, we can infer that the blue is the sky if it is in the upper half of the image. A. Zhuo et al. Most approaches taken in the past have focused on the local cues rather than exploiting the global structure of the image. Zhuo et al. developed a hierarchical representation of the scene, which combines the local depth with mid-level and global scene structures. They were able to formulate a single image depth estimation in a graphical model by encoding the interactions across different layers of their hierarchy. By doing so, they were able to produce detailed depth estimates and get higher-level information from the scene [5]. After conducting their experiments, they found that the mid-level structures provided the most to the final accuracy of their model. In the future they plan to use semantic labels as a part of the depth estimation calculations [5]. That should significantly increase the accuracy of the results as utilizing their high level information about the scene, they can classify difference objects and recognize that they sky is far away. B. Liu et al. Utilizing a different approach, Liu et al. used a pool of images for which the depth is known to help them calculate the depth in an unknown image. They treated the task as a discrete-continuous optimization problem, where the discrete variables represented the relationship between neighboring superpixels and the continuous variables encoded the depth of the superpixels. By performing inference in a graphical model utilizing the particle belief propagation, they found a solution to the discretecontinuous optimization problem. By using the images where the depth is known, they can use them to compute the unary potentials in the graphical model [6]. Similar to Zhuo et. al., they plan to incorporate the use of semantic labeling in their estimations. C. Hedau et al. Using the geometric information in a scene and the geometric representation of an object, it is possible to produce a detector for that object. The detector they built unifies contextual and geometric information to produce a probabilistic model of the scene. The locations of the walls and the floor in the image can refine the estimation for the 3D object. They show that it is possible to derive a 3D interpretation of the location of the object from a 2D image [7]. In addition, Hedau et. al. also considered the challenge of recovering spatial layout of indoor scenes given a monocular image. In most rooms, the distinct boundary that 789 793
marks the division between the floor and wall is partially or entirely occluded by furniture or objects in the room. Most algorithms used to identify the geometric context of the room rely on finding the ground-wall boundary. Rather, they employ a structured learning algorithm to find the parameters for their algorithm based on global perspective cues [8]. The algorithm employed in our research currently relies on finding the ground-wall boundary to produce a depth map. As shown in the results, we show that our model has deficiencies when it cannot find the ground wall. D. Saxena et al. Researchers have been able to recreate the sense of depth perception in computers when analyzing an image to a certain degree of success under certain circumstances. As challenging as optical illusions are to humans, they are even more difficult for a computer. Some approaches, as taken by Saxena et al. utilized a supervised training approach as they collected a training set of monocular images (indoor and outdoor environments such as trees, buildings, and sidewalks) and their corresponding ground-truth depth maps [3]. Using a Markov Random Field which incorporates multiscale local- and global- image features, they were able to model the depths and relation between depths at different points in the image [2]. Their approach combines monocular and stereo (triangulation) cues to estimate depth to show improvements over utilizing only monocular or stereo cues. E. Eigen et al. Other work has involved using deep network stacks to predict depth. Eigen et al. describes how they employed two deep network stacks to make a global prediction and another that refines the prediction locally. It is important to note that they applied a scale-invariant error to help measure depth relations rather than the scale itself [4]. After applying the training sets, their project achieved much success on both NYU Depth and KITTI without the need for superpixelation. In the next section, we will describe the methodology and implementation of our model as inspired by previous research. We will begin with a brief overview of the system followed by a detailed look at the approach we took. In the experiment section, we will describe how we achieved the results as shown in Fig 4 and further areas for improvement. Finally, we will discuss how we can improve our program in the future. III. METHODOLOGY A. Overview Contrary to computers, humans have a remarkable capability of perceiving depth, even if one eye is not involved. Various cues such as shading, perspective, size familiarity and occlusion are important in depth perception. The most powerful form of depth perception a human uses is stereo disparity. Each eye sends an image to the brain which combines them to give the sense of 3D. However, some individuals are born without stereo vision. Rather, their brain compensates for this through active spatial creativity. In a sense, their brain is overcompensating by paying attention to other visual cues present. The approaches outlined previously present successful research concerning the problem of depth perception given a single monocular image. The approach we explored used a geometric representation of the image utilizing a box model. B. Our Approach Existing systems utilize various techniques to recover depth information lost when the 3D world is portrayed on a 2D plane with the most prominent focus being geometric information. Arguably the most difficult task is object recognition. A computer views an image as an array of pixels. Therefore, it makes logical sense to try and find patterns that resemble what we are trying to find. We would expect in most situations to find a horizontal row of similarly colored pixels that represent the floor/wall boundary. However, object occlusion can corrupt the algorithms calculations unless it is accounted for. Specifically, we are interested in developing a best fit algorithm that can detect the boundaries of the room after the room has been analyzed. We want to be able to find the boundaries of the room and calculate a line of best fit, or one that has the most pixels of the same color within a certain threshold. The algorithm itself cannot accomplish this task alone. Instead, we must try and remove the existential information from the scene. C. Proposed System For our algorithm to be able to estimate the depth of an image, some preprocessing must be done for maximum optimization and accuracy. First, we apply a Sobel operator identify the edges. Once the image s edges have been identified, we run another an algorithm that identifies edges based specifically on similar color. The outliers and objects deemed to not be lines are removed by identifying regions of curvature. An outlier in this context can be described as a pixel or a group of pixels that do not conform or represent an object and have no meaning. For example, a part of the wall may have a marker stain or a nail hole in it. These will be identified by the algorithm, but they are not essential to the larger picture. Following the successful completion of the outlier removal algorithm, we then scan the image to check for connectivity. Given some pixel, we want to check if that pixel is a part of a larger group of pixels. For example, if there was a picture frame on the wall, and our algorithm targeted the upper left pixel of the frame, we want to check 790 794
if there are more pixels around that pixel and if they form any distinct shape. If there are not any pixels or not enough pixels within a given radius, then we remove them from the calculations. More specifically, we want to identify patterns of pixels that are round or not straight, as they would not be considered a boundary. While checking, each pixel is stored in a temporary array in case the algorithm believes that it is not part of the larger picture so that they can all be removed easily. Next, we apply our best fit algorithm which generates boundary lines each time it finds a pixel identified as a potential boundary. Each generation, the slope of the line is modified to see if that generation would be better than the previous. Once all boundaries have been identified, a gradient is applied to the image to show the gradual transition in depth at any point. Figure 2. Pseudocode for Best Fit Algorithm D. Implementation The first step was to apply a basic Sobel edge detection to identify the edges. Following, we grouped pixels into distinct regions based on color similarity. Next, we removed the outliers from the calculation. Utilizing a predetermined value, such as 10, we can tell the program to remove any pixels that do not have a connection of 10 or greater. This greatly reduces the amount of pixels that need to be scanned by the best fit algorithm. This threshold can be modified to produce different results. Finally, the most prominent feature of our program is the best fit algorithm. As shown in Fig 2, the best fit algorithm loops through each pixel of the image until it comes across one that has a connectivity of greater than 10. In a 240 generation while loop, the slope of the line is changed and the pixels generated are overlaid on top of the pixels remaining after the outlier removal algorithm. The program keeps a running tally of how many pixels are on top of the others. The slope with the best fit is chosen and that is where a room boundary is placed. As you can see in Fig 4, the algorithm accurately finds the boundaries of the room when the horizontal axis is found. E. Experiment The experiment was conducted by capturing four indoor static images and then letting our program evaluate them. In this controlled experiment, we took images that had a ground/wall boundary defined, as well as some that didn t. The program first identified the color variations in the image and shaded them magenta if the adjacent pixels exceeded a predetermined threshold. This allowed the program to identify the boundaries between objects with relative certainty. Next, we applied an outlier removal algorithm that checked the connectivity of the magenta pixels to determine whether the object was an object or boundary, such as the ground/wall border. The outlier removal algorithm scanned through the image sequentially until it found a magenta pixel. When it found one, it would check the connectivity by scanning the surrounding pixels and seeing whether they were linear or if they had a curve. The initial threshold was thirty pixels. If those thirty pixels resembled a linear line, they were kept. Otherwise, if the pixels did not have those criteria, they were set back to the original color of the image. This helped narrow down what the algorithm had to process on. Next, our best fit algorithm was applied to calculate the most likely spots for walls. The small excerpt of pseudocode in Fig 2 shows the core process of the algorithm. The idea before performing the best fit algorithm is that if there are magenta pixels remaining before this algorithm, then there is a possibility that this is a boundary. The algorithm begins on a magenta pixel and calculates 240 variations of lines that the pixel could progress upon. A simple counter variable is used to keep track of the line where most of the magenta pixels fall upon. For every pixel we calculated this and created the lines. Once the best fit algorithm identified a line, it removed them from the next calculation to have the ability to calculate more lines. 791 795
Once the boundaries of the floor and walls were determined, the box model was created. Next, we applied a texture gradient pattern to the box model to produce a depth map of the room. The color shifts from the outer edges of the image towards the center gradually, showing the progression in depth. F. Results The current program can in most cases identify the boundaries of the room and correctly produce a box model. Although the program is currently limited in the controlled experiment, further improvements can be made to increase the accuracy and the approach that it uses to model depth. Please refer to Fig 3 and 4 to view the results of the experiment. The limiting factor in our current method is the ability to define the ground wall boundary. As shown in Fig 4, the algorithm appears to correctly identify where the ground meets the wall; however, we believe that this can be improved further. The texture gradient did not apply correctly in this scenario unlike in the previous examples displayed. Figure 3. Comparison of common features shared between programs Initially we wanted to compute a depth and assign each pixel or group of pixels a depth value that would say how far from the camera they were. We conducted an initial experiment to measure the depth at certain distances from the camera computed by our program but soon found inaccuracies at different angles and depths. These results were therefore not included in this study as further development is necessary to improve its accuracy. IV. FUTURE WORK Currently we are developing a method to more accurately predict edges, specifically by analyzing focus. The Sobel edge detection algorithm is an efficient algorithm; however, it has difficulties when considering objects of similar color. For example, a distinct different is examining the depth of two white pieces of paper at different distances from the camera. Given other monocular cues such as texture and occlusion, the most prominent method of object depth is binocular vision. We recognize that the closer our eyes are together, the closer an object is. Similarly, the further away an object is, our eyes separate. When concentrating on an object in the fovea up close, objects farther away will be duplicated. An example can be demonstrated by holding your nose close to your face and focusing on it and hold your other thumb at arm s length away. The further thumb will be duplicated. When we go to focus on the further thumb, the closer one is duplicated in our periphery. In particular, the research discussed in this paper highlights an approach to producing a depth map using a box model generated from a single image. For future applications, we are developing a system where the difference of a shifted object will help us determine depth more accurately. Specifically, two cameras side by side will take an image focused on an object and the difference in the shift from objects in the foreground and background give us information on their depth. For example, a similar experiment will be run comparing the results of our monocular to the stereo vision. The depth of an object or room can also be inferred by evaluating multiple duplicate edge lines and object variance. An object farther away moves a smaller amount when blinking your eyes compared to when the object is up close. At different depths in a room, an object s shift can be compared via two different cameras. We would expect objects closer to move faster in respect to another object farther away. We also expect object duplicates (the same object viewed at different angles) to become separated by a greater difference over time. A future version of this application will estimate depth information of objects in an environment directly related to the focus. V. CONCLUSION The purpose of this program was to generate a box model of the room given the geometric cues present and apply and gradient to the image to show the progression of depth in the image. As shown in Fig 4, the results of our program were successful in the controlled experiment. However, as discussed previously, there are limitations to it. The current application was developed to quickly estimate the depth of a room. Further research is needed to improve the accuracy of 792 796
this program and will incorporate the methods discussed in the future work. Figure 4. Results of experiment REFERENCES [8] V. Hedau, D. Hoiem, and D. Forsyth.Recovering the spatial layout of cluttered rooms. In Proceedings of ICCV, 2009. [1] B. Liu, S. Gould, and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). [2] A. Saxena, S.H. Chung, and A.Y. Ng. 3-D depth reconstruction from a single still image. International journal of Computer Vision, 76(1):53-69, 2008, 268 [3] A. Saxena, M.Sun, and A.Y. Ng. Learning 3-D Scene Structure from a Single Still Image. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2007. [4] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014) [5] W. Zhuo, M. Salzmann, X. He, M. Liu Indoor Scene Structure Analysis for Single Image Depth Estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015). [6] M. Liu, M. Salzmann, X. He. Discrete-Continuous Depth Estimation from a Single Image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recgonition. (2014) [7] Hedau, D. Hoiem, and D. Forsyth.Thinking inside the box: Using appearance models and context based on room geometry. In Proceedings of ECCV, 2010, pg. 224-237. 793 797