Deep Learning Based 3D Reconstruction of Indoor Scenes

Size: px

Start display at page:

Download "Deep Learning Based 3D Reconstruction of Indoor Scenes"

Eustace Reed
5 years ago
Views:

1 Deep Learning Based 3D Reconstruction of Indoor Scenes Student Name: Srinidhi Hegde Roll Number: BTP report submitted in partial fulfillment of the requirements for the Degree of B.Tech. in Computer Science & Engineering on 18th April 2017 BTP Track: Research Track BTP Advisor Dr. Saket Anand, Assistant Professor (CSE), IIIT-D Dr. Ojaswa Sharma, Assistant Professor (CSE), IIIT-D Indraprastha Institute of Information Technology New Delhi

2 Student s Declaration I hereby declare that the work presented in the report entitled Learning Based 3D Reconstruction submitted by me for the partial fulfillment of the requirements for the degree of Bachelor of Technology in Computer Science & Engineering at Indraprastha Institute of Information Technology, Delhi, is an authentic record of my work carried out under guidance of Dr. Saket Anand and Dr. Ojaswa Sharma. Due acknowledgements have been given in the report to all material used. This work has not been submitted anywhere else for the reward of any other degree.... Place & Date:... Srinidhi Hegde Certificate This is to certify that the above statement made by the candidate is correct to the best of my knowledge.... Place & Date:... Dr. Saket Anand... Place & Date:... Dr. Ojaswa Sharma 2

3 Abstract Recent advancement in deep learning techniques has opened doors for wide variety of applications. With growing interests in deep learning and geometry, lots of computer vision problems have been tackled using deep learning. In this work, we try to create a framework for a learning based 3D reconstruction of interiors of building from multiple 2D images that capture the entire scene of interest. We use PoseNet for regressing over the camera pose to establish spatial relationship between constituents of a scene. This work is a step towards solving a bigger problem of reconstruction from incomplete data of the scene. Keywords: 3D reconstruction, relocalization, deep learning, Convolutional Neuural Network

4 Acknowledgments I would like to express my special thanks of gratitude to my advisors Dr. Saket Anand and Dr. Ojaswa Sharma for their guidance and constant help. I am also grateful for their timely help and untiring effort for providing necessary and useful inputs. I would like to express my immense gratitude to my parents for their constant support and motivation. I would also like to thank IIIT Delhi for providing me with this wonderful opportunity and for access to all the resources and facilities used by me. I have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals. I would like to extend my sincere thanks to one and all of them. Work Distribution The work distribution is provided in terms of weekly completion of tasks. Week 1 to Week 5: Literature Survey Week 6 to Week 8: Data Generation and Modelling Lab Week 9 to Week 12: Data Visualization Week 13 to Week 14: Posenet Training i

5 Contents 1 Introduction Motivation D-Reconstruction Problem Feature Points Identification and Correspondences Depth Estimation D Model Registration Incorporating Learning with 3D-Reconstruction Learning Approaches for Geometry and Reconstruction Learning Depth, Normals and Semantic Labels using Deep Learning Depth Estimation Surface Normal Estimation Semantic Segmentation PoseNet: A CNN for Real-Time 6-DOF Camera Relocalization Creating Framework Data Collection and Processing Experimental Setup and Procedure Challenges Results Summary and Overview Summary from Previous Part Overview Deep Learning for Elements of 3D Reconstruction SceneNet - Synthetic Dataset Object Pose Estimation Using Siamese Network Pipeline for 3D Reconstruction 15 ii

6 6.1 About the Pipeline Experiments and Results Future Works Model Improvement Removing Constraints - Exploring Incomplete Data iii

7 PART - I (Monsoon Semester, 2016) iv

8 Chapter 1 Introduction 1.1 Motivation 3D reconstruction is an problem that helps in rapid prototyping of geometrical models in 3D and thus is an essential part of scene understanding. Learning when included with 3D reconstruction can help in automating the process which could be tiresome to do manually. Pose estimation is equally important for 3D reconstruction which is helpful in establishing spatial relationship between different geometric entity in 3D. Thus automation of pose estimation from image can reduce manual intervention to a great extent. 3D Reconstruction has wide variety of application in various domains. 3D reconstruction is an integral part for fields such as robotics, augmented reality and virtual reality. Automation in generating models helps in relocalization and path planning in robotics in real-time. Faster reconstruction helps in real-time rendering of geometric models which is helpful for various AR/VR applications. On extending this problem to reconstruction from incomplete data, this could be helpful in solving real-time puzzle assembly problem which is a challenging problem in the domain of artificial intelligence D-Reconstruction Problem The problem of 3D-Reconstruction is one of most challenging problems in the domain of computer vision. As the name suggests, the problem involves recreating 3D models out of incomplete data. The problem is applicable for different forms of data. For example, for 2D images it could be also thought of as an inverse process of obtaining images from 3D real world objects (photography and imaging). As evident from nature of the problem, one of the challenges lies in accurately and effectively extracting the originally lost depth information from the 2D images. 3D reconstruction can also be applied to retrieve complete 3D model from an incomplete 3D point cloud. In this scenario it could be challenging to estimate the exact transformations of different parts of model for registering the parts onto a composite final model. In the problem, discussed here, the focus is on both 3D reconstruction from multiple 2D images and registration among models generated from different images. Some of the key concepts that are essential for solving the 3D reconstruction problem are discussed in the following subsections. 1

9 1.2.1 Feature Points Identification and Correspondences For estimating the pose and transformation between two image pairs or two geometric model pairs, it is essential to identify some feature points which are invariant to affine transformations. This helps in uniquely identifying a point in an image or a 3D model Depth Estimation Depth estimation involves generating depth map from images, which is an image where each pixel intensity represents the depth of visible object from camera. The conventional techniques of depth estimation involve multi-view approach where epipolar geometry is employed to estimate the depth from multiple images D Model Registration Registration of 3D geometric models deals with alignment of the models or point clouds (set of points in 3D space). The two prominent algorithms used for model registrations are Iterative Closest Point(ICP) algorithm and Random Sample Consensus(RANSAC) algorithm. Iterative Closest Point: ICP is one of the widely used methods for 3D model registration. It was first proposed by Besl et.al [2]. The algorithm takes two geometric models (as set of points), say A and B, and the point correspondences amongst them as input. Then it tries to estimate the affine transformations between A and B. The transformation is estimated such that it minimizes the distance amongst the corresponding points in A and B. Random Sample Consensus: RANSAC is a generic iterative framework for fitting models into unstructured data. In each iteration, it takes minimal data items that could define the model which is being fit onto the dataset and it greedily updates the model to maximize the best matching data points (inliers). [5] 1.3 Incorporating Learning with 3D-Reconstruction The concept of learning is formally defined by Tom Mitchell in [9] as following: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. It has been seen that some of the advanced learning approaches have been employed recently for geometric understanding of the scene. Generally, learning is a powerful tool that enables us to automate lot of the processes involved in 3D reconstruction. From the visual and geometric cues from the images the algorithm can learn key features such as depth, position and orientation for a better scene understanding. In recent years, deep learning involving deep neural network framework have been used to estimate depth, surface normals and semantic relations from visual cues of a single image as descibed in detail in 2.1. Furthermore it is seen that relocalization is also an important problem which could be learnt (as discussed in 2.2) using deep learning frameworks. 2

10 Chapter 2 Learning Approaches for Geometry and Reconstruction There has been a vast amount of work being done in the field of 3D reconstruction. But incorporation of learning aspects in solving 3D reconstruction problem is growing up to be of recent interest. Hence there are few works pertaining to the problem of learning based 3D reconstruction. Here is a brief overview of the works related to learning based 3D reconstruction. 2.1 Learning Depth, Normals and Semantic Labels using Deep Learning This work by Eigen et al. [4] focuses on solving three important computer vision problems, namely depth estimation, surface normal estimation and semantic segmentation, using a single neural network architecture. This architecture used is a convolutional neural network(cnn) as shown in Figure 2.1. They use a multiscale feature extraction process using three different CNNs for processing features at three different scales to obtain high resolution feature set. Although a single architecture is used, the feature set extraction and training has some finer variations for different problems Depth Estimation For depth estimation, the scene depth is modelled as depth maps. Consider the predicted and ground-truth depth maps to be D and D respectively. The loss function used for training, assuming d = D D, is given by L depth (D, D ) = 1 n i d 2 i 1 2n 2 ( i d i ) [( x d i ) 2 + ( y d i ) 2 ] (2.1) n i where the sums are over valid pixels i and n is the number of pixels. Here, x d i and y d i are the horizontal and vertical image gradients of the difference respectively. On a coarser scale, the network extract depth information based on global geometric features from the image such as vanishing points, object poses and allignment of structures. Whereas on a finer scale, the network extracts local depth information at a finer level such as depth fluctuations due to object texture. 3

Figure 2.1: Multi-scale CNN for multiple tasks comprising of 3 layers. Scale 1 - Predicts coarse but spatially varing feature set covering entire image area.

11 Figure 2.1: Multi-scale CNN for multiple tasks comprising of 3 layers. Scale 1 - Predicts coarse but spatially varing feature set covering entire image area. Scale 2 - Provides finer predictions at an intermediate level of resolution. Scale 3 - Outputs high resolution output giving highly detailed image. Source: [4] Surface Normal Estimation Surface normals in a 3D geometric model are generally represented as an attribute (in vector form belonging to R 3 ) specific to vertices of the models. Unlike depth estimation, to predict surface normals three channel output is used for x, y and z components of normal vector respectively. A simple elementwise loss function is employed for comparing the predicted normal at each pixel to the ground truth, using a dot product: L normals (N, N ) = 1 N i.ni = 1 n n N.N (2.2) where N and N are predicted and ground-truth vector maps representing surface normal vectors at each pixel. For computing the ground-truth vector map, approach proposed by Silberman et al. [11] is used that is based on fitting least-square planes onto a point cloud generated from image. i Semantic Segmentation The basic approach used for semantic segmentation is based on estimating per-pixel sematic labels on the given image. Then we use these labels to cluster together similar labels representing pixels showing similar objects in scene. The network discussed in the work estimates per-pixel semantic labels using pixelwise softmax classifier. The number of channels in the final output is same as the number of classes. The following pixelwise cross-entropy based loss function is used for training, L semantic (C, C ) = 1 Ci logc i (2.3) n 4 i

12 Figure 2.2: GoogLeNet architecture with components. Source: [1] where C i given by C i = ez i c ez i,c is class prediction at pixel i given the output z of the final convolutional linear layer. The semantic labelling is estimated using the information of depth and surface normals obtained from the same network as discussed in the previous sections. (2.4) 2.2 PoseNet: A CNN for Real-Time 6-DOF Camera Relocalization PoseNet [7] is a real-time system for solving camera relocalization problem. Camera relocalization is a task of identifying the position and orientation, otherwise known as pose, of camera. PoseNet takes as input a 224x224 RGB image and regresses the camera s 6-DoF pose relative to the scene. Pose, which is regressed by Posenet, is modelled as a vector p given as, p = [x, q] (2.5) where x is the position of camera in 3D and q is orientation represented as quaternion. PoseNet uses a modified GoogLeNet [10] architecture as shown in Figure 2.2. The three softmax classifiers are replaced by affine regressors to get pose as output. Another fully connected layer was inserted before the final regressor to form a localized feature vector for exploration and visualization. For implementation purposes we use PoseNet s implementation that uses Caffe library [6]. For training PoseNet on an input image I, stochastic gradient descent is used with the following Euclidean loss function, loss(i) = x x + β q q q 2 (2.6) where β is a scale factor chosen to keep the expected value of position and orientation errors to be approximately equal. For better results, PoseNet is pretrained on large datasets such as ImageNet and Places. This transfer learning reduces the requirement of huge amount of data for pose estimation. 5

13 Chapter 3 Creating Framework In our work, we try to create a framework for a learning based 3D reconstruction of interiors of building from multiple 2D images that capture the entire scene of interest. For learning to reconstruct, it is essential to estimate the poses of camera to get an estimate of pose of scene of interest, with respect to world coordinate frame. So we employ PoseNet for regressing on the camera pose. For this we first focus on creating the required ground truth data that is essential to training for regressing the poses. We then visualize the camera poses to get useful insights into the data. 3.1 Data Collection and Processing Experimental Setup and Procedure For the purpose of generating the ground-truth we captured RGB-D images which along with three RGB color components has an additional depth component. For capturing RGB-D images we used Microsoft Kinect v2 along Kinect Fusion for 3D Reconstruction. We captured the RGB-D images of interiors of Swarath Lab, IIIT-Delhi. The Kinect Fusion SDK helps us to perform 3D-reconstruction of scenes in real-time. This was used to reconstruct the lab in patches and these patches were stitched together using ICP externally. Thus we used two levels of reconstruction for getting the complete model of the lab. Using RANSAC tends to be very expensive to obtain a highly accurate correspondences of models but for lower accuracy this is faster. On the other hand ICP is slow but produces registration with much better result. So in our case, we use a combination of ICP and RANSAC for model registration. First we use RANSAC for alligning the significant planes in the model and then we fine tune the registration procedure using ICP to obtain an accurate model Challenges Due to physical constraints all of the lab could not be captured in one run. So we created multiple patches from different locations in the lab as shown in Figure At each location we rotated the Kinect to cover its entire FoV as visible in KinectFusion tool. After collecting all the data for each portion of the Lab, we stitched all the models using a ICP with intializations based on visual features in the RGB-D image. While obtaining a 3D model using Kinect Fusion, we obtained models with improper registration 6

due to availability of less number of features. So we introduced more visual features by placing objects with distinct colors as markers. This helped us to obtain more accurate models of the lab.

Here we see that the walls of one portion of the lab are not registered properly. This kind of problem is created by the accumumlation of errors over all the patches is referred to as Loop Closure.

1.3). The camera poses are represented through identical frustums with different positions and orientations. These serve to be the ground-truth for training purposes.

14 due to availability of less number of features. So we introduced more visual features by placing objects with distinct colors as markers. This helped us to obtain more accurate models of the lab. In the final model obtained by fusing different patches of the lab together, we see some irregularity as shown in Figure 7.1. Here we see that the walls of one portion of the lab are not registered properly. This kind of problem is created by the accumumlation of errors over all the patches is referred to as Loop Closure. (a) (b) (d) (c) (e) Figure 3.1: Patches of models. (a) to (e) represent different lab patches that were stitched to create composite model. In all we created 18 patches Results The point cloud data generated from mapping Swarath Lab interiors are visualized using an opensource tool Meshlab (as shown in the Figure 3.1.3). The camera poses are represented through identical frustums with different positions and orientations. These serve to be the ground-truth for training purposes. We have also collected RGB-D images of the respective camera poses which are shown in Figure RGB-D images are stored as two separate images - RGB image and a greyscale image representing scaled depthmap. 7

15 (a) (b) Figure 3.2: Composite Model. Visualization of composite model of lab created using ICP. Frustums represent the pose of cameras - tip of frustum as position and normal to base as orientation of camera. 8

16 PART - II (Winter Semester, 2017) 9

17 Chapter 4 Summary and Overview 4.1 Summary from Previous Part In the previous part, we introduced the problem of 3D Reconstruction as generation of geometrical models in 3D from 2D images. We discussed various techniques involved in 3D reconstruction using structure from motion techniques (SfM), which involved (i) feature points identification and correspondences, (ii) depth estimation from 2D images and (iii) 3D model registration (based on ICP, RANSAC or a combination of both). Apart from discussing traditional SfM techniques, we also discussed how learning can be incorporated in solving the 3D reconstruction problem and also discussed both advantages and disadvantages of using learning algorithms. Employing learning, we saw some of the important elements of 3D geometry and discussed about inferring them using deep learning techniques. We showed convolutional neural networks (CNN) could be employed in finding important geometrical constructs such as depth, surface normals, scene semantics and camera poses. We described a three-in-one architecture proposed by Eigen et al. [4] to predict depth, surface normals and semantics segmentations using a single CNN. Furthermore we discussed about the PoseNet architecture proposed by Kendall et al. [7] for regressing 6Dof camera pose given an RGB image during test time. Previously we also started with building an end-to-end framework that could aid in generating 3D reconstructed models. We used the dataset collected from Swarath Lab at IIIT-Delhi for generating ground truths using SfM for training. We used the data for fine-tuning the PoseNet architecture for predicting camera poses. The of the same is shown in Figure 4.1. Furthermore we produced some of the results of reconstruction from SfM along with camera pose visualizations. We discussed the challenges and shortcomings with this generated model as well. 4.2 Overview The second part of this thesis focuses on the work done in the second (Winter 17) semester of the B.Tech Project. The first part of the thesis work was mostly focused on literature survey and some elemantary experiments and rest was based on data collection for training. In the second part of the work the majority of focus was on implementing some of the ideas discussed before along with exploring different architectures for completing various tasks of 3D reconstruction. 10

18 Figure 4.1: Errors in estimating camera position and camera orientation on pretrained PoseNet model. The next few chapters are organized as follows: chapter 5 discusses some of the important ideas that help in computing certain geometric elements that come in handy for reconstructing 3D geometric models. Following this, chapter 6 discusses the pipeline we propose and some of the experiments we performed and also the results of these experiments. Finally, chapter 7 presents some of the future directions for interesting problems that can be addressed in future. 11

19 Chapter 5 Deep Learning for Elements of 3D Reconstruction This chapter focuses on some of the components of 3D reconstruction which could be combined to develop an end-to-end pipeline for performing the task of 3D reconstruction. We discuss about the performance of deep learning architectures on synthetically generated datasets. Following this we focus on an efficient object pose estimation technique that uses regression on a different class of deep-learning architectures - Siamese Network. 5.1 SceneNet - Synthetic Dataset SceneNet RGB-D [8] is a large scale dataset of photorealistic RGB-D videos which provide complete ground truth for a wide range of problems related to indoor scenes. The dataset is highly suitable for various problems pertaining to the domain of computer vision such as semantic segmentation, estimating geometrical constructs of different datasets, 3D reconstruction and metric SLAM problems. This dataset contains 57 trajectories covered in 1000 video sequences each consisting of frames per video sequence. The dataset contains RGB-D video sequence along with semantic annotations for some cases. The dataset also provides camera poses per frame (with accurate measurements between opening and closing of camera aperture), class and geometric orientation information of all the CAD models available in the scene. Sample data from the dataset could be seen in the Figure 5.1. We used Scenenet RGB-D dataset for fine tuning the pre-trained model of PoseNet for regressing camera-poses from single RGB images. The synthetic dataset could not provide accurate results for indoor scenes with positional error of 1.8m and huge orientation error of 60 to 70 on an average. One of the possible reasons for this behaviour could be availibility of a very small number of samples for training per trajectory for image sequence, that is, 300 per trajectory. 5.2 Object Pose Estimation Using Siamese Network The paper [3] talks about an interesting idea for regressing camera pose. The main problem that the paper talks about is the regression of object s pose in angle space guided by feature space. To improvise normal regression based on angle space, certain constraint is applied on a pair of samples in a particular feature space. This major constraint is that distance between any two samples should be equal or proportionately equal in ideal case in both angle and feature space. 12

Figure 5.1: Other than the shown data properties the data set could easily provide optical flow vectors for scene and class segmentation informations with minimal processing.

20 Figure 5.1: Other than the shown data properties the data set could easily provide optical flow vectors for scene and class segmentation informations with minimal processing. We take an input of pair of RGB images x 1, x 2 and pass it through the Siamese Regression network. The siamese architecture consists of two (or more) branches of the same CNN that share weights and encode two inputs processed in parallel. For this application the CNN consists of two convolutional layer followed by 2 fully convolutional layer and then finally a fully connected layer to output 6DoF object pose in quaternion representation. The Siamese Regression uses the following loss function l f : l f = K f(x n,1 ) f(x n,2 ) 2 2 y n,1 y n,2 2 2 (5.1) n=1 Here, f(x i ) represents the mapping of image x i in a feature space of interest and y i is the mapping of the image x i in angle space (and also its training label) and the sum is over total of K samples of training data. The speciality of this configuration is that we need siamese network only for training and during test time we only need a single architecture without any branches as expressed in Figure 5.2. Final loss function combines feature (siamese regression) loss l f along with regression loss function defined as follows: l R = K g(f(x n )) y n 2 2 (5.2) n=1 where g() is a regression layer function. For better results, this combination is further combined with L2 regularization. The most crucial part of this learning algorithm is learning a feature space that satisfies the aforementioned constraints along with enhanced discriminative property among input data. This is achieved by a learning technique known as Triplet learning. Triplet training samples contain an anchor, a positive sample and a negative sample. Then we optimize anchor and the positive sample to be closer in feature space than the anchor and the negative one. Another improvisation is done while preprocessing the dataset. We can create well-formed batches with 13

21 Figure 5.2: Siamese architecture essentially regresses relative distance between feature and pose spaces. During testing, we extract a branch of the network, and use it for regression. equal representation of positive, negative samples with respect to an anchor. improve the result when compared to random initialization of dataset. This found to 14

22 Chapter 6 Pipeline for 3D Reconstruction 6.1 About the Pipeline For a complete end-to-end pipeline for 3D reconstruction, we satisfy a dual objective of finding camera pose and depth information for generating the 3D model. This is so because depth maps encode in them details of the local geometrical information of a 3D model. Also for obtaining global information we use camera poses as these convey the information of relative poses and transformation between them in the world coordinate frame which aid in generating a global picture. At the end, for registration of different patches of reconstructed models we use the information of surface normals. The overview of pipeline is shown in Figure 6.1. Firstly for estimating camera poses, we used simple SfM techniques. Secondly, for depth estimation we used a CNN architecture proposed by Eigen et al. [4] to generate depth maps and also surface normal maps from a single RGB image. 6.2 Experiments and Results We generated depth map, surface normals using the same method as in [4]. The input RGB images were collected using Kinect and the dimension of these images were These RGB images were reduced to match the dimension of input layer of VGGNet, that is to The obtained depth and surface normal maps had similar dimensions of The results Figure 6.1: A block diagram representing the working of our proposed pipeline. 15

23 Figure 6.2: (From left to right) Original RGB image. Corresponding (color coded) predicted depth map. Corresponding predicted surface normal map. generated from this method on our dataset of Swarath Lab is shown in Figure 6.2. For estimating the camera poses we used SfM. 16

24 Chapter 7 Future Works 7.1 Model Improvement There is a good scope for improvement of model. The problem of loop enclosure (see Figure 7.1) is one such problem which needs to be addressed for accurate representation of any interior scene. Using better feature matching scheme (that is, to create robust correspondences between model features) and bundle adjustment could be used for solving this problem. Other improvements include completion of mapping by including floor and ceiling models with the walls of the interior rooms. 7.2 Removing Constraints - Exploring Incomplete Data In the present work we use RGB-D information, which describes the entire 3D information, to train the PoseNet. Presently, we assume that the data given to us completely describes the scene of interest. But in future we aim to solve the problem of reconstruction with incomplete 2D data with this framework. This could consist of data such as incomplete set of 2D images or floor plan of a room or interiors of building. Figure 7.1: Effects of Loop Enclosure problem due to accumulated errors from all patches. 17

25 Bibliography [1] Available at googlenet_keras/googlenet_components.png. [2] Besl, P. J., and McKay, H. D. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14, 2 (Feb 1992), [3] Doumanoglou, A., Balntas, V., Kouskouridas, R., and Kim, T.-K. Siamese regression networks with efficient mid-level feature extraction for 3d object pose estimation. arxiv preprint arxiv: (2016). [4] Eigen, D., and Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp [5] Fischler, M. A., and Bolles, R. C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 6 (June 1981), [6] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (2014), ACM, pp [7] Kendall, A., Grimes, M., and Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp [8] McCormac, J., Handa, A., Leutenegger, S., and Davison, A. J. Scenenet RGB- D: 5m photorealistic images of synthetic indoor trajectories with ground truth. CoRR abs/ (2016). [9] Mitchell, T. M. Machine Learning, 1 ed. McGraw-Hill, Inc., New York, NY, USA, [10] MODULE, I. Googlenet: Going deeper with convolutions. [11] Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (2012), Springer, pp

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection