Unsupervised Deep Learning for Scene Recognition Akram Helou and Chau Nguyen May 19, 2011 1 Introduction Object and scene recognition are usually studied separately. However, research [2]shows that context from scene recognition can greatly improve object recognition performance. The performance of object detectors can be improved by adding information about the type of scene the object is embeded in. This helps disambiguate the class of an object. Gist is an example of a global image feature descriptor for characterizing the global properties of a scene [3]. The aim of our project is to begin exploring the possibility of learning global scene features which can be later used to improve performance of standard object detectors. The main advantage of learning features automatically is that it is dicult to hand engineer features which capture the full statistical properties of natural scenes. 2 Related Work Gist is an example of a global image feature descriptor for characterizing the global properties of a scene [3]. [1, 3] use Gist features in order to classify images into the type of scenes they belong to and to bias an object recognition algorithm to look for an object only in the region of the image where the object is most likely to be located. [3]consider the task of classifying images into types of scenes. On a data containing images belonging to eight dierent scenes, Gist features are fed into to achieve an accuracy of 83.7%. In general, Gist is often used in object recognition systems that make use of context. Clever features like Gist pervade computer vision however such features have been carefully engineered over years using specialized knowledge. In order to accelerate research and advances in computer vision research, it would be advantageous to learn good features automatically. Recently, deep learning techniques have been shown to be able to learn image features which are competitive with hand engineered image features such as Gabor lters [6] in a completely unsupervised ting. 1
3 Algorithms and Implementations 3.1 Deep Learning A deep learning algorithm is any algorithm that attempts to learn a hierarchical representation of the data usually in an unsupervised way [4]. This hierarchical representation is then used for solving standard machine learning problems namely prediction. Deep learning techniques namely Deep Belief Nets (DBN) [5] and similar approaches have many attractive properties including: 1. Learning can be done in unsupervised and supervised ting. [5] 2. A hierarchical non-linear representation of the data can boost performance on supervised learning tasks. [5] 3. Learning leads to a generative model of the data which can be eciently sampled from. [5] 4. There are ecient algorithms for learning deep representations. [5] 5. A function represented using K layers may need an exponentially bigger representation in order to be accurately modeled with K-1 layers. [4] 6. The mammal visual cortex system has a hierarchical architecture. Furthermore, sparse DBN have been show to learn features similar to those found in the V1 and V2 cells of the visual cortex system.[6] 3.2 Deep Belief Networks A Deep Belief Network (DBN) is a generative multilayer directed graphical model with unrestricted weigths between layers [5]. A graphical representation of a DBN can be seen in (gure 1). In DBN, the top hidden layer and the hidden layer below it dene a Restricted Boltzman Machines (RBM). A RBM (gure 1) is an undirected probabilistic generative model consisting of one hidden and visible layer. There is a connection between every hidden and visible unit. These connections dene a matrix W. Learning W via maximum likelihood is too slow in practice. Instead, the Contrastive-Divergence (CD) algorithm is used to approximate learning in RBM. CD approximates a dierent objective function than the one maximized by maximum likelihood. In practice, CD works well. Overall, learning a DBN model proceeds by learning the weighs connecting every hidden layer with the one below it starting from the lowest level and treating each pair of hidden layers as a RBM. This greedy learning procedure oers no formal guarantees for converging to the maximum likelihood model however it works well in practice. 2
Figure 1: Graphical representation of Deep Belief Network Figure 2: Graphical reprsentation of a Convolutional Resricted Boltzmann Machine 3.3 Convolutional Deep Belief Networks A Convolutional Deep Belief Network (CDBN) is a variant of DBN. The building blocks of CDBN are Convolutional Restricted Boltzmann Machines (CRBM) (gure 2). CRBM diers from RBM in the that weights connecting hidden units to visible units are constrained to be equal over a of visible units. In the case of an image, these weights dene a lter which is convoluted with them image. Additionally, a CRBM has a pooling layer on top of the hidden layer. A pooling layer compresses the hidden layer by choosing the maximal hidden activity over every image patch. This is called maximum pooling. Maximum pooling makes learned features invariant to slight noise and translation. While learning in CDBN the pooling layer from the rst CRBM is fed into the next CRBM. This learning procedure allows for learning more higher level procedures as we move up in the hierarchy of layers. Overall CDBN are attractive for computer vision tasks. They scale to higher resolution images than DBN because they have fewer parameters to learn. The convolutional structure of the network takes into account the structure of an image. Maximum pooling allows for learning shift invariant features. 3.4 Implementations We have found matlab code for training a DBN however it was poorly written and only worked on the MNIST handwritten digits data. Instead we heavily modied the code to make it more general and more tunable. We also added sparsity to the learning procedure because it was shown to help learn more edge like features [6]. We were not able to nd code for training a CDBN. Instead we implemented our own CDBN in matlab as described in [7]. 3
Implementing DBN and CDBN was time consuming. Training DBN and CDBN is not straightforward because the models have upwards of eight parameters to tune and because the learning procedure, being an approximation, does not oer formal guarantees on performance and convergence. Auxillary functions for monitoring overtting and divergence conditions need to be written. Finally, the most challenging aspect of using these models is that they remain slow. Modeling images required more than a million parameters using DBN in our case. Therefore, we optimized our code and parallelized parts of our CDBN to take advantage of multiple cores. The code is available http://code.google.com/p/deep-learning-obj-recog/. The code needs to be cleaned up and some functions are not properly documented yet. 4 Experimental Results 4.1 Data We used Torralba's scene recognition data consisting of 2600 256*256 color images. The data contains 8 outdoor scene categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. 4.2 Deep Belief Networks Results Running DBN or CDBN is very time consuming therefore we were not able to properly cross validate the models over large number of parameters values. Instead we used a training and testing sub of the scene recognition data for quickly evaluating the data reconstruction error 1 performance of multiple DBN models over a range of parameters. We picked the best of parameters and used them to train on 1800 images and test on the remaining images. After the DBN model has been learned, we use the activities of all the hidden layers as features to a 2 which we used to classify the test data over the eight given scenes. Figure 3 describes the experiments we ran using the best model and shows the confusion matrix for each model.. Unsurprisingly, colored images coupled with higher resolution images yield better accuracy. As a baseline, we used the raw pixels of the images of the images as feature for 3. The baseline achieves an accuracy of 30% and 47 % for gray and color images respectively. Figure 4 shows four of the learned features from each hidden layer. Admittingly, the features are dicult to interpret. Also, the features do not appear to capture higher level objects as we move up a layer even though we restricted 1 We eventually discovered that data reconstruction error alone is not a good metric for assessing the generative performance of a DBN. 2 We used the hidden activities over the same 1800 images used during training of the DBN. This is clearly undesirable but we did not nd the time to rerun the experiments. 3 If we had more time we would have used gabor lter or/and other low level vision features. This would constitute a more appropriate baseline 4
each layer to have sparse activities as in [6]. We are not very surprised by the results as our model had more than a million parameters and was trained on relatively minuscule data of only 1800 images. Figure 5 shows the generated image for a given testing image. 4.3 Convolutinal Deep Belief Network Results As we expected CDBN do signicantly better than DBN (gure 1). CDBN allowed us to scale to 100*100 resolution images and to learn low level features that look like edges and corners (gure 6). Also, higher level lters appear to learn higher level features (gures 7 and 8). However, these higher level are not readily interpretable. This is somewhat not surprising since our data contains images depicting very dierent scenes, it is small in size, and our CDBN model is simpler than the one used to learn to detect images containing a single centered object [7]. Unlike the DBN experiments, we made sure to train over the CDBN features inferred from images not included in the training used for learning the CDBN model. Our CDBN model achieves an accuracy of 86.5% over the 400 images in the testing 4. The features extracted from the CDBN actually manage to be more indicative of scene class than Gist features: trained with Gist features achieves an accuracy of 78.5%. Figures 9 and 10 indicate that our CDBN learns to generate images of the original images that are good enough to discern some of the scenes. 5 Discussion Although the learned features from CDBN manage to be more predictive of scene class than Gist, we cannot decisively conclude that the learned features are always better than Gist for scene classication. On one hand, our experiments are not exhaustive enough. On the other hand, we used more than 50000 features for each image to achieve the aforementioned performance while Gist only used 960 features. However, our initial experiments how that the learned features are likely to compete with or are better than Gist since the model we learned is relatively simple and used a small data. Furthermore, we did not use color information in our case and unlike Gist we do not pad the border of the image with zeros to avoid artifacts, we do not prelter the image, and we do not whiten the image. Also, Gist uses Gabor lters as a preprocessing step. The biggest obstacle we encountered while working on this project for the past three weeks was the lack of enough computations resources (fast multiple cores machines). This prevented us from systematically explore parameters, learn more complex models, and use more data. 4 We made sure to include an equal number of images from each class in the training and testing phase 5
Confusion t i s h c o m f matrix (Model 1) tall 61 1 9 0 1 5 13 10 building inside city 15 31 6 4 4 23 7 10 street 2 6 76 4 1 3 3 5 highway 2 3 5 70 3 13 2 2 coast 5 1 6 14 25 33 12 4 open 3 7 5 17 13 44 9 2 country mountain 12 5 9 6 8 13 33 14 forest 16 11 10 1 11 7 7 37 Confusion matrix (Models 2) t i s h c o m f tall building 65 8 8 0 4 3 9 3 inside city 10 47 10 4 7 6 8 8 street 7 18 65 0 2 1 3 4 highway 2 1 5 69 8 10 4 1 coast 5 3 1 6 58 11 8 8 open country 2 0 3 5 3 75 9 3 mountain 18 9 2 7 12 10 41 1 forest 8 9 5 0 3 7 3 65 Models image size image type DBN train train test kernel Number of units per hidden layer 1 50*50 gray 1800 1800 800 Linear [500 500 1000 1000] 2 75*75 color 1800 1800 800 Linear [500 500 1000 1000] Figure 3: Confusion matrix and experiments 1 and 2 details Sparse? Accuracy Yes 47 Yes 60.62 Figure 4: Left to Right: Learned features from hidden units in layers 1,2,3,and 4 belonging experiment 1's DBN model 6
Figure 5: Left: one of the original images from the scene data. Right: Corresponding generated image from experiment 1's DBN model Confusion t i s h c o m f matrix (CDBN) tall 92 0 4 0 0 2 2 0 building inside 4 88 0 0 0 2 2 4 city street 4 0 78 2 4 6 2 4 highway 0 0 0 100 0 0 0 0 coast 0 0 0 10 82 4 4 0 open 0 6 0 4 2 86 2 0 country mountain 2 8 2 4 0 2 90 2 forest 4 0 4 4 2 0 0 86 Models image size image type DBN train train test Confusion t i s h c o m f matrix (Gist) tall 84 4 6 0 2 0 4 0 building inside 8 78 8 4 0 2 0 0 city street 6 6 86 0 0 2 0 0 highway 2 4 6 74 8 0 4 2 coast 0 0 0 8 74 14 0 4 open 2 2 2 12 10 64 2 6 country mountain 4 0 6 0 4 8 76 2 forest 2 0 0 0 0 2 4 92 kernel Number of lters/layer Sparse? Accuracy CDBN 100*100 gray 1800 400 400 Linear [20 40 60] Yes 86.5 Gist 128*128 color NA 400 400 Linear NA NA 78.5 Table 1: Confusion matrix and experiments CDBN and Gist details 7
Figure 6: 20 lters learned by the rst layer of our CDBN model Figure 7: 10 of the 40 lters learned by the second layer of our CDBN model Figure 8: 10 of the 60 lters learned by the rst layer of our CDBN model 8
Figure 9: Original images from 8 classes Figure 10: Corresponding generated images from our CDBN model 9
6 Future Work We would like to see how eectively CDBNs can be used to model complex scenes including indoor scenes. In order to do continue this investigation, we would need to map our code to code that can be ran on a GPU [8]. Right now, inferring more than 50000 features would be unfeasible for real time applications so it would be important to also investigate compressing this representation. Finally, after getting a better intuition into the limitation of CDBNs for modeling complex indoor and outdoor scenes, we will be very interested in investigating architectural modications to CDBN. One interesting extension would be to enforce scale invariance since complex scenes will have similar parts at dierent scales. References [1] P. Sinha A. Torralba. Statistical context priming for object detection. In Proceedings of the International Conference on Computer Vision, 2001. [2] W. T. Freeman A. Torralba, K. Murphy. Using the forest to see the trees: object recognition in context. Communications of the ACM, Research Highlights, 2010. [3] Antonio Torralba Aude Oliva. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42:145175, 2001. [4] Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning: Vol. 2, 2009. [5] Osindero S. Hinton, G. E. and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006. [6] Chaitu Ekanadham Honglak Lee and Andrew Y. Ng. Sparse deep belief net model for visual area v2. In Advances in Neural Information Processing Systems (NIPS), 2008. [7] Rajesh Ranganath Honglak Lee, Roger Grosse and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the Twenth-Sixth International Conference on Machine Learning (ICML), 2009. [8] Andrew Y. Ng Rajat Raina, Anand Madhavan. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009. 10