Online Learning for Object Recognition with a Hierarchical Visual Cortex Model

Similar documents
Coupling of Evolution and Learning to Optimize a Hierarchical Object Recognition Model

A Biologically Motivated Visual Memory Architecture for Online Learning of Objects

Visual object classification by sparse convolutional neural networks

OBJECT RECOGNITION ALGORITHM FOR MOBILE DEVICES

Object Tracking using HOG and SVM

Machine Learning 13. week

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Learning Optimized Features for Hierarchical Models of Invariant Object Recognition

Component-based Face Recognition with 3D Morphable Models

Human Vision Based Object Recognition Sye-Min Christina Chan

Biologically inspired object categorization in cluttered scenes

Overview. Object Recognition. Neurobiology of Vision. Invariances in higher visual cortex

Function approximation using RBF network. 10 basis functions and 25 data points.

A robust method for automatic player detection in sport videos

HAND-GESTURE BASED FILM RESTORATION

Robust color segmentation algorithms in illumination variation conditions

A Model of Dynamic Visual Attention for Object Tracking in Natural Image Sequences

Extract an Essential Skeleton of a Character as a Graph from a Character Image

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Static Gesture Recognition with Restricted Boltzmann Machines

Selection of Scale-Invariant Parts for Object Class Recognition

Experiments with Edge Detection using One-dimensional Surface Fitting

Facial Expression Classification with Random Filters Feature Extraction

Face Detection Using Convolutional Neural Networks and Gabor Filters

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Subject-Oriented Image Classification based on Face Detection and Recognition

Artifacts and Textured Region Detection

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

A Hierarchial Model for Visual Perception

Categorization by Learning and Combining Object Parts

Dynamic Routing Between Capsules

Deep Learning. Volker Tresp Summer 2014

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

In this assignment, we investigated the use of neural networks for supervised classification

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi

Artificial Neural Networks Unsupervised learning: SOM

Foveated Vision and Object Recognition on Humanoid Robots

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Gray-Level Reduction Using Local Spatial Features

Su et al. Shape Descriptors - III

New wavelet based ART network for texture classification

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Deep Learning for Computer Vision II

Oriented Filters for Object Recognition: an empirical study

A Developing Sensory Mapping for Robots

For Monday. Read chapter 18, sections Homework:

A Two-stage Scheme for Dynamic Hand Gesture Recognition

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Outline 7/2/201011/6/

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Detection and recognition of moving objects using statistical motion detection and Fourier descriptors

Does the Brain do Inverse Graphics?

Character Recognition Using Convolutional Neural Networks

Classification of Face Images for Gender, Age, Facial Expression, and Identity 1

Pixel-Pair Features Selection for Vehicle Tracking

Chapter 3 Image Registration. Chapter 3 Image Registration

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms

Color Image Segmentation

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

Reduced Dimensionality Space for Post Placement Quality Inspection of Components based on Neural Networks

Haresh D. Chande #, Zankhana H. Shah *

Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction

Analyzing Images Containing Multiple Sparse Patterns with Neural Networks

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

Comparative Performance Analysis of Feature(S)- Classifier Combination for Devanagari Optical Character Recognition System

Texture Segmentation Using Multichannel Gabor Filtering

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Content Based Image Retrieval (CBIR) Using Segmentation Process

Color Content Based Image Classification

Online labelling strategies for growing neural gas

Face detection and recognition. Detection Recognition Sally

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

View-invariant action recognition based on Artificial Neural Networks

Det De e t cting abnormal event n s Jaechul Kim

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Face Detection using Hierarchical SVM

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

A Study on the Neural Network Model for Finger Print Recognition

Tracking Using Online Feature Selection and a Local Generative Model

Mouse Pointer Tracking with Eyes

Sequential Maximum Entropy Coding as Efficient Indexing for Rapid Navigation through Large Image Repositories

Recognition of Non-symmetric Faces Using Principal Component Analysis

Conditional Random Fields for Object Recognition

Short Survey on Static Hand Gesture Recognition

Component-based Face Recognition with 3D Morphable Models

A Novel Algorithm for Color Image matching using Wavelet-SIFT

Data Compression. The Encoder and PCA

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Scalable Object Classification using Range Images

2. Neural network basics

Transcription:

Online Learning for Object Recognition with a Hierarchical Visual Cortex Model Stephan Kirstein, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl Legien Str. 30 63073 Offenbach am Main, Germany {stephan.kirstein,heiko.wersing,edgar.koerner}@honda-ri.de Abstract. We present an architecture for the online learning of object representations based on a visual cortex hierarchy developed earlier. We use the output of a topographical feature hierarchy to provide a viewbased representation of three-dimensional objects as a form of visual short term memory. Objects are represented in an incremental vector quantization model, that selects and stores representative feature maps of object views together with the object label. New views are added to the representation based on their similarity to already stored views. The realized recognition system is a major step towards shape-based immediate high-performance online recognition capability for arbitrary complex-shaped objects. 1 Introduction Although object recognition is a long-studied subject in computer vision, the main focus in research has been so far on achieving optimal recognition performance on selected data sets of object images. Since the training of a recognition system is normally done offline, the efficiency of learning with regard to the learning speed has been considered less relevant in most approaches, leading to typical training times from several minutes to hours. Another problem is that most powerful classifier architectures like the multi layer perceptrons or support vector machines do not allow online training with the same performance as for offline batch training. Due to the lack of rapid learning methods for complex shapes, research in man-machine interaction for robotics dealing with online learning of objects has mainly used histogram-based feature representations that offer fast processing [5, 1], but only limited representational and discriminatory capacity. An interesting approach to online learning for object recognition was proposed by Bekel et al. [2]. Their VPL classifier consists of feature extraction based on vector quantisation and PCA and supervised classification using a local linear map architecture. Image acquisition is triggered by pointing gestures on a table, and is followed by a training phase taking some minutes. We suggest to use a strategy similar to the hierarchical processing in the ventral pathway of the human visual system to speed up the object learning process considerably. The main idea is to use a sufficiently general feature representation

2 S1 Layer 4 Gabors S2 Layer Combinations C1 Layer C2 Layer 4 Pool Object 1 Object 2 Visual Field I i Object 3 i x (I i )= r M+1 High dimensional C2 feature space Fig. 1. The visual hierarchical network structure. Based on an image I i, the first feature-matching stage S1 computes an linear sign-insensitive receptive field summation, a Winner-Take-Most mechanism between features at the same position and a final threshold function. We use Gabor filter receptive fields, to perform a local orientation estimation in this layer. The C1 layer subsamples the S1 features by pooling down to a quarter of the original resolution in both directions using a Gaussian receptive field and a sigmoidal nonlinearity. The features in the intermediate layer S2 are sensitive to local combinations of the features in the planes of the C1 layer, and are thus capable of detecting more complex feature combinations in the input image. We use sparse coding for unsupervised training of these so-called combination feature neurons. A second pooling stage in the layer C2 again performs spatial integration and reduces the resolution by one half in both directions. Object representatives are learnt using an incremental vector quantization approach with attached class labels. Representatives r k are computed as the output x i (I i) of the hierarchy and added based on sufficient Euclidean distance in the C2 feature space to previously stored r k of the same object. that remains unchanged, while object-specific learning is accomplished only in the highest levels of the hierarchy. We perform online learning of objects using a short-term memory and similarity-based incremental collection of templates using the intermediate level feature representation of the proposed visual hierarchy from [6]. After a short introduction to our processing and memory model in Sect. 2, we demonstrate its effectiveness for an implementation of real-time online object learning in Sect. 3, and give our conclusions in Sect.4. 2 Hierarchical Visual Processing Model Model Architecture. The visual hierarchical model proposed in [6] is based on a feed-forward architecture with weight-sharing [3] and a succession of featuresensitive and pooling stages (see Fig.1). For a comparision to other recent feedforward models of recognition see [6]. The output of the sparse feature repre-

3 sentation of the complex feature layer (C2) is used to incrementally build up the appearance-based object representation with an incremental vector quantisation model. These extracted C2 features are sensitive to coarse local edge combinations like e.g. t-junctions and corners. Given a set of N input images I i, i = 1,..., N, the feature map outputs of the C2 layer of the hierarchy are computed as x i (I i ). The labeled object information is stored in a set of M representatives r k, k = 1,..., M, that are incrementally collected. We define R l as a set of representatives r k that belong to object l. The acquisition of templates is based on a similarity threshold S T. New views of an object are only collected into the object representation if their similarity to the previously stored templates in R l is less than S T. The parameter S T is critical, characterizing a compromise between the object representation accuracy and the computation time. We denote the similarity between view x i and representative r k by A ik and compute it based on the quadratic Euclidean distance in feature space by A ik = exp ( (x i r k ) 2 /σ ). Here, σ is chosen for convenience such that the average similarity in a generic recognition setup is approximately 0.5. Online Training. For one learning step the similarity A ik between the current training vector x i, labeled as object l and all representatives r k R l of the same object l must be calculated and the maximum value is computed as = max k Rl A ik. The training vector x i and the corresponding class label will be added to the object representation if A max i < S T. Assuming that M A max i representatives were present before, we then choose r M+1 = x i. Otherwise we assume that the vector x i is already sufficiently represented by one r k, and do not add it to the representation. Online Recognition. Recognition of a test view I j is done with a nearest neighbour search of the hierarchy output x j (I j ) to the set of representatives. In contrast to the training, the similarity A jk must be calculated between the current C2 feature vector x j and all representatives r k. The class label of the winning representative r kj max with k j max = arg max k (A jk ) is then assigned to the current validation vector x j. Due to the non-destructive incremental learning process, online learning and recognition can be done at the same time, without a separation into training and testing phases. Rejection. For a real application of the online learning system e.g. in a robot interaction scenario it is crucial to reach good classification results, but also unknown objects and clutter should be rejected. This can be done based on the similarity of a test view to the winning representative. Due to the different structural complexity of the appearance variation of different objects, the rejection can be largely improved by choosing the detection threshold similarity dependent on an estimate of the object complexity. This can be estimated by the average number of non-zero elements of the C2 feature vectors. 3 Experimental Results For our experiments we use a setup, where we show objects, held in hand with a black glove, in front of a black background. Images are taken with a cam-

4 a) b) 500 Number of selected templates 400 300 200 100 0 1 2 3 4 5 6 7 8 Object index Threshold 0.95 0.90 0.85 0.80 0.75 0.70 Fig. 2. Test images and the number of selected representatives for different similarity thresholds. (a) Some example images of the eight freely rotated objects, taken in front of a dark background and using a black glove for holding, causing also some minor occlusion effects. The difficulty of this database is the rotation of objects around three axes. Additionally some object views are only partially segmented. (b) Number of selected representative vectors for changing similarity thresholds. The selected number strongly depends on the shape-dependent appearance variation of the objects. era, segmented using local entropy-thresholding, normalized in size (each view is 64x64 pixel large) and converted to grey scale. We show each object by rotating it freely by hand for a few ten seconds, which results in 500 input images I i per object. Another set of 500 images for each object is recorded for validation. Some rotation examples are shown in Fig.2a. The difficulty of this training ensemble is the high variation of objects during rotation around three axes and the sometimes only partially segmented object views (e.g. mug and cup). Figure 2b shows how the similarity threshold S T influences the number of selected representatives. It can be seen that this number strongly depends on the complexity of the object, i.e. shape-dependent appearance variation. The first investigation of training time should demonstrate how long it takes to incrementally train one object using a real camera. The training speed is limited by the frame rate of the used camera (12,5 Hz) and the computation time needed for the entropy segmentation, the extraction of the corresponding sparse C2 feature vector x i with 3200 dimensions and the calculation of similarities A ik (see Sect.2). For the shown curves of the teapot and the cup we trained all other seven objects and incrementally trained the teapot or cup as the eighth object. Figure 3a shows how long it takes until the newly added object can be robustly separated from all other objects. We also investigated how fast our model performs on a saved image ensemble, without the limitation of the camera s frame rate and the segmentation. For this exploration all eight objects are trained in parallel. We started with a training ensemble of 25 training views for each object and determined the needed training time and the classification rate. Afterwards we increased the number of training views for each object in 25 view steps until all 500 training views for each object

5 are reached. Figure 3a (labeled with database ) shows that the training phase takes less than 1 minute for a training ensemble of 8 500 = 4000 object views. We compared the classification results (see Fig.3b) of our model with a normal nearest neighbour classifier (NNC), where every training vector x i is directly used as a representative and a one-layered sigmoidal network trained by a gradientbased supervised learning on the C2 feature vectors x i. The sigmoidal network consists of an input and output layer, without hidden layers. For every object we used one output node, whereas each node has a linear scalar product activation and a sigmoidal transfer function. We trained these networks with all training vectors or with the selected representatives of our model. Figure 3b shows that the exhaustive NNC and our model using C2 activations perform quite equal, which means that our model can reduce the number of relevant representatives (2906 representatives are selected from 4000 training views (72,65%), for 64x64 pixel images and S T = 0.92) without losing classification performance. The sigmoidal networks perform in comparison to our model slightly worse, but the representational resources are strongly reduced. We used only one linear discriminating weight vector per object, i.e. 8 weight vectors. We also trained these networks with the selected representatives of our model, which speeds up the training process but also slightly reduces the classification rates. Further we show that the usage of C2 features considerably increases the classification performance of our model compared to the original grey value images (see Fig.3b). The rejection capability of our model was tested with all 8 trained objects and 16000 different grey value clutter images. These images were randomly cut out of large scenes and contain parts of different objects, portions of buildings, faces and so on. The reached false negative rate for the 8 trained objects was 7.9%, whereas 8.3% of all clutter images are classified as an object. 4 Conclusion We have shown that the hierarchical feature representation is well suited for online learning using an incremental vector quantization model approach. Of particular relevance is the technical realization of the appearance-based online learning of complex shapes for the context of man-machine interaction and humanoid robotics. This capability introduces many new possibilities for interaction scenarios and can incrementally increase the visual knowledge of a robot. The simple template-based representation of objects in our approach allows a simple incremental buildup of the representation during online-learning. Nevertheless, due to the exhaustive storage of high-dimensional feature map information a similar approach seems prohibitive given the, arguably large, but finite neural resources in the brain. We have already investigated that a later offline refinement of the representation like supervised gradient-based training can be used to reduce the representational effort considerably. Such a process could be used to guide the transfer from a simple photographic short-term memory presented here to a more optimized long-term memory representation. The in-

6 a) 1 b) 0.9 Classification rate 0.8 0.6 0.4 0.2 teapot (real camera) cup (real camera) all objects (database) 0 0 10 20 30 40 50 60 Training time in seconds Classification rate 0.85 0.8 NNC (all C2 train vectors) Online Learning Model (C2) Online Learning Model (Org.) Sigmoidal N. (all C2 train vectors) Sigmoidal N. (online select. C2 vect.) 0.75 32 40 48 56 64 Image dimension Fig. 3. Classification rate over training time and a comparison between nearest neighbour classifier (NNC) (all training vectors are used), our online learning model and sigmoidal networks for different image dimensions. (a) Classification rate dependent on needed computation time for the whole training set (labeled with database ) and the training of one object with a real camera. The training speed using a camera is also limited by the frame rate of the camera and the segmentation. Good recognition performance can be achieved within 20-30 seconds of online training. (b) Comparison of classification rates between a simple NNC (all training vectors are used), our model (C2 activations, original images) and sigmoidal networks for different image sizes. It can be seen that the classification rates of the NNC and our online learning model using the C2 activation are more or less equal and that the one-layered sigmoidal networks perform slightly worse. The classification results of our model using the C2 features are distinctly better than the results using the original grey value images. vestigation of appropriate models that are related to the representation of visual representation in the inferotemporal cortex [4] will be the subject of future study. Acknowledgments: We thank C. Goerick, M. Dunn, J. Eggert and A. Ceravola for providing the image acquisition and processing system infrastructure. References 1. Arsenio, A.: Developmental learning on a humanoid robot. Proc. IJCNN 2004, Budapest. 2. Bekel, H., Bax I., Heidemann G., Ritter H.: Adaptive Computer Vision: Online Learning for Object Recognition. Proc. DAGM 2004 Springer C. E. Rasmussen et al. (2004) 447 454 3. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36 (4) (1980) 193 202 4. Tanaka, K.: Inferotemporal cortex and object vision: stimulus selectivity and columnar organization. Annual Review of Neuroscience, vol. 19 (1996) 109 139 5. Steels, L., Kaplan, F.: AIBO s first words. the social learning of language and meaning. Evolution of Communication, vol. 4, no. 1 (2001) 3 32 6. Wersing, H., Körner, E.: Learning Optimized Features for Hierarchical Models of Invariant Object Recognition. Neural Computation 15 (7) (2003) 1559 1588