A Hierarchial Model for Visual Perception

Similar documents
Scene Gist: A Holistic Generative Model of Natural Image

ICA mixture models for image processing

Efficient Visual Coding: From Retina To V2

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Saliency Detection: A Spectral Residual Approach

Structural Similarity Sparse Coding

Dimensionality Reduction using Relative Attributes

Human Vision Based Object Recognition Sye-Min Christina Chan

A Keypoint Descriptor Inspired by Retinal Computation

Learning based face hallucination techniques: A survey

Selecting Models from Videos for Appearance-Based Face Recognition

Recursive ICA. Abstract

A Computational Approach To Understanding The Response Properties Of Cells In The Visual System

ELL 788 Computational Perception & Cognition July November 2015

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Contour-Based Large Scale Image Retrieval

Does the Brain do Inverse Graphics?

Locating ego-centers in depth for hippocampal place cells

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution

Unsupervised Learning

Features that draw visual attention: an information theoretic perspective

Diffusion Wavelets for Natural Image Analysis

Sparse Models in Image Understanding And Computer Vision

The most cited papers in Computer Vision

Improving Image Segmentation Quality Via Graph Theory

Self-Organizing Sparse Codes

Sequential Maximum Entropy Coding as Efficient Indexing for Rapid Navigation through Large Image Repositories

Contextual priming for artificial visual perception

Robust Pose Estimation using the SwissRanger SR-3000 Camera

Salient Region Detection and Segmentation in Images using Dynamic Mode Decomposition

Department of Psychology, Uris Hall, Cornell University, Ithaca, New York

Non-linear dimension reduction

Technical Report. Title: Manifold learning and Random Projections for multi-view object recognition

Natural image statistics and efficient coding

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Manifold Learning for Video-to-Video Face Recognition

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Gabors Help Supervised Manifold Learning

Advances in Neural Information Processing Systems, 1999, In press. Unsupervised Classication with Non-Gaussian Mixture Models using ICA Te-Won Lee, Mi

By Suren Manvelyan,

Relative Constraints as Features

Bilevel Sparse Coding

Unsupervised Deep Learning for Scene Recognition

Locality Preserving Projections (LPP) Abstract

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

Head Frontal-View Identification Using Extended LLE

Why equivariance is better than premature invariance

Learning global properties of scene images based on their correlational structures

Perceptual Attention Through Image Feature Generation Based on Visio-Motor Map Learning

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS

SELECTION OF THE OPTIMAL PARAMETER VALUE FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM. Olga Kouropteva, Oleg Okun and Matti Pietikäinen

Locality Preserving Projections (LPP) Abstract

Efficient Acquisition of Human Existence Priors from Motion Trajectories

Image Feature Generation by Visio-Motor Map Learning towards Selective Attention

Emergence of Topography and Complex Cell Properties from Natural Images using Extensions of ICA

Semi-supervised Data Representation via Affinity Graph Learning

The "tree-dependent components" of natural scenes are edge filters

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

3D Object Recognition: A Model of View-Tuned Neurons

2 Feature Extraction and Signal Reconstruction. of Air and Bone Conduction Voices by Independent Component Analysis

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

The Analysis of Parameters t and k of LPP on Several Famous Face Databases

Subjective Randomness and Natural Scene Statistics

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Tight Clusters and Smooth Manifolds with the Harmonic Topographic Map.

Beyond Bags of Features

An Efficient Approach for Color Pattern Matching Using Image Mining

Texture-Based Processing in Early Vision and a Proposed Role for Coarse-Scale Segmentation

Introduction to visual computation and the primate visual system

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Online Learning for Object Recognition with a Hierarchical Visual Cortex Model

OBJECT RECOGNITION ALGORITHM FOR MOBILE DEVICES

Improved Spatial Pyramid Matching for Image Classification

SPARSE CODES FOR NATURAL IMAGES. Davide Scaramuzza Autonomous Systems Lab (EPFL)

A THREE LAYERED MODEL TO PERFORM CHARACTER RECOGNITION FOR NOISY IMAGES

Using temporal seeding to constrain the disparity search range in stereo matching

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

arxiv: v1 [cs.lg] 20 Dec 2013

A Developing Sensory Mapping for Robots

From natural scene statistics to models of neural coding and representation (part 1)

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Does the Brain do Inverse Graphics?

Automatic Segmentation of Semantic Classes in Raster Map Images

Machine Learning : Clustering, Self-Organizing Maps

Robotics Programming Laboratory

CS 231A Computer Vision (Fall 2011) Problem Set 4

Novel Lossy Compression Algorithms with Stacked Autoencoders

Sketchable Histograms of Oriented Gradients for Object Detection

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Learning a Manifold as an Atlas Supplementary Material

Texture Image Segmentation using FCM

Lecture Topic Projects

Learning Spatio-temporally Invariant Representations from Video

Unsupervised Outlier Detection and Semi-Supervised Learning

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Saliency Detection Using Quaternion Sparse Reconstruction

ORGANIZATION AND REPRESENTATION OF OBJECTS IN MULTI-SOURCE REMOTE SENSING IMAGE CLASSIFICATION

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Correcting User Guided Image Segmentation

Spatial Hierarchy of Textons Distributions for Scene Classification

Transcription:

A Hierarchial Model for Visual Perception Bolei Zhou 1 and Liqing Zhang 2 1 MOE-Microsoft Laboratory for Intelligent Computing and Intelligent Systems, and Department of Biomedical Engineering, Shanghai Jiao Tong University, No.800, Dongchuan Road, Shanghai, China zhoubolei@gmail.com 2 MOE-Microsoft Laboratory for Intelligent Computing and Intelligent Systems, and Department of Computer Science and Engineering, Shanghai Jiao Tong University, No.800, Dongchuan Road, Shanghai, China zhang-lq@cs.sjtu.edu.cn Abstract. This paper proposes a computational model for visual perception: the visual pathway is considered as a functional process of dimensionality reduction for input data, that is, to search for the intrinsic dimensionality of natural scene images. This model is hierarchically constructed, and final leads to the formation of a low-dimensional space called perceptual manifold. Further analysis of the perceptual manifold reveals that scene images which share similar perceptual similarities stay nearby in the manifold space, and the dimensions of the space could describe the spatial layout of scenes, which are like the degree of naturalness, openness supervised trained in [1]. Moreover, the implementation of scene retrieval task validates the topographic property of the perceptual manifold space. 1 Introduction From photoreceptor of retina to the unified scene perception in higher level cortex, the transformations of input signal in visual cortex are very complicated. The neurophysiological studies [2] indicate that through the hierarchical processing of information flow in visual cortex, the extremely high-dimensional input signal is gradually represented by fewer active neurons, which is believed to achieve a sparse coding [3] or efficient coding [4]. Meanwhile, the efficient coding theory proposed by Horace Barlow in 1961 gradually becomes the theoretical principles for nervous system and explain away much neuronal behaviors. One of the key predictions of efficient coding theory is that the visual cortex relies on the environmental statistics to encode and represent the signal [5], that is to say, the neuronal representation is closely related to the intrinsic properties of the input signal. Moreover, studies on the natural image statistics show that the natural images are usually embedded in a relatively low dimensional manifold of pixel space [7], and there is a large amount of information redundancy within the natural images. According to the efficient coding theory, it is assumed that the structure of neural computations should fully adapt to extract the intrinsic low dimensionality of natural image

to form the unified scene perception, in spite of the extremely high-dimensional raw sensory input from the retina [2]. Under the efficient coding theory and natural image statistics, we propose a computational model for visual perception: the visual pathway is considered as a functional process of dimensionality reduction for input signal, that is, to remove the information redundancy and to search for the intrinsic dimensionality of natural scene images. By pooling together the activity of local low-level feature detectors across large regions of the visual field, we build the population feature representation which is the statistical summary of the input scene image. Then, thousands of population feature representations of scene images are extracted, and to be mapped unsupervised to a low-dimensional space called perceptual manifold. Further analysis of this manifold reveal that scene images which share similar perceptual similarity stay nearby in the manifold space, and the dimensions of the manifold could describe continuous changes within the spatial layout of scenes, which are similar to the degree of naturalness and openness supervised trained in [1]. In addition, the implementation of scene retrieval task validates the topographic property of the perceptual manifold space. In the following section, the Perceptual Manifold model is extended in detail. 2 The Hierarchical Model One of the fundamental properties of visual cortex concerns the ability to integrate the local components of a visual image into a unified perception [2]. Based on the neurobiological mechanism of hierarchical signal processing in visual system, we build a computational architecture called Perceptual Manifold. The architecture includes three hierarchical layers: 1) local feature encoding, 2) population feature encoding and 3) perceptual manifold embedding (refer to Fig.1a). 2.1 Local Sparse Feature Encoding Experimental studies have shown that the receptive fields of simple cells in the primary visual cortex produce a sparse representation of input signal [5]. The standard efficient coding method [9] assumes that the image patch is transformed by a set of linear filters w i to output response s i. In matrix form, s = Wx (1) Or equivalently in terms of a basis set, x=as=w 1 s. Then, the filter response s i are assumed to be statistically independent, p(s) = i p(s i ) exp( i s i ) (2) Specifically, 150000 20 20 gray image patches from natural scenes are used in training. A set of 384 20 20 basis function is obtained. These basis functions

R N R M Manifold Embedding 0.01 0.005 0 0 50 100 150 200 250 300 Population coding Local sparse coding Visual stimulus a) b) Fig. 1. a) Schematic diagram of the Perceptual Manifold model. There are three hierarchical layers of sensory computation before the final formation of the perceptual space. b) A subset of filter basis W trained from natural image patches, they resemble receptive fields of the simple cells in V1. resemble the receptive field properties of simple cells, i.e., they are spatially localized, oriented, and band-pass in different spatial frequency bands (Fig.1b). Let W = [w 1, w 2,..., w 384 ] be the filter function. A vectorized image patch x can be decomposed into those statistically independent bases, in which only a small portion of bases are activated at one time. They are used as the first layer of local feature extraction in the architecture. 2.2 Population Feature Encoding Higher processing stages are not influenced by single V1 neuron alone but by the activity of a larger population [10]. The neurophysiological study [2] implies that on the population level, a linear pooling mechanism might be used by the visual cortex to extract the global response of the stimulus. In view of this evidence, our computational model includes a second layer of population encoding to sum up feature s response over the local visual fields. Let X = [x 1, x 2,..., x n,...] denote the sample matrix, where x n is the vectorized image patch sampled from scene image. The population feature component for the i th feature is: p i = i n w i x n n w i x n (3)

Thus, p = [p 1, p 2,..., p 384 ] is the population response of scene image. 2.3 Perceptual Space Embedding Furthermore, neurophysiological studies have often found that the firing rate of each neuron in a population can be written as a smooth function of a small number of variables [10], which supports the idea that the population activity might be constrained to lie on a low-dimensional manifold. To search for the underlying meaningful dimensionality of percept, the Local Linear Embedding [11] is applied as the nonlinear dimensionality reduction method for thousands of the population feature responses of scene images: First step: compute the weight W ij that best linearly reconstructs p i from its neighbor p j, minimizing: ε(w ) = i p i j W ij p j 2 (4) Second step: compute the low-dimensional embedding vectors y i best reconstructed by W ij, minimizing: φ(y) = i y i j W ij y j 2 (5) The resulting embedding space R M is called perceptual space, where vector y = [y 1, y 2,..., y M ] R M. The topographic properties of this space would be analyzed in the following experiment section. 3 The Experiment The dataset of natural images implemented here is from [12], which contains 3890 images from 13 semantic categories of natural scenes, like coast, forest, etc. And image size is normalized as 128 128 pixels. All 3890 images are used in the embedding process. The dimensionality of manifold space M is tuned as 20, so that the space embedding is R 384 R 20. Fig.2 shows the embedded sample points described by first three coordinates of the perceptual space, in which points of images from four categories are visualized, and are colored according to the images scene category. We can clearly see the clustering and nonlinear geometric property of the data points in the space. Moreover, the image retrieval task as follows validates its property. 3.1 Image Retrieval Image retrieval task is to retrieve the perceptually similar sets of images when given a target image. This task is implemented in our embedded perceptual space and achieves impressive results.

Fig. 2. a)colored points of scene images from four categories are visualized by the first three coordinates of the perceptual space, the geometric structure among those points of scene images is clearly nonlinear. b)some examples of target images with the sets of retrieved images. Despite the simplicity of the similarity metric used in the experiment, these pairs share close perceptual similarity. To show the topographic property of the perceptual space, image similarity metric is approximated simply as the Euclidean distance between the target i and retrieved ones j in the perceptual space, that is, D 2 (i, j) = M (y ih y jh ) 2 (6) h=1 Fig. 2 shows examples of target images and retrieved sets of K least metric images, here K is 6. Despite the simplicity of the similarity metric used in the experiment, the set of retrieved images are very similar to the target image. 4 Discussion and Conclusion The efficiency of the neural code depends both on the transformation that maps the input to the neural responses and on the statistics of the input signal [5]. In this paper, under the theory of efficient coding, a hierarchical framework is developed to explore the intrinsic dimensions of visual perception. By pooling together the activity of local low-level feature detectors across large region of the visual field, we build the population feature representation which is the statistical summary of the input scene image. Then, thousands of population feature representations of scene images are extracted, and mapped unsupervised to the low dimensional perceptual space. Analysis of the perceptual space reveals the topographic property that scene images with similar perceptual similarity stay nearby in the embedded space. Moreover, the implementation of scene retrieval task validates the topographic property of the perceptual space.

In addition, it is noteworthy that in the image retrieval task, most of the retrieved images belong to the same scene category of the target image handlabored in [12]. This would imply the topographic property of the embedded perceptual space corresponds to the semantic concept of scene image, and this interesting property would be discussed in our further work. 5 Acknowledgements The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and the Science and Technology Commission of Shanghai Municipality (Grant No. 08511501701 ). References 1. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42 (2001) 2. Weliky, M., Fiser, J., Hunt, R.H., Wagner, D.N.: Coding of natural scenes in primary visual cortex. Neuron 37(4) (February 2003) 703 718 3. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583) (June 1996) 607 609 4. Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current Opinion in Neurobiology 14(4) (August 2004) 481 487 5. Simoncelli, E.P., Olshausen, B.: Natural image statistics and neural representation. Annual Review of Neuroscience 24 (may 2001) 1193 1216 6. Chklovskii, D.B., Mel, B.W., Svoboda, K.: Cortical rewiring and information storage. Nature 431(7010) (October 2004) 782 788 7. Srivastava, A., Lee, A.B., Simoncelli, E.P., c. Zhu, S.: On advances in statistical modeling of natural images. Journal of Mathematical Imaging and Vision 18 (2003) 17 33 8. Seung, S.H., Lee, D.D.: Cognition: The manifold ways of perception. Science 290(5500) (December 2000) 2268 2269 9. Bell, A.J., Sejnowski, T.J.: The independent components of natural scenes are edge filters. Vision Res 37(23) (December 1997) 3327 3338 10. Averbeck, B.B., Latham, P.E., Pouget, A.: Neural correlations, population coding and computation. Nature Reviews Neuroscience 7(5) 358 366 11. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500) (December 2000) 2323 2326 12. Li, F.F., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR 05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05) - Volume 2, Washington, DC, USA, IEEE Computer Society (2005) 524 531