Object-Based Saliency Maps Harry Marr Computing BSc 2009/2010

Size: px

Start display at page:

Download "Object-Based Saliency Maps Harry Marr Computing BSc 2009/2010"

Holly Harrison
5 years ago
Views:

1 Object-Based Saliency Maps Harry Marr Computing BSc 2009/2010 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

2 Summary This project aims to implement a system that will produce object-based saliency maps of images, based on computational models of biological vision. Feature-based representations of objects will be formed using an existing model of object recognition in the visual cortex. An object will be selected as the target; the goal is to find the location of this target object in a visual scene. In a single-object scene, finding the features present in the object is trivial. However, in a multi-object scene, finding the features that correspond to one specific target object is a challenging task. The features present in the target object, which are found using the feature based representation of the object will be used to identify regions in a novel visual scene that are good candidates for the location of this object. The candidates regions will be selected based on the features present within each region. A map will then be produced that indicates these salient regions in the visual input. This report includes an overview of the relevant background information that is necessary to understand the various aspects of the project. A discussion of the design and implementation will follow, demonstrating how the system was constructed. The results will be evaluated in the final section. i

3 Acknowledgements Firstly, I would like to express a great deal of gratitude to Marc de Kamps for his excellent guidance and supervision throughout the project. I would also like to thank Roy Ruddle for his invaluable feedback on my mid-term report. Finally, I would like to thank Dave Harrison and Sam Johnson for their constructive feedback on the first draft of my report. ii

4 Contents 1 Introduction Overview Aim Objectives Minimum Requirements Background Research The Visual System Architecture The Two-Streams Hypothesis From Features to Objects Visual Attention and Saliency Attention Bottom-Up Saliency Object-Based Saliency Existing Approaches Models of Object Recognition in the Cortex From Knowing What to Knowing Where HMAX Structure Invariance Properties Learning Objects iii

5 2.5.1 Self-Organising Maps Technologies MIIND LayerMappingLib Self-Organising Map Software Project Plan Milestones and Deliverables Schedule Methodology Design and Implementation System Overview Obtaining Feature Vectors Learning and Recognising Objects Training the Self-Organising Map Finding Clusters The Feedback Connections Feeding Back to S Creating the Saliency Map Visualising the S2 Feature Types Finding Suitable Object Templates Results and Evaluation Evaluation of the Model Feature Vector Analysis The Circle Problem The Feature Dictionary Evaluation of Object Recognition Training the Self Organising Map Clustering the Self Organising Map iv

6 5.3 Analysis of the Saliency Maps Future Work Bibliography 47 A Personal Reflection 50 v

7 Chapter 1 Introduction 1.1 Overview Biological vision is a complex process that despite its importance, is still not fully understood. Animals are able to process vast quantities of visual information extremely rapidly from a large array of visual sensory data, objects may be recognised within milliseconds. This object recognition process is seemingly robust enough to cope situations where objects are partially occluded, or where an object is only seen for brief instant. The field of Computer Vision has made remarkable progress in recent decades, but is still not able to replicate the visual abilities of humans. Studying the biological approach to vision and the computational modelling biological visual processes has two principle goals: firstly, to further understand how animals see on a neurological level, and secondly, to advance computer vision systems by taking inspiration from biological approaches. To be able to make sense of the enormous amount of visual input that is received by the retina, animals select objects that they will consider, and direct their attention to these objects relatively early on in the visual process. This enables animals to ignore the majority of visual input, which is unrelated to the object they have chosen to see. Eliminating this large amount of irrelevant information is believed to be key to animals ability to process visual scenes so rapidly. It is this system of objectbased visual attention that this project focusses on. The visual cortex the part of the brain that processes visual sensory information is widely believed to be a hierarchical neural network, consisting of numerous layers that are connected to one another. Visual sensory information is received by the first layer of the network, a feedforward process 1

8 takes place the information is fed through the network, following the hierarchy up to the higher levels. During this process, spatial information about the scene being observed is lost by the higher levels of the network, a more abstract representation of the visual scene is present. This representation contains information about the features that are present in the scene colours, shapes and intensities. It has been proposed that the attentional mechanism that allows animals to locate regions in the visual input that are considered interesting comes from two processes. The first occurs at the lower levels of the hierarchy, picking out locations in the visual stimuli that contain salient features such as bright colours or sharp changes in brightness. The second process starts at the top of the hierarchy. It uses stored representations of objects that are considered interesting to determine the salient parts of the visual input. However, as no spatial information is present at the higher levels of the hierarchy, information about the features present in the objects that will be the focus of attention feed backwards through the hierarchy to the lower layers. The spatial regions of the visual information that contain these features are what is considered interesting. This project will focus on producing maps that show which parts of an image should be considered salient, given a target object to search for. This will be achieved by augmenting a computational model of biological object-recognition with an object-based attentional mechanism, that is inspired by current knowledge about how the brain performs this task. The model consists of the hierarchy of layers described above, and only includes a feedforward process, which forms feature-based representations of visual scenes. The addition of the attentional mechanism will be implemented by added a feedback step, that propagates information about an object s feature down to the lower levels of the network. 1.2 Aim The aim of this project is to implement a system that will produce object-based visual saliency maps of images. The system will first learn representations of objects in a set of training images. The system should then detect the presence of objects that it has learned in an input image. A saliency map will be produced that indicates the areas of the input image that are most likely to contain a target object. 1.3 Objectives The objectives of the project are to: Conduct a review of the background literature on biological vision and research how object detection and recognition are performed in the human brain. Investigate existing computational models of biological vision understand the structure and motivation behind the models. 2

9 Use a model of object detection in the brain to produce a system that can detect if objects that have been learned are present in novel images. Research visual attention and saliency, and extend the model to enable it to produce objectbased saliency maps that indicate likely candidate locations for target objects within a scene. 1.4 Minimum Requirements The minimum requirements are: Obtain representations of objects present in image by using an existing computational model of the visual cortex. Enable the system to learn the representations of objects in training images. The system should be able to identify clusters of images that contain objects of the same class, and use these clusters to determine the identity of new images presented that contain one of the objects that has been learned. Implement a feedback mechanism in the model that would allow likely positions of a target object in an image to be deduced. Produce a saliency map of an image, which indicates likely locations for target objects within the image. The possible extensions are: Implement a user interface that will allow users to investigate the clusters of image representations, providing the user with information about where certain training examples fall within a cluster. This may be used to find a suitable template object representation to be used in the feedback mechanism. Produce a tool to visualise the features present in object representations. 3

10 Chapter 2 Background Research 2.1 The Visual System The implementation of the system described in this project uses a computational model of parts of the visual system. In order to understand the motivation behind this model and other similar models, it is important to understand the basic structure and operation of the visual system. A complete description of the biology and functionality of the visual system is outside of the scope of this project, but the relevant parts will be described in enough detail to demonstrate the basis and motivation of the models that are used Architecture Vision starts at the retina, a thin layer of neural tissue that lines the rear surface of the eye. Visual information is passed from the retina, through the lateral geniculate nucleus (LGN), which is located in the thalamus, to the visual cortex. The visual cortex is a widely regarded to be hierarchical network (which is assumed to be the case in this project), that consists of a number of functionally distinct areas, and is mostly contained within the occipital lobe of the cortex, which is found at the rear of the brain. Visual stimuli from the LGN arrives at the primary visual cortex (also known as V1), the area of the visual cortex that deals with the first stage of cortical visual processing [28]. Area V1 contains two main types of cells, simple and complex cells. Simple cells respond to edges oriented at certain angles within their receptive fields. The receptive fields of simple cells have both inhibitory and excitatory regions; the presence of stimulus in some areas will produce a positive 4

11 activation, but will result in a negative activation in other areas. Simple cells exhibit a summation property, in that a stimulus covering more of the excitatory region causes a greater response than one covering a smaller area [14]. Unlike simple cells, complex cells in the primary visual cortex do not have separate excitatory and inhibitory regions. A stimulus within the receptive field that is translated (but still within the receptive field), or scaled to a different size will not cause a complex cell to produce a stronger or weaker activation [14]. So complex cells offer a certain degree of invariance to translation and scale The Two-Streams Hypothesis It has been hypothesised originally by Mishkin et al [19] that there are two functionally independent cortical pathways in the visual cortex the ventral stream and the dorsal stream. The ventral stream is believed to play a major role in perceptual identification of objects; it is thought to start at V1 and terminate in the temporal lobe (passing through areas V2, V4, AIT and PIT). The dorsal stream provides information required for visually guided spatial actions, such as catching a ball or picking up an object. It is believed that the dorsal stream also starts at V1, but terminates in the parietal lobe [10]. The ventral stream is the primary concern of this project as it is the pathway that contains the areas that play a role in object recognition and visual attention From Features to Objects Objects play a major role in this project: both the recognition of objects and a model of object-based attention will be part of the system that is to be implemented. For this reason, a discussion of what is meant by object and how objects may be represented in the brain is pertinent. The visual cortex contains feedforward links between the different areas, which are organised in a hierarchical structure. The early layers (V1, V2) show a high degree of retinotopy, that is, the topography of the cells is similar to the spatial layout of the receptors in the retina. This means that at these lower layers, information about the locations of features within a scene is present. Going up the hierarchy, the complexity of the features that are represented increases, and the degree of retinotopy decreases. By the later stages of the hierarchy when objects are represented more abstractly little, if any, information concerning location remains [11]. At the higher levels, objects are represented as a combination of features rather than visual stimuli that contains spatial information as with objects, it is important to understand exactly what is meant by feature. Treisman and Gelade [24] propose that there are populations of specialised cells present in the early stages of visual processing that respond selectively to different feature dimensions, such as colour, orientation and direction of motion. Objects are represented and stored as a composition of features across several feature dimensions. Crucially, the representations of an object across different 5

12 feature dimensions are stored independently. This property provides the ability to recognise objects generally, and greatly reduces the number of representations that need to be stored for each object that is learned. Take the example of recognising cars: rather than storing a separate complete representation (including shape, colour and luminosity information) for each colour of car, a representation of a car s form may be stored independently of the information relating to the car s colour. This has two important consequences: firstly, many fewer car representations need be stored, as all cars of the same type will share one common representation of form and secondly, it would be possible to recognise a car in a new colour, as it may be recognised based on its shape and structure alone. A parallel may be drawn between this idea and the concept of database normalisation: each feature is stored once; each composite structure is a list of features of which the whole is composed. As a consequence of this representation of objects, there is a binding problem. In a multi-object scene, it becomes non-trivial to identify which features in the different feature dimensions belong to the same object. The simplest form of this problem is finding the position of a shape in the visual field. 2.2 Visual Attention and Saliency An overview of the structure of the visual system is helpful in understanding the motivation and design choices of the models described later. However, the ultimate aim of this project is to investigate and implement a basic model for visual attention. A review of current theories of visual attention will be presented in this section Attention Visual attention is the mechanism that selects regions of interest within a scene. It is argued that attention is vital for the correct perception of objects [24]. Feature Integration Theory makes a distinction between two stages of visual search feature search and conjunction search. Feature search is a search of the visual field for primitive features; it is performed very rapidly each part of the visual field may be analysed independently, allowing this to be performed in parallel. Conjunction search is a serial process information from remote parts of the brain must be integrated, which may be done in only one location at once [25] and hence takes much longer. According to Feature Integration Theory, conjunction search is fundamental to the correct perception of objects, due to the nature of the representation of objects in the brain. Feature Integration Theory proposes that without visual attention, correct conjunctions of features across different feature dimensions cannot be formed, without which, objects may not be correctly perceived. The utility of visual attention in both regular and biologically-inspired object-recognition systems is noted by Tsotsos et al. The presence of an attentional mechanism reduces the complexity of the 6

13 analysis of a visual scene from being an NP-complete problem to one solvable in linear time [26]. The time complexity of the problem without attention is due to the combinatorial nature of selecting which parts of the image are to be processed there are an exponential number of combinations of parts of the image. An attentional mechanism can determine which parts of the image are to be looked at first, dramatically reducing the size of the search problem. Using attention, the response time grows linearly with the size of the visual field; this has also been verified psychophysically [26]. It is mentioned above that attention selects regions of interest within a visual scene. What qualifies as a salient region, or a region of interest, is an important and pertinent question. Attention is believed to consist of a combination of two components: bottom-up, stimulus-based saliency and topdown task-dependent saliency [15]. Both approaches are considered in the remainder of this section Bottom-Up Saliency A great deal of the work on visual saliency has addressed bottom-up saliency, an approach that uses information about local features present in the visual stimulus to decide which parts of the input are salient. Saliency occurs in human vision for example, brightly coloured objects in otherwise unsaturated scenes, such as a red jacket among black dinner jackets, involuntarily attract attention. It is claimed by Itti and Koch [15] that this types of processing is predominantly driven by bottom-up saliency. Much of the work on bottom-up saliency is based on Feature-Integration Theory; looking at each of the feature dimensions and computing a measure of saliency for the visual input based on the salient regions across the different dimensions. Itti and Koch describe a model of bottom-up saliency-based attention that stems from this idea [16]. In their model, three feature dimensions (colour, intensity and orientation) are extracted from an image at multiple scales. A feature map is then computed for each dimension: these maps indicate preliminary local saliency for a given feature dimension. This is done by using centre-surround cells, which are excitatory in the centre of their receptive field though when stimuli are presented in the larger surrounding region, the response is inhibited. These cells are effective at detecting which locations stand out from their surroundings as they are most strongly activated when the region in the centre differs from the surrounding region. Suppose there is a layer of centre-surround cells that are sensitive to a certain luminosity. If the same luminosity is present in the entire receptive field of the cell, the output will be low due to inhibition. However, if the luminosity varies across the receptive field, a local spatial discontinuity will be detected by the cell, resulting in a strong response. The feature maps from the separate dimensions are later integrated into a single saliency map by taking the mean of the feature maps. This indicates which parts of the image are most interesting according to the chosen metrics. Areas which are considered salient across multiple dimensions are more likely to be considered salient in the final output of the model. 7

14 2.2.3 Object-Based Saliency Object-based saliency addresses visual attention in a different manner to the bottom-up approach. Rather than searching for features that appear to be locally salient in an visual scene, prior knowledge of the features present in objects that have been previously learned is used to guide attention to the parts of the scene that are most likely to contain an object that is of interest. This process of selecting a target object to direct attention to is necessary in order to be able to perform an action on the target. Performing an action on a target object uses interactions between the ventral stream and the dorsal stream. The dorsal stream is responsible for initiating spatial actions, but it is thought to be the ventral stream that locates the target object in a visual scene and directs the dorsal stream about where to perform the action [27]. Neurological evidence for object-based attention has been demonstrated by Chelazzi et al [3]. Monkeys were presented with a complex image to hold in memory, which activated certain cells in the inferior temporal cortex (IT) that were tuned to the features present in the image. The monkeys were then shown 2 5 choices of images simultaneously, and required to make a saccade to the target image. About milliseconds before the eye movement, responses of cells that did not correspond to the target were suppressed the neuronal response was dominated by the target. These results suggest that the selection of target objects (which is thought to be perform in the prefrontal cortex) is reflected in the inferior temporal cortex. Stored representations of learned objects contain no structural information however, the representations do include information about the features that are present in the objects. The description of the visual system so far includes no method for retrieving information about the location of objects from the retinotopic areas. This would require interactions between the later areas in the temporal cortex and the lower retinotopic areas. This could be achieved by means of a feedback connections between layers in the ventral stream hierarchy. It is noted in [12] that this top-down approach relies on connections in the reverse direction in order to decompose an object into its constituent features. Nodes in the first layer of feedback process which is the last layer of the feedforward process (most likely one of the areas in the temporal cortex) would only be activated if they were active in the representation of an object that is being attended to. Following the hierarchy backwards down to the retinotopic layers (V1, V2 and V4), only the features that are activated in nodes in previous layers features that are present in the target image, would be activated irrelevant features would be inhibited. This would result in object-based feature selectivity at the retinotopic layers, from which the location of objects in a scene could be deduced. It is worth mentioning that another form of top-down attention has been studied. Ghandi et al demonstrated [9] that the ability to perform a visual discrimination task is improved when a cue about the target s location is given in advance. This mechanism is known as spatial attention. However, this project will focus on object-based attention, and not include a model of spatial attention in the system. 8

15 2.3 Existing Approaches The goals of this project include the design of a system that is able to recognise simple objects, using a model of object recognition in the visual cortex. The system must also have the ability to produce object-based saliency maps within a visual scene. Here, two of the most prominent existing models will be discussed and considered for inclusion in the system Models of Object Recognition in the Cortex One of the first neural-network models of object recognition in the visual cortex was Fukushima s Neocognitron [6]. It consists of a series of alternating S and C layers, which are based around Hubel and Wiesel s simple and complex cells [14] respectively. This cascade of alternating layers is said to give the network an invariance to shifts in position: if the object being detected moves within the visual field, the output will be largely unaffected. Neocognitron has primarily been used for optical character recognition (OCR) and handwriting recognition. The features that are selected by the model are pixel-based simple line segments [7]. These feature types fit the problem of character recognition well, but are not well suited to object detection in camera images as they are designed to target lines present within text rather than more general features that occur in real images. Poggio and Riesenhuber proposed an alternative model, HMAX [21], that was heavily inspired by the ideas of Neocognitron. HMAX consists of only four main layers, two S layers and two C layers. The model has been shown to produce good results in object recognition [23, 22]. Two key differences exist between HMAX and Neocognitron. Firstly, HMAX proposes the use of a max operation in the C layers, which pools over nodes in a receptive field and takes the maximum activation, rather than using a linear combination of the activations. The use of this operation is claimed to improve the models invariance to scale transformations and reduce the impact of background clutter. Secondly, rather than using pixel-based line segments as features, HMAX uses filters such as the first derivative of Gaussians to extract orientations from the input [21]. These filters may be oriented at a given angle on a 2-dimensional plane: using these oriented filters to pre-process pixel data allows the model to use information about orientations in the visual input rather than raw pixel data. This resembles biological visual systems more accurately than the approach taken by Neocognitron. For these reasons, HMAX has been selected as an appropriate biologically-inspired model for object recognition to be used in this project From Knowing What to Knowing Where HMAX provides a suitable framework for recognising objects, but does not include an attentional mechanism. As mentioned above, a lack of an attentional mechanism poses problems for object recognition, and is not in line with current knowledge about the biological vision. Implementing a 9

16 model of object-based attention is a key aim of this project an existing approach to this problem will be discussed in the remainder of this section. It is proposed by van der Velde and de Kamps [27] that the problem of object-based attention may be approached by means of a feedback network. Their model starts with a feedforward network that simulates the process of feature identification. The feedforward step is related to HMAX in that they are both hierarchical architectures that model the ventral stream. However, there is a fundamental difference in the approach to achieving translation invariance. The C layers of HMAX pool over all locations in the visual field, which means that there is no change in the output if an object being detected is shifted to a different location. The feedforward step of the model in [27] approaches the same goal by training the network using back-propagation to recognise an object at every possible location. As the approach used in HMAX is simpler and does not require the network to be trained with an object in every possible location, the feedback step of the model described by van der Velde and de Kamps will not be used. The second stage of the network described in [27] is the feedback stage. The feedback network has the same structure as the feedforward network, but the connections are reversed. Information about the identity of a detected object is present in the higher layers of the network (AIT) but contains no retinotopic information. The lower layers (V1, V2, V4) still retain a degree of retinotopy. The information about the object identity is passed through the feedback network to the retinotopic layers. As target-related cells are active in the higher layers, the feedback network should activate targetrelated cells at all locations in the retinotopic layers. If the activated target-related cells were also activated in the feedforward step, a match is found. These cells remain active, as they represent a possibility that part of the target object is present in that part of the visual field. Cells that were activated in the feedforward step that are not activated in the feedback step will be inhibited, as they are unlikely to represent part of the target object. This forms a basic model for object-based attention: attention may be directed to the locations represented by the cells that remain active after the feedback step. These locations are likely to contain part of a target object. Though the discussed model will not be used directly, a similar model for object-based attention will be used. The system will attempt to combine the robust object-recognition capabilities of HMAX with the feedback mechanism described in [27]. This framework has the goal of retrieving the locations of known objects in images. 2.4 HMAX HMAX is a computational model of object recognition in the visual cortex. The model describes a hierarchical feedforward network. It is an extension of the idea of complex cells being composed of simple cells, proposed by Hubel and Wiesel. There are two classes of layers in the network, S 10

layers, which are based on simple cells, and C layers (this follows the notation used by Fukushima [6]), which are modelled loosely on complex cells [21]. 2.4.

It consists of a multidimensional array of units, each of which takes the form of a first derivative of Gaussian function [21] or a Gabor function [22].

17 layers, which are based on simple cells, and C layers (this follows the notation used by Fukushima [6]), which are modelled loosely on complex cells [21] Structure HMAX consists of a hierarchy for four main layers: S1, C1, S2 and C2. The first layer, S1, receives as input intensity values from a greyscale image. It consists of a multidimensional array of units, each of which takes the form of a first derivative of Gaussian function [21] or a Gabor function [22]. These functions are capable of detecting edges in the visual input that are oriented at a given angle. Visualisations of the filters may be seen in figure 2.1. The angle the filter is oriented at determines the angle of the edge that it is sensitive to. Functionally, the filters mimic the behaviour of simple cell receptive fields [22]. Each part of the input image will be covered by a number of S1 units. For each location, there are multiple units covering different scales (i.e. the receptive field size is different for the different units). These units are organised into scale bands, with two neighbouring scales in each band. As the receptive field functions are oriented at a certain angle, there are four copies of each unit at each location, each with receptive field functions oriented at different angles (0, 90, 180 and 270 ) [22]. (a) Oriented first derivative of Gaussian filters (b) Oriented Gabor filters Figure 2.1: Examples of the filter types used in the S1 layer of HMAX to extract oriented edges from an input image. The next layer in HMAX C1 corresponds to complex cells in the primary visual cortex. Complex cells show some invariance to scale and translation. This functionality is mirrored in the C1 units of HMAX. They pool over multiple S1 units at a given location; the number of S1 locations pooled over is dictated by the receptive field size of the C1 units. The pooling operation used by C1 is 11

18 max the maximum activation of units in the receptive field is taken. This property helps to achieve the partial invariance to scale and translation. Taking only the maximum value of the predecessor units means that having more units active or a different set of active units will not necessarily alter the final activation. In addition to this spatial pooling, C1 units pool over different scales within each scale band. So C1 units contain 4 different scales corresponding to the scale bands in S1. Each C1 scale takes the maximum value of the scales in the corresponding S1 scale band. Again, as the S1 receptive field functions are oriented at four different angles, this entire operation occurs once for each orientation. This results in S O C1 units at each location, where S is the number of scale bands, and O is the number of orientations. The following layer in the hierarchy is S2. Units in this layer are described as composite feature cells [21]. They pool over four neighbouring C1 cells (in a 2 2 grid) across all four orientations. This results in 4 4 = 256 different types of S2 cell, each is a different combination of four orientations. This layer allows HMAX to represent more complex features and patterns than just using the simple oriented lines that were present in the previous layers. The S2 layer does not however, pool over multiple scale bands, so for each unit location in S2 there are 256 S units. After S2, comes the C2 layer. In this layer all retinotopy is lost. C2 units pool with a max operation over all S2 units in all locations and all scales for each S2 unit type (2 2 grid of orientations). This results in a total of 256 units in C2 layer as there is one for each S2 type. Each C2 unit represents the maximum activation for that type in the entire S2 layer across all scales. Riesenhuber and Poggio[21] describe HMAX as having a layer of view-tuned units (VTUs) above the C2 layer. These units are trained to be sensitive to a specific view of an object. This is achieved by presenting a 3D object from many different views during the training process. This project however, will not include these VTUs in the implementation of the model, as the primary purpose is to investigate and implement object-based attention. Including VTUs in the model will unnecessarily overcomplicate the implementation task for the scope of this project it is sufficient to implement an attentional mechanism for objects viewed from a single angle. A diagram showing the structure of HMAX has been included in figure 2.2. This diagram does not include all the information about the network details have been omitted to keep the diagram comprehensible Invariance Properties HMAX provides a significant degree of position and scale invariance: when a stimulus is moved within the visual scene, or enlarged or shrunk, the output of the model at the C2 layer should not be affected. 12

HMAX C2 Pool over S2 units in all scale bands in all locations for each feature type S2 Pool over a 2x2 grid of C1 units to produce more complex features. This is done for each scale band.

19 HMAX C2 Pool over S2 units in all scale bands in all locations for each feature type S2 Pool over a 2x2 grid of C1 units to produce more complex features. This is done for each scale band. C1 Max over scales within each scale band. S1 Detect orientations by using oriented filters over multiple scales across the image. Input image Figure 2.2: A diagram showing the structure of HMAX. Not all nodes present in the model are shown in this diagram, but enough have been included to give an impression of how the layers are connected. 13

20 The key to HMAX s translation invariance is the pooling C layers [21]. Suppose there is a simple feature at the bottom left of an otherwise empty scene. Information about the location of this feature will be retained in the model up to the S2 layer, as the receptive fields first few layers cover a limit spatial area of the previous layer. However, as the C2 layer pools over all S2 units in all locations, the feature will be represented in the C2 layer as a strong activation in one or more of the 256 typespecific nodes. If the feature were to be moved to the top right of the scene, the model would be notably different in the layers up to S2, but the representation would likely be very similar in the C2 layer. This is due to the fact that units in the C2 layer represent the maximum activation for a specific unit type within the S2 layer, regardless of the position. The pooling operation is also vital to the model s invariance to scale. Suppose a straight diagonal line that is present in the visual field, and that it spans the entire image. The line will activate units in each of the first three layers that are sensitive to that orientation. However, each unit whose receptive field includes the line will result in a similar activation. When the final max operation is performed by the C2 layer, only the maximum activation would be picked, which would be similar if the line only appeared in the receptive field of one S2 unit. 2.5 Learning Objects One of the objectives of this project is to be able to detect which of a set of learned images is present in an image. HMAX includes no learning element it is purely a feedforward process. The feature vector (the C2 activations) of a network gives a representation of the features present within an image. A feature vector obtained from a visual scene may be compared to feature vectors of learned objects to give an indication as to whether the object is present within the scene. The common approach taken when using HMAX to detect objects is to train a linear classifier on the feature vectors of a set of training images, then use the classifier to infer the identity of an object present in a test image. Serre et al [23, 22] performed this classification process using linear Support Vector Machines (SVMs) and boosting algorithms (AdaBoost), and achieved good results. The accuracy of their results in object recognition tasks ranged from 94.% to 99.8% across seven image datasets. They compared these results to those achieved by a benchmark computer vision system, which achieved between 75.4% and 96.4% accuracy. In this project, a different approach was taken to object classification. A self-organising map[17] is a type of artificial neural network that allows high-dimensional data to be organised in to clusters. If the feature vectors of a set of training images is be used to form the clusters, classification of objects in novel scenes should be possible by determining which cluster the feature vector falls into. One of the key benefits of self-organising maps is that they are an effective visualisation tool. Being able to visualise the clusters of objects will be a valuable aid in understanding how the feature vectors of 14

21 training images are related to each other. An overview of self-organising maps will be given in the remainder of this section, along with a discussion of their utility in this project Self-Organising Maps A Self-Organising Map (also known as a Kohonen Network) is an artificial neural network that is trained in an unsupervised manner to modify the internal state of the network, allowing features found in the training data to be modelled [2]. Fundamentally, a Self Organising Map (SOM) converts nonlinear relationships between high-dimensional data into geometric relationships on a low-dimensional (usually 2D) display. While the SOM maintains the core topological and metric relationships in the data, mapping the data on to a lower-dimensional display allows for abstractions to be made about the data [17]. A SOM consists of a 2-dimensional grid of nodes. Each node is initialised with a vector of weights; the initialisation values of these weights are typically chosen to be random or change linearly across the 2D structure[17]. It should be noted that the size of the weight vector will reflect the dimensionality of the data that is to be used in training; so if the training data contains three features, each node will have three weights. To train the SOM, training examples are presented in turn. The node that matches the training example most closely is found, typically by using a distance metric such as the Euclidean distance to compare the example input vector with a node s weight vector. The Euclidean distance is defined as N i=1 (x i y i ) 2, where X and Y are two vectors. The weights of the nodes within a neighbourhood surrounding the best-matching node are updated to be closer to the value of the training example using the following formula: m i (t + 1) = m i (t) + d(m i,c(t))α(t)[x(t) m i (t)] for each i N c (t) (2.1) where c(t) is the best-matching node at time t, N c (t) is the neighbourhood around the bestmatching node, d(m i,c) is the distance between a neighbourhood node and the best-matching node, x(t) is the current training example and α(t) is some scalar that defines the size of the learning step. Over time, during the execution of the algorithm, the size of the learning step and the neighbourhood size decrease. Training ends after a defined number of iterations of this training process have been completed. It is known that different parts of the brain, especially across the cerebral cortex, are responsible for specific functions, such as the analysis of sensory data (visual, auditory, etc.). Experimental research has shown that in many areas, sensory response signals are obtained in the same topological order on the cortex as they were received by the sensory organs [17]. For example, a spatial ordering can be seen in the auditory cortex that reflects the frequency response of the auditory system. The cells are ordered in the auditory cortex in such a way that they trace a logarithmic scale of frequency (i.e. low frequency sounds will generate a response in one end of the cortex region, and 15

22 high frequency sounds will generate a response in the other end of the region) [17]. This evidence for self-organisation within the brain suggests that SOMs are an appropriate biologically-inspired model for learning patterns within sensory information, in this case the feature vectors output by HMAX. As HMAX feature vectors represent objects in a scene, if the feature vectors generated from a number of training images are used as the input to a SOM, the SOM should organise the feature vectors that are similar in 256-dimensional space into geometric clusters in the map. The training images used should contain the object of interest in isolation, to prevent features present in the background from being included in the representation. Scenes that consist of the same objects often produce similar feature vectors (as suggested by the success in performing object recognition by using linear classifiers on feature vectors in [23]). The significance of this is that the geometric clusters formed in the SOM should correspond to images that contain a specific class of object, as long as the images show the objects in isolation with a neutral background. To extend this idea to object detection, a SOM trained with HMAX feature vectors could be thought of as a set of geometric regions, each of which corresponds to an object that was present in the training images. To detect the presence of one of these objects in a novel scene, a feature vector would be calculated using HMAX for this new scene. The best matching unit in the SOM for this feature vector could be found trivially (this is part of the SOM training algorithm). As the region that contains this best matching unit corresponds to a specific object, it could be predicted this object is present in the scene. 2.6 Technologies MIIND MIIND (Multiple Interacting Instantiations of Neural Dynamics) is a modular framework written in C++, that contains a number of libraries used for computational neuroscience modelling [5]. Of the various libraries in MIIND, LayerMappingLib is the one that is most relevant to this project. It is designed specifically for implementing hierarchical models of the ventral stream, of which HMAX is an example. LayerMappingLib was implemented with several specific models in mind, including Neocogitron and HMAX. A basic implementation of HMAX, built on the LayerMappingLib is included in the framework [5]. An implementation of HMAX that was written by the authors of the original paper that describes HMAX [21] also exists in Matlab. The LayerMappingLib implementation was decided to be the most appropriate solution for this project for two main reasons. Firstly, as it has an extensible, object-oriented API, which will enable the modification of the implemented model to include a feedback mechanism. Secondly, MIIND is written in C++, which is considered to perform well and has a large number of libraries available, which will be used in other parts of the system. 16

23 LayerMappingLib LayerMappingLib uses a somewhat unconventional approach to representing networks compared to the other libraries in the framework. This is mainly for efficiency: as the library is designed for developing specific types of hierarchical networks, certain optimisations may be made to improve network performance. As non-trivial modifications and will be made to the HMAX implementation which uses LayerMappingLib a solid understanding of the design and API is necessary. It is observed that many hierarchical networks use a non-linear function to combine the activations of afferent nodes (nodes in the receptive field of a given node)[5]. Many implementations of artificial neural networks perform a weighted summation over the afferent nodes, and often include a squashing function, such as a sigmoid function. However, the non-linear operations found in many hierarchical networks (for example, the max operation in HMAX) may not be represented by this model. Consider the case where activations are defined by ( ) a i = g w i j a j a j RF(a i ) (2.2) where g is the squashing function, RF is the receptive field of a given node and w is a matrix containing the weights of the connections between node a i and the previous layer (nodes a j ). In many hierarchical networks, the weight matrix is the same for each node it covers the receptive field of the node, and is repeated across a layer. In this case, the calculation for the activation of a node may be thought of as a squashing function applied to a convolution of the (shared) weight matrix with the afferent nodes in the previous layer (equation 2.3). This structure is known as a convolutional network [18]. a i = g(rf(a i ) W) (2.3) In hierarchical networks, these convolutions are typically aimed at feature extraction. An example of this is HMAX: the S1 layer applies filters at different scales and orientations. LayerMappingLib represents networks in a similar way to convolutional networks. Layers in LayerMappingLib are referred to as feature-maps due to the fact that their purpose is typically to extract features from the previous layer [5]. Rather than storing weights for all nodes across a layer, LayerMappingLib uses a FeatureMapNode, which corresponds to a layer in a convolutional network. A FeatureMapNode stores a layer s parameters, including: the receptive field, which specifies the number of afferent nodes in the previous layer, 17

24 the skip size, which corresponds to the amount by which the receptive fields of neighbouring nodes overlap, the filter that is to be applied to the receptive field. Filters in LayerMappingLib are implemented as functions. Convolution is one of these functions, enabling convolution networks to be implemented. Non-linear filters, such as the max operation in HMAX, are implemented as specialised functions Self-Organising Map Software Self-Organising Maps are used in this project to learn representations of multiple objects. The use of Self-Organising Maps in this project is discussed in detail in section 4.3. Numerous pieces of software exist that allow the creation and training of Self-Organising Maps, including SOMCode[4], KSOM[1] and SOMNetwork[13]. SOMCode and KSOM are an open-source libraries for building Self-Organising Map simulations. SOMNetwork is an application that was written by Dave Harrison for his Final-Year Project. Unlike the other two libraries, it include a graphical-user interface that allows settings to be changed quickly, and displays real-time visualisations of the SOM training process. It is written in Java and has well-documented source code, written in an object-oriented style that allows it to be extended in a number of ways. Due to these two features, it has been selected as the software that will be used for creating Self-Organising Maps in this project. The user interface will make experimenting with different training parameters quicker and easier that it would be if it had to be done in code. The extensible API will allow the application to be modified to perform additional tasks necessary for this project. 18

25 Chapter 3 Project Plan 3.1 Milestones and Deliverables The project was divided up in to several milestones, in order to make it easier to keep track of the progress made. Identifying milestones was the starting point for the project schedule as well a certain amount of time was allocated for each milestone, and progress was tracked. Some milestones also include associated deliverables, such as a piece of writing or a computer program. 1. Gain sufficient knowledge about the fields of Computational Neuroscience, Visual Perception and Biologically-Inspired Computing by performing a thorough literature review of each area. 2. Design the system that is to be built outline the different components that will be necessary and determine which technologies will be used. 3. Development stage 1. Implement a program that uses the necessary libraries to run a computational model of biological vision on a given image, and output relevant information from the model. Associated deliverables: The program as described above. 4. Development stage 2. Use Self-Organising Maps to learn representations of classes of objects. Find clusters within the Self-Organising Maps that map to distinct classes of objects. Associated deliverables: An extended version of SOMNetwork that includes the ability to form clusters in a trained SOM. 19

26 5. Development stage 3. Construct a program that produces an object-based saliency map, given a test image and an image containing a target object. Associated deliverables: The program as described above. 6. Write up a mid-project report, that describes the background to the problem that is to be solved, aims and objectives for the project and the progress that has been made so far. Associated deliverables: The mid-project report. 7. Implement the extensions specified Associated deliverables: A modified version of SOMNetwork that includes a user interface for inspecting nodes and a tool for visualising features present in objects. 8. Write-up of the final report. Associated deliverables: A print-out and a digital copy of the final report. 9. Evaluate the system. Each part of the implemented system will be evaluated independently. The evaluation will be an ongoing process when a sufficient amount of the development has been completed. 3.2 Schedule A schedule was devised at the start of the project to ensure that enough time would be available to carry out each part of the project. It was checked regularly that the work done was in line with the schedule. Despite efforts to stick to the initial schedule, it became apparent after the first few weeks of the project that the schedule would need to be revised. The background reading took considerably longer than was initially expected. This was in part due to the fact that some understanding of the Neuroscience of vision and perception was necessary to be able to complete the project to a suitable standard. As this field had not been previously studied, there was a great deal more to learn than was initially expected. Gantt charts were produced for both the initial plan (figure 3.1) and the revised plan (figure 3.2). Each item in the Gantt chart corresponds to one milestone. 3.3 Methodology The common software development methodologies such as the Waterfall model and Extreme Programming are designed for use on large projects by teams of developers. This reduces their utility in this project as it will be carried out by a single developer. However, some of the principles of methodologies may be used on a single-person project. The system was divided up into three main subsystems, each of which were developed independently. Each of these subsystems were broken 20

27 Sheet Background reading System design Development stage 1 Development stage 2 Development stage 3 Mid-project report Easter break Development of extensions Report write-up Evaluation Figure 3.1: The initial Gantt chart. Sheet Background reading System design Development stage 1 Development stage 2 Development stage 3 Mid-project report Easter break Development of extensions Report write-up Evaluation Figure 3.2: The revised Gantt chart. up in to smaller tasks, which were developed in turn in rapid iterations. Before a new feature was implemented, the previous feature was analysed and checked for correctness. The process consisted of the following steps: 1. Identification of requirements each part of the system required multiple pieces of functionality to be implemented. These requirements of each of the subsystems were identified early on and prioritised. 2. Produce a prototype implementation the main features of a particular subsystem were implemented in order of priority. Each feature was validated before moving on to the next. 3. Analyse the prototype implementation the first iteration of each subsystem was analysed as a whole. Missing features and errors in the code were identified. 4. Perform an additional iteration taking into account the issues found in step 3, improve the current implementation with another iteration of development. If necessary, go back to step 3 to perform another iteration. Page 1 21

28 Chapter 4 Design and Implementation 4.1 System Overview The system is divided in to three distinct subsystems, each of which provide different functionality and satisfy a distinct objective. The first objective of the system is to be able to find the feature vector (the activation of the C2 layer of HMAX) for a given object. Secondly, the system should have the ability to organise or cluster a collection of feature vectors of different objects. From this, the system should be able to determine which of the objects used in training is most likely to be present in a test image. Finally, the system should be able to produce an object-based saliency map of an image, given a template feature vector, i.e. given the template of an object, what are the most important candidate locations for the object in a test image. 4.2 Obtaining Feature Vectors The first part of system has one purpose: to produce a feature vector for a given image. The LayerMappingLib implementation of HMAX is suitable for this purpose. The five main stages required to produce the feature vector are: 1. Read in the input image and convert it to a greyscale intensity image. 2. Initialise a new HMAX network. 3. Feed the image in as the input to the network. 22

4. Evolve the network run the model for the given input. 5. Read out the values of the C2 layer of the network. To read in the image, Magick++, the C++ interface to ImageMagick was used.

29 4. Evolve the network run the model for the given input. 5. Read out the values of the C2 layer of the network. To read in the image, Magick++, the C++ interface to ImageMagick was used. ImageMagick is a suite of open-source image manipulation programs and libraries; it was chosen for three main reasons. Firstly, it supports a wide range of file formats. This is a useful feature as it allows a wide range of test images be used in the system. Secondly, the library s documentation seems to be more complete than many of the alternatives. Finally, ImageMagick is a commonly used package so it is already installed on many systems, and is available for three of the major platforms (Linux, Mac OS X and Windows). Once an image is read in and an intensity image is produced, an instance of the HMAX network needs to be created. As mentioned in section 2.6.1, MIIND includes a predefined implementation of HMAX, which was used in this part of the system. Once the model is created, it is initialised using the raw intensity pixel data from the image. LayerMappingLib provides a high-level interface to networks (NetworkInterface) this only exposes basic functionality, but is sufficient for the simple task of finding the output activation of a network. Using this interface, the model is evolved with the loaded image as the input. Evolving the network executes the feedforward process, which calculates the activations at each layer. The C2 activations may be then read out of the model and written to a file for further analysis. Figure 4.1: Charts showing the feature vectors of two simple images. Each bar corresponds to the C2 activation of a specific S2 feature type (a 2 2 grid of orientations). Initially, simple images of lines oriented at different angles were used. Using simple images makes the interpretation of the output of the network much easier to understand. Without a thorough 23

30 understanding of the output of the network, performing object-recognition and implementation of the feedback mechanism would be much more difficult. Figure 4.1 shows the feature vectors of images containing horizontal and vertical lines. The horizontal line s feature vector shows spikes towards the left of the chart, the largest being in the first column, whereas the vertical line s feature vectors shows highest activity around the 170 th column, about three quarters of the way along. The charts show that the difference between feature vectors is subtle most values are between 0.4 and 0.6. This is due to the fact that the filters used in the S1 layer do not produce a dramatically different output for different orientations. Despite the small range of values, the difference is enough to be able to deduce information about the features that are represented, and to tell feature vectors of different objects apart. Each column in the charts represents an feature type in the C2 layer (a 2 2 grid of orientations). From this feature vector it is possible to determine which feature types are most present in the image. This is explained in more depth and evaluated in section Learning and Recognising Objects To learn the representations of objects, the feature vectors of a set of training images must be computed. The part of the system described in the previous section is responsible for calculating feature vectors. This process may be run for a set of training images, resulting in a feature vector for each of the objects in the training images. Self-Organising Maps are used to classify new images. This was done by training a SOM with the feature vectors of a set of training images. A feature vector corresponds to the activations of the C2 layer of HMAX. As discussed in section 2.4, there are 256 nodes in the C2 layer. MIIND represents a node s activation as a real number that varies between 0 and 1, so a feature vector is a 256-dimensional vector of real numbers. The first task is to initialise a SOM with the feature vectors of the training data Training the Self-Organising Map The chosen SOM software (SOMNetwork) provides a graphical user interface for creating and training SOMs. It allows the users to load in training data in the form of a text file that specifies dimensionality of the data, then contains one training example on each line. Training examples should be formatted as a space-separated list of real numbers. The first part of the system, that deals with computing feature vectors, was modified so that it could run the model on a large number of images sequentially, then write the computed feature vectors to a file in the format required by SOMNetwork. This training file is used as the input to SOMNetwork, from which a map is created and trained. After training, each node in the SOM will have a 256-dimensional weight vector known as a codebook vector associated with it. Codebooks that are close in data space will appear topo- 24

31 graphically close in the SOM. An effective method of visualising SOMs is the D-matrix. A D-matrix displays each node in a greyscale value, the brightness of the value corresponds to the average distance of the node from its neighbours. A dark node will be close to its spatial neighbours in data-space, whereas a light node will be far apart from (at least some of) its neighbours in data space. Figure 4.2: A sample of 12 of the images from the training set that was used. 200 images of each class of object were present in the training set, in each image the object size and position was determined randomly. A SOM requires numerous examples of each object type to accurately extract the data landscape, therefore the first step is to generate or collect test data. A program was written that creates 600 images 200 containing a square, 200 containing a circle and 200 containing a triangle. These training images were always of the same dimensions ( pixels). The size and position of the shapes within the images varied randomly, but the shapes were always entirely contained within the bounds of the image. A sample of some of the images included in the training set are shown in figure 4.2. Feature vectors were calculated for each of these training images, and collated into a file that may be parsed by SOMNetwork. The task was to train a SOM on these 600 feature vectors to form clusters for each of the different shape types. Figure 4.3(a) shows the D-matrix for this trained SOM. As the dark nodes are close in data space to their neighbours, it is possible to see three distinct clusters formed in the SOM the lighter sections are the borders of the clusters. The labelled SOM (figure 4.3(b)) shows that each cluster contains a specific shape. It should be noted that the labels have been included to show which shapes mapped to which clusters the labels are not required for training the network as SOMs are trained in an unsupervised manner Finding Clusters To be able to determine the identity of objects in test images, the boundaries of the clusters in the SOM need to be found. Once the distinct clusters have been found, the feature vector of a test image may identified by finding the cluster that contains the node that matches the new feature vector best. An advantage of using self-organising maps over other clustering algorithms such as K-means is that the number of clusters do not need to be specified in advance. This allows the system to be 25

trained on a number of examples, and learn the representations without having to specify the number of cluster a priori. To find the clusters, a region-growing algorithm was used.

32 (a) D-matrix (b) Labelled D-matrix Figure 4.3: The D-matrix for a SOM trained on images of circles, squares and triangles. The right image includes labels to illustrate which nodes each shape were closest to. The labels are the first letter of the corresponding shape name. trained on a number of examples, and learn the representations without having to specify the number of cluster a priori. To find the clusters, a region-growing algorithm was used. Once the SOM is trained, the training data is presented once more. For each example in the training data, the best-matching unit in the SOM is found. The best-matching unit for a training example is the node in the SOM with the codebook that has the smallest distance (as defined by the Euclidean distance) from the example. If this node is not in a cluster, a new cluster will be created. Nodes surrounding this unit will then be added to the cluster repeatedly as long as the their distance from the initial unit (again calculated using the Euclidean distance) does not exceed a predefined threshold. This operation is similar to the weight-update operation in the sense that the best-matching unit for a training example is found, then an operation takes place on the neighbourhood surrounding that unit. This similarity may be used as a basis for the implementation of the region-growing algorithm. The way in which the node updating code in SOMNetwork has been written allows for the regiongrowing extension code to follow the patterns and techniques used by the update process. To understand the implementation of the clustering algorithm, an overview of the structure of SOMNetwork s node updating system will be given. SOMNetwork uses the Visitor design pattern for updating nodes weights. The Visitor pattern is used for performing operations on the elements of an object structure without having to alter the classes of the elements that the structure is composed of [8]. Rather than changing the elements, an abstract Visitor class is created, that has a method that accepts an element. Operations may be defined by inheriting from the Visitor class, and implementing the method that accepts the element to perform an operation. SOMNetwork defines an AbstractToplogyVisitor, that may be used to perform an operation on each node in the network. The visitor includes the traver- 26

33 sal logic, allowing the operation-specific subclasses to work on any network topology without having to worry about navigation. Having the navigation logic separated from the operation logic proved to be a useful feature when the region growing algorithm was implemented. Nodes are traversed in a specific order, which is dependent on the topology of the SOM. Edge cases must be handled when the edge of the SOM is reached. The fact that the navigation implementation was defined in the abstract visitor class meant that it did not have to be repeated when the clustering code was written. Figure 4.4: The D-matrix for the SOM shown in figure 4.3 after clustering has been performed. To cluster the nodes, a ClusteringTopologyVisitor class was defined that assigns nodes to clusters depending on their distance from the best-matching unit for a training example. Figure 4.4 shows an example of a SOM after clustering each cluster is drawn in a different colour. To determine the identity of objects in new images, test image feature vectors need to be loaded in, and assigned to a cluster. Once the SOM is clustered, a similar approach to the clustering mechanism may be used. New clusters will not be created, but as the nodes in the SOM already belong to clusters, the test examples can use the cluster of the best-matching node. The SOMNetwork program was modified to allow test data to be loaded and clustered; once the SOM was clustered, the clusters and the test examples that are contained within them are written to a file. By looking at the number of examples from each object class which fall into a certain cluster, the object that the cluster represents may be inferred. 4.4 The Feedback Connections To produce object-based saliency maps, feedback connections from C2 to the lower retinotopic layers are needed. Object-based saliency maps show the regions of an image that are most likely to contain 27

34 a target object. For this to be possible, a representation of the target object is required. This representation is in the form of a feature vector a vector containing the C2 activations for the given object. It is information from this feature vector that is carried down to the lower layers of the network. As a saliency map requires spatial information which is absent at the highest level of HMAX, the S2 layer is used as it retains a significant amount of retinotopy but is still capable of representing the complex features used in C2. A node in the S2 layer should be considered salient if there is both a strong activation in the target feature vector for the node s type (grid of orientations) and the node is strongly activated after the feedforward process. The saliency map will show the S2 layer activations, combining the nodes values with the respective activations from the target object s feature vector Feeding Back to S2 It is necessary to determine which S2 nodes each C2 node is connected to. To do this using LayerMappingLib, the higher-level NetworkInterface is not sufficient as it does not provide direct access to the nodes in the network. Instead, the FeatureMapNetwork was used, which provides lower-level access to the network. A FeatureMapNetwork provides an interface that allows iteration over the C2 nodes (instances of FeatureMapNode) of the network. The next task is to find the predecessor nodes for each C2 node once these are determined, the saliency map may be constructed by combining S2 activations with their corresponding C2 activation (the way in which these values are combined is important and will be discussed later). However, LayerMappingLib does not provide a way to access the nodes of the previous layer. The library was modified by adding methods predecessor nodes begin() and predecessor nodes end() to the FeatureMapNode class, that allow iteration over the nodes in the previous layer. It should be noted that a FeatureMapNode represents a feature map the shared weights and connection pattern of all nodes in the layer, rather than a node in the conventional artificial neural network sense. The FeatureMapNode provides access to the activations at different locations using the method FeatureMapNode.activation begin(). For the S2 layer, there will be one FeatureMapNode for each node type (a combination of four orientations). Each FeatureMapNode will contain the activations for that node type at all locations in the layer Creating the Saliency Map The created saliency map will be an image that is of the same dimensions as the largest scale band in the S2 layer there will be one pixel in the image for every node. This brings up an important point there are four scale bands in S2, each of which differ in size. A saliency map will be produced for each scale band, which will allow the output to be analysed at multiple scales. A unified saliency map may later be produced by combining the four intermediate scale bands. 28

35 Before the saliency map is created, data is allocated to hold temporary saliency values. This array of temporary data is necessary as the saliency values may exceed the maximum value that can be present in an image (1.0 when pixels are represented as doubles). This means that the temporary saliency values that are used during the calculations cannot be inserted directly in to the image that will be used as the final saliency map as values in images are forced to be within the range that a pixel is represented by. To determine the amount of data that must be allocated to store the temporary saliency data, the dimensions of the S2 layer are retrieved this is specific to each scale band. All elements in the array are initialised to 0. The following process then takes place for each C2 node of the template feature vector: 1. The corresponding C2 node in the network is retrieved for the current value in the template feature vector. So for the first element in the template feature vector, the first C2 node in the network will be retrieved. 2. The predecessor nodes are found for this C2 node by traversing backwards through the network to the S2 layer. As each C2 node represents one S2 node type, there is only one FeatureMapNode in each scale band of the S2 layer for the C2 node. 3. The activation values of the FeatureMapNode are iterated over. For each activation, the corresponding value in the array of saliency data (which was initialised to 0) is incremented by the product of the activation value of the S2 node and the value of the corresponding C2 node from the template feature vector. (a) Template image (b) Test image (c) Saliency maps Figure 4.5: Saliency maps for the four scale bands of S2. The saliency at each pixel was calculated using the mean saliency of the different S2 feature types at each location. The saliency values were normalised after they were calculated to improve the contrast. 29

36 At the end of this process, the temporary array of saliency data contains the sum of the saliency values for each C2 node. Each value in the array is then divided by the number of C2 nodes in the feature vector. This results in each value of the array containing the mean saliency value across all feature types. An image is then created using the calculated saliency values as greyscale pixel data. The image is saved to disk as the final saliency map. Initial results showed images with extremely low contrast the parts of the image considered salient were difficult to distinguish from less salient stimuli. To make the results easier to interpret, the saliency values were normalised between 0 and 1 (so the lowest saliency value is mapped to 0 and the highest is mapped to 1) before they were saved to the image. An example of the normalised output is shown in figure 4.5(c). Figure 4.5(a) was used as the template and the figure 4.5(b) was used as the test image. This example shows that the line oriented in the same direction as the one in the template image is considered salient, but the other is not. The first implementation of this part of the system used the mean of the saliency value each pixel to produce the final map. The implementation was later amended to optionally use the maximum saliency value instead. This functionality was added by altering two of the steps used. Firstly, rather than adding the saliency value for each S2 type to the saliency data, the new value was calculated, checked if it was larger than the value previously in the array, and set as the new value if it was. Secondly, if the maximum was being used, the final values would not be divided by the number of C2 nodes. The results of this revised approach may be seen in figure 4.6. Figure 4.6: Saliency maps produced using the max operation at each location for calculating saliency. No normalisation was performed on these maps. One motivation for using the max operation was to get around the issue of the low-contrast saliency maps produced by the initial approach. Many parts of the image that should be considered salient will only be considered salient for one particular S2 feature type. In the example of the diagonal lines in figure 4.5, the diagonal line in the test image that is similar to the line in the template will only be considered salient by S2 feature types that contain the correct diagonal orientation. Using the mean of the values calculated from all filter types results in a relatively small saliency value, as only a quarter of the filter types contain the correct orientation, and even fewer contain the correct orientation in multiple positions. As previously mentioned, normalisation was used to overcome this issue. However, this approach is not ideal. It is successful when it is known that the target object will be present in the test image, but when this is not known in advance it presents a problem: the most salient part of the image, even if not actually salient, will appear extremely salient after normalisation. 30

37 This is because the highest value, even if extremely low, will be scaled up to 1. Taking the maximum should avoid the issue of contrast, as only the most salient value for each location will be considered. This allows a saliency map to be produced without performing normalisation on the data. The saliency maps produced and the role of max operation are discussed further in section Visualising the S2 Feature Types To understand and be able to evaluate the performance of the model, it is useful to know which C2 nodes correspond to which S2 feature types. With this knowledge, it is possible to show which S2 feature types are most active in a given image, and validate that a template feature vector accurately represents the image it was generated from. It is only possible to determine the S2 feature type that a C2 node represents by traversing backward through the network from C2 to S2. Now that the feedback mechanism has been implemented, this can be done. For each C2 node, the S2 predecessor FeatureMapNode was found. From the FeatureMapNode it is possible to determine which C1 nodes the S2 node is connected to, and in which order they are connected. C1 nodes each represent a specific orientation, so from this information it is possible to determine the four orientations that are present in an S2 node s feature type, and hence the structure of the feature type used at a specific C2 node. A mapping between each C2 node and the orientations present in the corresponding feature type was created and saved to a file. A program was written that accepts the index of a C2 node, and creates a visualisation of the corresponding S2 feature type, using the precomputed mapping of C2 nodes to feature types. The S2 feature types for the first 10 C2 nodes are shown in figure 4.7(a) and the feature types for the last 10 C2 nodes are shown in figure 4.7(b). (a) The S2 feature types for the first 10 C2 nodes. (b) The S2 feature types for the last 10 C2 nodes. Figure

38 4.6 Finding Suitable Object Templates The selection of a template feature vector is important for the feedback part of the system. When trying to determine the location of an object of a certain class within a scene, it is preferable to use a template that corresponds to an average object in its class. In the example of recognising cars, there may well be many cars that have relatively similar appearance four door saloons and hatchbacks, for example. However, there may be some examples in a set of training images that are considered outliers, such as an oddly-shaped sports car. When trying to recognise cars in novel test images assuming the test set is made up mostly of the saloons and hatchbacks like the training set the system will perform better on average if one of the more common cars is used as the template. If the oddly-shaped sports car was used, it would be a good match for a few other similar cars, but would likely consist of a different spectrum of features to the majority of the cars that in the test images, and hence be a poor match for them. A trained Self-Organising Map can be used as a tool for selecting an appropriate template feature vector. A feature vector that is similar to a large number other feature vectors that represent the same object is likely to be more representative of the class of objects than one that is dissimilar to the other feature vectors. In a trained SOM, a number of feature vectors may map to one particular node. Even if this is not the case, nodes that are close in data space may be identified using the D-matrix. Feature vectors that map to nodes that are close in data space are likely to be relatively similar. Figure 4.8: An extension to SOMNetwork that allows nodes in the D-matrix to be inspected. The SOM application SOMNetwork was extended to include an interactive tool for inspecting nodes in a trained SOM. Users may click on a node in the D-matrix to view the codebook value of the node, and to see which of the training images were matched to the node. A screenshot of this tool may be seen in figure 4.8. This tool may be used to find feature vectors that are similar to a large number of other feature vectors of objects in the same class, making this tool a useful aid in choosing an appropriate template for the feedback mechanism. 32

39 Chapter 5 Results and Evaluation 5.1 Evaluation of the Model In this section the feedforward part of the system will be evaluated. The feature vectors produce by HMAX will be discussed; potential issues with the model will be identified Feature Vector Analysis It is important to understand the performance of the feedforward step of the system to be able to evaluate the object recognition functionality and be able to analyse the saliency maps. The performance of the feedforward network will be evaluated by investigating the feature vector produced for simple images, and checking that the correct features are active. The tool described in section 4.5 was extended to enable this kind of validation. The tool initially accepted a C2 node index as a command line argument, and produced an image containing a visualisation of the feature type that the node represented. To be able to evaluate feature vectors the tool was modified to accept a file containing a feature vector as a command line argument. From this feature vector file, the tool produces visualisations of the most and least active feature types for the given feature vector (examples of these visualisations may be seen in figures 4.7, 5.1 and 5.2). The number of images produced may be altered via a command line option. This tool was used to find the five most active feature types for images of simple lines, oriented at different angles. Initially, horizontal, vertical and corner-like features appeared be the most active features in all images. This was even the case when a blank image was used the activations for 33

40 these features was much larger than the rest. Spikes were showing in charts of the feature vector of the blank image, even though no features were present. It was these observations that led to the discovery of the fact that the implementation of HMAX was taking image borders into account. Even when the background colour was changed, the borders still showed up as features. This issue was solved by changing the implementation of the feedforward part of the system. Rather than running the whole network using the simple NetworkInterface API, each layer was processed independently. After the S1 layer had processed in the input from the image data, the activations of the nodes around the edges of the layer were set to 0. The width of the border was taken to be the receptive field size of the layer, which is defined in the model definition for HMAX. After this modification had been made, the spikes disappeared from the charts of the blank image s feature vector. It should be noted that all results shown in this report were created after this modification was made to the system. Once this border issue had been fixed, the original test using the oriented lines was re-run. The images and the corresponding most-active feature types may be seen in figure 5.1. The results show that for each line, the most active feature type is the one that is composed of four orientations that are at the same angle as the line. The next four most active features also consist mostly of orientations at this angle. Figure 5.1: The five most active feature types for four images of simple lines. The feature types with the highest activations are on the left the activations decrease going right. Next, the same test was performed on images containing simple shapes. The shapes used were a square, a circle and a triangle (as shown in figure 4.2). The results are shown in figure 5.2. The 34

41 square s results were as expected the most active features represented the straight edges of the square, other features represent corners. Similarly, the triangle produced results that were in line with expectations the diagonal lines and the horizontal line were represented strongly, as was the corner at the top of the triangle. The circle produced interesting results the most active features represented all quantised line orientations both diagonal orientations, as well as horizontal and vertical orientations. These results are not unexpected but present an interesting problem that will be discussed in the following section. Figure 5.2: The five most active feature types for three simple shapes. The feature types with the highest activations are on the left the activations decrease going right The Circle Problem The results in figure 5.2 show that the most active features in the circle are similar to the active features in the triangle, and to a certain degree, the square. This presents an issue with using HMAX for objectbased attention. A circle contains most of the features that are active in other shapes. This means that when a saliency map is produced using a circle as the template, a different shape such as a triangle may be considered salient as it contains a subset of the features present in the target object. Consider the most active feature types that are present in a circle (as shown in figure 5.2). Four out of the five feature types are also among the five most active feature types of the triangle. In fact the first, second and fourth most active feature types of the circle would be likely to appear in the majority of the triangle as the triangle is predominantly composed of the three sides, which is what these feature types represent. These features would produce a strong activation in a saliency map for the triangle. It is also conceivable that more complex objects may contain strong representations of an even large number of different features, to an greater degree than a circle. 35

42 5.1.3 The Feature Dictionary The feature dictionary that HMAX uses is the 256 feature types combinations of four possible orientations arranged in a 2 2 grid. For simple images this is sufficient to produce a saliency map, but for more complex images it presents some issues. The Circle Problem is an example of these issues coming into effect as the features are relatively simple, it is not uncommon that one object may contain a superset of the features present in another unrelated object. This makes certain objects difficult to distinguish from each other when implementing an attentional mechanism, as no information about the relative locations of the features is available. If relative location information regarding the features within an object was available, it would be possible to discriminate more strictly about which parts of an image are likely to contain a target object, and which parts are not. For example, if it was known that the triangle not only contained a horizontal orientation, but contained multiple horizontal orientations in a row (the bottom edge of the triangle), it would be possible to disregard a circle as a possible triangle as it does not contain a row of horizontal orientations. Though storing relative positional information about features within an object may improve the performance of the model, it is not in line with biological evidence of how objects are represented (as discussed in section 2.1.2). An alternative adjustment that could be made is an expansion of the dictionary of primitive features that are represented in the model. Even a simple modification such as adding a larger number of orientations to the S1 filters (such as using 8 oriented filters rather than 4) may help to improve the model s feature specificity. An alternative approach to constructing feature dictionaries is rather than using predefined features, the dictionary of features is learned as the model is trained. This approach would result in a feature dictionary that fits the objects being used more accurately. This approach has been used by the Computer Vision community Winn et al[29] describe an approach where filters are applied in multiple locations to a set of images. The filter responses are aggregated over the entire set of images and clustered using K-means. The centroids of the clusters (referred to as textons ) are used as the primitives of the feature dictionary for the model. Applying a similar method to HMAX could potentially improve the results of the attentional mechanism as features would be more specific to particular objects, which would result in fewer distinct object types having strongly feature spectra. 5.2 Evaluation of Object Recognition To analyse the performance of the object recognition part of the system, it must be determined how well clusters are formed in the Self-Organising Map. If the clusters do not separate different object types that is, multiple object types fall into one cluster it is much more difficult to determine the identity of a previously unseen object. If it is not clear which object type a cluster represents, assigning a test image to that cluster means very little as it cannot be accurately classified. It is also preferable 36

43 that there is one cluster for each object type rather than having multiple clusters for each object type. If multiple clusters represented the same object type, the objects falling in the separate clusters would effectively be considered to be in different classes, indicating weak classification performance Training the Self Organising Map To find how well a trained SOM performs object recognition, it is possible to use the file that is created by the modification to SOMNetwork described in section The file contains the labels that are optionally assigned to test data, and the id of the cluster that each point falls into. If labels are used that make it possible to determine the class of the objects in the test data, the file may be processed to calculate what percentage of each class of object was assigned to each cluster. A script was written to parse this file and perform these calculations so that statistics could be produced about the accuracy of the object classification system. The first step in ensuring robust clustering is by altering the training parameters of the SOM. The dimensions of the SOM determine the number of nodes used. This is an important parameter too few nodes could mean that it is impossible to separate different object types. On the other hand, too many nodes could result in large areas of the SOM that are not close to any object type. Overfitting may also occur in the case of too many nodes, resulting in multiple clusters representing a single class of object, which impedes the system s ability to generalise about the representation of an object. Clustering was performed with SOMs of different dimensions on a number of generated sets of images, each containing 200 images of triangles, 200 of squares and 200 of circles (as shown in figure 4.2). Results proved optimal when the size of the SOM was between 8 8 and A SOM that has too many nodes is less of a problem than one with too few as, objects may still be classified to some degree (although, as discussed, generalisation to novel objects is likely to be poor). For this reason, a SOM size of was decided to be optimal for the images being used in training. The next parameter that may be adjusted is the number of iterations that are used when training the SOM. Each iteration presents one training example to the SOM and applies the update rule. Using more iterations will show the training examples a larger number of times, which should result in a better clustered SOM. However, as each iteration performs the update rule, which is a relatively expensive operation, no more iterations should be used than are necessary as the running time of the training will be greatly increased. As each iteration presents one training example, to properly represent the training set, it is necessary to select a number of iterations that is a multiple of the number of examples in the training set. As there were 600 examples in the training set, different numbers of iterations were tried, each time incrementing the number by the size of the training set. For each number of iterations, the quantisation error (Q-error) was inspected. The Q-error is a measure of how well the SOM represents the training data this also gives an indication as to how correctly the SOM will be able to classify new data. It is calculated by taking the average of the distance between each training example and 37

44 Number of iterations Q-error Table 5.1: The Q-error for different numbers of iterations in the SOM Learning rate Q-error Table 5.2: The Q-error for different SOM learning rates its best-matching unit [17]. A low Q-error is generally considered to indicate positive performance, however if there are more nodes in the SOM than training data, the Q-error will be low despite the performance of the SOM. This issue can be avoided by ensuring that more training examples are used than there are nodes in the SOM. This forces some nodes to represent multiple input vectors and thus generalise and avoid overfitting. The number of iterations and their respective Q-errors are shown in table 5.1. The table shows that the Q-error is worse for 1800 iterations and 2400 iterations than it is for For this reason, 1200 iterations were used when training the SOM. The other important parameter is the learning rate. The learning rate denotes how much a node s weights are changed to match a training sample when it is updated. The Q-error of multiple learning rates were tested; the results are displayed in table 5.2. The optimal value is 0.5, which was selected as the learning rate for the experiments that were run Clustering the Self Organising Map Once the SOM has been trained, discrete clusters are formed. As described in section 4.3.2, the clusters are found by using a region-growing algorithm to form clusters around certain nodes. The nodes picked are the best-matching units of examples in the training data. The region around a node expands and includes additional nodes whose distance from the original node is less than a threshold. The distance between two nodes is calculated as the euclidean distance. The value of the threshold is important as it dictates how different nodes within a cluster can be from each other, and hence the number of nodes included in a cluster. Different values were tested, and a visual inspection of the clusters created was performed to determine how many clusters were 38

45 Cluster id Triangles Squares Circles % 0.0% 0.5% 1 0.0% 0.0% 99.5% 2 0.0% 0.0% 0.0% 3 5.5% 100.0% 0.0% Table 5.3: The percentages of each class of shape that were assigned to each cluster created, and how well the training data was divided into the different clusters. The script that produces statistics from the mapping between test examples and clusters was also used to calculate the quality of the clustering for each threshold value. A distance threshold of 0.15 was selected as the optimum, but the results showed that variation of 0.05 either way didn t make a noticeable impact on the clustering. Using the selected threshold, the training data consisting of squares, triangles and circles was used to train and cluster a SOM. A different set of images containing the same shapes were created, and assigned clusters by finding which node in the SOM their value was closest to. A summary of how much of each class of object fell in to each cluster may be found in table 5.3. The results show that each of the shapes are assigned mainly to one cluster. The triangles clustered the least well out of all the shapes 5.5% of the triangles were assigned to cluster 3, which was otherwise dominated by squares. The shapes that fell in a cluster that did not predominantly consist of shapes of the same class were analysed in depth. A similarity between all of these shapes was found they were considerably smaller than the average shape size. Small shapes may contain different features to large shapes a small triangle may contain a feature that represents the inner three borders of the triangle, but a large triangle will not as there is no scale band in HMAX that covers an area large enough to enclose the entire triangle. Although the triangles did not cluster as well as the other shapes the outliers did not fall in to the cluster that contained predominantly circles this may seem inconsistent with the circle problem described earlier. However, this results does not contradict the circle problem. The circle problem occurs because the feedback mechanism marks all features that are present in both the template and the visual stimuli as salient. Most features present in a triangle are also present in a circle, so when a circle is used as the template most of the triangle is marked as salient. The fact that only most of the triangle s features occur in circles (rather than all of the triangle s features) is important. There are still some features such as the corners that can discriminate between the triangle and the circle. The circle also includes vertically-oriented features that are not present in the triangle. The fact that there are features that may be used to tell the two shapes apart means that the SOM is still capable of separating the shapes in to different clusters. The features used to differentiate between the shapes may not have much effect on the saliency map, but their presence is critical to the clustering stage. To further validate this property, two SOMs were created each were trained with three types of image. The first SOM was trained with 200 images of circles, 200 of squares and 200 containing circles and squares (this may be seen in figure 5.3). The second SOM was trained with 200 images of 39

46 circles, 200 of triangles and 200 of circles and triangles. The results showed that a cluster was formed for each of the types of image the images that contained both objects in each case was represented by a distinct cluster. This shows that the shapes contain features that are sufficient to tell them apart from the other shapes if a shape consisted of the same set of features that were present in another other shape, they would be clustered together. (a) D-matrix (b) Labelled D-matrix Figure 5.3: The D-matrix for a SOM trained on images of circles, images of squares and images containing both shapes. The labels shown in the image on the right are taken from the first letter of the shape name; b stands for both, and represents the images that contained both circles and squares. Table 5.3 also shows that a cluster was formed that had no shapes assigned to it (cluster 2). This may be the case because the clusters were formed by growing regions around nodes that matched the training examples. If an outlier in the training examples is used to form a cluster, the cluster will be very small in size. The small size happens because of the fact that the example is an outlier very few, if any, other training examples will be matched to the node that the outlier was assigned to. This means that the surrounding nodes will not get updated to be more similar to the initial node. If the surrounding nodes differ substantially from the node, when the cluster is formed it will not expand to any other nodes as their distance will exceed the threshold defined. If a similar outlier does not exist in the test data, no examples will be assigned to the cluster, which results in empty clusters. 5.3 Analysis of the Saliency Maps A saliency map shows a visual representation of which areas of a scene should be considered important. It may be used to direct attention to the most salient parts of a scene. Attention is shifted to different regions in order of their saliency value that is present in the saliency map [16, 20]. Implementing this attentional mechanism was outside the scope of this project the final goal was 40

47 to produce the saliency maps. Given the lack of an attentional system, the best way to evaluate the saliency maps is a visual inspection in doing so it is possible to determine the areas considered most salient, as they are the brightest areas in the map. First, the simple case of oriented lines was considered. Before the saliency maps can be applied to more complex objects such as shapes, it is important to determine that the system produces meaningful results in this simple case. As described in section 4.4.2, the implementation includes two methods for producing saliency maps. The first computes the saliency at each position for every C2 node (each of which represent a feature type), then takes the mean value at each location to be the final value. The final saliency map is then normalised so that all values fall between 0 and 1 to improve the contrast. Like the first method, the second method starts by computing the saliency at each location for each C2 node, but when combining all values a given location, ittakes the maximum saliency, rather than the mean. Using this method requires no normalisation. Both types of saliency map were created in this test. The test image used contains four lines, each of a different orientation. The test was performed four times, each time using a different template image. The results are shown in figures (a) Template image (b) Test image (c) Saliency map using mean (d) Saliency map using max Figure 5.4: Saliency maps of a scene containing four lines oriented at different angles, with a vertical line as the target object. The correct line is clear in the saliency maps that use mean, and highly dominant in the saliency maps that use max. 41

48 (a) Template image (b) Test image (c) Saliency map using mean (d) Saliency map using max Figure 5.5: Saliency maps of a scene containing four lines oriented at different angles, with a horizontal line as the target object. (a) Template image (b) Test image (c) Saliency map using mean (d) Saliency map using max Figure 5.6: Saliency maps of a scene containing four lines oriented at different angles, with a diagonal line as the target object. 42

49 (a) Template image (b) Test image (c) Saliency map using mean (d) Saliency map using max Figure 5.7: Saliency maps of a scene containing four lines oriented at different angles, with a diagonal line as the target object. In each case, the desired result is that the line in the test image that is similar to the line in the template image appears bright, and that the rest of the image is dark. This indicates that the system has chosen the correct line to be the salient one, as the regions considered salient should correspond to the target object. The results show that the second approach to producing saliency maps taking the maximum value at each location produces the better results. In the images that used the mean value at each location, the target line appears slightly brighter than the others, but not significantly so. The results of using the maximum value at each location show the target line as considerably brighter and hence more salient than the others. Due to the preferable performance of the second approach, it will be used exclusively for further results in this chapter. Although the target objects are brighter than the other objects in the saliency map, the other objects do stand out from the background, indicating that they are considered more salient that the background. This happens for two reasons: 1. The filters that are used to extract orientations from the image at the S1 layer are still activated a small amount, even if the area of the image they cover does not exactly correspond to the orientation that the filter is sensitive to. 2. The lines in the template images contain features other than the main orientation of the line. In this example, the main other features are the edges that are perpendicular to the direction of the 43

Overview. Object Recognition. Neurobiology of Vision. Invariances in higher visual cortex

Overview. Object Recognition. Neurobiology of Vision. Invariances in higher visual cortex 3 / 27 4 / 27 Overview Object Recognition Mark van Rossum School of Informatics, University of Edinburgh January 15, 2018 Neurobiology of Vision Computational Object Recognition: What s the Problem? Fukushima