Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Abstract Automatic linguistic indexing of pictures is an important but highly challenging problem for researchers in content-based image retrieval. So far, there have been three categories of technology trends to realize content-based image retrieval system. In this paper, the author introduces a 2D MHMM (2-dimensional Multiresolution Hidden Markov Model) modeling approach to solve this problem. Experiments have demonstrated the good accuracy of the complementation of this approach and its high potential in linguistic indexing of photographic images. Automatic linguistic indexing of pictures is a challenge problem for the content-based image retrieval system. Why? Let s first check what the content-based image retrieval is. 1. Content-based image retrieval definition: Content-based image retrieval (CBIR) is aimed at efficient retrieval of relevant images from large image databases based on automatically derived imagery features. So far there are three categories of CBIR technology. 1.1. Categories of CBIR technology: High level semantics description: That kind of CBIR system is almost divided into two components. One part is to process the semantics information of every image before stored into the image database. The other part is to process user queries. Semantics information is described using ontology. So the whole system is actually an image knowledge base. Projects include The Helsinki University

Museum, MINDSWAP, etc. Big problem: This system doesn t process the physical features of images. So user can t query the image using physical features. All these semantics information are edited manually by builders of this image database. It s a hard work. Low level feature classification: This kind of CBIR system analyzes one image s physical characters and returns images with highest possibilities to users. For sure in some of these systems, semantics information feedback is introduced. But their focus is on physical characters. Semantics information is just the tool to help classify the image more accurate. Projects include QBIC, PicSeek, MARS, Netra etc. Big problem: These systems can t satisfy users semantic query. For example, Give me one image of this dog when it was a puppy. Because these systems don t know what dog is, what puppy is, and such kinds of stuffs. So the third technology developed. Link high level semantics description to low level feature information: That means automatically assigning comprehensive textual description to pictures. This imitates the thinking way of our human beings when we see one image. The technology introduced in this article belongs to this category. 1.2. Problems in automatic annotation to images: Automatic mapping between low level feature and knowledge: This means before one image is stored into the image database, system can conclude the image s physical characters and assign knowledge corresponding to these physical characters to the image (e.g. Who s that guy in the picture? What is he doing?). How does model the semantics content? That means computer can use what kind of domain knowledge to describe this kind of image and how can computer acquire and store this knowledge. In this article, article solved part of the first problem. The best way of concluding the image s physical features is statistical method. Using statistical models, computer can classify images according to it statistical rules. Now we will see what this article s approach is.

2. Approach: Using 2D MHMM (2-dimensional multiresolution Hidden Markov Model) Why does it choose 2D MHMM? HMM is suitable for block-based image classification. Block-based image classification means when classifying one image, firstly the image is divided into blocks. Then block size is the critical point. If block size is big, each block will include more objects. So it s hard to classify it. If block size is small, there will be dependence between blocks. HMM can be used to model dependence information. For HMM, the image s category is the state, and its feature vector is observation symbols for the state. Compared 1-D HMM, 2-D HMM solve the problem of overlocalization. In 2-D HMM, there is a set of superstates. Within each superstate, there is a set of simple Markovian states. Superstates consist rows in 2 dimensions. And simple states are columns corresponding to one superstate. This concept reflects in the figure 1. The state transition probability of A2 depends on A1 and A3. In particular application, this model works better than 1-D HMM Figure 1

In 2-D HMM, global information can be used efficiently. But from the view of computation complexity, it s necessary to increase the size of one block and prevent from including more objects. So for this purpose, the author introduces multiresolution. Lower resolution images include less states than higher resolution images does. Figure 2

2.1. Application architecture The following figure is the application architecture of this technology. Figure 3 2.1.1. Select one category of images to train for one concept: A concept corresponds to a particular category of images. (A concept doesn t just correspond to one word. A cluster of words can be considered as a concept.) These images do not have to be visually similar. 2.1.2. Extract features from this category of images: Every picture s pixel is 384 * 256. An image is partitioned into 4 * 4 blocks. For each block, the system extracts a feature vector six dimensions using wavelet transform. 2.1.3. Statistical Modeling: To get a 2-D MHMM, there are several assumptions: 1. Si,j the state of block (i, j), Ui,j the feature vector of block (i, j) Where m = Si-1,j n = Si,j-1 l = Si,j 2. The second assumption is that, given every state, the feature vectors follow a Gaussian distribution.

3. For the MHMM, denote the set of resolutions by, with r = R the finest resolution. Let the collection of block indices at resolution r be 4. In particular, given the states at the parent resolution, the states at the current resolution are conditionally independent of the other preceding resolutions, so that 5. In addition, given its state, a feature vector at any resolution is conditionally independent of any other states and feature vectors. 6. Several independence: The feature vector is conditionally independent of information on other blocks once the state of a block of the feature vector is known. The states of one resolution are conditionally independent of the other preceding resolutions. 7. According to the above assumption, we can get the joint probability of a particular set of states and the feature vector: 8. Also assume that child blocks descended from different parent blocks are conditionally independent (The states of its child blocks are independent of states of their uncle blocks.) But the state transition probabilities depend on the state of their parent block. So compute the transition probabilities in this formula: Where 9. The joint probability of states and feature vectors at all the resolutions in (1)

is then derived as To summarize, a 2D MHMM captures both the inter-scale and intra-scale statistical dependence. The inter-scale dependence is modeled by the Markov chain over resolutions. The intra-scale dependence is modeled by the HMM. 10. This model is trained using EM algorithm. 2.1.4. Automatic Linguistic Indexing of Pictures After getting this model, we can start automatic linguistic indexing of pictures. Use the models of every concept to compute the log probabilities of generating, that is. Sort the log value to find K top ranked categories. (The selection of k is somewhat arbitrary. An adaptive way to decide k is to use categories with likelihoods exceeding a threshold. However, it is found that the range of likelihoods computed from a query image varies greatly depending on the category the image belongs to. A fixed threshold is not useful. When there are a large number of categories in the database, it is observed that choosing a fixed number of top-ranked categories tends to yield relatively robust annotation). After getting K candidate concepts, the author doesn t use these concepts to annotate the image. K maybe is too large for a short description of one image. So they introduce a tricky method to select a subset of words from K concepts. j, k: The word appear j times in k categories. A small probability indicates it is unlikely that the word has appeared simply by chance, and also indicates a high level of significance for this given word.

The advantage: The proposed scheme of choosing words favors rare words. It tends to provide relatively specific or interesting information about the query and avoids using words that fit a large number of image categories. 2.2. Experiment The author conducted experiment on COREL dataset which includes 600 categories and every category has 100 images. So they trained 600 concepts and 40 concepts for each concept. 4,630 test images outside the training set. Manually assign these words to every image category. Complexity of training for each of the 600 categories of images: Training process: 15-40 minutes; Configuration: 800MHz Pentium Ⅲ PC

2.2.1. Accuracy Table 2 Accuracy means the match percentage of 4,630 images. match means the test image annotated by this system is actually included in this category. 2.3. Conclusion and Future work 2.3.1. Conclusion My opinion: This article proposed one approach to tackle part of the first problems in automatic annotation of pictures. It also seems like a classification technology for pictures. Its advantage compared with other low level feature classification technologies is that they link concepts and features in order to establish the concept indexing to make keyword queries a little intelligent. But because it doesn t care the second problems, it still isn t intelligent enough for content-based image retrieval. But for the other query method, which is the user inputs an image, this paper doesn t give us a comparison with other technologies. These are conclusion from this article. You can take a look at them in the paper. Proposed a 2D MHMM modeling approach to solve the problem of automatic linguistic indexing of pictures. The index is the model of one category of pictures. Advantage of this approach: 1. Models for different concepts can be independently trained and retrained. Hence the system has good scalability; 2. Spatial relation among image pixels within and across resolutions is taken into consideration with probabilistic likelihood as a universal measure. Limitation: 1. Train the concept dictionary using only 2D images without a sense of object

size. 2. 40 training images are insufficient for the computer program to build a reliable model for a complex concept. 2.3.2. Future work Improve the indexing speed of the system by using approximation in the likelihood computation. A rule-based system may be used to process the words annotated automatically to eliminate conflicting semantics. 3. Reference: [1] J. Li, R.M. Gray, and R.A. Olshen, Multiresolution Image Classification by Hierarchical Modeling with Two Dimensional Hidden Markov Models, IEEE Trans. Information Theory, vol. 46, no. 5, pp. 1826-41, Aug. 2000. [2] J. Li, A. Najmi, and R.M. Gray, Image Classification by a Two Dimensional Hidden Markov Model, IEEE Trans. Signal Processing, vol. 48, no. 2, pp. 517-33, Feb. 2000. [3] J.Z. Wang, J. Li, and G. Wiederhold, SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 9, pp. 947-963, Sept. 2001.