Extraction of Web Image Information: Semantic or Visual Cues?

Similar documents
Keywords Data alignment, Data annotation, Web database, Search Result Record

ImgSeek: Capturing User s Intent For Internet Image Search

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

Image Similarity Measurements Using Hmok- Simrank

Video annotation based on adaptive annular spatial partition scheme

ANovel Approach to Collect Training Images from WWW for Image Thesaurus Building

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang

A Review on Identifying the Main Content From Web Pages

Annotating Multiple Web Databases Using Svm

A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

An Efficient Methodology for Image Rich Information Retrieval

ISSN: , (2015): DOI:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

Studying the Impact of Text Summarization on Contextual Advertising

Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction

Semantic Image Clustering Using Object Relation Network

Deep Web Crawling and Mining for Building Advanced Search Application

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

Tag Based Image Search by Social Re-ranking

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

Content Based Image Retrieval with Semantic Features using Object Ontology

SIEVE Search Images Effectively through Visual Elimination

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A hybrid method to categorize HTML documents

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

A REVIEW ON SEARCH BASED FACE ANNOTATION USING WEAKLY LABELED FACIAL IMAGES

Domain-specific Concept-based Information Retrieval System

Webpage Understanding: Beyond Page-Level Search

Inferring User Search for Feedback Sessions

A Robust Wipe Detection Algorithm

International Journal of Advanced Research in Computer Science and Software Engineering

MURDOCH RESEARCH REPOSITORY

Topic Diversity Method for Image Re-Ranking

Survey on Recommendation of Personalized Travel Sequence

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Holistic Correlation of Color Models, Color Features and Distance Metrics on Content-Based Image Retrieval

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Page Segmentation by Web Content Clustering

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Knowledge Engineering in Search Engines

An Enhanced Image Retrieval Using K-Mean Clustering Algorithm in Integrating Text and Visual Features

Web Page Segmentation for Small Screen Devices Using Tag Path Clustering Approach

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Sign Language Recognition using Dynamic Time Warping and Hand Shape Distance Based on Histogram of Oriented Gradient Features

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques

Context based Re-ranking of Web Documents (CReWD)

Advanced Community Question Answering Sites with Multimedia Answers

Latest development in image feature representation and extraction

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

An Annotation Tool for Semantic Documents

INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP

Text Document Clustering Using DPM with Concept and Feature Analysis

Query Difficulty Prediction for Contextual Image Retrieval

Overview of Web Mining Techniques and its Application towards Web

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

Mining Contents in Web Pages and Ranking of Web Pages Using Cosine Similarity

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Crawler with Search Engine based Simple Web Application System for Forum Mining

A Miniature-Based Image Retrieval System

User Profiling for Interest-focused Browsing History

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

A New Technique to Optimize User s Browsing Session using Data Mining

Cluster-based Instance Consolidation For Subsequent Matching

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

R. R. Badre Associate Professor Department of Computer Engineering MIT Academy of Engineering, Pune, Maharashtra, India

Video Analysis for Browsing and Printing

A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering

Automation of URL Discovery and Flattering Mechanism in Live Forum Threads

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Fuzzy C-means Clustering with Temporal-based Membership Function

QUERY REGION DETERMINATION BASED ON REGION IMPORTANCE INDEX AND RELATIVE POSITION FOR REGION-BASED IMAGE RETRIEVAL

Data Extraction and Alignment in Web Databases

Context Based Web Indexing For Semantic Web

Seeing and Reading Red: Hue and Color-word Correlation in Images and Attendant Text on the WWW

ANALYSIS OF DENSE AND SPARSE PATTERNS TO IMPROVE MINING EFFICIENCY

A new method of comparing webpages

RSDC 09: Tag Recommendation Using Keywords and Association Rules

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Document Retrieval using Predication Similarity

A Hierarchical Document Clustering Approach with Frequent Itemsets

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Using XML Logical Structure to Retrieve (Multimedia) Objects

A Bayesian Approach to Hybrid Image Retrieval

Ranking Web Pages by Associating Keywords with Locations

Image Classification Using Wavelet Coefficients in Low-pass Bands

International Journal of Advance Engineering and Research Development. A Survey of Document Search Using Images

Ontology based Web Page Topic Identification

Searching the Web [Arasu 01]

Clustering Technique with Potter stemmer and Hypergraph Algorithms for Multi-featured Query Processing

Mining Structured Objects (Data Records) Based on Maximum Region Detection by Text Content Comparison From Website

Patch-Based Image Classification Using Image Epitomes

Extracting Representative Image from Web Page

Extracting Representative Image from Web Page

[Supriya, 4(11): November, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Transcription:

Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus Abstract. Text based approaches for web image information retrieval have been exploited for many years, however the noisy textual content of the web pages makes their task challenging. Moreover, text based systems that retrieve information from textual sources such as image file names, anchor texts, existing keywords and, of course, surrounding text often share the inability to correctly assign all relevant text to an image and discard the irrelevant. A novel method for indexing web images is discussed in the present paper. The main concern of the proposed system is to overcome the obstacle of correctly assigning textual information to web images, while disregarding text that is unrelated to them. The proposed system uses visual cues in order to cluster a web page into several regions and compares this method to the use of semantic information and the realization of a k-means clustering. The evaluation reveals the advantages and disadvantages of the different clustering techniques and confirms the validity of the proposed method for web image indexing. 1 Introduction and Related Work Numerous web image search engines have been developed as the amount of digital image collections on the web constantly increases [1]. These systems share the objective to minimize the necessary human interaction while offering an intuitive image search. The two main approaches that exist in the literature for content extraction and representation of web images are: (i) the text-based and (ii) the visual feature-based methods. The text-based approaches use associate text (i.e image file names, anchor texts, surrounding paragraphs) to derive the content of images. The text blocks that are used as concept sources for images may be extracted with several methods, with the following four being the most popular: (i) fixed-size sequence of terms [2], (ii) DOM tree structure [3, 4], (iii) Web page segmentation [5] and (iv) hybrid versions of the above [6]. The first approach is time-efficient but yields poor results since the extracted text may be irrelevant to the image, or on the other hand, important parts of the relevant text may be discarded. Approaches that use the DOM tree structure of the web page are in general not adaptive and they are designed for specific design patterns. Web page segmentation is a more adequate solution to the problem since it is adaptable to different web page styles. Most of the proposed algorithms in this

field though, are not designed specifically for the problem of image indexing and therefore often deliver poor results. The proposed system uses information obtained following the web page segmentation approach. The paper is organized as follows: Section 2 presents the architecture of the proposed system. Section 3 presents the evaluation of the proposed system. Finally some conclusions and future perspectives are given in Sec. 4. 2 System Architecture The general architecture of the proposed system is depicted in Fig. 1. As shown there the system consists of two main parts, which are described in the following sections. Part I Part II Web page Visual Segmentation Images and Text Blocks Text to Image Assignment Clustering of Text blocks Image Description Fig. 1. The general architecture of the proposed system. 2.1 Visual Segmentation The content extraction of web images is based on textual information that exists in the same web document and refers to this image. In order to determine to which image the various text parts of a web page refer to, we use the visual cues which are connected to the outline and the presentation of the hosting web page. In order to obtain the set of visual segments that form a web page, we use the Visual Based Page Segmentation (VIPS) algorithm [7], which extracts the semantic structure of a web page based on its visual representation. It attempts to make full use of the page layout structure by extracting blocks from the DOM tree structure of the web page and locating separators among these blocks. Each web page is represented as a set of blocks that bare similar Degree of Coherence (DOC). With the permitted DOC (pdoc) set to its maximum value, we obtain a set of visual blocks that consist of visually indivisible contents. An example of a web page segmentation using pdoc = 10 (i.e. maximum allowed value) is illustrated in Fig. 2. 2.2 Text to Image Assignment Each text block found in the web page has to be assigned to an image block. In other words, we attempt to determine to which image, each textual block refers

How much does the Earth's gravity pull on me? Block 1 [edit] It's easy to find your weight on Earth by using a scale. You have weight because the Earth's gravity pulls you towards its center. Normally, the ground or the floor get in the way, making you feel 'stuck' to them. There are several kinds of scales: 1) Comparing of 2 masses (weights). You put the thing(s) you want to weigh on one pan (like some marbles), and then you put several "weights" on the other pan until the pointer shows that both pans have equal weights on them. Then you look at the pan with the known weights on it, and add them all up. The total is the mass of the thing(s) you want to weigh. 2) A spring balance usually has a hook on it, with a pan. You put the thing(s) you want to weigh on the pan, the spring is pulled, and the greater the weight, the further the spring is pulled. That distance, calibrated in pounds or kilogram (or whatever), is usually shown either on a dial or on a linear scale. Block 2 Fig.2. The results of VIPS algorithm on a fragment of a web page. Each visual block is marked with a rectangle region: dashed-red when the block is an image and continuousblue when it is text. to. After this decision, the corresponding text blocks will be adequate to use for the extraction of the image information. The processing that takes place for this module is presented in Fig. 3 and each part is described in further details in the following paragraphs. Visual Blocks Text to Image Assignment Image/text Blocks Separation Distance Calculation Clustering Image Description Fig.3. The processing that takes place in the second part of the system: Text to Image Assignment Image/text Blocks Separation. The first task towards the assignment of each text block to the image block it refers to, is to determine whether a block contains image content or text. The use of the maximum pdoc during VIPS execution certifies that for a well formed HTML document, each block will either be an image or a text block indicating that no blocks with mixed content will be returned. The HTML source code that corresponds to each one of these blocks, is returned by the VIPS algorithm and it is used in order to classify them into two categories: (i) image blocks and (ii) text blocks. Distance Calculation. Once the blocks are separated into these categories, the Euclidean distance between every image/text block pair is calculated making use of the Cartesian coordinates returned by the VIPS algorithm. The distance calculation in this case is not a trivial problem since the goal is to quantify the intuitive understanding of how close two visual blocks are. This understanding depends not only on the distance of the centres of the two visual blocks, but also on their shape, size and relative position. In order to solve the distance calculation problem several approaches were taken into consideration. The calculation

of the distance between the closest edges of the blocks was found to offer the better representation of the block distance. Clustering. After the distance calculation the web page has to be clustered into regions. Each region or cluster is defined by an image in its center and contains all the text blocks that have been found to refer to this image. For the clustering we took into consideration two different approaches. The first one is based only in visual cues and the location of the various blocks while the second approach mines semantic information and implements a k-means clustering. Clustering Based on Visual Cues. In this approach, the text blocks are assigned to the corresponding cluster making use of the visual information that is available for them: the Euclidean coordinates which are returned be the VIPS algorithm are used for distance calculation and each text block is assigned to the cluster, whose center it is closest to. However, it is possible that one or more blocks of text do not refer to a certain image of the web page. For this reason, it is necessary to discard one or more text blocks from the calculated clusters. In order to determine which blocks of text are irrelevant to the image they are connected to, the blocks whose distance to the cluster center (i. e. corresponding web image) is longer than a defined threshold t, are discarded. In order to calculate this threshold, the distances d c i that appear in a cluster c are normalized to the maximum distance. Using the normalized values d c i the threshold t is calculated as t ed = t +m ed s ed, where t is a static, predefined threshold (in our experiment is empirically set to 0.1), m ed is the mean value of the euclidean distances found in the cluster c and s ed is their standard deviation. Clustering Based on Semantic Information. Semantic information, as obtained from the application of the Vector Space Model [8], is used in the second approach in order to create a clustering which is based on the content of each block rather than its location on the web page. The web page is considered the corpus from which the vocabulary is extracted. Each text block tb i is one of the corpus documents and it is expressed as a vector of term weights as v i = [w 1i w 2i...w Ni ] T. The term w ti indicates the weight of text t in text block tb i and is calculated using the tf*idf [8] statistic which expresses how important is each word for the representation of a text block. The k-means algorithm is then used on the vectors of term weights in order to cluster the text blocks into M regions. Since each text block may refer to any image but it may also be irrelevant to every image, the number of clusters M is equal to the total number of images found in the web page plus one for text blocks irrelevant to every image. Once the k-means algorithm is executed each text block is assigned to a specific cluster. However, it is not yet determined to which image each cluster refers to. To find the solution in this problem we considered the average distances among images and clusters as well as an initialization to the k-means algorithm based on the output of the first approach to the clustering procedure. The results presented in Sec. 3 are obtained using this initialization.

3 Evaluation A set of manually labelled web images has been collected in order to create a corpus, based on which, the clustering of text blocks is evaluated. This corpus was obtained using an annotation tool that we designed in order to facilitate web image labelling. The annotator had to assign relevant text to each one of the images that exist in web pages which were rendered in the default browser. A dataset that consists of 40 web pages and their annotations was created in order to evaluate our method. Using the above described corpus and the evaluation measures Precision, Recall and F-score as defined in [9], we obtained two different sets of results using the clustering methods described earlier. In the first set of results, the whole processing is based on visual information as obtained from the application of the VIPS algorithm. In the second set, the results are based on semantic information, as it is obtained from the Vector Space Model applied on the content of the web page. As shown in Table 1 there is a significant variation on the efficiency of the two approaches with the first approach having the highest average F-score. Table 1. The results for different Text Block Discarding Methods. Clustering Method Precision Recall F-score Visual 0.8610 0.8836 0.8962 Semantic 0.3576 0.4528 0.4533 The execution of the proposed method yields an average F-score equal to 0.8610 for the total of the 131 annotated images. As shown in Fig. 4, more than 80% of the text blocks are identified with Recall and Precision values higher than 0.8, indicating that the system succeeds in retrieving most of the blocks that according to the annotation refer to a certain image. 1 Cumulative Distribution Function 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Content F-measure Fig. 4. Cumulative distribution function of F-measure.

4 Conclusions In this paper we presented an image indexing system that uses textual information in order to extract the concept of the images that are found in a web page. The method uses visual cues in order to identify the segments of the web page and calculates euclidean distances among these segments. It delivers a semantic or euclidean clustering of the contents of a web page in order to assign textual information to the existing images. During the experimenting and the evaluation it was found that a clustering based on semantic information of the contextual information delivers poor results compared to the clustering that is based on visual information. This stresses out the importance of understanding and processing the web pages as a structured visual document when it comes to web image indexing rather than an unstructured bag of words. The weight vectors that were used for the k-means clustering are in general sparse vectors, since the length of the vocabulary is not proportional to the size of each text block. Moreover, when two neighbouring text blocks refer to the same image and contain similar semantic content, the authors usually select synonyms to express the same meaning. The Vector Space Model does not account such connections among different terms. It is therefore expected that the use of an ontology that describes the semantic distances and relations among different words and phrases will improve the results of the k-means clustering and this is the direction of our future study. References [1] Sclaroff, S., La Cascia, M., Sethi, S., Taycher, L.: Unifying textual and visual cues for content-based image retrieval on the world wide web. Computer Vision and Image Understanding 75(1 2) (1999) 86 98 [2] Feng, H., Shi, R., Chua, T.S.: A bootstrapping framework for annotating and retrieving www images. In: Proc. of the 12th ACM Int. Conf. on Multimedia. (2004) [3] Hua, Z., Wang, X.J., Liu, Q., Lu, H.: Semantic knowledge extraction and annotation for web images. In: Proc. of the 13th ACM Int. Conf. on Multimedia. (2005) 467 470 [4] Fauzi, F., Hong, J.L., Belkhatir, M.: Webpage segmentation for extracting images and their surrounding contextual information. In: Proc. of the 17th ACM Int. Conf. on Multimedia. (2009) [5] He, X., Cai, D., Wen, J.R., Ma, W.Y., Zhang, H.J.: Clustering and searching www images using link and page layout analysis. ACM Trans. on Multimedia Computing, Communications and Applications 3(2) (June 2007) [6] Alcic, S., Conrad, S.: A clustering-based approach to web image context extraction. In: Proc. of the 19th ACM Int. Conf. on Multimedia. (2011) 74 79 [7] Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: a vision based page segmentation algorithm. Technical report, Microsoft Research (2003) [8] Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11) (1975) 613 620 [9] Alcic, S., Conrad, S.: Measuring performance of web image context extraction. In: Proc. of the 10th Int. Workshop on Multimedia Data Mining. (July 2010) 1 8