A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

Similar documents
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining

Evaluation Methods for Focused Crawling

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

Competitive Intelligence and Web Mining:

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

A Review on Identifying the Main Content From Web Pages

Building Web Annotation Stickies based on Bidirectional Links

Context Based Web Indexing For Semantic Web

Chapter 6: Information Retrieval and Web Search. An introduction

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

Semi-Supervised Clustering with Partial Background Information

Introducing Dynamic Ranking on Web-Pages Based on Multiple Ontology Supported Domains

Texture Image Segmentation using FCM

FSRM Feedback Algorithm based on Learning Theory

Accelerating Pattern Matching or HowMuchCanYouSlide?

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

An ICA-Based Multivariate Discretization Algorithm

String Vector based KNN for Text Categorization

Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A Fast Distance Between Histograms

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

International Journal of Advanced Research in Computer Science and Software Engineering

Collaborative Rough Clustering

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Technique to Optimize User s Browsing Session using Data Mining

Mining Web Data. Lijun Zhang

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users

Minimal Test Cost Feature Selection with Positive Region Constraint

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Image Compression with Competitive Networks and Pre-fixed Prototypes*

CS294-1 Final Project. Algorithms Comparison

Application of Fuzzy Classification in Bankruptcy Prediction

Multiple Classifier Fusion using k-nearest Localized Templates

Similarity Image Retrieval System Using Hierarchical Classification

A Composite Graph Model for Web Document and the MCS Technique

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Clustering-Based Distributed Precomputation for Quality-of-Service Routing*

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Graph Based Workflow Validation

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Reference Point Detection for Arch Type Fingerprints

Focused Web Crawling Using Neural Network, Decision Tree Induction and Naïve Bayes Classifier

Detecting Clusters and Outliers for Multidimensional

Improving Relevance Prediction for Focused Web Crawlers

Automated Online News Classification with Personalization

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

A Bagging Method using Decision Trees in the Role of Base Classifiers

Maximizing edge-ratio is NP-complete

A New Type of ART2 Architecture and Application to Color Image Segmentation

A genetic algorithm based focused Web crawler for automatic webpage classification

A new predictive image compression scheme using histogram analysis and pattern matching

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

Web page recommendation using a stochastic process model

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Using Association Rules for Better Treatment of Missing Values

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

Web-Page Indexing Based on the Prioritized Ontology Terms

Keyword Extraction by KNN considering Similarity among Features

Classification with Diffuse or Incomplete Information

RPKM: The Rough Possibilistic K-Modes

Nearest Cluster Classifier

International Journal of Advanced Engineering Technology

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

A fast parallel algorithm for frequent itemsets mining

Accelerating XML Structural Matching Using Suffix Bitmaps

Artificial Mosaics with Irregular Tiles BasedonGradientVectorFlow

Fast trajectory matching using small binary images

A Modified Fuzzy Min-Max Neural Network and Its Application to Fault Classification

Parallel Evaluation of Hopfield Neural Networks

Probability Distribution of Index Distances in Normal Index Array for Normal Vector Compression

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Invariant Generation in Vampire

A Framework for Hierarchical Clustering Based Indexing in Search Engines

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

CLUSTERING ALGORITHMS

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Efficient Object Extraction Using Fuzzy Cardinality Based Thresholding and Hopfield Network

PersoNews: A Personalized News Reader Enhanced by Machine Learning and Semantic Filtering

Linking Entities in Chinese Queries to Knowledge Graph

Web Data mining-a Research area in Web usage mining

Transcription:

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information George E. Tsekouras *, Damianos Gavalas, Stefanos Filios, Antonios D. Niros, and George Bafaloukas Department of Cultural Technology and Communication, University of the Aegean, 81100, Mytilene, Lesvos, Greece gtsek@ct.aegean.gr, dgavalas@aegean.gr Abstract. We present a novel focused crawling method for extracting and processing cultural data from the web in a fully automated fashion. After downloading the pages, we extract from each document a number of words for each thematic cultural area. We then create multidimensional document vectors comprising the most frequent word occurrences. The dissimilarity between these vectors is measured by the Hamming distance. In the last stage, we employ cluster analysis to partition the document vectors into a number of clusters. Finally, our approach is illustrated via a proof-of-concept application which scrutinizes hundreds of web pages spanning different cultural thematic areas. Keywords: web crawling, HTML parser, document vector, cluster analysis, Hamming distance, similarity measure, filtering. 1 Introduction Web crawlers typically perform a simulated browsing of the web by extracting links from pages, downloading all of them and repeating the process ad infinitum. This process requires enormous amounts of hardware and network resources, ending up with a large fraction of the visible web on the crawler s storage array. When information about a predefined topic is desired though, a specialization of the aforementioned process called focused crawling is used [1]. When searching for further relevant web pages, the focused crawler starts from the given pages and recursively explores the linked web pages [1, 2]. While the simple crawlers perform a breadth-first search of the whole web; focused crawlers explore only a small portion of the web using a best-first search guided by the user interest and based on similarity estimations [1, 2]. To maintain a fast information retrieval process, a focused web crawler has to perform web document classification under certain similar characteristics. One of the most efficient approaches to maintain this issue is to use cluster analysis. This article introduces a novel clustering-based focused crawler that involves: (a) creation a multidimensional document vector comprising the most frequent word occurrences; (b) calculation of the Hamming distances between the cultural-related documents; (c) partitioning of the documents into a number of clusters. * Corresponding author. J. Darzentas et al. (Eds.): SETN 2008, LNAI 5138, pp. 419 424, 2008. Springer-Verlag Berlin Heidelberg 2008

420 G.E. Tsekouras et al. The remainder of the article is organized as follows: Section 2 describes the structure of the focused web-crawler. Section 3 discusses the experimental results and Section 4 presents the conclusions of the present work. 2 The Proposed Method Prior to performing clustering of web documents, our algorithm involves the documents retrieval (crawling) and parsing and also the calculation of their distance vector and distance matrix. The high-level process of crawling, parsing, filtering and clustering of the downloaded web pages is illustrated in Figure 1. In detail, our algorithm is described within the following subsections. Internet Web crawler Counting cultural terms in segmented word XML Cultural dictionary HTML source documents Is it a cultural document? HTML parser Create document vector XML content Calculate Hamming distances among documents General dictionary Filtering noise words Offline documents clustering in thematic cultural clusters Fig. 1. The high-level process of the proposed focused web-crawler 2.1 Crawling Procedure For the retrieval of web pages we utilize a simple recursive procedure which enables breadth-first searches through the links in web pages across the Internet. The application downloads the first document, retrieves the web links included within the page and then recursively downloads the pages where these links point to, until the requested number of documents has been downloaded. The respective pseudo-code implementation is given in Figure 2.

A Clustering Framework to Build Focused Web Crawlers 421 Initialize: UrlsDone = ; UrlsTodo = { firstsite.seed.htm, secondsite.seed.htm..} Repeat: url = UrlsTodo.getNext() ip = DNSlookup( url.gethostname() ) html = DownloadPage( ip, url.getpath() ) UrlsDone.insert( url ) newurls = parseforlinks( html ) For each newurl If not UrlsDone.contains( newurl ) then UrlsTodo.insert( newurl ) Fig. 2. The web crawling algorithm 2.2 Parsing of HTML Documents The parser maintains the title and the clear text content of the document and the URL addresses where the document links point to. The title and the document s body content are then translated to XML format. The noisy words such as articles and punctuation marks are filtered. The documents not including sufficient cultural content are deleted. The documents based on their URL addresses are retrieved and parsed on the next algorithm s execution round, unless they have been already appended in the UrlsDone list. 2.3 Calculation of the Document Vectors For each parsed document we calculate the respective document vector denoted as DV. The DV indicates the descriptive and most useful words included within the document and their frequencies of appearance. For instance, if a document D i includes the words a, b, c and d with frequency 3, 2, 8 and 6 respectively, then its document vector will be: DV i = [3a, 2b, 8c, 6d]. The dimension of DV i equals the number of the words it includes (for the previous example it is DVi = 4) and varies for each document. Next, we reorder each DV i in descending order of words frequencies so the vector of the previous example becomes: DV i = [8c, 6d, 3a, 2b]. Finally, we filter the DV i maintaining only a specific number of T words, so that all DV i s are of equal dimension. Thus, for T=2, the vector of the previous example becomes: DV i = [8c, 6d]. The filtering excludes some information, since we have no knowledge of which words with small frequencies are included in each document. If no filtering is applied, then the worse case scenario for a dictionary of W words is a W-dimensions DV i, where each word of the dictionary appears only once in a document. 2.4 Calculation of the Document Vectors Distance Matrix We calculate the distance matrix DM of the parsed web documents. Each scalar element DM i,j of this matrix equals the distance between DV i and DV j document vectors: DM i,j = d (DV i, DV j ), which represents the dissimilarity of the words included within the corresponding documents, i.e. the more different the words (and their frequencies) of documents D i and D j, the higher the DM i,j value. The distance matrix values DM i,j are calculated based on the Hamming distance calculation method: if a word w with

422 G.E. Tsekouras et al. frequency f appears in D i and not in D j or vice-versa, DM i,j increases by f. Otherwise, w appears on D i with frequency f i and on D j with frequency f j and therefore, DM i,j increases by the absolute value of the difference between f i and f j ( f i - f j ). The distance matrix provides the input of our clustering algorithm presented in the following section. 2.5 Clustering Process = 1, 2, DV N be a set of N document vectors of equal dimension. The potential of the i-th document vector is defined as follows, Let X { DV DV..., } N Z i = S ji j= 1, ( 1 i N) (1) where S ji is the similarity measure between DV j and DV i given as follows, ji { α d( DV DV )} S = exp,, α (0,1) (2) j A document vector that appears a high potential value is surrounded by many neighboring document vectors. Therefore, a document vector with a high potential value is a good nominee to be a cluster center. Based on this remark the potentialbased clustering algorithm is described as follows, Step 1). Select values for the design parameters α (0,1) and β (0,1). Initialy, set the number of clusters equal to n=0. Step 2). Using eqs (1) and (2) and the distance matrix calculate the potential values for all document vectors DV i ( 1 i N). Step 3). Set n=n+1. Step 4). Calculate the maximum potential value: Z = { Z } Select the document vector element of the n-th cluster: i max max i (3) 1 i N DVi max that corresponds to Z max as the center C n = DV imax. Step 5). Remove from the set X all the document vectors having similarity (given in eq. (2)) with DV greater than β and assign them to the n-th cluster. i max Step 6). If X is empty stop. Else turn the algorithm to step 2. The implementation of the above algorithm requires a priori knowledge of the values for α and β. To simplify the approach, we set α = 0. 5 and we calculate the optimal value of β by using the following cluster validity index, COMP V = (7) SEP

A Clustering Framework to Build Focused Web Crawlers 423 Where COMP is cluster compactness measure and SEP the cluster separation measure, which are respectively given as, and COMP = n C Z k k = 1 (4) C k SEP = g min{ d( Ci, C j )} (5) i, j where in (4) Z is the potential value of the k-th cluster center and in (5) the function g is a strictly increasing function. Here, we choose the following form, q g ( x) = x, q ( 1, ) (6) Table 1. Comparative classification results with respect to the Harvest Ratio Category Cultural conservation Cultural heritage Painting Sculpture Dancing Cinematograph Architecture Museum Archaeology Folklore Music Theatre Cultural Events Audiovisual Arts Graphics Design Art History Best- First Search 48.7% 52.1% 70.6% 52.0% 66.8% 67.4% 55.4% 59.7% 60.8% 65.2% 71.5% 58.8% 63.3% 68.8% 48.7% Accelerated Focused Crawler [4] 65.3% 72.4% 72.5% 67.2% 73.8% 84.7% 50.5% 59.8% 64.9% 85.4% 87.2% 74.0% 68.4% 69.0% 59.6% First- Order Crawler [5] 68.9% 73.7% 76.4% 71.6% 88.8% 90.2% 78.3% 80.0% 82.5% 93.0% 91.5% 90.6% 82.8% 78.6% 60.9% Proposed Method 69.6% 77.2% 77.1% 72.1% 90.6% 85.4% 72.1% 84.9% 84.0% 93.4% 92.0% 87.1% 84.7% 82.9% 59.0% In the above equation, the parameter q is used to normalize the separation measure in order to cancel undesired effects related to the number of document vectors and the number of clusters, as well. To this end, the main objective is to select the value of the parameter β, which minimizes the validity index V. 3 Experimental Evaluation Target topics were defined and the page samples were obtained through metasearching the Yahoo search engine. We choose 15 categories related to cultural

424 G.E. Tsekouras et al. information, which are depicted in Table 1. For each category, we downloaded 1000 web pages to train the algorithm. After generating the dictionary, for each category, we selected the 200 most frequently reported words, using the inverse document frequency (IDF) for each word [3]. In the next step, we defined the dimension of each document vector as T=30. Note that, we can keep the feature space dimension large, but in this case the computational cost will increase. To test the method, we downloaded another 1000 pages for each category and we utilized the well-known Harvest Ratio [4, 5] to compare the method with other algorithms. The results are presented in Table 1, where we can easily verify that, except of few cases, our method outperformed the rest of the methods. 4 Conclusions We have shown how cluster analysis can be efficiently incorporated into a focused web crawler. The basic idea of the approach is to create multidimensional document vectors each of which corresponds to a specific web page. The dissimilarity between two distinct document vectors is measured using the Hamming distance. Then, we use a clustering algorithm to classify the set of all document vectors into a number of clusters, where the respective cluster centers are objects from the original data set that satisfy specific conditions. The classification of unknown web pages is accomplished by using the minimum Hamming distance. Several experimental simulations took place, which verified the efficiency of the proposed method. References [1] Huang, Y., Ye, Y.-M.: whunter: A Focused Web Crawler A Tool for Digital Library. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-p. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 519 522. Springer, Heidelberg (2004) [2] Zhu, Q.: An algorithm for the focused web crawler. In: The Proceedings of the 6th International Conference on Machine Learning and Cybernetics, Hong Kong, (2007) [3] Tsekouras, G.E., Anagnostopoulos, C.N., Gavalas, D., Economou, D.: Classification of Web Documents using Fuzzy Logic Categorical Data Clustering. In: Boukis, C., Pnevmatikakis, A., Polymenakos, L. (eds.) Artificial Intelligence and Innovations: From Therory to Applications, pp. 93 100. Springer, Heidelberg (2007) [4] Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW, pp. 148 159 (2002) [5] Xu, Q., Zuo, W.: First-order Focused Crawling. In: The Proceedings of the International Conference on WWW 2007, Banff, Alberta, Canada (2007)