A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information George E. Tsekouras *, Damianos Gavalas, Stefanos Filios, Antonios D. Niros, and George Bafaloukas Department of Cultural Technology and Communication, University of the Aegean, 81100, Mytilene, Lesvos, Greece gtsek@ct.aegean.gr, dgavalas@aegean.gr Abstract. We present a novel focused crawling method for extracting and processing cultural data from the web in a fully automated fashion. After downloading the pages, we extract from each document a number of words for each thematic cultural area. We then create multidimensional document vectors comprising the most frequent word occurrences. The dissimilarity between these vectors is measured by the Hamming distance. In the last stage, we employ cluster analysis to partition the document vectors into a number of clusters. Finally, our approach is illustrated via a proof-of-concept application which scrutinizes hundreds of web pages spanning different cultural thematic areas. Keywords: web crawling, HTML parser, document vector, cluster analysis, Hamming distance, similarity measure, filtering. 1 Introduction Web crawlers typically perform a simulated browsing of the web by extracting links from pages, downloading all of them and repeating the process ad infinitum. This process requires enormous amounts of hardware and network resources, ending up with a large fraction of the visible web on the crawler s storage array. When information about a predefined topic is desired though, a specialization of the aforementioned process called focused crawling is used [1]. When searching for further relevant web pages, the focused crawler starts from the given pages and recursively explores the linked web pages [1, 2]. While the simple crawlers perform a breadth-first search of the whole web; focused crawlers explore only a small portion of the web using a best-first search guided by the user interest and based on similarity estimations [1, 2]. To maintain a fast information retrieval process, a focused web crawler has to perform web document classification under certain similar characteristics. One of the most efficient approaches to maintain this issue is to use cluster analysis. This article introduces a novel clustering-based focused crawler that involves: (a) creation a multidimensional document vector comprising the most frequent word occurrences; (b) calculation of the Hamming distances between the cultural-related documents; (c) partitioning of the documents into a number of clusters. * Corresponding author. J. Darzentas et al. (Eds.): SETN 2008, LNAI 5138, pp. 419 424, 2008. Springer-Verlag Berlin Heidelberg 2008

420 G.E. Tsekouras et al. The remainder of the article is organized as follows: Section 2 describes the structure of the focused web-crawler. Section 3 discusses the experimental results and Section 4 presents the conclusions of the present work. 2 The Proposed Method Prior to performing clustering of web documents, our algorithm involves the documents retrieval (crawling) and parsing and also the calculation of their distance vector and distance matrix. The high-level process of crawling, parsing, filtering and clustering of the downloaded web pages is illustrated in Figure 1. In detail, our algorithm is described within the following subsections. Internet Web crawler Counting cultural terms in segmented word XML Cultural dictionary HTML source documents Is it a cultural document? HTML parser Create document vector XML content Calculate Hamming distances among documents General dictionary Filtering noise words Offline documents clustering in thematic cultural clusters Fig. 1. The high-level process of the proposed focused web-crawler 2.1 Crawling Procedure For the retrieval of web pages we utilize a simple recursive procedure which enables breadth-first searches through the links in web pages across the Internet. The application downloads the first document, retrieves the web links included within the page and then recursively downloads the pages where these links point to, until the requested number of documents has been downloaded. The respective pseudo-code implementation is given in Figure 2.

A Clustering Framework to Build Focused Web Crawlers 421 Initialize: UrlsDone = ; UrlsTodo = { firstsite.seed.htm, secondsite.seed.htm..} Repeat: url = UrlsTodo.getNext() ip = DNSlookup( url.gethostname() ) html = DownloadPage( ip, url.getpath() ) UrlsDone.insert( url ) newurls = parseforlinks( html ) For each newurl If not UrlsDone.contains( newurl ) then UrlsTodo.insert( newurl ) Fig. 2. The web crawling algorithm 2.2 Parsing of HTML Documents The parser maintains the title and the clear text content of the document and the URL addresses where the document links point to. The title and the document s body content are then translated to XML format. The noisy words such as articles and punctuation marks are filtered. The documents not including sufficient cultural content are deleted. The documents based on their URL addresses are retrieved and parsed on the next algorithm s execution round, unless they have been already appended in the UrlsDone list. 2.3 Calculation of the Document Vectors For each parsed document we calculate the respective document vector denoted as DV. The DV indicates the descriptive and most useful words included within the document and their frequencies of appearance. For instance, if a document D i includes the words a, b, c and d with frequency 3, 2, 8 and 6 respectively, then its document vector will be: DV i = [3a, 2b, 8c, 6d]. The dimension of DV i equals the number of the words it includes (for the previous example it is DVi = 4) and varies for each document. Next, we reorder each DV i in descending order of words frequencies so the vector of the previous example becomes: DV i = [8c, 6d, 3a, 2b]. Finally, we filter the DV i maintaining only a specific number of T words, so that all DV i s are of equal dimension. Thus, for T=2, the vector of the previous example becomes: DV i = [8c, 6d]. The filtering excludes some information, since we have no knowledge of which words with small frequencies are included in each document. If no filtering is applied, then the worse case scenario for a dictionary of W words is a W-dimensions DV i, where each word of the dictionary appears only once in a document. 2.4 Calculation of the Document Vectors Distance Matrix We calculate the distance matrix DM of the parsed web documents. Each scalar element DM i,j of this matrix equals the distance between DV i and DV j document vectors: DM i,j = d (DV i, DV j ), which represents the dissimilarity of the words included within the corresponding documents, i.e. the more different the words (and their frequencies) of documents D i and D j, the higher the DM i,j value. The distance matrix values DM i,j are calculated based on the Hamming distance calculation method: if a word w with

422 G.E. Tsekouras et al. frequency f appears in D i and not in D j or vice-versa, DM i,j increases by f. Otherwise, w appears on D i with frequency f i and on D j with frequency f j and therefore, DM i,j increases by the absolute value of the difference between f i and f j ( f i - f j ). The distance matrix provides the input of our clustering algorithm presented in the following section. 2.5 Clustering Process = 1, 2, DV N be a set of N document vectors of equal dimension. The potential of the i-th document vector is defined as follows, Let X { DV DV..., } N Z i = S ji j= 1, ( 1 i N) (1) where S ji is the similarity measure between DV j and DV i given as follows, ji { α d( DV DV )} S = exp,, α (0,1) (2) j A document vector that appears a high potential value is surrounded by many neighboring document vectors. Therefore, a document vector with a high potential value is a good nominee to be a cluster center. Based on this remark the potentialbased clustering algorithm is described as follows, Step 1). Select values for the design parameters α (0,1) and β (0,1). Initialy, set the number of clusters equal to n=0. Step 2). Using eqs (1) and (2) and the distance matrix calculate the potential values for all document vectors DV i ( 1 i N). Step 3). Set n=n+1. Step 4). Calculate the maximum potential value: Z = { Z } Select the document vector element of the n-th cluster: i max max i (3) 1 i N DVi max that corresponds to Z max as the center C n = DV imax. Step 5). Remove from the set X all the document vectors having similarity (given in eq. (2)) with DV greater than β and assign them to the n-th cluster. i max Step 6). If X is empty stop. Else turn the algorithm to step 2. The implementation of the above algorithm requires a priori knowledge of the values for α and β. To simplify the approach, we set α = 0. 5 and we calculate the optimal value of β by using the following cluster validity index, COMP V = (7) SEP

A Clustering Framework to Build Focused Web Crawlers 423 Where COMP is cluster compactness measure and SEP the cluster separation measure, which are respectively given as, and COMP = n C Z k k = 1 (4) C k SEP = g min{ d( Ci, C j )} (5) i, j where in (4) Z is the potential value of the k-th cluster center and in (5) the function g is a strictly increasing function. Here, we choose the following form, q g ( x) = x, q ( 1, ) (6) Table 1. Comparative classification results with respect to the Harvest Ratio Category Cultural conservation Cultural heritage Painting Sculpture Dancing Cinematograph Architecture Museum Archaeology Folklore Music Theatre Cultural Events Audiovisual Arts Graphics Design Art History Best- First Search 48.7% 52.1% 70.6% 52.0% 66.8% 67.4% 55.4% 59.7% 60.8% 65.2% 71.5% 58.8% 63.3% 68.8% 48.7% Accelerated Focused Crawler [4] 65.3% 72.4% 72.5% 67.2% 73.8% 84.7% 50.5% 59.8% 64.9% 85.4% 87.2% 74.0% 68.4% 69.0% 59.6% First- Order Crawler [5] 68.9% 73.7% 76.4% 71.6% 88.8% 90.2% 78.3% 80.0% 82.5% 93.0% 91.5% 90.6% 82.8% 78.6% 60.9% Proposed Method 69.6% 77.2% 77.1% 72.1% 90.6% 85.4% 72.1% 84.9% 84.0% 93.4% 92.0% 87.1% 84.7% 82.9% 59.0% In the above equation, the parameter q is used to normalize the separation measure in order to cancel undesired effects related to the number of document vectors and the number of clusters, as well. To this end, the main objective is to select the value of the parameter β, which minimizes the validity index V. 3 Experimental Evaluation Target topics were defined and the page samples were obtained through metasearching the Yahoo search engine. We choose 15 categories related to cultural

424 G.E. Tsekouras et al. information, which are depicted in Table 1. For each category, we downloaded 1000 web pages to train the algorithm. After generating the dictionary, for each category, we selected the 200 most frequently reported words, using the inverse document frequency (IDF) for each word [3]. In the next step, we defined the dimension of each document vector as T=30. Note that, we can keep the feature space dimension large, but in this case the computational cost will increase. To test the method, we downloaded another 1000 pages for each category and we utilized the well-known Harvest Ratio [4, 5] to compare the method with other algorithms. The results are presented in Table 1, where we can easily verify that, except of few cases, our method outperformed the rest of the methods. 4 Conclusions We have shown how cluster analysis can be efficiently incorporated into a focused web crawler. The basic idea of the approach is to create multidimensional document vectors each of which corresponds to a specific web page. The dissimilarity between two distinct document vectors is measured using the Hamming distance. Then, we use a clustering algorithm to classify the set of all document vectors into a number of clusters, where the respective cluster centers are objects from the original data set that satisfy specific conditions. The classification of unknown web pages is accomplished by using the minimum Hamming distance. Several experimental simulations took place, which verified the efficiency of the proposed method. References [1] Huang, Y., Ye, Y.-M.: whunter: A Focused Web Crawler A Tool for Digital Library. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-p. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 519 522. Springer, Heidelberg (2004) [2] Zhu, Q.: An algorithm for the focused web crawler. In: The Proceedings of the 6th International Conference on Machine Learning and Cybernetics, Hong Kong, (2007) [3] Tsekouras, G.E., Anagnostopoulos, C.N., Gavalas, D., Economou, D.: Classification of Web Documents using Fuzzy Logic Categorical Data Clustering. In: Boukis, C., Pnevmatikakis, A., Polymenakos, L. (eds.) Artificial Intelligence and Innovations: From Therory to Applications, pp. 93 100. Springer, Heidelberg (2007) [4] Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW, pp. 148 159 (2002) [5] Xu, Q., Zuo, W.: First-order Focused Crawling. In: The Proceedings of the International Conference on WWW 2007, Banff, Alberta, Canada (2007)