Web Document Clustering based on Document Structure

Web Document Clustering based on Document Structure Khaled M. Hammouda and Mohamed S. Kamel Department of Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada N2L 3G1 E-mail: {hammouda,mkamel}@pami.uwaterloo.ca Corresponding author 1

Abstract Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, document structure should be reflected in the underlying data model. This paper presents a framework for web document clustering based on two important concepts. The first one is the web document structure, which is currently ignored by many people. However, the (semi-)structure of a web document provides significant information about the content of the document. The second concept is finding the relationships between documents based on local context using a new phrase matching technique, so that documents are indexed based on phrases, rather than individual words as widely used in current systems. A novel document data model, the Document Index Graph, is designed specifically to facilitate phrase matching between documents. The combination of these two concepts creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in web document clustering over traditional methods. To make the approach applicable to online clustering, an incremental clustering algorithm guided by the maximization of cluster cohesiveness is also presented. Keywords: web mining, document clustering, document similarity, document structure, document index graph, phrase matching. 2

1 Introduction In an effort to keep up with the tremendous growth of the World Wide Web, many research projects were targeted on how to organize such information in a way that will make it easier for the end users to find the information they want efficiently and accurately. Information on the web is present in the form of text documents (formatted in HTML), and that is the reason many web document processing systems are rooted in text data mining techniques. Text data mining shares many concepts with traditional data mining methods. Data mining includes many techniques that can unveil inherent structure in the underlying data. One of these techniques is clustering. Applied to text data, clustering methods try to identify inherent groupings of the text documents so that a set of clusters are produced in which clusters exhibit high intra-cluster similarity and low inter-cluster similarity [5]. Generally speaking, text document clustering methods attempt to segregate the documents into groups where each group represents some topic that is different than those topics represented by the other groups [8]. By applying text mining in the web domain, the process becomes what is known as web mining. There are three types of web mining in general, according to Kosala et al [16]: (1) web structure mining; (2) web usage mining; and (3) web content mining. We are mainly interested in the last type, web content mining. Any clustering technique relies on four concepts: (1) a data representation model, (2) a similarity measure, (3) a cluster model, and (4) a clustering algorithm that builds the clusters using the data model and the similarity measure. Most of the document clustering methods that are in use today are based on the Vector Space Model [1, 21, 20, 22], which is a very widely used data model for text classification and clustering. The Vector Space Model represents documents as a feature vector of the terms (words) that appear in all the document set. Each feature vector contains term-weights 3

(usually term-frequencies) of the terms appearing in that document. Similarity between documents is measured using one of several similarity measures that are based on such a feature vector. Examples include the cosine measure and the Jaccard measure. Clustering methods based on this model make use of single-term analysis only, they do not make use of any word proximity or phrase-based analysis 1. The motivation behind the work in this paper is that we believe that document clustering should be based not only on single word analysis, but on phrases as well. Phrase-based analysis means that the similarity between documents should be based on matching phrases rather than on single words only. The work that has been reported in literature about using phrases in document clustering is limited. Most efforts have been targeted toward single-word analysis. The methods used for text clustering includes decision trees [7, 15, 18, 28], statistical analysis [9, 10, 15], neural nets [11], inductive logic programming [6, 14], and rule-based systems [23, 24] among others. These methods are at the cross roads of more than one research area, such as database (DB), information retrieval (IR), and artificial intelligence (AI) including machine learning (ML) and Natural Language Processing (NLP). The most relevant work to what is presented here is that of Oren Zamir et al [31, 32, 30]. They proposed a phrase-based document clustering approach based on Suffix Tree Clustering (STC). The method basically involves the use of a trie (a compact tree) structure to represent shared suffixes between documents. Based on these shared suffixes they identify base clusters of documents, which are then combined into final clusters based on a connected-component graph algorithm. They claim to achieve n log(n) performance and produce high quality clusters. The results they showed were encouraging, but the suffix tree model could be argued to have a high number of redundancies in 1 Throughout this paper the term phrase means a sequence of words, and not the grammatical structure of a sentence. 4

terms of the suffixes stored in the tree. In this paper we propose a system for web clustering based on document structure. The system consists of four components: 1. A web document restructuring scheme that identifies different document parts, and assigns levels of significance to these parts according to their importance. 2. A novel document representation model, the Document Index Graph (DIG) that captures the structure of sentences in the document set, rather than single words only. The DIG model is based on graph theory and utilizes graph properties to match any-length phrase from a document to any number of previously seen documents in a time nearly proportional to the number of words of the document. 3. A phrase-based similarity measure for scoring the similarity between two documents according to the matching phrases and their significance. 4. An incremental document clustering method based on maintaining high cluster cohesiveness using a new cluster quality concept called Similarity Histogram. The integration of these four components proved to be of superior performance to traditional document clustering methods. Although the whole system performance is quite good, each component could be used independent of the other. The overall system design is illustrated in figure 1. The proposed document model is used to measure the similarity between the documents using a new similarity measure that makes use of phrase-based matching. The similarity calculation between documents is based on a combination of single-term similarity and phrase-based similarity. Similarity 5

Keywords (Stored in HTML) Signific ant Text Title Section Heading Italic Text Image Alternate Text Coloured Text Section Heading Bold Text Hyperlink Abstrac t or I ntroduc tory Paragraph well-structured XML documents Web Documents Document Structure Identification Document Index Graph Representation phrase matching Document Clusters Incremental Clustering Document Similarity Calculation document similarity Figure 1: Web Document Clustering System Design based on matching phrases between documents is proved to have a more significant effect on the clustering quality due to its insensitivity to noisy terms that could lead to incorrect similarity measure. The proposed incremental document clustering method relies on improving the pair-wise document similarity distribution inside each cluster so that similarities are maximized in each cluster. The quality of the clusters produced using this system were higher than those produced using traditional clustering methods. Improvement in clustering ranged from 20% to 70% over traditional clustering methods. The rest of this paper is organized as follows. Section 2 presents an analysis of the important features of the semi-structured web documents. Section 3 introduces the Document Index Graph model. Section 4 presents the phrase-based similarity measure. Section 5 presents our proposed incremental clustering algorithm. Section 6 presents our experimental results. Finally we conclude and discuss future work in the last section. 6

2 Web document structure analysis Web documents are known to be semi-structured. HTML tags are used to designate different parts of the document. However, since the HTML language is meant for specifying the layout of the document, it is used to present the document to the user in a friendly manner, rather than specify the structure of the data in the document, hence they are semi-structured. However, it is still possible to identify key parts of the document based on this structure. The idea is that some parts of the document are more informative than other parts, thus having different levels of significance based on where they appear in the document and the tags that surround them. It is less informing to treat the title of the document, for example, and the text body equally. The proposed system analyzes the HTML document and restructures the document according into a predetermined structure that assigns different levels of significance to different document parts. The result is a well structured XML document that corresponds to the original HTML document, but with the significance levels assigned to the different parts of the original document. Currently we assign one of three levels of significance to the different parts; HIGH, MEDIUM, and LOW. Examples of HIGH significance parts are the title, meta keywords, meta description, and section headings. Example of MEDIUM significance parts are text that appear in bold, italics, colored, hyper-linked text, image alternate text, and table captions. LOW significance parts are usually comprised of the document body text that was not assigned any of the other levels. This structuring scheme is exploited in measuring the similarity between two documents (see section 4 for details). For example, if we have a phrase match of HIGH significance in both documents, the similarity is rewarded more than if the match was for LOW significance phrases. This is justified by arguing that a phrase match in titles, for example, is much more informative than a phrase match 7

in body text. A sentence boundary detector algorithm was developed to locate sentence boundaries in the documents. The algorithm is based on a finite state machine lexical analyzer with heuristic rules for finding the boundaries. A similar approach is used to find word boundaries. About 98% of the actual boundaries are correctly detected. To achieve 100% accuracy, however, requires natural language processing techniques and underlying knowledge of the data set domain, which is beyond the scope of this paper. However, the resulting documents contain very accurate sentence separation and word separation, with very negligible noise. Finally, a document cleaning step is performed to remove stop-words that have no significance, and to stem the words using the popular Porter Stemmer algorithm [19]. 3 Document Index Graph To achieve better clustering results, the data model that underlies the clustering method must accurately capture the salient features of the data. According to the Vector Space Model, the document data is represented as a feature vector of terms with different weights assigned to the terms according to their frequency of appearance in the document. It does not represent any relation between the words, so sentences are broken down into their individual components without any representation of the sentence structure. The proposed Document Index Graph (DIG for short) indexes the documents while maintaining the sentence structure in the original documents. This allows us to make use of more informative phrase matching rather than individual words matching. Moreover, the DIG also captures the different levels of significance of the original sentences, thus allowing us to make use of sentence significance. 8

3.1 DIG structure The Document Index Graph (DIG for short) is a directed graph (digraph) G =(V,E) where V : is a set of nodes {v 1,v 2,...,v n }, where each node v represents a unique word in the entire document set; and E: is a set of edges {e 1,e 2,...,e m }, such that each edge e is an ordered pair of nodes (v i,v j ). Edge (v i,v j ) is from v i to v j, and v j is adjacent to v i. There will be an edge from v i to v j if, and only if, the word v j appears successive to the word v i in any document. The above definition of the graph suggests that the number of nodes in the graph is the number of unique words in the document set; i.e. the vocabulary of the document set, since each node represents a single word in the whole document set. Nodes in the graph carry information about the documents they appeared in, along with the sentence path information. Sentence structure is maintained by recording the edge along which each sentence continues. This essentially creates an inverted list of the documents, but with sentence information recorded in the inverted list. Assume a sentence of m words appearing in one document consists of the following word sequence: {v 1,v 2,...,v m }. The sentence is represented in the graph by a path from v 1 to v m, such that (v 1,v 2 )(v 2,v 3 ),...,(v m 1,v m ) are edges in the graph. Path information is stored in the vertices along the path to uniquely identify each sentence. Sentences that share sub-phrases will have shared parts of their paths in the graph that correspond to the shared sub-phrase. The structure maintained in each node is a table of documents. Each document entry in the document table records the term frequency of the word in that document. Since words can appear in different parts of a document with different level of significance, the recorded term frequency is ac- 9

tually broken into those levels of significance, with a frequency count per level per document entry. This structure helps in achieving a more accurate similarity measure based on level of significance later on. Since the graph is directed, each node maintains a list of an outgoing edges per document entry. This list of edges tells us which sentence continues along which edge. The task of creating a sentence path in the graph is thus reduced to recording the necessary information in this edge table to reflect the structure of the sentences. booking mild fishing river trips vacation plan wild rafting Document 1 river rafting mild river rafting river rafting trips adventures Document 2 wild river adventures river rafting vacation plan Document 3 fishing trips fishing vacation plan booking fishing trips river fishing Figure 2: Example of the Document Index Graph To better illustrate the graph structure, Figure 2 presents a simple example graph that represents three documents. Each document contains a number of sentences with some overlap between the documents. As seen from the graph, an edge is created between two nodes only if the words represented by the two nodes appear successive in any document. Thus, sentences map into paths in the 10

graph. Dotted lines represent sentences from document 1, dash-dotted lines represent sentences from document 2, and dashed lines represent sentences from document 3. As mentioned earlier, matching phrases between documents becomes a task of finding shared paths in the graph between different documents. The example presented here is a simple one. Real web documents will contain hundreds or thousands of words. With a very large document set, the graph could become more complex in terms of memory usage. Typically, the number of graph nodes will be exactly the same as the number of unique words in the data set. The number of edges is about 4 to 6 times the number of nodes (that is the average degree of a node). 3.2 Constructing the graph The DIG is built incrementally by processing one document at a time. When a new document is introduced, it is scanned in sequential fashion, and the graph is updated with the new sentence information as necessary. New words are added to the graph as necessary and connected with other nodes to reflect the sentence structure. The graph building process becomes less memory demanding when no new words are introduced by a new document (or very few new words are introduced). At this point the graph becomes more stable, and the only operation needed is to update the sentence structure in the graph to accommodate for the new sentences introduced. It is very critical to note that introducing a new document will only require the inspection (or addition) of those words that appear in that document, and not every node in the graph. This is where the efficiency of the model comes from. Along with indexing the sentence structure, the level of significance of each sentence is also recorded in the graph. This allows us to recall such information when we match sentences from other documents. 11

mild river trips Document 1 rafting river rafting mild river rafting river rafting trips mild river trips vacation plan Document 2 wild rafting wild river adventures river rafting vacation plan adventures booking mild fishing river trips vacation plan Document 3 wild rafting fishing trips fishing vacation plan booking fishing trips river fishing adventures Figure 3: Incremental construction of the Document Index Graph Continuing from the example introduced earlier, the process of constructing the graph that represents the three documents is illustrated in Figure 3. The emphasis here is on the incremental construction process, where new nodes are added and new edges are created incrementally upon introducing a new document. Unlike traditional phrase matching techniques that are usually used in information retrieval literature, the Document Index Graph provides complete information about full phrase matching between every pair of documents. While traditional phrase matching methods are aimed at searching and retrieval of documents that have matching phrases to a specific query, the Document Index Graph is aimed at providing information about the degree of overlap between every pair of documents. This 12

information will help in determining the degree of similarity between documents as will be explained in section 4. 3.3 Detecting Matching Phrases Upon introducing a new document, finding matching phrases from previously seen documents becomes an easy task using the DIG. Algorithm 1 describes the process of both incremental graph building and phrase matching. The procedure starts with a new document to process (line 1). We expect the new document to have well defined sentence boundaries; each sentence is processed individually. This is important because we do not want to match a phrase that spans two sentences (which could break the local context we are looking for.) It is also important to know the original sentence length so that it will be used in the similarity calculation (section 4). For each sentence (for loop at line 2) we process the words in the sentence sequentially, adding new words as new nodes to the graph, and constructing a path in the graph (by adding new edges if necessary) to represent the sentence we are processing. Matching the phrases from previous documents is done by keeping a list L that holds an entry for every previous document that shares a phrase with the current document D. As we continue along the sentence path, we update L by adding new matching phrases and their respective document identifiers, and extending phrase matches from the previous iteration (lines 10 and 11). If there are no matching phrases at some point, we just update the respective nodes of the graph to reflect the new sentence path (lines 13 and 14). After the whole document is processed L will contain all the matching phrases between the current document and any previous document that shared at least one phrase with the new document. Finally we output L as the list of documents with matching phrases 13

and all the necessary information about the matching phrases. Algorithm 1 Document Index Graph construction and phrase matching 1: D New Document 2: for each sentence s in D do 3: w 1 first word in s 4: if w 1 is not in G then 5: Add w 1 to G 6: end if 7: L Empty List {L is a list of matching phrases} 8: for each word w i {w 2,w 3,...,w k } in s do 9: if (w i 1,w i ) is an edge in G then 10: Extend phrase matches in L for sentences that continue along (w i 1,w i ) 11: Add new phrase matches to L 12: else 13: Add edge(w i 1,w i ) to G 14: Update sentence path in nodes w i 1 and w i 15: end if 16: end for 17: end for 18: Output matching phrases in L The above algorithm is capable of matching any-length phrases between a new document D and all previously seen documents in roughly O(m) time, where m is the number of words in document 14

D. The step at line 10 in the algorithm, where we extend the matching phrases as we continue along an existing path, may seem not to be a constant time step, because when the graph starts building up, the number of matching phrases becomes larger, and consequently when moving along an existing path we have to match more phrases. However, it turns out that the size of the list of matching phrases becomes roughly constant even with very large document sets, due to the fact that a certain phrase will be shared by only a small set of documents; which on average tends to be a constant number. 4 A phrase-based similarity measure As mentioned earlier, phrases convey local context information, which is essential in determining an accurate similarity between documents. Towards this end we devised a similarity measure based on matching phrases rather than individual terms. This measure exploits the information extracted from the previous phrase matching algorithm to better judge the similarity between the documents. This is related to the work of Isaacs et al [12] who used a pair-wise probabilistic document similarity measure based on Information Theory. Although they showed it could improve on traditional similarity measures, but it is still fundamentally based on the vector space model representation. The phrase similarity between two documents is calculated based on the list of matching phrases between the two documents. This similarity measure is a function of four factors: The number of matching phrases P, The lengthes of the matching phrases (l i : i =1, 2,...,P), The frequencies of the matching phrases in both documents (f i1 and f i2 : i =1, 2,...,P), and 15

The levels of significance (weight) of the matching phrases in both document (w i1 and w i2 : i =1, 2,...,P). Frequency of phrases is an important factor in the similarity measure. The more frequent the phrase appears in both documents, the more similar they tend to be. Similarly, the level of significance of the matching phrase in both documents should be taken into consideration. The phrase similarity between two documents, d 1 and d 2, is calculated using the following empirical equation: sim p (d 1, d 2 )= P i=1 [g(l i) (f i1 w i1 + f i2 w i2 )] 2 j s j1 w j1 + k s k2 w k2 (1) where g(l i ) is a function that scores the matching phrase length, giving higher score as the matching phrase length approaches the length of the original sentence; s j1 and s k2 are the original sentence lengths from document d 1 and d 2, respectively. The equation rewards longer phrase matches with higher level of significance, and with higher frequency in both documents. The function g(l i ) in the implemented system was used as: g(l i )=( ms i / s i ) γ (2) where ms i is the matching phrase length, and γ is a sentence fragmentation factor with values greater than or equal to 1. If γ is 1, two halves of a sentence could be matched independently and would be treated as a whole sentence match. However, by increasing γ we can avoid this situation, and score whole sentence matches higher than fractions of sentences. A value of 1.2 for γ was found to produce best results. 16

The normalization by the length of the two documents in equation (1) is necessary to be able to compare the similarities from other documents. 4.1 Combining single-term and phrase similarities If the similarity between documents is based solely on matching phrases, and not single-terms at the same time, related documents could be judged as non-similar if they do not share enough phrases (a typical case that could happen in many situations.) Shared phrases provide important local context matching, but sometimes similarity based on phrases only is not sufficient. To alleviate this problem, and to produce high quality clusters, we combined single-term similarity measure with our phrasebased similarity measure. We used the cosine correlation similarity measure [21, 22], with TF-IDF (Term Frequency Inverse Document Frequency) term weights, as the single-term similarity measure. The cosine measure was chosen due to its wide use in the document clustering literature, and since it is described as being able to capture human categorization behavior well [26]. The TF-IDF weighting is also a widely used term weighting scheme [29]. Recall that the cosine measure calculates the cosine of the angle between the two document vectors. Accordingly our term-based similarity measure (sim t ) is given as: sim t (d 1, d 2 )=cos(d 1, d 2 )= d 1 d 2 d 1 d 2 (3) where the vectors d 1 and d 2 are represented as term weights calculated using TF-IDF weighting scheme. The combination of the term-based and the phrase-based similarity measures is a weighted average of the two quantities from equations (1) and (3), and is given by equation (4). 17

sim(d 1, d 2 )=α sim p (d 1, d 2 )+(1 α) sim t (d 1, d 2 ) (4) where α is a value in the interval [0, 1] which determines the weight of the phrase similarity measure, or, as we call it, the Similarity Blend Factor. According to the experimental results discussed in section 6 we found that a value between 0.6 and 0.8 for α results in the maximum improvement in the clustering quality. 5 Incremental Document Clustering In this section we present a brief overview of incremental clustering algorithms, and introduce the proposed algorithm, based on pair-wise document similarity, and employ it as part of the whole web document clustering system. The role of a document similarity measure is to provide judgement on the closeness of documents to each other. However, it is up to the clustering method how to make use of such similarity calculation. The idea here is to employ an incremental clustering method that will exploit our similarity measure to produce clusters of high quality (assessing quality of clustering is described in section 6). Incremental clustering is an essential strategy for online applications, where time is a critical factor for usability. Incremental clustering algorithms work by processing data objects one at a time, incrementally assigning data objects to their respective clusters while they progress. The process is simple enough, but faces several challenges, including: How to determine to which cluster the next object should be assigned? How to deal with the problem of insertion order? 18

Once an object has been assigned to a cluster, should its assignment to the cluster be frozen or is it allowed to be re-assigned to other clusters later on? Usually a heuristic method is employed to deal with the above challenges. A good incremental clustering algorithm has to find the respective cluster for each newly introduced object without significantly sacrificing the accuracy of clustering due to insertion order or fixed object-to-cluster assignment. We will briefly discuss two incremental clustering methods in the light of the above challenges, before we introduce our proposed method. Suffix Tree Clustering (STC). Introduced by Zamir et al [31] in 1997, the idea behind the STC algorithm is to build a tree of phrase suffixes shared between multiple documents. The documents sharing a suffix are considered as a base cluster. Base clusters are then combined together if they have a document overlap of 50% or more. The algorithm has two drawbacks. First, although the structure used is a compact tree, suffixes can appear multiple times if they are part of larger shared suffixes. The other drawback is that the second phase of the algorithm is not incremental. Combining base clusters into final clusters has to be done in a non-incremental way. The algorithm deals properly with the insertion order problem though, since any insertion order will lead to the same result suffix tree. DC-tree Clustering. The DC-tree incremental algorithm was introduced by Wong et al [27] in 2000. The algorithm is based on the B + -tree structure. Unlike the STC algorithm, this algorithm is based on vector space representation of the documents. Most of the algorithm operations are borrowed from the B + -tree operations. Each node in the tree is a representation of a cluster, where a cluster is represented by the combined feature vectors of its individual documents. Inserting a new document involves comparison of the document feature vector with the cluster vectors at one level of the tree, 19

and descending to the most similar cluster. The algorithm defines several parameters and thresholds for the various operations. The algorithm suffers from two problems though. Once a document is assigned to a cluster it is not allowed to be re-assigned later to a newly created cluster. Second, which is a consequence of the first drawback, clusters are not allowed to overlap; i.e. a document can belong to only one cluster. 5.1 Similarity histogram-based incremental clustering The clustering approach proposed here is an incremental dynamic method of building the clusters. We adopt an overlapped cluster model. The key concept for the proposed clustering method is to keep each cluster at a high degree of coherency at any time. We represent the coherency of a cluster with a new concept called Cluster Similarity Histogram. Cluster Similarity Histogram: is a concise statistical representation of the set of pairwise document similarities distribution in the cluster. A number of bins in the histogram correspond to fixed similarity value intervals. Each bin contains the count of pair-wise document similarities in the corresponding interval. Figure 4 shows a typical cluster similarity histogram, where the distribution is almost a normal distribution. A perfect cluster would have a histogram where the similarities are all maximum, while a loose cluster would have a histogram where the similarities are all minimum. 5.2 Creating coherent clusters incrementally Our objective is to keep each cluster as coherent as possible. In terms of the similarity histogram concept this translates to maximizing the number of similarities in the high similarity intervals. To 20

Typical Cluster Histogram Count 35 30 25 20 15 10 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Figure 4: Typical Cluster Similarity Histogram achieve this goal in an incremental fashion, we judge the effect of adding a new document to a certain cluster. If the document is going to degrade the distribution of the similarities in the clusters very much, it should not be added, otherwise it is added. A much stricter strategy would be to add documents that will enhance the similarity distribution. However, this could create a problem with perfect clusters. The document will be rejected by the cluster even if it has high similarity to most of the documents to the cluster (because it is perfect). We judge the quality of a similarity histogram (cluster cohesiveness) by calculating the ratio of the count of similarities above a certain similarity threshold S T to the total count of similarities. The higher this ratio, the more coherent is the cluster. Let n be the number of the documents in a cluster. The number of pair-wise similarities in the cluster is m = n(n +1)/2. Let S = {s i : i =1,...,m} be the set of similarities in the cluster. The histogram of the similarities in the cluster is represented as: H = {h i : i =1,...,B} (5a) h i = count(s k ) s li <s k <s ui (5b) 21

where B: the number of histogram bins, h i : the count of similarities in bin i, s li : the lower similarity bound of bin i, and s ui : the upper similarity bound of bin i. The histogram ratio of a cluster is the measure of cohesiveness of the cluster as described above, and is calculated as: HR(C) = B i=t h i B j=1 h j (6a) T = S T B (6b) where HR: the histogram ratio, C: the cluster under consideration, S T : the similarity threshold, and T : the bin number corresponding to the similarity threshold. Basically we would like to keep the histogram ratio of each cluster high. However, since we allow documents that can degrade the histogram ratio to be added, this could result in a chain effect of degrading the ratio to zero eventually. To prevent this, we set a minimum histogram ratio HR min that clusters should maintain. We also do not allow adding a document that will bring down the histogram ratio significantly (even if still above HR min ). This is to prevent a bad document from severely bringing down cluster quality by one single document addition. We now present the incremental clustering algorithm based on the above framework (Algorithm 2). The algorithm works incrementally by receiving a new document, and for each cluster calculates the cluster histogram before and after simulating the addition of the document (lines 3-5). The old and 22

new histogram ratios are compared and if the new ratio is greater than or equal to the old one, the document is added to the cluster. If the new ratio is less than the old one by no more than ε and still above HR min, it is added (lines 6-8). Otherwise it is not added. If after checking all clusters the document was not assigned to any cluster, a new cluster is created and the document is added to it (lines 10-13). Algorithm 2 Similarity Histogram-based Incremental Document Clustering 1: L Empty List {Cluster List} 2: for each document D do 3: for each cluster C in L do 4: HR old = HR(C) 5: Simulate adding D to C 6: HR new = HR(C) 7: if (HR new HR old )OR((HR new >HR min ) AND (HR old HR new <ε)) then 8: Add D to C 9: end if 10: end for 11: if D was not added to any cluster then 12: Create a new cluster C 13: ADD D to C 14: ADD C to L 15: end if 16: end for 23

5.3 Dealing with insertion order problems Our strategy for the insertion order problem is to implement a document reassignment strategy. Older documents that were added before new clusters were created should have the chance to be reassigned to newly created clusters. Only documents that seem to be bad for a certain cluster are tagged and considered for reassignment to other clusters. The documents that are candidates to leave a cluster are the documents that their leaving the cluster will increase the cluster similarity histogram ratio; i.e. the cluster is better off without them. We keep a record of each document of the histogram ratio if the document was not in the cluster. If this value is greater than the current histogram ratio, then the document is a candidate for leaving the cluster. Upon adding a new document to any cluster, we consult the documents that are candidate for leaving the cluster. If any of such documents can be added to other clusters, we move it to that cluster, thus benefiting both clusters. This strategy creates a dynamic negotiation scheme between clusters for document assignment. It also allows for overlapping clusters, and dynamic incremental document clustering. 6 Experimental Results In order to test the effectiveness of the web clustering system, we conducted a set of experiments using our proposed data model, phrase matching, similarity measure, and incremental clustering method. The experiments conducted were divided into two sets. We first tested the effectiveness of the Document Index Graph model, presented in section 3, and the accompanying phrase matching algorithm for calculating the similarity between documents based on phrases versus individual words only. The second set of experiments was to evaluate the accuracy of the incremental document clustering algo- 24

Data Set Description Categories Documents DS1 UofW web site, Canadian web sites 10 314 DS2 Reuters news articles (from Yahoo! news) 20 2340 Table 1: Data Sets Descriptions rithm, presented in section 5, based on the cluster cohesiveness measure using similarity histograms. 6.1 Experimental setup Because the proposed system was designed for making use of the semi-structure of web documents, regular text corpora were not used. Our experimental setup consisted of two web document sets. The first consists of 314 web documents collected from University of Waterloo various web sites, such as the Graduate Studies Office, Information Systems and Technology, Health Services, Career Services, Co-operative Education, and other Canadian web sites. The documents were classified, according to their content, into 10 different categories. In order to allow for independent testing and the reproduction of the results presented here, this document collection can be downloaded at: http://pami.uwaterloo.ca/ hammouda/webdata/. The second data set is a collection of Reuters news articles from the Yahoo! news site. The set contains 2340 documents classified into 20 different categories (with some relevancy between the categories as well.) The second data set was used by Boley et al in [4, 2, 3]. Table 1 summarizes the two data sets. 6.2 Evaluation measures In order to evaluate the quality of the clustering, we adopted two quality measures widely used in the text mining literature for the purpose of document clustering [25]. The first is the F-measure, which 25

combines the Precision and Recall ideas from the Information Retrieval literature. The precision and recall of a cluster j with respect to a class i are defined as: P = Precision(i, j) = N ij N i R = Recall(i, j) = N ij N j (7a) (7b) where N ij : is the number of members of class i in cluster j, N j : is the number of members of cluster j, and N i : is the number of members of class i. The F-measure of a class i is defined as: F (i) = 2PR P + R (8) With respect to class i we consider the cluster with the highest F-measure to be the cluster j that maps to class i, and that F-measure becomes the score for class i. The overall F-measure for the clustering result C is the weighted average of the F-measure for each class i: i ( i F(i)) F C = i i (9) where i is the number of objects in class i. The higher the overall F-measure, the better the clustering, due to the higher accuracy of the clusters mapping to the original classes. The second measure is the Entropy, which provides a measure of goodness for un-nested clusters or for the clusters at one level of a hierarchical clustering. Entropy tells us how homogeneous a cluster is. The higher the homogeneity of a cluster, the lower the entropy is, and vice versa. The entropy of a cluster containing only one object (perfect homogeneity) is zero. For every cluster j in the clustering result C we compute p ij, the probability that a member of cluster j belongs to class i. The entropy of each cluster j is calculated using the standard formula 26

E j = i p ij log(p ij ), where the sum is taken over all classes. The total entropy for a set of clusters is calculated as the sum of entropies for each cluster weighted by the size of each cluster: E C = m ( N j N E j) (10) j=1 where N j is the size of cluster j, and N is the total number of data objects. Basically we would like to maximize the F-measure, and minimize the Entropy of clusters to achieve high quality clustering. 6.3 Effect of phrase-based similarity on clustering quality The similarities calculated by our algorithm were used to construct a similarity matrix between the documents. We elected to use three standard document clustering techniques for testing the effect of phrase similarity on clustering [13]: (1) Hierarchical Agglomerative Clustering (HAC), (2) Single Pass Clustering, and (3) K-Nearest Neighbor Clustering (k-nn) 2. For each of the algorithms, we constructed the similarity matrix and let the algorithm cluster the documents based on the presented similarity matrix. The results listed in Table 2 show the improvement in the clustering quality on the first data set using the combined similarity measure. The improvements shown were achieved at a similarity blend factor between 70% and 80% (phrase similarity weight). The parameters chosen for the different algorithms were the ones that produced best results. The percentage of improvement ranges from 19.5% to 60.6% increase in the F-measure quality, and 9.1% to 46.2% drop in Entropy (lower is better for Entropy). It is obvious that the phrase based similarity plays an important role in accurately judging the relation between documents. It is known that Single Pass clustering is very sensitive to 2 Although k-nn is mostly known to be used for classification, it has also been used for clustering (example could be found in [17]). 27

F-measure 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 HAC Single Pass K-NN 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Blend Factor (alpha) (a) Effect of phrase similarity on F-measure 1 0.9 0.8 HAC Single Pass K-NN 0.7 0.6 Entropy 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Blend Factor (alpha) (b) Effect of phrase similarity on Entropy Figure 5: Effect of phrase similarity on clustering quality 28

Table 2: Phrase-based clustering improvement Single-Term Similarity Combined Similarity Improvement F-measure Entropy F-measure Entropy HAC a 0.709 0.351 0.904 0.103 +19.5%F, -24.8%E Single Pass b 0.427 0.613 0.817 0.151 +39.0%F, -46.2%E k-nn c 0.228 0.173 0.834 0.082 +60.6%F, -9.1%E a Complete Linkage was used as the cluster distance measure for the HAC method since it tends to produce tight clusters with small diameter. b A document-to-cluster similarity threshold of 0.25 was used. c A K of 5 and a cluster similarity threshold of 0.25 were used. noise; that is why it has the worst performance. However, when the phrase similarity was introduced, the quality of clusters produced was pushed close to that produced by HAC and k-nn. In order to better understand the effect of the phrase similarity on the clustering quality, we generated a clustering quality profile against the similarity blend factor. Figure 5(a) illustrates the effect of introducing the phrase similarity on the F-measure of the resulting clusters. It is obvious that the phrase similarity enhances the F-measure of the clustering until a certain point (around a weight of 80%) and then its effect starts bringing down the quality. As we mentioned in section 4.1 that phrases alone cannot capture all the similarity information between documents, the single-term similarity is still required, but to a smaller degree. The same can be seen from the Entropy profile in Figure 5(b), where Entropy is minimized at around 80% contribution of phrase similarity against 20% for the single-term similarity. The results show that both evaluation measures are optimized in the same trend with respect 29

to the blend factor. By having two independent evaluation measures prove the clustering quality improvement, we are confident that the results are not biased by any of the evaluation measures. 6.4 Incremental clustering evaluation Our proposed incremental document clustering method was evaluated using both data sets mentioned earlier. We relied on the same evaluation measures discussed above, as well as another measure called Overall Similarity, which is the average of the similarities inside each cluster. Higher overall similarity means better cluster cohesiveness. Data Set 1 Data Set 2 F-measure Entropy O-S a F-measure Entropy O-S Proposed Method 0.931 0.119 0.504 0.682 0.156 0.497 HAC 0.709 0.351 0.455 0.584 0.281 0.398 Single Pass 0.427 0.613 0.385 0.502 0.250 0.311 k-nn 0.228 0.173 0.367 0.522 0.161 0.452 a Overall-Similarity Table 3: Proposed Clustering Method Improvement Table 3 shows the result of the proposed clustering method against HAC, Single Pass, and k-nn clustering. For the first data set, the improvement was very significant, reaching over 70% improvement over k-nn (in terms of F-measure), 25% improvement over HAC, and 53% improvement over Single Pass. This is attributed to the fact that the different categories of the documents do not have a great deal of overlap, which makes the algorithm able to avoid noisy similarities from other clusters. For the second data set an improvement between 10% to 18% was achieved over the other meth- 30

ods. However, the F-measure was not really high compared to the first data set. By examining the actual documents and their classification it turns out that the documents do not have enough overlap in each single class, which makes it difficult to have an accurate similarity calculation between the documents. However, we were able to push the quality of clustering further by relying on accurate and robust phrase matching similarity calculation, and achieve higher clustering quality. F-Measure 1 0.9 0.8 0.7 0.6 0.5 0.4 P roposed M etho d (with Re-assignment) Proposed Method (no Re-assignment) HAC Single Pass K-NN 0.3 0.2 0.1 0 1 Data Set 2 (a) Clustering Quality - F-measure Entropy 1 0.9 0.8 0.7 0.6 0.5 0.4 P roposed M etho d (with Re-assignment) Proposed Method (no Re-assignment) HAC Single Pass K-NN 0.3 0.2 0.1 0 1 Data Set 2 (b) Clustering Quality - Entropy Figure 6: Quality of Clustering Comparison Figure 6 shows the above mentioned results more clearly, showing the achieved improvement in comparison with the other methods. The figure shows also the effect of apply the re-assignment 31

strategy discussed in section 5.3. The problem with incremental clustering is that documents usually do not end up where they should be. The re-assignment strategy we chose to use re-assigns documents that are seen as bad for some clusters to other clusters that can accept the document, all based on the idea of increasing the cluster similarity histogram. The re-assignment strategy showed a slight improvement over the same method without document re-assignment as shown in the figure. 7 Conclusion We presented a system composed of four decoupled components in an attempt to improve the document clustering problem in the web domain. Information in web documents does not lie in the content only, but in their inherent semi-structure of the web documents. By exploiting this structure we can achieve better clustering results. We presented a web document analysis component that is capable of identifying the structure of web documents, and building structured documents out of the semi-structured web documents. The second component, and perhaps the most important one that has most of the impact on performance, is the new document model introduced in this paper, the Document Index Graph. This model is based on indexing web documents using phrases and their levels of significance. Such a model enables us to perform phrase matching and similarity calculation between documents in a very robust and accurate way. The quality of clustering achieved using this model significantly surpasses the traditional vector space model based approaches. The third component is the phrase-based similarity measure. By carefully examining the factors affecting the degree of overlap between documents, we devised a phrase-based similarity measure that is capable of accurate calculation of pair-wise document similarity. 32

The fourth component is an incremental document clustering method based on maintaining high cluster cohesiveness by improving the pair-wise document similarity distribution inside each cluster. The merits of such a design is that each component could be utilized independent of the other. But we have confidence that the combination of these components leads to better results, as justified by the results presented in this paper. By adopting different standard clustering techniques to test against our model, we are very confident that this model is well justified. There are a number of future research directions to extend and improve this work. One direction that this work might continue on is to improve on the accuracy of similarity calculation between documents by employing different similarity calculation strategies. Although the current scheme proved more accurate than traditional methods, there are still room for improvement. Although the work presented here is aimed at web document clustering, it could be easily adapted to any document type as well. However, it will not benefit from the semi-structure found in web documents. Our intention is to investigate the usage of such model on standard corpora and see its effect on clustering compared to traditional methods. 33

References [1] K. Aas and L. Eikvil. Text categorisation: A survey. Technical Report 941, Norwegian Computing Center, June 1999. [2] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems, 27:329 341, 1999. [3] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the World Wide Web using WebACE. AI Review, 13(5-6):365 391, 1999. [4] D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325 344, 1998. [5] K. Cios, W. Pedrycs, and R. Swiniarski. Data Mining Methods for Knowledge Discovery. Kluwer Academic Publishers, Boston, 1998. [6] W. W. Cohen. Learning to classify English text with ILP methods. In Proceedings of the 5 th International Workshop on Inductive Logic Programming, pages 3 24. Department of Computer Science, Katholieke Universiteit Leuven, 1995. [7] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7 th International Conference on Information and Knowledge Management, pages 148 15, November 1998.