IN an effort to keep up with the tremendous growth of the

Size: px

Start display at page:

Download "IN an effort to keep up with the tremendous growth of the"

David Bailey
5 years ago
Views:

1 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER Efficient Phrase-Based Document Indexing for Web Document Clustering Khaled M. Hammouda, Student Member, IEEE, and Mohamed S. Kamel, Senior Member, IEEE Abstract Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods. Index Terms Web mining, document similarity, phrase-based indexing, document clustering, document structure, document index graph, phrase matching. æ 1 INTRODUCTION IN an effort to keep up with the tremendous growth of the World Wide Web, many research projects were targeted on how to organize such information in a way that will make it easier for the end users to find the information they want efficiently and accurately. Information on the Web is present in the form of text documents (formatted in HTML), and that is the reason many Web document processing systems are rooted in text data mining techniques. Text mining shares many concepts with traditional data mining methods. Data mining includes many techniques that can unveil inherent structure in the underlying data. One of these techniques is clustering. When applied to textual data, clustering methods try to identify inherent groupings of the text documents so that a set of clusters is produced in which clusters exhibit high intracluster similarity and low intercluster similarity [1]. Generally speaking, text document clustering methods attempt to segregate the documents into groups where each group represents some topic that is different than those topics represented by the other groups [2]. By applying text mining in the Web domain, the process becomes what is known as Web mining. There are three types of Web mining in general, according to Kosala and Blockeel [3]: 1) Web structure mining, 2) Web usage mining, and 3) Web content mining. We are mainly interested in the last type. Applications of document clustering include: clustering of retrieved documents to present organized and understandable results to the user (e.g., [4]), clustering documents in a collection (e.g., digital libraries), automated (or. The authors are with the Department of Systems Design Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada N2L 3G1. {hammouda, mkamel}@pami.uwaterloo.ca. Manuscript received 30 July 2002; revised 11 Apr. 2003; accepted 27 Aug For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number semiautomated) creation of document taxonomies (e.g., Yahoo and Open Directory styles), and efficient information retrieval by focusing on relevant subsets (clusters) rather than whole collections. The methods used for text clustering includes decision trees [5], [6], [7], [8], statistical analysis [9], [10], [6], neural nets [11], inductive logic programming [12], [13], and rulebased systems [14], [15] among others. These methods are at the cross roads of more than one research area, such as database (DB), information retrieval (IR), and artificial intelligence (AI) including machine learning (ML) and Natural Language Processing (NLP). Any clustering technique relies on four concepts: 1. a data representation model, 2. a similarity measure, 3. a cluster model, and 4. a clustering algorithm that builds the clusters using the data model and the similarity measure. Most of the document clustering methods that are in use today are based on the Vector Space Model [16], [17], [18], [19], which is a very widely used data model for text classification and clustering. The Vector Space Model represents documents as a feature vector of the terms (words) that appear in all the document set. Each feature vector contains termweights (usually term-frequencies) of the terms appearing in that document. Similarity between documents is measured using one of several similarity measures that are based on such a feature vector. Examples include the cosine measure and the Jaccard measure. Clustering methods based on this model make use of single-term analysis only, they do not make use of any word proximity or phrase-based analysis Throughout this paper the term phrase means a sequence of words, and not the grammatical structure of a sentence /04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

2 1280 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 Fig. 1. Web document clustering system design. The motivation behind the work in this paper is that we believe that document clustering should be based not only on single word analysis, but on phrases as well. Phrasebased analysis means that the similarity between documents should be based on matching phrases rather than on single words only. The work that has been reported in literature about using phrases in document clustering is limited. Most efforts have been targeted toward single-word analysis. The most relevant work to what is presented here is that of Zamir et al. [20], [21], [4]. They proposed a phrase-based document clustering approach based on Suffix Tree Clustering (STC). The method basically involves the use of a trie (a compact tree) structure to represent shared suffixes between documents. Based on these shared suffixes, they identify base clusters of documents, which are then combined into final clusters based on a connected-component graph algorithm. They claim to achieve n logðnþ performance and produce high quality clusters. The results they showed were encouraging, but the suffix tree model could be argued to have a high number of redundancies in terms of the suffixes stored in the tree. In this paper, we propose a system for Web clustering based on two key concepts. The first is the use of weighted phrases as an essential constituent of documents. Similarity between documents will be based on matching phrases and their weights. The second concept is the incremental clustering of documents using a histogram-based method to maximize the tightness of clusters by carefully watching the similarity distribution inside each cluster. The system consists of four components: 1. A Web document restructuring scheme that identifies different document parts, and assigns levels of significance to these parts according to their importance. 2. A novel phrase-based document indexing model, the Document Index Graph (DIG) that captures the structure of sentences in the document set, rather than single words only. The DIG model is based on graph theory and utilizes graph properties to match any-length phrase from a document to any number of previously seen documents in a time nearly proportional to the number of words of the document. 3. A phrase-based similarity measure for scoring the similarity between two documents according to the matching phrases and their significance. 4. An incremental document clustering method based on maintaining high cluster cohesiveness using a new cluster quality concept called Similarity Histogram. The integration of these four components proved to be of superior performance to traditional document clustering methods. Although the whole system performance is quite good, each component could be used independent of the other. The overall system design is illustrated in Fig. 1. The proposed phrase-based document indexing model is used to measure the similarity between the documents using a new similarity measure that makes use of phrasebased matching. The similarity calculation between documents is based on a combination of single-term similarity and phrase-based similarity. Similarity, based on matching phrases between documents, is proven to have a more significant effect on the clustering quality due to its insensitivity to noisy terms that could lead to incorrect similarity measure. The proposed incremental document clustering method relies on improving the pair-wise document similarity distribution inside each cluster so that similarities are maximized in each cluster. The quality of the clusters produced using this system were higher than those produced using traditional clustering methods. The Improvement over traditional clustering methods was 10 to 29 percent.

3 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1281 The rest of this paper is organized as follows: Section 2 presents an analysis of the important features of the semistructured Web documents. Section 3 introduces the Document Index Graph model. Section 4 presents the phrasebased similarity measure. Section 5 presents our proposed incremental clustering algorithm. Section 6 discusses the experimental results. Finally, we conclude and discuss future work in the last section. 2 WEB DOCUMENT STRUCTURE ANALYSIS Web documents are known to be semistructured. HTML tags are used to designate different parts of the document. However, since the HTML language is meant for specifying the layout of the document, it is used to present the document to the user in a friendly manner, rather than specify the structure of the data in the document, hence they are semistructured. However, it is still possible to identify key parts of the document based on this structure. The idea is that some parts of the document are more informative than other parts, thus having different levels of significance based on where they appear in the document and the tags that surround them. It is less informing to treat the title of the document, for example, and the body text equally. The proposed system analyzes the HTML document and restructures the document according to a predetermined structure that assigns different levels of significance to different document parts. The result is a well-structured XML document that corresponds to the original HTML document, but with the significance levels assigned to the different parts of the original document. Currently, we assign one of three levels of significance to the different parts; HIGH, MEDIUM, and LOW. Examples of HIGH significance parts are the title, metakeywords, metadescription, and section headings. Examples of MEDIUM significance parts are text that appear in bold, italics, colored, hyper-linked text, image alternate text, and table captions. LOW significance parts are usually comprised of the document body text that was not assigned any of the other levels. This structuring scheme is exploited in measuring the similarity between two documents (see Section 4 for details). For example, if we have a phrase match of HIGH significance in both documents, the similarity is rewarded more than if the match was for LOW significance phrases. This is justified by arguing that a phrase match in titles, for example, is much more informative than a phrase match in body text. A sentence boundary detector algorithm was developed to locate sentence boundaries in the documents. The algorithm is based on a finite state machine lexical analyzer with heuristic rules for finding the boundaries. A similar approach is used to find word boundaries. About 98 percent of the actual boundaries are correctly detected. The resulting documents contain very accurate sentence separation and word separation, with very negligible noise. Finally, a document cleaning step is performed to remove stop-words that have no significance, and to stem the words using the popular Porter Stemmer algorithm [22]. 2.1 Document Representation A formal model is presented here that represents document features as sentences rather than individual words. The model assumes that the constituents of a document is a set of sentences, which in turn are composed of a set of terms. A document is represented as a vector of sentences: where d i ¼fs ij : j ¼ 1;...;p i g; s ij ¼ft ijk : k ¼ 1;...;l ij ; w ij g; ð1aþ ð1bþ d i : is document i, s ij :issentence j in document i, p i : is the number of sentences in document i, t ijk :isterm k of sentence s ij, l ij : is the length of sentence s ij, and w ij : is the weight associated with sentence s ij. The above definition is a direct mapping of the actual document to a formal representation that breaks a document into a set of sentences. Sentence weights are assigned according to their significance, as discussed in Section 2. This definition does not consider the frequency of sentences (or part of sentences) as a sentence weight. Sentence frequency will be taken into consideration when matching phrases between documents. The rationale for deferring sentence frequency calculation was to perform a lazy computation upon matching part of a sentence with other documents, rather than computing all possible frequencies upfront that might not be used in similarity calculation later. 3 DOCUMENT INDEX GRAPH To achieve better clustering results, the data model that underlies the clustering method must accurately capture the salient features of the data. According to the Vector Space Model, the document data is represented as a feature vector of terms with different weights assigned to the terms according to their frequency of appearance in the document. It does not represent any relation between the words, so sentences are broken down into their individual components without any representation of the sentence structure. The proposed Document Index Graph (DIG for short) indexes the documents while maintaining the sentence structure in the original documents. This allows us to make use of more informative phrase matching rather than individual words matching. Moreover, DIG also captures the different levels of significance of the original sentences, thus allowing us to make use of sentence significance. Suffix trees are the closest structure to the proposed model, but they suffer from huge redundancy [23]. Apostolico [24] gives more than 40 references on suffix trees, and Manber and Myers [25] add more recent ones. However, the proposed DIG model is not just an extension or an enhancement of suffix trees, it takes a different perspective of how to match phrases efficiently, without the need for storing redundant information. Phrasal indexing has been widely used in the information retrieval literature [26], [27]. The work presented here takes it a step further toward an efficient way of indexing phrases with emphasis on applying phrase-based similarity as a way of clustering documents accurately.

4 1282 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 Fig. 2. Example of the document index graph. 3.1 DIG Structure Overview The DIG is a directed graph (digraph) G ¼ðV;EÞ, where V : is a set of nodes fv 1 ;v 2 ;...;v n g, where each node v represents a unique word in the entire document set; and E: is a set of edges fe 1 ;e 2 ;...;e m g, such that each edge e is an ordered pair of nodes ðv i ;v j Þ. Edge ðv i ;v j Þ is from v i to v j, and v j is adjacent to v i. There will be an edge from v i to v j if, and only if, the word v j appears successive to the word v i in any document. A set of edges is said to be corresponding to a sentence in a document if they link the nodes corresponding to the sentence in the same order the words appeared in the sentence. The above definition of the graph suggests that the number of nodes in the graph is the number of unique words in the document set, i.e., the vocabulary of the document set since each node represents a single word in the whole document set. Nodes in the graph carry information about the documents they appeared in, along with the sentence path information. Sentence structure is maintained by recording the edge along which each sentence continues. This essentially creates an inverted list of the documents, but with sentence information recorded in the inverted list. Assume a sentence of m words appearing in one document consists of the following word sequence: fv 1 ;v 2 ;...;v m g. The sentence is represented in the graph by a path from v 1 to v m, such that ðv 1 ;v 2 Þðv 2 ;v 3 Þ;...; ðv m 1 ;v m Þ are edges in the graph. Path information is stored in the vertices along the path to uniquely identify each sentence. Sentences that share subphrases will have shared parts of their paths in the graph that correspond to the shared subphrase. To better illustrate the graph structure, Fig. 2 presents a simple example graph that represents three documents. Each document contains a number of sentences with some overlap between the documents. As seen from the graph, an edge is created between two nodes only if the words represented by the two nodes appear successive in any document. Thus, sentences map into paths in the graph. Dotted lines represent sentences from Document 1, dashdotted lines represent sentences from Document 2, and dashed lines represent sentences from Document 3. If a phrase appears more than once in a document, the frequency of the individual words making up the phrase is increased, and the sentence information in the nodes reflects the multiple occurrence of such phrase. As mentioned earlier, matching phrases between documents becomes a task of finding shared paths in the graph between different documents. 3.2 DIG Detailed Structure This section provides details of the phrase indexing structure to serve as a reference for implementation purposes. Phrase indexing information is stored in the graph nodes themselves in the form of document tables. Fig. 3 illustrates Fig. 3. DIG structure detail.

5 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1283 Fig. 4. Incremental construction of the document index graph. the information stored in one of the nodes given in the previous example. Basically, the structure maintained in each node is a table of documents. Each document entry in the document table records the term frequency (TF) of the word in that document. Since words can appear in different parts of a document with different levels of significance, the recorded term frequency is actually broken into those levels of significance, with a frequency count per level (these are the three numbers under the TF column.) This structure helps in achieving a more accurate similarity measure based on the level of significance. Since the graph is directed, each node maintains a list of outgoing edges per document entry. This list of edges tells us which sentence continues along which edge. The task of creating a sentence path in the graph is thus to record the necessary information in this edge table to reflect the structure of the sentences. The document table structure represented in each node consists of the following items: Document ID, Term Frequency (for different levels of significance), and Edge Table. The edge table is a list of an outgoing set of edges E d subset of E v, where E v is the list of outgoing edges that belong to node v (which, in turn, is a subset of the whole edge set E.) E d is the set of outgoing edges for a specific document entry in the document table, where this list holds the path information required to follow a certain sentence in the graph. Each such edge table (of a specific document) maintains the different sentence instances that might appear in the document, to accommodate multiple occurrences of the same sentence (or subsentence). For example, the word river appeared in Document 1 three times: first word in sentence 1, s 1 ð1þ, second word in sentence 2, s 2 ð2þ, and first word in sentence 3, s 3 ð1þ. Thus, this whole structure maintains full information about each sentence in each document and that is what facilitates complete phrase matching of any-length. 3.3 DIG Construction The DIG is built incrementally by processing one document at a time. When a new document is introduced, it is scanned in sequential fashion, and the graph is updated with the new sentence information as necessary. New words are added to the graph as necessary and connected with other nodes to reflect the sentence structure. The graph building process becomes less memory demanding when no new words are introduced by a new document (or very few new words are introduced). At this point, the graph becomes more stable, and the only operation needed is to update the sentence structure in the graph to accommodate the new sentences introduced. It is very critical to note that introducing a new document will only require the inspection (or addition) of those words that appear in that document and not every node in the graph. This is where the efficiency of the model comes from. Along with indexing the sentence structure, the level of significance of each sentence is also recorded in the graph. This allows us to recall such information when we measure the similarity with other documents. Continuing from the example introduced earlier, the process of constructing the graph that represents the three documents is illustrated in Fig. 4. The emphasis here is on the incremental construction process, where new nodes are added and new edges are created incrementally upon introducing a new document. We now define the incremental

6 1284 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 Fig. 5. Algorithm 1: DIG incremental construction and phrase matching. DIG construction process formally in terms of graph properties and operations. Document Subgraph. Each document d i is mapped to a subgraph g i that represents this document in a standalone manner (an example is the first step in Fig. 4.) Each subgraph can be viewed as a detached subset of the DIG that represents the corresponding document in terms of the DIG properties: g i ¼fV i ;E i g, where V i is the set of nodes corresponding to the unique words of d i, and E i is the set of edges representing the sentence paths of d i. Cumulative DIG. Let the DIG representation of the documents processed up to document d i 1 be G i 1, and that of the documents processed up to document d i be G i. Computing G i is done by merging G i 1 with the subgraph g i : G i ¼ G i 1 [ g i : G i is said to be the Cumulative DIG of the documents processed up to document d i Phrase Matching. A list of matching phrases between document d i and d j is computed by intersecting the subgraphs of both documents, g i and g j, respectively. Let M ij denote such list, then: M ij ¼ g i \ g j : A list of matching phrases between document d i and all previously processed documents is computed by intersecting the document subgraph g i with the cumulative DIG G i 1. Let M i denote such list, then: M i ¼ g i \ G i 1 : Unlike traditional phrase matching techniques that are usually used in information retrieval literature, DIG provides complete information about full phrase matching between every pair of documents. While traditional phrase matching methods are aimed at searching and retrieval of documents that have matching phrases to a specific query, DIG is aimed at providing information about the degree of overlap between every pair of documents. This information will help in determining the degree of similarity between documents as will be explained in Section DIG Construction and Phrase Matching Algorithm Upon introducing a new document, finding matching phrases from previously seen documents becomes an easy task using DIG. Algorithm 1 (Fig. 5) describes the process of both incremental graph building and phrase matching. Instead of building document subgraphs and intersecting

HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1285 TABLE 1 DIG Size Statistics them with the cumulative DIG, the algorithm incrementally incorporates new

7 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1285 TABLE 1 DIG Size Statistics them with the cumulative DIG, the algorithm incrementally incorporates new documents into DIG while collecting matching phrases from previous documents at the same time. The procedure starts with a new document to process (line 1). Matching phrases from previous documents are done by keeping a list M that holds an entry for every previous document that shares a phrase with the current document d i. For each sentence (for loop at line 3), we process the words in the sentence sequentially, adding new words (as new nodes) to the graph and constructing a path in the graph (by adding new edges if necessary) to represent the sentence we are processing. As we continue along the sentence path, we update M by adding new matching phrases and their respective document identifiers, and extending phrase matches from the previous iteration (lines 14 to 16). We first consult the document table of v k 1 for documents that have sentences that continue along the edge e k. Those documents share at least two terms with the current sentence under consideration. We examine the list M for any previous matching phrases (from previous iterations) to extend the current two-term phrase match (on edge e k ). This allows the extension of previous matches, and can continue for anylength phrase match. If there are no matching phrases at some point, we just update the respective nodes of the graph to reflect the new sentence path (line 19). After the whole document is processed, M will contain all the matching phrases between the current document and any previous document that shared at least one phrase with the new document. Finally, we update G i to be the current cumulative DIG and output M as the list of documents with all the necessary information about the matching phrases, which will be used in similarity calculation later. The algorithm s performance is largely determined by the size of the document table at each node in the graph. A node s document table size is essentially the inverse document frequency (idf) of the term represented by the node. The larger the average document table size, the more computation is needed to process a newly introduced document. According to the DIG size statistics given in Table 1, the average document table size grows slowly with the number of documents. This slow increase is believed to be critical in the efficiency of the algorithm since computation should not be affected much by the growth of the average document table size. In the worst-case scenario, the average document table size will become as large as the number of documents in the data set, in which case the algorithm performance would be quadratic because every word is shared by all documents. But in practice, typically, we have small overlap in documents for each word, which helps the algorithm perform well. As explained in more detail in the next section, the average number of the matching documents at any node tends to grow slowly. The actual performance depends on how much overlap of phrases there is in the document set; the more matching phrases the more time it takes to process the whole set, but the more accuracy we get for similarity, and vice versa. The trade off is corpus-dependent, but in general for Web documents, it is typically a balance between speed and accuracy. This efficient performance of construction/phrase-matching lends itself to online incremental processing, such as processing the results of a Web search engine retrieved list of documents. The algorithm processed 2,000 news group articles in as low as 44 seconds, while it processed 2,340 moderate sized Web documents in a little over 5 minutes. Performance results are discussed in Section DIG Complexity The example presented here is a simple one. Real Web documents will contain hundreds of words. With a very large document set, the graph could become more complex in terms of memory usage. By definition, the number of graph nodes will be exactly the same as the number of unique words in the data set. In worst case, the number of edges will be m 2 (m is the number of unique words) if every word is followed by every other word in the corpus. However, typically, the number of edges is around one order of magnitude larger than the number of nodes. In terms of memory usage compared to the vector space model, if we assume that we do not maintain phrase indexing structures, the model will use memory as large as the number of nonempty entries in a term-by-document vector space model matrix (since it represents an inverted list of term-to-document term frequencies.) If we maintain phrase indexing structures, we require extra memory as large as the number of documents times the average terms per document. More formally, assume that: n: is the number of documents in the data set, m: is the number of unique terms in the data set, idf avg : is the average inverse document frequency (IDF), and q: is the average number of terms per document, then the space requirements of the model is: SizeðGÞ ¼3ðm idf avg Þþq n: ð2þ

8 1286 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 Fig. 6. DIG size scalability. The first term in the equation is the space required for the model without phrase indexing (3 term frequency entries per term for each document it appears in.) According to the statistics shown in Table 1, the average idf (which is the average document table size of a node) is quite low and tends to grow slowly. The second term accounts for the phrase indexing requirement, where we need to store information about the location of each term in each document. If we assume that idf avg and q are constants (which we believe is not a stretched assumption), then the asymptotic upper bound on the size of DIG is near-linear. Table 1 shows some statistics about the size of DIG using a data set of 2,340 Web documents, with an average 289 words per document (see Section 6 for details about the DS2 data set.) The average size of the different components of the DIG (document tables, edge tables, and sentence tables) saturates at low values as the graph grows. The number of different tables necessary to maintain the DIG, however, grows with an order of magnitude larger than that of the nodes. Fig. 6 clearly shows that the DIG size grows linearly with the number of documents, as we demonstrated earlier in (2). Although, specifically designed for phrase indexing, this design of the model allows for the ability to do away with the phrase indexing structure and work only with singleterm processing, just like the vector space model. However, the storage of information about single terms is done in a compact way without the need to store a very large sparse term-by-document matrix. 4 APHRASE-BASED SIMILARITY MEASURE As mentioned earlier, phrases convey local context information, which is essential in determining an accurate similarity between documents. Toward this end, we devised a similarity measure based on matching phrases rather than individual terms. This measure exploits the information extracted from the previous phrase matching algorithm to better judge the similarity between the documents. This is related to the work of Isaacs and Aslam [28] who used a pair-wise probabilistic document similarity measure based on Information Theory. Although, they showed it could improve on traditional similarity measures, but it is still fundamentally based on the vector space model representation. The phrase similarity between two documents is calculated based on the list of matching phrases between the two documents. From an information theoretic point of view, the similarity between two objects is regarded as how much they share in common. The cosine and the Jaccard measures are indeed of such nature, but they are essentially used as single-term based similarity measures. Lin [29] gave a formal definition for any information theoretic similarity measure in the form of: simðx; yþ ¼ x \ y x [ y : ð3þ The basic assumption here is that the similarity between two documents is based on the ratio of how much they overlap to their union, all in terms of phrases. This definition still coincides with the major assumption of the cosine and the Jaccard measures, and to Lin s definition as well. This phrase-based similarity measure is a function of four factors:. the number of matching phrases P,. the lengthes of the matching phrases ðl i : i ¼ 1; 2;...;PÞ;. the frequencies of the matching phrases in both documents ðf 1i and f 2i : i ¼ 1; 2;...;PÞ, and. the levels of significance (weight) of the matching phrases in both document ðw 1i and w 2i : i ¼ 1; 2;...;PÞ:

9 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1287 Frequency of phrases is an important factor in the similarity measure. The more frequent the phrase appears in both documents, the more similar they tend to be. Similarly, the level of significance of the matching phrase in both documents should be taken into consideration. The phrase similarity between two documents, d 1 and d 2, is calculated using the following empirical equation: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P i¼1 sim p ðd 1 ; d 2 Þ¼ ½gðl iþðf 1i w 1i þ f 2i w 2i ÞŠ 2 Pj js 1jjw 1j þ P k js ; ð4þ 2kjw 2k where gðl i Þ is a function that scores the matching phrase length, giving higher score as the matching phrase length approaches the length of the original sentence; js 1j j and js 2k j are the original sentence lengths from document d 1 and d 2, respectively. The equation rewards longer phrase matches with higher level of significance, and with higher frequency in both documents. The function gðl i Þ in the implemented system was used as: gðl i Þ¼ðl i =js i jþ ; where js i j is the original phrase length and is a sentence fragmentation factor with values greater than or equal to 1. If is 1, two halves of a sentence could be matched independently and would be treated as a whole sentence match. However, by increasing we can avoid this situation and score whole sentence matches higher than fractions of sentences. A value of 1.2 for was found to produce best results. The normalization by the length of the two documents in (4) is necessary to be able to compare the similarities from other documents. 4.1 Combining Single-Term and Phrase Similarities If the similarity between documents is based solely on matching phrases, and not single-terms at the same time, related documents could be judged as nonsimilar if they do not share enough phrases (a typical case.) Shared phrases provide important local context matching, but sometimes similarity based on phrases only is not sufficient. To alleviate this problem, and to produce high quality clusters, we combined single-term similarity measure with our phrasebased similarity measure. Experimental results to justify this claim is given in Section 6.3. We used the cosine correlation similarity measure [17], [19], with TF-IDF (Term Frequency Inverse Document Frequency) term weights, as the singleterm similarity measure. The cosine measure was chosen due to its wide use in the document clustering literature, and since it is described as being able to capture human categorization behavior well [30]. The TF-IDF weighting is also a widely used term weighting scheme [31]. Recall that the cosine measure calculates the cosine of the angle between the two document vectors. Accordingly our term-based similarity measure (sim t ) is given as: sim t ðd 1 ; d 2 Þ¼cosðd 1 ; d 2 Þ¼ d 1 d 2 kd 1 kkd 2 k ; ð6þ where the vectors d 1 and d 2 are represented as term weights calculated using TF-IDF weighting scheme. The combination of the term-based and the phrase-based similarity measures is a weighted average of the two ð5þ quantities from (4) and (6), and is given by (7). The reason for separating single-terms and phrases in the similarity equation, as opposed to treating a single-term as a oneword-phrase, is to evaluate the blending factor between the two quantities, and see the effect of phrases in similarity as opposed to single-terms. simðd 1 ; d 2 Þ¼ sim p ðd 1 ; d 2 Þþð1 Þsim t ðd 1 ; d 2 Þ; where is a value in the interval ½0; 1Š which determines the weight of the phrase similarity measure, or, as we call it, the Similarity Blend Factor. According to the experimental results discussed in Section 6, we found that a value between 0.6 and 0.8 for results in the maximum improvement in clustering quality. 5 INCREMENTAL DOCUMENT CLUSTERING In this section, we present a brief overview of incremental clustering algorithms and introduce the proposed algorithm, based on pair-wise document similarity, and employ it as part of the whole Web document clustering system. The role of a document similarity measure is to provide judgement on the closeness of documents to each other. However, it is up to the clustering method how to make use of such similarity calculation. Steinbach et al. [32] give a good comparison of document clustering techniques. A large array of data clustering methods can be also found in [33], [34]. Charikar et al. discussed an incremental hierarchical clustering [35] as well. Beil et al. [36] proposed a clustering algorithm based on frequent terms that address the high dimensionality problem of text data sets. Pantel and Lin [37] proposed the CBC document clustering algorithm that finds cluster representatives as a way to decide on the membership of clusters later. The idea here is to employ an incremental clustering method that will exploit our similarity measure to produce clusters of high quality (assessing quality of clustering is described in Section 6). Incremental clustering is an essential strategy for online applications, where time is a critical factor for usability. Incremental clustering algorithms work by processing data objects one at a time, incrementally assigning data objects to their respective clusters while they progress. The process is simple enough, but faces several challenges. How to determine to which cluster the next object should be assigned? How to deal with the problem of insertion order? Once an object has been assigned to a cluster, should its assignment to the cluster be frozen or is it allowed to be reassigned to other clusters later on? Usually, a heuristic method is employed to deal with the above challenges. A good incremental clustering algorithm has to find the respective cluster for each newly introduced object without significantly sacrificing the accuracy of clustering due to insertion order or fixed object-to-cluster assignment. We will briefly discuss four incremental clustering methods in the light of the above challenges, before we introduce our proposed method. Single-Pass Clustering [38], [1]. This algorithm basically processes documents sequentially, and compares each document to all existing clusters. If the similarity between ð7þ

10 1288 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 the document and any cluster is above a certain threshold, then the document is added to the closest cluster; otherwise, it forms its own cluster. Usually, the method for determining the similarity between a document and a cluster is done by computing the average similarity of the document to all documents in that cluster. K-Nearest Neighbor Clustering [39], [1]. Although K-NN is mostly known to be used for classification, it has also been used for clustering (example could be found in [40].) For each new document, the algorithm computes its similarity to every other document, and chooses the top k documents. The new document is assigned to the cluster where the majority of the top k documents are assigned. Suffix Tree Clustering (STC). Introduced by Zamir et al. [20] in 1997, the idea behind the STC algorithm is to build a tree of phrase suffixes shared between multiple documents. The documents sharing a suffix are considered as a base cluster. Base clusters are then combined together if they have a document overlap of 50 percent or more. The algorithm has two drawbacks. First, although the structure used is a compact tree, suffixes can appear multiple times if they are part of larger shared suffixes. The other drawback is that the second phase of the algorithm is not incremental. Combining base clusters into final clusters has to be done in a nonincremental way. The algorithm deals properly with the insertion order problem though, since any insertion order will lead to the same resultant suffix tree. DC-Tree Clustering. The DC-tree incremental algorithm was introduced by Wong and Fu [41] in The algorithm is based on the B þ -tree structure. Unlike the STC algorithm, this algorithm is based on vector space representation of the documents. Most of the algorithm operations are borrowed from the B þ -tree operations. Each node in the tree is a representation of a cluster, where a cluster is represented by the combined feature vectors of its individual documents. Inserting a new document involves comparison of the document feature vector with the cluster vectors at one level of the tree, and descending to the most similar cluster. The algorithm defines several parameters and thresholds for the various operations. It suffers from two problems though. Once a document is assigned to a cluster it is not allowed to be reassigned later to a newly created cluster. Second, which is a consequence of the first drawback, clusters are not allowed to overlap, i.e., a document can belong to only one cluster. 5.1 Similarity Histogram-Based Incremental Clustering The clustering approach proposed here is an incremental dynamic method of building the clusters. We adopt an overlapped cluster model. The key concept for the similarity histogram-based clustering method (referred to as SHC hereafter) is to keep each cluster at a high degree of coherency at any time. We represent the coherency of a cluster with a new concept called Cluster Similarity Histogram. Cluster Similarity Histogram. A concise statistical representation of the set of pair-wise document similarities distribution in the cluster. A number of bins in the histogram correspond to fixed similarity value intervals. Fig. 7. Cluster similarity histogram. Each bin contains the count of pair-wise document similarities in the corresponding interval. Fig. 7 shows a typical cluster similarity histogram, where the distribution is almost a normal distribution. A perfect cluster would have a histogram where the similarities are all maximum, while a loose cluster would have a histogram where the similarities are all minimum. 5.2 Creating Coherent Clusters Incrementally Our objective is to keep each cluster as coherent as possible. In terms of the similarity histogram concept, this translates to maximizing the number of similarities in the high similarity intervals. To achieve this goal in an incremental fashion, we judge the effect of adding a new document to a certain cluster. If the document is going to degrade the distribution of the similarities in the clusters very much, it should not be added, otherwise it is added. A much stricter strategy would be to add documents that will only enhance the similarity distribution. However, this could create a problem with perfect clusters. The document will be rejected by the cluster even if it has high similarity to most of the documents to the cluster (because it is perfect). We judge the quality of a similarity histogram (cluster cohesiveness) by calculating the ratio of the count of similarities above a certain similarity threshold S T to the total count of similarities. The higher this ratio, the more coherent the cluster is. Let n c be the number of the documents in a cluster. The number of pair-wise similarities in the cluster is m c ¼ n c ðn c þ 1Þ=2. Let S ¼fs i : i ¼ 1;...;m c g be the set of similarities in the cluster. The histogram of the similarities in the cluster is represented as: where H ¼fh i : i ¼ 1;...;Bg ð8aþ h i ¼ countðs k Þ s li s k <s ui ; ð8bþ B: the number of histogram bins, h i : the count of similarities in bin i, s li : the lower similarity bound of bin i, and s ui : the upper similarity bound of bin i. The histogram ratio (HR) of a cluster is the measure of cohesiveness of the cluster as described above, and is calculated as:

11 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1289 Fig. 8. Algorithm 2: similarity histogram-based incremental document clustering. where P B i¼t HR c ¼ h i P B j¼1 h j T ¼bS T Bc; ð9aþ ð9bþ HR c : the histogram ratio of cluster c, S T : the similarity threshold, and T: the bin number corresponding to the similarity threshold. Basically, we would like to keep the histogram ratio of each cluster high. However, since we allow documents that can degrade the histogram ratio to be added, this could result in a chain effect of degrading the ratio to zero eventually. To prevent this, we set a minimum histogram ratio HR min that clusters should maintain. We also do not allow adding a document that will bring down the histogram ratio significantly (even if still above HR min ). This is to prevent a bad document from severely bringing down cluster quality by one single document addition. We now present the incremental clustering algorithm based on the above framework (see Fig. 8, Algorithm 2). The algorithm works incrementally by receiving a new document, and for each cluster calculates the cluster histogram before and after simulating the addition of the document (lines 4-6). The old and new histogram ratios are compared and if the new ratio is greater than or equal to the old one, the document is added to the cluster. If the new ratio is less than the old one by no more than " and still above HR min,it is added (lines 7-9). Otherwise, it is not added. If after checking all clusters the document was not assigned to any cluster, a new cluster is created and the document is added to it (lines 11-15). In comparison with the criteria of single-pass clustering and K-NN clustering, the similarity histogram ratio as a coherency measure provides a more representative measure of the tightness of the documents in the cluster, and how the external document would affect such tightness. On the other hand, single-pass compares the external document to the average of the similarities in the cluster, while the K-NN method takes into consideration only a few similarities that might be outliers, and that is why we sometimes need to increase the value of the parameter k to get better results from K-NN. This was the main reason for devising such a concise cluster coherency measure and employing it in assessing the effect of external documents on each cluster. 5.3 SHC Complexity Analysis By definition, the time complexity of the similarity histogram-based clustering algorithm is Oðn 2 Þ, since for each new document we must compute its similarity to all previously seen documents. This is a property of all algorithms that work based on a document similarity matrix. However, the similarity histogram representation gives us an advantage with typical document corpora. Typically, a document similarity vector (containing its similarity to every other document) will be sufficiently sparse. This is usually because quite a large percentage of documents do not share any words (after removal of stopwords), especially documents from different classes, so their similarity is zero. We can take advantage of this sparse vector by compacting zero elements into the first bin of the similarity histogram of each cluster, and only process nonzero elements which actually affect the computation time of the algorithm. This strategy saves computation and makes the algorithm subquadratic in typical situations, as demonstrated by the experiments in Section 6.4. Space requirements for the algorithm depend on whether similarities are precomputed, or each new document similarity to other documents is computed as it arrives. In the former case, the space requirement is Oðn 2 Þ, while in the latter, it is OðnÞ since we only need to store information about the current document similarity vector,

12 1290 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 TABLE 2 Data Sets Descriptions then the vector is discarded after the document is processed, we only keep histogram information for each cluster which is OðrÞ, where r is the number of clusters. 5.4 Dealing with Insertion Order Problems Our strategy for the insertion order problem is to implement a document reassignment strategy. This strategy does not completely eliminate the problem, but it helps reduce its effect, i.e., the process is nondeterministic and different insertion ordering will result in different partitioning of the documents. Older documents that were added before new clusters were created should have the chance to be reassigned to newly created clusters. Only documents that seem to be bad for a certain cluster are tagged and considered for reassignment to other clusters. The documents that are candidates to leave a cluster are the documents that if they were removed from the cluster, the cluster similarity histogram ratio will increase, i.e., the cluster is better off without them. We keep with each document a value indicating the histogram ratio if the document was not in the cluster. If this value is greater than the current histogram ratio, then the document is a candidate for leaving the cluster. This tagging of bad documents allows clusters to be reassessed periodically so as to remove those documents. A bad document will be removed from a cluster if, and only if, we can find one or more other clusters that can accept the document. By accept, we mean the document will either increase their histogram ratio or decrease by no more than ". Thus, benefiting the initial and the recipient clusters. This strategy creates a dynamic negotiation scheme between clusters for document assignment. It also allows for overlapping clusters, and dynamic incremental document clustering. 6.1 Experimental Setup The availability of Web document data sets suitable for clustering is limited. However, we used three data sets, two of which are Web document data sets, and the third is a collection of articles posted on various USENET newsgroups. Table 2 describes the data sets. The first data set (DS1) is a collection of 314 Web documents manually collected and labelled from various University of Waterloo and Canadian Web sites. 2 This data set was used in [42]. It is categorized manually based on topic description. This data set has a moderate degree of overlap between the different classes. The second data set (DS2) is a collection of 2,340 Reuters news articles posted on Yahoo! news, and was used by Boley in [43], [44], [45]. The categories of the data set come from the Yahoo categories of reuters news feed. The overlap between classes is quite low in this data set. The third data set is a subset of the full 20-newsgroups collection of USENET news group articles. This data set is available from the UCI KDD Archive. 3 Each news group constitutes a different category, with varying overlap between them; some news groups are very related (e.g., talk.politics.mideast & talk.politics.misc) and others are not related at all (e.g., comp.graphics & talk.religion.misc.). 6.2 Evaluation Measures In order to evaluate the quality of the clustering, we adopted three quality measures widely used in the text mining literature for the purpose of document clustering [32]. The first is the F-measure, which combines the Precision and Recall ideas from the Information Retrieval literature. The precision and recall of a cluster j with respect to a class i are defined as: P ¼ Precisionði; jþ ¼ N ij N j ð10aþ 6 EXPERIMENTAL RESULTS In order to test the effectiveness of the Web clustering system, we conducted a set of experiments using our proposed data model, phrase matching, similarity measure, and incremental clustering method. The experiments conducted were divided into two sets. We first tested the effectiveness of the DIG model, presented in Section 3, and the accompanying phrase matching algorithm for calculating the similarity between documents based on phrases versus individual words only. The second set of experiments was to evaluate the accuracy of the incremental document clustering algorithm, presented in Section 5, based on the cluster cohesiveness measure using similarity histograms. where R ¼ Recallði; jþ ¼ N ij N i ; ð10bþ N ij : is the number of members of class i in cluster j, N j : is the number of members of cluster j, and N i : is the number of members of class i. The F-measure of a class i is defined as: F ðiþ ¼ 2PR P þ R : ð11þ 2. The document collection can be downloaded at: uwaterloo.ca/~hammouda/webdata/. 3.

13 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1291 TABLE 3 Phrase-Based Clustering Improvement (a) Complete Linkage was used as the cluster distance measure for the HAC method since it tends to produce tight clusters with small diameter. (b) A document-to-cluster similarity threshold of 0.25 was used. (c) A k of 5 and a cluster similarity threshold of 0.25 were used. With respect to class i, we consider the cluster with the highest F-measure to be the cluster that maps to class i, and that F-measure becomes the score for class i. The overall F- measure for the clustering result C is the weighted average of the F-measure for each class i: P i F C ¼ ðjijfðiþþ Pi jij ; ð12þ where jij is the number of objects in class i. The higher the overall F-measure, the better the clustering, due to the higher accuracy of the clusters mapping to the original classes. The second measure is the Entropy, which provides a measure of goodness for unnested clusters or for the clusters at one level of a hierarchical clustering. Entropy tells us how homogeneous a cluster is. The higher the homogeneity of a cluster, the lower the entropy is, and vice versa. The entropy of a cluster containing only one object (perfect homogeneity) is zero. For every cluster j in the clustering result C, we compute p ij, the probability that a member of cluster j belongs to class i. The entropy of each cluster j is calculated using the standard formula E j ¼ P i p ij logðp ij Þ, where the sum is taken over all classes. The total entropy for a set of clusters is calculated as the sum of entropies of each cluster weighted by the size of that cluster: E C ¼ Xm N j N E j ; ð13þ j¼1 where N j is the size of cluster j, and N is the total number of data objects. The third measure is the Overall Similarity, which is the average of the similarities inside each cluster. It is usually used in the absence of document labels. Overall similarity measures cluster cohesiveness by using the weighted similarity of the internal cluster similarity. For cluster C, the overall similarity is given by: S C ¼ 1 X jcj 2 simðx; yþ: x2c y2c ð14þ Basically, we would like to maximize the F-measure and Overall Similarity, and minimize the Entropy of clusters to achieve high quality clustering. 6.3 Effect of Phrase-Based Similarity on Clustering Quality The similarities calculated by our algorithm were used to construct a similarity matrix between the documents. We elected to use three standard document clustering techniques for testing the effect of phrase similarity on clustering [33]: 1. Hierarchical Agglomerative Clustering (HAC), 2. Single Pass Clustering, and 3. K-Nearest Neighbor Clustering (K-NN). For each of the algorithms, we constructed the similarity matrix and let the algorithm cluster the documents based on the presented similarity matrix. The results listed in Table 3 show the improvement in the clustering quality on the first data set using the combined similarity measure. The improvements shown were achieved at a similarity blend factor between 70 and 80 percent (phrase similarity weight). The parameters chosen for the different algorithms were the ones that produced best results. The percentage of improvement ranges from 19.5 to 60.6 percent increase in the F-measure quality, and 9.1 to 46.2 percent drop in Entropy (lower is better for Entropy). It is obvious that the phrase based similarity plays an important role in accurately judging the relation between documents. It is known that Single Pass clustering is very sensitive to noise; that is why it has the worst performance. However, when the phrase similarity was introduced, the quality of clusters produced was pushed close to that produced by HAC and K-NN. In order to better understand the effect of the phrase similarity on the clustering quality, we generated a clustering quality profile against the similarity blend factor. Fig. 9 illustrates the effect of introducing the phrase similarity on the F-measure and the entropy of the resulting

14 1292 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 Fig. 9. Effect of phrase similarity on clustering quality. (a) DS1 F-measure. (b) DS1 Entropy. (c) DS2 F-measure. (d) DS2 Entropy. (e) DS3 F-measure. (f) DS3 Entropy. clusters. The alpha parameter is the similarity blend factor presented in (7). It is obvious that the phrase similarity enhances the quality of clustering until a certain point (around a weight of 80 percent) and then its effect starts bringing down the quality. As we mentioned in Section 4.1 that phrases alone cannot capture all the similarity information between documents, the single-term similarity is still required, but to a smaller degree. The results show that both evaluation measures are optimized in the same trend with respect to the blend factor. By having two evaluation measures proving the clustering quality improvement, we are confident that the results are not biased by any of them. The performance of the model was closely examined to make sure that the phrase matching algorithm is scalable enough for moderate to large data sets. The experiments were performed on a Pentium 4, 1.8 GHz machine with 512MB of main memory. The system was written in C++. Fig. 10 shows the performance of the graph construction and phrase matching algorithm for the two larger data sets (DS2 & DS3). In both data sets, the algorithm performed in a nearlinear time. Although the two data sets contain a close number of documents, DS2 took about an order of magnitude more than DS3 to build the graph and complete the phrase matching. This is attributed to two factors: 1) The DS2 data set average words per document is almost twice that of DS3 so we match more phrases per document. 2) The DS2 data set has a larger amount of shared phrases between documents on average than the DS3 data set. News group articles rarely share a large amount of phrases (except when someone

HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1293 Fig. 10. DIG performance. (a) DIG performance using the Yahoo! news data set.

15 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1293 Fig. 10. DIG performance. (a) DIG performance using the Yahoo! news data set. (b) DIG performance using the 20-newsgroup data set. quotes another post) so, on average, we do not need to match large number of phrases per document. 6.4 SHC Evaluation The SHC method was evaluated using two of the data sets mentioned earlier, DS1 and DS2. We relied on the same evaluation measures discussed above. Table 4 shows the result of SHC against HAC, Single-Pass, and K-NN clustering. For the first data set, the improvement was very significant, reaching more than 20 percent improvement over K-NN (in terms of F-measure), 3 percent improvement over HAC, and 29 percent improvement over Single- Pass. For the second data set, an improvement between 10 to 18 percent was achieved over the other methods. However, the absolute F-measure was not really high compared to the first data set. The parameters chosen for the different algorithms were the ones that produced best results. By examining the actual documents in DS2 and their classification, it turns out that the documents do not have enough overlap in each individual class, which makes it difficult to have an accurate similarity calculation between the documents. However, we were able to push the quality of clustering further by relying on accurate and robust phrase matching similarity calculation, and achieve higher clustering quality. Fig. 11 shows the above mentioned results more clearly, showing the achieved improvement in comparison with the other methods. The figure shows also the effect of applying the reassignment strategy discussed in Section 5.4. The reassignment strategy showed a slight improvement over the same method without document reassignment as shown in the figure. This could be attributed to the distance between the original clusters; if the clusters are sufficiently distant from each other, then the probability of a document getting assigned to a cluster it does not belong to in the first place is very low. We believe that the data sets chosen for experiments are of such nature with slight degree of overlap. The time performance comparison of the different clustering algorithms is illustrated in Fig. 12, showing the performance for both data sets. The performance of SHC is comparable to single-pass and K-NN, while being much better than HAC. The reason for the gain in performance over HAC is because HAC spends so much time in recalculating the similarities between the newly merged cluster and all other clusters during every iteration, which brings its performance down significantly. On the other hand, SHC, single-pass, and K-NN share the same general strategy for processing documents, without having to recalculate similarities at each step. Thus, while the SHC algorithm generates better quality clustering, it still exhibits the same, or better, performance as other incremental algorithms in its class. 7 CONCLUSIONS AND FUTURE RESEARCH We presented a system composed of four components in an attempt to improve the document clustering problem in the Web domain. Information in Web documents does not lie in the content only, but in their inherent semistructure. We presented a Web document analysis component that is capable of identifying the weights of various Web documents phrases and breaking down the document into its sentence constituents for further processing. The second component, and perhaps the most important one that has most of the impact on performance, is the new document model introduced in this paper, the Document TABLE 4 SHC Improvement (a) Overall similarity. (b) Complete Linkage was used as the cluster distance measure for the HAC method since it tends to produce tight clusters with small diameter. (c) A document-to-cluster similarity threshold of 0.25 was used. (d) A k of 5 and a cluster similarity threshold of 0.25 were used.

1294 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 Fig. 11. Quality of clustering comparison. (a) Clustering quality F-measure. (b) Clustering quality entropy.

16 1294 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 Fig. 11. Quality of clustering comparison. (a) Clustering quality F-measure. (b) Clustering quality entropy. Index Graph. This model is based on indexing Web documents using phrases and their levels of significance. Such a model enables us to perform phrase matching and similarity calculation between documents in a very robust, efficient, and accurate way. The quality of clustering achieved using this model significantly surpasses the traditional vector space model based approaches. The third component is the phrase-based similarity measure. By carefully examining the factors affecting the degree of overlap between documents, we devised a phrase-based similarity measure that is capable of accurate calculation of pair-wise document similarity. The fourth component is an incremental document clustering method based on maintaining high cluster cohesiveness by improving the pair-wise document similarity distribution inside each cluster. The merits of such a design is that each component could be utilized independent of the other. But, we have confidence that the combination of these components leads to better results, as justified by the results presented in this paper. By adopting different standard clustering techniques and different evaluation measures to test against our model, we are very confident that this model is well justified. Potential applications of this framework include automatic grouping of search engine results, building taxonomy Fig. 12. Clustering performance. (a) Clustering performance DS1. (b) Clustering performance DS2.

17 HAMMOUDA AND KAMEL: EFFICIENT PHRASE-BASED DOCUMENT INDEXING FOR WEB DOCUMENT CLUSTERING 1295 of text corpora, phrase-based information retrieval, document similarity measurement, detection of plagiarism, and many others. There are a number of future research directions to extend and improve this work. One direction that this work might continue on is to improve on the accuracy of similarity calculation between documents by employing different similarity calculation strategies. Although the current scheme proved more accurate than traditional methods, there are still room for improvement. Although the work presented here is aimed at Web document clustering, it could be easily adapted to any document type as well. However, it will not benefit from the semistructure found in Web documents. Our intention is to investigate the usage of such model on standard corpora and see its effect on clustering compared to traditional methods. ACKNOWLEDGMENTS This work has been partially supported by a strategic grant from the Natural Sciences and Engineering Research Council of Canada (NSERC). REFERENCES [1] K. Cios, W. Pedrycs, and R. Swiniarski, Data Mining Methods for Knowledge Discovery. Boston: Kluwer Academic Publishers, [2] W.B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, N.J.: Prentice Hall, [3] R. Kosala and H. Blockeel, Web Mining Research: A Survey, ACM SIGKDD Explorations Newsletter, vol. 2, no. 1, pp. 1-15, [4] O. Zamir and O. Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, Computer Networks, vol. 31, nos , pp , [5] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, Inductive Learning Algorithms and Representations for Text Categorization, Proc. Seventh Int l Conf. Information and Knowledge Management, pp , Nov [6] H. Kargupta, I. Hamzaoglu, and B. Stafford, Distributed Data Mining Using an Agent Based Architecture, Proc. Knowledge Discovery and Data Mining, pp , [7] U.Y. Nahm and R.J. Mooney, A Mutually Beneficial Integration of Data Mining and Information Extraction, Proc. 17th Nat l Conf. Artificial Intelligence (AAAI-00), pp , [8] Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. Archibald, and X. Liu, Learning Approaches for Detecting and Tracking News Events, IEEE Intelligent Systems, vol. 14, no. 4, pp , [9] D. Freitag and A. McCallum, Information Extraction with HMMs and Shrinkage, Proc. AAAI-99 Workshop Machine Learning for Information Extraction, pp , [10] T. Hofmann, The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data, Proc. 16th Int l Joint Conf. Artificial Intelligence (IJCAI-99), pp , [11] T. Honkela, S. Kaski, K. Lagus, and T. Kohonen, WEBSOM Self- Organizing Maps of Document Collections, Proc. WSOM 97, Workshop Self-Organizing Maps, pp , June [12] W.W. Cohen, Learning to Classify English Text with ILP Methods, Proc. Fifth Int l Workshop Inductive Logic Programming, pp. 3-24, [13] M. Junker, M. Sintek, and M. Rinck, Learning for Text Categorization and Information Extraction with ILP, Proc. First Workshop Learning Language in Logic, J. Cussens, ed., pp , [14] S. Scott and S. Matwin, Feature Engineering for Text Classification, Proc. 16th Int l Conf. Machine Learning (ICML-99), pp , [15] S. Soderland, Learning Information Extraction Rules for Semi- Structured and Free Text, Machine Learning, vol. 34, nos. 1-3, pp , [16] K. Aas and L. Eikvil, Text Categorisation: A Survey, Technical Report 941, Norwegian Computing Center, June [17] G. Salton, A. Wong, and C. Yang, A Vector Space Model for Automatic Indexing, Comm. ACM, vol. 18, no. 11, pp , Nov [18] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series, New York: McGraw-Hill, [19] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, Mass.: Addison Wesley, [20] O. Zamir, O. Etzioni, O. Madanim, and R.M. Karp, Fast and Intuitive Clustering of Web Documents, Proc. Third Int l Conf. Knowledge Discovery and Data Mining, pp , Aug [21] O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility Demonstration, Proc. 21st Ann. Int l ACM SIGIR Conf., pp , [22] M.F. Porter, An Algorithm for Suffix Stripping, Program, vol. 14, no. 3, pp , July [23] S. Kurtz, Reducing the Space Requirement of Suffix Trees, Software Practice and Experience, vol. 29, no. 13, pp , [24] A. Apostolico, The Myriad Virtues of Subword Trees, Combinatorial Algorithms on Words, A. Apostolico and Z. Galil, eds., pp , (NATO ISI Series), [25] U. Manber and G. Myers, Suffix Arrays: A New Method for On- Line String Searches, SIAM J. Computing, vol. 22, no. 5, pp , [26] J.L. Fagan, Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non- Syntactic Methods, PhD thesis, Dept. of Computer Science, Cornell Univ., Sept [27] M.F. Caropreso, S. Matwin, and F. Sebastiani, Statistical Phrases in Automated Text Categorization, Technical Report IEI-B , Pisa, Italy, [28] J.D. Isaacs and J.A. Aslam, Investigating Measures for Pairwise Document Similarity, Technical Report PCS-TR99-357, Dartmouth College, Computer Science, Hanover, N.H., June [29] D. Lin, An Information-Theoretic Definition of Similarity, Proc. 15th Int l Conf. Machine Learning, pp , [30] A. Strehl, J. Ghosh, and R. Mooney, Impact of Similarity Measures on Web-Page Clustering, Proc. 17th Nat l Conf. Artificial Intelligence: Workshop Artificial Intelligence for Web Search (AAAI 2000), pp , July [31] Y. Yang and J.P. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proc. 14th Int l Conf. Machine Learning (ICML 97), pp , [32] M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques, Proc. KDD-2000 Workshop TextMining, Aug [33] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, [34] A.K. Jain, M.N. Murty, and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, vol. 31, no. 3, pp , [35] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, Incremental Clustering and Dynamic Information Retrieval, Proc. 29th Ann. ACM Symp. Theory of Computing, pp , [36] F. Beil, M. Ester, and X. Xu, Frequent Term-Based Text Clustering, Proc. Eighth Int l Conf. Knowledge Discovery and Data Mining (KDD 2002), pp , [37] P. Pantel and D. Lin, Document Clustering with Committees, Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [38] D.R. Hill, A Vector Clustering Technique, Mechanized Information Storage, Retrieval and Dissemination. K. Samuelson, ed., Amsterdam: North-Holland Publishing, [39] B.V. Dasarathy, Nearest Neighbor NN Norms: NN Pattern Classification Techniques. McGraw-Hill Computer Science Series. IEEE CS Press, [40] S.Y. Lu and K.S. Fu, A Sentence-to-Sentence Clustering Procedure for Pattern Analysis, IEEE Trans. Systems, Man, and Cybernetics, vol. 8, pp , [41] W. Wong and A. Fu, Incremental Document Clustering for Web Page Classification, Proc Int l Conf. Information Soc. in the 21st Century: Emerging Technologies and New Challenges (IS2000), 2000.

1296 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 [42] K. Hammouda and M. Kamel, Phrase-Based Document Similarity Based on an Index Graph Model, Proc.

[44] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, Partitioning-Based Clustering for Web Document Categorization, Decision Support Systems, vol.

18 1296 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 [42] K. Hammouda and M. Kamel, Phrase-Based Document Similarity Based on an Index Graph Model, Proc IEEE Int l Conf. Data Mining (ICDM 02), pp , Dec [43] D. Boley, Principal Direction Divisive Partitioning, Data Mining and Knowledge Discovery, vol. 2, no. 4, pp , [44] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, Partitioning-Based Clustering for Web Document Categorization, Decision Support Systems, vol. 27, pp , [45] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, Document Categorization and Query Generation on the World Wide Web Using WebACE, AI Rev., vol. 13, nos. 5-6, pp , Khaled M. Hammouda received the BSc degree in computer engineering from the Faculty of Engineering, Cairo University, Egypt, in He received the MASc degree from the Department of Systems Design Engineering, University of Waterloo, Canada, in From 1997 to 2000, he was with the Department of Computer Science, Faculty of Computers and Information, Cairo University, as a teaching assistant. Currently, he is a PhD candidate at the University of Waterloo, and a member of the Pattern Analysis and Machine Intelligence Research Group at the Department of Systems Design Engineering. His research interests are in data mining, especially Web mining, knowledge discovery from text data, and distributed data mining. He is interested in Web content analysis and document clustering using machine intelligence techniques based on efficient structures and algorithms. He is a student member of the IEEE. Mohamed S. Kamel received the PhD degree in computer science from the University of Toronto, Canada. He is at present a professor and director of the Pattern Analysis and Machine Intelligence Laboratory at the Department of Systems Design Engineering, University of Waterloo, Canada. Dr. Kamel holds a Canada Research Chair in Cooperative Intelligent Systems. He has authored and coauthored more than 180 papers in journals and conference proceedings, two patents, and numerous technical and industrial project reports. Under his supervision, 44 PhD and MASc students have completed their degrees. Dr. Kamel is a member of the ACM, the AAAI, the CIPS, the APEO, and a senior member of the IEEE, and is editor inchief of the International Journal of Robotics and Automation, associate editor of four international journals, and guest editor for special issues in four journals. He is a member of the board of directors and cofounder of Virtek Vision International in Waterloo. He is a consultant to many companies including NCR, IBM, Nortel, VRP, and CSA.. For more information on this or any computing topic, please visit our Digital Library at

Data Mining in e-learning

Data Mining in e-learning Khaled Hammouda 1 and Mohamed Kamel 2 1 Department of Systems Design Engineering 2 Department of Electrical and Computer Engineering Pattern Analysis and Machine Intelligence