Web Document Clustering based on Document Structure

Size: px
Start display at page:

Download "Web Document Clustering based on Document Structure"

Transcription

1 Web Document Clustering based on Document Structure Khaled M. Hammouda and Mohamed S. Kamel Department of Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada N2L 3G1 Corresponding author 1

2 Abstract Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, document structure should be reflected in the underlying data model. This paper presents a framework for web document clustering based on two important concepts. The first one is the web document structure, which is currently ignored by many people. However, the (semi-)structure of a web document provides significant information about the content of the document. The second concept is finding the relationships between documents based on local context using a new phrase matching technique, so that documents are indexed based on phrases, rather than individual words as widely used in current systems. A novel document data model, the Document Index Graph, is designed specifically to facilitate phrase matching between documents. The combination of these two concepts creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in web document clustering over traditional methods. To make the approach applicable to online clustering, an incremental clustering algorithm guided by the maximization of cluster cohesiveness is also presented. Keywords: web mining, document clustering, document similarity, document structure, document index graph, phrase matching. 2

3 1 Introduction In an effort to keep up with the tremendous growth of the World Wide Web, many research projects were targeted on how to organize such information in a way that will make it easier for the end users to find the information they want efficiently and accurately. Information on the web is present in the form of text documents (formatted in HTML), and that is the reason many web document processing systems are rooted in text data mining techniques. Text data mining shares many concepts with traditional data mining methods. Data mining includes many techniques that can unveil inherent structure in the underlying data. One of these techniques is clustering. Applied to text data, clustering methods try to identify inherent groupings of the text documents so that a set of clusters are produced in which clusters exhibit high intra-cluster similarity and low inter-cluster similarity [5]. Generally speaking, text document clustering methods attempt to segregate the documents into groups where each group represents some topic that is different than those topics represented by the other groups [8]. By applying text mining in the web domain, the process becomes what is known as web mining. There are three types of web mining in general, according to Kosala et al [16]: (1) web structure mining; (2) web usage mining; and (3) web content mining. We are mainly interested in the last type, web content mining. Any clustering technique relies on four concepts: (1) a data representation model, (2) a similarity measure, (3) a cluster model, and (4) a clustering algorithm that builds the clusters using the data model and the similarity measure. Most of the document clustering methods that are in use today are based on the Vector Space Model [1, 21, 20, 22], which is a very widely used data model for text classification and clustering. The Vector Space Model represents documents as a feature vector of the terms (words) that appear in all the document set. Each feature vector contains term-weights 3

4 (usually term-frequencies) of the terms appearing in that document. Similarity between documents is measured using one of several similarity measures that are based on such a feature vector. Examples include the cosine measure and the Jaccard measure. Clustering methods based on this model make use of single-term analysis only, they do not make use of any word proximity or phrase-based analysis 1. The motivation behind the work in this paper is that we believe that document clustering should be based not only on single word analysis, but on phrases as well. Phrase-based analysis means that the similarity between documents should be based on matching phrases rather than on single words only. The work that has been reported in literature about using phrases in document clustering is limited. Most efforts have been targeted toward single-word analysis. The methods used for text clustering includes decision trees [7, 15, 18, 28], statistical analysis [9, 10, 15], neural nets [11], inductive logic programming [6, 14], and rule-based systems [23, 24] among others. These methods are at the cross roads of more than one research area, such as database (DB), information retrieval (IR), and artificial intelligence (AI) including machine learning (ML) and Natural Language Processing (NLP). The most relevant work to what is presented here is that of Oren Zamir et al [31, 32, 30]. They proposed a phrase-based document clustering approach based on Suffix Tree Clustering (STC). The method basically involves the use of a trie (a compact tree) structure to represent shared suffixes between documents. Based on these shared suffixes they identify base clusters of documents, which are then combined into final clusters based on a connected-component graph algorithm. They claim to achieve n log(n) performance and produce high quality clusters. The results they showed were encouraging, but the suffix tree model could be argued to have a high number of redundancies in 1 Throughout this paper the term phrase means a sequence of words, and not the grammatical structure of a sentence. 4

5 terms of the suffixes stored in the tree. In this paper we propose a system for web clustering based on document structure. The system consists of four components: 1. A web document restructuring scheme that identifies different document parts, and assigns levels of significance to these parts according to their importance. 2. A novel document representation model, the Document Index Graph (DIG) that captures the structure of sentences in the document set, rather than single words only. The DIG model is based on graph theory and utilizes graph properties to match any-length phrase from a document to any number of previously seen documents in a time nearly proportional to the number of words of the document. 3. A phrase-based similarity measure for scoring the similarity between two documents according to the matching phrases and their significance. 4. An incremental document clustering method based on maintaining high cluster cohesiveness using a new cluster quality concept called Similarity Histogram. The integration of these four components proved to be of superior performance to traditional document clustering methods. Although the whole system performance is quite good, each component could be used independent of the other. The overall system design is illustrated in figure 1. The proposed document model is used to measure the similarity between the documents using a new similarity measure that makes use of phrase-based matching. The similarity calculation between documents is based on a combination of single-term similarity and phrase-based similarity. Similarity 5

6 Keywords (Stored in HTML) Signific ant Text Title Section Heading Italic Text Image Alternate Text Coloured Text Section Heading Bold Text Hyperlink Abstrac t or I ntroduc tory Paragraph well-structured XML documents Web Documents Document Structure Identification Document Index Graph Representation phrase matching Document Clusters Incremental Clustering Document Similarity Calculation document similarity Figure 1: Web Document Clustering System Design based on matching phrases between documents is proved to have a more significant effect on the clustering quality due to its insensitivity to noisy terms that could lead to incorrect similarity measure. The proposed incremental document clustering method relies on improving the pair-wise document similarity distribution inside each cluster so that similarities are maximized in each cluster. The quality of the clusters produced using this system were higher than those produced using traditional clustering methods. Improvement in clustering ranged from 20% to 70% over traditional clustering methods. The rest of this paper is organized as follows. Section 2 presents an analysis of the important features of the semi-structured web documents. Section 3 introduces the Document Index Graph model. Section 4 presents the phrase-based similarity measure. Section 5 presents our proposed incremental clustering algorithm. Section 6 presents our experimental results. Finally we conclude and discuss future work in the last section. 6

7 2 Web document structure analysis Web documents are known to be semi-structured. HTML tags are used to designate different parts of the document. However, since the HTML language is meant for specifying the layout of the document, it is used to present the document to the user in a friendly manner, rather than specify the structure of the data in the document, hence they are semi-structured. However, it is still possible to identify key parts of the document based on this structure. The idea is that some parts of the document are more informative than other parts, thus having different levels of significance based on where they appear in the document and the tags that surround them. It is less informing to treat the title of the document, for example, and the text body equally. The proposed system analyzes the HTML document and restructures the document according into a predetermined structure that assigns different levels of significance to different document parts. The result is a well structured XML document that corresponds to the original HTML document, but with the significance levels assigned to the different parts of the original document. Currently we assign one of three levels of significance to the different parts; HIGH, MEDIUM, and LOW. Examples of HIGH significance parts are the title, meta keywords, meta description, and section headings. Example of MEDIUM significance parts are text that appear in bold, italics, colored, hyper-linked text, image alternate text, and table captions. LOW significance parts are usually comprised of the document body text that was not assigned any of the other levels. This structuring scheme is exploited in measuring the similarity between two documents (see section 4 for details). For example, if we have a phrase match of HIGH significance in both documents, the similarity is rewarded more than if the match was for LOW significance phrases. This is justified by arguing that a phrase match in titles, for example, is much more informative than a phrase match 7

8 in body text. A sentence boundary detector algorithm was developed to locate sentence boundaries in the documents. The algorithm is based on a finite state machine lexical analyzer with heuristic rules for finding the boundaries. A similar approach is used to find word boundaries. About 98% of the actual boundaries are correctly detected. To achieve 100% accuracy, however, requires natural language processing techniques and underlying knowledge of the data set domain, which is beyond the scope of this paper. However, the resulting documents contain very accurate sentence separation and word separation, with very negligible noise. Finally, a document cleaning step is performed to remove stop-words that have no significance, and to stem the words using the popular Porter Stemmer algorithm [19]. 3 Document Index Graph To achieve better clustering results, the data model that underlies the clustering method must accurately capture the salient features of the data. According to the Vector Space Model, the document data is represented as a feature vector of terms with different weights assigned to the terms according to their frequency of appearance in the document. It does not represent any relation between the words, so sentences are broken down into their individual components without any representation of the sentence structure. The proposed Document Index Graph (DIG for short) indexes the documents while maintaining the sentence structure in the original documents. This allows us to make use of more informative phrase matching rather than individual words matching. Moreover, the DIG also captures the different levels of significance of the original sentences, thus allowing us to make use of sentence significance. 8

9 3.1 DIG structure The Document Index Graph (DIG for short) is a directed graph (digraph) G =(V,E) where V : is a set of nodes {v 1,v 2,...,v n }, where each node v represents a unique word in the entire document set; and E: is a set of edges {e 1,e 2,...,e m }, such that each edge e is an ordered pair of nodes (v i,v j ). Edge (v i,v j ) is from v i to v j, and v j is adjacent to v i. There will be an edge from v i to v j if, and only if, the word v j appears successive to the word v i in any document. The above definition of the graph suggests that the number of nodes in the graph is the number of unique words in the document set; i.e. the vocabulary of the document set, since each node represents a single word in the whole document set. Nodes in the graph carry information about the documents they appeared in, along with the sentence path information. Sentence structure is maintained by recording the edge along which each sentence continues. This essentially creates an inverted list of the documents, but with sentence information recorded in the inverted list. Assume a sentence of m words appearing in one document consists of the following word sequence: {v 1,v 2,...,v m }. The sentence is represented in the graph by a path from v 1 to v m, such that (v 1,v 2 )(v 2,v 3 ),...,(v m 1,v m ) are edges in the graph. Path information is stored in the vertices along the path to uniquely identify each sentence. Sentences that share sub-phrases will have shared parts of their paths in the graph that correspond to the shared sub-phrase. The structure maintained in each node is a table of documents. Each document entry in the document table records the term frequency of the word in that document. Since words can appear in different parts of a document with different level of significance, the recorded term frequency is ac- 9

10 tually broken into those levels of significance, with a frequency count per level per document entry. This structure helps in achieving a more accurate similarity measure based on level of significance later on. Since the graph is directed, each node maintains a list of an outgoing edges per document entry. This list of edges tells us which sentence continues along which edge. The task of creating a sentence path in the graph is thus reduced to recording the necessary information in this edge table to reflect the structure of the sentences. booking mild fishing river trips vacation plan wild rafting Document 1 river rafting mild river rafting river rafting trips adventures Document 2 wild river adventures river rafting vacation plan Document 3 fishing trips fishing vacation plan booking fishing trips river fishing Figure 2: Example of the Document Index Graph To better illustrate the graph structure, Figure 2 presents a simple example graph that represents three documents. Each document contains a number of sentences with some overlap between the documents. As seen from the graph, an edge is created between two nodes only if the words represented by the two nodes appear successive in any document. Thus, sentences map into paths in the 10

11 graph. Dotted lines represent sentences from document 1, dash-dotted lines represent sentences from document 2, and dashed lines represent sentences from document 3. As mentioned earlier, matching phrases between documents becomes a task of finding shared paths in the graph between different documents. The example presented here is a simple one. Real web documents will contain hundreds or thousands of words. With a very large document set, the graph could become more complex in terms of memory usage. Typically, the number of graph nodes will be exactly the same as the number of unique words in the data set. The number of edges is about 4 to 6 times the number of nodes (that is the average degree of a node). 3.2 Constructing the graph The DIG is built incrementally by processing one document at a time. When a new document is introduced, it is scanned in sequential fashion, and the graph is updated with the new sentence information as necessary. New words are added to the graph as necessary and connected with other nodes to reflect the sentence structure. The graph building process becomes less memory demanding when no new words are introduced by a new document (or very few new words are introduced). At this point the graph becomes more stable, and the only operation needed is to update the sentence structure in the graph to accommodate for the new sentences introduced. It is very critical to note that introducing a new document will only require the inspection (or addition) of those words that appear in that document, and not every node in the graph. This is where the efficiency of the model comes from. Along with indexing the sentence structure, the level of significance of each sentence is also recorded in the graph. This allows us to recall such information when we match sentences from other documents. 11

12 mild river trips Document 1 rafting river rafting mild river rafting river rafting trips mild river trips vacation plan Document 2 wild rafting wild river adventures river rafting vacation plan adventures booking mild fishing river trips vacation plan Document 3 wild rafting fishing trips fishing vacation plan booking fishing trips river fishing adventures Figure 3: Incremental construction of the Document Index Graph Continuing from the example introduced earlier, the process of constructing the graph that represents the three documents is illustrated in Figure 3. The emphasis here is on the incremental construction process, where new nodes are added and new edges are created incrementally upon introducing a new document. Unlike traditional phrase matching techniques that are usually used in information retrieval literature, the Document Index Graph provides complete information about full phrase matching between every pair of documents. While traditional phrase matching methods are aimed at searching and retrieval of documents that have matching phrases to a specific query, the Document Index Graph is aimed at providing information about the degree of overlap between every pair of documents. This 12

13 information will help in determining the degree of similarity between documents as will be explained in section Detecting Matching Phrases Upon introducing a new document, finding matching phrases from previously seen documents becomes an easy task using the DIG. Algorithm 1 describes the process of both incremental graph building and phrase matching. The procedure starts with a new document to process (line 1). We expect the new document to have well defined sentence boundaries; each sentence is processed individually. This is important because we do not want to match a phrase that spans two sentences (which could break the local context we are looking for.) It is also important to know the original sentence length so that it will be used in the similarity calculation (section 4). For each sentence (for loop at line 2) we process the words in the sentence sequentially, adding new words as new nodes to the graph, and constructing a path in the graph (by adding new edges if necessary) to represent the sentence we are processing. Matching the phrases from previous documents is done by keeping a list L that holds an entry for every previous document that shares a phrase with the current document D. As we continue along the sentence path, we update L by adding new matching phrases and their respective document identifiers, and extending phrase matches from the previous iteration (lines 10 and 11). If there are no matching phrases at some point, we just update the respective nodes of the graph to reflect the new sentence path (lines 13 and 14). After the whole document is processed L will contain all the matching phrases between the current document and any previous document that shared at least one phrase with the new document. Finally we output L as the list of documents with matching phrases 13

14 and all the necessary information about the matching phrases. Algorithm 1 Document Index Graph construction and phrase matching 1: D New Document 2: for each sentence s in D do 3: w 1 first word in s 4: if w 1 is not in G then 5: Add w 1 to G 6: end if 7: L Empty List {L is a list of matching phrases} 8: for each word w i {w 2,w 3,...,w k } in s do 9: if (w i 1,w i ) is an edge in G then 10: Extend phrase matches in L for sentences that continue along (w i 1,w i ) 11: Add new phrase matches to L 12: else 13: Add edge(w i 1,w i ) to G 14: Update sentence path in nodes w i 1 and w i 15: end if 16: end for 17: end for 18: Output matching phrases in L The above algorithm is capable of matching any-length phrases between a new document D and all previously seen documents in roughly O(m) time, where m is the number of words in document 14

15 D. The step at line 10 in the algorithm, where we extend the matching phrases as we continue along an existing path, may seem not to be a constant time step, because when the graph starts building up, the number of matching phrases becomes larger, and consequently when moving along an existing path we have to match more phrases. However, it turns out that the size of the list of matching phrases becomes roughly constant even with very large document sets, due to the fact that a certain phrase will be shared by only a small set of documents; which on average tends to be a constant number. 4 A phrase-based similarity measure As mentioned earlier, phrases convey local context information, which is essential in determining an accurate similarity between documents. Towards this end we devised a similarity measure based on matching phrases rather than individual terms. This measure exploits the information extracted from the previous phrase matching algorithm to better judge the similarity between the documents. This is related to the work of Isaacs et al [12] who used a pair-wise probabilistic document similarity measure based on Information Theory. Although they showed it could improve on traditional similarity measures, but it is still fundamentally based on the vector space model representation. The phrase similarity between two documents is calculated based on the list of matching phrases between the two documents. This similarity measure is a function of four factors: The number of matching phrases P, The lengthes of the matching phrases (l i : i =1, 2,...,P), The frequencies of the matching phrases in both documents (f i1 and f i2 : i =1, 2,...,P), and 15

16 The levels of significance (weight) of the matching phrases in both document (w i1 and w i2 : i =1, 2,...,P). Frequency of phrases is an important factor in the similarity measure. The more frequent the phrase appears in both documents, the more similar they tend to be. Similarly, the level of significance of the matching phrase in both documents should be taken into consideration. The phrase similarity between two documents, d 1 and d 2, is calculated using the following empirical equation: sim p (d 1, d 2 )= P i=1 [g(l i) (f i1 w i1 + f i2 w i2 )] 2 j s j1 w j1 + k s k2 w k2 (1) where g(l i ) is a function that scores the matching phrase length, giving higher score as the matching phrase length approaches the length of the original sentence; s j1 and s k2 are the original sentence lengths from document d 1 and d 2, respectively. The equation rewards longer phrase matches with higher level of significance, and with higher frequency in both documents. The function g(l i ) in the implemented system was used as: g(l i )=( ms i / s i ) γ (2) where ms i is the matching phrase length, and γ is a sentence fragmentation factor with values greater than or equal to 1. If γ is 1, two halves of a sentence could be matched independently and would be treated as a whole sentence match. However, by increasing γ we can avoid this situation, and score whole sentence matches higher than fractions of sentences. A value of 1.2 for γ was found to produce best results. 16

17 The normalization by the length of the two documents in equation (1) is necessary to be able to compare the similarities from other documents. 4.1 Combining single-term and phrase similarities If the similarity between documents is based solely on matching phrases, and not single-terms at the same time, related documents could be judged as non-similar if they do not share enough phrases (a typical case that could happen in many situations.) Shared phrases provide important local context matching, but sometimes similarity based on phrases only is not sufficient. To alleviate this problem, and to produce high quality clusters, we combined single-term similarity measure with our phrasebased similarity measure. We used the cosine correlation similarity measure [21, 22], with TF-IDF (Term Frequency Inverse Document Frequency) term weights, as the single-term similarity measure. The cosine measure was chosen due to its wide use in the document clustering literature, and since it is described as being able to capture human categorization behavior well [26]. The TF-IDF weighting is also a widely used term weighting scheme [29]. Recall that the cosine measure calculates the cosine of the angle between the two document vectors. Accordingly our term-based similarity measure (sim t ) is given as: sim t (d 1, d 2 )=cos(d 1, d 2 )= d 1 d 2 d 1 d 2 (3) where the vectors d 1 and d 2 are represented as term weights calculated using TF-IDF weighting scheme. The combination of the term-based and the phrase-based similarity measures is a weighted average of the two quantities from equations (1) and (3), and is given by equation (4). 17

18 sim(d 1, d 2 )=α sim p (d 1, d 2 )+(1 α) sim t (d 1, d 2 ) (4) where α is a value in the interval [0, 1] which determines the weight of the phrase similarity measure, or, as we call it, the Similarity Blend Factor. According to the experimental results discussed in section 6 we found that a value between 0.6 and 0.8 for α results in the maximum improvement in the clustering quality. 5 Incremental Document Clustering In this section we present a brief overview of incremental clustering algorithms, and introduce the proposed algorithm, based on pair-wise document similarity, and employ it as part of the whole web document clustering system. The role of a document similarity measure is to provide judgement on the closeness of documents to each other. However, it is up to the clustering method how to make use of such similarity calculation. The idea here is to employ an incremental clustering method that will exploit our similarity measure to produce clusters of high quality (assessing quality of clustering is described in section 6). Incremental clustering is an essential strategy for online applications, where time is a critical factor for usability. Incremental clustering algorithms work by processing data objects one at a time, incrementally assigning data objects to their respective clusters while they progress. The process is simple enough, but faces several challenges, including: How to determine to which cluster the next object should be assigned? How to deal with the problem of insertion order? 18

19 Once an object has been assigned to a cluster, should its assignment to the cluster be frozen or is it allowed to be re-assigned to other clusters later on? Usually a heuristic method is employed to deal with the above challenges. A good incremental clustering algorithm has to find the respective cluster for each newly introduced object without significantly sacrificing the accuracy of clustering due to insertion order or fixed object-to-cluster assignment. We will briefly discuss two incremental clustering methods in the light of the above challenges, before we introduce our proposed method. Suffix Tree Clustering (STC). Introduced by Zamir et al [31] in 1997, the idea behind the STC algorithm is to build a tree of phrase suffixes shared between multiple documents. The documents sharing a suffix are considered as a base cluster. Base clusters are then combined together if they have a document overlap of 50% or more. The algorithm has two drawbacks. First, although the structure used is a compact tree, suffixes can appear multiple times if they are part of larger shared suffixes. The other drawback is that the second phase of the algorithm is not incremental. Combining base clusters into final clusters has to be done in a non-incremental way. The algorithm deals properly with the insertion order problem though, since any insertion order will lead to the same result suffix tree. DC-tree Clustering. The DC-tree incremental algorithm was introduced by Wong et al [27] in The algorithm is based on the B + -tree structure. Unlike the STC algorithm, this algorithm is based on vector space representation of the documents. Most of the algorithm operations are borrowed from the B + -tree operations. Each node in the tree is a representation of a cluster, where a cluster is represented by the combined feature vectors of its individual documents. Inserting a new document involves comparison of the document feature vector with the cluster vectors at one level of the tree, 19

20 and descending to the most similar cluster. The algorithm defines several parameters and thresholds for the various operations. The algorithm suffers from two problems though. Once a document is assigned to a cluster it is not allowed to be re-assigned later to a newly created cluster. Second, which is a consequence of the first drawback, clusters are not allowed to overlap; i.e. a document can belong to only one cluster. 5.1 Similarity histogram-based incremental clustering The clustering approach proposed here is an incremental dynamic method of building the clusters. We adopt an overlapped cluster model. The key concept for the proposed clustering method is to keep each cluster at a high degree of coherency at any time. We represent the coherency of a cluster with a new concept called Cluster Similarity Histogram. Cluster Similarity Histogram: is a concise statistical representation of the set of pairwise document similarities distribution in the cluster. A number of bins in the histogram correspond to fixed similarity value intervals. Each bin contains the count of pair-wise document similarities in the corresponding interval. Figure 4 shows a typical cluster similarity histogram, where the distribution is almost a normal distribution. A perfect cluster would have a histogram where the similarities are all maximum, while a loose cluster would have a histogram where the similarities are all minimum. 5.2 Creating coherent clusters incrementally Our objective is to keep each cluster as coherent as possible. In terms of the similarity histogram concept this translates to maximizing the number of similarities in the high similarity intervals. To 20

21 Typical Cluster Histogram Count Similarity Figure 4: Typical Cluster Similarity Histogram achieve this goal in an incremental fashion, we judge the effect of adding a new document to a certain cluster. If the document is going to degrade the distribution of the similarities in the clusters very much, it should not be added, otherwise it is added. A much stricter strategy would be to add documents that will enhance the similarity distribution. However, this could create a problem with perfect clusters. The document will be rejected by the cluster even if it has high similarity to most of the documents to the cluster (because it is perfect). We judge the quality of a similarity histogram (cluster cohesiveness) by calculating the ratio of the count of similarities above a certain similarity threshold S T to the total count of similarities. The higher this ratio, the more coherent is the cluster. Let n be the number of the documents in a cluster. The number of pair-wise similarities in the cluster is m = n(n +1)/2. Let S = {s i : i =1,...,m} be the set of similarities in the cluster. The histogram of the similarities in the cluster is represented as: H = {h i : i =1,...,B} (5a) h i = count(s k ) s li <s k <s ui (5b) 21

22 where B: the number of histogram bins, h i : the count of similarities in bin i, s li : the lower similarity bound of bin i, and s ui : the upper similarity bound of bin i. The histogram ratio of a cluster is the measure of cohesiveness of the cluster as described above, and is calculated as: HR(C) = B i=t h i B j=1 h j (6a) T = S T B (6b) where HR: the histogram ratio, C: the cluster under consideration, S T : the similarity threshold, and T : the bin number corresponding to the similarity threshold. Basically we would like to keep the histogram ratio of each cluster high. However, since we allow documents that can degrade the histogram ratio to be added, this could result in a chain effect of degrading the ratio to zero eventually. To prevent this, we set a minimum histogram ratio HR min that clusters should maintain. We also do not allow adding a document that will bring down the histogram ratio significantly (even if still above HR min ). This is to prevent a bad document from severely bringing down cluster quality by one single document addition. We now present the incremental clustering algorithm based on the above framework (Algorithm 2). The algorithm works incrementally by receiving a new document, and for each cluster calculates the cluster histogram before and after simulating the addition of the document (lines 3-5). The old and 22

23 new histogram ratios are compared and if the new ratio is greater than or equal to the old one, the document is added to the cluster. If the new ratio is less than the old one by no more than ε and still above HR min, it is added (lines 6-8). Otherwise it is not added. If after checking all clusters the document was not assigned to any cluster, a new cluster is created and the document is added to it (lines 10-13). Algorithm 2 Similarity Histogram-based Incremental Document Clustering 1: L Empty List {Cluster List} 2: for each document D do 3: for each cluster C in L do 4: HR old = HR(C) 5: Simulate adding D to C 6: HR new = HR(C) 7: if (HR new HR old )OR((HR new >HR min ) AND (HR old HR new <ε)) then 8: Add D to C 9: end if 10: end for 11: if D was not added to any cluster then 12: Create a new cluster C 13: ADD D to C 14: ADD C to L 15: end if 16: end for 23

24 5.3 Dealing with insertion order problems Our strategy for the insertion order problem is to implement a document reassignment strategy. Older documents that were added before new clusters were created should have the chance to be reassigned to newly created clusters. Only documents that seem to be bad for a certain cluster are tagged and considered for reassignment to other clusters. The documents that are candidates to leave a cluster are the documents that their leaving the cluster will increase the cluster similarity histogram ratio; i.e. the cluster is better off without them. We keep a record of each document of the histogram ratio if the document was not in the cluster. If this value is greater than the current histogram ratio, then the document is a candidate for leaving the cluster. Upon adding a new document to any cluster, we consult the documents that are candidate for leaving the cluster. If any of such documents can be added to other clusters, we move it to that cluster, thus benefiting both clusters. This strategy creates a dynamic negotiation scheme between clusters for document assignment. It also allows for overlapping clusters, and dynamic incremental document clustering. 6 Experimental Results In order to test the effectiveness of the web clustering system, we conducted a set of experiments using our proposed data model, phrase matching, similarity measure, and incremental clustering method. The experiments conducted were divided into two sets. We first tested the effectiveness of the Document Index Graph model, presented in section 3, and the accompanying phrase matching algorithm for calculating the similarity between documents based on phrases versus individual words only. The second set of experiments was to evaluate the accuracy of the incremental document clustering algo- 24

25 Data Set Description Categories Documents DS1 UofW web site, Canadian web sites DS2 Reuters news articles (from Yahoo! news) Table 1: Data Sets Descriptions rithm, presented in section 5, based on the cluster cohesiveness measure using similarity histograms. 6.1 Experimental setup Because the proposed system was designed for making use of the semi-structure of web documents, regular text corpora were not used. Our experimental setup consisted of two web document sets. The first consists of 314 web documents collected from University of Waterloo various web sites, such as the Graduate Studies Office, Information Systems and Technology, Health Services, Career Services, Co-operative Education, and other Canadian web sites. The documents were classified, according to their content, into 10 different categories. In order to allow for independent testing and the reproduction of the results presented here, this document collection can be downloaded at: hammouda/webdata/. The second data set is a collection of Reuters news articles from the Yahoo! news site. The set contains 2340 documents classified into 20 different categories (with some relevancy between the categories as well.) The second data set was used by Boley et al in [4, 2, 3]. Table 1 summarizes the two data sets. 6.2 Evaluation measures In order to evaluate the quality of the clustering, we adopted two quality measures widely used in the text mining literature for the purpose of document clustering [25]. The first is the F-measure, which 25

26 combines the Precision and Recall ideas from the Information Retrieval literature. The precision and recall of a cluster j with respect to a class i are defined as: P = Precision(i, j) = N ij N i R = Recall(i, j) = N ij N j (7a) (7b) where N ij : is the number of members of class i in cluster j, N j : is the number of members of cluster j, and N i : is the number of members of class i. The F-measure of a class i is defined as: F (i) = 2PR P + R (8) With respect to class i we consider the cluster with the highest F-measure to be the cluster j that maps to class i, and that F-measure becomes the score for class i. The overall F-measure for the clustering result C is the weighted average of the F-measure for each class i: i ( i F(i)) F C = i i (9) where i is the number of objects in class i. The higher the overall F-measure, the better the clustering, due to the higher accuracy of the clusters mapping to the original classes. The second measure is the Entropy, which provides a measure of goodness for un-nested clusters or for the clusters at one level of a hierarchical clustering. Entropy tells us how homogeneous a cluster is. The higher the homogeneity of a cluster, the lower the entropy is, and vice versa. The entropy of a cluster containing only one object (perfect homogeneity) is zero. For every cluster j in the clustering result C we compute p ij, the probability that a member of cluster j belongs to class i. The entropy of each cluster j is calculated using the standard formula 26

27 E j = i p ij log(p ij ), where the sum is taken over all classes. The total entropy for a set of clusters is calculated as the sum of entropies for each cluster weighted by the size of each cluster: E C = m ( N j N E j) (10) j=1 where N j is the size of cluster j, and N is the total number of data objects. Basically we would like to maximize the F-measure, and minimize the Entropy of clusters to achieve high quality clustering. 6.3 Effect of phrase-based similarity on clustering quality The similarities calculated by our algorithm were used to construct a similarity matrix between the documents. We elected to use three standard document clustering techniques for testing the effect of phrase similarity on clustering [13]: (1) Hierarchical Agglomerative Clustering (HAC), (2) Single Pass Clustering, and (3) K-Nearest Neighbor Clustering (k-nn) 2. For each of the algorithms, we constructed the similarity matrix and let the algorithm cluster the documents based on the presented similarity matrix. The results listed in Table 2 show the improvement in the clustering quality on the first data set using the combined similarity measure. The improvements shown were achieved at a similarity blend factor between 70% and 80% (phrase similarity weight). The parameters chosen for the different algorithms were the ones that produced best results. The percentage of improvement ranges from 19.5% to 60.6% increase in the F-measure quality, and 9.1% to 46.2% drop in Entropy (lower is better for Entropy). It is obvious that the phrase based similarity plays an important role in accurately judging the relation between documents. It is known that Single Pass clustering is very sensitive to 2 Although k-nn is mostly known to be used for classification, it has also been used for clustering (example could be found in [17]). 27

28 F-measure HAC Single Pass K-NN Similarity Blend Factor (alpha) (a) Effect of phrase similarity on F-measure HAC Single Pass K-NN Entropy Similarity Blend Factor (alpha) (b) Effect of phrase similarity on Entropy Figure 5: Effect of phrase similarity on clustering quality 28

29 Table 2: Phrase-based clustering improvement Single-Term Similarity Combined Similarity Improvement F-measure Entropy F-measure Entropy HAC a %F, -24.8%E Single Pass b %F, -46.2%E k-nn c %F, -9.1%E a Complete Linkage was used as the cluster distance measure for the HAC method since it tends to produce tight clusters with small diameter. b A document-to-cluster similarity threshold of 0.25 was used. c A K of 5 and a cluster similarity threshold of 0.25 were used. noise; that is why it has the worst performance. However, when the phrase similarity was introduced, the quality of clusters produced was pushed close to that produced by HAC and k-nn. In order to better understand the effect of the phrase similarity on the clustering quality, we generated a clustering quality profile against the similarity blend factor. Figure 5(a) illustrates the effect of introducing the phrase similarity on the F-measure of the resulting clusters. It is obvious that the phrase similarity enhances the F-measure of the clustering until a certain point (around a weight of 80%) and then its effect starts bringing down the quality. As we mentioned in section 4.1 that phrases alone cannot capture all the similarity information between documents, the single-term similarity is still required, but to a smaller degree. The same can be seen from the Entropy profile in Figure 5(b), where Entropy is minimized at around 80% contribution of phrase similarity against 20% for the single-term similarity. The results show that both evaluation measures are optimized in the same trend with respect 29

30 to the blend factor. By having two independent evaluation measures prove the clustering quality improvement, we are confident that the results are not biased by any of the evaluation measures. 6.4 Incremental clustering evaluation Our proposed incremental document clustering method was evaluated using both data sets mentioned earlier. We relied on the same evaluation measures discussed above, as well as another measure called Overall Similarity, which is the average of the similarities inside each cluster. Higher overall similarity means better cluster cohesiveness. Data Set 1 Data Set 2 F-measure Entropy O-S a F-measure Entropy O-S Proposed Method HAC Single Pass k-nn a Overall-Similarity Table 3: Proposed Clustering Method Improvement Table 3 shows the result of the proposed clustering method against HAC, Single Pass, and k-nn clustering. For the first data set, the improvement was very significant, reaching over 70% improvement over k-nn (in terms of F-measure), 25% improvement over HAC, and 53% improvement over Single Pass. This is attributed to the fact that the different categories of the documents do not have a great deal of overlap, which makes the algorithm able to avoid noisy similarities from other clusters. For the second data set an improvement between 10% to 18% was achieved over the other meth- 30

31 ods. However, the F-measure was not really high compared to the first data set. By examining the actual documents and their classification it turns out that the documents do not have enough overlap in each single class, which makes it difficult to have an accurate similarity calculation between the documents. However, we were able to push the quality of clustering further by relying on accurate and robust phrase matching similarity calculation, and achieve higher clustering quality. F-Measure P roposed M etho d (with Re-assignment) Proposed Method (no Re-assignment) HAC Single Pass K-NN Data Set 2 (a) Clustering Quality - F-measure Entropy P roposed M etho d (with Re-assignment) Proposed Method (no Re-assignment) HAC Single Pass K-NN Data Set 2 (b) Clustering Quality - Entropy Figure 6: Quality of Clustering Comparison Figure 6 shows the above mentioned results more clearly, showing the achieved improvement in comparison with the other methods. The figure shows also the effect of apply the re-assignment 31

32 strategy discussed in section 5.3. The problem with incremental clustering is that documents usually do not end up where they should be. The re-assignment strategy we chose to use re-assigns documents that are seen as bad for some clusters to other clusters that can accept the document, all based on the idea of increasing the cluster similarity histogram. The re-assignment strategy showed a slight improvement over the same method without document re-assignment as shown in the figure. 7 Conclusion We presented a system composed of four decoupled components in an attempt to improve the document clustering problem in the web domain. Information in web documents does not lie in the content only, but in their inherent semi-structure of the web documents. By exploiting this structure we can achieve better clustering results. We presented a web document analysis component that is capable of identifying the structure of web documents, and building structured documents out of the semi-structured web documents. The second component, and perhaps the most important one that has most of the impact on performance, is the new document model introduced in this paper, the Document Index Graph. This model is based on indexing web documents using phrases and their levels of significance. Such a model enables us to perform phrase matching and similarity calculation between documents in a very robust and accurate way. The quality of clustering achieved using this model significantly surpasses the traditional vector space model based approaches. The third component is the phrase-based similarity measure. By carefully examining the factors affecting the degree of overlap between documents, we devised a phrase-based similarity measure that is capable of accurate calculation of pair-wise document similarity. 32

33 The fourth component is an incremental document clustering method based on maintaining high cluster cohesiveness by improving the pair-wise document similarity distribution inside each cluster. The merits of such a design is that each component could be utilized independent of the other. But we have confidence that the combination of these components leads to better results, as justified by the results presented in this paper. By adopting different standard clustering techniques to test against our model, we are very confident that this model is well justified. There are a number of future research directions to extend and improve this work. One direction that this work might continue on is to improve on the accuracy of similarity calculation between documents by employing different similarity calculation strategies. Although the current scheme proved more accurate than traditional methods, there are still room for improvement. Although the work presented here is aimed at web document clustering, it could be easily adapted to any document type as well. However, it will not benefit from the semi-structure found in web documents. Our intention is to investigate the usage of such model on standard corpora and see its effect on clustering compared to traditional methods. 33

34 References [1] K. Aas and L. Eikvil. Text categorisation: A survey. Technical Report 941, Norwegian Computing Center, June [2] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems, 27: , [3] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the World Wide Web using WebACE. AI Review, 13(5-6): , [4] D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4): , [5] K. Cios, W. Pedrycs, and R. Swiniarski. Data Mining Methods for Knowledge Discovery. Kluwer Academic Publishers, Boston, [6] W. W. Cohen. Learning to classify English text with ILP methods. In Proceedings of the 5 th International Workshop on Inductive Logic Programming, pages Department of Computer Science, Katholieke Universiteit Leuven, [7] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7 th International Conference on Information and Knowledge Management, pages , November 1998.

Data Mining in e-learning

Data Mining in e-learning Data Mining in e-learning Khaled Hammouda 1 and Mohamed Kamel 2 1 Department of Systems Design Engineering 2 Department of Electrical and Computer Engineering Pattern Analysis and Machine Intelligence

More information

Collaborative Document Clustering

Collaborative Document Clustering Collaborative Document Clustering Khaled Hammouda Mohamed Kamel Abstract Document clustering has been traditionally studied as a centralized process. There are scenarios when centralized clustering does

More information

IN an effort to keep up with the tremendous growth of the

IN an effort to keep up with the tremendous growth of the IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 1279 Efficient Phrase-Based Document Indexing for Web Document Clustering Khaled M. Hammouda, Student Member, IEEE, and

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Weighted Suffix Tree Document Model for Web Documents Clustering

Weighted Suffix Tree Document Model for Web Documents Clustering ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

CorePhrase: Keyphrase Extraction for Document Clustering

CorePhrase: Keyphrase Extraction for Document Clustering CorePhrase: Keyphrase Extraction for Document Clustering Khaled M. Hammouda 1, Diego N. Matute 2, and Mohamed S. Kamel 3 1 Department of Systems Design Engineering 2 School of Computer Science 3 Department

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

A Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering

A Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering A Comparison of Three Document Clustering Algorithms:, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering Abstract Kenrick Mock 9/23/1998 Business Applications Intel Architecture

More information

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Web Document Clustering using Hybrid Approach in Data Mining

Web Document Clustering using Hybrid Approach in Data Mining Web Document Clustering using Hybrid Approach in Data Mining Pralhad S. Gamare 1, G. A. Patil 2 Computer Science & Technology 1, Computer Science and Engineering 2 Department of Technology, Kolhapur 1,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Document Clustering For Forensic Investigation

Document Clustering For Forensic Investigation Document Clustering For Forensic Investigation Yogesh J. Kadam 1, Yogesh R. Chavan 2, Shailesh R. Kharat 3, Pradnya R. Ahire 4 1Student, Computer Department, S.V.I.T. Nasik, Maharashtra, India 2Student,

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

Heading-Based Sectional Hierarchy Identification for HTML Documents

Heading-Based Sectional Hierarchy Identification for HTML Documents Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of

More information

A Comparison of Document Clustering Techniques

A Comparison of Document Clustering Techniques A Comparison of Document Clustering Techniques M. Steinbach, G. Karypis, V. Kumar Present by Leo Chen Feb-01 Leo Chen 1 Road Map Background & Motivation (2) Basic (6) Vector Space Model Cluster Quality

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

Component ranking and Automatic Query Refinement for XML Retrieval

Component ranking and Automatic Query Refinement for XML Retrieval Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining D.Kavinya 1 Student, Department of CSE, K.S.Rangasamy College of Technology, Tiruchengode, Tamil Nadu, India 1

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

Centroid Based Text Clustering

Centroid Based Text Clustering Centroid Based Text Clustering Priti Maheshwari Jitendra Agrawal School of Information Technology Rajiv Gandhi Technical University BHOPAL [M.P] India Abstract--Web mining is a burgeoning new field that

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

CHAPTER 7 CONCLUSION AND FUTURE WORK

CHAPTER 7 CONCLUSION AND FUTURE WORK CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

IRCE at the NTCIR-12 IMine-2 Task

IRCE at the NTCIR-12 IMine-2 Task IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Dr.K.Duraiswamy Dean, Academic K.S.Rangasamy College of Technology Tiruchengode, India V. Valli Mayil (Corresponding

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information