Applying Key Segments Schema to Generalized Star Clustering

Similar documents
ACONS: A New Algorithm for Clustering Documents

Semi-Supervised Clustering with Partial Background Information

Keyword Extraction by KNN considering Similarity among Features

International Journal of Advanced Research in Computer Science and Software Engineering

Text Documents clustering using K Means Algorithm

Making Retrieval Faster Through Document Clustering

Information Retrieval. (M&S Ch 15)

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Text Categorization (I)

Encoding Words into String Vectors for Word Categorization

An Overview of Concept Based and Advanced Text Clustering Methods.

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

String Vector based KNN for Text Categorization

Introduction to Information Retrieval

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

CS 6320 Natural Language Processing

Mining Quantitative Association Rules on Overlapped Intervals

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

modern database systems lecture 4 : information retrieval

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Chapter 6: Information Retrieval and Web Search. An introduction

A hybrid method to categorize HTML documents

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Concept-Based Document Similarity Based on Suffix Tree Document

Multimedia Information Systems

Text Mining: A Burgeoning technology for knowledge extraction

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task

Online algorithms for clustering problems

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Available online at ScienceDirect. Procedia Computer Science 82 (2016 ) 28 34

Retrieval of Highly Related Documents Containing Gene-Disease Association

Hierarchical Document Clustering

Text Document Clustering Using DPM with Concept and Feature Analysis

Knowledge Discovery and Data Mining 1 (VO) ( )

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Comparative Study of Subspace Clustering Algorithms

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Knowledge Engineering in Search Engines

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Automatic Summarization

Information Retrieval and Web Search

A Fast Approximated k Median Algorithm

A probabilistic description-oriented approach for categorising Web documents

Clustering Algorithms for Data Stream

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Boolean Model. Hongning Wang

Q: Given a set of keywords how can we return relevant documents quickly?

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

Nearest Neighbor Classification

Agglomerative clustering on vertically partitioned data

Reading group on Ontologies and NLP:

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

ResPubliQA 2010

Text clustering based on a divide and merge strategy

Theme Identification in RDF Graphs

A Model for Information Retrieval Agent System Based on Keywords Distribution

CADIAL Search Engine at INEX

Information Retrieval. Information Retrieval and Web Search

Determination of Similarity Threshold in Clustering Problems for Large Data Sets

Development of Generic Search Method Based on Transformation Invariance

Noisy Text Clustering

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

Social Media Computing

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Information Retrieval CSCI

Similarity search in multimedia databases

Semi supervised clustering for Text Clustering

Clustering with Lower Bound on Similarity

Collaborative Rough Clustering

International ejournals

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

CHAPTER 4: CLUSTER ANALYSIS

Chapter 27 Introduction to Information Retrieval and Web Search

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha

Mining Web Data. Lijun Zhang

Consensus clustering by graph based approach

A Universal Model for XML Information Retrieval

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

A Document Graph Based Query Focused Multi- Document Summarizer

Introduction to Information Retrieval

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Top-k Keyword Search Over Graphs Based On Backward Search

Information Retrieval: Retrieval Models

Search Engines. Information Retrieval in Practice

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

VK Multimedia Information Systems

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Using a Painting Metaphor to Rate Large Numbers

Circle Graphs: New Visualization Tools for Text-Mining

Transcription:

Applying Key Segments Schema to Generalized Star Clustering Abstract. Clustering process, as other text mining tasks, depends critically on the proper representation of documents. In this paper we propose a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema. The evaluation experiments show that our proposal applied to Vector Space Model and Global Association Distance Model using Generalized Star algorithm outperforms the original models. 1 Introduction Clustering is the process of grouping data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning. Cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, image processing, and market research. Initially, document clustering was evaluated for improving the results in information retrieval systems [10, 2]. Clustering has been proposed as an efficient way of finding automatically related topics or new ones; in filtering tasks [1] and grouping the retrieved documents into a list of meaningful categories, facilitating query processing by searching only clusters closest to the query [15]. Several algorithms have been proposed for document clustering. One of these algorithms is Generalized Star (GStar), presented and evaluated by Pérez et al. in [8]. They introduced a new definition of star allowing a different star-shaped subgraph, in this way the GStar retains the strengths of previous algorithms as well as solves the drawbacks presented in it previous algorithms. The experimentation comparing the GStar against the original Star [1] and the Extended algorithms [7] and other traditional clustering algorithms such as the Single and Average Link [6] shows that the Generalized Star outperforms those algorithms using the Vector Space Model (VSM)[12] as document representation models. Nevertheless, we consider that the representation models applied on whole documents, but filtered by a key segment, produce in general a better performance than the original representation models. In this paper we evaluate the GStar performance using the Vector Space Model, Global Association Distance Model, and their respective filtered by a key segment in order to show the certainty of this consideration.

The basic outline of this paper is as follows. Section 2 is dedicated to the representation models considered in the experimentation phase. The Generalized Star algorithm is described in section 3. The experimental results are discussed in section 4, and the conclusions of the research and some ideas about future directions are exposed in section 5. 2 Document representation The clustering process, as any other task of text mining, is carried out in two main stages: a pre-processing stage and a discovery stage. In the first stage, texts are transformed into a kind of structured or semi-structured representation, simpler and more useful to be automatically processed by computers. In the second stage these representations are analyzed in order to discover interesting patterns, i.e. clusters. In the pre-processing stage a set of operations is done to simplify and standardize the documents being analyzed. Some of these operations implement disambiguation methods and stemming processes, identifying concepts and syntagmatic structures.these concepts and structures can be organized in different forms but, in general, they are considered as groups or bags of words, usually structured using a vector space model [12]. In the vector space model, each document is a vector of terms. The values of these vectors could be assumed as weights according the term occurrences in the document or in the document collection, and considering the different interpretations [5]: Boolean, Term Frequency (Tf) and Term Frequency-Inverse Document Frequency (Tf-Idf). These vectors of terms are used in a second stage, among other tasks, to analyze the similarities between documents, or groups of them, using different measures as the cosine, applied to the angle between the vectors, defined as [5]: sim(d i, d j ) = cos(d i, d j ) = (d i d j ) d i d j = wir w jr w 2 ir w 2 jr, (1) where d i, d j are the vectors of documents i, j, di, dj the norms of the vectors, and w ir, w jr are the term weights in the vectors di, dj, respectively. Another representation model is Global Association Distance Model (GADM) [11]. GADM can be defined as a vector space model (VSM) where each term is weighted by their global association strength (3). Nevertheless, in contradistinction to the original VSM that considers the term relevance by the number of its occurrences in a document, GADM considers the cooccurrences (actually, the association strengths) amongst terms in sentences, paragraphs and so on. So, a document d can be modelled by a vector of global association strengths (2). d = (gt1,..., g tn ) (2) where g tr = t s d 1 Drs (3)

and the formal distance between these terms (D rs ) is defined as follows, considering the distance by paragraph, without ignoring the natural co-occurrence when appearing in the same sentence, and considering: (p r, n r ), (p s, n s ), the paragraph and sentence numbers of terms t r and t s respectively. { 1 (r = s) [(p D rs = r = p s ) (n r = n s )] p r p s + 2 other case (4) Although flat structures are the simplest way for processing document collections, these linear model provides a limited means to measure similarities between semistructured documents. In a semistructured representation, it is not necessary to use all the information [3]. In some approaches, a predefined structure is considered and information is fed into the structure provided. In other approaches, documents are allowed to have specific structure types (such as trees or segments). A semistructured approach is not an odd way for document representation. For example, in academic papers, authors are asked to write a few words that concisely describe their work (the title), to write a few paragraphs that outline their work (the abstract), to write a few pages that precisely describe the work (the body), and finally to summarize the work (the conclusion). In newsworthy information, the inverted pyramid style is generally used. In this style, the conclusion or summary of the news is moved up to the front of the article, putting the main idea into the first paragraph [13]. If we try to simplify the representation of a document by a single vector, perhaps we should choose between a whole document vector and a vector from a key segment (for instance, the abstract in academic papers or the first paragraph in news), both as flat structures. Notice that in key segment vectors, terms use to appear only once and term weights loose their relevance in a VSM or GADM model. We have been considering an alternative way to represent semistructured documents by a single vector, without loosing the importance of the key segment but taking into account the relevance of the terms in the whole document (term weight). We have named it vector Filtered by Key Segment. A vector Filtered by Key Segment (FKS) is a vector model (VSM or GADM) obtained from the whole document but considering only the terms appearing in a predefined key segment. This kind of vector could even be used in a structure weighting approach, where the vector associated with each segment is constructed from the whole document. Besides, it should be noticed that vectors FKS could reduce the dimensionality problem usually presented in any document processing system. In FKS from VSM, the term weight is considered as the occurrence frequency in the whole document, and in FKS from GADM the term weight is considered as the global association strength in the whole document amongst the term and the rest of the terms of the Key Segment.

3 Generalized Star algorithm The Generalized Star algorithm was proposed by Pérez et al. in [8] to solve the drawback presented in its previous algorithms: The Star Algorithm [1] and the Extended Star algorithm [7]. This algorithm represents the document collection by its thresholded similarity graph, finding overlaps with dense sub-graphs. Let V = {d 1,..., d n } be a collection of documents and Sim(d i, d j ) a similarity (symmetric) function between documents d i and d j, we call similarity graph to an undirected and weighted graph G = (V, E, w), where vertices correspond to documents and each weighted edge corresponds to the similarity between two documents. Considering a similarity threshold σ defined by the user we can define a thresholded graph G σ as the undirected graph obtained from G by eliminating all the edges whose weights are lower than σ. The set of Weak Satellites (WeakSats) and the set of Potential Satellites (PotSats)of o are defined by the expresions (5) and (6) respectively. o.w eaksats = {s s o.adj o.adj s.adj }, (5) o.p otsats = {s s o.adj o.w eaksats s.w eaksats }. (6) The WeakSats and PotSats degrees of a vertex o is defined as the quantity of vertices included in its sets of Weak Satellites and Potentials Satellites respectively. Considering the aforementioned sets a Generalized star-shaped sub-graphs of m + 1 vertices en G σ, consists of a single center c and m adjacent vertices, such that c.p otsats v.p otsats for all v c.p otsats. Starting from this definition and guaranteing a full cover C of G σ, this algorithm should satisfy the following post-conditions: x V, x C x.adj C, (7) c C, v c.p otsats, c.p otsats v.p otsats. (8) The first condition (7) guarantees that each object of the collection belongs at least to one group, as a center or as a satellite. Besides, the condition (8) indicates that all the centers satisfy the generalized star-shaped sub-graphs definition. The set of Necessary Satellites (NecSats) of o is the set of its adjacent vertices that could depend on o to be covered. This concept is necessary only during the cluster generation. Initially, NecSats takes the value of PotSats; but, it can decrease during the clustering process as more documents are covered by stars. Let C be a set of centers obtaining by the algorithm, a center vertex c will be considered redundant if it satisfies the following conditions: 1. d c.adj C, i.e. vertex c has at least one adjacent center on its neighborhood. 2. s c.p otsats, s C s.adj C > 1, i.e. vertex s has more than one adjacent center (a neighboring center different to c) on its neighborhood or vertex s is a center.

Table 1. Pseudo-code of Generalized Star Algorithm Algorithm 1: GStar Input: V = {d 1, d 2,..., d n}-set of documents, σ-similarity threshold Output: SC-Set of clusters 1 forall vertex d i V do 2 d i.adj := {d j d j V Sim(d i, d j) σ}; 3 forall vertex d V do 4 d.w eaksats := {s s d.adj d.adj s.adj }; 5 forall vertex d V do 6 d.p otsats := {s s d.adj d.w eaksats s.w eaksats }; 7 d.necsats := d.p otsats; 8 end 9 L := V ; 10 C := ; 11 while L do 12 d := arg max{ d i.p otsats, d i L}; 13 Update(d,C,L); 14 end 15 Sort C in ascending order by PotSats ; 16 SC := ; 17 forall center c C do 18 if c is redundant then C := C \ {c}; 19 else SC := SC {{c} c.adj}; The Generalized Star algorithm is summarized in Table 1. The procedure Update (see Fig. 6) is applied to mark a vertex as center, deleting it from L, and for updating the set NecSats on each of its necessary satellites. Table 2. Pseudo-code of Update Procedure Update Input: d - Selected center, C- Set of clusters centers, L - Set of unprocessed vertices Output: C, L 1 C := C {d}; 2 L := L \ {d}; 3 forall s d.necsats do 4 s.necsats := s.necsats \ {d}; 5 if s.necsats = then L := L \ {s}; 6 forall c s.adj \ {d} do 7 c.necsats := c.necsats \ {s}; 8 if (c.necsats = ) (c.adj C ) then L := L \ {c}; 9 end 10 end

The GStar method as the original Star algorithm and the two versions of the Extended algorithm generates clusters which can be overlapped and guarantees also that the pairwise similarity between satellite vertices in a generalized starshaped sub-graph be high. Unlike it previous algorithms, the GStar algorithm can not produce illogical clusters because all the centers satisfy the generalized star-shaped sub-graph definition. The GStar algorithm does not produce uncovered vertices either this property is ensured by the fulfillment of postcondition (7) and avoids the generation of unnecessary clusters presented in two versions of the Extended algorithm. The dependence on data order is a property that the Extended Star method certainly solves. Nevertheless, as was indicated in [8], it is necessary only when that dependence affects the quality of the resulting clusters. Thus, the GStar algorithm solves the dependence on data order (for non symmetric or similar solutions) observed in the Star algorithm. In [8] the GStar algorithm was compared with the original Star and the Extended Star methods in two standard document collections considering the Jaccard index [9] and F -measure [4]. The experimentation showed that GStar outperforms previous methods considering the aforementioned measures and also considering the number an density of the generated clusters. These performances proved the validity of GStar for clustering tasks. 4 Experimental results In this section we present the experimental evaluation of GStar clustering using the Vector Space Model, Global Association Distance Model, and their corresponding variants filtered by a key segment, proposed in section 2. The GStar clustering results are evaluated by the same method and criterion to ensure a fair comparison across these representation models. Two document collections widely used in document clustering research were used in the experiments: TREC -5 1 and Reuters-21578 2. These are heterogeneous as to document size, cluster size, number of classes, and document distribution. The data set TREC -5 contains news items in Spanish published by AFP during 1994 1995; Reuters-21578 is a Reuters LTD news corpus in English. We excluded from document collections the empty documents and also those documents do not have an associated topic. The main characteristics of these collections are summarized in Table 3. We also included in this table the number of overlapping documents for each collection. All documents were preprocessed and lemmatized with TreeTager [14] and in order to obtain the similarity amongst documents we use the traditional cosine measure. The KFS models were applied considering the news first paragraph as the key segment. 1 http://trec.nist.gov/pubs/trec5 2 ftp://canberra.cs.umass.edu/pub/reuters

Table 3. Characteristics of document collections Collect. Doc. Overlap. doc. Topics Lang. AFP 695 16 25 Spanish Reuters 10377 1722 119 English The literature abounds in measures defined by multiple authors to compare two partitions on the same set. The most widely used are: Jaccard index, and F-measure. Jaccard index.- This index (noted j) takes into account the objects simultaneously joined [9]. It is defined as follows: j(a, B) = n 11 N(N 1) 2 n 00 (9) In this index, n 11 denotes the number of pair of objects which are both in the same cluster in A and are also both in the same cluster in B. Similarly, n 00 is the number of pair of objects which are in different clusters in A and are also in different clusters in B. The performances of the algorithm in the document collections considering Jaccard index are shown in Fig. 1 (A) and (B) and in Fig.2 (A) and (B) respectively. F-measure.- The aforementioned index and others are usually applied to partitions. In order to make a better evaluation of overlapping clustering, we have considered F-measure calculated over pairs of points, as defined in [4]. Noted as, this measure is the harmonic mean of precision and recall (10). 2 P resicion Recall = (10) P resicion + Recall n 11 P resicion = Number of identified pairs, Recall = n 00 Number of true pairs The performances of the algorithm in the document collections considering F-measure are shown in Fig. 1 (C) and (D) and in Fig. 2 (C) and (D) respectively. As can be showed in Fig. 1 and Fig. 2, the accuracy obtained by GStar using FKS (GStar-FKS) from VSM is in most cases (for all the indexes) better or comparable with that obtained from the GStar using the original VSM; and the accuracy obtained by GStar-FKS from GADM is in all of cases better or similar to that obtained from original GADM. Note that the best performance of GStar is attached to the GADM filtered by a key segment for both documents collections. We also compared the result obtained by GStar-FKS in both models to know what models offer the best results for each measure. The performances of the GStar-FKS in the document collections considering Jaccard index, and F-measure are shown in Fig. 3. As we can see, the FKS-GADM outperforms in most cases the FKS-VSM in TREC-5 collection and the performance is comparable in both FKS models for

Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 1. Behavior in TREC-5 collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 2. Behavior in Reuter collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Reuter collection, though in this corpus the best performance value is obtained for FKS-GADM. Thus, the FKS-GADM represents a 10.91% and a 7.27% average improvement in performance compared to FKS-VSM in AFP collection considering the Jaccard index and F-measure respectively; and in the case of the Reuters collection the FKS-VSM outperforms slightly the FKS-GADM in a 1.34% and 1.19% considering both measures.

Jaccard Index (A) (B) Jaccard Index (C) (D) Fig. 3. Behavior in AFP (A,B) and Reuter (C,D) collections with Jaccard index and F-measure using FKS by VSM and KFS by GADM 5 Conclusions In order to archive a better performance of clustering using GStar, in this paper we have proposed a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema; and we have shown with the experiment results for two news items collections, the improvement of our proposal regarding original Vector Space Model and original Global Association Distance Model. References 1. Aslam, J., Pelekhov K. and Rus, D.: Using Star Clusters for Filtering, In Proceedings of the Ninth International Conference on Information and Knowledge Management, USA, 2000. 2. Aslam, J., Pelekhov K. and Rus, D.: The Star Clustering Algorithm for Static and Dynamic Information Organization, Journal of Graph Algorithms and Applications, Vol. 8, No. 1, pp. 95-129, 2004. 3. Baeza-Yates,R. and Ribeiro-Neto,B.: Modern Information Retrieval, ACM Press, Addison-Wesley, 1999. 4. Banerjee, A., Krumpelman, C., Basu, S., Mooney, R. and Ghosh, J.: Model Based Overlapping Clustering, In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), 2005. 5. Berry,M. : Survey of Text Mining, Clustering, Classification and Retrieval, Springer, 2004. 6. Duda R., Hart P., Stork D. Pattern Classification. John Wiley & Sons Inc. 2001. 7. Gil-Garca, R. J., Bada-Contelles, J. M. and Pons-Porrata, A.: Extended Star Clustering Algorithm, In Proceedings of CIARP 03, LNCS 2905, pp. 480-487, 2003.

8. Pérez -Suárez, A. and Medina-Pagola, J. E.: A Clustering Algorithm based on Generalized Stars, In Proceedings of the 5th International Conference on Machine Learning and Data Mining (MLDM2007), LNAI 4571, Leipzig, Germany, 2007. 9. Kuncheva, L. and Hadjitodorov, S.: Using Diversity in Cluster Ensembles, In Proceedings of IEEE SMC 2004, The Netherlands, 2004. 10. van Rijsbergen, C. J.: Information Retrieval, Buttersworth, London, 2nd edition, 1979. 11. Medina Pagola, J.E., Rodrguez, A.Y., Hechavarra, A., Hernndez Palancar, J.: Document Representation using Global Association Distance Model. In Proc. of ECIR 2007, LNCS, 4425, pp. 565 572, 2007. 12. Salton, G.: The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, New Jersey, 1971. 13. Scanlan,C.: Writing from the Top Down: Pros and Cons of the Inverted Pyramid, Poynter Institute, Retrieved on July 27, 2006, URL:http://www.poynter.org/column.asp?id=52&aid=38693. 14. Schmid, H.: Probabilistic Part-Of-Speech Tagging Using Decision Tree. In: International Conference on New Methods in Language Processing, Manchester, UK (1994) 15. Zhong, S. and Ghosh, J.: A Comparative Study of Generative Models for Document Clustering, In Proceedings of SDM Workshop on Clustering High Dimensional Data and Its Applications, 2003.