Applying Key Segments Schema to Generalized Star Clustering

Applying Key Segments Schema to Generalized Star Clustering Abstract. Clustering process, as other text mining tasks, depends critically on the proper representation of documents. In this paper we propose a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema. The evaluation experiments show that our proposal applied to Vector Space Model and Global Association Distance Model using Generalized Star algorithm outperforms the original models. 1 Introduction Clustering is the process of grouping data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning. Cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, image processing, and market research. Initially, document clustering was evaluated for improving the results in information retrieval systems [10, 2]. Clustering has been proposed as an efficient way of finding automatically related topics or new ones; in filtering tasks [1] and grouping the retrieved documents into a list of meaningful categories, facilitating query processing by searching only clusters closest to the query [15]. Several algorithms have been proposed for document clustering. One of these algorithms is Generalized Star (GStar), presented and evaluated by Pérez et al. in [8]. They introduced a new definition of star allowing a different star-shaped subgraph, in this way the GStar retains the strengths of previous algorithms as well as solves the drawbacks presented in it previous algorithms. The experimentation comparing the GStar against the original Star [1] and the Extended algorithms [7] and other traditional clustering algorithms such as the Single and Average Link [6] shows that the Generalized Star outperforms those algorithms using the Vector Space Model (VSM)[12] as document representation models. Nevertheless, we consider that the representation models applied on whole documents, but filtered by a key segment, produce in general a better performance than the original representation models. In this paper we evaluate the GStar performance using the Vector Space Model, Global Association Distance Model, and their respective filtered by a key segment in order to show the certainty of this consideration.

The basic outline of this paper is as follows. Section 2 is dedicated to the representation models considered in the experimentation phase. The Generalized Star algorithm is described in section 3. The experimental results are discussed in section 4, and the conclusions of the research and some ideas about future directions are exposed in section 5. 2 Document representation The clustering process, as any other task of text mining, is carried out in two main stages: a pre-processing stage and a discovery stage. In the first stage, texts are transformed into a kind of structured or semi-structured representation, simpler and more useful to be automatically processed by computers. In the second stage these representations are analyzed in order to discover interesting patterns, i.e. clusters. In the pre-processing stage a set of operations is done to simplify and standardize the documents being analyzed. Some of these operations implement disambiguation methods and stemming processes, identifying concepts and syntagmatic structures.these concepts and structures can be organized in different forms but, in general, they are considered as groups or bags of words, usually structured using a vector space model [12]. In the vector space model, each document is a vector of terms. The values of these vectors could be assumed as weights according the term occurrences in the document or in the document collection, and considering the different interpretations [5]: Boolean, Term Frequency (Tf) and Term Frequency-Inverse Document Frequency (Tf-Idf). These vectors of terms are used in a second stage, among other tasks, to analyze the similarities between documents, or groups of them, using different measures as the cosine, applied to the angle between the vectors, defined as [5]: sim(d i, d j ) = cos(d i, d j ) = (d i d j ) d i d j = wir w jr w 2 ir w 2 jr, (1) where d i, d j are the vectors of documents i, j, di, dj the norms of the vectors, and w ir, w jr are the term weights in the vectors di, dj, respectively. Another representation model is Global Association Distance Model (GADM) [11]. GADM can be defined as a vector space model (VSM) where each term is weighted by their global association strength (3). Nevertheless, in contradistinction to the original VSM that considers the term relevance by the number of its occurrences in a document, GADM considers the cooccurrences (actually, the association strengths) amongst terms in sentences, paragraphs and so on. So, a document d can be modelled by a vector of global association strengths (2). d = (gt1,..., g tn ) (2) where g tr = t s d 1 Drs (3)

and the formal distance between these terms (D rs ) is defined as follows, considering the distance by paragraph, without ignoring the natural co-occurrence when appearing in the same sentence, and considering: (p r, n r ), (p s, n s ), the paragraph and sentence numbers of terms t r and t s respectively. { 1 (r = s) [(p D rs = r = p s ) (n r = n s )] p r p s + 2 other case (4) Although flat structures are the simplest way for processing document collections, these linear model provides a limited means to measure similarities between semistructured documents. In a semistructured representation, it is not necessary to use all the information [3]. In some approaches, a predefined structure is considered and information is fed into the structure provided. In other approaches, documents are allowed to have specific structure types (such as trees or segments). A semistructured approach is not an odd way for document representation. For example, in academic papers, authors are asked to write a few words that concisely describe their work (the title), to write a few paragraphs that outline their work (the abstract), to write a few pages that precisely describe the work (the body), and finally to summarize the work (the conclusion). In newsworthy information, the inverted pyramid style is generally used. In this style, the conclusion or summary of the news is moved up to the front of the article, putting the main idea into the first paragraph [13]. If we try to simplify the representation of a document by a single vector, perhaps we should choose between a whole document vector and a vector from a key segment (for instance, the abstract in academic papers or the first paragraph in news), both as flat structures. Notice that in key segment vectors, terms use to appear only once and term weights loose their relevance in a VSM or GADM model. We have been considering an alternative way to represent semistructured documents by a single vector, without loosing the importance of the key segment but taking into account the relevance of the terms in the whole document (term weight). We have named it vector Filtered by Key Segment. A vector Filtered by Key Segment (FKS) is a vector model (VSM or GADM) obtained from the whole document but considering only the terms appearing in a predefined key segment. This kind of vector could even be used in a structure weighting approach, where the vector associated with each segment is constructed from the whole document. Besides, it should be noticed that vectors FKS could reduce the dimensionality problem usually presented in any document processing system. In FKS from VSM, the term weight is considered as the occurrence frequency in the whole document, and in FKS from GADM the term weight is considered as the global association strength in the whole document amongst the term and the rest of the terms of the Key Segment.

3 Generalized Star algorithm The Generalized Star algorithm was proposed by Pérez et al. in [8] to solve the drawback presented in its previous algorithms: The Star Algorithm [1] and the Extended Star algorithm [7]. This algorithm represents the document collection by its thresholded similarity graph, finding overlaps with dense sub-graphs. Let V = {d 1,..., d n } be a collection of documents and Sim(d i, d j ) a similarity (symmetric) function between documents d i and d j, we call similarity graph to an undirected and weighted graph G = (V, E, w), where vertices correspond to documents and each weighted edge corresponds to the similarity between two documents. Considering a similarity threshold σ defined by the user we can define a thresholded graph G σ as the undirected graph obtained from G by eliminating all the edges whose weights are lower than σ. The set of Weak Satellites (WeakSats) and the set of Potential Satellites (PotSats)of o are defined by the expresions (5) and (6) respectively. o.w eaksats = {s s o.adj o.adj s.adj }, (5) o.p otsats = {s s o.adj o.w eaksats s.w eaksats }. (6) The WeakSats and PotSats degrees of a vertex o is defined as the quantity of vertices included in its sets of Weak Satellites and Potentials Satellites respectively. Considering the aforementioned sets a Generalized star-shaped sub-graphs of m + 1 vertices en G σ, consists of a single center c and m adjacent vertices, such that c.p otsats v.p otsats for all v c.p otsats. Starting from this definition and guaranteing a full cover C of G σ, this algorithm should satisfy the following post-conditions: x V, x C x.adj C, (7) c C, v c.p otsats, c.p otsats v.p otsats. (8) The first condition (7) guarantees that each object of the collection belongs at least to one group, as a center or as a satellite. Besides, the condition (8) indicates that all the centers satisfy the generalized star-shaped sub-graphs definition. The set of Necessary Satellites (NecSats) of o is the set of its adjacent vertices that could depend on o to be covered. This concept is necessary only during the cluster generation. Initially, NecSats takes the value of PotSats; but, it can decrease during the clustering process as more documents are covered by stars. Let C be a set of centers obtaining by the algorithm, a center vertex c will be considered redundant if it satisfies the following conditions: 1. d c.adj C, i.e. vertex c has at least one adjacent center on its neighborhood. 2. s c.p otsats, s C s.adj C > 1, i.e. vertex s has more than one adjacent center (a neighboring center different to c) on its neighborhood or vertex s is a center.

Table 1. Pseudo-code of Generalized Star Algorithm Algorithm 1: GStar Input: V = {d 1, d 2,..., d n}-set of documents, σ-similarity threshold Output: SC-Set of clusters 1 forall vertex d i V do 2 d i.adj := {d j d j V Sim(d i, d j) σ}; 3 forall vertex d V do 4 d.w eaksats := {s s d.adj d.adj s.adj }; 5 forall vertex d V do 6 d.p otsats := {s s d.adj d.w eaksats s.w eaksats }; 7 d.necsats := d.p otsats; 8 end 9 L := V ; 10 C := ; 11 while L do 12 d := arg max{ d i.p otsats, d i L}; 13 Update(d,C,L); 14 end 15 Sort C in ascending order by PotSats ; 16 SC := ; 17 forall center c C do 18 if c is redundant then C := C \ {c}; 19 else SC := SC {{c} c.adj}; The Generalized Star algorithm is summarized in Table 1. The procedure Update (see Fig. 6) is applied to mark a vertex as center, deleting it from L, and for updating the set NecSats on each of its necessary satellites. Table 2. Pseudo-code of Update Procedure Update Input: d - Selected center, C- Set of clusters centers, L - Set of unprocessed vertices Output: C, L 1 C := C {d}; 2 L := L \ {d}; 3 forall s d.necsats do 4 s.necsats := s.necsats \ {d}; 5 if s.necsats = then L := L \ {s}; 6 forall c s.adj \ {d} do 7 c.necsats := c.necsats \ {s}; 8 if (c.necsats = ) (c.adj C ) then L := L \ {c}; 9 end 10 end

The GStar method as the original Star algorithm and the two versions of the Extended algorithm generates clusters which can be overlapped and guarantees also that the pairwise similarity between satellite vertices in a generalized starshaped sub-graph be high. Unlike it previous algorithms, the GStar algorithm can not produce illogical clusters because all the centers satisfy the generalized star-shaped sub-graph definition. The GStar algorithm does not produce uncovered vertices either this property is ensured by the fulfillment of postcondition (7) and avoids the generation of unnecessary clusters presented in two versions of the Extended algorithm. The dependence on data order is a property that the Extended Star method certainly solves. Nevertheless, as was indicated in [8], it is necessary only when that dependence affects the quality of the resulting clusters. Thus, the GStar algorithm solves the dependence on data order (for non symmetric or similar solutions) observed in the Star algorithm. In [8] the GStar algorithm was compared with the original Star and the Extended Star methods in two standard document collections considering the Jaccard index [9] and F -measure [4]. The experimentation showed that GStar outperforms previous methods considering the aforementioned measures and also considering the number an density of the generated clusters. These performances proved the validity of GStar for clustering tasks. 4 Experimental results In this section we present the experimental evaluation of GStar clustering using the Vector Space Model, Global Association Distance Model, and their corresponding variants filtered by a key segment, proposed in section 2. The GStar clustering results are evaluated by the same method and criterion to ensure a fair comparison across these representation models. Two document collections widely used in document clustering research were used in the experiments: TREC -5 1 and Reuters-21578 2. These are heterogeneous as to document size, cluster size, number of classes, and document distribution. The data set TREC -5 contains news items in Spanish published by AFP during 1994 1995; Reuters-21578 is a Reuters LTD news corpus in English. We excluded from document collections the empty documents and also those documents do not have an associated topic. The main characteristics of these collections are summarized in Table 3. We also included in this table the number of overlapping documents for each collection. All documents were preprocessed and lemmatized with TreeTager [14] and in order to obtain the similarity amongst documents we use the traditional cosine measure. The KFS models were applied considering the news first paragraph as the key segment. 1 http://trec.nist.gov/pubs/trec5 2 ftp://canberra.cs.umass.edu/pub/reuters

Table 3. Characteristics of document collections Collect. Doc. Overlap. doc. Topics Lang. AFP 695 16 25 Spanish Reuters 10377 1722 119 English The literature abounds in measures defined by multiple authors to compare two partitions on the same set. The most widely used are: Jaccard index, and F-measure. Jaccard index.- This index (noted j) takes into account the objects simultaneously joined [9]. It is defined as follows: j(a, B) = n 11 N(N 1) 2 n 00 (9) In this index, n 11 denotes the number of pair of objects which are both in the same cluster in A and are also both in the same cluster in B. Similarly, n 00 is the number of pair of objects which are in different clusters in A and are also in different clusters in B. The performances of the algorithm in the document collections considering Jaccard index are shown in Fig. 1 (A) and (B) and in Fig.2 (A) and (B) respectively. F-measure.- The aforementioned index and others are usually applied to partitions. In order to make a better evaluation of overlapping clustering, we have considered F-measure calculated over pairs of points, as defined in [4]. Noted as, this measure is the harmonic mean of precision and recall (10). 2 P resicion Recall = (10) P resicion + Recall n 11 P resicion = Number of identified pairs, Recall = n 00 Number of true pairs The performances of the algorithm in the document collections considering F-measure are shown in Fig. 1 (C) and (D) and in Fig. 2 (C) and (D) respectively. As can be showed in Fig. 1 and Fig. 2, the accuracy obtained by GStar using FKS (GStar-FKS) from VSM is in most cases (for all the indexes) better or comparable with that obtained from the GStar using the original VSM; and the accuracy obtained by GStar-FKS from GADM is in all of cases better or similar to that obtained from original GADM. Note that the best performance of GStar is attached to the GADM filtered by a key segment for both documents collections. We also compared the result obtained by GStar-FKS in both models to know what models offer the best results for each measure. The performances of the GStar-FKS in the document collections considering Jaccard index, and F-measure are shown in Fig. 3. As we can see, the FKS-GADM outperforms in most cases the FKS-VSM in TREC-5 collection and the performance is comparable in both FKS models for

Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 1. Behavior in TREC-5 collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 2. Behavior in Reuter collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Reuter collection, though in this corpus the best performance value is obtained for FKS-GADM. Thus, the FKS-GADM represents a 10.91% and a 7.27% average improvement in performance compared to FKS-VSM in AFP collection considering the Jaccard index and F-measure respectively; and in the case of the Reuters collection the FKS-VSM outperforms slightly the FKS-GADM in a 1.34% and 1.19% considering both measures.

Jaccard Index (A) (B) Jaccard Index (C) (D) Fig. 3. Behavior in AFP (A,B) and Reuter (C,D) collections with Jaccard index and F-measure using FKS by VSM and KFS by GADM 5 Conclusions In order to archive a better performance of clustering using GStar, in this paper we have proposed a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema; and we have shown with the experiment results for two news items collections, the improvement of our proposal regarding original Vector Space Model and original Global Association Distance Model. References 1. Aslam, J., Pelekhov K. and Rus, D.: Using Star Clusters for Filtering, In Proceedings of the Ninth International Conference on Information and Knowledge Management, USA, 2000. 2. Aslam, J., Pelekhov K. and Rus, D.: The Star Clustering Algorithm for Static and Dynamic Information Organization, Journal of Graph Algorithms and Applications, Vol. 8, No. 1, pp. 95-129, 2004. 3. Baeza-Yates,R. and Ribeiro-Neto,B.: Modern Information Retrieval, ACM Press, Addison-Wesley, 1999. 4. Banerjee, A., Krumpelman, C., Basu, S., Mooney, R. and Ghosh, J.: Model Based Overlapping Clustering, In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), 2005. 5. Berry,M. : Survey of Text Mining, Clustering, Classification and Retrieval, Springer, 2004. 6. Duda R., Hart P., Stork D. Pattern Classification. John Wiley & Sons Inc. 2001. 7. Gil-Garca, R. J., Bada-Contelles, J. M. and Pons-Porrata, A.: Extended Star Clustering Algorithm, In Proceedings of CIARP 03, LNCS 2905, pp. 480-487, 2003.

8. Pérez -Suárez, A. and Medina-Pagola, J. E.: A Clustering Algorithm based on Generalized Stars, In Proceedings of the 5th International Conference on Machine Learning and Data Mining (MLDM2007), LNAI 4571, Leipzig, Germany, 2007. 9. Kuncheva, L. and Hadjitodorov, S.: Using Diversity in Cluster Ensembles, In Proceedings of IEEE SMC 2004, The Netherlands, 2004. 10. van Rijsbergen, C. J.: Information Retrieval, Buttersworth, London, 2nd edition, 1979. 11. Medina Pagola, J.E., Rodrguez, A.Y., Hechavarra, A., Hernndez Palancar, J.: Document Representation using Global Association Distance Model. In Proc. of ECIR 2007, LNCS, 4425, pp. 565 572, 2007. 12. Salton, G.: The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, New Jersey, 1971. 13. Scanlan,C.: Writing from the Top Down: Pros and Cons of the Inverted Pyramid, Poynter Institute, Retrieved on July 27, 2006, URL:http://www.poynter.org/column.asp?id=52&aid=38693. 14. Schmid, H.: Probabilistic Part-Of-Speech Tagging Using Decision Tree. In: International Conference on New Methods in Language Processing, Manchester, UK (1994) 15. Zhong, S. and Ghosh, J.: A Comparative Study of Generative Models for Document Clustering, In Proceedings of SDM Workshop on Clustering High Dimensional Data and Its Applications, 2003.