Applying Key Segments Schema to Generalized Star Clustering
|
|
- Stephany Chase
- 6 years ago
- Views:
Transcription
1 Applying Key Segments Schema to Generalized Star Clustering Abstract. Clustering process, as other text mining tasks, depends critically on the proper representation of documents. In this paper we propose a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema. The evaluation experiments show that our proposal applied to Vector Space Model and Global Association Distance Model using Generalized Star algorithm outperforms the original models. 1 Introduction Clustering is the process of grouping data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning. Cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, image processing, and market research. Initially, document clustering was evaluated for improving the results in information retrieval systems [10, 2]. Clustering has been proposed as an efficient way of finding automatically related topics or new ones; in filtering tasks [1] and grouping the retrieved documents into a list of meaningful categories, facilitating query processing by searching only clusters closest to the query [15]. Several algorithms have been proposed for document clustering. One of these algorithms is Generalized Star (GStar), presented and evaluated by Pérez et al. in [8]. They introduced a new definition of star allowing a different star-shaped subgraph, in this way the GStar retains the strengths of previous algorithms as well as solves the drawbacks presented in it previous algorithms. The experimentation comparing the GStar against the original Star [1] and the Extended algorithms [7] and other traditional clustering algorithms such as the Single and Average Link [6] shows that the Generalized Star outperforms those algorithms using the Vector Space Model (VSM)[12] as document representation models. Nevertheless, we consider that the representation models applied on whole documents, but filtered by a key segment, produce in general a better performance than the original representation models. In this paper we evaluate the GStar performance using the Vector Space Model, Global Association Distance Model, and their respective filtered by a key segment in order to show the certainty of this consideration.
2 The basic outline of this paper is as follows. Section 2 is dedicated to the representation models considered in the experimentation phase. The Generalized Star algorithm is described in section 3. The experimental results are discussed in section 4, and the conclusions of the research and some ideas about future directions are exposed in section 5. 2 Document representation The clustering process, as any other task of text mining, is carried out in two main stages: a pre-processing stage and a discovery stage. In the first stage, texts are transformed into a kind of structured or semi-structured representation, simpler and more useful to be automatically processed by computers. In the second stage these representations are analyzed in order to discover interesting patterns, i.e. clusters. In the pre-processing stage a set of operations is done to simplify and standardize the documents being analyzed. Some of these operations implement disambiguation methods and stemming processes, identifying concepts and syntagmatic structures.these concepts and structures can be organized in different forms but, in general, they are considered as groups or bags of words, usually structured using a vector space model [12]. In the vector space model, each document is a vector of terms. The values of these vectors could be assumed as weights according the term occurrences in the document or in the document collection, and considering the different interpretations [5]: Boolean, Term Frequency (Tf) and Term Frequency-Inverse Document Frequency (Tf-Idf). These vectors of terms are used in a second stage, among other tasks, to analyze the similarities between documents, or groups of them, using different measures as the cosine, applied to the angle between the vectors, defined as [5]: sim(d i, d j ) = cos(d i, d j ) = (d i d j ) d i d j = wir w jr w 2 ir w 2 jr, (1) where d i, d j are the vectors of documents i, j, di, dj the norms of the vectors, and w ir, w jr are the term weights in the vectors di, dj, respectively. Another representation model is Global Association Distance Model (GADM) [11]. GADM can be defined as a vector space model (VSM) where each term is weighted by their global association strength (3). Nevertheless, in contradistinction to the original VSM that considers the term relevance by the number of its occurrences in a document, GADM considers the cooccurrences (actually, the association strengths) amongst terms in sentences, paragraphs and so on. So, a document d can be modelled by a vector of global association strengths (2). d = (gt1,..., g tn ) (2) where g tr = t s d 1 Drs (3)
3 and the formal distance between these terms (D rs ) is defined as follows, considering the distance by paragraph, without ignoring the natural co-occurrence when appearing in the same sentence, and considering: (p r, n r ), (p s, n s ), the paragraph and sentence numbers of terms t r and t s respectively. { 1 (r = s) [(p D rs = r = p s ) (n r = n s )] p r p s + 2 other case (4) Although flat structures are the simplest way for processing document collections, these linear model provides a limited means to measure similarities between semistructured documents. In a semistructured representation, it is not necessary to use all the information [3]. In some approaches, a predefined structure is considered and information is fed into the structure provided. In other approaches, documents are allowed to have specific structure types (such as trees or segments). A semistructured approach is not an odd way for document representation. For example, in academic papers, authors are asked to write a few words that concisely describe their work (the title), to write a few paragraphs that outline their work (the abstract), to write a few pages that precisely describe the work (the body), and finally to summarize the work (the conclusion). In newsworthy information, the inverted pyramid style is generally used. In this style, the conclusion or summary of the news is moved up to the front of the article, putting the main idea into the first paragraph [13]. If we try to simplify the representation of a document by a single vector, perhaps we should choose between a whole document vector and a vector from a key segment (for instance, the abstract in academic papers or the first paragraph in news), both as flat structures. Notice that in key segment vectors, terms use to appear only once and term weights loose their relevance in a VSM or GADM model. We have been considering an alternative way to represent semistructured documents by a single vector, without loosing the importance of the key segment but taking into account the relevance of the terms in the whole document (term weight). We have named it vector Filtered by Key Segment. A vector Filtered by Key Segment (FKS) is a vector model (VSM or GADM) obtained from the whole document but considering only the terms appearing in a predefined key segment. This kind of vector could even be used in a structure weighting approach, where the vector associated with each segment is constructed from the whole document. Besides, it should be noticed that vectors FKS could reduce the dimensionality problem usually presented in any document processing system. In FKS from VSM, the term weight is considered as the occurrence frequency in the whole document, and in FKS from GADM the term weight is considered as the global association strength in the whole document amongst the term and the rest of the terms of the Key Segment.
4 3 Generalized Star algorithm The Generalized Star algorithm was proposed by Pérez et al. in [8] to solve the drawback presented in its previous algorithms: The Star Algorithm [1] and the Extended Star algorithm [7]. This algorithm represents the document collection by its thresholded similarity graph, finding overlaps with dense sub-graphs. Let V = {d 1,..., d n } be a collection of documents and Sim(d i, d j ) a similarity (symmetric) function between documents d i and d j, we call similarity graph to an undirected and weighted graph G = (V, E, w), where vertices correspond to documents and each weighted edge corresponds to the similarity between two documents. Considering a similarity threshold σ defined by the user we can define a thresholded graph G σ as the undirected graph obtained from G by eliminating all the edges whose weights are lower than σ. The set of Weak Satellites (WeakSats) and the set of Potential Satellites (PotSats)of o are defined by the expresions (5) and (6) respectively. o.w eaksats = {s s o.adj o.adj s.adj }, (5) o.p otsats = {s s o.adj o.w eaksats s.w eaksats }. (6) The WeakSats and PotSats degrees of a vertex o is defined as the quantity of vertices included in its sets of Weak Satellites and Potentials Satellites respectively. Considering the aforementioned sets a Generalized star-shaped sub-graphs of m + 1 vertices en G σ, consists of a single center c and m adjacent vertices, such that c.p otsats v.p otsats for all v c.p otsats. Starting from this definition and guaranteing a full cover C of G σ, this algorithm should satisfy the following post-conditions: x V, x C x.adj C, (7) c C, v c.p otsats, c.p otsats v.p otsats. (8) The first condition (7) guarantees that each object of the collection belongs at least to one group, as a center or as a satellite. Besides, the condition (8) indicates that all the centers satisfy the generalized star-shaped sub-graphs definition. The set of Necessary Satellites (NecSats) of o is the set of its adjacent vertices that could depend on o to be covered. This concept is necessary only during the cluster generation. Initially, NecSats takes the value of PotSats; but, it can decrease during the clustering process as more documents are covered by stars. Let C be a set of centers obtaining by the algorithm, a center vertex c will be considered redundant if it satisfies the following conditions: 1. d c.adj C, i.e. vertex c has at least one adjacent center on its neighborhood. 2. s c.p otsats, s C s.adj C > 1, i.e. vertex s has more than one adjacent center (a neighboring center different to c) on its neighborhood or vertex s is a center.
5 Table 1. Pseudo-code of Generalized Star Algorithm Algorithm 1: GStar Input: V = {d 1, d 2,..., d n}-set of documents, σ-similarity threshold Output: SC-Set of clusters 1 forall vertex d i V do 2 d i.adj := {d j d j V Sim(d i, d j) σ}; 3 forall vertex d V do 4 d.w eaksats := {s s d.adj d.adj s.adj }; 5 forall vertex d V do 6 d.p otsats := {s s d.adj d.w eaksats s.w eaksats }; 7 d.necsats := d.p otsats; 8 end 9 L := V ; 10 C := ; 11 while L do 12 d := arg max{ d i.p otsats, d i L}; 13 Update(d,C,L); 14 end 15 Sort C in ascending order by PotSats ; 16 SC := ; 17 forall center c C do 18 if c is redundant then C := C \ {c}; 19 else SC := SC {{c} c.adj}; The Generalized Star algorithm is summarized in Table 1. The procedure Update (see Fig. 6) is applied to mark a vertex as center, deleting it from L, and for updating the set NecSats on each of its necessary satellites. Table 2. Pseudo-code of Update Procedure Update Input: d - Selected center, C- Set of clusters centers, L - Set of unprocessed vertices Output: C, L 1 C := C {d}; 2 L := L \ {d}; 3 forall s d.necsats do 4 s.necsats := s.necsats \ {d}; 5 if s.necsats = then L := L \ {s}; 6 forall c s.adj \ {d} do 7 c.necsats := c.necsats \ {s}; 8 if (c.necsats = ) (c.adj C ) then L := L \ {c}; 9 end 10 end
6 The GStar method as the original Star algorithm and the two versions of the Extended algorithm generates clusters which can be overlapped and guarantees also that the pairwise similarity between satellite vertices in a generalized starshaped sub-graph be high. Unlike it previous algorithms, the GStar algorithm can not produce illogical clusters because all the centers satisfy the generalized star-shaped sub-graph definition. The GStar algorithm does not produce uncovered vertices either this property is ensured by the fulfillment of postcondition (7) and avoids the generation of unnecessary clusters presented in two versions of the Extended algorithm. The dependence on data order is a property that the Extended Star method certainly solves. Nevertheless, as was indicated in [8], it is necessary only when that dependence affects the quality of the resulting clusters. Thus, the GStar algorithm solves the dependence on data order (for non symmetric or similar solutions) observed in the Star algorithm. In [8] the GStar algorithm was compared with the original Star and the Extended Star methods in two standard document collections considering the Jaccard index [9] and F -measure [4]. The experimentation showed that GStar outperforms previous methods considering the aforementioned measures and also considering the number an density of the generated clusters. These performances proved the validity of GStar for clustering tasks. 4 Experimental results In this section we present the experimental evaluation of GStar clustering using the Vector Space Model, Global Association Distance Model, and their corresponding variants filtered by a key segment, proposed in section 2. The GStar clustering results are evaluated by the same method and criterion to ensure a fair comparison across these representation models. Two document collections widely used in document clustering research were used in the experiments: TREC -5 1 and Reuters These are heterogeneous as to document size, cluster size, number of classes, and document distribution. The data set TREC -5 contains news items in Spanish published by AFP during ; Reuters is a Reuters LTD news corpus in English. We excluded from document collections the empty documents and also those documents do not have an associated topic. The main characteristics of these collections are summarized in Table 3. We also included in this table the number of overlapping documents for each collection. All documents were preprocessed and lemmatized with TreeTager [14] and in order to obtain the similarity amongst documents we use the traditional cosine measure. The KFS models were applied considering the news first paragraph as the key segment ftp://canberra.cs.umass.edu/pub/reuters
7 Table 3. Characteristics of document collections Collect. Doc. Overlap. doc. Topics Lang. AFP Spanish Reuters English The literature abounds in measures defined by multiple authors to compare two partitions on the same set. The most widely used are: Jaccard index, and F-measure. Jaccard index.- This index (noted j) takes into account the objects simultaneously joined [9]. It is defined as follows: j(a, B) = n 11 N(N 1) 2 n 00 (9) In this index, n 11 denotes the number of pair of objects which are both in the same cluster in A and are also both in the same cluster in B. Similarly, n 00 is the number of pair of objects which are in different clusters in A and are also in different clusters in B. The performances of the algorithm in the document collections considering Jaccard index are shown in Fig. 1 (A) and (B) and in Fig.2 (A) and (B) respectively. F-measure.- The aforementioned index and others are usually applied to partitions. In order to make a better evaluation of overlapping clustering, we have considered F-measure calculated over pairs of points, as defined in [4]. Noted as, this measure is the harmonic mean of precision and recall (10). 2 P resicion Recall = (10) P resicion + Recall n 11 P resicion = Number of identified pairs, Recall = n 00 Number of true pairs The performances of the algorithm in the document collections considering F-measure are shown in Fig. 1 (C) and (D) and in Fig. 2 (C) and (D) respectively. As can be showed in Fig. 1 and Fig. 2, the accuracy obtained by GStar using FKS (GStar-FKS) from VSM is in most cases (for all the indexes) better or comparable with that obtained from the GStar using the original VSM; and the accuracy obtained by GStar-FKS from GADM is in all of cases better or similar to that obtained from original GADM. Note that the best performance of GStar is attached to the GADM filtered by a key segment for both documents collections. We also compared the result obtained by GStar-FKS in both models to know what models offer the best results for each measure. The performances of the GStar-FKS in the document collections considering Jaccard index, and F-measure are shown in Fig. 3. As we can see, the FKS-GADM outperforms in most cases the FKS-VSM in TREC-5 collection and the performance is comparable in both FKS models for
8 Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 1. Behavior in TREC-5 collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 2. Behavior in Reuter collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Reuter collection, though in this corpus the best performance value is obtained for FKS-GADM. Thus, the FKS-GADM represents a 10.91% and a 7.27% average improvement in performance compared to FKS-VSM in AFP collection considering the Jaccard index and F-measure respectively; and in the case of the Reuters collection the FKS-VSM outperforms slightly the FKS-GADM in a 1.34% and 1.19% considering both measures.
9 Jaccard Index (A) (B) Jaccard Index (C) (D) Fig. 3. Behavior in AFP (A,B) and Reuter (C,D) collections with Jaccard index and F-measure using FKS by VSM and KFS by GADM 5 Conclusions In order to archive a better performance of clustering using GStar, in this paper we have proposed a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema; and we have shown with the experiment results for two news items collections, the improvement of our proposal regarding original Vector Space Model and original Global Association Distance Model. References 1. Aslam, J., Pelekhov K. and Rus, D.: Using Star Clusters for Filtering, In Proceedings of the Ninth International Conference on Information and Knowledge Management, USA, Aslam, J., Pelekhov K. and Rus, D.: The Star Clustering Algorithm for Static and Dynamic Information Organization, Journal of Graph Algorithms and Applications, Vol. 8, No. 1, pp , Baeza-Yates,R. and Ribeiro-Neto,B.: Modern Information Retrieval, ACM Press, Addison-Wesley, Banerjee, A., Krumpelman, C., Basu, S., Mooney, R. and Ghosh, J.: Model Based Overlapping Clustering, In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), Berry,M. : Survey of Text Mining, Clustering, Classification and Retrieval, Springer, Duda R., Hart P., Stork D. Pattern Classification. John Wiley & Sons Inc Gil-Garca, R. J., Bada-Contelles, J. M. and Pons-Porrata, A.: Extended Star Clustering Algorithm, In Proceedings of CIARP 03, LNCS 2905, pp , 2003.
10 8. Pérez -Suárez, A. and Medina-Pagola, J. E.: A Clustering Algorithm based on Generalized Stars, In Proceedings of the 5th International Conference on Machine Learning and Data Mining (MLDM2007), LNAI 4571, Leipzig, Germany, Kuncheva, L. and Hadjitodorov, S.: Using Diversity in Cluster Ensembles, In Proceedings of IEEE SMC 2004, The Netherlands, van Rijsbergen, C. J.: Information Retrieval, Buttersworth, London, 2nd edition, Medina Pagola, J.E., Rodrguez, A.Y., Hechavarra, A., Hernndez Palancar, J.: Document Representation using Global Association Distance Model. In Proc. of ECIR 2007, LNCS, 4425, pp , Salton, G.: The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, New Jersey, Scanlan,C.: Writing from the Top Down: Pros and Cons of the Inverted Pyramid, Poynter Institute, Retrieved on July 27, 2006, URL: 14. Schmid, H.: Probabilistic Part-Of-Speech Tagging Using Decision Tree. In: International Conference on New Methods in Language Processing, Manchester, UK (1994) 15. Zhong, S. and Ghosh, J.: A Comparative Study of Generative Models for Document Clustering, In Proceedings of SDM Workshop on Clustering High Dimensional Data and Its Applications, 2003.
ACONS: A New Algorithm for Clustering Documents
ACONS: A New Algorithm for Clustering Documents Andrés Gago Alonso, Airel Pérez Suárez, and José E. Medina Pagola Advanced Technologies Application Center (CENATAV), 7a 21812 e/ 218 y 222, Rpto. Siboney,
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationText Documents clustering using K Means Algorithm
Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals
More informationMaking Retrieval Faster Through Document Clustering
R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationInternational Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.
A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish
More informationText Categorization (I)
CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationAn Overview of Concept Based and Advanced Text Clustering Methods.
An Overview of Concept Based and Advanced Text Clustering Methods. B.Jyothi, D.Sailaja, Dr.Y.Srinivasa Rao, GITAM, ANITS, GITAM, Asst.Professor Asst.Professor Professor Abstract: Most of the common techniques
More informationBalancing Manual and Automatic Indexing for Retrieval of Paper Abstracts
Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationMining Quantitative Association Rules on Overlapped Intervals
Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,
More informationDocument Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure
Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com
More informationCLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationA hybrid method to categorize HTML documents
Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationConcept-Based Document Similarity Based on Suffix Tree Document
Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationText Mining: A Burgeoning technology for knowledge extraction
Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationA Distributed Retrieval System for NTCIR-5 Patent Retrieval Task
A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task Hiroki Tanioka Kenichi Yamamoto Justsystem Corporation Brains Park Tokushima-shi, Tokushima 771-0189, Japan {hiroki tanioka, kenichi yamamoto}@justsystem.co.jp
More informationOnline algorithms for clustering problems
University of Szeged Department of Computer Algorithms and Artificial Intelligence Online algorithms for clustering problems Summary of the Ph.D. thesis by Gabriella Divéki Supervisor Dr. Csanád Imreh
More informationA New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval
Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationClustering Web Documents using Hierarchical Method for Efficient Cluster Formation
Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College
More informationIndex Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.
International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa
More informationAvailable online at ScienceDirect. Procedia Computer Science 82 (2016 ) 28 34
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 82 (2016 ) 28 34 Symposium on Data Mining Applications, SDMA2016, 30 March 2016, Riyadh, Saudi Arabia Finding similar documents
More informationRetrieval of Highly Related Documents Containing Gene-Disease Association
Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,
More informationHierarchical Document Clustering
Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters
More informationText Document Clustering Using DPM with Concept and Feature Analysis
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationKnowledge Engineering in Search Engines
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationAutomatic Summarization
Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationA Fast Approximated k Median Algorithm
A Fast Approximated k Median Algorithm Eva Gómez Ballester, Luisa Micó, Jose Oncina Universidad de Alicante, Departamento de Lenguajes y Sistemas Informáticos {eva, mico,oncina}@dlsi.ua.es Abstract. The
More informationA probabilistic description-oriented approach for categorising Web documents
A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic
More informationClustering Algorithms for Data Stream
Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:
More informationEnhancing Clustering Results In Hierarchical Approach By Mvs Measures
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach
More informationBoolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationQ: Given a set of keywords how can we return relevant documents quickly?
Keyword Search Traditional B+index is good for answering 1-dimensional range or point query Q: What about keyword search? Geo-spatial queries? Q: Documents on Computer Science? Q: Nearby coffee shops?
More informationAn Efficient Hash-based Association Rule Mining Approach for Document Clustering
An Efficient Hash-based Association Rule Mining Approach for Document Clustering NOHA NEGM #1, PASSENT ELKAFRAWY #2, ABD-ELBADEEH SALEM * 3 # Faculty of Science, Menoufia University Shebin El-Kom, EGYPT
More informationNearest Neighbor Classification
Nearest Neighbor Classification Charles Elkan elkan@cs.ucsd.edu October 9, 2007 The nearest-neighbor method is perhaps the simplest of all algorithms for predicting the class of a test example. The training
More informationAgglomerative clustering on vertically partitioned data
Agglomerative clustering on vertically partitioned data R.Senkamalavalli Research Scholar, Department of Computer Science and Engg., SCSVMV University, Enathur, Kanchipuram 631 561 sengu_cool@yahoo.com
More informationReading group on Ontologies and NLP:
Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationText clustering based on a divide and merge strategy
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and
More informationTheme Identification in RDF Graphs
Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published
More informationA Model for Information Retrieval Agent System Based on Keywords Distribution
A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationDetermination of Similarity Threshold in Clustering Problems for Large Data Sets
Determination of Similarity Threshold in Clustering Problems for Large Data Sets Guillermo Sánchez-Díaz 1 and José F. Martínez-Trinidad 2 1 Center of Technologies Research on Information and Systems, The
More informationDevelopment of Generic Search Method Based on Transformation Invariance
Development of Generic Search Method Based on Transformation Invariance Fuminori Adachi, Takashi Washio, Hiroshi Motoda and *Hidemitsu Hanafusa I.S.I.R., Osaka University, {adachi, washio, motoda}@ar.sanken.osaka-u.ac.jp
More informationNoisy Text Clustering
R E S E A R C H R E P O R T Noisy Text Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-31 I D I A P December 2004 1 IDIAP, CP 592, 1920 Martigny, Switzerland, grangier@idiap.ch 2 IDIAP,
More informationInformation Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining
Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM
More informationSocial Media Computing
Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,
More informationCHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS
CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS 4.1 Introduction Although MST-based clustering methods are effective for complex data, they require quadratic computational time which is high for
More informationColor-Based Classification of Natural Rock Images Using Classifier Combinations
Color-Based Classification of Natural Rock Images Using Classifier Combinations Leena Lepistö, Iivari Kunttu, and Ari Visa Tampere University of Technology, Institute of Signal Processing, P.O. Box 553,
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationNDoT: Nearest Neighbor Distance Based Outlier Detection Technique
NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationSimilarity search in multimedia databases
Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:
More informationSemi supervised clustering for Text Clustering
Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering
More informationClustering with Lower Bound on Similarity
Clustering with Lower Bound on Similarity Mohammad Al Hasan, Saeed Salem, Benjarath Pupacdi 2, and Mohammed J. Zaki Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 2 Chulabhorn
More informationCollaborative Rough Clustering
Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical
More informationInternational ejournals
Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT
More informationLetter Pair Similarity Classification and URL Ranking Based on Feedback Approach
Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationFinding Hubs and authorities using Information scent to improve the Information Retrieval precision
Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Suruchi Chawla 1, Dr Punam Bedi 2 1 Department of Computer Science, University of Delhi, Delhi, INDIA
More informationImproving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha
Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha 1 Asst. Professor, Srm University, Chennai 2 Mtech, Srm University, Chennai Abstract R- Google is a dedicated
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationConsensus clustering by graph based approach
Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr
More informationA Universal Model for XML Information Retrieval
A Universal Model for XML Information Retrieval Maria Izabel M. Azevedo 1, Lucas Pantuza Amorim 2, and Nívio Ziviani 3 1 Department of Computer Science, State University of Montes Claros, Montes Claros,
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationA Document Graph Based Query Focused Multi- Document Summarizer
A Document Graph Based Query Focused Multi- Document Summarizer By Sibabrata Paladhi and Dr. Sivaji Bandyopadhyay Department of Computer Science and Engineering Jadavpur University Jadavpur, Kolkata India
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationDepartment of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _
COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.
More informationTop-k Keyword Search Over Graphs Based On Backward Search
Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems
More informationTexture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map
Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map Markus Turtinen, Topi Mäenpää, and Matti Pietikäinen Machine Vision Group, P.O.Box 4500, FIN-90014 University
More informationVK Multimedia Information Systems
VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval
More informationijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System
ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationUsing a Painting Metaphor to Rate Large Numbers
1 Using a Painting Metaphor to Rate Large Numbers of Objects Patrick Baudisch Integrated Publication and Information Systems Institute (IPSI) German National Research Center for Information Technology
More informationCircle Graphs: New Visualization Tools for Text-Mining
Circle Graphs: New Visualization Tools for Text-Mining Yonatan Aumann, Ronen Feldman, Yaron Ben Yehuda, David Landau, Orly Liphstat, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan
More information