Applying Key Segments Schema to Generalized Star Clustering

Size: px
Start display at page:

Download "Applying Key Segments Schema to Generalized Star Clustering"

Transcription

1 Applying Key Segments Schema to Generalized Star Clustering Abstract. Clustering process, as other text mining tasks, depends critically on the proper representation of documents. In this paper we propose a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema. The evaluation experiments show that our proposal applied to Vector Space Model and Global Association Distance Model using Generalized Star algorithm outperforms the original models. 1 Introduction Clustering is the process of grouping data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning. Cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, image processing, and market research. Initially, document clustering was evaluated for improving the results in information retrieval systems [10, 2]. Clustering has been proposed as an efficient way of finding automatically related topics or new ones; in filtering tasks [1] and grouping the retrieved documents into a list of meaningful categories, facilitating query processing by searching only clusters closest to the query [15]. Several algorithms have been proposed for document clustering. One of these algorithms is Generalized Star (GStar), presented and evaluated by Pérez et al. in [8]. They introduced a new definition of star allowing a different star-shaped subgraph, in this way the GStar retains the strengths of previous algorithms as well as solves the drawbacks presented in it previous algorithms. The experimentation comparing the GStar against the original Star [1] and the Extended algorithms [7] and other traditional clustering algorithms such as the Single and Average Link [6] shows that the Generalized Star outperforms those algorithms using the Vector Space Model (VSM)[12] as document representation models. Nevertheless, we consider that the representation models applied on whole documents, but filtered by a key segment, produce in general a better performance than the original representation models. In this paper we evaluate the GStar performance using the Vector Space Model, Global Association Distance Model, and their respective filtered by a key segment in order to show the certainty of this consideration.

2 The basic outline of this paper is as follows. Section 2 is dedicated to the representation models considered in the experimentation phase. The Generalized Star algorithm is described in section 3. The experimental results are discussed in section 4, and the conclusions of the research and some ideas about future directions are exposed in section 5. 2 Document representation The clustering process, as any other task of text mining, is carried out in two main stages: a pre-processing stage and a discovery stage. In the first stage, texts are transformed into a kind of structured or semi-structured representation, simpler and more useful to be automatically processed by computers. In the second stage these representations are analyzed in order to discover interesting patterns, i.e. clusters. In the pre-processing stage a set of operations is done to simplify and standardize the documents being analyzed. Some of these operations implement disambiguation methods and stemming processes, identifying concepts and syntagmatic structures.these concepts and structures can be organized in different forms but, in general, they are considered as groups or bags of words, usually structured using a vector space model [12]. In the vector space model, each document is a vector of terms. The values of these vectors could be assumed as weights according the term occurrences in the document or in the document collection, and considering the different interpretations [5]: Boolean, Term Frequency (Tf) and Term Frequency-Inverse Document Frequency (Tf-Idf). These vectors of terms are used in a second stage, among other tasks, to analyze the similarities between documents, or groups of them, using different measures as the cosine, applied to the angle between the vectors, defined as [5]: sim(d i, d j ) = cos(d i, d j ) = (d i d j ) d i d j = wir w jr w 2 ir w 2 jr, (1) where d i, d j are the vectors of documents i, j, di, dj the norms of the vectors, and w ir, w jr are the term weights in the vectors di, dj, respectively. Another representation model is Global Association Distance Model (GADM) [11]. GADM can be defined as a vector space model (VSM) where each term is weighted by their global association strength (3). Nevertheless, in contradistinction to the original VSM that considers the term relevance by the number of its occurrences in a document, GADM considers the cooccurrences (actually, the association strengths) amongst terms in sentences, paragraphs and so on. So, a document d can be modelled by a vector of global association strengths (2). d = (gt1,..., g tn ) (2) where g tr = t s d 1 Drs (3)

3 and the formal distance between these terms (D rs ) is defined as follows, considering the distance by paragraph, without ignoring the natural co-occurrence when appearing in the same sentence, and considering: (p r, n r ), (p s, n s ), the paragraph and sentence numbers of terms t r and t s respectively. { 1 (r = s) [(p D rs = r = p s ) (n r = n s )] p r p s + 2 other case (4) Although flat structures are the simplest way for processing document collections, these linear model provides a limited means to measure similarities between semistructured documents. In a semistructured representation, it is not necessary to use all the information [3]. In some approaches, a predefined structure is considered and information is fed into the structure provided. In other approaches, documents are allowed to have specific structure types (such as trees or segments). A semistructured approach is not an odd way for document representation. For example, in academic papers, authors are asked to write a few words that concisely describe their work (the title), to write a few paragraphs that outline their work (the abstract), to write a few pages that precisely describe the work (the body), and finally to summarize the work (the conclusion). In newsworthy information, the inverted pyramid style is generally used. In this style, the conclusion or summary of the news is moved up to the front of the article, putting the main idea into the first paragraph [13]. If we try to simplify the representation of a document by a single vector, perhaps we should choose between a whole document vector and a vector from a key segment (for instance, the abstract in academic papers or the first paragraph in news), both as flat structures. Notice that in key segment vectors, terms use to appear only once and term weights loose their relevance in a VSM or GADM model. We have been considering an alternative way to represent semistructured documents by a single vector, without loosing the importance of the key segment but taking into account the relevance of the terms in the whole document (term weight). We have named it vector Filtered by Key Segment. A vector Filtered by Key Segment (FKS) is a vector model (VSM or GADM) obtained from the whole document but considering only the terms appearing in a predefined key segment. This kind of vector could even be used in a structure weighting approach, where the vector associated with each segment is constructed from the whole document. Besides, it should be noticed that vectors FKS could reduce the dimensionality problem usually presented in any document processing system. In FKS from VSM, the term weight is considered as the occurrence frequency in the whole document, and in FKS from GADM the term weight is considered as the global association strength in the whole document amongst the term and the rest of the terms of the Key Segment.

4 3 Generalized Star algorithm The Generalized Star algorithm was proposed by Pérez et al. in [8] to solve the drawback presented in its previous algorithms: The Star Algorithm [1] and the Extended Star algorithm [7]. This algorithm represents the document collection by its thresholded similarity graph, finding overlaps with dense sub-graphs. Let V = {d 1,..., d n } be a collection of documents and Sim(d i, d j ) a similarity (symmetric) function between documents d i and d j, we call similarity graph to an undirected and weighted graph G = (V, E, w), where vertices correspond to documents and each weighted edge corresponds to the similarity between two documents. Considering a similarity threshold σ defined by the user we can define a thresholded graph G σ as the undirected graph obtained from G by eliminating all the edges whose weights are lower than σ. The set of Weak Satellites (WeakSats) and the set of Potential Satellites (PotSats)of o are defined by the expresions (5) and (6) respectively. o.w eaksats = {s s o.adj o.adj s.adj }, (5) o.p otsats = {s s o.adj o.w eaksats s.w eaksats }. (6) The WeakSats and PotSats degrees of a vertex o is defined as the quantity of vertices included in its sets of Weak Satellites and Potentials Satellites respectively. Considering the aforementioned sets a Generalized star-shaped sub-graphs of m + 1 vertices en G σ, consists of a single center c and m adjacent vertices, such that c.p otsats v.p otsats for all v c.p otsats. Starting from this definition and guaranteing a full cover C of G σ, this algorithm should satisfy the following post-conditions: x V, x C x.adj C, (7) c C, v c.p otsats, c.p otsats v.p otsats. (8) The first condition (7) guarantees that each object of the collection belongs at least to one group, as a center or as a satellite. Besides, the condition (8) indicates that all the centers satisfy the generalized star-shaped sub-graphs definition. The set of Necessary Satellites (NecSats) of o is the set of its adjacent vertices that could depend on o to be covered. This concept is necessary only during the cluster generation. Initially, NecSats takes the value of PotSats; but, it can decrease during the clustering process as more documents are covered by stars. Let C be a set of centers obtaining by the algorithm, a center vertex c will be considered redundant if it satisfies the following conditions: 1. d c.adj C, i.e. vertex c has at least one adjacent center on its neighborhood. 2. s c.p otsats, s C s.adj C > 1, i.e. vertex s has more than one adjacent center (a neighboring center different to c) on its neighborhood or vertex s is a center.

5 Table 1. Pseudo-code of Generalized Star Algorithm Algorithm 1: GStar Input: V = {d 1, d 2,..., d n}-set of documents, σ-similarity threshold Output: SC-Set of clusters 1 forall vertex d i V do 2 d i.adj := {d j d j V Sim(d i, d j) σ}; 3 forall vertex d V do 4 d.w eaksats := {s s d.adj d.adj s.adj }; 5 forall vertex d V do 6 d.p otsats := {s s d.adj d.w eaksats s.w eaksats }; 7 d.necsats := d.p otsats; 8 end 9 L := V ; 10 C := ; 11 while L do 12 d := arg max{ d i.p otsats, d i L}; 13 Update(d,C,L); 14 end 15 Sort C in ascending order by PotSats ; 16 SC := ; 17 forall center c C do 18 if c is redundant then C := C \ {c}; 19 else SC := SC {{c} c.adj}; The Generalized Star algorithm is summarized in Table 1. The procedure Update (see Fig. 6) is applied to mark a vertex as center, deleting it from L, and for updating the set NecSats on each of its necessary satellites. Table 2. Pseudo-code of Update Procedure Update Input: d - Selected center, C- Set of clusters centers, L - Set of unprocessed vertices Output: C, L 1 C := C {d}; 2 L := L \ {d}; 3 forall s d.necsats do 4 s.necsats := s.necsats \ {d}; 5 if s.necsats = then L := L \ {s}; 6 forall c s.adj \ {d} do 7 c.necsats := c.necsats \ {s}; 8 if (c.necsats = ) (c.adj C ) then L := L \ {c}; 9 end 10 end

6 The GStar method as the original Star algorithm and the two versions of the Extended algorithm generates clusters which can be overlapped and guarantees also that the pairwise similarity between satellite vertices in a generalized starshaped sub-graph be high. Unlike it previous algorithms, the GStar algorithm can not produce illogical clusters because all the centers satisfy the generalized star-shaped sub-graph definition. The GStar algorithm does not produce uncovered vertices either this property is ensured by the fulfillment of postcondition (7) and avoids the generation of unnecessary clusters presented in two versions of the Extended algorithm. The dependence on data order is a property that the Extended Star method certainly solves. Nevertheless, as was indicated in [8], it is necessary only when that dependence affects the quality of the resulting clusters. Thus, the GStar algorithm solves the dependence on data order (for non symmetric or similar solutions) observed in the Star algorithm. In [8] the GStar algorithm was compared with the original Star and the Extended Star methods in two standard document collections considering the Jaccard index [9] and F -measure [4]. The experimentation showed that GStar outperforms previous methods considering the aforementioned measures and also considering the number an density of the generated clusters. These performances proved the validity of GStar for clustering tasks. 4 Experimental results In this section we present the experimental evaluation of GStar clustering using the Vector Space Model, Global Association Distance Model, and their corresponding variants filtered by a key segment, proposed in section 2. The GStar clustering results are evaluated by the same method and criterion to ensure a fair comparison across these representation models. Two document collections widely used in document clustering research were used in the experiments: TREC -5 1 and Reuters These are heterogeneous as to document size, cluster size, number of classes, and document distribution. The data set TREC -5 contains news items in Spanish published by AFP during ; Reuters is a Reuters LTD news corpus in English. We excluded from document collections the empty documents and also those documents do not have an associated topic. The main characteristics of these collections are summarized in Table 3. We also included in this table the number of overlapping documents for each collection. All documents were preprocessed and lemmatized with TreeTager [14] and in order to obtain the similarity amongst documents we use the traditional cosine measure. The KFS models were applied considering the news first paragraph as the key segment ftp://canberra.cs.umass.edu/pub/reuters

7 Table 3. Characteristics of document collections Collect. Doc. Overlap. doc. Topics Lang. AFP Spanish Reuters English The literature abounds in measures defined by multiple authors to compare two partitions on the same set. The most widely used are: Jaccard index, and F-measure. Jaccard index.- This index (noted j) takes into account the objects simultaneously joined [9]. It is defined as follows: j(a, B) = n 11 N(N 1) 2 n 00 (9) In this index, n 11 denotes the number of pair of objects which are both in the same cluster in A and are also both in the same cluster in B. Similarly, n 00 is the number of pair of objects which are in different clusters in A and are also in different clusters in B. The performances of the algorithm in the document collections considering Jaccard index are shown in Fig. 1 (A) and (B) and in Fig.2 (A) and (B) respectively. F-measure.- The aforementioned index and others are usually applied to partitions. In order to make a better evaluation of overlapping clustering, we have considered F-measure calculated over pairs of points, as defined in [4]. Noted as, this measure is the harmonic mean of precision and recall (10). 2 P resicion Recall = (10) P resicion + Recall n 11 P resicion = Number of identified pairs, Recall = n 00 Number of true pairs The performances of the algorithm in the document collections considering F-measure are shown in Fig. 1 (C) and (D) and in Fig. 2 (C) and (D) respectively. As can be showed in Fig. 1 and Fig. 2, the accuracy obtained by GStar using FKS (GStar-FKS) from VSM is in most cases (for all the indexes) better or comparable with that obtained from the GStar using the original VSM; and the accuracy obtained by GStar-FKS from GADM is in all of cases better or similar to that obtained from original GADM. Note that the best performance of GStar is attached to the GADM filtered by a key segment for both documents collections. We also compared the result obtained by GStar-FKS in both models to know what models offer the best results for each measure. The performances of the GStar-FKS in the document collections considering Jaccard index, and F-measure are shown in Fig. 3. As we can see, the FKS-GADM outperforms in most cases the FKS-VSM in TREC-5 collection and the performance is comparable in both FKS models for

8 Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 1. Behavior in TREC-5 collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Jaccard Index Jaccard Index (A) (B) (C) (D) Fig. 2. Behavior in Reuter collection using FKS by VSM (A,C) and KFS by GADM (B,D) collections with Jaccard index and F-measure respectively Reuter collection, though in this corpus the best performance value is obtained for FKS-GADM. Thus, the FKS-GADM represents a 10.91% and a 7.27% average improvement in performance compared to FKS-VSM in AFP collection considering the Jaccard index and F-measure respectively; and in the case of the Reuters collection the FKS-VSM outperforms slightly the FKS-GADM in a 1.34% and 1.19% considering both measures.

9 Jaccard Index (A) (B) Jaccard Index (C) (D) Fig. 3. Behavior in AFP (A,B) and Reuter (C,D) collections with Jaccard index and F-measure using FKS by VSM and KFS by GADM 5 Conclusions In order to archive a better performance of clustering using GStar, in this paper we have proposed a Filtered by Key Segment vector (FKS), obtained from the whole document but considering only the terms appearing in a predefined key segment, as the document representation schema; and we have shown with the experiment results for two news items collections, the improvement of our proposal regarding original Vector Space Model and original Global Association Distance Model. References 1. Aslam, J., Pelekhov K. and Rus, D.: Using Star Clusters for Filtering, In Proceedings of the Ninth International Conference on Information and Knowledge Management, USA, Aslam, J., Pelekhov K. and Rus, D.: The Star Clustering Algorithm for Static and Dynamic Information Organization, Journal of Graph Algorithms and Applications, Vol. 8, No. 1, pp , Baeza-Yates,R. and Ribeiro-Neto,B.: Modern Information Retrieval, ACM Press, Addison-Wesley, Banerjee, A., Krumpelman, C., Basu, S., Mooney, R. and Ghosh, J.: Model Based Overlapping Clustering, In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), Berry,M. : Survey of Text Mining, Clustering, Classification and Retrieval, Springer, Duda R., Hart P., Stork D. Pattern Classification. John Wiley & Sons Inc Gil-Garca, R. J., Bada-Contelles, J. M. and Pons-Porrata, A.: Extended Star Clustering Algorithm, In Proceedings of CIARP 03, LNCS 2905, pp , 2003.

10 8. Pérez -Suárez, A. and Medina-Pagola, J. E.: A Clustering Algorithm based on Generalized Stars, In Proceedings of the 5th International Conference on Machine Learning and Data Mining (MLDM2007), LNAI 4571, Leipzig, Germany, Kuncheva, L. and Hadjitodorov, S.: Using Diversity in Cluster Ensembles, In Proceedings of IEEE SMC 2004, The Netherlands, van Rijsbergen, C. J.: Information Retrieval, Buttersworth, London, 2nd edition, Medina Pagola, J.E., Rodrguez, A.Y., Hechavarra, A., Hernndez Palancar, J.: Document Representation using Global Association Distance Model. In Proc. of ECIR 2007, LNCS, 4425, pp , Salton, G.: The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, New Jersey, Scanlan,C.: Writing from the Top Down: Pros and Cons of the Inverted Pyramid, Poynter Institute, Retrieved on July 27, 2006, URL: 14. Schmid, H.: Probabilistic Part-Of-Speech Tagging Using Decision Tree. In: International Conference on New Methods in Language Processing, Manchester, UK (1994) 15. Zhong, S. and Ghosh, J.: A Comparative Study of Generative Models for Document Clustering, In Proceedings of SDM Workshop on Clustering High Dimensional Data and Its Applications, 2003.

ACONS: A New Algorithm for Clustering Documents

ACONS: A New Algorithm for Clustering Documents ACONS: A New Algorithm for Clustering Documents Andrés Gago Alonso, Airel Pérez Suárez, and José E. Medina Pagola Advanced Technologies Application Center (CENATAV), 7a 21812 e/ 218 y 222, Rpto. Siboney,

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

An Overview of Concept Based and Advanced Text Clustering Methods.

An Overview of Concept Based and Advanced Text Clustering Methods. An Overview of Concept Based and Advanced Text Clustering Methods. B.Jyothi, D.Sailaja, Dr.Y.Srinivasa Rao, GITAM, ANITS, GITAM, Asst.Professor Asst.Professor Professor Abstract: Most of the common techniques

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task

A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task Hiroki Tanioka Kenichi Yamamoto Justsystem Corporation Brains Park Tokushima-shi, Tokushima 771-0189, Japan {hiroki tanioka, kenichi yamamoto}@justsystem.co.jp

More information

Online algorithms for clustering problems

Online algorithms for clustering problems University of Szeged Department of Computer Algorithms and Artificial Intelligence Online algorithms for clustering problems Summary of the Ph.D. thesis by Gabriella Divéki Supervisor Dr. Csanád Imreh

More information

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa

More information

Available online at ScienceDirect. Procedia Computer Science 82 (2016 ) 28 34

Available online at  ScienceDirect. Procedia Computer Science 82 (2016 ) 28 34 Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 82 (2016 ) 28 34 Symposium on Data Mining Applications, SDMA2016, 30 March 2016, Riyadh, Saudi Arabia Finding similar documents

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

A Fast Approximated k Median Algorithm

A Fast Approximated k Median Algorithm A Fast Approximated k Median Algorithm Eva Gómez Ballester, Luisa Micó, Jose Oncina Universidad de Alicante, Departamento de Lenguajes y Sistemas Informáticos {eva, mico,oncina}@dlsi.ua.es Abstract. The

More information

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Q: Given a set of keywords how can we return relevant documents quickly?

Q: Given a set of keywords how can we return relevant documents quickly? Keyword Search Traditional B+index is good for answering 1-dimensional range or point query Q: What about keyword search? Geo-spatial queries? Q: Documents on Computer Science? Q: Nearby coffee shops?

More information

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

An Efficient Hash-based Association Rule Mining Approach for Document Clustering An Efficient Hash-based Association Rule Mining Approach for Document Clustering NOHA NEGM #1, PASSENT ELKAFRAWY #2, ABD-ELBADEEH SALEM * 3 # Faculty of Science, Menoufia University Shebin El-Kom, EGYPT

More information

Nearest Neighbor Classification

Nearest Neighbor Classification Nearest Neighbor Classification Charles Elkan elkan@cs.ucsd.edu October 9, 2007 The nearest-neighbor method is perhaps the simplest of all algorithms for predicting the class of a test example. The training

More information

Agglomerative clustering on vertically partitioned data

Agglomerative clustering on vertically partitioned data Agglomerative clustering on vertically partitioned data R.Senkamalavalli Research Scholar, Department of Computer Science and Engg., SCSVMV University, Enathur, Kanchipuram 631 561 sengu_cool@yahoo.com

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

A Model for Information Retrieval Agent System Based on Keywords Distribution

A Model for Information Retrieval Agent System Based on Keywords Distribution A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Determination of Similarity Threshold in Clustering Problems for Large Data Sets

Determination of Similarity Threshold in Clustering Problems for Large Data Sets Determination of Similarity Threshold in Clustering Problems for Large Data Sets Guillermo Sánchez-Díaz 1 and José F. Martínez-Trinidad 2 1 Center of Technologies Research on Information and Systems, The

More information

Development of Generic Search Method Based on Transformation Invariance

Development of Generic Search Method Based on Transformation Invariance Development of Generic Search Method Based on Transformation Invariance Fuminori Adachi, Takashi Washio, Hiroshi Motoda and *Hidemitsu Hanafusa I.S.I.R., Osaka University, {adachi, washio, motoda}@ar.sanken.osaka-u.ac.jp

More information

Noisy Text Clustering

Noisy Text Clustering R E S E A R C H R E P O R T Noisy Text Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-31 I D I A P December 2004 1 IDIAP, CP 592, 1920 Martigny, Switzerland, grangier@idiap.ch 2 IDIAP,

More information

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM

More information

Social Media Computing

Social Media Computing Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,

More information

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS 4.1 Introduction Although MST-based clustering methods are effective for complex data, they require quadratic computational time which is high for

More information

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Color-Based Classification of Natural Rock Images Using Classifier Combinations Color-Based Classification of Natural Rock Images Using Classifier Combinations Leena Lepistö, Iivari Kunttu, and Ari Visa Tampere University of Technology, Institute of Signal Processing, P.O. Box 553,

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Clustering with Lower Bound on Similarity

Clustering with Lower Bound on Similarity Clustering with Lower Bound on Similarity Mohammad Al Hasan, Saeed Salem, Benjarath Pupacdi 2, and Mohammed J. Zaki Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 2 Chulabhorn

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Suruchi Chawla 1, Dr Punam Bedi 2 1 Department of Computer Science, University of Delhi, Delhi, INDIA

More information

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha 1 Asst. Professor, Srm University, Chennai 2 Mtech, Srm University, Chennai Abstract R- Google is a dedicated

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

A Universal Model for XML Information Retrieval

A Universal Model for XML Information Retrieval A Universal Model for XML Information Retrieval Maria Izabel M. Azevedo 1, Lucas Pantuza Amorim 2, and Nívio Ziviani 3 1 Department of Computer Science, State University of Montes Claros, Montes Claros,

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

A Document Graph Based Query Focused Multi- Document Summarizer

A Document Graph Based Query Focused Multi- Document Summarizer A Document Graph Based Query Focused Multi- Document Summarizer By Sibabrata Paladhi and Dr. Sivaji Bandyopadhyay Department of Computer Science and Engineering Jadavpur University Jadavpur, Kolkata India

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map Markus Turtinen, Topi Mäenpää, and Matti Pietikäinen Machine Vision Group, P.O.Box 4500, FIN-90014 University

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval

More information

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Using a Painting Metaphor to Rate Large Numbers

Using a Painting Metaphor to Rate Large Numbers 1 Using a Painting Metaphor to Rate Large Numbers of Objects Patrick Baudisch Integrated Publication and Information Systems Institute (IPSI) German National Research Center for Information Technology

More information

Circle Graphs: New Visualization Tools for Text-Mining

Circle Graphs: New Visualization Tools for Text-Mining Circle Graphs: New Visualization Tools for Text-Mining Yonatan Aumann, Ronen Feldman, Yaron Ben Yehuda, David Landau, Orly Liphstat, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan

More information