OntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction

Size: px
Start display at page:

Download "OntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction"

Transcription

1 OntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction Paolo Ceravolo, Ernesto Damiani, Marcello Leida, and Marco Viviani Università degli studi di Milano, Dipartimento di Tecnologie dell Informazione, via Bramante, Crema (CR), Italy {ceravolo, damiani, leida, viviani}@dti.unimi.it Abstract. This paper describes OntoExtractor a tool for extracting metadata from heterogeneous sources of information, producing a quick-and-dirty hierarchy of knowledge. This tool is specifically tailored for a quick classification of semi-structured data. By this feature, OntoExtractor is convenient for dealing with a web-based data source. 1 Introduction Typically, knowledge management techniques use metadata in order to specify content, quality, type, creation, and context of a data item. A number of specialized formats for the creation of metadata exist. A typical example is the Resource Description Framework (RDF). But metadata can be stored in any format such as free text, Extensible Markup Language (XML), or database entries. All of these formats must relay on a vocabulary that can have different degree of formality. If this vocabulary is compliant to a set of logical axioms it is called an ontology. There are a number of well-known advantages in using information extracted from data instead of data themselves. On one hand, because of their small size compared to the data they describe, metadata are more easily shareable than data. Thanks to metadata sharing, information about data becomes readily available to anyone seeking it. Thus, metadata make data discovery easier and reduces data duplication. But on the other hand some important drawbacks are restraining the diffusion of metadata format. First of all, building a knowledge-base is an onerous process. The domain analysis involves different activities often difficult to integrate, because they are usually performed by different professional roles. In addition, the high cost of knowledge-base building is in contradiction to important characteristics of knowledge management principles. Any knowledge management activity need to be configured for a given domain. But every domain evolves and the knowledge-bases related to it have to evolve as well. If the domain is evolving rapidly, a dis-alignment may result between the actual domain s R. Meersman, Z. Tari, P. Herrero et al. (Eds.): OTM Workshops 2006, LNCS 4278, pp , c Springer-Verlag Berlin Heidelberg 2006

2 1826 P. Ceravolo et al. state of affairs and the knowledge-base. In addition, classical knowledge extraction technologies are not tailored for web-based data. These techniques were largely experimented with successful results. Anyhow they present some limitations. First, they need a high number of documents (typically, many thousands) to work properly. Secondly, they hardly take into account document structure and are therefore unsuitable for semi-structured document formats used on the Web. In this paper we present OntoExtractor a tool supporting knowledge extraction activities in a web-based environment. OntoExtractor was designed to be inserted in a more general system aimed at managing the whole Ontology Life Cycle [5]. The classification produced as output is transformed in a standard metadata format and proposed to a community of used. Feedbacks from the community are collected in order to refine the classification, discarding metadata expressing not relevant classes or misclassified documents [4]. In order to support continuos domain evolutions, OntoExtractor is designed for quickly producing a preliminary classification of a knowledge base. This tool supports heterogeneous source of information, including semi-strucutred data. A fuzzy representation of document vectors allows to segment documents according to their structural topology, assigning different relevance values to each segment. Another important feature of OntoExtractor is to produce different classifications organizing the classes of documents according to different degree of cohesion. This feature allows the user to quickly discard a classification not coherent to his vision of the domain. The paper is organized as follows: Section 2, introduces the tool, Section 3 describes the format adopted for document representation, Section 4 explains the techniques used in the structural classification of documents, Section 5 explains the techniques used in the content classification, while Section 6 goes to the conclusions. 2 OntoExtractor OntoExtractor is a tool, developed in the context of the KIWI project 1, which extracts metadata from heterogeneous sources of information, producing a quick-and-dirty hierarchy of knowledge. The construction of the hierarchy occurs in a bottom-up fashion: starting from the heterogeneous document set a clustering process groups documents in meaningful clusters. These clusters identify the backbone hierarchy of the ontology. Construction of the hierarchy is a three-step process, composed of the following phases: 1. Normalize the incoming documents into XML format [9]. 2. Clustering the documents according to their structure using a Fuzzy Bag representation of the XML tree [3] [6]. 1 This work was partly funded by the Italian Ministry of Research Fund for Basic Research (FIRB) under projects RBAU01CLNB 001 Knowledge Management for the Web Infrastructure (KIWI).

3 OntoExtractor: A Fuzzy-Based Approach to Content 1827 Fig. 1. Overview of the OntoExtractor process 3. Refine the structural clustering analyzing the content of the document, producing a semantic clustering of the documents. 3 Normalize the Knowledge Base This first step in our process is choosing a common representation format for the information to be managed. Data may come from different and heterogeneous sources: including unstructured, semi-structured or structured information, such as textual documents, HTML files, XML files or records in a database. In order to conciliate these different data sources we developed a set of wrapper applications transforming most used document formats in a XML target representation. The wrapping process is shown in Figure 2: for semi-structured and structured sources the wrapper does not have much to do. All it has to perform is applying a mapping between the original data and elements in the target XML tree. Unstructured sources of information need additional processing aimed to extracting the hidden structure of the documents. This phase uses well-known text-segmentation techniques [9] in order to find relations among parts of a text. This is an iterative process that takes as input a text blob (which is a continuous flow of characters, representing the whole content of a document) and gives as output a set of text-segments identified by the text segmentation process. The process stops when no text blob can be segmented further. At this point, a post-processing phase analyzes the resulting tree structure and generates the corresponding XML document. In the current version of the OntoExtractor software, a Regular Expressions matching approach is also available in order to discover regular patterns like titles of sections in the documents, helping controlling the text segmentation process. This is a preliminary approach that compares each row of the document with the regular expression (i.e. [0 9] + (([.]?) ([.]?[0 9]+)) (\s + \w+)+ we used this expression to match chapter, sections and paragraph headlines, which are usually proceeded by numbers separated by a. ).

4 1828 P. Ceravolo et al. Fig. 2. Wrapping process 4 Clustering by Structure The OntoExtractor tool uses a flat encoding for the internal representation of XML documents for processing and analysis purposes. Documents are represented as Fuzzy Bags, i.e. a collection of elements which may contain duplicates. Due to the fact that the importance of tags can differ, it is possible to assign a different weight (in the range form 0 to 1) to each tag in the document. In other words, for each element in the XML document d, the Fuzzy Bag encoding d contains a Fuzzy Element whose membership value is determined by the position of the tag in the document s structure or by other topological properties. OntoExtractor tool currently provides two different algorithms to calculate the membership function of a Fuzzy Element: 1. Nesting: this is a lossy representation of the original document s topology, because this membership value does not keep track of which is the parent tag of the current tag, as shown in Figure 3. Giving a vocabulary V = {R/1,a/0.9,b/0.8,d/0.6,e/0.4}, applying the nesting weighting function to a generic XML document, such as A.xml or B.xml, we obtain the fuzzy bag A = B = {R/1,a/0.3,a/0.225,b/0.2,d/0.3,e/0.2}. The membership value for each element is: M = V e /L. Where: M: membership value; V e : weight of the tag in the vocabulary; L: nestinglevelofthetagwithl root =0.

5 OntoExtractor: A Fuzzy-Based Approach to Content 1829 Fig. 3. Two generic XML documents A.XML and B.XML 2. MV : this is an experimental algorithm introduced by our group, which keeps memory of the parent tag. The membership value for each element is: M = (V e + M p )/L. Where: M: membership value; M p : membership value of the parent tag with M root =0; V e : weight of the tag in the vocabulary; L: nestinglevelofthetagwithl root =0. The MV membership value helps, in certain cases, to keep memory of the tree structure of the original document, referring to figure 3: using the same vocabulary V, applying the MV weighting function to the tree representation of the two XML documents A.xml and B.xml we obtain A = {R/1,a/0.53,a/0.36,b/0.33,d/0.8,e/0.7} and B = {R/1,a/0.56,a/0.37,b/0.34,d/0.8,e/0.7} which are different. Figure 4 shows the differences in processing an XML document coming from Amazon, alternatively by Nesting and MV algorithms. In order to compare the XML documents modeled as fuzzy bags well known similarity measures studied in [1] [2]. We privileged measures giving higher similarity weight to the bags where elements (tags) belonging to the intersection are less nested. This is motivated by the fact that, if a tag is near to the root it seems reasonable to assume that it has a higher semantic value. In OntoExtractor the comparison between two Fuzzy Bags is computed using Jaccard norm: Where: S(B1,B2) = Approx B1and B2 are the input fuzzy bags; is the intersection operator; is the union operator; is the cardinality operator; Approx() is the approximation operator; Sis the similarity value between B1 and B2. ( ) Bag1 Bag2 Bag1 Bag2

6 1830 P. Ceravolo et al. Fig. 4. Fuzzy Bags generated by Nesting and MV algorithms. And the XML representation of the document. For more theoretical information about this norm and how the union, intersection, approximation and cardinality operations are expressed, please refer to [3] and [6]. Using this norm the tool can perform a partitioned clustering technique that is an hybrid version between K-means and K-NN clustering algorithms. OntoExtractor uses an alpha-cut value as a threshold for the clustering process, in order to avoid to suggest the initial number of clusters (k) and skipping this way some clustering problems related to the k-means algorithm. The clustering algorithm compares all the documents with the centroid of each cluster, considering only the bigger resemblance value. If this value is bigger than the given alpha the document is inserted in the cluster, otherwise a new empty cluster is generated and the document is inserted in it. OntoExtractor tool offers two different ways to calculate the centroid of each cluster: one method chooses the document that has the smaller representative Fuzzy Bag. In this method the centroid always corresponds to a real document. The other method generates a new Fuzzy Bag as the union of all the Fuzzy Bags in the cluster. This way the generated Fuzzy Bag does not have a compulsory correspondence in a real document. 5 Clustering by Content The second clustering process that we propose is based on the content connected to leaf nodes. Content-based clustering is independent for each structural cluster

7 OntoExtractor: A Fuzzy-Based Approach to Content 1831 selected so on it is possible to give different clustering criteria for each structural cluster generated, as shown in Figure 5. Note that users can select which clustering process to perform; for instance, if there is no need of structural clustering then only content-based clustering is performed. Is important to remember that Fig. 5. Domain class subdivision based on structure (a) and refinement based on content (b) our clustering technique works on XML documents that are somehow structured. Therefore we compute content-based similarity at tag level, comparing content of the same tag between different documents. Then we compute content-based similarity at document level by aggregating tag level similarity values. Referring Fig. 6. Tag-Level ccomparison between data belonging to the same tag in different documents to Figure 6 it is necessary to choose two different functions: a function f to compare data belonging to tags with the same name in different documents: f a (a[data] A ; a[data] B ); f b (b[data] A ; b[data] B ); f c (c[data] A ; c[data] B ) (1)

8 1832 P. Ceravolo et al. and a function F to aggregate the individual fs: F (f a,f b,f c ). We have two possibilities for choosing the F function: F is a t-norm: conjunction of the single values (f a f b f c ); F is a t-conorm: disjunction of the single values (f a f b f c ). Fig. 7. A: comparison in case of null values. B: comparison in case of nested values. Referring to Figure 7 it is evident that we need to consider also cases where the tag is not present in the document and cases of documents having multiple instances of the same tag at different nesting levels. So in the first case we have: f b (null; b[data] B )=0; (2) and in the second case we evaluate the distancebetweenthetagsusingthe formula: 1 f x =max f xp,k (x p [data] p,k 1+Δ A ; x k [data] B ); p,k (3) Δ = μ(x p ) μ(x k ). (4) Occurrences of terms have distinct informative roles depending on the tags they belong to. So, it is possible either to define a different function f for each group of data of the same tag in different documents, or choosing a function considering the membership value μ(x i ) associated to the i-th tag. We represent the content of each tag (A n [data],b n [data, ]C n [data],... in (1)) with the well-known Vector Space Model,widelyusedinthemoderninformation retrieval system. The vector space model (VSM) is an algebraic model used for information filtering and information retrieval. It represents natural language documents in a formal manner by the use of vectors in a multi-dimensional space. The vector space model usually builds a documents-terms matrix and processes it to generate the document-terms vectors. Our approach is similar but we generate one matrix for each tag in the document; correspondingly, we generate a tag-terms vector. There are several methods to generate the tag-terms vector, such as LSA (Latent Semantic Analysis [7]) or SVD (Singular Value Decomposition), a well-known method of matrix reduction that adds latent semantic meaning to the vectors. In OntoExtractor, generating the tag-terms vectors is a three-step process:

9 OntoExtractor: A Fuzzy-Based Approach to Content 1833 Generating the tags-terms matrix: for each tag in the document, a documents-terms matrix is produced. It is important to remember that we do not consider the document as a unique text-blob, but we build the documents-terms matrix at the tag level. If a tag is not present in a document, a row of zeros is added to the matrix. Each entry in the matrix can be computed in several ways as well, by choosing one of the weighting methods implemented in the tool. At now it is possible to choose among: tf idf, tf df, tf and term occurrency. Transforming the matrix : once the matrix has been generated we process it by some matrix tranformations. We allow to choose between keeping the original matrix or transform it LSA by SVD. This method relies on the assumption that any m n matrix A (with (m n)) can be written as the product of an m n column-orthogonal matrix U, ann n diagonal matrix with positive or zero elements(σ), and the transpose of an n n orthogonal matrix V. Suppose M is an m n matrix whose entries come from the field K, which is either the field of real numbers or the field of complex numbers. Then there exists a factorization of the form: M = UΣV ;whereu is an m m unitary matrix over K,thematrixS is m n with non-negative numbers on the diagonal and zeros off the diagonal, and V denotes the conjugate transpose of V,ann n unitary matrix over K. Such a factorization is called a singular-value decomposition of M. ThematrixV thus contains a set of orthogonal input or analysing base-vector directions for M. The matrix U contains a set of orthogonal output base-vector directions for M. ThematrixS contains the singular values, which can be thought of as scalar gain controls by which each corresponding input is multiplied to give a corresponding output. After the matrix decomposition we generate a new n m matrix using an r-reduction of the original SVD decomposition: M = U r Σ r Vr. Only the r column vectors of U and r row vectors of V corresponding to the non-zero singular values S r are calculated. The resulting new matrix is not a sparse matrix anymore but it is densely populated by values, with hidden semantic meaning. Storing the vectors: each row in the matrix is stored in the associated tag in the document model as a new Fuzzy Bag with the terms as the element and the entry in the vector as membership value. Now tags contents are represented by Fuzzy Bags and we can compare them by mean of different distances measures: we can use traditional Euclidean distances such as the Cosine distance. 6 Conclusions and Further Work In order to avoid the siononimy and polisemy problem in the next versions of OntoExtractor will be added new processors using external ontologies to identify concept. Anyway this approach introduces other problems that have to be considered. One of this is the Word Sense Disambiguation (WSD). The validity

10 1834 P. Ceravolo et al. of this tool must be evaluated in the complete system it inserted on. Further works will provide a report on evaluations of the KIWI system. References 1. B. Bouchon-Meunier, M. Rifqi, S. Bothorel: Towards general measures of comparison of objects. Fuzzy Sets and Systems, volume 84, pages , P. Bosc, E. Damiani: Fuzzy Service Selection in a Distributed Object-Oriented Environment. IEEE Transactions on Fussy Systems, volume 9, no. 5, pages , P. Ceravolo, M.C. Nocerino, M. Viviani: Knowledge extraction from semistructured data based on fuzzy techniques. Knowledge-Based Intelligent Information and Engineering Systems, Proceedings of the 8th International Conference, KES 2004, Part III, pages , P. Ceravolo, E. Damiani, M. Viviani: Adding a Peer-to-Peer Trust Layer to Metadata Generators. Lecture Notes in Computer Science, Volume 3762, pages , P. Ceravolo, A. Corallo, E. Damiani, G. Elia, M. Viviani, and A. Zilli: Bottomup extraction and maintenance of ontology-based metadata. Fuzzy Logic and the Semantic Web, Computer Intelligence, Elsevier, E. Damiani, M.C. Nocerino, M. Viviani: Knowledge extraction from an XML data flow: building a taxonomy based on clustering technique. Current Issues in Data and Knowledge Engineering, Proceedings of EUROFUSE 2004: 8th Meeting of the EURO Working Group on Fuzzy Sets, pages , T. K. Landauer, P. W. Foltz, & D. Laham: Introduction to Latent Semantic Analysis. Discourse Processes, 25, pages , G. Salton. and C. Buckley: Term Weighting Approaches in Automatic Text Retrieval. Technical Report. UMI Order Number: TR , Cornell University. 1987, 9. G. Salton, A. Singhal, C. Buckley and M. Mitra: Automatic Text Decomposition Using Text Segments and Text Themes. Conference on Hypertext, pages 53-65, 1996.

Mining Class Hierarchies from XML Data: Representation Techniques

Mining Class Hierarchies from XML Data: Representation Techniques Mining Class Hierarchies from XML Data: Representation Techniques Paolo Ceravolo 1 and Ernesto Damiani 1 Department of Information Technology University of Milan Via Bramante, 65-26013 Crema (Italy) damiani,

More information

Conception of Ontology for Security in Health Care Systems

Conception of Ontology for Security in Health Care Systems Conception of Ontology for Security in Health Care Systems Dr. J. Indumathi Department of Information Science and Technology, Anna University, Chennai, Tamilnadu, India Abstract- The insidious and omnipresent

More information

Which Role for an Ontology of Uncertainty?

Which Role for an Ontology of Uncertainty? Which Role for an Ontology of Uncertainty? Paolo Ceravolo, Ernesto Damiani, Marcello Leida Dipartimento di Tecnologie dell Informazione - Università degli studi di Milano via Bramante, 65-26013 Crema (CR),

More information

The HMatch 2.0 Suite for Ontology Matchmaking

The HMatch 2.0 Suite for Ontology Matchmaking The HMatch 2.0 Suite for Ontology Matchmaking S. Castano, A. Ferrara, D. Lorusso, and S. Montanelli Università degli Studi di Milano DICo - Via Comelico, 39, 20135 Milano - Italy {castano,ferrara,lorusso,montanelli}@dico.unimi.it

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

LATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY

LATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY 6-02 Latent Semantic Analysis And Weigted Tree Similarity For Semantic Search In Digital Library LATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY Umi Sa adah

More information

Vector Space Models: Theory and Applications

Vector Space Models: Theory and Applications Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Information Retrieval Basics: Agenda Vector

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Chapter 3 Text clustering as a mining task

Chapter 3 Text clustering as a mining task Chapter 3 Text clustering as a mining task F. Mandreoli, R. Martoglia & P. Tiberio Dipartimento di Ingegneria dell Informazione, Università di Modena e Reggio Emilia, Modena, Italy. Abstract In this chapter

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Learning Probabilistic Ontologies with Distributed Parameter Learning

Learning Probabilistic Ontologies with Distributed Parameter Learning Learning Probabilistic Ontologies with Distributed Parameter Learning Giuseppe Cota 1, Riccardo Zese 1, Elena Bellodi 1, Fabrizio Riguzzi 2, and Evelina Lamma 1 1 Dipartimento di Ingegneria University

More information

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.) Using Semantic Similarity in Crawling-based Web Application Testing Jun-Wei Lin Farn Wang Paul Chu (UC-Irvine) (National Taiwan Univ.) (QNAP, Inc) Crawling-based Web App Testing the web app under test

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Optimal Decision Trees Generation from OR-Decision Tables

Optimal Decision Trees Generation from OR-Decision Tables Optimal Decision Trees Generation from OR-Decision Tables Costantino Grana, Manuela Montangero, Daniele Borghesani, and Rita Cucchiara Dipartimento di Ingegneria dell Informazione Università degli Studi

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

The Information Retrieval Series. Series Editor W. Bruce Croft

The Information Retrieval Series. Series Editor W. Bruce Croft The Information Retrieval Series Series Editor W. Bruce Croft Sándor Dominich The Modern Algebra of Information Retrieval 123 Sándor Dominich Computer Science Department University of Pannonia Egyetem

More information

Ontology Extraction from Heterogeneous Documents

Ontology Extraction from Heterogeneous Documents Vol.3, Issue.2, March-April. 2013 pp-985-989 ISSN: 2249-6645 Ontology Extraction from Heterogeneous Documents Kirankumar Kataraki, 1 Sumana M 2 1 IV sem M.Tech/ Department of Information Science & Engg

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Bruno Martins. 1 st Semester 2012/2013 Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

More information

Profile Based Information Retrieval

Profile Based Information Retrieval Profile Based Information Retrieval Athar Shaikh, Pravin Bhjantri, Shankar Pendse,V.K.Parvati Department of Information Science and Engineering, S.D.M.College of Engineering & Technology, Dharwad Abstract-This

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

Semantic Web Search Model for Information Retrieval of the Semantic Data *

Semantic Web Search Model for Information Retrieval of the Semantic Data * Semantic Web Search Model for Information Retrieval of the Semantic Data * Okkyung Choi 1, SeokHyun Yoon 1, Myeongeun Oh 1, and Sangyong Han 2 Department of Computer Science & Engineering Chungang University

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

Matching Techniques for Resource Discovery in Distributed Systems Using Heterogeneous Ontology Descriptions

Matching Techniques for Resource Discovery in Distributed Systems Using Heterogeneous Ontology Descriptions Matching Techniques for Discovery in Distributed Systems Using Heterogeneous Ontology Descriptions S. Castano, A. Ferrara, S. Montanelli, G. Racca Università degli Studi di Milano DICO - Via Comelico,

More information

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab Recommender System What is it? How to build it? Challenges R package: recommenderlab 1 What is a recommender system Wiki definition: A recommender system or a recommendation system (sometimes replacing

More information

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017 Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

Sub-process discovery: Opportunities for Process Diagnostics

Sub-process discovery: Opportunities for Process Diagnostics Sub-process discovery: Opportunities for Process Diagnostics Raykenler Yzquierdo-Herrera 1, Rogelio Silverio-Castro 1, Manuel Lazo-Cortés 1 1 Faculty 3, University of the Informatics Sciences. Habana,

More information

Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes

Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes Y. Bashon, D. Neagu, M.J. Ridley Department of Computing University of Bradford Bradford, BD7 DP, UK e-mail: {Y.Bashon, D.Neagu,

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

User Profiling for Interest-focused Browsing History

User Profiling for Interest-focused Browsing History User Profiling for Interest-focused Browsing History Miha Grčar, Dunja Mladenič, Marko Grobelnik Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia {Miha.Grcar, Dunja.Mladenic, Marko.Grobelnik}@ijs.si

More information

Lesson 5 Web Service Interface Definition (Part II)

Lesson 5 Web Service Interface Definition (Part II) Lesson 5 Web Service Interface Definition (Part II) Service Oriented Architectures Security Module 1 - Basic technologies Unit 3 WSDL Ernesto Damiani Università di Milano Controlling the style (1) The

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Intelligent flexible query answering Using Fuzzy Ontologies

Intelligent flexible query answering Using Fuzzy Ontologies International Conference on Control, Engineering & Information Technology (CEIT 14) Proceedings - Copyright IPCO-2014, pp. 262-277 ISSN 2356-5608 Intelligent flexible query answering Using Fuzzy Ontologies

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Circle Graphs: New Visualization Tools for Text-Mining

Circle Graphs: New Visualization Tools for Text-Mining Circle Graphs: New Visualization Tools for Text-Mining Yonatan Aumann, Ronen Feldman, Yaron Ben Yehuda, David Landau, Orly Liphstat, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

Ontology Development. Qing He

Ontology Development. Qing He A tutorial report for SENG 609.22 Agent Based Software Engineering Course Instructor: Dr. Behrouz H. Far Ontology Development Qing He 1 Why develop an ontology? In recent years the development of ontologies

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Introducing fuzzy quantification in OWL 2 ontologies

Introducing fuzzy quantification in OWL 2 ontologies Introducing fuzzy quantification in OWL 2 ontologies Francesca Alessandra Lisi and Corrado Mencar Dipartimento di Informatica, Centro Interdipartimentale di Logica e Applicazioni Università degli Studi

More information

Community Detection. Community

Community Detection. Community Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,

More information

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms Clustering Distance Measures Hierarchical Clustering k -Means Algorithms 1 The Problem of Clustering Given a set of points, with a notion of distance between points, group the points into some number of

More information

Exploring Ancient Texts with a User Driven Concept Search

Exploring Ancient Texts with a User Driven Concept Search Exploring Ancient Texts with a User Driven Concept Search Muhammad Faisal Cheema, Stefan Jänicke, Christoph Weilbach, Judith Blumenstein, Gerik Scheuermann Leipzig University, Germany exchange: Exploring

More information

A ew Algorithm for Community Identification in Linked Data

A ew Algorithm for Community Identification in Linked Data A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

Image Access and Data Mining: An Approach

Image Access and Data Mining: An Approach Image Access and Data Mining: An Approach Chabane Djeraba IRIN, Ecole Polythechnique de l Université de Nantes, 2 rue de la Houssinière, BP 92208-44322 Nantes Cedex 3, France djeraba@irin.univ-nantes.fr

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal

More information

Simple Method for Ontology Automatic Extraction from Documents

Simple Method for Ontology Automatic Extraction from Documents Simple Method for Ontology Automatic Extraction from Documents Andreia Dal Ponte Novelli Dept. of Computer Science Aeronautic Technological Institute Dept. of Informatics Federal Institute of Sao Paulo

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John

More information

Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking

Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking Emanuele Di Buccio, Ivano Masiero, and Massimo Melucci Department of Information Engineering, University

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

The UCD community has made this article openly available. Please share how this access benefits you. Your story matters!

The UCD community has made this article openly available. Please share how this access benefits you. Your story matters! Provided by the author(s) and University College Dublin Library in accordance with publisher policies., Please cite the published version when available. Title Context enabled semantic granularity Authors(s)

More information

Studying the Impact of Text Summarization on Contextual Advertising

Studying the Impact of Text Summarization on Contextual Advertising Studying the Impact of Text Summarization on Contextual Advertising G. Armano, A. Giuliani, and E. Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

CS231A Course Notes 4: Stereo Systems and Structure from Motion

CS231A Course Notes 4: Stereo Systems and Structure from Motion CS231A Course Notes 4: Stereo Systems and Structure from Motion Kenji Hata and Silvio Savarese 1 Introduction In the previous notes, we covered how adding additional viewpoints of a scene can greatly enhance

More information

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil Distributed Information Retrieval using LSI Markus Watzl and Rade Kutil Abstract. Latent semantic indexing (LSI) is a recently developed method for information retrieval (IR). It is a modification of the

More information

Ontology based Web Page Topic Identification

Ontology based Web Page Topic Identification Ontology based Web Page Topic Identification Abhishek Singh Rathore Department of Computer Science & Engineering Maulana Azad National Institute of Technology Bhopal, India Devshri Roy Department of Computer

More information

Matching and Alignment: What is the Cost of User Post-match Effort?

Matching and Alignment: What is the Cost of User Post-match Effort? Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Lecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck

Lecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck Lecture Telecooperation D. Fensel Leopold-Franzens- Universität Innsbruck First Lecture: Introduction: Semantic Web & Ontology Introduction Semantic Web and Ontology Part I Introduction into the subject

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION 6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013 Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu

The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites H. Davulcu, S. Koduri, S. Nagarajan Department of Computer Science and Engineering Arizona State University,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Data Distortion for Privacy Protection in a Terrorist Analysis System

Data Distortion for Privacy Protection in a Terrorist Analysis System Data Distortion for Privacy Protection in a Terrorist Analysis System Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang Department of Computer Science, University of Kentucky, Lexington KY 40506-0046, USA

More information